Mastering Hierarchies: Graphs, DAGs and BI, Oh My! (Part 3 of 5)

Recap 

In the first post of this series, I walked through the basic ideas of what constitutes a graph, and in the second post I discussed a particular kind of graph called a directed acyclical graph (DAG), providing a handful of examples from the world of data engineering along with an attempt to disambiguate  –  and expand  – the meaning of what a DAG is. 

To briefly recap the important points discussed so far: 

  • A graph consists of one or more nodes. 
  • Nodes are related to other nodes via one or more edges. 
  • Nodes that that can reach themselves again (by traversing the graph in a single direction) indicate the graph has a cycle. 
  • A given edge can be directed or undirected. 
  • Any set of nodes that can’t reach other sets of nodes are a subgraph
  • Nodes and edges can have attributes, including numerical attributes (usually called weights in formal graph theory), that are often leveraged when executing more advanced graph algorithms/analytics. 
  • Graphs for which all edges are directed, and which have no cycles, are referred to as directed acyclical graphs (DAGs). 

In the following post, we’re going to further constrain the definition of a DAG to arrive at the definition of a hierarchy and then discuss some of the associated nuances, i.e. when it comes to modelling and consuming hierarchical data, including a few additional attributes that “fall out” from the definition of a hierarchy (that aren’t particularly relevant to more generic graph structures). 

Hierarchies 

In order to flesh out what constitutes a hierarchy, recall from the first post in this series that it’s common to refer to a pair of nodes that are related by a directed edge as either “parent/child” nodes or “predecessor/successor” nodes. For simplicity, we’ll stick with “parent/child” for the rest of this discussion. 

Now, as you may have noticed, we didn’t really incorporate this vernacular in our discussion of DAGs. Why? Well, the family analogy (parent/child) breaks down rather quickly when your realise that it’s possible, and actually quite common, for a “child” node to have many “parent” nodes (which doesn’t jibe with our intuition about parents and children). So, if anything, DAGs should probably discussed with the terms “predecessor” and “successor” nodes. 

When it comes to hierarchies, the parent/child analogy becomes more useful. The distinguishing characteristic for a hierarchy, compared to a DAG, is that a node only ever has one — and only one — parent node. (Conversely, a given parent node can have multiple child nodes.) 

(Note: there are edge cases where a node can potentially have more than one parent, but such cases are the exceptions, and they need to handled on a case-by-case basis. Perhaps in a future post I’ll spend time discussing these cases in more detail, but for our purposes, and in light of the expected cardinality of dimensional modelling (i.e. many-to-one from fact table to dimension tables), we’re going to move forward with the expectation that any/all hierarchies should always respect this relationship, i.e. a child node only ever having a single parent node.) 

Now, the parent/child analogy is still a bit limited. Obviously in real life, children have only two biological parents (and clearly can have multiple step parents, adopted parents, etc.), whereas with our analogy, we’re restricting a child node to only one parent node, but otherwise you get the point. 

Also, as an FYI, a more formal term for this data structure from computer science would be a “tree” which indicates a similar analogy where any given branch stems from a single other branch (or trunk). 

(This analogy will play an important role later in our discussion.) 

Here’s a basic illustration of a hierarchy: 

Example hierarchy of… alphabet soup? 

As you can see, it’s still very much a graph with: 

  • seven nodes 
  • six edges (directed) 
  • no cycles 
  • each node only has a single parent 

So far so good. 

I mentioned earlier that even though hierarchies have more constraints, i.e. they’re more limited in what they can represent compared to generic graphs, they nonetheless yield additional attributes. What do I mean by that? 

Consider the following two points. 

Firstly, because I have: 

  • a single starting point (i.e. the very first parent node), 
  • only directed edges, 
  • no cycles, 

a hierarchy therefore allows me to measure the “distance” of any node in the hierarchy by how many edges away from the starting node (or “root node” as we’ll later call it) it is. This measurement is typically referred to as which level a particular node is on (analogous to measuring how high up in a building you are by what level you’re on. And much like the British/American confusion around which floor constitutes the “first” floor, you’ll also have to decide whether you index your root node at level 0 or 1). 

(Side tangent: you can, of course, measure the number of edges between any two nodes in a generic graph, but unlike with hierarchies, there’s not a single value for level per node but rather N-1 such values, i.e. since you can measure the number of edges between a given node and all other nodes in the same subgraph. Storing such information for all nodes introduce exponential storage requirements, i.e. O(n²), and offers very little technical or business value. Hence level is a fairly meaningless attribute in the context of generic graphs or DAGs.) 

Secondly, because you can now refer to multiple “sibling” nodes under a single parent, you can describe the sort order of each sibling node. In other words, you define a particular sort order for all the related sibling nodes under a particular parent node. (This is another example of an attribute that is meaningful for hierarchies, but really doesn’t make much sense in the context of a generic graph or DAG.) 

A classic example of this are the various lines in a financial statement such as a profit-and-loss statement: 

Financial reports are a very common use case in the world of BI, and it’s clearly extremely important to ensure that your reports display the various lines of your report in the correct order, defined by the sort order of each node in your financial statement hierarchy. (Note in the image above how this sort order is defined at each level.) 

Real quick, this sort order attribute doesn’t have a canonical name in the world of BI as such. But for a variety of reasons, my preference is “ordinal position”, so we’ll stick with that for now. 

(Also, for folks new to hierarchies, you might find it a worthwhile exercise to analyse the financial statement structure above as a hierarchy: How many nodes does it have? How many edges? How many levels? How many child nodes per parent? How might you actually model this data in a relational format? We’ll come back to this in a bit…) 

Two more points I’d like to make, as we work towards a generic/reusable data model for hierarchies: 

  • This should be obvious, but there are many different types of hierarchies, such as (in an enterprise context): profit centre hierarchies, cost centre hierarchies, WBS structure hierarchies, cost element hierarchies, financial statement hierarchies, and the list goes on. 
  • A particular hierarchy type might have multiple hierarchy instances. For example, a time dimension is actually a hierarchy, and almost all large enterprises care about tracking metrics along both the calendar year as well as the fiscal year, each of which is its own hierarchy. (We’ll circle back to time dimension hierarchies here in a bit.) 

And lastly, let me introduce a bit more vernacular specific to hierarchies: 

  • A given hierarchy has one (and only one) root node, which is the only node that has children but no parents. 
  • Any nodes that have parent and child nodes are called intermediate or inner nodes
  • Any nodes that have parent but no child nodes themselves are called leaf nodes

A colorful alphabet soup hierarchy. Very academic. 

(If you like to consider edges cases, keep in mind you could technically have a hierarchy of a single node, which would mean it’s simultaneously all three types. It’s also technically possible to have a leaf node at level 2, which would mean that particular “branch” of the hierarchy has no inner nodes.) 

As you can see, the world of data warehousing and BI has managed to mix metaphors by using terminology from the idea of a tree data structure (with things like roots, leaves and branches), and mixing it up with the family analogy (i.e. parents, children, sibling nodes, etc.) of hierarchies. 

How fun and entirely not confusing at all. 

Anyways. 

I’d say we’re ready for our generic hierarchy data model. Rather than write the SQL, I’m just going to provide a poor man’s data dictionary, especially given that your naming conventions (and probably also data types) may well differ. 

Basic data dictionary for a hierarchy table.

And just to finish up the example from earlier, here’s what that data might look like (primary key fields highlighted in green): 

Sample hierarchy.

It’s worth taking a moment to talk through this a bit. 

  • First of all, I’d suggest playing with some sample data, and sorting by the different columns, to familiarise yourself even more with this data model. For example, you could sort by NODE_TYPE to quickly find all the INNER nodes and LEAF nodes. You could sort by LEVEL and then ORDINAL_POSITION to “read” the hierarchy as you would with its visualised equivalent, i.e. top-to-bottom, left-to-right. 
  • Note that root nodes are defined as those records for which PARENT_NODE is NULL. This is a commonly accepted convention, especially for query languages that can parse parent-child hierarchies natively (such as MDX, which we’ll come back to), so I’d highly suggest adopting it (especially if you’re modeling in a strictly OLAP context), but there’s clearly no “formal” reason, i.e. from relational theory or anything else, that requires this. 
  • If this were a data model for a generic graph or even a DAG, we would have to include PARENT_NODE as part of the primary key, since a node could have more than one parent (er, predecessor). However, since hierarchy nodes only have a single parent node, we don’t need to do this, and in fact, we don’t want to do this (as, from a data quality perspective, it would let us insert records with multiple parent records, which breaks the definition set forth previously for hierarchies, and could lead to incorrect results in our BI queries by inflating metrics when joining hierarchies to fact tables due to M:N join cardinality instead of M:1). 
  • Be aware that the child nodes in a hierarchy are often just referred to as “nodes”, whereas parent nodes are always referred to as parent nodes. 
  • ORDINAL_POSITION might be a bit confusing at first glance, because again, it only defines the sort order for a given node compared to its immediate siblings (again, consider the financial statement hierarchy structure that we glanced at earlier. Revenue has to always come before Expenses at the top level, but underneath revenue, it’s important that its constituent child records are also displayed in the right order according to accounting expectations, i.e. “License” coming before “Support”). Now, sorting data only requires that sequential values are larger than previous values, so you could technically use a typical sequence for this column, i.e. each record in the table increasing by 1, which would give us 1, 2, 3… up to 7. But that kind of implies that all of the records are somehow related to each other when it comes to sorting (which they’re not — sort order only makes sense amongst sibling nodes), and if you casually glance at the data, you could get the impression that some records are missing (i.e. if the first node among a set of siblings starts with the value 7 for example). 
  • Speaking of sorting hierarchy records — in many cases, such as with GL Account hierarchies, the sort order is defined by the business in a dedicated column, and almost always used in any reporting. However, in other cases, a sort order doesn’t really matter. Think of a typical org chart. If you were looking at expenses charged per department (or employee), you often don’t really care about sort order of the hierarchy, as long as the data is grouped properly (i.e. you can see all of the employees per department, all departments per division, etc.). And lastly, you often only care about the alphanumeric sort order, such as with your time dimension. We’ll get to the time dimension shortly, but it’s absolute a hierarchy in the strict definition, and you’re almost always going to sort your days by their intrinsic value (assuming you have your days modeled properly, i.e. don’t forget your padded zeroes for any columns you use for sorting! Otherwise the value 11 might come before the value 2.) 
  • Ok, very last comment about sorting… you obviously could have more than one sort column for some reason, and you could also join your nodes to some other master data table, let’s say, which might include yet other columns that you’d want to sort on for some reason… 

Before I forget to mention, you might hear this structure referred to as an “adjacency list”, which is another technical term from graph theory. Technically, an adjacency list maps a single node to *all* of its adjacent nodes, so in the case of hierarchies, a single row would show a parent node mappes to *all* of its children nodes (in an array, for example). In practice though, folks often refer to the parent/child data model above as an adjacency list, even though each child node gets its own row. Just FYI. 

In Summary 

We’ve now gone through graphs, DAGs, and hierarchies, and we’ve walked through a basic, scalable data model for parent-child hierarchies and discussed some of the nuances of this model. 

In the next post, we’ll review a few more considerations when it comes to data modeling, such as: 

  • Discuss data quality implications of hierarchies. 
  • Disambiguate some confusion around terms like “parent/child” 
  • Discuss two major patterns for consuming hierarchies, i.e. schema-on-read compared to schema-on-write. 
  • The reason hierarchies are so prevalent in BI in terms of how human cognition goes about processing information. 

Mastering Hierarchies: Graphs, DAGs and BI, Oh My! (Part 2 of 5)

Recap

In the first post of this series, I walked through the basic ideas of what constitutes a graph (from a somewhat formal computer science perspective) with a simplified example from supply chain logistics.

To briefly recap:

  • A graph consists of one or more nodes.
  • Nodes are related to other nodes via one or more edges.
  • Nodes that can reach themselves again (by traversing the graph in a single direction) indicate the graph has a cycle.
  • A given edge can be directed or undirected.
  • Any set of nodes that can’t reach other sets of nodes are a subgraph.
  • Nodes and edges can have attributes, including numerical attributes (usually called weights), that are often leveraged when executing more advanced graph algorithms/analytics.

In terms of goals set forth in the last post:

  • The last post gave a sense of what a data model might start to look like for graphs in general, which we’ll continue to flesh out.
  • It attempted to simplify the idea of graphs and the few rules you need to understand what graphs represent.

In terms of the goals of this series of posts, this particular post will further flesh out the technical context for hierarchies (another stated goals of this series of posts) by digging further into a certain subset of graphs called directed acyclical graphs (DAGs) and further disambiguate DAGs and their related concepts.

Directed Acyclical Graphs (DAGs)

This post is not actually about ETL tools, but I would say most modern day Data Engineers are at least loosely familiar with the concept of DAGs as it is a term that has experienced quite a bit of adoption in data pipeline orchestration tools like Apache Airflow and, uh, Dagster.

Now that we’ve fleshed out the concept of a graph, it’s pretty straightforward to flesh out what a DAG is, just by examining the acronym.

  • A DAG is a graph where each and every edge is directed, and
  • that has no cycles, i.e. is acyclical

So, where and how do DAGs present themselves in the world of data engineering? While this is a series of posts specifically about hierarchies, not DAGs, I think it’s quite relevant to quickly examine all the different instances of DAGs in data engineering, as they are everywhere.

Also, in the way of disambiguation, let me point out what should hopefully be obvious — DAGs are an abstract concept from graph theory, which includes (but in no way is limited to) what we might call “data flow graphs” in an ETL/orchestration tool, which have come to just be referred to as DAGs. So, be mindful of context when discussing DAGs, as they constitute a much broader concept than just that of data pipelines.

Queries

Queries are DAGs. Look at the EXPLAIN PLAN of even the most basic query, and you’ll see that the visualised execution plan is a DAG, where:

  • Each node is a particular operation (such as join, filter, aggregate).
  • Each edge visualises how the output of a given operation serve as an input to the next operation.

Data Lineage

Data lineage, which represents the flow of data across a data pipeline, is also a DAG. (What I find interesting to note is that in the case of data lineage, each edge represents a particular operation, and each node represents data. This is the opposite of a query plan, where each edge represents data, and each node represents an operation).

This image has an empty alt attribute; its file name is 1730464839246-1024x529.png

Data Pipelines

Data pipelines are, rather obviously, DAGs.

Gantt Charts

Gantt charts? Aren’t Gantt charts something from project management? Yes, but they’re also used to visualise query plans by visualising the time each operation takes within a query execution plan, by expanding the length of the node to represent the amount of time it takes.

(Random tangent: I’ve always wanted a dynamic Gantt chart visualisation for query plans, where instead of being limited just to time (represented by bar width), I’d have the choice to select from things like memory consumption, CPU time, any implicit parallelization factors, disk I/O for root nodes, etc… any product managers out there want to take up the mantle?)

Relational Data Models

Okay, so, cycles can be found in some real-world relational data models, but even so, all relational data models are directed, and many are acyclical, so I decided to include them here.

  • Entities are nodes.
  • Primary/foreign key relationships are the edges.
  • The direction of the directed edges are from primary keys to foreign keys.
  • Clearly large enterprise data models with multiple subject areas can consist of multiple sub-graphs.

And if you’re in the mood for a bit of nuance, it’s absolutely fair to refer to the data model itself (i.e. the metadata) as a DAG, and separately calling out the data itself also as a DAG (i.e. the records themselves and their primary/foreign key relationships).

Summary

This is a fairly quick post to summarise:

  • DAGs are directed, acyclical graphs
  • The common use of the term DAG by Data Engineers is usually limited to the code artifacts of many modern data pipeline/orchestration tools, although it’s clearly a much broader concept.
  • Even within the field of Data Engineering, DAGs can be found everywhere, as they describe the structure of things like: query execution plans, data lineage visualisations, data pipelines, and entity-relationship models.

In the next post, we’ll further constrain the definition of a DAG in order to arrive at the concept of a hierarchy and discuss a handful of related considerations.

Mastering Hierarchies: Graphs, DAGs and BI, Oh My! (Part 1 of 5)

Introduction

I remember feeling somewhat lost on the first large-scale data warehousing project I was put on in late 2012, early in my consulting career. SAP was trying to win the business of Europe’s largest chain of hardware stores, and we were thus trying to replatform their data warehouse and business intelligence (BI) platform from Oracle onto an SAP stack.

At the time, I’d like to think I was a pretty strong database developer with some pretty sharp SQL skills, but had still never really spent any time digging into the academic rigors of data warehousing as its own discipline.

As the team started discussing this particular company’s various product hierarchies, I found myself feeling a bit lost as folks threw around terms like “root nodes”, “leaf nodes”, “hierarchy levels” and “recursive CTEs”. Nonetheless, I muddled my way through the discussion and the rest of the project, and since then have spent several years deep in the weeds of hierarchies. Nonetheless, I never felt particularly satisfied with the “academic” resources I had come across.

So, what I’m hoping to accomplish with this post is to create the resource I wish I’d had at the time — an introduction to hierarchies that is accessible yet comprehensive without introducing unnecessary complexity.

So whether you’re earlier in your Data Engineering career, or if you’re more senior but looking for a refresher, I hope you’ll walk away from this series of posts with a much stronger understanding of hierarchies and how to both model and consume them in the context of BI.

Goals

Now, there’s an endless amount of blog posts out there that cover hierarchies in the context of data warehousing and BI, so I thought it would be helpful to flesh out my specific goals with this post that I feel haven’t really been met by most of the other content out there.

So, in no particular, the goals of this blog series include:

  • Simplification — My personal opinion is that most authors over-complicate hierarchy discussion both in terms of their examples (such as this eyesore in another blog post I found) and the technical minutiae of hierarchies (“ragged” hierarchies anyone?). By focusing on fundamentals that are often glossed over, I’m hoping the discussion will facilitate a much simpler conceptualisation of hierarchies.

  • Cognitive Context — Hierarchies are all over the place in data analytics, but oftentimes folks don’t realise it (case in point: folks often don’t recognise that a time dimension is a hierarchy). But why are hierarchies in particular, as a data structure, so common in data analytics? In other words, how do they support human cognition when it comes to processing information? While my discussion is perhaps overly simplistic, it will help facilitate an understanding of why hierarchies are so helpful, especially in the context of “multi-dimensional” BI, exploratory ad-hoc analytics in particular.

  • Technical Context — hierarchies are a subset of “directed acyclical graphs” (DAGs) which themselves are a subset of graphs. In my opinion, a brief discussion of these different data structures helps Data Engineers, especially those lacking a formal computer science background, better understand the technical concepts underlying hierarchies. (Call me crazy, but I propose that such content helps, rather than hinders, the prior stated goal of “simplification”.)

  • Disambiguation — most content I’ve seen on hierarchies muddies the waters between technical theory and real-world use cases (if I may reuse this rather hideous example), so I hope to more effectively handle those contexts separately before synthesising them. I also hope to deliberately address the vocabulary of hierarchies and disambiguate terms I often see confused, such as the relationship between “parent/child tables” (such as order headers and order line items), and “parent/child hierarchies”, which at first glance have nothing in common (but are actually related).

  • Data Model — I’m also not satisfied with the data model(s) most authors have put forth for hierarchies. So, I’ll share a generic data model that hopefully serves as a stronger starting point for hierarchies, and I’ll also spend time discussing the tradeoffs of a schema-on-write model vs. schema-on-read for processing hierarchical data.

Graphs

As alluded to above, I want to ground this discussion of hierarchies in terms of their “super structure” — graphs. Bear with me, as this is going to go down a bit of a rabbit hole, but I promise it’ll be worth your time in better framing hierarchies themselves.

And just to chart our path forward in this series of posts:

  • Post 1 of 4 (this post) will focus on graphs.
  • Post 2 of 4 will focus on directed acyclical graphs (DAGs).
  • Post 3 of 4 will introduce hierarchies.
  • Post 4 of 4 will discuss the data engineering considerations when modeling/consuming hierarchies.

So, let’s proceed!

To being, what is meant by the term “graph”? As always, it depends. In the context of BI, one might immediately think of something like a line graph visualisation in Tableau, for example.

However, the academic concept of a graph from computer science describes a data structure that represents “things” and their relationships to one another. More specifically, these “things” are referred to as nodes (or sometimes “vertices”), and the relationships between nodes are referred to as edges.

Here’s a rather trivial (and abstract) example:

In the above example, there are:

  • Six nodes
  • Five edges
  • Two subgraphs
  • One cycle

Can you identify each of those concepts just from the visual depiction of this graph?

The nodes and edges should be easy to figure out.

  • Nodes: A, B, C, D, E, F
  • Edges: A-B, B-C, C-A, D-E, E-F

The two subgraphs should be easy enough to intuit. The above graph shows that A, B, and C are related, and also the D, E and F are related, but each of these collections of nodes aren’t related to each other. That means each of them constitute their own “subgraph”.

(I’m being a bit informal with my terminology here. The overall dataset is technically called a “disconnected graph” since there are two sets of nodes/edges (subgraphs), also called components, which cannot reach other. We could also call it a graph with two connected subgraphs. I’m also not distinguishing what might be obvious — we might have data that itself constitutes multiple such subgraphs, but in other cases, we might just be filtering a larger dataset such that subgraphs “fall out” of a larger connected graph… but these details are not particularly germane to the rest of this discussion.)

And lastly, the one cycle should be relatively intuitive. A cycle is a path through the graph that repeats its edges. In other words, I could put the tip of a pencil on the node A (and moving only in one direction) can trace my way from A > B > C and then back to A. This is called a cycle, and clearly there’s no way to do the same thing from D, for example (hence the second sub-graph does not have a cycle).

If we wanted to, we could model this data in a relational database (like Snowflake) as follows:

Obviously this is a simplified data model, but it illustrates the concept just fine.

Now, one thing I want to call your attention to is the fact that I’ve named the columns of the edges table as “from_node” and “to_node” which seems to imply a direction, i.e. starting from the “from_node” and heading towards the “to_node”. (In practice, these are often also referred to as the “predecessor” node and the “successor” node.)

This illustrates the last concept worth highlighting, i.e. whether we treat a graph as directed or undirected.

In other words, we sometimes might care about some concept of directionality or precedence, i.e. when a given node should somehow come before another node that it’s related to.

Let’s further flesh these concepts out with a real-world use case, i.e. a simplified example from the world of logistics / supply chain.

In order to simplify the example, which typically consists of a network (synonym for graph) of thousands of ships, trains, and trucks and all their various routes throughout the supply chain, let’s consider a single example most of us experience every day: navigating from one point to another using our favorite map app on our phone.

This is obviously a screenshot from my Google Maps app on my phone, showing directions from Boulder to Denver International Airport.

From a data perspective, clearly Google Maps is leveraging geospatial data to visually trace out the exact routes that I could take, but what it’s also doing is figuring out and displaying graph data, which itself doesn’t really care about the geospatial coordinates of anything.

In short, Google Maps is using sophisticated (graph) algorithms to select a handful of routes that are likely to make the most sense for me to then choose from.

In my case, I was given three different routes, which I’ve visualised above by just showing the durations for each leg of each route (taking a few liberties here with my version of the graph).

Clearly, Google Maps is summing up the amount of time each leg of each route, and assigning that total value to each route, and displaying them so that I can choose the shortest option.

It’s also showing things like any associated toll road fees (which I’ll just call cost), as well as the total distance.

So, I have three metrics to choose from, for each edge (duration, distance, and cost), and a simple exercise worth thinking about is which of these “edge metrics” should be considered directed vs. undirected:

  • Duration: directed
  • Distance: undirected
  • Cost: undirected

Let me briefly explain.

Duration can often differ in one direction compared to the reverse direction (even though they cover the same route and distance), for example, in cases when there’s a large influx of traffic in one direction, i.e. from suburbs into commercial city centers during the morning rush hour.

Distance between two points on the same leg is pretty much always the same, so that’s an undirected metric (bear with me here — I’m ignoring important edges cases such as construction which might only impact one direction of traffic). And toll road fees (cost) are almost always a function of distance, not time, so those metrics also remain undirected.

So, I can redraw just one leg of one route and show all of these metrics, which also demonstrates how a graph can have multiple edges between the same two nodes (and how some edges can be directed while others can be undirected).

And if I wanted to capture all of this data in my (relational) database, I might model it as follows:

(Don’t @ me about all the ways to improve this data model. It’s just meant to illustrate a practical example of why a graph might have multiple edges between the same two nodes. ?)

I chose this example not only because it’s intuitive and helps flesh out the key concepts of graphs, but also because it represents a great example of a significant software engineering challenge in a critical real-world use case, i.e. how to optimise supply chain logistics on a much larger scale.

For example, think of all of the manufacturers, shipping ports, distribution centers, warehouses, and stores associated with Wal-Mart across the world, along with the different transport types (truck, train, ship… drone?). Optimising the supply chain of large enterprises is no joke!

In the way of an example, this is one of the core problems addressed by SAP’s Advanced Planning and Optimisation (APO) capabilities, which themselves leveraged an in-memory graph engine which was ultimately rolled into SAP HANA’s graph engine.

But, I’m not here to get in the weeds of any particular use case nor vendor-specific features for that matter, nor do we even need to get in the minutiae of graph algorithms (Dijkstra’s algorithm, anyone?), so let’s take a moment to step back and just summarise the important features of a graph.

Summary

Again, one of the goals of this series of posts is to make complex concepts simple. As you can see in the discussion thus far, such as in the example of optimising travel plans at scale in complex supply chain networks, graph analytics can get very hairy, very quickly, and introduce a need for very sophisticated algorithms.

But, we’re not here for the algorithms. We’re here for the data, and at the end of the data, these points are basically all you need to remember about graphs:

  • A graph consists of one or more nodes
  • Nodes are related to other nodes via one or more edges
  • Nodes that that can reach themselves again (by traversing the graph in a single direction) indicate the graph has a cycle
  • A given edge can be directed or undirected
  • Any set of nodes that can’t reach other sets of nodes are a subgraph (or a “connected component” if you like).
  • Nodes and edges can have attributes, including numerical attributes, that are often leveraged for performing more advanced analytics. (For simplicity, I’m ignoring the BI vernacular that distinguishes attributes from measures.) These numerical attributes are often referred to as “weights”, which is the reason you might have heard of weights (and biases — same as weights but assigned to nodes rather than edges) in the context of neural networks, such as those undergirding large language models (LLMs) such as ChatGPT. Neural networks, or any kind of network really (including computer networks) are examples of graphs.

Hopefully you’ll find those points simple enough to remember, at it covers pretty much everything you’ll need to know about graphs in the context of data engineering! Again, we’re not particularly concerned with graph algorithms, but rather just the structure of graphs themselves.

In the next post, we’ll impose a few constraints on the definition of a graph to arrive at the concept of a direct acyclical graph (DAG), which should already start to make a bit of sense based on our discussion thus far, and should also check the boxes of the other goals set forth at the beginning of this post.

Into the Matrix: Mastering Conditional Formatting in Power BI

Introduction

In the world of data viz, creating a clear and informative matrix can often be a daunting challenge. Recently, I was tasked with developing a complex matrix in Power BI to display various brands alongside their key performance indicators (KPIs). This report required six measures for each brand and several time intelligence calculations – such as yesterday’s figures and week-to-date comparisons. Therefore, ensuring a cohesive and consistent visual style became a challenge. Adding to the complexity, I had to ensure that the conditional formatting aligned with that of previous reports for the same client.  

In this post, we’ll explore how to build a matrix visual that pivots values to rows, groups them by dimensional hierarchies, whilst performing conditional formatting on our columns, which in this example, are calculated time intelligence measures. 

The Challenge

As I began to set up the matrix, I quickly ran into a significant obstacle: Power BI’s default settings only allowed conditional formatting on the values section of the matrix. This limitation became frustrating when I pivoted the data to display KPIs as rows, causing the conditional formatting to be incorrectly applied along the horizontal axis instead of the intended vertical columns. My efforts to create a visually appealing and informative matrix were hampered by these constraints, and I needed to find a way to implement the conditional formatting rules effectively while preserving the integrity of the data presentation. 

Exploration of Alternatives

In my quest for a solution, I initially considered hard coding the conditional formatting directly into the time intelligence columns. Using DAX formula, I was able to display up or down arrows based on the calculated values. While this approach worked for the “TW Vs LW” columns, it proved inadequate for the “% change” columns, limiting the overall functionality of my matrix. 

Additionally, formatting the values in the DAX query led to performance issues, significantly slowing down the report. I quickly realised that this method did not align with the existing aesthetics of my report pack, further complicating my efforts. 

To enhance readability, I explored splitting the matrix into six separate matrices. One for each KPI, with countries as rows and individual measures for each time-period. This solution worked remarkably well for mobile viewing, allowing users to easily access all KPIs, brands, and time periods in a more digestible format when using devices in portrait mode. However, this approach was less ideal for desktop users, as navigating between six matrices became cumbersome and disrupted the seamless experience I intended to deliver. 

Breakthrough Solution

After identifying the challenges in presenting a cohesive matrix with effective conditional formatting, I adopted a systematic approach to streamline the data structure and enhance the visual representation of KPIs across different brands. 

1. Creating the Source Table: I initiated the solution by developing a new source table in Excel. This table organised essential measures associated with various brands, accompanied by an ordinal column to facilitate the correct sequencing of KPIs. This structured layout not only simplified data management but also allowed for straightforward integration with existing data models. For example: 

2. Establishing Relationships: To ensure seamless data interaction, I joined this new source table to the store dimension table using a cross-directional, many-to-many relationship based on the brand name. Although many-to-many relationships are generally avoided in Power BI due to potential complexity, they work in this case because the source table’s pre-aggregated KPIs and distinct brand names ensure efficiency and clarity in the model. This configuration maximised flexibility in how data could be accessed and analysed across the two tables. 

3. Implementing Calculated Columns: Utilising a SWITCH() statement, I created calculated columns that enabled dynamic switching between the measures in the source table and the corresponding measures in the data model. A SWITCH() statement in DAX evaluates an expression against multiple conditions and returns the corresponding result for the first match. This approach allowed for targeted calculations based on specific KPIs while maintaining clarity in data representation. 

4. Developing Measures: Building on the calculated columns, I developed measures to sum the values required. This was essential for ensuring accurate aggregations, providing a comprehensive view of performance metrics across all brands and time periods. For example: 

5. Applying Conditional Formatting: With the measures established, I leveraged Power BI’s built-in conditional formatting rules to enhance visual clarity. This consistency in formatting not only aligned with the existing report pack but also facilitated stakeholders in quickly identifying trends and insights. 

This solution not only resolved the immediate technical challenges but also enhanced the overall user experience for those engaging with the reports. The dynamic calculations enabled by the calculated columns ensure that the reports remain relevant and insightful, catering to the business’s evolving needs. Additionally, the focus on consistent formatting facilitates the identification of critical patterns, a fundamental concept in Data-Viz best practices. 

Conclusion 

By combining thoughtful data modelling and conditional formatting, we can transform raw data into actionable insights, empowering informed decision-making among various stakeholders. As we continue to navigate the complexities of data analysis, this approach serves as a valuable blueprint for future projects, ensuring our reporting remains insightful and adaptable to shifting business landscapes. 

I encourage you to apply this method in your own reports and experience the improvements in your workflow!