Introduction
I remember feeling somewhat lost on the first large-scale data warehousing project I was put on in late 2012, early in my consulting career. SAP was trying to win the business of Europe’s largest chain of hardware stores, and we were thus trying to replatform their data warehouse and business intelligence (BI) platform from Oracle onto an SAP stack.
At the time, I’d like to think I was a pretty strong database developer with some pretty sharp SQL skills, but had still never really spent any time digging into the academic rigors of data warehousing as its own discipline.
As the team started discussing this particular company’s various product hierarchies, I found myself feeling a bit lost as folks threw around terms like “root nodes”, “leaf nodes”, “hierarchy levels” and “recursive CTEs”. Nonetheless, I muddled my way through the discussion and the rest of the project, and since then have spent several years deep in the weeds of hierarchies. Nonetheless, I never felt particularly satisfied with the “academic” resources I had come across.
So, what I’m hoping to accomplish with this post is to create the resource I wish I’d had at the time — an introduction to hierarchies that is accessible yet comprehensive without introducing unnecessary complexity.
So whether you’re earlier in your Data Engineering career, or if you’re more senior but looking for a refresher, I hope you’ll walk away from this series of posts with a much stronger understanding of hierarchies and how to both model and consume them in the context of BI.
Goals
Now, there’s an endless amount of blog posts out there that cover hierarchies in the context of data warehousing and BI, so I thought it would be helpful to flesh out my specific goals with this post that I feel haven’t really been met by most of the other content out there.
So, in no particular, the goals of this blog series include:
- Simplification — My personal opinion is that most authors over-complicate hierarchy discussion both in terms of their examples (such as this eyesore in another blog post I found) and the technical minutiae of hierarchies (“ragged” hierarchies anyone?). By focusing on fundamentals that are often glossed over, I’m hoping the discussion will facilitate a much simpler conceptualisation of hierarchies.
- Cognitive Context — Hierarchies are all over the place in data analytics, but oftentimes folks don’t realise it (case in point: folks often don’t recognise that a time dimension is a hierarchy). But why are hierarchies in particular, as a data structure, so common in data analytics? In other words, how do they support human cognition when it comes to processing information? While my discussion is perhaps overly simplistic, it will help facilitate an understanding of why hierarchies are so helpful, especially in the context of “multi-dimensional” BI, exploratory ad-hoc analytics in particular.
- Technical Context — hierarchies are a subset of “directed acyclical graphs” (DAGs) which themselves are a subset of graphs. In my opinion, a brief discussion of these different data structures helps Data Engineers, especially those lacking a formal computer science background, better understand the technical concepts underlying hierarchies. (Call me crazy, but I propose that such content helps, rather than hinders, the prior stated goal of “simplification”.)
- Disambiguation — most content I’ve seen on hierarchies muddies the waters between technical theory and real-world use cases (if I may reuse this rather hideous example), so I hope to more effectively handle those contexts separately before synthesising them. I also hope to deliberately address the vocabulary of hierarchies and disambiguate terms I often see confused, such as the relationship between “parent/child tables” (such as order headers and order line items), and “parent/child hierarchies”, which at first glance have nothing in common (but are actually related).
- Data Model — I’m also not satisfied with the data model(s) most authors have put forth for hierarchies. So, I’ll share a generic data model that hopefully serves as a stronger starting point for hierarchies, and I’ll also spend time discussing the tradeoffs of a schema-on-write model vs. schema-on-read for processing hierarchical data.
Graphs
As alluded to above, I want to ground this discussion of hierarchies in terms of their “super structure” — graphs. Bear with me, as this is going to go down a bit of a rabbit hole, but I promise it’ll be worth your time in better framing hierarchies themselves.
And just to chart our path forward in this series of posts:
- Post 1 of 4 (this post) will focus on graphs.
- Post 2 of 4 will focus on directed acyclical graphs (DAGs).
- Post 3 of 4 will introduce hierarchies.
- Post 4 of 4 will discuss the data engineering considerations when modeling/consuming hierarchies.
So, let’s proceed!
To being, what is meant by the term “graph”? As always, it depends. In the context of BI, one might immediately think of something like a line graph visualisation in Tableau, for example.
However, the academic concept of a graph from computer science describes a data structure that represents “things” and their relationships to one another. More specifically, these “things” are referred to as nodes (or sometimes “vertices”), and the relationships between nodes are referred to as edges.
Here’s a rather trivial (and abstract) example:
In the above example, there are:
- Six nodes
- Five edges
- Two subgraphs
- One cycle
Can you identify each of those concepts just from the visual depiction of this graph?
The nodes and edges should be easy to figure out.
- Nodes: A, B, C, D, E, F
- Edges: A-B, B-C, C-A, D-E, E-F
The two subgraphs should be easy enough to intuit. The above graph shows that A, B, and C are related, and also the D, E and F are related, but each of these collections of nodes aren’t related to each other. That means each of them constitute their own “subgraph”.
(I’m being a bit informal with my terminology here. The overall dataset is technically called a “disconnected graph” since there are two sets of nodes/edges (subgraphs), also called components, which cannot reach other. We could also call it a graph with two connected subgraphs. I’m also not distinguishing what might be obvious — we might have data that itself constitutes multiple such subgraphs, but in other cases, we might just be filtering a larger dataset such that subgraphs “fall out” of a larger connected graph… but these details are not particularly germane to the rest of this discussion.)
And lastly, the one cycle should be relatively intuitive. A cycle is a path through the graph that repeats its edges. In other words, I could put the tip of a pencil on the node A (and moving only in one direction) can trace my way from A > B > C and then back to A. This is called a cycle, and clearly there’s no way to do the same thing from D, for example (hence the second sub-graph does not have a cycle).
If we wanted to, we could model this data in a relational database (like Snowflake) as follows:
Obviously this is a simplified data model, but it illustrates the concept just fine.
Now, one thing I want to call your attention to is the fact that I’ve named the columns of the edges table as “from_node” and “to_node” which seems to imply a direction, i.e. starting from the “from_node” and heading towards the “to_node”. (In practice, these are often also referred to as the “predecessor” node and the “successor” node.)
This illustrates the last concept worth highlighting, i.e. whether we treat a graph as directed or undirected.
In other words, we sometimes might care about some concept of directionality or precedence, i.e. when a given node should somehow come before another node that it’s related to.
Let’s further flesh these concepts out with a real-world use case, i.e. a simplified example from the world of logistics / supply chain.
In order to simplify the example, which typically consists of a network (synonym for graph) of thousands of ships, trains, and trucks and all their various routes throughout the supply chain, let’s consider a single example most of us experience every day: navigating from one point to another using our favorite map app on our phone.
This is obviously a screenshot from my Google Maps app on my phone, showing directions from Boulder to Denver International Airport.
From a data perspective, clearly Google Maps is leveraging geospatial data to visually trace out the exact routes that I could take, but what it’s also doing is figuring out and displaying graph data, which itself doesn’t really care about the geospatial coordinates of anything.
In short, Google Maps is using sophisticated (graph) algorithms to select a handful of routes that are likely to make the most sense for me to then choose from.
In my case, I was given three different routes, which I’ve visualised above by just showing the durations for each leg of each route (taking a few liberties here with my version of the graph).
Clearly, Google Maps is summing up the amount of time each leg of each route, and assigning that total value to each route, and displaying them so that I can choose the shortest option.
It’s also showing things like any associated toll road fees (which I’ll just call cost), as well as the total distance.
So, I have three metrics to choose from, for each edge (duration, distance, and cost), and a simple exercise worth thinking about is which of these “edge metrics” should be considered directed vs. undirected:
- Duration: directed
- Distance: undirected
- Cost: undirected
Let me briefly explain.
Duration can often differ in one direction compared to the reverse direction (even though they cover the same route and distance), for example, in cases when there’s a large influx of traffic in one direction, i.e. from suburbs into commercial city centers during the morning rush hour.
Distance between two points on the same leg is pretty much always the same, so that’s an undirected metric (bear with me here — I’m ignoring important edges cases such as construction which might only impact one direction of traffic). And toll road fees (cost) are almost always a function of distance, not time, so those metrics also remain undirected.
So, I can redraw just one leg of one route and show all of these metrics, which also demonstrates how a graph can have multiple edges between the same two nodes (and how some edges can be directed while others can be undirected).
And if I wanted to capture all of this data in my (relational) database, I might model it as follows:
(Don’t @ me about all the ways to improve this data model. It’s just meant to illustrate a practical example of why a graph might have multiple edges between the same two nodes. ?)
I chose this example not only because it’s intuitive and helps flesh out the key concepts of graphs, but also because it represents a great example of a significant software engineering challenge in a critical real-world use case, i.e. how to optimise supply chain logistics on a much larger scale.
For example, think of all of the manufacturers, shipping ports, distribution centers, warehouses, and stores associated with Wal-Mart across the world, along with the different transport types (truck, train, ship… drone?). Optimising the supply chain of large enterprises is no joke!
In the way of an example, this is one of the core problems addressed by SAP’s Advanced Planning and Optimisation (APO) capabilities, which themselves leveraged an in-memory graph engine which was ultimately rolled into SAP HANA’s graph engine.
But, I’m not here to get in the weeds of any particular use case nor vendor-specific features for that matter, nor do we even need to get in the minutiae of graph algorithms (Dijkstra’s algorithm, anyone?), so let’s take a moment to step back and just summarise the important features of a graph.
Summary
Again, one of the goals of this series of posts is to make complex concepts simple. As you can see in the discussion thus far, such as in the example of optimising travel plans at scale in complex supply chain networks, graph analytics can get very hairy, very quickly, and introduce a need for very sophisticated algorithms.
But, we’re not here for the algorithms. We’re here for the data, and at the end of the data, these points are basically all you need to remember about graphs:
- A graph consists of one or more nodes
- Nodes are related to other nodes via one or more edges
- Nodes that that can reach themselves again (by traversing the graph in a single direction) indicate the graph has a cycle
- A given edge can be directed or undirected
- Any set of nodes that can’t reach other sets of nodes are a subgraph (or a “connected component” if you like).
- Nodes and edges can have attributes, including numerical attributes, that are often leveraged for performing more advanced analytics. (For simplicity, I’m ignoring the BI vernacular that distinguishes attributes from measures.) These numerical attributes are often referred to as “weights”, which is the reason you might have heard of weights (and biases — same as weights but assigned to nodes rather than edges) in the context of neural networks, such as those undergirding large language models (LLMs) such as ChatGPT. Neural networks, or any kind of network really (including computer networks) are examples of graphs.
Numerical attributes of graph edges, or “weights”, represented in the edges of a neural network
Hopefully you’ll find those points simple enough to remember, at it covers pretty much everything you’ll need to know about graphs in the context of data engineering! Again, we’re not particularly concerned with graph algorithms, but rather just the structure of graphs themselves.
In the next post, we’ll impose a few constraints on the definition of a graph to arrive at the concept of a direct acyclical graph (DAG), which should already start to make a bit of sense based on our discussion thus far, and should also check the boxes of the other goals set forth at the beginning of this post.