Introduction
I remember feeling somewhat lost on the first large-scale data warehousing project I was put on in late 2012, early in my consulting career. SAP was trying to win the business of Europe’s largest chain of hardware stores, and thus as an early career consultant in SAP’s Business Analytics division, I was staffed on the team hired to replatform their entire data stack from Oracle onto SAP.
At the time, I’d like to think I was a pretty strong database developer with some pretty sharp SQL skills, but had still never really spent any time digging into the academic rigors of data warehousing as its own discipline.
As the team started discussing this particular company’s various product hierarchies, I found myself feeling a bit lost as folks threw around terms like “root nodes”, “leaf nodes”, “hierarchy levels” and “recursive CTEs”. Nonetheless, I muddled my way through the discussion, ended up successfully executing my first large data warehousing project, and since then have spent many years developing experience, skills and technological contributions to the disciplines of data modeling, data engineering and BI development, including those disciplines as they relate to hierarchies.
So, given that initial confusion at the onset of my career, the subsequent career progression in the way of all things data engineering, and my overall frustration with the lack of high quality content when it comes to understanding how to extract, model, transform, and consume hierarchies, I thought it would be helpful to put together this series of posts as a resource I wish I’d had on that first project - a detailed and systematic introduction to hierarchies with lots of examples, discussion of edge cases, and most importantly – working SQL to dig deep into a subject area that can get quite complex.
So whether you’re earlier in your data engineering career, or if you’re more senior but looking for a refresher, I hope you’ll walk away from this series of posts with a much stronger understanding of hierarchies and how to both model and consume them, primarily in the context of BI.
Goals
Now, there’s an endless amount of blog posts out there that cover hierarchies in the context of data warehousing and BI, so I thought it would be helpful to flesh out my specific goals with this post that I feel haven’t really been met by most of the other content out there.
So, in no particular, the goals of this blog series include:
- Simplification — My personal opinion is that most authors over-complicate hierarchy discussion both in terms of their examples (such as this eyesore in another blog post I found) and the technical minutiae of hierarchies (“ragged” hierarchies anyone?). By focusing on fundamentals that are often glossed over, I’m hoping the discussion will facilitate a much simpler conceptualisation of hierarchies.
- Cognitive Context — Hierarchies are all over the place in data analytics, but oftentimes folks don’t realise it (case in point: folks often don’t recognise that a time dimension is a hierarchy). But why are hierarchies in particular, as a data structure, so common in data analytics? In other words, how do they support human cognition when it comes to processing information? While my discussion is perhaps overly simplistic, it will help facilitate an understanding of why hierarchies are so helpful, especially in the context of “multi-dimensional” BI, exploratory ad-hoc analytics in particular.
- Technical Context — hierarchies are a subset of “directed acyclical graphs” (DAGs) which themselves are a subset of graphs. In my opinion, a brief discussion of these different data structures helps Data Engineers, especially those lacking a formal computer science background, better understand the technical concepts underlying hierarchies. (Call me crazy, but I propose that such content helps, rather than hinders, the prior stated goal of “simplification”.)
- Disambiguation — most content I’ve seen on hierarchies muddies the waters between technical theory and real-world use cases (if I may reuse this rather hideous example), so I hope to more effectively handle those contexts separately before synthesising them. I also hope to deliberately address the vocabulary of hierarchies and disambiguate terms I often see confused, such as the relationship between “parent/child tables” (such as order headers and order line items), and “parent/child hierarchies”, which at first glance have nothing in common (but are actually related).
- Data Model — I’m also not satisfied with the data model(s) most authors have put forth for hierarchies. So, I’ll share a few generic data models (and iterations thereof) that can accommodate the ingestion, transformation and consumption layers of the most common types of hierarchies (parent/child, level, unbalanced, ragged, heterogenous, time-dependent, etc.) along with a brief discussion of the tradeoffs of a schema-on-write model vs. schema-on-read for hierarchy processing data pipelines.
Contents
Here’s is a high-level summary of each of the seven posts in these series:
- Post 1: Introducing Graphs
- Post 2: Introducing DAGs
- Post 3: Introducing Parent-Child Hierarchies
- Post 4: Introducing Level Hierarchies
- Post 5: Handling Ragged, Unbalanced and Heterogeneous Hierarchies
- Post 6: Hierarchy Review (Data Modeling)
- Post 7: Final Thoughts and Reflections
Note that the emphasis in the first five posts is largely conceptual (although some SQL is provided where relevant). A detailed review of relational data models for hierarchies, their corresponding SQL definitions and ETL logic, and a brief discussion on architecture and tradeoffs addressed specifically in Post 6.
Also note that the query language MDX and its role in consuming hierarchies is touched on at a few points throughout this series of posts, but doesn’t take a central role given its (unfortunate) absence in most “modern data stack” deployments.
Graphs
As alluded to above, I want to ground this discussion of hierarchies in terms of their “super structure” — graphs. This will lead us down a bit of a rabbit hole, but will ultimately help in better framing the discussion of hierarchies and their role as a type of data structure.
To begin, what is meant by the term “graph”? In the context of BI, one might immediately think of something like a line chart in a BI tool like Tableau, for example (i.e. graphing data points on a Cartesian plane).
However, the academic concept of a graph from computer science is a separate and distinct concept that describes a data structure that represents “things” and their relationships to one another. More specifically, these “things” are referred to as nodes (or sometimes “vertices”), and the relationships between nodes are referred to as edges.
Here’s a rather trivial (and abstract) example:
In the above example, there are:
- Six nodes
- Five edges
- Two subgraphs
- One cycle
Can you identify each of those concepts just from the visual depiction of this graph?
The nodes and edges should be easy to figure out.
- Nodes: A, B, C, D, E, F
- Edges: A-B, B-C, C-A, D-E, E-F
The two subgraphs should be easy enough to intuit. The above graph shows that A, B, and C are related, and also the D, E and F are related, but each of these collections of nodes aren’t related to each other. That means each of them constitute their own “subgraph”.
(I’m being a bit informal with my terminology here. The overall dataset is technically called a “disconnected graph” since there are two sets of nodes/edges (subgraphs), also called components, which cannot reach other. We could also call it a graph with two connected subgraphs… but in any case, these details are not particularly germane to the rest of this discussion, and thus the formal vernacular of graph theory won’t be addressed much further since it historically hasn’t played an important role in the discussion of hierarchies nor other related topics in the fields of data engineering and business intelligence.)
And lastly, the one cycle should be relatively intuitive. A cycle is a path through the graph that repeats its edges. In other words, I could put the tip of a pencil on the node A (and moving only in one direction) can trace my way from A > B > C and then back to A. This is called a cycle, and clearly there’s no way to do the same thing from D, for example (hence the second sub-graph does not have a cycle).
In the way of a simple SQL example, we could model this data in a relational database (like Snowflake) as follows:
/*
Create tables
*/
CREATE TABLE IF NOT EXISTS NODES
(
NODE_ID TEXT
);
CREATE TABLE IF NOT EXISTS EDGES
(
NODE_ID_FROM TEXT,
NODE_ID_TO TEXT
);
/*
Populate tables
*/
INSERT INTO
NODES
VALUES
('A'), ('B'), ('C'), ('D'), ('E'), ('F');
INSERT INTO
EDGES
VALUES
('A', 'B'),
('B', 'C'),
('C', 'A'),
('D', 'E'),
('E', 'F');
Granted, this is an incredibly simple model, and should probably be extended (i.e. to accommodate attributes associated with each node/edges, as well as accommodate multiple graphs), but it illustrates the concept just fine.
Now, one thing worth calling out is the names of the the columns in the EDGES table. The fact that they have the prepositions “TO” and “FROM” imply a direction, i.e. starting from the “from_node” and heading towards the “to_node”. (In practice, these are often also referred to as the “predecessor” node and the “successor” node.)
This illustrates the last concept worth highlighting, i.e. whether we treat a graph as directed or undirected. (Graphs are NOT always directed, and thus these naming conventions are not entirely generic for arbitrary graphs. However, we won’t consider the case of arbitrary graphs much further, whereas directed graphs will play an important role in the rest of our discussion.)
In other words, we sometimes might care about some concept of directionality or precedence, i.e. when a given node should somehow come before another node that it’s related to.
Let’s further flesh these concepts out with a real-world use case from the world of logistics.
Logistics is, “the part of supply chain management that deals with the efficient forward and reverse flow of goods, services, and related information from the point of origin to the point of consumption.” Large supply chains consist of the waterways, railways, and roads that connect cargo vehicles (ships, trains, trucks) all across the globe.
In order to simplify the example, which let’s just consider a single example most of us experience every day: navigating from one point to another using our favorite map app on our phone.
This is obviously a screenshot from my Google Maps app on my phone, showing directions from Boulder to Denver International Airport.
From a data perspective, clearly Google Maps is leveraging geospatial data to visually trace out the exact routes that I could take, but what it’s also doing is figuring out and displaying graph data (which is not actually concerned at all with any geospatial coordinates).
In short, Google Maps is using sophisticated (graph) algorithms to select a handful of routes that are likely to make the most sense for me to choose from, based on time and/or distance.
In my case, I was given three different routes, which I’ve visualised above by just showing the durations for each leg of each route (taking a few liberties here with my version of the graph).
Clearly, Google Maps is summing up the amount of time each leg of each route, and assigning that total value to each route, and displaying them so that I can choose the shortest option.
It’s also showing things like any associated toll road fees (which I’ll just call cost), as well as the total distance.
So, I have three metrics to choose from, for each edge (duration, distance, and cost), and a simple exercise worth thinking about is which of these “edge metrics” should be considered directed vs. undirected:
- Duration: directed
- Distance: undirected
- Cost: undirected
Let me briefly explain.
Duration can often differ in one direction compared to the reverse direction (even though they cover the same route and distance), for example, in cases when there’s a large influx of traffic in one direction, i.e. from suburbs into commercial city centers during the morning rush hour.
Distance between two points on the same leg is pretty much always the same, so that’s an undirected metric (bear with me here — I’m ignoring important edges cases such as construction which might only impact one direction of traffic). And toll road fees (cost) are almost always a function of distance, not time, so those metrics also remain undirected.
So, I can redraw just one leg of one route and show all of these metrics, which also demonstrates how a graph can have multiple edges between the same two nodes (and how some edges can be directed while others can be undirected).
And if I wanted to capture all of this data in my (relational) database, I might model it as follows:
CREATE TABLE IF NOT EXISTS EDGES
(
NODE_ID_FROM TEXT,
NODE_ID_TO TEXT,
DIRECTION TEXT,
METRIC TEXT,
AMOUNT DECIMAL
);
INSERT INTO
EDGES
VALUES
('Boulder', 'Exit A', 'DEFAULT', 'Duration', 10.0),
('Boulder', 'Exit A', 'REVERSE', 'Duration', 12),
('Boulder', 'Exit A', NULL, 'Distance', 8),
('Boulder', 'Exit A', NULL, 'Cost', 4.5);
I chose this example not only because it’s intuitive and helps flesh out the key concepts of graphs, but also because it represents a great example of a significant software engineering challenge in a critical real-world use case, i.e. how to optimise supply chain logistics on a much larger scale.
For example, think of all of the manufacturers, shipping ports, distribution centers, warehouses, and stores associated with Wal-Mart across the world, along with the different transport types (truck, train, ship… drone?). Optimising the supply chain of large enterprises is no joke!
In the way of an example, this is one of the core problems addressed by SAP’s Advanced Planning and Optimisation (APO) capabilities, which themselves leveraged an in-memory graph engine which was ultimately rolled into SAP HANA’s graph engine.
But, I’m not here to get in the weeds of any particular use case nor vendor-specific features for that matter, nor do we even need to get in the minutiae of graph algorithms (Dijkstra’s algorithm, anyone?), so let’s take a moment to step back and just summarise the important features of a graph.
Summary
Again, one of the goals of this series of posts is to make complex concepts simple. As you can see in the discussion thus far, such as in the example of optimising travel plans at scale in complex supply chain networks, graph analytics can get very hairy, very quickly, and introduce a need for very sophisticated algorithms.
But, we’re not here for the algorithms. We’re here for the data, and at the end of the data, these points are basically all you need to remember about graphs:
- A graph consists of one or more nodes
- Nodes are related to other nodes via one or more edges
- Nodes that that can reach themselves again (by traversing the graph in a single direction) indicate the graph has a cycle
- A given edge can be directed or undirected
- Any set of nodes that can’t reach other sets of nodes are a subgraph.
- Nodes and edges can have *attributes associated with them.
*Such attributes can consist of numerical values used in advanced analytics and algorithms. One such example would be numerical attributes of edges called “weights” and numerical attributes of nodes called “biases” which are leveraged in neural networks, which are the graphs undergirding large language models (LLMs) such as ChatGPT.
(Since this is not a BI use case per se, I’m avoiding the typical distinction between “measures” and “attributes”.)
Hopefully you’ll find those points simple enough to remember, at it covers pretty much everything you’ll need to know about graphs in the context of data engineering! Again, we’re not particularly concerned with graph algorithms, but rather just the structure of graphs themselves.
In the next post, we’ll impose a few constraints on the definition of a graph to arrive at the concept of a direct acyclical graph (DAG), which will help further motivate our introduction of hierarchies.