Mastering Hierarchies: Graphs, DAGs and BI, Oh My! (Part 1 of 7)

Introduction

I remember feeling somewhat lost on the first large-scale data warehousing project I was put on in late 2012, early in my consulting career. SAP was trying to win the business of Europe’s largest chain of hardware stores, and we were thus trying to replatform their data warehouse and business intelligence (BI) platform from Oracle onto an SAP stack.

At the time, I’d like to think I was a pretty strong database developer with some pretty sharp SQL skills, but had still never really spent any time digging into the academic rigors of data warehousing as its own discipline.

As the team started discussing this particular company’s various product hierarchies, I found myself feeling a bit lost as folks threw around terms like “root nodes”, “leaf nodes”, “hierarchy levels” and “recursive CTEs”. Nonetheless, I muddled my way through the discussion and the rest of the project, and since then have spent several years deep in the weeds of hierarchies. Nonetheless, I never felt particularly satisfied with the “academic” resources I had come across.

So, what I’m hoping to accomplish with this series of posts is to create the resource I wish I’d had at the time — an introduction to hierarchies that is accessible yet comprehensive without introducing unnecessary complexity.

So whether you’re earlier in your Data Engineering career, or if you’re more senior but looking for a refresher, I hope you’ll walk away from this series of posts with a much stronger understanding of hierarchies and how to both model and consume them, primarily in the context of BI.

Goals

Now, there’s an endless amount of blog posts out there that cover hierarchies in the context of data warehousing and BI, so I thought it would be helpful to flesh out my specific goals with this post that I feel haven’t really been met by most of the other content out there.

So, in no particular, the goals of this blog series include:

  • Simplification — My personal opinion is that most authors over-complicate hierarchy discussion both in terms of their examples (such as this eyesore in another blog post I found) and the technical minutiae of hierarchies (“ragged” hierarchies anyone?). By focusing on fundamentals that are often glossed over, I’m hoping the discussion will facilitate a much simpler conceptualisation of hierarchies.

  • Cognitive Context — Hierarchies are all over the place in data analytics, but oftentimes folks don’t realise it (case in point: folks often don’t recognise that a time dimension is a hierarchy). But why are hierarchies in particular, as a data structure, so common in data analytics? In other words, how do they support human cognition when it comes to processing information? While my discussion is perhaps overly simplistic, it will help facilitate an understanding of why hierarchies are so helpful, especially in the context of “multi-dimensional” BI, exploratory ad-hoc analytics in particular.

  • Technical Context — hierarchies are a subset of “directed acyclical graphs” (DAGs) which themselves are a subset of graphs. In my opinion, a brief discussion of these different data structures helps Data Engineers, especially those lacking a formal computer science background, better understand the technical concepts underlying hierarchies. (Call me crazy, but I propose that such content helps, rather than hinders, the prior stated goal of “simplification”.)

  • Disambiguation — most content I’ve seen on hierarchies muddies the waters between technical theory and real-world use cases (if I may reuse this rather hideous example), so I hope to more effectively handle those contexts separately before synthesising them. I also hope to deliberately address the vocabulary of hierarchies and disambiguate terms I often see confused, such as the relationship between “parent/child tables” (such as order headers and order line items), and “parent/child hierarchies”, which at first glance have nothing in common (but are actually related).

  • Data Model — I’m also not satisfied with the data model(s) most authors have put forth for hierarchies. So, I’ll share a few generic data models (and iterations thereof) that can accommodate both the ingestion layer and consumption layer of the most common types of hierarchies (parent/child, level, unbalanced, ragged, heterogenous, time-dependent, etc.) along with a brief discussion of the tradeoffs of a schema-on-write model vs. schema-on-read for hierarchy processing data pipelines.

Contents

Here’s is a high-level summary of each of the seven posts in these series:

  • Post 1: Introducing Graphs
  • Post 2: Introducing DAGs
  • Post 3: Introducing (Parent/Child) Hierarchies
  • Post 4: Introducing Level Hierarchies
  • Post 5: Handling Ragged, Unbalanced and Heterogeneous Hierarchies
  • Post 6: Hierarchy Review (Data Modeling)
  • Post 7: Final Thoughts and Reflections

Note that while some SQL and data modeling examples are included throughout the series, the focus is primarily on the concepts behind hierarchies. For a detailed review of relational data models for hierarchies, their corresponding SQL definitions and ETL logic, and a brief discussion on architecture and tradeoffs is addressed in Post 6.

Also note that the query language MDX and its role in consuming hierarchies is touched on at a few points throughout this series of posts, but doesn’t take a central role given its (unfortunate) absence in most “modern data stack” deployments.

Graphs

As alluded to above, I want to ground this discussion of hierarchies in terms of their “super structure” — graphs. This will lead us down a bit of a rabbit hole, but will ultimately help in better framing the discussion of hierarchies and their role as a type of data structure.

To begin, what is meant by the term “graph”? In the context of BI, one might immediately think of something like a visual graph, such as a line chart in a BI tool like Tableau, for example.

However, the academic concept of a graph from computer science describes a data structure that represents “things” and their relationships to one another. More specifically, these “things” are referred to as nodes (or sometimes “vertices”), and the relationships between nodes are referred to as edges.

Here’s a rather trivial (and abstract) example:

In the above example, there are:

  • Six nodes
  • Five edges
  • Two subgraphs
  • One cycle

Can you identify each of those concepts just from the visual depiction of this graph?

The nodes and edges should be easy to figure out.

  • Nodes: A, B, C, D, E, F
  • Edges: A-B, B-C, C-A, D-E, E-F

The two subgraphs should be easy enough to intuit. The above graph shows that A, B, and C are related, and also the D, E and F are related, but each of these collections of nodes aren’t related to each other. That means each of them constitute their own “subgraph”.

(I’m being a bit informal with my terminology here. The overall dataset is technically called a “disconnected graph” since there are two sets of nodes/edges (subgraphs), also called components, which cannot reach other. We could also call it a graph with two connected subgraphs… but in any case, these details are not particularly germane to the rest of this discussion, and thus the formal vernacular of graph theory won’t be addressed much further since it historically hasn’t played an important role in the discussion of hierarchies nor other related topics in the fields of data engineering and business intelligence.)

And lastly, the one cycle should be relatively intuitive. A cycle is a path through the graph that repeats its edges. In other words, I could put the tip of a pencil on the node A (and moving only in one direction) can trace my way from A > B > C and then back to A. This is called a cycle, and clearly there’s no way to do the same thing from D, for example (hence the second sub-graph does not have a cycle).

In the way of a simple SQL example, we could model this data in a relational database (like Snowflake) as follows:

Granted, this is an incredibly simple model, and should probably be extended (i.e. to accommodate attributes associated with each node/edges, as well as accommodate multiple graphs), but it illustrates the concept just fine.

Now, one thing worth calling out is the names of the the columns in the EDGES table. The fact that they have the prepositions “TO” and “FROM” imply a direction, i.e. starting from the “from_node” and heading towards the “to_node”. (In practice, these are often also referred to as the “predecessor” node and the “successor” node.)

This illustrates the last concept worth highlighting, i.e. whether we treat a graph as directed or undirected. (Graphs are NOT always directed, and thus these naming conventions are not entirely generic for arbitrary graphs. However, we won’t consider the case of arbitrary graphs much further, whereas directed graphs will play an important role in the rest of our discussion.)

In other words, we sometimes might care about some concept of directionality or precedence, i.e. when a given node should somehow come before another node that it’s related to.

Let’s further flesh these concepts out with a real-world use case from the world of logistics.

Logistics is, “the part of supply chain management that deals with the efficient forward and reverse flow of goods, services, and related information from the point of origin to the point of consumption.” Large supply chains consist of the waterways, railways, and roads that connect cargo vehicles (ships, trains, trucks) all across the globe.

Supply chain networks can be modeled as directed graphs.
(Image source)

In order to simplify the example, which let’s just consider a single example most of us experience every day: navigating from one point to another using our favorite map app on our phone.

This is obviously a screenshot from my Google Maps app on my phone, showing directions from Boulder to Denver International Airport.

From a data perspective, clearly Google Maps is leveraging geospatial data to visually trace out the exact routes that I could take, but what it’s also doing is figuring out and displaying graph data (which is not actually concerned at all with any geospatial coordinates).

A graph representation of navigation paths

In short, Google Maps is using sophisticated (graph) algorithms to select a handful of routes that are likely to make the most sense for me to choose from, based on time and/or distance.

In my case, I was given three different routes, which I’ve visualised above by just showing the durations for each leg of each route (taking a few liberties here with my version of the graph).

Clearly, Google Maps is summing up the amount of time each leg of each route, and assigning that total value to each route, and displaying them so that I can choose the shortest option.

It’s also showing things like any associated toll road fees (which I’ll just call cost), as well as the total distance.

So, I have three metrics to choose from, for each edge (duration, distance, and cost), and a simple exercise worth thinking about is which of these “edge metrics” should be considered directed vs. undirected:

  • Duration: directed
  • Distance: undirected
  • Cost: undirected

Let me briefly explain.

Duration can often differ in one direction compared to the reverse direction (even though they cover the same route and distance), for example, in cases when there’s a large influx of traffic in one direction, i.e. from suburbs into commercial city centers during the morning rush hour.

Distance between two points on the same leg is pretty much always the same, so that’s an undirected metric (bear with me here — I’m ignoring important edges cases such as construction which might only impact one direction of traffic). And toll road fees (cost) are almost always a function of distance, not time, so those metrics also remain undirected.

So, I can redraw just one leg of one route and show all of these metrics, which also demonstrates how a graph can have multiple edges between the same two nodes (and how some edges can be directed while others can be undirected).

And if I wanted to capture all of this data in my (relational) database, I might model it as follows:

I chose this example not only because it’s intuitive and helps flesh out the key concepts of graphs, but also because it represents a great example of a significant software engineering challenge in a critical real-world use case, i.e. how to optimise supply chain logistics on a much larger scale.

For example, think of all of the manufacturers, shipping ports, distribution centers, warehouses, and stores associated with Wal-Mart across the world, along with the different transport types (truck, train, ship… drone?). Optimising the supply chain of large enterprises is no joke!

In the way of an example, this is one of the core problems addressed by SAP’s Advanced Planning and Optimisation (APO) capabilities, which themselves leveraged an in-memory graph engine which was ultimately rolled into SAP HANA’s graph engine.

But, I’m not here to get in the weeds of any particular use case nor vendor-specific features for that matter, nor do we even need to get in the minutiae of graph algorithms (Dijkstra’s algorithm, anyone?), so let’s take a moment to step back and just summarise the important features of a graph.

Summary

Again, one of the goals of this series of posts is to make complex concepts simple. As you can see in the discussion thus far, such as in the example of optimising travel plans at scale in complex supply chain networks, graph analytics can get very hairy, very quickly, and introduce a need for very sophisticated algorithms.

But, we’re not here for the algorithms. We’re here for the data, and at the end of the data, these points are basically all you need to remember about graphs:

  • A graph consists of one or more nodes
  • Nodes are related to other nodes via one or more edges
  • Nodes that that can reach themselves again (by traversing the graph in a single direction) indicate the graph has a cycle
  • A given edge can be directed or undirected
  • Any set of nodes that can’t reach other sets of nodes are a subgraph.
  • Nodes and edges can have *attributes associated with them.

*Such attributes can consist of numerical values used in advanced analytics and algorithms. One such example would be numerical attributes of edges called “weights” and numerical attributes of nodes called “biases” which are leveraged in neural networks, which are the graphs undergirding large language models (LLMs) such as ChatGPT.

(Since this is not a BI use case per se, I’m avoiding the typical distinction between “measures” and “attributes”.)

Numerical attributes of graph edges, or “weights”, represented in the edges of a neural network

Hopefully you’ll find those points simple enough to remember, at it covers pretty much everything you’ll need to know about graphs in the context of data engineering! Again, we’re not particularly concerned with graph algorithms, but rather just the structure of graphs themselves.

In the next post, we’ll impose a few constraints on the definition of a graph to arrive at the concept of a direct acyclical graph (DAG), which will help further motivate our introduction of hierarchies.

Lakehouses For Sale

Whilst a ‘Lakehouse’ might sound like something out of a Nordic real estate brochure, we’re actually referring to ‘Data Lakehouses’ – the emerging technology in analytics.

The evolution of the Data Lakehouse all started with Data Warehouses which have been around since the 70’s when Inmon coined the term ‘Data Warehouse’. The Data Warehouse was designed to store numerous disparate data sets within a single repository ensuring that there is a single version of the truth for organisations. These data warehouse solutions are highly structured and governed, but typically they became extremely difficult to change and started to struggle when the volumes of data became too large.

Technology continued to evolve, and organisations started to have an exponential increase in the amount of data that they were creating as a result. This data that was being created by new applications was frequently produced in a semi-structured or unstructured format in continuous streams which is very different to the structured data from transactional systems that data warehouses were used to dealing with. For example, the Internet of Things (IoT) has meant that there are a huge number of devices that are continually creating sensory data such as smart draught pumps in pubs which automatically order new beer deliveries when the keg starts to run out!

This change in data meant that organisations had to adapt their data solutions. Around 10 years ago organisations started to create Data Lakes built using platforms such as Hadoop. Whereas Data Warehouses typically would only store the data that was needed for reporting in a highly structured format, Data Lakes were able to store all data possible but in its raw unstructured or semi-structured format. However, this lack of data quality and consistency often means that it is very difficult to use this data effectively for reporting which has seen much of the promise of Data Lakes unable to be realised (see our blog on ‘Data Swamps’). In addition, these Data Lakes, due to their lack of consistency, also struggled with a lack of governance and functionality compared to Data Warehouse solutions. To deal with this dilemma, lots of organisations typically have a data landscape with a data lake in combination with several data warehouses to help them meet their analytics needs. These extra systems have made the data landscape unnecessarily complex, with data frequently having to be copied between the data lake and the data warehouse.

Evolution of data storage, from data warehouses to data lakes to lakehouses
Figure 1 – Evolution of Data Solutions as Proposed by Databricks

Enter the ‘Data Lakehouse’ – technologies which provide both the benefits from traditional Data Warehouses in combination with the huge amounts of cheap storage utilised for Data Lakes which has only been possible with the advancement of cloud technology. There are several key features of a Data Lakehouse, however perhaps the most important of these is the ability to natively support both structured and semi-structured data. Snowflake, for example, has a native ‘Variant’ data type which means that you can load semi-structured data such as JSON, XML and Avro straight into the Data Lakehouse, whilst also providing SQL extensions to query this semi-structured data directly. The result is that it’s easy to store fully ACID compliant transactions commonly used in management reporting solutions alongside semi-structured data provided by streaming applications. Data Lakehouses also provide the auditability and governance that was clearly lacking in Data Lake solutions supporting traditional modelling techniques such as Star and Snowflake schemas. Lastly, by storing all of their data in one central data hub it reduces both the complexity of their technical landscape as well as reducing the cost as they don’t need to be storing data in both the data warehouse and the data lake (for more details, see our blog ‘Better, Faster, Cheaper”).

Data Lakehouses are the start of a really exciting new era in the world of data and analytics, offering a huge competitive advantage to organisations. Companies can finally use the huge amounts of data that they have started to create, being able to effectively use it in analytics and reporting to drive new insights and ultimately take better decisions.