Recap
In the first post of this series, I walked through the basic ideas of what constitutes a graph (from a somewhat formal computer science perspective) with a simplified example from supply chain logistics.
To briefly recap:
- A graph consists of one or more nodes.
- Nodes are related to other nodes via one or more edges.
- Nodes that can reach themselves again (by traversing the graph in a single direction) indicate the graph has a cycle.
- A given edge can be directed or undirected.
- Any set of nodes that can’t reach other sets of nodes are a subgraph.
- Nodes and edges can have attributes, including numerical attributes (usually called weights), that are often leveraged when executing more advanced graph algorithms/analytics.
In terms of goals set forth in the last post:
- The last post gave a sense of what a data model might start to look like for graphs in general, which we’ll continue to flesh out.
- It attempted to simplify the idea of graphs and the few rules you need to understand what graphs represent.
In terms of the goals of this series of posts, this particular post will further flesh out the technical context for hierarchies (another stated goals of this series of posts) by digging further into a certain subset of graphs called directed acyclical graphs (DAGs) and further disambiguate DAGs and their related concepts.
Directed Acyclical Graphs (DAGs)
This post is not actually about ETL tools, but I would say most modern day Data Engineers are at least loosely familiar with the concept of DAGs as it is a term that has experienced quite a bit of adoption in data pipeline orchestration tools like Apache Airflow and, uh, Dagster.
Now that we’ve fleshed out the concept of a graph, it’s pretty straightforward to flesh out what a DAG is, just by examining the acronym.
- A DAG is a graph where each and every edge is directed, and
- that has no cycles, i.e. is acyclical
So, where and how do DAGs present themselves in the world of data engineering? While this is a series of posts specifically about hierarchies, not DAGs, I think it’s quite relevant to quickly examine all the different instances of DAGs in data engineering, as they are everywhere.
Also, in the way of disambiguation, let me point out what should hopefully be obvious — DAGs are an abstract concept from graph theory, which includes (but in no way is limited to) what we might call “data flow graphs” in an ETL/orchestration tool, which have come to just be referred to as DAGs. So, be mindful of context when discussing DAGs, as they constitute a much broader concept than just that of data pipelines.
Queries
Queries are DAGs. Look at the EXPLAIN PLAN of even the most basic query, and you’ll see that the visualised execution plan is a DAG, where:
- Each node is a particular operation (such as join, filter, aggregate).
- Each edge visualises how the output of a given operation serve as an input to the next operation.
Data Lineage
Data lineage, which represents the flow of data across a data pipeline, is also a DAG. (What I find interesting to note is that in the case of data lineage, each edge represents a particular operation, and each node represents data. This is the opposite of a query plan, where each edge represents data, and each node represents an operation).
Data Pipelines
Data pipelines are, rather obviously, DAGs.
Gantt Charts
Gantt charts? Aren’t Gantt charts something from project management? Yes, but they’re also used to visualise query plans by visualising the time each operation takes within a query execution plan, by expanding the length of the node to represent the amount of time it takes.
(Random tangent: I’ve always wanted a dynamic Gantt chart visualisation for query plans, where instead of being limited just to time (represented by bar width), I’d have the choice to select from things like memory consumption, CPU time, any implicit parallelization factors, disk I/O for root nodes, etc… any product managers out there want to take up the mantle?)
Relational Data Models
Okay, so, cycles can be found in some real-world relational data models, but even so, all relational data models are directed, and many are acyclical, so I decided to include them here.
- Entities are nodes.
- Primary/foreign key relationships are the edges.
- The direction of the directed edges are from primary keys to foreign keys.
- Clearly large enterprise data models with multiple subject areas can consist of multiple sub-graphs.
And if you’re in the mood for a bit of nuance, it’s absolutely fair to refer to the data model itself (i.e. the metadata) as a DAG, and separately calling out the data itself also as a DAG (i.e. the records themselves and their primary/foreign key relationships).
Summary
This is a fairly quick post to summarise:
- DAGs are directed, acyclical graphs
- The common use of the term DAG by Data Engineers is usually limited to the code artifacts of many modern data pipeline/orchestration tools, although it’s clearly a much broader concept.
- Even within the field of Data Engineering, DAGs can be found everywhere, as they describe the structure of things like: query execution plans, data lineage visualisations, data pipelines, and entity-relationship models.
In the next post, we’ll further constrain the definition of a DAG in order to arrive at the concept of a hierarchy and discuss a handful of related considerations.