Mastering Hierarchies: Graphs, DAGs and BI, Oh My! (Part 1 of 7)

Introduction

I remember feeling somewhat lost on the first large-scale data warehousing project I was put on in late 2012, early in my consulting career. SAP was trying to win the business of Europe’s largest chain of hardware stores, and we were thus trying to replatform their data warehouse and business intelligence (BI) platform from Oracle onto an SAP stack.

At the time, I’d like to think I was a pretty strong database developer with some pretty sharp SQL skills, but had still never really spent any time digging into the academic rigors of data warehousing as its own discipline.

As the team started discussing this particular company’s various product hierarchies, I found myself feeling a bit lost as folks threw around terms like “root nodes”, “leaf nodes”, “hierarchy levels” and “recursive CTEs”. Nonetheless, I muddled my way through the discussion and the rest of the project, and since then have spent several years deep in the weeds of hierarchies. Nonetheless, I never felt particularly satisfied with the “academic” resources I had come across.

So, what I’m hoping to accomplish with this series of posts is to create the resource I wish I’d had at the time — an introduction to hierarchies that is accessible yet comprehensive without introducing unnecessary complexity.

So whether you’re earlier in your Data Engineering career, or if you’re more senior but looking for a refresher, I hope you’ll walk away from this series of posts with a much stronger understanding of hierarchies and how to both model and consume them, primarily in the context of BI.

Goals

Now, there’s an endless amount of blog posts out there that cover hierarchies in the context of data warehousing and BI, so I thought it would be helpful to flesh out my specific goals with this post that I feel haven’t really been met by most of the other content out there.

So, in no particular, the goals of this blog series include:

  • Simplification — My personal opinion is that most authors over-complicate hierarchy discussion both in terms of their examples (such as this eyesore in another blog post I found) and the technical minutiae of hierarchies (“ragged” hierarchies anyone?). By focusing on fundamentals that are often glossed over, I’m hoping the discussion will facilitate a much simpler conceptualisation of hierarchies.

  • Cognitive Context — Hierarchies are all over the place in data analytics, but oftentimes folks don’t realise it (case in point: folks often don’t recognise that a time dimension is a hierarchy). But why are hierarchies in particular, as a data structure, so common in data analytics? In other words, how do they support human cognition when it comes to processing information? While my discussion is perhaps overly simplistic, it will help facilitate an understanding of why hierarchies are so helpful, especially in the context of “multi-dimensional” BI, exploratory ad-hoc analytics in particular.

  • Technical Context — hierarchies are a subset of “directed acyclical graphs” (DAGs) which themselves are a subset of graphs. In my opinion, a brief discussion of these different data structures helps Data Engineers, especially those lacking a formal computer science background, better understand the technical concepts underlying hierarchies. (Call me crazy, but I propose that such content helps, rather than hinders, the prior stated goal of “simplification”.)

  • Disambiguation — most content I’ve seen on hierarchies muddies the waters between technical theory and real-world use cases (if I may reuse this rather hideous example), so I hope to more effectively handle those contexts separately before synthesising them. I also hope to deliberately address the vocabulary of hierarchies and disambiguate terms I often see confused, such as the relationship between “parent/child tables” (such as order headers and order line items), and “parent/child hierarchies”, which at first glance have nothing in common (but are actually related).

  • Data Model — I’m also not satisfied with the data model(s) most authors have put forth for hierarchies. So, I’ll share a few generic data models (and iterations thereof) that can accommodate both the ingestion layer and consumption layer of the most common types of hierarchies (parent/child, level, unbalanced, ragged, heterogenous, time-dependent, etc.) along with a brief discussion of the tradeoffs of a schema-on-write model vs. schema-on-read for hierarchy processing data pipelines.

Contents

Here’s is a high-level summary of each of the seven posts in these series:

  • Post 1: Introducing Graphs
  • Post 2: Introducing DAGs
  • Post 3: Introducing (Parent/Child) Hierarchies
  • Post 4: Introducing Level Hierarchies
  • Post 5: Handling Ragged, Unbalanced and Heterogeneous Hierarchies
  • Post 6: Hierarchy Review (Data Modeling)
  • Post 7: Final Thoughts and Reflections

Note that while some SQL and data modeling examples are included throughout the series, the focus is primarily on the concepts behind hierarchies. For a detailed review of relational data models for hierarchies, their corresponding SQL definitions and ETL logic, and a brief discussion on architecture and tradeoffs is addressed in Post 6.

Also note that the query language MDX and its role in consuming hierarchies is touched on at a few points throughout this series of posts, but doesn’t take a central role given its (unfortunate) absence in most “modern data stack” deployments.

Graphs

As alluded to above, I want to ground this discussion of hierarchies in terms of their “super structure” — graphs. This will lead us down a bit of a rabbit hole, but will ultimately help in better framing the discussion of hierarchies and their role as a type of data structure.

To begin, what is meant by the term “graph”? In the context of BI, one might immediately think of something like a visual graph, such as a line chart in a BI tool like Tableau, for example.

However, the academic concept of a graph from computer science describes a data structure that represents “things” and their relationships to one another. More specifically, these “things” are referred to as nodes (or sometimes “vertices”), and the relationships between nodes are referred to as edges.

Here’s a rather trivial (and abstract) example:

In the above example, there are:

  • Six nodes
  • Five edges
  • Two subgraphs
  • One cycle

Can you identify each of those concepts just from the visual depiction of this graph?

The nodes and edges should be easy to figure out.

  • Nodes: A, B, C, D, E, F
  • Edges: A-B, B-C, C-A, D-E, E-F

The two subgraphs should be easy enough to intuit. The above graph shows that A, B, and C are related, and also the D, E and F are related, but each of these collections of nodes aren’t related to each other. That means each of them constitute their own “subgraph”.

(I’m being a bit informal with my terminology here. The overall dataset is technically called a “disconnected graph” since there are two sets of nodes/edges (subgraphs), also called components, which cannot reach other. We could also call it a graph with two connected subgraphs… but in any case, these details are not particularly germane to the rest of this discussion, and thus the formal vernacular of graph theory won’t be addressed much further since it historically hasn’t played an important role in the discussion of hierarchies nor other related topics in the fields of data engineering and business intelligence.)

And lastly, the one cycle should be relatively intuitive. A cycle is a path through the graph that repeats its edges. In other words, I could put the tip of a pencil on the node A (and moving only in one direction) can trace my way from A > B > C and then back to A. This is called a cycle, and clearly there’s no way to do the same thing from D, for example (hence the second sub-graph does not have a cycle).

In the way of a simple SQL example, we could model this data in a relational database (like Snowflake) as follows:

Granted, this is an incredibly simple model, and should probably be extended (i.e. to accommodate attributes associated with each node/edges, as well as accommodate multiple graphs), but it illustrates the concept just fine.

Now, one thing worth calling out is the names of the the columns in the EDGES table. The fact that they have the prepositions “TO” and “FROM” imply a direction, i.e. starting from the “from_node” and heading towards the “to_node”. (In practice, these are often also referred to as the “predecessor” node and the “successor” node.)

This illustrates the last concept worth highlighting, i.e. whether we treat a graph as directed or undirected. (Graphs are NOT always directed, and thus these naming conventions are not entirely generic for arbitrary graphs. However, we won’t consider the case of arbitrary graphs much further, whereas directed graphs will play an important role in the rest of our discussion.)

In other words, we sometimes might care about some concept of directionality or precedence, i.e. when a given node should somehow come before another node that it’s related to.

Let’s further flesh these concepts out with a real-world use case from the world of logistics.

Logistics is, “the part of supply chain management that deals with the efficient forward and reverse flow of goods, services, and related information from the point of origin to the point of consumption.” Large supply chains consist of the waterways, railways, and roads that connect cargo vehicles (ships, trains, trucks) all across the globe.

Supply chain networks can be modeled as directed graphs.
(Image source)

In order to simplify the example, which let’s just consider a single example most of us experience every day: navigating from one point to another using our favorite map app on our phone.

This is obviously a screenshot from my Google Maps app on my phone, showing directions from Boulder to Denver International Airport.

From a data perspective, clearly Google Maps is leveraging geospatial data to visually trace out the exact routes that I could take, but what it’s also doing is figuring out and displaying graph data (which is not actually concerned at all with any geospatial coordinates).

A graph representation of navigation paths

In short, Google Maps is using sophisticated (graph) algorithms to select a handful of routes that are likely to make the most sense for me to choose from, based on time and/or distance.

In my case, I was given three different routes, which I’ve visualised above by just showing the durations for each leg of each route (taking a few liberties here with my version of the graph).

Clearly, Google Maps is summing up the amount of time each leg of each route, and assigning that total value to each route, and displaying them so that I can choose the shortest option.

It’s also showing things like any associated toll road fees (which I’ll just call cost), as well as the total distance.

So, I have three metrics to choose from, for each edge (duration, distance, and cost), and a simple exercise worth thinking about is which of these “edge metrics” should be considered directed vs. undirected:

  • Duration: directed
  • Distance: undirected
  • Cost: undirected

Let me briefly explain.

Duration can often differ in one direction compared to the reverse direction (even though they cover the same route and distance), for example, in cases when there’s a large influx of traffic in one direction, i.e. from suburbs into commercial city centers during the morning rush hour.

Distance between two points on the same leg is pretty much always the same, so that’s an undirected metric (bear with me here — I’m ignoring important edges cases such as construction which might only impact one direction of traffic). And toll road fees (cost) are almost always a function of distance, not time, so those metrics also remain undirected.

So, I can redraw just one leg of one route and show all of these metrics, which also demonstrates how a graph can have multiple edges between the same two nodes (and how some edges can be directed while others can be undirected).

And if I wanted to capture all of this data in my (relational) database, I might model it as follows:

I chose this example not only because it’s intuitive and helps flesh out the key concepts of graphs, but also because it represents a great example of a significant software engineering challenge in a critical real-world use case, i.e. how to optimise supply chain logistics on a much larger scale.

For example, think of all of the manufacturers, shipping ports, distribution centers, warehouses, and stores associated with Wal-Mart across the world, along with the different transport types (truck, train, ship… drone?). Optimising the supply chain of large enterprises is no joke!

In the way of an example, this is one of the core problems addressed by SAP’s Advanced Planning and Optimisation (APO) capabilities, which themselves leveraged an in-memory graph engine which was ultimately rolled into SAP HANA’s graph engine.

But, I’m not here to get in the weeds of any particular use case nor vendor-specific features for that matter, nor do we even need to get in the minutiae of graph algorithms (Dijkstra’s algorithm, anyone?), so let’s take a moment to step back and just summarise the important features of a graph.

Summary

Again, one of the goals of this series of posts is to make complex concepts simple. As you can see in the discussion thus far, such as in the example of optimising travel plans at scale in complex supply chain networks, graph analytics can get very hairy, very quickly, and introduce a need for very sophisticated algorithms.

But, we’re not here for the algorithms. We’re here for the data, and at the end of the data, these points are basically all you need to remember about graphs:

  • A graph consists of one or more nodes
  • Nodes are related to other nodes via one or more edges
  • Nodes that that can reach themselves again (by traversing the graph in a single direction) indicate the graph has a cycle
  • A given edge can be directed or undirected
  • Any set of nodes that can’t reach other sets of nodes are a subgraph.
  • Nodes and edges can have *attributes associated with them.

*Such attributes can consist of numerical values used in advanced analytics and algorithms. One such example would be numerical attributes of edges called “weights” and numerical attributes of nodes called “biases” which are leveraged in neural networks, which are the graphs undergirding large language models (LLMs) such as ChatGPT.

(Since this is not a BI use case per se, I’m avoiding the typical distinction between “measures” and “attributes”.)

Numerical attributes of graph edges, or “weights”, represented in the edges of a neural network

Hopefully you’ll find those points simple enough to remember, at it covers pretty much everything you’ll need to know about graphs in the context of data engineering! Again, we’re not particularly concerned with graph algorithms, but rather just the structure of graphs themselves.

In the next post, we’ll impose a few constraints on the definition of a graph to arrive at the concept of a direct acyclical graph (DAG), which will help further motivate our introduction of hierarchies.

SAP’s cynical move to keep control of your enterprise data (aka note 3255746)

SAP has rocked the boat. They have issued an SAP note (3255746), declaring a popular method for moving data from SAP into analytics platforms out of bounds for customers. Customers and software vendors are concerned. They have to ensure they operate within the terms & conditions of the license agreement with SAP. It seems unfair that SAP unilaterally changes these Ts and Cs after organisations have purchased their product. I will refrain from giving legal advice but my understanding is that SAP notes are not legally binding. I imagine the legal teams will have a field day trying to work this all out. In this article I will explain the context and consequences of this SAP note. I will also get my crystal ball out and try and predict SAPs next move, as well as giving you some further considerations which perhaps help you decide how to move forward.

What exactly have SAP done this time?

SAP first published note 3255746 in 2022. In the note, SAP explained that 3rd parties (customers, product vendors) could use SAP APIs for the Operational Data Provisioning (ODP) framework but these APIs were not supported. The APIs were meant for internal use. As such, SAP reserved the right to change the behaviour and/or remove these APIs altogether. Recently, SAP have updated the note (version 4). Out of the blue, SAP declared it is no longer permitted to use the APIs for ODP. For good measure, SAP threatens to restrict and audit the unpermitted use of this feature. With a history of court cases decided in SAPs favour over license breaches, it is no wonder that customers and software vendors get a bit nervous. So, let’s look at the wider context. What is this ODP framework and what does it actually mean for customers and product vendors?

SAP ODP – making the job of getting data out of SAP somewhat less painful

Getting data out of SAP is never easy, but ODP offered very useful features to take away some of the burden. It enabled external data consumers to subscribe to datasets. Instead of receiving difficult to decipher raw data, these data sets would contain data which was already modelled for analytical consumption. Moreover, the ODP framework supports ‘delta enabled’ data sets, which significantly reduces the volumes of data to refresh on a day-to-day basis. When the ODP framework was released (around 2011(1)), 3rd party data integration platforms were quick to provide a designated SAP ODP connector. Vendors like Informatica, Talend, Theobald and Qlik have had an ODP connector for many years. Recently Azure Data Factory and Matillion released their connector as well. SAP also offered a connection to the ODP framework through the open data protocol OData. This means you can easily build your own interface if the platform of your choice does not have an ODP plug-in.

One can imagine that software vendors are not best pleased with SAP’s decision to no longer permit the use of the ODP framework by 3rd parties. Although all platforms mentioned above have other types of SAP connectors(2), the ODP connector has been the go-to solution for many years. The fact that this solution was not officially supported by SAP has never really scared the software vendors. ODP was and remains to be deeply integrated in SAP’s own technology stack and the chances that SAP will change the architecture in current product versions are next to zero.

Predicting SAP’s next move

You might wonder why SAP is doing this? Well, in recent years, customers have voted with their feet and moved SAP data to more modern, flexible and open data & analytics platforms. There is no lack of competition. AWS, Google, Microsoft, Snowflake and a handful of other contenders all offer cost effective data platforms, with limitless scalability. On these data platforms, you are free to use the data and analytics tools of your choice, or take the data out to wherever you please without additional costs. SAP also has a data & analytics platform but this is well behind the curve. There are two SAP products to consider, SAP Analytics Cloud (SAC) and SAP DataSphere.
The first is a planning and analytics toolset for business users and was introduced in 2015. For a long time, it was largely ignored. In recent years, it has come to maturity and should now be considered a serious contender to PowerBI, Tableau, Qlik and so on. I’m not going to do a full-blown comparison here but the fact that SAC has integrated planning capabilities is a killer feature.
SAP DataSphere is a different story. It is relatively new (introduced as SAP Data Warehouse Cloud in 2020) – and seasoned SAP professionals know what to do with new products: If you’re curious you can do a PoC or innovation project. If not, or you don’t have the time or means for this kind of experimenting, you just sit and wait until the problems are flushed out. SAP DataSphere is likely to suffer from teething problems for a bit longer, and it will take time before it is as feature-rich as the main competitor data platforms. One of the critical features which was missing until very recently was the ability to offload data to cloud storage (S3/Blob/buckets, depending on your cloud provider). That feature was added in Feb 2024. Around the same time as when SAP decided that 3rd parties could no longer use the ODP interface to achieve exactly the same. Coincidence?

So where is SAP going with this? Clearly they want all their customers to embrace SAP DataSphere. SAP charges for storage and compute so of course they try and contain as many workloads and as much data as they can on their platform. This is not different from the other platform providers. What is different is that SAP deliberately puts up barriers to take the data out, where other providers let you take your data wherever you want. SAP’s competitors know they offer a great service at a very competitive price. It seems SAP doesn’t want to compete on price or service, but chooses to put up a legal barrier to keep the customer’s data on their platform.

SAP Certification for 3rd party ETL tools no longer available

Blocking the use of ODP by 3rd party applications is only the beginning. SAP has already announced it will no longer certify 3rd party ETL tools for the SAP platform(3). The out-and-out SAP specialists have invested heavily in creating bolt-on features on the SAP platform to replicate large SAP data sets efficiently, often in near real-time. The likes of Fivetran, SNP Glue and Theobald have all introduced their own innovative (proprietary) code purely for this function. SAP used to certify this code, but has now stopped doing so. Again, the legal position is unclear and perhaps SAP will do a complete u-turn on this, but for now it leaves these vendors wondering what the future will be for their SAP data integration products.

What do you need to do if you use ODP now through a 3rd party application?

My advice is to start with involving your legal team. In my opinion an SAP note is not legally binding like terms & conditions are, but I appreciate my opinion in legal matters doesn’t count for much.
If you are planning to stay on your current product version for the foreseeable future and you have no contract negotiations with SAP coming up then you can carry on as normal. When you are planning to move to a new product version though, or if your contract with SAP is up for renewal, it would be good to familiarise yourself with alternatives.

As I mentioned before, most 3rd party products have multiple ways of connecting to SAP, so it would be good to understand what the impact is if you had to start using a different method.
It also makes sense to stay up-to-date with the SAP DataSphere roadmap. When I put my rose-tinted glasses on, I can see a future where SAP provides an easy way to replicate SAP data to the cloud storage of your choice, in near-real time, in a cost effective way. Most customers wouldn’t mind paying a reasonable price for this. I expect SAP and its customers might have a very different expectation of what that reasonable price is but until the solution is there, there is no point speculating. If you are looking for some inspiration to find the best way forward for you, come talk to Snap Analytics. Getting data out of SAP is our core business and I am sure we can help you find a futureproof, cost effective way for you.


Footnotes and references

(1) – The ODP framework version 1.0 was released around 2011, specifically with SAP NetWeaver 7.0 SPS24,  7.01 SPS 09, 7.02 SPS 08. The current version of ODP is 2.0, which was released in 2014 with SAP Netweaver 7.3 SPS 08, 7.31 SPS 05, 7.4 SPS 02. See notes 1521883 and 1931427 respectively.

(2) – Other types of SAP connections: One of my previous blog posts discusses the various ways of getting data out of SAP in some detail: Need to get data out of SAP and into your cloud data platform? Here are your options

(3) – Further restrictions for partners on providing solutions to get data out of SAP, see this article: Guidance for Partners on certifying their data integration offerings with SAP Solutions

Be a Data Hero and deliver Net Zero!

The biggest problem in the WORLD!

It is clear that we need radical changes to save our planet. Governments, the private sector and individuals aspire to achieve ‘Net Zero’ – but radically changing the way we operate is not going to be easy.

Achieving this goal is going to be a huge challenge for big, complex organisations.  There are so many areas to explore, from reducing travel and fossil fuel consumption, leveraging renewable energy, improving efficiency of existing equipment, or simple behavior change.  With so much complexity the task can be daunting. 

Can data save us?…

Starting with data can help you to understand where the quickest and biggest wins are.  This helps you to understand what to focus on first.  As Peter Drucker once famously said “You can’t manage what you don’t measure”.

To create a link between desired outcomes and measurable targets you can use a ‘Data Value Map’. Whilst I love technology and data…it’s only useful when it drives actions and creates positive change.  The Data Value Map helps to visualise how data can help you to achieve your goals.  If your goal is Net Zero…it could look something like this:

Data Value Maps can be achieved using a mind mapping or collaboration tool (I like Mindmeister and Miro) and are best done as a highly collaborative team workshop…don’t forget to bring the coffee and cakes!

Now you have a clear view what data is required to measure and act (your “use cases”) to deliver the Net Zero goal.  Next you can score these in terms of Value and Complexity.  Something like a prioritisation matrix can help:

By focusing in on the ‘high priority’ and ‘low complexity’ use cases you can deliver quick wins to the business.  This will help you to demonstrate you are a true Data Hero and can help your organisation to fly!

Once you have prioritised your use cases, you can start to map out the underpinning systems and processes that are needed to deliver connected, structured data to drive your Net Zero goals. 

Delivering at lightning speed…

There are numerous technologies out there that can help you connect all of this data, but we love Matillion for being able to easily and quickly connect to almost any source and transform and join data to make it useful.  As a data platform Snowflake is fantastic for virtually unlimited storage, blistering speed, data warehousing and data science capabilities.  These technologies will certainly enable you to hone your capabilities as a true Data Hero!! There are also many other fantastic cloud solutions that can help you to supercharge your Net Zero data capabilities.

Join the Data League!

Snap Analytics’ team of Data Heroes are helping one of the UK’s largest food manufacturers to leverage data to drive positive change…but if we’re going to solve humanity’s greatest threat…it’s going to take a whole Justice League of Data Heroes.  So join us on this mission to save the planet, and lets all make sure the decision makers in our organisations have the data they need to drive positive change.  Don’t delay…be a Data Hero today!

We believe that businesses have a responsibility to look after our earth…it’s the only one we have!  We will give any organisation a 15% discount on our standard rates for any work directly linked to making a positive change to the environment!

Part 2 – Scaling and Flexing: Data Vault Benefits

If you’re wondering what exactly Data Vault is then check out Part 1 of our Data Vault blog where we outline exactly what Data Vault is, otherwise keep reading to find out all the great benefits of using Data Vault methodology!

Scalability

Probably the biggest benefit given the current data landscape that we find ourselves in is the scalability of data vault. From my own experiences, data warehouses over time become more resistant to change and much less flexible to adapt to the changing needs of the business. Once certain reports and datasets have been created, in order to develop new functionality all of the existing functionality will need to be tested to ensure that it hasn’t broken, and refactored if it is. This is where Data Vault really helps as it is easily extensible without the worry about losing any historical data. New link tables and satellites can easily be added to the model, and similarly other link tables can be closed off when required. When virtual information marts are used this means it’s even more adaptable as it’s only the joins that will require updates to handle the changes.

Flexibility

Link tables mean that relationships are modelled in a way that it assumes that there will be many-to-many requiring virtually no work when a relationship changes. All historical data is tracked by default in the Data Vault by using the satellite tables meaning again the flexibility is there to report on the historical changes of the data warehouse. For example, if the business decides that actually they need to report on the historical status of a master data object that previously wasn’t a requirement, this can simply be provided by altering the joins to pull the correct data from the master data satellite tables. Previously we would have probably thought that using this approach would have caused lots of redundant data to be stored, however with the rise of cloud storage the costs for storing are so trivial that this now means we can design and build with future change in mind.

Auditability and Traceability

This tracking of history also means that the data warehouse is fully auditable. Data Vault practitioners like to say that it becomes the “single version of the facts” rather than the “single version of the truth”. All data is loaded into the data vault exactly as it was in source, rather than just cleansed data for a specific reporting purpose. This means that you can fully reconcile the data back to the source system at the point of entry, and it also becomes a lot easier to track the data lineage. Business users will no longer see the data warehouse as a ‘black box’, but instead it will become the trusted source for all of their business data!

Big Data & Loading

Traditional modelling approaches were created long before the advent of semi-structured and machine generated data. In Data Vault there isn’t the requirement to cleanse and conform the data to ensure star schema compliance which means that huge volumes of data can be loaded very quickly. Data Vault also uses hashed keys meaning that there is reduced dependencies during loading between hub, satellite and link tables meaning that they can be loaded in parallel, making use of scalable processing power that is available with cloud data warehouse solutions like Snowflake. Data vault also uses the concept of a virtual information mart layer on top of the data vault for reporting which again means increased performance as cubes don’t need to be persisted and loaded. This is a key feature that helps to ensure that real-time data warehousing can be realised.

All of this sounds great, so why isn’t data vault more widely used when building out data warehouses? I think the simply answer can be put down to a couple of key factors. The first is that cloud data warehousing has only recently started to be adopted, and it’s only with the scalability of cloud storage and processing that means that a data vault structure is truly viable. The other key factor is that if a project that is being delivered is a small project then it may be that building it using a data vault methodology is overkill. However, given the increasing importance of data and stricter data regulations (e.g. GDPR) I would argue that in many cases Data Vault methodology would be the right choice to ensure a modern and scalable data warehouse.

Written by Tom Bruce the Delivery Lead and Co-founder of Snap Analytics.

5 things to do before starting a data project

You’re about to start a big data project. Fantastic! We’re big believers in the fact that every business can gain a real competitive advantage through analysing their data. It’s why we do what we do. 

But just before you go running off all excited, stop for a moment. If you really want your data project to be a success, you need to think about five key things before you even start. 

Understand the problem that you are trying to solve 

Chances are you’re looking to data analytics to fix a specific need, something which is causing inefficiencies and costing you money. Don’t assume though that by shaking the data tree enough times a solution will magically fall into your lap. First you need to look at your existing systems to see what exactly needs fixing. It’s only once you have a clearly defined vision and end point in mind, that we can see exactly how we can help. 

Define what success will look like  

Having identified what you want, it is time to think about what a successful outcome might look like. It helps nobody to embark on a data project without setting any specific goals or measurable outcomes. So make a plan, draw up a list of milestones, devise ways of measuring what’s happening and then track the results against that. One useful approach we’ve found is to run a user survey six months down the line and find out how people are using, or benefitting from the findings.  

Align with company strategy 

It’s all very well you dreaming up fantastic, innovative data driven projects that will change the very fabric of your business and the world generally. But it might be best, first of all, to check that your goals are something that fall in line the wider strategic direction of the business. Is this a problem you should even be solving? Is it a business priority? Will it help tick some important boxes when the annual report comes round? If the answer is yes to all the above, fantastic – you’re on your way to getting managerial buy-in and tapping up a healthy budget for an important piece of work. 

Data for the people 

You’ve addressed the needs of the bigger cheeses but don’t forget about the little guys, the people on the front line who are working hard to produce this data in the first place. Think about how this is going to benefit them in the long term, how will it make their day-to-work work easier, more efficient, or more effective? This is particularly pertinent if your business is going through a restructuring process. We’re great believers in the power of data analysis, but if you’re losing half your team it might not be perceived as the best use of company resources. 

Build the right team 

One commonly held assumption we come across is the idea that data is purely a tech led process: you identify a problem or need and the nerds crunch the numbers. It’s not that simple of course. To produce an effective outcome, you need quality input from people on the business side, members of the team who can provide insights into how the company works and what its goals and strategies are. You should bring together people who use the data in different ways and can provide the broadest possible range of experience. That way the insights we produce will be deeper, richer and ultimately more valuable. 

Modern Data Warehouse – Snowflake & Fivetran

Better, cheaper, faster – you can have all three!

You may have heard the saying “Better, cheaper, faster – pick two”. The idea isn’t new. If you want something really good and you want it quickly, you’re going to have to pay. If you want to save money and still keep the quality, you’ll need to wait. And so on.

But in big data that mantra is being subverted. Thanks to the cloud, you can now deliver data solutions that are more flexible, scalable and, crucially, cheaper and faster. Best of all it doesn’t mean abandoning quality or reliability – if well designed you can achieve better quality and consistency.

Not so long ago, if you were planning a data project you might have to set aside a big chunk of your budget for licences, servers and a data warehouse in which to store it. On top of this, you’d need a specialised (and potentially expensive) team of people to set up the infrastructure and operate the hardware. This effectively put data analysis out of the reach of many smaller businesses.

The cloud has changed all that – revolutionising the delivery of data analytics. So, what exactly can it offer you?

BETTER

Today’s cloud based technology is so simple even the least tech savvy people are able to reap the rewards. You no longer need to know how much storage you require up front as companies like Snowflake simply offer small, medium or large solutions plus the option of almost infinite scalability. For many new and smaller businesses the entry package will be enough, allowing you to upload millions of rows of data. And as you expand you can simply scale up.

Conventional wisdom once said that there was no more secure way of storing data than keeping it all on your premises where it was maintained by and managed by a member of staff. In 2019 that is no longer true. Even the most conscientious IT person will be constrained by your budget and facilities. By handing this responsibility over to the likes of Microsoft with their near infinite resources, there is arguably no safer way of storing your valuable data.

CHEAPER

The maths is simple: with modern data platforms like Snowflake, you just pay for what you use. Whereas previously you would have had to try and work out up front how much space you needed and hope you hadn’t overestimated or underestimated (with the associated painful time and cost implications), now you can simply scale up or down as necessary as and when your business requires. If for example your business acquires a new company, it’s easy, simply instantly increase the size of your data warehouse. At the time of writing, a terabyte of storage with Snowflake is an astonishing $23 per month.

This flexibility also means reduced waste. Once you had to pay for a solution that might only be used on one day every month when you had to run 10,000 reports. The other 30 days it sat idle costing you money. Now you can pay for the smallest package for the majority of the month and set it to automatically scale up when you really need the resources.

FASTER

Remember the sound of whirring fans and the wall of heat that would hit you when you went anywhere near the server room? Thanks to the cloud you can do away with the racks upon racks of energy guzzling storage and move it all off site, possibly thousands of miles away. This doesn’t make it slower; thanks to modern petabyte networks, you can access your data in a fraction of the time, generating reports in 10 seconds rather than 20 minutes.

Several years ago Snap Analytics was hired by a large automotive manufacturer for a major project based on their premises. At the time cloud storage didn’t have quite the same functionality and wasn’t trusted to do the job. As a result we had to work on site with their people, working within their existing systems just to set up the architecture. It added nearly 6 months to the project – and quite a few zeros to the final invoice. Thankfully, with modern data platforms, these overheads are completely eliminated, the scalability is infinite and the speed is truly phenomenal. And all delivered at a fraction of the price!