Part 2 – Scaling and Flexing: Data Vault Benefits

If you’re wondering what exactly Data Vault is then check out Part 1 of our Data Vault blog where we outline exactly what Data Vault is, otherwise keep reading to find out all the great benefits of using Data Vault methodology!

Scalability

Probably the biggest benefit given the current data landscape that we find ourselves in is the scalability of data vault. From my own experiences, data warehouses over time become more resistant to change and much less flexible to adapt to the changing needs of the business. Once certain reports and datasets have been created, in order to develop new functionality all of the existing functionality will need to be tested to ensure that it hasn’t broken, and refactored if it is. This is where Data Vault really helps as it is easily extensible without the worry about losing any historical data. New link tables and satellites can easily be added to the model, and similarly other link tables can be closed off when required. When virtual information marts are used this means it’s even more adaptable as it’s only the joins that will require updates to handle the changes.

Flexibility

Link tables mean that relationships are modelled in a way that it assumes that there will be many-to-many requiring virtually no work when a relationship changes. All historical data is tracked by default in the Data Vault by using the satellite tables meaning again the flexibility is there to report on the historical changes of the data warehouse. For example, if the business decides that actually they need to report on the historical status of a master data object that previously wasn’t a requirement, this can simply be provided by altering the joins to pull the correct data from the master data satellite tables. Previously we would have probably thought that using this approach would have caused lots of redundant data to be stored, however with the rise of cloud storage the costs for storing are so trivial that this now means we can design and build with future change in mind.

Auditability and Traceability

This tracking of history also means that the data warehouse is fully auditable. Data Vault practitioners like to say that it becomes the “single version of the facts” rather than the “single version of the truth”. All data is loaded into the data vault exactly as it was in source, rather than just cleansed data for a specific reporting purpose. This means that you can fully reconcile the data back to the source system at the point of entry, and it also becomes a lot easier to track the data lineage. Business users will no longer see the data warehouse as a ‘black box’, but instead it will become the trusted source for all of their business data!

Big Data & Loading

Traditional modelling approaches were created long before the advent of semi-structured and machine generated data. In Data Vault there isn’t the requirement to cleanse and conform the data to ensure star schema compliance which means that huge volumes of data can be loaded very quickly. Data Vault also uses hashed keys meaning that there is reduced dependencies during loading between hub, satellite and link tables meaning that they can be loaded in parallel, making use of scalable processing power that is available with cloud data warehouse solutions like Snowflake. Data vault also uses the concept of a virtual information mart layer on top of the data vault for reporting which again means increased performance as cubes don’t need to be persisted and loaded. This is a key feature that helps to ensure that real-time data warehousing can be realised.

All of this sounds great, so why isn’t data vault more widely used when building out data warehouses? I think the simply answer can be put down to a couple of key factors. The first is that cloud data warehousing has only recently started to be adopted, and it’s only with the scalability of cloud storage and processing that means that a data vault structure is truly viable. The other key factor is that if a project that is being delivered is a small project then it may be that building it using a data vault methodology is overkill. However, given the increasing importance of data and stricter data regulations (e.g. GDPR) I would argue that in many cases Data Vault methodology would be the right choice to ensure a modern and scalable data warehouse.

Written by Tom Bruce the Delivery Lead and Co-founder of Snap Analytics.

Part 1 – What is Data Vault?

What is Data Vault?

Firstly, what is Data Vault? Well, put simply it’s a modern approach to designing and building enterprise data warehouses. Most traditional data warehouses are designed and built using concepts from Kimball’s dimensional modelling and Inmon’s relational 3NF approaches. Data Vault combines the best bits from both star schema and 3NF approaches to design a more agile approach to building data warehouses which are designed for change, whilst also being scalable and able to handle petabytes of data.  The data vault layer is introduced between staging and the dimensional star schema to track history and provide full traceability of the data.

Why is it needed?

While both Kimball’s dimensional modelling and Inmon’s relational 3NF were revolutionary when they were first introduced, the data landscape has changed significantly since the 1990’s. Nowadays companies capture a much greater variety of data rather than they did previously (which was mainly just transactional systems) such as semi-structured data, web data and machine generated data. The rate at which this data is being produced is increasing exponentially with 90% of the data on the web being created in the last two years alone according to a marketing cloud study conducted by IBM. Put simply, data warehouses that are built using these traditional methodologies are not responsive enough to the change that organisations are currently experiencing – which is where Data Vault comes in!

How does it work?

The Data Vault layer is based on patterns from nature using a hub-and-spoke architecture that scales well and is easy to extend. There are three distinct components to the data vault which are hubs, links and satellites.

Figure 1 – Example Data Vault Model (simplified)

Hubs

As the business keys tend to change less frequently than other data, these are separated into hubs which purely store the business key and the hashed key which is used to increase the loading performance of the data vault layer. In the example above, there are hubs for Customer, Account and Payment which represent the core business keys.

Links

There are then link tables which connect these hub keys together and represent the source business processes and transactions. The link tables are key in being able to improve the flexibility of the data vault model because it is easy to add new links, or to change the relationship of the links that are already in place. In the example above, the payment link table connects the Customer and the Payment hubs.

Satellites

Lastly, there are satellite tables which provide the additional data that adds context to a relationship or transaction. Most important, this is where the changes are tracked over time to enable the full history to be analysed. A single hub, for example Customer, may have several different satellites depending on the source system that provides those additional details or the rate of change for those particular fields. In our example above it is separate systems that provide the Address and Segmentation details which is why these are separated. This is also important as the changes for these particular fields from these systems can then easily be updated into the separate satellite tables that are affected.

Now you know a bit more about what Data Vault is, look out for part 2 of our Data Vault blog where we’ll highlight the key benefits of Data Vault methodology.

Written by Tom Bruce the Delivery Lead and Co-founder of Snap Analytics.