Lakehouses For Sale

Whilst a ‘Lakehouse’ might sound like something out of a Nordic real estate brochure, we’re actually referring to ‘Data Lakehouses’ – the emerging technology in analytics.

The evolution of the Data Lakehouse all started with Data Warehouses which have been around since the 70’s when Inmon coined the term ‘Data Warehouse’. The Data Warehouse was designed to store numerous disparate data sets within a single repository ensuring that there is a single version of the truth for organisations. These data warehouse solutions are highly structured and governed, but typically they became extremely difficult to change and started to struggle when the volumes of data became too large.

Technology continued to evolve, and organisations started to have an exponential increase in the amount of data that they were creating as a result. This data that was being created by new applications was frequently produced in a semi-structured or unstructured format in continuous streams which is very different to the structured data from transactional systems that data warehouses were used to dealing with. For example, the Internet of Things (IoT) has meant that there are a huge number of devices that are continually creating sensory data such as smart draught pumps in pubs which automatically order new beer deliveries when the keg starts to run out!

This change in data meant that organisations had to adapt their data solutions. Around 10 years ago organisations started to create Data Lakes built using platforms such as Hadoop. Whereas Data Warehouses typically would only store the data that was needed for reporting in a highly structured format, Data Lakes were able to store all data possible but in its raw unstructured or semi-structured format. However, this lack of data quality and consistency often means that it is very difficult to use this data effectively for reporting which has seen much of the promise of Data Lakes unable to be realised (see our blog on ‘Data Swamps’). In addition, these Data Lakes, due to their lack of consistency, also struggled with a lack of governance and functionality compared to Data Warehouse solutions. To deal with this dilemma, lots of organisations typically have a data landscape with a data lake in combination with several data warehouses to help them meet their analytics needs. These extra systems have made the data landscape unnecessarily complex, with data frequently having to be copied between the data lake and the data warehouse.

Evolution of data storage, from data warehouses to data lakes to lakehouses
Figure 1 – Evolution of Data Solutions as Proposed by Databricks

Enter the ‘Data Lakehouse’ – technologies which provide both the benefits from traditional Data Warehouses in combination with the huge amounts of cheap storage utilised for Data Lakes which has only been possible with the advancement of cloud technology. There are several key features of a Data Lakehouse, however perhaps the most important of these is the ability to natively support both structured and semi-structured data. Snowflake, for example, has a native ‘Variant’ data type which means that you can load semi-structured data such as JSON, XML and Avro straight into the Data Lakehouse, whilst also providing SQL extensions to query this semi-structured data directly. The result is that it’s easy to store fully ACID compliant transactions commonly used in management reporting solutions alongside semi-structured data provided by streaming applications. Data Lakehouses also provide the auditability and governance that was clearly lacking in Data Lake solutions supporting traditional modelling techniques such as Star and Snowflake schemas. Lastly, by storing all of their data in one central data hub it reduces both the complexity of their technical landscape as well as reducing the cost as they don’t need to be storing data in both the data warehouse and the data lake (for more details, see our blog ‘Better, Faster, Cheaper”).

Data Lakehouses are the start of a really exciting new era in the world of data and analytics, offering a huge competitive advantage to organisations. Companies can finally use the huge amounts of data that they have started to create, being able to effectively use it in analytics and reporting to drive new insights and ultimately take better decisions.

5 things to do before starting a data project

You’re about to start a big data project. Fantastic! We’re big believers in the fact that every business can gain a real competitive advantage through analysing their data. It’s why we do what we do. 

But just before you go running off all excited, stop for a moment. If you really want your data project to be a success, you need to think about five key things before you even start. 

Understand the problem that you are trying to solve 

Chances are you’re looking to data analytics to fix a specific need, something which is causing inefficiencies and costing you money. Don’t assume though that by shaking the data tree enough times a solution will magically fall into your lap. First you need to look at your existing systems to see what exactly needs fixing. It’s only once you have a clearly defined vision and end point in mind, that we can see exactly how we can help. 

Define what success will look like  

Having identified what you want, it is time to think about what a successful outcome might look like. It helps nobody to embark on a data project without setting any specific goals or measurable outcomes. So make a plan, draw up a list of milestones, devise ways of measuring what’s happening and then track the results against that. One useful approach we’ve found is to run a user survey six months down the line and find out how people are using, or benefitting from the findings.  

Align with company strategy 

It’s all very well you dreaming up fantastic, innovative data driven projects that will change the very fabric of your business and the world generally. But it might be best, first of all, to check that your goals are something that fall in line the wider strategic direction of the business. Is this a problem you should even be solving? Is it a business priority? Will it help tick some important boxes when the annual report comes round? If the answer is yes to all the above, fantastic – you’re on your way to getting managerial buy-in and tapping up a healthy budget for an important piece of work. 

Data for the people 

You’ve addressed the needs of the bigger cheeses but don’t forget about the little guys, the people on the front line who are working hard to produce this data in the first place. Think about how this is going to benefit them in the long term, how will it make their day-to-work work easier, more efficient, or more effective? This is particularly pertinent if your business is going through a restructuring process. We’re great believers in the power of data analysis, but if you’re losing half your team it might not be perceived as the best use of company resources. 

Build the right team 

One commonly held assumption we come across is the idea that data is purely a tech led process: you identify a problem or need and the nerds crunch the numbers. It’s not that simple of course. To produce an effective outcome, you need quality input from people on the business side, members of the team who can provide insights into how the company works and what its goals and strategies are. You should bring together people who use the data in different ways and can provide the broadest possible range of experience. That way the insights we produce will be deeper, richer and ultimately more valuable. 

Modern Data Warehouse – Snowflake & Fivetran