Data Lake vs Data Warehouse: Picking The Right Data Architecture

A Quick Look Into Data Warehouses, Lakes and Delta Lakes

Posted by Juan Martin Pampliega

on March 17, 2022 · 9 mins read

Before We Get Started, Why Was Data Important Again?

Entire industries, companies, and ideas are built on data. We no longer only rely on gut feeling decisions. Data has become the driving force behind decision-making. YOUR decision-making.

What makes data elemental is the possibility of understanding how customers will behave and interact with your brand.

When advertising became a thing in the XX century, we had some inkling of an idea of what people wanted, probably a sharp razor for businessmen aged 18-50 and stockings for coquette women in marrying age.

Of course, today we know women also shave, and some men wear stockings, all thanks to data they willingly give us through their devices.

Storage methods

Great! We’re on the same page: Data is the new gold, how we use it defines our very competitive advantage. The issue? There’s a lot of it and it needs to be stored somewhere to make it accessible and useful. Remember this is our gold, you wouldn’t store your gold anywhere.

You’ve probably heard about Data Warehouses and Lakes, most likely because of the hype surrounding the latter. A common discussion is which is the best alternative. A not so common answer: It Depends.

Now, you want to have your own personal Fort Knox, but where?

Like glistening gold on tumultuous rivers, Data can be kept on Lakes, unhandled and raw. Or you can mold it into ingots and keep it on the shelves of a tidy Warehouse.

Much like comparing a pair of beloved siblings, one is not better than the other. We are all different and good at certain things, the same concept applies here. Your choice on storage method should be based on your company’s specific needs. In some cases a combination of both methods might be the ideal solution. Let’s break it down.

Data Warehouse

Like the name indicates, It is a storage or repository where all the data in a company is kept, we can consider it as the organization's single source of truth.

This system takes data, both current and historical, from a variety of sources, both internal and external and centralizes and consolidates it in one place.

All of this information is like natural water. If we want to feed water to a village we wouldn’t just draw up a pipe or open the flood gates. We need to make sure the water is clean, so it can be fractioned and then used.

That’s the other side of a Data Warehouse’s job, it’s not just about storing all the water but about cleansing, processing and filtering it making it drinkable for the village. Your Data Warehouse makes raw data palatable and useful for your employees through a clear schema and hierarchical structure.

Imagine you have a mumbo jumbo of information, numbers and letters that point nowhere. This system, for example, might categorize them into: names, addresses, ages, and occupations.

Suddenly, we’re not just looking at numbers but understandable, useful actionable insights to offer your products and services according to specific customer attributes or needs.

Your source of raw water becomes a bottle of pure water with all of the user’s info written in the label. Your whole Data Warehouse acts as a big closet full of bottled waters where every employee can just grab one and read the label.

You might have heard (or had first hand experience) with old Data Warehouses. Those were hard to scale, incredibly rigid and required constant maintenance by Database Administrators to make them work properly. Old Data Warehouses were not meant to have more than one instance running. All of this ended up slowing up your analysts.

Thankfully, that's a thing of the past: new technologies like Redshift, BigQuery and Snowflake changed the game. All of them offer cloud based Data Warehouses, which are easy to scale up to your needs. Not only that: all of them allow you to store and query semi-structured data like JSON, Avro or Parquet.

Date Warehouses have come a long way, and are more than up to the task of modern data tasks.

Data Lake

Before we get into this, let's get one thing clear: Data Lakes are not the improved Data Warehouse or version 2.0. A Data Lake is a different, independent, storage method. We know you are probably thirsty by now, but we’re going to stick to the water metaphors a bit longer.

Unlike a Data Warehouse a Data lake has no flood gates, all the raw water and everything in it comes in. In this case, to avoid flooding a village, let’s say you build an artificial lake and you want to fill it. You let all the water in, leaving you with a container of raw, crude data.

You're probably now hearing an alarm sound in your head. Won't storing all of my data be incredibly expensive? I probably need the latest and most expensive hardware to make this work! That might have been true in the past, but not now.

Hardware is now a commodity: inexpensive, widely available and easily interchangeable with similar hardware of its type. Data Lakes were built to exploit that. They are made out of multiple machines conforming a cluster.

All of the machines that are part of the cluster are running the same software and coordinate tasks and resources with each other. Clusters act like one big machine made out of multiple smaller machines. Need to scale up? Just add a new instance! There are no set limits! Its hardware doesn't even have to be the same: Data Lakes abstract the need to use a certain hardware vendor.

All the data we can collect, regardless of size or type then goes into this cluster unfiltered and unprocessed. This might seem a bit messy but you have all the data available whenever you want and can do whatever you want with it.

It doesn’t require a complex system for loading and extracting, but it also does not differentiate between water coming from a waterfall, rain or from an underground water source. Although the input is different in that it’s not processed data to begin with, a lake with so much going in quickly becomes swamp-like and soon enough you can’t see much through the water.

Just like a real lake it needs some maintenance and filtering. Defined mechanisms are put in place to catalog data and make it safe, think: governance and consistency controls.

The flexibility that a Data Lake offers is a double-edged sword. When you're not entirely sure what you will end up doing with your data it offers great flexibility but if you become too flexible and start collecting too much data without a clear objective in mind it might become a swamp. Once you know what you are after, with the help of machine learning techniques you can dive through your pristine lake finding only useful data.

Another key aspect to take into account: Quality. Over. Quantity. Making a lake bigger with the sole objective of having more data available for analytics doesn’t make sense. A great habit is to always question the consistency and quality we are after which takes as back a skip: Start with a clear objective in mind. What do you need?

Key Differences

Just like twins, when you look closely the differences start to stand out quite clearly:

	Data Warehouse	Data Lake
Data Structure	Processed	Raw
Data Purpose	Currently in use	Not yet determined
Data Users	Business professionals	Data scientists
Data Accessibility	Restringed	Highly accessible

Data structure: Processed vs. Raw

DWs’ are silos’ of processed organized information while DLs’ are containers of raw material. Data inputted into a warehouse will be processed and cataloged according to predetermined parameters. A shoe company will have predefined parameters to store names, addresses, ages and shoe sizes but probably won’t be interested in the name of your pet.

In a lake, inputted data is raw, there is no hierarchy or direction, just an unlimited storage space with everything that is inputted. Say you run a business that’s a bit harder to structure, for example, a hospital. Infinite amount of illnesses, conditions, doctor and pharmacists notes, studies, etc, etc. All this information can be a bit overwhelming for such a tight and structured storage method such as a Data Warehouse. Maybe a Data Lake is a better fit for a business with so many variables.

Data Purpose: In-use vs. undetermined.

In a DW there is a specific need for each piece of information, input data has been processed and cataloged for a specific use. If we go back to the bottled waters, every employee knows the first row are customers A-B and that their labels contain: name, age, occupation and search tendencies.

In a DL, information is unfiltered. Some data might have a specific use in the future. Think of a patient’s medical history, they might have inherited a health issue or not, but you keep the data available just in case. Data we input today might generate useful insights in the future in order to prevent or understand a disease that might show itself tomorrow.

Data Users: Business Professionals vs. Data scientists

Data from DW is mostly used for charts and spreadsheets. Most, if not all, your employees can make something of it. The only requirement is being familiar with the topic the data is referring to.

DL needs a more experienced user. Like a scuba diver you need some knowledge and training before you dive into a data lake. Data Scientists can understand and translate the input data, so they are the more likely users.

Data Accessibility: Restricted vs. Flexible

It’s harder to make structural changes in a DW once structures have been defined. In the case of a DL there is no structure to begin with, this makes it easier to access and change. Modifications are carried out quickly due to lack of limitations.

Delta Lake

Think of a Delta Lake, as the best of both worlds. In short, it is a well kept, organized Lake House. It’s an open format storage layer over a Data Lake, that allows you to stream and batch operations.

This means that you can either get certain nuggets of information in real time (Like when you stream a movie) or a more thought out operation that processes and organizes the data after a certain amount of time (i.e. A bank that processes transactions such as international money transfer after-hours.)

It replaces data silos with a single place for three different levels of data: structured, semi-structured and unstructured.

It supports ACID transactions, using Apache Spark, and schema enforcement, providing the reliability that regular Data Lakes lack.

But what is ACID?

The IT world has a kick coming up with funky acronyms, this one actually meaning: Atomicity, Consistency, Isolation and Durability.

Atomicity: A transaction must be 100% completed or they are not done at all in case anything goes wrong. This guarantees no incomplete processes.
Consistency: Any change in a process has to go from a valid estate to another equally valid estate. So there are no differences or surprises with the transformation.
Isolation: No operation can affect another one. This prevents any sort of parallel processes being affected by a change in their partners.
Durability: Changes have to last. Even if there are any shortcomings in the future.

Delta Lake: Key Features

Open and Secure Data Sharing: Open protocol for easy and secure data sharing regardless of where information is stored. Shared data can be visualized and controlled making meeting security needs easy.
Scalable Metadata Handling: Can handle processing power needed to process Big Data by scaling information with partitions without sacrificing quality.
Time Travel: Access to and possibility of modifying earlier versions of data.
Open Format: Data available to anyone with correct tools and pass.
Schema Application: Ability to ensure data types, required columns and other attributes of the data schema are met. This guarantees correct data ingestion.
Audit History: Users can verify log records to visualize changes and executed commands in tables, trace transactions, etc.
Upsert: Users can effortlessly insert and update records.

Wrapping Up

And the winner is… Drum Roll Please… Well… It Depends. The point of this blogpost is that while Data Lakes are newer, Delta Lakes are pretty cool and shiny and Data Warehouses might be looking a bit old, each architecture has its PROs and CONs and there isn’t a go-to-architecture for every company and every set of needs. Before choosing, it’s key to ask the right questions and really map out what you’re after. Not to mention, you might find yourself needing a combination of different architectures depending on how varied your needs are.

Need help choosing the best architecture for your company and needs? Feel like it’s time to update your data stack or implement a modern data stack from scratch? We always enjoy a challenge, get in touch with our team of data experts anytime.

We hope you’ve found this post useful, and at least mildly entertaining. If you like what you’ve read so far, got some mad dev skills and like applying machine-learning to solve tough business challenges, check out our [team openings]({{ site.jobs_page }}).