The Dremio Blog

What Is a Data Lakehouse?

Open Data Architecture
Dipankar Mazumdar Dipankar MazumdarDeveloper Advocate, Dremio

As the name suggests, a data lakehouse architecture combines a data lake and a data warehouse. Although it is not just a mere integration between the two, the idea is to bring the best out of the two architectures: the reliable transactions of a data warehouse and the scalability and low cost of a data lake.

Over the last decade, businesses have been heavily investing in their data strategy to be able to deduce relevant insights and use them for critical decision-making. This has helped them reduce operational costs, predict future sales, and take strategic actions.

A lakehouse is a new type of data platform architecture that:

  • Provides the data management capabilities of a data warehouse and takes advantage of the scalability and agility of data lakes
  • Helps reduce data duplication by serving as the single platform for all types of workloads (e.g., BI, ML)
  • Is cost-efficient
  • Prevents vendor lock-in and lock-out by leveraging open standards
data lakehouse architecture
Fig: A data lakehouse architecture

Data Lakehouse vs. Data Warehouse vs. Data Lake

To better understand what a data lakehouse is and why it’s valuable, let’s take a step back in time to understand how organizations historically used data to answer business questions. Since moving away from reporting off siloed data marts, it has involved moving day-to-day operational data to some centralized repository and then using it for business intelligence (BI) related tasks. This was the advent of the first generation centralized data platform, data warehouses.

Data Warehouse

A data warehouse is a repository that centrally stores data from one or more disparate sources (e.g., application databases, CRM, etc.). The data stored in a warehouse is structured, i.e., in a particular form and conforms to the defined data model. This makes the data consumption process efficient for downstream BI applications as they typically deal with well-organized data. The following are some of the key pros and cons of the first-generation on-premises data warehouses.

On-prem data warehouse
On-prem data warehouse

Advantages:

  • Allows businesses to derive insights using both semi-recent and historical data
  • Data consumers (e.g., analysts) have easier access to data instead of having to access multiple disparate sources
  • Serves as a single source of truth — data reliability, quality, and governance

Disadvantages:

  • The need to scale compute and storage at the same time
    • As the analytics journey matured for businesses, there was also an increase in data generation. Typically, data warehouses had “compute” and “storage” tied together, so scaling these components independently was a massive concern as the datasets grew and workloads increased. Typically, a business’s compute needs would grow faster than their storage or vice versa. 
  • It’s expensive to store and query data in a data warehouse
  • Businesses are limited to structured data
    • Along with the increase in dataset size, organizations also saw various types of data flowing in — most of which was unstructured (or semi-structured), such as CSV, images, texts, videos, etc. 
  • Businesses are limited to relational workloads
    • The availability of new semi-structured and unstructured data opened up exciting avenues for more advanced analytical workloads such as machine learning which could help the enterprise (e.g., understand customers’ sentiments, forecast their sales, etc.). However, data warehouses did not support these types of workloads. In order to apply machine learning, data had to be moved out of the warehouse, thus creating data copies and often resulting in critical issues later, such as data drift, model decay, etc.
  • Businesses may encounter lock-in issues
    • Because the data warehouse was the only engine that could directly access the data in the data warehouse, and due to an ever-increasing amount of workloads and data stored on the platform, businesses became locked-in to that platform. There are numerous tales about businesses that wanted to (or even tried to) migrate off data warehouses (e.g., Teradata) that either were abandoned due to difficulty, time, and/or complexity, or were only partially successful.
  • Businesses may encounter lock-out issues
    • Since the data warehouse was the only engine that could directly access the data in the data warehouse, and as the analytics industry ramped up its advances in new capabilities and technologies, businesses found themselves often locked-out, unable to leverage the new capabilities for their benefit, or were limited to the extent to which they could leverage them.

Data Lake

To address the problems businesses were experiencing with data warehouses and to democratize data for all sorts of analytical workloads, a different type of data platform emerged — the data lake. It all started with storing, managing, and processing a huge volume of data using the on-premises Hadoop ecosystem (e.g., HDFS for storage, Hive for processing).

Fig: On-prem data lake
Fig: On-prem data lake

On-premises data lakes solve some of the issues of the first-gen data warehouses, but they still have disadvantages.

Advantages:

  • Addresses the challenges of dealing with various types of data (structured, unstructured) and the huge volume of data
  • Enables multiple analytical workloads to run directly on the data that was openly stored in HDFS
  • Helps reduce the costs of data warehouses drastically

Disadvantages:

  • Since compute and storage are tied together, it results in scalability issues as organizations mature in their analytics journey 
  • Poor governance of the stored data
  • Complexity in the tooling stack that requires specialized data engineers’ expertise, which results in slower time to insight

Cloud Data Warehouse

While the first generation on-premises data warehouses helped businesses derive historical insights from multiple data sources, it required a significant amount of investment in terms of cost and managing the infrastructure. The next generation of data warehouses, cloud data warehouses, took advantage of the cloud and addressed some of the problems with the on-premises data warehouses discussed earlier.

Fig: Cloud data warehouse
Fig: Cloud data warehouse

Advantages:

  • Separates storage and compute, allowing each component to scale independently
  • The cost factor associated with the adoption and maintenance of on-premises physical servers decreases with cloud data warehouses
  • Cloud data warehouses provide tighter integration with various data sources and other SaaS services (such as Fivetran and Keboola)

Disadvantages:

  • While the cloud data warehouse reduces some costs, it is still very expensive 
  • To run any workload where performance matters, data needs to be copied into the data warehouse before the workload can run
  • Data in a cloud data warehouse is often stored in a vendor-specific format which leads to lock-in/lock-out issues; some cloud data warehouses have the option to store data in external storage
  • Support for multiple analytical workloads, especially the ones related to unstructured data such as machine learning, is still unavailable

Cloud Data Lake

With the recent upsurge of the cloud industry, organizations started to leverage cloud-based object stores, such as Amazon S3, ADLS, etc., to build their data lake platforms. The elastic nature of cloud data lakes allowed organizations to enjoy the same benefits of a data lake, but now they could scale both storage and compute components per their needs and independently. The usual way of working in a data lake architecture is that after the data lands in a cloud object store, data engineering teams extract and load the required data into a data warehouse, so downstream BI applications can use it for reporting. Since “storage” is decoupled from “compute” in this architecture, organizations can store any amount of data they want and use a compute engine to process the data.

Cloud data lakes delivered some significant advantages.

Fig: Cloud data warehouse
Fig: Cloud Data Lake

Advantages:

  • Stores all data types (structured, semi-structured, or unstructured) in open file formats such as Apache Parquet and ORC
  • Is very cost-efficient
  • Separates storage and compute
  • Workloads such as machine learning can run directly on the data without the need to move data out

Disadvantages:

  • While data lakes allow organizations to store any volume or type of data without considering the structure or storage costs, quality and governance are still significant problems. If data isn’t managed properly, it might lead to data lakes becoming swamps.
  • This approach leads to additional data copies as the data is first extracted and loaded to a data lake and then needs to be extracted and loaded to a data warehouse for downstream apps (such as BI). This may also lead to more job failures and ultimately impact downstream apps.
  • Since data is stored in raw formats and written by many different tools and jobs, files may not be optimized for query engines and low-latency analytical applications.

Advantages of Data Lakehouse Solutions

A data lakehouse mitigates the critical problems experienced with data warehouses and data lakes.

Fig: Representation of a data Lakehouse architecture
Fig: Representation of a Data Lakehouse architecture

Multiple Analytical Workload Support

Various BI and reporting tools (e.g., Tableau) have direct access to the data in a lakehouse without the need for complex and error-prone ETL processes. Additionally, since the data is stored in open file formats like Parquet, data scientists and ML engineers can build models directly on any data (structured, unstructured) depending on their use cases.

Cost

With all the data stored in a cost-effective cloud object storage, organizations don’t have to pay hefty costs associated with data warehouses. Data lakes also serve as a central repository for an organization’s data, so there is no overhead to storing data in multiple systems and managing them.

Data Copies

With a data lakehouse architecture, engines can access data directly from the data lake storage without copying data using ETL pipelines for reporting or moving data out for machine learning-based workloads. This ultimately ensures reliable data in the downstream applications and helps prevent issues such as data drift, concept drift, etc.

No Lock-In or Lock-Out

The open nature of the data lakehouse architecture allows businesses to use multiple engines on the same data, depending on the use case, and helps to avoid vendor lock-in. Also, for new workloads, organizations can add any new tool to their stack.

Independent Scaling of Compute and Storage

A lakehouse architecture separates compute and storage which helps scale these components independently to cater to the needs of an organization.

Infinite Scalability

Scaling up resources infinitely based on the type of workload is another important aspect that the lakehouse addresses. If it's a resource-intensive task, organizations can decide to scale up easily.

Key ​Data Lakehouse Features

A data lakehouse architecture blends the best of a data warehouse and data lake to support modern analytical workloads. Some of the key features are:

  • Transactional support: One of the most important features of a data lakehouse is that it supports ACID transactions, which ensures the same atomicity and data consistency guaranteed in a data warehouse. This is critical for multiple read and write operations to run concurrently in a production scenario.
  • Open data: In a data lakehouse architecture, the data is stored in open formats such as Parquet, ORC, etc., which allows multiple engines to work in unison on the same data for a range of analytical workloads. Therefore, data consumers can have faster and more direct access to the data.
  • Data quality and governance: One of the pain points with a data lake architecture is that there are no governance policies on the data, which means the quality of data (accuracy, correctness, etc.) landing in the object store may not be helpful for deriving insights. This often leads to data swamp problems. The data lakehouse explicitly focuses on these aspects by adopting tried-and-true best practices from the data warehousing world to ensure proper access control and to adhere to regulatory requirements.
  • Support for schema management: A lakehouse architecture guarantees that a specific schema is respected when writing new data to ensure no “garbage.” Also, with new use cases, there may be changes in the data type, and new data may be added. Lakehouses provide support for effective schema evolution with no side effects using the table formats.

Components of Data Lakehouse Architecture

A data lakehouse architecture typically comprises the following five components.

1. Storage 

Storage is the first component of a lakehouse. This is where the data lands after ingestion from operational systems. Object stores are available from the three cloud service providers — Amazon S3, Azure Blob Storage, and Google Cloud Storage — which supports storing any type of data and facilitates required performance and security. These systems are also very scalable and inexpensive, helping to streamline costs.

2. File Formats 

The next component is where the actual data is stored. Typically, they are columnar formats which provide significant advantages in reading data or sharing data between multiple systems. Common file formats include Apache Parquet, ORC, Apache Arrow, etc. These files are stored in the object storage.

3. Table Formats 

The data lake table format is the most important component of a lakehouse architecture. There must be some way to organize and manage all the raw data files in the data lake storage. Table formats help abstract the physical data structure’s complexity and allow different engines to work simultaneously on the same data. The table format in a lakehouse architecture facilitates the ability to do data warehouse-level transactions (DML) along with ACID guarantees. Some of the other critical features of a table format are schema evolution, expressive SQL, time travel, data compaction, etc. Apache Iceberg, Hudi, and Delta Lake are the three most popular table formats, and are widely gaining momentum.

4. Query Engines 

Table formats provide the specifications and APIs required to interact with the table data. However, the responsibility of processing the data and providing efficient read performance is on the query engine. Some query engines also allow native connection with BI tools such as Tableau, which makes it easy to do reporting directly on the data stored in the object storage. Query engines such as Dremio Sonar and Apache Spark work seamlessly with table formats like Apache Iceberg to enable a robust lakehouse architecture using commonly used languages like SQL.

5. Applications 

The final component of a data lakehouse is the downstream applications interacting with the data. These include BI tools such as Tableau and Power BI and machine learning frameworks like TensorFlow, PyTorch, etc., making it easy for data analysts, data scientists, and ML engineers to directly access the data. Usually, this takes weeks, if not months, in other data architectures.

Conclusion 

The data lakehouse brings the capabilities of a data warehouse with the reliability, consistency guarantees, and cost-efficiency of data lakes to present a robust data architecture. Organizations experience significant benefits in adopting a lakehouse architecture, such as:

  • Businesses can make sense of both structured and unstructured data rapidly with reduced wait times, improving time to insight
  • It is very cost-efficient
  • Data is stored in open formats, which allows different engines to run simultaneously instead of waiting on dependencies
  • You are not locked into a specific vendor since the data stays open; this also makes your architecture future-proof for new use cases

If you are interested in getting started with a lakehouse architecture, Dremio’s open lakehouse platform provides an easy and efficient way. Sign up for Dremio Cloud and directly access all the data in your data lake or lakehouse. You can also build BI dashboards using tools such as Tableau and Power BI directly on the data using Dremio.

Additional Resources

Get Started Free

No time limit - totally free - just the way you like it.

Sign Up Now

See Dremio in Action

Not ready to get started today? See the platform in action.

Watch Demo

Talk to an Expert

Not sure where to start? Get your questions answered fast.

Contact Us