Get Started Free
No time limit - totally free - just the way you like it.Sign Up Now
As the name suggests, a data lakehouse architecture combines a data lake and a data warehouse. Although it is not just a mere integration between the two, the idea is to bring the best out of the two architectures: the reliable transactions of a data warehouse and the scalability and low cost of a data lake.
Over the last decade, businesses have been heavily investing in their data strategy to be able to deduce relevant insights and use them for critical decision-making. This has helped them reduce operational costs, predict future sales, and take strategic actions.
A lakehouse is a new type of data platform architecture that:
To better understand what a data lakehouse is and why it’s valuable, let’s take a step back in time to understand how organizations historically used data to answer business questions. Since moving away from reporting off siloed data marts, it has involved moving day-to-day operational data to some centralized repository and then using it for business intelligence (BI) related tasks. This was the advent of the first generation centralized data platform, data warehouses.
A data warehouse is a repository that centrally stores data from one or more disparate sources (e.g., application databases, CRM, etc.). The data stored in a warehouse is structured, i.e., in a particular form and conforms to the defined data model. This makes the data consumption process efficient for downstream BI applications as they typically deal with well-organized data. The following are some of the key pros and cons of the first-generation on-premises data warehouses.
To address the problems businesses were experiencing with data warehouses and to democratize data for all sorts of analytical workloads, a different type of data platform emerged — the data lake. It all started with storing, managing, and processing a huge volume of data using the on-premises Hadoop ecosystem (e.g., HDFS for storage, Hive for processing).
On-premises data lakes solve some of the issues of the first-gen data warehouses, but they still have disadvantages.
While the first generation on-premises data warehouses helped businesses derive historical insights from multiple data sources, it required a significant amount of investment in terms of cost and managing the infrastructure. The next generation of data warehouses, cloud data warehouses, took advantage of the cloud and addressed some of the problems with the on-premises data warehouses discussed earlier.
With the recent upsurge of the cloud industry, organizations started to leverage cloud-based object stores, such as Amazon S3, ADLS, etc., to build their data lake platforms. The elastic nature of cloud data lakes allowed organizations to enjoy the same benefits of a data lake, but now they could scale both storage and compute components per their needs and independently. The usual way of working in a data lake architecture is that after the data lands in a cloud object store, data engineering teams extract and load the required data into a data warehouse, so downstream BI applications can use it for reporting. Since “storage” is decoupled from “compute” in this architecture, organizations can store any amount of data they want and use a compute engine to process the data.
Cloud data lakes delivered some significant advantages.
A data lakehouse mitigates the critical problems experienced with data warehouses and data lakes.
Various BI and reporting tools (e.g., Tableau) have direct access to the data in a lakehouse without the need for complex and error-prone ETL processes. Additionally, since the data is stored in open file formats like Parquet, data scientists and ML engineers can build models directly on any data (structured, unstructured) depending on their use cases.
With all the data stored in a cost-effective cloud object storage, organizations don’t have to pay hefty costs associated with data warehouses. Data lakes also serve as a central repository for an organization’s data, so there is no overhead to storing data in multiple systems and managing them.
With a data lakehouse architecture, engines can access data directly from the data lake storage without copying data using ETL pipelines for reporting or moving data out for machine learning-based workloads. This ultimately ensures reliable data in the downstream applications and helps prevent issues such as data drift, concept drift, etc.
The open nature of the data lakehouse architecture allows businesses to use multiple engines on the same data, depending on the use case, and helps to avoid vendor lock-in. Also, for new workloads, organizations can add any new tool to their stack.
A lakehouse architecture separates compute and storage which helps scale these components independently to cater to the needs of an organization.
Scaling up resources infinitely based on the type of workload is another important aspect that the lakehouse addresses. If it's a resource-intensive task, organizations can decide to scale up easily.
A data lakehouse architecture blends the best of a data warehouse and data lake to support modern analytical workloads. Some of the key features are:
A data lakehouse architecture typically comprises the following five components.
Storage is the first component of a lakehouse. This is where the data lands after ingestion from operational systems. Object stores are available from the three cloud service providers — Amazon S3, Azure Blob Storage, and Google Cloud Storage — which supports storing any type of data and facilitates required performance and security. These systems are also very scalable and inexpensive, helping to streamline costs.
The next component is where the actual data is stored. Typically, they are columnar formats which provide significant advantages in reading data or sharing data between multiple systems. Common file formats include Apache Parquet, ORC, Apache Arrow, etc. These files are stored in the object storage.
The data lake table format is the most important component of a lakehouse architecture. There must be some way to organize and manage all the raw data files in the data lake storage. Table formats help abstract the physical data structure’s complexity and allow different engines to work simultaneously on the same data. The table format in a lakehouse architecture facilitates the ability to do data warehouse-level transactions (DML) along with ACID guarantees. Some of the other critical features of a table format are schema evolution, expressive SQL, time travel, data compaction, etc. Apache Iceberg, Hudi, and Delta Lake are the three most popular table formats, and are widely gaining momentum.
Table formats provide the specifications and APIs required to interact with the table data. However, the responsibility of processing the data and providing efficient read performance is on the query engine. Some query engines also allow native connection with BI tools such as Tableau, which makes it easy to do reporting directly on the data stored in the object storage. Query engines such as Dremio Sonar and Apache Spark work seamlessly with table formats like Apache Iceberg to enable a robust lakehouse architecture using commonly used languages like SQL.
The final component of a data lakehouse is the downstream applications interacting with the data. These include BI tools such as Tableau and Power BI and machine learning frameworks like TensorFlow, PyTorch, etc., making it easy for data analysts, data scientists, and ML engineers to directly access the data. Usually, this takes weeks, if not months, in other data architectures.
The data lakehouse brings the capabilities of a data warehouse with the reliability, consistency guarantees, and cost-efficiency of data lakes to present a robust data architecture. Organizations experience significant benefits in adopting a lakehouse architecture, such as:
If you are interested in getting started with a lakehouse architecture, Dremio’s open lakehouse platform provides an easy and efficient way. Sign up for Dremio Cloud and directly access all the data in your data lake or lakehouse. You can also build BI dashboards using tools such as Tableau and Power BI directly on the data using Dremio.