While data lakes and data warehouses are conceptually different in terms of their design and implementation, they have at least a few things in common:
Both are meant to help organizations make better decisions
Both are of interest to analysts and data scientists
Both are designed to store large amounts of enterprise data
However, this is usually where the similarities end. Before comparing data warehouses and data lakes, it is useful first to explain what we mean by data warehousing.
What Is a Data Warehouse?
Data warehouses have been around in various forms since the early 1980s. They are generally used to store data from operational systems and a variety of other sources. The idea behind a data warehouse is to collect enterprise data into a single location where it can be consolidated and analyzed to help organizations make better business decisions. For example, a company might use a data warehouse to store information about things like products, orders, customers, inventory, employees and more.
Data warehouses are deployed in different tiers. For example, large organizations may deploy data marts, which are topic- or function-specific data warehouses. They may also have operational data stores (ODS) used for various reporting and operational tasks. Both data marts and operational data stores will typically feed a central enterprise data warehouse (EDW) that aggregates data from multiple sources.
Business intelligence (BI) software is used by business analysts to help explore and visualize data in the data warehouse. This category of software is also referred to as decision support software (DSS) since its purpose is to help enterprises make better decisions based on data.
When most people refer to data warehouses, they are referring to traditional data warehouses built using SQL relational databases (RDBMS). As database technology continues to evolve, some organizations may use alternative data management environments such as NoSQL data stores or cloud-based services to warehouse data.
Traditional vs. Cloud Data Warehouses
At one time all data warehouses were deployed on premises. Increasingly data warehouses are also deployed in the cloud, so it is important to distinguish between traditional data warehouses (on premises) and cloud data warehouse services. What both traditional and cloud data warehouses have in common is that they are optimized to support structured data and present a standard SQL interface to familiar BI tools.
The Traditional Data Warehouse
Traditional data warehouses like Teradata store data in relational database tables. As relational administrators know, running complex queries across multiple large tables can be time-consuming. Complex queries can run for several minutes or, in some cases, even hours.
Data warehouses need to support operational reporting, so database administrators typically design schemas to process these anticipated queries efficiently. They carefully decide what data belongs in what table, what fields should be primary or secondary keys, and what fields should be indexed.
ETL – Extract, Transform and Load
Data warehouses also need to be constantly refreshed with new data from other systems. The data sources that feed the EDW typically will not match the schema of data warehouse tables. This means that data often needs to be extracted from operational systems (E), transformed into the desired format (T) and loaded into data warehouse tables (L). Collectively, these ETL operations are part of data maintenance operations. In some environments ETL operations may run almost continuously, feeding the warehouse from various data sources, aggregating data, and purging data that is no longer required.
Online Analytical Processing (OLAP)
In addition to planned queries and data maintenance activities, data warehouses may also support ad hoc queries and online analytical processing (OLAP). Ad hoc queries and OLAP activities pose particular challenges because the nature of these queries is not known in advance.
Organizations cannot afford to have analysts running queries that interfere with business-critical reporting and data maintenance activities. Because of this, datasets required to support OLAP are frequently extracted from the data warehouse, and analysts run queries against these data extracts so as not to interfere with the data warehouse itself. The diagram below illustrates how users of BI tools typically analyze data in the data warehouse.
Data extracts can take different forms, including raw data extracts, aggregation tables and multi-dimensional data cubes. Data cubes are preprocessed data extracts where data is stored to support multi-dimensional analysis efficiently. Data cubes are used to explore relationships between data items.
In some cases, analysts may extract data from the data warehouse themselves. In other cases, data extraction and cube construction will run as batch workflows along with other data maintenance activities. Business analysts typically use BI tools such as Tableau, Power BI or Qlik to explore, analyze and visualize data.
The Cloud Data Warehouse
Companies are increasingly moving toward cloud-based data warehouses such as Snowflake, Amazon Redshift or Azure Synapse Analytics (formerly Azure SQL Data Warehouse) to augment, and in some cases replace, traditional on-premises data warehouses. Cloud data warehouses apply the concept of the traditional data warehouse to the cloud and differ from traditional warehouses in the following ways:
There is no need to purchase and maintain/support physical hardware
It’s quicker and cheaper to set up and scale cloud data warehouses
The elastic nature of the cloud makes it faster and cheaper to perform complex massively parallel processing (MPP) workloads compared to on-prem
Cloud data warehouses still require that organizations deal with ETL workflows, but with modern cloud databases these requirements may be reduced. For example, a cloud database may support fast, column-oriented queries allowing OLAP tools to run against the database directly, reducing the need for ETL workflows to prepare data in advance.
While data warehouses have been around for decades, data lakes, by comparison, are relatively new. Data lakes are typically built on scalable, low-cost commodity servers or leverage cloud-based object storage. Examples of data lake environments include Apache Hadoop, Amazon S3 and Microsoft Azure Data Lake Storage (ADLS).
Some Key Differences
Traditional data warehouses and data lakes were created to solve different problems. Data lakes are generally much more economical than data warehouses per terabyte stored. They are also elastic, resilient and far more scalable.
Data types such as text, images, social media activity, web server logs and telemetry from sensors are difficult or impractical to store in a traditional database. These data types may lack a clear structure that is easily parsed to fit into a database table with rows and columns. Text and social media activity are good examples. In other cases (such as streaming data or sensor telemetry), even though the data may be structured, the rate at which data needs to be collected would overwhelm a traditional RDBMS.
Valuable new use cases continue to be found for these and other types of nontraditional data, however. For example, analyzing raw web, transaction logs or syslog data can help organizations detect cybersecurity threats or various types of fraud.
A high-level comparison of some of the differences between a data warehouse and a data lake appears below:
Today, most organizations initially store data in a cloud data lake environment such as Amazon S3 or ADLS. In some cases, data in the data lake can be queried directly, but in others, data needs to be loaded into a separate enterprise data warehouse. The process of transforming and loading data from the data lake into an enterprise data warehouse results in replicated data, added complexity and increased total cost of ownership (TCO). For these reasons, users are increasingly looking for solutions that enable them to query data lake storage directly to avoid the cost and overhead of managing an enterprise data warehouse.
What Your Organization Should Use
Today, many enterprises operate both data warehouses and data lakes. Data lakes have overtaken the capabilities of data warehouses in many areas, including cost-efficiency, flexibility and scale. First-generation Hadoop data lakes may lag the capabilities of the data warehouse in other areas, however, including performance, security and data governance features. Many organizations have large significant sunk investments in data warehouses. Even though data lakes are quickly closing the gap in these other functional areas, it is unlikely that enterprise data warehouses will disappear anytime soon.
Organizations have historically used a data warehouse when:
The data that needs to be stored is known in advance, and organizations are comfortable discarding additional data or creating duplicates
Data formats are relatively static, and not expected to change with time
Organizations run standard sets of reports requiring fast, efficient queries
Results need to be drawn from accurate, carefully curated data
Regulatory or business requirements dictate special handling of data for security or audit purposes
Data lakes should be considered when:
The types of data that need to be stored are not known in advance
Data types do not easily fit a tabular or relational model
Datasets are either very large or are growing fast enough that cost of storage is a concern
Relationships between data elements are not understood in advance
Applications include data exploration, predictive analytics and machine learning where it is valuable to have complete, raw datasets
Supporting Operational Queries
Data warehouses are primarily suited to business analysts and operational users. These are users that need to access data and reports to answer business-level questions. Data lakes, on the other hand, can support all types of users, including data architects, data scientists, analysts and operational users.
Data analysts will see value in summary operational reports. However, they may also want to delve more deeply into the source data to understand the underlying reasons for changes in metrics and KPIs not apparent from the summary reports. Data scientists may be tasked with employing more advanced analytic techniques to get more value from data. These include statistical analysis and predictive modeling to gather insights from data not attainable from the limited data available in the data warehouse.
First-generation data lakes could support operational reporting, but achieving adequate performance has been a challenge historically. Operational reporting from a data lake is supported by metadata that sits over raw data in a data lake, rather than the physically rigid data views in a data warehouse. The advantage of the data lake is that operations can change without requiring a developer to make changes to underlying data structures (an expensive and time-consuming process).
Different Use Cases for Data Warehouses and Data Lakes
Opportunities for Consolidation and Savings
Ideally, most organizations would prefer to have a single source of enterprise data rather than replicating data in both the data lake and the data warehouse. Traditional barriers to doing this have been concerns related to query performance, compatibility with existing BI tools and data security and data governance-related issues.
Fortunately, technological barriers that prevented data lakes from being used for operational data warehouse applications are quickly falling away. Modern technology such as Dremio’s data lake engine enables data analysts to run BI queries directly against the data lake using familiar BI tools and no change to the analyst’s environment. Equally important, queries made against the data lake are now lightning-fast and support the same data security, provenance and data lineage related features found in a much more expensive data warehouse.
These advances in data lake query technologies can help enterprises offload expensive analytic processes from data warehouses at their own pace. By embracing data lake native solutions, organizations can boost productivity and efficiency, run across their choice of on-premises and cloud platforms, and significantly reduce cost.