Dremio Jekyll

Data Lake vs. Data Warehouse

While data lakes and data warehouses are conceptually different in terms of their design and implementation, they have at least a few things in common:

  • Both are meant to help organizations make better decisions
  • Both are of interest to analysts and data scientists
  • Both are designed to store large amounts of enterprise data

However, this is usually where the similarities end. Before comparing data warehouses and data lakes, it is useful first to explain what we mean by data warehousing.

What Is a Data Warehouse?

Data warehouses have been around in various forms since the early 1980s. They are generally used to store data from operational systems and a variety of other sources. The idea behind a data warehouse is to collect enterprise data into a single location where it can be consolidated and analyzed to help organizations make better business decisions. For example, a company might use a data warehouse to store information about things like products, orders, customers, inventory, employees and more.

Data warehouses are deployed in different tiers. For example, large organizations may deploy data marts, which are topic- or function-specific data warehouses. They may also have operational data stores (ODS) used for various reporting and operational tasks. Both data marts and operational data stores will typically feed a central enterprise data warehouse (EDW) that aggregates data from multiple sources.

Business intelligence (BI) software is used by business analysts to help explore and visualize data in the data warehouse. This category of software is also referred to as decision support software (DSS) since its purpose is to help enterprises make better decisions based on data.

When most people refer to data warehouses, they are referring to traditional data warehouses built using SQL relational databases (RDBMS). As database technology continues to evolve, some organizations may use alternative data management environments such as NoSQL data stores or cloud-based services to warehouse data.

Traditional vs. Cloud Data Warehouses

At one time all data warehouses were deployed on premises. Increasingly data warehouses are also deployed in the cloud, so it is important to distinguish between traditional data warehouses (on premises) and cloud data warehouse services. What both traditional and cloud data warehouses have in common is that they are optimized to support structured data and present a standard SQL interface to familiar BI tools.

The Traditional Data Warehouse

Traditional data warehouses like Teradata store data in relational database tables. As relational administrators know, running complex queries across multiple large tables can be time-consuming. Complex queries can run for several minutes or, in some cases, even hours.

Data warehouses need to support operational reporting, so database administrators typically design schemas to process these anticipated queries efficiently. They carefully decide what data belongs in what table, what fields should be primary or secondary keys, and what fields should be indexed.

ETL – Extract, Transform and Load

Data warehouses also need to be constantly refreshed with new data from other systems. The data sources that feed the EDW typically will not match the schema of data warehouse tables. This means that data often needs to be extracted from operational systems (E), transformed into the desired format (T) and loaded into data warehouse tables (L). Collectively, these ETL operations are part of data maintenance operations. In some environments ETL operations may run almost continuously, feeding the warehouse from various data sources, aggregating data, and purging data that is no longer required.

Online Analytical Processing (OLAP)

In addition to planned queries and data maintenance activities, data warehouses may also support ad hoc queries and online analytical processing (OLAP). Ad hoc queries and OLAP activities pose particular challenges because the nature of these queries is not known in advance.

Organizations cannot afford to have analysts running queries that interfere with business-critical reporting and data maintenance activities. Because of this, datasets required to support OLAP are frequently extracted from the data warehouse, and analysts run queries against these data extracts so as not to interfere with the data warehouse itself. The diagram below illustrates how users of BI tools typically analyze data in the data warehouse.

Data extracts can take different forms, including raw data extracts, aggregation tables and multi-dimensional data cubes. Data cubes are preprocessed data extracts where data is stored to support multi-dimensional analysis efficiently. Data cubes are used to explore relationships between data items.

In some cases, analysts may extract data from the data warehouse themselves. In other cases, data extraction and cube construction will run as batch workflows along with other data maintenance activities. Business analysts typically use BI tools such as Tableau, Power BI or Qlik to explore, analyze and visualize data.

The Cloud Data Warehouse

Companies are increasingly moving toward cloud-based data warehouses such as Snowflake, Amazon Redshift or Azure Synapse Analytics (formerly Azure SQL Data Warehouse) to augment, and in some cases replace, traditional on-premises data warehouses. Cloud data warehouses apply the concept of the traditional data warehouse to the cloud and differ from traditional warehouses in the following ways:

  • There is no need to purchase and maintain/support physical hardware
  • It’s quicker and cheaper to set up and scale cloud data warehouses
  • The elastic nature of the cloud makes it faster and cheaper to perform complex massively parallel processing (MPP) workloads compared to on-prem

Cloud data warehouses still require that organizations deal with ETL workflows, but with modern cloud databases these requirements may be reduced. For example, a cloud database may support fast, column-oriented queries allowing OLAP tools to run against the database directly, reducing the need for ETL workflows to prepare data in advance.

While data warehouses have been around for decades, data lakes, by comparison, are relatively new. Data lakes are typically built on scalable, low-cost commodity servers or leverage cloud-based object storage. Examples of data lake environments include Apache Hadoop, Amazon S3 and Microsoft Azure Data Lake Storage (ADLS).

What Is a Data Lake?

Unlike data warehouses, data lakes do not require that a rigorous schema be defined in advance of loading data. Any information can be stored in the data lake, including binary files, images, video files, documents and text. This includes semi-structured data such as CSV files, log files and JSON files. A key feature of a data lake is that data is stored in its natural form, and data is transformed only when applications are ready to use it. This approach is known as "schema on read" versus the "schema on write" approach used in a data warehouse.

To accommodate the full range of data types, while also making data efficient to parse, query and analyze, data lakes frequently use specialized data formats. In addition to text and binary files, data lake storage may support key-value file formats and sequence files (.seq) comprised of binary key-value pairs. Other common file formats include Avro (encoding binary data in JSON), Apache Parquet (column-oriented format), RC (row-column), or ORC (optimized row-column) formats.

Sometimes raw data will exist in external systems such as time-series databases, NAS filers, or relational or NoSQL databases. Rather than ingest data from these sources into the data lake and store it redundantly, these external sources may be thought of as part of the logical data lake. Analytic tools will typically access the data wherever it exists. These additional data sources can be thought of as ponds, feeding the data lake while residing nearby.

Learn About Data Lakes

Some Key Differences

Traditional data warehouses and data lakes were created to solve different problems. Data lakes are generally much more economical than data warehouses per terabyte stored. They are also elastic, resilient and far more scalable.

Data types such as text, images, social media activity, web server logs and telemetry from sensors are difficult or impractical to store in a traditional database. These data types may lack a clear structure that is easily parsed to fit into a database table with rows and columns. Text and social media activity are good examples. In other cases (such as streaming data or sensor telemetry), even though the data may be structured, the rate at which data needs to be collected would overwhelm a traditional RDBMS.

Valuable new use cases continue to be found for these and other types of nontraditional data, however. For example, analyzing raw web, transaction logs or syslog data can help organizations detect cybersecurity threats or various types of fraud.

A high-level comparison of some of the differences between a data warehouse and a data lake appears below:

Comparison Data Warehouse Data Lake
 Data volumes Usually in the TBs PBs or even hundreds of PBs
 Data curation Carefully select data to store Store everything
 Data types Structured Structured, semi-structured and unstructured
 Schemas Carefully predefined schemas Schemas applied on read
 Cost Generally expensive Low cost per GB
 Data format Proprietary Diverse, comprised of multiple software frameworks, data sources and file formats
 Vendor lock-in Yes No

Today, most organizations initially store data in a cloud data lake environment such as Amazon S3 or ADLS. In some cases, data in the data lake can be queried directly, but in others, data needs to be loaded into a separate enterprise data warehouse. The process of transforming and loading data from the data lake into an enterprise data warehouse results in replicated data, added complexity and increased total cost of ownership (TCO). For these reasons, users are increasingly looking for solutions that enable them to query data lake storage directly to avoid the cost and overhead of managing an enterprise data warehouse.

What Your Organization Should Use

Today, many enterprises operate both data warehouses and data lakes. Data lakes have overtaken the capabilities of data warehouses in many areas, including cost-efficiency, flexibility and scale. First-generation Hadoop data lakes may lag the capabilities of the data warehouse in other areas, however, including performance, security and data governance features. Many organizations have large significant sunk investments in data warehouses. Even though data lakes are quickly closing the gap in these other functional areas, it is unlikely that enterprise data warehouses will disappear anytime soon.

Organizations have historically used a data warehouse when:

  • The data that needs to be stored is known in advance, and organizations are comfortable discarding additional data or creating duplicates
  • Data formats are relatively static, and not expected to change with time
  • Organizations run standard sets of reports requiring fast, efficient queries
  • Results need to be drawn from accurate, carefully curated data
  • Regulatory or business requirements dictate special handling of data for security or audit purposes

Data lakes should be considered when:

  • The types of data that need to be stored are not known in advance
  • Data types do not easily fit a tabular or relational model
  • Datasets are either very large or are growing fast enough that cost of storage is a concern
  • Relationships between data elements are not understood in advance
  • Applications include data exploration, predictive analytics and machine learning where it is valuable to have complete, raw datasets

Supporting Operational Queries

Data warehouses are primarily suited to business analysts and operational users. These are users that need to access data and reports to answer business-level questions. Data lakes, on the other hand, can support all types of users, including data architects, data scientists, analysts and operational users.

Data analysts will see value in summary operational reports. However, they may also want to delve more deeply into the source data to understand the underlying reasons for changes in metrics and KPIs not apparent from the summary reports. Data scientists may be tasked with employing more advanced analytic techniques to get more value from data. These include statistical analysis and predictive modeling to gather insights from data not attainable from the limited data available in the data warehouse.

First-generation data lakes could support operational reporting, but achieving adequate performance has been a challenge historically. Operational reporting from a data lake is supported by metadata that sits over raw data in a data lake, rather than the physically rigid data views in a data warehouse. The advantage of the data lake is that operations can change without requiring a developer to make changes to underlying data structures (an expensive and time-consuming process).

Different Use Cases for Data Warehouses and Data Lakes

Data Warehouse Data Lake
Run a consolidated quarterly report on e-commerce return rates by store and sales item, to rank top customer satisfaction issues. Explore customer order data together with raw chat, email, reviews and customer support transcriptions to identify customer experience issues impeding sales.
Present an accounting dashboard showing days sales outstanding (DSO) by branch office to assess accounts receivable for the quarter. For a mobile communication company, develop a new "opt-in" mobile service where businesses can receive anonymized notifications when a customer is in physical proximity to their location.
For all sales regions, rank SKUs based on average sales margin and inventory turnover to identify poorly performing sales items. Compile a dataset comprised of web and OS-level logs, order details and support interactions to train a machine learning model to quantify the probability of fraud before a transaction completes.

Opportunities for Consolidation and Savings

Ideally, most organizations would prefer to have a single source of enterprise data rather than replicating data in both the data lake and the data warehouse. Traditional barriers to doing this have been concerns related to query performance, compatibility with existing BI tools and data security and data governance-related issues.

Fortunately, technological barriers that prevented data lakes from being used for operational data warehouse applications are quickly falling away. Modern technology such as Dremio's data lake engine enables data analysts to run BI queries directly against the data lake using familiar BI tools and no change to the analyst's environment. Equally important, queries made against the data lake are now lightning-fast and support the same data security, provenance and data lineage related features found in a much more expensive data warehouse.

These advances in data lake query technologies can help enterprises offload expensive analytic processes from data warehouses at their own pace. By embracing data lake native solutions, organizations can boost productivity and efficiency, run across their choice of on-premises and cloud platforms, and significantly reduce cost.

Learn About Data Lake Engines

Additional Resources