A data lake engine is an application or service which queries and/or processes the vast sets of data stored in data lake storage. Data lake processing engines like Apache Spark are often used for batch data transformation jobs and machine learning. Data lake query engines such as Dremio and Presto are used to analyze structured and semi-structured data in place for business intelligence (BI) and data science.
Data lake engines support queries directly against raw structured, semi-structured or unstructured data in a data lake that is stored in text, CSV, JSON or open columnar formats such as ORC and Apache Parquet. They leverage a variety of new open source and proprietary software technologies to accelerate query performance dramatically. In doing so, they help avoid the limitations of traditional data warehouses, including the need for ETL processing, cubes, data extracts and aggregation tables.
The data lake engine allows data consumers to query or process data in cloud storage directly, as illustrated below.
A data lake engine addresses a variety of business needs by simplifying data access, accelerating analytical processing, securing and masking data, curating datasets and providing a unified catalog of data across all sources.
Popular BI and reporting tools often assume that all data exists in a single, high-performance relational database. When data is in multiple systems, or non-relational data stores such as Azure Data Lake Storage (ADLS), Amazon S3, Hadoop or NoSQL databases, then these tools are compromised in their abilities. Despite advances in data lake technologies, it is generally not practical to run analytic and BI tools against data lakes directly. As a result, IT is tasked with moving this data into relational environments, creating cubes, and generating proprietary extracts for different analytical tools.
Data lake engines simplify these challenges by allowing companies to leave data where it is already managed and stored, and providing fast access for data consumers, regardless of the tool they use.
The Evolution of Data Lakes
While data lakes provide a cost-effective way to store large amounts of data, they have several limitations. To understand these limitations, it is useful to review how data lakes have evolved.
First-generation data lakes were typically implemented using Hadoop and deployed in on-premises environments. As cloud providers began to provide inexpensive cloud-based object storage, many organizations began to deploy data lakes in the cloud.
These cloud data lakes such as Amazon S3 and ADLS represented a significant architectural advancement. Cloud-based storage allows for the decoupling of storage from compute (allowing the two to scale independently). Perhaps the most compelling aspect of cloud data lakes, however, is that enterprises no longer needed to manage their infrastructure.
Despite the advances in software for managing and querying data lakes, traditional data lakes continue to have limitations. Among these are:
Performance – Early generations of data lakes were notoriously slow, relying on batch-oriented methods such as MapReduce and Hive to extract data. While performance has improved with Spark and other data engines, even cloud data lakes are still too slow for most BI and analytic requirements.
Complexity– Data lakes have historically required significant expertise to use and manage. As technologies for querying big data evolved, data lakes came to support a variety of engines optimized for specific data management problems. Examples include Hive, HBase, Impala, Spark and others. For analysts who have been using SQL for decades, learning these new tools posed a significant barrier to adoption.
Efficiency – While cloud-based data lakes are less expensive per TB stored than traditional data warehouses, they can be inefficient in terms of storage. Data models are often determined by query requirements, and unlike relational databases, data is frequently de-normalized with the same data fields spread across different tables designed to support specific queries. Also, to support BI tools, data may be extracted into OLAP cubes or intermediate representations leading to further data duplication and complicating data governance.
Hybrid Data Warehouse/Data Lake Implementations
To overcome the limitations described above, data engineers are often tasked with moving data sets into the enterprise data warehouse. Once data is in the data warehouse, it is more accessible to business users. However, to maximize performance, data engineers also need to create cubes (for multi-dimensional data analysis), and generate extracts using familiar BI and analytic tools.
To achieve this, organizations need to devise processes to extract, transform and load (ETL) data from its raw form in the data lake into SQL tables in the data warehouse. Despite the added cost and complexity, organizations may see advantages to this approach:
Business analysts can easily access new datasets using familiar BI tools and analytic techniques.
Analysts can be more productive since data in the data warehouse can be queried more quickly and easily.
Organizations can benefit from data governance and security features often missing in less mature Hadoop or cloud-based data lake solutions. These data controls may be essential to ensure data security or meet external regulatory requirements.
A hybrid data warehouse/data lake environment is illustrated below.
Needless to say, while this hybrid approach is used in many enterprises, it is far from ideal. Some drawbacks are as follows:
Inefficient – The same data may be replicated multiple times. The same data may exist in the data lake, in intermediate files created during ETL processing, in the data warehouse or data mart, or extracts, cubes or intermediate aggregation tables created by analysts.
Significant work for IT – A second challenge is that loading data into the data warehouse requires assistance from IT. Organizations need to create new data warehouse tables, define workflows, and transform data such that it can be loaded into the warehouse. Also, these workflows need to be operationalized, typically via batch processing so that as new data appears in the data lake, it becomes available in the data warehouse. In many organizations, this can take months and require significant development effort.
Expensive – In addition to concerns related to productivity and efficiency, the cost is another concern. Storing data in the data warehouse is expensive, and rather than offloading the data warehouse, the strategy above increases the cost of data management. Customers not only pay for a data lake infrastructure, but they continue to grow data warehouse and supporting ETL-related infrastructure.
Data governance concerns – Finally, while the architecture above is meant to simplify data access and better manage data, ironically, it can undermine data governance efforts. Analysts will often have cubes and various data extracts containing potentially sensitive data sitting outside of the governance framework of the data warehouse.
The Rise of the Cloud Data Lake Engine
Most enterprises would prefer to manage a single, unified data environment. They would like to present users and applications with a single repository of enterprise data accessible without compromising on performance, flexibility or data security. Also, they would like to avoid the need for the complex ETL processing required to stage data in the data warehouse.
The cloud data lake engine is a new category of analytic platform that extends maturing cloud data lakes by addressing their traditional shortcomings.
Cloud data lake engines support existing BI and data science workloads with performance comparable to an enterprise data warehouse (EDW) while simplifying data access, and using resources more efficiently. They blend the performance, structural and governance advantages of the SQL-oriented data warehouse with the openness, scale and efficiency of cloud object stores.
Data lake engines enable companies to leave data in the data lake but provide fast, secure access for data consumers, regardless of the tool they use. Organizations can deploy data lake engines in their own data centers or cloud platforms such as Amazon AWS, Microsoft Azure and Google Cloud Platform.
A high-level architecture of a data lake engine is illustrated below.
Characteristics of a Data Lake Engine
Modern data lake engines are faster than SQL query engines because they use a variety of new technologies for in-memory, columnar and parallel processing to accelerate performance. This enables customers to achieve query performance against raw data in a data lake, similar to what they would expect from an optimized data warehouse.
Ease of Use
Data lake engines are also easier to use and manage, providing a central control point for data engineers to manage data. A semantic layer abstracts raw data in the multi-tenant object store, providing virtual data views that can be queried using familiar SQL and BI tools via standard ODBC/JDBC interfaces. The data lake also reduces the need for OLAP cubes and BI extracts for offline analysis because it is fast and flexible enough for analytic tools to query data sources directly without the need for intermediate data representations.
Data lake engines are also much more efficient and flexible. They reduce or avoid the requirement for costly enterprise data warehouse infrastructure running instead on lower-cost distributed infrastructure or cloud instances. Data lake engines are deployable in elastic cloud computing environments or containerized environments managed by Kubernetes.
Data scientists and BI analysts can be more productive. Analysts can create intermediate data representations themselves, and no longer wait for data engineers to load data into the data warehouse. Also, analysts can continue to use existing BI tools. Data becomes faster and easier to explore leading to better quality analytic models and ultimately better business decisions.
Openness and Flexibility
Along with the other advantages described above, data lake engines are inherently more open and flexible. Customers retain the flexibility to use their analytic tools of choice, and they can run on premises or in multiple clouds against multiple data sources. They can also take advantage of multiple deployment models and open data formats.
Dremio Data Lake Engine
Dremio’s data lake engine employs the technologies above along with other innovations, including data reflections for query acceleration, a columnar cloud cache (C3), and predictive pipelining.
Speed up queries by up to 3,000x directly on data lake storage