16 minute read · July 1, 2021

What is a Data Lake?

data lake is a centralized repository that allows you to store all of your structured and unstructured data at any scale. In the past, when disk storage was expensive, and data was costly and time-consuming to gather, enterprises needed to be discerning about what data to collect and store. Organizations would carefully design databases and data warehouses to capture the information viewed as essential to the operation of their business. Today, with changes in the economics of data storage and improved analytic tools, how organizations view data collection and storage has changed dramatically.

Data Lake Use Cases

Data architecture modernization
Avoid reliance on proprietary data warehouse infrastructure and the need to manage cubes, extracts and aggregation tables. Run operational data warehouse queries on lower-cost data lakes, offloading the data warehouse at your own pace.

Business intelligence on data lake storage
Dramatically improve speed for ad hoc queries, dashboards and reports. Run existing BI tools on lower-cost data lakes without compromising performance or data quality. Avoid costly delays adding new data sources and reports.

Data science on data lake storage
Accelerate data science on data lake storage with simplified data exploration and feature engineering. Dramatically improve performance making data scientists and engineers more efficient, resulting in high-quality analytic models.

Cloud data lake migration
Optionally deploy new applications to the cloud using data lake storage such as S3 or ADLS. Migrate from older on-prem data lake environments that are expensive and difficult to maintain while ensuring agility and flexibility.

Data in the data lake may be queried directly from various client tools via a modern data lake engine. Data may be extracted from the data lake to feed an existing data warehouse using ETL tools.

The data lake storage layer is where data is physically stored. In modern data lakes, data is frequently stored in cloud-based object stores such as Amazon S3 or ADLS but data may reside on premises as well. The data lake storage layer is not necessarily monolithic. Data in the logical data lake may span multiple physical data stores.

Data stored in data lake storage can exist in a variety of file formats, from text to various binary formats to specialized query-optimized formats. Some open source file formats such as Apache Parquet have their origins in Hadoop. Parquet is designed to support large, complex datasets with efficient compression and encoding while supporting column-oriented queries against large data tables. JSON (JavaScript Object Notation) is a popular format with developers. It is a lightweight, human-readable, text-based data-interchange format that can represent arbitrarily complex data. JSON is popular because it is easy to parse, and JSON is easily generated using a variety of programming languages. JSON is also frequently the basis of messages passed via modern RESTful APIs. Other tabular data may be stored in simple text files containing comma-separated values (CSV) or tab-separated values (TSV).

Table formats are metadata constructs built on top of the various physical file formats described above to make tables appear like SQL tables. Examples include Apache Iceberg, Delta Lake, AWS Glue and the Hive Metastore (HMS).

The table repository layer provides a uniform view and optimized access to the various table formats in the diagram above. Project Nessie is an open source project that works with the table formats described above, including Iceberg, Delta Lake and Hive tables.

Analytic and data science client tools typically access data in the data lake through a data lake engine. A variety of standard protocols efficiently encode and transmit queries and return results to client tools. These protocols include ODBC, JDBC, RESTful APIs and Apache Arrow Flight.

The data lake engine is an application or service which efficiently queries and processes the vast sets of data stored in a data lake via the various standardized software layers described above. Examples of data lake engines include Spark, Apache Kafka and Dremio data lake engine.

Ready to get started?

Dremio Test Drive

Experience Dremio with sample data

The simplest way to try out Dremio.

cloud data lakehouse
Dremio Cloud

Open & fully-managed data lakehouse

Best Option if your data is on AWS. Forever Free Usage.

Dremio Software

Software for any environment

Download Dremio’s Community Edition

Here are some resources to get started