Introduction to Data Lakes

  • Dremio

Table of Contents

Table of Contents

What Is a Data Lake?

A data lake is a centralized repository that allows you to store all of your data, whether a little or a lot, in one place.

Like a real lake, data lakes store large amounts of unrefined data coming from various streams and tributaries in its natural state. Also, like a real lake, the sources that feed the lake can change with time.

Data Lakes Are the New Normal

In the old days, the cost of data and complicated software meant that organizations had to be picky about how much data they kept. Organizations would carefully design databases but only used them to store business-critical information. Today, storage is cheaper and software to analyze the contents is more complex, so the way organizations view data collection and storage has changed dramatically.

Storing a terabyte of data is a staggering 10M+ times less expensive than it was 40 years ago. It’s now a lot more viable to keep all the data your business generates. Sometimes, it can be cheaper to collect all the data you can in a data lake, as it comes in and then sort it later.

Just as storage costs have plummeted, so too has the cost of data acquisition. Thanks to all the devices we use today, the cost of capturing data has dropped to almost zero, with nearly all data originating from computers, laptops, tablets, and phones. Whenever you interact with someone else on the internet, it leaves a digital trail — everything from in-store purchases, in-app e-commerce orders, to recorded customer service interactions via phone or chat.

Data Lake Storage Technologies

There are a variety of data lake storage technologies. Popular storage solutions for data lakes include:

  • Legacy HDFS (Hadoop Distributed File System)
  • Amazon S3
  • Azure Blob storage
  • Azure Data Lake Storage (ADLS)

Those are just a few examples, but lots of other on-premises and cloud solutions exist.

Whichever one you choose, they work in similar ways, using distributed systems where data is spread across multiple low-cost hosts or cloud instances. Data is usually stored in multiple places simultaneously to provide a backup if something goes wrong.

Data Lake File and Table Formats

Traditional databases had to store data in very specific, organized ways, but a data lake can easily store any kind of data — whether it’s fully organized when it’s uploaded or completely unstructured.

A data lake can store a variety of file formats. Common file formats for data storage include:

  • Comma-separated values, or CSV
  • JavaScript Object Notation, or JSON
  • Query-optimized open source, for example, Apache Parquet

Table formats are metadata constructs that make it easier to interact with files in tables.

Data lakes are only as useful as their metadata. Table formats are metadata constructs that help you understand what data you have in your data lake and make that data easier to use. Common table formats include:

  • Apache Iceberg (open source)
  • Delta Lake (Databricks)

A metastore stores metadata about all the tables in your data lake and how they are structured, essentially acting as a catalog for everything in your lake. Data lake metastores include:

Data Lakehouses

Before the smartphone, we had to carry around lots of different devices with a single function — be that a diary, a camera, or a phone. The smartphone brought all the best parts of each device together in one device, and data lakehouses combine the best of both data warehouses and data lakes.

Data warehouses typically have carefully crafted schemas designed to answer predetermined queries quickly and efficiently. Data lakes store all your data, but historically they can be harder to query because data is not rigorously structured and formatted for analysis. A data lakehouse combines the best of both worlds.

Because a data lakehouse combines the features of a data lake and a data warehouse, it can be greater than the sum of its parts. It separates transactional functions from storage and reduces the overall amount of compute power needed to run queries by directly accessing standardized source data, whether or not it has been fully structured.

A cloud-based lakehouse supports a wide range of schemas, data governance protocols, and end-to-end streaming. It can also read and write data simultaneously, making a more stable platform for concurrent users.

Dremio and Data Lakehouses

Dremio helps companies get more value from their data, faster. Dremio’s forever-free lakehouse platform delivers high-performing BI dashboards and interactive analytics directly on the data lake.

Ready to go deeper? Read a more technical article on data lakes.

Ready to Get Started? Here Are Some Resources to Help

Analyst Report

DZone Data Pipelines Trend Report

DZone has released their Data Pipelines Trend Report. It is a survey of software developers, architects, platform engineers, and IT professionals to understand the challenges and potential solutions around ingesting, processing, and leveraging data.

read more


How Enel Group Built a Data Mesh Architecture with Dremio and AgileLab

In this webinar, learn how Enel Group worked with Agile Lab to implement Dremio as a data mesh solution for providing broad access to a unified view of their data, and how they use that architecture to enable a multitude of use cases.

read more


The Digital Bank: Building a data-driven customer experience on the open lakehouse

Learn how direct access to analytics on Amazon S3 with Dremio can help banking and finance firms achieve a consistent and data-driven customer experience.

read more

Get Started Free

No time limit - totally free - just the way you like it.

Sign Up Now

Watch Demo

Not ready to get started today? See the platform in action.

Check Out Demo