Introduction to Data Lakes

   
  • Dremio

Table of Contents

Table of Contents

What Is a Data Lake?

A data lake is a centralized repository that allows you to store all of your data, whether a little or a lot, in one place.

Like a real lake, data lakes store large amounts of unrefined data coming from various streams and tributaries in its natural state. Also, like a real lake, the sources that feed the lake can change with time.

Data Lakes Are the New Normal

In the old days, the cost of data and complicated software meant that organizations had to be picky about how much data they kept. Organizations would carefully design databases but only used them to store business-critical information. Today, storage is cheaper and software to analyze the contents is more complex, so the way organizations view data collection and storage has changed dramatically.

Storing a terabyte of data is a staggering 10M+ times less expensive than it was 40 years ago. It’s now a lot more viable to keep all the data your business generates. Sometimes, it can be cheaper to collect all the data you can in a data lake, as it comes in and then sort it later.

Just as storage costs have plummeted, so too has the cost of data acquisition. Thanks to all the devices we use today, the cost of capturing data has dropped to almost zero, with nearly all data originating from computers, laptops, tablets, and phones. Whenever you interact with someone else on the internet, it leaves a digital trail — everything from in-store purchases, in-app e-commerce orders, to recorded customer service interactions via phone or chat.

Data Lake Storage Technologies

There are a variety of data lake storage technologies. Popular storage solutions for data lakes include:

  • Legacy HDFS (Hadoop Distributed File System)
  • Amazon S3
  • Azure Blob storage
  • Azure Data Lake Storage (ADLS)

Those are just a few examples, but lots of other on-premises and cloud solutions exist.

Whichever one you choose, they work in similar ways, using distributed systems where data is spread across multiple low-cost hosts or cloud instances. Data is usually stored in multiple places simultaneously to provide a backup if something goes wrong.

Data Lake File and Table Formats

Traditional databases had to store data in very specific, organized ways, but a data lake can easily store any kind of data — whether it’s fully organized when it’s uploaded or completely unstructured.

A data lake can store a variety of file formats. Common file formats for data storage include:

  • Comma-separated values, or CSV
  • JavaScript Object Notation, or JSON
  • Query-optimized open source, for example, Apache Parquet

Table formats are metadata constructs that make it easier to interact with files in tables.

Data lakes are only as useful as their metadata. Table formats are metadata constructs that help you understand what data you have in your data lake and make that data easier to use. Common table formats include:

  • Apache Iceberg (open source)
  • Delta Lake (Databricks)

A metastore stores metadata about all the tables in your data lake and how they are structured, essentially acting as a catalog for everything in your lake. Data lake metastores include:

Data Lakehouses

Before the smartphone, we had to carry around lots of different devices with a single function — be that a diary, a camera, or a phone. The smartphone brought all the best parts of each device together in one device, and data lakehouses combine the best of both data warehouses and data lakes.

Data warehouses typically have carefully crafted schemas designed to answer predetermined queries quickly and efficiently. Data lakes store all your data, but historically they can be harder to query because data is not rigorously structured and formatted for analysis. A data lakehouse combines the best of both worlds.

Because a data lakehouse combines the features of a data lake and a data warehouse, it can be greater than the sum of its parts. It separates transactional functions from storage and reduces the overall amount of compute power needed to run queries by directly accessing standardized source data, whether or not it has been fully structured.

A cloud-based lakehouse supports a wide range of schemas, data governance protocols, and end-to-end streaming. It can also read and write data simultaneously, making a more stable platform for concurrent users.

Dremio and Data Lakehouses

Dremio helps companies get more value from their data, faster. Dremio’s forever-free lakehouse platform delivers high-performing BI dashboards and interactive analytics directly on the data lake.

Ready to go deeper? Read a more technical article on data lakes.

Ready to Get Started? Here Are Some Resources to Help

Alteryx Analytic Platform and Dremio Open Lakehouse combine to simplify data operations and enable broad access to the data lake

Webinars

Unlocking Analytics from your Data Lake with Alteryx and Dremio

As a result of the accelerated growth of data lakes, data teams have been forced to either build and maintain expensive and complex processes to make new sources of data available for use in proprietary data warehouses, or hinder access to analytics for all data consumers. In this webinar, learn how the Alteryx Analytic Platform and Dremio Open Lakehouse combine to simplify data operations and enable broad access to the data lake for exploration, discovery, and insights.

read more

Webinars

How Open Lakehouses Simplify Analytics on Cloud Data Lakes

Cloud migration affords your organization the opportunity to rethink the fundamental architecture of corporate reporting and analytics system design. This webinar explores how cloud resources and services eliminate the need for costly data warehouse solutions that require significant data integration and preparation efforts.

read more

Guides

Data Virtualization vs. Data Lakes

Businesses need to aggregate data sources to be able to use the data. Data virtualization and data lakes are popular approaches, but which to choose?

read more

Get Started Free

No time limit - totally free - just the way you like it.

Sign Up Now

Watch Demo

Not ready to get started today? See the platform in action.

Check Out Demo