Introducing the Apache Hudi Table Format, Purpose-Built for Low-Latency Data Lake Use Cases

Data lakes are one of the fastest-growing trends in managing big data for any organization. Data lakes offer massively scalable data processing over vast amounts of data. Additionally, as businesses have evolved, there have been many more demands of features from data lakes such as the ability to use change-data-capture (CDC) at low latencies to serve business needs, perform data deletions to meet compliance requirements while continuing to ingest new data, and reduce costs of storing and accessing data (and metadata) while at the same time scaling the data infrastructure for business continuity.Apache Hudi is a data platform technology that provides several functionalities needed to build and manage your own data lake. To provide users with more flexibility, we recently introduced a set of low-level APIs that help to directly program against our table format. In this session, we will describe the Apache Hudi table format that is designed to improve canonical table layouts that are popularly used to build modern data lakes. We will discuss the data and metadata layout of Hudi tables that realize primitives such as upserts, deletes and incremental pulls. We will go over ways to access Hudi timeline (a sequential audit log of actions performed on the table) to assist in monitoring and managing the pipelines and tables. We will dive into Hudi’s concurrency models and how Apache Hudi’s table format also supports lock-free concurrent writing from multiple applications.As datalake ecosystem evolves, table services are becoming an integral part of an efficient data lake architecture. This involves services such as cleaning and compaction for efficient storage management of data and metadata, data clustering for intelligent and dynamic re-clustering of data for better storage management and faster query times. We will talk about how compaction policy/scheduling and dynamic data clustering can be used with out-of-the-box solutions or can be plugged in based on one’s needs to get the best out of their Hudi tables. Similarly, locking services for multi-writer support can be plugged in if some users have their own lock service for all data infrastructure in their organization. To conclude, we will discuss our ongoing efforts to add column indexes to the table format to assist in trimming down the read latency with commonly predicated columns.

Topics Covered

Table Formats

Ready to Get Started? Here Are Some Resources to Help

Whitepaper Thumb


Simplifying Data Mesh for Self-Service Analytics on an Open Data Lakehouse

read more
Whitepaper Thumb


Dremio Upgrade Testing Framework

read more
Whitepaper Thumb


Operating Dremio Cloud Runbook

read more
get started

Get Started Free

No time limit - totally free - just the way you like it.

Sign Up Now
demo on demand

See Dremio in Action

Not ready to get started today? See the platform in action.

Watch Demo
talk expert

Talk to an Expert

Not sure where to start? Get your questions answered fast.

Contact Us

Ready to Get Started?

Bring your users closer to the data with organization-wide self-service analytics and lakehouse flexibility, scalability, and performance at a fraction of the cost. Run Dremio anywhere with self-managed software or Dremio Cloud.