Introducing the Apache Hudi Table Format, Purpose-Built for Low-Latency Data Lake Use Cases
Data lakes are one of the fastest-growing trends in managing big data for any organization. Data lakes offer massively scalable data processing over vast amounts of data. Additionally, as businesses have evolved, there have been many more demands of features from data lakes such as the ability to use change-data-capture (CDC) at low latencies to serve business needs, perform data deletions to meet compliance requirements while continuing to ingest new data, and reduce costs of storing and accessing data (and metadata) while at the same time scaling the data infrastructure for business continuity.Apache Hudi is a data platform technology that provides several functionalities needed to build and manage your own data lake. To provide users with more flexibility, we recently introduced a set of low-level APIs that help to directly program against our table format. In this session, we will describe the Apache Hudi table format that is designed to improve canonical table layouts that are popularly used to build modern data lakes. We will discuss the data and metadata layout of Hudi tables that realize primitives such as upserts, deletes and incremental pulls. We will go over ways to access Hudi timeline (a sequential audit log of actions performed on the table) to assist in monitoring and managing the pipelines and tables. We will dive into Hudi’s concurrency models and how Apache Hudi’s table format also supports lock-free concurrent writing from multiple applications.As datalake ecosystem evolves, table services are becoming an integral part of an efficient data lake architecture. This involves services such as cleaning and compaction for efficient storage management of data and metadata, data clustering for intelligent and dynamic re-clustering of data for better storage management and faster query times. We will talk about how compaction policy/scheduling and dynamic data clustering can be used with out-of-the-box solutions or can be plugged in based on one’s needs to get the best out of their Hudi tables. Similarly, locking services for multi-writer support can be plugged in if some users have their own lock service for all data infrastructure in their organization. To conclude, we will discuss our ongoing efforts to add column indexes to the table format to assist in trimming down the read latency with commonly predicated columns.
Sivabalan Narayanan, Lead for Network Infrastructure at Uber, has an extensive background in large-scale distributed systems. He worked at LinkedIn as a data infrastructure engineer with a focus on blob storage and is now an Apache Hudi PMC member.
Nishith leads the Data Infra team at Uber, where he manages the storage and compute platforms. He is a PMC of Apache Hudi and has 10+ years of experience in distributed systems and databases.