Hiveberg: Integrating Apache Iceberg with the Hive Metastore
Apache Iceberg is an open table format that can be used for huge (petabyte scale) datasets. This talk will give an overview of Iceberg and its many attractive features such as time travel, improved performance, snapshot isolation, schema evolution and partition spec evolution. We’ll then discuss how Iceberg can be used inside an organisation such as Expedia Group to power next-generation data lake technology. One of the challenges of moving to a new table format for an organisation that already has a significant investment in existing technologies (in our case Hive and, specifically, the Hive metastore) is to prevent data silos from forming, where data generated in the new format can’t be used by others who haven’t switched to it yet. We’ll discuss the solution we came up with, Hiveberg, which opens up a path to read Iceberg tables from Hive (and thus any tooling that supports Hive). This allows more advanced users to take advantage of the features of Iceberg when creating data but still allows this data to be widely read and used by others.
Adrian Woodhead is a principal engineer at Expedia Group in London working with teams focusing on open source and the platform powering their big data processing systems.
Christine Mathiesen is a software development intern at Expedia Group in London focusing on next-generation data lake technologies.