Dremio Blog

6 minute read · September 19, 2022

Ensuring High Performance at Any Scale with Apache Iceberg’s Object Store File Layout

Alex Merced Head of DevRel, Dremio

Start For Free

Copied to clipboard

Ensuring High Performance at Any Scale with Apache Iceberg’s Object Store File Layout

The Prior Hadoop and Hive Paradigm

The Apache Iceberg Solution

Conclusion

Note: You can find this and many other great Apache Iceberg instructional videos and articles at our Apache Iceberg 101 article.

Cloud object storage services like AWS S3, Azure Blob Storage and Google Cloud Storage have been transformational in the world of data storage. By drastically lowering the cost and complexity of storing large amounts of data many have migrated away from Hadoop clusters.

Cloud object storage has its own architectural considerations as the scale of your data analytics grows and Apache Iceberg is the data lakehouse table format best suited for ensuring high performance at any scale on cloud object storage.

The Prior Hadoop and Hive Paradigm

When it came to large-scale data lake storage, large Hadoop clusters were the standard. Huge networks of low-cost hardware acting as one large-scale distributed hard drive through the Hadoop File System (HDFS). The Hive table format was created originally to help recognize groups of folders stored on HDFS as “tables” for facilitating better data analytics workloads.

The Hive table format relied heavily on the physical layout of files in the file system to know what files belonged to a table. A directory would represent a table and any sub-directories would represent partitions like so:

This all works fine from the HDFS paradigm, but a particular bottleneck arises when we structure our data files in this way on object storage.

In object storage, there aren’t traditional directories but “prefixes” (a string before the file name like /this/is/the/prefix/file.txt). Many cloud storage providers like S3 are architected in such a way that there are limits on how many requests can be made per second to files with the same prefix (equivalent to the same directory).

This can be a problem if you depend on the hive-type file layout which can lead to large table partitions having so many files under the same prefix that throttling occurs when accessing the data, causing slow performance.

Table formats like Delta Lake and Apache Hudi rely heavily on the hive file layout exposing themselves to these issues when using cloud object storage.

Try Dremio’s Interactive Demo

Explore this interactive demo and see how Dremio's Intelligent Lakehouse enables Agentic AI

The Apache Iceberg Solution

Apache Iceberg abandoned this file layout-heavy approach to instead rely on a list of files within its metadata structure. This means the actual data files themselves don’t need to be in any particular location, as long as the file manifests list the right location for that file.

This allows Apache Iceberg to use traditional file layouts when on HDFS but to use alternative file layouts on object storage for better scaling.

When creating an Apache Iceberg table you set the following properties so files written will be optimized in multiple prefixes for cloud object storage.

CREATE TABLE this_catalog.db.my_table (
  id bigint,
  name string,
)
USING iceberg
OPTIONS (
  'write.object-storage.enabled'=true,
  'write.data.path'='s3://the-bucket-with-my-data'
)
PARTITIONED BY (bucket(8, id));

With this setting it will generate a hash and write the file to a path with the hash in it.

So let’s say it generates a hash of 7be596a8 – the resulting file path for a data file would be:

s3://the-bucket-with-my-data/table/7be596a8/partition/filename.parquet

This spreads files within a partition under many prefixes, avoiding throttling when accessing lots of files in a large table or large partition.

Conclusion

Cloud object storage has been a great innovation in making data lakehouses more accessible. When paired with Apache Iceberg, you get the full benefits of cloud storage and a table format without any risk of hitting bottlenecks at scale.

Read this article to learn how to create Iceberg tables on S3 using AWS Glue to try Iceberg’s object store file layout yourself.

Try Dremio Cloud free for 30 days

Deploy agentic analytics directly on Apache Iceberg data with no pipelines and no added overhead.

Start For Free

Article Topics

Dremio Blog: Various Insights

Blog coverpage for Ingesting Data into Aparche Iceberg with Dremio

Feb 1, 2024 Product Insights from the Dremio Blog

Ingesting Data Into Apache Iceberg Tables with Dremio: A Unified Path to Iceberg

By unifying data from diverse sources, simplifying data operations, and providing powerful tools for data management, Dremio stands out as a comprehensive solution for modern data needs. Whether you are a data engineer, business analyst, or data scientist, harnessing the combined power of Dremio and Apache Iceberg will undoubtedly be a valuable asset in your data management toolkit.

Alex Merced

Sep 22, 2023 Dremio Blog: Open Data Insights

Intro to Dremio, Nessie, and Apache Iceberg on Your Laptop

We're always looking for ways to better handle and save money on our data. That's why the "data lakehouse" is becoming so popular. It offers a mix of the flexibility of data lakes and the ease of use and performance of data warehouses. The goal? Make data handling easier and cheaper. So, how do we […]

Alex Merced

Oct 12, 2023 Product Insights from the Dremio Blog

Table-Driven Access Policies Using Subqueries

This blog helps you learn about table-driven access policies in Dremio Cloud and Dremio Software v24.1+.

Albert Vernon

Ensuring High Performance at Any Scale with Apache Iceberg’s Object Store File Layout

Table of Contents

The Prior Hadoop and Hive Paradigm

Try Dremio’s Interactive Demo

The Apache Iceberg Solution

Conclusion

Try Dremio Cloud free for 30 days

Ready to Get Started?

Table of Contents

The Prior Hadoop and Hive Paradigm

Try Dremio’s Interactive Demo

The Apache Iceberg Solution

Conclusion

Try Dremio Cloud free for 30 days

Related Dremio Articles

Ingesting Data Into Apache Iceberg Tables with Dremio: A Unified Path to Iceberg

Intro to Dremio, Nessie, and Apache Iceberg on Your Laptop

Table-Driven Access Policies Using Subqueries

Ready to Get Started?