Dremio Blog

10 minute read · March 12, 2026

Beyond Parquet: The Apache Iceberg File Format API and the AI Era

Will Martin Will Martin Technical Evangelist
Start For Free
Beyond Parquet: The Apache Iceberg File Format API and the AI Era
Copied to clipboard

Key Takeaways

  • The Apache Iceberg community introduced a new File Format API to decouple object models from physical storage, aiming for engine-agnostic file formats.
  • Traditional formats struggle with modern AI workloads, causing bottlenecks that hinder performance.
  • New AI-native file formats like Lance, Nimble, and Vortex emerge to address these limitations by optimizing for specific data access patterns.
  • The File Format API facilitates modular integrations and a standardized approach for adding new formats, enhancing innovation within Apache Iceberg.
  • This strategic shift prepares Apache Iceberg for the future of AI, allowing flexible adaptations to new formats without requiring extensive infrastructure changes.

The Apache Iceberg community recently finalised a new File Format API, scheduled for the upcoming 1.11.0 release. It is a strategic architectural shift that decouples the object model from the physical storage layout. The aim? To make file formats engine-agnostic, so Apache Iceberg can integrate with new formats without rewriting the core engine logic every time.

This post explores why traditional file formats are hitting a wall in the AI era and how this new API prepares the data lakehouse for an AI-native future.

Try Dremio’s Interactive Demo

Explore this interactive demo and see how Dremio's Intelligent Lakehouse enables Agentic AI

Why Traditional Formats are Hitting a Wall

Columnar formats, like Parquet and ORC, were designed for high compression and predictable batch performance. However, modern AI workloads, such as retrieval-augmented generation (RAG) and model training, require data access patterns that these formats were never meant to handle.

Several technical limitations have become critical bottlenecks as hardware has evolved toward high-bandwidth cloud object stores and massive GPU clusters:

  • Metadata Overhead in High-Feature Tables: Modern training datasets often contain tens of thousands of features. In Parquet, accessing a single column out of 10,000 requires parsing the metadata for every column in the file. This creates a massive I/O and CPU bottleneck before a single byte of actual data is read.
  • Inefficiency of 1MB Page Scans: To read one specific record, a Parquet engine typically must scan an entire 1MB page. This is significant latency to AI applications which can require sub-millisecond random access to specific embeddings.
  • GPU Parallelisation and Variable-Length Encodings: Parquet uses complex, variable-length encodings to maximise storage density. While efficient for disk space, these encodings are difficult to parallelise on the thousands of cores available in a GPU. This often leads to "GPU starvation," where the processor sits idle waiting for the CPU to decode data into a usable format.

Hardware has transitioned from high-latency spinning disks to high-bandwidth NVMe and GPU-accelerated compute. The file formats we use must now adapt to keep these resources saturated.

The New Contenders: AI-Native File Formats

To address these technical limitations, a new generation of file formats has emerged. Rather than pursuing a "one size fits all" strategy, these formats focus on specific performance profiles required by AI and high-speed analytics.

  • Lance: Designed for high-dimensional, multimodal data like embeddings and images. Created by contributors to Pandas and HDFS, its core philosophy is O(1) random access. In real-world benchmarks, Lance has proved to be approximately 2,000x faster than Parquet for specific random access workloads, such as retrieving 100 million small records in milliseconds.
  • Nimble: Focuses on the "data loading" bottleneck in machine learning training. When training at scale, the primary cost is often the speed at which data can be decoded to feed the GPU. By using simplified encodings and memory layouts, Nimble reduces CPU overhead during data ingestion. Early reports indicate Nimble can deliver 2x–3x speedups in decoding compared to Parquet, effectively increasing training throughput by reducing I/O wait times.
  • Vortex: Intended as a general-purpose successor to Parquet. Vortex uses cascading compression, allowing the engine to run filter expressions directly on compressed data without full decompression. In TPC-H SF=100 benchmarks, Vortex demonstrated speeds 18% to 35% faster than Parquet while maintaining a comparable storage footprint.

The Innovation Bottleneck: Why Iceberg Needed an API

Before this update, adding support for a new format like Lance or Vortex to Apache Iceberg was a massive, multi-engine undertaking. The community faced several problems that throttled innovation:

  1. Fragmented Logic: Every engine integration maintained its own custom readers and writers. Adding one format required modifying multiple independent modules.
  2. Uneven Feature Support: Because logic was duplicated across engines, features were inconsistent. A delete file might work perfectly for Parquet in Spark but remain unsupported for ORC in Flink because that specific integration path had not been written yet.
  3. Innovation Friction: There was no standardised contract for what a format implementation must provide. This made it extremely difficult for contributors to add new formats without a deep, cross-cutting effort across the entire Iceberg project.

How It Works: The Pluggable Architecture

The new File Format API solves these issues by establishing a unified, modular interface. Iceberg no longer needs to manage the internal binary structure of every file it tracks. Instead, it interacts with formats through a standardised set of builders and metadata structures. This decoupling means that the storage layer can evolve independently of the table format and the query engine, ending the need for format-specific code.

The architecture centers on three primary components:

  • Read and Write Builders: Tools to allow engines to configure data read/write operations. The builders handle the complexity of the file format, while the engine simply consumes the resulting data stream.
  • FormatModel: A format implementation that provides a file format identifier, format-specific configuration/capabilities, and the Read and Write Builders. 
  • FormatModelRegistry: A central directory to store the available FormatModels. Any new format implementations get registered here and are then available for implementation.

For an in-depth discussion of the technical architecture, I would recommend this talk from Péter Váry, a PMC member of Apache Iceberg.

Architecting Beyond Parquet

The API enables advanced storage capabilities that go beyond simply making it easier to support AI-native file formats. Here are two major capabilities this new API will enable for Apache Iceberg. 

Direct Conversion vs. Intermediate Formats

Many systems use an intermediate format like Apache Arrow to ensure compatibility between file formats and engines. While this simplifies the integration, it creates a performance bottleneck for every single record read or written. For high-performance workloads, the overhead of converting to an intermediate format and then to the engine’s native memory format is too high. This new Iceberg API allows direct conversion from the starting file format to the engine’s memory, preserving the low-latency benefits of AI-native formats like Vortex.

Column Families

In a traditional wide table, updating a single column requires rewriting the entire data file. With Column Families, groups of columns can be vertically split and stored in separate files. This enables you to update or replace a specific column family, e.g. a set of refreshed embeddings, without touching the rest of the table. This leads to higher write parallelism and more efficient selective reads.

Summary

The File Format API is a strategic move to ensure Apache Iceberg remains the foundation for the AI era. By making the storage layer pluggable, Iceberg avoids the risk of technical obsolescence if a new format eventually replaces Parquet as the industry standard.

Whether the future belongs to Lance, Vortex, or a format yet to be invented, Iceberg’s architecture is now prepared to accommodate it. This flexibility allows organisations to build data lakes that adapt to changing hardware and shifting AI requirements without a total infrastructure rebuild. 

For more information on the current state of and planned next steps for the API, I recommend visiting the official website.

Try Dremio Cloud free for 30 days

Deploy agentic analytics directly on Apache Iceberg data with no pipelines and no added overhead.