21 minute read · October 20, 2025
Exploring the Evolving File Format Landscape in AI Era: Parquet, Lance, Nimble and Vortex And What It Means for Apache Iceberg
· Head of DevRel, Dremio
File formats rarely get the spotlight. They sit under layers of query engines, orchestration tools, and machine learning frameworks, quietly doing the heavy lifting. Yet, the way we store and access data has a direct impact on everything from query latency to model accuracy. And right now, the file format space is undergoing one of its biggest shifts in years.
For over a decade, formats like Parquet, Avro, and ORC served as the backbone of analytical processing. They brought predictable performance to batch queries and were tightly integrated into the Hadoop ecosystem that many data architectures grew up on.
But things are different now. AI and machine learning workloads demand low-latency access to high-dimensional data. Vector embeddings, multimodal content, and fast-changing schemas are no longer edge cases, they’re core use cases. And modern hardware like NVMe drives and cloud object stores offer new performance tradeoffs that older formats weren’t designed to take advantage of.
The result? A growing ecosystem of purpose-built file formats designed to address today’s workloads. Some focus on sub-second random access. Others prioritize GPU-friendly layouts or native support for concurrent writes. This shift has major implications, not just for engines and pipelines, but for the table formats that sit one level above.
That’s where Apache Iceberg enters the picture. As the open table format adopted by a growing number of platforms, Iceberg supports multiple file formats by design. But keeping those formats in sync, and enabling support for new ones, hasn’t been easy. A new proposal aims to change that, creating a more modular and extensible way for Iceberg to support the next generation of file formats.
Before we get into the details of that proposal, let’s explore why this evolution is happening in the first place, and why now.
What’s Changing in File Formats
A few years ago, choosing a file format was straightforward. If you needed high-performance analytics, you used Parquet. If you wanted row-level storage or fast serialization, you picked Avro. ORC had its place too, especially in Hive-centric environments. Each format had a clear set of trade-offs, and those trade-offs aligned with batch processing on large datasets.
Today, the landscape is more complicated, and more interesting.
The workloads themselves have changed. AI and machine learning aren’t just occasional extensions of analytics, they’re driving many platform decisions. Training tables can include tens of thousands of features. Vector embeddings, often 768 to 4096 dimensions, are now a standard part of recommendation engines, retrieval-augmented generation (RAG) systems, and LLM pipelines.
These use cases expose the limitations of traditional formats:
- Reading a single record in Parquet often means scanning a full 1MB page.
- Accessing one column out of 10,000 still requires parsing metadata for all of them.
- Updates and deletes remain cumbersome, often relying on higher-level abstractions to patch over the lack of native support.
Meanwhile, modern infrastructure has evolved. Cloud object stores like S3 are high-bandwidth but high-latency. NVMe SSDs deliver 850,000+ IOPS, enabling access patterns that weren’t feasible with spinning disks. GPUs now play a growing role in data processing, but variable-length encodings like those in Parquet don’t parallelize well on thousands of GPU cores.
All of this has opened the door for new formats with more focused goals:
- Lance, designed by contributors from Pandas and HDFS, offers blazing-fast random access and native support for vector search.
- Vortex and other experimental formats aim to improve on storage layout, update mechanics, and real-time read performance.
- Other formats like Nimble target decoding speed and GPU optimization for ML training.
None of these are general-purpose replacements, yet. But they reflect a clear trend: file format innovation is accelerating, and format choice is becoming more workload-specific.
Lance: Fast Access for Vector-Driven Workloads
If Parquet was designed for the world of batch SQL, Lance is built for the age of embeddings, LLMs, and vector search.
Created by Chang She (co-creator of Pandas) and Lei Xu (contributor to Apache HDFS), the Lance format was introduced in 2022 with one clear goal: to make high-dimensional, multimodal data easier and faster to query. This includes use cases where traditional formats hit a wall, like retrieving nearest neighbors in a vector store or combining embeddings with structured data like product metadata, clickstreams, or images.
Try Dremio’s Interactive Demo
Explore this interactive demo and see how Dremio's Intelligent Lakehouse enables Agentic AI
Why Lance Exists
In vector databases and RAG pipelines, random access is everything. After running an approximate nearest neighbor (ANN) search, a system might return a list of record IDs. To be useful, those IDs need to be dereferenced, quickly, so the system can display a document, show a product, or stream a video.
This is where Parquet slows things down. It’s optimized for scanning large batches, not plucking individual rows out of disk at speed. Lance, on the other hand, has been benchmarked to deliver random access performance that is ~2000x faster than Parquet on some real-world workloads, retrieving 100 million 1KB string records in milliseconds instead of seconds.
How Lance Achieves It
Lance gets this performance through several design choices:
- Adaptive Encoding: It chooses the right storage strategy based on data size. Large values (like images or embeddings) use "full zip encoding" for tight packing and O(1) row access.
- Miniblock Layouts: Smaller values are grouped into small, metadata-light chunks (4–8KB), reducing the cost of jumping between blocks.
- No Row Groups: One of Parquet’s biggest trade-offs is the row group. Lance removes it entirely, letting each column define its own layout and access strategy.
- Built-in Vector Indexing: Lance supports common ANN algorithms like IVF_PQ and HNSW natively—no need for external indexing systems.
- Git-Like Versioning: Using manifest files, Lance supports multi-version concurrency control (MVCC) with snapshot-based updates and zero-copy writes.
These design elements make Lance especially attractive for AI-native systems where data is complex, high-dimensional, and constantly queried by smart agents or search algorithms.
But with all this performance comes a tradeoff: Lance isn’t optimized for every use case. It’s not widely supported in traditional BI tools, and its ecosystem is still growing. That’s where table formats like Apache Iceberg come in, helping standardize access while offering room for innovation at the file layer.
Next, let’s take a look at Nimble, another modern format focused on ML training efficiency.
Nimble: Speeding Up ML Training with Simpler Decoding
While Lance focuses on high-performance random access for AI applications like search and retrieval, Nimble targets a different challenge: fast decoding for massive ML training datasets.
Many companies today train models on wide tables with tens of thousands of columns. These tables often include dense numeric features, sparse binary indicators, and embedding references, all stored in columnar formats like Parquet. But as models scale and refresh more frequently, the bottleneck isn’t always the model architecture, it’s how quickly the training data can be read and decoded.
That’s the problem Nimble tries to solve.
What's the Problem with Parquet in ML Workloads?
Parquet is designed to be compact and efficient for scanning. It uses compression codecs and encoding schemes like dictionary encoding and run-length encoding to reduce file size and scan cost. That’s great for BI-style queries, but decoding this structure can create a performance hit during model training, especially when features are read in mini-batches or streamed across GPUs.
This decoding overhead shows up in practical ML workloads as:
- Longer training startup times.
- High CPU usage for data loading.
- I/O bottlenecks when processing extremely wide rows.
These factors add latency and cost, especially at scale.
How Nimble Optimizes for ML
Nimble flips the tradeoff. It prioritizes faster read and decode speeds over maximum compression. While details about the format are still emerging, here’s what distinguishes Nimble from traditional formats:
- Simplified Encodings: Nimble minimizes the use of variable-length encodings that slow down decompression, especially in parallel environments like GPUs or multithreaded CPUs.
- Predictable Memory Layouts: Data is stored in a way that aligns with the memory access patterns used in ML frameworks like PyTorch and TensorFlow.
- Batch-Friendly Reads: Instead of optimizing for full-column scans, Nimble aligns data layout with batch sampling patterns typical in model training pipelines.
Early reports suggest that Nimble can deliver 2x–3x decoding speedups compared to Parquet for training scenarios. That’s not just a quality-of-life improvement, it’s a tangible boost in training throughput, especially when running large experiments or frequent model refreshes.
Where It Fits
Nimble is best suited for environments where:
- You control the full ML pipeline end-to-end.
- Training speed is a priority over storage footprint.
- You’re working with feature stores or training tables that change frequently.
Unlike Lance, Nimble isn’t trying to serve low-latency lookups or interactive queries. Instead, it’s focused on reducing the friction between data lake storage and ML frameworks, a valuable optimization in a world where AI is everywhere.
Vortex: Real-Time Access for the Next Generation of Analytics
Where Lance zeroes in on vector workloads and Nimble accelerates ML training, Vortex steps into a different frontier: making traditional tabular data faster and more responsive for real-time use cases.
Vortex is still early in its development, but it reflects a growing demand for file formats that bridge the gap between batch-optimized storage and the responsiveness required by modern applications, especially those powered by agentic AI, streaming pipelines, or real-time dashboards.
What Problem Is Vortex Trying to Solve?
In most analytical systems, “real-time” access is a bit of a misnomer. Even fast query engines often operate on immutable, append-only files. To support updates, deletions, or fresh data ingestion, many platforms lean on table formats (like Apache Iceberg) or external layers to patch over file-level limitations.
The challenge is this: batch-optimized formats like Parquet weren’t built for low-latency access patterns or frequent writes. Their structure assumes large scans, complex query planning, and the ability to cache and optimize over time. That’s great for historical analytics. It’s less ideal when users, or AI agents, expect immediate feedback and up-to-the-minute context.
Vortex aims to make tabular data more dynamic at the file format level.
How Vortex Approaches the Problem
While technical documentation on Vortex is still emerging, early indications suggest it focuses on:
- Low-latency access to individual rows or ranges without scanning large blocks.
- Efficient support for updates and deletes, potentially through index-aware file layouts.
- Schema agility, enabling frequent changes without long reload cycles or coordination overhead.
- Lightweight metadata, making files easier to manage, load, and reason about in dynamic systems.
Vortex is likely exploring ways to reduce coordination costs between engines and file metadata, either by flattening the file structure or building in features that are usually handled at the table format level (e.g., indexing, record-level versioning).
Where It Could Fit
Vortex is being positioned as a format that could complement workloads such as:
- Real-time analytics on fast-changing datasets.
- Operational intelligence platforms where data freshness matters more than scan throughput.
- Event-driven systems where micro-batches or streaming data need to be queryable on arrival.
While it’s still too early to predict widespread adoption, Vortex reflects a larger trend: a desire to rethink tabular data storage for systems that aren’t just read-heavy and static.
As more of the stack becomes interactive, intelligent, and incremental, the file format underneath needs to evolve too.
With new formats like Lance, Nimble, and Vortex gaining traction, the need for interoperability, consistency, and governance becomes more urgent. That’s where Apache Iceberg’s new File Format API proposal comes into play.
Let’s explore what it means, and how it helps the ecosystem keep pace.
How Apache Iceberg Is Preparing for What’s Next
The rise of new file formats is exciting, but it also introduces real complexity.
For teams building on Apache Iceberg, this complexity shows up when trying to adopt or extend file format support. Iceberg is a table format, not a file format, which means it abstracts storage layout from query behavior. That abstraction is powerful: it allows you to time-travel, update schemas safely, and manage deletes, regardless of whether your data is stored in Parquet, Avro, or ORC.
But here’s the catch: every new feature in Iceberg (like default column values or new types in the v3 spec) needs to be implemented individually for each supported file format. This slows progress and creates uneven support, some features might work in Parquet, but not in Avro or ORC, depending on what’s been prioritized.
As the community looks ahead to formats like Lance, Nimble, and Vortex, the challenge becomes clear:
How do we enable Iceberg to support a growing number of file formats; consistently, safely, and without duplicating logic?
That’s what the File Format API proposal is designed to solve.
What the File Format API Proposal Introduces
The proposal, currently under review by the Apache Iceberg community, lays out a unified, pluggable way for file formats to integrate with Iceberg. Instead of hardcoding support in multiple places, it defines a clean interface that new and existing formats can implement.
At a high level, the new API introduces:
ReadBuilder
andWriteBuilder
interfaces
These define how a file format reads data into an engine’s object model, and how it writes data back out. Every file format registers its own logic using these builders.- A central
Format Model Registry
This registry acts as the directory for all file formats and their implementations. Engines can look up formats by name and access the correct logic for reading or writing. - Specialized write contexts
Formats can expose specific builder types for:- Data files
- Equality delete files
- Positional delete files
This allows engines to reuse the same core logic while plugging in any file format that implements the spec. In other words, Iceberg no longer needs to know about the internals of every format, just that it follows the contract.
Why This Matters
For the community, this is a meaningful shift:
- It removes duplication. No more writing feature support multiple times across file formats.
- It improves consistency. Features like default values, delete handling, or schema evolution can be tested and validated across formats using a shared test compatibility kit (TCK).
- It unlocks extensibility. Want to add Lance or Vortex? You don’t need to modify core Iceberg logic, just implement the interface.
This doesn’t mean every format is automatically supported. But it does mean the door is open, and the path is clearer than it’s ever been.
The File Format API proposal sets the foundation for a more modular, more future-ready Iceberg—one that can adapt to a growing, fragmented, and fast-moving file format landscape.
Where This Is All Going
The file format layer used to be an afterthought, something picked once, rarely revisited. But that’s no longer the case. As data becomes more complex and more real-time, the choice of file format can directly impact model latency, system cost, and user experience.
Formats like Lance, Nimble, and Vortex show that innovation is accelerating. Each one introduces new ideas about how to structure, index, and access data. But none of them exist in isolation. They need to integrate with engines, catalogs, and governance layers to be truly useful.
That’s why table formats like Apache Iceberg have become essential. They bring consistency to an ecosystem that’s growing more diverse, providing a foundation where new file formats can thrive without breaking everything upstream.
The File Format API proposal is a step toward that future. It doesn’t try to predict the “next Parquet.” Instead, it makes space for many formats, each optimized for a specific need, all coexisting under a common metadata layer.
That flexibility is what modern data architectures require. AI workloads need fast vector access. Business teams need governed tables with version history. Engineers need to experiment without creating chaos. Supporting that range of needs means being able to plug in new capabilities without rebuilding the foundation.
In that sense, the File Format API is more than a cleanup or a refactor, it’s a strategic move to future-proof Apache Iceberg for what’s coming next.
As more engines adopt Iceberg, and as more formats push the limits of what’s possible, this kind of modularity will be key to keeping lakehouses open, scalable, and adaptable.
Register for an Apache Iceberg Workshop and download a free copy of "Apache Iceberg: The Definitive Guide"