18 minute read · September 5, 2025

Looking back the last year in Lakehouse OSS: Advances in Apache Arrow, Iceberg & Polaris (incubating)

Alex Merced

Alex Merced · Head of DevRel, Dremio

Over the last twelve months, the open lakehouse ecosystem has taken a decisive step forward. Three projects in particular, Apache Arrow, Apache Iceberg, and Apache Polaris, have delivered milestone releases that make it easier for organizations to unify their data, accelerate queries, and govern access across clouds and engines. Arrow has continued to push the boundaries of performance with new compute kernels, broader data type support, and improvements in its query engine, DataFusion. Iceberg officially adopted its third specification, introducing deletion vectors, row-level lineage, and semi-structured types that reshape how mutable and complex data is managed. Polaris, meanwhile, reached its first production-ready release, delivering a standards-based catalog with governance, credential vending, and even experimental support for managing non-Iceberg tables. Taken together, these advancements demonstrate how quickly the open lakehouse is maturing into a coherent, interoperable standard that reduces vendor lock-in and expands the capabilities of data teams.

Apache Arrow – Faster, Broader, and More Modular

Apache Arrow has continued to cement its role as the in-memory foundation of the lakehouse. Over the last year, the project has shipped several major releases, 18.0.0, 19.0.0, 20.0.0, and most recently 21.0.0, each introducing important enhancements for performance, interoperability, and ecosystem integration.

One of the most notable shifts came with the adoption of mimalloc as the default memory allocator, replacing the system allocator. This change, introduced in Arrow 18.0.0, enhances memory efficiency and consistency across platforms, particularly in workloads that heavily rely on columnar data structures. Alongside this, Arrow added canonical extension types, such as UUID, JSON, and Opaque, providing developers with a standardized way to represent complex data across languages and systems.

Arrow’s compute engine has also grown significantly. Versions 20.0.0 and 21.0.0 introduced new kernels, including inverse_permutation, pivot_wider, winsorize, and rank_normal, which expand the capabilities of what can be done natively without additional libraries. The Acero execution engine achieved significant efficiency gains: hash joins are now both faster and safer, reducing memory usage while scaling more effectively for analytical workloads. Arrow also made strides in semi-structured and geospatial analytics by adding support for variant, geometry, and geography data types, ensuring a consistent pipeline from raw ingestion to advanced analytics.

On the ecosystem side, Arrow’s query engine, DataFusion, reached version 49.0.0 in July 2025. This release introduced practical optimizations, including dynamic filtering and Top-K pushdown, which reduce the amount of data scanned during queries. It also introduced asynchronous UDFs, opening the door to integrating external services such as machine learning models directly into query execution.

Finally, Arrow Database Connectivity (ADBC) gained traction as a cross-language standard for database drivers. Recent releases expanded its coverage and stability, positioning ADBC alongside Flight SQL as part of the Arrow-native transport stack that lets tools interoperate without costly serialization.

Altogether, Arrow’s advancements reflect a broader push to simplify high-performance analytics by making memory management, query execution, and connectivity more unified and efficient across the lakehouse ecosystem.

Apache Iceberg – Spec v3 Turns the Corner on Mutability and Governance

If Arrow is the memory layer of the open lakehouse, Apache Iceberg is its table layer, and 2025 marked a defining year for the project. In May, the community formally adopted Table Specification v3, a milestone that introduces the most significant evolution of the format since its inception. This new spec reshapes how Iceberg handles mutability, change data capture, and schema flexibility, while laying the foundation for future performance gains.

At the core of v3 are binary deletion vectors, a more compact and efficient method for representing row-level deletes. Instead of rewriting files or relying on clunky position deletes, deletion vectors allow engines to skip irrelevant rows with minimal overhead, reducing both I/O and query latency. In addition, Iceberg now supports row-level lineage, where each row is assigned a unique ID and sequence number. This feature unlocks powerful new use cases such as incremental materialized views, fine-grained auditing, and time-travel queries that track not just what changed, but exactly when and how.

Spec v3 also broadens Iceberg’s type system. The introduction of a variant type enables efficient storage and querying of semi-structured data, such as JSON, without compromising schema enforcement. Meanwhile, the integration of geometry and geography types brings geospatial analytics into the Iceberg ecosystem, aligning with the broader industry trend of incorporating location intelligence into data platforms. To support smoother schema evolution, Iceberg now allows default column values and more flexible partition transforms, reducing friction when adapting tables to new business needs.

From an implementation perspective, the project’s Java library advanced quickly to support v3 features. Recent releases have added compatibility with Spark 4.0 (dropping support for Spark 3.3) and Flink 2.0, while introducing a connector for BigQuery catalogs. These updates ensure that engines can leverage v3’s capabilities across diverse execution environments. Looking ahead, the community has already begun scoping Spec v4, which will focus on performance improvements, including columnar metadata formats, faster commits, and relative path support for simpler deployments.

In short, Iceberg’s past year has demonstrated a clear shift: from a table format focused on schema evolution and partitioning to becoming a comprehensive framework for governed, mutable, and auditable data. With deletion vectors, row lineage, and new data types, Iceberg is now poised to power not just traditional analytics, but also real-time, incremental, and compliance-driven workloads at enterprise scale.

Try Dremio’s Interactive Demo

Explore this interactive demo and see how Dremio's Intelligent Lakehouse enables Agentic AI

Apache Polaris – The Catalog Layer Grows Up

While Arrow refined performance and Iceberg advanced mutability, Apache Polaris reached a milestone of its own: the release of 1.0.0-incubating in July 2025. This marked the catalog’s transition from an emerging project to a production-ready foundation for governance and interoperability in open lakehouses.

At its core, Polaris implements the Apache Iceberg REST Catalog specification, providing engines with a consistent way to discover and manage datasets without requiring bespoke connectors. But the 1.0 release pushed beyond Iceberg alone. With the introduction of the experimental Generic Table API (Beta), Polaris can now register and manage non-Iceberg tables, such as Delta Lake or even simple CSV datasets, alongside Iceberg in a unified namespace. This expansion reflects a pragmatic reality: most enterprises still operate in multi-format environments, and a single catalog that spans them reduces silos and operational friction.

To make this accessible, the project delivered a Spark client capable of interacting with Polaris’ Generic Tables. With it, Spark can create, load, and drop both Iceberg and Delta tables directly through the Polaris catalog, bringing governance and metadata consistency to Spark-heavy workflows.

Governance and security were also front and center in the 1.0 milestone. Polaris introduced role-based access control (RBAC) and credential vending, ensuring that query engines receive only the temporary permissions they need. Integration with external identity providers and support for multi-cloud deployments across AWS, Azure, and GCP make Polaris a viable choice for enterprises operating in hybrid environments. Meanwhile, operational improvements such as HTTP caching, compaction-aware rollbacks, and snapshot filtering enhance reliability and reduce costs when scaling catalogs to thousands of tables.

Perhaps most importantly, Polaris has begun to position itself not just as “an Iceberg catalog,” but as a general-purpose governance layer for the open lakehouse. By federating external catalogs, supporting multiple table formats, and standardizing APIs, Polaris simplifies the management of data across various tools and environments.

With the 1.0 line established and 1.0.1 already addressing early feedback, Polaris is emerging as the keystone for secure, multi-engine data access in the open lakehouse stack.

Interoperability Wins – Bringing the Pieces Together

The most exciting story over the past year isn’t just what Arrow, Iceberg, and Polaris accomplished individually; it’s how they are starting to fit together into a coherent open lakehouse stack. Each project has matured along its own path, but their recent advancements clearly reinforce one another.

Arrow’s addition of variant, geometry, and geography types lines up directly with Iceberg’s spec v3, which formalized those same data types at the table level. This means that when an engine like DataFusion reads Iceberg tables, the in-memory representations align seamlessly, avoiding the messy conversions that often sap performance. Similarly, Arrow’s dynamic filtering in DataFusion complements Iceberg’s deletion vectors and row lineage, allowing queries to skip irrelevant rows or snapshots with surgical precision. Together, they reduce I/O and make real-time analytics more efficient.

On the connectivity front, Arrow’s Flight RPC and ADBC drivers dovetail with Polaris’ REST Catalog. A developer can use ADBC to issue queries that travel over Flight while relying on Polaris to discover metadata and enforce governance policies. This reduces the need for custom connectors and accelerates adoption across engines. Once a system implements the specification, it can integrate into the entire ecosystem.

Polaris’ Generic Table API further amplifies this interoperability story. By enabling the management of Delta or CSV alongside Iceberg, Polaris makes it possible to bring legacy datasets into the same governance and discovery framework. When paired with Arrow’s cross-language libraries, those datasets can be queried and transformed without friction, opening the door to smoother migrations and hybrid architectures.

What emerges is a picture of a stack that is greater than the sum of its parts. Arrow ensures that once data is in memory, it can be moved and transformed efficiently across tools and languages. Iceberg provides the table abstraction that brings reliability, mutability, and governance to raw files. Polaris offers the catalog and security layer that makes this data consistently discoverable and governable across engines. Together, they are building the scaffolding for a truly open, interoperable, and future-proof lakehouse.

Practitioner Playbook – Upgrading Your Lakehouse Stack

With Arrow, Iceberg, and Polaris all hitting major milestones, data teams now face a practical question: how do you take advantage of these advancements in your own environment? The answer depends on where you are starting, but the playbook generally follows three steps: engine readiness, governance centralization, and compute optimization.

1. Evaluate engine readiness.
If you’re running Apache Iceberg, the adoption of Spec v3 means your engines need to catch up. Spark 4.0 and Flink 2.0 have already added compatibility, while engines like Dremio are rapidly incorporating support. Before enabling features such as deletion vectors or row lineage, confirm that your chosen engines and connectors can support them. This prevents hidden inconsistencies when multiple tools interact with the same tables.

2. Centralize governance.
Catalog sprawl is one of the biggest operational headaches in lakehouse deployments. By registering your tables with Apache Polaris, you gain a consistent governance layer that enforces RBAC, issues temporary storage credentials, and supports multi-cloud deployments. If your organization still uses Delta or CSV datasets, the new Generic Table API and upcoming Table Sources API enable you to manage them within the same catalog, thereby reducing silos and introducing uniformity to discovery and access control.

3. Optimize compute paths.
Finally, don’t overlook the performance gains available when using engines that leverage Apache Arrow alongside Apache Iceberg. Modern Arrow-based query engines, such as Dremio or DataFusion. These capabilities can significantly reduce query times and lower resource usage. Combined with Iceberg’s row-level pruning, they enable workloads that are both faster and more cost-efficient.

For many teams, the upgrade path will be incremental, starting with testing v3 tables in a development environment, rolling out Polaris to unify governance, and finally updating engines to leverage Arrow’s new compute features. However, even small steps can yield immediate benefits, including faster queries, simpler security, and greater consistency across the stack.

Looking Ahead – Where the Open Lakehouse Goes Next

The past year has been transformative, but the roadmaps for Arrow, Iceberg, and Polaris show that the open lakehouse is only getting stronger. Each project has a clear trajectory that deepens interoperability and broadens the scope of what the ecosystem can handle.

Apache Iceberg is already laying the groundwork for Spec v4, which aims to improve performance and operational efficiency. Planned improvements include faster commit protocols, a columnar metadata format for more lightweight planning, and relative path support to simplify deployment across both cloud and on-premises environments. Together, these changes aim to cut latency in large-scale environments and make Iceberg catalogs even more portable.

Apache Arrow is expected to continue its push for modularization. The compute kernels introduced in 20.0.0 and 21.0.0 are just the beginning; more functions and optimized algorithms are forthcoming, particularly for statistical and geospatial workloads. Arrow Database Connectivity (ADBC) is likely to expand its driver coverage, filling in more gaps so developers can treat Arrow as the default interface to any SQL engine. Flight SQL, in turn, is moving toward greater standardization, making cross-language, high-performance connectivity less of an experiment and more of a default.

Apache Polaris is entering a fascinating phase. With its Generic Table API already in beta, the community is preparing to take the concept further with the proposed Table Sources feature. Table Sources will enable Polaris to register not just Iceberg or Delta tables, but a much wider variety of datasets, structured, semi-structured, or even unstructured, in a more robust and governable way. Unlike the lightweight Generic Tables, Table Sources are designed to provide richer metadata, stronger lifecycle management, and tighter governance controls. In practice, this could make Polaris the central catalog for an enterprise’s entire data estate, not just Iceberg or lakehouse tables. Combined with existing RBAC and credential vending, this positions Polaris as the governance backbone of the open lakehouse.

Taken together, these roadmaps suggest that the coming year will see the open lakehouse stack move from strong interoperability toward full-spectrum governance and performance optimization. Arrow will continue to power fast, universal data transport; Iceberg will deepen its support for mutable and auditable data; and Polaris will extend its reach to govern every kind of dataset enterprises care about. The momentum is clear: the open lakehouse is evolving from a collection of projects into a tightly integrated, standards-driven platform.

Dremio’s Role – A Platform Built on Arrow, Iceberg, and Polaris

The rapid evolution of Arrow, Iceberg, and Polaris hasn’t happened in a vacuum; Dremio has beeninvolved in all of these three projects, both as a major contributor and, in some cases, a co-creator. This deep involvement gives Dremio a unique position: it’s not just another engine that adopts these standards, but a platform architected from the ground up around them and a company that has consistently anticipated and helped pave the way for future data standards, versus following them after they establish themselves.

  • Apache Arrow: Dremio co-created Arrow and Arrow powers the core query engine inside Dremio, enabling zero-copy data access, columnar execution, and interoperability with other engines and libraries. By building directly on Arrow, Dremio ensures high-performance query execution while also aligning with the de facto standard for in-memory analytics.
  • Apache Iceberg: As one of the earliest adopters and contributors to Iceberg, Dremio has been instrumental in advancing its capabilities. From evolving the specification to building features like reflections and incremental refreshes that accelerate Iceberg queries, Dremio has helped shape Iceberg into a robust open table format. Within the Dremio platform, Iceberg serves as the table storage layer, providing ACID guarantees, schema evolution, and support for features like time travel and deletion vectors. Dremio aims to provide the most robust Iceberg-native experience.
  • Apache Polaris (Incubating): Dremio co-created Polaris with Snowflake and donated it to the Apache Software Foundation. Polaris began as an open, standards-based catalog for Iceberg and is quickly expanding into a more general governance layer, featuring capabilities such as Generic Tables and the upcoming Table Sources. Dremio’s own catalog is built on Polaris, providing customers with the benefits of enterprise-grade governance while ensuring alignment with the open standard catalog for the lakehouse.

The result is that Dremio’s Intelligent Lakehouse Platform is not only powered by these projects but also helps advance them. Organizations that adopt Dremio are effectively standing on the shoulders of Arrow, Iceberg, and Polaris, with the assurance that the platform’s roadmap will continue to track and influence the direction of the broader open-source ecosystem.

This alignment delivers more than just technical advantages. It means a reduced risk of lock-in, faster adoption of new open standards, and confidence that the innovations shaping the open lakehouse will be introduced in Dremio first. For practitioners, Dremio is the most direct way to capitalize on the momentum behind Arrow, Iceberg, and Polaris in a unified, enterprise-ready platform.

Conclusion – The Open Lakehouse Comes of Age

The past year has been a watershed moment for the open lakehouse. Apache Arrow pushed in-memory analytics to new heights with richer data types, faster compute kernels, and maturing connectivity through Flight and ADBC. Apache Iceberg solidified its position as the table standard with the adoption of Spec v3, unlocking deletion vectors, row lineage, and support for semi-structured and geospatial data. Apache Polaris reached its first production-ready release, delivering governance, security, and multi-format catalog capabilities that bring order to the complexity of modern data estates.

What makes these advancements truly powerful is how they reinforce one another. Arrow provides the performance substrate, Iceberg defines reliable and flexible tables, and Polaris ensures discoverability and governance across engines. Together, they are transforming the open lakehouse from a promising idea into a practical, interoperable reality.

Dremio’s role as a co-creator and key contributor across all three projects makes it uniquely positioned to deliver these benefits in a single platform. By adopting Dremio, organizations aren’t just selecting an analytics engine; they’re aligning themselves with the very projects that are shaping the future of open data.

The direction is clear: the open lakehouse is no longer about choosing between flexibility and performance, or between innovation and governance. With Arrow, Iceberg, and Polaris maturing side by side, and with Dremio leading the charge, the open lakehouse has become a complete, standards-driven foundation for modern analytics. For enterprises seeking both freedom and power, this is the moment to embrace it.

Ready to Get Started?

Enable the business to accelerate AI and analytics with AI-ready data products – driven by unified data and autonomous performance.