Most organizations run at least two copies of their most important data: one in a warehouse for BI, another in a data lake for engineering and ML. That duplication costs real money, creates governance blind spots, and makes it harder for AI agents to get a complete picture of the business. The Apache Iceberg lakehouse eliminates that duplication by putting a SQL-capable engine directly on top of open file formats stored in object storage. But collapsing the warehouse and the lake into one layer only solves the storage problem. To get real value, especially from AI, you need something sitting above that storage layer: a catalog that makes data discoverable, governs access, and carries the business semantics that turn raw tables into trusted analytical assets.
That is the role Apache Polaris is filling. Co-created by Dremio and Snowflake and now an Apache Top-Level Project, Polaris is the community-governed catalog standard for the Iceberg lakehouse. This post walks through why it matters, what it does, and where it is headed.
Why the Iceberg Lakehouse Changes the Economics of Analytics
The traditional analytics stack separates storage into two systems. A data warehouse handles structured BI queries. A data lake handles raw files for engineering and ML. Both store overlapping data, both require separate pipelines to keep current, and both charge for storage and compute independently.
An Iceberg lakehouse collapses that into one layer. Your data sits in object storage (S3, Azure Blob, GCS) in Apache Iceberg's open table format. Any compatible engine, Spark, Flink, Trino, Dremio, StarRocks, and others, reads and writes the same tables. There is no proprietary format locking you to a single vendor's compute pricing.
The economic impact is direct:
Storage costs drop 10-20x. Object storage runs a fraction of the cost per TB compared to proprietary warehouse storage. You stop paying a vendor markup on bytes at rest.
Engine freedom. You choose the right engine for each workload. Flink for streaming ingestion, Spark for batch ETL, Dremio for interactive analytics. No single-vendor lock.
AI gets a complete view. When all data lives in one format in one place, AI agents don't need to stitch together answers from fragmented sources. They query one catalog, one set of tables, one semantic layer.
But cheap, accessible, interoperable storage is only half the equation. Without a catalog sitting on top, your lakehouse is a collection of files with no registry, no access control, and no business context. That's a data swamp.
Try Dremio’s Interactive Demo
Explore this interactive demo and see how Dremio's Intelligent Lakehouse enables Agentic AI
What a Lakehouse Still Needs: Discovery, Governance, and Semantics
A table stored as Iceberg on S3 is technically accessible, but practically invisible unless something registers it. Three capabilities separate a functional lakehouse from a pile of Parquet files:
Discovery. Tables need to be findable. A catalog provides namespace browsing, search, and dependency tracking. Without it, an engineer knows a table exists because they created it. Everyone else has to ask that engineer.
Governance. Access control determines who can read, write, or manage each table. Credential vending ensures engines receive short-lived, scoped tokens for storage access rather than long-lived keys. Audit trails track who accessed what and when. Without governance, every table is either fully open or fully locked down with manual workarounds.
Semantics. This is the layer AI depends on most. Semantics include metric definitions ("active customer" means purchased in the last 90 days, not on a trial), column-level documentation, and business rules encoded in SQL views. An AI agent without business semantics writes generic SQL that produces technically valid but factually wrong answers. A semantic layer teaches the AI your business language.
All three need to be interoperable, meaning they work the same way regardless of which engine or tool accesses the catalog. That requirement is what drives the need for a standard interface.
The Iceberg REST Catalog: A Standard Interface for a Standard Format
Apache Iceberg solved the format problem. Before Iceberg, table formats were either proprietary or relied on Hive's directory-listing approach, which caused partial reads during writes, took minutes to plan queries on large partition sets, and broke when you changed partitioning. Iceberg replaced that with atomic commits, snapshot isolation, schema evolution, and hidden partitioning. It became the community standard for what a table looks like on storage.
But standardizing the format is not enough. Engines also need a standard way to talk to catalogs. Without one, every engine needs custom integration code for every catalog backend. Spark talks to Hive Metastore one way, Trino talks to AWS Glue another way, and your custom tooling talks to your internal catalog a third way.
The Iceberg REST Catalog specification solves this. It defines a vendor-agnostic HTTP API for table and namespace operations: create, list, load, update, and drop. It includes credential vending (the catalog issues scoped, short-lived storage tokens) and namespace management. Any engine that speaks HTTP can interact with any compliant catalog. No engine-specific JARs, no custom client libraries, no integration maintenance.
Multiple vendors have built commercial implementations of this spec: Snowflake, Databricks (Unity Catalog), AWS Glue, Tabular, Nessie, and others. That is healthy adoption, but it creates a familiar problem: many implementations, each adding proprietary extensions, with no community-governed reference implementation to keep them aligned.
The pattern mirrors Iceberg itself. Before Iceberg, multiple proprietary table formats existed. The community needed an open, vendor-neutral standard that no single company controlled. The same need now exists for catalogs. Commercial implementations serve their creators' ecosystems well, but the broader lakehouse ecosystem needs a catalog primitive governed by the community, extensible by anyone, and not tied to a single vendor's roadmap.
Apache Polaris: The Community Standard for Lakehouse Catalogs
Apache Polaris fills that gap. Co-created by Dremio and Snowflake, Polaris was donated to the Apache Software Foundation in August 2024, incubated for 18 months with contributions from Google, Microsoft, Confluent, and dozens of other organizations, and graduated to an Apache Top-Level Project in February 2026.
Polaris implements the Iceberg REST Catalog specification and provides the core features a lakehouse catalog needs out of the box, while remaining open and extensible for custom use cases.
Role-Based Access Control (RBAC)
Polaris uses a hierarchical security model:
Level
Purpose
Principals
Individual users or service accounts
Principal Roles
Logical groupings of principals (e.g., "Data Engineer", "Analyst")
Catalog Roles
Permissions on specific securable objects: catalogs, namespaces, tables, views
Privileges are defined at the catalog role level and granted to principal roles. This decouples identity management from permission management. It also means security is enforced at the catalog layer, independent of which engine runs the query. A Spark job and a Dremio query hitting the same table go through the same access control.
Polaris also supports external policy integration with tools like Open Policy Agent (OPA) for organizations that need more flexible or auditable policy enforcement.
Catalog Federation
A single Polaris instance can serve as a "catalog of catalogs." It manages both internal catalogs (tables Polaris owns directly) and external catalogs synced from AWS Glue, Hive Metastore, or other Iceberg REST endpoints. This means you don't have to rip out your existing metastore to adopt Polaris. Register it as a federated source and govern it from one place.
Generic Tables
Not everything in a lakehouse is an Iceberg table. Polaris supports Generic Tables, which let you register non-Iceberg assets (Delta Lake tables, Hudi tables, custom data assets) alongside Iceberg tables in the same namespace. This provides a unified entry point for discovery and access across formats, which is critical for organizations in the middle of a format transition or running a mixed-format environment.
Extensibility
Polaris is open-source with pluggable persistence backends (PostgreSQL, etc.), custom APIs for extending functionality, and community-driven feature development. It provides the core primitives, and organizations customize on top without waiting for a vendor's feature release cycle.
How Polaris Closes the Gap for AI-Ready Lakehouses
The three requirements from earlier, discovery, governance, and semantics, map directly to Polaris capabilities.
Tables registered in Polaris are automatically discoverable via the REST API. RBAC controls who can access them. Credential vending secures storage access. And because Polaris speaks the Iceberg REST spec, every compatible engine can discover and query those tables without custom integration. That handles discovery and governance.
Semantics is where Polaris gets interesting for AI use cases.
Iceberg SQL Views. Polaris stores SQL view definitions in the catalog. These views are the primary vehicle for encoding business logic, "active customer" definitions, revenue calculations, churn metrics. Because the views live in the catalog with governed access, different teams using different engines can reference the same metric definition without redefining it in each tool.
Generic Tables as Semantic Assets. The Generic Tables feature opens up a pattern beyond format support. You can register custom "metric" assets as Generic Table entries, storing metric definitions, ownership, and lineage as metadata properties. Polaris APIs then govern access to these custom assets the same way they govern access to Iceberg tables. This lets you build a governed semantic layer directly in the catalog, not in a separate system.
The "Table Sources" Proposal. The Polaris community is actively working on extending the catalog's scope to become a central registry for all lakehouse assets, not just tables and views, but functions, metrics, and models. The goal: one place to track, govern, and provide access to every data asset in your lakehouse.
This is the outcome that matters for agentic analytics. A marketing team using a BI tool, an ML team using a Python framework, and a finance team using a different query engine all point at the same Polaris catalog. They see the same tables, the same metric definitions, the same access policies. Nobody redefines "monthly recurring revenue" in their own tool. Nobody maintains a shadow copy of the customer dimension. The AI agent querying on behalf of any of these teams gets the same governed, semantically rich view of the data.
Polaris is production-ready today. Organizations are already using its RBAC, catalog federation, credential vending, Iceberg SQL views, and generic tables to govern multi-engine lakehouses at scale. These are not roadmap items; they are shipped, tested, and running in production environments. With these features alone, you can register and discover every table in your lakehouse, enforce fine-grained access policies across any Iceberg-compatible engine, federate external catalogs into a single governance plane, and store semantic assets alongside your data. That is a complete governance foundation for both human analysts and AI agents.
What makes the trajectory exciting is where Polaris is headed next. The "Table Sources" proposal and related community efforts are working toward extending the catalog into a universal registry for all lakehouse assets: tables, views, functions, metrics, and models. When that ships, Polaris becomes the single place where every team, regardless of the engine, BI tool, AI framework, or cloud they use, tracks, governs, and accesses every data asset in the organization. No redefinition across tools. No shadow catalogs. One source of truth for the entire lakehouse.
Experience Polaris with Dremio's Open Catalog
Dremio's Open Catalog is more than managed Polaris infrastructure. It is Apache Polaris at the core, integrated with Dremio's query federation engine, AI Semantic Layer technology, and first-class Apache Iceberg lakehouse features to deliver a complete data unification and semantics experience out of the box.
With Dremio Cloud, you get a pre-configured Polaris catalog from the moment you sign up. Dremio's federation engine extends that catalog by connecting databases, warehouses, and external catalogs (PostgreSQL, Snowflake, BigQuery, AWS Glue, Unity Catalog, and more) into the same governed namespace, so your Polaris catalog governs not just Iceberg tables but every data source your organization touches. On top of that, Dremio's semantic layer lets you build curated SQL views (Bronze, Silver, Gold) with Wikis, Tags, and AI-generated metadata directly over your federated and Iceberg data. Fine-Grained Access Control via UDFs adds row-level security and column-level masking that travel with the data across every access path.
The result: you don't assemble separate tools for catalog management, federation, semantics, and governance. You get one integrated platform with Polaris governance at its foundation, Dremio federation reaching every data source, and a semantic layer that makes it all intelligible to humans and AI agents alike.
Start your free Dremio Cloud trial and experience the full Polaris-powered lakehouse with query federation, an AI semantic layer, and agentic analytics built in.
Want to go deeper on Polaris itself? Download Apache Polaris: The Definitive Guide for a comprehensive walkthrough of the architecture, features, and deployment patterns.
Intro to Dremio, Nessie, and Apache Iceberg on Your Laptop
We're always looking for ways to better handle and save money on our data. That's why the "data lakehouse" is becoming so popular. It offers a mix of the flexibility of data lakes and the ease of use and performance of data warehouses. The goal? Make data handling easier and cheaper. So, how do we […]
Aug 16, 2023·Dremio Blog: News Highlights
5 Use Cases for the Dremio Lakehouse
With its capabilities in on-prem to cloud migration, data warehouse offload, data virtualization, upgrading data lakes and lakehouses, and building customer-facing analytics applications, Dremio provides the tools and functionalities to streamline operations and unlock the full potential of data assets.
Aug 31, 2023·Dremio Blog: News Highlights
Dremio Arctic is Now Your Data Lakehouse Catalog in Dremio Cloud
Dremio Arctic bring new features to Dremio Cloud, including Apache Iceberg table optimization and Data as Code.