Iceberg tables are files. Without a catalog, they are just Parquet and Avro files scattered across object storage with no name, no schema, and no access control layer binding them together. The Apache Iceberg REST Catalog solves this problem with a vendor-neutral HTTP API that any engine can use to discover, read, and write Iceberg tables. Understanding how it works is essential for anyone building or operating a modern data lakehouse.
This post covers what the REST Catalog actually is, how it compares to older catalog approaches like Hive Metastore and AWS Glue, what Apache Polaris adds on top of the base spec, how credential vending works, and how to configure Dremio and Python clients to connect to a REST catalog in practice.
For a deeper grounding in Apache Iceberg's architecture before reading this post, that guide covers the metadata layer in detail.
Why Every Iceberg Table Needs a Catalog
An Apache Iceberg table is not a folder or a database object in the traditional sense. It is a collection of files: data files (Parquet), manifest files (Avro), manifest lists, and a metadata JSON file that records the current snapshot. When a query engine wants to read a table, it needs to know which metadata file is "current." When a writer commits a new snapshot, it needs to atomically update that pointer so other readers immediately see the new state.
That is what the catalog does. The catalog is the table registry: a service that maps logical table names to their current metadata file locations. It is also the lock manager for commits. When two writers attempt to commit simultaneously, the catalog ensures exactly one wins and the other must retry, preserving snapshot isolation and making Iceberg's ACID guarantees possible across multiple engines.
The catalog also carries access control. It controls which principals can see which tables, namespaces, and schemas. Without a catalog enforcing those boundaries, any engine with storage credentials could read any file on your data lake, regardless of permissions.
Finally, the catalog tracks the full namespace hierarchy. A three-level namespace like production.events.user_interactions is managed by the catalog. Rename operations, schema changes, and partition spec updates all flow through the catalog to ensure every engine sees a consistent view.
Try Dremio’s Interactive Demo
Explore this interactive demo and see how Dremio's Intelligent Lakehouse enables Agentic AI
Understanding the Apache Iceberg REST Catalog
The Catalog Landscape Before REST
Before the REST Catalog specification existed, Iceberg users chose from three catalog options, each with significant constraints.
Hive Metastore (HMS) was the original Hadoop catalog. It uses the Thrift protocol, which requires a JVM on every client. HMS works well for Hive, Spark, and Presto because they are all JVM-based. But connecting a Python client, a Go-based engine, or a cloud-native service to HMS requires either a JVM shim layer or a proxy. HMS also has no credential vending and limited role-based access control beyond what the Hadoop ecosystem provides natively.
AWS Glue is a managed service that exposes a proprietary REST-like API. It integrates naturally with AWS IAM, which makes it a reasonable choice if your entire organization is on AWS. Outside AWS, Glue is essentially unavailable. Glue also does not implement the Iceberg REST Catalog spec, so engines that only speak the standard Iceberg REST API cannot connect to it directly.
JDBC catalogs use a relational database (PostgreSQL, MySQL) as the backend. Any JDBC-capable client can connect, which makes them easy to set up for small deployments. But JDBC catalogs provide no credential vending, no RBAC, and no real multi-tenancy. They are typically used for local development or testing, not production at scale.
Why the REST Catalog Won
The Iceberg REST Catalog specification was introduced in Iceberg 0.14 in 2022. It defines a standard HTTP API for catalog operations: managing namespaces and tables, loading table metadata, committing snapshot updates, and vending storage credentials.
Three properties made it the community's preferred approach almost immediately.
First, it has no JVM dependency. Any language can make HTTP calls. Python clients, Go services, and Rust-based engines all work without needing a JVM shim.
Second, it is vendor-neutral. The spec is part of the Apache Iceberg project. No single company controls it. Any service that implements the spec is interoperable with any client that speaks the spec.
Third, it has credential vending built into the spec. When an engine loads a table, the catalog response can include short-lived, narrowly scoped storage credentials. This changes the security model fundamentally. Engines no longer need long-lived S3 access keys. Instead, the catalog issues time-limited tokens scoped to specific table prefixes.
What the Iceberg REST Catalog API Provides
Namespace and Table Management Endpoints
The Iceberg REST Catalog API organizes operations around a small set of resource types. All endpoints are HTTP with JSON request/response bodies and OAuth2 bearer token authentication.
Namespace operations:
GET /v1/{prefix}/namespaces lists all namespaces
POST /v1/{prefix}/namespaces creates a new namespace with optional properties
GET /v1/{prefix}/namespaces/{namespace} fetches namespace properties
DELETE /v1/{prefix}/namespaces/{namespace} removes an empty namespace
Table operations:
GET /v1/{prefix}/namespaces/{namespace}/tables lists tables in a namespace
POST /v1/{prefix}/namespaces/{namespace}/tables creates a new table
GET /v1/{prefix}/namespaces/{namespace}/tables/{table} loads a table, returning the current metadata file location and, optionally, short-lived storage credentials
POST /v1/{prefix}/namespaces/{namespace}/tables/{table} commits a table update (atomic, validated for conflicts)
DELETE /v1/{prefix}/namespaces/{namespace}/tables/{table} drops a table
The table load endpoint is the most critical. When an engine calls it, the response body includes:
The path to the current metadata.json file
Configuration properties for the storage layer
Optionally, s3.access-key-id, s3.secret-access-key, and s3.session-token as short-lived vended credentials
The engine then reads that metadata file directly from object storage, follows the manifest list to find data files, and executes the query. The catalog server is never in the data path.
Credential Vending: Security Without Full S3 Keys
Credential vending is the feature that most meaningfully improves security in a multi-engine lakehouse. The traditional approach gives every query engine full S3 credentials (an access key and secret). If any engine is compromised, an attacker gets full access to your entire data lake. Rotating credentials requires updating configuration on every engine.
The REST Catalog's credential vending works differently.
When an engine authenticates with the catalog using its OAuth2 service principal and requests a table, the catalog checks whether that principal has the appropriate privilege. If it does, the catalog calls the cloud IAM service (AWS STS AssumeRole, Azure Managed Identity, GCP Service Account) to generate a short-lived token scoped to that table's storage prefix. The token typically expires in 15 to 60 minutes and grants access only to the specific S3 prefix (or ADLS container path) where that table's data lives.
The engine uses those scoped credentials to read or write data files directly. The catalog is not involved in the data transfer. And the audit log at the catalog layer records every credential vend: which engine, which table, which operation, at what time. You get least-privilege access, time-limited exposure, and centralized audit in a single design.
Hidden partitioning and other storage optimizations like Iceberg hidden partitioning also benefit from this model because partition metadata flows through the same catalog load call, not through separate storage scans.
Access Control Hooks
Every REST Catalog endpoint requires OAuth2 bearer token authentication. The catalog server validates the token, resolves the principal's identity, and checks whether that principal has permission to perform the requested operation before returning any metadata.
This means access control is enforced at the catalog level, not at the storage level. Even if an engine has the technical ability to reach S3, it cannot load a table's metadata without going through the catalog, and the catalog will refuse if the principal lacks permission. This is a much stronger security model than relying on S3 bucket policies alone.
Apache Polaris: The Open-Source REST Catalog Reference
The Iceberg REST Catalog spec defines what the API should look like. It does not define how to implement the server. That is where Apache Polaris comes in.
Polaris was co-created by Dremio and Snowflake and donated to the Apache Software Foundation. It is now an Apache incubator project with broad community participation. The project lives at polaris.apache.org and serves as the reference implementation of the Iceberg REST Catalog spec. You can also read a detailed overview of what Apache Polaris is and how it works.
A minimal REST catalog implementation only needs to pass the Iceberg spec's integration tests. Polaris goes much further.
RBAC in Polaris
Polaris implements a full role-based access control model with two layers.
Principal roles are assigned to service principals (engines, users, or service accounts). A principal role defines what a service principal is allowed to do within the catalog.
Catalog roles are attached to specific catalogs within a Polaris instance. A catalog role carries specific privileges: TABLE_READ_DATA, TABLE_WRITE_DATA, TABLE_CREATE, NAMESPACE_FULL_METADATA, and others. You map principal roles to catalog roles to grant access.
This two-layer model enables precise control. You can give your Spark ETL service principal write access to the ingestion namespace while giving your Dremio analytics principal read-only access to the production namespace, all within the same Polaris instance.
Multi-Tenancy in Polaris
A single Polaris deployment can host multiple independent catalogs. Each catalog has its own namespaces, tables, access controls, and storage configurations. This makes Polaris suitable for multi-team or multi-business-unit deployments where strict isolation is required.
Credential Vending in Polaris
Polaris implements credential vending for AWS S3 (via STS AssumeRole), Azure ADLS (via Managed Identity), and GCS. The storage configuration is set at the catalog or namespace level. Polaris handles the cloud IAM calls transparently, so engines only need to speak the Iceberg REST protocol.
Dremio Open Catalog: Polaris Plus Federated Sources
Dremio's built-in catalog is not a custom catalog implementation built from scratch. It is built on Apache Polaris and extends it with capabilities specific to Dremio's architecture.
Dremio Open Catalog adds:
Federated source connectors: you can register JDBC databases (PostgreSQL, MySQL, Oracle), cloud warehouses (Snowflake, BigQuery), and other object storage collections as sources within the same governed namespace. This means your Polaris-compatible REST catalog endpoint also exposes non-Iceberg data through the same logical namespace hierarchy.
Fine-Grained Access Control (FGAC): column-level and row-level security applied at the catalog layer. Even when credential vending grants file access, Dremio's query engine enforces column and row masking before returning results.
Arrow Flight endpoints: alongside the standard HTTP REST API, Dremio exposes an Apache Arrow Flight interface for high-throughput bulk data access from analytics clients.
Dremio functions simultaneously as a REST catalog server (other engines can point to it as their catalog endpoint) and as the primary SQL query engine for that catalog. This dual role is central to the Dremio Agentic Lakehouse architecture, where a single platform serves BI tools, AI agents, and programmatic Python clients through the same governed data layer.
Connecting to an Iceberg REST Catalog
Connecting Dremio to an External REST Catalog
If you are running an external Polaris instance (or any other REST catalog server) and want Dremio to query through it, you add it as a source in Dremio's configuration. The key parameters are the endpoint URL, authentication credentials, and the warehouse name.
# Connecting Dremio to an Iceberg REST Catalog source
# Via Dremio UI or API configuration
source:
type: ARCTIC # or NESSIE for open-source Nessie
config:
endpoint: https://my-polaris-catalog.example.com/api/catalog
credential: Bearer my-oauth-token
warehouse: my-warehouse
In the Dremio UI, navigate to Add Source, choose the appropriate Iceberg catalog type, and supply the endpoint and credentials. Dremio will then enumerate the namespaces and tables available through that catalog and make them queryable via SQL.
Using PyIceberg with a REST Catalog
PyIceberg provides a native Python client for the Iceberg REST Catalog API. This is the recommended approach for Python-based data workflows connecting to Polaris or any other REST-compatible catalog.
# Using PyIceberg to connect to a REST catalog (e.g., Polaris)
from pyiceberg.catalog import load_catalog
catalog = load_catalog(
'polaris',
**{
'type': 'rest',
'uri': 'https://my-polaris-catalog.example.com/api/catalog',
'credential': 'client_id:client_secret',
'warehouse': 'my-warehouse',
's3.region': 'us-east-1'
}
)
# List tables via the catalog API
tables = catalog.list_tables('production')
print(tables)
The credential field uses the client_id:client_secret format for OAuth2 client credentials flow. PyIceberg handles the token exchange automatically, including token refresh when the access token expires. The warehouse parameter tells Polaris which catalog within the Polaris instance to target.
Once the catalog is loaded, you can read tables as DataFrames, run schema evolution, append snapshots, and manage namespaces all through standard PyIceberg calls.
Querying via SQL Once Connected
After Dremio is connected to the catalog, querying is straightforward SQL. Dremio handles all the metadata resolution, credential vending, and file planning internally.
-- Query an Iceberg table via Dremio with REST catalog as source
SELECT
user_id,
COUNT(*) as event_count,
MAX(event_time) as last_seen
FROM polaris_catalog.events.user_interactions
WHERE event_date >= CURRENT_DATE - INTERVAL '30' DAY
GROUP BY user_id
HAVING COUNT(*) > 10;
The three-part name polaris_catalog.events.user_interactions maps to the Dremio source name (polaris_catalog), the namespace (events), and the table name (user_interactions). Dremio resolves this through the REST catalog API and uses Iceberg's partition pruning to skip irrelevant files.
Comparison: REST Catalog vs. Other Iceberg Catalogs
Choosing the right catalog depends on your existing ecosystem, your security requirements, and whether you need multi-engine access. The table below summarizes the key differences.
Catalog Type
Protocol
Credential Vending
RBAC
Engine Support
Best For
Hive Metastore
Thrift
No
Limited
Broad (JVM engines)
Legacy Hadoop
AWS Glue
REST (proprietary)
Via IAM
IAM-based
AWS-native engines
AWS-only shops
Apache Polaris
Iceberg REST
Yes
Full RBAC
All REST-compatible
Multi-engine OSS
Dremio Open Catalog
Iceberg REST + federated
Yes
FGAC
All REST-compatible + federated sources
AI-native lakehouse
Unity Catalog
REST (proprietary)
Yes
Databricks RBAC
Databricks-first
Databricks shops
A few practical notes on this comparison:
Hive Metastore is still appropriate if you are running a mature Hadoop-based environment and migrating is not yet feasible. JVM engines (Spark, Hive, Presto) connect to it with minimal friction. For new deployments, it is not the right choice.
AWS Glue works if you are fully committed to AWS and prefer managed services over running your own catalog. The lack of a standard Iceberg REST implementation means some engines require a compatibility layer or a custom connector.
Apache Polaris is the right choice if you want an open-source, community-governed catalog that any REST-compatible engine can connect to. It handles multi-engine access natively and runs on any infrastructure.
Dremio Open Catalog makes sense when you want Polaris-level catalog capabilities combined with a SQL query engine, federated source access, and AI-native features in a single platform. Rather than deploying Polaris separately and then deploying a query engine separately, Dremio bundles them.
Unity Catalog is purpose-built for the Databricks ecosystem. It is a reasonable choice if Databricks is your primary platform, but external engines that are not Databricks-native face friction connecting to it.
The Multi-Engine Lakehouse with a REST Catalog
The REST Catalog specification is the key architectural enabler for multi-engine lakehouses. With a common HTTP API, every engine speaks the same language to the catalog, regardless of how it is implemented internally.
In a typical multi-engine setup with Polaris as the catalog, the architecture looks like this:
Apache Spark handles batch ETL jobs. Your Spark cluster authenticates to Polaris with a dedicated service principal that has write access to the ingestion namespace. Spark loads tables through the REST API, gets vended S3 credentials, and writes new snapshots. When the job commits, Polaris records the new snapshot pointer atomically.
Apache Flink handles streaming ingestion. Your Flink jobs authenticate with a separate principal scoped to append-only operations on specific tables. Flink commits micro-batch snapshots as new Iceberg append operations. Polaris validates each commit.
Dremio handles BI, AI, and ad-hoc SQL. Analytics engineers, BI dashboards, and AI agents connect to Dremio, which resolves tables through the catalog, applies FGAC, and returns results. Dremio's Autonomous Reflections can also cache and accelerate common query patterns without any manual tuning, as described in the Autonomous Performance overview.
Trino or StarRocks might handle ad-hoc queries from teams that prefer those engines. Both support the Iceberg REST Catalog protocol natively.
All four engines see the same tables, the same schemas, and the same snapshots. Schema changes committed by Spark are immediately visible to Dremio and Flink without any synchronization step. Time travel works consistently: each engine can query any snapshot by ID or timestamp using standard Iceberg time travel syntax.
The catalog is the contract between engines. As long as every engine speaks the REST spec, the architecture is engine-agnostic by design.
Tradeoffs and Operational Considerations
The REST Catalog is the right choice for most new Iceberg deployments, but it introduces operational responsibilities that Hive Metastore or a managed service like Glue does not.
New infrastructure component to operate. Running Polaris (or another REST catalog server) means managing a new service: deployment, health checks, backups of catalog state, and upgrades. If you use Dremio Cloud, the catalog management is handled for you. For self-managed deployments, plan for catalog high availability using multiple replicas behind a load balancer.
Catalog availability affects all queries. If the REST catalog server is unreachable, queries that need to load table metadata will fail. Engines cache recently loaded table metadata locally, which provides short-term resilience, but the catalog must be available for any operation that needs to see the latest snapshot or commit a new one. Designing the catalog service for high availability is a production requirement.
HTTP latency on table opens. Every table open requires at least one HTTP call to the catalog to retrieve the metadata file location and credentials. For high-concurrency query workloads where thousands of tables are opened per second, this latency can accumulate. Client-side metadata caching (built into most Iceberg clients) mitigates this significantly, but it is worth measuring in your specific environment.
OAuth2 infrastructure. The REST Catalog requires OAuth2 for authentication. If your organization does not already have an OAuth2-compatible identity provider, you need to set one up. Polaris supports standard OAuth2 client credentials flow, which most IdPs (Okta, Keycloak, Azure AD, AWS IAM Identity Center) support natively.
Migration from existing catalogs. Moving from HMS or Glue to a REST catalog requires re-registering all existing Iceberg tables in the new catalog. This is a one-time migration, not an ongoing operational burden, but it requires planning. Tools like Iceberg's RegisterTable API can automate most of the work if table metadata files are accessible.
Despite these tradeoffs, the benefits (vendor-neutral interoperability, credential vending, full RBAC, language-agnostic clients) make the REST Catalog the standard choice for new multi-engine Iceberg deployments.
Getting Started with the Iceberg REST Catalog
If you are starting fresh, the path forward is straightforward.
For a self-managed deployment, stand up Apache Polaris on Kubernetes or any container platform. Configure your storage connections (S3 bucket + IAM role, ADLS container + Managed Identity), create your catalog structure, define principal roles for each engine or team, and point your engines at the Polaris endpoint. The Apache Polaris documentation covers the deployment and configuration steps in detail.
For a managed experience, Dremio Cloud includes the full catalog layer built on Polaris, plus the query engine, FGAC, Arrow Flight, and AI-native features. You get a production-ready REST catalog without the operational overhead of managing Polaris separately.
Once your catalog is running, configure each engine's catalog section to point at the REST endpoint with its service principal credentials. From that point, all engines share a consistent view of your Iceberg tables. New tables created by Spark appear in Dremio immediately. Schema changes committed by Flink are visible to PyIceberg clients without any manual sync. The catalog handles the coordination.
For teams building AI-native data applications, a well-governed REST catalog is the foundation. When AI agents query your lakehouse through Dremio's MCP Server or Arrow Flight interface, the catalog's RBAC layer ensures those agents only access the data they are authorized to see, with every access logged and auditable.
Try Dremio Cloud free for 30 days and experience the Apache Iceberg REST Catalog with a fully managed catalog, query engine, and AI-native layer already configured: https://www.dremio.com/get-started.
Try Dremio Cloud free for 30 days
Deploy agentic analytics directly on Apache Iceberg data with no pipelines and no added overhead.
Intro to Dremio, Nessie, and Apache Iceberg on Your Laptop
We're always looking for ways to better handle and save money on our data. That's why the "data lakehouse" is becoming so popular. It offers a mix of the flexibility of data lakes and the ease of use and performance of data warehouses. The goal? Make data handling easier and cheaper. So, how do we […]
Aug 16, 2023·Dremio Blog: News Highlights
5 Use Cases for the Dremio Lakehouse
With its capabilities in on-prem to cloud migration, data warehouse offload, data virtualization, upgrading data lakes and lakehouses, and building customer-facing analytics applications, Dremio provides the tools and functionalities to streamline operations and unlock the full potential of data assets.
Aug 31, 2023·Dremio Blog: News Highlights
Dremio Arctic is Now Your Data Lakehouse Catalog in Dremio Cloud
Dremio Arctic bring new features to Dremio Cloud, including Apache Iceberg table optimization and Data as Code.