Catalog governance is the biggest bottleneck in building a multi-engine lakehouse. When you query the same Apache Iceberg tables with Spark, Flink, and Dremio, synchronizing permissions and access credentials across different engines is traditionally a manual, error-prone chore.
Apache Polaris solves this by providing a centralized, open-source REST catalog for Apache Iceberg tables. Instead of duplicating access control policies across every query engine, you manage them once in Polaris.
The release of Apache Polaris 1.5.0 marks a significant step forward in the project's evolution. This release introduces enterprise-grade security integrations, expanded catalog federation, advanced credential vending, and key performance optimizations.
This deep dive examines the pull requests, code changes, and architectural updates in Polaris 1.5.0, and explains what they mean for your data operations.
Why Polaris Matters: The Core Philosophy
Data lakehouses succeed when they separate compute from storage. If your data is stored in open formats like Apache Iceberg, you should not be locked into a single query engine. You might use Apache Spark for batch ETL, Apache Flink for real-time streaming, and Dremio for interactive business intelligence.
However, sharing storage requires sharing metadata. Without a shared catalog, different engines cannot agree on table schemas, partitions, or transaction state.
The Iceberg REST Catalog specification defines how engines communicate with a catalog service. Apache Polaris is a reference implementation of this spec, but it goes further by managing access delegation and credential vending.
In Dremio's ecosystem, Polaris serves as the open-source foundation for the Dremio Open Catalog. One Dremio Organization maps to a Polaris Realm, and individual Dremio Projects map to Polaris Catalogs. By building on Polaris, Dremio ensures your metadata remains in an open, standardized catalog, preventing vendor lock-in.
Try Dremio’s Interactive Demo
Explore this interactive demo and see how Dremio's Intelligent Lakehouse enables Agentic AI
1. Security & Access Control Hardening
Enterprise data lakes require strict, auditable access controls. Polaris 1.5.0 introduces architectural changes to enforce security policies cleanly and integrate with existing enterprise policy engines.
Integrating Apache Ranger for Enterprise Authorization (PR #3928)
For years, Apache Ranger has been the standard for managing data security policies in enterprise environments. Many organizations still maintain extensive Ranger policies that control access to Hadoop, Hive, or Trino.
Pull request #3928, contributed by @sneethiraj, implements the Apache Ranger Authorization Plugin for Apache Polaris. This allows Polaris to delegate authorization decisions directly to an external Ranger service.
When a query engine requests schema access or table metadata from Polaris, the Polaris catalog server does not just evaluate its internal RBAC rules. Instead, it queries the Ranger service provider interface to check whether the requesting user has the appropriate privileges.
// Simplified illustration of the Ranger Authorizer integration
public class RangerPolarisAuthorizer implements PolarisAuthorizer {
private final RangerBasePlugin rangerPlugin;
public RangerPolarisAuthorizer(RangerBasePlugin plugin) {
this.rangerPlugin = plugin;
}
@Override
public AuthorizationResponse authorize(PolarisAuthorizableContext context) {
RangerAccessRequest request = createRangerRequest(context);
RangerAccessResult result = rangerPlugin.isAccessAllowed(request);
if (result != null && result.getIsAllowed()) {
return AuthorizationResponse.allow();
}
return AuthorizationResponse.deny("Access denied by Apache Ranger policy.");
}
private RangerAccessRequest createRangerRequest(PolarisAuthorizableContext context) {
RangerAccessRequestImpl request = new RangerAccessRequestImpl();
request.setResource(new RangerPolarisResource(context.getTargetEntity()));
request.setUser(context.getPrincipalName());
request.setAccessType(context.getRequiredAction().name());
request.setAction(context.getRequiredAction().name());
request.setRequestData(context.getRequestPayload());
return request;
}
}
Ranger policy administrators can define rules using familiar resources. They map policies to Polaris realms, catalogs, namespaces, and tables.
Here is an example of an Apache Ranger policy definition in JSON, showing how permissions are mapped:
This integration allows you to centralize policy management. You do not need to rewrite your security policies when migrating from legacy Hadoop environments to a modern cloud-native Iceberg lakehouse. Your existing Ranger policies apply directly to your Iceberg tables.
Decoupling Permissions from Internals (PR #4006)
In earlier versions of Polaris, authorization logic was tightly coupled to the internal PolarisAuthorizableOperations enum. Every permission check was directly bound to a specific Java enum value representing a catalog operation.
Pull request #4006, contributed by @sungwy, decouples RBAC privileges and operation semantics from these internal enums. This refactoring introduces a cleaner separation between what an operation does and what permission it requires.
By isolating the authorization checks from the catalog operation logic, the codebase becomes more modular. Developers can write custom authorization plugins, such as the Ranger plugin or an Open Policy Agent (OPA) authorizer, without modifying the core catalog service codebase.
Consider how this looks in practice for OPA rule declarations. With the decoupled architecture, OPA can query semantic actions directly. Here is an example of OPA Rego rules checking table actions in Polaris:
package polaris.authz
default allow = false
# Allow read access to tables if user belongs to reader group
allow {
input.action == "READ_TABLE"
input.user_groups[_] == "data_analysts"
input.resource.catalog == "shared_catalog"
}
# Restrict table deletion to administrator group
allow {
input.action == "DROP_TABLE"
input.user_groups[_] == "platform_admins"
}
This decoupled approach reduces system maintenance costs, prevents code regressions, and makes auditing security rules much simpler.
Cleaning Up the Grant Lifecycle (PR #4059, #4234)
When principal roles or users are dropped from a security system, orphaned grant records can remain in the metadata database. These "phantom grants" pose a security risk and clutter the catalog's persistence layer.
Polaris 1.5.0 addresses this with pull requests #4059 and #4234. These updates refactor how grant operations are verified and stored. Polaris now automatically filters out stale grants associated with deleted grantees whenever it loads or resolves authorization entities.
Before applying a new grant or revoking an existing privilege, the catalog service revalidates the target entities. If a role or principal no longer exists, its associated grants are cleaned up, ensuring the authorization state is clean and secure.
2. Metastore Federation & Catalog Uniformity
Data teams rarely start with a clean slate. They often have historical datasets stored in Hive Metastores or managed by cloud services like Google Cloud BigQuery. Polaris 1.5.0 introduces federation capabilities to help unify these disparate metadata sources.
Google Cloud BigQuery Metastore Federation (PR #4050)
Google Cloud Platform users often rely on the BigQuery Metastore (an HMS-compatible service) to manage tables. Pull request #4050, contributed by @joyhaldar, adds Google Cloud BigQuery Metastore federation support to Polaris.
This feature allows Polaris to federate tables from BigQuery. Polaris reads the schema and location details from the BigQuery Metastore and projects them as standard Iceberg REST endpoints.
Query engines can join tables in Google Cloud Storage with tables in Google BigQuery through a single Polaris catalog. This eliminates the need to copy metadata or synchronize schemas manually between GCP services and your lakehouse catalog.
To configure BigQuery Metastore federation in Polaris, you set up a federated catalog using the following configuration model:
Once configured, Polaris acts as a bridge, parsing BigQuery datasets and projecting them as Iceberg schemas.
Hive Metastore Federation and Migration (PR #4315)
Migrating away from legacy Hadoop architectures is a priority for many organizations, but rewriting millions of files is not feasible. Polaris 1.5.0 improves support for Hive Metastore (HMS) federation, validated by new integration tests in pull request #4315.
Polaris federates legacy Hive tables by reading their metadata and presenting them as readable endpoints to modern engines. This allows you to adopt a gradual migration strategy. You can keep your historical data in place while writing new datasets as native Apache Iceberg tables, accessing both through the same Polaris interface.
A typical Spark configuration pointing to the federated Hive catalog inside Polaris looks like this:
As federation became a core feature, the naming convention in the codebase became inconsistent. Some components referred to federated sources as "External Catalogs," while others used "Federated Catalogs."
Pull request #4116, contributed by @flyrain, renames ExternalCatalogFactory to FederatedCatalogFactory throughout the codebase.
This change matches Dremio's terminology system. In the Agentic Lakehouse, a catalog is not just a local metadata store. It is a federated gateway that connects to databases, warehouses, and other catalogs. The Dremio Open Catalog combines a Polaris catalog with federated sources, exposing a single, governed namespace.
Credential vending is a key feature of Apache Polaris. In traditional data architectures, you had to distribute cloud storage credentials (like AWS IAM keys) to every client machine running a query. This created a large security risk.
Polaris eliminates this risk by using access delegation. Query engines do not have direct access to cloud storage. Instead, they authenticate with Polaris.
When an engine requests a table read or write, Polaris verifies the user's permissions, requests short-lived, limited-privilege tokens from the cloud provider, and sends them back to the engine. The engine uses these temporary credentials to read or write the Parquet data files.
Regional STS Client Configurations (PR #4161)
In global, multi-region cloud deployments, physical distance introduces latency and connection failures. When vending credentials for S3 buckets, Polaris requests temporary tokens from the AWS Security Token Service (STS).
In previous versions, the internal STS client did not always configure its regional endpoints dynamically. An executor in one region might try to fetch tokens from a distant global STS endpoint, causing network timeouts or region-mismatch errors.
Pull request #4161, contributed by @yushesp, fixes this by passing the signing region parameter to the STS client builder.
// Passing region explicitly to the STS client builder
StsClient stsClient = StsClient.builder()
.region(Region.of(configuredSigningRegion))
.credentialsProvider(DefaultCredentialsProvider.create())
.build();
By explicitly configuring the regional STS client, Polaris ensures token requests are routed to the nearest regional endpoint. This prevents query failures caused by network latency or regional policy restrictions in multi-region deployments.
Single Expiration Timestamp per Bundle (PR #4173)
A table write operation in Apache Iceberg can involve writing hundreds of data files and manifest files across different storage locations. Previously, Polaris could return a bundle of vended credentials where different storage paths had different token expiration times.
If some credentials expired before others during a long-running write job, the query engine would experience path-specific write failures, leaving the table in an inconsistent state.
Pull request #4173, contributed by @yushesp, addresses this by enforcing a single, unified expiration timestamp across the entire credential bundle.
This update simplifies token cache management. Query engines like Spark or Dremio can track a single time-to-live (TTL) for the vended credential package, refreshing all tokens at once before any write path expires.
Here is an example structure of a vended credentials payload from Polaris, showing the unified expiration field:
This structural change guarantees that all analytical query tasks running inside executors complete their writes using consistent session keys.
KMS Credential Mocking and Generic Table API (PR #4034, #4043)
Polaris 1.5.0 improves support for local testing and developer environments. Pull request #4034 adds support for AWS-shaped Key Management Service (KMS) credentials when using local storage mocks like MinIO. This allows developers to test encrypted credential vending locally before deploying to production.
Additionally, pull request #4043 introduces the Polaris-Generic-Table-Access-Delegation header in the Generic Table API. This allows client applications to request specific token delegation scopes when accessing non-Iceberg metadata formats managed by Polaris.
4. CLI & Admin User Experience Upgrades
Managing a catalog programmatically is essential for platform teams. Polaris 1.5.0 updates the Python-based CLI client to make administration easier.
The New summarize Subcommand (PR #4003)
Platform administrators need to know the state of their catalogs. How many tables exist? What is the size distribution?
Pull request #4003, contributed by @MonkeyCanCode, adds the summarize subcommand to the Polaris CLI.
# Example command using the new summarize subcommand
polaris catalogs summarize --name my_catalog
Executing this command returns a JSON summary of the catalog's structure, including counts of tables, namespaces, and views, along with basic storage details. This provides a quick way to audit catalog use without scraping logs or writing custom SQL scripts.
Locating Tables Across Deep Namespaces (PR #4075)
In large organizations, tables are organized into nested namespaces representing different business units, teams, and environments. Finding a specific table in a deep directory structure is difficult.
Pull request #4075, also contributed by @MonkeyCanCode, introduces the tables/find command to search for tables.
# Locating tables across the entire namespace tree
polaris tables find --name web_logs
This command searches the entire namespace hierarchy and returns a table listing matches, their schema details, and parent namespaces. This makes it easier to locate datasets and verify catalog structure.
5. Performance, Concurrency & Persistence
Under heavy analytical workloads, a catalog must handle thousands of concurrent metadata requests. Latency in the catalog translates to latency in query planning. Polaris 1.5.0 optimizes the storage engines and persistence layers.
Per-Realm JDBC Locking (PR #4054)
Many Polaris deployments use relational databases (like PostgreSQL) via JDBC to persist catalog metadata. In previous versions, the JDBC store used coarse-grained synchronized methods to coordinate updates.
This created a concurrency bottleneck. A write operation in one tenant realm would block metadata reads in another realm, causing query planning delays in multi-tenant environments.
Pull request #4054, contributed by @singhpk234, replaces these coarse-grained synchronized blocks with per-realm database locks.
// Concept diagram of per-realm locking optimization
public class PolarisJdbcStore {
private final ConcurrentMap<String, ReentrantLock> realmLocks = new ConcurrentHashMap<>();
public void updateEntity(String realm, EntityMetadata entity) {
ReentrantLock lock = realmLocks.computeIfAbsent(realm, r -> new ReentrantLock());
lock.lock();
try {
// Execute database write operation for this specific realm
executeDbWrite(realm, entity);
} finally {
lock.unlock();
}
}
}
This ensures write operations in one realm do not block database access in other realms. Multi-tenant deployments experience lower latency and higher throughput, especially during peak load times.
NoSQL Persistence and GC Reduction (PR #4071)
For NoSQL persistence backends (like Apache Cassandra), database write speed is often limited by Java Garbage Collection (GC) pauses. Frequently allocating and discarding temporary byte arrays during serialization causes high memory pressure.
Pull request #4071, contributed by @snazy, optimizes the NoSQL persistence layer by reusing serialization scratch buffers.
Instead of allocating a new byte array for every entity write, Polaris maintains a pool of reusable byte buffers. The serialization engine writes fields directly into these buffers, reducing allocations and garbage collection overhead.
6. Governance & Community Standards
As the Polaris community grows, maintaining code quality and compliance is essential. Polaris 1.5.0 introduces guidelines to govern contributions, including those generated by AI tools.
The rise of AI coding assistants has led to an increase in automated pull requests. While these tools can improve developer productivity, they can also introduce low-quality code, security vulnerabilities, or license compliance issues.
Polaris 1.5.0 addresses this with pull requests #3948 and #4276. These updates introduce contribution guidelines for AI-generated code, documented in AGENTS.md.
These guidelines define requirements for contributions generated by AI agents. Code submitted by AI must meet checkstyle standards, include unit tests, and avoid introducing deprecated methods. This ensures the project remains secure and maintainable.
Technical Summary of Key Pull Requests
The following table summarizes the key pull requests in Apache Polaris 1.5.0:
Category
PR Number
Contributor
Core Change
Primary User Impact
Security
#3928
@sneethiraj
Integrates Apache Ranger Authorization Plugin
Unified enterprise security policy management.
Security
#4006
@sungwy
Decouples privileges from internal operation enums
Modular and extensible security codebase.
Security
#4059
@flyrain
Refactors grant validation operations
Clean grant lifecycle; no phantom access records.
Federation
#4050
@joyhaldar
Adds Google BigQuery Metastore federation
Query GCP and object storage tables together.
Federation
#4116
@flyrain
Renames External Catalog to Federated Catalog
Terminology alignment with modern query federation.
Apache Polaris and Dremio: The Open Catalog Integration
The performance and security updates in Polaris 1.5.0 directly benefit users of the Dremio Open Catalog.
Because Dremio is built on open standards, it uses Polaris to manage Iceberg table metadata. When Polaris optimizes its JDBC persistence layer or improves its credential vending logic, Dremio users experience faster query planning and more reliable token exchanges.
By combining Polaris with Dremio's query execution engine, you can build an Agentic Lakehouse that is fast, secure, and open. Dremio provides sub-second query execution on Iceberg tables via automated features like reflections and cloud caching, while Polaris manages the shared metadata layer.
This architecture ensures you retain control of your data. Your tables are stored in open Apache Iceberg format in your own cloud storage, cataloged by an open-source REST service, and accessible by Spark, Flink, and Dremio without vendor lock-in.
This open approach enables complete freedom of choice. You can run Spark jobs, Flink streaming pipelines, and Dremio BI dashboards in parallel, all querying the same physical storage bucket. The Polaris catalog coordinates the transactions, while Dremio accelerates queries, eliminating the serialization tax.
Technical Configuration Guides for Multi-Engine Setup
To fully realize the value of Polaris 1.5.0, you must configure your query engines to communicate with the catalog server. Below are the technical configuration templates for Spark, Flink, and Trino.
Apache Spark Session Configuration
For data pipeline jobs running on Spark, add these parameters to your Spark configuration payload:
These configuration templates establish the metadata exchange path. This path enables different analytical tools to operate against the same unified, Ranger-secured catalog tables.
7. Open REST Catalogs vs. Proprietary Lock-In
Data platform strategies often diverge on catalog architecture. Organizations must decide whether to use an open REST catalog like Apache Polaris or a proprietary metadata system. The table below compares the architectural differences between Apache Polaris, Databricks Unity Catalog, and Snowflake Horizon.
Feature
Apache Polaris
Databricks Unity Catalog
Snowflake Horizon
Catalog Spec
Native Apache Iceberg REST API
Proprietary REST API
Proprietary REST API
Vendor Neutrality
High (open source ASF project)
Medium (managed by Databricks)
Low (tied to Snowflake platform)
Multi-Engine Support
Universal (Spark, Flink, Dremio, Trino)
Primary integration with Spark
Primary integration with Snowflake
Access Control
Ranger, OPA, internal RBAC
Unity SQL privileges, catalog ACLs
Snowflake SQL grants, row/column masking
Vended Credentials
Standardized token exchange (S3/GCS/ADLS)
Internal storage credential mounting
Snowflake-managed external tables
Deployment Model
Self-hosted, Kubernetes, or Dremio Open Catalog
Managed cloud service or hosted server
Managed Snowflake platform
Proprietary metadata catalogs partition your lakehouse. If you write tables through one vendor's catalog, another vendor's query engine must jump through translation hoops to read them. This introduces the serialization tax and limits query speeds.
By utilizing Apache Polaris 1.5.0, your catalog functions as a neutral server. You write access rules in Apache Ranger once, and those rules apply uniformly whether Dremio queries the table or Spark processes a batch job. This is the cornerstone of the Agentic Lakehouse: centralizing governance while preserving absolute freedom of engine choice.
Troubleshooting Polaris 1.5.0 Deployments
When upgrading to Polaris 1.5.0, you may encounter configuration errors in multi-region environments or multi-tenant database clusters. Below are three common troubleshooting strategies.
1. Resolving AWS STS Signature Region Errors
Symptom: Client engines report errors like SignatureDoesNotMatch or ExpiredToken during query planning.
Reason: In multi-region deployments, the internal STS client might route token requests to the default global endpoint (us-east-1), while the target S3 bucket resides in a region that requires regional STS endpoints (such as eu-central-1).
Fix: Ensure you pass the signing region property to your Polaris catalog configuration parameters. Update your database configuration properties table:
UPDATE polaris_properties
SET value = 'eu-central-1'
WHERE key = 'aws.sts.signing-region' AND catalog_name = 'main_warehouse';
Verify that your client engines also set their S3 region parameters explicitly in their configuration properties.
2. Tuning JDBC Connection Pool for Per-Realm Locks
Symptom: Concurrency performance drops when running high numbers of parallel catalog transactions, accompanied by connection timeouts.
Reason: While per-realm JDBC locking prevents table-level contention, it increases the total number of parallel active threads requesting database connections. If your database connection pool is too small, threads wait for available connections, causing timeouts.
Fix: Increase the Hikari connection pool size (maximum-pool-size) in your Quarkus server properties file (application.properties):
# Tuning Quarkus JDBC pool sizes
quarkus.datasource.jdbc.max-size=64
quarkus.datasource.jdbc.min-size=8
quarkus.datasource.jdbc.idle-timeout=10
Ensure your target PostgreSQL or MySQL server is configured to allow a maximum connection limit that accommodates the total connections across all Quarkus nodes.
3. Cleaning Up Phantom Grants Left by Deleted Principal Roles
Symptom: Metadata audits show grant records pointing to non-existent principals or roles.
Reason: If a principal role was deleted in previous versions without running proper revoke calls first, the access records remained in the relational database.
Fix: Polaris 1.5.0 automatically filters these records, but you can clean them up from the database to improve lookups. Run a database query to audit and delete orphaned grant records:
-- Audit orphaned grants
SELECT * FROM polaris_grants
WHERE grantee_id NOT IN (SELECT id FROM polaris_principals);
-- Clean up orphaned records
DELETE FROM polaris_grants
WHERE grantee_id NOT IN (SELECT id FROM polaris_principals);
Regular audits ensure that your metadata storage stays optimized and secure.
Evolving Your Data Architecture
Modern analytics architectures succeed when they are open. If your table catalog is tied to a single proprietary vendor, you will face high data movement costs and complex integration pipelines.
Adopting an open-source, standardized catalog service like Apache Polaris gives you complete architectural choice. You can run Spark jobs for ETL, Flink streaming applications for real-time monitoring, and Dremio BI dashboards in parallel, all querying the same physical storage bucket.
The security updates in Polaris 1.5.0, such as Apache Ranger policy integration and JDBC per-realm locking, ensure you can scale your data operations safely.
Try Dremio Cloud free for 30 days to deploy a managed open catalog directly on your cloud data lake. You can start small, accelerate your slow dashboards, and build a modular, high-performance data lakehouse without vendor lock-in.
Try Dremio Cloud free for 30 days
Deploy agentic analytics directly on Apache Iceberg data with no pipelines and no added overhead.
Intro to Dremio, Nessie, and Apache Iceberg on Your Laptop
We're always looking for ways to better handle and save money on our data. That's why the "data lakehouse" is becoming so popular. It offers a mix of the flexibility of data lakes and the ease of use and performance of data warehouses. The goal? Make data handling easier and cheaper. So, how do we […]
Aug 16, 2023·Dremio Blog: News Highlights
5 Use Cases for the Dremio Lakehouse
With its capabilities in on-prem to cloud migration, data warehouse offload, data virtualization, upgrading data lakes and lakehouses, and building customer-facing analytics applications, Dremio provides the tools and functionalities to streamline operations and unlock the full potential of data assets.
Aug 31, 2023·Dremio Blog: News Highlights
Dremio Arctic is Now Your Data Lakehouse Catalog in Dremio Cloud
Dremio Arctic bring new features to Dremio Cloud, including Apache Iceberg table optimization and Data as Code.