Dremio Blog

16 minute read · July 1, 2025

Benchmarking Framework for the Apache Iceberg Catalog, Polaris

Pierre Laporte Senior Staff Software Engineer

Start For Free

Copied to clipboard

Benchmarking Framework for the Apache Iceberg Catalog, Polaris

Design of the framework

Before we start

Test dataset

Prepare the creation of a dataset

Create the dataset

Run a 90% Read/10% Write workload

Conclusion

Apache Polaris (incubating) is becoming a central piece of Open Data Architectures. Polaris is an open-source, fully-featured catalog for Apache Iceberg™. It implements Iceberg’s REST API, enabling seamless multi-engine interoperability across a wide range of platforms, including Apache Doris™, Apache Flink®, Apache Spark™, Dremio®, StarRocks, and Trino. As a catalog, Polaris is responsible for maintaining metadata, including table definitions, locations, versions, and more. It is critical that it delivers on all these features as quickly as possible so that data processing engines can efficiently serve user queries.

Dremio is one of the original co-creators of the Apache Polaris (incubating) project. Dremio is committed to contributing the main features of Project Nessie (https://projectnessie.org/) to Polaris. This includes the ability to store metadata in horizontally scalable NoSQL databases like MongoDB. That way, Polaris can efficiently manage a high volume of catalogs and metadata, making it well-suited for scalable, multi-tenant environments.

This article talks about the benchmarking framework that was created for Polaris.

This framework has been leveraged multiple times already to improve the quality of Polaris. It can be used to verify that a Polaris deployment is properly configured and tuned to meet performance requirements. We’ve used it to detect a scalability issue in the EclipseLink persistence layer. We also used it to to demonstrate that the NoSQL persistence layer is able to meet Polaris performance goals. And recently, Prashant Singh, a Polaris committer from Snowflake, used it to demonstrate that the JDBC persistence refactor fixes the EclipseLink concurrency issues.

Design of the framework

Polaris started as an implementation of Iceberg’s REST API, enabling vendors to provide catalog-level features over Iceberg tables. As of April 2025, Polaris is the technology that powers Snowflake Open Catalog as well as Dremio Catalog. And being an open-source project, administrators can also deploy it on-premise, on their own infrastructure.

Since Polaris’s main entry point is REST APIs, it makes sense to run performance tests using a regular HTTP load testing tool. As part of the benchmarking framework initiative, Gatling (https://gatling.io/) has been chosen to develop the benchmarks. It is located under the apache/polaris-tools GitHub repository: https://github.com/apache/polaris-tools/tree/main/benchmarks/.

The benchmarks use a procedural dataset generation technique. Given the same parameters, the exact same dataset will be generated. This makes for true apples-to-apples comparisons between runs.

Generation rules have been defined so that datasets with arbitrary size can be generated. This enables testing Polaris deployments with millions of entities (namespaces, tables, and views).

Discussions are happening on the Polaris Dev mailing list to augment it with more varied tests. And regular benchmark executions are also being considered to verify that no performance regression has been introduced in the codebase.

Before we start

In order to benchmark a Polaris deployment, first ensure that all machines are monitored. This includes the client node that is running the benchmarking code. This kind of monitoring makes it possible to verify that server-side saturation has been achieved and that bottlenecks are not caused by a client-side implementation detail.

Then, ensure that Polaris is deployed for production. Typically, with a properly configured and tuned database, as well as using production-like hardware.

Clone the https://github.com/apache/polaris-tools/ repository locally. For the rest of this article, we will execute commands from the benchmarks/ directory of that repository.

Test dataset

Every dataset generated by the framework has the following characteristics:

Contains an arbitrary number of catalogs
Under the first catalog, contains a n-ary tree of namespaces where n is the tree width, and where the tree depth is arbitrary
Under each leaf node of the namespace tree, contains an arbitrary number of tables and views

The framework allows for more configuration, including an arbitrary number of properties per entity, as well as an arbitrary number of columns per table and view.

The diagram below shows the dataset that will be generated for 3 catalogs, a namespace width equal to 2, a namespace depth equal to 3, and each leaf namespace containing 5 tables and 3 views.

An important distinction to keep in mind is that the framework works at the metadata level, not at the data level. In other words, it will create catalog, namespace, table, and view definitions. But it will not create the data that go in these tables and views.

Prepare the creation of a dataset

Our first example will consist of running a 100% write workload against an empty Polaris server. This will create the dataset that will be used in subsequent workloads.

We will use 1 catalog, a namespace tree width equal to 2, a tree depth equal to 14, and 3 tables and 2 views per leaf namespace. This means that the dataset will have the following characteristics:

1 catalog
16383 namespaces forming a binary tree (with 8192 leaf namespaces)
24576 tables
16384 views

Additionally, we will configure the framework to add 50 properties to each entity and to configure each table and view so that they have 100 columns. And we will set the table and view write concurrency to 40 simultaneous operations.

To achieve that, let’s create the following benchmark configuration file, named application.conf:

http {
  base-url = "http://my-polaris-deployment:8181" // <1>
}

auth { // <2>
  client-id = "..."
  client-secret = "..."
}

dataset.tree { // <3>
  num-catalogs = 1

  namespace-width = 2
  namespace-depth = 14
  tables-per-namespace = 3
  views-per-namespace = 2

  namespace-properties = 50
  table-properties = 50
  view-properties = 50
  columns-per-table = 100
  columns-per-view = 100

  default-base-location = "s3://my-polaris-deployment/" // <4>
}

workload {
  create-tree-dataset {
    table-concurrency = 40
    view-concurrency = 40
  }
}

<1> This property points to your Polaris server.
<2> This section contains the client ID and secret that can be used to authenticate as an administrator against your Polaris server.
<3> This section assigns the parameters as per the above requirements.
<4> The default base location is used for all created catalogs.

Create the dataset

Note that in Gatling terms, a benchmark is refered to as a “simulation”.

Now we can run the simulation named CreateTreeDataset. Execute the following command from the benchmarks/ directory of your apache/polaris-tools clone.

./gradlew gatlingRun --simulation org.apache.polaris.benchmarks.simulations.CreateTreeDataset -Dconfig.file=./application.conf

This will start a Gatling process that connects to your Polaris instance and then creates each entity in a specific order. First, all catalogs will be created, one at a time. Second, all namespaces will be created, with gradually increasing concurrency. Third, all tables will be created with the configured concurrency. Finally, the same will happen for views.

Every 5 seconds, progress will be printed in the terminal. This looks like the text block below. In this example, all catalogs and namespaces have been successfully created (no KO request, which denotes failed requests). And 3172 tables have been created so far. No view has been created yet.

========================================================================================================================
2025-05-20 09:24:59 UTC                                                                              30s elapsed
---- Requests -----------------------------------------------------------------------|---Total---|-----OK----|----KO----
> Global                                                                             |    19,566 |    19,566 |         0
> Authenticate                                                                       |         1 |         1 |         0
> Create Catalog                                                                     |        10 |        10 |         0
> Create Namespace                                                                   |    16,383 |    16,383 |         0
> Create Table                                                                       |     3,172 |     3,172 |         0

After benchmarks are executed, the Gatling engine creates HTML reports with response time percentiles for each query and group of queries.

Look for the following line:

Reports generated, please open the following file: file:///Users/plaporte/dremio/polaris-tools/benchmarks/build/reports/gatling/createtreedataset-20250520092429542/index.html

In that file, look at the summary table per query. That table shows the total throughput (Cnt/s), response time percentiles (p0, p25, p50, p75, p99 and p100) as well as average and standard deviation for each query.

Click on the Requests cell for Create Table to see a detailed view of that specific request. The Stats table contains the statistics for that specific query, much like the initial summary table. In this example, it shows that the Polaris server sustained 135.78 queries/s with a concurrency of 40 simultaneous queries. It also shows that the median latency was 89ms, and the p99 was 154ms. And the Response Time Percentiles over Time panel shows how percentiles evolved throughout the duration of the benchmark.

Run a 90% Read/10% Write workload

Now that the Polaris server has actual data, we can start simulating mixed workloads composed of reads and writes. More specifically, reads and updates of entities created by the 100% write workload from the previous section. In this section, we will learn how to run a 90% read/10% update workload.

First, add a read-update-tree-dataset block under the workload section of application.conf. That section sets the parameter for the mixed workload, including the read/write ratio as well as the total throughput and benchmark duration. For this example, let’s set a target throughput of 150 ops/s during 10 minutes. The block we are going to add is as follows:

read-update-tree-dataset {
    read-write-ratio = 0.9
    throughput = 150
    duration-in-minutes = 10
  }

The full application.conf file should now look like this:

http {
  base-url = "http://my-polaris-deployment:8181"
}

auth {
  client-id = "..."
  client-secret = "..."
}

dataset.tree {
  num-catalogs = 1

  namespace-width = 2
  namespace-depth = 14
  tables-per-namespace = 3
  views-per-namespace = 2

  namespace-properties = 50
  table-properties = 50
  view-properties = 50
  columns-per-table = 100
  columns-per-view = 100

  default-base-location = "s3://my-polaris-deployment/"
}

workload {
  create-tree-dataset { // <5>
    table-concurrency = 40
    view-concurrency = 40
  }

  read-update-tree-dataset { // <6>
    read-write-ratio = 0.9
    throughput = 150
    duration-in-minutes = 10
  }
}

<5> Note that the file can contain parameters about other workloads, they will simply be ignored.
<6> This is where the new block is added.

Now we just need to run Gatling as before, using the ReadUpdateTreeDataset simulation name.

./gradlew gatlingRun --simulation org.apache.polaris.benchmarks.simulations.ReadUpdateTreeDataset -Dconfig.file=./application.conf

Just as with the 100% write workload, Gatling prints a progress report every 5 secondes. We can see that the workload includes queries from a Read and a Write group. Those queries exercise different Iceberg REST API endpoints, fetching and updating entities.

========================================================================================================================
2025-05-20 12:41:39 UTC                                                                             490s elapsed
---- Requests -----------------------------------------------------------------------|---Total---|-----OK----|----KO----
> Global                                                                             |    73,078 |    73,078 |         0
> Authenticate                                                                       |         9 |         9 |         0
> Read / Fetch Namespace                                                             |     7,334 |     7,334 |         0
> Read / Check Table Exists                                                          |     7,411 |     7,411 |         0
> Read / Check View Exists                                                           |     7,221 |     7,221 |         0
> Write / Update Namespace Properties                                                |     2,482 |     2,482 |         0
> Read / Fetch all Tables under parent namespace                                     |     7,247 |     7,247 |         0
> Read / Fetch all Namespaces under specific parent                                  |     7,352 |     7,352 |         0
> Read / Fetch View                                                                  |     7,277 |     7,277 |         0
> Read / Fetch Table                                                                 |     7,353 |     7,353 |         0
> Read / Check Namespace Exists                                                      |     7,188 |     7,188 |         0
> Read / Fetch all Views under parent namespace                                      |     7,371 |     7,371 |         0
> Write / Update View metadata                                                       |     2,397 |     2,397 |         0
> Write / Update table metadata                                                      |     2,436 |     2,436 |         0

Once again, Gatling generates a full HTML report with statistics breakdown per group (read/write) as well as per query. In the screenshot below, we can see that 80920 read queries and 8996 write queries were issued. This corresponds to 89.98% read and 10.00% write ratio, which is what we specified in the configuration file. And we can see that, for example, approximately 13.73 tables definitions were fetched every second, with a median latency of 7ms and a p99 of 12ms.

Conclusion

The Polaris benchmarking framework provides a robust mechanism to validate performance, scalability, and reliability of Polaris deployments. By simulating real-world workloads, it enables administrators to identify bottlenecks, verify configurations, and ensure compliance with service-level objectives (SLOs). The framework’s flexibility allows for the creation of arbitrarily complex datasets, making it an essential tool for both development and production environments.

As Polaris continues to evolve, the benchmarking framework will expand to include additional workloads and scenarios. We encourage you to contribute to this effort by:

Joining the Polaris Community to stay informed and participate in discussions.
Watching the polaris-tools repository on GitHub for updates and contributing issues or feature requests.
Submitting pull requests to enhance the framework’s capabilities.

Try Dremio Cloud free for 30 days

Deploy agentic analytics directly on Apache Iceberg data with no pipelines and no added overhead.

Start For Free

Article Topics

Dremio Blog: Open Data Insights Engineering Blog

Sep 22, 2023 Dremio Blog: Open Data Insights

Intro to Dremio, Nessie, and Apache Iceberg on Your Laptop

We're always looking for ways to better handle and save money on our data. That's why the "data lakehouse" is becoming so popular. It offers a mix of the flexibility of data lakes and the ease of use and performance of data warehouses. The goal? Make data handling easier and cheaper. So, how do we […]

Alex Merced

Aug 16, 2023 Dremio Blog: News Highlights

5 Use Cases for the Dremio Lakehouse

With its capabilities in on-prem to cloud migration, data warehouse offload, data virtualization, upgrading data lakes and lakehouses, and building customer-facing analytics applications, Dremio provides the tools and functionalities to streamline operations and unlock the full potential of data assets.

Alex Merced

Aug 31, 2023 Dremio Blog: News Highlights

Dremio Arctic is Now Your Data Lakehouse Catalog in Dremio Cloud

Dremio Arctic bring new features to Dremio Cloud, including Apache Iceberg table optimization and Data as Code.

Jeremiah Morrow

Benchmarking Framework for the Apache Iceberg Catalog, Polaris

Table of Contents

Design of the framework

Before we start

Test dataset

Prepare the creation of a dataset

Create the dataset

Run a 90% Read/10% Write workload

Conclusion

Try Dremio Cloud free for 30 days

Ready to Get Started?

Table of Contents

Design of the framework

Before we start

Test dataset

Prepare the creation of a dataset

Create the dataset

Run a 90% Read/10% Write workload

Conclusion

Try Dremio Cloud free for 30 days

Related Dremio Articles

Intro to Dremio, Nessie, and Apache Iceberg on Your Laptop

5 Use Cases for the Dremio Lakehouse

Dremio Arctic is Now Your Data Lakehouse Catalog in Dremio Cloud

Ready to Get Started?