16 minute read · October 14, 2025
Try Apache Polaris (incubating) on Your Laptop with Minio
· Head of DevRel, Dremio
Apache Polaris brings open, standards-based governance to the modern data lakehouse. It provides a central catalog that defines how Iceberg tables are organized, accessed, and secured across engines and clouds. For anyone who wants to understand how Polaris works, running it locally is the fastest way to see its features in action.
This guide walks you through deploying Polaris on your laptop using Docker. It uses MinIO as an S3-compatible object store and Apache Spark as the compute engine for querying Iceberg tables. The setup mirrors a real lakehouse environment but runs entirely on your machine.
The GitHub project used in this walkthrough includes everything you need: a Docker Compose file to spin up all components, and a Python bootstrap script that initializes catalogs, creates principals, and prints ready-to-use Spark configurations. In about ten minutes, you’ll have a working lakehouse where you can create Iceberg tables, store data in MinIO, and manage it all through Polaris.
What You’ll Build
Before diving into the setup, it helps to understand what each component in this environment does and how they work together. Apache Polaris acts as the catalog and governance layer, it stores metadata about tables, manages authentication, and provides APIs for data engines like Spark to discover and modify tables. MinIO serves as the physical storage system, offering an S3-compatible API that makes it easy for Iceberg to store and access table data and metadata files. Spark acts as the compute layer, reading and writing Iceberg tables through Polaris.
When these three components run together, they create a miniature version of a modern lakehouse architecture. Polaris handles catalogs, permissions, and access control; MinIO stores the data files; and Spark performs the queries and transformations. The configuration used in this setup mirrors what you would use in a larger-scale deployment on cloud storage, making it an excellent environment for hands-on learning and experimentation.
By the end of this guide, you’ll have all three components running on your laptop. You’ll create catalogs, set up a principal with permissions, and connect Spark to Polaris to query Iceberg tables stored in MinIO, all using open standards.
Getting Started
To begin, clone the repository that contains all the files for this quickstart. Open a terminal and run the following commands:
git clone https://github.com/AlexMercedCoder/Apache-Polaris-Apache-Iceberg-Minio-Spark-Quickstart.git cd Apache-Polaris-Apache-Iceberg-Minio-Spark-Quickstart
Inside the project folder, you’ll find two essential files. The first is docker-compose.yml, which defines four containers: Polaris, MinIO, a MinIO client, and Spark. The second is bootstrap.py, a Python script that configures Polaris after startup.
The Docker Compose file automatically sets up networking and environment variables so that each service can communicate with the others. Polaris uses ports 8181 and 8182 for its API. MinIO listens on ports 9000 and 9001, storing data for the catalogs created in Polaris. The Spark container includes Jupyter Notebook and is configured to access both Polaris and MinIO.
Once you have the repository cloned, the setup is ready to run. You don’t need to edit any configuration files, the defaults will create a local environment that mirrors a small-scale production lakehouse. The next step is to start the containers and verify that each service is running.
Start the environment by running the Docker Compose file. From the project directory, use the following command:
docker compose up -d
This command launches Polaris, MinIO, the MinIO client, and Spark in the background. Docker will pull the necessary images the first time you run it, which might take a few minutes depending on your internet speed. Once all containers are running, you can confirm with:
docker ps
Each service should appear with its assigned ports. Polaris will be available on port 8181, MinIO on 9000 for its API and 9001 for its web console, and Spark on 8888 for the Jupyter Notebook interface.
The MinIO client container automatically creates two buckets named lakehouse and warehouse. These serve as storage locations for the catalogs you’ll define in Polaris. The buckets are set to public access for easy testing.
After confirming that all containers are healthy, open the MinIO console at http://localhost:9001 and log in with the credentials admin and password. You should see the two pre-created buckets listed. This verifies that your object store is ready for use.
With the storage layer in place, the next step is to initialize Polaris and create the catalogs and user credentials needed for Spark to connect.
Try Dremio’s Interactive Demo
Explore this interactive demo and see how Dremio's Intelligent Lakehouse enables Agentic AI
Bootstrapping Apache Polaris
With all containers running, the next step is to initialize Polaris so it can recognize your storage buckets and issue credentials for access. Inside the repo is a Python script called bootstrap.py that automates this process.
Head to localhost:8888 and create a new python notebook and copy the contents of bootstrap.py and run it.
The script will wait for Polaris to become ready, then authenticate using the default admin credentials. It will create two catalogs, lakehouse and warehouse, that point to the MinIO buckets with the same names. It will also create a new principal called user1, assign a role, and grant that user full access to both catalogs.
When the process finishes, you’ll see confirmation messages for each action followed by a printed Spark configuration for each catalog configured. This configuration includes the necessary Polaris client package, catalog URI, and the clientId and clientSecret for the newly created principal.
It will look something like this:
=== PySpark Configuration ===
# Spark configuration for catalog: lakehouse
from pyspark.sql import SparkSession
spark = (SparkSession.builder
.config("spark.jars.packages", "org.apache.polaris:polaris-spark-3.5_2.13:1.1.0-incubating,org.apache.iceberg:iceberg-aws-bundle:1.10.0,io.delta:delta-spark_2.12:3.3.1,org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.10.0")
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
.config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions,io.delta.sql.DeltaSparkSessionExtension")
.config("spark.sql.catalog.polaris", "org.apache.polaris.spark.SparkCatalog")
.config("spark.sql.catalog.polaris.uri", "http://polaris:8181/api/catalog")
.config("spark.sql.catalog.polaris.warehouse", "lakehouse")
.config("spark.sql.catalog.polaris.credential", "f5ab655712accbef:df6e4943cdabf76543427fe72dcd4699")
.config("spark.sql.catalog.polaris.scope", "PRINCIPAL_ROLE:ALL")
.config("spark.sql.catalog.polaris.header.X-Iceberg-Access-Delegation", "vended-credentials")
.config("spark.sql.catalog.polaris.token-refresh-enabled", "true")
.getOrCreate())
spark.sql("SHOW CATALOGS").show()
# Spark configuration for catalog: warehouse
from pyspark.sql import SparkSession
spark = (SparkSession.builder
.config("spark.jars.packages", "org.apache.polaris:polaris-spark-3.5_2.13:1.1.0-incubating,org.apache.iceberg:iceberg-aws-bundle:1.10.0,io.delta:delta-spark_2.12:3.3.1,org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.10.0")
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
.config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions,io.delta.sql.DeltaSparkSessionExtension")
.config("spark.sql.catalog.polaris", "org.apache.polaris.spark.SparkCatalog")
.config("spark.sql.catalog.polaris.uri", "http://polaris:8181/api/catalog")
.config("spark.sql.catalog.polaris.warehouse", "warehouse")
.config("spark.sql.catalog.polaris.credential", "f5ab655712accbef:df6e4943cdabf76543427fe72dcd4699")
.config("spark.sql.catalog.polaris.scope", "PRINCIPAL_ROLE:ALL")
.config("spark.sql.catalog.polaris.header.X-Iceberg-Access-Delegation", "vended-credentials")
.config("spark.sql.catalog.polaris.token-refresh-enabled", "true")
.getOrCreate())
spark.sql("SHOW CATALOGS").show()
You can copy this configuration (use one of them, not both at the same) directly into a new notebook and if it runs successfully, you are good to go.
Working with Apache Iceberg Tables in Polaris
At this stage, your Polaris environment is fully initialized. It knows where to find the object store, has a catalog structure, and is ready to connect with an authenticated user. The following section shows how to use this configuration in Spark to create and query Iceberg tables.
Now that Polaris is set up and Spark is connected, you can start experimenting with SQL operations to see how Iceberg tables behave under different workloads. All the examples below assume that you have a working namespace is polaris.db, where you’ll create and modify tables (this should of been made from the script you ran after bootstrap.py).
Now let’s issue SQL commands from your Jupyter notebook or Spark shell.
Creating Tables
Start with a simple unpartitioned table:
spark.sql("""
CREATE TABLE IF NOT EXISTS polaris.db.customers (
id INT,
name STRING,
city STRING
)
USING iceberg
""")To define a partitioning strategy, you can specify one or more partition columns. For example, partitioning by city:
spark.sql("""
CREATE TABLE IF NOT EXISTS polaris.db.sales (
sale_id INT,
product STRING,
quantity INT,
city STRING
)
USING iceberg
PARTITIONED BY (city)
""")For time-based data, partitioning by month or day is common:
spark.sql("""
CREATE TABLE IF NOT EXISTS polaris.db.orders (
order_id INT,
customer_id INT,
order_date DATE,
total DECIMAL(10,2)
)
USING iceberg
PARTITIONED BY (months(order_date))
""")Inserting Data
Insert new records with a standard INSERT INTO command:
spark.sql("""
INSERT INTO polaris.db.customers VALUES
(1, 'Alice', 'New York'),
(2, 'Bob', 'Chicago'),
(3, 'Clara', 'Boston')
""")Updating Records
Iceberg supports SQL-based updates:
spark.sql("""
UPDATE polaris.db.customers
SET city = 'San Francisco'
WHERE name = 'Alice'
""")Deleting Records
You can remove rows directly:
spark.sql("""
DELETE FROM polaris.db.customers
WHERE name = 'Bob'
""")Altering Partition Strategies
If you need to change how your data is organized, Iceberg allows altering partition specs. For instance, switch from city partitioning to bucket(4, city):
spark.sql("""
ALTER TABLE polaris.db.sales
DROP PARTITION FIELD city
""")
spark.sql("""
ALTER TABLE polaris.db.sales
ADD PARTITION FIELD bucket(4, city)
""")This change will apply to new data written after the alteration.
Querying Metadata Tables
Iceberg exposes metadata tables that let you inspect snapshots, manifests, and history:
spark.sql("SELECT * FROM polaris.db.customers.history").show()
spark.sql("SELECT * FROM polaris.db.customers.snapshots").show()
spark.sql("SELECT * FROM polaris.db.customers.files").show()These tables are useful for exploring how Iceberg tracks changes over time.
Explore the Files in MinIO
After running these transactions, open the MinIO console at http://localhost:9001. Navigate to the lakehouse bucket and explore the folders created for each table. You’ll see how Iceberg stores data files, manifests, and metadata under each table directory. Every insert, update, and delete operation creates new files that represent immutable snapshots of your table’s state.
This exercise gives you a hands-on view of how Iceberg tables evolve over time and how Polaris and MinIO work together to manage and store that evolution.
Conclusion
Running Polaris locally with MinIO gives you a hands-on view of how an open catalog governs Iceberg tables. But when it’s time to move beyond experimentation, deploying and managing a full Polaris environment can take more effort. That’s where Dremio’s integrated catalog comes in.
Dremio’s catalog is built directly on Apache Polaris. It delivers all the same open governance, authentication, and interoperability while removing the setup work. You get a production-ready Polaris deployment with a complete user interface, automated security management, and fine-grained access controls, all without maintaining separate services.
Beyond governance, the Dremio catalog also optimizes your Iceberg tables automatically. It handles compaction, cleanup, and reflection management to keep query performance high and storage efficient. Whether your teams use Spark, Dremio, Trino, Flink, or other engines, Dremio’s catalog provides a consistent and intelligent control layer for your Iceberg lakehouse.
With this quickstart, you’ve seen the open foundation in action. With Dremio, you can scale it across your entire data platform, combining open standards, centralized governance, and automated optimization into one seamless experience.
Download “Apache Iceberg: The Definitive Guide” for Free
Download “Apache Polaris: The Definitive Guide” for Free
Register for an upcoming Workshop to get Hands-on with Dremio and Dremio Catalog