23 minute read · July 7, 2025

Quick Start with Apache Iceberg and Apache Polaris on your Laptop (quick setup notebook environment)

Alex Merced

Alex Merced · Head of DevRel, Dremio

Imagine your data lake is like a giant filing cabinet, files everywhere, in all formats, with no real sense of order. At first, it feels freeing: store anything, anytime, however you want. But soon, the freedom turns to frustration. Teams start asking, “Where's the latest version of this table?” or “Who has access to this data?” or worse, “Why did this query break after last week’s schema change?”

Welcome to the world before table formats.

What is a table format?

A table format is similar to the index in a giant filing cabinet, it brings structure, rules, and control to your data lake. It defines how to manage schema changes, track versions, organize partitions, and handle transactions across massive files.

And in today’s world of lakehouses, where you want the flexibility of a data lake with the structure of a warehouse, table formats are non-negotiable.

Enter Apache Iceberg

Apache Iceberg is one of the most powerful table formats available today. Originally created at Netflix, it was designed to overcome the limitations of older formats like Hive. Iceberg supports:

  • Time travel & versioning: Query your data as it existed at any point in time
  • Schema evolution: Add, drop, or rename columns safely
  • Hidden partitioning: Simplify query logic and optimize performance
  • ACID transactions: Reliable multi-write operations across distributed systems

Iceberg makes your lakehouse feel like a data warehouse, without giving up the openness of your data lake.

But a Table Format Needs a Catalog...

While Iceberg takes care of the structure inside a table, you still need something to manage all your tables, something that can:

  • Track metadata locations
  • Govern access and roles
  • Serve multiple compute engines, so they know what tables exist
  • Provide a single source of truth across your organization

That’s where lakehouse catalogs come in.

Apache Polaris: The Catalog for the Modern Lakehouse

Apache Polaris is an open-source implementation of the Apache Iceberg REST Catalog Specification. This specification defines how external services (such as Spark, Flink, or Dremio) can communicate with a catalog over HTTP in a standardized, secure, and decoupled manner.

Polaris takes it a step further by supporting credential vending, multi-tenant governance, and role-based access, all key features for enterprise-grade deployments.

And it’s not just theory: Polaris is already the engine behind two major commercial offerings:

  • Dremio Arctic Catalog
  • Snowflake's Open Catalog

Many more are expected to follow from other vendors in the data ecosystem.

Polaris is quickly emerging as the industry standard for open catalogs, making it an essential component of any future-ready lakehouse stack.

2. Environment Overview: What’s in the Box?

Before we dive into setup steps, let’s zoom out and understand what we’re putting together, and more importantly, why.

This quick-start environment gives you a ready-to-go Iceberg + Polaris + Spark stack, all running locally via Docker. That means you get the full power of a modern lakehouse architecture, right on your laptop, without installing a dozen tools or wiring up cloud infrastructure.

Let’s break down the key components.

Polaris: The Open Catalog Brain

Apache Polaris is your metadata brain. It keeps track of:

  • What tables exist
  • Their schema versions and partitioning
  • Where the files live (in our case, local disk)
  • Who can access what, and how

It speaks the Iceberg REST Catalog protocol, which means any engine that understands Iceberg (like Spark or Trino) can connect and query tables through Polaris.

In this setup, Polaris runs as a Docker container, exposing its API on port 8181.

Spark: Your Compute Engine

Apache Spark acts as the muscle, the engine that processes data and executes SQL queries. Thanks to its support for Iceberg, Spark can read and write tables managed by Polaris just like it would with any built-in catalog.

We’re using a Jupyter-enabled Spark container (alexmerced/spark35nb), which makes it easy to run interactive notebooks for testing and exploration. You’ll access Spark’s UI via port 8888.

Shared Storage: Where the Data Lives

Both Polaris and Spark need to work with the same physical files. To make that happen, they share a local volume mounted at ./icebergdata on your host, and at /data inside both containers.

This directory acts as our mock "data lake", Polaris stores metadata here, Spark reads and writes Parquet files here, and we get a simplified simulation of a file-backed Iceberg setup.

The Toolkit: Scripts and Helpers

This project includes:

  • bootstrap.py: Automates the creation of your Polaris catalog, user, and roles
  • table_setup.py: Creates Iceberg table directories with proper permissions
  • A requirements.txt to handle Python dependencies

These scripts are your companions in this journey, smoothing out setup steps and helping avoid common pitfalls (like permissions errors).

Docker Compose: The One-Command Setup

All of this is orchestrated with a single Docker Compose file. With one command, docker compose up -d, you'll spin up Polaris and Spark, connect them to shared storage, and start your local lakehouse.

Think of it as your mini data platform in a box.

In the next section, we’ll walk you through setting up this environment step by step. You won’t just see what to run; you’ll understand why each step matters and how it fits into the lakehouse picture.

Ready? Let’s fire it up.

3. Step-by-Step Setup

Now that you understand what the environment includes, let’s walk through setting it up on your machine. This guide assumes you have Docker and Python 3.8 or later installed.

Step 1: Clone the Quick-Start Repository

To begin, you need to download the environment setup code. This includes the Docker Compose file and helper scripts.

In your terminal, navigate to a working directory where you want to keep the project. Then run:

bashCopyEditgit clone https://github.com/AlexMercedCoder/quick-test-polaris-environment.git
cd quick-test-polaris-environment

You should now be in the project folder, which contains the docker-compose.yml file and other setup scripts.

Step 2: Start Polaris and Spark

While still inside the quick-test-polaris-environment directory, start the environment using Docker Compose:

bashCopyEditdocker compose up -d

This command launches two containers:

  1. Polaris on port 8181
  2. Spark with Jupyter on port 8888

Both containers share a local folder called ./icebergdata, which serves as your data lake storage.

Once running, you can verify that Polaris is live by opening http://localhost:8181 in your browser. Similarly, you can access the Spark notebook interface at http://localhost:8888.

Step 3: Set Up the Polaris Catalog

Before Spark can use Polaris, you need to initialize the Polaris catalog and generate access credentials.

Still in the same directory, create a virtual environment and install dependencies:

python -m venv .venv
source .venv/bin/activate # On Windows use .venv\Scripts\activate
pip install -r requirements.txt

Next, run the bootstrap script:

python bootstrap.py

This script will create a Polaris catalog called polariscatalog, set up access roles, and register a user. At the end, it will display your access credentials. Be sure to copy these credentials, as you will need them when configuring Spark.

Step 4: Fix Permissions for File Storage

When using file-backed storage, Spark and Polaris need shared access to the same directories. However, they run as different users inside their containers, which can cause permission errors when Spark creates a directory that Polaris later tries to write to.

To avoid this, you need to manually set permissions on the shared volume. Run the following command from your terminal, still inside the project directory:

chmod -R 777 icebergdata

Alternatively, if you're planning to create a specific table, use the provided helper script to create the necessary folder structure with the right permissions. First, change into the storage directory:

cd icebergdata

Then run the table setup script, replacing namespace.table with your intended table name (db.event_date):

python table_setup.py example.namespace_table

After running the script, go back to the root project folder:

cd ..

Step 5: Configure and Launch Spark

Now that Polaris is initialized and permissions are fixed, you’re ready to start working with Spark.

You can open the Spark Jupyter interface by visiting http://localhost:8888 in your browser. Inside, you can create a new Python notebook and use the provided script to configure your Spark session with Polaris.

Make sure to update the POLARIS_CREDENTIALS variable with the client ID and secret you received earlier.

In the next section, we’ll walk through an example notebook where you connect Spark to Polaris, create an Iceberg table, and run some basic queries.

Let’s build your first lakehouse table.

4. Example: Creating and Querying an Iceberg Table

With your environment up and running, it’s time to interact with Polaris through Spark and see the power of Iceberg in action. This section will guide you through creating a namespace, defining a table, inserting data, and querying it using Spark SQL.

Step 1: Open the Spark Notebook Interface

In your web browser, go to http://localhost:8888. This opens the Jupyter notebook interface running inside the Spark container.

If prompted for a token, check the terminal output from docker compose up for the access URL with the token included. Once inside, create a new Python notebook.

Step 2: Configure Your Spark Session

In a new cell, paste and run the following code to create a Spark session that connects to your Polaris catalog. Replace the placeholder credentials with the values output by bootstrap.py.

import pyspark
from pyspark.sql import SparkSession

POLARIS_URI = 'http://polaris:8181/api/catalog'
POLARIS_CATALOG_NAME = 'polariscatalog'
POLARIS_CREDENTIALS = 'your_client_id:your_client_secret'
POLARIS_SCOPE = 'PRINCIPAL_ROLE:ALL'

conf = (
    pyspark.SparkConf()
        .setAppName('PolarisSparkApp')
        .set('spark.jars.packages', 'org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.5.2,org.apache.hadoop:hadoop-aws:3.4.0')
        .set('spark.sql.extensions', 'org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions')
        .set('spark.sql.catalog.polaris', 'org.apache.iceberg.spark.SparkCatalog')
        .set('spark.sql.catalog.polaris.warehouse', POLARIS_CATALOG_NAME)
        .set('spark.sql.catalog.polaris.catalog-impl', 'org.apache.iceberg.rest.RESTCatalog')
        .set('spark.sql.catalog.polaris.uri', POLARIS_URI)
        .set('spark.sql.catalog.polaris.credential', POLARIS_CREDENTIALS)
        .set('spark.sql.catalog.polaris.scope', POLARIS_SCOPE)
        .set('spark.sql.catalog.polaris.token-refresh-enabled', 'true')
)

spark = SparkSession.builder.config(conf=conf).getOrCreate()
print("Spark session configured for Polaris is running.")

This config tells Spark how to connect to Polaris using the REST Catalog API and authenticate with the proper credentials and scope.

Step 3: Create a Namespace and Table

Now you can begin issuing SQL commands through Spark. Start by creating a namespace, which serves as a logical grouping for tables.

spark.sql("CREATE NAMESPACE IF NOT EXISTS polaris.db").show()

Next, define some sample data and write it to a new Iceberg table within that namespace.

data = [
    (1, "Alice", "2023-01-01"),
    (2, "Bob", "2023-01-02"),
    (3, "Charlie", "2023-01-03"),
]
columns = ["id", "name", "event_date"]
df = spark.createDataFrame(data, columns)

df.writeTo("polaris.db.events") \
  .partitionedBy("event_date") \
  .tableProperty("write.format.default", "parquet") \
  .create()

This creates an Iceberg table named events inside the polaris.db namespace. The table is partitioned by event_date and stores data in Parquet format.

Step 4: Query the Iceberg Table

You can now run queries just like you would in a traditional database.

spark.sql("SHOW TABLES IN polaris.db").show()
spark.sql("SELECT * FROM polaris.db.events").show()

You should see your sample data returned from the table, confirming that the write and read operations are working correctly through Polaris.

In the next section, we will take a step back and look at how this architecture mirrors real-world lakehouse patterns, and how it sets you up for scalable data governance.

How This Setup Mirrors a Real Lakehouse Architecture

What you have running on your laptop is not just a toy or a demo. It is a scaled-down version of a real-world lakehouse architecture. Understanding how the pieces fit together helps you see how this setup can grow into a production-grade data platform.

Polaris as the Central Metadata Service

In a modern data architecture, the catalog is the control center. It tracks every table/view/namespace, maintains metadata locations, manages access permissions, and enables interoperability across compute engines. Polaris provides all of these capabilities by implementing the Apache Iceberg REST Catalog specification.

This standard allows any compute engine that understands Iceberg to plug into Polaris using a consistent API. That means tools like Spark, Dremio, Flink, and even commercial platforms can all access the same metadata without vendor lock-in or custom integration.

Spark as the Analytical Workhorse

Spark is one of the most widely used engines for processing large volumes of data. In this setup, it interacts with Polaris to read table definitions and metadata, then uses that information to locate and process the underlying data files.

In production environments, Spark might be replaced or supplemented with other engines, depending on the workload. But the pattern remains the same. The engine pulls metadata from Polaris and operates on data stored in a lake.

Local Disk as a Stand-in for Cloud Object Storage

In this setup, the data is stored on your local file system using a shared volume between Polaris and Spark. This simulates what would normally be an object store like Amazon S3, Google Cloud Storage, or Azure Blob Storage.

Polaris supports these cloud storage systems out of the box. Switching to one of them in production would only require updating the catalog configuration and ensuring access credentials are properly managed.

Security and Governance Ready by Design

Because Polaris is designed for multi-tenant environments, it includes features like role-based access control and credential vending. These are critical in enterprise settings where different teams or applications need access to different parts of the catalog with different levels of permission.

By bootstrapping roles and users from the beginning, you are setting up an architecture that can grow into a secure and governed platform as your needs expand.

A Future-Proof Lakehouse Foundation

The real strength of this setup lies in its use of open standards and decoupled components. Polaris can evolve independently from Spark. Storage can be swapped without changing the table logic. New engines can be added without rewriting your data pipelines.

This kind of flexibility is what the lakehouse model promises. It combines the reliability of data warehouses with the openness of data lakes, all while maintaining control over your architecture.

Final Thoughts and Next Steps

By following the steps in this guide, you now have a fully functional Iceberg and Polaris environment running locally. You have seen how to spin up the services, initialize the catalog, configure Spark, and work with Iceberg tables. Most importantly, you have set up a pattern that closely mirrors what modern data platforms are doing in production today.

This setup is ideal for learning, testing, and prototyping. It gives you hands-on experience with concepts like REST-based catalogs, credential vending, and Iceberg table operations without needing to provision cloud resources or manage complex infrastructure.

If you are looking to expand on this foundation, here are a few ideas for next steps:

Explore other compute engines. Try connecting Dremio Enterprise or Flink to Polaris using the same REST catalog approach.

Experiment with Iceberg features. Test schema evolution, partition spec changes, and table snapshots to understand how Iceberg manages metadata and file layout over time.

Switch to cloud storage. Modify your Docker volumes and Polaris config to use Amazon S3 or another object store. This will bring you one step closer to a production-grade environment.

Test role-based access. Create multiple users and roles in Polaris and configure different permissions for different teams or projects.

Integrate with a query engine UI. Connect a BI tool or data cataloging platform to Polaris to explore Iceberg tables through a graphical interface.

Finally, continue exploring the Iceberg and Polaris documentation. These projects are active and evolving, and staying up to date will help you make the most of what they offer.

This is just the beginning. With a solid understanding of how Iceberg and Polaris work together, you are well on your way to building a scalable, open, and future-ready data platform.

Learn more about Apache Polaris by downloading a free early release copy of Apache Polaris: The Definitive Guide along with learning about Dremio's Enterprise Catalog powered by Apache Polaris.

Try Dremio’s Interactive Demo

Explore this interactive demo and see how Dremio's Intelligent Lakehouse enables Agentic AI

Ready to Get Started?

Enable the business to accelerate AI and analytics with AI-ready data products – driven by unified data and autonomous performance.