Dremio Blog

48 minute read · November 24, 2025

Ingesting Data into Apache Iceberg Using Python Tools with Dremio Catalog

Alex Merced Alex Merced Head of DevRel, Dremio
Start For Free
Ingesting Data into Apache Iceberg Using Python Tools with Dremio Catalog
Copied to clipboard

Key Takeaways

  • The article discusses ingestion of data into Apache Iceberg using various Python tools: PyIceberg, PyArrow, Bauplan, Daft, SpiceAI, DuckDB, and PySpark.
  • It explains how to connect these tools to a Dremio Catalog using bearer tokens for secure access and vended credentials for easy integration.
  • Ingestion follows a consistent pattern: read, shape, and write data into Iceberg tables while leveraging clean snapshots and metadata management.
  • The article provides an end-to-end example showcasing how to ingest a CSV into Iceberg using PyIceberg and a Dremio Catalog.
  • Overall, it emphasizes the flexibility of Iceberg for managing data in lakehouses with simple and clean pipeline setups.

Get your free Dremio Catalog by signing up for the Dremio Cloud Free Trial

Self-guide workshop on using Dremio’s Features During Free Trial

Previous Article I wrote around tools for ingesting into Apache Iceberg

Ingesting Data into Apache Iceberg with Dremio

Apache Iceberg gives teams a simple way to manage data in a lakehouse. It adds clear tables, strong guarantees, and predictable performance. Python gives engineers an easy way to collect, clean, and load data from many sources. When you combine both, you get a flexible path to build reliable pipelines without heavy infrastructure.

This blog shows how to ingest data into Iceberg using four Python tools: PyIceberg, PyArrow, Bauplan, PySpark, SpiceAI, Daft, and more. Each tool handles data differently. Some tools work well for small scripts. Others scale across large files or full pipelines. All of them can write data into Iceberg when connected to a proper catalog.

The examples in this blog use Dremio Catalog. It uses the Apache Iceberg REST Catalog interface, bearer-token access, and short-lived credential vending for storage. These features make the catalog easy to connect and safe to use in production. Later sections will cover the details, but the main idea is simple: authenticate with a token, point your client at the catalog URL, and the system handles the rest.

By the end of this guide, you will know how each Python tool works, how they differ, and how to choose the right approach for your next ingestion job.

Try Dremio’s Interactive Demo

Explore this interactive demo and see how Dremio's Intelligent Lakehouse enables Agentic AI

How Ingestion Works in an Iceberg Lakehouse

Ingestion in an Iceberg lakehouse follows a clear pattern. You load data from a source, shape it into a table-like form, and write it into an Iceberg table. Iceberg manages files, metadata, and version history. This design keeps pipelines predictable and easy to debug.

An Iceberg table has two parts. The first part is the data itself, stored as Parquet files in object storage. The second part is the metadata, stored as JSON. The metadata tracks the schema, partition rules, snapshots, and file changes. Every write creates a new snapshot. This gives you time travel, rollback, and consistent reads.

Ingestion tools interact with the Iceberg table through a catalog. The catalog provides a namespace, a location for metadata, and a way to look up tables. A Dremio Catalog uses the Apache Iceberg REST Catalog API for all table operations. Clients send requests with a bearer token, and the catalog handles the rest. If credential vending is enabled, the catalog also returns short-lived storage credentials, so you do not need to store cloud keys in your code.

Once a client connects to the catalog, ingestion becomes simple. Create a table if it does not exist. Load your data into memory. Write the data into the table. Iceberg handles the file layout and the atomic commit. After the write finishes, any engine connected to the same catalog, including Dremio, can read the new snapshot.

This workflow looks the same no matter how large the data becomes. That is the advantage of Iceberg. The tools you choose may change, but the ingestion pattern stays stable across all stages of growth.

Catalog Setup: Connecting Python Engines to a Dremio Iceberg Catalog

Every ingestion workflow begins with a catalog connection. The catalog stores table metadata, tracks snapshots, and gives each client a consistent view of the lakehouse. A Dremio Catalog follows the Iceberg REST specification so that most Python tools can use it without special plugins.

A Dremio Catalog uses four elements. The first is the REST endpoint. This is the URL clients use to read or write Iceberg metadata. The second is the OAuth2 token server. This server issues and refreshes tokens when needed. The third is a bearer token, which identifies the user or service. The fourth is an optional access-delegation header. This header tells the catalog to vend short-lived storage credentials so your code does not need cloud keys.

A typical configuration looks like this:

{

  "uri": "https://catalog.dremio.cloud/api/iceberg",

  "oauth2-server-uri": "https://login.dremio.cloud/oauth/token",

  "token": "REPLACE_WITH_YOUR_TOKEN_VALUE",

  "warehouse": "your_project_name",

  "header.X-Iceberg-Access-Delegation": "vended-credentials",

  "type": "rest"

}

The warehouse value matches the name of your Dremio Cloud project.
The token is a bearer token and should come from an environment variable or a secret store.
Do not embed a real token in your code.

Most Python tools follow the same setup pattern. You pass the catalog URL. You attach the bearer token using a header or an auth field. You set the warehouse for your Dremio project. When vended credentials are enabled, the catalog supplies short-lived access keys so your client can write Parquet files without storing static cloud credentials.

The surrounding libraries vary in syntax, but they follow this same idea. PyIceberg loads a catalog from a configuration map. PyArrow adds the token to FlightSQL headers. Bauplan stores the connection in a client object. PySpark defines the REST catalog through session settings. Once set up, each tool handles table lookups, metadata reads, and atomic commits via the catalog.

Once the connection is established, ingestion becomes straightforward. You create tables, append data, and confirm new snapshots. The client does not manage metadata files. The catalog manages them and enforces consistency. This keeps your ingestion jobs simple, safe, and easy to repeat.

Data Sources You Can Ingest

You can load data into Iceberg from many places. The source does not change the core workflow. You read the data into memory, shape it into a table-like structure, and write it to an Iceberg table via the catalog. The tools you choose only change how these steps run.

Files are the most common source. CSV, JSON, and Parquet files work well because they map cleanly to tabular structures. PyArrow reads these formats with little code. PySpark does the same a larger scale. Bauplan can scan entire folders of Parquet files and build an Iceberg table from them. Once the data sits in memory as an Arrow table or a DataFrame, you can write it into Iceberg.

APIs are another option. Many teams pull data from REST endpoints. You fetch the data, parse the JSON payload, and turn it into a local structure. PyArrow can convert Python lists and dictionaries into Arrow tables. That makes the data ready for Iceberg. If the source system supports Arrow Flight, the process becomes faster. You send a SQL query or command, receive an Arrow table, and write it into Iceberg with one more step.

Databases also feed Iceberg. You can run a direct extract using JDBC or ODBC. PySpark provides a simple path with spark.read.format("jdbc"). PyArrow and PyIceberg can write the results once they are in memory. For continuous feeds, you can use a CDC tool to stream changes into Parquet files. Bauplan and PyIceberg can then register those files as Iceberg tables or append them to an existing table.

All of these sources follow the same pattern. Load. Shape. Write. Iceberg handles layout, metadata, and commits. Dremio handles the catalog and the credential vending. Your code only focuses on the data.

Ingesting Data with PyIceberg

PyIceberg is the simplest way to write data to Iceberg with pure Python. It gives you direct control over catalogs, schemas, and snapshots. It works well for scripts, small ingestion jobs, and service-style pipelines.

PyIceberg follows a simple pattern. You connect to a catalog, create or load a table, and append data. The library uses PyArrow tables for data, which keeps the workflow clean and consistent.

1. Connect to the Dremio Catalog

You load the catalog using a config map.
Here is an example with placeholder values:

from pyiceberg.catalog import load_catalog

catalog = load_catalog(

    "dremio",

    {

        "uri": "https://catalog.dremio.cloud/api/iceberg",

        "oauth2-server-uri": "https://login.dremio.cloud/oauth/token",

        "token": "YOUR_TOKEN_HERE",

        "warehouse": "your_project_name",

        "header.X-Iceberg-Access-Delegation": "vended-credentials",

        "type": "rest"

    }

)
  • The warehouse name matches your Dremio Cloud project.
  • The token should come from an environment variable, not from the code.

2. Create or Load a Table

You define a schema and create the table.
If the table exists, you load it instead.

from pyiceberg.schema import Schema

from pyiceberg.types import IntegerType, StringType

schema = Schema(

    ("id", IntegerType()),

    ("name", StringType())

)

table_identifier = "demo.users"

# Create the table if it does not exist

table = catalog.create_table(

    table_identifier,

    schema=schema

)

If the table exists in the catalog, you can load it with:

table = catalog.load_table(table_identifier)

3. Prepare Data as a PyArrow Table

PyIceberg writes Arrow tables.
You convert your data like this:

import pyarrow as pa

records = [

    {"id": 1, "name": "Alice"},

    {"id": 2, "name": "Bob"}

]

arrow_table = pa.Table.from_pylist(records, schema=table.schema().as_arrow())

4. Append the Data

Once the data sits in an Arrow table, you append it:

table.append(arrow_table)

PyIceberg writes the Parquet files, updates metadata, and creates a new snapshot.
The commit is atomic. If something fails, the table stays unchanged.

5. Other Operations

PyIceberg also supports deletes and overwrites:

from pyiceberg.expressions import col

# Delete rows where id = 2

table.delete(col("id") == 2)

# Overwrite rows with id = 1

table.overwrite(

    arrow_table,

    overwrite_filter=col("id") == 1

)

These actions create new snapshots and keep the table consistent.

Summary

PyIceberg works best for small or moderate ingestion tasks.
It does not distribute compute, but it gives you clear control, strong safety, and simple code.
If you need heavier transformations or parallel execution, PySpark or Bauplan may be a better fit.

Ingesting Data with PyArrow & Dremio

PyArrow gives you fast, columnar data handling in Python. It reads files, parses API payloads, and moves data between systems. PyArrow does not write to Iceberg by itself, but it prepares the data that Iceberg needs. It also connects to Dremio through Arrow FlightSQL, which lets you run SQL that writes into Iceberg tables using the Dremio Query Engine for scalable capacity.

The workflow is simple. You read data into an Arrow table. You connect to Dremio with a bearer token. You run SQL that creates or updates an Iceberg table. PyArrow handles the data transfer, and Dremio handles the commit.

1. Read Data into an Arrow Table

You can load CSV, JSON, or Parquet files.

import pyarrow as pa

import pyarrow.csv as csv

import pyarrow.parquet as pq

# CSV example

table = csv.read_csv("data/users.csv")

# Parquet example

# table = pq.read_table("data/users.parquet")

You can also build an Arrow table from Python data.

records = [

    {"id": 1, "name": "Alice"},

    {"id": 2, "name": "Bob"}

]

table = pa.Table.from_pylist(records)

2. Connect to Dremio Using FlightSQL

You connect with the Dremio Flight endpoint and a bearer token.

from pyarrow import flight

token = "YOUR_TOKEN_HERE"

client = flight.FlightClient("grpc+tls://data.dremio.cloud:443")

headers = [

    (b"authorization", f"bearer {token}".encode())

]

3. Create the Iceberg Table Through SQL

Use FlightSQL to send DDL statements.

command = """

CREATE TABLE IF NOT EXISTS demo.users (

    id INT,

    name VARCHAR

)

USING iceberg

"""

descriptor = flight.FlightDescriptor.for_command(command)

options = flight.FlightCallOptions(headers=headers)

client.get_flight_info(descriptor, options)

Dremio creates the table in your project, backed by the Iceberg catalog.

4. Insert Data Using FlightSQL

You send INSERT statements to load data.

insert_cmd = """

INSERT INTO demo.users VALUES

(1, 'Alice'),

(2, 'Bob')

"""

descriptor = flight.FlightDescriptor.for_command(insert_cmd)

client.get_flight_info(descriptor, options)

For larger datasets, you prepare batches in Python and load them with bulk insert patterns or temporary staging tables. The core idea stays the same: PyArrow sends the SQL, Dremio writes the data into Iceberg.

Ingesting Data with Bauplan

Bauplan is a Python-native lakehouse platform that writes and versions Apache Iceberg tables directly on object storage, while exposing those tables through a standard Iceberg REST catalog. It provides a managed execution environment for Python models and ingestion jobs, and applies Git-style semantics to data: branches, commits, and merges are first-class operations. Every ingestion or transformation produces an explicit, versioned Iceberg snapshot, which makes changes auditable, reversible, and safe to promote across environments.

With Bauplan, you can create Iceberg tables from existing Parquet files, append new data incrementally, or materialize the output of Python models as managed tables. All operations run against an explicit branch, so development and validation happen in isolation before merging into main. Once written, the tables are immediately queryable by external engines like Dremio through Bauplan’s Iceberg REST catalog, without copying data or maintaining separate metadata systems.

1. Connect Bauplan to Dremio catalog

Before you run any ingestion code, do a one-time setup so Bauplan can write Iceberg and Dremio can discover/query it. On the Bauplan side, your script just needs to authenticate (typically by exporting BAUPLAN_API_KEY, selecting a BAUPLAN_PROFILE, or using ~/.bauplan/config.yml ), which is what the Bauplan client will pick up automatically.

On the Dremio side, add a new Iceberg REST Catalog source that points to Bauplan’s Iceberg REST endpoint (https://api.use1.aprod.bauplanlabs.com/iceberg) and configure it to send Authorization: Bearer <token> using a Bauplan API key (ideally from a read-only Bauplan user).

Because Bauplan stores Iceberg metadata + Parquet data directly in your object store, Dremio must also be able to read that same storage location using its own S3/Azure credentials (Bauplan does not proxy storage access).

Dremio enables “Use vended credentials” by default for Iceberg REST catalogs; if the catalog supports credential vending, Dremio can query storage without additional configuration. If the catalog does not vend credentials, disable this option and provide S3/Azure storage authentication under Advanced Options so Dremio can read the table files directly from your bucket.

If you want Dremio to browse a data branch different than main , register a separate Dremio source pointing at the branch-scoped catalog endpoint (for example .../iceberg/<your_username>.<branch_name>). Once this is configured, Dremio will automatically see Bauplan namespaces/tables via the REST catalog (no manual refresh workflow).

2. Create an Iceberg Table from Parquet Files

If your data already lives in Parquet files, Bauplan can create a table and infer the schema with two calls.

import bauplan

client = bauplan.Client()

client.create_table(table="demo.taxi", search_uri="s3://my-bucket/nyc_taxi/*.parquet", branch="branch_name", replace=True)
print(f"\n 🧊🧊 New Iceberg table created \n")

Bauplan scans the Parquet files, builds the schema, creates the table, and registers the metadata. Those tables can now become visible in Dremio.

3. Import External Files into an Existing Table

You can append new Parquet files to an Iceberg table.

import bauplan

client = bauplan.Client()

client.import_data(table="demo.taxi", search_uri="s3://my-bucket/nyc_taxi/*.parquet", branch="branch_name",)
print(f"\n 🧊🧊 Data imported in your new Iceberg table \n")

Bauplan writes new data files, updates the metadata, and commits a new snapshot.

4. Build Tables Using Python Models

You can also define a Python function and have Bauplan materialize the result as an Iceberg table.

@bauplan.model(materialization_strategy="REPLACE")
@bauplan.python("3.11", pip={"polars": "1.35.2"})
def my_parent(
    trips=bauplan.Model(
        name="demo.taxi",
        columns=[
            "pickup_datetime",
            "PULocationID",
            "DOLocationID",
            "trip_miles",
        ],
        filter="pickup_datetime >= '2023-03-01T00:00:00-05:00' AND pickup_datetime < '2023-06-01T00:00:00-05:00'",
    ),
    zones=bauplan.Model(
        name="demo.taxi_metadata",
        columns=[
            "LocationID",
            "Borough",
        ]
    ),
):
    import polars as pl

    # trips and zones are Arrow-backed convert them in Polars.
    trips_df = pl.from_arrow(trips)
    zones_df = pl.from_arrow(zones)

    # Join trips and zones.
    joined_df = trips_df.join(zones_df, left_on="DOLocationID", right_on="LocationID", how="inner")
    print("\n\n ❤️❤️ Bauplan + Dremio ❤️❤️\n\n")

    return joined_df.to_arrow()

Because of the flag materialization_strategy="REPLACE" running the model writes the returned DataFrame (which, under the hood, is automatically converted into an Arrow table) to the target Iceberg table. Bauplan handles the schema, Parquet files, and commit.

5. Work with Branches and Safe Changes

Because Bauplan uses a catalog based on the Nessie backend, you can branch your data:

import bauplan

def create_import_branch(branch_name: str) -> bool:
        client = bauplan.Client()
    if client.has_branch(branch_name):
        client.delete_branch(branch_name)
        print(f"Branch {branch_name} already exists, deleted it first...")

    client.create_branch(branch_name, from_ref="main")
    assert client.has_branch(branch_name), "Branch creation failed"
    print(f"🌿 Branch {branch_name} from main created!")

    return True

You then run your models or imports on the branch. After validation, you merge the branch into main:

def merge_and_cleanup(client: bauplan.Client, branch_name: str) -> bool:
    print(f"\n    Merging {branch_name} into main...")
    client.merge_branch(source_ref=branch_name, into_branch="main")

    print(f"    Deleting branch {branch_name}...\n")
    client.delete_branch(branch_name)

    print(f"✅ Branch {branch_name} merged in main\n")

    return True

This pattern gives you a safe development path without risk to production tables.

Ingesting Data with PySpark

PySpark distributes work across many nodes, reads large files with ease, and connects to most data sources. PySpark also has full Iceberg support through the Iceberg Spark extensions. This makes it a reliable choice for ingestion, heavy transforms, and recurring pipelines.

PySpark works with the Dremio Catalog in two ways.

You can pass a bearer token directly. Or you can use Dremio’s Auth Manager. The Auth Manager is being contributed to the Iceberg project. It adds advanced logic for token exchange and token refresh. It is the safer choice in long-running Spark jobs because it handles expired tokens without breaking the job.

The workflow stays simple. You configure the Spark session. You read your data. You write it into an Iceberg table. Spark handles the files, metadata, and commit.

1. Configure Spark with the Basic Dremio Catalog Settings

This example uses a direct bearer token. Use placeholder values in your code:

from pyspark.sql import SparkSession

spark = (

    SparkSession.builder

        .appName("iceberg_ingest")

        .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions")

        .config("spark.sql.catalog.dremio", "org.apache.iceberg.spark.SparkCatalog")

        .config("spark.sql.catalog.dremio.type", "rest")

        .config("spark.sql.catalog.dremio.uri", "https://catalog.dremio.cloud/api/iceberg")

        .config("spark.sql.catalog.dremio.oauth2-server-uri", "https://login.dremio.cloud/oauth/token")

        .config("spark.sql.catalog.dremio.token", "YOUR_TOKEN_HERE")

        .config("spark.sql.catalog.dremio.warehouse", "your_project_name")

        .config("spark.sql.catalog.dremio.header.X-Iceberg-Access-Delegation", "vended-credentials")

        .getOrCreate()

)

This setup works well for short jobs or interactive use.

2. Configure Spark with the Dremio Auth Manager

For production jobs, long-running pipelines, or scheduled Spark clusters, the Auth Manager is a better option. It handles token refresh, token exchange, and expiration. The configuration looks like this:

import pyspark

conf = (

    pyspark.SparkConf()

        .setAppName("DremioIcebergSparkApp")

        .set(

            "spark.jars.packages",

            "org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.9.2,"

            "org.apache.iceberg:iceberg-aws-bundle:1.9.2,"

            "com.dremio.iceberg.authmgr:authmgr-oauth2-runtime:0.0.5"

        )

        # Enable Iceberg Spark extensions

        .set("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions")

        # Use REST catalog

        .set("spark.sql.catalog.dremio", "org.apache.iceberg.spark.SparkCatalog")

        .set("spark.sql.catalog.dremio.catalog-impl", "org.apache.iceberg.rest.RESTCatalog")

        .set("spark.sql.catalog.dremio.uri", DREMIO_CATALOG_URI)

        .set("spark.sql.catalog.dremio.warehouse", CATALOG_NAME)

        .set("spark.sql.catalog.dremio.cache-enabled", "false")

        .set("spark.sql.catalog.dremio.header.X-Iceberg-Access-Delegation", "vended-credentials")

        # Enable the Dremio Auth Manager

        .set("spark.sql.catalog.dremio.rest.auth.type", "com.dremio.iceberg.authmgr.oauth2.OAuth2Manager")

        .set("spark.sql.catalog.dremio.rest.auth.oauth2.token-endpoint", DREMIO_AUTH_URI)

        .set("spark.sql.catalog.dremio.rest.auth.oauth2.grant-type", "token_exchange")

        .set("spark.sql.catalog.dremio.rest.auth.oauth2.client-id", "dremio")

        .set("spark.sql.catalog.dremio.rest.auth.oauth2.scope", "dremio.all")

        .set("spark.sql.catalog.dremio.rest.auth.oauth2.token-exchange.subject-token", DREMIO_PAT)

        .set(

            "spark.sql.catalog.dremio.rest.auth.oauth2.token-exchange.subject-token-type",

            "urn:ietf:params:oauth:token-type:dremio:personal-access-token"

        )

)

The Auth Manager handles authentication for the entire Spark session.
You do not need to manage refresh logic yourself.

3. Read Data Into a DataFrame

Spark loads many formats:

df = spark.read.csv("data/users.csv", header=True, inferSchema=True)

Or from a database:

jdbc_df = (

    spark.read.format("jdbc")

        .option("url", "jdbc:postgresql://host/db")

        .option("dbtable", "users")

        .option("user", "myuser")

        .option("password", "mypassword")

        .load()

)

4. Create an Iceberg Table

Use Spark SQL:

spark.sql("""

CREATE TABLE IF NOT EXISTS dremio.demo.users (

    id INT,

    name STRING

) USING iceberg

""")

5. Write Data Into Iceberg

You can use SQL:

df.createOrReplaceTempView("staging_users")

spark.sql("""

INSERT INTO dremio.demo.users

SELECT id, name FROM staging_users

""")

Or use the DataFrame writer:

df.writeTo("dremio.demo.users").append()

6. Confirm the Write

spark.sql("SELECT * FROM dremio.demo.users").show()

Summary

PySpark handles large datasets, complex transforms, and distributed workloads. You can use a simple bearer-token setup for short jobs. For stable, long-running ingestion, the Dremio Auth Manager adds safe token exchange and refresh. Both configurations work with the Dremio Catalog and produce clean, atomic Iceberg snapshots.

If you want the next section, I can continue with the comparison table or best practices.

Ingesting Data with Daft

Daft is a Python DataFrame library built for scalable data processing. It integrates directly with Apache Iceberg via PyIceberg and supports writing data into Iceberg tables using a clean DataFrame-style API. It works well for Python ingestion jobs where you want PySpark-like power with simpler, native syntax.

Daft uses PyIceberg under the hood, so it supports REST catalog connections, including Dremio’s catalog. You configure your catalog using PyIceberg settings, then Daft uses that connection to write data into Iceberg.

1. Load a REST Catalog

Daft relies on PyIceberg’s catalog interface. You first define a REST catalog, with bearer token and access delegation:

catalog = load_catalog(

    "dremio",

    {

        "uri": "https://catalog.dremio.cloud/api/iceberg",

        "oauth2-server-uri": "https://login.dremio.cloud/oauth/token",

        "token": "YOUR_TOKEN_HERE",

        "warehouse": "your_project_name",

        "header.X-Iceberg-Access-Delegation": "vended-credentials",

        "type": "rest"

    }

)

from pyiceberg.catalog import load_catalog

This uses Dremio’s token-based auth and credential vending. Once loaded, you can create or load tables using catalog.load_table().

2. Create a Daft DataFrame

Daft lets you build DataFrames from Python-native data, files, or cloud sources:

import daft

df = daft.from_pydict({

    "id": [1, 2, 3],

    "name": ["Alice", "Bob", "Carol"]

})

Daft supports Arrow tables under the hood, which makes it compatible with Iceberg formats.

3. Write Data Into an Iceberg Table

You use the write_iceberg() method and pass in a PyIceberg table reference.

iceberg_table = catalog.load_table("demo.users")

df.write_iceberg(iceberg_table, mode="append")

This writes the data into Iceberg in append mode. You can also use "overwrite" mode to replace the table snapshot. After the write completes, Daft returns a DataFrame with the write metadata (e.g. file paths and row counts).

4. Output and Verification

You can query the written data using any Iceberg-compatible engine, including Dremio, Spark, or PyIceberg.

If needed, you can inspect the result of the write:

write_result = df.write_iceberg(iceberg_table, mode="append")

write_result.show()

This confirms the number of rows and files written in the snapshot.

Limitations

  • Daft supports append and overwrite modes only.
  • Upserts, deletes, and schema evolution must be handled through PyIceberg or another tool.
  • Partitioning and table creation require using PyIceberg directly, then writing data with Daft.

Summary

Daft is a DataFrame engine that supports native writes to Iceberg through PyIceberg. It works well when you want readable code, fast local execution, and Iceberg output. If you're already using PyIceberg, Daft can simplify the ingestion layer without giving up control.

Ingesting Data with SpiceAI

Spice.ai is a query engine and data runtime built for analytics and time-series applications. It supports SQL-based ingestion and can write data into Apache Iceberg tables using standard INSERT INTO statements. The platform is built for declarative pipelines, so ingestion is expressed as part of a Spicepod configuration and executed by the runtime.

Spice connects to Iceberg through its built-in connectors. It supports REST catalogs, bearer tokens, and OAuth2. Once configured, it can insert data into Iceberg tables using SQL, either from static values or from other sources like APIs and cloud storage.

1. Define the Catalog in a Spicepod

You describe the Iceberg connection in a YAML file (typically spicepod.yaml). Here is a sample config:

name: iceberg_ingest

version: 0.1.0

datasets:

  - name: my_table

    from: iceberg:demo.my_table

    access: read_write

    iceberg_uri: https://catalog.dremio.cloud/api/iceberg

    iceberg_token: ${secrets:iceberg_token}

Spice also supports OAuth2 if needed:

    iceberg_oauth2_credential: ${secrets:oauth_token}

    iceberg_oauth2_server_url: https://login.dremio.cloud/oauth/token

Tokens are pulled from the secrets system, not hardcoded.

2. Use SQL to Insert Into Iceberg

Spice allows SQL-based ingestion. For example:

INSERT INTO demo.my_table (id, name)

VALUES (1, 'Alice'), (2, 'Bob');

You can also ingest from another dataset or API:

INSERT INTO demo.my_table

SELECT id, name FROM another_dataset;

Spice runs these SQL commands as part of your pipeline or on-demand.

3. Run the Ingestion Job

You can run the job from the CLI or from Python using the spicepy SDK:

import spicepy

client = spicepy.Client()

client.execute("INSERT INTO demo.my_table VALUES (3, 'Carol')")

This sends the SQL to the Spice runtime, which performs the write into Iceberg.

4. Authentication and Catalog Support

Spice connects to Iceberg REST catalogs, including:

  • Dremio
  • AWS Glue
  • Custom REST endpoints

It supports:

  • Bearer token authentication
  • OAuth2 token exchange
  • Secrets management for tokens

This makes it production-safe and easy to run without exposing credentials.

Limitations

  • Iceberg writes are currently append-only.
  • No support for UPDATE, DELETE, or MERGE.
  • Schema evolution is possible but must be managed outside the SQL layer.
  • Requires running a Spice runtime (local or hosted).

Summary

SpiceAI is a good option when you want SQL-driven ingestion into Iceberg.
It supports REST catalogs and token-based auth out of the box.
It’s especially useful when building pipelines declaratively or integrating with time-series or cloud-native sources.

Ingesting Data with DuckDB

DuckDB is an embedded SQL engine designed for analytics. It supports reading and writing to Apache Iceberg tables using its built-in iceberg extension. Once the extension is loaded, you can attach a REST catalog, authenticate using secrets, and write data using standard SQL.

DuckDB works well for small-to-medium ingestion jobs where you want fast, in-process execution and full SQL control. It also runs in Python, so you can embed ingestion logic directly into your scripts.

1. Load the Iceberg Extension in SQL

DuckDB includes Iceberg support as an extension. You need to load it before use:

INSTALL 'iceberg';

LOAD 'iceberg';

# You also need httpfs if you’re connecting to a REST catalog:

LOAD httpfs;

2. Authenticate with a REST Catalog (e.g., Dremio)

You use DuckDB’s CREATE SECRET feature to store OAuth or bearer token credentials:

CREATE SECRET dremio_secret (

  TYPE iceberg,

  CLIENT_ID 'dremio',

  CLIENT_SECRET 'YOUR_TOKEN_HERE',

  OAUTH2_SERVER_URI 'https://login.dremio.cloud/oauth/token'

);

Then attach the catalog using the REST endpoint and secret:

ATTACH 'dremio_catalog' AS dremio (

  TYPE iceberg,

  ENDPOINT 'https://catalog.dremio.cloud/api/iceberg',

  SECRET dremio_secret

);

This allows DuckDB to read and write to Iceberg tables managed by Dremio.

3. Create and Insert Into an Iceberg Table

Once attached, you create and populate tables like any SQL database:

CREATE TABLE dremio.demo.users (

  id INTEGER,

  name VARCHAR

);

INSERT INTO dremio.demo.users VALUES (1, 'Alice'), (2, 'Bob');

DuckDB writes the data as Parquet files and updates Iceberg metadata using the REST catalog.

4. Use DuckDB from Python

You can execute all of the above directly in Python:

import duckdb

con = duckdb.connect()

con.execute("INSTALL 'iceberg';")

con.execute("LOAD 'iceberg';")

con.execute("LOAD httpfs;")

con.execute("""

CREATE SECRET dremio_secret (

  TYPE iceberg,

  CLIENT_ID 'dremio',

  CLIENT_SECRET 'YOUR_TOKEN_HERE',

  OAUTH2_SERVER_URI 'https://login.dremio.cloud/oauth/token'

);

""")

con.execute("""

ATTACH 'dremio_catalog' AS dremio (

  TYPE iceberg,

  ENDPOINT 'https://catalog.dremio.cloud/api/iceberg',

  SECRET dremio_secret

);

""")

con.execute("INSERT INTO dremio.demo.users VALUES (3, 'Carol');")

This makes DuckDB a useful embedded tool for Python ingestion jobs that need SQL control and REST catalog support.

Limitations

  • DuckDB only supports append-mode writes to Iceberg.
  • No UPDATE, DELETE, or MERGE support as of now.
  • Schema evolution must be handled externally.
  • Requires explicit loading of extensions and secrets setup.

Summary

DuckDB is a lightweight SQL engine with native Iceberg write support.
It connects to REST catalogs like Dremio, handles authentication, and executes INSERT statements with low overhead.

End-to-End Example: Ingest CSV into Iceberg with PyIceberg and Dremio Catalog

This example walks through a full ingestion pipeline:
read a CSV file, convert it into a PyArrow table, write it into an Iceberg table using PyIceberg, and verify the results, all using a Dremio REST catalog with bearer token authentication.

1. Set Up the Catalog

Configure the catalog using your Dremio project name and personal access token:

from pyiceberg.catalog import load_catalog

catalog = load_catalog(

    "dremio",

    {

        "uri": "https://catalog.dremio.cloud/api/iceberg",

        "oauth2-server-uri": "https://login.dremio.cloud/oauth/token",

        "token": "YOUR_TOKEN_HERE",

        "warehouse": "your_project_name",

        "header.X-Iceberg-Access-Delegation": "vended-credentials",

        "type": "rest"

    }

)

Make sure the token is stored in an environment variable or a secrets manager in real use.

2. Read and Convert the CSV File

Use PyArrow to read the file and prepare it for ingestion:

import pyarrow.csv as csv

arrow_table = csv.read_csv("users.csv")

You can inspect the schema and rows:

print(arrow_table.schema)

print(arrow_table.to_pylist())

3. Create or Load the Iceberg Table

Create the table with the schema derived from the Arrow table:

from pyiceberg.schema import Schema

from pyiceberg.types import *

schema = Schema(

    ("id", IntegerType()),

    ("name", StringType()),

    ("email", StringType())

)

table_identifier = "demo.users"

try:

    table = catalog.create_table(table_identifier, schema=schema)

except:

    table = catalog.load_table(table_identifier)

4. Write the Data into Iceberg

Convert the Arrow table if needed and append it:

table.append(arrow_table)

This creates a new snapshot, writes Parquet files, and commits the metadata.

5. Verify the Write

You can scan the table with PyIceberg:

for row in table.scan():

    print(row)

Or connect through Dremio or Spark and run:

SELECT * FROM demo.users;

The new data will be immediately visible.

What This Shows

  • Full use of Dremio REST catalog with bearer token
  • Ingestion of local file into Iceberg
  • Clean table creation and append
  • Safe and atomic snapshot commit

You can adapt this to any data source, APIs, databases, or transformed pipelines, by converting the result into an Arrow table and writing through PyIceberg or another compatible tool.

Conclusion

Apache Iceberg gives you a reliable, open standard for managing data in the lakehouse. With the right tools, you can build ingestion pipelines entirely in Python, no JVM, no lock-in, and no extra complexity.

In this post, you explored how to ingest data into Iceberg using:

  • PyIceberg
  • PyArrow with Dremio
  • Bauplan
  • Daft
  • SpiceAI
  • DuckDB
  • PySpark

You also saw how to connect each tool to a REST catalog like Dremio Catalog, using bearer tokens and vended credentials to keep your pipelines secure and portable. You reviewed best practices for schema control, batch writing, and safe snapshot management, and walked through an end-to-end example that you can adapt to any data source.

Every tool you use can write clean, atomic data into Iceberg. Once there, the table is immediately queryable by Dremio or any other engine that speaks Iceberg. That’s the power of open formats and shared catalogs: one standard, many tools, zero friction.

Choose the tool that fits your workload. Stay close to the data. Keep your pipeline simple.
And let Iceberg handle the rest.

Try Dremio Catalog by creating a free trial account with Dremio Today!

Try Dremio Cloud free for 30 days

Deploy agentic analytics directly on Apache Iceberg data with no pipelines and no added overhead.