44 minute read · December 19, 2025

Data Ingestion Patterns Using Dremio: From Raw Data to Apache Iceberg

Alex Merced · Head of DevRel, Dremio

Copied to clipboard

Data Ingestion Patterns Using Dremio: From Raw Data to Apache Iceberg

What Is Dremio: The Agentic Lakehouse Platform

What Is Dremio Open Catalog: An Apache Polaris–Based Lakehouse Catalog

The Ingestion Landscape: Iceberg REST and Dremio-Native Ingestion Paths

File Upload in the UI: One-Time and Ad Hoc Ingestion into Iceberg

CTAS and INSERT INTO SELECT: SQL-Driven Ingestion and Incremental Updates

COPY INTO and CREATE PIPE: File-Based Ingestion from Object Storage

DremioFrame: Programmatic Ingestion into Iceberg Using Python

Conclusion: Choosing the Right Ingestion Pattern with Dremio

Modern data platforms are no longer built around monolithic warehouses or tightly coupled ingestion pipelines. Instead, organizations are standardizing on open lakehouse architectures, where data is stored in open formats, governed by shared catalogs, and processed by multiple engines based on workload.

At the center of this shift is Apache Iceberg, which has emerged as the de facto table format for analytic and AI workloads on object storage. Iceberg brings transactional guarantees, schema evolution, time travel, and partition evolution to data lakes, capabilities that were once exclusive to proprietary systems.

However, adopting Iceberg is only part of the story. Teams still face a critical question:

How do you ingest data into Iceberg efficiently, reliably, and at scale, without rebuilding complex ETL infrastructure?

This is where Dremio plays a key role.

Dremio is a lakehouse query and processing engine that natively reads from and writes to Apache Iceberg tables. Rather than introducing a proprietary ingestion framework, Dremio enables ingestion through SQL, file-based loading, and programmatic APIs, allowing teams to use the same engine for exploration, transformation, and data delivery.

Importantly, Dremio is not an orchestration layer. Workflow orchestration remains the responsibility of tools such as Apache Airflow, dbt, or custom pipelines built with DremioFrame, which can schedule, coordinate, and trigger ingestion workloads that execute on Dremio. This separation keeps architectures modular and flexible while allowing Dremio to focus on what it does best: fast, scalable data processing on open data.

In this post, we’ll walk through the most common data ingestion patterns with Dremio, focusing on ingesting data into Apache Iceberg tables managed through an open catalog. We’ll cover when to use each approach, how they fit into real-world pipelines, and best practices for choosing the correct pattern based on your data sources and workloads.

Whether you’re loading ad hoc datasets, migrating existing tables, ingesting files from object storage, or pulling data from APIs and databases, Dremio provides multiple paths to bring data into Iceberg, without locking you into a single engine or ingestion tool.

What Is Dremio: The Agentic Lakehouse Platform

Dremio is the Agentic Lakehouse, a data platform built for AI agents and managed by agents. It unifies data, governance, and business context to enable fast, accurate analytics and AI workflows directly on open data, without pipelines, lock-in, or manual optimization.

At its foundation, Dremio is a high-performance data processing engine built on Apache Arrow and optimized for Apache Iceberg. Iceberg is Dremio’s first-class table format for both reads and writes, allowing users to create, ingest, and evolve analytical tables directly in object storage with full transactional guarantees. Tables written by Dremio are immediately interoperable with other Iceberg-compatible engines, preserving the openness of the lakehouse.

What distinguishes Dremio from traditional query engines is its agentic architecture, which combines AI-driven interaction, autonomous operations, and semantic understanding of data:

Integrated AI Agent and MCP Server

Dremio includes a built-in AI agent that can run queries, generate visualizations, explain SQL, and suggest optimizations using natural language. This capability extends beyond the Dremio UI through an MCP (Model Context Protocol) server, which exposes Dremio’s semantic understanding of data to external clients and tools. Together, these capabilities allow AI agents and users to interact with data more naturally and productively.

Try Dremio’s Interactive Demo

Explore this interactive demo and see how Dremio's Intelligent Lakehouse enables Agentic AI

AI Functions for Unstructured Data

Dremio brings AI directly into SQL through AI Functions, enabling teams to transform unstructured content, such as PDFs, documents, and images, into structured, queryable data. These functions make it possible to ingest and analyze data that would traditionally require complex preprocessing pipelines, expanding what “data ingestion” means in a lakehouse context.

Autonomous Performance Management

Operating an Iceberg lakehouse at scale typically requires continuous tuning and maintenance. Dremio eliminates this burden through autonomous performance management, including automatic table optimization, results caching, query planning caches, and Autonomous Reflections. These capabilities continuously optimize performance and cost as data volumes and workloads evolve, without manual intervention.

Dremio Open Catalog: Apache Polaris–Based

Dremio includes a built-in, fully managed lakehouse catalog, Dremio Open Catalog, powered by Apache Polaris. The catalog tracks, governs, and secures Iceberg tables while enabling interoperable access across engines through standard Iceberg REST APIs. This ensures that data ingested into Iceberg remains discoverable, governed, and reusable across the broader ecosystem.

Integrated Semantic Layer

Dremio provides a first-class semantic layer that includes views, tags, wikis, and end-to-end lineage. This semantic context is not only consumed by users, but also leveraged by the AI agent and MCP server to deliver more accurate and meaningful results. The semantic layer spans both native Iceberg tables and virtualized data from databases, data warehouses, and data lakes, enabling agentic analytics across the entire data estate.

By combining first-class Iceberg support, autonomous lakehouse management, and AI-driven interaction, Dremio enables organizations to move from fragmented data silos to performant, agentic analytics on unified data, often overnight rather than through long, multi-year platform migrations.

What Is Dremio Open Catalog: An Apache Polaris–Based Lakehouse Catalog

An open lakehouse requires more than an open table format. It also needs a shared, interoperable catalog that tracks table metadata, enforces governance, and allows multiple engines to safely read and write the same data. This is the role of Dremio Open Catalog (DOC).

Dremio Open Catalog is a lakehouse catalog built directly into the Dremio platform, powered by Apache Polaris. It provides a fully managed catalog for Apache Iceberg tables, enabling organizations to govern, secure, and share data without introducing proprietary metadata layers or locking themselves into a single compute engine.

Built on Apache Polaris and Iceberg REST

At its core, DOC implements the Apache Iceberg REST catalog specification via Apache Polaris. This means Iceberg tables registered in Dremio Open Catalog can be accessed by any engine that supports the Iceberg REST API, including Spark, Flink, Trino, and others.

This architecture ensures:

Interoperability: Tables ingested through Dremio are immediately available to other Iceberg-compatible engines.
Consistency: All engines operate against the same catalog metadata and transactional state.
Openness: Metadata remains portable and standards-based, avoiding proprietary lock-in.

First-Class Governance for Iceberg Tables

Dremio Open Catalog is not just a metadata registry. It is a governance layer that provides fine-grained access control, auditing, and lineage for Iceberg tables. Permissions are enforced consistently whether data is queried interactively, ingested via SQL, or accessed programmatically.

Because the catalog is integrated into the Dremio platform, governance is applied automatically as data is created or ingested, without requiring separate systems to synchronize policies or metadata.

Designed for Ingestion and Evolution

Ingestion is one of the most demanding phases of the data lifecycle, especially in Iceberg-based lakehouses where tables continuously evolve. Dremio Open Catalog is designed to support:

Transactional table creation and writes
Schema evolution during ingestion
Partition evolution without rewrites
Safe concurrent access from multiple engines

This makes DOC a natural foundation for ingestion pipelines that start with raw data and mature into curated, shared Iceberg tables over time.

A Shared Foundation for Agentic Analytics

Dremio Open Catalog also plays a critical role in enabling agentic analytics. By centralizing metadata, permissions, and table definitions, the catalog provides the trusted foundation that Dremio’s AI agent and semantic layer rely on to understand data, apply context, and deliver accurate results.

In practice, this means that once data is ingested into Iceberg and registered in Dremio Open Catalog, it becomes immediately discoverable, governable, and usable, by humans, by AI agents, and by external engines alike.

With an open catalog in place, the question shifts from where data lives to how it should be ingested. In the next sections, we’ll look at the different ingestion paths Dremio provides, starting with simple, ad hoc workflows and progressing toward fully automated, production-grade patterns.

The Ingestion Landscape: Iceberg REST and Dremio-Native Ingestion Paths

With Apache Iceberg and an open catalog in place, ingestion becomes far more flexible than in traditional data platforms. Instead of being tied to a single engine or proprietary pipeline framework, organizations can choose from multiple ingestion paths based on data volume, latency, and operational complexity.

Because Dremio Open Catalog implements the Iceberg REST specification, any engine that supports Iceberg REST can ingest data into tables registered in the catalog. This enables a broad ecosystem of tools, batch engines, streaming frameworks, and custom applications, to safely create and update Iceberg tables while sharing a single source of truth for metadata and governance.

At the same time, Dremio provides several native ingestion mechanisms that are tightly integrated with its engine, catalog, and semantic layer. These options are often simpler to operate, require fewer moving parts, and automatically benefit from Dremio’s autonomous performance management and governance capabilities.

Broadly, ingestion into Iceberg using Dremio falls into four categories:

1. Ad Hoc and Interactive Ingestion

For exploratory or one-time datasets, Dremio supports interactive ingestion workflows directly in the UI. These are designed for speed and accessibility, allowing analysts and engineers to quickly turn files into Iceberg tables without writing pipelines or provisioning infrastructure.

2. SQL-Based Ingestion and Transformation

Dremio’s SQL engine can create and incrementally update Iceberg tables using familiar patterns such as CREATE TABLE AS SELECT and INSERT INTO SELECT. This approach is well suited for:

Migrating data from existing sources
Building curated tables from raw datasets
Performing incremental ingestion based on timestamps or identifiers

Because these operations are transactional and Iceberg-native, they can be safely re-run and integrated into scheduled workflows.

3. File-Based Loading from Object Storage

For ingestion patterns centered around files landing in object storage, Dremio provides file-oriented loading mechanisms that are optimized for bulk and continuous ingestion. These patterns are ideal for external data feeds, event-driven pipelines, and landing-zone architectures.

4. Programmatic Ingestion with DremioFrame

Some ingestion scenarios go beyond what can be expressed purely in SQL or through file ingestion. DremioFrame, a Python library built on top of the Dremio engine, enables programmatic ingestion from APIs, custom file formats, and JDBC-accessible systems, while still pushing execution down to Dremio and writing data into Iceberg tables.

Each of these ingestion paths serves a different purpose, and they are often used together within the same lakehouse. The key advantage of Dremio’s approach is that all roads lead to the same destination: governed, interoperable Apache Iceberg tables managed through an open catalog.

In the following sections, we’ll explore each of these ingestion patterns in detail, starting with the simplest workflows and progressing toward more advanced, production-grade ingestion pipelines, along with best practices for choosing the right approach for your use case.

File Upload in the UI: One-Time and Ad Hoc Ingestion into Iceberg

Not every ingestion workflow needs to start with a pipeline. For exploratory analysis, rapid prototyping, or one-off datasets, Dremio provides a simple UI-based file upload experience that allows users to ingest data directly into the lakehouse.

Through the Dremio UI, users can upload files in common formats such as CSV, JSON, and Parquet, preview their contents, and immediately query them using SQL. These uploaded files can then be transformed and materialized as Apache Iceberg tables, making them first-class citizens in the lakehouse rather than isolated artifacts.

This workflow is particularly valuable in early-stage analysis, where the goal is to move quickly from raw data to insight without standing up infrastructure or writing ingestion code.

How UI-Based Ingestion Works

The typical flow for UI-based ingestion looks like this:

Upload a file through the Dremio UI.
Inspect the inferred schema and data types.
The file is now an Iceberg table inside Dremio Open Catalog.

Once materialized, the resulting Iceberg table is managed by Dremio Open Catalog, governed like any other dataset, and immediately available for querying by other users, tools, and engines.

When This Pattern Makes Sense

UI-based file uploads are best suited for:

Ad hoc datasets shared by partners or internal teams
Exploratory analysis and proof-of-concept work
Small reference datasets that change infrequently
Analyst-driven workflows where speed matters more than automation

This approach lowers the barrier to entry for working with Iceberg by allowing users to focus on data and SQL rather than ingestion infrastructure.

Best Practices

To use UI-based ingestion effectively:

Convert uploaded files to Iceberg early
Treat uploaded files as a staging step, not a long-term storage solution. Persist curated results into Iceberg tables as soon as possible.
Validate schemas before materializing
Review inferred data types and column names to avoid propagating issues into downstream tables.
Avoid using UI uploads for recurring ingestion
If a dataset needs to be refreshed regularly or scaled over time, transition to SQL-based, file-based, or programmatic ingestion patterns.

Limitations to Keep in Mind

While convenient, UI-based ingestion is intentionally scoped. It is not designed for:

Large-scale or high-frequency ingestion
Automated or scheduled pipelines
Complex schema evolution scenarios

For those use cases, Dremio’s SQL-based and file-based ingestion mechanisms provide more control and scalability.

UI uploads excel as a starting point, a fast path from raw data to governed Iceberg tables, before evolving into more robust ingestion patterns as requirements grow.

CTAS and INSERT INTO SELECT: SQL-Driven Ingestion and Incremental Updates

For most production ingestion workflows, SQL-based ingestion provides the best balance of simplicity, scalability, and control. Dremio supports this natively through CREATE TABLE AS SELECT(CTAS) and INSERT INTO SELECT, allowing teams to ingest, transform, and incrementally update Apache Iceberg tables using standard SQL.

These patterns are especially powerful because they:

Write transactionally to Iceberg
Work across federated sources (databases, warehouses, files, lakes)
Integrate cleanly into automated workflows
Support repeatable and incremental ingestion

Initial Loads with CREATE TABLE AS SELECT (CTAS)

CTAS is the most common way to perform an initial ingestion into Iceberg. It allows you to create a new Iceberg table directly from an existing source while applying transformations, filtering, and schema normalization.

Example: Migrating Data from an External Source into Iceberg

CREATE TABLE analytics.orders_iceberg

PARTITION BY (order_date)

AS

SELECT

  order_id,

  customer_id,

  CAST(order_timestamp AS DATE) AS order_date,

  order_status,

  total_amount,

  CURRENT_TIMESTAMP AS ingestion_ts

FROM snowflake.sales.orders

WHERE order_timestamp >= '2024-01-01';

In this example:

Data is read directly from an external source
The table is written in Apache Iceberg format
Partitioning is applied at creation time
Ingestion metadata is added explicitly

Once created, this Iceberg table is immediately registered in Dremio Open Catalog and available to other engines.

Best Practices for CTAS

Normalize schemas during ingestion to avoid propagating inconsistencies
Add ingestion timestamps for auditability and incremental logic
Choose partitions carefully based on query patterns, not source layouts
Prefer CTAS for initial loads and backfills, not ongoing updates

Incremental Ingestion with INSERT INTO SELECT

After the initial load, most datasets need to be updated incrementally. Dremio supports this pattern using INSERT INTO SELECT, appending new data to existing Iceberg tables in a fully transactional manner.

Example: Incremental Inserts Based on a Timestamp

INSERT INTO analytics.orders_iceberg

SELECT

  order_id,

  customer_id,

  CAST(order_timestamp AS DATE) AS order_date,

  order_status,

  total_amount,

  CURRENT_TIMESTAMP AS ingestion_ts

FROM snowflake.sales.orders

WHERE order_timestamp > (

  SELECT MAX(order_date)

  FROM analytics.orders_iceberg

);

This pattern:

Reads only new records from the source
Appends data safely to the Iceberg table
Can be rerun without affecting existing data

Example: Incremental Inserts Using a High-Watermark Table

For more control, many teams maintain a watermark table:

INSERT INTO analytics.orders_iceberg

SELECT

  o.order_id,

  o.customer_id,

  CAST(o.order_timestamp AS DATE) AS order_date,

  o.order_status,

  o.total_amount,

  CURRENT_TIMESTAMP AS ingestion_ts

FROM snowflake.sales.orders o

JOIN ingestion_metadata.watermarks w

  ON o.order_timestamp > w.last_processed_ts

WHERE w.dataset_name = 'orders';

This approach is especially useful when:

Sources don’t guarantee ordering
Multiple ingestion jobs share state
Late-arriving data is expected

Idempotency and Reliability Considerations

When using SQL-based ingestion in production, it’s important to design for reliability:

Avoid duplicates by filtering on immutable keys or timestamps
Prefer append-only ingestion when possible
Track ingestion state explicitly, not implicitly
Treat SQL as declarative ingestion logic, not procedural code

Because Iceberg guarantees atomic commits, failed ingestion jobs will not leave tables in a partially written state.

When to Use CTAS and INSERT INTO SELECT

This pattern is ideal for:

Migrating data from existing systems into Iceberg
Building curated or analytics-ready tables
Incremental batch ingestion
Pipelines managed by schedulers or CI/CD systems

It is less suitable for:

Continuous file-based ingestion
Event-driven ingestion from object storage
Complex API-driven ingestion logic

For those scenarios, Dremio’s file-based ingestion and programmatic approaches are a better fit.

COPY INTO and CREATE PIPE: File-Based Ingestion from Object Storage

Many ingestion pipelines are built around files landing in object storage, whether produced by upstream systems, event-driven processes, or external partners. For these scenarios, Dremio provides file-native ingestion patterns that load data directly into Apache Iceberg tables without requiring intermediate processing engines.

Two SQL constructs are central to this approach: COPY INTO and CREATE PIPE. Together, they support both bulk ingestion and continuous file loading into Iceberg.

Bulk File Ingestion with COPY INTO

COPY INTO is designed for explicit, batch-oriented ingestion of files from object storage into an existing Iceberg table. It is well-suited for backfills, periodic loads, or controlled batch workflows.

Example: Loading Parquet Files from Object Storage

COPY INTO analytics.events_iceberg

FROM '@s3.raw_data/events/'

FILE_FORMAT 'PARQUET';

In this example:

Files are read directly from an object storage location
Data is appended transactionally to the Iceberg table
No staging tables or external engines are required

Example: Loading CSV Files with Explicit Options

COPY INTO analytics.customers_iceberg

FROM '@s3.raw_data/customers/'

FILE_FORMAT (

  TYPE 'CSV',

  FIELD_DELIMITER ',',

  SKIP_FIRST_LINE TRUE

);

Dremio handles schema mapping, file discovery, and commit semantics automatically, ensuring that each COPY INTO operation results in a consistent Iceberg snapshot.

Best Practices for COPY INTO

Use COPY INTO for controlled batch ingestion, not continuous streaming
Validate schemas early, especially for CSV and JSON inputs
Group files into reasonably sized batches to avoid excessively small files
Prefer immutable file drops to simplify ingestion logic

Continuous Ingestion with CREATE PIPE

For pipelines that receive files continuously, CREATE PIPE enables automated ingestion. A pipe defines a persistent ingestion rule that watches a location and loads new files as they appear.

Example: Creating a Pipe for Continuous Ingestion

CREATE PIPE analytics.events_pipe

AS

COPY INTO analytics.events_iceberg

FROM '@s3.raw_data/events/'

FILE_FORMAT 'PARQUET';

Once created, the pipe:

Tracks which files have already been processed
Automatically ingests new files
Ensures each file is loaded exactly once

Pipes are ideal for landing-zone architectures where upstream systems continuously write files to object storage.

Starting and Managing Pipes

ALTER PIPE analytics.events_pipe SET PIPE_EXECUTION_PAUSED = FALSE;

Pipes can be paused, resumed, or monitored without redefining ingestion logic.

Handling Schema Evolution and File Layout

File-based ingestion often introduces schema drift over time. When ingesting into Iceberg:

Backward-compatible schema changes (adding columns) are handled gracefully
Incompatible changes should be normalized before ingestion
Partitioning decisions should be based on query access patterns, not file layout

Dremio’s autonomous optimization features help mitigate issues such as small files and suboptimal layouts after ingestion.

When to Use COPY INTO vs CREATE PIPE

Use Case	Recommended Pattern
One-time backfill	COPY INTO
Scheduled batch loads	COPY INTO
Continuous file drops	CREATE PIPE
Event-driven pipelines	CREATE PIPE
External data feeds	CREATE PIPE

Both patterns integrate seamlessly with Iceberg’s transactional model and Dremio Open Catalog, ensuring ingested data is immediately governed and queryable.

Where File-Based Ingestion Fits Best

File-based ingestion is ideal when:

Upstream systems already produce files
Object storage acts as a landing zone
Low-latency streaming is not required
Ingestion must scale independently of source systems

For ingestion that involves APIs, custom formats, or database-driven extraction logic, SQL alone is often not enough. In the next section, we’ll look at DremioFrame, a Python library that enables programmatic ingestion while still leveraging Dremio’s engine and Iceberg-native writes.

DremioFrame: Programmatic Ingestion into Iceberg Using Python

While SQL- and file-based ingestion cover many common lakehouse workflows, some ingestion scenarios require programmatic control. Data arriving from REST APIs, local files, Python applications, or external databases often needs to be fetched, normalized, or enriched before it can be written to Apache Iceberg.

DremioFrame addresses these use cases by providing a Python-native ingestion layer that writes data through the Dremio engine and into Iceberg tables managed by Dremio Open Catalog. Rather than acting as a local processing engine, DremioFrame serves as a control plane for ingestion, pushing work into Dremio wherever possible and preserving governance, lineage, and transactional guarantees.

DremioFrame supports both ELT-style ingestion (moving data from Dremio-connected sources into Iceberg) and ETL-style ingestion (bringing external data into the lakehouse).

API Ingestion

DremioFrame includes built-in support for ingesting data directly from REST APIs using client.ingest_api. This method handles fetching data, batching, and writing results into Iceberg tables.

from dremioframe.client import DremioClient

client = DremioClient()

client.ingest_api(

    url="https://api.example.com/users",

    table_name="marketing.users",

    mode="merge",   # 'replace', 'append', or 'merge'

    pk="id"

)

This pattern is well suited for:

SaaS platforms and external services
APIs without bulk export capabilities
Incremental ingestion using merge semantics

Best practices

Use merge with a primary key for mutable API data
Use append for append-only event streams
Consider staging tables for complex transformations before merging

File Upload from Local Systems

When working with local files that are not already available in object storage, DremioFrame allows you to upload them directly into Dremio as Iceberg tables using client.upload_file.

from dremioframe.client import DremioClient

client = DremioClient()

# Upload a CSV file

client.upload_file("data/sales.csv", "marketing.sales")

# Upload an Excel file

client.upload_file("data/financials.xlsx", "marketing.financials")

# Upload an Avro file

client.upload_file("data/users.avro", "marketing.users")

Supported formats include CSV, JSON, Parquet, Excel, HTML, Avro, ORC, Lance, and Arrow/Feather, with format-specific options passed through to the underlying readers.

This approach is ideal for:

Analyst-provided files
Small to medium batch ingestion
File types not directly ingested through the Dremio UI

Database Ingestion (JDBC / ODBC)

DremioFrame provides a standardized way to ingest data from relational databases into Iceberg using client.ingest.database. This integration supports a wide range of databases and can leverage high-performance backends such as connectorx.

from dremioframe.client import DremioClient

client = DremioClient()

client.ingest.database(

    connection_string="postgresql://user:password@localhost:5432/mydb",

    query="SELECT * FROM users WHERE active = true",

    table_name='"marketing"."users"',

    write_disposition="replace",

    backend="connectorx"

)

This pattern is commonly used for:

Migrating operational data into Iceberg
Isolating analytics workloads from source systems
Periodic batch ingestion from databases

Performance tips

Use connectorx whenever supported for faster ingestion
Use append for incremental loads
For very large datasets with SQLAlchemy, configure batch_size to stream results

File System Ingestion

For ingesting multiple local files at once, DremioFrame supports file system ingestion using glob patterns.

client.ingest.files(

    "data/events/*.parquet",

    table_name="marketing.events"

)

This is useful for:

Backfills from local landing zones
Ingesting many files in a single operation
Prototyping before moving to object storage–based pipelines

Ingestion from Local DataFrames

When data already exists in Python as a Pandas DataFrame or Arrow table, DremioFrame provides several options for creating or updating Iceberg tables.

Creating a New Iceberg Table (Recommended)

import pandas as pd

df = pd.read_csv("local_data.csv")

client.create_table(

    "marketing.local_data",

    schema=df,

    insert_data=True

)

This is the cleanest approach for creating new tables from local data.

Appending to an Existing Table

client.table("marketing.local_data").insert(

    "marketing.local_data",

    data=df,

    batch_size=5000

)

Upserts Using Merge

client.table("staging.users").merge(

    target_table="marketing.users",

    on="id",

    matched_update={

        "email": "source.email",

        "status": "source.status"

    },

    not_matched_insert={

        "id": "source.id",

        "email": "source.email",

        "status": "source.status"

    }

)

These patterns are especially useful for:

Application-driven ingestion
Controlled updates and merges
Small to medium data volumes

Operational Best Practices

After ingestion, DremioFrame exposes table maintenance operations that are especially important for Iceberg tables:

# Compact small files

client.table("marketing.users").optimize()

# Expire old snapshots

client.table("marketing.users").vacuum(retain_last=5)

Additional best practices:

Use batching when inserting large DataFrames
Ensure type consistency before ingestion
Use staging tables for complex transformations
Prefer Iceberg tables over direct source queries for analytics workloads

When to Use DremioFrame

DremioFrame is the right choice when:

Ingestion originates outside Dremio (APIs, local files, Python apps)
Programmatic control is required
You want Iceberg-native writes without Spark
You need to combine Python logic with lakehouse governance

It complements SQL (CTAS, INSERT, MERGE) and file-based ingestion (COPY INTO, PIPE) by filling the gap between automation and open lakehouse ingestion.

Conclusion: Choosing the Right Ingestion Pattern with Dremio

Ingesting data into an Apache Iceberg lakehouse does not require a single tool or a one-size-fits-all approach. Instead, effective lakehouse architectures rely on multiple ingestion patterns, each optimized for different sources, data velocities, and operational requirements.

Dremio makes this possible by treating Apache Iceberg as a first-class write format and combining it with an open catalog, autonomous optimization, and both SQL- and programmatic ingestion paths. Whether data arrives as files, database tables, API responses, or local datasets, Dremio provides a clear and consistent path to governed, interoperable Iceberg tables.

Each ingestion method covered in this post serves a distinct purpose:

UI-based file uploads enable fast, ad hoc ingestion for exploration and prototyping.
CTAS and INSERT INTO SELECT provide a robust, SQL-first approach for migrations, transformations, and incremental batch ingestion.
COPY INTO and CREATE PIPE support scalable file-based ingestion from object storage, from one-time backfills to continuous file drops.
DremioFrame extends ingestion into the programmatic domain, enabling APIs, databases, local files, and Python-driven workflows to land cleanly in Iceberg.

The key advantage of Dremio’s approach is that all of these paths converge on the same outcome: Apache Iceberg tables managed by Dremio Open Catalog, optimized automatically, governed consistently, and immediately usable across engines and tools.

This convergence dramatically reduces the complexity typically associated with data ingestion. Teams no longer need separate systems for ingestion, transformation, optimization, and governance. Instead, they can focus on choosing the right pattern for the job, confident that the resulting data will be performant, reliable, and ready for analytics and AI.

By combining open standards, autonomous lakehouse operations, and agentic interaction, Dremio turns the journey from raw data to trusted insight into an incremental, flexible process, not a long, disruptive platform migration.

As your data sources and requirements evolve, the ingestion patterns can evolve with them, without re-architecting the lakehouse or sacrificing openness.

Article Topics

Product Insights from the Dremio Blog

Table of Contents

What Is Dremio: The Agentic Lakehouse Platform

Integrated AI Agent and MCP Server

Try Dremio’s Interactive Demo

AI Functions for Unstructured Data

Autonomous Performance Management

Dremio Open Catalog: Apache Polaris–Based

Integrated Semantic Layer

What Is Dremio Open Catalog: An Apache Polaris–Based Lakehouse Catalog

Built on Apache Polaris and Iceberg REST

First-Class Governance for Iceberg Tables

Designed for Ingestion and Evolution

A Shared Foundation for Agentic Analytics

The Ingestion Landscape: Iceberg REST and Dremio-Native Ingestion Paths

1. Ad Hoc and Interactive Ingestion

2. SQL-Based Ingestion and Transformation

3. File-Based Loading from Object Storage

4. Programmatic Ingestion with DremioFrame

File Upload in the UI: One-Time and Ad Hoc Ingestion into Iceberg

How UI-Based Ingestion Works

When This Pattern Makes Sense

Best Practices

Limitations to Keep in Mind

CTAS and INSERT INTO SELECT: SQL-Driven Ingestion and Incremental Updates

Initial Loads with CREATE TABLE AS SELECT (CTAS)

Example: Migrating Data from an External Source into Iceberg

Best Practices for CTAS

Incremental Ingestion with INSERT INTO SELECT

Example: Incremental Inserts Based on a Timestamp

Example: Incremental Inserts Using a High-Watermark Table

Idempotency and Reliability Considerations

When to Use CTAS and INSERT INTO SELECT

COPY INTO and CREATE PIPE: File-Based Ingestion from Object Storage

Bulk File Ingestion with COPY INTO

Example: Loading Parquet Files from Object Storage

Example: Loading CSV Files with Explicit Options

Best Practices for COPY INTO

Continuous Ingestion with CREATE PIPE

Example: Creating a Pipe for Continuous Ingestion

Starting and Managing Pipes

Handling Schema Evolution and File Layout

When to Use COPY INTO vs CREATE PIPE

Where File-Based Ingestion Fits Best

DremioFrame: Programmatic Ingestion into Iceberg Using Python

API Ingestion

File Upload from Local Systems

Database Ingestion (JDBC / ODBC)

File System Ingestion

Ingestion from Local DataFrames

Creating a New Iceberg Table (Recommended)

Appending to an Existing Table

Upserts Using Merge

Operational Best Practices

When to Use DremioFrame

Conclusion: Choosing the Right Ingestion Pattern with Dremio

Additional Resources

Ingesting Data Into Apache Iceberg Tables with Dremio: A Unified Path to Iceberg

Table-Driven Access Policies Using Subqueries

Dremio Arctic is Now Your Data Lakehouse Catalog in Dremio Cloud

Make data engineers and analysts 10x more productive