44 minute read · December 19, 2025

Data Ingestion Patterns Using Dremio: From Raw Data to Apache Iceberg

Alex Merced

Alex Merced · Head of DevRel, Dremio

Copied to clipboard

Modern data platforms are no longer built around monolithic warehouses or tightly coupled ingestion pipelines. Instead, organizations are standardizing on open lakehouse architectures, where data is stored in open formats, governed by shared catalogs, and processed by multiple engines based on workload.

At the center of this shift is Apache Iceberg, which has emerged as the de facto table format for analytic and AI workloads on object storage. Iceberg brings transactional guarantees, schema evolution, time travel, and partition evolution to data lakes, capabilities that were once exclusive to proprietary systems.

However, adopting Iceberg is only part of the story. Teams still face a critical question:

How do you ingest data into Iceberg efficiently, reliably, and at scale, without rebuilding complex ETL infrastructure?

This is where Dremio plays a key role.

Dremio is a lakehouse query and processing engine that natively reads from and writes to Apache Iceberg tables. Rather than introducing a proprietary ingestion framework, Dremio enables ingestion through SQL, file-based loading, and programmatic APIs, allowing teams to use the same engine for exploration, transformation, and data delivery.

Importantly, Dremio is not an orchestration layer. Workflow orchestration remains the responsibility of tools such as Apache Airflow, dbt, or custom pipelines built with DremioFrame, which can schedule, coordinate, and trigger ingestion workloads that execute on Dremio. This separation keeps architectures modular and flexible while allowing Dremio to focus on what it does best: fast, scalable data processing on open data.

In this post, we’ll walk through the most common data ingestion patterns with Dremio, focusing on ingesting data into Apache Iceberg tables managed through an open catalog. We’ll cover when to use each approach, how they fit into real-world pipelines, and best practices for choosing the correct pattern based on your data sources and workloads.

Whether you’re loading ad hoc datasets, migrating existing tables, ingesting files from object storage, or pulling data from APIs and databases, Dremio provides multiple paths to bring data into Iceberg, without locking you into a single engine or ingestion tool.

What Is Dremio: The Agentic Lakehouse Platform

Dremio is the Agentic Lakehouse, a data platform built for AI agents and managed by agents. It unifies data, governance, and business context to enable fast, accurate analytics and AI workflows directly on open data, without pipelines, lock-in, or manual optimization.

At its foundation, Dremio is a high-performance data processing engine built on Apache Arrow and optimized for Apache Iceberg. Iceberg is Dremio’s first-class table format for both reads and writes, allowing users to create, ingest, and evolve analytical tables directly in object storage with full transactional guarantees. Tables written by Dremio are immediately interoperable with other Iceberg-compatible engines, preserving the openness of the lakehouse.

What distinguishes Dremio from traditional query engines is its agentic architecture, which combines AI-driven interaction, autonomous operations, and semantic understanding of data:

Integrated AI Agent and MCP Server

Dremio includes a built-in AI agent that can run queries, generate visualizations, explain SQL, and suggest optimizations using natural language. This capability extends beyond the Dremio UI through an MCP (Model Context Protocol) server, which exposes Dremio’s semantic understanding of data to external clients and tools. Together, these capabilities allow AI agents and users to interact with data more naturally and productively.

Try Dremio’s Interactive Demo

Explore this interactive demo and see how Dremio's Intelligent Lakehouse enables Agentic AI

AI Functions for Unstructured Data

Dremio brings AI directly into SQL through AI Functions, enabling teams to transform unstructured content, such as PDFs, documents, and images, into structured, queryable data. These functions make it possible to ingest and analyze data that would traditionally require complex preprocessing pipelines, expanding what “data ingestion” means in a lakehouse context.

Autonomous Performance Management

Operating an Iceberg lakehouse at scale typically requires continuous tuning and maintenance. Dremio eliminates this burden through autonomous performance management, including automatic table optimization, results caching, query planning caches, and Autonomous Reflections. These capabilities continuously optimize performance and cost as data volumes and workloads evolve, without manual intervention.

Dremio Open Catalog: Apache Polaris–Based

Dremio includes a built-in, fully managed lakehouse catalog, Dremio Open Catalog, powered by Apache Polaris. The catalog tracks, governs, and secures Iceberg tables while enabling interoperable access across engines through standard Iceberg REST APIs. This ensures that data ingested into Iceberg remains discoverable, governed, and reusable across the broader ecosystem.

Integrated Semantic Layer

Dremio provides a first-class semantic layer that includes views, tags, wikis, and end-to-end lineage. This semantic context is not only consumed by users, but also leveraged by the AI agent and MCP server to deliver more accurate and meaningful results. The semantic layer spans both native Iceberg tables and virtualized data from databases, data warehouses, and data lakes, enabling agentic analytics across the entire data estate.

By combining first-class Iceberg support, autonomous lakehouse management, and AI-driven interaction, Dremio enables organizations to move from fragmented data silos to performant, agentic analytics on unified data, often overnight rather than through long, multi-year platform migrations.

What Is Dremio Open Catalog: An Apache Polaris–Based Lakehouse Catalog

An open lakehouse requires more than an open table format. It also needs a shared, interoperable catalog that tracks table metadata, enforces governance, and allows multiple engines to safely read and write the same data. This is the role of Dremio Open Catalog (DOC).

Dremio Open Catalog is a lakehouse catalog built directly into the Dremio platform, powered by Apache Polaris. It provides a fully managed catalog for Apache Iceberg tables, enabling organizations to govern, secure, and share data without introducing proprietary metadata layers or locking themselves into a single compute engine.

Built on Apache Polaris and Iceberg REST

At its core, DOC implements the Apache Iceberg REST catalog specification via Apache Polaris. This means Iceberg tables registered in Dremio Open Catalog can be accessed by any engine that supports the Iceberg REST API, including Spark, Flink, Trino, and others.

This architecture ensures:

  • Interoperability: Tables ingested through Dremio are immediately available to other Iceberg-compatible engines.
  • Consistency: All engines operate against the same catalog metadata and transactional state.
  • Openness: Metadata remains portable and standards-based, avoiding proprietary lock-in.

First-Class Governance for Iceberg Tables

Dremio Open Catalog is not just a metadata registry. It is a governance layer that provides fine-grained access control, auditing, and lineage for Iceberg tables. Permissions are enforced consistently whether data is queried interactively, ingested via SQL, or accessed programmatically.

Because the catalog is integrated into the Dremio platform, governance is applied automatically as data is created or ingested, without requiring separate systems to synchronize policies or metadata.

Designed for Ingestion and Evolution

Ingestion is one of the most demanding phases of the data lifecycle, especially in Iceberg-based lakehouses where tables continuously evolve. Dremio Open Catalog is designed to support:

  • Transactional table creation and writes
  • Schema evolution during ingestion
  • Partition evolution without rewrites
  • Safe concurrent access from multiple engines

This makes DOC a natural foundation for ingestion pipelines that start with raw data and mature into curated, shared Iceberg tables over time.

A Shared Foundation for Agentic Analytics

Dremio Open Catalog also plays a critical role in enabling agentic analytics. By centralizing metadata, permissions, and table definitions, the catalog provides the trusted foundation that Dremio’s AI agent and semantic layer rely on to understand data, apply context, and deliver accurate results.

In practice, this means that once data is ingested into Iceberg and registered in Dremio Open Catalog, it becomes immediately discoverable, governable, and usable, by humans, by AI agents, and by external engines alike.

With an open catalog in place, the question shifts from where data lives to how it should be ingested. In the next sections, we’ll look at the different ingestion paths Dremio provides, starting with simple, ad hoc workflows and progressing toward fully automated, production-grade patterns.

The Ingestion Landscape: Iceberg REST and Dremio-Native Ingestion Paths

With Apache Iceberg and an open catalog in place, ingestion becomes far more flexible than in traditional data platforms. Instead of being tied to a single engine or proprietary pipeline framework, organizations can choose from multiple ingestion paths based on data volume, latency, and operational complexity.

Because Dremio Open Catalog implements the Iceberg REST specification, any engine that supports Iceberg REST can ingest data into tables registered in the catalog. This enables a broad ecosystem of tools, batch engines, streaming frameworks, and custom applications, to safely create and update Iceberg tables while sharing a single source of truth for metadata and governance.

At the same time, Dremio provides several native ingestion mechanisms that are tightly integrated with its engine, catalog, and semantic layer. These options are often simpler to operate, require fewer moving parts, and automatically benefit from Dremio’s autonomous performance management and governance capabilities.

Broadly, ingestion into Iceberg using Dremio falls into four categories:

1. Ad Hoc and Interactive Ingestion

For exploratory or one-time datasets, Dremio supports interactive ingestion workflows directly in the UI. These are designed for speed and accessibility, allowing analysts and engineers to quickly turn files into Iceberg tables without writing pipelines or provisioning infrastructure.

2. SQL-Based Ingestion and Transformation

Dremio’s SQL engine can create and incrementally update Iceberg tables using familiar patterns such as CREATE TABLE AS SELECT and INSERT INTO SELECT. This approach is well suited for:

  • Migrating data from existing sources
  • Building curated tables from raw datasets
  • Performing incremental ingestion based on timestamps or identifiers

Because these operations are transactional and Iceberg-native, they can be safely re-run and integrated into scheduled workflows.

3. File-Based Loading from Object Storage

For ingestion patterns centered around files landing in object storage, Dremio provides file-oriented loading mechanisms that are optimized for bulk and continuous ingestion. These patterns are ideal for external data feeds, event-driven pipelines, and landing-zone architectures.

4. Programmatic Ingestion with DremioFrame

Some ingestion scenarios go beyond what can be expressed purely in SQL or through file ingestion. DremioFrame, a Python library built on top of the Dremio engine, enables programmatic ingestion from APIs, custom file formats, and JDBC-accessible systems, while still pushing execution down to Dremio and writing data into Iceberg tables.

Each of these ingestion paths serves a different purpose, and they are often used together within the same lakehouse. The key advantage of Dremio’s approach is that all roads lead to the same destination: governed, interoperable Apache Iceberg tables managed through an open catalog.

In the following sections, we’ll explore each of these ingestion patterns in detail, starting with the simplest workflows and progressing toward more advanced, production-grade ingestion pipelines, along with best practices for choosing the right approach for your use case.

File Upload in the UI: One-Time and Ad Hoc Ingestion into Iceberg

Not every ingestion workflow needs to start with a pipeline. For exploratory analysis, rapid prototyping, or one-off datasets, Dremio provides a simple UI-based file upload experience that allows users to ingest data directly into the lakehouse.

Through the Dremio UI, users can upload files in common formats such as CSV, JSON, and Parquet, preview their contents, and immediately query them using SQL. These uploaded files can then be transformed and materialized as Apache Iceberg tables, making them first-class citizens in the lakehouse rather than isolated artifacts.

This workflow is particularly valuable in early-stage analysis, where the goal is to move quickly from raw data to insight without standing up infrastructure or writing ingestion code.

How UI-Based Ingestion Works

The typical flow for UI-based ingestion looks like this:

  1. Upload a file through the Dremio UI.
  2. Inspect the inferred schema and data types.
  3. The file is now an Iceberg table inside Dremio Open Catalog.

Once materialized, the resulting Iceberg table is managed by Dremio Open Catalog, governed like any other dataset, and immediately available for querying by other users, tools, and engines.

When This Pattern Makes Sense

UI-based file uploads are best suited for:

  • Ad hoc datasets shared by partners or internal teams
  • Exploratory analysis and proof-of-concept work
  • Small reference datasets that change infrequently
  • Analyst-driven workflows where speed matters more than automation

This approach lowers the barrier to entry for working with Iceberg by allowing users to focus on data and SQL rather than ingestion infrastructure.

Best Practices

To use UI-based ingestion effectively:

  • Convert uploaded files to Iceberg early
    Treat uploaded files as a staging step, not a long-term storage solution. Persist curated results into Iceberg tables as soon as possible.
  • Validate schemas before materializing
    Review inferred data types and column names to avoid propagating issues into downstream tables.
  • Avoid using UI uploads for recurring ingestion
    If a dataset needs to be refreshed regularly or scaled over time, transition to SQL-based, file-based, or programmatic ingestion patterns.

Limitations to Keep in Mind

While convenient, UI-based ingestion is intentionally scoped. It is not designed for:

  • Large-scale or high-frequency ingestion
  • Automated or scheduled pipelines
  • Complex schema evolution scenarios

For those use cases, Dremio’s SQL-based and file-based ingestion mechanisms provide more control and scalability.

UI uploads excel as a starting point, a fast path from raw data to governed Iceberg tables, before evolving into more robust ingestion patterns as requirements grow.

CTAS and INSERT INTO SELECT: SQL-Driven Ingestion and Incremental Updates

For most production ingestion workflows, SQL-based ingestion provides the best balance of simplicity, scalability, and control. Dremio supports this natively through CREATE TABLE AS SELECT(CTAS) and INSERT INTO SELECT, allowing teams to ingest, transform, and incrementally update Apache Iceberg tables using standard SQL.

These patterns are especially powerful because they:

  • Write transactionally to Iceberg
  • Work across federated sources (databases, warehouses, files, lakes)
  • Integrate cleanly into automated workflows
  • Support repeatable and incremental ingestion

Initial Loads with CREATE TABLE AS SELECT (CTAS)

CTAS is the most common way to perform an initial ingestion into Iceberg. It allows you to create a new Iceberg table directly from an existing source while applying transformations, filtering, and schema normalization.

Example: Migrating Data from an External Source into Iceberg

CREATE TABLE analytics.orders_iceberg

PARTITION BY (order_date)

AS

SELECT

  order_id,

  customer_id,

  CAST(order_timestamp AS DATE) AS order_date,

  order_status,

  total_amount,

  CURRENT_TIMESTAMP AS ingestion_ts

FROM snowflake.sales.orders

WHERE order_timestamp >= '2024-01-01';

In this example:

  • Data is read directly from an external source
  • The table is written in Apache Iceberg format
  • Partitioning is applied at creation time
  • Ingestion metadata is added explicitly

Once created, this Iceberg table is immediately registered in Dremio Open Catalog and available to other engines.

Best Practices for CTAS

  • Normalize schemas during ingestion to avoid propagating inconsistencies
  • Add ingestion timestamps for auditability and incremental logic
  • Choose partitions carefully based on query patterns, not source layouts
  • Prefer CTAS for initial loads and backfills, not ongoing updates

Incremental Ingestion with INSERT INTO SELECT

After the initial load, most datasets need to be updated incrementally. Dremio supports this pattern using INSERT INTO SELECT, appending new data to existing Iceberg tables in a fully transactional manner.

Example: Incremental Inserts Based on a Timestamp

INSERT INTO analytics.orders_iceberg

SELECT

  order_id,

  customer_id,

  CAST(order_timestamp AS DATE) AS order_date,

  order_status,

  total_amount,

  CURRENT_TIMESTAMP AS ingestion_ts

FROM snowflake.sales.orders

WHERE order_timestamp > (

  SELECT MAX(order_date)

  FROM analytics.orders_iceberg

);

This pattern:

  • Reads only new records from the source
  • Appends data safely to the Iceberg table
  • Can be rerun without affecting existing data

Example: Incremental Inserts Using a High-Watermark Table

For more control, many teams maintain a watermark table:

INSERT INTO analytics.orders_iceberg

SELECT

  o.order_id,

  o.customer_id,

  CAST(o.order_timestamp AS DATE) AS order_date,

  o.order_status,

  o.total_amount,

  CURRENT_TIMESTAMP AS ingestion_ts

FROM snowflake.sales.orders o

JOIN ingestion_metadata.watermarks w

  ON o.order_timestamp > w.last_processed_ts

WHERE w.dataset_name = 'orders';

This approach is especially useful when:

  • Sources don’t guarantee ordering
  • Multiple ingestion jobs share state
  • Late-arriving data is expected

Idempotency and Reliability Considerations

When using SQL-based ingestion in production, it’s important to design for reliability:

  • Avoid duplicates by filtering on immutable keys or timestamps
  • Prefer append-only ingestion when possible
  • Track ingestion state explicitly, not implicitly
  • Treat SQL as declarative ingestion logic, not procedural code

Because Iceberg guarantees atomic commits, failed ingestion jobs will not leave tables in a partially written state.

When to Use CTAS and INSERT INTO SELECT

This pattern is ideal for:

  • Migrating data from existing systems into Iceberg
  • Building curated or analytics-ready tables
  • Incremental batch ingestion
  • Pipelines managed by schedulers or CI/CD systems

It is less suitable for:

  • Continuous file-based ingestion
  • Event-driven ingestion from object storage
  • Complex API-driven ingestion logic

For those scenarios, Dremio’s file-based ingestion and programmatic approaches are a better fit.

COPY INTO and CREATE PIPE: File-Based Ingestion from Object Storage

Many ingestion pipelines are built around files landing in object storage, whether produced by upstream systems, event-driven processes, or external partners. For these scenarios, Dremio provides file-native ingestion patterns that load data directly into Apache Iceberg tables without requiring intermediate processing engines.

Two SQL constructs are central to this approach: COPY INTO and CREATE PIPE. Together, they support both bulk ingestion and continuous file loading into Iceberg.

Bulk File Ingestion with COPY INTO

COPY INTO is designed for explicit, batch-oriented ingestion of files from object storage into an existing Iceberg table. It is well-suited for backfills, periodic loads, or controlled batch workflows.

Example: Loading Parquet Files from Object Storage

COPY INTO analytics.events_iceberg

FROM '@s3.raw_data/events/'

FILE_FORMAT 'PARQUET';

In this example:

  • Files are read directly from an object storage location
  • Data is appended transactionally to the Iceberg table
  • No staging tables or external engines are required

Example: Loading CSV Files with Explicit Options

COPY INTO analytics.customers_iceberg

FROM '@s3.raw_data/customers/'

FILE_FORMAT (

  TYPE 'CSV',

  FIELD_DELIMITER ',',

  SKIP_FIRST_LINE TRUE

);

Dremio handles schema mapping, file discovery, and commit semantics automatically, ensuring that each COPY INTO operation results in a consistent Iceberg snapshot.

Best Practices for COPY INTO

  • Use COPY INTO for controlled batch ingestion, not continuous streaming
  • Validate schemas early, especially for CSV and JSON inputs
  • Group files into reasonably sized batches to avoid excessively small files
  • Prefer immutable file drops to simplify ingestion logic

Continuous Ingestion with CREATE PIPE

For pipelines that receive files continuously, CREATE PIPE enables automated ingestion. A pipe defines a persistent ingestion rule that watches a location and loads new files as they appear.

Example: Creating a Pipe for Continuous Ingestion

CREATE PIPE analytics.events_pipe

AS

COPY INTO analytics.events_iceberg

FROM '@s3.raw_data/events/'

FILE_FORMAT 'PARQUET';

Once created, the pipe:

  • Tracks which files have already been processed
  • Automatically ingests new files
  • Ensures each file is loaded exactly once

Pipes are ideal for landing-zone architectures where upstream systems continuously write files to object storage.

Starting and Managing Pipes

ALTER PIPE analytics.events_pipe SET PIPE_EXECUTION_PAUSED = FALSE;

Pipes can be paused, resumed, or monitored without redefining ingestion logic.

Handling Schema Evolution and File Layout

File-based ingestion often introduces schema drift over time. When ingesting into Iceberg:

  • Backward-compatible schema changes (adding columns) are handled gracefully
  • Incompatible changes should be normalized before ingestion
  • Partitioning decisions should be based on query access patterns, not file layout

Dremio’s autonomous optimization features help mitigate issues such as small files and suboptimal layouts after ingestion.

When to Use COPY INTO vs CREATE PIPE

Use CaseRecommended Pattern
One-time backfillCOPY INTO
Scheduled batch loadsCOPY INTO
Continuous file dropsCREATE PIPE
Event-driven pipelinesCREATE PIPE
External data feedsCREATE PIPE

Both patterns integrate seamlessly with Iceberg’s transactional model and Dremio Open Catalog, ensuring ingested data is immediately governed and queryable.

Where File-Based Ingestion Fits Best

File-based ingestion is ideal when:

  • Upstream systems already produce files
  • Object storage acts as a landing zone
  • Low-latency streaming is not required
  • Ingestion must scale independently of source systems

For ingestion that involves APIs, custom formats, or database-driven extraction logic, SQL alone is often not enough. In the next section, we’ll look at DremioFrame, a Python library that enables programmatic ingestion while still leveraging Dremio’s engine and Iceberg-native writes.

DremioFrame: Programmatic Ingestion into Iceberg Using Python

While SQL- and file-based ingestion cover many common lakehouse workflows, some ingestion scenarios require programmatic control. Data arriving from REST APIs, local files, Python applications, or external databases often needs to be fetched, normalized, or enriched before it can be written to Apache Iceberg.

DremioFrame addresses these use cases by providing a Python-native ingestion layer that writes data through the Dremio engine and into Iceberg tables managed by Dremio Open Catalog. Rather than acting as a local processing engine, DremioFrame serves as a control plane for ingestion, pushing work into Dremio wherever possible and preserving governance, lineage, and transactional guarantees.

DremioFrame supports both ELT-style ingestion (moving data from Dremio-connected sources into Iceberg) and ETL-style ingestion (bringing external data into the lakehouse).

API Ingestion

DremioFrame includes built-in support for ingesting data directly from REST APIs using client.ingest_api. This method handles fetching data, batching, and writing results into Iceberg tables.

from dremioframe.client import DremioClient

client = DremioClient()

client.ingest_api(

    url="https://api.example.com/users",

    table_name="marketing.users",

    mode="merge",   # 'replace', 'append', or 'merge'

    pk="id"

)

This pattern is well suited for:

  • SaaS platforms and external services
  • APIs without bulk export capabilities
  • Incremental ingestion using merge semantics

Best practices

  • Use merge with a primary key for mutable API data
  • Use append for append-only event streams
  • Consider staging tables for complex transformations before merging

File Upload from Local Systems

When working with local files that are not already available in object storage, DremioFrame allows you to upload them directly into Dremio as Iceberg tables using client.upload_file.

from dremioframe.client import DremioClient

client = DremioClient()

# Upload a CSV file

client.upload_file("data/sales.csv", "marketing.sales")

# Upload an Excel file

client.upload_file("data/financials.xlsx", "marketing.financials")

# Upload an Avro file

client.upload_file("data/users.avro", "marketing.users")

Supported formats include CSV, JSON, Parquet, Excel, HTML, Avro, ORC, Lance, and Arrow/Feather, with format-specific options passed through to the underlying readers.

This approach is ideal for:

  • Analyst-provided files
  • Small to medium batch ingestion
  • File types not directly ingested through the Dremio UI

Database Ingestion (JDBC / ODBC)

DremioFrame provides a standardized way to ingest data from relational databases into Iceberg using client.ingest.database. This integration supports a wide range of databases and can leverage high-performance backends such as connectorx.

from dremioframe.client import DremioClient

client = DremioClient()

client.ingest.database(

    connection_string="postgresql://user:password@localhost:5432/mydb",

    query="SELECT * FROM users WHERE active = true",

    table_name='"marketing"."users"',

    write_disposition="replace",

    backend="connectorx"

)

This pattern is commonly used for:

  • Migrating operational data into Iceberg
  • Isolating analytics workloads from source systems
  • Periodic batch ingestion from databases

Performance tips

  • Use connectorx whenever supported for faster ingestion
  • Use append for incremental loads
  • For very large datasets with SQLAlchemy, configure batch_size to stream results

File System Ingestion

For ingesting multiple local files at once, DremioFrame supports file system ingestion using glob patterns.

client.ingest.files(

    "data/events/*.parquet",

    table_name="marketing.events"

)

This is useful for:

  • Backfills from local landing zones
  • Ingesting many files in a single operation
  • Prototyping before moving to object storage–based pipelines

Ingestion from Local DataFrames

When data already exists in Python as a Pandas DataFrame or Arrow table, DremioFrame provides several options for creating or updating Iceberg tables.

import pandas as pd

df = pd.read_csv("local_data.csv")

client.create_table(

    "marketing.local_data",

    schema=df,

    insert_data=True

)

This is the cleanest approach for creating new tables from local data.

Appending to an Existing Table

client.table("marketing.local_data").insert(

    "marketing.local_data",

    data=df,

    batch_size=5000

)

Upserts Using Merge

client.table("staging.users").merge(

    target_table="marketing.users",

    on="id",

    matched_update={

        "email": "source.email",

        "status": "source.status"

    },

    not_matched_insert={

        "id": "source.id",

        "email": "source.email",

        "status": "source.status"

    }

)

These patterns are especially useful for:

  • Application-driven ingestion
  • Controlled updates and merges
  • Small to medium data volumes

Operational Best Practices

After ingestion, DremioFrame exposes table maintenance operations that are especially important for Iceberg tables:

# Compact small files

client.table("marketing.users").optimize()

# Expire old snapshots

client.table("marketing.users").vacuum(retain_last=5)

Additional best practices:

  • Use batching when inserting large DataFrames
  • Ensure type consistency before ingestion
  • Use staging tables for complex transformations
  • Prefer Iceberg tables over direct source queries for analytics workloads

When to Use DremioFrame

DremioFrame is the right choice when:

  • Ingestion originates outside Dremio (APIs, local files, Python apps)
  • Programmatic control is required
  • You want Iceberg-native writes without Spark
  • You need to combine Python logic with lakehouse governance

It complements SQL (CTAS, INSERT, MERGE) and file-based ingestion (COPY INTO, PIPE) by filling the gap between automation and open lakehouse ingestion.

Conclusion: Choosing the Right Ingestion Pattern with Dremio

Ingesting data into an Apache Iceberg lakehouse does not require a single tool or a one-size-fits-all approach. Instead, effective lakehouse architectures rely on multiple ingestion patterns, each optimized for different sources, data velocities, and operational requirements.

Dremio makes this possible by treating Apache Iceberg as a first-class write format and combining it with an open catalog, autonomous optimization, and both SQL- and programmatic ingestion paths. Whether data arrives as files, database tables, API responses, or local datasets, Dremio provides a clear and consistent path to governed, interoperable Iceberg tables.

Each ingestion method covered in this post serves a distinct purpose:

  • UI-based file uploads enable fast, ad hoc ingestion for exploration and prototyping.
  • CTAS and INSERT INTO SELECT provide a robust, SQL-first approach for migrations, transformations, and incremental batch ingestion.
  • COPY INTO and CREATE PIPE support scalable file-based ingestion from object storage, from one-time backfills to continuous file drops.
  • DremioFrame extends ingestion into the programmatic domain, enabling APIs, databases, local files, and Python-driven workflows to land cleanly in Iceberg.

The key advantage of Dremio’s approach is that all of these paths converge on the same outcome: Apache Iceberg tables managed by Dremio Open Catalog, optimized automatically, governed consistently, and immediately usable across engines and tools.

This convergence dramatically reduces the complexity typically associated with data ingestion. Teams no longer need separate systems for ingestion, transformation, optimization, and governance. Instead, they can focus on choosing the right pattern for the job, confident that the resulting data will be performant, reliable, and ready for analytics and AI.

By combining open standards, autonomous lakehouse operations, and agentic interaction, Dremio turns the journey from raw data to trusted insight into an incremental, flexible process, not a long, disruptive platform migration.

As your data sources and requirements evolve, the ingestion patterns can evolve with them, without re-architecting the lakehouse or sacrificing openness.

Sign up for a free 30-day trial of Dremio Cloud Today!

Make data engineers and analysts 10x more productive

Boost efficiency with AI-powered agents, faster coding for engineers, instant insights for analysts.