44 minute read · December 19, 2025
Data Ingestion Patterns Using Dremio: From Raw Data to Apache Iceberg
· Head of DevRel, Dremio
Modern data platforms are no longer built around monolithic warehouses or tightly coupled ingestion pipelines. Instead, organizations are standardizing on open lakehouse architectures, where data is stored in open formats, governed by shared catalogs, and processed by multiple engines based on workload.
At the center of this shift is Apache Iceberg, which has emerged as the de facto table format for analytic and AI workloads on object storage. Iceberg brings transactional guarantees, schema evolution, time travel, and partition evolution to data lakes, capabilities that were once exclusive to proprietary systems.
However, adopting Iceberg is only part of the story. Teams still face a critical question:
How do you ingest data into Iceberg efficiently, reliably, and at scale, without rebuilding complex ETL infrastructure?
This is where Dremio plays a key role.
Dremio is a lakehouse query and processing engine that natively reads from and writes to Apache Iceberg tables. Rather than introducing a proprietary ingestion framework, Dremio enables ingestion through SQL, file-based loading, and programmatic APIs, allowing teams to use the same engine for exploration, transformation, and data delivery.
Importantly, Dremio is not an orchestration layer. Workflow orchestration remains the responsibility of tools such as Apache Airflow, dbt, or custom pipelines built with DremioFrame, which can schedule, coordinate, and trigger ingestion workloads that execute on Dremio. This separation keeps architectures modular and flexible while allowing Dremio to focus on what it does best: fast, scalable data processing on open data.
In this post, we’ll walk through the most common data ingestion patterns with Dremio, focusing on ingesting data into Apache Iceberg tables managed through an open catalog. We’ll cover when to use each approach, how they fit into real-world pipelines, and best practices for choosing the correct pattern based on your data sources and workloads.
Whether you’re loading ad hoc datasets, migrating existing tables, ingesting files from object storage, or pulling data from APIs and databases, Dremio provides multiple paths to bring data into Iceberg, without locking you into a single engine or ingestion tool.
What Is Dremio: The Agentic Lakehouse Platform
Dremio is the Agentic Lakehouse, a data platform built for AI agents and managed by agents. It unifies data, governance, and business context to enable fast, accurate analytics and AI workflows directly on open data, without pipelines, lock-in, or manual optimization.
At its foundation, Dremio is a high-performance data processing engine built on Apache Arrow and optimized for Apache Iceberg. Iceberg is Dremio’s first-class table format for both reads and writes, allowing users to create, ingest, and evolve analytical tables directly in object storage with full transactional guarantees. Tables written by Dremio are immediately interoperable with other Iceberg-compatible engines, preserving the openness of the lakehouse.
What distinguishes Dremio from traditional query engines is its agentic architecture, which combines AI-driven interaction, autonomous operations, and semantic understanding of data:
Integrated AI Agent and MCP Server
Dremio includes a built-in AI agent that can run queries, generate visualizations, explain SQL, and suggest optimizations using natural language. This capability extends beyond the Dremio UI through an MCP (Model Context Protocol) server, which exposes Dremio’s semantic understanding of data to external clients and tools. Together, these capabilities allow AI agents and users to interact with data more naturally and productively.
Try Dremio’s Interactive Demo
Explore this interactive demo and see how Dremio's Intelligent Lakehouse enables Agentic AI
AI Functions for Unstructured Data
Dremio brings AI directly into SQL through AI Functions, enabling teams to transform unstructured content, such as PDFs, documents, and images, into structured, queryable data. These functions make it possible to ingest and analyze data that would traditionally require complex preprocessing pipelines, expanding what “data ingestion” means in a lakehouse context.
Autonomous Performance Management
Operating an Iceberg lakehouse at scale typically requires continuous tuning and maintenance. Dremio eliminates this burden through autonomous performance management, including automatic table optimization, results caching, query planning caches, and Autonomous Reflections. These capabilities continuously optimize performance and cost as data volumes and workloads evolve, without manual intervention.
Dremio Open Catalog: Apache Polaris–Based
Dremio includes a built-in, fully managed lakehouse catalog, Dremio Open Catalog, powered by Apache Polaris. The catalog tracks, governs, and secures Iceberg tables while enabling interoperable access across engines through standard Iceberg REST APIs. This ensures that data ingested into Iceberg remains discoverable, governed, and reusable across the broader ecosystem.
Integrated Semantic Layer
Dremio provides a first-class semantic layer that includes views, tags, wikis, and end-to-end lineage. This semantic context is not only consumed by users, but also leveraged by the AI agent and MCP server to deliver more accurate and meaningful results. The semantic layer spans both native Iceberg tables and virtualized data from databases, data warehouses, and data lakes, enabling agentic analytics across the entire data estate.
By combining first-class Iceberg support, autonomous lakehouse management, and AI-driven interaction, Dremio enables organizations to move from fragmented data silos to performant, agentic analytics on unified data, often overnight rather than through long, multi-year platform migrations.
What Is Dremio Open Catalog: An Apache Polaris–Based Lakehouse Catalog
An open lakehouse requires more than an open table format. It also needs a shared, interoperable catalog that tracks table metadata, enforces governance, and allows multiple engines to safely read and write the same data. This is the role of Dremio Open Catalog (DOC).
Dremio Open Catalog is a lakehouse catalog built directly into the Dremio platform, powered by Apache Polaris. It provides a fully managed catalog for Apache Iceberg tables, enabling organizations to govern, secure, and share data without introducing proprietary metadata layers or locking themselves into a single compute engine.
Built on Apache Polaris and Iceberg REST
At its core, DOC implements the Apache Iceberg REST catalog specification via Apache Polaris. This means Iceberg tables registered in Dremio Open Catalog can be accessed by any engine that supports the Iceberg REST API, including Spark, Flink, Trino, and others.
This architecture ensures:
- Interoperability: Tables ingested through Dremio are immediately available to other Iceberg-compatible engines.
- Consistency: All engines operate against the same catalog metadata and transactional state.
- Openness: Metadata remains portable and standards-based, avoiding proprietary lock-in.
First-Class Governance for Iceberg Tables
Dremio Open Catalog is not just a metadata registry. It is a governance layer that provides fine-grained access control, auditing, and lineage for Iceberg tables. Permissions are enforced consistently whether data is queried interactively, ingested via SQL, or accessed programmatically.
Because the catalog is integrated into the Dremio platform, governance is applied automatically as data is created or ingested, without requiring separate systems to synchronize policies or metadata.
Designed for Ingestion and Evolution
Ingestion is one of the most demanding phases of the data lifecycle, especially in Iceberg-based lakehouses where tables continuously evolve. Dremio Open Catalog is designed to support:
- Transactional table creation and writes
- Schema evolution during ingestion
- Partition evolution without rewrites
- Safe concurrent access from multiple engines
This makes DOC a natural foundation for ingestion pipelines that start with raw data and mature into curated, shared Iceberg tables over time.
A Shared Foundation for Agentic Analytics
Dremio Open Catalog also plays a critical role in enabling agentic analytics. By centralizing metadata, permissions, and table definitions, the catalog provides the trusted foundation that Dremio’s AI agent and semantic layer rely on to understand data, apply context, and deliver accurate results.
In practice, this means that once data is ingested into Iceberg and registered in Dremio Open Catalog, it becomes immediately discoverable, governable, and usable, by humans, by AI agents, and by external engines alike.
With an open catalog in place, the question shifts from where data lives to how it should be ingested. In the next sections, we’ll look at the different ingestion paths Dremio provides, starting with simple, ad hoc workflows and progressing toward fully automated, production-grade patterns.
The Ingestion Landscape: Iceberg REST and Dremio-Native Ingestion Paths
With Apache Iceberg and an open catalog in place, ingestion becomes far more flexible than in traditional data platforms. Instead of being tied to a single engine or proprietary pipeline framework, organizations can choose from multiple ingestion paths based on data volume, latency, and operational complexity.
Because Dremio Open Catalog implements the Iceberg REST specification, any engine that supports Iceberg REST can ingest data into tables registered in the catalog. This enables a broad ecosystem of tools, batch engines, streaming frameworks, and custom applications, to safely create and update Iceberg tables while sharing a single source of truth for metadata and governance.
At the same time, Dremio provides several native ingestion mechanisms that are tightly integrated with its engine, catalog, and semantic layer. These options are often simpler to operate, require fewer moving parts, and automatically benefit from Dremio’s autonomous performance management and governance capabilities.
Broadly, ingestion into Iceberg using Dremio falls into four categories:
1. Ad Hoc and Interactive Ingestion
For exploratory or one-time datasets, Dremio supports interactive ingestion workflows directly in the UI. These are designed for speed and accessibility, allowing analysts and engineers to quickly turn files into Iceberg tables without writing pipelines or provisioning infrastructure.
2. SQL-Based Ingestion and Transformation
Dremio’s SQL engine can create and incrementally update Iceberg tables using familiar patterns such as CREATE TABLE AS SELECT and INSERT INTO SELECT. This approach is well suited for:
- Migrating data from existing sources
- Building curated tables from raw datasets
- Performing incremental ingestion based on timestamps or identifiers
Because these operations are transactional and Iceberg-native, they can be safely re-run and integrated into scheduled workflows.
3. File-Based Loading from Object Storage
For ingestion patterns centered around files landing in object storage, Dremio provides file-oriented loading mechanisms that are optimized for bulk and continuous ingestion. These patterns are ideal for external data feeds, event-driven pipelines, and landing-zone architectures.
4. Programmatic Ingestion with DremioFrame
Some ingestion scenarios go beyond what can be expressed purely in SQL or through file ingestion. DremioFrame, a Python library built on top of the Dremio engine, enables programmatic ingestion from APIs, custom file formats, and JDBC-accessible systems, while still pushing execution down to Dremio and writing data into Iceberg tables.
Each of these ingestion paths serves a different purpose, and they are often used together within the same lakehouse. The key advantage of Dremio’s approach is that all roads lead to the same destination: governed, interoperable Apache Iceberg tables managed through an open catalog.
In the following sections, we’ll explore each of these ingestion patterns in detail, starting with the simplest workflows and progressing toward more advanced, production-grade ingestion pipelines, along with best practices for choosing the right approach for your use case.
File Upload in the UI: One-Time and Ad Hoc Ingestion into Iceberg
Not every ingestion workflow needs to start with a pipeline. For exploratory analysis, rapid prototyping, or one-off datasets, Dremio provides a simple UI-based file upload experience that allows users to ingest data directly into the lakehouse.
Through the Dremio UI, users can upload files in common formats such as CSV, JSON, and Parquet, preview their contents, and immediately query them using SQL. These uploaded files can then be transformed and materialized as Apache Iceberg tables, making them first-class citizens in the lakehouse rather than isolated artifacts.
This workflow is particularly valuable in early-stage analysis, where the goal is to move quickly from raw data to insight without standing up infrastructure or writing ingestion code.
How UI-Based Ingestion Works
The typical flow for UI-based ingestion looks like this:
- Upload a file through the Dremio UI.
- Inspect the inferred schema and data types.
- The file is now an Iceberg table inside Dremio Open Catalog.
Once materialized, the resulting Iceberg table is managed by Dremio Open Catalog, governed like any other dataset, and immediately available for querying by other users, tools, and engines.
When This Pattern Makes Sense
UI-based file uploads are best suited for:
- Ad hoc datasets shared by partners or internal teams
- Exploratory analysis and proof-of-concept work
- Small reference datasets that change infrequently
- Analyst-driven workflows where speed matters more than automation
This approach lowers the barrier to entry for working with Iceberg by allowing users to focus on data and SQL rather than ingestion infrastructure.
Best Practices
To use UI-based ingestion effectively:
- Convert uploaded files to Iceberg early
Treat uploaded files as a staging step, not a long-term storage solution. Persist curated results into Iceberg tables as soon as possible. - Validate schemas before materializing
Review inferred data types and column names to avoid propagating issues into downstream tables. - Avoid using UI uploads for recurring ingestion
If a dataset needs to be refreshed regularly or scaled over time, transition to SQL-based, file-based, or programmatic ingestion patterns.
Limitations to Keep in Mind
While convenient, UI-based ingestion is intentionally scoped. It is not designed for:
- Large-scale or high-frequency ingestion
- Automated or scheduled pipelines
- Complex schema evolution scenarios
For those use cases, Dremio’s SQL-based and file-based ingestion mechanisms provide more control and scalability.
UI uploads excel as a starting point, a fast path from raw data to governed Iceberg tables, before evolving into more robust ingestion patterns as requirements grow.
CTAS and INSERT INTO SELECT: SQL-Driven Ingestion and Incremental Updates
For most production ingestion workflows, SQL-based ingestion provides the best balance of simplicity, scalability, and control. Dremio supports this natively through CREATE TABLE AS SELECT(CTAS) and INSERT INTO SELECT, allowing teams to ingest, transform, and incrementally update Apache Iceberg tables using standard SQL.
These patterns are especially powerful because they:
- Write transactionally to Iceberg
- Work across federated sources (databases, warehouses, files, lakes)
- Integrate cleanly into automated workflows
- Support repeatable and incremental ingestion
Initial Loads with CREATE TABLE AS SELECT (CTAS)
CTAS is the most common way to perform an initial ingestion into Iceberg. It allows you to create a new Iceberg table directly from an existing source while applying transformations, filtering, and schema normalization.
Example: Migrating Data from an External Source into Iceberg
CREATE TABLE analytics.orders_iceberg PARTITION BY (order_date) AS SELECT order_id, customer_id, CAST(order_timestamp AS DATE) AS order_date, order_status, total_amount, CURRENT_TIMESTAMP AS ingestion_ts FROM snowflake.sales.orders WHERE order_timestamp >= '2024-01-01';
In this example:
- Data is read directly from an external source
- The table is written in Apache Iceberg format
- Partitioning is applied at creation time
- Ingestion metadata is added explicitly
Once created, this Iceberg table is immediately registered in Dremio Open Catalog and available to other engines.
Best Practices for CTAS
- Normalize schemas during ingestion to avoid propagating inconsistencies
- Add ingestion timestamps for auditability and incremental logic
- Choose partitions carefully based on query patterns, not source layouts
- Prefer CTAS for initial loads and backfills, not ongoing updates
Incremental Ingestion with INSERT INTO SELECT
After the initial load, most datasets need to be updated incrementally. Dremio supports this pattern using INSERT INTO SELECT, appending new data to existing Iceberg tables in a fully transactional manner.
Example: Incremental Inserts Based on a Timestamp
INSERT INTO analytics.orders_iceberg SELECT order_id, customer_id, CAST(order_timestamp AS DATE) AS order_date, order_status, total_amount, CURRENT_TIMESTAMP AS ingestion_ts FROM snowflake.sales.orders WHERE order_timestamp > ( SELECT MAX(order_date) FROM analytics.orders_iceberg );
This pattern:
- Reads only new records from the source
- Appends data safely to the Iceberg table
- Can be rerun without affecting existing data
Example: Incremental Inserts Using a High-Watermark Table
For more control, many teams maintain a watermark table:
INSERT INTO analytics.orders_iceberg SELECT o.order_id, o.customer_id, CAST(o.order_timestamp AS DATE) AS order_date, o.order_status, o.total_amount, CURRENT_TIMESTAMP AS ingestion_ts FROM snowflake.sales.orders o JOIN ingestion_metadata.watermarks w ON o.order_timestamp > w.last_processed_ts WHERE w.dataset_name = 'orders';
This approach is especially useful when:
- Sources don’t guarantee ordering
- Multiple ingestion jobs share state
- Late-arriving data is expected
Idempotency and Reliability Considerations
When using SQL-based ingestion in production, it’s important to design for reliability:
- Avoid duplicates by filtering on immutable keys or timestamps
- Prefer append-only ingestion when possible
- Track ingestion state explicitly, not implicitly
- Treat SQL as declarative ingestion logic, not procedural code
Because Iceberg guarantees atomic commits, failed ingestion jobs will not leave tables in a partially written state.
When to Use CTAS and INSERT INTO SELECT
This pattern is ideal for:
- Migrating data from existing systems into Iceberg
- Building curated or analytics-ready tables
- Incremental batch ingestion
- Pipelines managed by schedulers or CI/CD systems
It is less suitable for:
- Continuous file-based ingestion
- Event-driven ingestion from object storage
- Complex API-driven ingestion logic
For those scenarios, Dremio’s file-based ingestion and programmatic approaches are a better fit.
COPY INTO and CREATE PIPE: File-Based Ingestion from Object Storage
Many ingestion pipelines are built around files landing in object storage, whether produced by upstream systems, event-driven processes, or external partners. For these scenarios, Dremio provides file-native ingestion patterns that load data directly into Apache Iceberg tables without requiring intermediate processing engines.
Two SQL constructs are central to this approach: COPY INTO and CREATE PIPE. Together, they support both bulk ingestion and continuous file loading into Iceberg.
Bulk File Ingestion with COPY INTO
COPY INTO is designed for explicit, batch-oriented ingestion of files from object storage into an existing Iceberg table. It is well-suited for backfills, periodic loads, or controlled batch workflows.
Example: Loading Parquet Files from Object Storage
COPY INTO analytics.events_iceberg FROM '@s3.raw_data/events/' FILE_FORMAT 'PARQUET';
In this example:
- Files are read directly from an object storage location
- Data is appended transactionally to the Iceberg table
- No staging tables or external engines are required
Example: Loading CSV Files with Explicit Options
COPY INTO analytics.customers_iceberg FROM '@s3.raw_data/customers/' FILE_FORMAT ( TYPE 'CSV', FIELD_DELIMITER ',', SKIP_FIRST_LINE TRUE );
Dremio handles schema mapping, file discovery, and commit semantics automatically, ensuring that each COPY INTO operation results in a consistent Iceberg snapshot.
Best Practices for COPY INTO
- Use COPY INTO for controlled batch ingestion, not continuous streaming
- Validate schemas early, especially for CSV and JSON inputs
- Group files into reasonably sized batches to avoid excessively small files
- Prefer immutable file drops to simplify ingestion logic
Continuous Ingestion with CREATE PIPE
For pipelines that receive files continuously, CREATE PIPE enables automated ingestion. A pipe defines a persistent ingestion rule that watches a location and loads new files as they appear.
Example: Creating a Pipe for Continuous Ingestion
CREATE PIPE analytics.events_pipe AS COPY INTO analytics.events_iceberg FROM '@s3.raw_data/events/' FILE_FORMAT 'PARQUET';
Once created, the pipe:
- Tracks which files have already been processed
- Automatically ingests new files
- Ensures each file is loaded exactly once
Pipes are ideal for landing-zone architectures where upstream systems continuously write files to object storage.
Starting and Managing Pipes
ALTER PIPE analytics.events_pipe SET PIPE_EXECUTION_PAUSED = FALSE;
Pipes can be paused, resumed, or monitored without redefining ingestion logic.
Handling Schema Evolution and File Layout
File-based ingestion often introduces schema drift over time. When ingesting into Iceberg:
- Backward-compatible schema changes (adding columns) are handled gracefully
- Incompatible changes should be normalized before ingestion
- Partitioning decisions should be based on query access patterns, not file layout
Dremio’s autonomous optimization features help mitigate issues such as small files and suboptimal layouts after ingestion.
When to Use COPY INTO vs CREATE PIPE
| Use Case | Recommended Pattern |
| One-time backfill | COPY INTO |
| Scheduled batch loads | COPY INTO |
| Continuous file drops | CREATE PIPE |
| Event-driven pipelines | CREATE PIPE |
| External data feeds | CREATE PIPE |
Both patterns integrate seamlessly with Iceberg’s transactional model and Dremio Open Catalog, ensuring ingested data is immediately governed and queryable.
Where File-Based Ingestion Fits Best
File-based ingestion is ideal when:
- Upstream systems already produce files
- Object storage acts as a landing zone
- Low-latency streaming is not required
- Ingestion must scale independently of source systems
For ingestion that involves APIs, custom formats, or database-driven extraction logic, SQL alone is often not enough. In the next section, we’ll look at DremioFrame, a Python library that enables programmatic ingestion while still leveraging Dremio’s engine and Iceberg-native writes.
DremioFrame: Programmatic Ingestion into Iceberg Using Python
While SQL- and file-based ingestion cover many common lakehouse workflows, some ingestion scenarios require programmatic control. Data arriving from REST APIs, local files, Python applications, or external databases often needs to be fetched, normalized, or enriched before it can be written to Apache Iceberg.
DremioFrame addresses these use cases by providing a Python-native ingestion layer that writes data through the Dremio engine and into Iceberg tables managed by Dremio Open Catalog. Rather than acting as a local processing engine, DremioFrame serves as a control plane for ingestion, pushing work into Dremio wherever possible and preserving governance, lineage, and transactional guarantees.
DremioFrame supports both ELT-style ingestion (moving data from Dremio-connected sources into Iceberg) and ETL-style ingestion (bringing external data into the lakehouse).
API Ingestion
DremioFrame includes built-in support for ingesting data directly from REST APIs using client.ingest_api. This method handles fetching data, batching, and writing results into Iceberg tables.
from dremioframe.client import DremioClient client = DremioClient() client.ingest_api( url="https://api.example.com/users", table_name="marketing.users", mode="merge", # 'replace', 'append', or 'merge' pk="id" )
This pattern is well suited for:
- SaaS platforms and external services
- APIs without bulk export capabilities
- Incremental ingestion using merge semantics
Best practices
- Use merge with a primary key for mutable API data
- Use append for append-only event streams
- Consider staging tables for complex transformations before merging
File Upload from Local Systems
When working with local files that are not already available in object storage, DremioFrame allows you to upload them directly into Dremio as Iceberg tables using client.upload_file.
from dremioframe.client import DremioClient
client = DremioClient()
# Upload a CSV file
client.upload_file("data/sales.csv", "marketing.sales")
# Upload an Excel file
client.upload_file("data/financials.xlsx", "marketing.financials")
# Upload an Avro file
client.upload_file("data/users.avro", "marketing.users")Supported formats include CSV, JSON, Parquet, Excel, HTML, Avro, ORC, Lance, and Arrow/Feather, with format-specific options passed through to the underlying readers.
This approach is ideal for:
- Analyst-provided files
- Small to medium batch ingestion
- File types not directly ingested through the Dremio UI
Database Ingestion (JDBC / ODBC)
DremioFrame provides a standardized way to ingest data from relational databases into Iceberg using client.ingest.database. This integration supports a wide range of databases and can leverage high-performance backends such as connectorx.
from dremioframe.client import DremioClient client = DremioClient() client.ingest.database( connection_string="postgresql://user:password@localhost:5432/mydb", query="SELECT * FROM users WHERE active = true", table_name='"marketing"."users"', write_disposition="replace", backend="connectorx" )
This pattern is commonly used for:
- Migrating operational data into Iceberg
- Isolating analytics workloads from source systems
- Periodic batch ingestion from databases
Performance tips
- Use connectorx whenever supported for faster ingestion
- Use append for incremental loads
- For very large datasets with SQLAlchemy, configure batch_size to stream results
File System Ingestion
For ingesting multiple local files at once, DremioFrame supports file system ingestion using glob patterns.
client.ingest.files( "data/events/*.parquet", table_name="marketing.events" )
This is useful for:
- Backfills from local landing zones
- Ingesting many files in a single operation
- Prototyping before moving to object storage–based pipelines
Ingestion from Local DataFrames
When data already exists in Python as a Pandas DataFrame or Arrow table, DremioFrame provides several options for creating or updating Iceberg tables.
Creating a New Iceberg Table (Recommended)
import pandas as pd
df = pd.read_csv("local_data.csv")
client.create_table(
"marketing.local_data",
schema=df,
insert_data=True
)This is the cleanest approach for creating new tables from local data.
Appending to an Existing Table
client.table("marketing.local_data").insert(
"marketing.local_data",
data=df,
batch_size=5000
)Upserts Using Merge
client.table("staging.users").merge(
target_table="marketing.users",
on="id",
matched_update={
"email": "source.email",
"status": "source.status"
},
not_matched_insert={
"id": "source.id",
"email": "source.email",
"status": "source.status"
}
)These patterns are especially useful for:
- Application-driven ingestion
- Controlled updates and merges
- Small to medium data volumes
Operational Best Practices
After ingestion, DremioFrame exposes table maintenance operations that are especially important for Iceberg tables:
# Compact small files
client.table("marketing.users").optimize()
# Expire old snapshots
client.table("marketing.users").vacuum(retain_last=5)Additional best practices:
- Use batching when inserting large DataFrames
- Ensure type consistency before ingestion
- Use staging tables for complex transformations
- Prefer Iceberg tables over direct source queries for analytics workloads
When to Use DremioFrame
DremioFrame is the right choice when:
- Ingestion originates outside Dremio (APIs, local files, Python apps)
- Programmatic control is required
- You want Iceberg-native writes without Spark
- You need to combine Python logic with lakehouse governance
It complements SQL (CTAS, INSERT, MERGE) and file-based ingestion (COPY INTO, PIPE) by filling the gap between automation and open lakehouse ingestion.
Conclusion: Choosing the Right Ingestion Pattern with Dremio
Ingesting data into an Apache Iceberg lakehouse does not require a single tool or a one-size-fits-all approach. Instead, effective lakehouse architectures rely on multiple ingestion patterns, each optimized for different sources, data velocities, and operational requirements.
Dremio makes this possible by treating Apache Iceberg as a first-class write format and combining it with an open catalog, autonomous optimization, and both SQL- and programmatic ingestion paths. Whether data arrives as files, database tables, API responses, or local datasets, Dremio provides a clear and consistent path to governed, interoperable Iceberg tables.
Each ingestion method covered in this post serves a distinct purpose:
- UI-based file uploads enable fast, ad hoc ingestion for exploration and prototyping.
- CTAS and INSERT INTO SELECT provide a robust, SQL-first approach for migrations, transformations, and incremental batch ingestion.
- COPY INTO and CREATE PIPE support scalable file-based ingestion from object storage, from one-time backfills to continuous file drops.
- DremioFrame extends ingestion into the programmatic domain, enabling APIs, databases, local files, and Python-driven workflows to land cleanly in Iceberg.
The key advantage of Dremio’s approach is that all of these paths converge on the same outcome: Apache Iceberg tables managed by Dremio Open Catalog, optimized automatically, governed consistently, and immediately usable across engines and tools.
This convergence dramatically reduces the complexity typically associated with data ingestion. Teams no longer need separate systems for ingestion, transformation, optimization, and governance. Instead, they can focus on choosing the right pattern for the job, confident that the resulting data will be performant, reliable, and ready for analytics and AI.
By combining open standards, autonomous lakehouse operations, and agentic interaction, Dremio turns the journey from raw data to trusted insight into an incremental, flexible process, not a long, disruptive platform migration.
As your data sources and requirements evolve, the ingestion patterns can evolve with them, without re-architecting the lakehouse or sacrificing openness.