Dremio Blog

14 minute read · April 26, 2026

The Journey from Scattered Data to an Apache Iceberg Lakehouse with Governed Agentic Analytics

Alex Merced Head of DevRel, Dremio

Start For Free

Copied to clipboard

The Journey from Scattered Data to an Apache Iceberg Lakehouse with Governed Agentic Analytics

Phase 1: Connect Your Data Where It Lives

Phase 2: Build a Semantic Layer Over Everything

Phase 3: Turn On Agentic Analytics

Phase 4: Migrate to Iceberg, One Dataset at a Time

Why the View Layer Makes Migration Invisible

The Tradeoffs

Start Today, Migrate Over Time

Journey from scattered data to governed agentic analytics through federation, semantic layer, and Iceberg lakehouse

The conventional wisdom for data platform modernization goes like this: pick a target system, build ETL pipelines for every source, migrate everything, validate the data, retrain your users, and then start getting value. That process takes six to eighteen months. During that time, analysts are waiting and leadership is asking why the investment has not produced results yet.

There is a better sequence. Instead of making everyone wait for a full migration, you start producing value on day one and migrate to Apache Iceberg at your own pace. The key is treating federation, the semantic layer, AI access, and Iceberg migration as four independent phases, each delivering value on its own, rather than a single all-or-nothing project.

Four-phase journey from connecting sources to Iceberg lakehouse showing value at every phase

Phase 1: Connect Your Data Where It Lives

Sign up for Dremio Cloud and you get a lakehouse project with a pre-configured Open Catalog right away. From there, start connecting your existing data sources through Dremio's federated query engine: PostgreSQL, MySQL, MongoDB, S3, Snowflake, BigQuery, Redshift, AWS Glue, Unity Catalog, and more.

No data copying. No ETL pipelines. Dremio queries your data where it already lives, using predicate pushdowns to push filtering work down to each source system.

The result: by the end of day one, your team has unified SQL access across every connected source. An analyst can join a PostgreSQL customer table with an S3-based event stream in a single query, without waiting for a data engineer to build a pipeline first.

Phase 2: Build a Semantic Layer Over Everything

Raw source tables have cryptic column names, inconsistent types, and zero business context. Before anyone can get reliable answers, whether human or AI, you need a curated layer on top.

Dremio's AI Semantic Layer uses SQL views organized in three tiers:

Bronze/Raw views map to raw sources. They standardize column names, cast data types, and apply basic filters. One Bronze view per source table.
Silver/Business views apply business logic. This is where you define what "active customer" means (purchased in the last 90 days, not on a trial), join data across sources, and compute metrics.
Gold/Application views serve specific consumers: a dashboard, a report, or an AI agent. Each Gold view is optimized for its use case.

Dremio's AI Agent can help you come up with the SQL to generate these views efficiently.

Govern Access and Document Everything

Grant users access to specific views using Role-Based Access Control (RBAC) at the folder, dataset, and column level. For sensitive data, add Fine-Grained Access Control (FGAC) via UDFs for row-level security and column-level masking.

Then enrich every dataset with Wikis (human-readable documentation explaining what each column means) and Tags (categorical labels for discoverability). Dremio can auto-generate Wiki descriptions and suggest Tags by sampling your table data and schema. You review and refine the output instead of writing everything from scratch.

This metadata is not just for humans. It is what the AI Agent reads when generating SQL. Better documentation means more accurate answers.

Phase 3: Turn On Agentic Analytics

With a governed semantic layer in place, you are ready for AI. This is the important part: you do not need to complete the Iceberg migration first. Agentic analytics works on federated data from the moment the semantic layer exists.

Dremio's built-in AI Agent lets users type plain-English questions in the console. The agent writes SQL, executes it against your governed views, returns results, generates charts, and suggests follow-up questions. It respects every RBAC and FGAC policy in your catalog. Users can only get answers about data they are authorized to see.

For teams that want to use external tools, Dremio's MCP (Model Context Protocol) server lets ChatGPT, Claude Desktop, or custom agents connect directly to your Dremio environment. External tools get the same semantic context and security controls as the built-in agent.

Interface	What It Provides
Built-in AI Agent	Natural language queries, SQL generation, charts, follow-up suggestions inside Dremio
MCP Server	Connect any MCP-compatible AI tool (ChatGPT, Claude, custom agents) with full governance
AI SQL Functions	Run `AI_GENERATE`, `AI_CLASSIFY`, `AI_COMPLETE` directly in SQL for unstructured data analysis

At this point your organization has unified data access, a governed semantic layer, and AI-powered analytics, and you have not migrated a single table to Iceberg yet.

Phase 4: Migrate to Iceberg, One Dataset at a Time

Federation gets you access, but a full Apache Iceberg lakehouse gets you more: Autonomous Reflections that optimize query performance based on actual usage patterns, end-to-end caching, automated table maintenance (compaction, clustering, vacuuming), and interoperability with every Iceberg-compatible engine (Spark, Flink, Trino). Your data stays in your storage, in an open format, with no vendor lock-in.

The migration pattern is deliberately incremental:

Pick one dataset to migrate (start with the highest-volume or most-queried table)
Build an Iceberg pipeline to land that data in your object storage (S3 or Azure)
Update the Bronze view to point to the new Iceberg table instead of the legacy federated source
Silver and Gold views stay unchanged. They reference the Bronze view, which now reads from Iceberg instead of the old source.
Every consumer is unaffected. Dashboards, reports, and AI agents continue to work exactly as before.

Repeat for the next dataset whenever you are ready. There is no deadline and no big-bang cutover.

Why the View Layer Makes Migration Invisible

This is the architectural insight that makes the whole journey work. The semantic layer acts as a contract between physical data storage and every consumer above it.

View layer swap showing Bronze view pointing to PostgreSQL before migration and Apache Iceberg after, with Silver, Gold, and AI Agent layers unchanged

When you swap a Bronze view's underlying source from PostgreSQL to an Iceberg table, every Silver view, Gold view, dashboard, report, and AI agent that depends on it continues to work without changes. The view contract (column names, data types, business logic) is preserved. Only the physical source pointer changes.

This means:

No dashboard rewiring
No report migration
No API endpoint changes
No AI Agent reconfiguration
No user communication (beyond governance notifications if your policies require them)

The migration happens underneath the abstraction layer. Everyone above it is oblivious.

The Tradeoffs

This phased approach is not free of costs.

Federation introduces network latency. Queries that join a PostgreSQL table in one region with an S3 bucket in another will be slower than queries against co-located Iceberg tables. Reflections and caching mitigate this for repeated queries, but the first execution of a new query pattern will feel it.

Iceberg migration still requires building ingest pipelines. Dremio does not eliminate that work. What it does is decouple the pipeline work from the analytics timeline. Your analysts and AI agents are productive while engineers build migration pipelines in the background.

Autonomous Reflections need a 7-day observation window before they start optimizing. Day-one performance on brand-new Iceberg tables relies on baseline optimizations (C3 caching, predicate pushdowns, vectorized execution). The system gets faster as it learns your query patterns.

And Dremio is an analytical engine, not a transactional database. Your OLTP workloads stay in PostgreSQL, MongoDB, or whatever system runs your application. You query those systems through federation, not as a replacement.

Start Today, Migrate Over Time

The traditional approach forces you to choose: spend months migrating, or keep running fragmented analytics on scattered data. Dremio eliminates that choice. Connect your sources, build your semantic layer, enable AI access, and start migrating to Iceberg when you are ready. Each phase delivers value independently, and the view layer ensures that migration never disrupts the people who are already getting answers.

Try Dremio Cloud free for 30 days and start the journey from wherever your data lives today.

Free Resources to Go Deeper

Try Dremio Cloud free for 30 days

Deploy agentic analytics directly on Apache Iceberg data with no pipelines and no added overhead.

Start For Free

Article Topics

Product Insights from the Dremio Blog

Blog coverpage for Ingesting Data into Aparche Iceberg with Dremio

Feb 1, 2024 Product Insights from the Dremio Blog

Ingesting Data Into Apache Iceberg Tables with Dremio: A Unified Path to Iceberg

By unifying data from diverse sources, simplifying data operations, and providing powerful tools for data management, Dremio stands out as a comprehensive solution for modern data needs. Whether you are a data engineer, business analyst, or data scientist, harnessing the combined power of Dremio and Apache Iceberg will undoubtedly be a valuable asset in your data management toolkit.

Alex Merced

Oct 12, 2023 Product Insights from the Dremio Blog

Table-Driven Access Policies Using Subqueries

This blog helps you learn about table-driven access policies in Dremio Cloud and Dremio Software v24.1+.

Albert Vernon

Aug 31, 2023 Dremio Blog: News Highlights

Dremio Arctic is Now Your Data Lakehouse Catalog in Dremio Cloud

Dremio Arctic bring new features to Dremio Cloud, including Apache Iceberg table optimization and Data as Code.

Jeremiah Morrow

The Journey from Scattered Data to an Apache Iceberg Lakehouse with Governed Agentic Analytics

Table of Contents

Phase 1: Connect Your Data Where It Lives

Phase 2: Build a Semantic Layer Over Everything

Govern Access and Document Everything

Phase 3: Turn On Agentic Analytics

Phase 4: Migrate to Iceberg, One Dataset at a Time

Why the View Layer Makes Migration Invisible

The Tradeoffs

Start Today, Migrate Over Time

Free Resources to Go Deeper

Try Dremio Cloud free for 30 days

Ready to Get Started?

Table of Contents

Phase 1: Connect Your Data Where It Lives

Phase 2: Build a Semantic Layer Over Everything

Govern Access and Document Everything

Phase 3: Turn On Agentic Analytics

Phase 4: Migrate to Iceberg, One Dataset at a Time

Why the View Layer Makes Migration Invisible

The Tradeoffs

Start Today, Migrate Over Time

Free Resources to Go Deeper

Try Dremio Cloud free for 30 days

Related Dremio Articles

Ingesting Data Into Apache Iceberg Tables with Dremio: A Unified Path to Iceberg

Table-Driven Access Policies Using Subqueries

Dremio Arctic is Now Your Data Lakehouse Catalog in Dremio Cloud

Ready to Get Started?