Dremio Blog

33 minute read · May 22, 2026

Dremio Semantic Layer: A Practical Step-by-Step Guide

Alex Merced Alex Merced Head of DevRel, Dremio
Start For Free
Dremio Semantic Layer: A Practical Step-by-Step Guide
Copied to clipboard

Most data teams using Dremio start the same way: analysts connect directly to raw source tables, write their own joins, and define their own filters. Within six months, "net revenue" means four different things depending on which dashboard you open. The Dremio semantic layer is the fix for this. It gives you a single governed layer of SQL views where business logic lives once and is shared everywhere.

This guide walks you through building a complete Dremio semantic layer for an e-commerce analytics use case from scratch. You will connect raw sources, build three tiers of views, add documentation, apply access control, and verify the whole thing works with Dremio's AI Agent. Every step includes working SQL. If you want conceptual background before diving in, the Dremio semantic layer overview and the complete semantic layer guide are solid starting points.

What You're Building

The use case is an e-commerce company with three raw data sources:

  • PostgreSQL orders database: transactional records with order ID, customer ID, amount in cents, status, timestamp, and region
  • Apache Iceberg tables on S3: clickstream event logs capturing user sessions, page views, and engagement signals
  • SaaS CRM data: customer profile data federated via a Dremio connector or pre-landed as an Iceberg table

The goal is to give three different consumer groups reliable, governed access to the same underlying data:

  • BI tools (Tableau, Power BI) querying monthly revenue and customer metrics
  • AI agents asking natural language questions about business performance
  • Executive dashboards showing pre-computed KPIs with no tolerance for inconsistency

You will build this as a three-layer stack: a Bronze preparation layer, a Silver business logic layer, and a Gold application layer. Each layer has a naming convention and a clear purpose. The result is a Dremio semantic layer where business definitions are encoded once and consumed many times.

Try Dremio’s Interactive Demo

Explore this interactive demo and see how Dremio's Intelligent Lakehouse enables Agentic AI

Organizing Your Semantic Layer in Dremio

Before writing a single line of SQL, set up the folder structure in Dremio. Dremio organizes virtual datasets in Spaces (think of these as namespaces for your catalog). Create three spaces:

  • semantic.bronze for preparation views
  • semantic.business for business logic views
  • semantic.application for application-ready views

You can create spaces from the Dremio UI under Catalog > New Space, or via the REST API. Once the spaces exist, all your SQL DDL can reference them directly with three-part names like semantic.bronze.prep_orders.

Adopt consistent naming prefixes from the start:

LayerPrefixExample
Bronzeprep_prep_ordersprep_events
Silverbiz_biz_net_revenuebiz_customer_tiers
Goldapp_app_executive_revenueapp_marketing_funnel

This convention pays off when you use Dremio's lineage graph later. You can immediately see which layer a view belongs to just from its name.

Step 1: Connect Your Data Sources

Dremio's query federation means you connect to sources rather than copying data. No ETL pipeline. No staging environment. Dremio queries the source directly at runtime and translates your SQL into the native dialect or format of that source.

Adding a PostgreSQL Source

From the Dremio UI:

  1. Navigate to Sources in the left sidebar
  2. Click + Add Source
  3. Select PostgreSQL
  4. Fill in hostname, port, database name, username, and password
  5. Click Save

Dremio reflects the PostgreSQL schema immediately. Your tables appear at raw_postgres.public.orders (or whatever schema name you configured). No data is moved. Every query against this source hits PostgreSQL at runtime.

Adding an Apache Iceberg Source on S3

For the clickstream event data stored as Iceberg tables on S3:

  1. Navigate to Sources > + Add Source
  2. Select Amazon S3 (or use Nessie if you have a catalog server)
  3. Provide the S3 bucket, path prefix, and AWS credentials
  4. Dremio discovers Iceberg metadata files and maps them to queryable datasets

If you use Dremio's Open Catalog (based on Apache Polaris), your Iceberg tables are registered there and Dremio federates them automatically. The benefit of Open Catalog is that multiple engines (Spark, Trino, Flink) can read the same Iceberg tables using the same catalog, while Dremio provides the semantic layer on top.

Once both sources are connected, you can write cross-source SQL immediately:

SELECT o.customer_id, COUNT(e.event_id) as event_count
FROM raw_postgres.public.orders o
JOIN iceberg_s3.events.clickstream e
  ON o.customer_id = e.user_id
WHERE o.created_at >= CURRENT_DATE - INTERVAL '30' DAY
GROUP BY 1;

This is Dremio query federation working. No data movement. No intermediate tables. The semantic layer you build on top of this is all virtual.

Step 2: Build the Bronze Preparation Layer

The Bronze layer does one thing: make raw data safe to use. It does not implement business logic. It casts types, renames cryptic columns, standardizes formats, and filters out records that have no business being in any analysis. Test accounts, nulls from bad ingestion runs, and malformed strings all go here.

The prep_ prefix signals to anyone reading the catalog that this view is a cleaned pass-through, not a metric definition.

prep_orders: Cleaning the Orders Table

-- Bronze Layer: Clean orders view
CREATE OR REPLACE VIEW semantic.bronze.prep_orders AS
SELECT 
  order_id,
  customer_id,
  CAST(created_at AS TIMESTAMP) AS order_timestamp,
  CAST(amount_cents AS DECIMAL(10,2)) / 100 AS order_amount_usd,
  LOWER(TRIM(status)) AS order_status,
  LOWER(region) AS region
FROM raw_postgres.public.orders
WHERE customer_id NOT LIKE 'test%';

Walk through each transformation:

  • CAST(created_at AS TIMESTAMP): the source column is stored as a string in some PostgreSQL configurations. Cast it once here so downstream views do not repeat this work.
  • CAST(amount_cents AS DECIMAL(10,2)) / 100: store amounts in dollars from the start, not cents. This prevents the classic bug where a downstream view forgets to divide.
  • LOWER(TRIM(status)): status values in production databases are often inconsistent ('Refunded'' refunded''REFUNDED'). Normalize to lowercase trimmed strings once.
  • LOWER(region): same principle for region values.
  • WHERE customer_id NOT LIKE 'test%': test accounts created during QA and development pollute every aggregate unless you exclude them here.

For the clickstream data, create a matching prep view:

-- Bronze Layer: Clean events view
CREATE OR REPLACE VIEW semantic.bronze.prep_events AS
SELECT
  event_id,
  user_id AS customer_id,
  CAST(event_timestamp AS TIMESTAMP) AS event_timestamp,
  LOWER(event_type) AS event_type,
  LOWER(page_name) AS page_name,
  session_id,
  LOWER(region) AS region
FROM iceberg_s3.events.clickstream
WHERE user_id IS NOT NULL
  AND user_id NOT LIKE 'test%';

The Bronze layer is your first line of defense. When the PostgreSQL team renames a column or changes a data type, you update one view. Not 40 downstream reports.

Step 3: Build the Silver Business Logic Layer

The Silver layer is where your organization's definitions live in code. "Net revenue" is not just a column name here. It is a specific formula your finance team has agreed on. "Active customer" has a specific time window your product team uses. These definitions go in biz_ views and they build on the Bronze layer, not raw sources.

If you are building your semantic layer alongside a dbt project, dbt's view materialization pattern maps directly to this approach. Your Silver models map to biz_ views in Dremio.

Defining Net Revenue

-- Silver Layer: Net Revenue business metric
CREATE OR REPLACE VIEW semantic.business.biz_net_revenue AS
SELECT 
  DATE_TRUNC('month', order_timestamp) AS revenue_month,
  region,
  SUM(order_amount_usd) AS gross_revenue_usd,
  SUM(CASE WHEN order_status != 'refunded' THEN order_amount_usd ELSE 0 END) AS net_revenue_usd,
  COUNT(DISTINCT customer_id) AS paying_customers
FROM semantic.bronze.prep_orders
WHERE order_status != 'cancelled'
GROUP BY 1, 2;

Three business decisions are encoded here:

  1. Exclude cancelled orders entirely from the base WHERE clause. A cancelled order never generated revenue.
  2. Subtract refunds from net revenue using the CASE WHEN expression. Refunded orders are retained in the row set so you can track the gross figure, but they are excluded from the net calculation.
  3. Monthly granularity via DATE_TRUNC. This is the standard reporting cadence your finance team agreed on.

Defining Customer Tiers

-- Silver Layer: Customer tier classification
CREATE OR REPLACE VIEW semantic.business.biz_customer_tiers AS
SELECT
  customer_id,
  SUM(order_amount_usd) AS lifetime_value_usd,
  CASE
    WHEN SUM(order_amount_usd) >= 10000 THEN 'Enterprise'
    WHEN SUM(order_amount_usd) >= 1000  THEN 'Growth'
    ELSE 'Starter'
  END AS customer_tier,
  MAX(order_timestamp) AS last_order_timestamp,
  CASE
    WHEN MAX(order_timestamp) >= CURRENT_TIMESTAMP - INTERVAL '90' DAY THEN true
    ELSE false
  END AS is_active
FROM semantic.bronze.prep_orders
WHERE order_status NOT IN ('cancelled', 'refunded')
GROUP BY 1;

The tier thresholds ($1,000 for Growth, $10,000 for Enterprise) are business decisions. When the sales team decides to change them, you update this one view. Every downstream report automatically reflects the change.

Step 4: Build the Gold Application Layer

Gold views are tailored for specific consumers. They pull from Silver views, select only the columns each consumer needs, apply final formatting (rounding, aliases), and give column names that make sense to the audience reading them. The app_ prefix marks them as consumer-ready.

Executive Revenue Dashboard View

-- Gold Layer: Executive Revenue Dashboard view
CREATE OR REPLACE VIEW semantic.application.app_executive_revenue AS
SELECT 
  revenue_month,
  region,
  ROUND(net_revenue_usd, 2) AS net_revenue,
  paying_customers,
  ROUND(net_revenue_usd / NULLIF(paying_customers, 0), 2) AS arpu
FROM semantic.business.biz_net_revenue
ORDER BY revenue_month DESC;

NULLIF(paying_customers, 0) prevents division-by-zero errors. The column name arpu (Average Revenue Per User) is the term executives use, and it reads clearly in dashboard labels. The _usd suffix is dropped intentionally for brevity.

Marketing Analytics View

-- Gold Layer: Marketing analytics combining revenue + engagement
CREATE OR REPLACE VIEW semantic.application.app_marketing_analytics AS
SELECT
  r.revenue_month,
  r.region,
  r.net_revenue_usd,
  r.paying_customers,
  COUNT(DISTINCT e.session_id) AS total_sessions,
  COUNT(DISTINCT e.customer_id) AS engaged_users,
  ROUND(r.net_revenue_usd / NULLIF(COUNT(DISTINCT e.session_id), 0), 4) AS revenue_per_session
FROM semantic.business.biz_net_revenue r
LEFT JOIN (
  SELECT
    customer_id,
    session_id,
    region,
    DATE_TRUNC('month', event_timestamp) AS event_month
  FROM semantic.bronze.prep_events
  WHERE event_type = 'purchase_intent'
) e ON r.region = e.region AND r.revenue_month = e.event_month
GROUP BY 1, 2, 3, 4;

AI Agent-Specific View

-- Gold Layer: AI Agent optimized view (clean names, minimal columns)
CREATE OR REPLACE VIEW semantic.application.app_ai_revenue_summary AS
SELECT
  revenue_month AS month,
  region,
  net_revenue_usd AS net_revenue,
  paying_customers AS customer_count,
  ROUND(net_revenue_usd / NULLIF(paying_customers, 0), 2) AS revenue_per_customer
FROM semantic.business.biz_net_revenue;

The AI agent view strips away columns the agent does not need for common revenue questions. Fewer columns reduce ambiguity during SQL generation. Column names are full English words with no abbreviations, because the AI reads them to understand what each field represents. This is part of Dremio's Agentic Lakehouse architecture: the semantic layer is designed to be machine-readable, not just human-readable.

Step 5: Document Your Virtual Datasets

Documentation is what turns a set of SQL views into a real semantic layer. Without it, the views are just views. With it, they become a governed data contract that BI tools, data engineers, and AI agents can all use reliably.

Adding a Wiki Description

In the Dremio UI, navigate to any virtual dataset, click the Wiki tab, and write a Markdown description. For biz_net_revenue, a good wiki entry looks like:

## Net Revenue (Monthly by Region)

Monthly net revenue aggregated by region, calculated as gross order revenue 
minus refunded order amounts. Cancelled orders are excluded entirely.

**Source:** `semantic.bronze.prep_orders`  
**Granularity:** Calendar month  
**Refresh cadence:** Live (queries source at runtime)  
**Business owner:** Finance team  

### Metric definitions
- **gross_revenue_usd:** Sum of all order amounts including refunds
- **net_revenue_usd:** Sum of order amounts excluding refunded orders
- **paying_customers:** Count of distinct customer IDs with non-cancelled, non-refunded orders

You can also write this via the Dremio REST API, which lets you version-control your documentation alongside your SQL in git:

curl -X POST "https://<your-dremio>/api/v3/catalog/<dataset-id>/wiki" \
  -H "Authorization: Bearer <token>" \
  -H "Content-Type: application/json" \
  -d '{"text": "## Net Revenue (Monthly by Region)\n\nMonthly net revenue..."}'

Adding Labels

Labels in Dremio are tags you apply to datasets for classification. Apply them from the dataset detail page or via API. Useful labels for this semantic layer:

  • revenue_metric: marks this as an official revenue measure
  • financial: financial governance classification
  • ai_consumable: signals to the AI Agent that this view is safe and appropriate to query
  • pii_free: confirms the view does not expose personally identifiable information

AI-Generated Column Descriptions

Dremio can sample your dataset and auto-generate column-level descriptions using its AI capabilities. For a view like app_executive_revenue, Dremio can infer that arpu is a calculated ratio and suggest a description. You review and approve these, then they appear in the catalog for anyone querying the dataset.

The combination of wiki descriptions, column descriptions, and labels is what makes your Dremio data catalog genuinely useful, for human data consumers and for AI agents generating SQL on your behalf alike.

Step 6: Set Up Access Control

The semantic layer is only as trustworthy as its access controls. You want data engineers to maintain all three layers, analytics teams to query the Gold layer, and no one (except authorized roles) to query raw source tables directly.

Role-Based Access

Grant access at the Space level:

-- Grant analytics team access to Gold layer only
GRANT SELECT ON FOLDER "semantic"."application" TO ROLE analytics_team;

-- Grant data engineers access to all semantic layers
GRANT SELECT ON SPACE "semantic" TO ROLE data_engineering;

-- Explicitly deny raw source access for analytics users (Dremio RBAC)
-- Raw sources are not in the semantic space, so simply not granting access is sufficient

Dremio's role-based access control (RBAC) is additive. If analytics_team only has SELECT on semantic.application, they cannot access raw_postgres.public.orders unless you explicitly grant it. The architecture naturally enforces the access boundary.

Row-Level Security with FGAC

For regional data restrictions (where West region managers should only see West region data), use Dremio's Fine-Grained Access Control (FGAC) with a UDF:

-- FGAC: Row-level security function for regional access
CREATE FUNCTION enforce_region_access(region VARCHAR)
RETURNS BOOLEAN
RETURN region = SESSION_USER_ATTRIBUTE('user_region')
   OR SESSION_USER_ATTRIBUTE('user_region') = 'global';

Apply this function in your Gold layer view:

-- Gold Layer with row-level security applied
CREATE OR REPLACE VIEW semantic.application.app_executive_revenue AS
SELECT 
  revenue_month,
  region,
  ROUND(net_revenue_usd, 2) AS net_revenue,
  paying_customers,
  ROUND(net_revenue_usd / NULLIF(paying_customers, 0), 2) AS arpu
FROM semantic.business.biz_net_revenue
WHERE enforce_region_access(region) = TRUE
ORDER BY revenue_month DESC;

SESSION_USER_ATTRIBUTE('user_region') reads the user_region attribute from the logged-in user's session, which you configure in Dremio's user management. A user with user_region = 'west' sees only West region rows. A user with user_region = 'global' (your finance director, your executive team) sees all regions.

This works transparently. The BI tool connecting to this view does not need to know about the regional restriction. The user queries the view normally; Dremio enforces the row filter at runtime.

Step 7: Test with the Dremio AI Agent

With your Gold layer documented and access-controlled, the Dremio AI Agent can now answer business questions using your semantic layer. This is where the documentation work pays off directly.

How the AI Agent Uses Your Semantic Layer

Open the Dremio AI Agent interface. Type:

"What was our net revenue in the West region last quarter?"

The agent processes this in several steps:

  1. Schema discovery: the agent reads your catalog, finds datasets labeled ai_consumable or tagged as revenue metrics
  2. Wiki reading: it reads the wiki description for app_ai_revenue_summary or app_executive_revenue, confirming that net_revenue is the correct column for this question
  3. SQL generation: it generates a SQL query filtering region = 'west' and applying a date range for last quarter
  4. FGAC enforcement: your enforce_region_access() UDF runs automatically; if the agent's service account has user_region = 'global', it can see all data, and the filter in the question narrows the result
  5. Result delivery: the agent returns the answer with the underlying SQL visible for transparency

The key insight is that the agent chooses app_executive_revenue or app_ai_revenue_summary rather than querying raw_postgres.public.orders directly. This happens because your Gold layer views are documented, labeled, and have clean column names. The Silver and Bronze views are also accessible, but the Gold layer's documentation makes the intent unambiguous.

Why the Gold Layer Works Better for AI

FactorRaw Source TableGold Layer View
Column namesamt_ctscust_idcreat_dtnet_revenuecustomer_countmonth
Business logic appliedNoneNet revenue excludes refunds, tier logic applied
DocumentationNoneWiki description, column descriptions, labels
Access controlBroadRole-scoped, FGAC row filtering
Column count40+ source columns5-8 relevant columns

Fewer columns reduce the chance the AI picks the wrong field. Clear documentation removes ambiguity about what each metric means. This is why building a complete semantic layer (not just connecting data sources) is foundational to the Dremio Agentic Lakehouse architecture.

If you connect external AI agents or tools via the Dremio MCP Server, they access the same governed semantic layer. The MCP Server exposes Dremio's SQL interface via the Model Context Protocol, so agents built on LLM frameworks can query app_executive_revenue with the same access controls that apply to any other Dremio user.

Maintaining Your Semantic Layer Over Time

Building the layer is the first step. Keeping it accurate is the ongoing discipline.

Version Control Your View SQL

Treat your SQL DDL as code. Store each view's CREATE OR REPLACE VIEW statement in a git repository:

semantic-layer/
  bronze/
    prep_orders.sql
    prep_events.sql
  business/
    biz_net_revenue.sql
    biz_customer_tiers.sql
  application/
    app_executive_revenue.sql
    app_marketing_analytics.sql
    app_ai_revenue_summary.sql

If you use dbt, your Silver views map naturally to dbt models with view materializations. dbt handles the CREATE OR REPLACE VIEW DDL and tracks lineage in its manifest. You can run dbt models against Dremio's JDBC/ODBC endpoint or Arrow Flight endpoint.

Track Business Definition Changes

When the finance team changes the definition of "net revenue" (say, they decide to also exclude orders with a discount greater than 80%), update biz_net_revenue.sql in git with a commit message that explains the business reason. Then update the wiki description in Dremio to match.

A semantic layer where the SQL disagrees with the documentation is worse than no semantic layer at all. The documentation is part of the contract.

Use Data Lineage for Impact Analysis

Before changing a Bronze view like prep_orders, check Dremio's Data Lineage graph. It shows you every Silver and Gold view that depends on prep_orders. If you change a column name or remove a filter, you can see exactly which downstream views will break before you make the change.

This is especially important for columns that AI agents reference. If an agent has been correctly answering questions using net_revenue and you rename the column, the agent's behavior changes. Lineage makes that impact visible.

Periodic Schema Drift Audits

PostgreSQL schemas change. New columns get added, old ones get deprecated, data types change. Run periodic checks to verify that your Bronze views still reflect the current source schema:

-- Check if source columns exist that prep_orders doesn't expose
SELECT column_name
FROM information_schema.columns
WHERE table_schema = 'public' AND table_name = 'orders'
  AND column_name NOT IN (
    SELECT column_name FROM information_schema.columns
    WHERE table_name = 'prep_orders'
  );

This kind of maintenance discipline is what separates a semantic layer that data teams trust from one they quietly bypass.

Tradeoffs to Know Before You Start

No architecture is free of tradeoffs. Here is what to weigh honestly.

View overhead is minimal but real. Each layer adds one query planning step. In practice this is sub-millisecond for most query patterns. For heavy aggregations running frequently, Dremio's Reflections (including Autonomous Reflections) can materialize the Silver or Gold layer results transparently. Queries hit the reflection rather than recomputing from Bronze on every run.

Maintenance is ongoing, not optional. A semantic layer is a living artifact. Business definitions change. Source schemas drift. New teams ask for new metrics. Plan for semantic layer maintenance as a recurring engineering activity, not a one-time project.

Layer proliferation is a real risk. Without discipline, you end up with 300 ad hoc views that nobody can navigate. Enforce the prep_/biz_/app_ naming convention. Archive views that are no longer used. Conduct quarterly catalog cleanups.

The layer must stay synchronized with business definitions. The most dangerous state is a semantic layer where the SQL is correct but the documentation is stale. Analysts trust the documentation, build on it, and then get wrong answers. Treat documentation updates as part of any code change.

FGAC adds complexity. Row-level security with UDFs is powerful but requires careful testing. Make sure you test both the "restricted user sees only their region" case and the "global user sees all regions" case before you put a FGAC view in production.

Build Once, Query Everywhere

After completing these seven steps, you have a Dremio semantic layer that is production-ready:

  • Three source connections (PostgreSQL, Iceberg/S3, SaaS) federated without ETL
  • A Bronze layer that cleans and standardizes raw data in 8 views
  • A Silver layer that encodes official business definitions for revenue, customer tiers, and engagement
  • A Gold layer tailored for three distinct consumers: BI tools, executives, and AI agents
  • Documentation (wiki, column descriptions, labels) that makes every view self-describing
  • FGAC row-level security so regional managers see only their data
  • A Dremio AI Agent that uses your documented Gold layer to answer business questions with governed SQL

What makes this approach durable is that it scales in both directions. As you add new sources, you add Bronze views. As business definitions evolve, you update Silver views in one place. As new consumer types emerge (more AI agents, more specialized dashboards), you add Gold views without touching the logic underneath.

The combination of Autonomous Reflections, documented virtual datasets, and the Dremio AI Agent is moving toward a semantic layer that partially manages itself. Today you build and maintain it manually. That ratio shifts over time.

Ready to build your own? Try Dremio Cloud free for 30 days and connect your first source in under five minutes.

Try Dremio Cloud free for 30 days

Deploy agentic analytics directly on Apache Iceberg data with no pipelines and no added overhead.