Dremio Blog

12 minute read · March 31, 2026

How Dremio’s AI Functions Unlock Analytics on Unstructured Data

Will Martin Will Martin Technical Evangelist
Start For Free
How Dremio’s AI Functions Unlock Analytics on Unstructured Data
Copied to clipboard

Key Takeaways

  • Many data engineering teams can only query about 20% of their company's data, as much of it remains unstructured.
  • Dremio's AI functions (AI_GENERATE, AI_CLASSIFY, AI_COMPLETE) allow users to analyze unstructured data directly within SQL, eliminating the need for separate pipelines.
  • Now, unstructured data like PDFs and emails can be stored as-is and queried without prior transformation, streamlining the analytics process.
  • The integration of AI within the SQL engine simplifies data governance, access control, and speeds up the analysis process for analysts familiar with SQL.

Ask a typical data engineering team what percentage of their company's data they can actually query. The honest answer will be low, probably around 20%. The rest sits in object storage as PDFs, email threads, customer feedback forms, call transcripts, maintenance logs, and scanned documents. It's all there. It's all stored. And for the purposes of SQL analytics, it has been essentially invisible.

This isn't a storage problem. Data lakes have handled unstructured files just fine for years. The problem is what happens after storage: to do anything analytically useful with that text, teams have had to pull it into a separate ML pipeline, run it through a language model or NLP service, transform the outputs into something tabular, and load it back somewhere a BI tool can read. And don't forget to account for the governance headaches of moving the data between these systems.

Dremio's AI functions (AI_GENERATE, AI_CLASSIFY, and AI_COMPLETE) collapse that processing pipeline into a single SQL query. They run directly inside the Dremio query engine, against files that live in your existing data lake. No copies, no separate pipelines, no ETL. You write SQL, and the unstructured data becomes part of your analysis.

Try Dremio’s Interactive Demo

Explore this interactive demo and see how Dremio's Intelligent Lakehouse enables Agentic AI

Why Unstructured Data Has Always Been the Lakehouse's Blind Spot

The data lakehouse promised to deliver warehouse analytics performance on the data lake, with open formats, open storage, and SQL-based access. It delivered on most of that. But "SQL-based access" has always implied structured data. Think files with defined schemas and tabular data with rows and columns.

However, consider what companies actually generate day to day: customer support tickets, legal contracts, product reviews, internal incident reports, clinical notes, job applications, sales call notes. It's created at volume, it accumulates in S3 or Azure Data Lake Storage, and it holds genuinely valuable signals that structured data alone can't capture. But, as this data doesn't fit neatly into a schema it's left outside the scope of typical data analytics.

The standard workaround has been to spin up separate infrastructure. Building a custom LLM pipeline, extracting what's needed from the unstructured files, serialising it, and landing it somewhere as structured data. The cognitive and operational overhead of this multi-step process is nothing small, with the costs compounding quickly as data volumes grow.

What was missing was a way to bring AI inference inside the lakehouse itself. The files already live there, so why can't the inference move in and live there too?

What Dremio's AI Functions Actually Do

Dremio exposes three AI functions in its SQL query engine, plus a supporting function called LIST_FILES that makes it possible to point any of them directly at files in object storage. Here's how each one works.

AI_GENERATE is the most general of the three. It sends unstructured content to a configured LLM and returns results shaped by an output schema you define in SQL. That's the key detail: you specify exactly what fields you want back, and the model extracts them. For example, if you have a directory of PDF contracts sitting in S3, you can extract counterparty name, contract value, expiration date, and jurisdiction all in a single query:

SELECT
  file['path'] AS contract_path,
  AI_GENERATE(
    'gpt.4o',
    ROW('Extract contract details from this document', file)
    WITH SCHEMA ROW(
      counterparty   VARCHAR,
      contract_value FLOAT,
      expiry_date    VARCHAR,
      jurisdiction   VARCHAR
    )
  ) AS contract_details
FROM TABLE(LIST_FILES('@legal/contracts/'))
WHERE file['path'] LIKE '%.pdf';

The output lands as a structured result set you can filter, aggregate, join to other tables, or materialize as a Dremio Reflection for repeated fast access.

AI_CLASSIFY is the choice when you have a fixed set of categories and want the model to assign each input consistently. Support ticket routing is the obvious example. You want to know whether a ticket is a billing question, a technical issue, a feature request, or a churn signal, and you want that answer to be consistent across hundreds of thousands of rows:

SELECT
  ticket_id,
  ticket_text,
  AI_CLASSIFY(
    'Classify this support ticket by its primary issue type',
    ticket_text,
    ARRAY['Billing', 'Technical', 'Feature Request', 'Churn Signal']
  ) AS issue_type
FROM support_tickets

This will return a value matching one of the array items for each row. You can immediately group by issue_type, trend volume over time, or join it to an agent assignment table to measure resolution rates by category.

Finally, AI_COMPLETE returns free text. It's what you reach for when you want to generate something without having to go to an AI agent or LLM frontend. It can produce summaries, translate customer feedback, or normalise free text from a messy input field. The function always returns VARCHAR, which makes it straightforward to chain into views or downstream transformations:

SELECT
  incident_id,
  raw_notes,
  AI_COMPLETE(
    'Summarize this incident report in two sentences for an executive audience: ' || raw_notes
  ) AS executive_summary
FROM incident_reports
WHERE severity = 'P1';

The three functions aren't mutually exclusive. A realistic workflow might use AI_GENERATE to extract line items into a structured schema, AI_CLASSIFY to flag each invoice by vendor category, and AI_COMPLETE to produce a one-line audit summary. All of that fits within a short sequence of views, with no external dependencies.

The Real Shift: Doing Analytics and AI in the Same Place

Before AI functions existed at the query layer, extracting value from unstructured data meant orchestrating multiple systems. A data engineer would write a pipeline (typically Python) that reads files from S3, calls an LLM API, parses the responses, and writes structured results to a database table. That pipeline has to be scheduled, monitored, and maintained. It has its own failure modes, its own dependencies, and its own access controls.

The moment AI inference moves inside the SQL engine, everything that governs your structured data also governs your AI-processed unstructured data. The same access controls apply. The same query audit logs capture what ran. If a user doesn't have permission to read the underlying files, the AI function won't let them bypass that. It operates within Dremio's existing security model. That's a meaningful difference for organisations that need to demonstrate data lineage or comply with regulatory requirements around what data gets processed how.

There's also a development speed argument. An analyst who knows SQL doesn't need to spin up a Python environment, learn a new API client, or wait for an engineering team to build a pipeline. If the unstructured data is accessible in Dremio, they can write a query today. For teams running iterative analysis that difference determines whether an answer comes this afternoon or next sprint.

What This Means for How You Store Data

One practical consequence of AI functions at the query layer is that the decision around what to store changes. For years the implicit rule was to only bring data into your lakehouse if it can be made structured first.

That's now shifted. A PDF invoice, a customer email thread, a maintenance log written in plain text: all of these now have analytical value without any transformation step before storage. You store them in your data lake in their original format, catalog them in Dremio's data catalog, and query them when you need them. The transformation happens at query time, governed by the same policies as everything else.

For organisations already running on Apache Iceberg, this fits naturally into the open lakehouse model. Structured data lives in Iceberg tables with unstructured data in object storage alongside it. Both are queryable through a single SQL interface with consistent governance and no vendor lock-in on the underlying storage format. There's no new architecture to adopt or new platform to evaluate. The files are probably already there. What has changed is what you can do with them.

Getting Started

Dremio's AI functions are available in Dremio Cloud. Before running your first query, you'll need to configure a model provider in your organisation settings. The documentation covers supported providers and the setup steps.

If you want to try it against your own data, Dremio's free trial at dremio.com/get-started takes about five minutes to set up and all you need is an email address. You can connect an S3 bucket, point LIST_FILES at a directory of documents, and run your first AI_GENERATE query without provisioning any additional infrastructure.

Try Dremio Cloud free for 30 days

Deploy agentic analytics directly on Apache Iceberg data with no pipelines and no added overhead.