Dremio Blog

6 minute read · May 19, 2026

4 Data Quality Tools to Keep Your Data In Shape

Will Martin Will Martin Technical Evangelist
Start For Free
4 Data Quality Tools to Keep Your Data In Shape
Copied to clipboard

A lakehouse is only as useful as the data inside it. Query performance, governance, and semantic layers all depend on one assumption: that the underlying data is accurate, complete, and behaving as expected. When it isn't, dashboards return wrong answers, AI agents reason from bad inputs, and engineering teams spend days diagnosing problems that should have been caught at the source.

Data quality testing is how you close that gap. The tools below each integrate with Dremio and cover the two main approaches: assertion-based testing, where you define what good data looks like and verify it explicitly, and observability-based monitoring, where the platform learns your data's normal behaviour and alerts you when something deviates. Used together, they give you both the checks you know to write and coverage for the problems you haven't anticipated yet.

dbt

The dbt-dremio adapter brings dbt's built-in test framework directly to your Dremio lakehouse. Tests are defined in YAML alongside your models and cover the most common data quality assertions: uniqueness, not-null constraints, accepted value sets, and referential integrity between tables. Running dbt test executes each test as a SQL query against Dremio and reports failures with the rows that caused them.

Because dbt tests live in the same project as your transformation logic, quality checks and the code they validate stay in sync. When a model changes, its tests change with it. For teams already using dbt-dremio for transformations, adding tests is a low-friction step rather than a separate tooling decision. Dremio's official dbt documentation is at docs.dremio.com/dremio-cloud/developer/dbt/, and the adapter repository with setup instructions is at github.com/dremio/dbt-dremio.

Try Dremio’s Interactive Demo

Explore this interactive demo and see how Dremio's Intelligent Lakehouse enables Agentic AI

Soda

Soda connects to Dremio via Arrow Flight SQL and lets you write data quality checks in SodaCL, a human-readable YAML-based language. A check might assert that a column has no null values, that row counts fall within an expected range, or that a custom SQL expression evaluates to true. Checks run as SQL against your Dremio tables, so they respect your existing access controls and work against federated sources just as they would against native Iceberg tables.

Soda Cloud provides a managed layer for scheduling scans, tracking check results over time, and routing alerts to Slack or email when checks fail. For teams running Dremio as their primary analytics platform, Soda is suited to validating data quality at the point of ingestion or after transformation, before results reach downstream consumers. The Soda Core Github repo is available at github.com/sodadata/soda-core, and the requisite Arrow Flight SQL ODBC Driver documentation is found in the Dremio docs.

Great Expectations

Great Expectations is an open-source framework that connects to Dremio and lets you define "expectations" about your data: assertions covering column ranges, regex pattern matching, statistical distributions, set membership, and more. Expectations are organised into suites and run as validation jobs against your Dremio tables. When a validation run completes, Great Expectations generates data docs: HTML reports that show exactly which expectations passed or failed, with sample failing rows included.

The framework is particularly well-suited to teams that want granular, reproducible quality checks that can be versioned and reviewed like code. Expectations can be generated automatically by profiling an existing dataset, giving you a baseline to refine rather than starting from scratch. Compatibility and setup information is available at docs.greatexpectations.io/docs/application_integration_support.

Monte Carlo

Monte Carlo takes a different approach to data quality. Rather than requiring you to define checks upfront, it connects to Dremio and automatically learns the normal behaviour of your tables: typical row counts, schema structure, distribution patterns, and freshness cadences. When something deviates from the norm, Monte Carlo raises an alert. This covers the class of data quality problems that are hard to write explicit checks for because you don't know what to look for until something goes wrong.

For Dremio environments, Monte Carlo supports schema change detection, custom SQL monitors, and comparison rules that validate consistency across tables or transformations. It authenticates via a Personal Access Token, and both Dremio Cloud and Dremio Software deployments are supported. The integration is currently in public preview. Setup documentation is at docs.getmontecarlo.com/docs/dremio, and Dremio's partnership overview is at dremio.com/blog/dremio-and-monte-carlo-enhanced-data-reliability-for-your-data-lakehouse/.

Getting Started

If you want to start testing data quality against a Dremio environment, you'll first need a Dremio environment! You can get a free Dremio Cloud account at dremio.com/get-started, giving you a working lakehouse to connect any of these tools from the word go.

Try Dremio Cloud free for 30 days

Deploy agentic analytics directly on Apache Iceberg data with no pipelines and no added overhead.