Leverage Dremio & dbt for AI-ready data

Will Martin Technical Evangelist

Start For Free

Copied to clipboard

Leverage Dremio & dbt for AI-ready data

Problem 1: Data Access

Problem 2: Data Quality

Dremio + dbt

Summary

Join the Dremio/dbt community by joining the dbt slack community and joining the #db-dremio channel to meet other dremio-dbt users and seek support.

The boom in AI witnessed both in the performance of available models and in the appetite of businesses in utilising them has been unprecedented in the last couple of years. There are several large tech companies that sell powerful, well-maintained Generative AI and Large Language Models (LLMs) providing interested users with many options to explore, such as OpenAI’s GPT series and Anthropic’s Claude Series. While capable, these commercial models are general purpose and so businesses need to customise and fine-tune them to deliver the best performance and address their specific requirements. However, no matter which strategy a business takes to deploy capable models, at the core of a successful and reliable AI deployment lies high quality data. As such, there are two key data issues AI teams need to consider when planning AI projects.

Problem 1: Data Access

This process of AI model development requires access to vast quantities of data from across an organisation and beyond. However, this first hurdle of data access is where a lot of businesses keen to adopt this powerful technology run into problems. For many their data is isolated in separate departments or systems, in data silos that are inaccessible or unknown to other parts of the business. This data compartmentalisation hinders collaboration and knowledge sharing within the organisation, and is a blocker to effective AI model training and utilisation.

Dremio breaks down data silos by providing federated data access to all your data, whether on-premise or across cloud providers. Dremio’s data lakehouse platform eliminates complex data integration and provides robust, self-service access to data across multiple sources, teams, and users, all through a single platform.

Try Dremio’s Interactive Demo

Explore this interactive demo and see how Dremio's Intelligent Lakehouse enables Agentic AI

Problem 2: Data Quality

With data access sorted with the use of a data lakehouse, the second hurdle to model training is data quality. The data you are feeding your model needs to be reliable, understandable, and accurate. The old adage of “quality over quantity” does not hold true for AI modeling, you need both for a performant AI solution. However, data transformation is not a one-and-done process as changes with your business model and processes can cause your data to shift. Your data schema can evolve, standardisations can change, and requirements can expand. This area is where data teams can benefit from the adoption of strategies traditionally found in software engineering workflows.

dbt (Data Build Tool) is a transformation workflow that centralises analytics code and provides users with the collaborative tools and processes, such as version control, modularity, portability, and documentation, through dbt’s git integration and testing capabilities. One of dbt’s guiding principles is modular development; breaking data transformation workflows down into small, manageable components. These components (referred to as ‘models’) are easier for users to understand, develop, and test, and as such are easier to collaborate on and share.

Dremio + dbt

Similar to the data lakehouse paradigm, which combines the strengths of data lakes and data warehouses, the complimentary strengths of Dremio and dbt can be combined into a single, powerful data transformation platform with the following advantages:

Hybrid Data Lakehouse: Dremio connects dbt to data wherever it is. There is no need for difficult data migrations or shifting to proprietary data storage. With Dremio you can integrate your existing data storage, from lakes and warehouses located in the cloud or on-prem. With a Dremio Hybrid Data Lakehouse these aren’t exclusive deployment choices, as you can operate with the combination of cloud and on-prem depending on your compliance requirements, cost considerations or other specific needs.

Flexible Data Pipelines: dbt encourages the development of modular data transformation pieces with declared dependencies. This makes it easy to make and deploy changes to data pipelines, so substituting out components or replacing data sources is a simple, non-disruptive process.

Data Reliability: dbt has robust test capabilities to ensure the integrity of your data models by making assertions about the data that goes in and the data that comes out. With built-in common data checks and the capability to write custom SQL queries dbt easily monitors and alerts you to data drift and breaking changes in both your data transformations and your data sources.

Optimal Data Querying: with native support of Apache Iceberg, Dremio enables efficient data management and high-performance query acceleration techniques, significantly reducing the time required to process complex queries.

Version Control and Collaboration: dbt models are written as SQL code and are tracked with git version control. This allows your data teams to safely collaborate on a single dbt project, with code changes being tracked and reliably integrated into your data pipeline. With this you can prevent breaking changes from being deployed and review the history and progress of the project for easy onboarding of new contributors. Dremio provides similar git functionality for your data sources, such as data versioning, observability, and lakehouse management features, via its Iceberg Catalog. Dremio’s data version control allows rollback in the case of data quality issues and hassle-free access to historical data for analysis projects.

Community and Ecosystem: Being an open-source tool, dbt Core has a strong community and ecosystem, providing users with a host of functionality-boosting resources, plugins, and integrations.

Open standards: Dremio is built on open standard formats, such as Apache Arrow and Apache Iceberg, ensuring you high performance for your data pipelines without the extra costs or vendor lock-in of proprietary formats.

Summary

Dremio allows you to easily share and manage your data within your organisation, while dbt allows the teams working with that data to efficiently share and collaborate. Combining Dremio's powerful data lakehouse platform and dbt's robust data transformation capabilities allows your organisation to produce reliable and accessible data to drive decision making and power AI initiatives. With these two tools, you can follow this 3-step process to build a solid foundation for your AI projects:

Load and consolidate data from across your organisation, with the fantastic low costs and high performance of Dremio’s hybrid data lakehouse.
Standardise and clean your data with modular, version controlled SQL data transformations in dbt.
Ensure and maintain high data quality throughout your transformation workflow with dbt’s robust testing suite.

With your data collated with Dremio, cleaned and optimised with dbt, you can confidently begin your organisation’s journey to success with AI.

Schedule a Free Architectural Workshop to See How to These Patterns Can Fit into Your Data Architecture
Become a Verified Lakehouse Associate at Dremio University

Try Dremio Cloud free for 30 days

Deploy agentic analytics directly on Apache Iceberg data with no pipelines and no added overhead.

Start For Free

Article Topics

Product Insights from the Dremio Blog

Blog coverpage for Ingesting Data into Aparche Iceberg with Dremio

Feb 1, 2024 Product Insights from the Dremio Blog

Ingesting Data Into Apache Iceberg Tables with Dremio: A Unified Path to Iceberg

By unifying data from diverse sources, simplifying data operations, and providing powerful tools for data management, Dremio stands out as a comprehensive solution for modern data needs. Whether you are a data engineer, business analyst, or data scientist, harnessing the combined power of Dremio and Apache Iceberg will undoubtedly be a valuable asset in your data management toolkit.