
Join the Dremio/dbt community by joining the dbt slack community and joining the #db-dremio channel to meet other dremio-dbt users and seek support.
A major data trend in 2024, and carrying on into this year, was the move to self-service analytics. According to our recent survey of IT and data decision-makers, 80% of companies we spoke to aim to democratise data access in their organisations through self-service and 41% of organisations have already shifted from cloud warehouses to open lakehouse architectures, like Dremio, to facilitate this for their analytics and AI workloads.
The idea behind self-service is to empower Business Users (BUs), those with the most knowledge and business context for their data, to take ownership of it. When provided the right tooling and operational flexibility the domain experts can refine and deliver high-quality data to their organisation (or externally to other companies), rather than the data engineers who are often abstracted away from business context.
However, a key part of producing business-context data is maintaining the integrity, reliability, and quality of that data. This is achieved through regularly building, testing, and deploying code to production. In contrast to analytical tasks, these continuous DevOps processes are more complex and can be intimidating responsibilities for even the most technically-minded BU. However, by combining dbt Core with Dremio you can implement CI/CD workflows into your data transformation projects to reliably streamline your data workflows and consistently deliver reliable data.
What is CI/CD?
The process of taking code from development to production involves building, testing (integration, unit, regression), and deployment, which traditionally have involved manual intervention. With CI/CD (Continuous Integration and Continuous Deployment) these processes are automated enabling you to ship code changes to users quickly and reliably. Implementing CI/CD tools and practices increases productivity, improves early detection of defects, and enables faster release cycles.
Key Concepts of CI/CD in Dremio and dbt
Automated Testing: dbt allows for the creation of data tests that can verify the consistency of your data models with each code change as well as detect breaking changes in your source datasets.
Data Versioning: Dremio’s data lakehouse catalog automatically tracks changes made to your datasets over time ensuring traceability and reproducibility of your data.
Code Version Control: dbt integrates seamlessly with version control systems like Git, enabling collaborative development and review processes through pull requests and code reviews.
Efficient CI Jobs: Tags in dbt allows the categorisation of your models and other project resources and can be used to selectively execute commands, e.g. run, build, test. This ensures resource-efficient validation of changes and makes it easier to manage your project during development and deployment.
Implementing CI/CD with Dremio and dbt
Build
- Build views in dbt and publish to Dremio, including tags, tests and wiki documentation.
- Build modular data models that reference each other with tracked dependencies for easier maintainability.
- With code branching, multiple people can simultaneously work on the same data model, with changes tracked and evaluated without impacting production.
- Use branching, pull requests, and code reviews to maintain a high-quality codebase for your data models.
Test
- Assert data quality by defining tests for both your source data and transformation code to reduce the likelihood of errors when logic changes and to alert you when issues arise.
- Use dbt incremental model builds to reduce your query run times.
- Implement and review CI checks within pull requests to ensure code changes meet quality standards before merging into the main branch.
- If a model or data change does not work as expected, use code rollback to minimise disruption.
- If there is an issue with the source data, rollback the state of your data via the lakehouse catalog.
Deliver
- Version your data models to easily track and understand the state of your code and what is being run in production.
Learning How to Work with Dremio & dbt
- Video Demonstration of dbt with Dremio Cloud
- Video Demonstration of dbt with Dremio Software
- Dremio dbt Documentation
- Dremio CI/CD with Dremio/dbt whitepaper
- Dremio Quick Guides dbt reference
- End-to-End Laptop Exercise with Dremio and dbt
- Video Playlist: Intro to dbt with Dremio
- Automating Running dbt-dremio with Github Actions
- Orchestrating Dremio with Airflow (can be used to trigger dbt after external data updates)
- Orchestrating Dremio with dbt using Airflow and Github Actions
Conclusion
The automation of repetitive, manual processes like code building, testing, and delivery allows engineers and business units to focus on developing new features and data products instead of pipeline maintenance. By leveraging the Dremio and dbt 's capabilities and adhering to these best practices, business units can build a strong CI/CD pipeline that improves both the quality and speed of data transformation projects.