
Join the Dremio/dbt community by joining the dbt slack community and joining the #db-dremio channel to meet other dremio-dbt users and seek support.
The data management strategy of the data lakehouse is to combine the key strengths of data lakes and data warehouses into one, unified platform. From data lakes it utilises low-cost and scalable data storage, while delivering the performance and reliability of a data warehouse. This streamlined approach to data management and analysis brings plenty of benefits but it is not without some potential pitfalls.
Whether your data is generated by tools, sensors, or business processes, you can easily store large amounts as data storage is cheap and abundant. However, the analysis and refinement of this vast store of data can become a pain point. If you are processing your entire dataset each time new data arrives or a record is updated, you will see an ever increasing demand on the time and resources required for your data transformations. The solution? Incremental models from dbt.
How They Work
Incremental models are a materialization strategy designed for the efficient update of your data models. They only process the data that has been added or modified since the previous execution of the model, focusing on just the new and changed records for transformation and loading. This is beneficial for large data sets but also for frequently updated models, such as event-style data models.
Incremental materializations are created as tables in your storage. Naturally, when your model is first run, all records from the source data are transformed to create your initial table. It is then on subsequent runs of your model that your table can be efficiently updated. Any records in the source data that have been created or changed since the last time dbt was run will be transformed according to your data model and inserted into your table.
Incremental models offer a performance middle-ground between view and table materializations, capable of providing better performance or reduced costs for your dbt runs, with the trade-off of being more complex to implement.
- Incremental models build a table in pieces over time.
- Equivalent query costs as a standard table, but with better build time and build costs.
- Building the initial table is a table materialization, so first runs of incremental models are slow.
- Can drift from source data over time. Effort must be made to refresh historical data when your schema or data model changes.
How to Configure
Turning on incremental
Like the other materializations built into dbt, such as table or view, incremental models are defined with select statements, with the materialization defined in a config block:
{{
config(
materialized='incremental'
unique_key='column_1'
)
}}
You can also optionally set a unique_key
for your model. By defining this primary key for your table you can enable incremental runs to update records in your target table, rather than just append. This unique key can be a single column or a list of columns.
With Dremio, your incremental materializations are formatted as Apache Iceberg tables and can be configured using the same partitioning options as tables and reflections.
Defining “new” data
To tell dbt which data should be transformed on an incremental run you need to write SQL code to identify and filter for those rows. Unfortunately, unlike with testing in dbt, there are no pre-defined or out-of-the-box filters so you have to custom script your own. This is one of the main areas of operational complexity compared to the other materialization options.
The two most common filter use cases are for:
- New Records: Rows that have been created since the last time the model was run. This can usually be achieved by using a timestamp in your target table.
- New and Updated Records: By providing a unique key for your model, modified records overwrite the original and are not inserted as duplicates.
Setting your incremental strategy
The next thing to configure is how your incremental chunks of data get inserted into your existing table. This is specified via the incremental_strategy
config, set either in individual models or at the project level via your dbt_project.yml
. Dremio supports two incremental strategies:
- Append: Insert all changes as new records without updating or overwriting any existing data. The default option.
- Merge: Insert new records and update existing records that match the unique key provided.
What to do if your model changes?
Your incremental implementation will check the source data for new records to process according to the SQL filters you defined. But how does it react to changes in your data model, whether in the schema or in the transformation logic?
Schema updates
The on_schema_change
config allows you to specify how to update your incremental table when columns are added or removed from the schema. With Dremio you have four available behaviour options:
- ignore: Added columns will not appear in your existing table. Attempting to remove a column will cause dbt run to fail. The default option.
- fail: Triggers an error message when your schema changes from that of the existing table.
- append_new_columns: Append new columns to the existing table. Ignore any columns removed in the new schema and retain them in the existing table.
- sync_all_columns: Incorporate all schema changes to the existing table, including data types. Add any new columns to the existing table and remove any columns that are now missing.
It is important to note that none of the available behaviours will backfill values in old rows for any newly added columns. To populate missing values on old records, you either need to run a manual update or trigger a full refresh.
Model updates
Incremental updates by design do not reprocess old records. For schema updates this means that new columns will only be populated for new rows. Likewise, if you update the logic of your data model, for example change the data transformations for a column, these will only be applied to the new rows of data. Additionally, incremental models are susceptible to data drift, where your existing table of processed data diverges from the source data over time due to either data loading latency or from other processes updating the source records after your model has run.
Depending on your model, these differences in data can range from obvious (a value vs. nothing) to inconspicuous but will almost always cause data integrity issues and can break your data pipeline. When you are making changes to your data model the responsibility is on you to do a full refresh of your table, using --full-refresh
, which will drop and rebuild your incremental model. It is best practice to do this for not just your incremental model but also any downstream models.
While the whole motivation of using incremental models is to avoid full data refreshes, they are still a useful and necessary operation. Full refreshes take time, so as such you should be efficient and use them sparingly.
Learning How to Work with Dremio & dbt
- Video Demonstration of dbt with Dremio Cloud
- Video Demonstration of dbt with Dremio Software
- Dremio dbt Documentation
- Dremio CI/CD with Dremio/dbt whitepaper
- Dremio Quick Guides dbt reference
- End-to-End Laptop Exercise with Dremio and dbt
- Video Playlist: Intro to dbt with Dremio
- Automating Running dbt-dremio with Github Actions
- Orchestrating Dremio with Airflow (can be used to trigger dbt after external data updates)
Conclusion
Incremental materializations allow you to build your data table piece by piece as new data comes in. By restricting your build operations to just this required data, you will not only greatly reduce the runtime of your data transformations, but also improve query performance and reduce compute costs.
Sign up for AI Ready Data content