8 minute read · July 1, 2024
More Flexible, Powerful Data Branching
· Principal Product Manager, Dremio
Dremio's built-in lakehouse catalog, powered by Project Nessie, makes data engineering workflows easy by enabling a Git-like experience on Iceberg tables and views. Companies can use branches to make and validate changes to data without disrupting production workloads, instantly merge changes into their production branch when ready, and easily roll back from mistakes if needed. Data branching saves time and money by eliminating the need to create multiple physical environments and manage pipelines between those environments.
Today, we’re excited to announce that Dremio’s catalog now makes data engineering workflows safer and more flexible by enabling you to perform dry runs before merging changes, and easily understand and resolve merge conflicts.
Perform Dry Runs Before Merging Changes Into Production
Dremio’s catalog eliminates the need for separate physical staging and development environments by enabling you to use Git-like branches for data engineering workflows. Branches enable you to support development, test, and production use cases on the same lakehouse environment without creating physical copies of data. With branches, you can run ETL pipelines and validate changes on dev/UAT branches before instantly promoting changes to production, instead of having to manage pipelines between multiple physical environments.
With this release, after you’ve made and verified changes in your development branch, you can now safely dry run your branch merge operation to understand whether or not your merge will have any conflicts. This is helpful to not only understand eventual merge behavior, but in the case that there are merge conflicts, understand which objects are in conflict and for which reasons. Performing a dry run of a MERGE BRANCH
statement returns the output of the MERGE BRANCH
statement if it were evaluated, but does not commit anything.
Easily Understand and Resolve Merge Conflicts
You can now easily resolve merge conflict scenarios that may arise during branching scenarios using ON CONFLICT
syntax. In the case of a merge conflict, you can specify how to resolve the conflict in several ways.
ON CONFLICT OVERWRITE
: If the table/view has a conflict, then upon merge, overwrite the version of the table/view in the target branch with the version that exists in the source branch.
ON CONFLICT DISCARD
: If the table/view has a conflict, then upon merge, keep the version of the table/view in the target branch (don’t do anything).
ON CONFLICT CANCEL
: If the table/view has a conflict, then cancel the entire merge operation altogether and do not commit any changes.
In addition to the above, you can specify exceptions for how the catalog should handle merge conflict behavior on a per-table/view basis. For example, you can specify that the catalog should discard any conflicting tables during a merge, but overwrite one specific table.
ON CONFLICT … EXCEPT
: If there is a merge conflict, then all tables/views that have a conflict will follow the behavior specified by theON CONFLICT
option, except for any tables/views specified in theEXCEPT
clause.
Example
Suppose we have two branches in our environment – main and dev, where main is our production branch and dev is our development branch. Suppose we have made and validated changes to our data on dev, and are ready to merge our changes back into production for data analysts to leverage. Before we commit any changes, we want to validate whether or not our merge will succeed:
MERGE BRANCH DRY RUN “dev” INTO “main”; --> message | contentName | status --> Branch “dev” cannot be merged into “main” | null | FAILURE --> Values of existing and expected content differ | MyCatalog.HR.Salaries | VALUE_DIFFERS --> Values of existing and expected content differ | MyCatalog.Sales.Pipeline | VALUE_DIFFERS
Using the DRY RUN
clause shows us that there are 2 conflicts that would prevent the merge from succeeding – the Salaries and Pipeline tables, respectively.
Assume that in our workflows, we know that we should overwrite any changes from the dev branch into the main branch, but we shouldn't make changes to our Pipeline table in any case. We can add the ON CONFLICT
clause to our MERGE BRANCH
statement to tell the catalog what to do next time we see a conflict, and then dry run our merge once more:
MERGE BRANCH DRY RUN “dev” INTO “main” ON CONFLICT OVERWRITE EXCEPT DISCARD “MyCatalog.Sales.Pipeline”; --> message | contentName | status --> Branch “dev” can be merged into “main” | null | SUCCESS
The DRY RUN
output tells us that the merge branch statement will succeed, because we’ve specified how to overcome any merge conflicts that may arise. Now, if we want to run our merge and commit all our changes, then we can simply remove the DRY RUN
syntax from our MERGE BRANCH
command:
MERGE BRANCH “dev” INTO “main” ON CONFLICT OVERWRITE EXCEPT DISCARD “MyCatalog.Sales.Pipeline”; --> Branch “dev” has been merged into “main”
End users get to see the new data instantly.
Conclusion
Data branching is a powerful new paradigm that makes data engineering workflows faster, safer, and cheaper. We’re excited to continue to make branching more flexible and powerful for our end users. To learn more:
- Get started for free
- Check out the documentation
- Visit the Project Nessie website