8 minute read · July 1, 2024

More Flexible, Powerful Data Branching

Ben Hudson

Ben Hudson · Principal Product Manager, Dremio

Dremio's built-in lakehouse catalog, powered by Project Nessie, makes data engineering workflows easy by enabling a Git-like experience on Iceberg tables and views. Companies can use branches to make and validate changes to data without disrupting production workloads, instantly merge changes into their production branch when ready, and easily roll back from mistakes if needed. Data branching saves time and money by eliminating the need to create multiple physical environments and manage pipelines between those environments.

Today, we’re excited to announce that Dremio’s catalog now makes data engineering workflows safer and more flexible by enabling you to perform dry runs before merging changes, and easily understand and resolve merge conflicts.

Perform Dry Runs Before Merging Changes Into Production

Dremio’s catalog eliminates the need for separate physical staging and development environments by enabling you to use Git-like branches for data engineering workflows. Branches enable you to support development, test, and production use cases on the same lakehouse environment without creating physical copies of data. With branches, you can run ETL pipelines and validate changes on dev/UAT branches before instantly promoting changes to production, instead of having to manage pipelines between multiple physical environments.

With this release, after you’ve made and verified changes in your development branch, you can now safely dry run your branch merge operation to understand whether or not your merge will have any conflicts. This is helpful to not only understand eventual merge behavior, but in the case that there are merge conflicts, understand which objects are in conflict and for which reasons. Performing a dry run of a MERGE BRANCH statement returns the output of the MERGE BRANCH statement if it were evaluated, but does not commit anything.

Easily Understand and Resolve Merge Conflicts

You can now easily resolve merge conflict scenarios that may arise during branching scenarios using ON CONFLICT syntax. In the case of a merge conflict, you can specify how to resolve the conflict in several ways.

  1. ON CONFLICT OVERWRITE: If the table/view has a conflict, then upon merge, overwrite the version of the table/view in the target branch with the version that exists in the source branch.
  1. ON CONFLICT DISCARD: If the table/view has a conflict, then upon merge, keep the version of the table/view in the target branch (don’t do anything).
  1. ON CONFLICT CANCEL: If the table/view has a conflict, then cancel the entire merge operation altogether and do not commit any changes.

In addition to the above, you can specify exceptions for how the catalog should handle merge conflict behavior on a per-table/view basis. For example, you can specify that the catalog should discard any conflicting tables during a merge, but overwrite one specific table.

  1. ON CONFLICT … EXCEPT: If there is a merge conflict, then all tables/views that have a conflict will follow the behavior specified by the ON CONFLICT option, except for any tables/views specified in the EXCEPT clause.

Example

Suppose we have two branches in our environment – main and dev, where main is our production branch and dev is our development branch. Suppose we have made and validated changes to our data on dev, and are ready to merge our changes back into production for data analysts to leverage. Before we commit any changes, we want to validate whether or not our merge will succeed:

MERGE BRANCH DRY RUN “dev” INTO “main”;

--> message | contentName | status
--> Branch “dev” cannot be merged into “main” | null | FAILURE
--> Values of existing and expected content differ | MyCatalog.HR.Salaries | VALUE_DIFFERS
--> Values of existing and expected content differ | MyCatalog.Sales.Pipeline | VALUE_DIFFERS

Using the DRY RUN clause shows us that there are 2 conflicts that would prevent the merge from succeeding – the Salaries and Pipeline tables, respectively.

Assume that in our workflows, we know that we should overwrite any changes from the dev branch into the main branch, but we shouldn't make changes to our Pipeline table in any case. We can add the ON CONFLICT clause to our MERGE BRANCH statement to tell the catalog what to do next time we see a conflict, and then dry run our merge once more:

MERGE BRANCH DRY RUN “dev” INTO “main”
ON CONFLICT OVERWRITE
EXCEPT DISCARD “MyCatalog.Sales.Pipeline”;

--> message | contentName | status
--> Branch “dev” can be merged into “main” | null | SUCCESS

The DRY RUN output tells us that the merge branch statement will succeed, because we’ve specified how to overcome any merge conflicts that may arise. Now, if we want to run our merge and commit all our changes, then we can simply remove the DRY RUN syntax from our MERGE BRANCH command:

MERGE BRANCH  “dev” INTO “main”
ON CONFLICT OVERWRITE
EXCEPT DISCARD “MyCatalog.Sales.Pipeline”;

--> Branch “dev” has been merged into “main”

End users get to see the new data instantly.

Conclusion

Data branching is a powerful new paradigm that makes data engineering workflows faster, safer, and cheaper. We’re excited to continue to make branching more flexible and powerful for our end users. To learn more:

Ready to Get Started?

Bring your users closer to the data with organization-wide self-service analytics and lakehouse flexibility, scalability, and performance at a fraction of the cost. Run Dremio anywhere with self-managed software or Dremio Cloud.