h3h4h3h4h3

7 minute read · October 18, 2024

The Importance of Versioning in Modern Data Platforms: Catalog Versioning with Nessie vs. Code Versioning with dbt

Alex Merced

Alex Merced · Senior Tech Evangelist, Dremio

Version control is a key aspect of modern data management, ensuring the smooth and reliable evolution of both your data and the code that generates insights from it. While code versioning has long been a staple in software development, the advent of catalog versioning brings a powerful new tool to data teams. In this post, we'll dive into the differences between versioning your data with a Nessie-based catalog like Dremio’s integrated lakehouse catalog and code versioning with dbt, exploring how both approaches complement each other to deliver robust and flexible data operations.

Catalog Versioning with Nessie

Catalog versioning involves maintaining the history of your data's metadata evolution in a structured way. In a system like Dremio, which leverages Nessie for catalog versioning, the catalog acts much like Git but for your data. As your data evolves, every change to the metadata of tables (such as schema, partitioning, and format) is tracked in a history of commits. This allows you to branch, tag, and roll back the catalog to a specific point in time, similar to version control for code.

Benefits of Catalog Versioning

  1. Direct Impact on What Users See: When you version your data catalog, you control the data that is exposed to end users. If a rollback is necessary, it affects the actual data that users interact with, making it ideal for disaster recovery.
  2. Rollback of Data Changes: In case of data corruption or unwanted changes, catalog versioning allows you to roll back the entire state of your data, providing a clean and efficient recovery process.
  3. Zero-Copy Environments: Catalog versioning allows you to create development or testing environments without duplicating data. This is possible through branching, which lets you test changes to the data without affecting production environments.
  4. Historical Data Tagging: You can tag specific states of your data, making it easy to replicate or access historical data at any given point, which is highly useful for auditing or regulatory compliance.
  5. Auditing and Traceability: With catalog commits, it's easy to see who made changes and when, providing transparency and control over data management.

Code Versioning with dbt

On the other hand, code versioning refers to managing the SQL code that defines how data is processed and transformed. In dbt, this means checking in your SQL models into Git, where the changes to code are stored as commits. Each commit represents a point in the evolution of your SQL models, and like data catalog versioning, you can branch, tag, and roll back the code as needed.

Benefits of Code Versioning

  1. Collaboration and Team Development: dbt’s code versioning allows teams to collaborate on writing SQL that generates your datasets and views. Code can be reviewed, tested, and improved collaboratively without impacting production.
  2. Safe Development with Branching: You can create branches to test new dbt models in isolation from production data. This way, you can safely develop and test without the risk of affecting live datasets.
  3. Automation with GitHub Actions: You can integrate GitHub Actions to automatically run models when production code changes or perform validations before changes are committed, adding an extra layer of safety to your development process.
  4. Commit History for Debugging: When something goes wrong, code versioning helps trace issues back to specific changes. Each commit is a checkpoint that makes it easier to diagnose and resolve problems.

Why You Need Both

While catalog versioning and code versioning are powerful on their own, their combined use delivers even greater value for modern data platforms.

  1. Data Rollback vs. Code Rollback: Rolling back code commits does not affect the underlying data. Having the ability to roll back both code (via Git) and data (via a catalog like Nessie) provides more comprehensive recovery options in the event of bad code making it into production.
  2. Diagnosing Issues: Code versioning helps identify where changes were made in the SQL, while catalog versioning highlights changes to the data. This makes diagnosing issues faster and more precise.
  3. Environment Isolation: In dbt, you can define different profiles for different environments, allowing SQL code in development branches to be executed against isolated environments. Simultaneously, catalog branches can be created for these environments, and used in your dbt projects through the AT BRANCH clause in Dremio SQL, giving you complete control over both data and code in development workflows.

Conclusion

Catalog versioning with Nessie and code versioning with dbt both serve distinct but complementary purposes. While catalog versioning ensures the integrity and traceability of your data, code versioning ensures the collaborative, flexible development of the SQL code that transforms your data into actionable insights. Using both techniques in tandem provides a robust framework for managing data operations and handling inevitable changes in your data landscape.

By combining the power of version control in both your data and your code, you can confidently develop, test, and deploy changes, knowing you have the tools to recover quickly and effectively when something goes wrong.

Ready to Get Started?

Enable the business to create and consume data products powered by Apache Iceberg, accelerating AI and analytics initiatives and dramatically reducing costs.