h2h2h2h2h2h2h2

9 minute read · February 23, 2024

What is Nessie, Catalog Versioning and Git-for-Data?

Alex Merced

Alex Merced · Senior Tech Evangelist, Dremio

Data is not just an asset, but the backbone of innovation and strategic decision-making; managing this data efficiently becomes paramount. Traditional data systems have struggled to keep pace with the explosion of data, evolving data formats, and the accelerating shift towards data lakes and cloud-based storage solutions. Enter Project Nessie, a new approach to lakehouse catalogs designed to address these challenges by bringing the proven principles of version control—aa staple in software development — to data management.

Understanding Project Nessie

Project Nessie is akin to Git for data, offering a way to apply version control principles to data catalogs. This open-source project enables data engineers, scientists, and analysts to manage and maintain data quickly and flexibly as developers manage code. Nessie was developed to tackle the complexities inherent in modern data platforms.

Nessie allows users to branch, tag, and commit changes to data catalogs, treating data changes as transactions. This capability ensures data evolution is manageable, auditable, and reversible, providing a robust foundation for data governance and security. By decoupling data and metadata management from the underlying storage system, Nessie supports a wide array of storage backends, from HDFS to cloud storage solutions, making it a versatile tool in the data engineer's toolkit.

Nessie and Git-for-Data: A Revolutionary Approach

The term Git-for-Data reflects applying Git's version control principles to data management. Just as Git revolutionized software development by enabling developers to track changes, branch off, merge updates, and explore the history of their codebases, Nessie aims to transform data management by applying these same principles to data catalogs.

With Nessie, data teams can create branches for experimenting with data without affecting the main branch, commit changes to track data evolution, and merge updates across all the tables in your catalog when they're ready to be shared or deployed. This Git-like functionality enhances collaboration among data teams and introduces flexibility and safety not previously available in data management. Through branching and merging, teams can test new data models, algorithms, or transformations in isolation, ensuring that only validated changes make their way into production.

Key Features of Nessie

Nessie introduces several groundbreaking features that set it apart from traditional data management systems and tools. Here’s an overview of the core capabilities that Nessie offers:

Branching and Merging

Like Git, Nessie supports creating branches of the data catalog, allowing teams to work on different versions of data simultaneously. This is particularly useful for experimenting with data models or conducting analysis without risking the integrity of the main data set. Once the work on a branch is complete and verified, it can be merged back into the main branch, ensuring that only accurate and validated changes are incorporated.

Time Travel and Rollbacks

One of the most compelling features of Nessie is the ability to travel back in time to retrieve previous versions of your catalog. This feature is invaluable for auditing, reproducing analyses, debugging data issues, and more. Time travel ensures that no data change is ever truly lost, providing a safety net for data engineers and scientists.

Nessie integrates seamlessly with various data processing tools and platforms, including Apache Spark, Dremio, Flink, Trino, Presto and more. This compatibility ensures data teams can continue using their preferred tools while benefiting from Nessie's version control capabilities.

Use Cases for Catalog Versioning with Nessie

The unique features of Nessie open up a wide range of use cases that were challenging or impossible to address with traditional data management practices. Here are some of the most impactful applications of catalog versioning with Nessie

Data Experimentation and Rollbacks

Data scientists and engineers must often experiment with data transformations, schema changes, and model training without disrupting ongoing operations. Nessie enables this experimentation by allowing users to create branches where they can freely make changes. If an experiment doesn't yield the expected results, it's easy to roll back to a previous state or discard the branch without affecting the production data.

Collaborative Data Management

Multiple teams may need to simultaneously work on the same datasets in large organizations. Nessie's branching and merging capabilities facilitate collaborative workflows, enabling teams to work in isolation and merge their changes when ready. This approach reduces conflicts and ensures that the primary dataset remains consistent and reliable.

Audit and Compliance

Regulatory compliance and audits require businesses to maintain accurate records of data changes, accesses, and lineage. Nessie's immutable history and time-travel capabilities provide an auditable trail of all data changes, simplifying compliance efforts and enhancing data governance.

Continuous Integration/Continuous Deployment (CI/CD) for Data Pipelines

Adopting CI/CD practices for data pipelines can significantly improve the reliability and efficiency of data operations. Nessie supports CI/CD by managing versions of data pipelines and datasets, allowing teams to automate testing and deployment processes. This ensures that changes can be quickly and safely, promoting a more agile and responsive data infrastructure.

Conclusion

Concluding our exploration of Project Nessie, its revolutionary approach to catalog versioning, and the concept of Git-for-Data, it's clear that Nessie presents a transformative solution to the complex challenges of modern data management. By enabling version control for data at the catalog level, Nessie enhances collaboration and experimentation and significantly improves data governance, compliance, and the overall agility of data teams.

A key aspect of Nessie's utility is its integration with the data lakehouse platform, Dremio, which exemplifies the practical application of Nessie in two distinct ways:

Dremio Cloud: Within Dremio Cloud, the integrated catalog is powered by Nessie, offering users a seamless experience with version-controlled data management. This integration includes automated table management features that simplify the complexities of handling vast data sets. Additionally, Dremio Cloud provides an intuitive user interface for monitoring catalog transactions, making it easier for data engineers and scientists to track changes, conduct experiments, and rollback if necessary. This powerful combination of Nessie and Dremio Cloud enhances the data lakehouse's capability, ensuring that data management is both efficient and user-friendly.

Demonstration of Using Nessie Versioning with Dremio Cloud's Integrated Catalog

Dremio Software: A Nessie connector is available for users of Dremio's self-managed software, allowing the connection to a Nessie server that users manage independently. This flexibility ensures that organizations can leverage the power of Nessie's version control capabilities while maintaining control over their data infrastructure. The Nessie connector for Dremio software enables teams to integrate version control into their existing workflows, bringing the benefits of Git-like data management to their lakehouse architecture.

Nessie's integration with platforms like Dremio demonstrates the significant value that version control brings to the data lakehouse architecture. Whether through the cloud-based ease of Dremio Cloud or the flexible, self-managed approach with Dremio software, Nessie is set to redefine how organizations manage, collaborate on, and deploy their data assets. As data grows in volume, variety, and importance, adopting tools like Nessie that offer robust, scalable solutions will be key to unlocking the full potential of data-driven insights and innovations.

Setting up Nessie and Dremio on your Laptop for Evaluation

Learn more about Dremio's Nessie-Powered Lakehouse Catalog which part of Dremio's Lakehouse Management Features.

Ready to Get Started?

Bring your users closer to the data with organization-wide self-service analytics and lakehouse flexibility, scalability, and performance at a fraction of the cost. Run Dremio anywhere with self-managed software or Dremio Cloud.