Dremio Jekyll

Nessie: Git for Data Lakes

The Rise of Data Lake Storage

For decades organizations relied on relational databases, and later enterprise data warehouses, to organize and store corporate data. These systems provided a strong structural model to organize data as well as data consistency and reliability guarantees. However, these aspects were achieved by vertically integrated technology designs that were isolated from each other and created data silos between individual systems. Data isolation resulted in the complex data migration and ETL flows that were required to move data between the numerous systems in an organization, which were complex to operate, brittle and reduced the reliability guarantees of an individual system.

To simplify data management and centralize data onto a common and shared platform, organizations increasingly deploy data lakes, which are large repositories that store structured and unstructured data at any scale. Initially, data lakes were associated with the Hadoop Distributed File System (HDFS), but increasingly organizations utilize cloud storage systems such as Amazon S3 or Microsoft Azure Data Lake Storage (ADLS). Due to the highly accessible and flexible nature of data lakes, organizations can enable all applications to operate on a shared data repository for processing.

Unlike traditional relational databases which are closed vertically integrated systems, data lakes are built upon an open architecture with numerous components for different functions. This open design provides organizations the flexibility to pick and utilize technologies that suit their individual requirements while still providing shared access to data across applications. For example, organizations can select SQL processing engines (e.g., Dremio, Spark, Hive, etc.), file formats (Parquet, ORC, etc.) and table formats (Iceberg, Delta Lake, Hudi) that match the needs of different workloads while operating on a shared data model. This adaptability enables organizations to optimize workloads while still maintaining a highly flexible and shared data model.

 Data Lake Architecture Components

Figure 1: Data Lake Architecture Components

Data Engineering Challenges

Despite the flexibility that data lakes offer, a couple key challenges are commonly encountered when deploying a data lake. The first is how to achieve the transactional consistency of traditional relational databases with an open architecture, and the second is how to address operational challenges to ensure users never see an inconsistent view of data as data is ingested and updated.

Relational databases provide ACID guarantees as data is ingested and updated with the ability to modify and commit data in a single transaction. This is possible because traditional databases are designed as a centralized system that gates all access to internal data storage. However, in open data lake architectures where multiple applications work on the same data in place but without a centralized gate, coordinating changes to data with similar ACID guarantees has been a challenge. For example, if one application is in the middle of an update process how can other applications view those changes in a reliable manner.

Another challenge data engineers have grappled with for decades, both with traditional databases and newer data lake architectures, is how to ensure users never see an inconsistent view of data as new data is ingested and updated. Organizational data is often updated through a series of workflows that curate new data and apply business logic, and these workflows typically consist of multiple steps developed over time. This presents several challenges including:

  • How to ensure intermediate processing steps are not exposed to users
  • How to verify results and data correctness before exposing to users
  • How to rollback to last known correct data if verification fails
  • How to develop and test changes to the process without exposing changes in development to users

Approaches to solve these challenges are often implemented in an unstructured and ad hoc manner. For example, one common approach is to create a series of views that function as a pointer to data and are switched after processing is complete. Such solutions are brittle, error-prone and highly time intensive, regardless of whether data is in an enterprise data warehouse or a data lake.

The Git Revolution

When looking for potential solutions to these coordination challenges it is helpful to look at solutions other disciplines implemented. Software engineering in particular had similar challenges coordinating versions on large code bases across teams of developers. Version control tools such as CVS existed but it was difficult to create new versions of code bases, and share, test, review and merge changes back into production code. In particular, ensuring changes on non-production versions maintained consistency with the production code base as it evolved was very expensive and time consuming.

And then Git was invented and suddenly everything changed and was easy. With Git it was trivial for developers to create branches where changes across an entire code base could be developed, shared, tested, verified and iteratively improved upon by themselves or a team. Maintaining consistency with the production code base as it evolved became easy and merging changes after completion was simple.

As a result, there was a significant jump in both individual developer productivity and the productivity of whole engineering teams, as multiple independent features could easily be developed on a common code base. With Git collaboration became trivial where developers could issue a pull request, conduct an independent review of code changes, receive suggestions for improvement and verify correctness before merging. These patterns resulted in more productive collaboration and significantly improved code quality.

Following the Git revolution in software engineering, Git started to be deployed to manage software operations. First, configurations started to be stored in Git, then Git was used for deployment, and finally the entire application stack was stored in Git. Beyond software development, ML models started to be stored in Git as well.

Git for the Data Lake

Nessie, a new open source project, brings the capabilities of Git to data and your entire data lake by implementing a repository of all objects in the data lake along with version control for the data lake. This enables data engineers to manage the data lake with the same best practices Git enabled for application development across a large and diverse code base. Nessie does so by implementing several core concepts of Git, such as branches, tags, commits, mergers, version control and reproducibility, over the entire data lake repository.

Suppose, for example, an organization has a complicated ETL process that runs regularly, imports new data from multiple data sources, executes a series of transformations over multiple steps, modifies table structures by adding/dropping partitions and utilizes multiple engines for computation (Spark, Hive), all while needing to ensure users are always exposed to a valid and consistent view of data. Ad hoc methods exist to ensure consistency but they are manual, time consuming and brittle.

With Nessie such processes are easy to implement and automate through the use of branches and commits. The ETL job begins by creating a branch for the entire process, which is an instantaneous and free process in Nessie. All modifications are then implemented within the branch, including new data ingestion, table partition changes, data transformation steps across multiple tables, etc., all while using multiple different tools operating together within the ETL branch.

Data verification is run on the branch to ensure correctness. If issues are discovered, data can be corrected or the entire process and branch can be rolled back or deleted, both of which are instantaneous processes. Once data passes verification the ETL branch is simply merged back onto the main branch. Merges are atomic operations and fast, which guarantees that users always see a consistent view of the data lake. One second users see the pre-ETL version of the data lake, and the next second users see the data lake post all ETL operations.

Such workflows are the standard best practice methods used in software development and other fields due to their simplicity, and with Nessie these practices are now viable for data engineers to use on the data lake.

Reproducibility with Zero Copies

Nessie also enables reproducibility with ease and zero cost. Reproducibility is important for a variety of scenarios, such as being able to publish a static view of data at a specific point in time or being able to regenerate past results that were used for decision making in a reliable manner. Previously, such functionality was only available by making copies of data or saving query results, which duplicated data and required time consuming and expensive management.

Nessie does so by giving users the ability to set a historical commit context for any operation and simplifies management through Nessie tagging. For example, insurance companies make policy decisions based on a variety of actuarial models applied to data that is available at the time an insurance policy is made. Nessie tagging enables easy to use reference points to the data lake at different times, making it effortless to regenerate the results used to make a previous policy decision, or to test actuarial models against the data lake at different points in time, all by utilizing version control and without making expensive copies of data.

The Data Lake Repository

Nessie defines a repository of all objects within the data lake and creates a shared and centralized definition of the data lake across multiple tools. More specifically, all data lake objects, whether they are tables, views, files, schema definitions or tool specific attributes, are defined and versioned by Nessie.

Table repositories have been a key missing component of data lake architectures compared to traditional relational databases and enterprise data warehouses. An advantage for traditional databases in being a vertically integrated and closed design is they can function as a central access point for data and schema changes, which simplifies data architectures since all users access data through the database.

Open design data lake architectures, on the other hand, lack a centralized definition which is often a source of complexity. The Nessie table repository, however, creates this central definition for data lakes in an open and flexible architecture, giving all users and tools that access the data lake a common and centralized definition of tables, views, files and more. This reproduces many of the benefits and functionalities of traditional databases, while preserving the openness and flexibility of a data lake architecture.

Nessie Table Repository

Figure 2: Nessie Table Repository

By utilizing a table repository for the data lake all tools operate on a consistent view of the data lake, remain in sync with each other and produce consistent results. For example, if one tool such as Spark creates new data lake tables or adds new partitions to multiple tables, other tools such as Hive see those changes immediately, in a reliable manner, and without complex and brittle synchronization activities.

Nessie also provides a centralized governance layer to simplify security and compliance policy management for the data lake, which has traditionally been divided across multiple tools creating numerous operational headaches as security policies evolved and needed to be maintained in a consistent and reliable manner. With Nessie all tools utilize the defined security policies within the table repository, thereby increasing the security and governance of the data lake.

Wrapping Up

Nessie improves modern data lake architectures first by bringing the capabilities and functionality of enterprise data warehouses to the data lake, and second by leapfrogging enterprise data warehouses through offering a Git-like experience for data management. With Nessie users have:

  • A central repository for the data lake including tables, views, files, schema changes, security and governance
  • Version control over the entire data lake
  • Powerful Git-like capabilities including branches, commits, tags and merges
  • Data reproducibility, verification and rollback capabilities
  • Multi-table transactional consistency in the data lake

Additional Resources