The Data Lakehouse is rapidly emerging as the ideal data architecture, utilizing a single source of truth on your data lake. This is made possible by technologies like Apache Iceberg and Project Nessie. Apache Iceberg, a revolutionary table format, allows you to organize files on your data lake into database tables and execute efficient transactions with robust ACID guarantees. Project Nessie complements Iceberg as a catalog for these data lake tables, making them accessible to various tools. Its unique strength lies in enabling a "Git for Data" experience at the catalog level, allowing you to track changes, isolate modifications with branching, merge changes for publication, and create tags for easily replicable points in time across all your tables simultaneously.
Apache Iceberg's ecosystem has flourished, offering many tools for reading, writing, and optimizing Iceberg tables. Similarly, Project Nessie's ecosystem is experiencing rapid growth, embraced by platforms like Dremio where it was initially created, Bauplan, and open-source tools such as Apache Spark, Apache Flink, Presto, and Trino.
In this blog, we'll explore the two platforms built around Nessie technology—Dremio and Bauplan—to showcase Nessie's power and how they utilize this cutting-edge open-source technology.
Try Dremio’s Interactive Demo
Explore this interactive demo and see how Dremio's Intelligent Lakehouse enables Agentic AI
Dremio's open architecture empowers a data-anywhere, deliver-everywhere approach. It supports many data sources, including databases, data lakes, and warehouses. Users can access all datasets via open interfaces such as a REST API, JDBC/ODBC, and Apache Arrow flight for analytics, data science, and more. Apache Iceberg tables cataloged in Dremio's Nessie-based catalogs can be seamlessly utilized in other tools like Apache Spark, Apache Flink, and Bauplan.
Dremio is a unified access layer for curating, organizing, governing, and analyzing your data. It simplifies the implementation of cutting-edge dataops and data mesh patterns, optimizes costs, and enhances overall productivity within your data platform through its open nature. Dremio has also recently announced Enterprise Support for Nessie in the Dremio Software product.
Bauplan is a programmable data lake, offering optimized multi-language runtimes (Python and SQL) for data workloads over object storage (e.g. parquet files on S3). Bauplan's main abstraction is the data pipeline, that is, a series of tables produced by repeatedly applying transformations to source, “raw” datasets.
A data pipeline is a collection of tables obtained by applying SQL or Python to source tables.
In this example, the source table “transactions” logs individual transactions for several countries; the child table “euro_selection” is created from it by filtering for countries in the euro-zone; finally, “usd_by_country” is an aggregation table created by applying a Pandas transformation to “euro_selection”. Bauplan interoperates with SQL engines such as Dremio SQL Engine to provide a unified runtime to run the above transformations, abstracting away data movement, containerization, and caching to the final user.
The easiest way to appreciate Nessie for Bauplan is to start from a concrete example in pipeline maintenance. Jacopo wakes up Tuesday with an alert: the pipeline that ran Monday night produced unexpectedly an empty “usd_by_country” table that needs to be investigated. We can’t just re-run the code on production Tuesday, as data may have changed in the meantime and you wouldn’t want your debugging pipelines to create conflict with your colleagues depending on a clean, stable production data lake. The picture below depicts our ideal scenario:
A production incident: to debug Monday’s failure we need Git for code and for data.
Jacopo can branch out from production data as of Monday to create a new debug branch; this debug branch will host temporary tables during debugging;
Jacopo can retrieve the code that ran on Monday and apply it to the source tables in the debug branch to recreate the bug (i.e. the empty table).
Walking back from our ideal debugging scenario, it should now be clear why Nessie is our catalog of choice: the Nessie abstraction over Iceberg tables allows us to perform both time-travel - how did the data look like on Monday? - and sandboxing - can I debug on production data without creating conflicts in the production environment? Through its zero-copy, multi-table capabilities, Nessie enables seamless Git-like experience on data pipelines, not just data tables: when combined with Bauplan meta-store (which is storing the code), the entire debugging can be performed with three commands in the terminal:
Re-running an arbitrary pipeline in Bauplan leveraging Nessie.
Create a new branch;
Re-run the pipeline from Monday by id: it’s Bauplan responsibility to retrieve from its metastore both the state of the data lake and the code base from Monday;
Verify that in the debug branch the “usd_by_country” is actually empty: it is again Bauplan's responsibility to route the query to the appropriate engine and branch, and stream back results to the user.
Finally, while interactions for Bauplan users are high-level, Git-like APIs, the underlying system speaks natively Iceberg and Arrow, and it’s therefore fully interoperable with the Dremio lakehouse: you can create a table in Python with Bauplan and query it with Dremio by simply accessing the underlying Nessie catalog; viceversa, you can start a Bauplan pipeline from an table you created with Dremio Engine in the Dremio Integrated Catalog. Building on Nessie and open standards allowed us to move faster and immediately gained access to a thriving ecosystem of interoperable tools.
If you want to know more about the interplay between pipelines and Nessie, Bauplan will present their paper on reproducible pipelines at SIGMOD2024.
Conclusion
As platforms like Dremio and Bauplan embrace Nessie, they underscore its pivotal role in enhancing data governance and operational efficiency through its 'Git for Data' approach. The recent adoption of the Apache Iceberg REST catalog specification by Nessie not only broadens its accessibility and usability across different programming environments but also cements its position as a cornerstone in the data architecture landscape. This strategic move amplifies the utility of Nessie, enabling it to serve a wider array of applications and fostering a more robust, interoperable ecosystem. As such, the future of Nessie and the catalog versioning paradigm it enables looks exceedingly promising, setting a new standard for data management practices and offering profound value to organizations aiming to harness the full potential of their data assets.
Intro to Dremio, Nessie, and Apache Iceberg on Your Laptop
We're always looking for ways to better handle and save money on our data. That's why the "data lakehouse" is becoming so popular. It offers a mix of the flexibility of data lakes and the ease of use and performance of data warehouses. The goal? Make data handling easier and cheaper. So, how do we […]
Aug 16, 2023·Dremio Blog: News Highlights
5 Use Cases for the Dremio Lakehouse
With its capabilities in on-prem to cloud migration, data warehouse offload, data virtualization, upgrading data lakes and lakehouses, and building customer-facing analytics applications, Dremio provides the tools and functionalities to streamline operations and unlock the full potential of data assets.
Aug 31, 2023·Dremio Blog: News Highlights
Dremio Arctic is Now Your Data Lakehouse Catalog in Dremio Cloud
Dremio Arctic bring new features to Dremio Cloud, including Apache Iceberg table optimization and Data as Code.