May 3, 2024

Prod is the New Dev: Using Nessie to Build the Definitive CI/CD for Data

With the rise of the Data Lakehouse and the continual growth of data workloads, data development must be reshaped to align with established software best practices, such as CI/CD. While many data catalogs provide pointwise solutions to keep track of tables, they cannot handle complex multi-language pipelines.

In this talk, we explain how we built upon Nessie to create the data catalog experience for Bauplan, a novel serverless data transformation platform. We showcase how Nessie can be combined with other open tools to build a truly composable Data Lakehouse, where the separation between dev and prod is naturally integrated in a CI/CD cycle, and discuss the code-first ergonomics that we implemented upon it.Finally, we conclude by demoing how Dremio cloud can seamlessly interoperate with Bauplan DAGs, such that Bauplan transformations can be easily managed and queried through Dremio Lakehouse Management.

Topics Covered

Governance and Management

Sign up to watch all Subsurface 2024 sessions

Speaker

Ciro Greco

CEO, Bauplan

Transcript

Note: This transcript was created using speech recognition software. While it has been reviewed by human transcribers, it may contain errors.

Ciro Greco:

So, we’ll be talking about some of the affordances that we are building at Bauplan using open formats, and in particular, we use Nessie as a choice around the developer experience that we wanted to build when it comes to working on a lake house in the most simple way for developers. This is pretty much what we are building at Bauplan. Before doing this, I was the founder and the CEO at Tuso, which was a company that was doing natural language processing on information retrieval. We brought the company into acquisition in 2019 and joined a larger organization helping with the AI and the data platform from growth to IPO. We had a lot of different use cases, a lot of different populations working on data, analysts, data engineers, and machine learning engineers specifically. A lot of my work has been around understanding how to build a data platform that has the right abstractions for people to work easily and to not have to learn too much about the infra. When we left and we started building Bauplan, basically what we wanted to do is the product that we would have loved to have at the time.

We’re an early stage startup backed by very great backers such as Innovation Endeavors and South Park Commons, and we have a bunch of very cool angels and advisors from reputable organizations in the data space. We’re very excited about where the space is going in terms of open source and open formats, and we can’t wait to see this unbundling of the platform towards the Lakehouse architecture happening.

Reproducibility in Data Projects

We’re going to start from ancient Greece. This is what we’re going to start from. Famous quote from Heraclius, “No man has ever stepped in the same river twice, for it’s not the same river and it’s not the same man.” Now, basically what we’re going to talk about today is how to make sure that this is never true for your data pipelines. We don’t want your data pipelines to be impossible to replicate. We don’t want data pipelines to change over time constantly. We want to address the problem of how to build a system that ensures reproducibility among other things. That is a major problem in data projects, mostly because data is an open system and makes reproducibility more complicated than traditional software engineering. As an open system, it means that some of the things that will determine the output of your system do not really depend on the system itself. It’s just that the data is going to change, distribution is going to change, the world is going to essentially change. There are two main things that are affected by the lack of reproducibility in your business.

Personally, I’ve seen this problem a lot. Basically, we’re going to see you have a dev environment or sometimes people develop locally. In these situations, the dev environment and a production environment are going to be divergent. Especially data scientists and machine learning engineers who know very well the business problem and the business logic are going to build prototypes. These prototypes will be built on a sample of data and they will be built on dev environments with little guarantee that that particular prototype can then be brought into production seamlessly. Because when it is brought to production, now we have to deal with the actual cloud environment for the production environment and the data in the production scenario, so the real world data. What happens there is that bringing things to production takes a long time. This is a very big problem for your business as it becomes time to arrive becomes very sluggish. It’s hard to track the responsibility of people as you will have certain themes that are a task in building applications and certain themes that are a task to bring the application online.

The Reproducibility Checklist

Then, of course, there’s the main problem about debug. We’re going to use this as our example for today. Basically, it’s when things go wrong. Our ability to debug really is as good as our ability to reproduce the problem. I’m sure that the scenario that is described here is a scenario that many of you found themselves many times. Monday night, something is wrong with a pipeline. A certain data pipeline is broken. Now, I come in the office on Tuesday. What do I do? The source data actually has changed because new data came in and the code also has changed as somebody pushed some new transformations and the environment has changed because the new data comes with new dependencies that are introduced.

To reproduce this incident here and understand what exactly went wrong, we are going to go through the reproducibility checklist. We need to be able to have the same code, the same source of data, the same source data, which is problematic because sometimes that involves the ability of versioning quite big artifacts, same environment, and same hardware because we all have heard at least once the sentence, it works on my computer. That is something that we don’t want to have when we debug a production system. We want to make sure that things run in the cloud and that we can control that the hardware is kept constant.

Now, the main problem there is that all these different layers of reproducibility that we need to ensure, data, code, environment, and runtime, they can bring a combinatorial explosion because each one of these layers can be implemented with a bunch of different tools. That is particularly true if you operate on a homegrown infrastructure or if you work on a lake. Taming this complexity is quite important than to be able to have a platform where developers can reproduce things fast and with abstractions that they understand and you don’t have to have 17 different people from three different teams involved just to replicate a problem.

Option 1: The Warehouse

The solution to taming this complexity can be, well, you can put everything in a warehouse. You can put everything in a system that basically controls all these layers. That is going to make your combinatorial explosion basically disappear if you have one provider and you have good SQL capabilities, but we all know that there are many cons in having only a data warehouse as the center of the data infrastructure. It’s very costly. There’s a lock-in because you have to pour data into the warehouse. Historically, warehouses are not great in multi-language experiences, so data science, machine learning, everything that is around Python is not going to be great.

Option 2: The Lakehouse

The alternative, of course, is that we do the lake house. We are going to have the possibility of plugging multi-language capabilities, having different run times supporting different languages, and we can have like cheaper compute at scale and a much more granular control over costs once we implemented the system. The con here is that it’s really complicated to operate most of the time, and so some of the layers that we’ve seen in the previous slides will be kind of like scattered across different systems, and it’s kind of unclear whether a certain persona in your organisation will be able to operate all these systems at once.

Reproducing Data Pipelines is Hard

So, reproducing data pipelines is hard. It is a problem. And what we want to do here is to kind of conceptualise the problem in sub-problems so that we can break it down and make it easier to solve. There’s a hierarchy of abstraction in general. One is the lower level is the data layer, how data are stored and organised, and there’s going to be a compute layer, how you run your computation, the environments that run in which the computation is run, and then there’s a code layer which is usually the interface at which most of your developers are going to think, is the code they write, and how the code is managed across different versions and across the CI/CD pipeline.

One possibility on the data lake house is we can build upon open formats and kind of like go through this hierarchy one by one, one level at a time. We can build on object store, and that has the advantage of having, you know, cheap storage and virtually infinite scalable storage. Then we want to organise the files in open formats that are easy to understand and widely supported, like Parquet, but then fundamentally, we’re going to get these files conceptualised as tables because that is a much, much better user interface for humans. So we’re going to use Iceberg for this because it’s the fastest-growing project as the most vibrant open source community, and I’m sure many people here are going to be very familiar with Iceberg since it’s one of the main contributors to the project. Essentially, what Iceberg is going to give us is the possibility of thinking in tables, support the large big data operations over tables, and one specific capability we’re going to use a lot that is time travel. We’re going to be able then to version our tables and go back in time.

The tables are then organised in a data catalogue, and we chose Nessie specifically for a very specific reason. We’re going to talk about that in a second, but for now, just keep in mind that our data catalogue is Nessie, the one developed by Dremio. Then as far as the code and the compute layers go, well, the compute layer is going to be built on top of commodity compute, so something easy that anybody can understand, like EC2s. Every environment is containerised. We have our own version of Docker. It doesn’t matter that much right here. What is important is that you have functions that are containerised, and so cloud executors can run these functions as isolated functions, and so you will be able to reproduce each environment that runs every node of a pipeline, and then your code is going to be managed by traditional code versioning tools like GitHub, any Git-based tool. The interface that your developer will adopt is an ID, okay?

Now with these elements here, plus Nessie, why Nessie is so important? Nessie is important because it’s frankly a very cool abstraction over the data lake. It’s the only one that provides you a way to conceptualise versioning of your data in terms of multi-table transactions, and so it’s very intuitive and it’s very Git-like. You have the possibility of branching your entire data catalogue and thinking in terms of, “Oh, I’m building a branch of my pipeline,” and it’s very intuitive and very easy and can support more complicated use cases than just versioning single tables. This is the main reason why we chose Nessie, because it has this higher abstraction over iceberg tables that makes a lot of sense when you have to do reproducibility or you have to develop something new and insert this in a mature CI/CD practice.

Reproducing Production Data

When you put all these things together, now this is great, because essentially we have the possibility of reproducing the environment thanks to containerization. We have the possibility to reproduce production data because we can branch the entire data catalogue or a subset of it in multiple tables at once, and we can have now our code version in Git. We have all the elements for us to be able to branch an entire data application while we develop and can embed that in whatever CI/CD practices you have. But going back to our example about reproducibility, like when something goes wrong, we can go back to our reproducibility checklist and see that we can’t check all the boxes right now. We have the same code because we have code versioning. We have the same source data because we have time travel, and we have this nice abstraction in terms of branches of the entire data lake. We have the same environments because we have Dockerized immutable functions, and hopefully we’re going to have the same hardware as we’re going to run on the cloud on commodity hardware.

This basically gives us the possibility of addressing the main problem that I introduced as the reproducibility of pipelines. But the problem is that we want to be sure that this is also very easy to use. All these pieces that I presented should be in place to ensure that all the boxes are checked, but it’s not enough for one single developer to be able to do his job easily. It doesn’t necessarily mean that what we have here is easy to operate. We need to nail the right level of abstractions, and this is particularly important, especially if we want to empower data scientists, machine learning engineers, people who are good at writing Python and SQL business logic because they understand the business problem, but they’re not necessarily super well-versed in questions about how data is stored, serialized, de-serialized, compressed, moved around, or how compute is provisioned or configured, or how environments are managed in the prod environment, and so on.

Usually a data scientist will not care too much about the compute and data layer. They might be grudgingly have to learn something about it, but ideally, they conceptualize their data pipeline as functional transformation from one function to another. You can have like my code is one function that takes some form of a table as an input and spits table as an output, and one function can be, for instance, like in this diagram here, we can think of these functions as either SQL queries that can query maybe an iceberg table and write back into the data lake another iceberg table, or a Python function that has more components from the environment point of view. It has packages and dependencies and takes an iceberg table as an input and potentially transform that into a pandas data frame or some kind of like data frame artifact and spits out something that we can then write back into our data lake, but the bottom line is that really the developer doesn’t think in terms of what these different layers have to do when they talk to each other. The code is what matters, and the best obstruction that they can think of is the function. It’s a single function as the single node of their pipelines.

Live Demo

So when it comes to building this way of interacting with this, we want them, like the data scientists, to interact with the system seamlessly and not have to deal with all these lower level layers. Now, I prepared a bunch of slides on this, but I thought that it might be actually easier if I show you something live, okay? This is the obstruction that we build on top of Nessie and on top of Iceberg to make sure that developers can interact with the system easily and without having to know anything about all the different layers.

Let’s say that I’m a developer and I want to interact with my data catalog, and I can use — I’m going to use the CLI as the first entry point. So we can do an inspection of what branches of our data catalog are open because our data catalog is built on Nessie, and we build APIs to interact directly with your data catalog. What it’s going to look like now for me to develop something new is going to look like this. We’re going to basically create a new branch and we’re going to check out into that branch. And now we can see that we are in the new branch that we created, okay? We can inspect this branch if we want and see what tables are inside. This is what we see. These are the tables that we have in our data catalog right here. We want to see one single table and see what this looks like. We can just query that. We can just get an inspection of the metadata and schema, okay?

Now that I am in my development branch, our name-spaced sandbox, I can run, for instance, a DAG that I prepared as a sequence of functions in Python. This is what the DAG is going to look like. It doesn’t matter too much in this moment how you’re going to do this. If you want to do this with an orchestrator or if you want to use a system like ours that abstracts away a lot of things. In this case, specifically, we also decided to abstract away the containerization piece because we don’t want the developer to deal too much with the compute layer. If we just do now run, we can run this entire DAG in the branch that we created and have all these functions as containerized immutable functions to run in the cloud one by one. I will always be able to go back and see what environment was provisioned for what function, what runtime run, what functions, and so on. We just run the entire pipeline right now in the cloud. This is what happens when you do run. You get basically your DAG, and then you run the functions one by one as tasks.

Now, while we do this, like we basically written something new in our branch, some table that is not in our production data catalog. In fact, I can now inspect my branch and see that there is one new table that is not in the main data catalog. It’s this table here, forecast trips, which is a table that is part of my DAG right here. It’s one of the functions that I executed that, in this case, wrote a new iceberg table into a branch of the Nessie catalog. If I now want to bring this table into my main catalog, what I can do is just check out my main branch and then do merge. Now we will have this very table here that we just created in our branch back into our main catalog that we can inspect like this. Done. We just brought a new table into our main catalog. At no point, the developer really had to deal with any of this. They never had to deal with a compute layer. They never had to talk about the data layer. They don’t know how data are moved around. They don’t know how data are stored. All that is completely abstracted away. All they need to know is a number of API calls that they can use to interact with the system.

A Composable and Programmable System

This opens up a bunch of possibilities now because we can possibly do complicated CI/CD and automatic flows where we can get data into an S3 bucket and use the capabilities of Nessie and this very powerful abstraction over branches to create a branch, run computation, run tests, run whatever computation we want arbitrarily want to run in our runtime, and merge it back into our data catalog, and then leverage the fact that because the data catalog that Remy provides is open, we can now also have those tables to communicate with other runtimes, kind of fulfilling the dream of the data lake house as an open system that is fully composable. We can script all this. We can just make this flow automatic if we want using an orchestration flow, using tasks in your orchestrator such as Airflow or Prefect.

This is the vision that we have about interacting with data in a data lake house. We’re very excited about the composability, but we really believe that developer experience should be a first-class citizen around that, and by providing simple abstractions and not asking developers to move away from the place where they develop more comfortably, namely the IDE, we can have, you know, a seamless way to interact with all the different components, hide all the complexity behind the right abstractions.

We started building around this concept a while ago, almost a year ago, and we are very active in the open source community. We publish, we evangelize, we do a lot of open source and open science. Some of this work that you see here will be presented at SIGMOD this year. We have kind of like a position, like a paper that describes like the general vision of how we want to build the developer experience on top of platforms like Dremio presented at VLDB last year. We’re super happy to talk, and we’re super easy to reach at any time if you want to talk about what we’re building, about open formats and how, you know, the data landscape is evolving on GistChat, just don’t be shy and reach out at any moment. Thank you very much.