March 1, 2023
11:45 am - 12:15 pm PST
DataOps in Action with Nessie, Iceberg, and Great Expectations
Sign up to watch all Subsurface 2023 sessions
Note: This transcript was created using speech recognition software. While it has been reviewed by human transcribers, it may contain errors.
I’m Antonio, I’m a Data Architect at Agile Lab. The purpose of Agile Lab, and for my professional purpose for the time being, is to elevate the data engineering game. This guy over there instead is Osvaldo, he’s my dog. He is not into data engineering yet, but is over hearing a lot of meetings, so maybe in a while he’ll join our company at Agile Lab.
Transactional Data Pipelines
Let’s say you have built your internal data platform and you are at the top of your game of data engineering. You’ve probably built transactional data pipelines, single table transactional data pipelines, using the processing engine of your choice. Sitting on top of an open table format that can be Iceberg, and today is going to be a lot about Iceberg. But it can be also Delta Lake or Hudi for the sake of this presentation. On top of that, you can have any engine that supports the table format you chose.
On top of that, you have built data quality checks. Again, the tool of your choice that integrates well with your engine or Iceberg, or both. And common choices are Great Expectation and Monte Carlo, but you can build those to your own SQL query or your Spark Jobs, whatever works for you. That is able to check that the quality of your data is how you want it to be.
For sure in order to run those data pipelines, you’ll need some orchestration tool. That can be Airflow, Prefect, Dagster, whatever works. It’s the important thing is you are able to create DAGs orchestrate different jobs. The cherry on top is to add Data Lineage. To be able to programmatically know how data is being produced inside your data platform. This can be achieved using almost any data catalog like DataHub, Apache Atlas, or there are other products. Or you can still use a tool, but base that on an open standard, like Open Linkage.
Data Quality DAG
You have all these tools and one of your pipelines looks like this. You have some IoT devices that are sending data to an object storage. From there, you have your first ingestion job. Let’s say it’s a Spark Job that loads this data into an Iceberg table. From the table, other jobs are consuming it and producing higher value data. We’ve talked about data quality, and data quality usually is implemented like this. The same table as before, you periodically run some of these data quality jobs that are able to give you a green or a red light in terms of quality of your data.
Let’s say it’s a day like any other, so everything’s fine, the quality is at point, pipelines are running smoothly. At some point you may have some unexpected behavior on your devices maybe. And these devices start sending data that has never been seen before. This weird data uncovers a bug on your ingestion pipeline, and this bug leads to some bad data in your first table. In a glimpse, you get bad data all over your pipeline, and this can be actually understood by your data quality jobs depending on tables, or it can be even more subtle and have some data quality checks, not even catch the problem. It’s even worse because if you don’t have Lineage, you don’t even know that you are affected because your checks on the last table are not, let’s say, comprehensive enough.
At this point, you are a datastrophy happened to you, and you have to choose either to take decision based on bad data that you know, it’s not at the quality you want it to be or take decision without using any data, like with gut feeling. After that you realize that and then you have to fix the problem.
How the remediation looks like? You have to track downstream processes and you can use Lineage for that. You have to stop them. You have to fix the code. Fix the data pipeline so that you don’t let any bad record inside your Data Lake or your Lake House. At the end, apart from fixing the pipeline, you also have to revert data to a good state. Just before bad data enter there, replay the input. It’s quite cumbersome. It’s a complex process. It looks like four steps. Usually takes days or weeks to have this in place for any single bug that corrupts in some way your data.
Avoid Shipping Bugs in Production
We may ask yourself the question, how do we avoid shipping bugs in production? Because bad data looks a lot like bad software, leads to bad outcomes in the end. And well, it’s easier, right? As we don’t deploy code that does not pass tests, we shouldn’t apply data that doesn’t pass tests and tests for us are data quality checks.
This is such an easy concept that it’s also a known pattern, that’s called, “Write-Audit-Publish-Pattern.” The problem is you have many challenges applying this pattern on Lakehouse architecture. One way of achieving Write-Audit-Publish is that you write to some, let’s say, folder to some temporary table, staging table. Then you audit that table, and finally you rename the table or rename the folder so that it matches the one that your consumers usually read from. But this is not easy at all because you have no atomic renames on object storage, which are the foundation of your Lakehouse.
Second, things that you probably are not able to do, instead of writing data, moving it, and then moving it after auditing, you could write, audit, and then copy. But while data is usually big in this cases, you cannot afford to do it twice, might be because it takes too much time or because it costs too much money. Also, it’s not that you cannot perform Write-Audit-Publish on the fly, unless you limit yourself to one single processing engine.This means that both the processing logic and the data quality checks should be implemented on the same process.
How do we solve this on the Lakehouse architecture? Well, the year of to date is a monster and it’s a Lochness monster. Here I present to you Nessie. Nessie is a technology that brings git-like features into your Data Lake. You can think about Nessie as a Metastore on steroids. It holds metadata about your tables the same way good old Hive Metastore does, but also implements some git features on top of it. So, you have versioning, branching, merging and tagging. Here I propose some way of dealing with this Write-Audit-Published-Pattern using Nessie. You will see it’s really, really simple. I also have a demo about this. So let’s say our workflow looks like this.
First of all, we create a new branch, let’s call it the staging branch. Then we write the new data on, this a staging branch. Finally, we validate this data using our data quality checks. And then if everything went well and the data quality checks are passed, you can merge the branch and finally delete it. Otherwise, if something goes bad in your data quality checks, you can send them out to someone that will check out what’s happening.
Demo of Nessie
Let’s go ahead with the demo. Let’s say we have a table, a very simple table four columns. You have an ID, a name, a surname and a tax ID. If we go to Nessie, we already created this table, as you can see, we can see the customer table, and we can also look at this committed history and we can see there is only one commit.
It’s when I created the table before starting the presentation. Right now, as I said, we go ahead and create our branch. We do that through Nessie CLI. We could do that from a Java client if we want to. It’s only a metadata operation. So as you can see, it’s very, very fast. Already done. And we go back here, we refresh, and we can see where our staging branch, which is exactly the same as the main branch.
Next step, we are in this situation, we created the branch, no commits yet. The two branches main and staging branch are the same. Next step we update our data. This is a very simple job. We will just insert some records based on the ID.
This is a Spark Job because I’m lazy and it’s fast for me to write Spark Jobs, but can be anything that works with Nessie. We can see before the update, we had zero rows. Now we have three rows of three customers. Next step is to run the expectation. Let’s look a moment into Nessie, the main branch is still untouched. One commit staging branch as two commits, and this second commit added those three rows. Next step, apply data quality checks. I used Great Expectation, and I can run Great Expectation through Spark. Great Expectation is cool because data quality checks are the clarity. I can just write this JSON, like notation and the engine so Great Expectation will pick these rules and apply using the engine, the processing engine. I chose Spark in my case, and I just say, I want to add these four columns.
I want the ID to never be null, and I want the text ID column to be unique. So, let’s go add, this is the Spark job, the Pi Spark Job that implements that. It’s really simple, I just load those JSON files, load my table at some given reference. So, at this reference is the branch, and I just run them. I co-validate and I get some results. If everything goes well, I print success, otherwise I exit with something different from zero. Let’s run data quality and we hope everything went well. We have a print success. Data quality was fine. Next step, let’s merge. Let’s merge our branch into the main branch. This is like, as you can see, very, very fast. If we go back to Nessie, we can see that the staging branch has two commits and now also the main branch is to commit. We are in this situation, we merged now finally, we can just delete our branch.
So far so good. Another day, another run of the pipeline. Let’s go ahead create a new branch and let’s update our data set with some wrong data, and I’m doing it on purpose, obviously. As we can see in a while, we will see that we have now four rows. We have another customer, Albertino Einstenino, which has exactly the same tax ID as Albert Einstein. So, if you remember, our data quality checks is a problem. If we are on our data quality checks on our staging branch, we’ll get a failure hopefully.
Here it is. So success falls into failure, it was exciting with one and we can also look here and say that the problem was that there were two albi, two tax IDs that are the same. That’s it. We should send an email to someone that will check this branch is able to. So I’ve never tried this, but no, I won’t try. But someone can check out that branch, see the data, what happened, why there are two customers with the same tax ID, and think about what to do next. But we never merged the branch. So this data was never exposed to our consumers.
What Are the Takeaways Using Nessie?
What are the takeaways? Using Nessie and Iceberg, we can de-couple processing and audit. We are able to never expose bad quality data. And also we are able to reverse changes if they ever happen. This supports all kinds of right modes that also Iceberg supports. It’s both update up and override and delete. This looks like nothing today, but if you’re in big data since the last decade, you will know that these things that we give for granted are, were not granted like three years ago. You can say, sure, but last year at this very conference, I’ve seen a talk about Sam that says that you can achieve the same simply using Iceberg. Yeah, you’re right, without Nessie you can do kind of the same, you can achieve the same result, but it’s more complex. Also, we just scratched the surface of what Nessie can do. This was a very simple example, but with Nessie, we can implement a lot of other workflows to match our needs. One of them is feature testing with diverging branches. We will go deeper on that later. You can also update atomically and consistently multiple tables. Something that with Iceberg alone, you are not able to do. You have transactionality on a single table.
You can do that on multi-step complex data pipelines because you have this very powerful concept of branches that you can use to actually do whatever you want. When you are good, when you are safe, you can merge it back to the main branch and leave it to your consumers.
Feature testing. Let’s say you have a new version of a data pipeline. This new version shouldn’t provide any functional change, just performance and improvement. You want to be really, really sure that it behaves exactly the same as the prior one. We know that you can test this in the DEM environment, QA environment, but data is really unique to the real data. Like you can find patterns in the production data that are not there on QA. How do you do it? Well, with Nessie it’s quite simple because you just create a new branch from the main branch.
You keep running the old pipeline on the main branch, and you run the new pipeline on the just created branch. You run for one day, for two days, for a month, whatever you want. When you are ready, you just check that the two branches are exactly the same. If they are, you’re good, you can merge to the main branch or just switch scheduling. But you are 100% sure that the two pipelines are exactly the same. Well we’ve scratched only the surface of the Iceberg that Nessie is. Let me talk a bit about the downsides.
Downsides of Nessie
Nessie is quite a new technology and I was very delighted to use it so far, but I have to point out some things that can be better and surely will be. First of all, authorization is still really primitive. It’s a lot static. You have to restart the Nessie server when you want to change authorization rules. So it’s a bit inconvenient. Also, and this is the biggest functional problem is that you have one Data Lake. One git history, one Nessie history for one Nessie instance. This means it’s like working with a mono-repo. You have all your tables that are using Nessie on the same history. If you have tables that are not linked to each other, this is a bit inconvenient because rivers are complex. Right now it supports only Iceberg and Delta Lake. So, no plain Parquet tables, no CSB files, no Hudi. You can use Nessie only for these new tables that you are working with on these table formats.
Since it’s not a general purpose Metastore, you will still need to run your Hive Metastore or Glue Metastore or whatever Metastore of your choice on the side of Nessie. For all the tables that Nessie doesn’t support, so it adds to your team a new piece of infrastructure to take care of, to deal with, and it’s widely stable. I’ve never had a problem in production with Nessie so far. I cross fingers right now, but still it’s another piece of infrastructure.
That was it. I thank you guys for attending this talk and I’m open to any question. You can find me on LinkedIn, on Twitter, and I will publish all the demo more and slides on GitHub.