A Git-Like Experience for Data Lakes
Since co-founding Dremio, I’ve strongly believed in the vision of loosely coupled data infrastructure and the power of OSS to achieve that vision. It’s driven much of what Dremio is and has become. For example, on September 12, 2015, I sent an email to my co-founder suggesting that we create a new Apache project code-named RootZero. Six months later, in collaboration with Wes McKinney and a dozen-plus other open source contributors, we announced Apache Arrow. It didn’t take long for people to take notice, and Arrow has become more successful than we could have hoped. It is now downloaded more than 10 million times a month, powers the core of Dremio, is embedded in more than two dozen other technologies, and has led to the creation of additional powerful technologies such as Arrow Flight and Gandiva.
Files and Directories Don’t Make a Data Warehouse
Like many in the data lake space, I come from a data warehousing background. Data warehouses are mature and powerful, but they are also closed systems that quickly become expensive and constrain your technology capabilities. Data lakes are cheap and easy. One simply has to drop a few files in an S3 bucket and then run one of many available tools to get a SQL experience on that data. It’s a great start and solves simple use cases.
As you become more advanced, you start to need more advanced capabilities. Things that are naturally built into traditional closed data warehouses weren’t easy on S3. Something as simple as atomic visibility of new files or renaming a column were surprisingly difficult. On-prem native solutions tried to extend into the cloud to solve some of these problems. Examples include Hadoop’s S3Guard, Hive’s ACID & Metastore and AWS Glue Catalog. These solutions were adopted because they addressed, at least to some extent, a serious problem. But in each case, we saw feature sets that were still short of a data warehouse while in many cases also creating a more siloed data solution; exactly what we were trying to avoid by adopting an open data lake architecture. Each of these solutions also inhibited query performance, something we care a lot about at Dremio.
Managing the Pain: Modern Table Formats
Luckily, multiple people stepped up to provide new open standards to solve many of these problems. In the last year, we’ve seen the rise of Apache Iceberg and Delta Lake as newer approaches to table management. These designs are built specifically around an open approach to table management and are cloud-first by design. They solve a lot of the pain points associated with older technologies. We’ve evaluated different ways to expose this functionality in Dremio since they were first created. With these libraries, you gain several key benefits. Top among these is:
- Powerful schema evolution
- Table versioning and history
- Improved planning/query start times
These technologies target multiple tools and are released as open source. (NOTE: It’s important to note that the OSS approach to Iceberg and Delta Lake are very different. OSS Delta Lake is a stripped-down derivative of what Databricks provides in their proprietary product and most development is done in private. Delta Lake also has weak first-class support for tools outside of Spark. On the flipside, Apache Iceberg is developed in the open with strong integrations for a much larger number of technologies (first-class examples include Hive and Flink).) In many ways, they continue the tradition of offering data lake functionality that is similar to what traditional data warehouses provide (with the inherent scale, cost and flexibility advantages of the lake). But these technologies still fail to provide a couple of key capabilities. Most significant among them is the absence of cross-table consistency and transactions.
As we started to think about how to solve these problems, we realized that while traditional data warehouses had concepts like START TRANSACTION, COMMIT and ROLLBACK, those ideas were built for closed single-kernel systems like databases. They are necessary in a SQL context (and will be available in Dremio soon), but they’re not sufficient for today’s world. For example, once you’re in a situation where moving from state A to B requires running 4 Spark jobs, two Dremio reflection refreshes and a PyArrow job, you start to need something more flexible. The problem at hand is the scale and scope of a transaction. In essence, how do you create loosely coupled transactions that match an open data lake approach? (Isn’t that an oxymoron?)
Loosely Coupled Transactions: Branches to the Rescue
As I thought more about the problem, I started to realize that Git actually provides a reasonable and widely adopted model for exactly that. A set of distributed applications can each perform independent transactions and Git provides the semantics to safely and effectively merge them into a composite record of history. Large transactions (branch merges) are composed of many separate small transactions (commits). At various points, one can take and layer these transactions together, reviewing the explanation for each operation and possibly filtering what is combined. In addition, you can travel back in time with transactional consistency, and it’s easy to understand both how you arrived at this point and what exactly has changed. (Try doing the same with a traditional database oplog.)
Introducing Project Nessie
I’m extremely pleased to share with you a new OSS project, Project Nessie. We’ve spent over a year creating Nessie, driven by the combination of two powerful ideas:
- Cross-table transactions for a data lake
- A Git-like experience for tables and views
Nessie builds on top of and integrates with Apache Iceberg, Delta Lake and Hive. It was designed from day one to run at massive scale in the cloud, supporting millions of tables referencing exabytes of data with 1000s of operations per second. You can branch your data to perform loosely coupled transactions, tag important historical points and work in entirely new ways. You also get all the classic data warehousing transactional concepts including three levels of isolation across all your tables and the classic START…COMMIT primitives.
Like Apache Arrow, we’ve decided to build Project Nessie as an Apache-licensed OSS project. Nessie already provides integration into technologies like Hive, Spark and any tool that supports Apache Iceberg’s pluggable catalog interface. Our goal is for Nessie to support all major data tools and engines. We expect that Nessie will be as prevalent in table management as Arrow is in data processing.
Now that we’ve announced the project, we’ll be working to integrate it into even more tools such as Airflow, custom Athena metadata lambda, etc. We’ll also work with the community to develop new patterns around management and security. (What do you think the permissioning model for a branch + table combination should be? Should ACLs be versioned like tables and views?) In the coming months we’ll also talk more about some of the great ways the Dremio product will leverage Nessie.
As an OSS project, we decided to announce the project early in order to build a strong community of contributors from many different companies and backgrounds. This worked well for Apache Arrow, where we have now seen contributions from nearly 500 unique contributors. To help with this, we’ve created a Nessie Slack channel (email firstname.lastname@example.org for an invite) and Google group, and all development will be done in the open on GitHub. We invite you to join the community. All contributions are welcome and contributions aren’t limited to code. Contribution of documentation, examples, design review, testing and critical feedback are also all welcome.
Git-Like Was Always in the Cards
When we started building Dremio, we talked extensively about how we could possibly use Git as a backing store. As a place to layer logical definitions on top of each other, Dremio users frequently build up complex hierarchies of view definitions. Again and again, we’d talk about how we could have versioning capabilities in your Data Lake for virtual and physical dataset management like those we all enjoy when working in software development. My co-founder would constantly argue how great it would be and I’d always agree, but also identify that Git can’t support the performance and concurrency requirements that we’d need for a system like Dremio. As such, we settled for the best that we could do — since launching the product, we’ve always maintained a user’s historical set of virtual datasets in our dataset history feature. It wasn’t Git but it was as good as we could do at that moment.
As new table formats such as Iceberg and Delta Lake developed, I started to think back to our original goals of providing a Git-like experience for data and the need for loosely coupled transactions. Was there a way to take these new modern table formats and layer a Git-like experience on top? I started exploring this, hoping that along with the rise of the table formats was an improvement in Git concurrency. Late last year, a small team at Dremio started exploring what it would mean to build a new type of technology on top of these modern table formats. We started by evaluating how we could use Git as a backing store, providing a versioning scheme for all your data assets. What we found is what I’d worried about earlier: a Git transaction (Git push) takes an average of 3-5s to complete on most hosted Git providers (GitHub, Azure Repos, GCP source repositories, etc.). Long tail latencies were worse, with 30s being surprisingly common. Unfortunately, that was probably at least two orders of magnitude too slow to provide a viable solution for the data use cases our customers see.
Narrowing Options and a Breakthrough
One of the key things I’ve learned in engineering is that sometimes a problem is hard because there are too many possibilities. With this type of performance gap, I think we were struggling with just that issue. We only made progress as we started to tear apart the problem and start to narrow down the options. We started by looking at the Git protocol internally and how we would map it onto a highly available cloudstore. Given the prevalence of our customers’ use of AWS, we chose DynamoDB as the initial store. From there, we iterated over the Git commit algorithms, stripping away pack operations, controlling the tree layout, etc.
At this point, we finally came to the crux of the problem: if you assume that a conflict resolution requires a minimum of two round trips to a remote system (say DynamoDB) and assume that a roundtrip takes 5-10ms, you immediately find that you can’t exceed ~100 ops/s. It’s better than 5s/op but still far short of what we were targeting. It was then that we explored whether we could have DynamoDB itself maintain a linear history while also resolving most conflicts in a single operation. Given the tighter parameters of the problem, we were able to find a solution and that solution now lives inside the Nessie codebase as the versioned commit kernel. (I’ll write more in the future about the algorithms, including our use of a 151-way striped lock.) Needless to say, the new algorithm allowed us to beat our target (perf analysis post to follow). And the best part, the algorithm isn’t specific to DynamoDB.
Better Than a Data Warehouse, Not Just Cheaper
If there is one message you should take from the launch of Project Nessie, it’s this: A loosely coupled data lake can do things a data warehouse never could. When you embrace the capabilities of a modern data lake and extend it, rather than constrain it, you achieve far more flexibility in how you can work with data, and far greater productivity.
We’re growing! If you’re excited about Project Nessie and want to work on it, we have roles open for OSS developers (wherever you might be).