
Files and Directories Don’t Make a Data Warehouse
Like many in the data lake space, I come from a data warehousing background. Data warehouses are mature and powerful, but they are also closed systems that quickly become expensive and constrain your technology capabilities. Data lakes are cheap and easy. One simply has to drop a few files in an S3 bucket and then run one of many available tools to get a SQL experience on that data. It’s a great start and solves simple use cases. As you become more advanced, you start to need more advanced capabilities. Things that are naturally built into traditional closed data warehouses weren’t easy on S3. Something as simple as atomic visibility of new files or renaming a column were surprisingly difficult. On-prem native solutions tried to extend into the cloud to solve some of these problems. Examples include Hadoop’s S3Guard, Hive’s ACID & Metastore and AWS Glue Catalog. These solutions were adopted because they addressed, at least to some extent, a serious problem. But in each case, we saw feature sets that were still short of a data warehouse while in many cases also creating a more siloed data solution; exactly what we were trying to avoid by adopting an open data lake architecture. Each of these solutions also inhibited query performance, something we care a lot about at Dremio.Managing the Pain: Modern Table Formats
Luckily, multiple people stepped up to provide new open standards to solve many of these problems. In the last year, we’ve seen the rise of Apache Iceberg and Delta Lake as newer approaches to table management. These designs are built specifically around an open approach to table management and are cloud-first by design. They solve a lot of the pain points associated with older technologies. We’ve evaluated different ways to expose this functionality in Dremio since they were first created. With these libraries, you gain several key benefits. Top among these is:- Powerful schema evolution
- Table versioning and history
- Improved planning/query start times
Loosely Coupled Transactions: Branches to the Rescue
As I thought more about the problem, I started to realize that Git actually provides a reasonable and widely adopted model for exactly that. A set of distributed applications can each perform independent transactions and Git provides the semantics to safely and effectively merge them into a composite record of history. Large transactions (branch merges) are composed of many separate small transactions (commits). At various points, one can take and layer these transactions together, reviewing the explanation for each operation and possibly filtering what is combined. In addition, you can travel back in time with transactional consistency, and it’s easy to understand both how you arrived at this point and what exactly has changed. (Try doing the same with a traditional database oplog.)Introducing Project Nessie
I’m extremely pleased to share with you a new OSS project, Project Nessie. We’ve spent over a year creating Nessie, driven by the combination of two powerful ideas:- Cross-table transactions for a data lake
- A Git-like experience for tables and views
Community First
As an OSS project, we decided to announce the project early in order to build a strong community of contributors from many different companies and backgrounds. This worked well for Apache Arrow, where we have now seen contributions from nearly 500 unique contributors. To help with this, we’ve created a Nessie Slack channel (email [email protected] for an invite) and Google group, and all development will be done in the open on GitHub. We invite you to join the community. All contributions are welcome and contributions aren’t limited to code. Contribution of documentation, examples, design review, testing and critical feedback are also all welcome.Git-Like Was Always in the Cards
When we started building Dremio, we talked extensively about how we could possibly use Git as a backing store. As a place to layer logical definitions on top of each other, Dremio users frequently build up complex hierarchies of view definitions. Again and again, we’d talk about how we could have versioning capabilities in your Data Lake for virtual and physical dataset management like those we all enjoy when working in software development. My co-founder would constantly argue how great it would be and I’d always agree, but also identify that Git can’t support the performance and concurrency requirements that we’d need for a system like Dremio. As such, we settled for the best that we could do — since launching the product, we’ve always maintained a user’s historical set of virtual datasets in our dataset history feature. It wasn’t Git but it was as good as we could do at that moment. As new table formats such as Iceberg and Delta Lake developed, I started to think back to our original goals of providing a Git-like experience for data and the need for loosely coupled transactions. Was there a way to take these new modern table formats and layer a Git-like experience on top? I started exploring this, hoping that along with the rise of the table formats was an improvement in Git concurrency. Late last year, a small team at Dremio started exploring what it would mean to build a new type of technology on top of these modern table formats. We started by evaluating how we could use Git as a backing store, providing a versioning scheme for all your data assets. What we found is what I’d worried about earlier: a Git transaction (Git push) takes an average of 3-5s to complete on most hosted Git providers (GitHub, Azure Repos, GCP source repositories, etc.). Long tail latencies were worse, with 30s being surprisingly common. Unfortunately, that was probably at least two orders of magnitude too slow to provide a viable solution for the data use cases our customers see.Narrowing Options and a Breakthrough
One of the key things I’ve learned in engineering is that sometimes a problem is hard because there are too many possibilities. With this type of performance gap, I think we were struggling with just that issue. We only made progress as we started to tear apart the problem and start to narrow down the options. We started by looking at the Git protocol internally and how we would map it onto a highly available cloudstore. Given the prevalence of our customers’ use of AWS, we chose DynamoDB as the initial store. From there, we iterated over the Git commit algorithms, stripping away pack operations, controlling the tree layout, etc. At this point, we finally came to the crux of the problem: if you assume that a conflict resolution requires a minimum of two round trips to a remote system (say DynamoDB) and assume that a roundtrip takes 5-10ms, you immediately find that you can’t exceed ~100 ops/s. It’s better than 5s/op but still far short of what we were targeting. It was then that we explored whether we could have DynamoDB itself maintain a linear history while also resolving most conflicts in a single operation. Given the tighter parameters of the problem, we were able to find a solution and that solution now lives inside the Nessie codebase as the versioned commit kernel. (I’ll write more in the future about the algorithms, including our use of a 151-way striped lock.) Needless to say, the new algorithm allowed us to beat our target (perf analysis post to follow). And the best part, the algorithm isn’t specific to DynamoDB.Better Than a Data Warehouse, Not Just Cheaper
If there is one message you should take from the launch of Project Nessie, it’s this: A loosely coupled data lake can do things a data warehouse never could. When you embrace the capabilities of a modern data lake and extend it, rather than constrain it, you achieve far more flexibility in how you can work with data, and far greater productivity.Join Us
We’re growing! If you’re excited about Project Nessie and want to work on it, we have roles open for OSS developers (wherever you might be).Ready to get started?

Dremio Test Drive
Experience Dremio with sample data
The simplest way to try out Dremio.
Dremio Cloud
Open & fully-managed data lakehouse
Best Option if your data is on AWS. Forever Free Usage.

Dremio Software
Software for any environment
Download Dremio’s Community Edition