The Apache Iceberg Advantage
An in-depth review of Apache Iceberg, an open table format for enterprise data lakes.read more
October 10, 2023
Join this session to learn how Dremio Arctic, a lakehouse management service, enables data teams to deliver a consistent and accurate view of their data lake with zero-copy clones of production data and multi-table transactions.
Organizations who want to leverage their data lake for insights often struggle to deliver a consistent, accurate, high-quality view of their data to all of their data consumers. That challenge is often exacerbated by the need to make changes to data that impacts multiple tables.
In this webinar, we’ll share how data teams can use Dremio Arctic, a data lakehouse management service, to simplify data management and operations. Using Git for Data capabilities like branching, tagging, and commits, we’ll show how Dremio Arctic makes it easier than ever to:
Watch or listen on your favorite platform
Jeremiah Morrow is the Product Marketing Director for Apache Iceberg and data lakehouse management. He is responsible for evangelizing the value of open table formats, Git for Data, and Dremio as the Iceberg data lakehouse provider. He joined Dremio as a Partner Solution Marketing Director in February 2022. Over the past ten years he has worked in partner and industry marketing, analyst relations, sales, and business development for a number of companies in the technology industry, including Vertica, OVH, SoftwareONE, and Gartner.
Note: This transcript was created using speech recognition software. While it has been reviewed by human transcribers, it may contain errors.
Hey, everybody! This is Alex Merced, and welcome to another episode of Gnarly Data Waves. Here in Episode 36, we'll be talking about how to simplify your lakehouse operations with zero-copy clones and multi-table transactions.
But before we get going there, what I want to do is make sure to invite you to get your hands on with Dremio. There are many ways to try out many of the features with Dremio––there's many self managed ways, where you can go use Docker, Azure, AWS, Google Cloud Platform, Kubernetes, Yarn, or use a standalone install on Linux––any of those ways work. That's if you want to get started with Dremio software for free––if you want to get started with Dremio Cloud, you can just create a free account in moments using single sign-on, like your Google account, your Microsoft account, Github account.
But if you want to try out Dremio Cloud, before you do that, you can go over to Test Drive, where you can get hands-on, and get a feel for the Dremio Cloud platform before you actually create a Dremio Cloud account. Also, there's a new blog over there at dremio.com/blog where [it will] actually walk you through the steps of spinning up a Dremio Docker container on your laptop, and just trying out many of Dremio's features. It's a great way to get started with iceberg and with Dremio all in one blog. So go check that out at dremio.com/blog.
Also, there's always been Apache Iceberg Definitive Guide: the book that me and Dipankar Mazumdar, Jason Hughes, and Tomer Shiran are all working on to be released early next year by O'Reilly, me and Dipankar. We're going to be over there at DataDay Texas, doing an Iceberg AMA. If you want to get an early copy of the book, you can go scan that QR code right there. Come down to DataDay Texas in January, to come join us for that event.
Now, speaking of events, Dremio will be at several events in the coming weeks, including Coalesce 2023 with DBT in San Diego, [on] October sixteenth through nineteenth. We'll be doing a Data Lakehouse Meetup in London on October seventeenth. We'll be at AWS free event on November 27th in Vegas, and we're gonna be at the Microsoft Azure and AI Conference [on] December fifth through seventh, in Orlando, Florida (in my stomping grounds!)
So make sure to stop by the Dremio Booth at any of these events, pick up some cool swag, and learn a little bit more about the Lakehouse.
As you see, Dremio, we've been putting out a lot of great content to help you with your data lakehouse world, with all these kind of news articles at the Dremio blog––you can go find at Dremio.com/blog––including articles on how to set up Dremio on your laptop with Iceberg and Nessie, a little bit a deep dive into the architectures of Iceberg, Delta, and Hudi, learning how to use air bytes to ingest data into Iceberg, and so many other great articles walking you through how to do things with your data Lakehouse, and how to optimize things and new features. Go check out dremio.com/blog, lots of great content. And of course, there's also the weekly show we have here in our database, with so many great episodes on the way, including next week, we're gonna have NetApp come on and talk about how they were able to improve their customer experience with product analytics used through the adoption of Dremio. And then, following that, we'll have Jacob Tagliabue, founder of BauplanLabs talking about how he's building a data science platform using Apache Iceberg and Nessy.
And then after that, we're gonna have a session on how to build an Iceberg data lakehouse with 5tran and Dremio. So make sure you just keep coming in every week, and with no further ado, we're gonna get to our feature presentation: Episode 36: How to simplify lakehouse operations with zero-copy clones and multi-table transactions, with our guests Jeremiah Morrow, Product Marketing director for Iceberg and Arctic here at Dremio.
Jeremiah, this stage is yours.
Hi, everyone. My name is Jeremiah, and I'm responsible for product marketing for Iceberg and Dremio Arctic. And today I'm going to be talking about simplifying Lakehouse operations with zero-copy clones and multi-table transactions.
There's a quick agenda for for this session. First, I'm going to level-set by talking about the journey from data lakes to data lakehouses. Then I'm going to do a quick review of what Dremio Arctic is. I'll talk about git for data and how you can create a zero-copy clone of your production data in just a few seconds. Then I'm going to walk through a use case of git for data, which is multi table transactions. And finally, I'm going to do a quick demonstration of how that works in Dremio Cloud. Then we'll do Q&A. You don't have to wait until the end to submit your questions. Feel free to drop them in the Q&A tab at any time.
So first let's start by recapping approximately 40 years of data architecture history in 2 minutes. We're going to start a long time ago with data warehouses. And data warehouses were great at what they were initially designed to do, which is storing and analyzing structured data from business systems in the data center, and they could deliver BI and reporting on that data in a reasonable amount of time. But the EDW couldn't really keep up with the growth of data volumes or the variety of data and data sources.
And so companies turn to the data lake. The data lake was also really great at what it was initially built to do. It was really great as a cheap and efficient storage for all of your data and for exploratory data science projects, but it never replaced BI and reporting capabilities of the EDW. And so what we have in probably 99% of companies we talk to today, is this cooperative architecture with a bunch of data lakes and a bunch of data warehouses doing basically what they've always done. You do BI and reporting in the data warehouse, and you do data science in the data lake.
And the challenge with this architecture is that more and more of our operational and customer data is going to land first in the data lake. And so our BI and reporting users need access to that data. So when they do, we have to move it into the data warehouse, and data teams get a ton of data access requests, and those ETL and ELT processes become a bottleneck, where it can take weeks, days, months, in order to get access to an important data source for your report.
So Dremio has been fantastic as a tool for doing 2 things––first, enabling instant data access to data in the data lake and any other data source, so data consumers can get to the data they need faster, and it also provides super fast query performance on that data. So data consumers can get answers to the questions that they're asking their data much more quickly. And in fact, we've had query engines on data lake storage. This has been the way to access data lake storage for quite some time now. But to move from a data lake, which is primarily for read-only queries, to [the] Data Lakehouse, which combines the flexibility and scalability of data lake storage with the full read-and-write functionality of the data warehouse. We need a few more components to effectively and efficiently manage that data. You're not gonna find too many companies today arguing against the value of file formats like Apache Parquet, from a compression and performance standpoint.
And so the next layer on top of that, that companies are looking into now, is table formats. Table formats like Apache Iceberg build on that compression and performance capability of file formats, and bring a lot of the missing elements of a data warehouse to the data lake, including acid transactions, versioning, time travel, and table and schema evolution, all of which enable the data team to transition more of those data warehouse workloads into the Data Lakehouse.
And the final piece of the Lakehouse stack is a catalogue that makes it easier than ever than ever to manage your Lakehouse and for that, for us, that's Dremio Arctic. So Dremio Arctic is a Lakehouse management service. It sits on top of your data lake, and it provides data optimization by automating a lot of the capabilities of Iceberg. It provides governance and security, which is really sort of table stakes for a data management solution today. And it gives data teams the ability to use this concept called git for data, which is git-inspired versioning for your data lake that makes it really easy to deliver a consistent and accurate view of your data to all of your data consumers. And we'll go through a few of these aspects, piece by piece.
So Dremio Arctic is a modern Lakehouse catalog. As I mentioned, it's Iceberg native, so it's based on Project Nessie. It's tightly coupled with Apache Iceberg, and a lot of Arctic's capabilities are based on the Iceberg architecture. Very importantly, it's also accessible by multiple engines. So we firmly believe that the future is multi-engine, and data teams are going to choose the best tool for each workload. So we're committed to ensuring that our customers have that flexibility.
And from an access control standpoint––obviously none of this data management solution works without security and governance. And so Dremio Arctic features fine-grain access controls, an integration with existing user and group directories. From an automatic optimization standpoint, Arctic takes some of the capabilities of Iceberg from a data management perspective and automates them.
The first is compaction. So compaction rewrites smaller files into larger files on a set schedule to improve read performance. That's great for use cases that require streaming data or micro-batching data into the data lake where you're bringing lots of small files in, and today, that's a manual task for a lot of data lake owners.
The second piece of this is the vacuum function or garbage collection, as it's sometimes called. And vacuum removes unused files on a set schedule to conserve storage space. So automatic optimization is all about automatically improving performance and also storage utilization.
My favorite feature of Arctic is git for data. So the the idea is this––Github transformed the way developers build and deliver software code branches, tags and commits introduced things like CICD version-control governance and quality insurance, while also improving collaboration and accelerating delivery. And now, with git for data, you can do essentially the exact same thing with data products.
So in this graphic, I have my main branch of data, and that's the source of truth from my organization. All my data consumers hit that branch for their reports and their dashboards, and so I want to make sure that view is consistent and accurate. If I want to make any changes to the data, I'd create a branch which uses metadata pointers to create a view of the catalog at a specific point in time. So it's a clone. It's not actually a physical copy of the data. And so within that branch, I can really do whatever I want in this graphic. I have a data science branch at the bottom. So my data scientists can experiment. They can run scenario analyses. Really, they can do whatever they want with an clone of the production-quality data, and without impacting production users.
And then my top branch is an ETL branch. This way, I can bring data into the branch, I can do my transformations, I can check it for quality. I can even run a dashboard against the branch data to make sure that I didn't break anything, and then I can merge it into the main branch. And only after I'm confident that it is a high quality view of the data that has the changes that I was intending to make, will my production users be exposed to those changes.
So these are a few of the benefits of managing your data with git for data. The first obviously is isolation. All of the work done on a branch happens without impacting other branches and other users, so my production view is safe, and none of those other users are impacted. The second is version control. I can very easily reproduce a view of the data at a specific point in time. and if a mistake does happen, all it takes to roll back. That mistake is to change the main branch header to a previous commit, and I can, in fact, roll back an entire catalog in just a matter of seconds. And finally, from a governance perspective, all of these changes are tracked and audited. You have a full history of what happens within a branch. And we have access controls to make sure that the right people can access data and make changes to that data.
So let's go through a common use case for zero-copy clones and tackle something that is traditionally a challenge within data management, especially in the in the data lake space––and that's multi table transactions. So in this graphic, I have 3 tables. And I might have a data consumer who has a virtual data set, which is a join between these 3 tables. They might even have a dashboard that visualizes that virtual data set. So I might need to make an update to more than one of these tables at a time. But I don't want that data consumer to see a partial or incomplete view of the data that is a view of the data before all of those changes are made. So the solution with Dremio Arctic and git for data is to do all those updates in a separate branch.
So I can create a test branch, and then I can make changes and updates to the tables.
I can do a quality check. Make sure I'm happy with those changes, and then I can merge to main. The changes are made atomically on the merge. So any of my production users who are accessing that joined virtual data set that I mentioned before won't see any of those changes until all of them are made on the merge. So every data consumer essentially has access to a consistent and accurate view of the data.
So now that I've talked through this graphic, I'm actually going to jump into Dremio Cloud and walk through these steps, so you can see how it works. Here I am in Dremio Cloud––and Dremio Cloud is 2 services today. The first is Dremio Sonar, which is our Arrow-based distributed sequel query engine, and also our unified access layer. And then we have Drermio Arctic, which is our data Lakehouse management service with git for data. I can do some of this work in either. Within Arctic, you can do a lot of this stuff in a low code, no code sort of way. But because I want to run some queries, I'm actually going to hop into Sonar for this project.
So here on Dremio's home screen, you can see I have my data sources off to the left, including 2 Arctic catalogs. For this project. I'm going to choose Dremio test, and now I have a couple of folders. So let's hop into agents and customers. So within this folder I have 3 Iceberg tables with roughly a thousand rows apiece––customers, agents, and jobs. And then over here, if I click on my email address, you can see, I have a virtual data set here. This is jobs joined and this is a virtual data set, which is a join between agents, customers, and jobs.
So let's say one of my agents onboarded a new customer, and so I need to update the customer table and also the jobs table. But I want anyone who's using the jobs joined view that I showed, to see the updated tables only after I've made updates to both of the tables. So to do that, I'm going to create a branch of the main Arctic catalog. So I'm gonna hop into my SQL runner here, and we're going to simply do a create-branch command. There we go. So I have created a branch and so if I click on this, you can see I have my new branch here in addition to the main branch, and it is a clone, as you can tell from the commit history here. I like to change the context here to make sure that I am actually updating just that. And so now I'm going to update the customer table. I'm going to add our new customer, so I will do an insert statement, and add Jeremiah Mprrow here, and an email address. Cool. So now my customers branch has been updated.
And now I need to update my my job table as well. So I'm going to add this job row here…and go ahead and run. And so now I've updated both tables. I'm going to check account here, and I'm going to select the count from the jobs table on the branch, and so my original table had 1,000 rows. This should have 1,001, and, as you can see right here, it has been updated. We do have 1,001 rows. And very importantly, for this whole project, obviously, I've been updating the jobs table update branch. Let's run a select count on the main branch here of the same table so we should still be at 1,000 rows. And we are so my production users have not seen any of the changes that we've made so far to these 2 tables. And so updating the main branch is as easy as just doing this merge branch command. So we're gonna merge the jobs table, update branch into the main branch and this will make atomic changes to both tables. And so now, my production user should see the updated tables. and very quickly we will go ahead and do a select count of the main branch jobs table, and that should now be at 1,001.
And so you can see that we have made the changes. And so, just to show how easy it is to roll back, say I didn't want to make that change after all, or I did something incorrectly, I broke someone's dashboard. The good news is rolling back is super easy. All we have to do is assign the branch header to the commit prior to that merge, and we'll undo the changes that we just made. So that looks like this with the alter branch command targeting main. And so now we just need to grab the commit ID, so we'll go back here to the main branch, and you can see you have your entire commit history right here. So this was the merge, and this is the where we want to travel back before. So we'll go to the previous commit, we'll copy this commit ID, and then we will insert it here. And there we go. We've now rolled back the entire catalog to the time before the commit. And just to show that our main branch has been rolled back, we should be back at 1,000 rows here prior to the customer. And we are at 1,000. So there we go.
To recap, we had an Arctic catalog with 3 iceberg tables, we had to join within our production environment that depends on those 3 tables. So if business conditions require us to make updates to multiple tables, and we don't want to expose those changes to our end users until all of those tables have been updated, we can create a branch. We can make our updates, check our work, and then merge the test branch with the main branch and all of those changes are made atomically. And then we can roll back those changes in just a matter of seconds with an alter branch statement.
So that's, in a nutshell, how Dremio Arctic and git for data make multiple trans multi-table transactions very easy and seamless. Again, if you want to try Dremio Arctic, it's now the default catalog within Dremio Cloud. So all you have to do is create a free account and start testing it today. And we'll include a blog post in our follow up to this that shows you this exact project, so you can try it out for yourself.
So again, my name is Jeremiah. Thank you very much for joining, and now we're happy to take any questions you may have.
Hey, everybody! It's me again now. We're gonna be doing Q&A, so if you have any questions, please send them over into either the Q&A box or the chat box. I will be monitoring those for any questions that you may have on Jeremiah’s great presentation today. And let me just make sure I got everything nice and open. Good.
And yeah, I mean, I guess first off is: when I create a branch on Arctic, are you really telling me that basically any queries that are coming in from my day-to-day data consumers, my data analyst, my data scientists were depending on consistent information? They're not being exposed to what's being on that branch? Assuming that all queries are going to go straight to the main branch?
Yep, that's correct. That is the value of git for data. It make[s] sure that your users have a consistent, accurate view of data at all times and you can do everything, anything, you want in that branch on production-level data. And you're safe.
Oh, here we go. Before committing into the main branch, is there a feature of manual review of the changes?
So you can definitely do review checks––it showed that in the demo by checking the count of rows within the branch, to make sure that the updates have been made. You can certainly do this, sometimes, with ETL pipelines, for example, where you can actually run a dashboard against your branch and see that the changes have made, that the data has been updated, and all of that prior to making the merge so those those are ways to check for quality. And I've seen some people automate that quality check with scripts, and I've seen some people do it manually.
I think that's it. And then, yeah, so just as Jeremiah said, you can manually see the state of the data on the branch. So in that case, if you wanted to do like a manual review before merging, like, let's say you're the person doing the changes. But you know boss is supposed to do the review. They can just do the review. What once you're done with your checks, there's no l flag for a manual review at the moment.
Cool. Any other questions?
Oh, I should add to that also––from a review standpoint, I showed it sort of in the commit history when I was rolling back. But if you wanna check the state that you're rolling back a branch, too, you can actually reference that snapshot, and then you can run the exact same sort of quality check––[like] hey, I'm rolling back to the state of data that I want to see before you actually make that roll back. So that's that's pretty cool feature. All of those snapshots are available for you to to check out.
Yes, now, super powerful. A thing to keep in mind is like, again, Dremio's rule has rule-based access controls. So in that case, not every user is going to be able to just create branches and roll back the catalog. You have control over which users can. So basically, you know your data engineering or handling the ingestion, ETL, [etc]. They can have the permission to do that, and then everybody else, they're just quering that main branch, and they don't need to be any wiser, and don't have to have the ability to accidentally like roll everything back so you can have those.
And same for merging a branch into the main branch––that is a roadmap item that's coming very soon, is the ability to manage permissions on who can actually merge to main. That way, you don't have a data scientist just merging their science experiment into your production branch. So yeah. You got you got really robust controls.
With that, any other questions? I'll give it one more moment for any additional questions before we call it a day. Let's see here. These 2 questions have been answered. So they mark those.
Alex, one thing I hear [is]––I was at a data science conference last week, and there was a lot of confusion around the idea of the clone. It's not a copy of the data––you're not storing that anywhere. Can you talk a little bit about clone versus copy, I guess?
Yeah, yeah. So basically, what happens is that in a traditional environment, what would happen is that––let's say you have your data scientist. And they want to have a copy. They want to have the data so that they can make X and Y changes, they'd update these rows, see what happens to the model, whatnot. Problem is, if they mutate that data, that data, then, doesn't become as useful for analytical use cases because it's been tainted with experimental data. So usually, what would happen is, you would make, literally, a full-on copy of the data, like a separate folder or a separate environment [for] those scientists. So then you've just duplicated your storage costs.
Like in git, when you create a branch, you're not creating a physical copy of your code. You're just creating different [branches], since essentially all git or Nessy is doing is they're tracking a series of changes. So essentially, when you create that branch, all you're doing is you're tracking all the changes that occur on that branch, separately than the changes that occur in the main branch. You're essentially working from the same data, but which changes are visible depend[s] on which branch you are different. So you're not creating a duplicate. The only new storage is any inserts, any files created from new transactions, which are only gonna be the new records or delete files to track deleted records. So essentially, your storage cost won't explode, because you're creating all these copies. And that's a great thing about like using Arctic or Nessie.
Literally, you just create a branch, and you theoretically have a workable copy of the data that someone can use for all sorts of experimentation, without actually literally making a copy.
Yeah, that's super useful when you think about your production environment. Some people have dev test copies for their data––completely separate environment, completely separate storage and compute and all that, [completely separate] data science. And then, yeah, you could be managing multiple copies of the same same exact data.
Agreed. That's a great thing about zero-copy clones, because there's no copy. But you have another copy. Yeah, it's like, Dr. Jekyll/Mr. Hyde. There's 2 people, but one body.
With that, I think we're gonna wrap this up, if there's no more questions. But with that, Jeremiah, always delightful to have you on, and we'll have you again on very soon in the coming months.
And then, next week, make sure everyone comes in. Next week, we're gonna be having NetApp come in to talk about how they use Dremio to reduce costs and improve some of the projects that they were working on. There's some really exciting customer stories. If you haven't gone to dremio.com/customers, to hear about how OTPBank saved 60% on their data storage, or how AP Intego was able to take a project that [would’ve took] them 3 years, and [got] it done in a month, thanks to Dremio. Go read those stories, [they] are actually really exciting, and really get you excited about the possibilities of what happens when you sit there and see, data lakehouse architecture in general, but also how Dremio can fit into that architecture. So hopefully, you guys signed up for a free architectural workshop with Dremio, so we can see, hey, where you're at, where you want to be, and what advice we can give you in getting you to where you want to be. Then again, it's about getting you where you want to be and then you know, [we] either fit in that store or we don't, but we're gonna help you get there.
So we'll see you all next week. Have a great day and enjoy.
An in-depth review of Apache Iceberg, an open table format for enterprise data lakes.read more
Watch Andres Bogsnes, Master Expert at Nordea Asset Management on how Nordea Asset Management journey to implement Data Domains with Dremio globally.read more
Galp data strategy is built upon the pillar of democratizing data access and analytics, promoting decentralization when it comes to data product development.read more