Gnarly Data Waves
Episode 53
|
July 9, 2024
Build the next-generation Iceberg lakehouse with Dremio and NetApp
Join our upcoming webinar to explore the future of data lakes and discover how NetApp and Dremio can revolutionize your analytics by delivering the next-generation of lakehouse with Apache Iceberg.
Watch Vishnu Vardhan, Director of Product Management StorageGRID at NetApp and Alex Merced,Senior Technical Evangelist at Dremio, as they explore the future of data lakes and discover how NetApp and Dremio can revolutionize your analytics by delivering the next generation of lakehouse with Apache Iceberg.
Transitioning to a modern data lakehouse environment allows organizations to increase business insight, reduce management complexity, and lower overall TCO of their analytics environments. The growing adoption of Apache Iceberg is a key enabler for building the next generation lakehouse. Its robust feature set, coupled with an open ecosystem for analytics use cases, including ACID transactions, time travel, and schema evolution, continues to drive rapid adoption.
Vishnu and Alex will delve into market trends surrounding Iceberg, as well as key drivers for lakehouse adoption and modernization.
In this webinar, you will learn about:
- Iceberg adoption trends
- NetApp StorageGRID and its benefits
- The Dremio and NetApp data lakehouse solution
- Key Iceberg data lakehouse modernization use cases
- Customer examples
Watch or listen on your favorite platform
Register to view episode
Speakers
Vishnu Vardhan
No Bio Available
Alex Merced
Alex Merced is a Senior Tech Evangelist for Dremio, a developer, and a seasoned instructor with a rich professional background. Having worked with companies like GenEd Systems, Crossfield Digital, CampusGuard, and General Assembly.
Alex is a co-author of the O’Reilly Book “Apache Iceberg: The Definitive Guide.” With a deep understanding of the subject matter, Alex has shared his insights as a speaker at events including Data Day Texas, OSA Con, P99Conf and Data Council.
Driven by a profound passion for technology, Alex has been instrumental in disseminating his knowledge through various platforms. His tech content can be found in blogs, videos, and his podcasts, Datanation and Web Dev 101.
Moreover, Alex Merced has made contributions to the JavaScript and Python communities by developing a range of libraries. Notable examples include SencilloDB, CoquitoJS, and dremio-simple-query, among others.
Transcript
Note: This transcript was created using speech recognition software. While it has been reviewed by human transcribers, it may contain errors.
Opening
Alex Merced:
Hey, everybody! This is Alex Merced, and welcome to another episode of Gnarly Data Waves presented to you by Dremio. As usual, here I am as your host, and today, me, Alex Merced, senior technical evangelist at Dremio, and Vishnu Vardan, Director of Product Management for StorageGRID over there at NetApp, we're going to be talking to you about building the next-generation Iceberg Lakehouse with Dremio and NetApp. And this is particularly exciting because this is talking about the cutting edge of Lakehouse architecture, and what's possible when you embrace Lakehouse architecture.
Dremio: Expert and Leader in the Apache Iceberg Lakehouse
Alex Merced:
Speaking about Apache Iceberg, here at Dremio, we have been advocating, and educating on Iceberg before it was cool, and one of the first platforms to embrace Apache Iceberg, created some of the earliest online Apache Iceberg education materials. We are the authors of Apache Iceberg: The Definitive Guide, and the creators of one of the first open-source Lakehouses, open-source catalogs for Dremio in the form of Nessie.
An Apache Iceberg Crash Course
Alex Merced:
Now, also, in that regard, we don't stop there. We're going to keep educating about the Apache Iceberg. We're going to talk about Apache Iceberg today, but if you want to learn the whole architecture, everything about it, we will be doing an Apache Iceberg crash course, 10 sessions, starting this week, going through July 11th and October 29th, you can scan this QR code to get involved in those sessions. But with no further ado, let's begin talking about today's subject of Dremio and NetApp, and building that cutting-edge hybrid Apache Iceberg Lakehouse.
Agenda
Alex Merced:
So in today's agenda, we'll be talking about market news, just talking about some of the changes in the market that make this topic particularly pressing in today's world. And once we establish why Apache Iceberg, and why Lakehouse, is: what makes it [so that] when you combine these technologies, [it’s] such a powerful step towards this Lakehouse architecture, and how Dremio and Netapp, put together, take that even a step further? And we always want to be doing that––taking our data infrastructure a step further. And we'll do a deep dive and just say, hey, what do these two platforms have to offer when you use NetApp as your storage layer and Dremio as your analytics layer? And let's take a look at a couple of great examples of these things in action, of the benefits that different customers and entities have had when they've embraced this combination, with examples and case studies and the different use cases for Dremio and NetApp and that hybrid Apache Iceberg Lakehouse.
Data Lifecycle Remains Complex, Brittle, and Expensive
Alex Merced:
And just to better understand the problem that this is solving, is again to think about the current status quo lifecycle of data. What happens [is] we have all our data sources––our data is found in a disparate array of databases, data warehouses, and data lakes, and [often] we take those sources, the databases and log files, and other things, and we have to ETL them through a complex chain of data pipelines into data lakes, and then through another set of complex chains of ETL pipelines into data warehouses, all just to land them and deliver them to our data consumers, for BI, for AI/ML. The problem with this is, that the more pipelines you have, the more you delay the delivery of that data.
Dremio and NetApp Iceberg Data Lakehouses Help!
Alex Merced:
You're going to see that we can simplify that, and not only simplify, [but] be able to get that data fast to those who need it by shifting left, by removing that dedicated data warehouse and using our data lake as our data warehouse, a.k.a data lakehouse––but it's also going to reduce costs because you're reducing the amount of compute you're spending on ETL jobs, you're reducing the amount of storage you're spending on having to store multiple copies of the data in different systems in the data lake and the data warehouse, and then just faster delivery of that data. So you get faster time to insight and reduce costs. It's a win-win when you embrace Data Lakehouse architecture.
Databricks Purchases Tabular
Alex Merced:
When it comes to Data Lake architecture, one of the big choices you have to make is a table format. One of the things that's been established recently is that Apache Iceberg is the winner when it comes to the default, the standard bearer, when it comes to a table format for how you represent your data sets in the data lake. There are a couple of different news items that emphasize this––one, Databricks created their table format called Delta Lake. But despite that, they still ended up purchasing Tabular, the company was founded by the creators of Iceberg for over a billion dollars, a multi-billion dollar purchase. If that doesn't signal the need to embrace Apache Iceberg, the need to adopt Apache Iceberg is that prescient, when Databricks makes that purchase, I'm not sure what a better signal you can find.
Snowflake Releases Polaris Iceberg Catalog
Alex Merced:
And then you also had Snowflake jumping in on the open-source bandwagon by announcing their open-source catalog Polaris, a huge announcement. And you're starting to see moving towards two different things–– open Iceberg and Lakehouse––basically, these are three things that we at Dremio here have been advocating for years. But over the years you're trying to see companies like Snowflake start moving more towards Lakehouse and open, and you're trying to see companies like Databricks moving more towards Iceberg. Everyone's converging in that open Iceberg Lakehouse story. So never a better time––you could never move forward with Iceberg as confidently as you can today.
Why Iceberg? A Diverse Developer Community
Alex Merced:
The signals are there that this is the format, this is the architecture, and this is the way forward when it comes to taking that leap with your data. And a couple of other signals of just why the Apache Iceberg project has earned this status is one, it's an open project. It's a community-run project. So there's no one commercial entity that can just flip the project overnight. So that you can build confidently with certainty, going to the future because you have big-name companies who are contributing to the project, but they're all contributing through cross-diverse contributions as a community.
Iceberg for your Next Generation Lakehouse
Alex Merced:
On top of that, on Iceberg, queries are faster. When it comes to data lake queries, Iceberg makes those queries nice and fast because of the metadata that allows you to quickly identify which files you need to scan for any query. Changes are easier because of the way the metadata is structured in Apache Iceberg, you can do granular inserts, updates, and deletes, and easily add data from files to your Apache Iceberg tables, especially with tools like Dremio that work well with Apache Iceberg. The result is less work for your data team, which means, that instead of spending a lot of time maintaining the world they have now, they'll have more room to expand that world, scale-out, and take on new projects. They're not spending as much time having to fix what's already there, because everything that’s there works. So fewer of those data tickets, and again, lower compute and storage costs.
The cherry on top is because Apache Iceberg is an open format, you're not locked into any particular vendor. If you need to change who you're doing any particular aspect of your data story with, you can change it and not have to recopy and reduplicate all your data and do these complex migrations. Once you're in Iceberg, you can stay in Iceberg, and still evolve your Lakehouse architecture as time goes on. Iceberg has one of the broadest ecosystems of tools that you can choose from for this, making it a very compelling choice when you're looking for that open avoidance of lock-in. So with that, let me bring in Vishnu to tell you about the first half of this amazing solution. When we make these Iceberg houses happen in the NetApp StorageGRID. So with that, I'll let Vishnu take it over from there.
Vishnu Vardhan:
Alright, thanks, Alex. So, with that introduction that Alex gave about how Iceberg is now the standard, and how both Databricks and Snowflake are adopting Iceberg. And there was this transition where you had multiple formats, but now everybody just comes into Iceberg as the one format and object storage starts to become much more relevant. And that's because Iceberg is fundamentally an object storage-based format first. It shines with object storage. So it's really important to understand your object storage solutions from that perspective. And so we'll talk about StorageGRID, which is NetApp’s leading object storage format, and then we'll talk about Dremio, and then we'll put all of this together.
NetApp StorageGRID
Vishnu Vardhan:
So StorageGRID from Netapp is an object-short solution that scales to 800 petabytes of capacity in a single cluster. So one cluster scaling to 800 petabytes, having 300 billion objects in that cluster, and that cluster can span across multiple sites. So we can span 16 different sites. Just one single cluster, and create a single global namespace, which means that you can go and access that data from any of those global sites. And this is active, active access. So you can go in and read the data, write the data in San Francisco, and read it from New York, and be available that very second. And you can have all of these reads and writes going across your entire object storage infrastructure. So that's a massive scaled single global namespace for your object. So StorageGRID does that, I think it's by itself fundamentally differentiated as an object storage platform, but in addition to that, we provide a policy engine, and I'll speak a little bit about that, it's extremely flexible to deploy. You can mix and match, and you can run it on virtual machines, or you can run it on appliances that we provide for simplicity. It offers up to 15 9’s of durability, and you can configure it, and it's configurable. And finally, a lot of your storage doesn't stay in the object storage. You need to be able to move it around, you need to be able to move it to other systems. You need to be able to move it to the cloud many times and get multiple tiers of storage, because each tier has its economics and its value proposition, and StorageGRID lets you do that. And when you combine that with Dremio, we'll show you how we built––how the solution can be compelling.
Automate Lakehouse Data Management at Scale with ILM Policies
Vishnu Vardhan:
So if you go to the next slide, it'll show you how, first, from a policy perspective, if you have 800 petabytes of data, managing that is hard. And so if you have a large Lakehouse, you need policies that will tell you how to control that data. And that's important because that lets you define where the data lives. So do you want to copy, for example, in a particular data center? Do you want that copy to be within a particular region to meet data sovereignty guidelines––constrained data, that is only in Europe, for example? And so, define policies that can do that, versus how long you want your data around. Much of your data is going to be cold; some’s going to be hot. And so you want to distinguish your cold data from your hot data and treat them differently. You want your cold data to be extremely low-cost, make sure that that's the first value prop that you give your customers, and then when they access it, it's just transparent, and they don't know that you have tapped the data you have identified, that it has code, and you have given it some special treatment. Your users just access the data and just get it, and it's just be transparent to them. So the policy engine lets you define how long it is stored, and how it is stored to get your costs lower.
And lastly, really be secure. How do you improve the durability of your data? For some data sets, you want extremely durable, others less durable, and so define that in a much more simple way. And so this is a simple screenshot that says how with different conditions and policies, you can apply very granular rules to your data. Now, some systems still do this. But what needs to also happen is that your environment is always dynamic and need to go in and be able to change these policies and apply them retroactively on data that's been ingested prior. And so with StorageGRID, you can do that where you have a data lake. You have some data sets that are now cold, you want them to be made hot again. You can go in and change a policy, and then make all that data hot, and now all your queries are back to being performant. That could mean, for example, moving from HDD to flash storage, and doing it transparently. So your applications, just have a namespace, they just access the data. You can decide, if is it on Flash, is an SD/HD, and then move it around, as an example in terms of improving performance.
Simple and Feature-rich Software-defined Object Storage
Vishnu Vardhan:
So I think that's the policy engine that StorageGRID provides that's extremely differentiated for Lakehouses. But that's not sufficient. And so if you go to the next slide you will see that apart from the policy engine, we, of course, provide basic bucket management and the basic S3 API––when I say basic, the storage platform has had all these 3 APIs, it's just fundamental that an object storage platform provides S3 APIs. But that's not sufficient when you create this large infrastructure. It's really important in terms of how you manage security, how you manage denial-of-service attacks on your grid, can you prevent them. Do you have traffic shaping? Do you have quality of service? Can you prevent different untrusted and trusted users from hitting a particular endpoint? If you have S3.yourcompany.com, you want to allow particular people to access that endpoint and not others, and you may want to prevent that. And so really, the whole load balancing tier is very fundamental in terms of driving a very secure infrastructure. You, of course, want this to connect to your entire identity framework in your company and be able to apply granular policies, and so that's something that you need to be looking at in an object storage platform. How do you build this internally, if you have so much storage? How do you chargeback internally? Do you have billing and metrics to be able to do billing internally? And so that's something that you need to think about. And these are things StorageGRID has, and I'm just posing them as questions so you can understand the value StorageGRID brings.
A big part of this large infrastructure is audit logging, and being able to look at all the actions that are happening in your storage system and be able to build an audit trail that lets you prove whether your solution is secure or not. And so that's something that you want to be looking at. Lastly, security is fundamental. You need to have multiple levels of security, and you want that to be––we have customers that sometimes enable all security levels because that's a threat factor that you want to be considered [at]. So, very flexible, as I said before, multiple ways to deploy it, whether it's appliance VMs or Bare Metal, and many ways to move your data around to different clouds that provide an object platform that you can use for your data lake. So we will now talk a little bit about Dremio, and then we'll put these two things together. And so, Alex, over to you.
Dremio: The Unified Lakehouse Platform for Self-Service Analytics
Alex Merced:
Hey, everybody, but bottom line. You're starting to see the shape, the flexibility of these platforms, and the flexibility of when you start building this data lakehouse. And Dremio adds even further flexibility. Generally, Dremio is what's called a lakehouse platform. So the idea here is you want to use your data lake as your data warehouse as that center. So what you need is a tool that wraps all the pieces and components of building a Lakehouse, and puts them in one nice unified way. Dremio does this with three tiers of service, which include unified analytics, our SQL query engine, and lakehouse management. And let's just click in and see a little bit deeper each of these tiers of service.
Unified Analytics
Alex Merced:
So we have unified analytics––the idea here is you're not going to have all your data in one place, so Dremio needs to be able to interact with your data wherever it is. And this is where, again, something like a NetApp StorageGRID works well, where it can also move your data where it needs to be. So you have this analytics layer and storage layer that gives you that flexibility. So Dremio can connect to databases, data lakes, and data warehouses on-prem and in the cloud. That way, you can reach your data wherever it is, and be able to access it quickly and be able to model it. So in that case, if I have an Iceberg table on my data lake, and then maybe another table in Postgres that I need to join to create a particular object or ask [for a] data asset that one of my consumers needs for BI dashboard, I can model it in Dremio without having to move the data, that then can be delivered to serve whatever data purposes you need. On top of [that], you have deep governance and security at that access layer as well. So basically, all these different data sources that are accessible through Dremio can then be governed using role-based access controls, column-based masking, and row-based masking, along with many tools for auditing and querying, and being able to see the lineage, along with other security features like single sign-on integrations and whatnot. So you have this robust access layer to access the data across many different sources. But when you query that data, you have a powerful SQL query engine to then query that data that can federate queries across those different sources, but also has really powerful acceleration features that work across all those sources.
SQL Query Engine
Alex Merced:
So that case, then, that comes in the form of reflection. The Dremio query engine already is the most price-performing engine on the market in the sense that it can execute the same query at a lower price because it's so fast. But when you need that extra juice, there's a feature called reflections, that provides, essentially, a relational cache. What it's doing is it'll cache Apache Iceberg representations of key data sets. So maybe I have a Postgres table that I'm looking to accelerate––I can turn on reflections, and it creates an Apache Iceberg version of that table that I can substitute when that table is queried in the future. But not only that, it'll update that materialization and then figure out when to substitute it. So you don't end up having what you have with other systems, with things like materialized views and cubes, where you end up creating all these different versions of the same data set, and the data analyst has to juggle them all in their head and know which is the right one to query for which types of queries. In this case, Dremio will figure out, hey, which of these reflections is the right materialization for that query, and make sure that materialization stays up to date.
And then we also have something called the columnar cloud cache, which is specifically designed to accelerate work with object storage. So essentially on nodes where the Dremio cluster is running, it's essentially caching in the NVMe memory, regularly-accessed columns or rows from different Parquet files to accelerate queries, and also can reduce costs when you're talking about network access costs and things like that. So the SQL query engine is fast and performant and also helps reduce costs.
Lakehouse Management
Alex Merced:
And last, we have Lakehouse management––if you're using your data lake as a data warehouse, one of the things that data warehouses and databases do under the hood that we oftentimes take for granted is that it's oftentimes managing how that data is stored. So cleaning up data we don't need anymore, rewriting data that needs to be rewritten as we write new data to it––that doesn't automatically happen in the data lake. So in that case we oftentimes would have to run our own regularly scheduled maintenance operations.
Now, Dremio makes this much easier in two ways. One, it has made running those operations easy through easy-to-access SQL commands like optimize and vacuum [which] allow you to manually maintain your tables easily. But even better, you can automate that with Dremio's enterprise data catalog––you can then schedule a regular cadence of maintenance. So that way, you just don't have to think about it. Your lakehouse Iceberg tables will just work. They'll be performant, and you won't necessarily have to worry about runaway storage because Dremio is actively expiring snapshots and cleaning up data files you no longer need as you go. So you get all the benefits of that Lakehouse paradigm, and you're going to minimize the amount of work it takes to implement it because Dremio’s data Lakehouse platform makes that whole implementation much much easier.
Dremio Hybrid Iceberg Lakehouse
Alex Merced:
So that's the Dremio platform. And because Dremio can connect to on-prem and cloud sources, Dremio becomes a key component to creating that hybrid Iceberg Lakehouse, because it can connect all those data sources from wherever that data is, and provide that nice access layer where you can govern that data, model that data, and deliver that data to wherever people need it, whether it's a Python Notebook or a BI dashboard tool like Tableau or PowerBI. Dremio can deliver the data there, act as that layer in between that provides you that reflect those accelerations that provide you that performant SQL query engine, that provides you that semantic layer to model that data, provides you everything you need to make that data easily deliverable in that lakehouse fashion, which is, well, really cool.
Why Dremio + NetApp
Alex Merced:
So this begs the question, these are two cool platforms that do awesome things to make sure that the data in my data lake is flexible, affordable, and performant. How do we pair them together and make something even more awesome? Let's talk about what happens when you put Dremio and NetApp together.
Iceberg Ingestion Options with Dremio
Alex Merced:
You end up having something like this––you bring in your data in the data lake, and Dremio is flexible, so when it comes to helping––or, a data lake is flexible, and you can bring your data into Apache Iceberg tables and store all sorts of different ways. You could use Dremio to bring that data in. We have features [in] SQL language Iike copy into that make it easy to copy data from JSON and CSV files into an Iceberg table. We've just introduced auto-ingest pipelines that allow you to set up a regularly scheduled ingestion job from from object storage we have. You can use Kafka to stream into Apache Iceberg tables and other streaming tools. You can use something like a create table as with Dremio, just move data from any other Dremio-connected source into an Iceberg table, or use other integration or other ingestion vendors like Airbite, Fivetran Upsolver to bring that data into your data lake. But once the data is in your data lake––so again, stored in your StorageGRID layer that provides you all those key tiering features and those cool policies. But on top of it, you have Dremio on there, providing that access layer. That way, it makes it easy for people to access that data wherever it lies, and however you move it. And then again, you get reflections to provide that acceleration to then deliver it to your favorite tools. So you get this powerful, affordable, fast lakehouse.
Dremio + NetApp StorageGRID Iceberg Lakehouse
Alex Merced:
And so basically, again, Dremio comes around providing you unified analytics, being able to connect all your data sources wherever they're at, a powerful query engine that allows you to federate queries across all those data sources, and lakehouse management, allowing you to manage those Lakehouse tables. So that way, it truly feels like a data warehouse experience with ease of use and performance, and that's all built on top of the tables you have that are stored in the Iceberg format on your NetApp StorageGRID, and NetApp, as Vishnu would tell us, bring you so many different benefits, such as…
Vishnu Vardhan:
Right, so, fundamentally, you have this data sitting on StorageGRID. Firstly, if you compare from the cloud to on-prem, you have a significant cost saving, but compared to other object storage options, StorageGRID, with its ILM, you can control how that data is being stored. So for example, if you store two copies of data or three copies of data, that's going to have a certain cost. If you erasure code it, we can, instead of having three copies, now you have 1.5 copies, so that's a 50% cost saving. And you don't have [to] erasure code at 1.5, you can erasure code it even more, which makes it even more efficient. So StorageGRID can lower that cost significantly for you in terms of what that storage is. It has all of the S3 APIs in terms of bucket management, so you don't have to use StorageGRID just for your data lake. But there can be other use cases in your enterprise that you want to also use StorageGRID for. So that lets you use the object storage as a general-purpose system across your entire enterprise.
If you go to the next slide––so security is key. But before that is preventing misuse, so denial of service attacks, and restricting that, especially if you have a single storage system, and you have many users on that you want to be able to restrict who can do what. And so that's what you can do with StorageGRID. We, of course, have significant performances, Dremio is amazing, and we'll talk about some of Dremio’s performances. But compare that, when you pair that with StorageGRID, you get an even more performant system. Security is not negotiable. And so really, multiple levels of encryption that the object storage platform can provide.
Dremio & StorageGRID Automate Data Management at Scale
Alex Merced:
Cool. And thank you so much for that. But the bottom line is, let's just keep exploring the partnership between Dremio and StorageGRID, and how these things work amazingly together. So just [to] reinforce, when you pair the two together, what aspects is Dremio managing for you? So Dremio can bring a lot of things to the table, as far as helping you manage user access. Dremio can act as that access layer on top of where your data is; so essentially your end users are given an account on Dremio, and they can only access the data that you've permitted them to access from the Dremio platform, which then has access to your storage and other data sources. Again, you can use role-based access controls, column masking, and row-based masking, all providing [those] fine-grain privileges using Dremio.
But then you also get version control–– the enterprise catalog provides catalog versioning, so you're able to create multiple versions of your data catalog. This is gonna be useful––for one, like isolating ingestion, so that way, you say, hey, I'm going to make changes to my data but I want to isolate it into a branch until I'm ready to publish that data, making it much easier to make changes to your data and consistently publish them. You could use it to create zero-copy environments. So for one example, a financial company that has to do certain stress tests on their data regularly can create, essentially, a zero-copy clone of the data by creating a branch, modifying it for those different stress test scenarios, running the test without having to modify their production data, even though they're necessarily working from the production data. They're not altering their production data, because they can just dump the branch afterward due to Dremio's version control features. On top of that, you can recover from any mistake, because you can always roll back to previous versions of the data.
You have table optimization––again, those lakehouse management features where Dremio is going to constantly clean those tables up, rewriting data files, so that way, they're in a more performant and optimized form, and expiring old versions of the table, so you can clean up data files you no longer need based on your data retention policies. Your data just works, it’s just the way you want it. And again provide that isolation, so transactions don't interfere with each other. Again, you can do that isolation on branches and so forth. And with that, Vishnu, tell us a little bit about some of these benchmarks.
Benchmark Results: NetApp StorageGRID & Dremio Win!
Vishnu Vardhan:
Yeah. So as we began to work with Dremio, we decided to do some testing ourselves to validate how Dremio works with StorageGRID. We compared three solutions––we compared Hive with Spark running on Delta Lake, as Alex touched on the Delta Lake format earlier, and then we tested Dremio with Iceberg, of course. So if you look at the bottom line, Dremio is about 20 to 25 times faster than Hive running on a Hadoop environment. And so if you look at why it is, and why is [there] that significant performance delta of 25x, it's coming from the list head, and the two kinds of head operations. So Hive is doing like a hundred, like a thousand times more list operations than Dremio is. Even if you look at the head operations, they are significantly more than what Dremio does by orders of magnitude. The reason is that the Iceberg format has been uniquely developed for object storage. And so you can see that, where the high systems are just not able to scale, while Dremio is extremely fast.
The NetApp & Dremio Difference
Vishnu Vardhan:
And so this plays into the next slide, where we'll talk about the overall benefits that you're seeing. Just straight away, we see a query performance of at least 10x, as in the TPC-DS benchmark that we saw for the 25x delta, but when you combine that with the cost economics [that] StorageGRID can do with its policy engine, which lets you store data at different levels of cost, and combine that with Dremio's query performance, the overall simplification that it's able to drive, you're able to get a 9x price performance improvement and certified savings in your ETL cost, just because you don't have to be moving your data across all of these different systems. And with this one data lakehouse architecture, it just simplifies all of your engineering. We are seeing up to a 50% better TCO with these two solutions together. Just a phenomenal improvement in terms of what your data lake is not costing.
Dremio & NetApp: Hadoop Modernization
Alex Merced:
Awesome, and all good. But the bottom line, is there are a couple of key places where you can be using Dremio and StorageGRID put together. So bottom line, if you need a lakehouse, this is a powerful combination, as you can see those numbers as far as TCO and performance benefits. But another key place where this combination becomes helpful is if you already have an existing on-prem Hadoop lakehouse, and you want to move into the process of modernizing your Hadoop infrastructure. What happens is you can move to Dremio plus StorageGRID, and end up with sub-second query performance, the 10 times better price performance. You can move to govern self-service analytics, because again, you get all those access features that you get at both layers, with the fine-grain policies that you get with the StorageGRID, [and] in the fine-grain access controls you have on the Dremio layer. So you get real control and real security to make sure that no one's accessing that data that should not be a unified view. Because, again, Dremio can then connect all your different data sources in the cloud and on-prem, and provide that unified view. So that way everyone sees where their data is in one place and can scale that computing and storage layer. You scale your Dremio and your NetApp based on your needs separately, and end up getting improved overall data management. And you reduce the complexity of your overall infrastructure providing a very, very powerful result. So again, migrating from a legacy Hadoop Data lake to a modern NetApp StorageGRID Dremio lakehouse environment can bring you lots and lots of benefits.
Use Case: Hadoop Modernization for NetApp’s Own Lakehouse!
Alex Merced:
So what we'll do is now is we're going to take a look at how NetApp did this to their lakehouse. I'll bring back Vishnu to talk about NetApp's showing this solution in their own data lake.
Vishnu Vardhan:
Yeah. So we didn't partner with Dremio first, and then build a solution. What we did was we used Dremio and the StorageGRID first, together, inside our infrastructure. And then we realized that this is a compelling story, and we started to put this whole solution together and said, hey, this is great. We should just not keep this inside the house. We need to tell other people about it. And so that's the genesis of this overall present, of this overall partnership. And so, in this particular case. How this all started was, that we have an internal telemetry system called Active IQ. Active IQ is a massive system. It receives about 10 trillion data points every month from hundreds of thousands of systems that we have outside. So we collect all this information, [and] we use it for three different things. We use it to provide a support experience for our customers and make sure we can analyze their systems and help them fix issues. We use it to provide insights to our customers to say, hey, you can do these five things to improve your utilization of your system or make it perform better, and so we provide insights to our customers to use, and then we use it inside internally, for our analysis of how our customers are using our systems. So, [a] very critical system that defines how we do our business.
And so that's been around for a while, it started as flat files, and then we moved to a Hadoop infrastructure. But we had a bunch of problems with that, with Hadoop, we had to scale both storage and compute together. We could not scale storage, and our system was just growing, so we had to keep adding cores for us to be able to scale the storage system underneath, and that was expected. And so first, was just the scaling was not cost-effective. The second challenge we had was in terms of governance, both in terms of the infrastructure, where particular Spark jobs could run, and they could starve the Hive queries, and so we were not able to granularly control––there were some controls, but they were not granular enough for us to say, we don't want this Hive, the Spark job to run and take over so much compute, we want to share this with the high query. So we didn't have that control. We also had issues in terms of being able to granularly control row and column access. Hadoop does have an additional doing, but that was a little bit more complicated for us to do.
Lastly, was just performance. We were seeing 45-minute query times with our Hadoop infrastructure, and that was that was just extremely painful. So, to address this, we looked at Dremio, we looked at a bunch of vendors, and then we finally worked with Dremio to move this solution from the Hadoop infrastructure to Dremio. And when we moved it, we didn't just move it, we created two copies of the system. We copied the data over from Hadoop to Dremio, and we felt that was safer as a way to move the data. We started to use StorageGRID as the underlying storage system, and Dremio as the query layer. After a little while, we realized we didn't have to move the data or create a copy. Dremio can operate with two different systems at the same time and be able to seamlessly migrate to the back end. So that's something that we realized, but initially, when we used it, we copied the data over as we finished the migration, and when we finished it, we were able to give dramatic savings We were able to save up to 60% of our costs on this system. Query times dramatically reduced to about two minutes from what was 45 minutes, and so those savings were significant. I think to wrap this up, what was also very interesting was the effects it had on our overall data lake. We started to get higher satisfaction from our users. So we started to also get more users using it, and we started to have more data come in. And so it just became a very self-fulfilling, and a good change in terms of how the system started to grow more and more, and we started to get more usage and with much higher user satisfaction at a much lower cost. So just a great solution and I think that was the basis for all of this; that's what we're talking about here today.
Use Case: Hadoop Modernization: Global 100 Financial Institution
Alex Merced:
Awesome. I just love that story. It's just lovely to see amazing technologies achieving amazing results. And it's not the only story––I mean, here we have another story of someone doing that Hadoop modernization movement, [who’s] a company in the Global 100. It's a global 100 financial institution, and it started with an existing large legacy on-prem Hadoop Data Lake. They had no plans to move to the cloud. That wasn't necessarily the goal, but they wanted to get more of their existing on-prem infrastructure. And what they ended up doing was adopting a StorageGRID in Dremio, a lakehouse. Basically what they did is they used Dremio for their financial data landing zone, basically acting as their single source of truth, giving them the ability to join data from their NetApp StorageGRID data on their lake too, also be able to join their existing Hadoop, as they phased it out, or as they modernize from and moved away from Hadoop, they're able to connect both sources and make that transition much more seamless because your end users didn't have to worry about where the data is, they can see the data sets they have access to. And as the migration happens behind the scenes, it doesn't change the workflows of your end users. And with the Dremio semantic layer, again, all that data is nice and modeled. They just see the data sets they need to have access to, and they're not necessarily concerned with where that data is coming from, making that transition much easier, and then also providing an auditory of where these queries are coming from, so that way they can still provide information to the FTC. With their NetApp storage plans, they're able to reduce their previous Cloudera data platform costs quite a bit, and eventually move entirely off on the StorageGRID. It's all led to reduction of costs, increased performance, [reduced] time to insight, [...] and overall just a better experience. So, for instance, for something that stays strictly on-prem, whether you're building a hybrid solution, cloud or on-prem, or strictly staying on-prem, there is a great story and great outcomes that happen when you make use of Dremio and StorageGRID to modernize your lake.
Use Case: Warehouse to Lakehouse
Alex Merced:
And one more example, this is from a Fortune10 customer. They had a 75% TCO savings. So that's 3 million savings, in just one department at this particular company. But essentially, we take a look at their world before, what used to happen is that they would have to ETL data from their data lake, and then ETL that data into Snowflake, so you have multiple data movements. And then all these transformations would have to happen within the data warehouse, where the compute costs are a little bit more. And then you're increasing storage costs as you do those materializations, layer by layer through those transformations. So you would have a lot of costs there, and then that would create 700 million records, and to do all that, the resulting BI dashboard, even after doing all of that, would still take 3 to 4 minutes. So essentially they would turn a knob, click a toggle to change the values in the BI dashboard, and it would take 3 or 4 min to refresh. So they were spending a lot more, and having a not-very-pleasant experience on their BI dashboards. So then they moved over to Dremio, [and] in this setup that you see here on the right, all the data just stayed in the data lake. And they just modeled the data, they did the transformations, virtually using Dremio right there on the data lake, so there weren't all those copies of data. There wasn't the more expensive compute to do all these transformations. And basically, they didn't even have to do things like reflections. These are just direct queries right off their lake, and they were able to achieve 5 to 15 seconds per click. So basically, went down from having to wait 3 or 4 min for that BI dashboard to update, along with all the costs of just getting the data in place to do that, to reducing their costs because they're reducing the duplication of their data, and reducing again the speed at which that BI dashboard––and so this is the benefit when you use Dremio on top of your data lake, and basically, treat your data lake like your data warehouse, aka, a data lakehouse.
So again, just making that case for why you want to make that move to the data lakehouse infrastructure, and why everybody's moving in that direction. It's no longer just a few people screaming on a mountain, lakehouse is the way, everyone! And as we saw in those news bits at the very beginning of this presentation, everyone's moving to the Lakehouse. Everyone is looking in that direction. And now is never a better time to start making those steps.
The NetApp & Dremio Difference
Alex Merced:
And with Dremio and StorageGRID, you're going to get that best-in-class TCO, you're going to get that best price performance, you're going to get that fastest query performance. And you're going to get simplified data engineering and management. And again, there's no better time to start than today, moving in that direction and future-proofing yourself, and getting all those nice results.
Get More Great Information
Alex Merced:
So here are a bunch of QR codes, so that way, you can get access to a case study from NetApp, a case study regarding the Dremio-StorageGRID solution, and some other things to consider when you're moving from Hadoop to a data lakehouse. You have all this information to begin planning your journey. Of course, we'd always be glad to have Dremio, and NetApp would always be glad to sit down and have a meeting to help you architect, what does that journey look like for you? So, we had a poll earlier to see if you would want to meet, but also, if you have any questions, please do put them in the Q&A box there in the menu bar below, so that way you can answer those questions as we wrap up this presentation.
Thank You
Alex Merced:
But I just want to also say, thank you for being here today, and also thank you to Vishnu for being on the show this week. This is a great time, and I always love hearing about the benefits of StorageGRID. And I also especially just love seeing how, basically, not only do they have this great combination between Dremio and StorageGRID, but how well it worked for them in-house. It's just such an amazing story.
Q&A
Alex Merced:
But with that, we have a question. So when will Dremio have Spark capability plugins? The way I'm reading this question is like, hey, when would Dremio have some way to write Spark, the right-like data frame type logic in the Spark API, be able to take your PySpark code and run it against revenue compute? I wouldn't say that that's what's on the existing roadmap. Dremio is a very SQL-first platform, although technically the open source components for making that a possibility are essentially the path to that, for any open source devs that would like to build these out. There's a library called SQLglot that allows you to transpile between different SQL dialects. So creating a Dremio language in SQLglot would be step number one, and then with that, you're able to build out either in Ibis back end or there's another one called SQL frame that's more specific to the PySpark API. And then you essentially would have what you're talking about there, and doing it all through open-source channels. So that's a possibility.
And now another question, how does this compare from setup and upfront cost compared to the public cloud storage option? So I'll pass that off to Vishnu on your thoughts on that question.
Vishnu Vardhan:
Yeah. So, firstly, I just want to thank you for having me on on, getting a chance to say that. Sure. So there were two questions, I think, and I'll address them both together. I answered one of them in chat, but I'll voiceover it. So there's a question about how StorageGRID differs from AWS S3, and is it migratable to other vendors? So StorageGRID is an on-prem version of AWS S3. The reason we have multiple customers do that is that it's just so much cheaper to run something on-prem. You don't have Egress costs, and you don't have access costs on every API, it's just much cheaper to run things on-prem, and so StorageGRID could do that. There are use cases where you want to be using the public cloud, for example, a Glacier and StorageGRID could tier out to Glacier, so a dramatic cost reduction. We have multiple customers that move petabytes of data on-prem, just because they realize it's just so much more expensive. It is a separate product. But as we said, [we] work together, and so it's not burned into Dremio, but it's a separate product.
So I think there was the other question is, how does this compare upfront setup, and from setup and upfront costs? Your cloud upfront cost is zero, and I think that's the benefit you want to do test and dev. I think the cloud is a great place to do tests in dev. But once you want to roll into production––storage is unique, because it's the unsaid thing. Compute costs flex, compute costs go up, and then you turn machines down––and most people forget to turn machines down––but you turn machines up and you turn machines down, and when you turn up and down you have a variable cost, and so you can save that cost if you go to the cloud. Storage, you don't ingest and delete your ingest, you let it stay, and still storage costs are fixed, increasing costs over time. So Cloud is not a good place to have storage, because of your cost––you put the storage in, and there's no flex, there is no, hey, I can turn it off, and so over a 3-year, 5-year period, you're significantly cheaper than AWS. Yeah, I'd ask you to go to the map, but you're talking about, 80%, 70% numbers. Yeah. The numbers are pretty large, so very compelling from a cost perspective.
Closing
Alex Merced:
Awesome. And let's see, I think that that question is now answered. And yeah, I think those are all the questions that we have for today. But again, I highly recommend that all of you add Vishnu and myself on Linkedin, follow NetApp and Dremio on Linkedin for further updates and insights from these technologies, and have a great day. And again, there'll be more Gnarly Data Waves always coming. So if you haven't subscribed to Gnarly Data Waves on iTunes and Spotify as a podcast, make sure that you never miss an episode, I recommend that you do that. And you all have a great day. And again be thinking about that future for your data, and how the Lakehouse can bring you lots of value. I'll see you next time.
Vishnu Vardhan:
Thanks, Alex. Thanks, everybody.
Ready to Get Started? Here Are Some Resources to Help
Webinars
Cyber Lakehouse for the AI Era, ZTA and Beyond
Many agencies today are struggling not only with managing the scale and complexity of cyber data but also with extracting actionable insights from that data. With new data retention regulations, such as M-21-31, compounding this problem further, agencies need a next-generation solution to address these challenges.
read more