May 2, 2024

Building and Benefitting from the On-Premises Lakehouse: How NetApp and Dremio Enable On-Premises Data Lakehouse Architectures

For many organizations, on-premises and hybrid analtytics architectures are here to stay. Join NetApp and Dremio as they present how to build and deploy on-premises, self-service lakehouse architectures. We’ll talk about the benefits and best-practices of on-premises lakehouses. We’ll share learnings from diverse open-source data lakehouse architectures and discover the drivers for lakehouse performance. You’ll also learn how NetApp built its own lakehouse with Dremio and gain insights into tangible benefits and lessons learned.

Topics Covered

Dremio Use Cases
Lakehouse Analytics

Sign up to watch all Subsurface 2024 sessions


Note: This transcript was created using speech recognition software. While it has been reviewed by human transcribers, it may contain errors.

Vishnu Vardhan:

Today, what we want to do is talk about how you can build and benefit from an on-prem lake house. We just want to give some credits here. A lot of what I’m going to talk about today is work that Aaron Sims at NetApp has done. So we stand on the shoulders of giants, and so I just want to give a shout out there. What we’ll do today is talk about a lake house at NetApp called Active IQ. Active IQ is foundational for our business, and I’ll touch upon why. And we moved Active IQ through multiple generations of lake house architectures, and it finally landed on Dremio, and we’re very happy about that. So I’m going to talk about how that journey was, and how the infrastructure choices under that helped us achieve better ROI. Fundamentally, we moved Dremio to an object storage platform called StorageGrid, and we’ll talk about that. My name is Vishnu Vardhan. I’m the director of product management for StorageGrid, and I’ll walk you through this journey. 

About NetApp

So with that, first, just a quick overview on what is NetApp. We are a leading storage provider. We help you manage your storage everywhere on-prem and in the cloud, and across multiple clouds. We are the leader in all-flash storage, and we are present in almost every major hyperscaler as a first-party solution. AWS, Azure, and Google manage StorageGrid and provide it themselves as a first-party solution. So we are the only storage provider to be able to do that. We primarily offer three kinds of services, file, block, and object storage. And the object storage is called StorageGrid, which is what we’ll talk about today. 

Active IQ

So let’s talk about ActiveIQ and the platform that we refer to. So ActiveIQ is our telemetry system. It gets telemetry data from hundreds of thousands of devices that are out in the field and at customer locations. These could be physical devices, or it could be software-based instances running in the cloud. We get tens of trillions of those data points into our data lake house, which then we use for multiple different use cases. Three fundamental use cases that are important there– first is for customers to be able to get insights from the data to see if they’re following best practices and really advise them on how to best use our platforms. The second use case is for product management and business, for us to make decisions on what exactly should we do next with our products. And so pretty critical for us from a future planning perspective. But at the heart of this is really how we are able to support our customers and really give them a very good customer experience. And so ActiveIQ is the basis for all of this data and touches every part of the organization. So for us to do this, we have built a bunch of different applications that we use. We have apps that we use. All of the data is being sourced from ActiveIQ. And this data is refreshed on a daily basis with new streams of data coming in on an ongoing basis. 

Architecture Before Dremio

So given the criticality of this platform, I think it’s kind of important to talk about how we are here and where we were before Dremio. So this has been around for many years now. And in its early iterations, it was really flat files. And we had an Oracle database, relational database, and we were using that as the basis of our lake house or our warehouse. Really early in the 2010s, we realized that we had to scale beyond Oracle. And so we moved to a Hadoop-based platform. We effectively got data coming in via Flume and Kafka streaming. We have the data landing in on a Hadoop CDP infrastructure. And we provided our business users access via Hive and Impala. Of course, people build their own Tableau dashboards and notebooks. And that’s how our customers were using this data. 

This was a multi-tenant infrastructure. So we had multiple business users and business groups using it. And so it was fairly, fairly, fairly large. From an infrastructure perspective, what we did was we kind of built one of these stamps of mini clusters. Each cluster was about four compute blades attached to a shared storage system. So we called that like a mini cluster. And we had many instances of these for a total of about 133 plus different compute blades in our infrastructure. This was a seven petabyte of storage with 4,000 plus cores. So a pretty large infrastructure with data coming in on an ongoing basis. So this was well and good when we made the transition. We were pretty happy with the change from flat files. But as the system became bigger and as we started to mature our use cases, we started to see some challenges. And fundamentally, four or five challenges. 

The first, of course, was the fact that the data was increasing and scaling. But every time we had to add more storage capacity, we had to add more compute. And so that was becoming– when that happened at smaller scales, it was less problematic. But as our scale became larger and as our percentage growth remained the same, the amount of compute we were adding was more and more. And so this started to become a pretty important problem for us to address. Performance was critically important. Many queries ran for about 45 minutes. And that was far too long and just not responsive enough to what our business needed. And so performance was something that, as it scaled and became large enough, became very important for us to address. 

The other issues we had were around just some operational controls. So the Hadoop environment did not give us enough fine-grained control in terms of being able to limit CPU resource utilization, for example, for a particular query. And what would happen is a particular user would set up a query in a certain way that he would hog a lot of the compute resources we had and would starve other jobs and other queries from being able to complete. So it was problematic for us in terms of how we could control in a fine way. And there were some coarse controls, of course, C groups and things like that. But provide fine controls was challenging for us. And then we had these upcoming challenges that we were increasingly facing. Initially, we didn’t have a lot of PII data. We didn’t have a lot of customer-identifiable data. But as our use cases became more prevalent, we needed to be able to provide fine-grained data governance just to handle PII and customer-identifiable information. And that was something that we were lacking in our Hadoop environment. 

Lastly, from an infrastructure perspective, we wanted to go to a more efficient storage mechanism. Hadoop does three copies. And we were looking at many ways to optimize it. And as I’ll talk through the slide, you’ll see by the time we reached storage grid, which is NetApp’s object storage solution, we were able to dramatically improve our storage efficiency from what was seven petabytes significantly. So these were the main challenges that we had. So when we started to go and we wanted to look for a solution, we, of course, wanted to solve these challenges. And just want to be the opposite of what I want to discuss– decouple storage and compute, better performance. But there were three other things that I think we wanted to do, which was different. So first is the fact that we had a lot of investment, a lot of infrastructure investment in computing the storage. And we wanted to figure out how we can reuse that. And so that was important for us. The other piece, the second new piece, compared to the challenges that I already addressed, was that we also had investment in our data pipelines. And we wanted to make sure that we can bring those data pipelines along and minimize the rework that we had to do in those data pipelines. And lastly, a part of what we need to do is really DR planning and make sure that this infrastructure is available on a disaster event. And so really, how do we simplify our disaster recovery plans, where something that was of concern to us? So in addition to solving all the challenges that I spoke about earlier, we needed to reuse our infrastructure, make sure the migration of our data pipelines was minimal, and improve disaster recovery. 

Why Dremio

So based on that, we selected Dremio. And there were three main reasons for why we reached Dremio as a conclusion. So first was the fact that it gave us a Lakehouse query engine. It gave us this comfort that we could move something else and be able to query even that other engine and have multiple engines federated behind one infrastructure, which is Dremio. That was compelling. And that gave us this freedom of choice and not being locked into one solution, but be able to plan for future migrations, as this was already our second migration. The second reason that we picked was the semantic layer, where we could create our own views, create our own schemas, as it were, for particular business users or particular use cases. And a business team internally needed completely different data from the support team, which needs a completely different set of views into the data from what our customers needed. Right there, we had three different views of the data. And the way we were doing it was extremely inefficient. So creating this semantic layer was something that we found to be very compelling and simplifying to our use cases. Lastly, and most importantly, of course, was performance. We wanted to make sure that our performance was great and we were able to address the 45-minute query times that we had before. 

So what we did is we re-architected to Dremio. And I’ll touch briefly upon how the ActiveIQ team did that. But just in terms of what the architecture was, we moved away from Hadoop to Dremio. Dremio was running on a Kubernetes infrastructure. It was backed by Storage Grid, which is the object storage platform from NetApp. Storage Grid is able to provide a single global namespace, which means that you can access that data active-active from any location in the world. And that plays a key part in how we were able to solve our DR problem, which I’ll cover. With Storage Grid as the storage layer, object storage layer, we were able to also drive storage efficiencies. Previously, with Hadoop, as you know, Hadoop keeps three copies of data. Storage Grid is able to erasure code data. And what we did was we used 4+2 erasure coding, which dramatically reduced the storage footprint. Storage Grid can do even better and can do 6+1 erasure coding just in a single site or can do multi-site erasure coding that drives it even lower. The ActiveIQ team initially started off with the 4+2 as the erasure coding scheme. 

What we also did was we were able to use policies in Storage Grid so that we were able to automatically regulate the old data and be able to purge old data out of the grid and be able to have retention policies that said certain kinds of data should be kept for longer durations or should be kept in compliance with the regulations and really use policy to keep storage in a particular way. So Dremio running on Kubernetes, Storage Grid as a storage layer, we’ve been able to shrink our storage footprint from what was seven petabytes down to three petabytes. And that infrastructure was able to drive about 8,900+ tables in the infrastructure. We had significant query time acceleration as a result of this. We were able to improve query times by 20 times, so 10x to 20x faster query performance. We were able to do this with about 60% lesser cores with a much better TCO. So we were very, very satisfied by the transition in terms of its business outcomes. Dramatically fewer cores, dramatically better performance on those fewer cores, resulting in a much better TCO. And I’ll summarize that shortly in terms of our key takeaways. 

Migration Journey

So we were very happy with the outcome. I think part of this question is also what is the journey to kind of get there? So what we did was we actually migrated and copied all the data over to the Dremio cluster. And then we changed our data flows so that our data pipelines were then aware of the Dremio cluster. What we did, and this is not necessarily the best practice, and I’m sure there are other things out there, this is just sharing what the ActiveIQ team did, is we then researched into the user queries that users were building. We worked with individual queries that we thought may be problematic and made sure that they were working on the new Dremio cluster. And then just from a change management perspective, worked with the users to help them just prepare for the migration, give them a new environment, help them run their queries on that new environment so that they can transform their queries if needed. And then we cut the users over in groups. And this was really the most concerning part is how disruptive would this be. And so as you can see, this is probably a pretty conservative way to do the migration. But it went very well. And we were able to migrate 130 plus users in two months with almost no problems at all. So we were able to migrate all of our users over in a very satisfactory manner. So that was our overall migration journey here. 

NetApp StorageGRID

So all of this was based on, of course, the Dremio layer. But the object storage underneath was Storage Grid. And the key thing that Storage Grid is able to do is to create this single global namespace across multiple sites. So you are on– and as this picture shows, you can be in San Francisco and New York. And you can write your data in San Francisco and read it from New York at the very same next minute or next second or next millisecond. And Storage Grid will go and pull the data out from San Francisco out to New York. So it creates a single global namespace that you can have across all of your sites or all of your data centers. Once you have that multiple site infrastructure, there’s this question of how do you move the data around. So one is, hey, can I give you a global namespace? But there’s a second question of, if I want to keep a copy in my primary and a copy in my secondary, how do I do that? And Storage Grid has an extremely powerful policy engine that lets us keep the data where you want it to be. And so Storage Grid provides that engine that simplifies all of the workflows. 

We do this at scale, 300 billion objects in a single grid in a single namespace that can scale to about 800 petabytes. We have customers that have multiple grids, of course, that scales this even further. But just one instance of Storage Grid itself can scale to about 800 petabytes with unmatched durability. And I think this was really important in terms of how we were able to solve for DR use cases. So if you look at what we did for DR with Dremio, previously, it was extremely complicated because we had to bring up the entire Hadoop stack in a remote site. And that was very difficult. What we were able to do with Dremio was, because Storage Grid is active-active, it’s already present on two sites. The data is always there on the two sites. So the only failover we had to do in a DR scenario was bring up a Dremio cluster. And so as you can see in the picture, there’s that load balancer. That load balancer on both sides is, as an example. And because the endpoint is the same, when we bring up Dremio, and because Dremio is based on our Kubernetes infrastructure, as I said before, all I need to do is run a Helm chart. And when I run the Helm chart, it just brings up my Dremio cluster, and there’s no other change to the application. It just goes to and just queries the exact same data, and all of that is sitting already available in the secondary site. So Storage Grid is able to keep the data in sync, and we’re just able to, in a flick of a button, failover and do nothing on the storage site. It’s all just available all the time. So Storage Grid helped us dramatically simplify our DR workflows. That was, as I said before, was one of the key things that we were looking to do. 

I think just a couple of other things in terms of Storage Grid and object storage, as I said, the policy engine is the other big thing. It allows us to very simply define where the data lives, how long is it stored there, and at what level of protection. You can say, “I want to do a DR across two sites.” You could say, “I want DR across three sites.” You could say that particular sets of data sets should not be deleted ever, while others should be kept for X number of years and then deleted, and it can be configurable using metadata of the data. So you could say, “Hey, this particular object has a tag, and because it has a particular tag, I’m going to treat it differently.” It’s extremely powerful in terms of how ILM is able to move your data around in Storage Grid. 

Business Outcomes

So what was our overall takeaway here? For us, fundamentally, we were able to improve and reduce our compute by about 60%. That was a significant compute saving. On that 60% compute reduction, we were able to get a 10 to 20X performance improvement. So if we didn’t reduce that compute, the performance would be even faster. Just to show, I think these two are two different axes, just the scale of change is exponential. So 60% reduction in compute, 10 to 20X performance increase, we were able to get a net TCO savings of about 30%. I think most importantly, the number of users that we had, we started off, as I said earlier, at about 130 users. We are now double the number of users, and we haven’t had any query impact time. Our queries today are completing in two minutes. I think in context is what we were seeing before, which was 45 minutes. So just the max query times are so different. I’m not even talking about the median and the best case. Just the worst case is so much better that I think we are very, very satisfied with the overall migration. 

So key points that I want to make here. So we were able to successfully move from Hadoop to Dremio. We thought that migration is going to be hard. We took a very conservative approach in that migration, a more expensive way to do the migration, a more difficult way for us to do the migration, and there are easier ways to do it. And it turned out it was just very, very simple. We were very conservative, and the migration was seamlessly smooth for us. The second thing, I think, is Dremio is great with object storage. And in fact, it seems to be better on object– I can’t speak to Dremio. I can speak to storage grid and object storage. But it seems to me from all the testing that even we have done ourselves, Dremio is actually faster with object storage than it is otherwise. But speak to a Dremio expert here. 

Lastly, storage grid as the underlying storage solution with its global namespace and its ILM capabilities that are able to move your data to where it needs to be offers a great object storage solution. So together, between Dremio and storage grid, you’re finding that this combination is extremely, extremely viable. We’re also seeing this in multiple customers who are doing this at scale. And I know some of you are here at Subsurface, so we look forward to meeting you more. And if anybody else wants to learn more about storage grid, please stop by here at Subsurface. We are here in person. So that’s what I had.