How to Build a Modern Data Lake and/or Warehouse On-Prem

Video Session

Together, Dremio and Pure FlashBlade create a modern data lake and/or warehouse with the flexibility of cloud-native query engines and storage. This combination makes it simple to independently scale, operate, and upgrade systems. Your data teams gain agility from the ability to query data in-place, whether stored as files or objects or in managed databases, allowing you to store data at the best location with no extra copies. In this session, learn how to modernize your legacy data lake or warehouse to create a modern kubernetes-based platform with Dremio and FlashBlade and how this architecture enables radical improvements in time-to-value for your data.

Video Transcript

Naveen: Hey guys. Good morning. Good afternoon. Good evening, wherever you may be. Today, we are going to talk about how to build a Cloud Data Lake or a cloud like Data Lake on premises. Like Dave said, I’m a senior solutions manager for analytics and AI at Pure Storage. Let’s get started with the agenda. So [00:00:30] we’re going to start with some challenges with legacy data Structure today, and how a modern data architecture solves some of these problems, and then we’re going to talk about some requirements for modern infrastructure to create these Cloud Data Lakes on premises, and finally, we’ll show you how to accelerate data insights at your organization and finally, we’ll conclude with some pointers to where you can find out more technical [00:01:00] in-depth resources about what I’m talking about to kind of show you some of the proof points and some of the examples of how other people have done it.

So let’s get started. I’m sure you’ve seen several slides like this throughout this conference and everybody starts with one of these slides and everybody knows that today’s environment is in silos, you have data warehouses, you have a team working on streaming analytics, there’s a backup copy [00:01:30] of some data somewhere, Data Lake, there’s a team working on AI and ML, and many times you have to create copies of your data into all of these different environments and these different environments have different teams managing them, it has different levels of service, it has different reliability standards and has different security, right? We all know this is not a way we want to be, we all want to shift to something that’s [00:02:00] more sane, secure, reliable that we can take from experimentation to production pretty rapidly, and so everybody knows this, you’ve seen many slides like this, and so what is the state that we want to be today?

And I know DataOps is a very buzzy word right now. There’s like MLOps that’s also super big buzz word in the industry right now. What [00:02:30] developers want, what organizations want is to automate their data pipelines and make them self-service. You want code that’s going through a CICT process and that’s ready for production anytime. You want faster time to insight, you want to build these pipelines, to create business value, right? You’re not creating pipelines for the sake of creating data pipelines, and you may encounter new tools, you want to use [00:03:00] the latest and greatest tools, newer tools, and you want to allocate the right amount of resources to the right project at the right time, right? Whether it’s an AIML project or just BI dashboarding or something else.

And finally, you want to have open data systems, so you don’t get locked into a tool and then later find out that, “I have to go to another tool, and now I have to migrate all this data, which is pretty painful.” So [00:03:30] you may be in another year or two years, you may be working on tools that are yet to be invented. So this is what most data teams want and we know that, but what are the infrastructure challenges that are sort of preventing us from getting there? Or what are some of the challenges that we face today? First, we have unpredictable performance, you’ve got data pipelines that service various teams with various requirements and their jobs [00:04:00] might be slow, their queries might be slowing them down, anybody that has a query that’s stuck is going to just give up and not use the system, right?

So, your users and your businesses leaders and your customers are impatient and they want predictable performance, but it’s hard to tune every system and figure out where the bottlenecks are and whether it’s a latency bottleneck or a throughput bottleneck, or is it just a process that’s stuck, it’s hard to [00:04:30] find out where the performance is bad. Second thing, there’s a lack of agility, requirements change on you at any given time, when you start building something and the requirements change, the tools change, so if your infrastructure is rigid, if your data is rigid and you have a certain set of resources allocated to you, “Oh, you’ve got this 10 nodes and you’ve got two terabytes of data.” That’s all you have and you have to work within rigid infrastructure. [00:05:00] That’s going to be a problem for you to be agile and create value quickly to your end user teams.

And finally complexity, management complexity. It is difficult to plan capacity ahead of time when you plan for something and then you add a node or removal node from a cluster, suddenly your data starts rebalancing and you have to move data from one location together, you need to install a patch, it’s just complex, [00:05:30] and the complexity scales with the data, so you start with a few terabytes of data or less than that and you start scaling to more users, you start scaling to more data, you start scaling to more clusters and nodes, and what happens is, complexity goes through the roof along with your scale. Let’s say you have a cluster of a certain analytics cluster, you can add either higher capacity nodes, [00:06:00] when you add higher capacity nodes, you know what’s going to happen is when one of those nodes fail, it’s going to cause a huge amount of rebalancing in your cluster and especially direct attached storage, right?

I’m talking about direct attached storage, where you have a hyper-converge infrastructure, where you have nodes. If you have higher capacity nodes, it’s going to cost more rebalancing. If you have more nodes, then you have frequent failures, so hundreds of nodes it’s again, managing hundreds of nodes [00:06:30] is complex, patching them and securing them, and there’s going to be lots of failures happening all the time, either one has its problems. And if you’re using multiple clusters, different types of clusters you may be under utilizing resources in one area and over utilizing resources and other area, you cannot keep trying to rebalance those. And finally, each piece of analytics software you have in your pipeline, whether it’s Spark or Splunk [00:07:00] or Elastic or whatever, it may be Dremio. It could be some kind of deep learning software, you have to keep performance tuning and users are always complaining about query speeds and not being there or some something not functioning, so you have to keep performance tuning.

All of these cost complexity, and you guys are well aware of that. So let’s see how our data architectures have evolved to [00:07:30] take into account this change paradigm shift that we’ve seen over the last 10 years, all of us started back in 2015 and before we had these Hadoop clusters and data warehouses and your data were co-locating. You had nodes, these hyper-converged nodes, and you’ve given a certain number of nodes for a particular application, whether it’s Hadoop or Spark or whatever [00:08:00] application that may be and you had these nodes that you just [inaudible 00:08:06] to scale like hundreds of nodes to 200 nodes, 300 nodes.

And in 2015 to 2020, we moved into this cloud data warehouse world, where you were in a cloud, there’s separation of computing storage, so the whole storage became a sort of in a cloud S3 layer, and you had cloud data warehouses, which would [00:08:30] separate compute, so bring compute to a query, and if you’re in a cloud it would bring unlimited compute with cloud to a particular query for a few minutes, and then spin it down when you don’t need it. This is a fantastic shift, it really brought in the elasticity and agility to the cloud world.

What we’re seeing in 2020 beyond, especially with innovators, just like Dremio is you’re [00:09:00] seeing Cloud Data Lakes that are built on open data, where there’s a separation between compute and data, where you have a open data layer on top of your storage that may be built on open metadata standards, open file formats like parquet and open table format, suggest to data lake and, and Iceberg and other data formats, and then you’ve got this open data layer on top of your storage [00:09:30] and then that open data layer is accessed by various applications via Dremio, Spark or [inaudible 00:09:37] or whatever the application may be. This is the new world that we’re headed into, open data world.

So, we’ve kind of spoken in theory about the various aspects, we spoke about the challenges we spoke about how open data architectures are addressing some of the challenges. [00:10:00] So let’s talk about the infrastructure underlying these cloud, like data lakes on premises and see how we build those and give you some actual examples of how to build those, right? So let’s start with an architecture or [marketecture 00:10:19] diagram, which shows the various layers, should on top, you have your applications, right? [inaudible 00:10:24] Spark, Dremio, [inaudible 00:10:26] , or whatever you’re using, which you [00:10:30] could be doing data science environments, you could be streaming real time analytics, or you could be doing scale-out SQL analytics. Modern infrastructure is going to be built on containers, these applications are going to run on containers, and so hopefully you have something that’s like a container as a service or a cluster as a service, platform as a service layer that has containers and virtual machines, right?

This is what you have in mind, so let’s look at [00:11:00] storage and how we bring this paradigm to storage. Below the Kubernetes layer, you’re going to have a layer that says “That’s for data management services for Kubernetes.” So the data management services for Kubernetes it’s going to… as a container is spun up, spun down, the data management services layer is going to provide the storage to do the Kubernetes layer, and then you’re going to have a layer, [00:11:30] which is your modern data lake layer, which is based on open data formats, and this software layer, or this layer is going to be built on top of Block or ObjectStore, or it could be more legacy systems, it’s going to be built on a [inaudible 00:11:49] .

So let’s double click into that storage layer, I’m from Pure Storage, obviously I’m going to double click into that storage layer and just find out like, “What are some [00:12:00] of the requirements of that storage layer in this modern data analytics world?” And what are some of the key drivers in market drivers for this layer, for data today, actually just not the storage layer, just what are the key market drivers for modern data delivery today. When we’ve kind of divided this into three trends, depending on whether you look at it from a business angle, or you look it from a data angle, [00:12:30] the three trends are first we’ve got workloads that are shifting towards more AI and ML workloads, your data is more machine generated data. Today, unstructured machine generated data is just growing exponentially, everybody knows that, IOT data, geospatial data are all generated by devices, video generated by cameras, log data. All of these new data sources that are growing exponentially are machine generated [00:13:00] and people are doing AI and ML on them.

While it’s clear in 10 years as forward thinking, companies say that most of the code generated would be AI and ML code. The second shift that we’re seeing is, of course, with the cloud, people are moving towards object story, object storage with a war on structured data. People are going to object stores, which are cheaper, easier to manage in many ways, and of course people are taking [00:13:30] the container approach. So you can bring compute to whatever you need, rather than allocating specific compute silos. with containers, you get that elasticity and agility that you need. Finally, from an organizational perspective, from an environmental perspective, you’re seeing security becoming a big concern because that data is now the new oil that is your IP and you have to protect [00:14:00] it is ransomware attacks everywhere, locking up your data and demanding ransom and so you want to keep it safe.

And the second is people want people who are doing more predictive use cases, more real-time use cases where you’re just not getting insights and just putting it on a dashboard, but you have a piece of code which gets an insight and then takes an action, and for example, in the case of creating, you get insights and then you take an action, you have to create a stock [00:14:30] or you have to respond to a security threat, or the software takes an action, so for you to be able to take immediate action on the data, you need to have real time data, and if the response is going to be automated, then you need… better have real-time data and needs to be predictive as we go to move towards machine learning.

So these are the requirements driving modern data requirements today and if you could capitalize on these, you’d be much ahead of the competition. [00:15:00] So, let’s look at like, if we had some magic pixie dust, what kind of storage would we build to meet these requirements? The storage that I would build to meet these requirements, like, it’d be an object store, it’ll be capable of many things and let’s see what are those capabilities that we should bring here. First, multidimensional performance, no matter what, the application that I throw at it and whatever [00:15:30] the data is, the data could be a different sizes, it could be either sequentially access or randomly access, it could be batch or real-time jobs, it could be a large number of small files or a small number of large files, whatever the file sizes, whatever the characteristics of the app is, I need to deliver a high throughput low-latency and consistent performance and that’s key.

The second, it needs to be an intelligent architecture built up on today’s technologies, today’s storage demands flash, [00:16:00] right? It’s easy to maintain, it’s beyond anything, just beyond the fast aspect, it’s just easy, if it’s built on flash, it’s stable, it’s easy to manage, power requirements are lower, it’s built, it’s just simple to manage no tuning required, you want to have no tuning required. Next, it needs to be cloud ready, even if you’re on premises, it needs to be at agile infrastructure, which [00:16:30] gives you flexible flexibility to bring compute to the data and also with segregated, compute and storage, and also provides you consumption choices that are cloud-like, right? You don’t want to pay for storage that you did not use, [inaudible 00:16:45] let me pay only for the gigs that I use today and not for… let me not plan for capacity for five years and then buy everything today and provide all my money, people are very operationally focused, so only pay for [00:17:00] what, what you use.

And it needs to be reliable and available always, even if you’re doing upgrades, you want to add capacity, you don’t want to take the storage down, it needs to be always available, no matter what you’re doing, upgrades patches and the data needs to be protected against ransomware attacks, against any kind of failure scenarios. It also needs a dynamic scalability, [00:17:30] so as you scale data, you usually are faced with more complexity and more like performance issues and you don’t want to deal with that because you scale your data, you don’t want to have downtime, and the performance goes up with scale. And finally multi-protocol support, you don’t want to bank all your dollars on one particular protocol. Different applications use… Cloud-like applications [00:18:00] use an S3… It’ll be a [inaudible 00:18:03] protocol, right? Object storage, but you may have legacy software that may be using NFS, or even current software using NFS are SMB protocol, so you want whatever the protocol that your application is using to access that data, that protocol should be available.

And also it should be native to that platform so the performance is good, no matter what protocol the application is using [00:18:30] to access data. So let me introduce to you Unified Fast File and Object Storage, this is not a name of a product, this is what we call a storage platform that meets those requirements that I just outlined, right? Unified Fast File and Object is a platform that would be engineered from ground up from, with flash to deliver simplicity, like literally nobody wants to [00:19:00] deal with storage, it just needs to work and it’s needs to be out of sight, so delivers simplicity, and at the same time delivers the multi-dimensional performance that you need for today’s unstructured analytics workloads, and we call this category of devices, a Unified Fast File and Object Storage and Pure Storage FlashBlade just happens to be the leading industry, leading platform for that.

[00:19:30] The other layer, on top of this storage, we spoke about building that open data platform and we also need another software to manage storage for your containers, as you spin up and spin down containers, you want those to be automatic and the storage needs to be allocated when it’s spun up and also when there’s a failure scenario, [00:20:00] when there’s backup, when there’s needs for backup, there’s needs to migrate data, we need to create a dev test environment, there’s a need to encrypt that data, when one container fails in and Kubernetes takes action to create another container, to do confer failure scenarios or scaling scenarios, all of those storage needs do need to be addressed, and you need [00:20:30] Kubernetes data services platform to address all those requirements and Pure Storage, quite a company called Portworx, which is an industry’s leading companies, data services, platforms available for that.

It can be used for building automating, protecting your cloud native applications, would module to just core storage, backup, disaster recovery, application, data migration, security, and infrastructure automation, all of that is taken care [00:21:00] with this a hundred percent software solution called Kubernetes data services platform. So this is something that’s essential to create that architecture that you need for today’s modern data services.

So let’s talk about how in the context of Dremio how these applications and this architecture is going to help you. First, Dremio is very versatile, you can access data [00:21:30] on any area with any protocol, so the mix of NFS and S3, data stored in FlashBlade can be accesses through NFS or S3 or SMB, so you can use the multi-protocol approach to do batch streaming, random access, whatever the workload might be, you can use that, and also, you can start small with just, you by chance, you just hit one blade, and then you can just keep slipping in blades with no [00:22:00] downtime, with no need to do anything. FlashBlade is completely managed from the cloud, so as you want to add capacity, you just keep slipping in new blades and it just adds capacity with no downtime, and it’s super simple, there’s no need for tuning. There’s no need for any of that.

Performance in terms of throughput, in terms of bandwidth, everything goes up as you add more and more blades and more and more capacity, so there’s no complexity, it would scale. [00:22:30] How can this help you migrate away from sort of older infrastructure to a more modern architecture and this kind of slide shows you this. Unified Fast File and Object slash grid helps you bridge the gap between existing infrastructure, which is maybe an HDFS cluster to a modern data lake. You were able to do this with a combination of a Hive source of [00:23:00] data and Dremio and Flash registry. You can start sharing your table definitions with Hive and share your data across existing applications and start migrating tables, users, query, and queries slowly to the S3 interface and Dremio at your pace and we don’t have the luxury to do a forklift upgrade or a weekend to just migrate all your data into this modern architecture. So you can do this over time and making sure that you’re [00:23:30] leveraging the latest S3 protocols and at the same time, keeping your users happy with no zero downtime.

And so this diagram brings the whole architecture, it shows the application layer, we spoke about, the Kubernetes layer and then the data management services, Kubernetes can be industry. Here is Portworx, [00:24:00] You build a open data lake on top of a Pure FlashBlade, with whatever meta store table, open data format, that the tables or files, parquet files, data lake tables on top of, and we’ve tested this And we’ve seen this works very, very well. Finally to summarize … what all this, what you’re doing is [00:24:30] you’re basically speeding up time to insights, you’re consolidating all your data into this one single massively scalable platform, so you can go from a few terabytes to multiple petabytes and more, and consolidating all your storage into a single device, and then you can bring compute to it whenever you need it.

Again, delivers consistent performance, consistent security. It’s got something called safe mode, which locks it against ransomware attacks, so you bring consistent [00:25:00] performance security and everything FlashBlade is managed from the cloud, so it’s very simple to manage, you can literally forget, and it can be managed with the APIs and the latest [inaudible 00:25:10] so you can just forget about managing storage. So that’s going to help you tremendously speed up time to insights, and also increase your agility. As you support various use cases, more data sources going from simple dashboards to machine learning, to actual [productionizing 00:25:31] [00:25:30] machine learning based software, right? Whatever the use case may be, it’s going to give you the performance that you need. It’s going to bring you consistent desegregation, green compute and storage, so you can bring a lot of compute to us, a problem for few minutes, and then take it away for another problem, right?

Gives you that elasticity, and with very, very high reliability, literally the storage is never going to be down. [00:26:00] And finally it’s simplifies operations as you scale, like I said, Pure [inaudible 00:26:05] is managed from the cloud, it can be consumed as a service, it’s completely storage as a service, you only pay for what you use and you never have to be down for any upgrade patching, and even if you need to do a controller upgrade, that’s all covered with Pure’s Evergreen guarantee. We want you to keep the storage that you have and never pay for storage that you already bought.

[00:26:30] And finally, you can support all these multi-tenant applications scale and make everything self-service and that’s our vision, to make analytics and AI scalable self-service and automated. And if want to learn more about this, there are many customers that are doing this today, and if you want to learn more about, we’ve got a very technical document written by Joshua. A lot of this content was developed by Joshua Robinson, who’s a chief [00:27:00] technology at Pure, and he’s written a very detailed blog describing it. It’s on medium, I can throw that in the chat, and we also have a glossy solution sheet to kind of walk you through the solution and what are some of the benefits to you. With that we’ll switch over to Q and A.

Dave: Great. [00:27:30] All right, let me get up set here. Okay, thanks Naveen. Let’s go ahead and open it up for Q and A. Again, if you have any questions, you can use the button in the upper right hand corner to share your audio or video, and you’ll automatically be put into a queue, and if for some reason you’re having trouble with that, you can just ask your question in the chat. All right, let’s take a look and see what we got over there. Okay, so we’ve got a couple of questions. All right. The first one is [00:28:00] what is the difference between a FlashBlade and a raid array of SSDs?

Naveen: Fantastic, and so this is a great question. raid array of SSDs is that direct attached storage architecture that we spoke about you, where you have compute and storage co-located in a single node and you’re just scaling the nodes, and we spoke about some of the disadvantages of that, right? That’s the old architecture old way of doing things, [00:28:30] the hyper-converge architecture, where you have a fixed amount of compute and a fixed amount of storage, and if you need to add storage, the compute just comes along with it.

Let’s say I have very little queries, but I’m getting more data, I need to add storage, I’d have to add couple of extra nodes there, right. Or if I have a lot of queries, but on very little data, then I just have to buy this storage, right? We want to take that cloud-like approach where the storage and the compute [00:29:00] are desegregated. So what you want to do is you want to run some nodes and just run the operating system and the basic functions on the local SSDs on the local drives and you want to keep all your data on a centralized object store file in object store that we call UFFO and that way you create an open data layer that can be used by any application, so you’re not locking yourself into silos, to me, that’s the difference between flash blade and RAID array of SSDs. That [00:29:30] answers the question.

Dave: Okay. Folks, if you have any other questions, let’s go ahead and get them in. We got one more question here. They say, “We’re using MinIO would that work with this type of platform?”

Naveen: Yeah. So MinIO is similar to FlashBlade, except FlashBlade is [inaudible 00:29:53] software. FlashBlades is ground up built to be a very reliable and [00:30:00] performant object [inaudible 00:30:05] store. MinIO is just like the quick and cheap, dirty version of that and it’s basically… yes, but you could use our Portworx software to completely manage all your storage. It works with any storage, any infrastructure, with FlashBlade, you’re getting a much better version, much more ground up built version [00:30:30] for your specific needs, low latency and other characteristics, simplicity and other characteristics.

Dave: Okay. Let’s see here. Oh, I didn’t realize we’re out of time already. All right, so that’s all the questions we have for time in the session. If we didn’t get your questions, I think we’ve got them all, but if we didn’t, then you can hit up Naveen in Slack, [00:31:00] but before you go, we would appreciate it if you would please fill out the super short Slido session survey, which you’ll find in the chat and the next session is coming up, I think we have a panel actually, or a keynote, a fireside chat, I believe. So please join us there or you can go to the [inaudible 00:31:20] and check out the booths and the demos that you’ll find there and some awesome giveaways and thanks everyone and enjoy the rest of the conference and thanks Naveen.

Naveen: Thank you so [00:31:30] much for the session. Thanks Dave.

Dave: Okay guys, take care.

Naveen: [00:32:00] Dave, you still there.

Dave: I’m still Here, and I think there’s still a bunch of people on, so they haven’t cut us off yet. I don’t think it’ll get added in, but just for reference, there’s another question that was… Is FlashBlade different from a standalone BOC of SSDs? I don’t know what BOC stands for?

Naveen: We’ll try and get that [00:32:30] over Slack, maybe.

Dave: Yeah. [inaudible 00:32:35] Slack. Here I’ll do it for you, I’ll paste that question into your Slack channel and I’ll post the link, just give me a second here. All right. So… [00:33:00] all right, let me copy the link. Okay. For those of you who are still here, it looks like there’s still a 20 so people here, just in the chat here.

Naveen: Yeah. I can try to answer if people are there, I can still try and answer that question. So basically what the user’s [00:33:30] asking where the customer is asking for it is basically, how’s it different from just like, “Once you have SSDs just put together.” Right? There’s a lot of design that has gone into flashlight to build, to create the… three things, right? One is most people buy FlashBlade for simplicity, you just put a bunch of SSDs together, you need to manage those, performance is going to be… when you said performance is going to be inconsistent, you have to tune it for the different application workloads. You would [00:34:00] have multi-protocol access, there’s like a lot of engineering that’s gone into building the FlashBlade made from ground up. Actually, if you go to YouTube and just search for FlashBlade there’s something about Field Day with Brian Gold, where he actually walks through all of the design that he’s gone through.

You can put the bunch of SSDs together and make it work for like a few terabytes of data. But as you start scaling, you’re going to see all these problems with complexity and performance. [00:34:30] We have several large fan companies using petabytes and petabytes of data to do machine learning on top of Pure’s storage devices. You cannot do that and just manage it with like one or two guys, and forget storage, right? You’d have to have people who are extremely technically smart to manage petabytes and petabytes [00:35:00] of data storage, and with Pure you can just literally set it and forget it will just exist and you don’t have to manage it, you don’t have to tune performance to it, it’s just going to keep delivering that simplicity performance and scale simply, that’s it. Right?

That’s the value that [inaudible 00:35:18]. And again, go check out that Field Day by Brian gold, from Pure Storage And he’s going to explain how we built this ground up architecture to scale. There’s actually not only storage going to do it, the storage [00:35:30] compute and networking built into every blade and flash blade so that you have that linear increase in performance as you scale and you’ll see the details in that video. That’s a long-winded answer for that question.

Dave: Okay. And by the way, he corrected himself instead of B-O-C, it was B-O-X. So box.

Naveen: [crosstalk 00:35:52] That’s essentially what I answered. It’s like, you take a bunch of SSDs put it together in a box and it becomes a FlashBlade, right? There’s just so much engineering [00:36:00] that goes into it and then there’s really smart HDs that worked on putting these things together to make it scale, to [inaudible 00:36:10]

Dave: Okay. Well we’re way over now. So I think we should [inaudible 00:36:14] and I did post the link there, you might just go check out slack just to see if anybody else is there, but at this point, I think we should just wrap it up. So thanks again Naveen for sticking around a little extra and thanks for your talk.

Naveen: Thank you guys. [00:36:30] See you on slack.

Dave: Yep.

Naveen: Bye.