Dremio Data Lake Engines Subsurface LIVE Sessions

Session Abstract

Organizations struggle with performing modern data engineering and data science because of a number of challenges. Processes are antiquated, tools and systems are outdated, and yet the demand for analytics to be infused into every aspect of the organization continues to grow. HPE recommends an industrial approach to solve these challenges. By looking to modern manufacturing systems, we can copy patterns to enable repeatable, reliable, and performant processes through the use of modern technology. HPE Ezmeral Container Platform and Dremio enable data and analytics across the enterprise through this industrial approach.During this session we will discuss those challenges organizations face and why an industrial approach is the best way to solve those challenges. Using examples from manufacturing, we will describe how the HPE Ezmeral Container Platform provides the factory and tooling while Dremio delivers just-in-time access to data across the enterprise with streamlined processes for deployment.Join us for this session of business concepts and technology discussion to gain insight into how your organization can take an industrial approach to data and analytics.

Video Transcript

Matt:    Thank you very much. Hello everybody. I’m going to be talking to you today. As you can see on the title about how we can bring more efficiency to the enterprise data workloads that we know that many organizations are trying that are struggling with, that will help you accelerate time to insight and give you better access to data. In the presentation today, we want to cover a number [00:00:30] of topics. First. I want to start talking about how organizations have been struggling with these challenges, doing data science and data engineering at enterprise scale, by taking an industrial approach to that. And I’ll explain what that means in a minute, but specifically taking the concepts of research and developments and assembly line process, taking the theme of just in time parts or manufacturing, getting an assembly line and what the results are from following that.We’re then going to have Tom [00:01:00] Phelan my colleague talk about how Ezmeral and Dremio actually enable these industrial analytics. Tom’s going to go into an overview of the Ezmeral container platform. Talk about how Dremio can be deployed via the Ezmeral platform and as part of our certified ISV marketplace. And then we’ll wrap it up with some conclusions and open it up for that live Q and A. Let’s get started. So, as I mentioned, organizations have challenges, of course organizations have challenges, but I want to talk about specifically the challenges [00:01:30] organizations have doing analytics at broader scale. I talked to a number of organizations, whether they’re public or private, government entities, everyone can do an analytics project once, you can develop the killer ML or AI application and put it into production, you can do it one time. Organizations can move heaven and earth to do that one time, but how do we do that over and over and over?How do we infuse analytics into everything that we do? What is standing in the way of that? So what we [00:02:00] see is of course, the demand for analytics is really high. Any executive reading the news, the press, they see that analytics, machine learning, artificial intelligence is increasing by the day. And as part of that, we have new applications, new systems, new partnerships, feeding more data. We have more data sources and it’s exploding at an exponential rate. The challenge is that we don’t have enough people to deal with this. We don’t, we can’t just throw all the data scientists we have [00:02:30] at the problem to try and solve this. We can’t hire enough data engineers to go work on these issues. And too much time is wasted because oftentimes we have legacy tools and IT processes that are standing in the way. And then organizationally, from a business perspective, we have too much Ad hoc responsiveness.So how do we fix this? I’m going to use that analogy as we discuss through. The image here is of a factory line. So I want you to think about a factory that for this industrial approach, [00:03:00] this industrialized way of doing analytics, data science, data engineering at scale. I want you to think of a modern manufacturing plant. You don’t have to have been into one, but just think of the robotics, the automation. Think about a car plant when we see those commercials. And so the first trick to doing this is that we have significant automation that we have the right tools in place to be able to do this so that we can repeat this process over and over. We can build train, deploy, test, redeploy, [00:03:30] optimize, retrain, and redeploy the models over and over again. And like a lot of manufacturing organizations. I want you to think about having this R&D aspect, a car that or an automobile that’s coming out a couple of years from now, or one that has come out this year, the 2022 models that are coming, those were thought of years in advance.And that was part of an R&D lab. In IT and technology, we think about this in terms of a center of excellence. And I’m going to explain a little bit more about what that means and what [00:04:00] organizations should specifically be doing to try and solve for that. After we’ve sort of gotten that process, I want you to think about this concept of just-in-time manufacturing. I believe Toyota optimized these processes decades ago, where instead of having warehouses full of all the parts, just waiting around, taking up space, having the right tools and in this case data available right when you need it. So my data scientists and analysts are hunting around asking people where they can [00:04:30] find the data sources that it’s made available for them, right when they need it. And that requires the right technology in place as well.But specifically we need to have the right tooling in order to solve for this, right? If we’re going to build the next generation automotive or do the next generation of data science and analytics, we need the right tools in place. We also need to have the right organizational structure and people with the right skills in place. In an auto manufacturing organization, if I bring it all the right tooling, but I haven’t trained my people on how to use that, [00:05:00] you’re going to have an inefficient assembly line. So I need to make sure I’ve got all the people organized correctly, that they have the right skills to be able to handle what’s coming down that pipeline. I also then need, of course, to update my organizational processes. If I can create an analytics model, a machine learning based model, that is the most clever thing in the world, but my organization can’t take advantage of those insights, that’s kind of all for not.So let’s take a look at what good organizations are doing as they approach this concept of R& [00:05:30] D, as it relates to data science. What’s key here is that we need to bring the disciplines together throughout the organization. And I’m going to go through those personas on the next slide, but it doesn’t just mean I have data scientists that are figuring out what comes next. The challenge is I may have a data scientist that loves the latest version of Py Torch, but the prod systems aren’t certified to run Py Torch. So I may build, train and have a model ready to deliver, but prod can’t run it. So what’s key about [00:06:00] the R&D lab is getting this end-to-end process thought out and then putting the automation in place so that prod and the folks in prod are ready to accept those changes on a regular basis.That means we need to have more infrastructurous code, more DevOps in place, but we also need to make sure that these teams are in sync to be able to do that. They need to be looking forward at what’s happening in the industry next. So maybe I have a practitioner data scientists that’s rotating in and out of these positions, thinking [00:06:30] ahead of what are the standards coming down the pipeline? What do we think people are going to be using next? And then infusing that in this process, as well as developing best practices. And in larger organizations, these centers of excellence generally then fan out and work with the various lines of businesses on this is the best way to set up a Jupiter notebook. This is the best way to operationalize a nightly batch job. This is the best way, et cetera, et cetera.And some of that means putting together tools [00:07:00] and application catalogs, building web portals, where people can click to deploy. That’s a lot of the function of that center of excellence. And so once that function has been established, then we can get to scale. Then our data scientists can get to work. They can build models, they can train models and they can launch environments. They can go to their notebooks, but this is where the other personas come into play. This is where our software development team needs to start interacting. The software developers are responsible for a number of things, [00:07:30] but one of the important aspects is taking whatever logic that the data scientists have built and making sure that it is packaged up in an application, whether it’s a RESTful API or actually refactoring that Python into C-sharp because I’ve got to put it in some sort of embedded system.That’s where software engineers need to closely work with the data scientists. Sometimes they’re called machine learning engineers, but it’s someone who’s got more of a proper computer science background as opposed to more of a quantum background that our data scientists have. These two [00:08:00] teams need to work together and they need to have a common set of code repositories, model repositories, registries that they can actually look and work together. They can’t physically be working together in a pair programming manner, they should at least be virtually working together to ensure that the software engineers are prepared for what the data scientists are building. Maybe help them restructure their code a little bit within that Jupiter notebook. The software engineers are also responsible for some of those DevOps functions that I mentioned. And an important part [00:08:30] of the automation is having what is known as a CI/CD pipeline, that continuous integration and continuous delivery pipeline, which is the automated testing packaging, and ultimately deploying that code into some sort of production environment, that’s using automated tools, automated scripts.Whether, again you’re on premises in the Cloud deploying at the Edge. We want to automate this process as much as possible because an environment variable may change and we may need to retrain the [00:09:00] model and push it out. We can’t wait for three months production environment recycles. We need to be able to do this continuously. That’s the C in CI/CD continuously be able to push code into that production environment, which means our operations team needs to be tightly involved. Again, they’re part of that R&D process, but they need to be very close to the software developers, software developers, operations gives us DevOps. That means my operations team isn’t just sitting around [00:09:30] upgrading hard drives and patching Linux. They are tightly involved in integrating this DevOps process. And they probably need to be working closer with those data scientists to understand the model runtime and the performance characteristics.And they’re going to be making sure that that production environment is being upgraded using that infrastructurous code. It really does need to be collaborative. And so that operations team is keeping production up and running. They’re helping with auto-scale, they’re making sure the multi-cloud deployments are working. [00:10:00] And then the last actor, at least in this example, there’s lots of actors that when you do data science and analytics, but at least the last actor or persona in this description is the data analysts. And of course they’re doing their own job. They’re working on building reports, whether they’re ad hoc or batch or nightly reports, we need to bring them into play as part of this data science operationalization process, because their job is to then take those data science reports and treat them as a key business function. We need to be putting [00:10:30] key performance indicators on the efficacy of the model.Maybe we’re measuring defects on a manufacturing line. Maybe we’re looking at the defects in the paint job of one of this automobiles. And so we’ve got cameras looking at the metal as it moves forward before it gets painted, as it’s being painted, after it’s painted as an example and determining how many defects they find in the paint. And the reason this is important is because models tend to decay over time, the efficacy of [00:11:00] that model, whatever it was solving for, whatever it was attempting to predict changes over time. Maybe the paint supplier changed, maybe the metal supplier changed, maybe the mine where the ore was refined to make the steel that went into the automobile changed months upstream. And that changes the characteristics for the way the paint may need harder. I’m making up examples here. I’m sure you can come up with your own examples of variables change all the time, environment variables, human behavior changes.We need to put key performance indicators [00:11:30] on those models so that we can measure the efficacy of them. If we’re not measuring it, how do we know that it’s even working? And as those models decay over time, this team needs to notify the data scientists so that they can go back rehydrate their environment. This is a key part of automation. Being able to bring that dev environment up just as it was when it deployed the previous time. Bring up that training set, get me the metadata so that the data scientist doesn’t have to start from scratch. This is what gives us that virtuous cycle. So they retrain [00:12:00] and then ultimately push the code back out into a source repository and get it back in the system. And so this is what an industrialized approach looks like. And so my colleague, Tom Phelan is going to now talk about in present how HPE Ezmeral’s container platform can support this. And in working along with Dremio can deliver this industrialized approach. Tom, over to you.Tom:    Thanks, Matt. Let me just see if I can do the hard part [00:12:30] here is that’s sharing my a window here. Matt has just set the stage for us on how to operationalize data analytics. What I’ll be talking about in the next few minutes is the tools and procedures of how we actually do that in the enterprise. One component here is what’s known as the Ezmeral or HPE Ezmeral data platform. It is a platform that allows [00:13:00] artificial intelligence, machine learning, deep learning, basically analytics data workloads to be deployed with a Cloud experience across your entire infrastructure, whether it’s Edge, Cloud or on-premise. And we have two parts we have, as Matt pointed out the manufacturing tool kit, this is the HPE Ezmeral platform. It provides scalability, flexibility, repeatability. The ability to attach workloads to [00:13:30] datasets Dremio is the conductivity.It is the tool that provides that columnar Cloud cache, which is the bridge between the data residing in the data lake and the applicant pulling that data to train the analytical model. What I want to show here is just a little bit about what we mean when we say HPE Ezmeral platform provides a Cloud experience. So if you look at that horizontal [00:14:00] green outlined box, two thirds of the way down the slide, we see that it’s a self-service model, it’s pay per use. It’s automatically scale up scale down, tons of automation that are managed for you. So it makes it very easy for a data scientist to go ahead and train their model. They don’t have to worry about scalability. They don’t have to worry about security, data access, all that is controlled for them. And if you look above that, you see the three multi-colored [00:14:30] boxes, this platform supports not only data intensive workloads, like we’re talking about things like Spark TensorFlow, Jupiter notebooks, it also supports Cloud native applications.So if you have your microservices-based platform that will run here, as well as those legacy applications from the ’90s probably written in Java non-cloud native applications, all of them run on the Ezmeral container platform controlled across [00:15:00] the Edge, core and Cloud. Now, if we dig down a little bit deeper into what this platform is, we have a layer cake diagram. What I like about this as it pulls in directly from what Matt talked about, talked about the operators, the roles that we’re talking about here, data engineers, data scientists, app developers, they’re the ones who interact with the platform and they can instantiate this collection of [00:15:30] applications. Note that some of these applications are open source like Spark or Kafka. Other ones could be proprietary applications like H2O and so forth, also bring your own. So if we don’t see one in the catalog it’s easy to build your own application, deploy it here.Integration with enterprise class security systems, we’ll go into a little bit of that later. Running on top of a certified CNCF Kubernetes distribution, that can [00:16:00] be a distribution installed by the Ezmeral platform or another distribution that you have already installed within your infrastructure. Whether that Kubernetes cluster is running in the Cloud or on-premise, it doesn’t matter. It can be managed by the Ezmeral platform. I had a full resource control, so that’s multitenancy allows resource constraints and monitoring. So whether you want to limit CPU, GPU, memory access, networking access, you can do [00:16:30] that through the control plane of Ezmeral, and then surface it down into the data fabrics themselves using the HBS Ezmeral data fabric or some other external data lake implementation. So now let’s talk about how we can deploy or manage Kubernetes clusters with Ezmeral. So we have a control plane, and since it’s enterprise quality it is HA it’s highly available, comes with RESTful API’s integration with the Apache web services and alerting and monitoring.[00:17:00] Well I say here now, a connection to an off-proxy, this goes out and connects to the enterprise’s authorization system. So whether it’s active directory or LDAP, or what have you, if you want to do two-factor authorization or whatever it is, this ties into your existing enterprise authorization system so that your data scientists can use their common credentials, spin up a Kubernetes cluster or application in a way they go without an additional learning curve. There’s also an HA [00:17:30] a gateway. This gateway is more about infrastructure. It protects the private IP addresses of the Kubernetes clusters, maps them through to routable IP addresses within the enterprise. So the controller can deploy or manage these Kubernetes clusters. And we’ll just say, here we have three different types of clusters. You’ll notice there are three different versions of Kubernetes. So Ezmeral will support any of the common versions of the Kubernetes API.One it will deploy. So [00:18:00] it deployed a version 1.20. It imported or connected to an EKS deployed version. And it then also there may have been a legacy version of Kubernetes cluster version 1.7, which was already deployed in the data center. So Ezmeral platform can connect to all those. It surfaces administration rights to the individual clusters. If that’s what you want to do using the very standard Kube config and Kube cuddle APIs and CLIs. [00:18:30] So there’s no additional learning curve for the administrator. Load balancing goes back. So if we need to have [inaudible 00:18:36] of our services across multiple ingress, IP addresses, we can do that. Now let’s drill down on the individual Kubernetes cluster that we manage. As I said earlier, we have a multitenancy model. So you can have multiple namespaces that are partitioned on the resources of that cluster. We stand in and we automatically plug in a CNI, a Container [00:19:00] Network Interface, and a CSI by default Prometheus’s the monitoring tool.In this case, we use the canal, very common CNI combination of Flannel and Calico. If you don’t like that, you can certainly replace it with a different CNI of your own choice, not a big deal through it. Through the UI, with Ezmeral. We integrate with common set of CSI’s. In this case, I’m just showing the data fabric provided by HPE. Now, because we can also support not only [inaudible 00:19:29] [00:19:30] we have the standard user access. So if your data scientist is comfortable using Kube cuddle, they can go and connect to that and log into a connection to a shell prompt and then use their CSI, or excuse me, use their Kube cuddle CLI to access their cluster, or they can use the RESTful API or a web-based access if they want. What Ezmeral provides is an agent to upkeep. So this monitors the health of the Kubernetes cluster [00:20:00] and make sure everything is up to date with patches and security implementations and so forth.We also provide what’s called a Kube director. This is an open source project for managing applications in a way that’s it’s a little bit more complete than something like Kube flow or Helm. So great. This is the whole infrastructure that Ezmeral provides. Now, where does Dremio fit in? We’ll dig to that next. So Dremio now sits inside this Kubernetes cluster. So it uses [00:20:30] the arrow flight implementation to connect with the data scientist, as you see at the very top. So they could be running a SQL query. It could be through a Jupiter notebook. It could be through a Python script, what have you. It’s connecting into this columnar cloud cache, which has its own set of data accesses, which map on to the data lake storage underneath. This provides a complete simple way for applications to do SQL queries [00:21:00] on unstructured data. This is the value that Dremio is providing. [inaudible 00:21:07] here is often right hand side sits as a application running within the Kubernetes cluster managed by the Ezmeral container platform.Now let’s drill down a little bit deeper into what this actually means. So we have Dremio. So it’s the Arrow implementation with both the caching and the connection to the remote data storage. And so we can [00:21:30] reach out to data lakes like S3 or HDFS or whatever else you have out there. And what it allows you to do is not have to ingest or copy data from these data lakes in order to surface it as SQL queries to your applications. So altogether, what we have here is a robust platform for providing infrastructure, R&D infrastructure as Matt pointed out for, at Kubernetes. Within that you can run your Dremio [00:22:00] to connect to your remote data, and then surface that data to the various applications running within your Kubernetes cluster. So I’d like to turn it back over to you for our conclusion and onto our Q and AMatt:    Thank you, Tom. So I hope everyone has enjoyed the session. So just to kind of wrap up these concepts, as I started out by saying, we know that every enterprise struggles with repeatable analytics, it is difficult to do this at scale, but HPE [00:22:30] Ezmeral along with Dremio can provide a platform and a partnership that provides easy deployment and scalability of data science, that’s Ezmeral analytics and engineering tooling. While Dremio is delivering the connectivity from your tools of choice, whether it’s data science, data analytics, or data engineering to the data, wherever exists. Ezmeral is the easy button for deploying those applications. Because as Tom went into all of that low level Kubernetes, the CNI, the CSI, the gateway, the service [00:23:00] mesh, all of that complexity is handled by Ezmeral, Ezmeral is the easy button for deploying applications and services. And Dremio gives you that conductivity and caching layer to be able to connect into the applications, the data sources that those applications requires.So you don’t have to worry about moving data around and all that complexity. And so together we deliver industrialization for the enterprise. And as we show here, if you go and search the HPE Ezmeral [00:23:30] marketplace, or you get a copy of this presentation, Dremio is a feature partner in our marketplace, which means you can click a button, deploy Dremio into your application environment. So we’d like to thank you for joining the presentation today. We’re happy to any questions that are in the live Q and A. You can certainly add us in the slack channel and look us up online.Speaker 3:    Thank you, Matt. Thank you, Thomas. It was a great session. All right, everybody. We’re going to take [00:24:00] some Q and A, but looking through right now, please get your questions in. The other thing is if you’d like to do an audio and video Q and A go ahead and request. So now, and I will click to enable you to do so. Let’s give everybody a couple of minutes and let’s see what comes in.Matt:    Who’s going to be the first to be brave to go on video.Speaker 3:    Exactly.Matt:    Who’s got the worst office, home office.Speaker 3:    [00:24:30] I don’t know. Mine’s pretty rough. I mean, I’m in a ’60s basement right now. 1850s basement.Matt:    That’s pretty cool. I got the piles of rubble behind me that you can’t see. Tom’s got the clean one.Speaker 3:    I think so. Most definitely. Tom definitely has us beat there. All right. So if you’re actually trying to add to the Q and A, you go to the upper right hand section of your screen and go to share audio and video, and that’ll automatically put you in the [00:25:00] queue. And if you’re having trouble sharing your audio, you can also ask that in the chat. I’ll also remind you that it’d would be great if you could fill out a very short survey in the slide-out tab at the right-hand corner of your screen. Your feedback is welcome and obviously appreciated. Let’s see if I can go across here and see if I see any questions. I don’t have anything coming in. As a reminder, both Matt and Thomas, I put the link [00:25:30] to the subsurface slack channel. They will be in that slack channel. You can search for their names, both Thomas Phelan and Matt Maccaux. And you can ask your question, I guess one on two, if you don’t feel like asking it here.Matt:    Tom is a treasurer, he doesn’t brag about himself, but Tom is a treasurer. So Tom was one of the co-founders of a startup that Hewlett Packard acquired. He ran engineering. He’s now the chief architect over the solution. [00:26:00] He’s in the office of the CTO. He was over at VMware, created some interesting IP there. He’s got really interesting stuff in his career. I’m not all that interesting. I travel a bunch, but Tom is the brains here. So you want to pick his brain about Kubernetes, data science, data analytics, where the industry’s going. Please, add him. I’d love to watch [Addus 00:26:21] in the channel and please engage us in conversation.Speaker 3:    So what do [00:26:30] you have to do to get the fellow title? I mean, that is in my world, right? I’ve been doing this 20 years. We used the term an IBM fellow. I mean and it was, you were 300,000 employees. It was a nomination, it was years of work. So how did you get that title fellow? Tell usTom:    It is a mystical process. Okay. So HPE has a very rigorous set of criteria. You know, you have to meet these technical [00:27:00] goals, you have to meet business goals, you have to meet ability to [inaudible 00:27:06] within the industry and so forth. I happen to be able to satisfy all those. I think I was somewhat lucky because I came from a company called blue data. This was the company that Matt was talking about.HPE acquired blue data. It also acquired another technology company known as MapR also more recently acquired a company that specialized in SPIFFE [00:27:30] and SPIRE. All those technology companies HPE is putting together under the Ezmeral umbrella. And stay tuned there’s probably going to be some additional acquisitions as we build out this whole platform, because HPE is a major player now in containerized workloads, and in fact, running all your applications in as a service fashion. So I always say I put myself in front of the right trains during my career. And I got knocked down the track. And so I ended up here as fellow. [00:28:00] HPE deemed that I had met the criteria.Speaker 3:    So he’s humble. He’s very humble. But there is a question for you in the Q and A, Tom asking about can Kubernetes be considered a modern operating system for these various workloads. It’s funny that that term came in. Because that’s terminology we’re starting to hear and use. But anyway, please give your thoughts on this.Tom:    Yeah, it’s a very good, it an estude question. Okay so [00:28:30] the general terminology is that Kubernetes is a container orchestrator. Okay. What that means is it has [inaudible 00:28:37] amount of virtualization. It uses a container runtime, whether that runtime is docker or singularity, what have you doesn’t really matter. I am an old school OS guy. I wrote Unix before Linux existed. So the OS provides the ability for the application software to access the hardware. So how does [00:29:00] it get to the NIC? How does it get to the storage? How does it get to the CPU? So Kubernetes is a scheduler in management layer. However, I would not yet really call it an OS. We are starting to see a particular flavors of Linux that embed the Kubernetes API within the operating system. So I think we’re getting there, but I don’t. [00:29:30] When we talk about Kubernetes today, I don’t think it quite yet meets all the functionality that an operating system has.Speaker 3:    Any other questions here? We’re right at time. I want to encourage everybody to get your feedback in about the session as well as we’ve got the next session coming up in about five minutes. Our expo hall is open. Check out the [00:30:00] boots, get demos on the latest track and win some giveaways. I know that we’ve got some great sponsors that are taking questions and as well, I mentioned, feel free to… I will reput the slack channel in here for you, so you have it, but you can get engaged directly with both Matt and Tom following today’s session. With that, I’m going to close. Gentlemen, thank you so much for your time today. I thought it was a fantastic session, very insightful. And if I can help in any way, [00:30:30] let me know and have a great afternoon or a great morning, depending on where you are and for the folks that are attending today’s session, thank you and enjoy the rest of subsurface. Good day.Tom:    Thank you.Matt:    Thank you. Great day.