May 3, 2024
VAST + Dremio- Astonishing Speed and Accelerated ROI for Data
The VAST Data Platform is a fully integrated solution for streamlining the path to value for structured and unstructured data. With unparalleled performance, significant cost efficiencies, and radical simplicity, VAST aims to both simplify and future-proof data environments for BI, AI, and beyond. We will discuss how to combine the power of VAST with the elegance of the Dremio query engine and analytics solution for a truly full-stack data solution. Dremio and VAST together can deliver a stronger solution for generative AI, which marries structured and unstructured data into an ecosystem to activate all of an organization’s data at scale.
This talk will demonstrate the new Dremio connector for VAST Data, and show users how to truly get the most of out all of their data: structured and unstructured, lake and lakehouse, for BI and AI.
Sign up to watch all Subsurface 2024 sessions
Transcript
Note: This transcript was created using speech recognition software. While it has been reviewed by human transcribers, it may contain errors.
Colleen Tartow:
Hi, everyone. I am really excited to be here. I spent yesterday at the on-site subsurface, and it was incredibly energizing and exciting hearing all of the announcements and everything yesterday. So I’m very excited to be continuing that today. As I promised yesterday in the keynote, today I’m going to do a deep dive into the VAST database connector for Dremio. So I’m going to assume that most people here know enough about Dremio, but perhaps you haven’t heard of VAST data. So I’m going to start with an overview of VAST data, and then I’ll talk more about how VAST and Dremio work together, and then I’m going to do a demo. So very exciting.
The Path to Value for Data
So as you I’m sure know, Dremio is focusing mostly on data federation, which means connecting data from the consumption layer, and also joining, curating, exploring data, and really making it available for end users. This is how I think of data pipelines. You know, we start with data sourcing, then we walk through ingestion, and then we really get to the meat of the business value. We’re curating data and then consuming it. VAST, as opposed to Dremio, is focusing on the sort of first half of the pipeline, focusing, beginning at the source, with data production and ingestion, processing unstructured, structured, semi-structured data, and curating it into consumable data products. And you can see that these two technologies overlap in the middle, and so this is a natural place for a key partnership between our two products.
Requirements for a Modern Data Platform
Thinking more broadly about what’s required for a modern data platform, and this is something I think about probably far too often, the first piece is really scalability, and this has become incredibly important recently, especially with the onset of AI. Data continues to grow exponentially. I don’t even know how many zettabytes of data there are now, but data is growing exponentially, and what’s more, we’re now doing AI, generative meaning it’s actually generating even more data. So whatever platform you’re using needs to be able to operate at all scales and provide incredibly fast access to all data at all scales. The next piece is obviously flexibility. We need to be able to stretch and serve myriad use cases. You don’t want to have to spend a million dollars building up a separate data stack for each use case, right? You have all different kinds of data, you have all different types of compute, you’re going to have CPUs, hopefully you can get your hands on some GPUs, you might also have DPUs, et cetera, and so that’s something you need to think about when you build a data platform.
That said, you also want it to be streamlined, right? You want the most efficient path to business value possible. Things are complicated. This is very complicated. Things are only getting more complex, and so the more efficiencies you have in your data pipeline, the better. So you really want to focus on the path to business value here. Furthermore, we want to make sure that whatever we’re doing, it’s performant, and it’s going to perform at those larger and larger scales, and that you aren’t going to get worse performance as you get bigger. Obviously, everybody always wants to answer immediately, and so minimizing query time is important for both users and efficiency, but it’s also important so that you can minimize your use of compute, because that’s a huge source of spend in data. Obviously, you want something user-friendly. I mean, this is why Dremio is so amazing. None of this is any good if people can’t use it to easily get the value they need for their data. And then lastly, of course, we want something cost-efficient, and this is really the name of the game and something that I think about a lot, which is minimizing the TCO of data so that you can maximize the return on investment, right? So if your data is returning value, that’s great, but if the cost of the data environment is more than that, then you’re actually losing money, so you need to be incredibly cost-efficient.
VAST Data Platform
So that said, VAST provides all of these things. VAST data platform is an all-flash, disaggregated, shared-everything data architecture that’s focused on allowing you to break the trade-offs that are inherent in all of the legacy data platforms. For example, a traditional platform can often deliver on scale and performance, but as you scale, the cost becomes prohibitive. On the other hand, you might find a platform that’s cost-efficient and scalable, but then your performance suffers at larger scales. And so VAST has a unique architecture that delivers on all of the above by focusing on the capabilities of just the latest innovations in storage and compute to really separate out that storage and compute layer for NAS. And so I want to be clear, VAST is not a hardware company, it’s not a storage company. We are VAST data, not VAST storage. We are software that can run anywhere to take advantage of the latest hardware, whatever you have and wherever it lives, whether it’s at the edge, like in IoT use cases, on-prem, or even in the cloud.
Another way to think about this is that deep learning, like I said, has created this inflection point with technology. And VAST data is really poised to take advantage of that by providing software that’s built for all kinds of data, whether structured or unstructured. And it’s also built for both ingestion and consumption, and AI and BI. And the beauty of this is that with VAST as that underlying backbone layer, that data storage layer and database layer, Dremio is then the flexible data lakehouse to bring in the user experience and the consumption layer. So this is how we like to partner together. And you can see here, we are working closely with partners across this ecosystem, and Dremio really provides that data lakehouse layer, which is very exciting.
The Data Pipeline for AI
I want to double-click on something I said earlier when I was talking about data pipelines. We often think of data pipelines as storage or ingestion, processing, sorry, data sourcing, ingestion, curation, and consumption. But consumption and curation means something completely different in AI than BI. AI changes everything. So whether you’re fine-tuning pre-trained models or you’re training your own models, or you’re just using someone else’s model, data engineering is still going to be the underlying workhorse of the AI pipeline. So you’re still going to be doing an incredible amount of data engineering, whether you’re doing AI or BI. I mean, those of you who have been data engineers, you know that the data engineering work is the bulk of the work in getting value out of data, and that doesn’t change with AI. And AI itself, I mean, the models are inherently incredibly complex, and all of this can be very challenging, and it’s not well understood, and it can be a black box and all those things. But it’s still that same pipeline of ingesting through curation that’s going to require that intense data engineering, even if you’re doing it on a GPU now. And there’s also structured data in AI. It’s not all unstructured data. And so you still need to think about that when you’re thinking about all of these things. And it’s now at a different scale than what you’ve done before. Maybe you’re talking about petabytes now instead of gigabytes or terabytes.
So all of that to say, VAST is unique and innovative in the way it handles both structured data in our database and unstructured data in our data store. And VAST database is incredibly cool. It’s both transactional and analytical. We’re writing in rows and reading in columns, and we’re using Flash very efficiently. So it’s low cost architecture. And then on top of that, you have incredibly fast access. So you’re really getting that data warehouse performance and feature set at the cost of a data lake, the cost and scale of a data lake. So you’re kind of coupling those together. And so what that allows you to do is to simplify and streamline your pipeline of data from the source to the value across your use cases. And so you can actually use the same environment for operational data processing, A.I. or B.I. and whatever comes next. Who knows what will be next? And so one of the key pieces of the stack is that fast database.
VAST Database + Dremio
So like I said, that’s groundbreaking in that it’s both transactional and analytical, and you get all the benefits of being in the VAST data platform. So we’re asset compliant in our transactionality. Like I said, we store data in columns. So for those of you who are familiar, we are both OLTP and OLAP in our formats. The VAST data platform overall, we get incredibly good data reduction. We have our proprietary data reduction on top of all the usual compression techniques. You get data protection. You get a lot of different benefits just from using the VAST data platform as well. And like I said, you get that data warehouse performance at the data lake scale and cost, which is very exciting. And then finally, when you put all of this together, you are essentially accelerating read and write performance, which again, reduces compute, which reduces your total cost of ownership.
And then if you add Dremio on top, you’re really adding a layer of user experience, lake house features that completes the picture for data curation and consumption. And so like I said yesterday, I’m thrilled to announce that our VAST database connector for Dremio is now available in beta, and we’re really unifying this offering into a complete data lake house experience. And what I referenced yesterday was specific to the zero trust use case, which is SIM data that’s specific largely to US federal customers. But that said, I think this is important for everyone understanding that we run really well with highly selective queries, etc. So together, VAST and Dremio are giving you this complete data lake house experience where you get like the incredible user experience and functionality of Dremio and all the data products and that good stuff on top of the scalability, performance and cost efficiency of the VAST data platform.
So yeah, so when you couple these together, you really do have that entire data stack from bottom to top or left to right, however you want to picture it. And I also want to note that we’ve actually had for some quite some time now the capability for Dremio to connect directly to best S3 protocol through the data store. So we also have a catalog that I’ll talk about in just a second, but you’re able to use Dremio to actually query structured data stored in your favorite open file format, directly in S3. So if you do have iceberg files or parquet or whatever, you can just get to them directly on the data store. And you’ll get all those awesome features like data reduction, et cetera. And it’s scalable out to exabytes, which is obviously important because, you know, 95% of data is unstructured or semi-structured. And so you want to be able to get to that data as well. But all of the data is actually, it has its metadata index into the VAST catalog, which then can be queried in Dremio as well. And that’s been around for a couple of years now. So those are the points I was just making scalable to exabytes. Love that.
AI Data Infrastructure Recipe
So going back to our original data pipeline flow diagram, you can now see how the VAST data platform and Dremio are really working together to cover that entire pipeline and make it streamlined, performing, cost efficient, scalable, flexible, beautiful user experience, all those things I was talking about earlier. And this can really work for AI, BI, and then of course, it’s scalable out to whatever comes next, whatever that may be. And that really, I also think about this in that we’re really building an ideal AI data infrastructure here. You know, we’re using the latest and greatest of everything and you get that exabyte scale for all of that data that’s necessary for deep learning, computer vision, gen AI, all of those good things. And then your efficiency and your cost effectiveness and your performance at that scale and the iterative data engineering and data governance work that you’re doing with Dremio, that’s all part of that same platform. Now I’ve been talking a lot. And while I love to hear myself talk, I’m going to do a demo. And so I can show you firsthand a demonstration of the VAST Dremio environment so you can see it for yourself.
Live Demo
So what I’m going to show you today is actually a sample data pipeline that we dreamed up for our environment, starting with some data that we’ve pulled from the U.S. National Weather Service. We have a fairly straightforward Python script that pulled this data. It actually stopped a few days ago. I got kicked off my server so it stopped calling. But it downloaded XML files and then created arrow objects and inserted them into the VAST database. And then on top of that, on top of the VAST database, we’re just using the new Dremio connector for the VAST database to query the data. And we’re going to create a few views. And it’s always more fun to use a visualization rather than just showing you SQL queries. So I’ll show you the Dremio connector for Apache Superset to visualize the data. So I’m going to take a moment and stop sharing and I will share something else. So what I’m sharing now is this is the VAST interface. This is the VAST data store interface. And what this is showing you is a bunch of storage-y stuff. It’s showing you things like the data reduction rate, which is 2.8 to 1, the number of data nodes and compute nodes that are up. But really what you want to look at is the VAST database. And so if I scroll over here, I can see I have three databases. And in DB0, I have something called Metar data, which is the meteorological data that we’re going to use. And it has things like pull time, observation time, et cetera. And in this data, basically the United States Weather Service, the National Weather Service has something like 5,000 weather stations all over the country. And they’re constantly polling for weather data. And they’re recording all of this information, like longitude, latitude, temperature, dew point, wind visibility, blah, blah, blah, temperature, stuff like that. And so what I’m going to do is I’m going to show you this data. And so this is just the VAST UI, the VAST UI. And so this is the dashboard that I’m going to load. And you’ll notice that none of the data is there yet. So it doesn’t exist on the views that are used to drive this data. So now I’m going to go to Dremio.
And you can see this Dremio instance is fairly empty, with the exception of the VAST S3 object storage. So you can then go to Add Source. And if you go down to the bottom, you’ll see VAST data. Now this is a beta, so we’re missing the icon. It’s coming in the next release of Dremio, though, I promise. And so I’m going to be very creative and call this BDB. And I’m, of course, going to cheat and copy and paste my information from here into my connection. Please don’t look and don’t screenshot my secret key. Thank you very much. All right. So then I save that. And now I can see the same thing I saw in the VAST database view, where I can see my VDB. And then inside, I can see the two different things. I want to change my number of splits and subsets. This is just how the parallelism works, just based on my server. So now in DB0, you can see my Metar data is there. And this is actually just the same data. And now I can query it. I’m only going to show you the first 10 records, because it’s actually several million records. And so let’s say we can also look at the number of records. Let’s see how many we have. I think it’s 11 or 18 or something like that. 18 million records, 18.5 million records. OK. So all right. What we’re going to do from here is we can also look at the last time it ran. So the last observation time that I pulled before my machine got rebooted was May 1st. All right. May 1st at 1.30 UTC. OK. So that’s my most recent data.
So now what I’m going to do is I want to see– I have two views that drive this dashboard that I want to create. And so the first one– first of all, I want to put them in a bucket. So I want to make sort of data products, if you will. I’m going to call it Metar weather data. I just copied that in, or autocompleted, because I’ve done this before. And I’m going to cheat by, again, copying the SQL for these views. So I go to the Query Editor, open it up, paste it in. And you’ll see the pull time. And this is the current data. So I’m just pulling the current data– the current most recent pull. So you’ll see there’s 4,800 records from each weather– there’s one record from each weather station. I’m going to save it as a view. Oops. Not clicking. OK. Oops. Sorry. And I’m going to call this Metar current pull. I’ll save that as a view. And then I’m going to do another one. And this is the last seven days’ worth of data. And I’m doing a couple of quick calculations like changing– I’m going to save this as a view as well– changing the elevation from meters to feet, that kind of thing. Sorry. OK. And then, again, I’m going to call this Metar last 7D Duke in the same bucket. And there are my views. And now– oops. Sorry. Clicking too much. If I load my dashboard, you’ll see all of this really cool data shows up because these views now exist. And so you’ll see, again, I have 18.5 million record counts. I’ve got the most recent pull time, roughly 5,000 weather stations. You can see that current weather conditions are up top. Again, this is just super set. And then down at the bottom, I have visibility and flight conditions. This is like, I don’t know, aviation data or something. So yeah. So that is the demo. Hopefully, that is exciting for everyone. Let me go back to my– there it is. OK.
Requirements for a Modern Data Platform
Back to my slideshow. I do want to note that I’ve showed you all these things. I’ve showed you that this is scalable. We’ve done these things for things that are petabytes of data. That was obviously 18 million records. But I think it’s important to think through, when you’re building that platform, that you can do all these things. And so it was flexible. I built this demo and probably– I already had the Python script to run the polling. It probably took me an hour to get everything up and running. And the Dremio database– mass database connector was very straightforward as well. And I do think that that’s really important. Oops. Sorry. I think I stopped sharing. Here we go. Sorry. And then, obviously, it’s streamlined. I mean, that was a fairly straightforward ideal situation that I was looking at. But I think, like I said, things are only getting more complicated and more complex. And it’s important for all of these things to be as efficient as possible. And so Dremio, its set of features– VAST, its set of features. When you put them together, you really have a fully functional data platform. It’s also incredibly performant. I mean, obviously, that was a demo. So it’s always going to be pretty quick. But you get that same linear performance. And then you can scale both horizontally and vertically as you need to. Excuse me. Again, super user-friendly. I mean, everybody loves the user experience in Dremio. And like I said, none of this is any good if people can’t use it to get the value from their data that they want. And then, lastly, we want something that’s cost efficient. So in the end, going back to this original data pipeline flow diagram that I had, you can now see how the VAST data platform and Dremio really are covering all the different steps of this pipeline to define a streamlined, performant, cost-efficient, scalable data lake house for BI and AI and beyond. So I hope that was a good demo.
I hope everybody enjoyed it. This is a link. If you scan this link, we’re having another webinar, a VAST Dremio joint webinar on May 30 to show– we’re going to show the database connector for Dremio again. And we’ll show it probably with a bit different data. We’ll see what we end up showing. But we’re going to talk about it. And it’s going to be me and a bunch of people from Dremio and VAST. It’s going to be really fun. So please join. You can see Gnarly here with Vinny, the VASTronaut, the VAST guy. Mascot. I’m sorry. I couldn’t think of that word. And so, yeah, I’m really excited. And if you have any questions, feel free to reach out to me or anybody from the VAST team. We also have the virtual booth open right now. And thank you for listening.