Dremio Jekyll

DM Radio: Modern Data Pipelines

Host Eric Kavanagh interviews several open-source experts - including Dremio CMO Kelly Stirman - about the current state of open-source development.

Listen to the entire podcast over at DM Radio.


Eric Kavanagh:

Our next guest is Kelly Stirman, from a company called Dremio. This is a fascinating company. These guys came along and figured out a way to achieve some really, really cool things for data access and data provisioning, basically enabling what we've called for years something called federated queries. In the analyst world, a lot of times you're doing SQL queries, that stands for Structured Query Language, for radio audience out there. SQL is a lingua franca of data, basically. It's the long-standing language for querying databases and data sources. But the dream for years and years was to be able to do federated queries, meaning not just of one database, but querying multiple databases in multiple different places. Well, until recently, it was just a pipe dream. I mean, you could actually try to execute one of those things, but there's almost no chance of it coming back, at least with anything meaningful. But Dremio focused on that and they've done some pretty cool stuff. So with that, Kelly Stirman, welcome back to DM Radio. Tell us a bit what you guys are doing with respect to modern data pipelines.

Kelly Stirman:

Thanks for having me, Eric. I thought we were going to talk about Repo Man. I didn't know we were talking about data things. I'm a little surprised.

Eric Kavanagh:

Life with Repo Man's always intense.

Kelly Stirman:

Yeah. So Dremio is in the data analytics space. We have a product that we call data as a service. As you said, companies have data in lots of different systems, and as our earlier folks on the call described, you talk about 7,000, just in the domain of kind of marketing solutions. Think about all the other systems that you use to run the business. No company has their data in one place. For decades I think the solution to how do I understand the full view of my customer, or the full view of my supply chain where data is managed in all these different systems. The only solution for decades is we'll just copy your data into a new silo, and then you can do your analysis on the data in that one new silo.

For most companies that's just not terribly practical. There's just too many systems, there's too much data. By the time you move it into a new place there's a ton of time that goes by and it's very expensive and error prone, et cetera, et cetera.

What Dremio's about is can we put something on top of all the data, wherever it happens to be, and let people go and blend data across different sources, do analytics across different sources, without first moving it into yet another silo? In addition, can we do that in a self-service model so that instead of it being something only IT can do, can you make it so the business analyst, or the data scientist or any of what we call the data consumers can go and do things for themselves on their own terms and at their own speed without being so dependent on a core IT function. So that's what Dremio's all about.

Oh, and also it's open source, so it's something that we've made available so everyone can use it.

Eric Kavanagh:

Yeah that's really cool stuff. You're kind of speaking to a broader shift in the marketplace, not just in the data marketplace, but also in the media. I see this across the board of a sort of push versus pull. Push is when the reports team would push out a report and say, "Okay, here's what happened to the business," but as a service we start talking about self-service, that's a pull model. Now the individual's pulling the information that they want. So you think about in the media world, there used to be just a handful of media companies, so they'd push out the nightly news. They'd tell you what happened. Now with the internet, we pull stuff to us. We go onto Google and see what happened today and search for stuff that matters to us. This is a very widespread transformation that's not a small thing.

I mean this is a massive inversion of flow, if you will, of data flow. To get self service right, especially in the analytics space, that takes a lot of engineering under the covers to really nail stuff down. Can you kinda talk about how you managed to essentially jump over some of these hurdles when you get deep into the weeds of pulling different data types from different data systems and being able to federate that query and give something that is meaningful back to the end user? How did you guys do all that?

Kelly Stirman:

Yeah. It's a big challenge because data is in lots of different technologies, it's in lots of different formats. In many cases companies don't even know what they have. There's so many different systems and owners of data in the enterprise that just beginning to understand what you have to work with can be a real challenge.

So we are beneficiaries of years of work in this area in the open source space. Ourselves, those of us who started the company, have been in this world for a couple of decades. So one of the key evolutions in data and data management over the past decade is that once upon a time, everything was in one or more flavors of a relational database, but now it's in lots of other systems, not just relationally databases. It's in data lakes. It's in no SQL databases. It's in cloud services. It's in third party applications like some of the apps that we talked about earlier on the show.

So what Dremio has done an enormous amount of work on is creating core technology innovation that allows us to easily source and connect to data from all these different systems, and work with data no matter what format it's in. That's one part of it. The other part of it is how do we make it fast? Because people that work with data, they want to be able to work at the speed of thought. Answering one question begets the next question. So you need that interactivity in your analytics to actually do your job for most people. So there's a really important open source project called Apache Arrow that we've been heavily involved with, which is all about taking advantage of innovations in hardware to make analysis of data dramatically more efficient.

I don't mean 50% faster. I mean 1000 times faster than what we've had in traditional technology. So at the core of Dremio, it's not just the ability to reach into data and all these different places, but also the ability to accelerate the analytics so that people can work at their own speed and at the speed of thought.

Eric Kavanagh:

Yeah, that's really cool stuff. Let's face it, hardware really leads the charge in terms of innovation. If you wanna watch where the disruptions occur, it's usually in some sort of hardware innovation. Just think about flash for example or solid state drives. I mean flash, like D-Ram for example is 1000 to 10,000 times faster than spinning disks. We are kind of at that inflection point right now where hard drives, which many of us have had over the years in our computers, well is always spinning disk. Those things crash and they're not terribly efficient. So now you have much more memory out there. So is that also kinda playing into how you guys get stuff done?

Kelly Stirman:

Yeah. Yeah. So you're right. So you can kind of think about tiers of where the data might live in different fast lanes if you will. You're right. In some storage system like a spinning disk or flash, is one way to think about it. Then there's data in memory. We always hear about hey, in memory data is a lot faster. What's faster than in memory is the memory that's actually on CPU. So there's actual ... The CPU can reach directly over, in many cases in the same chip, into a memory buffer to access and operate on the data.

So a lot of what Arrow is focused on is how do I maximize the amount of data that I can fit in memory and on CPU memory, and how do I maximize the efficiency of the CPU cycles as it operates on that data? By focusing on those areas, you get these huge benefits in efficiency. Ultimately that efficiency boils down to speed, but if you're a data center operator, that efficiency ultimately boils down to your electricity bill at the end of the month. So there's real cost savings in terms of operating these systems, but also the experience of the data consumer and they're ability to work with the data is dramatically improved by these kinds of innovations.

Eric Kavanagh:

Yeah. That's really good stuff. Like I say, enabling the federated query, you kinda spoke to another big development in the space. Right we had a data warehouse where we aggregated all this data, then the data lake came along and once again we're trying to aggregate all this data. Guess what, it's just not really gonna happen that way. You're always gonna have different systems. It's always gonna be a challenge to pull data from those systems. So you're always gonna have some kind of silo, and I think you guys took a fairly clever approach to try to give an alternate route, basically to the data that someone needs to do their analysis. This is all good stuff.