Dremio Jekyll

Self-Service Data for the Data Lake

 

Transcript

Eric Kavanagh:

All right, ladies and gentlemen. Hello, and welcome once again to Inside Analysis. We have one of our famous double headers here today. This is your host, Eric Kavanagh, and we've just finished up a radio show with Kelly Stirman of Dremio, talking all about the evolution of information management and really about this cool stuff happening with federated queries, leaving the data where it lives. So I'm going to dive right in here and share a few slides, just to frame the context here, give you all some perspective.

This is one of my favorite slides ever. Two birds sitting there. What is a data lake, and the bird says, "Then the Data Lake evaporated into the Cloud." There's actually a pretty interesting backstory on all that we've talked about on past webcasts, mainly that the cloud really has emerged as a tremendous resource for data these days. Not just for storing data, but for analyzing data, the cloud is a great democratizer. And of course, we're going to live in this multi-cloud environment. I don't think there's any stop ... Excuse me. I don't think there's any stopping the reality of multi-cloud when you've got, of course, Amazon Web Services, still is the kingpin of the industry, arguably has, by some estimates, 80% of the market with Amazon Web Services. They got a head start, they really did. They were a very visionary company that has fundamentally transformed enterprise software.

Self-Service Data for the Data Lake

But now you've got Microsoft with Azure, you have SAP with the SAP Cloud Platform, you've got Oracle focusing very heavily on cloud. That's actually where a lot of their good news is coming from. But what does that mean? It means multi-cloud is here to stay. You're going to have lots of different applications across lots of different environments, and of course you have all these different tools that are being used for human resources, for marketing, for sales automation. A tremendous amount of activity is happening up in the cloud, and with good reason.

And the challenge, of course, is how do you manage a multi-cloud environment, especially from an analytical perspective? Well, I can tell you, several years ago ... This dates back probably eight or nine years ago, in fact, I guess eight. I was talking to a guy at a company called [ParAccel 00:02:07], and he was explaining their architecture of being able to reach across different platforms and leverage the computational power of those platforms. Well, in the last couple of years, we've seen a tremendous amount of innovation, and I'm going to explain a little bit about why that is.

So hardware usually is where innovation starts, right? If you think about memory and the falling price of memory, the falling price of flash, for example, it's a huge driver of innovation right now. In fact, you're seeing a lot of companies really focusing on memory exclusively, SAP being one of them with SAP HANA. That's an in-memory architecture. Well, memory is up to a thousand times more or faster than spinning disks. In some cases, 10,000 times more, but it used to be very expensive. Well, that's now changing.

Self-Service Data for the Data Lake

So we have all these innovations in CPUs and in GPUs, thanks to all those gamers out there for playing lots of video games, because they fueled a tremendous surge of investment in GPUs, which, as we've heard on other shows, I'm sure we'll hear today, are excellent for machine learning, because they're excellent for parallelized processes, and of course parallelism, let's face it. That's one of the driving forces that changing the nature of information management these days. It's changing the nature of business, quite frankly. All these tremendously scalable solutions, solutions that are built at web-scale, if you will, are just amazing. And the collaboration capabilities and the analytical capabilities, all of this stuff is being fueled by tremendous innovations in the hardware world.

Well, there's also tremendous innovation in the software world. In our radio show that we just wrapped up, our eponymous radio show Inside Analysis, we were talking a lot about open source, and how open source is fundamentally inspected, if you will. Maybe not the best word, but infused with the mindset of the developer. Now if you want to develop software, you really think open source first. And so open source, largely because of Hadoop, also because of Spark, of course, now run by Databricks, Kafka, Nifi. All of these innovations are built on the open source model, and the cool thing there is you're really standing on the shoulders of giants.

Self-Service Data for the Data Lake

So you can think about where open source really took off, way back in the '80s with Linux. Well, what happened is, people got tired of lock-ins. People got tired of Microsoft pulling the rug out from underneath them and forcing a new version of the software. Takes time to write software, developers are not cheap. So that really did cause a major change in how we think about software development, and I think it's all good news for the end user.

So, of course, the cloud. We just talked about cloud computing, I just did a moment ago, how it is the de facto standard. I think most software companies these days, when they're planning out new architectures, when they're planning out their vision, they're thinking cloud first, and with good reason, because data has gravity. Data that's in a data center is probably going to stay there for a long time, it's probably not going to forklift too much data from data centers into the cloud, but going forward, I have to think we're in a very much cloud first environment.

So how do you manage that from an analytical perspective? Well, we'll talk about that. So analytics, of course, is a process. It's not just a single activity, it's a process, and ideally, it's a closed loop process. You really want to be able to have access to a lot of data, you want to be able to analyze that data at the speed of thought, as we say. You want to have that conversation with your data.

Self-Service Data for the Data Lake

Well, traditional methods for doing analytics aren't very well designed for that particular use case for a whole lot of reasons. One, because it just takes time to move the data around. ETL was the main stage of the data warehousing world. It's still out there today. There's going to be a tremendously long tail for the old way of doing things. I mentioned on our radio show that one of my favorite quotes is by William Gibson, who once said, "The future is here already, it's just not evenly distributed." His point was that the future is in certain pockets around the world, of course, Silicon Valley, often Texas is another one. You have these tremendous beating hearts of innovation, and that's where a lot of this new technology is coming from, and it's slowly going to find its way out to the rest of the world as the cloud frankly democratizes both functionality and data and access to data.

So on my other radio show, DM Radio, I can tell you literally for about eight or nine years now, I would talk to experts from all around the industry. I would ask one question, which is, when are we going to stop the madness? All this ETL, well, Dr. Phil has this great line where he says, "Things start for one reason but continue for another." I think it's a very compelling thought to be brought to the forum there, because ETL is a perfect example.

Well, what happened over many years, especially for larger organizations, is that more and more people wanted access to data. So what does that mean? Well, it means you have to move data around. So you have all these different ETL scripts running in batch processes, and then the ETL developers and the IT teams have to make sure they hit these back windows. Well, if you can imagine, we reached a point where you're just at a breaking point. You have so much data moving around, it's difficult to get any of these data [steps 00:07:30] loaded into a data warehouse, for example. That was an inflection point, I would argue. That was part of the reason why we saw this transformation and why we got this concept of data lakes that come out.

Self-Service Data for the Data Lake

So data warehouses, very heavy on engineering, very heavy on leveraging the relational model to get the right data to the right person at the right time. Very useful stuff, typically, pretty expensive. Data warehousing has not been a very cheap activity, though recently, we do have some significant innovations. I mentioned Amazon Web Services. Of course, they have Redshift. In fact, the guy who I was talking about, Rick Glick is his name, he worked at a company called ParAccel, was one of their engineers, their key engineers designing that solution. ParAccel, the beating heart of ParAccel, became the foundation of Amazon Redshift.

So Amazon basically leveraged that technology and has since added to it, of course, and ParAccel has gone away in one of those strange stories of the enterprise software world. But that was Rick Glick who was telling me, eight, nine years ago, what's going on. And you can always see those early signs, you can always see the future as it unfolds in this business, and that's actually one of the things I love so much about being in this field, is that we really get to talk to some of the smartest people who understand the cool things that they're doing.

Well, very quickly, I wanted to just touch on data governance, because of a lot of different reasons. There is this whole raft of regulations coming out of the EU called the GDPR, the General Data Protection Regulation, which basically says we have to be more careful with our data. We have to, especially of EU citizens, of course that's where it applies. And there's some very interesting components in GDPR, one of which is the so-called right to be forgotten, meaning if you are an end user out there, if you're a consumer, you find out that some big company has your data and you don't want them to have your data, you can tell the company, "I want you to get rid of my data," and they're supposed to do that.

Self-Service Data for the Data Lake

Well, let me tell you, that's going to be a lot of fun. It's going to spur a whole lot of innovation, but it's just a straw in the wind. And it's interesting, I did a webcast last week on GDPR and data governance, and the timeliness was stunning, because the day before, Mark Zuckerberg himself of Facebook was testifying before Congress, and lots of the different senators were asking him, do you think the Europeans have it right? They were talking about GDPR. They were talking about privacy.

Obviously, this is a major issue, and it's just a straw in the wind here that data governance is important, and really organizations are going to need to have very sound policies that are A) achievable, B) enforceable, and C) defensible, because when the regulator shows up, when the auditor knocks on your door, you're going to need to know what the point of those policies and explain what it is that you've done, and there's no one size fits all, there's no one solution that covers all these different components. There's no silver bullet. So you're going to need to leverage the functionality of many different tools in terms of achieve robust, effective, defensible data governance.

And with that, I'm going to bring in our guest for today. I'm very pleased to have Kelly Stirman of Dremio online. Kelly, I'm going to hand you the keys for the WebEx here. Go ahead and share your screen, and show us what you got. I think this is fascinating stuff, folks. Take it away, Kelly.

Kelly Stirman:

Thank you, Eric. Well, I'm going to start with a couple of slides just to orient everyone to the conversation as it relates to self-service data and this great new software solution called Dremio that's open source. And we'll talk about many of things that you touched on, from infrastructure, in terms of infrastructure to challenges around data analytics, and bridging the world of traditional tools with some of the newer data formats and data technologies, as well as governance. So thank you for the great introduction.

Self-Service Data for the Data Lake

So just briefly about the company, you've got a group of people who have been working in distributed systems and open source and big data for a decade and longer, who have come together to take this incredible idea that's called Dremio and bring it to market and build a successful business around the technology. And what is it that we're talking about? Well, let me give you some context in how I think Dremio fits into many things that you are already familiar with.

Self-Service Data for the Data Lake

Companies have moved to embrace the model of the data lake over the past five to seven years. Even if they weren't always calling it a data lake, the idea has been around for some time now, that with new volumes of data and variety and data moving faster than ever, and some new software technologies that have been created to address those changes, companies have a different way of thinking about how they deal with data.

Self-Service Data for the Data Lake

Well, for 30+ years, we've had the notion of a data warehouse, where there's a lot of effort and rigor put into structuring data before we put it into the data warehouse. The data lake is much more flexible, and it allows us to take data in its raw form and put it into a place that is fundamentally much more cost-effective to operate, much more elastic, and much more scalable. And in that sense, it's very appealing when you compare it to a traditional data warehouse.

Of course, it has trade offs. In the absence of all that rigor to structure the data before you put it into the data lake, you necessarily make it harder to access the data that's there, which we'll talk about in just a few minutes. But companies have embraced this idea of data lakes for lots of great reasons.

And here's a kind of overview of what that looks like in practice. On the left, you have data from operational systems that are running the business, and these are a mix of relational databases and things like Oracle and SQL Server, but also some of the newer things like Mongo DB and Elastic Search. And that data is moved into the data lake in its raw form, these raw datasets. And that's the case, whether you're running a data lake on AWS using S3, or on Azure using Azure Data Lake Store, or on prem in your own infrastructure using Hadoop. The idea is the same. If you have a file system or object store, that makes it very easy to put whatever data you like into the data lake.

At this point, you find, companies find that they have access to this data for their data engineers and their data scientists, who are fundamentally very capable with software and able to work with data and these APIs available to the data lake. But what they also find is that their BI users, who greatly outnumber their data engineers and data scientists, cannot use standard tools like Tableau and Power BI, [Click 00:14:53], or Excel. They cannot use those tools with the data that's in a data lake.

So what companies tend to do is they start to try and solve this puzzle. So first of all, they say, "You know what, we need a way to take the data in its raw form and transform it into something that's higher value, cleaner, more normalized, and has the integrity that we need to support the BI users." And we've been doing this for many years in the form of ETL tools. There is a new generation of tools that are designed specifically to work with the complex data formats in the data lake, and to scale using data lake infrastructure, and those are going to be called data prep tools.

Self-Service Data for the Data Lake

And as they start to reshape and refine the data, they now know they need to inventory, and make sense of what they have in their data lake, and so they look at a data catalog tool. So they understand what do they have in raw form, and what do they have in a curated form for their users to work with.

Self-Service Data for the Data Lake

But at this point, you still haven't really solved one of the core problems of the data lake, which is speed. And without that speed, it's hard for BI users to be able to take advantage of the data in the data lake. And so companies start to look at BI acceleration technologies, cubing technologies that make access for BI workloads possible in the data lake, but don't, unfortunately, solve ad hoc queries.

Self-Service Data for the Data Lake

And so companies still look at ad hoc acceleration, which tends to be a traditional database like Teradata, like Vertica, like Redshift, to give them the speed they need for ad hoc queries. And it's at this point, as they've assembled these pieces of the puzzle together, that their BI users can start to run those workloads on data in the data lake.

Self-Service Data for the Data Lake

But what has happened is that IT has put a lot of complexity onto the plate of the BI users. Rapid aside, for a particular query, do I connect to a BI acceleration technology, do I connect to a data [mark 00:16:45] or a data warehouse, and when do I need to go to the data catalog? I haven't even drawn all the arrows here, but it's a very complex picture that I think doesn't really embrace the ideas of self-service that are so important to BI and analytics today.

Self-Service Data for the Data Lake

So we created Dremio to propose a really fundamentally different approach. We said there's gotta be a better way, and when we sat down and said, what would that have to be, what are the core tenets that we would need to accomplish to change this picture? We said, well, look, it would need to work with any data lake. Not just one cloud vendor or another, not just Hadoop. All of them. And it would also need to work with any BI or data science tool, whether you're using Tableau or Excel or Python or R, every company has a mix of those tools, and it needs to work with all of them.

It would need to solve the fundamental data acceleration problem for BI and ad hoc workloads, by accelerating data by between 10x and 1000x. It would need to provide a self-service semantic layer, so business users could describe data on their own terms and work together to model and represent data for their own needs. We think that everyone wants a little bit of a different shape and formatting of the data for whatever job they're working on, and traditional, that's meant lots of ETL and lots and lots of copies of data. And we want to solve that with a zero copy data curation model.

And finally, we think that a solution like this has to scale like the data lake. All those things that made the elasticity and the infinite scalability of data lakes so appealing, the same has to be true of Dremio. And then finally, we think something like this has to be open source.

Self-Service Data for the Data Lake

And so at a high level, that's what Dremio is. It's a new tier in data analytics called self-service data, that sits between the tools of the data scientists, data engineers, and BI users, and the infrastructure and the data of the data lake. And it runs directly in that infrastructure, providing acceleration capabilities, self-service capabilities, a semantic layer for end users to describe data in their own terms. Zero copy curation capabilities, and we automatically track data lineage throughout all these different workloads across all the different tools.

Self-Service Data for the Data Lake

And we do this by taking advantage of the infrastructure, by being able to access the data in its raw form in a highly optimized way, and to model and represent the data in a curated form without making any copies, while in the background, we take advantage of the low-cost storage provided by the data lake for Dremio's data reflection, which are core to the data acceleration capabilities.

So that's it at a high level. It's a new layer that sits between the data that you already have without forcing you to move the data into yet another repository, and lets you continue to take advantage of the skills and tools you already have deployed across your data scientists, data engineers, and BI users. It makes the data faster and easier and more self-service for everyone.

It's a whole lot more fun to look at this picture in person instead of just hearing me talk about it in slides, so what I'd like to do now is just jump over into a browser and take a look at Dremio in action. And I'll just say, while I'm doing this, if you have questions about anything that I'm talking about, please use the Q and A feature in the WebEx to ask your questions, and then we will get to those questions a little bit later, in just a few minutes.

So Eric, are you able to see my browser window now?

Eric Kavanagh:

I sure can.

Kelly Stirman:

Okay, great. So what I've done here is logged into Dremio through a browser. Any modern browser will do. And I've accessed a cluster that Dremio is running on, a small, four node cluster that happens to be running in Google Cloud. And I am logged in as an administrator so I can see the whole world, but that's a nice thing for me to show you all the different capabilities that we ... And we won't look at everything, but we'll look at several different scenarios to help you understand how a self-service data platform can really complement the existing investments you have in BI and data science tools, as well as where you already have your data stored today.

Self-Service Data for the Data Lake

So what I've done in this cluster is I've gone ahead and connected up a number of different data sources from a data lake to Elastic Search, to Hive, Mongo DB, Oracle, Postgres, and even though I've got this cluster on a Postgres cloud ... Sorry, on Google Cloud, I'm also connected to Redshift and S3 over on AWS. So I've really got a cluster here that spans both clouds.

Self-Service Data for the Data Lake

And up above, I have what Dremio calls spaces, and this is the area in which data consumers organize their datasets and collaborate together to do analytical work. So let's take a look at a simple scenario. So, for example, let's say I'm a Tableau user, and I've been assigned the task of analyzing data related to taxi rides in New York City. And I realize most of you, the nature of taxi rides in New York City may not be core to your business, but bear with me, 'cause it's a sample dataset that I have access to here.

So if you were assigned to work with data, one of the first questions is, where is it? And most companies don't have a central catalog with an inventory of all the different datasets. Well, one of the nice things about Dremio is that when you connect it to any of these sources, whether it's a relational database or Hadoop or S3 or Mongo DB or Elastic Search, all these different sources, we automatically discover schema and build a catalog and index it so it's searchable.

Self-Service Data for the Data Lake

So I want to find taxi rides in New York City. I can just go into my search box and type in "trips" and get back a set of search results. Now, these search results correspond to datasets from any of those sources in my company. And by clicking on the first search result, I can jump into a sample of this data.

Now, at this point, I don't need to care whether it's stored in a relational database, or in a file system, or in Mongo DB. It doesn't matter to me as an end user. I'm getting this nice, quick sample so I can visually inspect this data and see, is this what I'm looking for?

Self-Service Data for the Data Lake

So, if I look, each row here corresponds to a taxi ride, and I can see pick up and drop offs, the number of passengers, how far the trip was, and if I scroll over to the right, I have a breakdown of the fees. The tolls, the tip, the taxes, et cetera, et cetera. So from here, there's two things I could do. I could say, this is exactly the data I'm looking for and start to analyze it with Tableau, or I could say, this is not exactly what I'm looking for, I want to reshape it or change it in some way before I do my analysis. And we'll look at both of those scenarios briefly.

But let's say this is what I'm looking for, and Dremio works with any tool that can generate SQL over ODBC or JDBC or REST. But we have a couple tools that we can launch directly from within Dremio. So if I click on this Tableau button, it will launch Tableau connected to this dataset. Now, what I actually have to do here real quick, is share Tableau, because I don't have the option to share my desktop.

Eric Kavanagh:

I think you can just share your whole desktop. Does it put you through share application?

Kelly Stirman:

Sorry, say that again?

Eric Kavanagh:

Yeah, so I think if you ... Did you share your application or did you share your desktop?

Kelly Stirman:

I don't have the option to share my ... I don't have the option to share my whole desktop.

Eric Kavanagh:

I can actually take it back and let you do that if you want. Let me do that just to give you a chance to go ahead and restart. I'm going to take this back, and folks, send your questions. And I do have one question, actually, I'll throw over to you in a second, Kelly. Now I'm going to give you the keys again, and this time, instead of sharing your application, just share your desktop and I think that should enable you to share both those apps. So go back under quick start. And underneath share, there's more options, and you should be able to share your screen, that's what it's called. Do you see that?

Kelly Stirman:

I have ... Let's see. I have all applications. I just have a list. I don't have to share the whole desktop. But I think we're okay if I just ... Sorry about this, everyone. If I share this next one ...

Eric Kavanagh:

Just go back to sharing, yeah. Kelly is actually in a brand new office in Austin, Texas. They just launched an office down there in one of my favorite cities in the world. The heartland of Texas. There we go.

Kelly Stirman:

So now you can see that I'm logged into Tableau, and if you ... If I first start by just looking, how many records am I dealing with here? You can see here, it's just over a billion rows of data, and let me now just do a couple of quick things so you can get a sense for how interactive the experience is with Tableau.

Eric Kavanagh:

Wow. And you're reaching through Dremio into ... You can reach into all sorts of different environments, right? That's the whole idea, that you can grab data from any number of sources, any number of environments, right?

Kelly Stirman:

That's right. It doesn't really matter where the data is. In this case, I'm looking at a billion rows of CSV data that's stored in HDFS. Though it doesn't matter if this was Mongo DB or Elastic Search or a relational database, and frankly, I shouldn't have to care as an end user, right? I should just be able to get this nice, interactive experience without waiting for IT to build an extract for me, and without worrying about the underlying technology at work.

Self-Service Data for the Data Lake

And the key thing here, just to demonstrate, is just all of these queries are dynamic SQL queries over ODBC. There's not some ... I haven't moved things into a data mart or built an extract. Every single one of these queries is a live query, which means it doesn't matter if I'm using Tableau. Tableau looks great, it's really easy to use, there are lots of reasons why it's so popular. But in terms of the performance and the interactivity, it doesn't really matter which tool I'm using. I can have the same nice kind of experience.

Eric Kavanagh:

That's amazing, and you actually just ... You must have been reading my mind here, but there's a question that came from an attendee asking, does the data get stored on the Dremio platform, or is it just cataloging the data in the data lake? And what you're talking about here is accessing the data directly, right? Can you talk a little bit about Apache Arrow and the in memory architecture and how that works?

Self-Service Data for the Data Lake

Kelly Stirman:

Sure. So, and by the way, I've just gone in here to look, and this is the SQL query that was coming over ODBC into Dremio, and then Dremio's query planner is figuring out, what's the most efficient way to run this query. And that in some cases can mean pushing the query down to the underlying source, or in this case, accelerating the query using what Dremio calls data reflections. And ultimately, no matter whether we're using a reflection or pushing the query down into the underlying source, we read data into Apache Arrow buffers and do the execution of the query in our distributed query engine that's going to run on maybe 10 nodes, maybe 100 nodes, maybe 1,000 nodes. It just depends on the scale of your data and the concurrency of your queries.

So this query was executed in under a second, and even though it was a billion rows of CSV in HDFS.

Eric Kavanagh:

That's unbelievable. That really is amazing. And so that in memory approach really allows you to deliver that speed that people want, right?

Kelly Stirman:

That's right. A key thing, though, is it's not required that your data fit into memory. We are reading the data into memory into these really optimized Arrow buffers, and that's part of how Dremio can make this experience so nice and interactive for an end user.

So let's take a look at that other scenario, which is the data that I'm looking for is actually not already created, and I want to go create that for this job that I want to do. Now, traditionally what that means, of course, is you go to IT and say, "Hey, IT, I would like you to create this dataset for me." And then you wait for weeks or maybe months, and then IT finally produces the data that you asked for, and hopefully they understood what you were looking for. And then they make a copy of that data and put it somewhere for you to access.

And so we're really trying to just completely get away from that model of waiting for IT and then having them create a copy of the data, because people don't want to wait. The data consumer doesn't want to wait, and IT doesn't want to make copies, because more copies means more governance challenges, more security vulnerabilities, greater cost, et cetera, et cetera. So neither party is happy in this relationship.

So let's do a couple things here. I'm going to create a new space. Let's save here. And so if I go to my list of ... If I go to my list here, I'll just put Bloor at the top. You can see there's nothing in this space, and you can think about these spaces like project folders, so based on my LDAP group membership, I would only see spaces that I have access to.

Self-Service Data for the Data Lake

Now I'm going to go into my catalog here, and I'm going to go into this Postgres database. So this is not data that's in a data lake already. This is one of my existing systems. So I'm going to go into this HR database, and let's say the request was for a data scientist team, and either that team is going to be doing this work themselves, or maybe you have the data engineer that's facilitating this experience for them. But the idea is they're working on building a model that has something to do with HR data, and they want a specific slice of the data for the model, because one of the key topics and trends in data science these days is the data sample. The whole concept of sample bias and finding the right data to feed into your models.

Eric Kavanagh:

Yeah, I think you need to be sharing your browser app, right? Is that what you're trying to do? Because we're still seeing the Tableau thing.

Kelly Stirman:

Oh, sorry about that. Okay. Tired of seeing that, let me go back to my browser here. So I've created this Bloor space with nothing in it, and then I went into my Postgres database down here below by just clicking Postgres. I could've searched for this, but here I wanted to show you browsing into a database, and each of these couple icons are physical datasets. These are tables in this database.

And if I click on this employees database, this employees table, sorry, I can see again a sample of the data to help me visually inspect and understand what I'm looking at, and I'm now going to build the dataset by making changes to this data. Now, Dremio can't update the data at the source, because Dremio is a read only application, but we can do the changes that we need in a virtual context. So let me show you how that works.

There's two things to look for. One is that this purple icon is going to go from purple to green to reflect that you're now in a virtual context, and the over here on the right, these dots are going to start accumulating, because we version control all the changes that you make so you can easily toggle back and forth between them.

So let's do a couple of quick things here. I'm going to get rid of this column, employee ID, 'cause nobody's going to use this in their model. So I'll just say drop, and that column goes away, and you see the purple change to green. And now I want to look at focusing on the senior employees in the company, because my data science team believes that data about the senior employees will be more meaningful for this model that they're working on.

So then you go to the hire date column, and say keep only. And that for me is going to highlight the hire date column and show you a histogram of hire dates for all the employees. And by sliding this slider over, I can zero in the data on employees who only started before this particular date over on the right. So I'm focused on the senior employees, and when I slid that slider over, the data refreshed below dynamically to give me feedback on my selection.

I can click apply, and now let's say .. And there's literally thousands of things we could do here, but I'm just showing you a couple quick easy ones. Now let's say I want to go get data about the departments that the employees work in, but that's in a completely different system. So I click join, and now Dremio will recommend other datasets that it has learned are complementary to the dataset that I was just looking at. And actually, the first option here is the one that's been used the most frequently, and that comes from a table in Redshift.

Self-Service Data for the Data Lake

Now, what the data you see below is a Postgres database running on Google Cloud, and this, for the employees. And the department data is going to come from a table in Redshift. So I'll click apply, and that will blend these two datasets together. Now, I could build my own join if I like, but Dremio's making a recommendation so I can take advantage of work that other people have done.

And if I slide over here to the right, now I can see data about the departments that these employees work in. So I'll save this as what we call a virtual dataset, and I'll give it a name. I'll call it My Employees, and I'll put it in the Bloor space, so that everyone on my project team has access to this virtual dataset.

So if I go into that space now that was empty before, I can see this virtual dataset we just created, and now I can launch Tableau connected to this virtual dataset. And then go back to share Tableau again, apologies.

Eric Kavanagh:

That's all right.

Kelly Stirman:

And ... Can you see Tableau now?

Eric Kavanagh:

Not yet.

Kelly Stirman:

Wow.

Eric Kavanagh:

It worked before a second ago when you switched. We're encountering the limitations of WebEx, I believe, is what we're looking at. There we go.

Kelly Stirman:

Well, I'll log in with my LDAP credentials again. And now I'll take the department names, which are coming from Redshift, and I'll take the salary data which is coming from Postgres on Google Cloud, and I can see that executives are getting paid the most.

And so what just happened? Well, what just happened is, I built a sample dataset for my data scientists without writing any code, and without moving any data, that made it so they could run their models on a refined sample dataset that otherwise, without Dremio, they would've waited some prolonged period of time to get exactly the same data, and it would've been a copy of the data that would've been loaded into some new system. So it really shortened the time for the data scientist team to get access to the data, and I gave it to them in exactly the format that they wanted, and I did it without making any copies. But I also did it without writing any code, which is really exciting.

Eric Kavanagh:

Yeah, we have a couple of really great questions coming in here from the audience. Let me throw a couple of those over to you. One of the questions is, how is this different from something like Denodo, 'cause it is a sort of data virtualization, right? What are the key differentiators between Dremio and Denodo?

Kelly Stirman:

Yeah, there are a couple things. So first of all, one of the things we saw with the taxi data is an acceleration capability that is not part of data virtualization technologies. The way that Dremio is able to improve the performance of queries by 10x, 100x, 1000x, is using this patent pending capability called data reflections, where we're taking advantage of columnar data structures and in memory computing and our sophisticated query planner to build alternative query plans that are dramatically faster.

In a data virtualization technology, you're really distributing the query to some existing database engine. And so you're really only as fast as the slowest member of that query plan. That's the first thing that's different. The second thing that I think is really different is that Dremio is designed to work with things like Mongo DB and Elastic Search and S3 and Hadoop and all these new modern sources of data that are really not the areas of strength for traditional data virtualization technologies, which were designed primarily for relational databases.

And then the third thing, of course, is that Dremio's open source, whereas the data virtualization technologies on the market are proprietary and very expensive to operate. So I think those are three big differences between what you're seeing with Dremio. The fourth and probably the biggest in terms of day to day, of course, is that data virtualization products are IT products. They are not self-service products like what you're seeing here with Dremio.

Eric Kavanagh:

Yeah, that's a good point. And I think a lot of the magic here comes from the ability to reach into all of these different kinds of data sources, because I think you guys figured out, heterogeneity is here to stay. I actually remember back in 2006 and 2007, there was a big push by IBM and Oracle and SAP to standardize, right? They were like, oh, why don't you standardize? Everything will work so much better if you use all Oracle apps or all IBM apps, et cetera. And that's just not a reality. The bottom line is that we're going to have lots of different types of data, lots of different data stores, especially in this new world of big data. So being able to virtually coalesce these disparate systems in different data models, for example, that's pretty cool stuff, because now the analyst can get a much more clear picture across different dimensions much faster, and you're not creating a silo, an actual physical silo, right?

Kelly Stirman:

Yeah. I think as long as there's been database companies, the answer has always been, "Well, just move all your data into my database."

Eric Kavanagh:

Right.

Kelly Stirman:

And to that end, to what you're saying, can you see my browser?

Eric Kavanagh:

I can see your browser, yeah.

Kelly Stirman:

Let's do something with one of the newer sources. The first thing we did was query CSV files in HDFS. Let's look at JSON in Mongo DB. So here is Yelp data about businesses, and each one of these rows corresponds to a JSON document. I can go look and say, hey, I want to look at a subset of [crosstalk 00:41:06]-

Eric Kavanagh:

Wow.

Kelly Stirman:

I can say, keep only businesses from these particular cities. Charlotte, Phoenix, and Scottsdale. Click apply. And now I want to look at this category data, which is in an array. An array is a data structure that's fundamental to JSON that is incompatible with SQL and BI tools. What do I do about that?

Well, traditionally, that would mean I write a bunch of code and I copy the data into a relational database. Well, here I can just click unnest, and now I get a row for each item in the category, I can rename this column to be category. And I'll do one other quick thing, which is, I want to get the ZIP codes out of the address field. I want that in its own column. I can click extract and get a set of regular expressions. The first one works. Click apply. And now I've got ZIP codes in their own column, along with the full address, categories flattened out, and a subset of the cities. I can save this as a virtual dataset, I can join it to other data sources.

Here I've got data, one of the recommendations is to join to reviews about the businesses that are in Elastic Search, which makes sense. I want to be able to search my reviews, so I put them in a search engine. I can click apply. And now I've got data about the reviews way over here on the right that describes these businesses, combined with data from the businesses in Mongo DB. I can save this as Good Businesses and put that in my Bloor space. And now if I go into Bloor and look at Good Businesses, I can connect to this with Tableau, and start to analyze the data that's joined between two NoSQL systems. That's something you can't do, you haven't been able to do, without Dremio. Certainly not at this speed.

Are you able to see all the-

Eric Kavanagh:

You've got another couple of questions. Oh, go ahead.

Kelly Stirman:

Are you able to see my Tableau?

Eric Kavanagh:

Yes, I can. It looks like the original. Maybe do a refresh on that if you could.

Kelly Stirman:

'Cause this is kind of fun to see in action, share application, Tableau. Share application. How about that?

Eric Kavanagh:

Okay, yeah. So what's so interesting here is you're able to, in one environment, be able to navigate and discover across multiple different data sources and mash them all up together, right?

Kelly Stirman:

Yeah, there we go. So I have more reviews by category in restaurants than anything else, and this is millions and millions of documents across Elastic Search and Mongo DB.

Eric Kavanagh:

Let me throw a couple of these questions at you, and for my own edification too, I'm really fascinated by this concept of reflections. Can you dig into some details about what constitutes a reflection, and how are you able to make this happen?

Kelly Stirman:

Yeah, I think a reflection is similar to an index in a database, which is to say ... If you connect Tableau to Oracle and there are no indexes, then your queries are going to be slow. And then you add indexes, and it makes the queries much faster. You don't connect to a new database, you don't change your SQL, but the indexes help the query planner figure out a smarter, faster way to run the query.

Reflections are similar in that Tableau connects to Dremio whether there's a reflection or not. But if the query is not fast enough, users can vote to say, hey, I think this dataset should be faster, and Dremio will look at those votes and automatically create reflections that will accelerate the queries of the end users. The end users don't have to connect to something new, they don't have to change their SQL. Now the query planner can figure out a smarter way to run the query that's going to be much, much more efficient than doing what it was doing before.

And so this is, as I mentioned, a patent pending capability that's part of Dremio that leverages Apache Arrow to make the query execution, in some cases, hundreds of times faster than in the absence of reflection.

Eric Kavanagh:

That's so interesting. And there's another question around discovery. Is there a way, an attendee is asking, to use Dremio to scan across a whole bunch of different systems to hopefully find PII, personally identifiable information?

Kelly Stirman:

That's a great question, and something that we are planning to add into the product. The ability to automatically detect sensitive patterns like PII, you could imagine industry code, like [ICD 10 00:46:13], where there are certain patterns that can be automatically discovered and flagged in the data to make it easier for users to work with the data.

As things stand today, you can use Dremio to explicitly go look for certain patterns. So if you know what to look for, you can ask Dremio to look for those things, and you can ... For example, you could programmatically mask PII data in Dremio over the wire as it's being accessed by different tools, leveraging LDAP. But right now, we don't automatically go find all those things for you. You need to tell us to go look for them.

Eric Kavanagh:

What you have, in terms of data governance, you have the ability to get very granular, is that right? Can you talk about how you're able to allow only certain people to access, even down to the row level, is that right?

Kelly Stirman:

It's true, yeah. One of the things that we do here, if I look at ... We created this virtual dataset called Good Businesses. We track the lineage of this data. So if I look, this is the schema for what we created, and that green icon tells me this is virtual. I can see its descendant from this physical source in Mongo DB, this collection of Mongo DB and this physical index in Elastic Search. And we track these relationships in a dependency graph, so if you wanted to see, for example, what are all the virtual datasets descended from this collection in Mongo DB, you see all the different people that have made their own virtual dataset, and in one click, I can go see all the different queries that have ever been issued on that virtual dataset, no matter what the tool was, no matter what the user was. What their query was at that time, how long the query took. We even cache the results of that query.

We use this dependency graph to track the relationship between the different datasets, and as users are accessing data through these virtual datasets, we can programmatically give you row and column level access controls to mask certain values, to block access to certain values, to return default in some cases, to do joins in other cases. There are many ... The ultimate flexibility, it's dynamically control exactly what users are seeing at the row and column level across all these different data sources.

Eric Kavanagh:

That's amazing. And as I'm looking at this, I'm realizing we see this across the industry. There are various tools and technologies that are trying to address a similar reality that you guys have focused on, but I really see this remarkable development where you have what amounts to a multifaceted, highly versatile lens through which you can look at different datasets, because as the information landscapes get broader and wider and more diverse in terms of their topology, we're going to need simpler ways to view them and to explore them.

And I think by adding this virtualization component, you've tackled one of the key requirements of the business, which is to enable specific people to create their own view of the world without changing data at the source and without moving data around, because it's those unique views where you can provide some value back to the business, right?

Kelly Stirman:

Couldn't have said it ... You need to be able to give users exactly what they want without making copies. You need to accelerate the access so people can work at the speed of thought. And you need to preserve governance and security end to end so that IT remains in control while people get exactly what they want.

Eric Kavanagh:

That is really cool. We have a question here about metadata. Can you talk about metadata, how that's stored, how it's managed, how it's used in order to help users align different information sets?

Kelly Stirman:

Sure. So when we connect to these sources, like this Oracle database, we automatically query the system tables to collect metadata about all of the schema in the database, and then we store that metadata in our catalog and make it searchable. And you can either search the catalog using Google style searches, or you can explicitly request an inventory from the catalog of all the sources and all the tables and collections and indexes available within each source, and that's all available through our REST APIs.

We also capture data types and we sample some of the data to understand cardinality, so we build this rich picture of the source systems. But we also expect those systems to change. Schemas evolve, and we automatically discover schema changes at query time, and we also programmatically update our catalog by periodically revisiting the schema of the sources. So the catalog is central to Dremio in terms of optimizing how people use the data, but it's also something that end users can query through our REST APIs to do interesting things through our catalog.

Eric Kavanagh:

Wow, that is just really, really interesting. I have one question about mainframes. Can you actually tap into mainframe data?

Kelly Stirman:

Not yet. Something we're working on. Let me just give you a quick look at what we support today. You see a mix of Redshift, S3, ADLS, NoSQL, like Elastic Search and Mongo DB and [HFace 00:51:59] file systems. And we have lots of things we're working on, like Teradata and Salesforce, Cassandra, and then maybe DB2 and the mainframe is around the corner. We'll have to see.

Eric Kavanagh:

Wow. Okay, one last question. You already kind of talked about this, and maybe just add some more context. And folks, we do archive all these webcasts for later viewing, so you should be able to come back later on even today to watch it. But scalability, can you talk a bit more about how you enable scalability?

Kelly Stirman:

Sure. We designed Dremio to scale like Hadoop. To run on hundreds and potentially thousands of nodes. So there are two types of nodes in a Dremio cluster. One is called an executor node, and the other is called a coordinator node. When you submit a query over ODBC or JDBC or REST, that goes to one of the coordinator nodes, and it's the coordinator nodes that track the metadata about the system and build query plans, and then they farm out those query plans to the executor nodes.

And so the executor nodes do all the work of executing the query and performing the join and sorts and aggregations, and streaming the results back to the client. So you would scale the coordinator nodes to accommodate greater concurrency in the queries, and you would scale the executor nodes to accommodate greater data volumes. And you can scale those two things independent of one another. If you're running a Hadoop cluster, you can run Dremio as a YARN application directly in the Hadoop cluster. If you're running on AWS or Azure, you can provision Dremio on as many nodes as you like in an elastic fashion, and we will take advantage of things like S3 and ADLS to store the data reflection.

Eric Kavanagh:

That is amazing stuff, folks. We burned through an hour, a little bit over an hour here, with our friend Kelly Stirman of Dremio. Absolutely fantastic technology leveraging Apache Arrow. And they have a new office in Austin, Texas. Kelly, thank you so much for your time. Thanks, all of you, for great questions today. Like I said, we do archive all these webcasts. Come back later and check it out, and check out Dremio. I have to say, it's one of the most impressive technologies I've seen. Really, it's a straw in the wind for the future of data management, for the future of analysis. We're going to need solutions like this to be able to reach across these incredibly diverse environments and stay on top of them and build those virtual views, those reflections, which is fantastic stuff.

Thanks so much, Kelly, for your time today. Thanks all of you out there, take care and talk to you next time. Bye bye.