Dremio Jekyll

Self-Service Data for the Data Lake

Transcript

Data-as-a-Service: Self-Service Data for the Data LakeData-as-a-Service: Self-Service Data for the Data Lake

Kelly Stirman:

What I want to talk about today ... Actually, before I get to this picture, I just want to ask everyone to do a little bit of a thought experiment before we get started. Think about the last time that you spent some time doing some searching on Google. Maybe you were planning a trip or planning a night out, or maybe you have a child who's working on a report or something like that, and you wanted to know the answer to a question.

For example, what is the size of the Earth relative to the sun? How long would you expect it to take to answer a question like that? I think for most of us, the answer to that is a few seconds. You would go into Google or your favorite search engine, type in a few keywords, and typically get to a reasonable answer either right away, in terms of what Google puts in front of you, or a few of the pages on the first 10 results are likely to get you the answer to the question that you asked.

If you take that example of what you experience in your personal life, now I want you to consider going into work and asking a question about the business, something about your customers, something about your products. I don't know what, but just think about what are your expectations in terms of how long it takes to answer that question.

For most people, the answer is days, weeks, months. If it's not already an answer that's being put into some sort of a dashboard, chances are you're going to have to go to somebody who's in control of that data or you're going to have to go to IT and ask them to provision data on your behalf so you can start to ask questions.

That contrast of what you experience in your personal life with how you experience asking questions and getting answers at work is at the heart of what we want to talk about today. I just wanted to put that example out there for everyone as a way to frame the conversation and give you an understanding of what we're fundamentally trying to solve with Dremio and our data-as-a-service platform. Okay. Let's get back into the pictures.

This problem that we're talking about, if you look at the 'Then' in the picture, I think once upon a time, companies largely operated on a pretty small technology platform in terms of the software and data that they used to run the business.

If you look at the 'Then' here, once upon a time, when you're working at a company, the data that you used to run your business was largely in one place. The ability for you to ask questions, it might have been challenging, but you knew where to go to get the data to answer your questions. Well, now data for most companies is in lots of systems, hundreds of systems. Those systems all run on different kinds of backends and different-

Kelly Stirman:

Today those different systems run on different kinds of technologies. Some of them are in the cloud, some of them are on-prem. Some are relational, traditional things like Oracle and SQL servers, some are newer things like Elasticsearch.

The point is I think pretty much everyone on the call has this experience of actually the data I need to answer my question is not in one place, and I may not even know where it is. I may be comfortable with a product like Tableau or Power BI, or maybe I'm a data scientist and I'm comfortable with something like Python or Jupyter Notebook. But the question of where the data is, how to access it, how to get the right data to answer my question is a really hard problem.

How hard of a problem is it? Well, companies spend somewhere between $75 billion and $100 billion a year on technologies related to this space, things like ETL and data warehouses and data marts and data lakes. Those kinds of technologies are what they're using to try and make this problem more approachable.

If you think about the consequence of people not being able to do this themselves, the consequence of people having to wait their turn with IT to get access to the data they need, you're looking at lots of underutilized skill across a global workforce of somewhere between 150 and 200 million people. If you make even modest assumptions about salary and wasted productivity, the numbers, it's a multi-trillion-dollar problem. This is a big deal. Again, this is what we're focused on solving for people.

Again, I'm about to move to the next slide, but just as a reminder, please ask questions along the way if you have them. Appreciate that. It makes things easier for all of us.

Dremio is data-as-a-service platform. What we're focused on is this really interesting phenomenon over the past 10 years that we've seen in IT. If you think back 10 years ago and you were building an application, you would go to IT and say, "Hey, I need some servers for that application," and IT would order the servers from IBM or HP or Dell or something like that, and a couple of months later, you would have those servers stacked and racked and ready for you to use.

Well, AWS comes along and completely changes that part of the technology equation. Now we think about infrastructure as sort of like shopping on Amazon. Same orange button, you go and click, and in a few minutes, you can have as much infrastructure as you need. People's expectations have changed in terms of how they get access to that infrastructure. They don't expect to wait for their turn with IT. They expect to be able to do that kind of things themselves.

Well, if you think about data, we're still trapped in the rack and stack era that predates AWS. We're still in a situation where that's not something that a data consumer can do themselves. The idea about this service, which began with infrastructure, has moved its way up the technology stack through different kinds of tools and full-blown applications.

I think most people would love for their data to be the same, where they could get more value from their data faster, they can provision datasets on demand that meet the needs of exactly what they're looking for to solve a particular problem or ask a particular question, that makes the data engineer as well as the data consumer more productive and self-sufficient and makes companies, in general, more agile and flexible in how they work with data and the problems they can solve with their data. That's why we call it data-as-a-service, bringing these ideas to your data.

Data-as-a-Service: Self-Service Data for the Data Lake

Let me tell you just briefly about the company. We were founded in 2015. We spent a few years building out the core platform and came out of stealth a little under two years ago. When we started Dremio, we also helped start a really important Apache project called Apache Arrow, which has exploded in popularity, is now downloaded 3.5 million times a month. Very, very popular open source project.

We are headquartered in Santa Clara. We're in a broadcasting firm here live today. Dremio, the product, and everything we'll show you today and we'll take a look at, is open source. Arrow is a building block of that product, but the whole Dremio product overall is also an Apache-licensed project.

We're dealing with a tiny bit of a lag between the slides and what you're seeing and what I'm seeing. That lag is something like 15 or 20 seconds. We're going to take a look while I continue to talk and try and get that under control. Okay. I think that's more on our end than on your end. Apologies for that.

Just a little bit about the team. The company was founded by veterans in big data and distributed systems and open source. We have terrific investors here in the Valley, Lightspeed, RedPoint, Norwest, and Cisco. I was just at an event at Cisco yesterday at their headquarters. A lot of support and enthusiasm around this idea of data-as-a-service, and being built by people who have a lot of experience in this domain.

The type of companies, largely, that are using Dremio are Fortune 500 and Global 2000 companies across all kinds of industries, geographies, use cases. We'll get into more about that a little bit later.

When we talk briefly about Apache Arrow, so this, as I mentioned before, is a project that's really exploded in popularity. It's used across dozens of different open source projects now.

Data-as-a-Service: Self-Service Data for the Data Lake

What is Arrow? Well, it's a standard for in-memory analytics, both a standard in terms of representation of the data in a columnar format as well as libraries in over 10 different languages, C, C++, Python, Ruby, Rust, Go, et cetera, et cetera, Java, libraries in those languages to help you operate on these in-memory columnar data structures.

It quickly has become the standard in the industry for representing data in-memory for analytical workloads. It's core to Dremio's platform. It's something we're proud that we were a part of getting at, started a little over three years ago.

Let's talk about the problem again in just a little bit more detail. Then we'll get into an exploration of how Dremio solves this problem.

Data-as-a-Service: Self-Service Data for the Data Lake

What do we see time and time again in terms of the data lake and the need for self-service? Well, companies have, at the top of this picture, what we call data consumers. The data consumer is perhaps a user of a BI tool like Tableau or Power BI, or maybe it's a data scientist, somebody who works in R or Python or Scholar or something like that. These people depend on access to data to do their jobs effectively.

If you go to the bottom of the picture, data in organizations is in lots of different technologies and different backend systems, as we discussed earlier. These are the operational systems where data is effectively born in the first place. The question is how are you going to get the data that your data consumers need, how are you going to make that available to them for their own use?

What companies do is they start by moving data into some kind of a staging area. These days, that's the data lake. That is your Hadoop cluster. That might be S3 or ADLS if you're in the cloud. But it's a place that is format and size-agnostic and makes it easy for you to consolidate data from different sources into one central repository. The way you get the data there is through a combination of ETL or certain types of scripts where you make a copy from the operational systems and put it in the data lake.

Now you have basically the data in different formats and representations, and you need to make it available to the end user. Well, it's not really in a form that's suitable to that. All the tools on the top of the picture assume some kind of a SQL interface. In many cases, you might begin by looking at something like Hive, but Hive is really not suitable for this kind of access pattern, typically. It's deemed to be too slow, inadequate in terms of security and governance controls.

What companies do next is they start to move the data from the data lake into a data mart or a data warehouse. This could be something like Redshift if you're on S3, it could be Snowflake, it could be SQL data warehouse. It's one of those technologies that provides a high-performance, relational front-end to the data. This is another copy of the data that is moved with a different set of scripts and ETL to make the data go from a data lake into the data warehouse or data mart.

Now you've got SQL access and people can start to access the system. But frequently what we find is that there are still concerns and limitations in terms of performance and concurrency. What people do is they build aggregation tables or they build extracts for the different BI tools or they build cubes that give you more interactive access, at least in the case of cubes, for elapsed data analytics.

Another set, another copy of the data, another series of steps to move the data around. In many cases, we talk to companies, and this isn't just a three-layer picture that they have. Literally, they have 10 layers in their picture and a dozen different copies of the data along the way. This is a simplified version of what we see with most companies. But chances are you have something like this in your company.

Data-as-a-Service: Self-Service Data for the Data Lake

At the end of the day, at the end of this process, the data consumer has access to the data that they need, but what happens when they have a new question or a new dataset that they want to bring into this picture? Well, that's something they could never do themselves. It's too complicated. It's too technical. They have to go to IT to do this for them, open up a ticket, and IT begins a data engineering project to make a new dataset available to the end user.

That is something that's not measured in hours. It's measured in weeks and months and, in many cases, quarters before the data consumer gets access to the new dataset that they requested. That's largely what we see with companies.

We started Dremio and said, "Look, there's got to be a better way to do this." That is the same way we've been approaching a problem for 25 or 30 years, is copying the data between different technologies, making different physical representations, entirely owned and operated by IT, long lead times, very expensive, very fragile process.

We designed Dremio to work with any BI or data science tool. People like those tools. We don't think they should have to change. They should get to use whatever tool they like. Dremio solves the data acceleration problem.

The reason that a lot of you have all these copies in many cases is all about performance. The reason you have cubes and aggregation tables and extracts is to solve performance challenges. We believe we've solved that in a very innovative and elegant way with something we call Data Reflections.

We've packaged Dremio as an integrated self-service platform. The idea is that the data consumer logs into Dremio and does much of the work themselves. For those of you, this may be Dremio's ... It's not the perfect analogy, but Dremio is like Google Docs or Office 365 for your datasets. It's something everyone can use. They can collaborate together and access simply through their browser.

We believe that in the middle of things, between the different analytical tools and the different data sources, we have an opportunity with Dremio to add greater security and governance controls. We think there are powerful set of capabilities we deliver through Dremio that would be almost impossible for companies to build themselves.

We've developed Dremio to be deeply integrated with the Hadoop ecosystem. If your data lake is an on-prem Hadoop cluster or you're using something like HDInsight, you'll find Dremio fits in perfectly to the way that you think about operating in that kind of infrastructure in terms of integrations, in terms of deployment and monitoring, all that good stuff. It fits in perfectly.

Now we've developed Dremio to not be dependent on Hadoop. If your data lake is on S3 or ADLS, you'll find that our Kubernetes-based orchestration, which we provide with some very sophisticated Helm charts to make it really easy for you to deploy Dremio, you'll find Dremio works extremely well with that kind of a data lake environment as well.

Finally, everything you're going to see today is an open source solution. This is something you could go download and try out yourself today.

Okay. What is Dremio? If we go back to that picture of the data consumers on top and different data sources at the bottom, what you see is Dremio's deployed typically next to or on the data lake, and it provides as an integrated self-service platform a number of capabilities that you would typically have to buy and manage different technology solutions to address the comprehensive set of capabilities that Dremio provides. I want to talk to this just briefly and then we'll, in a few minutes, look at the product together and you'll see this actually at work.

Data-as-a-Service: Self-Service Data for the Data Lake

First of all, I mentioned data acceleration. The reason we build data marts and cubes and extracts and all these other copies of the data in most cases is to get the performance we need for the data consumer. But in Dremio, we've solved this with a patented approach that we call Data Reflections, which are invisible to data consumers and allow Dremio to optimize a wide variety of workloads at interactive speed no matter what the underlying data source is and no matter what the scale is. That's part of the Dremio solution.

A popular use case, we'll talk about this in a few minutes. But a popular use case is, "Hey, I want to take my BI workloads and deploy them in my data lake. I want to do that without building extracts for Tableau. Let Tableau connect to my data lake and do interactive speed queries on terabytes and petabytes of data without all the work to build extracts, and to also not just build the extracts but manage them on something like Tableau server." That's data acceleration.

The next thing here is fine-grained access control and masking. One of the things that typically happens when you embrace something like S3 or ADLS or even HDFS is when you move data out of the sophisticated capabilities of a relational database into a file system or object store is you give up fine-grained access control, row and column-level access control, masking of sensitive data, these kinds of things.

This is something that Dremio provides out of the box. Whether your data is coming from files in your data lake or JSON records in a NoSQL database or records in a relational database, you have one uniform and unified way of controlling row and column-level access as well as masking of sensitive data.

The next thing here is self-service data curation. We've designed Dremio so that the data consumer can log into Dremio through a browser, find a dataset that they find interesting and useful and relevant to what they're trying to do, and maybe make some modifications. Maybe add a calculated field or rename the columns or blend a couple of different datasets together to create a new dataset.

That's all something that we think the data consumer is in the best position to make smart decisions about. We've created an interface in Dremio that lets the data consumer do this without writing any code and without being terribly technical to complete that task.

Data catalog and semantic layer on the bottom left. When Dremio connects to your data lake as well as other data sources, we automatically detect schema, we automatically track all kinds of interesting metadata about the source and ingest that into a searchable catalog. A data consumer can simply login, do a search like they're searching on Google, to find data that's relevant to what they're trying to do.

The next thing here is low latency SQL engine. Using Apache Arrow, using Data Reflections, we've built what we think is the world's fastest SQL engine for the data lake and allow you to deliver interactive speed on very, very large heterogeneous datasets in the data lake and other sources.

Then, finally, data virtualization. Nobody has all their data in a data lake. Data still exists in systems outside of the data lake. A common use case for Dremio is to join data in the data lake with external traditional, relational databases, for example. We allow you to drive more workloads to your data lake without first copying all the data there to make it available.

That's in the integrated self-service platform. You have all these powerful capabilities packaged in one open source self-service solution. That's what data-as-a-service is all about. We believe it requires these integrated set of capabilities to make data more of a service in your business and in your organization.

Let's talk about some use cases and outcomes. How are people using Dremio? Just some examples here. I mentioned a few of these, but, for example, interactive BI and data science on Hadoop. I think people think of Hadoop and the data lake in general as a great way to consolidate data from different systems and to do transformations of data to get data ready, but it's typically not used as the way you access the data in terms of interactive analytics.

But people would sure like it if it were because they've made significant investments in that infrastructure, the data's already there. Wouldn't it be nice if I could just take my BI users or my data scientists and point them at a nice interface that sits on my data lake and then get the interactive speed and access that they want? That's a popular use case for Dremio.

We talked a little bit about self-service. For example, Royal Caribbean has hundreds of analysts and data scientists whose common entry point for data, for analytics is Dremio. The way they begin a journey is by going into a catalog, doing a search, finding datasets, and then launching their favorite tool, whether that's a Jupyter Notebook or something like Tableau, and getting access to the data at interactive speed. They do that themselves without having IT provision things on their behalf.

Another use case is offloading of analytics from operational databases in the EDW. Maybe you are at capacity with your Teradata cluster or maybe you don't want to send table scans to your operational system. With Dremio and our Data Reflection capability, we can offload the analytics on other systems into your data lake infrastructure. We can do that in a way that's invisible and seamless to your end users, which is a really powerful and useful use case.

Another is accelerated analytics in the cloud. Maybe companies today are transitioning to cloud or have a hybrid strategy for on-prem and cloud, and when you start to move your data into things like S3 and ADLS, it's the beginning.

You still, without Dremio, have to then find a way to move your data into an environment that gives you the kind of SQL access and performance you need to perform your analytics. Well, with Dremio, you can simply point Dremio at your data lake on S3 or ADLS and get the speed and performance, as well as the catalog and all the other virtues that we talked about earlier.

Data-as-a-Service: Self-Service Data for the Data Lake

Then, finally, enabling the data lake as a data warehouse. There are a lot companies that want to decommission legacy data warehouses and data marts and have those workloads consolidated onto their data lake. Only Dremio really gives you the workload management controls and security and governance controls and interactive speed necessary to make that a reality.

Those are some of the popular use cases across our customers and the kinds of outcomes you see are things like consolidating BI workloads on the data lake, getting faster time to value from the data, decommissioning legacy platforms, better security and governance controls, and more independence on the part of the data consumer. These are all common themes we see with customers across the board.

Data-as-a-Service: Self-Service Data for the Data Lake

I want to talk briefly about just a few of these deployments in more detail. I'll start with TransUnion. TransUnion is a credit-rating bureau. If you are a property owner who's considering leasing your property to an individual, one of the things you might want to understand is the credit worthiness of that individual. You're going to contact somebody like TransUnion to understand about that individual or particularly a cohort of individuals.

The traditional business of TransUnion has been managing a huge volume of data on over a billion people that they collect from 90,000-plus sources and billions of updates per month to keep an up-to-date picture of all of the credit worthiness of all these different individuals.

For 50 years, their business model has been you ask us about someone or a group of people and we will generate a PDF report for you. But, increasingly, people want access to the data themselves. They want to be able to take TransUnion's data and blend it with their own data for a variety of different interesting use cases.

PRAMA is an application that TransUnion has built on their big data platform with Dremio with a really nice Tableau front-end that makes it easy for their high-performance users to get much more in-depth with the data that TransUnion manages.

This is something that they had to try to do with some other technologies, had been really struggling to get the performance, had a number of full-time data engineers just keeping things running. With Dremio, they were able to significantly improve their performance SLAs, but also dramatically reduce the number of data engineers from something like 14 or 15 full-time people to just one full-time person.

There have been a number of other use cases at TransUnion along the way, including decommissioning of their Netezza appliances onto this infrastructure that's based on their big data platform and Dremio. That's a little bit about one customer, TransUnion, who's been on this really powerful, impressive journey with Dremio.

Another is Microsoft. Microsoft is a very large software company. One of their popular products is called One Note. One Note is, if you're unfamiliar with this product, it's similar to Evernote. It's a note-taking application that I believe is part of the Office Productivity Suite.

Data-as-a-Service: Self-Service Data for the Data Lake

Microsoft, of course, had built their product usage analytics environment on the traditional Microsoft analytics stack. Probably many of you are familiar with this stack. It comprises SQL server, SSAS, SSIS, SSRS, that sort of traditional stack, which it sounds like, in talking to folks, starts to have scalability limitations at a hundred gigabytes, a few hundred gigabytes for dataset size. They ran into some limitations.

They were also looking to re-platform some of their infrastructure onto Azure so they could spend less time operating their backend systems and more time doing innovative work with the data.

They ran into Dremio. They found Dremio and were able to move from a traditional environment with I believe there are over 10 copies of the data, lots of ETL, lots of steps along the way, long lead time to have new data available to their analysts and data scientists.

They were able to move away from that to a world where data lands in ADLS. Dremio is sitting on top of ADLS. Now their data consumers have a catalog of different datasets. They can self-serve to find the data, to build new datasets. Then they connect with Power BI or Python and get really nice, fast, interactive access to the data that's in ADLS.

It's a really nice story. I'll tell you, on a personal note, as being the much smaller software vendor, it's fun to have Microsoft call you and tell you how much they like your product.

Data-as-a-Service: Self-Service Data for the Data Lake

Let me form a preamble. I'm sure you would like to see the product at work, so what I thought I would do here is bring up a demonstration and just walk through a couple of examples real quick. Again, if you have questions, please ask them along the way. I will do my best to work those into the conversation or to answer them at the end.

Just to give you a sense for what it is you're going to see here on the demonstration, so we have a couple of different scenarios. I won't get through all of it in detail, but I'll give you top of wave. Then for anyone who's really interested, of course, we'd be happy to do a call with you and show you the demonstration in more detail and more depth.

But the scenarios I wanted to briefly touch on, first of all, just was is that experience for the data consumer? How do they use Dremio? What is that like? We'll walk through an example of someone searching a catalog, finding a dataset, and then launching Tableau to have a really nice interactive experience with data.

The second is what is the data engineer's experience of Dremio? In that example, it's basically, "Hey, I have a team of users. They've asked for a dataset. How do I provision that dataset for that particular group of users?"

This environment that we're looking at is in a Hadoop cluster running in the cloud. It's a small, tiny four-node cluster connected to a variety of backend sources, the data lake itself, but also a mix of relational and NoSQL systems.

That's the environment you're going to see at work. Let me turn on screen sharing here. Sorry, just a second here. I forgot I have to walk through this wizard real quick. Let me bring up the browser here.

I'm logged in through my browser into Dremio. Just to orient you to what you're seeing here, I'm logged in as an administrator so I can see everything. If you are a data consumer, you would see less than what we're going to see here. But I think it's easier to make sense of things seeing the administrative view.

This is a little bit like a cooking show where some of the work is done ahead of time, at least in setting up Dremio. I have a variety of different backend systems I have connected here, ADLS, MongoDB, Postgres, also connected to S3, so I can join across clouds if I want to, SQL server.

Then I've got what we call spaces. You can think of a space kind of like a project folder. As data consumers work with datasets and collaborate together, they organize those in spaces. Then every user has a home space where a surprisingly powerful use case is, "Hey, I've got a spreadsheet. I want to join that to my data warehouse," and you can do that in Dremio without having IT be involved, which is really, really powerful.

Let's try out this first example. I'm a data consumer. I have been assigned to do some work on understanding the penetration of Uber and Lyft in the New York City taxi market. Now I want each of you to do a new thought experiment, which is ask yourself, if you were at work and someone said, "Hey, go find me the answer to this question," where will you go to find the data?

When I ask companies that, most companies do not have an inventory of their data assets. Most companies would answer that question by saying, "Well, I would open a ticket or I would send an email or I would ask the person next to me or I would go on Slack." There's all these different ways where hopefully you connect with someone who knows the answer.

Well, in Dremio, when you connect Dremio to the different data sources that you have, we automatically build a catalog and we automatically maintain that catalog and make it so you can search and find different datasets.

If I were tasked with looking at taxi rides, I know taxis have fares. I could go into the 'Search' up here at the top and I could type in 'fare' and get back different search results. The search results that come back, there's not a whole lot in this cluster, but these are not webpages like a Google search, but each of these search results is instead a dataset.

That dataset could come from any of the backend systems that Dremio is connected to, assuming you have permissions to access the data, or it could come from what we call a virtual dataset, which is a dataset that has been defined in Dremio that might be a subset of the raw data or a blend of multiple datasets. We'll build one of those in just a couple of minutes.

Data-as-a-Service: Self-Service Data for the Data Lake

Let's say here that I am interested in this second result, and I could quickly preview the schema here and understand what the columns are that are available in this particular dataset. I can also see that there have been over 2600 jobs run on this dataset, so it seems to be popular. I can see, finally, that there are three descendants, which means there are at least three different people that have set up virtual datasets that are derived from this dataset. We'll look at that in more detail.

Data-as-a-Service: Self-Service Data for the Data Lake

Also, there are some tags here. It's been tagged with ... There's some PII data. There's information about trips and gold, and who knows? But these are tags that have been user-assigned on this dataset. Probably what I really want to do is look at a sample and see is this the data that I am looking for to answer these questions?

To orient you briefly on what is this data we're looking at, well, each row in this table is a taxi ride in New York City in a five-year period. I can see here the pickup and drop-off date times, how many people were in the taxi, how many miles the trip was, some geospatial data about the pickup and drop-off locations. If I scroll over, I have a breakdown of the fees: the total, the toll, the tip, the tax, surcharge, and the base fare.

Here's a sample, here are the column names, here are the data types. That's all very meaningful in helping you understand is this the right data to help me with the job at hand? But there's more here. I can look at the catalog entry for this dataset.

Here you can see there's information about who to ask if you have questions, screenshots of reports that use this data, descriptions of the fields and where the data comes from, anything you like. This is all maintained by the users of Dremio. I have a preview over here of all the different fields and data types and then tags that help me organize the data.

If I wanted to say this is a BrightTALK dataset, I could just add that tag and now users can search by the tag in addition to whatever is in the catalog in terms of the catalog entry and metadata about the data that we're just looking at.

Finally, the third thing you can look at to understand if this is the data that you need is what we call the data graph. This tells you that this dataset we're looking at, which is called nyctaxi.trips and has 2682 jobs that had been run on it, well, is descendant from a directory of files in ADLS. Somebody has taken this as a starting point, much the same way you would take a PowerPoint and modify it instead of creating from scratch, and they've created their own dataset that is descendant from the one that we were just looking at.

Between these three things, between the sample, between the catalog entry itself, and between the data graph, you have a really good sense for what is this data and what is its context in the broader data ecosystem within my organization?

From here, you could say, "Hey, this is great. It's exactly the data I'm looking for," and you could launch your favorite tool, let's say Tableau, by clicking this button, and launch Tableau, connect it to this dataset.

When I click on this file, it's going to launch Tableau over standard ODBC and JBC to connect to Dremio. While I tested this late last night, apparently my Tableau license is expired, many apologies. But I'm not really here to demonstrate Tableau. But what I think you would appreciate if you were looking at Tableau is sub-second queries on billions of rows of data that is managed in a data lake.

In this particular environment where Dremio is deployed, there are, as I said, four nodes and queries take about 10 minutes per query. We've basically gone from 10 minutes per query with something like Hive or Impala to under a second per query.

That is a nine-day difference, but it's also the difference between an actual way for a data consumer to do their jobs effectively versus a way that would never be acceptable, and to work around those performance challenges, you would move the data into a data mart or data warehouse. So back to the popular use case of deploying your BI workloads into your data lake. That is exactly what you can do with Dremio.

That's the first scenario. Apologies again, I couldn't show you the live interaction with Tableau. Happy to do that in a follow-up call if you'd like to see it. It's usually exciting to see.

The second scenario I wanted to touch on briefly is what is the experience of the data engineer? If I go back home, I'm going to create a new space briefly and I'm going to call this BrightTALK. This space is something I could make available to everyone in my organization, or I could use LDAP, an active directory group membership, to control who has access to the space. Not only who has access, but can they query the data or edit the data?

From here, I'm going to actually make this available to everybody and click 'Save'. If I go into this BrightTALK space, you can see there's nothing there.

Let's play out the scenario. The scenario is I'm a data engineer. I have received a request from one of my data science teams. This team is building a predictive model to anticipate when employees will leave the company. They've asked for a training dataset that includes information about employees as well as the departments they work in.

That data's in two totally different systems. They've asked for the information about these employees and departments for people who've been at the company 10 years or longer, because they think that will provide a more accurate training dataset for their model.

In this example, I'm a data engineer and I'm going to go do this work for them. Now it's perfectly possible that your data scientist may want to do this work themselves, and that's absolutely supported by Dremio, but I wanted to give you a different angle and take on things.

I'm going to quickly do this. Think about how long it would normally take a data engineer for [Vision Data 00:45:39] for you. I'm going to do this in just about 60 seconds or less. I'm going to go into this Postgres database. These are all tables in this database. I'm going to go into the employees table. I'm going to do a couple of quick things.

Data-as-a-Service: Self-Service Data for the Data Lake

First of all, nobody needs this employee ID column, so I'm going to get rid of this column. Next, I want to restrict this to people who've been at the company 10 years or longer, so I'll say 'Keep Only'. I get a nice histogram of all the data. I can slide things over to narrow things down to people who've been at the company 10 years or longer, click 'Apply'.

I now want to join this data to data about the departments that they work in. I can click 'Join' and Dremio's going to recommend different joins. I'm going to actually reach into SQL server for data about the departments that they work in. Now I've got all that data together, I can save this as senior employees and put this in my BrightTALK space.

Now I can go into that BrightTALK space and see the senior employees virtual dataset, which anyone could connect to with Tableau, Python, Jupyter. I didn't make a copy of the data, I didn't move any data. In just a few seconds, I've provisioned a new virtual dataset for these users, meeting exactly the requirements that they put in their ticket to me in IT.

We are just about to run out of time. I wanted to make sure to thank you for attending. For those of you who asked questions, apologies that we didn't quite get to those. I'll be sure and answer them offline from the session today.

But I wanted to thank you for listening about Self-Service Data on a Data Lake, about Dremio, and our data-as-a-service platform, which again is open source. Feel free try it out yourself. If you have a desire to see a more in-depth demonstration and ask us questions, we'd love to do that with you online some time soon. Thanks again.