Dremio Jekyll

Intro to Self-Service Data With Dremio

 

Transcript

Kelly:

We're going to try and make this a little fun and interactive. We've got a couple of pictures to show you, talk about the architecture of the products, talk a little bit about why we created Dremio, and then get into the product and show you how it works. Using this tool, you should be able to ask questions along the way. Please do. We'll get to them in the course of the discussion or at the end, seeing how things go. I have a couple of polls, quick questions. I'll ask you to give a little bit of feedback and insight from you along the way, so if you have time, as I inject those into the conversation, please participate. That would be really greatly appreciated.

So let's get the conversation going. Tomer, you want to kick things off?

Tomer:

Good morning, everybody. What we're going to be talking about here, I'll start with a kind of overview of the company and what we are up to at Dremio. Feel free to use the chat system to ask questions. Okay, just a few words about the team here at Dremio. My name is Tomer, I'm one of the co-founders and the CEO. Jacques is actually the CTO, not the CEO, but that's ...

Kelly:

That's my mistake. Sorry about that. Actually, I think Jacques did that.

Tomer:

Oh, he might have done that.

Kelly:

Blame Jacques for that.

Tomer:

So Jacques and I came from a company that knew how to do space called MapR. I was one of the first employees in the VPR product there. Jacques you may have seen him presenting at various conferences and events in the big data space. Most recently, he created an open source project called Apache Arrow, which is an in memory polymer technology that enables much faster analytics, allows the use of modern intel, CPUs for faster data processing and also, fast exchange of data between systems. You may see here at the bottom right, Wes McKinney, is an advisor and is the creator of pandas, very popular data processing library for Python. Wes is also a significant contributor to Apache Arrow. Ajay Singh, for those of you from the Hadoop space, is an executive from that from Hortonworks and he is running field engineering here. Dremio is back by two of the top VCs in Silicon Valley, Lightspeed and Redpoint.

So why Dremio? Why did we start this company and more importantly, this open source project? There are two trends happening right now in the industry and the first one, across companies in all verticals, is the fact that data has grown and it's become extremely complex. It's not longer a world where all data can be stored in a single relationable data base and you can then simply point your BI tool at that system. It's not that anymore. In fact, kind of the traditional stack of ETL tools also struggle to deal with the modern data landscape.

And so that's one challenge that companies fact, which already is hard enough, but then you combine this with this other trend where the users that want to consume the data, the business analysts, the folks in the business like product managers and other business users, marketing analysts, that want to consume data. All these people have this great experience in their personal lives where they go home and they can ask a question on Google and get an answer one second later and then they come to work and it takes there months to create a new visualization or a new dashboard or create a new machine learning model. So in our personal lives, we have this instant gratification, this amazing experience with data. We open our smart phones and use applications to book travel within two minutes and so forth.

This disconnect between our experience and our personal lives and our experience at work with data, it's just this huge difference and that also puts pressure on companies to be able to solve that problem, so when you combine the growing demand for self-service from the business users and the analysts, and the complexity and size of data, you get those impossible situations. When you look at the technology stack, that really hasn't changed in years, so what do we do today? We have data in a bunch of different places, increasingly we're kind of seeing kind of modern data technologies like S3, and Hadoop, and MongoDB, and Elasticsearch, and so forth, of course in addition to the relational databases. And then you have the tools that people want to use. If your company's bigger than a small start up, you probably have more than one tool that you're trying to use, so self-service BI tools like Tableau, Power BI, and so forth. You see people that want to use Excel and other people that are more technical and want to use Python and R for more data science type workloads. And then you have the SQL analysts, so all sorts of different tools that people want to use to be able to analyze data.

So what do companies do? They first move the data through custom, kind of ETL, developments into a staging area and this is often a data lake, a Hadoop cluster, Cloud storage, Azure Blob Store, S3 ... It's a lot of work, a lot of scripting and maintenance and just keeping that pipe line up and running is often very challenging because the data in the underlining sources can change and the developer who's writing the app on top of Mongo may add a field to one of the collections. Well how does that data pipeline then deal with that? Those are all things you have to account for.

Intro to Self Service Data

Doing the analysis directly on the data lake is often too slow, as I'm sure many of you know. And so what do you do? You ETL a subset of that data, maybe the last 30 days or maybe some aggregate level data into a data warehouse and that could be something like Redshift and the Cloud or maybe Teradata, Vertica, Oracle, SQL server on premise. These are all very proprietary lock in type systems, a lot of overhead, very expensive, and even these systems don't provide the performance often times that the BI user wants.

Then you get into this additional tier of data product, and so you have, for example, cubes with things like Microsoft SSAS or cubes on Hadoop and those types of technologies. Often time we see people pre-aggregate data in the data warehouse and so they create new tables that are at the aggregate level, maybe aggregated by session ID, or by city, or something like that. And then you see people extracting data into BI systems, so these are kind of BI extracts, so yet another set of copies of data.

At the end of all this that you put an enormous price and that price is not just the infrastructure, of course, you have ten copies of every piece of data in your company, but it's also the complexity that comes with that. It's also the fact that when you have lots of copies of data, they end up diverging and the regulated industry companies get fined for having different versions that have diverged and provide different results. It also introduces security risks, where you now have these uncontrolled copies of data managed by different departments or different end users who may not have that same kind of expertise that IT does around data security.

And then finally, I think the most problematic aspect of all this is just the fact that you cannot accomplish self-service when this is the data architecture. Because when you have different pieces of your data living in different systems and in different types of structures, it's impossible to expect that Tableau, Power BI Click user to know where do they go to analyze each data set. Maybe that sales data, if it's just a very simple sales data analysis, they can go to that cube, but then they want to join that with the session data from the website and they can't do that. At the end of the day, the vast majority of companies today when a user wants to do something new, ask a new question, create a new visualization or a new dash board, they are dependent on IT, right? So they file a ticket, they wait for IT to go prioritize it, and finally, maybe a month or several months later, everything has been done and there's some table in some data warehouse where they can point their BI tool at and start coring that data.

So very, very complex, cumbersome, and hasn't changed in 20 years. Our belief at Dremio is that in order for companies to become data driven, which they all want to become, this cannot continue to be the architecture. It just does not work anymore. And that's the reason we started Dremio. We really felt that there had to be a better way. We thought that if we built an open source project that really addressed really the modern landscape, then we could really empower data consumers, those people that are using BI tools or doing machine learning to be self sufficient and do it in a way that IT feels comfortable that they have that kind of govern and some control around that self-service work.

So, what do we mean by "a better way"? First of all, we think it has to be something that will work with any data source, so most companies have lots of different data sources today and we know that five years from now, there will be a new set of data sources, which we can't even name today. Just like five years ago, nobody was using Elasticsearch or MongoDB or very few people were doing that. We think that it has to work with any tool, any BI tool, any data science tool, things like Tableau and Microstrategy and Excel and Python and so forth. We think that the world would be a better place without data warehouses and cubes and we understand that you can't get rid of all that in a day, but we think that those systems are there, they're kind of a necessary evil today, but there has to be a better way and nobody really likes having to manage those.

We're big believers in self-service and collaboration. One of the things you'll see about the user interface is that it very much resembles Google Docs, except for instead of documents, it's datasets. Performance is key to all this. If you make something that's not fast enough, nobody wants to use it and so we've spent a lot of time and a lot of our IP in technology that makes things go faster. If you've ever used kind of BI on top of big data and seeing that queries take ten minutes to come back into the BI tool. That has to be solved. We have to be able to provide an interactive experience of one or two seconds, even for the largest datasets, even for petabytes and many, many billions of records.

And then finally, we think that it has to be open source. It's 2017. Any kind of data infrastructure or analytics technology needs to be open source. Companies don't want to get locked in. They want to have a healthy ecosystem around the technologies that they use and that's really core to what we're doing here. You may have seen every member of the executive team at Dremio comes from a company that has built a successful open source business, whether that's Hortonworks or MongoDB or MapR, these are all companies that have grown from zero to over 500 employees.

So what is Dremio? It's a new tier is data analytics. We call it self-service data.

What does it do? It provides everything that you need in between where the data lives and the tools that people want to use on their desktops to explore and analyze that data. When we say "everything you need," that includes the logical aspects of it, meaning for example, the ability to find the data that you want with an integrated data catalog. It includes the ability for people whether they are technical or not to curate new datasets, we call them virtual datasets, by interacting with the data and transforming columns and joining things together.

Intro to Self Service Data

And then at the physical layer, being able to connect with all these different data sources and run queries that join data from different sources and accelerate those queries, so like I said, when you have petabytes of data or terabytes of data, you still have that need to come back and provide results to a query in less than a few seconds. So how do you do that? We developed a unique data acceleration layer, which we call "data reflections".

Let's talk a little bit about the technology and how this is possible. Often times, I talk to people about this and it kind of seems like the Holy Grail. "Well, how is this possible?" A lot of technology goes into making this kind of a reality. The first thing you'll see in our demo is that user interface and everything behind the scenes that's making this possible. We like to think of it as similar to Google Docs for data, where people can collaborative work together. They can create new virtual datasets. It's not like data preparation. We're never creating copies of that data when you curate the data. It's all live data. You can create millions of these new datasets and there is no problem.

The second thing is that ability to perform live data curation. You're always operating with live data and that experience is powered by the Dremio learning engine, which is our AI kind of backed engine. So what does that engine actually do? One of the interesting things is that the Dremio curation and collaboration interface, as well as all the queries that are coming in from client applications, are all received and processed by Dremio, so we are the execution engine for these queries. That means that as things are happening in the system, whether it's through our user interface or through one of these BI tools, we are learning from those queries. We're learning what are people joining together. We'll see that many people are, for example, joining one table in SQL server with this specific hive table. Now we can recommend that to other people. We call that "predictive transformations" where we're recommending a transformation that somebody may want to do. Again, the unique advantage here is that because we are the execution engine, these recommendations are not just based on other people's kind of preparation experience. They're based on the actual analysis that happens in the BI tools and they are also based on us seeing that data over and over again as queries are running in the system.

We also have an adaptive schema learning system here where if you've ever used any other product in this space, you'll know that you have to define schemas and mappings and things like that. None of that is necessary in Dremio. We automatically learn the schema from every system that we interact with, whether that system has well defined schema in it, like a relational data base, or maybe it has less of that, like MongoDB not having any or JSON files and a director on S3. There is no schema there. We have a predictive meta-data cashing layer, where we are automatically based on the query patterns figuring out what do we learn more often versus less frequently and then we have a lot more coming here in our next release, we'll have predictive SQL recommendations.

In terms of the deployment reference architecture, Dremio is basically open source software. You can download it. It's a distributive system. You run it on anywhere from one server to thousands of server in a single cluster. You have two options: you can run it in the cloud or you can run it on premise. I say about 50% of our users run it on EC2 whereas others run it in their own data centers. We have a persistent caching layer, which we can the reflection store and that reflection store is a really key ingredient in enabling interactive speed on very large datasets. That reflection store, we can maintain these data reflections either on S3 or on HDSF or on Direct-to-Test storage, so basically the local disks of the Dremio cluster.

Our interaction with the underlying data sources allows us to push down query processing into these source, so we spend a lot of time integrating with these technologies, things like Mongo and Elastic and relational databases and Hadoop. For the systems that have execution capabilities, even if it's not full sequel, we can push down processing of relational query plans into those engines to the extent that it is possible. Then we augment it with our own capabilities so you have a full SQL experience, including joins and query related sub queries and window function. Everything you'd expect from a fully compliant SQL engine on top of systems that don't even do SQL, like Elasticsearch and Mongo and others.

We have one question. It's probably good to talk about right now, which is "What is the impact of push downs on source databases?" I think it's good to talk a little bit about how we really do push down predicates and other expressions when appropriate to make those queries as efficient as possible in the underlying system. If you want to minimize the impact of queries, you can use data reflection. Right, we talked a little bit about that.

Intro to Self Service Data

It's really a question of "What do you want to do and what are these databases that we are connecting to? What are they actually serving today?" Dremio gives you a lot of flexibility around how we interact with those systems. We can operate in kind of full push down mode, where basically we're pushing down as much a possible into the underlying source and that's great if that underlying source system is kind of designed for analytical type workloads. Often times, say an Elasticsearch cluster, it is there to serve analytical workloads. It has the data indexed and we push down a group file on an index field. It can return a response very efficiently. In other cases, we may be connecting to your OLTP database, the one that's processing your orders and your eCommerce transactions and in that case, it may be less than ideal for us to be pushing down large analytical queries into that source system.

That's where these data reflections come into play, so the data reflections that sit in this reflections store can actually map. From a physical standpoint, they can mask that underlying source so that we never actually talk to that source expect maybe on an hourly or daily basis to bring in the incremental changes that have happened in that source into the Dremio persistent cache. And that allows the user to have that experience of "I am interacting with any data at any time. It doesn't matter what system it's in, while at the same time, kind of isolating those operational databases from these analytical queries." So you kind of have that flexibility. We'll talk more about reflections, in particular the raw reflections that we provide, which are kind of ROW granularity data reflections. Those are the ones that can really prevent any query from hitting the operational database.

We optimize your data and your queries for acceleration. There are a number of technologies in the product that enable us to provide this level of performance. We talked about the native push downs. That's the one on the bottom left here, but there are a few other things.

First of all, the execution engine in Dremio, it's a distributed engine. It is actually based on Apache Arrow, which is now the underlying technology for some of the [inaudible 00:20:56] databases, Python and R have also embraced Arrow as kind of their in-memory, high performance data representation. So being based on Apache Arrow has allowed us to execute extremely fast and leverage the modern CPU, kind of SIMD. [Vectra 00:21:17] is kind of operations, the [Intel CPU Support 00:21:20].

Intro to Self Service Data

That's often not enough, just executing fast on a lot of data is not sufficient in order to provide interactive response time, just from a physics standpoint. The key here is the Dremio reflections. These are optimized physically data structures that allow us to satisfy queries much, much faster by reducing the amount of work that has to happen in order to respond to that query. If you think about kind of traditionally technology, so just cubes and projections in some databases and indexes, all these types of techniques are about optimization of data. Dremio leverages those types of techniques and it does that in a transparent way, where the user who's running a query now, across maybe multiple sources or even just on their data lake, doesn't know about these data reflections. Their queries get magically accelerated from their standpoint and the reflections also allow the administrator of the system to then add additional optimizations for new workloads and new used cases that people care about without any impact.

This is what the reflections look like. If you're thinking about traditionally how people approach this is they create aggregation tables or cubes. That's kind of what they create and then the administrator designs and maintains those physical representations of data. That's what you see on the left. Unfortunately, the user then has to pick the right representation that they want to interact with, which in any really world scenario becomes impossible because the user can't look at 1,000 tables in the data warehouse and say, "You know what, this is aggregated by session already and I really need session level enough for my query, so let me go over there". There's no way they can be that sophisticated and always know exactly what the physical representations of data are.

The different here with Dremio is with a source table in Dremio, we will design and maintain different physical optimizations of that data, whether they're sorted by specific columns or partitioned in different ways or aggregated by different dimensions. When a new query comes into the system, whether it's through a BI tool or through somebody writing SQL by hand, we will automatically pick the right set of physical optimizations or data reflections in order to accelerate that query and we will internally rewrite the query plan os that instead of going trough the raw data for every query, we are now using these data reflections and substantially reducing the amount of work that has to take place in order to respond tot hat query with the correct answers.

With that said, lets look at the product.

Kelly:

Actually, let me run a couple of quick polls. It'd be great to get feedback from you folks on the call. The first question is just about the tools you're using. It's multiple choice, so whichever apply to you. It'd be great to just check those off and make sure. It's interesting to see what tools are being used right now. While you guys are voting, one quick question that came up was, "Where are these reflections stored?"

Tomer:

That's a great question. The reflections are actually stored ... you have several options of where that reflection store lives. If you do have a distributed file system such as HDSF or MapR or access to cloud storage like S3, then we can store these reflections inside of that persistent store. The great thing here is that they do not have to fit in memory and these storage technologies are extremely cheap.

Sometimes you are running in a situation where you don't have access to cloud storage, so maybe you're running on-prem and you also don't have a Hadoop cluster anywhere near. That's okay as well because Dremio can actually store these reflections just across the local disks of a Dremio cluster. We can strike them across those disks and we really don't need that data to be replicated because we're not storing the primary copy of the data. It's really a cache.

Kelly:

So I'll wrap up this one. I have one other quick question, which is more about the sources of the data that you use. Again, it's multiple choice, so just pick all the ones that apply to you. It's just interesting to see what sources are people using that are important in terms of analytics.

While you guys are voting on that, again, I really appreciate all of you, let's see if we can answer this question. Another question is "How do you keep these reflections current with respect to the source?"

Tomer:

So the way we keep the reflections current is you basically can define the SLA, so you may say "You know what? I'm okay with he data being up to one hour stale". And then we will figure out the right update orders and refresh these reflections automatically. We'll do it in an intelligent way where internally, we built a reflection graph where we can decide to update one reflection and then based on that update, other reflections that can be derived from that first one. That can also reduce the load if it is an operational database and not like a data lake scenario. You can also choose whether you want to do these incrementally, so we have an option for incremental refresh and we also have an option for a kind of full refresh on that schedule.

Kelly:

Great. Thank you so much.

Last quick question, and again, this is just another multiple choice, "which role best describes you?" Those of you on this call. I realize that some of you probably fill all of these. I know I feel like I do most days. Really appreciate the feedback. It's very interesting to see.

Okay, Tomer. Take it away. Let's see if there are any other quick questions that we can answer. "If you're running Dremio in Hadoop, does it need to run standalone or can it run on existing notes?"

Tomer:

When you have a Hadoop cluster, it is most typical that you would run Dremio inside that Hadoop cluster. We actually integrate natively with yarn and we allow you to use the yarn to provision the Dremio executors on your Hadoop cluster. Really, in terms of hardware, all you need if you have Hadoop is one or maybe a few edge notes where you can install the Dremio coordinator on and then we will actually run the execution inside of the Hadoop cluster, leveraging the yarn.

Kelly:

Thanks everyone for voting on those three polls. Tomer, let's jump into the demonstration and let people get a close look at Dremio in action.

Tomer:

Sure, sure.

So what you're seeing here actually, if you're looking at my screen is the-

Kelly:

Actually, I don't see your screen yet.

Tomer:

You do not see it?

Kelly:

I see ... hopefully the demo is more than just a slide that says "Demo".

Tomer:

Oh, that's ...i didn't get out. Okay, I think we're sharing the application. Let's go to the sharing ... we'll do a new share. So I'm going to share my entire desktop here so I can kind of show you how it interacts with the BI tools as well.

Kelly:

Yeah, that sounds good.

Tomer:

Okay.

Kelly:

Yes, I can see it now. Looks good.

Tomer:

What you're seeing here is the user interface of Dremio, so when you log into the product, this is what you'll see. The first thing you'll notice on the bottom left here is we have our data sources, so we have ... and in this case, the cluster is kind of a mess. It has all sorts of different sources that in fact, different people here have added Dremio. I have an Elasticsearch cluster and a MongoDB cluster and some [Post Crest 00:29:27] and what is this S3 account? So S3, I have all these buckets that different people here have created and I have a Hadoop cluster here as well. Lots of different sources that I've added to this cluster. I say about half of our users kind of focus on one primary source. Typically that's their data lake, either Hadoop or kind of an S3 data source. Then the other half typically, they have multiple different sources of data and they're seeing the value in kind of bringing together data from different sources.

So that's what you have here is the box. If I click this plus icon here, I'm an admin so I can do this, you can actually add new data sources and adding a new data source is as simple as selecting that source and entering the information about it, so if that's an Elasticsearch clutter, let's say, I clip on "Elasticsearch" and I can enter the credentials and the hosts and that's about it. We actually discover all the nodes in the Elasticsearch cluster automatically if you enter one of them.

What you see on the left hand side, there's are call "spaces" and spaces are a way to organize virtual datasets, so as users in Dremio create new virtual datasets, those can be stored inside of, or saved inside of spaces, and that's always a way to set permissions and sharing where you can say, "You know what? I'm going to share only one specific space of curated data with no PII information with my generalist. Maybe I'll create a different space with other datasets for other folks. They can also create their own spaces and collaborate".

Then every user has their own home space. This is what you see at the top left where you can create your own virtual dataset. You can also upload your own spreadsheets, so if you want it to take a huge data set in Hadoop and join it with a spreadsheet that had the list of the 10 customers that you care about, you can then join that very easily, those two datasets.

Let's look at a simple example here and let's create space for the purpose of the demo and we'll call this "The Webinar Space". What we see here is "webinar" and that's a new space that we have here. You can see that there are no virtual datasets inside of the webinar space.

Now let's go and start curating data. It seems like my [inaudible 00:32:02] up here is spinning in CPU for some reason. Maybe this webinar software is kind of slowing things down. Typically, this is extremely fast and so it may be a little bit slower here because of that, but when you click on one of these dataset, what happens is that we show you the records. I clicked on this dataset of business inside of the MongoDB yelp database. You can see here on the top left, this is a purple table icon. That means this is a physical dataset. It's one of the datasets in your source systems. We never changed that data. Every dataset in the system, whether physical or virtual, has kind of unique coordinates. This one is Mongo.yelp.business and when you query data, if you're technically, you can actually see the SQL. This one is "select star from Mongo.yelp.business".

Let's close that. Pretend we're non-technical users and we just want to interact with the data. This underscore ID column is probably do not care about it, so let's just drop that. Click on "drop" and we can drop that column. We have the cities here. Every record here is a business in the United States. I may not care about all the cities, so let's click on "keep only" and keep only a subset of the cities in this dataset. It looks like Pittsburgh and Charlotte are the most common cities, so let's focus our analysis on those two cities [inaudible 00:33:29].

Intro to Self Service Data

Let's see. What else can we do here? Let's say I wanted to figure out what are the most common business categories in these two cities. You can see we have the categories here as an array and so every business has potentially multiple categories and that's not good in terms of being able to do a group by and do this analysis in a BI tool. What I can do here is I can say, "You know what? Let's click on this and unnest this array," so I'm going to basically unnest so that every category shows up in it's own record. I can go here and I can type here. I'll just rename the "categories" column to "category" because we now have one category per record.

So I made all these changes. Of course I could have done that as a non-technical user. That was very easy to do it. If I can used Excel, I can use Dremio's user interface here, but if I am technical and I want to do more sophisticated things, I can actually use the SQL editor and I have the full range of SQLs. You'll even notice here that as you made these changes in real time, we are actually updating the SQL, so you'll see things like "where city: in Pittsburgh and Charlotte". Those are the things we did in the UI.

Let's go ahead and save this inside of our webinar space. I'm going to click on the "save as" button and we're going to save this inside of our webinar space and let's call this "The Categories Dataset". Once I've done that, you'll notice the icon changing from purple to green here, which indicates that this is now a virtual dataset. It's call "webinar.categories" and so now I can okay wit this dataset. In fact, I could even run a query that joins "webinar.categories" with something else. We also have these buttons here which make it very easy to open a BI tool with a live connection to Dremio and to this specific dataset. In this case, we are opening up Tableau with a live connection. We never export the data from Dremio into the BI tool, we're just basically launching the BI tool with a live connections, so from the BI tool standpoint, Dremio looks just like one relational database. We accept SQL cores. We run them and we return results.

Here you'll see I've opened up Tableau desktop by clicking on that Tableau button and on the left hand side here, what you'll see, for example, that there's a column called "category" and so this is that new column that we literally just created one minute ago and flattened the JSON and so forth. That's how easy it is. I can now start dragging category and playing with this dataset inside of Tableau.

But let's look at a more complex example here where I'm joining data across multiple data sources. Let's look at this dataset inside Elasticsearch where I have all the reviews on Yelp, so this is the review dataset. This is actually in Elastic terminology. Review is a type and Yelp is an index. Again, you can see we happened to name this thing "Elastic-remote" to see remotely from Dremio cluster in this case. One of the most important columns here is the text column. That's the column that has the actual review. For example, "nice venue, good food, great beer, awful service". You can see that person gave this a three. The venue was nice, the service not so much.

If I open the SQL editor again, you can see that when I just clicked on this dataset, we started off with just a select star. Now if I'm technical, I can start typing SQL here, and actually, for things like Elasticsearch, we support a function called "contains" which allows us to do free text search, and so we can push down any Lucene expression. This is actually a Lucene expression here. We're pushing that down into the underlying data source, in this case, Elasticsearch. Now you can see if you look at the test column, we see "amazing" all over the place, right? We're looking at all the review that have the word "amazing" in them, so "amazing food and coffee" is an example. I can now click on the "join" button and we will recommend things you might want to joint this with.

Even though we've transformed the dataset already by filtering through only the "amazing" datasets ... It's not even a name dataset at this point and time because we haven't saved it yet. We can still identify the right thing to recommence that you joining it with and in this case, the top recommendation is this dataset of businesses inside of Mongo. The reason we're recommending this is because we have seen other people do this join from SQL queries that have been received from various BI tools.

I'm going to click on "apply". If this was the first time using the system, I could have just got a custom join and had that visual drag and drop experience that you typically see in a lot of products. You can also see the history here of all the transformations that we've done and you can go back in time if you want to. Let's go save this dataset now and I can save this inside the "webinar space". I'm going to save this and call this the "Amazing Reviews".

If I click on this "data ref" icon, one of the cool things you'll see is you can actually see that data graph, so you get to that webinar.amazingreviewsdataset. I haven't been queried yet. You can see "zero jobs," but it is derived from these two datasets, this Yelp review dataset and Elastic. You can see it comes from Elastic and this Mongo Yelp business dataset that comes from MongoDB. You can see how many times these datasets have been queried directly and actually quite a bit.

In fact, they can click on one of these things and kind of focus the attention to that specific dataset and what happens is that dataset then moves to the middle and you can see that this dataset here , Netflix.goodreviews is actually based on this review datasets. I can look at that. I can also click on this "jobs" link here at the bottom, the four and CO. What were the actual four queries that ran on this Netflix dataset and I can actually see that Kelly here has playing with that dataset.

Intro to Self Service Data

Kelly:

I can't get enough of those reviews. I keep going back.

Tomer:

Let's go back to the main screen here and look at our webinar space and what you'll see here is you'll see two datasets that we created throughout this demo. Earlier on, we created this categories dataset. Now we created amazing reviews. Amazing reviews again was that join between Mongo and Elastic. Again, I can open this up in a client application. I like using Tableau on my desktop because it's very, very lightweight, so I'll spin it up on my laptop. My laptop is pretty slow, so it's taking Tableau some time to kind of open up here.

What you'll see here is Tableau connected and here at the top, you'll see "webinar.amazingreviews". I can start dragging and dropping so I may say, "You know what, I want to see the names of the businesses that have the most reviews with the word 'amazing' in them," so I'm going to drag that here. What happens when I drag that here, is that Dremio is getting a ... I'll show you here. We're getting this query from Tableau, so you can see "select/start". This is the query we just got a simple reply from Tableau. I can drag the number of records here to see actually how many reviews with the word "amazing" did we get. You can see it only takes about five seconds for us to parse the query, plan it, and then push down the query at two different queries, one into a remote Elasticsearch cluster, one into MongoDB, and perform in-memory distributed join and return the results back to the applient application.

You could think about how hard this would have been if I had to export data from these two different no SQL databases into a teradata and then point my Tableau at that teradata system. That would have been a project that probably I could have spent the whole week kind of planing out and making sure that also keeps working over time. I was able to do that here in just a couple minutes. And you can spend ... Mon Ami Gabi is a restaurant that has the most reviews with the word "amazing". Now if I wanted to see, let's say ... I keep playing with this. I always enjoy just doing random things with this deal. Let's see what the cities are. I'm curious. This one, I guess, we did not filter by those two cities, so it'll be interesting to see what are these.

Kelly:

Peta Jungle? Wow, they're all over the place. I've never even heard of that.

Tomer:

This is a chain, yeah. It's in Scottsdale, Phoenix, Glendale, Chandler ... Okay, lots of Peta Jungle. That seems like a chain restaurant. These other ones seem like these businesses, Wicked Spoon, they're all single city. Actually, this one has a Henderson in addition to Las Vegas.

Anyway, that's kind of how easy it is to work with across different sources. The last thing I want to show you, those datasets were not very big. Often times, you're going to end up in situations where you have big datasets. These are datasets that may be using a SQL engine like [Yo hibe 00:42:36] or Athena. It will take you maybe many, many minutes or run these queries on larger datasets, and so that's where Dremio reflections come into play. Let me, as an example, show you this dataset we have here that has over a billion records. We're actually running on a very small cluster, so a billion records on this cluster take about 6-10 minutes depending on the exact SQL engine and the query that's being run. What I want to show you here-

Kelly:

And by "for 6-10 minutes," you mean if you were going to query this with Hive or [Pola 00:43:08], that's the kind of latency you would see in this cluster with this data?

Tomer:

Right, right. Of course, you ran a bigger cluster, it'd take less time, but it all scales with the amount of data and rhe amount of notes you have on your cluster.

What I'm looking at here is a dataset that has all the taxi trips in New York from something like the last five years, I think. You can see the pick up time, the drop off time, number of passengers, trip distance, and so forth. I think this dataset actually sits inside of a Hadoop cluster on HDSF. Again, every tide runs [inaudible 00:43:46] here, this would be about a 10 minute wait.

Let's see what it looks like when you're using Dremio and we are taking advantage of data reflection as behind the scenes. I'm going to drag the number of records here and you can see that we just performed a count star on 1.03 billion records and that came back almost in second. I can do whatever I want. I can take the drop off time and I can see, "Okay, when are the taxi drivers dropping off passengers," so let's turn that into a bar chart and you can see here 2012-2014, we're seeing a kind of decline in the number of yellow cab taxi trips in New York. That may be because of Lyft and Uber taking some of that load. Let's say we want to look at the tips. I always like looking at tips and amounts that people are paying, so ... Actually, that needs to be an average, not a sum, so let's change that to average. Every time I'm interacting with this, every change, changing from sum to average, and let's change the colors here. All those interactions resulted in a SQL query being sent to Dremio, so you can see the tips have gone up from 2009, which you may recall was that first year after the Great Recession up to 2014, so maybe people were feeling better, they were tipping more as the years progress.

Let's see if we can see any trends in terms of the months. So when we look at the months, we see that if you wanted to be a taxi driver, I would recommend being a taxi driver later in the year. It seems like maybe people are spending all their money ... or they're feeling good, tipping the drivers up until kind of the holiday season, and then come the new year, everybody's in a bad mood and the tips go down. Then they start going up again and then July comes around and a lot of tourists are in town and the tipping goes down again. I don't know the exact reasons for this, but what I can do it is, I can kind of play with the data and I can get these trends very easily because I have that ability to get this interactive experience rather than waiting 10 minutes every time I drag the mouse.

If we go here into the "jobs" page, you can see all these queries that have come in as I was demoing that dataset. See all these sub one, second query and the flame next to them. Unfortunately, this would have even a three hour webinar had we not had data reflections here (laughs), with each query taking us 10 minutes or 6-10 minutes from our experience, but because of our ability to accelerate these queries and that little flame indicates that, we're able to play with the data in a much more interactive way. Sp that's kind of the experience with acceleration.

Kelly:

Let's see if we can make sure we understand, so this query that I'm looking at, this big "select average from group buy," etc., etc., that's down there at the bottom of your screen, that's the query that Tableau sent over ODBC to Dremio. What happened to the query after it got to Dremio?

Intro to Self Service Data

Tomer:

When that query hit Dremio, the first thing, we have kind of a unique cost space to optimize or that took that query, parsed it, turned it into a kind of logical query plan, and then we ran some unique algorithms that we've developed here which basically look for opportunities to not have to scan the entire newyorkcity.trips dataset, this is really what ... the user is asking us to go query this massive dataset of a billion records. Now we, with our algorithms, we have identified that we actually don't have to go scan that entire newyorkcity.trips dataset. We can instead leverage some of these data reflections that we have in the cache where the data may be pre-sorted or pre-aggregated in different ways and therefore reduce the amount of time that it takes to satisfy this query. In this case, it was 600 times faster.

Kelly:

Yeah, something like that.

Okay, so we've a bunch of questions, a lot of really good ones and some fun comments. I especially like the comment, "This is freaking awesome". I agree (laughs). Let's get through a couple of these questions.

Tomer:

Read some more of those types of comments.

Kelly:

(laughs) "If I want to visualize Dremio tables with a non-Tableau SQL compatible tool, what connection types or drivers would I need to use?"

Tomer:

If you go to our website and you go to the download page, you can download the Dremio ODCB, JDBC drivers and you can use any SQL compliant tool. It doesn't matter. I was just showing Tableau because that's what I have on my laptop. I could show you just as easily Power BI Click, Microstrategy, any of these other tools.

Kelly:

Another question is, "When I'm pulling data from an RDBMS, is it stored on Hadoop in Arrow format?" I think what they're asking is if I create ... because you don't have to move the data into Dremio, right? We run between your tools and the data, but if you create a reflection on data in a relational database, how does that get stored in Dremio's reflection store?

Tomer:

So all Dremio reflections are stored in a highly compressed columner representation. That's kind of the foundation for that and so we leverage Apache Parquet as well as Arrow and we've introduce some additional optimizations on top of those things in terms of how we deal with the dictionary and coding and all sorts of statistics that allow us to be more efficient when we actually run the query, so for example, we don't want to decompress or deteriolize that data when we don't have to. We can run an entire execution without doing that and that's, of course, even better. That's how the data is stored. It's stored in a highly compressed columner format inside of the Hadoop cluster if Hadoop is being used as your reflection store.

Kelly:

Great. Another question related is, "Would it be faster to have the reflection store be S3 or HDFS or a local file system? Are there any differences in terms of performance there"?

Tomer:

You know, that's a really good question because at the end of the day, we have to read that data. The performance at which we can read that data, that will play a role, especially ... and it depends on what kind of reflection we're using. We spend a lot of time integrating well with all of these different system, so S3 will open lots of different [inaudible 00:50:19] connections and we'll do the same thing, of course, and get data locality even in the case of HDSF, so then it becomes, "Okay, well how many disks do you actually have in that HDSF cluster versus the performance of S3 and the instant types you're using on Amazon dictate [inaudible 00:50:36] that you have to S3?"

Unfortunately, I can't really answer that question because it kind of depends on the exact instance types or hardware that you're using. We have users that use either of those and have good experience.

Kelly:

I think the key thing is that it doesn't require super high performance storage subsystem to get the kind of experience we just saw. What we just saw was Dremio running in a Hadoop cluster on Google cloud with cheap cloud storage underneath in this particular cluster.

Tomer:

Yeah, we were just using local infrastructure.

Kelly:

Yeah, so there may be some differences between S3 and HDSF, but that's not going to be this massive difference. The big advantage here is the fact that we're using reflections in a highly optimal data structure that makes it so you don't need super high performance storage systems.

Another question is related to reflections. How are the reflections updated over time?

Tomer:

So the reflections are updated actively on a schedule so you can kind of define for a given source, you can say, "What is my SLA? How still is this data allowed to be?" But then it's cache, so when we use reflections, we're not serving the data from one second ago. We may be serving data that is a minute old or 15 minutes old, but you get to define what that period of time is. You can define it for different sources of data at a different threshold and also for different physical datasets. You can have different thresholds

Kelly:

Okay. How does Dremio manage access control?

Tomer:

Yes, the access control ... let's test this out. Let's start with authentication. So authentication, you can either define the users inside of Dremio or you can have us connect to your LDAP source, so active directory or just any other LDAP. That's for kind of identity and that's how we check the password and that's the group membership information that we use all comes from LDAP if you're connected to LDAP. When it comes to permissions, you can actually see that here. I'm not sure you remember, but when we created the space called ... let me click on the spaces here ... when we created the "webinar" space, we actually had that ability to say, "Who do you want to share it with?" I was kind of lazy and I shared it with all the users, but you could go here and share with specific users, so maybe I want to share this space with Kelly and now only myself and Kelly have access to this space here.

You can do it at the space level. You can also go and do that at the dataset level and you can override it. Depending on your company and your exact situation, you may actually decide that you don't want to expose the sources of data here at the bottom left to your end users, to the analysts ... you may say, "You know what? I want only IT to have access to the raw data". Our company, what we'll do is we'll create a space with the IT approved datasets and that will be kind of the starting point for all the users, so maybe you'd just drop some of the PII columns, the social security number, the credit card number, or only decide to expose a very small subset of the data to your end users. And they can build on top of that. People can build virtual datasets on top of other virtual datasets, but permissions really allow you to control who can see what data. You can also do things with the column and the row level.

Intro to Self Service Data

Kelly:

So what about Kerberized Hadoop? Is that supported?

Tomer:

Yeah, that's supported. There's no problem and there's documentation on how to do that.

Kelly:

What about user delegation?

Tomer:

Yup. Actually when you add a Hadoop source, there's a check box that says, "Do you want Dremio is impersonate the user?" Meaning, "Should we be talking to Hadoop cluster as the identity of the user who is connected to Dremio or should we be talking to the Hadoop cluster as the Dremio user or whatever user you set up." You're running the Dremio [inaudible 00:54:47]. You have that option.

Kelly:

Several questions related to sources that are supported. What's there today and what's sort of the story for the future going forward?

Tomer:

I think we have a dozen ... I'm doing some multiplication here, so 13 sources today, HDSF, your Hive tables, we don't use the Hive engine, this is like Hive meta store, MapR, SQL server, Oracle, Network-Attached Storage, you can use any number of NAS solutions here. My SQL, MongoDB, Elasticsearch, and [HBIT 00:55:20] are kind of the no-SQL databases we support now. We have some of these coming soon icons at the bottom, although we've now ... it's before fair to say we've gotten a lot of request for, "Can we add this dataset and that dataset?" And we'll continue to add new data sources and we'll continue to add more sources to the system as we're getting these requests. That included Teradata, HANA, various Cloud applications, which we'll be adding as well.

Kelly:

What about MapR-DB?

Tomer:

That's another one where we've had multiple request for MapR-DB being including both kind of binary as well as the JSON Docking Model and we'll be adding both those as well.

Kelly:

We're running out of time and again, I really appreciate all the great questions-

Tomer:

And I'll just add to that. Feel free to reach out if you want specifics on talking about specific items and road map and dates and so forth, I don't want to get into all that now, but we can, of course, have that one-on-one conversation.

Kelly:

Why don't you open up the community site, just so people can see that. As a reminder, Dremio is open source. We love to hear from you. Please join the community. Ask questions. Make recommendations or suggestions to us there. It's pretty active and we're very responsive here, so it's a great place to connect with us. Dremio comes in two editions, the Community addition which has pretty much everything you saw today with the exception of the data lineage features and the LDAP and Kerberos support. Then we have Dremio Enterprise edition which has some advanced management capabilities, advanced security, and several other things that we'd make available as part of a subscription. We're happy to talk to you about that and make it available to you to evaluate if you'd like. The open source version is terrific. We're really proud and we're just getting started, so lots of great things to come in the future.

I want to thank everyone for joining today. A recording will be sent out to you shortly. I'm not sure how long that will take me to get together, but we'll try and get it out to you very soon. Thanks for attending. We're going to be doing it very months until we'll be focusing on other deep topics related to Apache Arrow, related to different kinds of optimization techniques, advanced features we have with specific sources and specific tools, so stay tuned for feature webinars and we look forward to see you out there. Thanks again. Take car and have a great day. Bye bye.