Dremio Jekyll

Data Reflections: Accelerate your Queries Without Copies

Dremio

Transcript

Jesse Anderson:

Hello everyone, welcome to this webinar. In this webinar, we're going to learn a little bit more about Apache Dremio and Dremio products, and other Apache products. This webinar is called Data Reflections: Accelerate Your Queries Without Copies. Now we're going to introduce the people that are going to be talking in the next slide.

So my name is Jesse Anderson, I'm the managing director at Big Data Institute. I'm also a committer on Apache Beam, and I've contributed to other various open source projects, especially Apache projects. You might have read some of the writing I've done on my personal site or on O'Reilly. With me is Steven Phillips, he's a principal software engineer at Dremio, he's also a very technical person. He's a PMC and committer on the Apache Drill and Apache Arrow projects.

We're in good luck, because we're able to, as technologists, ask some very technical questions to Steven, so be ready. Let's try and stump him with some really technical things.

Now speaking of questions, here's how to do your questions. I'll be monitoring this as kind of the moderator for this, and then as I see questions pop up, or as I see something that we want to cover, we can do that. So if you want to, in the Q&A, if you see the Q&A at the bottom, it'll look a little bit different on yours, but please do avail yourself of the ability to ask questions.

With that, I'm going to turn it over to Steven. Steven, thank you.

Steven Phillips:

Data Reflections: Accelerate your Queries Without Copies

Thank you Jesse. To get started we want to give a little background for, "Why are reflections important?" It all starts with the raw data. The organizations now have lots of data in lots of different formats, all over the place. It could be in places like such as Adupt or Microsoft Azure, or Amazon in various places. It can be very disorganized and hard to manage. And especially if we want to be able to get insights from this data for use by data scientists or BI users. There's nothing a very long and expensive process to make this data an accessible.

Data Reflections: Accelerate your Queries Without Copies

On this chart here we kind of have a couple of different layers of data and data transformations. And so sort of the lowest layer would be the data lake. This is a very raw, unstructured data, often very messy. And some of the aspects of this data, in a data lake, is that it often involves, you know, custom ETL, and these transforms can be very fragile. They break easily, and it's also very slow moving.

Typically, the way organizations will then try to wrangle and manage this data is by moving some of this data into a data warehouse or data mart, such as Teradata or Vertica. Some of the disadvantages of this, though, are that, first off, they're very expensive. And it's a lot of overhead. And obviously there's proprietary. There's lock in. It can create some problems for your organization. Even then the data it still isn't... Maybe doesn't quite satisfy the needs of the BI users and data scientists. And that these queries may take too long to run, or use too much resources. We can't have BI users hitting it over and over again. And it might just cause too many problems.

So the next step that organizations typically have is some sort of copies of the data into different formats that the BI users can use. And this includes things like cubes, or Tableau, or other BI tools. May have extracts, and so on. Now the problem here is that now we've got the data sprawling, so it's copied in many different locations. Governance becomes a problem. Trying to manage and keep track of where all these copies are located and stored is a real headache. And then it's slow to update. If you need to modify. If you need a new dimension in your query or whatever, it can be a very long and drawn out process to have the data engineer modify the cube appropriately for the BI users or data scientists needs. So this is kind of the current normal, for how larger organizations manage their data. And make it accessible to the end users.

And BI users are getting left behind here, because it creates many headaches for them. First off, they're dependent on IT to manage this for them. They need IT. The have to ask and have IT find access and repair the data sets. This slows down their access. And some of the data in it's raw format is really not accessible to the BI users, for things like Hadoop and JSON, Parquet are some example. This leads to long lead times. It can take even months to onboard new datasets, or even just modify or change a little detail for some of the cubes that are extracted exist.

So there is a better way. That's why we built this company and product, Dremio. It's designed to work with any data lake, any BI, or data science tool. And you can provide 10X to 1000X data acceleration. Dremio's considered a self-service semantic layer. And this is based on zero-copy data curation. This is a very expandable... It's a scale-out architecture. You can write it under just a handful of nodes, and you can scale it out to over a thousand minutes. And it's built on open source software. And it is a self-open source.

Data Reflections: Accelerate your Queries Without Copies

Dremio represents a new tier in data analytics. And we like to call this self-service data, or also known as data as a service. So as you can see, in this diagram at the bottom we have our various data sources. Things like Hadoop, MongoDB, or various relational databases. At the top we have the data science and BI analytics users. So they're going to use things like Python, maybe with pandas or R, for the data scientists. And the BI users typically use something like Tableau or Looker. And Dremio sits in between these two, and it provides various things. And the one I'm going to focus on mostly today is the data acceleration component of what Dremio provides.

So one way to think of data reflections is as a relational cache. So I'm going to kind of go into just a little bit of depth about what relational caching means. So first off, we have just a little refresher on what are some of the key aspects of relational algebra.

Data Reflections: Accelerate your Queries Without Copies

So first off, you have relations which are basically tables. Or in other words, a set of couples is a relation. And then we have various operators. And these define some set of transformation. Some examples of an operator might be a join project, scan filter, aggregate, Window function, et cetera. And then we also have rules, which are used by the optimizer. A rule defines a logically equivalent transformation of a plan. So for example, you may have a project followed by a filter operator and then the rule can determine if it's logically equivalent to perform the filter before the project, or vice versa. And there are a lot of different potential rules that can be used to optimize a query. And one other determinate to understand is a graph or a tree. So in relational algebra, this sort of pipeline of operators can be represented or thought of as graph. And specifically in Dremio it's what's known as a DAG, or a directed acyclic graph.

So on the right here we have a couple of very simple graphs. In this case, it's a scan, project, and filter. Or logically equivalent tree is a scan, filter, project. So those are example of query plans or DAGs.

So what is a relational cache? So relational cache is a little more sophisticated than just a basic cache. A basic cache might be just store the result of a query, and then if you run that same query return the results that you cache. We want to go beyond that, and store the data in some sort of an intermediate state. The advantage here is that we can... The cache can then be used to satisfy more queries beyond jut the one simple query that we cache the results for.

So there's a concept that we like to use which is distance to data, DTD. So originally without a reflection, without the relational cache, the original distance to data is much larger. We can go from the original data at the bottom of this graph and then up to what you want. The idea is then we can persist some intermediate state, that maybe isn't quite exactly what you want unique in any case, but is much closer. And then the queries that then when the user actually runs the query the new distance to data is much shorter than it was with that originally. So by doing this we can now satisfy a large variety of queries with this cache. They will all be accelerated, and it will reduce resource requirements and latency.

So to some extent you probably are already doing this, manually. This is typically done with things like sessionization or cleanse data. Or maybe in your data warehouse or data mart, you've partitioned by some time or region. Or maybe you had some summary tables, that are summarized for a particular purpose. And now, currently, what this means is the user will have to choose what to query, depending on what they need. The analyst have to be trained on using different tables, depending on the use-case, and often there are going to be custom datasets built just for reports. And this might include things like summarization or extracts, built for things like Tableau dashboards.

Okay?

Data Reflections: Accelerate your Queries Without Copies

So let's compare the two models... We might call the current model, that's often used, as copy and pick, whereas what Dremio provides, through data reflections is the relational caching model. So on the left you can see the copy and pick, where in the bottom we have our original raw data, and then we sort of have the physical layer, which is represented by the various cubes, or extracts, or whatever the admin has created. And then there's the logical model, which in the copy and pick world, is the same as the physical model; there is no difference. The end user will then go and pick which physical optimization to use to satisfy their logical use case. In the relational caching model there is a separation between the logical model and the physical optimizations. So we do have various physical optimizations in our relational cache, but the end user doesn't need to know about them or car about them. They're simply going to look at the original dataset as in the logical model. And then the relational cache, itself, will figure out how to leverage the physical optimizations to satisfy the end user's needs.

So let me describe a few of the key components of relational caching. And specifically how what we use in Dremio reflections. So first off, in order to express the transformations in states it's just part of Dremio's sequel language. And in order to match the relational algebra to determine if your reflection can be used, we use a library called Apache Calcite. The reflections themselves, the materializations are persisted in Parquet. Parquet is also another Apache project, which provides on-disk columnar storage. And how we process the data is through a combination of Apache Arrow, which is in memory columnar format, and Sabot, which is Dremio's query execution engine. And of course we have a lot of code putting it altogether, as you can see from this diagram. I won't go into a lot of detail on this, but this is kind of the high-level architecture of how the relational cache works.

So let's talk a little bit about reflection definition and matching. So first off, so coming back to Calcite. Like I said before, Calcite is an Apache project. It is a planner and an optimizer, and also a sequel parser. And it comes pre-built with many of the things that are needed for optimization, such as operators, the various rules that we described earlier, and things like properties, which includes things like distribution, or sortedness, coalitions, and so on. And it has the ability to handle materialized views. Which basically if you're familiar with the concept of materialized views, the relational cache is a very similar concept. So this provides a perfect foundation for relational caching.

Data Reflections: Accelerate your Queries Without Copies

So here's how we built the cache. And we call them reflections. And essentially a reflection is a persisted alternative view of data stored in a Parquet format. And we have two types of reflections. One is a raw reflection, which persists all the records of the underlying dataset, but allows you to control how you partition and sort the data. And the other type of reflection is the aggregate reflection, which persists of partially aggregated dataset, based on a selection of dimensions and measures, which also allow partitioning and sortedness. And these reflections can be built on top of either the raw source tables or some, what we call a virtual dataset, which is somewhat analogous to a database view.

So we give some examples of how this works. So here, let's say we have query. On the left, we have our user query, which involves a scan of some table called T1. We then project the columns A and C. From there we perform an aggregation, where we group by A and compute the sum of C, renaming that C Prime. And then filter, keeping only the record where C Prime is less than 10. This is a very simple example. Okay.

Now inside the reflection definition we have a similar query, where we simply did a scan of T1. And then an aggregation, where we grouped by A and B, and computed the sum of C. So you'll notice that the query is not exactly the same as the user query, but is very similar. Okay?

Data Reflections: Accelerate your Queries Without Copies

So now this reflection definition what it means is that that target query can be replaced with a simple scan of the materialization R1. All right? So then using our Calcite based reflection matching algorithm Dremio's able to figure out that we can rewrite the original user query using the reflection R1. And the plan is now scan R1, aggregate on A, compute the sum of C, and then perform the filter where C Prime is less than 10.

So this new query looks very similar to the original, but one thing to note is that instead of scanning T1 we are scanning R1. And depending on the data R1 could be a hundred thousand, even a million, times smaller than the original T1. So because of this our optimizer will look at this alternative plan, and say, "Oh, this is a much cheaper, much better plan. Let's use this." And then this query can then be hundred thousand, or even a million, times faster. It depends on how much reduction we get when we created this reflection, based on the cardinality of the dimensions that were specified.

Here's another slightly more complicated example. In this case the user query involves a join of two tables with an aggregation after the join. But our reflection does not have the join. It's only an aggregation of T1, which is one side of the join. Well, our algorithm is smart enough to leverage this one, as well. And it turns out then we can replace the left side of the join with the aggregated version with this reflection, which contains the aggregation of T1. And most likely, in this scenario, T1 is a very large table, and T2 might be a dimension table. So we didn't have any need to create a reflection for T2, because it was small enough. But T1 might be very large. And so once again this could be a very big improvement over the original query, because R1 could be a thousand times smaller than T1, depending on the data.

So in both of those previous examples we used aggregate reflections. In this example I'm showing the benefit of how you might use a raw reflection. So this is a very simple query. Just scan of T1, and then filter some filter condition on column A. And sometimes this might be referred to as a needle in a haystack query, where maybe we're trying to find just a few records out of billions, that match some filter criteria.

So now in this example we actually have two reflections that are being considered. The one on top we do a scan of T1, and we hit a raw reflection. So in this case the raw reflection is equivalent to the original table, but we also keep track of the fact for R1 is partitioned by A. And we have another definition which is also a raw reflection, but in this case is partitioned by B. And so both of these reflections will be considered, and then our optimizer will notice that, "Well, if we use the first reflection, and since our filter depends on column A we will be able to prune out much of the data, if we use the first reflection. And therefore it will be much cheaper to use that one." And our optimizer will choose the R1, instead of...

I think there's a mistake there. It should be at R2. Anyway, it'll choose the top one, where it's pruned on A.

And then the final example I want to show here is a relatively new technology, that we developed. And we call it a Starflake feature. So in case, imagine we have a user query.

Well, actually, let's start with the materialization definition. So imagine you have sort of a star or a snowflake scheme that defines the various tables in your raw tables. So in this case we have F1 is a fact table, and then might have D1 and D2 as dimension tables, which are then joined together. So now, let's say you create a reflection on top of the joint of those three tables, and call it R1. Now if some user comes in and they were in a similar query but they don't include D2. they're just joining one of the dimension tables with the fact table. Well, our star table will able to determine that, based on the fact that these are dimension tables, this join is a cardinality preserving join. And therefore we can still use this materialization, even though the user query is not using all of the data from the reflection.

Data Reflections: Accelerate your Queries Without Copies

All right, another thing I'd like to go into now is how do we keep these reflections up to date, as data is modified or added to the underlying data sources. So first off, we have the refresh management. Okay, so the underlying data may change and the admin will need to define a refresh frequency for the underlying data. So for highly changing data source you'll probably want to have a more frequent refresh frequency. And they also need to define sort of a TTL, which defines at what point, if the reflection is beyond some age then we should not use the reflection because it should be considered stale.

Okay, so another thing that is important about how we manage these refreshes is that it's possible that one reflection could actually use some other reflection as part of its refresh. So we make sure, using what we call a dependency graph, to order the refreshes appropriately. And a simple example would be, as you see in this diagram here, if we have a physical dataset and we created both a raw and an aggregate reflection most like the aggregate reflection can actually use the raw reflection for its refresh. So we will make sure to first refresh the raw reflection, and then once that is done refresh the aggregate reflection.

Data Reflections: Accelerate your Queries Without Copies

And then we have multiple update modes, which this is something admin have to figure out based on the underlying data. The simplest one is what we call a full refresh. And this would be appropriate when the underlying data is highly mutating. So if it includes updates or deletions in the underlying data, then the update mode would be full update, which basically means every time we refresh the reflection we will rebuild the entire thing from scratch.

However, in many cases we can be better than that, because if it's known that the underlying data is append only... As in we're only adding new rows to the data; we're not doing updates or deletions then it makes sense to not recreate the entire reflection every time we refresh. But only add the newly added data.

And the two ways this is handled depends on what type of data source we have. If it's a file system based then we can simply look at the timestamp, or the create time, for the files, and only add the data from the newly added files. If the underlying source is something like an RDBMS source, the way we handle this now is the user can define some column or key that is monotonically increasing. A timestamp is a common one. As new data is added the timestamp field is always increasing. And we use this field to only add the new data every time we do a refresh.

Okay, so I want to go into a demonstration now of how these reflections can be used to accelerate your queries. And so the dataset I'm using is the... It's from New York City Taxi, and it's the data collected for all the taxi trips, from 2009 to 2014. This data set contains about a billion rows, which corresponds to about 180 gigabytes of data. And it's in the CSV format. And a CSV format is not a very efficient or performant way to store the data. The Dremio cluster is running in the Google Cloud. It's a pretty small cluster; just has one single coordinator node and two executor nodes, each with I think four cores. So a pretty small system. And so 180 gigabytes for this small of a cluster is a pretty sizeable dataset.

So let me go ahead and switch to my Dremio UI. Okay. So here I've opened up the UI to the page where I'm viewing this table, called Trips, which is a slightly cleansed version of the raw CSV data. Real quickly I can show you what the actual underlying sequel for this dataset is, and you can see that we are doing some various conversions and a few cleansing queries. Okay.

Okay. But I'll go ahead and hide that for now. The nice thing about virtual data is that you can then... Once you've done your cleansing query, you can ignore it. So this is what the dataset looks like after having cleansed it. So you can see here, we have a few columns of interest. A vendor ID. More interests than we had the pick-up date time and drop-off date time. And then a bunch of other stuff like passenger count, trip distance. A few more columns. Also of interest is the fare amount, how much was actually paid for each trip. And then also breakdown of things like surcharge and tip.

Okay, so this was done in advance to make this demo go a little bit more smoothly, but if you want to create a reflection for this dataset you can click on the gear, up here at the top. And come down to the reflections tab, and you can see here there's a button here for raw reflection and also for aggregation reflections. Now these columns were actually suggested by Dremio as an appropriate or possibly good choice for what to use as dimensions, and what to use as measures. And in this case I went with the recommendation, because I think they look pretty good. Looking at what we have for dimensions...

Data Reflections: Accelerate your Queries Without Copies

When we say dimension, a dimension is what column do we expect to group by, when analyzing this data. And measures it means what are the columns that we typically expect to compute using. Compute values for. And what we're typically computing is going to be things like sum, count, and average. Are the more common ones. Okay?

So I previously already created these reflections, but I just wanted to demonstrate how one would do this. And if you wanted to modify this it's very easy. You can move these around. Create different dimensions, or different measures. Let me go ahead and cancel. Now like I said, the original dataset was a one billion row, 180 gigabyte dataset, so running some simple queries would actually take quite a while. For example what if I just wanted to get a list of the distinct vendor ID? Well, I'm going to run this right now, and it should be very, very fast. See, it came back in less than a second. Without a reflection this would take much, much longer, because, first off because, the underlying data is CSV it would have to read all the data. Even though I only care about vendor ID, the CSV format basically requires us to read 100 percent of the data. And then it has to parse it, and then aggregate over a billion records. And this would take... I'm not sure how long it would take, actually. But it would take quite a while. But you can see here with the reflection it's much, much faster.

Data Reflections: Accelerate your Queries Without Copies

Okay. I'm going to show you the jobs page. So Dremio lets you look at all the jobs that have been running. And if we click on here we can actually look at a profile for the job. As when I demonstrate here that we can see the... First off, you can see from the plan that the original plan is much more complicated. Aggregate on top of project. And then going all the way down to having a filter. And then the underlying data source, which is the CSV with 1404 files, or splits, actually. Sorry. Those are HGFS splits. But with the reflection the optimizer ended up choosing a plan that queries the reflection instead, and first off, is a much simpler plan. And then we can also look and see, if we click on the stats, we only read a few hundred thousand rows, whereas the original dataset has over a billion rows. So we can see here that we probably had more than a one-thousandth speed up in this query.

So I'm going to show now how this is useful, and how integrate with popular BI tools. So specifically I'm going to look at Tableau. So we actually have for... When viewing a dataset in Dremio, you can then click on this Tableau button up here, which will download a TDS file which you can just simply click on. And assuming you have Tableau installed on your system, it will launch Tableau and prompt you for your credentials. These are your Dremio credentials.

I'll just type that. Okay, Tableau just crashed. Let me... Okay, yeah. It's working fine. I don't know why I got that error.

So now, now that I'm in Tableau I can start doing the typical things you might do in Tableau. So let's start by, I don't know. Let's drop a number of records here in columns. And notice how quickly this returned. Very snappy, and we can see the count is over a million. Okay, so now let's say, "Well, I don't want just the number of records. Let's look at number of records grouped by year, for example. So let's grab the pick-up date time. Drag that over to rows. Let's change that to a bar graph. Okay, so now we can see the number of pick-ups as a function every year. Seemed to pick up a little bit in 2012, but then maybe it's dropping off in 2014. Kind of interesting.

Maybe I want a different granularity. So let's come up here and choose a different... Maybe by month. Actually no, let's do this one. But first off, I'd like you to notice how quickly these are returning. It's basically an interactive experience. That's kind of cool. Maybe we want to look at the average fare amount. There we go. So we can see that the average fare amount seems to be increasing. In fact, there's a big jump around September 2012. Kind of interesting. I have no idea why, but this could be the beginning of some investigation into why that is happening.

Data Reflections: Accelerate your Queries Without Copies

And then we can then go back into Dremio and look at the jobs page, again. And we can actually see the queries that were generated by Tableau, and we can see that they are all able to use the reflection. And they are completing in less than one second. And believe me, without the reflection this particular Tableau use case would not be interactive at all. We're looking at probably minutes every time we drag something. So I hope this is a good demonstration to you, of the power of Dremio reflections in making the BI user experience better.

So that's the end of my demo. Oh, so let me go back to the PowerPoint. There we go. So now we have time for Q and A, and also some closing words. So you'll notice these links. I think you'll find they're very interesting.

First off, the Dremio University, which contains lots of great content for how to use Dremio, and also just general big data get analytics content. Also, definitely download Dremio. We have a free community edition. It's open source, so definitely download it and try it out. And the library.

Actually I don't know the library. So yeah, check these links out. So now I think we'll have a chance for some... Take some questions.

Jesse Anderson:

Yeah, so a few questions that popped up. Are you able to show, in the Dremio UI, how long it took you to ingest that initial data? You were saying you didn't want to show that because it took a bit. Exactly how long did it take?

Steven Phillips:

Oh, so for this one it took about an hour. But how long it takes, obviously, highly dependent on the size of your cluster. So like I said, this is a... Typically, Dremio cluster will be more than two nodes. I think my cluster is the smallest size GCE supports, that Dremio allows. We have some minimum requirements, and I think something like eight gigabytes of RAM, or something. I forget the actual number, but very small instance. It took about an hour for this particular query.

Jesse Anderson:

And for production cluster, on average, how long do you think that would actually take?

Steven Phillips:

Well, it's very scalable, so if you had a ten node cluster it would take probably five times as fast. So you go from an hour down to maybe 20 minutes. Now typically users will schedule their refreshes to happen during less loaded times, like maybe overnight, or whatever. So it's expected that the reflection will take longer than the actual queries, because we're doing more of the work upfront.

Jesse Anderson:

Yeah. Okay, another question from the audience. Materialized views will initially take time to instantiate, how do you manage this in a production system? Do you need to warm the system with the expected user queries, to allow materialized views to be initially created?

Steven Phillips:

That's a very good question. So right now, that's basically how it would work. And it's not even automatic. We do have some suggestions, that we'll make for a given dataset. What reflections to choose. But if someone's giving me a manual process, looking at what queries you've been running and trying to determine what the optimal reflection layout would be. What dimensions and measure to choose. That being said, we're definitely working on better automation of this, so that Dremio will figure out based on, yes, the workload, what are the optimal reflections to create.

Jesse Anderson:

Next one. Another question, maintaining materialized views can consume conservable space. Are power storage limits managed? Are unused materialized views ever GCed, or garbage collected?

Steven Phillips:

So yeah, they can use space. So right now it's somewhat manual. We do have some pages in the UI which can tell you what the footprint for the various reflections are. And I think we have statistics that allow us to see what reflections are being used, and what are not. But right now it's not automated. It's not automatically pruned by the system. It'd be manual. So yeah, right now, yeah, it's not automatically removing the unused reflections. That would definitely probably a useful feature for us to add.

Jesse Anderson:

Perhaps a follow-on question to that, that they didn't ask. Which parts of the data are stored in Arrow? So in-memory versus on-disk?

Steven Phillips:

So yeah, that's a good question. We actually store it all on-disk. So Arrow is simply the format used in-memory during the pipelining. We don't currently cache the... We don't currently store any of them on-disk. Oh sorry, we don't currently store any of them in Arrow, in terms of where we store the data. So that's one thing to know, it's not in-memory cache. It's an on-disk cache.

Jesse Anderson:

Straight on-disk. And then the actual storage is that in HDFS or is that in your local file system?

Steven Phillips:

So it's configurable. It's best if it's either in HDFS or if you're running in a cloud environment, then it probably makes sense to store it in the cloud storage, like S3 or Azure Data Lake. So yeah, that's generally the vast majority of customers either do an HDFS or in the cloud storage. And there is an option for storing on local disk, on the local level, which are all the Dremio executor nodes. The disadvantage there though is that you lose out on the replication persistence that you get with HDFS. So that's not our recommended configuration, but it is supported.

Jesse Anderson:

And you didn't talk about it specifically, is there a specific flavor of HDFS you support, or is it all the HDFS versions?

Steven Phillips:

Well, I don't want to say all. But most of the common ones. So we specifically have support for Cloudera, Important Works, and also MapR distributions of Hadoop. Those are all supported. EMR.

Jesse Anderson:

I know you can spin up HDFS with EMR, do you support that as well?

Steven Phillips:

EMR is also supported, yeah.

Jesse Anderson:

Okay, great. Okay and then another question asked is, is there downtime during updates to reflections? Say we want to refresh every five to ten minutes.

Steven Phillips:

It's not downtime. However, the refresh, itself, can consume resources. And so you might see some impact to other user queries that are running while the refresh is happening. Now we do have workload management, which can help manage that. Generally speaking, we'll give current user queries higher priority to reduce the impact of the refresh. So generally, there shouldn't be downtime. That being said though, often customers do decide that it makes sense to run their refreshes during non-peak hours. But it's not downtime. The cluster is still accessible and usable.

Jesse Anderson:

Another question that I think is related to that question is, concurrency. You didn't really talk about how many concurrent users could you actual run at this?

Steven Phillips:

There's no hard limit. But we've definitely been seeing our customers run with a hundred different current queries and users.

Jesse Anderson:

Excellent. Yeah, that's been one of the kind of missing pieces out there in the ecosystem of how to do BI queries concurrently at large scale. Great.

Jesse Anderson:

Okay, the next question is what was the max memory allocated for Dremio?

Steven Phillips:

So I'm guessing they're talking about just in this particular example. I think there are... I've said eight gigabytes, earlier, but I think it's actually 16 gigabytes, is the memory on these GCE instances. And most of that is allocated for Dremio. However, these queries, I don't know that they were using anywhere close to that.

Jesse Anderson:

So to be clear, those Dremio processes are Java processes, correct?

Steven Phillips:

Yes. Yes, they're Java.

Jesse Anderson:

And maybe asking about the JVM XMX.

Steven Phillips:

So actually one interesting thing about Dremio is that it's not just XMX, which limits the Java heap. But especially on our executor side we actually do most of our... Most of the memory we consume is not heap; it's direct memory.

Jesse Anderson:

It's off-heap?

Steven Phillips:

Off-heap memory, yeah. So there's a separate setting for that, to learn how much off-heap memory we can use. But typically an executor node will recommend, I think... Well, this is a pretty small node, so I believe we're doing basically four gigabytes per heap and twelve for off-heap. Because all the actual data buffers are off-heap. The heap I sonly used for some of the control structures. But the actual data itself is in off-heap buffers. So as you get more production, in a production environment you're typically going to have quite a bit more memory, and the vast majority of that will goes towards off-heap. Because that's what allows you to increase your capacity for running all the queries. Usually the heap is not malware, where most of the memory is used.

Jesse Anderson:

So to be clear, the executors off-heap usage is more for the aggregation. So, example, if it's doing a distinct it's storing those distincts, the actaul distinct value, as well as, the count, for example in that off-heap?

Steven Phillips:

Sorry, well what I meant is all the Arrow buffers. All the data that we're... So we read from Parquet, and we load the data into Arrow buffers all those, the buffers themselves, are off-heap.

Jesse Anderson:

Okay.

Steven Phillips:

And then as the data flows through the pipeline of operators they'll be transformed, and copied, and new data will be written to new buffers. Those are also off-heap.

Jesse Anderson:

I see. Okay, and the next question is, the scheduler is by number of hours instead of a specific time. Is scheduler going to be updated to allow specific time scheduling of reflections?

Steven Phillips:

I believe there is... That that's been asked for, and I think we're working on that. But I can't give a specific timeline for that, at the moment. But I'm pretty sure that is in the works, because I know customers have asked for that.

Jesse Anderson:

Okay. Answered that. Okay, you talked a little bit, initially, about who the primary users of Dremio are. Often times it's the data engineer, putting that data into Dremio, loading that into Dremio. And then from there the business intelligence and data scientists are using it. Are you finding that data scientists are using Dremio for discovery?

Steven Phillips:

That's a very good question. Unfortunately, I'm not the right person to answer that one. I'm on the development side. In terms of whether or not that's happening I honestly don't know. That would be something I could look into. And talk to someone who's more on the sales engineering side, or possibly our product side would know the answer to that. I would think probably some are doing that, but I can't say for sure.

Jesse Anderson:

Okay, and it sounds like, given your level of integration with Tableau, it's very, very common for the BI folks to be querying against Dremio with their Tableau instance?

Steven Phillips:

Yes, that is very... I have seen that, for sure, at a lot of our customers.

Jesse Anderson:

Okay.

Steven Phillips:

Tableau does seem to be the most popular, of the BI tools I've seen. But we also support... We support almost any of them, but our best integration would be things like Power BI's another good one. Looker, we have integration with that. I think Microstrategy, but yeah.

Jesse Anderson:

Okay, and somebody asked about this being made available for offline viewing, and the answer is yes. The attendees of the webinar will get a copy, or get a link so they can watch this again. Kind of speaking of that, let's say somebody has a question or wants to reach out to you, is there an email address that they can reach out?

Steven Phillips:

There is. My email address is Steven, with a v, at dremio.com. I don't know if I can just add it to the slide? No. Okay.

Jesse Anderson:

I will chat it out, as well.

Steven Phillips:

Yeah, chat it out. That would have been... I should have added it here.

Jesse Anderson:

Okay.

Steven Phillips:

We also have a Dremio community forum. Also would have been good to add the link here. But I think if you just go the dremio.com you should be able to find a link to the community forum, which can also be a great place for asking questions. Because not just Dremio employees, but also other paying and non-paying customers contribute to both asking and answering questions on the forum.

Jesse Anderson:

Okay. There were two other questions. I didn't see them, because they were in the chat. Instead of the Q and A widget. So the one question is, is Dremio a replacement for the traditional data warehouse? We'll start with that one. That one's a good initial question.

Steven Phillips:

I would say no, it's not a replacement for data warehouse. We envision it mainly as supplemental or possibly living side by side. I don't think we really would recommend that you want to just replace your... Generally speaking, for some specific use cases though, it might make sense to use that use case from your data warehouse into Dremio, instead. And the reason for that is maybe you... Data warehouses, like Vertica, Teradata, they do provide things that Dremio doesn't have, but they're also very expensive, cumbersome, and difficult to work with. And if you don't need all that. If you don't need the data warehouse then Dremio could definitely be a replacement for certain use cases. But generally speaking, no, it's not a replacement. It's supplemental.

Jesse Anderson:

Yeah, I think that's one of the things I would recommend people take away from this, is there's a desire for management and often from database people, to often have a single place for everything, that just isn't going to work in big data and its scale. There's just too many trade-offs, as you mentioned.

Steven Phillips:

One thing to note, we actually support Teradata, for example, as an input source to Dremio.

Jesse Anderson:

Okay. Their other question is, does the relational cache essentially copy all of your source data into memory? I think you kind of answered that one already.

Steven Phillips:

Also two things I noticed is that... I note is that, it's not really a copy. And first off, it's not in the memory, it's in the disk. Generally speaking, you're going to be doing aggregate reflections to give the biggest bang for your buck. And that's not a copy, it's a pre-computed partial... Partially pre-computed aggregation. So generally, it's much smaller. Orders smaller than the original data. And even in the case of a raw reflection, typically you're not... From what we see in most of our customers typically, not every column in the raw data set is even necessary for analytics. So it's not going to be a naïve copy of the entire data. It will only be the copying of the columns that are needed, which is then compressed using things like, dictionary and coding and other types of compression. So typically even in a raw reflection the footprint of the reflection is many times smaller than the original data.

Steven Phillips:

So one way to think of it is that a reflection is more like building an index than it is simply copying a copy of the data.

Jesse Anderson:

Okay. Then they ask another question. What's the largest production Dremio implementation, in terms of both source data and varied number of data sources?

Steven Phillips:

Oh, that's a...

Jesse Anderson:

May not be one you know as well.

Steven Phillips:

Well, so if by implementation if they mean... We've definitely had a... I know of a customer who has a 600 node Dremio cluster, with, I believe, multiple petabytes of data. That's the largest one I know of.

Steven Phillips:

In terms of number of data sources, most of our customers don't usually have a huge number of data sources. There's nothing stopping us from supporting a hundred different data sources, I just haven't seen that being usually the case. Typically, it's a much smaller number. But there's no reason it couldn't be hundreds.

Jesse Anderson:

Is there an estimate on what that 600 node cluster... Is that a petabyte? Is that two petabytes?

Steven Phillips:

It's in the petabytes. I don't remember how many petabytes.

Jesse Anderson:

Okay. That's usually...

Steven Phillips:

The petabytes is the data in the underlying HDFS that we're connecting to. Or the hive, sorry. It's actually a hive. Kind of a hive form... Apache Hive for minded data, that we've used Dremio to run queries against.

Jesse Anderson:

Okay. Yeah, usually those sorts of questions, they're just trying to get an idea of the high watermark, to make sure that they're under that.

Jesse Anderson:

There is one more question that we haven't talked about, and that was Spark integration. Often times the data scientists and data engineers will want to query Dremio with Spark. Could you talk a little bit about that?

Steven Phillips:

Yeah, that's a good question. As far as I know, from... It's not my area of expertise, but I do know that we have ways of connecting Dremio as a source inside of Spark. So I forget how that works though. But I should be honest, I don't know that there's that much integration, currently, other than being able to add Dremio as a source. I guess, we kind of view Spark and Dremio as being coexisting; we can use one or the other. Now one interesting thing to note, though. I didn't get into this in the slides, but we actually have a nother feature called external reflections, which allows a user to manage the creation of the reflections themselves using... So maybe they would use Spark or Hive to actually create the reflections, but then register them in Dremio as being equivalent to some of their query. And so then it can be used for substitutions, but it's maybe you already have the data generated through some other Spark job, or something. So that's one example of how Spark and Dremio can work together.

Jesse Anderson:

Excellent. Well, we are out of time. I'd like to thank you, Steven, for taking the time to tell us a little bit more about Dremio, and how we can start using that. And, "Hey, there's some pretty cool things that we can do. And how fast it can happen." So I think that was definitely what I took away of, I'd love to see my BI people spending a lot less time waiting for a query to come back. So with that I'd like to thank you Dremio, Steven. And so on behalf of myself and Dremio, thank you again for attending. And we wish you the best of luck as you implement your Dremio solutions. Thank you.

Steven Phillips:

Yeah, thank you. Thank you, Jesse.