Dremio Jekyll

Dremio 2.1 - Technical Deep Dive

Transcript

Kelly:

Hi everyone, thanks for your patience. We’re about to get started here. My name is Kelly Stirman. I run strategy here at Dremio. I’m joined by two of our product managers who will be talking to you through the features we want to describe for Dremio 2.1, which we released about a week and a half ago. I’ll guide the conversation. This will be very informal and I think fun. Typically, these are a lot of fun. But one thing we ask of you is we know you are going to have questions along the way and you should be able to … In the interface, there’s a Q&A box, which you can use to ask questions and that lets us see all the questions that come up and answer them as necessary.

Dremio 2.1 Technical Deep Dive

There's also a chat but we prefer that you use the Q&A if you can. Appreciate that. So, what are we going to talk about today? We’re going to talk about Dremio 2.1 but I just want to … We’ve had Dremio out and available to folks for about a year now, a little over a year. And I wanted to give you a sense of the cadence of things. We shipped Dremio 1 last summer in July and then immediately had a 1.1 about five weeks later. Then we have been moving, progressing along every couple of months with a new release.

In each of these releases thus far we’ve had a mix of what you might call stability fixes as well as new features and functionality. Even though Dremio 2.1 you might think of as a minor release with just some stability fixes, there are also a number of new features in this release.

We’re gonna talk about Dremio 2.1 today. At the very end of the session we’re gonna spend a few minutes talking just a little bit about some of the key features coming in Dremio 3.0, which we’re currently tracking for about a month from now. With that said, what are we gonna talk about today?

Well, there are four key areas we want to describe with new features. First, is around data sources. So, the places Dremio can access and query data. The second is Dremio’s acceleration capabilities, which we deliver through feature … A set of features related to data reflections, and some enhancements and new capabilities related to data reflections. The third is around Dremio’s SQL execution. So, Dremio’s seeks … Ships with a SQL execution engine based on Apache Arrow. That’s something we continue to refine and add capabilities and improvements in functionality but also in performance.

Dremio 2.1 Technical Deep Dive

We’ll talk a little bit about some changes to query planning execution capabilities and then one of the key areas of functionality in Dremio that’s very important to our customers is things related to security administration. We’ll talk about some of the changes and enhancement in that area as well. Let’s get started with capabilities related to data sources.

First of all, finally, Elasticsearch 6 is available on Dremio. A little round of applause for Elasticsearch 6. One of our most popular data sources, until recently there was support for Elastic 2.x and 5.x and now, finally with 2.1, 6.x. Can, do you want to talk a little bit about what’s new here and what people can expect?

Can:

Yeah, thanks, Kelly. In the last few years, one of the most used training sources I’ve used actually had on the platform was Elasticsearch. And based on some of the feedback we’ve been getting from our community as well as some of our users, we’ve put a lot of effort in this at least in 6.x releases and making sure that in the Dremio experience you get … You’re still in love with 2.x and 5.x, it’s also available in 6. This also makes it much easier for our users to find out when we get started the last six years because as most managers services also had this version as the version that they run.

Dremio 2.1 Technical Deep Dive

A lot of exciting things are on this one. In terms of certification, having more push time capabilites and making sure that things like payment push downs that we have as well as 5.x series.

Kelly:

And just as a reminder for folks, if you haven't taken advantage of this, the Elasticsearch, very, very popular no-SQL system that people use for actually pretty large data sets. But it does not have any real native support for SQL. There are some new capabilities that are shipping now as an x-pack with Elasticsearch but you don't have things like joints and a number of other fundamental capabilities related to SQL. Windowing functions are another example.

Dremio, of course, is open-source and we provide full anti-SQL capabilities including more recent capabilities, like windowing, all kinds of joints, joints to other systems using standard SQL. And what we'll do is we'll push down expressions into Elasticsearch either as painless or in the DSL. And capabilities that Elastic doesn't provide, like joints, are things that Dremio Sequence Execution Engine will complete as part of the larger query plant. You don't have to worry too much, you can just send a SQL expression to Dremio and Dremio will push down what it can and do the rest on its own engine.

And if you're curious what exactly is happening behind the scenes, if you go to a job profile on Dremio and click on the planning tab, which I've put a box in yellow on here on this job profile, at the bottom if you look, there's a section for physical ... For final physical transformation and if you look in the text there, there's some cryptic things and then push-down. Then I put a box around the actual expression that's push down for this particular query I run into Elasticsearch.

So if you're trying to get a sense for, "Hey, I sent this arbitrary SQL expression to Dremio. What actually got pushed down into Elasticsearch?" You can go to the job profile and you can find it on the planning tab here. That's it for Elasticsearch.

What else is new in terms of data sources? Can or Jeff, do either of you want to talk about this one in particular?

Jeff:

Sure, I can give that a shot. Thanks, Kelly. One of the things a lot of our customers have asked for is support for AWS Cloud, which is essentially a stream in terms of the features and capabilities that it supports, so when I'm using AWS Cloud as a source or as storage for reflections, it works the same way as S3.

And then like Kelly mentioned, we have a penchant for security. We take it very seriously so supporting TLS with Oracle was an important introduction with this release and of course works with the number of releases. And going forward we'll start providing this capability with some of the other sources as well.

Our customers are ... They're customers that are using Word files that reside at hive sources. There's quite a few of them so now we're able to provide much better performance and better memory efficiency with Word files and these are Word files that are primarily in the hive environment. Now you'll get ... You'll still have better support for push downs and, of course, just better performance overall. And the other important distinction here is the ... The fact that we're actually using our own statistics instead of hives. And this will obviously provide better performance and more efficiency in terms of how we access the metadata within the hive environment.

Dremio 2.1 Technical Deep Dive

Kelly:

Yeah, I think one of the things we saw with a lot of customers was that maybe they weren't keeping their hive stats up to date and we were dependent on accurate stats to do efficient query planning in Dremio, even though we weren't pushing the query through hive. The misleading statistics could have us create a pretty poor query plan. Now, by default, we rely on our own statistics, but you can of course opt out of that and just use hive stats. And there's an advanced admin setting here in the slide if you want to switch it back.

Thank you. All right, let's talk about data reflections. As a reminder, data reflections are one of the ways Dremio makes queries so fast so ... One of the first things here is related to selected measures. What's this all about, Can?

Can:

As a part of creating an aggregation reflection, you ... There are so many things that you're defining. One of them is the set of dimension that you're going to be slicing and dicing by and the other one is the set of measures ... Basically the metrics, if I were to say that you want us to select ... This is more for convenience improvement so that now, whenever you're selecting these measures, you can be more selective in terms of the specific measure you want so that you can save compute resources when you're updating that reflection as well as space savings, because we will be saving less. You'll see here you can just count the standard field measures in this case, the count, the sums, the max and the min.

And also you guys are seeing here a new measure type that we're introducing in this release as well, which is approximate listing account. Previously, Dremio already had the the capability of running approximate listing count inquiries. It's called ... AMDV, is the name of the function. With this release we're introducing the capability also accelerating count-listing queries using an approximate count-listing algorithm.

Today, we already support the exact count-listing acceleration. This is on top of that adding the ability to do this approximate thing. By default, this algorithm gives you, I believe, plus or minus two percent accuracy on count by default. And, yeah, it is very exciting because it helps you make the trade off between if you wanted to have more accuracy versus memory and space as a part of compute and storage. This was one of the top requests from our customers in terms of enhancements.

Kelly:

If you're doing count distincts, typically that's a pretty slow query in any system and if you're more focused or more concerned with performance or resource utilization, this is an option that gets you pretty close.

Can:

Exactly.

Kelly:

For a lot less cost, in terms of ranking query.

Can:

Yes, it enables you to have more concurrent uses with the same resources, it enables you to be much faster also without having to expend as much resources, yup.

Kelly:

I have potentially a really dumb question here which is, "It looks like Dremio does min and max and sum and count, what if I want to do average? Is that not something Dremio can do?"

Can:

That's a good question. So what happens is when you have count and sum enable for reflection, Dremio automatically can derive averages so you don't have to configure that or store data as yet another field because we can drive and it's very cheap. We actually store it as a separate one.

Kelly:

I see.

Can:

You get that automatically. No stupid questions for me, Kelly.

Kelly:

So these are primitive ... Not primitive but they're ... They're foundational calculations that many other measures can be derived from.

Can:

Exactly.

Jeff:

And this is a feature that's useful in scenarios where you have cardinality or very large data sets, right?

Kelly:

For count distinct?

Jeff:

Yeah.

Kelly:

The approximate count distinct.

Kelly:

You approximate this then?

Can:

It makes handling such cases much less [inaudible 00:22:06] so that you can handle even a smaller cluster.

Kelly:

Great. Okay. Look I didn't finish filling out my slides. Never expire reflections F. I get an F for this slide.

Can:

I don't know what the message here is but.

Kelly:

What is an expired reflection? What's the all about?

Can:

So, whenever you're defining a reflection in general, you have two options associated with that from a lifecycle standpoint. One of them is the refresh interval, right? Which is the first option you're seeing on there, the refresh policy on the screenshot, which tells us they try to refresh this every X many hours, and you also have the expiration so that you can tell your end users, hey, the data that you're working with is guaranteed to be at most six hours old or four hours old, whatever that is. And in some scenarios we see our users have variable refresh patterns where in some cases, it's three hours, in some cases it's six hours, they distribute it themselves, so they want to handle both the refresh as well as the expiration al [inaudible 00:23:11] without us expiring it for them.

So this is just to provide some [inaudible 00:23:15] flexibility for users more interested in managing their own updates and expirations.

Kelly:

So I can let Dremio do all the reflection maintenance and refreshes. Or maybe I want to be explicitly in control of that.

Can:

Exactly.

Kelly:

We already let you control explicitly control the creation of a reflection through RAPIs. Now you can also explicitly control the expiration of existing.

Can:

And more like you set this to never expire and when you refresh it, you just switch it to the using now one as opposed to when it's automatically expiring. It's just so if you have ... you don't know your update patterns, we can still [inaudible 00:23:50].

Kelly:

So you have a pipeline that does a bunch of work and at the end says hey Dremio, this reflection is ready for you to update the reflection.

Can:

Exactly. And this is supported both from our simple API's as well as our rest API's, so should give you some flexibility.

Kelly:

Okay. So external reflections, on that topic of reflections, so typically, a reflection is something in Dremio that persists as a parquet file, and that's something that's ... the metadata and use is governed by the Dremio process. But it's possible that you may already be doing work to optimize data for different kinds of query patterns. Right? So you might already be creating some kind of an aggregation table. Or sorting the data in some way or keeping some subset of the data for a particular group of users and their queries. And maybe you already have that in some other system like an Oracle database or an elastic search. Or as a particular partition in hive or some bucket in S3, who knows? But the point is you're already doing the work and that isn't broken, so don't reinvent the wheel. But if you tell Dremio about it, we can actually use it in queries. We can cost it out ... our query planner can cost it out among all the different options and when it makes sense as the lowest cost plan, use that external reflection to accelerate certain types of queries.

So that's the preamble on what an external reflection is and why it's important. What's new in 2.1 related to external reflections?

Can:

Yeah, so you could see if there's a phased roll out of the capability. So when we first released the external reflection capability it was focused more around the data link sources as well as the non-sequenced sources. So we would enable our users to do things like, let's say you have an activity table, and you've already had that table partitioned in three different ways to optimize for different access by patterns. You have it by date, let's say, you have it by customer ID, and let's say you have it by product category, or whatever that is. Right? And then Dremio would provide you that structure on top of this so that your query is still querying more logical there, and Dremio would draft your queries, depending on what would work best for that query for the appropriate data set that you maintain, not Dremio.

Dremio 2.1 Technical Deep Dive

So in a relational database context, you can take it slightly differently where you may already have [inaudible 00:26:25] indexes but not the tables, or you have aggregation tables that you already maintain. Or you will have aggregation tables that you're maintaining but you have [inaudible 00:26:34] that you want to have in real time. So it's a very highly mutating table that you don't necessarily put on a refresh schedule. And that's a perfect use case for relationals, external reflection or relational sources where you want to have more fine, granually managed control over the content of the data as opposed to doing it [inaudible 00:26:55].

Kelly:

Cool. Very interesting. So, a little bit of detail on how you actually do this. There is not a GUI in Dremio currently to create an external reflection, you would've already created this data set in some external source. You're going to tell Dremio that this particular data set is an external reflection on some virtual data set that's been defined in Dremio, and if you wanted to take a look at what's there, you can query the Syst out reflections table to get a sense for what the information is. And I had a little screen shot of this. Here's, on the right, describe the Sys reflections, you can see what columns are in that table and when it's an external reflection that is reflected in the values in that table. So that's how you work with those external reflections, is by sequel.

Okay. Let's move on to query planning and execution. So what is new and interesting in query planning and acceleration. So I think we actually don't have slides for this, it's just to talk about these three things: joint optimizations, approximate [inaudible 00:28:19] which we talked about a little bit before, and correlated step queries.

So what's new in terms of joint optimizations in 2.1?

Can:

So we've done various enhancements on this side. Basically ... let me give you a high level summary. We now have a better understanding of how best to order joints, which is the most critical thing when you're optimizing joints, to give you the most optimal performance in joints. So now we cover more edge pieces, we are on average going to be better when you have more than one joint in the query from a planning standpoint. And that's how I would summarize that oral. And just to touch up on the correlated sub-query portion as well, one of the things at VC, our customers use as well as some of the missing benchmarks, is a lot of use of correlated sub-queries. And with this release, you will see Dremio optimize sub queries in a much better way, so in the optimized version, usually when you try to optimize correlated sub-queries, you actually turn them into joints of different kinds so that we can efficiently introduce some of them. In this release, we basically reduced the amounts of operations we do, and do better optimization, and [inaudible 00:29:35] the better memory characters, it's a performance overall when you have correlated sub-queries.

Kelly:

And so just thinking about, we just talked about data reflections and here we're talking about joint optimizations and enhanced processing of correlated sub queries. What does that have to do, how does that relate to reflections?

Can:

That's a really good question. So any optimization that you do at the planning level, later affects both raw execution in the absence of any reflection, so you're getting better results out of the box if you don't have any reflections. On top of that you also get better performance reflection because reflection are to be joined with something else or present in scenarios where you have correlated sub-queries. So you're getting a double element, both on the raw execution side as well on planning and executing against reflection and creating those things.

Kelly:

So if I'm already using Dremio, and I'm using reflections, these enhancements are going to potentially be beneficial to queries where I'm already seeing them accelerated with a reflection, as well as those that aren't accelerated by a reflection? Okay. And is there anything as a user I need to do differently to take advantage of these? Or do I need to write my sequel differently or are there things like hints that I need to be aware of?

Can:

No, that's the beauty right, it's the optimizer behind the scenes so you should pay less attention if anything and you should be just optimized better and you should see better performance. In our internal testing, it will be so, like 80 percent of our complex [inaudible 00:31:09] or maybe some cases were slightly less preferable but the whole world, there's like a good amount of lists in the scenarios. You're not going to get [inaudible 00:31:18]. These are incremental in an issue but should help just building daily workloads.

Kelly:

I guess if those of you out there see a degraded performance in 2.1 with any of your particular queries, then you can open a ticket if you're a customer or you can post on community.dremio.com and it's something we can take a look at and maybe it's just a matter of reforming your sequel to be better optimized, or maybe there's something else we can do to help. But we expect that the vast majority of queries will be better off with these enhancements.

Okay, let's talk about admin and security. First of all, Jeff, do you want to talk about the rest API for security features?

Jeff:

Yeah, so kind of, thanks Elliot, kind of a long view here, so stepping back. Every object within Dremio is se curable, whether it's spaces, peripheral data sets, whatnot. And you can always assign [inaudible 00:32:19] or commissions to the GUI. Now what we've done with this release, is we introduced [inaudible 00:32:23] to do this through the rest APIs. So whether you're assigning users, groups, which are all various [inaudible 00:32:31], you can sign those users and groups with specific permissions, rewrite to specific objects, whether they need spaces or whatever objects within Dremio.

A little bit of detail here that describes some of those astral properties and fields, this is in the [inaudible 00:32:48] but this will give our customers who are using scripts and using rest API sets to do a local advanced capabilities for managing Dremio, managing their instance.

Kelly:

Super useful. I think we see lots of our customers provisioning and controlling Dremio through the rest API. And insofar as security is a big part of that, we want everything to be addressable by rest that you can do through the GUI as well. Is there anything significant that you can't really do through rest right now?

Can:

Some things are not, like [inaudible 00:33:27] provisioning and stuff are still in enclosed APIs. But my [inaudible 00:33:32] going forward is going to be more of a [inaudible 00:33:33] where everything that you have access [inaudible 00:33:36] should be accessed [inaudible 00:33:37] at the same time. You will see the future line in training.

Kelly:

Okay. There's a couple of new controls for spaces and file upload in 2.1. As a reminder, a space in Dremio is how you organize virtual data sets. And it's also object that you can enforce access control. So you could say, particular users or L-dot-group members can only access a particular space. And when they log in they would only see spaces that they have access to. In the past, both ... and currently there's two types of users in Dremio, there's users and administrators. And part of 2.1, both users and administrators could create spaces. But now in 2.1, non-admin users can no longer create spaces by default. If you loved it that non-admin users could create spaces, you can turn that off. There's a toggle. But the default going forward will be that non-admin users can no longer create spaces.

Similarly, there's a new switch you can toggle for whether users can upload files into their home space. So one of the really nice features in Dremio is that as a user, you can have something like a local spreadsheet and maybe you want to join your spreadsheet to one of the enterprise data sources. Typically that meant you'd have to send it to IT and IT would get back to you and do a bunch of work. But wouldn't it be nice if you could just do that yourself. So in Dremio, you can upload Jason and Excel and CS Feed Piles and Parquet, and any of the data sources we support, you can upload those into your home space, they're private to you. But that might not be something that you want everyone to be able to do as an administrator. So now in 2.1 you can disable the ability for users to upload files into the system. And there's a toggle for that as well and both of these are keys that you enter in the advance settings page on the admin screen. And you simply enter them into that yellow box on the right. It's not yellow on the GUI; I just put a yellow box around it.

But you enter that in and then click "show", and then it shows up on the left as something you can toggle left and right or reset the default. Both of these are effectively global settings that apply to all users. So currently you cannot, on a user-by-user basis, assign these settings. They're intended for the whole deployment.

Jeff:

Yeah, I would add the request for the time when you upload files came as a result of there were a number of admins, a number of customers, that wanted to have more control over moderate space usage, assigning quotas if you will, to how their end users were using the space in terms of uploading files, so that's primarily where the motivation for that came from.

Kelly:

Yeah. And another way you could deal with this of course, is you could put the files on a network drive and then mount the network drive in Dremio. The difference is that that data source would be visible by everyone, whereas people uploading files into their home space is private to them and no one can see them.

So I'm actually personally really excited about this next feature. Because a very common question on the community site is well, Dremio's got settings for heap and direct memory. How do I know what values to put in each? And the answer was always, well it depends. And there were lots of questions. Now we've really simplified that. The settings are optimized differently for coordinator notes and executor notes. So now in Dremio, you can basically enter one value for the maximum amount of RAM that Dremio can use and Dremio will figure out the best balance for each node, whether it's a configurator or executor, between heap and direct.

And just at a high level, the coordinator nodes in Dremio are more ... used more in heap, and the executor nodes in Dremio are used more in direct memory. So they're less susceptible to garbage collection. Each is a little bit different. This should make things a lot easier. And I think, Can, if you really want to, you could override whatever setting this produces with your own dial-in settings. But we hope that this simplifies things for most people and is the right set of values the majority of the time.

Can:

Yeah, and media consulting is if you just [inaudible 00:38:27] and one of them, system won't figure out the other one, it's very [inaudible 00:38:32] only having some of the option set as we'll figure out the rest [inaudible 00:38:35].

Kelly:

Oh that's nice. That's nice. So hopefully everyone enjoys that. Can, do you wanna talk about the yarn auto-bungler? It sounds like something related to doing your laundry, I have to say. Not sure why I think that.

Can:

This is more of a convenience capability where previously if you had to employ some of the tendencies in yarn deployment to the executor notes, that was somewhat of a manual process. Now with this enhancement, Dremio takes care of this portion. And since we are also doing it automatically, we can also optimize its [inaudible 00:39:15], and in testing we comment, this always up to 50 percent reduction, the size of the parts that we are not deploying to different nodes, so this just hopefully simplifies deployment and saving you some space overall.

Kelly:

Cool. I'd heard that there was a fair bit of copying things around in these yarn deployments, so this should make things a lot easier for folks.

Can:

Exactly, you got it.

Kelly:

Okay, so that wraps up 2.1. If you have questions, please go in and enter them into the Q and A and we'll start to triage those in just a minute. But before we do that, I thought it would be fun to talk just a little bit about what's coming in 3.0.

It's a big release with tons of enhancements overall since 2.0. so we won't talk about everything right now. But I thought it would be fun to talk about a couple of the big exciting features. I would ask for a drum roll, but I think no.

All right, all right very good. Appreciate that. All right so let's talk about these ... I'll talk about Gandiva first and then we'll swap around. So, in June we announced the Gandiva initiative for Apache Arrow. This is about taking that LLVM compiler which is widely used within Apple to compile different applications to the different hardware platforms that Apple offers. LLVM has a just in time compiler capability that allows you to take arbitrary expressions and compile them into optimized machine code. This is something for example that's used in Impala to compile sequel into optimized machine code. And other systems use this as well. So the idea here is what is the fastest possible way that we could execute arbitrary expressions on Apache Arrow buffers? So the Gandiva initiative is all about taking the full surface area of sequel and making it something that can be compiled into the best possible way to operate on arrow buffers, and of course, in Dremio, whether we're reading data from Parquet or Elastic Search or Oracle, the first thing we do after we've started receiving results, after our push-down, the first thing we do is convert the data structures into arrow buffers.

Dremio 2.1 Technical Deep Dive

And we do all our operations in memory on these arrow buffers, and that's one of the things that makes Dremio so fast. This particular set of enhancements is the first wave of making all of the sequel that we process and support compiled by LLVM. So it will cover things like filtering and projections and lots of different calculations. Later this week, we'll publish a blog post on how you can write UVFs for Gandiva, so the community can start to add more and more specialized operators that are exposed by a sequel. In some of our testing we've seen big performance spievs, so there's a blog post about 70 to 80 X performance improvement on projections in Dremio with these Gandiva enhancements and you'll start to see these benefit you starting with Dremio 3.0. And in subsequent releases we'll just start adding more and more of the full sequel language that's supported by Gandiva. So that's the Gandiva initiative.

Who wants to talk about workload management? Can, do you want talk about it?

Can:

Yeah, I got it. So prior to 3.0 in Dremio, you basically had four static job queues, you would have two for your regular queries and two for your reflection jobs. And there will be some cost issue based on the cost of that plan, where you could place your jobs in either the small tier or the large tier. And you would have control over concurrency and you would have control over the maximum memory per query in these static queues.

Refused or rejected. So, it should be a very flexible mechanism where you can define some rules based on your users, group membership, time of the day, any function that you can use in sequel basically, as well as some of query metadata to target, and either place the job in a queue or reject it because you don't think that makes sense. For either platform or your platform.

So, should help with isolation, workflows, increased productivity overall that make the cluster be better utilized across multiple groups.

Kelly:

You said something that I was a little surprised to hear you say, so you said something about anything you can express in a sequel function. What does workflow manager have to do with sequel?

Can:

So, for example you can say something like time of day function or known function to say, hey, if it's after 6:00 PM and if it's a large reporting query and if the user is marketing ups, place it at this queue.

Kelly:

Oh, so are you saying a queue you describe or express the policies using sequel?

Can:

You can, the policies of placing a job in the queue. So you can think of this as you have queues that you define. Your buckets with some sort of resource collection.

Kelly:

Oh.

Can:

CPU memory. And then you have these powerful expressive rules that you can express in either very simple sequel terms like hey, user x add blah ba blah or you can go crazy and do whatever additionals that you want. And with this [inaudible 00:45:38] going on this our customers love the fact that this is so expressive and gives you like if you [crosstalk 00:45:46].

Kelly:

So I could write a rule that says I can't run queries on the weekends.

Can:

Yep.

Kelly:

Or after nine PM.

Can:

You got it.

Kelly:

I think that's a great idea. I think it should ship with that [inaudible 00:45:57].

Data catalog enhancements. Actually let's talk about that one last, quickly let's talk about Apache range integration. Jeff you wanna talk about that?

Jeff:

Sure. Yeah, we got a number of customers that already have are using ranger in house to manager their I policies from the security standpoint. And again, this just layers onto Dremio's already strong security foundation. This is in addition to everything else we're doing in terms of security at the source and the object level. What this allows is for customers who have ranger that have policy set on to get their tables databases to define who has access, which users, which groups, then your [inaudible 00:46:39] have access to the tables. And ensure that Dremio honors those policies transparently. So it's that there's nothing you need to do on the Dremio side, Dremio will just do the right thing based on who the user you are, who you're coming in as, will consult with the arranger server, will consult with policy and ensure that the right things happen from a security standpoint.

And again, that's for table level access primarily which is what most of our users are using ranger for today.

Kelly:

Cool, excellent. Should make a lot of working customers happy.

Jeff:

Absolutely.

Kelly:

So let's talk now the last one here is data catalog and we'll get to Q and A, so one of the capabilities in Dremio that has been really well received by our users is this notion of a catalog. So when you connect Dremio to any source we automatically lots of metadata about the tables and columns and collections and indexes and so on and so forth. And make that searchable and easy to find in Dremio.

The same is true when you create virtual datasets. That all goes into the catalog automatically as well.

What was sort of missing though is the notion of hey I would like to describe these datasets or I would like to flag them as authoritative or reference or any other sort of ways to annotate what the catalog, what items in the catalog mean to a user. And so that's something that we're adding. A first set of big features in three dato with more to come in the future and Can I thought maybe it would be fun if you could talk through this one screenshot of what this is gonna look like in three oh at a high level.

Can:

Yep, thanks [inaudible 00:48:19] so one of the things that we've been hearing from our user base port the community and the customers is hey you wanna be able to put some more interfaces and more information as a part of [inaudible 00:48:30]. So if you think of our approach, [inaudible 00:48:33] is a discovery based catalog where whenever we connect a source we go ahead discover all the metadata available in that source you can browse and explore it as opposed to having to define it first. Hence the discover based approach. And on top of that since you have this powerful mechanism, you want to enable our users to order things their own user define then local source informations and be bale to input that [inaudible 00:48:59] as well as tag data sets for either search or the purposes of hey this is the truth what the data said. Being able to basically impart more context on a data set without having to consult an external source.

You see us as more and more functionality around this. One of the things you're seeing in the screenshot is actually a field glossary where you can add fill descriptions as well. That is not going to be in three point oh but that's there's some things like that are some things that we'll keep adding just of how our users might have to switch to multiple tools if they are have basic needs around gathering some of this information.

Kelly:

So what is gonna be in three oh?

Can:

So what's gonna be in three point oh is we're going to be adding the [inaudible 00:49:47] spaces, sources in any folder. And we're going to be adding support for tags for data sets.

Kelly:

Okay. So this wiki, is it literally a well I could put an image in here?

Can:

Yeah you can put images, you can import and export it with mark count if you already have some other external [inaudible 00:50:06] for this. Basically we wanted to make it so that if you have something external you can put it into Dremio and if you wanted to export [inaudible 00:50:14] you can also do that. So it's very flexible from that standpoint.

Kelly:

And so if I tag this as reference or something like that, does that mean I can search on the tags?

Can:

You got it.

Kelly:

Excellent.

Can:

You can search on the tags, if you click on a site it actually automatically starts a search for you so you don't have to type in or copy it. And yeah, and only users who have edit access to the datasets can manipulate the catalog, yep. Exactly.

Kelly:

And on a personal note I just wanna say thank you for using the quote from Moby Dick here, in the wiki you've copied. It's better than Greek. Alright. So let's get to the Q and A. Who, let's see what we have here. Do we have any questions that people have asked? So, what about, so one question here, what about aggregation hierarchies? Do we have anything we wanna talk about for aggregation hierarchies?

Can:

So I think the question is, how you can define different hierarchies on folders or conventions like when you were working with Apple or like a multidimensional analysis-

Kelly:

So like geography.

Can:

Like geography. We could, we do not have plans for that, be keeping an eye on that though. That's something you heard a few hypes so far, so we've been great understanding if that's something that also more people more and more.

Kelly:

It's still something you would do in your VI school today.

Can:

Exactly. You can totally do that in the VI school, the idea, I think the question is whether that would be something integrated into the virtual dataset player. We are definitely thinking about it, but we have to see some more demand around that one.

Kelly:

Okay. The next question is can you send us the slides after, when actually what we're gonna send you is a recording and a transcript of the recording in about 24-36 hours from now, so we'll get that out to everyone.

The next question is about elastic integration. So the results and performance of queries that are pushed on into elastic search are very dependent on the indexing that's set up. How do you manage this from Dremio? So do you want to-

Can:

Yeah so, that's a great question. So in Dremio's overall push down model, and when I say push down pushing some calculation down to a data source is we'll try to push as much as possible because specifically with sources like elastic search or something like that basically has a sole scan in you know transfer rig. But you wanna have also the compilation going on that side and then retrieve a little smaller this all said Dremio. So today we do not, though our push is on space learning if you have some field index analyzer so on and so forth, that's something you're thinking but already if I pushing on as much into last six or generating the results set, you see most of the benefit. And currently the approach recommended would be if you're seeing a certain muzzle that you wanna go faster, data reflections are, they don't help. Like you can target an individual workload that's slow that's to leverage your indexes for everything else, that's really fast. Using the indexes.

Kelly:

Yeah I think there are certain types of queries that elastic performance very very efficiently on it's own and there are others that it's really not optimal for. And the nice thing is that you could have reflections on certain indexes in elastic search and will push down things that we think are really efficient in elastic and we'll push down things into Dremio's reflections that we know are more efficient running on the reflection. So you have a lot of flexibility there on how you deal with this. And in a way a reflection is like, it's kind of like a new index type in elastic search. But it's configurative maintained by Dremio.

Can:

Yep.

Kelly:

So if anyone else has questions great, we have a couple more minutes here. If nothing pops up here right away, oh here's one. Please those of you who if you have a question please send it in via Q and A. This question is how is workload management at Dremio three dot oh integrate with the yarn scheduler at [inaudible 00:54:26]?

Can:

So if you think about Dremio's approach in this regard, you basically give Dremio a signal yarn cue and within that yarn cue Dremio runs all of it's jobs and allocates [inaudible 00:54:42] within that cue. Since Dremio's an attractive system, we typically require our like response times are less than one second. So we don't wanna have to spin out for some of the drop code cue so what we end up doing is we are lowering the application on one of the yarn cues. And all the other cues you would have within Dremio would be subbed away in those resource.

Kelly:

Hopefully that makes sense to you. There is more that I think we could do there potentially. I think it's interesting to see in the role of the duke how things are moving more in the direction of cubernettis for orchestration provising and away from yarn. Of course, not everyone is moving in that direction so it's just interesting to observe that that seems to be an emerging trend. And there is a question here about Dremio three IO and cubernettis so if you look on our download page there is a link to the docker container image on docker come. And we have tools for cubernettis, we have helmstars to let you provision and control your Dremio clustered via cubernettis using these docker containers. So cubernettis is something we're seeing most of our customers that aren't running Dremio in the duke via yarn are orchestrating their Dremio clusters using cubernettis. And now we have some scripts to help make that easier for you so encourage you to check those out.

There's a specific question about the do you support operator for cubernettis, I'm not a cubernettis expert, so I don't know so we'll get back to you over email on what that is, but right now you can basically provision and add capacity and move capacity from your Dremio cluster through these alm charts.

So another question here is it is possible to extend a metadata management capabilities by adding additional meta classes, for example policies, data validation, and detailed job descriptions and assign that with other objects? I'm not sure I understand the question but there are smarter people in the room than me, so if you guys know the questions.

Can:

Yeah, so our current focus is to enable [inaudible 00:57:08] existing objects within Dremio and keeping it kind of think or things like descriptions, fields, field description, tags and maybe [inaudible 00:57:18]. But our initial focus is going to be keeping to the objects that be added in the system today. Before we extend to additional types of things that'd be good. And then assign to objects, if that makes sense to you.

And just one other thing is EPL job descriptions, that was something you mentioned, we currently don't have plans to import EPL lineary and joint things like that into Dremio's meta store, our current focus is to allow you do all of the things you would be doing, all the catalog annotations that you would make on top of Dremio's sources and catalog. Not things outside of the Dremio's catalog.

Kelly:

Looks like we got a question about what about versioning of queries data integration with dob ops tools. So as some of you may or may not be aware, we do have the ability to, with versioning of the datasets, those are the breadcrumbs you see on the right hand of the screen, that's so you can go back in time and you can see all the different changes that were made to a virtual dataset. We are, I think [inaudible 00:58:28], we are discussing possible integration with this with maybe like a get out requirement such that we can do people can do proper version using those types of tools. Yeah so that is something that is on our radar.

Can:

I mean today our customers already saw this using our CPIs and because everything is sequeled underneath the covers, I think can you use a combination of spaces for life cycle management. And the APIs to be able to facilitate the environment so that you know the move within the [inaudible 00:58:54]. But yeah, as you said, it is one of the things that this type of mind person be thinking what would be a good integration? Would it makes sense and at what point.

Kelly:

Yeah I think part of sequaries but it's also the data model. So when you create spaces and you create virtual datasets, I think it would be really cool and this is something we're talking about, is there's sort of a production branch of your data model in Dremio, your semantics or semantic layer. And then there's a team that off iterating on the next version of that, maybe testing it. And when they feel like it's ready, you wanna merge that back into production system and that should all work the same way, something like get works.

And I think giving you that kind of flexibility to sort of branch off, work on something in parallel and then merge it back in, is something that we think is really important and we're watching for future release in Dremio.

There's a question here about the ability to calculate median and in tile functions. Is that something that we can do today or something we're thinking about doing in the future?

Can:

So ... we don't have plans about supporting approximate in tile. In tile's and which would also give you medium. I don't have a timeline unfortunately for you guys, but that has come up a few times. Not as so much as some of the other operators in sequel that we've been getting requests for, but we keep an eye on that one. It's probably in the top five from an operator's standpoint. Or like function standpoint.

Kelly:

And I think, as I mentioned before, some of these things that people want to see in a product, the UDF capabilities in condivo will make it easier for you to maybe experiment with adding some of those yourself. I appreciate that not everyone is in a position to do that but there are lots of our user who are. And so we wanna make that easier and then allow you to tap into all the power build and LO beam directly.

Can:

And if you want us make sure that would be, don't forget about something for sure, I mean your feedback, please do post this some of the stuff that you guys asking in the community as well, just so we remember the context and your use cases and it's, we typically look at that for prioritization purposes.

Kelly:

So a couple of you have asked to see demos of specific capabilities, that's something we can do outside of this webinar. If you send an email to contact at Dremio.com, we'll definitely we're happy to sit down with you over a zoom or a web ex or whatever. And show you whatever you'd like to see. We do that all the time. And for those of you who have explicitly we'll follow up with you after the webinar.

So last question here and then I think we're gonna need to wrap up or maybe there's time for two more. Can you explain more about the LOBN project and how Dremio differs from daks in terms of slash slash? I'm not sure what that means exactly. And I'm not an expert in daks, but what we're talking about here is Dremio part of the Dremio platform is a sequel execution. And they way that Dremio takes a sequel of expression and turns that into actual operations that get executed on the hardware where Dremio is running, today uses java just in time compilation into bi code. And for optimal, so it's sort of instead of interpreting the query and [inaudible 01:02:35] so we compile each query into bi code dynamically. And the work we're doing in the gondiva initiative is about taking what we're doing to dynamically compile the queries today in java bi code and instead compile them in machine code that's optimized for the hardware platform we're running.

So all of this work is very specific to the internals of Dremio's sequel engine. It's not intended as a general purpose computing platform the way something like spark is. This is a very specific to Dremio. And I appreciate that daks and other systems maybe using LOBM to do optimize compilation for different hardware platforms. This is similar in that regard but Dremio is itself not really anything like daks directly. This is just about how we're optimizing Dremio's performance internally. Hopefully that answers the question.

So thank you everyone for attending the webinar. As I said we will distribute a link to a transcript and recording of the webinar, the video of the webinar itself as soon as we can. Typically in about 24 to 36 hours. If you have additional questions, please feel free to send us emails, contact at Dremio.com or to post questions on the community site, community dot Dremio.com. And we'll look forward to seeing you at webinars in the future. We're about a month out from three dot oh and we'll have a similar sort of recording for that. Thanks for your support of Dremio and we'll see you out in the community, take care. Bye bye.