Dremio Jekyll

Dremio 3.0 - Technical Deep Dive

Transcript

Lucio:

All right. Welcome, everyone. My name is Lucio Daza. I direct technical marketing here at Dremio and we have a very exciting presentation for you today. We are going to be talking about Dremio 3.0. As you know, we released our latest version on Halloween day and we are very excited to talk to you about all the incredible features that we included in this release. I have Jeff and John with me today and we are all going to be having a great discussion about all these features.

Before we continue, I want to run a few housekeeping items by you. This presentation has been created for you. We want to take your questions. We want to make sure that you understand everything that we are about to present here. But we want to make sure that if you have any questions, please do not wait until the end of the presentation. Feel free to use the Q&A panel on the Zoom platform. So we will be able to see your questions and we will go ahead and try to address them as soon as possible, but then again, we are going to have a Q&A at the end.

Also, what are we going to be talking about today? The first thing that, as I mentioned a few seconds ago, we're going to be talking about Dremio 3.0, but I wanted to remind all our audience about the cadence that we've had so far. So as you know, Dremio 1.0 came out to the market a little bit over a year ago and now we are going to expose all these great features. And also we had one of these deep dives not too long ago. I go to talk about Dremio 2.1, where we gave you a teaser of some of the features that we're going to be talking about today.

This is our cadence. If you want to see the change log, all the release notes, everything, you can go to Dremio.com and we have our documentation and release notes out there if you want to. Also the previous webinars are recorded and placed there in the resources page. So you're more than welcome to go and check them out.

Before we start, we have a very, very important question to ask to you guys. This year, Gnarly went out on Halloween as a vampire. The question that we have for you today is what should we dress Gnarly as for Halloween next year? So feel free to put your responses on the Q&A or the chat window. The options are an octopus or Darth Vader or a giant pumpkin. So those are your three options. As you can see, and probably do if you are in social media, you got to see the little vampire that we released this year. I'm starting to get answers in the chat and I see Darth Vader seems to be the most popular so far, so we'll try to please you guys with that. We thought we would break the ice with this. I think that the vampire now is pretty cool, so I can't wait to see one as Darth Vader.

Thank you for participating in the first poll questions that we had for you today. And now we are going to dive into what we actually going to be touching. All the topics that we're going to be talking about today so first, the key features for Dremio 3.0, we have Integrated Data Catalog and if you are familiar with Dremio, you'd know that one of the things that we can do is as soon as you connect to the data source, your data will be made searchable. Think about Google Docs or Office 365 for your data so we enhance those capabilities. Now you can not only search, but there is much more that you can do with it to make sure that you can enhance the value of the data sets that you're working with in Dremio.

Also we've added Multi-Tenant Workload Management. And that is a topic that one of our product managers is going to be discussing with you today. As well as something very exciting. Apache Ranger Integration. We have this feature in one of our tutorials and it's super cool. You guys are going to like it.

We are going also to be talking about the Gandiva Initiative for Apache Arrow and as well as a new feature called High Performance Parallel Exports, which is super cool and also another security enhancement for the End-to-end Encryption and also the enhancements the the connector framework.

Can:

Let me take this one here. So one of the things that we introduced as Lucio segued is improvements to our existing data catalog capabilities. Today, Dremio, when you connect the data sources or when you have data access within Dremio, we actually go ahead and automatically discover that and make it available right in our user interface for either search purposes or so that you can use them to curate it.

And one of the things that we were increasingly hearing from our users was, "Hey, can we have more functionality in terms of being able to explicitly add attributed additional information to our existing catalog, whether that's a data set or an area in your data link, basically being able to add a bit more content.

And within this release, we are introducing data set Wikis and tags. Our users can now go ahead and add tags to individual data items for search purposes, for example, just like how you would be attributing like for example, sensitive information with these tags, or whether it's just a larger project-wise categorization, it gives you the flexibility of having arbitrary categories that are really easy for the search analyzers to get.

The second thing here is the Wikis. Initially if you added, they prefer you to add context to either at the source level, so you can describe the importance of the source, who the stewards of that is or any other information, like you could add that to a specific data set if you'd like or you could add it to any arbitrary area grouping within Dremio. All of this information that you're adding is uniquely available from the cataloging idea, also you can use the idea to add and remove information programmatically.

From a data catalog roadmap standpoint, this is just the beginning of some of the things that we are thinking about. So you'll see us add more and more as we get more users using it and giving us feedback, such as yourself, so looking forward to having you guys try this out and let us know what you think in the community. I'm happy to share where we are headed as we hear more from you guys. So I believe also have a few screen shots that we can show to you guys on this one.

For example, what you're seeing in this one is we are in a data set called incidents, which is in Lucio's homespace. This data set for example, on the lower head screen, you have Wiki associated with that. You have tags associated with that on the upper right hand side and you also get any information about what fields in there. Previously when you came into view an data set, it would actually give you a preview or in a data set preview of that data set and that would be your main view.

Now, along with that main data view, you have a catalog view, which is going to be your go to location for everything catalog and meta related and then you have the graph view so that you can understand the lineage and prominence of the data set that you're currently looking at.

If you were to go to the next slide, just to show you guys just a bit more detail how this looks like, all of the tags, when you click it, they open a search pop-up and make it available for you to just navigate to all relevant data sets and in the listing pages if you were to go to the next slide, all the tags that you added to a data set also appear for you to easily browse and search visually if you don't want to use the main ledger local search capability.

Jeff:

Great. Yeah, so parallel exports, this is an interesting new feature. It's also known as CTAS, using SQL, but as all of you know, within Dremio, we have this concept of virtual data sets. You can create virtual data sets that build on other data sets and we have a cool user interface that allows you to do this visually or using a SQL console. But sometimes it all makes sense to materialize the virtual data set and then put it into a file system or maybe head off to a different system, or maybe as an intermediate step as a data set that's fully materialized. That might be part of a multi-step process.

So in 3.0, you can actually use the familiar SQL CTAS base commands to actually create these parallel exports. In the past, we always had the dollar sign scratch directory, which we still have, but it was problematic for two reasons.

It was hidden. You couldn't lock it down to any particular users or missions and it was hard to search for. Customers wanted a way to be able to export in a fast, highly performant way, these data sets in a fully materialized fashion out to external data sources. So if you scroll to the next slide here.

So here's how you use it. We have support for all Filesystem types. Here's an example. SQL at the bottom here where you can create table as, give it the source name, that table hierarchy that we normally use today. The cool thing you can use LOCALSORT, you can use PARTITION, so you get all that benefit and you also have the ability to drop as well as create. The important thing to note there is the drop actually removes or deletes the data and it would also remove data and other directories that they have access to.

The way this feature's implemented under the hood is we're using the engine as our data reflections. Like I said, you have same options for sorting and partitioning and then we've also added some other key meta data in the parquet folders that Dremio or SQL editors can use to minimize what we read from disks so you get that benefit, as well.

Lucio:

Quick question, Jeff. This is great. Can you do, if you are creating one of these files using CTAS task, can you join from other data sources or are you limited to the one that you are working with at the moment?

Jeff:

You can run into any of these file system sort types. So you can read a JSON file from EDLS or from some other source, and then put it into S query, so as long as you have access into those file system sources, it should work just fine.

Lucio:

Jan, I believe the time is yours. Let's go ahead and now talk about Multi-tenant Workload Management.

Can:

Sounds good. This was one of the most anticipated capabilities that we are introducing, for our larger customers, actually. If you could go to our next slide. Basically, within Dremio, could we go to the next slide, please?

With the addition of additional workload management capabilities, or main target is to provide a easy and convenient way to let our users and administrative clusters isolate different workloads and prioritize different groups of users, different types of workloads as they desire without having to do complex configuration or having to jump through hoops.

With this release, we're adding one ability to define custom job queues. You can think of them as resource pools that have settings and how much CPU priority you're going to get, so how much of the CPU time relative are you getting? And some limits on memory as well as concurrency.

Once you define your job queues, then the issue becomes about assigning different types of jobs, routing different types of jobs to these queues. So if you could go to the next slide, and for that, Dremio actually offers a rule based queue assignment mechanism. So once we have defined our resource pools, aka job queues, I can then come and define a set of rules that are ordered, you see an example here with six rules, for example, that can be based on user, what the user is running, the group membership, are they a member of, for example, super admin group? The type of the job, whether that's a UI preview, whether that's an ODBC query, whether that's a reflection maintenance job, and we have some other categories, as well.

And then the estimated plan cost for that query. For example, if you wanted to say something like, and then I'm actually going to explain the example on the lower right hand side of the screen here. I can say something like if the user is X or Y or they're a member of this other group, the query is coming from an ODBC client. It's somewhat expensive and it's between hours 9:00 a.m. and 6:00 p.m., put it in this queue, otherwise reject that query.

So not only can you, for example, say I want to place this in this queue, but you can also come in and say I don't want this workload to be run at all. So you can reject that workload and provide your users with a custom error message notifying them of specifically what happened or otherwise go with a generic message. And when you combine our resource pools with the rule-based assignment mechanism, you get a quite powerful mechanism for managing capacity.

One thing to note here is the Gnarly, on the upper left hand side with the suit, denotes that this is an enterprise edition only feature. In our community edition, we already have static job queues already in the product today, just to go over that. You already have the capability of limiting concurrency and memory but not the ability to have this rule-based mechanism nor the custom job queues. So with this view, we are improving greatly on this same as you can see this is the framework that we'll be building up on top of, for example, you'll see us at more rule conditions. You'll be able to plan something like if this plan has a Cartesian joint, for example, you can say, don't add this. Or if this is a Cartesian join and this user is not an ops person, the idea is to leverage this framework that we built for you guy. With that I think up next is Apache Ranger. Jeff, do you want to take it away?

They are enterprise only. With this release, with the 3.0 release, we've added an optional integration with Apache Ranger to simplify administration requests the new ecosystem specifically for table level access enforcement. So what we're doing here is we added an additional Hive authorization client called Ranger Based Authorization. You'll see this in the sources screen for Hive when you set it up or when you go to configure it. And basically checking that Ranger, enabling that and putting in a few, like the name of the Ranger host and what not, it allows Dremio to confer with the Ranger plug-in and check the Ranger policy permissions for the end user that's logged into Dremio and then from there, we're able to allow or disallow access to a particular table meta database as per that Hive policy. So that just works under the covers automatically based on how you configure things.

Of course, these are held at users and groups so your support information still the same users and groups that Hive has access to, Dremio does too, so you don't have to worry about managing that aspect.

The other thing, too, from a monitoring and auditing perspective, we do have the integration with Apache auditing so you can go in and you can see, and Lucio mentioned this before, but there's a tutorial that goes into a lot of detail with an example showing the policies in action and you can also see where the auditing screen, we will log, or Ranger will log whether the user had access or not so you've got that capability.

Like I mentioned before, this is for table level access permissions. For something more granular than that, we've got our own very rich set of SQL based commands that allow you to do this for an additional layer of security across all sources not just HTFS, but all sources. So that's an important distinction there. Next slide?

So this basically is a diagram depicting an architecture, just kind of in a nutshell, you've got your Dremio cluster on the right, your Hive environment on the left, and there's a Ranger server there that's sort of running where the policies are administered and the authentication is done and then there's a Ranger that's running on the coordinator node and this is what we conferred with and the Ranger plug-in actually goes top of the Ranger server and allows/disallows whether or not a particular user has access to the HDFS resource in question. So just kind of a high-level view there.

So pretty exciting stuff with Ranger and certainly what is going to help out our enterprise customers. So along the same front, as long as we're talking about security, we also announce some additional enhancements to our already full security profile.

We've added TLS for ODBC/JDBC clients, it's now secure those connections between those clients and Dremio and also had a lot of requests from customers for the Intracluster Encryption so customers that have high security environments where they want to have the encryption done between Dremio nodes just to ensure that there's better security there. I should also mentioned if you joined our 2.1 webinar about a month or so ago, we did include or did add TLS for Oracle data sources and coming up soon here, we've got other ones coming down the pipe, so stay tuned for more security enhancements as we get there.

Okay, next up, now this is really cool. So this is probably something that you won't necessarily see from an end user perspective, but you might perceive it. And by that I mean, there's going to be much better push down capabilities, improved level of quality and you'll see faster turn around and new data sources and all this has to do with what we call our art framework or our advanced relational pushdown framework.

In previous versions of Dremio, we developed each connector on an independent codepath so now we have this new declared framework for developing relational factors going forward. What this does is it allows us to really standardize on a more efficient code based, like I said, provide better pushdowns, it'll be much easier for us to maintain and all that just makes for better quality. So we're pretty excited about that and the two new ones that are coming out for you, that'll leverage this new framework are Postgress and Microsoft SQL Server.

And in addition to that, we have the Teradata Connector, which is currently in beta now and that is enterprise only, as well. If there's some folks out there that are interested in that, please connect with your local sales rep to get access to that and then like I said, we'll have new connectors, including some non-relational ones delivered in future releases, as well, so stay tuned to this channel and we'll be sharing those with you in the future.

Lucio:

It's time for another icebreaker. Before we continue, we have another question for you. We want to know if this isn't actually just a binary question. Is Narwhal's tusk actually a gian tooth? I found this picture on the internet and I thought it was very funny. He's very frustrated that people keep calling him a unicorn and if you have been to one of our events has either Tableau or Strata, you have seen us wearing the Narwhal hat and a lot of people come asking what kind of unicorn is that? But no, it is not a, it is true, and yes, for those of you who said true, you are right, it is a giant tooth. And they actually use it to break ice when they face big masses of ice when they are swimming. I believe that's what the purpose of them is and I think I learned that from my four year old, which is a little embarrassing.

But before we continue, we have a couple of questions that I think we can go ahead and address. Before we go further along, so we have one question from the audience and they are asking us if the Apache Ranger integration works with Cloudera CDH hosted data sets. Jeff, do you know? Or is it something that is part of the roadmap?

Can:

So the current integration supports Ranger based only because that's basically most of the customer base, what we receive from the community and we will entertain any requests for Century stand point but we haven't been seeing them as widely used just yet, but definitely on our radar.

Lucio:

Excellent, thank you. All right. So I think we can go ahead and move on into the next section. I believe we have the Gandiva Initiative.

Jeff:

Yeah, so this is a pretty exciting initiative for us and this is one we've been talking about for awhile, obviously performance is a big key for us in every release and earlier this year, we announced the Gandiva Initiative for Apache Arrow, which builds on the LBMJ IT compilation to make operating on Arrow buffers as efficient as possible. So we're making total use of the underlying CPU architecture, which actually provides faster, lower cost operations for workloads.

Some of the earlier results we've seen have been really quite staggering. We worked with one early tester on a really complex query, where they saw over a 70X, I'm not talking 70%, but 70 times better performance. So clearly, there's a lot of excitement around this. The SDK's available. It's been available for awhile. This feature is disabled by default, for 3.0, but it can be enabled as part of our customer preview and I think on the next slide, we've got a link preview of Dremio.com. You can send email to receive a support key to enable it.

There's still work that we need to do here to provide 100% coverage. Not all the functions are covered. But over time, more and more will move over to Gandiva and there's ways to tell what part of your query are utilizing Gandiva or not.

So a couple other things to note is it's not currently supports on Windows or Mac but all these things are coming so please let us know if this is something you're interested in, and send us an email there. And the other thing, too, we also put out a tutorial or a blog that describes how to create your own functions for Apache Arrow so that you can build your own. We're seeing well over a million downloads for Arrow each month, it's really quite popular. So the work can be just far reaching, not just for Dremio, so any help you can do there, it'd be great.

Lucio:

Excellent. Thank you so much.

Can:

I've got a question for you, so in the preview, the user's using this, what type of operations should they expect good performance versus improved performance versus give or take the same, so is Gandiva targeted to specific operations or how we should set the expectations there?

Jeff:

I think you definitely stands projections some of the simpler data types. A lot of this, we have the list of actual supported functions and operators in our documentation if you want to know for sure. And then that other way to tell is through our query planner, we can send instructions on how you can look at the query planner to see what's being utilized by Gandiva and what's not. So there's a number of ways that you can find that.