Dremio Jekyll

Dremio 4.0 – Technical Deep Dive

Transcript

Lucio Daza:

All right everyone, thank you so much again for being here with us, and thank you for your time. We are very excited to have you here in this presentation. Today, we're going to talk about Dremio 4.0. My name is Lucio Daza, I direct technical marketing here at Dremio and today, with me, will be presenting, the one and only Tom Fry. He is our director of product management. So we have a very exciting presentation prepared for you today.

Before we start, there are a couple of things that I want to run by you. First, I want to confess that I don't know how to use these slides - I'm kidding. Okay, now that I have them working. The first thing that I want to run by the audience is how you are going to ask questions. So this is your time to understand Dremio, especially this newer, and we know that you are going to have some questions for us. We have a Q&A prepared for the end, however, do not wait until the end to ask any questions that you may have. If you want to ask questions, just go ahead and click that little Q&A button that you will see in your zoom interface. This is going to open this little white window right there, post your question and we will capture it and then talk about it at the end or through the webinar, if we can.

If for any reason you need to chat with the rest of the audience, feel free to use the chat button, or if there is any issue with the audio, or something's going on, just go ahead and raise your hand. All right, so, now that we have that out of the way. Also, if by any chance, you need to leave the webinar at some point, this is going to be recorded. We're going to go ahead and post it later on our site, as we do with the rest of the resources that we do online. All right?

So today, we are talking about Dremio 4.0. It almost looks like it was yesterday when we did our first release. As you can tell, we have a very frequent cadence of releases. As you may know, similar products, they only release a recorder and when they do make releases, it's just to include a few, or they include only features in major releases and whatnot. We do our releases, every one of our new releases includes new features, and today we are going to be covering many of them and more. So stay tuned!

So now, let's go ahead and do a quick review of what are we talking about in this webinar today. We're going to talk about Columnar Cloud Cache, also known as C3. In addition to that, we'll be talking about Multi-Cluster Isolation, Reflections on Cloud Data Lake Storage, Inbound Impersonation, a bunch of our security features, multiple Hive Metastores, now you can go ahead and copy those results and bring them to any one of the BI tools that you're working with, and of course we have some other EUI changes that we applied and security, as I mentioned.

So let's go ahead and get it started, and Tom, can you hear me okay?

Tom Fry:

Yeah, we can hear you great.

Lucio Daza:

Dremio 4.0 – Technical Deep Dive

Awesome, thank you. So one of the things that I want to start a conversation here with is, as our audience knows, there are several technologies that contributes to Dremio's query speed. So we have talked in the past about Apache Arrow. We also have talked about Arrow Flight, which is one of the latest things that we added. In addition to that, we also have predictive pipelining. So it looks like now we have an extra contributing factor to this whole speedy thing, and it's the Columnar Cloud Cache.

So say, Tom, is this now making things even faster? What does this mean for our users? What can you tell us this new feature?

Tom Fry:

Sure, thanks a lot for that Lucio. As you mentioned, we've had a series of major performance oriented features throughout this year, with essentially every single release this year. And Columnar Cloud Cache is a major performance orientated feature that we have in 4.0. The with all the different predictive pipelining and other technologies that he just mentioned that we rolled out earlier this year.

So what is a power cloud cache? And what is the purpose of that? So one of the great things about cloud storage systems, such as AWS, S3 and shared data lake storage, is they offer new opportunities to store data because they're highly scalable, very inexpensive, and very simple to use. And with cloud storage systems, users can provision any amount of storage, on demand, while at the same time, reducing storage costs. And this is a very powerful combination. Additionally, cloud storage enables separation of storage, and this lets companies save costs by only paying for resources when data needs to be processed, and this is not possible with traditional enterprise data warehouses, for example. So as a result, we see many organizations and many of our customers are not only increasing their use of cloud data lake storage, but in many cases, S3 and AWS are becoming the central data platform across their organization.

However, as great as this is, the scalability and separation of storage introduces new performance challenges. Such as, increased latency and higher variable response time, etc. All of which impact analytical workload performance. And so, as a result, you have many companies find that it's often necessary to pre-extract data out of data lake storage and into some other data warehouse or mart, in order to meet the performance SLAs. And this can obviously introduce new complexities and costs, and reducing the benefits.

So in Dremio 4.0, we are very excited to introduce a new technology stack we're calling "Columnar Cloud Cache" or "C3" for short. Which is a new technology that helps us combine the near infinite scalability and low cost of S3 and AWS, with the ultra high performance of local NVMe SSDs. And essentially, from a user's perspective, we help convert cloud data lake storage into a very fast, infinitely sized, local NVMe SSD resource. C3 leverages local high-performance disks to dramatically increase while simultaneously reducing network traffic from cloud storage. With C3, users can experience the large scale of cloud storage, but have no impact on performance. There's actually a lot more to this technology. Columnar Cloud Cache has a lot of intelligence and developed and built into it, that identifies and stores the most valuable pieces of a user workload perspective, directly through Dremio execution nodes, which helps to keep the most valuable data to compute in order to accelerate simple processing.

This is very unlike traditional caches, which simply store data based on pod access frequency, instead, C3 leverages Dremio's knowledge of SQL execution and prioritization information from workload management to help intelligently select the most data-to-cache locally. For example, we can look at the prioritization of a user or from a workload management perspective, and also look at the impact of different data blocks on simple processing, in order to determine which data to cache and keep within the system.

C3 also leverages our knowledge of advance columnar file formats when selecting which data to cache. For example for K-files, the footers very heavily accessed, so we will much more aggressively pin that type of information into the cache. And because C3, this is something else that's quite a bit different than traditional caches, because we developed C3 to utilize high-capacity NVMe SSDs for storage, instead of traditional memory, we're really able to build cloud-sized caches for cloud-sized datasets, which is very different than traditional systems which utilize memory for caching, and sort of limited in terms of the footprint and the datasets that they can accelerate.

One of the great things about this technology is it's fully automatic, with zero administration/user involvement required. All that needs to be done is simply configure the local disk resources that are available for caching and Dremio fully manages everything else, with zero user involvement or administrative activities.

Best of all, we've seen some pretty dramatic performance improvements for users on cloud storage with actually no changes involved. With C3, we've seen query response times improve from anywhere from 2 to 10 at times, while actually, at the same time, you're able to see a cost savings by reducing the amount of data that needs to read from storage and sent over the network. This is very important to the cost structure in cloud environments, for example. So we see this as a major enabler for users to utilize cloud storage, whether they run in the cloud or whether they run at their own data centers, remotely from the cloud, and simply source data from S3, for example. And we're very excited to introduce this capability in this release.

Lucio Daza:

Dremio 4.0 – Technical Deep Dive

This is excellent, and thank you for that explanation Tom. Now, the question that everyone wants to know; does it only apply to cloud data sources or can we use it with something else?

Tom Fry:

So that's a great question. We also have enabled this in 4.0 for S3 compatible storage as well, something we introduced in the previous 3.3 release, which in concept of S3 compatible storage, and we talked about mini I/O, for example, as a source that we have multiple customers using today. And this also works for all the S3 storage systems as well. So on on-prem customers, for example, are using mini I/O, are able to utilize C3 on those systems as well. We're also looking at expanding the different data systems as well, for C3 to work with more and more systems in future releases. In fact, this in month, we're actually going to add in a minor patch update C3 for HDFS systems as well, so that, if you have Hadoop, HDFS file store, Dremio can cache that data as well.

Lucio Daza:

Excellent. Another question, and one of the things that we talked about in previous webinars is that, as our audience may know, we are open and flexible, meaning that we have two different flavors, we have the community edition and the enterprise edition. So, previous versions we had certain features that you had to get in touch with us and try to enable within the product, so is this columnar cloud cache one of the enterprise edition's features or this is available for the community as well?

Tom Fry:

So this is available for the community as well. We see this as a great enabler in terms of performance acceleration across many different workloads and we wanted to make sure that we can put this in the community's hands.

Lucio Daza:

Excellent. So now let's go ahead and talk about multi-cluster isolation, and I want to start this part of the conversation with a quick note: I had the opportunity to attend the Strata event in New York last week with the rest of our amazing team out there, and one of the things that people seem to be concerned about is concurrency. So making sure that one workload is not going to bottleneck the rest of everybody else's progress, and so on. Is this multi-cluster isolation something that is going to help us with that?

Tom Fry:

So that's exactly right. Multi-cluster isolation is a great technology, both to help with concurrency, but also to help with isolating with different workloads. One of the great things that we've seen is more and more companies are using Dremio for many different departments and many different use cases within their organization. And as great as that is, one of the things we've seen is organizations can, at times, can struggle with how to onboard new workloads that are currently in development onto Dremio without disrupting existing workloads that they may have already established in a running production.

Dremio 4.0 – Technical Deep Dive

For example, a common pattern might be, an organization would deploy a major use case in production and then two network load and the resources for some target SLAs, and they'd want to maintain those SLAs on a consistent basis. But then at the same time, they want to develop an onboard new cases as well, and they needed to ensure that this new work, while it was in development, would not disrupt work that was already in production. However, often, this can be difficult because new workloads can impact existing production work by consuming an excessive amount of resources. Maybe while something was in development, it might scan multiple petabytes of data, maybe filters were set up incorrectly, or set up incorrectly, and a new workload development can consume a lot of resources and impact existing work and production.

So, to solve this, users would often deploy multiple separate instances of Dremio, for different use cases of departments, so that they would not impact established workloads already in production, or they could isolate different workloads across different groups. And we've seen some customers actually have several dozen Dremio instances or clusters in their environments.

However, this itself can create new administrative challenges. For example, each instance would have its own metadata catalog in Dremio, how do you keep those in sync, etc.

So in Dremio 4.0, one of the ways that we're solving this is we enable users to create multiple, completely isolated clusters of resources within a single Dremio instance. Multi-cluster isolation integrates with our workload management cues to allow administrators to route different workloads to separate, and fully-isolated clusters of execution resources at a nodes level. And since execution resources are fully isolated from each other, one workload will never impact another workload.

For example, with most multi-cluster isolation, it's possible to set up an existing production use case and maintain stable SLAs for an existing process, while at the same time, simultaneously, develop new workloads within the same Dremio instance, sharing the same Dremio catalog, and even if that new workload in production consumes an excesS amount of resources, because you can route it to use a completely different set of execution resources, it will not impact your production work. And since each of these sets of resources also operate under the same Dremio instance, they all share the same unified catalog, which really helps organizations maintain a common data model and semantic layer for the entire organization, while still being able to isolate different workloads at the same time.

So this is a very powerful set of capabilities, we have several large customers who will be able to make use of it right away after the release, to simplify their environments and expand their usage of Dremio. So we're really excited to get this out there, and we think it's a great enabler for users.

Lucio Daza:

Dremio 4.0 – Technical Deep Dive

This is awesome. So, back in 3.1, we talked about multi-tenant workload controls, is this something that builds on top of that? OR is this going to bring, is this going to enhance that feature that we have include back in 3.1?

Tom Fry:

So that's a great question. So this expands upon the workload management capabilities that we introduced in that early release. Workload management is a way to set different resources or operations to different cues, so you can manage concurrency and cuing differently. It also set prioritization for different workloads. Some workloads might be high priority or low priority. But even though you might be able to set different priorities of workload, even if something's a low priority workload, for example, it still consumes some number of resources, and that can impact other work on the system.

With this, we leverage the cue concept within workload management, where now you can make workload management cues, completely isolated from each other because they're being routed to separate execution resources. And it's very simple to set up. Essentially within, whether or not you're using Yarn or Kubernetes, there are simple methods to tag different execution nodes, and each tags defines essentially a cluster of resources. So it's very simple to be able to say these 10 nodes are in one group, these 10 nodes are in another, and then you can assign workload management cues to route to those different groups of resources. It's integrated to workload management stack, it's fairly easy and straightforward to set up, and it's really an enhancement to the existing workload management rules and capabilities by not just setting prioritization, but also by being able to set full isolation across those.

The other thing that this enables you do to is horizontal scaling with Dremio. Even though, for example, you have scale out Dremio to hundreds of nodes, and we have several customers of running Dremio well over a hundred nodes at scale, that's for a very large style SQL processing. There's also other use cases, for example, think about small SQL look-up type operations, it might not need a hundred nodes, but you might want to run thousands of them. This is another way to kind of horizontally scale Dremio, by creating multiple separate clusters to be able to have more options. So this enables, not just isolation, but it also enables some pretty dramatic concurrency improvements as well.

Lucio Daza:

Excellent, excellent. Thank you so much. So, as our audience knows, Dremio is the data lake engine, and one of the cool things that we like to talk about is that we understand that, you know, the value of working on a data lake that you may be working on, and also Dremio allows you to join multiple data sources and so on.

So one of the things that we have today is "Reflections on Cloud Data Lake Storage", and Tom, walk us through these and let me and the audience know what is this all about, what is this awesome feature, how is this going to help us on our use cases.

Tom Fry:

That's a great, thanks for asking that. So, reflections are, just backing up a little bit, reflections are a key optimization within Dremio. With reflections, users are able to easily, with a single click of a button, identify data sets that they would like to optimize, and Dremio will automatically, both pre-extract data for remote systems, and precompute commonly performed operations so that they're immediately available to accelerate workloads.

Reflections provide users a significant increase in performance, while at the same time, administrators see a reduced load on external systems by offloading operations from a data source into Dremio. This can provide extremely fast response times on precomputed results. Because of its usefulness, many users significant use of reflections to find a very large number of reflections within their configurations

For example, we have some users using thousands of reflections within Dremio, for example. Now as great as this is, users will often look at how they can best store reflections, both cost efficiently and at scale, and more importantly, let users create the reflections they need without, maybe, necessarily having to go IT to resources. In Dremio 4.0, reflections can now be stored in cloud data lake storage, both in S3 and ADLS, and S3 compatible storage, without any performance degradation.

Because reflections are a form of acceleration in Dremio, we want reflections to be close and with high through-put to the SQL execution engine for processing. By using cloud storage to store reflections, users can benefit from the scale of building an on-demand provisioning of cloud data lake storage and use that to create any number of reflections. One of the things that we do with this now is because we're coupling this with our C3, columnar cloud cache technology, users are able to store reflections in cloud storage for large scale ability and low cost to storage, while at the same time, Dremio is able to identify the reflections that are most commonly used, and aggressively keep those within the execution engine for high performance.

This really marries both the concepts of the large scale ability of cloud storage with the concept of reflections as an accelerator, and this enables users to cost efficiently accelerate any workload on-demand, without really any of the traditional storage space provisioning that might have been required before.

Lucio Daza:

Excellent. One of the questions, and by the way, thank you everyone for posting the questions that we're receiving. We have a ton of them. One of the questions that I'm reading here is, will reflections behave the same way regardless of where they're stored in? We're talking about cloud data lake storage, in this case. Are reflections have the same behavior, or they going to have any limitations depending on where they are? What's happening here?

Tom Fry:

That's a great question. We touched on a little bit when we were talking about columnar cloud cache, it's one of the great things about cloud storage is the scalability and on-demand provisioning in the systems, but that that comes at the cost of some performance characteristics where cloud storage can have a slightly higher latency characteristics and more variability in terms of response times, and obviously that impacts, if you're using that to store accelerated data, that can impact SQL response times.

One of the things that we did is that we heavily integrated this option with our cloud caching technology, where Dremio will aggressively determine which reflection will be most used, most often, which portions of reflections are really needed, and make sure that we leverage the C3 technology to offer reflections on cloud storage with no performance impact. We've done a lot of performance analysis on what happens when we store reflections in cloud storage with this option, and we really see no performance impact at all now from having stored reflections externally. In fact, the C3 technology was a major enabler to be able to add this as an option as well.

We'll aggressively look at which reflections to keep in, and a common pattern that we might see, for example is, users may create multiple reflections of the same data set, and which reflections used actually changes over time, they might not reflection that accelerates a data set, and then they decide to change the dimensions of measures or make a new reflection that might be more of a super-set of a previous reflection. Now new workloads are actually hitting the new reflection, using the new reflection, and the old reflection is still there, but administrators haven't decided to age it out or delete it out of the system. We often see is there is a subset of reflections that are being used by workload, and that subset can actually change over time, and by tying this in with our cloud columnar cache technology, we'll dynamically, behind-the-scenes, constantly be updating the reflections that we keep in the system, based upon realtime workloads.

Lucio Daza:

Nice. Very nice. One question that just came in, and I want to echo this question to the rest to audience. It's very interesting. It says, "do we have any documentation or best practices related to these reflections?" There is a very, very good course in our University, if you go to university.dremio.com, you will find there the catalog of classes, of courses that we have, and one of them is data reflections. It takes a few hours to go through the content, but you can actually launch a ForData Enterprise edition lab in there and follow along with the exercises and learn everything that is there to learn about data reflections. So, go ahead and give it a try. It's free to enroll, and free to try, and so, yeah. Super interesting.

Tom, thank you so much for that clarification and before we move on, there is another question that we have about the reflections, so maybe you mentioned this or not, but many use cases, and we have seen this happening a lot, you have a storage on-prem and you have a storage on cloud, and you're using Dremio to link sources from those two data sources, so can users go ahead and select where certain reflections are going to stored on-prem or on cloud or both?

Tom Fry:

Yeah. Reflections storage and where to store reflections is a configurable option within Dremio. It's something that we call "distributive storage". I think we have, you mentioned some tutorials, we also have a lot of documentation up on our website about how do you configure distributive storage. That's used to both specify where do you store reflections, where do you store user uploads, for example, the user uploaded data sets, where do you store create-table style data sets as well That's all configurable. Before, options were on HDFS, or NAS type devices, etc. This is essentially following the same paradigm, and C3, ADLS, and S3, or S3, ADLS, and S3 compatible storage is now an option for reflection storage as well.

Lucio Daza:

Excellent. Now let's go ahead and move on into a topic that I know most of us, and the audience, are waiting to learn about: security. This is right there with concurrency when you're working with a cloud data lake, you want to know all about security and that you can leverage most of the security features of your cloud data lake using Dremio. We continued to add more and more features. So what do we have today, to talk about security and ForData, Tom?

Tom Fry:

Dremio 4.0 – Technical Deep Dive

That's a great point, you know. Security is key for organizations to be able to deploy Dremio at scale, and similar to performance, we each have these series of improvements that we continue to add, in terms of enhancements on the security front.

One of the things that we looked at is many Dremio users utilize public cloud environments to deploy Dremio, some of which, it's part of their framework, some of which are going to the cloud native only environments as well. Not only is security obviously very important, but it's also essential to integrate with the common security tools and best practices utilized by cloud or within cloud environments.

One of the things that we did recently was look at some of the security best practices, and the tools that users were making use of in public cloud environments in order to better help Dremio integrate into these models. We identified several common security best practices, specifically in AWS, to support and we added those in Dremio 4.0. So Dremio 4.0 includes several security best practices. We've had a lot of feedback we identified on within the AWS environment.

The first is the concept of configurable IAM roles for AWS/S3 storage. With configurable IAM roles, it's possible to customize and define the rule that's used on a per-source basis, so that each S3 source bucket can essentially have a different IAM role assigned for permissions. This enables much finer grade access rights to be configure for S3 sources, and it also enables for authorizations to be centrally managed within AWS, since this utilizes AWS' assume role of permissions. Essentially, Dremio to the instance has an IAM role associated with it that you can use to connect to S3 buckets. With that IAM role that is associated the Dremio instance, it can also be specified to have the authorizations to assume other roles. And then you can assign those roles for S3 buckets as well.

Second was support for AWS secrets manager. One of the things that we've seen is in order to simplify operations and improve the security model, many organizations often prefer to avoid scenarios where they have to store passwords in multiple different systems. Instead, the prefer to sectionally manage passwords in a centralized system, and then for additional security within a centralized key vault.

In Dremio 4.0, we added support for AWS secrets manager. This allows access credentials to external sources, such as Oracle, or Redshift, etc, to be stored and centrally managed from AWS secrets manager, and it avoids having to store access credentials with Dremio. Our initial implementation supports AWS secrets manager as a key vault, but we can also look at additional key vaults in the future that are utilized in other environments or on-prem environments, etc. We really see AWS secrets manager as the first key vault that we'll add, and we've designed it so that we can add other key vaults in the future as well.

The third major practice that we saw was the desire to use server-side encryption with Amazon's key management service. This provides the ability for users to use AWS' customer master keys to encrypt Amazon S3 objects in a centrally managed perspective. This can help users utilize Amazon's key management service and provide additional data rest security for their S3 storage.

Lucio Daza:

Excellent, and before we move on, Tom, there is something that we didn't cover here in the slide, but I've seen on our UI. In ForData, we include also Azure Active Directory and SSO, that's correct?

Tom Fry:

Yes, so, we also have recently added Azure Active Directory supports with single sign-on. This enables users to essentially help, instead of users having to enter passwords into multiple different systems, you can use an identity provider, such as Azure Active Directory, users only have to sign in once, and then they can access multiple different systems. We talked a little bit about that in the last month webinar, not only did we add support for specifically Azure Active Directory single sign-on, we also added support for the OAuth and OpenID protocols, which pretty much all major identity providers support today, so this enables Dremio to integrate with multiple different identity providers and provide a seamless single sign-on process.

With that, users do not have to enter a username and password into Dremio, instead, they simply sign-on once with their identity provider, and then when they access Dremio, it's essentially a click-through process to gain access to the system. This also helps administrators not have to manage access rights to multiple different systems, but to centrally manage everything from one central identity provider.

That's an example of us continuing to improve on the security front, add more and more security oriented behaviors within Dremio.

Lucio Daza:

Excellent. For the audience, we are working on a series of blogs that are going to start coming out starting next week or the week after, I believe. It's going to be based on how to secure your cloud data lake, so I encourage you to stay tuned. We're going to cover most of these features in detail and there is a lot of more material on how to enable these things on your environment as well. So keep an eye for those. We got you covered on that.

Dremio 4.0 – Technical Deep Dive

So the next thing that we're going to talk to about, I believe we mentioned this in our prior webinar, and we have been talking to the audience, and exposing this to the audience on the community site, as well as our site as well. It's Dremio Hub, and this is a great opportunity, a great window for people to do a lot of great things with connectors, and so on. Tom, do you want to walk the audience through what Dremio Hub is and how it can help them take more advantage of Dremio in their environments?

Tom Fry:

Sure, thanks a lot for that, Lucio. Dremio Hub really helps enable Dremio offer connections to many, many different data sources throughout the industry. We had a major launch of this last month and we're very happy for get it out the door, and we've seen some great adoption of it from the get-go.

So for a little bit of a backdrop on this; there is a very large number of data sources within the industry where users store data and from which Dremio can read data from, several hundreds in fact. We've seen Dremio users interested in connecting the data from many, many different data sources and we're always for new data sources.

One of the things that we looked at was how can we accelerate the addition of a large number of data sources that Dremio can connect to and really rapidly expand the footprint of available data source connectors that users have access to. We also, at the same time, wanted to enable Dremio users to be able to connect to their own data sources. For example, we have customers that create their own customer databases or their own data sources that they've created internally, but they also want Dremio to be able to access that data as well.

To enable this, we were very excited to announce the launch of the Dremio Hub last month, which has two key components to it. The first is a simple framework so Dremio users can easily create their own connectors of data sources without advanced coding or knowledge required. The second is a marketplace of community builts and supported connector where users can share connectors and use connectors built by others, and also where third-party vendors can develop and post support connectors on as well.

With the Dremio Hub, the Dremio community can now create connectors that really any data source, this includes just about any relational database or NoSQL store, and even many SASS applications. For example, there are drivers that can read your data, and Dremio can plug into that.

Now the process is very simple; we designed it so that it's a template-based framework where users simple specify the supported data pipe's operations and functions that an external source supports, and Dremio uses that information to identify what data can be read from a source, and what operations can be pushed down into the source for processing.

Now, one of the great things about this is you can define template files very simply, just one page of data types, and that's even for Dremio to be able to read and do basic filtering with the data set, and that can be done very quickly in a manner of a hour.

Dremio 4.0 – Technical Deep Dive

However, we also designed Dremio Hub connectors to take advantage of the exact same advanced relational push-down framework that we have for native Dremio connectors that you see in the Dremio UI. So the Dremio Hub connectors have all the same performance optimizations as native connectors. So once you've made that first initial template file that enables you to, Dremio to be able to read various different data types from an external source, it's optional to specify additional functional mappings where you can specify, here's a function within Dremio, here's how you call that function within an external source.

The Dremio Hub execution path will utilize the advance relational push-down framework to push more operations end. So it's very simple to make a very basic connector out of the get-go that gets data and reads data, and that's also very simple to expand upon that definition, to take advantage of all our accelerations within the product.

Additionally, we're building a Dremio Hub marketplace on our website where community members can contribute and share new connectors that they've built, and users can simply download connectors others have created. In fact, that's the main usage pattern that we see is one person will post a connector and then hundreds can download it and use it.

Our goal is really to have a very large range of connectors available on the marketplace. We launched the website only a few weeks ago, and already, there's been some really great activity on it with connectors to multiple systems already available, such as connectors to Saleforce, and many other systems as well. So we look forward to everyone contributing, and we're rapidly trying to grow the footprinted term to the number of connectors that Dremio can read data from.

Lucio Daza:

I want to emphasize to the audience that when Tom mentions "simple", he actually means it. The process of creating a connector from Dremio Hub is very, very simple. Just go to dremio.com/hub, you will find there, pretty much a screenshot that we're showing there right now, and plus, instructions in the SQLite, I believe. Tom, correct me if I'm wrong, but the SQLite is the one that the audience can use as a template. Download that, follow the instructions that we have, either in our documentation, or also in Dremio University, we have a course there too. It's a very short session that teaches you how to download the template, how to create your own connector and then how to install it in your Dremio instance to start using it right away. So this is very exciting and as I mentioned, very simple to use. If I can do it, you can do it, I promise you that.

So now let's go ahead and move on, I think we're going to go ahead and start talking about, now we're talking about UI changes and some of our features that we included in 4.0, and the first one is, track jobs status in the UI. So what is the benefit here, Tom? How is this going to help us continue enjoying our experience with Dremio?

Tom Fry:

Dremio 4.0 – Technical Deep Dive

So this next feature might seem a little bit minor, but based upon a lot of the feedback that we've received and I've talked to users about, I think it might definitely be a that's one of our more popular ones in the release. And that's because it significantly improves the user experience in Dremio for users that go through the process of executing queries directly within a Dremio UI. And that is the ability to monitor and control queries directly from the SQL editor page in Dremio.

To step back a little bit, a common user work cloy that we saw was often after submitting a query, users would want to stop and then change the query. For example, a common pattern might be that the query took a little bit longer than they thought it would and then they realized, wait a minute, I'm missing a filter, I wanted to add a filter to improve results and speed execution, or maybe I forgot to join some data set and I want to join that in. However, to do that before required leaving the SQL editor page, going to the jobs administration page in the UI, finding the query from potentially a very long list of queries in the system, if it's busy at the time, stopping it, and then going back.

So to improve on this in Dremio 4.0, queries can be fully managed directly from with the SQL editor page, without having to leave the page at all. After submitting a query, a run-time status for the query is shown with the UI, plus a controls to view query details and also to stop a query. This makes it very, very easy to quickly stop a query that you just submitted, if you decide you need to make a modification of some sort, edit that query, and then resubmit it. That workload pattern that we think is very common, is now very simple.

In addition, the page is also dynamic and visually shows query status. So what you see here on the page is a wheel and this is the real UI, what you see when the query is running is that wheel continuously spins as long as the queries making progress. It only stops if there's something that needs investigation. With that, you can know, hey the query's running fine, but maybe I want to add a filter, I'll stop it and resubmit it.

So again, this may seem minor, but based on feedback, I think this is an improvement that'll benefit many users and I've heard a lot of great things about.

Lucio Daza:

This is great, and I have to say, in my previous life, I was using a product that when you were generating something, it wouldn't tell you what was going on in the background and it was one of the most frustrating things to not know if there was an error, if the thing was frozen, or whatnot. But I think this is going to be amazing.

So now let's go ahead and move onto talking about the multiple, no actually we're talking about cut and paste results. This is another UI change, is that correct, Tom?

Tom Fry:

Right, so much of the work from before where we identified some usage patterns in the Dremio UI that we thought was pretty common, another major concept is we saw a lot of people wanting to take data results out of Dremio, and then just simply move them to some other tool that they might have on their laptop.

For example, a very common pattern was to say I executed a query, that query has some number of hundreds or thousands of rows, and that query took a little bit of processing, and now I want to further refine the results. I don't want to have to do so by submitting multiple different queries, so instead, what I want to do is just extract those results and dump them into Excel, for example. And then once they're in Excel, I can do additional sorting, or I can do some additional grouping, or do some additional filtering, and that all happens in real-time because the data set's the processes down to a few hundred rows, and I just want to do some additional exploration in Excel. One of the things that we did was make it very easy to simply copy and paste results of an existing query out of the UI, into another other tool within your PC.

One of the great things that we also did with this is we did automatic formatting, using kind of rich text format, in order to preserve the format in column structures would paste specifically into Excel. So you can take this data and copy it into just about any tool, but we also did some optimizations to make sure that when we cut and paste directly into Excel, that we preserve the headers, and the column structure, is all preserved without having to do any kind of additional data processing within Excel.

This is a very common user pattern that we saw for our customer base, and it's something else that, again, may seem minor, but I think has a large impact, and we've had a lot of great feedback on.

Lucio Daza:

Definitely. And that 5,000 row, I don't want to call it limit, but is it a hard limit, but can we increase to a different number if you want to copy more than just 5,000?

Tom Fry:

So that is a limit that's driven by typical OS requirements, the clipboard's only so big, so we fixed the amount of data that you can copy. There's always the option to execute SQLs directly outside of the UI, and create data sets that way, but at that scale, we didn't see a lot of people wanting to do that in Excel, for example. The 99% use case is I submitted something that has several hundred rows and I want to explore those in Excel, so this is really for those type of use cases, and if you want to do analysis on return results, or maybe on a million rows returned, we suggest traditional paths for that.

Lucio Daza:

Excellent. So we're getting close to our time limit, but we still have some more material to cover. I appreciate all the questions that are coming in, we're going to try to move along with the rest of the content that we have, but please rest assured that if we don't get to your answer, we'll definitely on the follow-up, and if you don't get an answer here, you always are more than welcome to go ahead put your question on the Dremio community, so we, or anyone on the team can go ahead and address those.

So, Tom, let's go ahead and talk about multiple hive metastores. What is this about? How is this, what does this mean for the audience today?

Tom Fry:

So another key capability that we added to Dremio 4.0 is the ability to connect to Hive 3, plus the ability to connect to multiple different Hive metastore versions at the same time, and even within the same query. So within Dremio 4.0, both Hive 2 and 3 data sources can be figured and queried at the same time, and queries can even join data sets across the different Hive metastores, even if they're different versions.

Additionally, Dremio supports new transactional table and asset properties that Hive introduced with Hive 3.1, and so this let's users take advantage of all the new features and Hive supports in their most recent relays.

Lucio Daza:

And I lost my mute button, sorry about that. Okay this is great, so now, also we have several additions that you can find, and this goes along with the question that asked, where we can download Dremio, please go to dremio.com/deploy. You will find there all the deployment options that we have and I believe this is what Tom is going to cover at this point. So what do we have now for deployment options, Tom?

Tom Fry:

Yes, so that's a great point. One of the things we added is not just the ability to deploy Dremio in multiple different environments, but we also offer what you can think of as custom editions of Dremio that are optimized and custom-tailored for major cloud providers and on-prem deployments. So each addition is pre-configured for the various different environments and services that each cloud vendor offers and optimized for their unique environments. The configuration of S3 is a little different than ADLS, you know, how to run a best structure Dremio on top of that.

We support both Kubernetes and direct deployments using AWS cloud formation, or Azure's arm templates are supported. And to deploy, all users need to do is simple select the environment and type of deployment that they're interested in, and then with essentially a single click or very limited workflow, you can launch an optimized configuration of Dremio in your environment of choice. So this makes it very easy to quickly deploy Dremio in many environments with no additional configuration required.

Lucio Daza:

Dremio 4.0 – Technical Deep Dive

Very simple again. I want to emphasize this. I tried it and there is also some documentation in our tutorials and resources that you can follow along to do any one of these deployments. Super simple, as Tom mentioned, just follow the template. Now I believe we're making a jump back into the security topic. Now we're going to talk about inbound impersonation. Talk to me about this, Tom.

Tom Fry:

So this is another area of which we wanted to make our security model more simple for our users to manage within their environments. And also enable kind of more advanced and fine grain controls. So, Dremio, for several releases has supported the ability to access external sources with a single service account, while at the same time, utilizing at run-time, the unique authorizations of the end user currently using Dremio. This can simplify configurations while still providing advanced and fine grain authorizations that are aligned to each user.

But that's for Dremio accessing external sources, starting Dremio 4.0, we allow the same model of impersonation when users access Dremio from a third-party client tool. This enables external tools to access Dremio to use a single service account, ID and password, when they access Dremio, while still instructing Dremio to use the authorizations of a current end user. This can really simplify the management situations while still providing fine grain access controls.

I think for an example is, we have a lot of users that might have a custom ETL job or custom application they built, but they have end users that utilize that application, and they don't want to have to store lots of different user authorizations within that applications, they want the application to access Dremio for a centralized service account but limit the instead of using that application, so that's possible now with this feature.

Lucio Daza:

And as a security feature, this is enterprise edition only, right? Or is it for both?

Tom Fry:

This is just enterprise only.

Lucio Daza:

Mm-hmm (affirmative). Excellent. All right, so, what are we missing here? What else is included in this release that didn't make it to the big slides?

Tom Fry:

Dremio 4.0 – Technical Deep Dive

So there's actually still a bunch of other stuff. I encourage everyone to go our release notes and review that for a complete list of features. One of the things we did was we added decimal as a data type for relational and MongoDB sources, and all math functions have been applied to the Gandiva for high performance. We also did some other Gandiva improvements, we introduced Gandive in the 3.3 release, and we talked about a bunch of the performance improvements and capabilities there, and where we can continue to kind of expand the footprint of it. Gandiva can now process string-type functions, so this really enables Gandiva to kind of accelerate our variety of workloads that are doing a lot of string-heavy operations, where we have more math operations previously.

We're also adding the ability in a follow-up patch release pretty soon, to download existing results and 'save as'. A pretty common access pattern was to say I want to download one query, then run another query, download those results, then create another query, then download those results, and you could kind of lose track of which file was for which query. You can now basically specify the file name, so this is somewhat similar to the copy and paste as well.

But again, we have many more features that are in the release, and I encourage everyone to take a look at our documentation and please feel free to reach out to your Dremio representative if you have any specific questions as well.

Lucio Daza:

Excellent. So we have, I think we have time for maybe a question or two, while Tom checks the questions that we have been receiving while he was talking, I want to address the audience and, as Tom mentioned, invite them to our release notes. If also you want to learn more about Dremio, feel free to join Dremio University, or go to our tutorials and resources library inside of our site. We have a bunch of documentation in there on how you can use these new features or any other features that you see in our product, as well as, you will see recordings from our previous webinars and so on and so forth. And of course, as it was asked here, through the Q&A panel, you can go ahead and deploy Dremio by going to dremio.com/deploy.

Also, as you noticed, we didn't do a demo of our product today. If you want to see a live demo, please join us every Tuesday, I deliver a live demo every Tuesday at 2PM eastern time, I believe that's going to be 11 in the morning, pacific time. It's only 30 minutes, we walk through a couple of use cases and it shows you everything, all the basics about how to use Dremio, what the UI is about, and how you can do things in there.

So, the first question, Tom, that we are going to talk about today, and let me see if I can catch it here really quick, is how is C3 different from Dremio data reflection?

Tom Fry:

That's a great question, and I think it's really worthwhile to spend a little time and explain the differences there. I think a very good analogy is if you look at traditional databases, let's say you're on an Oracle or any other traditional database, you actually have both concepts as well. You'd have the concept of indexes on your tables, which are user-administrator defined areas or columns that you want to accelerate data on, and with an index, there'll be some pre-computation that's done. There's also some techniques that are done to keep that data closer to memory and faster and more available for usage. And that's really a user-driven activity, so here's things that I want to accelerate.

At the same time, traditional databases also have their own memory cache, and they also utilize the operating system's page cache as well to cache data off of disks and into memory. So if you look at a traditional kind of database environment, you actually have both concepts. You have the concept of index, which is a form of a user-driven acceleration, plus you have the concept of a cache, which is simply finding the most commonly accessed data, keeping it off of disk and into memory for fast processing.

You can make a very similar analogy for reflections and the columnar cloud cache technology as well. Reflections are specific optimizations that a user wants to do. These are areas where a user says I want you to pre-extract this out of the system, that system could be a file store, it could be an enterprise data warehouse like Teradata. Reflections are also a form of pre-computing results, that results might be performing joint operations, aggregate operations, field calculations, etc. So a reflection would be a form saying please pre-do those calculations on time, put it in reflection, and now thousands or millions of user queries can access those pre-computed results.

C3 is a very different concept where we will identify the most commonly accessed data, similar to a traditional kind of memory cache, and keep that data close to the system. But that happens without user or administrator involvement. So reflections is more of a user-driven activity about data that you know you want to pull into the system, and it works across all of our different sources, even relational sources, or NoSQL sources, etc. And it's also a form of doing accelerations on pre-computed results, where C3 is a technology stack, that instead, automatical responds in real-time to the external data that's being accessed most often and keep that in the system even though users haven't specified it to be done so.

Lucio Daza:

Great. I think we have time for one more question, and I believe this is related to the multi-cluster isolation section. The question is, is this virtual isolation or physical isolation?

Tom Fry:

So that's a great question. It's actually physical isolation at the VM level. Dremio instances are installed within a given VM, you could, for example, in a environment, have multiple Dremio executors on the same node, but that's because Dremio is deploying multiple nodes on the same physical server. Whereas, actually doing isolation really at that full VM level, so if you were, for example, deploying Dremio nodes, each on their own server instance, in that instance, you really are able to use this technology of full physical isolation between resources.

If within your Kubernetes or Yarn-type deployed model, you're running multple Dremio executor nodes, maybe potentially on the same server, then you're relying on the provisioning at the VM level for isolation. So the isolation is really defined at the boundaries of the VM node executor that you configure, and that can be flexible from either a virtual environment depending how you're using Kubernetes or Yarn, or you can go for complete physical isolation if that's required. In fact, we have some users that specifically, to kind of secure the different environments, are looking at it from that perspective as well. So I'm glad somebody asked that.

Lucio Daza:

Excellent. Well Tom, as always, it has been an absolute pleasure to talk to you about one of our new releases. I want to thank you and to the audience, thank you for being here with us. As I mentioned before, if you had any questions that were not addressed here, feel free to put them on our community. We will try to get to that, and also, join our live demo so you can learn more about Dremio. I hope everyone learned a little something today. Tom, thank you so much, it was a great pleasure. To all our audience, thank you for being here with us and on behalf of the Dremio team, we wish you a great week ahead and we'll see you next time. Bye bye.

Tom Fry:

Thank a lot!