Dremio Jekyll

Dremio 3.3 – Technical Deep Dive

Dremio

Transcript

Lucio Daza:

Hello, everyone. Thank you for being here with us. My name is Lucio Daza, and I direct technical marketing here at Dremio. Today, we have a very special presentation prepared for you. We are going to be talking about Dremio 3.3. We have with me the one and only Tom Fry, Director of Product Marketing, Product Management. Sorry, Tom. Can you say hello?

Tom Fry:

Hello everyone. Thank you for joining us today.

Lucio Daza:

Awesome. Before we get started, there are a couple of things that I want to run by you. This is your presentation. This is your time. This is your time to understand what is going on, what is our new release all about. What I want to show you here is how you're going to communicate with us. We have a Q&A prepared for you at the end. However, don't wait until the end to ask those questions that you may have. The only favor that I ask you is to please use the Q&A button that you will see at the bottom of your Zoom interface. That will trigger this little white panel right there. Go ahead and put your question there. Do not write your question on the chat window. I cannot guarantee that we're going to get to that if you put it there.

Data-as-a-Service: Self-Service Data for the Data Lake

All right. Also, if for any reason, you need to communicate with the audience, you can use the chat window. If there is a problem with the audio and so on and so forth, you can raise your hand, but please go ahead and do the Q&A if you have any questions. Now we are in Dremio 3.3 as I mentioned earlier. It is amazing. It has been roughly two years since we released Dremio 1.0. We have included a ton of new features in these three years. We are continuing to include a lot of cool features in every one of the releases that we do. Normally, you see software releasing only major features or important features and major releases, but we are doing [inaudible 00:05:06]. It is including very cool features like today we're going to be talking about.

Dremio 3.3 Technical Deep Dive

If for any reason you missed it, Dremio 3.2, we did this Deep Dive a couple of months ago. We talked about a bunch of cool features in relation to more connectivity reports to Azure. We included predictive pipelining, Cloud-Aware Predictive Pipelining. We also provided you with an opportunity to install or deploy Dremio using Kubernetes and Helm Charts, either on AKS or EKS depending on the cloud flavor that you wanted to work with. Also, we have some deployment templates in our deploy page on our site, which by the way, we changed. We also have some cool tutorials around how to use these templates and how to use the Helm Charts. We talked about enhancements to data Reflections. We talked about improved query concurrency as well as faster query planning. Of course, we included ORC data types and new dataset rendering engine. We have all this information documented out there in our site.

If you want to go ahead and listen to that webinar, you are more than welcome to do so. We have the video and the transcript there for you. What are we doing today? In the next 45 or so minutes, we're going to be covering everything that we include in 3.3 and more. We have some very cool stuff. Tom is going to be walking us through all these features we are going to be talking about. This is Personal Access Tokens. We also have an Automatic Virtual Dataset Update. This is huge. This is an amazing feature. Of course, Online Cluster Maintenance. We are going to be talking about Dremio Hub. For those of you who attend my Live Demo Tuesday's, I teach you on these. Now we are going to be talking about it. Of course, this is humongous. Gandiva is now generally available for everyone on Dremio.

We're going to be also talking about Reflection Insights and Filtering, Compatibility with MinIO S3 as well as Predictive Pipelining. We're continuing on the Predictive Pipelining enhancements now for ADLS/S3. Of course, some other changes that we're going to be discussing. The time is yours, Tom. Let's go ahead and start discussing what is new in this release.

Tom Fry:

Great. Well, thanks a lot for that, Lucio. We have a lot of different features we're excited to talk to today. Thanks a lot to everyone for joining with us here. The first feature that we wanted to talk about is the addition of Single Sign-on for Dremio. I know a lot of people are very interested in this. We're very excited to be able to offer it today. A little bit of background on this is the organizations today support a very large number of tools for their communities. This introduces several different complications. Users may be faced with the need to manage many different passwords, the multiple different tools. Organizations really have this sprawling mix of different access controls across different systems. That can introduce risk and the potential for gaps in security.

People, for example, leaving passwords and sticky notes because they have too many, for example. Organizations are looking for ways to simplify user access to tools to make it easier for end users to get smooth access to a variety of different tools. Also, organizations are looking at how can they increase their security models by centralizing rights for all their systems. This enables stronger controls, centralized management and really full visibility to access rights. To support this, in 3.3, Dremio added support for Single Sign-on. Single Sign-on is the ability for user to enter a single set of credentials to a centralized identity provider one time. From that point forward, have those credentials used to automatically log into an access multiple different tools that they require access to. Once configured, users accessing Dremio no longer need to enter a Dremio user ID and password. Instead, they can one-click-to-log into Dremio and automatically gain access through their identity provider.

If they've already logged into their identity provider through another tool or through another process, when they click one-click-to-log into Dremio, the process happens seamlessly behind the scenes. You just automatically land into Dremio. If you've not previously logged into your identity provider, you're prompted with that identity provider's sign on process, your credentials and you enter that. Then the process continues for accessing. With this, IT can centrally manage access through a single-identity provider and really simplify their management and reduce scroll. Then users no longer have the need to manage different credentials for different systems. In 3.3, Dremio supports automatic configuration with Azure Active Directory out of the box. We also support the OAuth and OpenID protocols, which are widely-supported throughout the industry. Most identity providers support OAuth and can be configured for Single Sign-on with Dremio. Additionally, for third-party tools such as BI Tools or other tools that may access Dremio through ODBC or JDBC-type connections, we also support the concept of Personal Access Tokens.

Personal Access Tokens are our alternative to username and passwords. Users can create their tokens within Dremio after signing in, and then use those to login over ODBC or JDBC from any other tool. They cannot built-in expiration capabilities. Administers can on-demand revoke them, et cetera. They enable seamless access controls.

Lucio Daza:

Dremio 3.3 Technical Deep Dive

This is great. Let me ask you this. It is possible that you already mentioned it, but just to clarify. Azure Active Directory is Okta-supported, if I use another different identity provider, can SSO be configured for that and IDP?

Tom Fry:

Dremio 3.3 Technical Deep Dive

That's a great question. To reemphasize, we support the OAuth and OpenID Connect protocols, which pretty much most identity providers support. In 3.3, we have automatic configuration with Azure Active Directory, where we know how to pre-configure to automatically do so. But you can configure with any identity provider to support those protocols, such as Okta, et cetera. You can configure Dremio to use that protocol in another directory service to enable similar functionality.

Lucio Daza:

Excellent.

Tom Fry:

This really enables a very wide-range of resources that Dremio can connect to.

Lucio Daza:

Awesome. Thank you. How about some Automatic Virtual DataSet Updates? I was looking at the documentation on this feature and I have to say I'm very excited to hear more about it. What can you tell us about that, Tom?

Tom Fry:

Dremio 3.3 Technical Deep Dive

This is actually I think a distinguishing feature now for Dremio compared to other data services. To provide a little bit of context on this, many Dremio users have a very, very large number of tables and virtual datasets that they define access managed in Dremio. For example, we actually have customers with several hundreds of thousands of VDS's in Dremio that they have to manage. They do so because of how powerful the Dremio platform is in their organization in terms of providing a single access point for all of their company's data. However, this skill can introduce a lot of challenges towards managing the relationships between different datasets, especially as datasets change. For examples in columns are added or deleted from an external source or field, how do you propagate those changes up through virtual datasets as they're defined within Dremio?

Dremio 3.3 Technical Deep Dive

In previous releases, virtual datasets would be fixed. For example, if you had external table or physical data set with 10 columns, and you defined VDS's with SELECT * on top of that. You change that piece of dataset to add columns, you have to go and refresh all the downstream VDS's. That's fine when you have 10 or 20 VDS's. If you have 500,000 VDS's in your system, obviously, that's a little bit of administrative burden. This behavior is actually common across relational databases. It's done for a variety of internal optimization reasons that we don't need to get into. As a result, you might have VDS's or views, for example, that are defined as SELECT * from some table and GROUP BY some positional order numbers. This is a pretty common type you need to make. In traditional systems and pretty much most relational systems, if you update the base table, that view is actually not updated, even though it says SELECT *. It only sees the columns that were there when you created that view.

One of the things that we want to do is make it much easier for Dremio to automatically respond to changes in the catalog and make it so that really reduce the administrative burden there. What we introduce is this concept called Automatic Virtual Dataset Update. What this will do is for VDS's maybe, for example, use a SELECT * or positional ordering for GRPUP BY's or ORDER BY's, et cetera, as the definition of the physical dataset or the external table changes, Dremio will automatically identify those changes and apply them to drive virtual datasets. This includes adding and removing columns, changing column names, GROUP BY, ORDER BY statements on positional references, et cetera. This is really a distinguishing feature, I think, if you look at Dremio across other data systems. We like to think of ourselves, for examples, as Google Docs to your data. If you think of that, we should always respond to changes. This is a method to both reduce administrative burden and in real time, respond to changes as datasets evolve.

Lucio Daza:

You mentioned something there that resonated very well. It's going to respond to changes. I'm butchering the quote. I guess the question is, if there was anything that changed in the background, say, for example, to have ETO job or at or some other job that is going to alter the data that I'm connecting to, are the changes going to happen automatically? The users, do they have to go in and apply some sort of refresh interval for the VDS to capture those changes? Is this something that is constantly listening to new changes, so it does the automatic update?

Tom Fry:

That's a really great question. Do you have to dump things or specify some type of refresh metadata type operation, which itself would be useful? We wanted to make it even more seamless than that and have it happened automatically behind the scenes. The way it technically works is when you run a query operation, at that time, we discover or we check the metadata information for underlying sources. At that time of query, if there's a change, we will then identify that change and propagate that change up through derive of virtual datasets. Then essentially run the operation after having done so. The change that you would see happens at runtime with zero requirements from a user or administrator to change. If you make a change, for example, if you go to your S3 bucket and you add a column to type [inaudible 00:17:07] file, on the next query, you will see that appear in your query.

Lucio Daza:

Excellent. Thank you. Now let's go ahead and move on to the Online Cluster Maintenance. I think this is going to come also with some changes in the UI as well, right, Tom?

Tom Fry:

Yes. To a little bit of context on this, Dremio as everybody knows, is a very highly-scalable execution engine. We have customers and deployments that manage Dremio at very high scale. Up to the order of several hundreds of thousands of nodes are running in production. It's powerful that is a challenge at very high scale is how do you maintain consistent and uninterrupted operations with no downtime. The challenge with a larger number of resources is the potential for failures increase. If you're running on 10 nodes versus 500 nodes, with 500 nodes, you have a 50 times higher rate of hardware failures, for example. Additionally, all systems require maintenance from time to time. For example, to upgrade Hadoop libraries or et cetera. What we included in Dremio, in 3.3, are new capabilities that enable administrators to fully manage and maintain their Dremio cluster while remaining online and available for use with zero downtime. All the way this works is you... We're showing three nodes here. But for example, if you had many hundreds of nodes, you can specify certain resources to essentially temporarily take out of the cluster and to avoid using them.

When that happens, Dremio will stop submitting work to that node. Once that node has finished the queries that are currently in flight processing on it, it will be taken out of the cluster and essentially put into a rest full-type state. At that point, administers can perform whatever maintenance activities they may need to do. That could be, for example, to upgrade the operating system on the node. It could to upgrade Hadoop libraries.  Maybe there's physical maintenance that's needed at the hardware level, for example, to replace SSDs or memories or CPU, et cetera. All those activities can be done. Once they're done, you can bring the node back online. While that node's offline, its resources are obviously not available for computation. Especially at larger scale, if you go from 100 nodes to 99, there's essentially no real impact to the users.

With this, both maintenance activities can be handled on a per-node basis. But also you think about rolling updates. For example, if you wanted to have a continuous zero-downtime availability of a cluster, but at the same time you wanted upgrade your Hadoop version or upgrade your OS or other aspects of the system, you can do so in a rolling manner, for example. This is really a big enabler we think in terms of helping people be able to stay online 100% of the time, while still performing the day-to-day activities that all systems require.

Lucio Daza:

Excellent. I want call out your attention really quick here to something I'm seeing on the screen. It says there, it looks like the user is trying to select to avoid using these executor node show. I'm also noticing that there is a black dot on what's supposed to be green for the other two ones. I mean that that node in this case is offline. I guess I'm trying to analyze the use case. In this case, a node went offline, and then I'm trying to avoid any bottlenecks, right?

Tom Fry:

Right. The way to avoid is what it means is stop submitting work. It doesn't mean stop the node right now, because that would cancel queries in flight. What this will do is enable you to basically say, "Avoid using this node for now on." It's a temporary state. It's not a shutdown state, for example. Dremio will still complete operations on that node. Let the queries drain out. Once it's offline, you'll be notified with a change of color status. At that point, you can do whatever you want to the node. When you want to bring it back online, you can manually specify to be returned to the pool. You work will start to go to that. Then you could go to the next node, for example, if you want to.

Lucio Daza:

Great. Another question and I am sorry for hijacking this slide here. This is very interesting. Say, for example, I go ahead and deploy Dremio on the cloud using either on Azure or using an ARM template. Would this work there too? Are there any differences depending on the platform where Dremio is deployed on the behavior on this feature?

Tom Fry:

That's a really great question. On these capabilities agnostic to the deployment mode, as you know, Dremio supports a lot of different environments that you from YARN-based environments to Kubernetes to kind of direct your own hardware resources, manage, et cetera. This works across all those on those. One thing to point out is if, for example, if you have Reflections or accelerations stored within the local pseudo file system within Dremio, this operation is essentially not compatible with that. That's because, essentially, if you take a node out, you're losing access to the reflection data that's stored within that node. But most systems, if you use our standard out-of-the-box configurations and our standard Kubernetes setups and Helm Charts, et cetera, you should be using external storage for Reflections or a GFS storage, et cetera, and you would not encounter that. It's really only if you're using the PDFS store. Otherwise, for pretty much all deployment scenarios, this feature's available.

Lucio Daza:

Excellent, thank you. Now let's go ahead and talk about these super awesome, this feature called Dremio Hub. I hope the audience is as excited as I am to see these. Tom, can you walk us through what Dremio Hub is? Why are we creating these and pretty much all the benefits that we're bringing to the community?

Tom Fry:

We're really excited to announce this. I think I've even talked several of you on the call today about the coming of this and even some early previous events as well. We haven't launched the web page yet today, but it'll be coming out very soon. A little bit of context here. There's an extremely large number of data sources in the industry, from which Dremio could read from. Several hundreds of different databases are out there. Dremio users are interested in connecting the data from many different data sources. We're always hearing a request for new data sources all the time. One of the things that we looked at is contextually, how could we accelerate the additional of a large number of data sources that Dremio can connect to, and really rapidly expand the footprint of available data source connectors that users have access to. We wanted to do a couple things. The first is we wanted to enable Dremio users to be able to connect to their own data sources.

For example, we have customers that have their own custom databases that they've internally developed and created internally they wanted to connect the Dremio. We also wanted to enable a community of really growing out the adoption of different connectors. Today, we're really excited to announce the launch of what we're calling Dremio Hub, which has two components to it. The first is a framework so that Dremio users can easily create their own connectors to data sources. The second is a marketplace of community built and supported connectors, where that users can post connectors that they've built for other tools. With this, the Dremio community can now create their own connectors to any data source with the JDBC driver.

This includes just about every relational database that's out there, every NoSQL store, even many SAS applications. For example, their JDBC drivers that can read your organization Salesforce data. The process is very simple. It's a template-based framework, where supported data types, operations, and functions can easily be defined. Dremio will use that information to identify what operations can be pushed down into the data source and what data can be read from the data source. It's very advanced. One of the things that you have with native Dremio connectors is advanced push down capabilities, where Dremio will, into that relational source, push very advanced operations and take advantage of the capabilities there. This community framework has the exact same capabilities. You can define the functions and other operations that a data source supports. Dremio will be able to do advanced relational pushed down into those the same as it would with a relational connector. You can do so without having to write C code. It's all very template-based, human readable, et cetera.

Additionally, the second aspect is we're building the Dremio Hub marketplace on our website, where community members can contribute and share connectors that they built. Users are more importantly can download connectors that others have created. Our goal here is really to have a very large number of available connectors to really enable Dremio users to be able to connect the data wherever it may be. This website will be launched very soon. Most likely, within a week. I've already been working with many people on a preview of both the capabilities. There's been a lot of activity there. Some of which you can even monitor and GitHub. We're very excited to announce this, and we think will be very useful.

Lucio Daza:

This is awesome. For no personal reason, I want to call attention to these vertical connector. Some cool guy at Dremio developed it. In all seriousness, this is going to be very simple to use. There is going to be a ton of resources out there, documentation on how to use the template. I want to give kudos to pretty much anyone involved in this process because they couldn't have make it any easier to follow the template and develop those connectors in that. I want to follow up with a couple of questions. The first one is, this is something I'm wondering. Once we go ahead and open the door for Dremio Hub, are we going to accept pretty much any submission that the community wants to send? Also, is there going to be an approval process? For example, somebody develops or improves one of the connectors that we have in there. They can send it to us. That's going to go through our QA process and so on and so forth for us to make it publicly available?

Tom Fry:

That's a great question. There is a contribute process. It's pretty straightforward. The website will walk through that process. There's some basic information that needs to be provided. Plus, obviously, the connector itself. Initially, on day one, we have some internal test frameworks that we will apply to it, to provide some initial testing. However, we will in time be exposing those tools to the community so that they can utilize them when they make connectors and, with the submission process, include test results. On day one, there's information that will be provided on the website in terms of what information to provide. It's pretty simple information. Then we will do some internal checks and their basic one-on one-checks just to ensure correctness. From there, they can get on the website.

We want to make this a very seamless process. Our real goal is to enable a community-led effort for the very long tail of different sources out there. There's a Snowflake connector, for example, on GitHub. There's actually already been dozens of poll requests on it and lots of activity. We think it's a great model. We're really happy to introduce it.

Lucio Daza:

Excellent, excellent. Cool. That is great. Keep an eye out next week. As I mentioned, there is a university. The Dremio University course dedicated to this, as well as tutorials. We have a bunch of instructional blog posts, and so on and so forth. There was no going to be lack of documentation on how to use these for anyone who is interested. Now let's go ahead and talk about Gandiva. This is something we... Pardon me. We have been talking about in the last couple of releases, and we continue to improve the Gandiva. Now we have GA in 3.3. Is that right, Tom?

Tom Fry:

That's right. It's both GA and enabled for default. To bring a little bit of context on this, we've discussed before how Apache Arrow, which Dremio is a key maintainer and sponsor of, is now a common in memory representation of data to accelerate data transfer between systems. Its really become an industry standard with millions of downloads per month, and has been incorporated in many different systems. With Apache Arrow, systems can easily transfer data while avoiding the overhead of centralizing decentralizing data. Again, Diva is an entirely new execution engine built and designed from the ground up to process data natively in Apache Arrow format with zero transformations. We announced Gandiva earlier this year. In 3.3, we're excited to announce that Gandiva is now fully GA. It's also the default execution engine for Dremio. On upgraded 3.3, Gandiva will automatically be enabled, with no activity required by the user.

What Gandiva is, is it's an Apache License, Open Source Execution kernel. It enables very high-speed compute on Apache Arrow data. It makes optimal use of CPU resources. It provides faster and lower-cost of operations on analytical workloads. We have seen some pretty dramatic speed up improvements on a variety of workloads. It's designed to take native advantage of the hardware resources today that's factorization and other low-level techniques in the CPU, but we have planned GPU activities as well. I think we can go to the next slide. Gandiva also is more than just the fast execution engine. More importantly, it's a platform and language-agnostic system. What I mean by this is the languages are compiled into an Expression Tree. The way that works is the Gandiva compiler will take that on performance compilations for the hardware platform that it's currently choice running on. Then submit that to the Gandiva execution kernel.

Dremio 3.3 Technical Deep Dive

The kernel will basically take batches of Arrow data as they're received from the source and execute them directly with no other transformations. But what's great about this model is we can extend it to many different languages. Today, out of the box, it includes c++ and Java bindings available. We have published examples of people being able to write their own Gandiva-based functions in those languages, where you can write your own functions and incorporate into the Gandiva execution engine. The model is extensible to other languages. We can extend this, for example, for data scientists for R and Python, et cetera. In the end, it all just becomes an Expression Tree that goes to Gandiva compiler. We can also compile data or operations like native use of hardware resources. We have planned extensions beyond just the initial rollout. Expect more improvements on this going forward. We really think the extensibility of this platform is a major aspect to it.

Next slide. We also have a variety of optimizations that we put into Gandiva. These are just some of the techniques. You can go to the website for even more. The first is what we call Pipelining and Null Decomposition. Because of data within relational systems, there's both the data value and the null bit as well. Is the data valid or not valid? Is it not a number? That actually introduces a lot of branching complexity within CPUs, which greatly slows down operations. Branches are highly-expensive in modern processor stacks. What we've done in Gandiva as one of the many optimization techniques is separate, basically, column validity from the column data for execution. This really optimizes CPU pipelining, on voice branches and the CPU execution.

It enables us to take full advantage of SIMD and vectorization operations. We separate data and validity. We can perform operations in parallel, and get the results. This is one of the many things that enable, one to two orders of magnitude improvement on the compute side. We're also going to be putting up some of our user guides, ways to look at the job profiles so that it's easy to analyze a certain workload and say how has Gandiva help, which operations are being executed in Gandiva, et cetera. We'll be providing a variety of information about helping people analyze, both the impact and the usefulness of it.

Lucio Daza:

Great. Just a quick recap for the audience and us as well. So far, we have talked about SSO. We have talked about Automatic Virtual Dataset Updates, Online Cluster Maintenance. We also talked about Dremio Hub. Now we are discussing Gandiva. Are any one of the features, they are not in preview, right? I remember Gandiva in our previous release, people had to get in touch with Dremio management for us to turn enable Gandiva for them. Now, any one of these features is in preview or they are just right out of the box available in 3.3?

Tom Fry:

That's a great question. All the features we've talked about so far are fully out of preview and available for GA.

Lucio Daza:

Excellent, thank you. Now let's go ahead and talk about Reflections Insight and Filtering.

Tom Fry:

Sure. A little bit of background. Reflections are key optimization within Dremio. With Reflections, users are able to easily with a click of a button specify datasets that they'd like to optimize. Dremio will automatically perform all optimizations required, including pre-fetching data for remote systems and pre-computing operations of interest. With Reflections, users see a significant increase in performance. While administers, at the same time, see a reduce load on the external systems, because operations are offloaded from the data source and are performed fully within Dremio. Because of this, many users make heavy use of and define a very large number of Reflections in their Dremio configurations. As powerful as that is, it does introduce some administrative complications. For examples, many administrators want to be able to quickly see which Reflections are disabled or which Reflections have an invalid configuration or might need some maintenance. We have some users, for example, with several thousands of Reflections as defined in their system. This is because of the power that Reflections provide.

With that, the large number of Reflections, identifying which ones might need some attention or how much cost each one has, it can be a little challenging. We overhauled Reflection management to make it easier and 3.3. This includes a variety of different capabilities. The first is the capability to search for Reflections by name, or data source, or folder. It can really narrow down on, "Maybe this is a space that I'm of interest. Let me just look for the Reflections there." We also made it easy to search for Reflections that need attention. For example, Reflections are disabled or Reflections cannot be rebuilt and currently offline. Before we have to scroll through a large list of Reflections, now it's very easy to be able to select the status of different Reflections and say, "Can you show me all the ones that disabled before? Show me the ones that had a failure recently." Administrators can quickly identify Reflections that need some attention.

We also made it possible to easily see the cost of Reflection in terms of the size and space that's used in storage, so that administrators can quickly identify which Reflections are consuming the most amount of space. More importantly, remove expensive but underutilized Reflections. This has been a common source of feedback that we've heard from a lot of people. A lot of this has been a key areas that a lot of users have expressed an interest in terms of improving administration on. We think it'll significantly help people.

Lucio Daza:

Excellent. Let me pick on your brain a little bit. We have this Reflection Insight and Filtering. All these UI changes, where are they going to be located? I am sorry to put you on the spot right now is that I want to point the audience to where inside of Dremio they're going to be able to see this. You go Administrator panel, and then you will see your Reflections there. That's when I'm going to be able to see all these actions for each one of them, correct?

Tom Fry:

That's correct. In previous releases and administration pages, there were tabs that are specified to Reflections. We just actually overhauled that page. It's the exact same path in terms of that you used before. We just have a lot of new capabilities we've added to the existing pages.

Lucio Daza:

Excellent. Excellent. Thank you. Just to double check again, right, this feature is also out of the box, right? There was no preview behind.

Tom Fry:

Yes, it's right there.

Lucio Daza:

Excellent. Let's go ahead and talk about this. I believe we're talking now about MinIO S3-compatible storage.

Tom Fry:

That's right. It's not just MinIO. To provide a little context here, S3, the API has really become an industry standard. Many store systems today support Amazon's S3 API and described themselves as S3-compatible. A common interest we've heard from a lot of customers and users has been to connect additional store systems that are S3-compatible and support AWS's S3 API. With 3.3, Dremio includes the ability to connect to storage systems that are compatible with Amazon's S3 API the same as they would with the Amazon S3 storage system itself. To do so, use the existing S3 connector. It now includes something that we call an S3 compatibility mode, which you can see highlighted here. When enabled, this enables Dremio to communicate with systems that are S3-compatible but not Amazon S3. Essentially, what you do is you enable the compatibility mode, and then you enter the URL for the source. We've tested this and confirmed that it works with MinIO and a couple other systems, for example, EMC ECS Object Storage.

Dremio 3.3 Technical Deep Dive

It really can be used with any other S3 compatible storage system. We're listing it here as experimental. You see in that. That's largely because we haven't had the chance to test it with every single S3 object store. We're definitely willing to work with people as they have interest in different S3 systems. They have basic support of the Amazon S3 API. We've seen pretty good success with this. Additionally, this can be used for configuring distributed storage for internal objects. For example, uploads or reflections, et cetera. You can use this now enables, especially for on-premise customers, to be able to specify an S3, like storage system to store all the Dremio's internal data that you might need. It really expands the ways that you can configure Dremio.

Lucio Daza:

Great. I believed you just mentioned this, but just to double check. You mentioned that not only Amazon S3 would be supported. Are there any other S3 storage systems supported with this feature?

Tom Fry:

We've tested MinIO and EMC ECS Object Storage here. We've done some internal testing pretty light with some other systems as well. We really can take them on a case-by-case basis. If there's something of interest, please reach out to us, and we're happy to work with you on it. There's a lot of systems out there. Someone who do the Dremio Hub for relational JDBC-type sources. This should be compatible with most things that advertise themselves as S3 compatible, but it's largely case-by-case basis. We've had a lot of success with the initial systems that we've connected to. Definitely, if you're interested in this, reach out to us. We're happy to talk to you about other systems as well. They should require no changes, but it will be a case-by-case basis. It depends on the API support of the system itself.

Lucio Daza:

In the case of MinIO, and I hope I got that that name right... This is what happens when you have a five-year-old who does nothing my watch the minions. Anytime that you see this word, I want to call them minions, right, the MinIO. In this case, we have MinIO. Is it going to be supported for a file uploads in Reflections as well? Can I use it to store those?

Tom Fry:

Yeah. Exactly, it can. It can be configured as a distributed store option for uploads, Reflections, all of those options, same as regular S3 storage.

Lucio Daza:

Great. Right. I think these covers it all unless I'm missing one or two. There are some additional changes. I'm sorry, Tom. I think we have some stuff here that you would like to cover as well.

Tom Fry:

Dremio 3.3 Technical Deep Dive

There's quite a bit that we put into the release more than we could discuss within a webinar. Feel free to go to our website. We have dozens improvements and fixes that are beyond the scope of the conversation today. You can see in a Release Notes. A couple other things that may be of interest to people. We announced Predictive Pipeline for ADLS and S3 in the last release. This enables improved ability to pipeline, the reading of data from cloud storage systems with very high latencies. We face a variety of improvements to that. I have seen further improvement or reduction in query times on S3. For example, one thing we'll do is we'll start to read the next file ahead. We have a variety of other improvements there. We expanded our support for Partition Pruning. For example, if you're using a limit operator, will be able to utilize that to more aggressively prune the number of partitions that will read.

We have some additional improvements to our Helm Charts. Something that may look small, but I've heard a lot of people expressed interest in this is Dremio now supports submitting queries terminated by semicolon. Before, you'd get a person in there. Today, you can kind of pay SQL from other systems and has a semicolon, that'll be fine. I know there's been a lot of people interested in even minor things like that. You want the long tail of improvements and fixes. There's many dozens of things out there. Feel free to go to our Release Notes or reach out to us.

Lucio Daza:

Excellent. Hey, a semicolon can be life-changing when writing a query, man. You never know. It's great. This is good. Now we have a couple of minutes left for a Q&A. Obviously, we are not going to answer everything due to time, but we're going to try to cover some of the questions that we have here. While Tom comes through the questions that we have here, I want to remind everyone to go ahead and check the Release Notes. If there is anything that we didn't cover here that you would like to know, you can always go back to either Release Notes, or you can go our website. We're going to have these webinars and the slides or the recording of the slides as well as a transcript of the conversation that Tom and I had today available for you. You can go ahead and rewind and pause and then double check on everything that we were talking about today.

In addition to that, I want to invite our audience to participating Dremio University. We're constantly adding new courses in there. It's free to try, free to register and free to play around with it. If you register for either one of the Dremio Fundamentals or Data Reflections, or Dremio for Data Consumers courses, you will have the opportunity to launch a virtual lab, which is going to be your private lab with Dremio in it without the need to install it. You can play with your lab, follow the exercises. When you complete that, you can go ahead and capture and print your completion certificate. Of course go to our deploy page. You're going to find they're not only the binaries, but also the templates and other sources to deploy and install Dremio. Tom, we have a couple of questions here that I would like to bring to your attention. It looks like the first one is, do you have support for app access tokens to use REST API?

Tom Fry:

I believe that's referring to our Personal Access tokens that were talking before about as part of the Single Sign-on packet. The answer is yes, we do. When you log into the UI, that's where you go through the single click to login with your identity provider. For all of the other endpoints that access Dremio, that includes ODBC, JDBC connections, and even a REST API, they all support personal access tokens for access. This is actually a really great feature so that people can... For example, if you think about automated tools that might want to connect Dremio's REST API. Instead of having to store that password, for example, what you can do is use a personal access token instead. You go through the normal authentication process on the REST API. Instead of using username and password, you actually provide the just the personal access token.

A username is automatically associated with that. You go through the normal process to log into the REST API. From there, you get a session token kit. Use that. We have customers, for example, that do put some pretty interesting configurations with that, where they're able to store personal access tokens, where then for example, a Key Vault or secrets manager. You can have a cron job or some type of automatic tool, go to that secrets manner, get a personal access token. Use that to log into the REST API and do a whole variety of interesting things. Yes, the REST API is fully-covered within that path as well.

Lucio Daza:

Beautiful. Thank you. We have another question. I believe this is coming from Jason. I want to say this is in relation to the automatic VDS updates. His question is, how does this affect their reflections on the VDS's? Do reflections change and we load? I think the question is related. If you have a reflection that depends on the VDS, how is this going to be impacted?

Tom Fry:

That's a really great and astute question. What will happen is when the dataset changes, the reflection on that data set, is essentially temporarily invalidated. If we're able to use the operation, the query that was submitted use, the reflection that exists will make use of it. But if not, you have to wait for the reflection to rebuild. Think, for example, if you're selecting all the rows, or all the columns within a given table.  If a new field is there obviously won't be in the reflection. It'll have to be rebuilt. In those cases, we'll go to the source will push down operations and retrieve that that new field. If, for example, you're selecting a subset of data and it was stored in reflection, we'll be able to make use of that. This process should be fully visible behind the scenes to users. It's something that would be kicked off automatically within Dremio. Obviously, if there's new data, we have to go and fetch that data. It's not something that needs to be managed manually. It will happen for you automatically behind the scenes.

Lucio Daza:

Great. Thank you. That covers that. I think we have time for one more question. The question here says, what is your plan  to support the Google Cloud Platform? What do we have going on in the world?

Tom Fry:

That's a great question. We support all the different cloud environments. Dremio is capable of running both in BMS and also within a Kubernetes on each of the different platforms. Though, we offer out of the box, kind of Kubernetes setup, for example, on a KSE and Azure AWS, it is more than possible to do so within Google as well. We actually do so for a lot of our internal systems. We run Dremio on Kubernetes and GDP. That is definitely possible. If you see some challenges with those configurations, of course, feel free to reach out to us. But we work in all different clubs.

Lucio Daza:

Awesome. All right. I think these wraps up the webinar that we have prepared for today. Then again, if you arrived late, or for some reason you had to leave and come back, you can always find the recording and the script for these webinars on our website in a couple of days. We'll have it ready for you. I want to thank Tom. Thank you so much for all your insights. Thank you so much for the announcement and all the good explanation on all these amazing features. It was great having you today here with me and the rest of the audience as well.

Tom Fry:

Thanks so much, Lucio. This is a great conversation. We hope everyone enjoyed it.

Lucio Daza:

Looking forward to the next releases. I know more stuff is coming. To the audience, please stay tuned. We have a lot of good stuff coming on. Keep an eye on our blog and our tutorials. As always, if there is anything that you would like to learn about Dremio, don't hesitate to go ahead and send us an email. We are always eager to continue documenting with new tutorials and new topics that we can help with and make your life easier when trying to overcome all your data challenges. Then, again, my name is Lucio Daza and I had Tom here we meet today. I hope everyone have a wonderful day. I'll talk to you soon. Bye-bye.