Announcing Dremio 3.3
Dremio 3.3 includes many key features that continue to enhance the performance, security and administration of Dremio, providing faster time to insight and ease of access to data. Dremio 3.3 includes numerous new features to improve the security, administration and performance of Dremio, including Automatic Virtual Dataset Update, Single Sign-On, Online Cluster Reconfiguration and Maintenance, Reflection Insights and more. Additionally, we are excited to announce the GA release of Gandiva and that Gandiva is now enabled as the default engine for Dremio. Gandiva, introduced last fall is the first high-performance engine designed from the ground up to process data in Apache Arrow format and provides significant performance improvements.
Automatic Virtual Dataset Update
This release introduces new advanced capabilities to automatically track schema changes throughout the entire system, including Dremio-defined datasets, external database tables/views and files in external storage – and apply those changes in real-time to Virtual Datasets in Dremio.
This includes both data-type changes and column additions or deletions. For example, if a new column is added to an underlying Parquet file, this new column will automatically be added to the Virtual Dataset’s schema definition.
Automatic Virtual Dataset Update drastically simplifies administration and maintenance for large deployments by tracking and updating catalog changes for administrators and users automatically and with no direct user involvement. This capability is unique to Dremio – other data services require manual catalog changes to all tables and views as data definitions evolve.
Performance is a key focus for every release and Dremio continues to implement new features to further accelerate queries and speed time to results. Last year we introduced the Gandiva Initiative for Apache Arrow, the first execution engine designed to make processing Arrow buffers as fast and efficient as possible. In Dremio 3.0, we made Gandiva available to users in a Preview release for trial and testing. With Dremio 3.3, Gandiva is available for general use and is automatically enabled as the default engine for Dremio.
Gandiva is the first execution kernel optimized for efficient, high-performance processing of Apache Arrow data. It is Apache-licensed and available in open source on GitHub. Gandiva makes optimal use of underlying CPU architectures, is written in C++ for performance and uses runtime code-generation in LLVM for efficient evaluation of arbitrary SQL expressions on Arrow buffers. Performance improvements are striking, with complex analytical workloads seeing up to a 70x performance improvement from Gandiva.
Gandiva includes numerous performance enhancements, including Null Decomposition and Vectorization. With Null Decomposition Gandiva separates data validity from the actual value, which enables algorithms that significantly reduce branching overhead in the CPU and efficient SIMD processing by modern CPUs. Additionally, since Arrow memory buffers are already well organized for SIMD instructions, Gandiva utilizes Vectorization and CPU SIMD instructions to process larger batches of values as a single operation.
Gandiva is also the optimal method to create User-Defined Functions in Dremio and other systems built using Apache Arrow. Instructions to build your own Gandiva based User-Defined Functions are available here. Arrow is widely adopted throughout the industry, so work to build Gandiva UDFs is far-reaching and not only for Dremio.
Single Sign-On & Azure AD
Another key capability we added in Dremio 3. 3 is Single Sign-On (SSO) support, which provides a flexible method to integrate Dremio with existing identity management systems and offers numerous advantages for an organization, including:
User experience: With Single Sign-On users are able to move between different systems security while experiencing zero interruption due to having to re-login to each system separately. SSO joins individual systems from a user’s perspective and switching between applications is seamless.
Security: With SSO user credentials and access are governed directly from a central Identity Provider, instead of the individual system a user is trying to access. This consolidates and centralizes identity management, which significantly reduces administrative overhead.
Full support for Azure AD is included and drastically simplifies management, security and administration. Simply configure Dremio to use Azure AD and Dremio will automatically integrate with Azure AD for identity management and security. Additionally Dremio supports the OAuth2.0 standard and can be configured with most Identity Providers that support the OAuth2.0 standard.
Personal Access Tokens
Building on SSO integration, Dremio also now supports Personal Access Tokens. Personal Access Tokens provide the ability to authenticate and login to Dremio using SSO configured tokens over ODBC, JDBC and even Arrow Flight endpoints. They offer security features such as built-in expiration and on-demand revocation for flexible administration.
Online Cluster Reconfiguration and Maintenance
Dremio 3.3 comes with new capabilities to enable maintenance and configuration of Dremio execution clusters during operation while online and without stopping operations. This enables Dremio to provide uninterrupted service to users during hardware or software maintenance activities, without impacting workloads or user activities.
This functionality is offered through the new ability to temporarily remove specific execution nodes from a Dremio cluster and then add them back at a later time. While these are nodes are removed, maintenance activities such as upgrading the Hadoop version for a node, can be performed without impacting the rest of the Dremio cluster or user queries. By iteratively performing rolling updates administrative maintenance tasks can be implemented without any service disruption.
Column-Aware Predictive Pipelining Enhancements for ADLS and S3
In Dremio 3.2 we included new Predictive Pipelining capabilities in Dremio that leveraged our understanding of columnar file formats and analytic workload patterns to intelligently predict likely access patterns and to coalesce nearby columns in columnar file formats such as Parquet and ORC.
And these optimizations offered 2-4x faster query response times.
In Dremio 3.3 we further enhanced Predictive Pipelining for Cloud Storage sources Azure ADLS and AWS S3 by optimizing for their unique latency profiles. For example, S3 testing shows that S3 typically buffers the first 1MB of the next 8MB block within a file, and that when reading through a file at high throughput latency spikes occur at regular intervals (here at 9MB, 17MB, 25MB, etc. positions). By taking the unique behavior of Cloud Storage into consideration, Dremio is able to optimize data access to maximize bandwidth and performance.
Reflection Insights & Filtering
Dremio 3.3 significantly simplifies management of reflections and provides administrators the ability to quickly search for reflections,identify the status and cost of reflections, and visualize reflections that might require attention, including:
- Fast Reflection Search: Simplified searching for reflections by dataset name, space or folder hierarchy
- Reflection Cost Insights: Understand reflection cost related to reflection storage.
- Reflection Status Insights: Visibility into current status, if the reflection is stale, etc. Simplified searching for reflections that require attention, etc.
We are very excited about the new capabilities in Dremio 3.3 and we hope you are too. As always, we look forward to your feedback. For a list of additional new features, enhancements and changes please review the release notes here. These release notes include information about additional new features and several dozen improvements and fixes. As always, please post questions on our community site and we’ll do our best to answer them there, along with other members of the Dremio community.