Announcing the Data Lake Engine (Dremio 4.0)

Today we are excited to announce the release of Dremio’s Data Lake Engine! Learn more in the technical deep dive on demand.

Dremio’s Data Lake Engine delivers lightning fast query speed and a self-service semantic layer operating directly against your data lake storage. No moving data to proprietary data warehouses or creating cubes, aggregation tables and BI extracts. Just flexibility and control for Data Architects, and self-service for Data Consumers.

This release, also known as Dremio 4.0, dramatically accelerates query performance on S3 and ADLS, and provides deeper integration with the security services of AWS and Azure. In addition, this release simplifies the ability to query data across a broader range of data sources, including multiple lakes (with different Hive versions) and through community-developed connectors offered in Dremio Hub.

Columnar Cloud Cache (C3)

Dremio’s Data Lake Engine introduces new capabilities that accelerate SQL queries directly on data lake storage (S3 and ADLS). S3 and ADLS are attractive because they are infinitely scalable, inexpensive ($20/TB/month) and easy to use. In addition, with the separation of compute and storage, organizations are now able to choose best-of-breed tools and systems for accessing and processing data. As a result, organizations are rapidly expanding their usage of data lake storage, and for most organizations S3 and ADLS have become the primary data stores in the cloud.

Unfortunately, the separation of compute and storage (i.e., remote data lake storage) naturally introduces performance challenges, such as higher latency and variability in response times. As a result, many organizations find that they need to extract data from data lake storage into data warehouses and marts in order to achieve the query performance that analysts and data scientists require.

What if you could have the infinite scalability and low cost of S3 and ADLS combined with the performance of local NVMe, thus making it possible to achieve lightning-fast query speed directly on S3/ADLS? The new Columnar Cloud Cache (C3) in Dremio 4.0 makes this a reality. C3 is a real-time distributed caching technology that leverages the local storage (typically NVMe) of EC2 instances and Azure Virtual Machines to drastically increase read throughput while simultaneously reducing network traffic.

Columnar Cloud Cache (C3) intelligently caches commonly accessed data on Dremio nodes to keep data close to compute. It is fully automatic with zero administration or user involvement required. Unlike traditional database caches which utilize DRAM, C3 utilizes local storage to provide a scalable cache that significantly accelerates most Big Data workloads.

While many caching technologies introduce downsides – such as stale data, increased load on the system and administrative burden – C3 does not suffer from any of these issues. First, users’ queries are guaranteed to see the live data on S3/ADLS (any new/modified bytes will be fetched from S3/ADLS directly). Second, the cache is populated as an “asynchronous side-effect” of accessing the data on S3/ADLS for the first time, leveraging Dremio’s Predictive Pipelining to overcome high latency and read the data from S3/ADLS at wire speed. Third, there is no involvement from the system administrator– no knobs to turn, and no explicit steps to hydrate or manage the cache.

Additionally, unlike standard caches which simply cache data blocks based on number of accesses, Columnar Cloud Cache (C3) includes unique capabilities to selectively cache data based on SQL query patterns, workload management and file structures to optimize what to store and evict.

With C3 users automatically experience a significant improvement in overall query response times, ranging from 2-10x, with no changes required. At the same time organizations automatically reduce operational costs by reducing the amount of data read from data lake storage. Although C3 is currently supported only for columnar data on S3, ADLS and S3-compatible stores, support for HDFS will be available soon as well.

Multi-Cluster Isolation

Many companies are using Dremio to serve multiple use cases and departments on a regular basis. Dremio 4.0 extends existing workload management capabilities with the ability to fully isolate the resources used by different workloads, thereby preventing one set of queries from impacting the performance of another set of queries. With Multi-Cluster Isolation, users can configure multiple execution clusters within a single Dremio deployment.

All execution clusters within a single deployment share the same unified catalog, which enables organizations to define a common data model/semantic layer. As a result, organizations can now avoid having multiple Dremio deployments (i.e., silos). In addition, it is now possible to support higher concurrency workloads.

Data Reflections on Cloud Data Lake Storage (S3/ADLS)

Reflections enable Dremio to significantly accelerate operations and offload requests from external systems. Reflections provide this by pre-extracting data from an external database or data lake storage and pre-computing commonly performed calculations, and then loading data and results into Dremio so they are immediately available for user queries. Users use Reflections both to accelerate workloads and to offload work in order to reduce the load on existing systems. Due to their usefulness many Dremio users create numerous Reflections.

Starting in Dremio 4.0 users can store Reflections in Cloud Data Lake Storage (S3 & ADLS) without any performance degradation. This enables users to take advantage of the scalability and on-demand provisioning of cloud data lake storage in order to define and store any number of Reflections, and also take advantage of the low cost of cloud data lake storage. Using cloud data lake storage to store Reflections enables users to accelerate any workload on-demand without requiring traditional storage space provisioning.

AWS & Azure Security

AWS & Azure offer a variety of security tools, methods and practices that organizations increasingly rely on to secure data access. In Dremio 4.0 we added several common AWS & Azure security tools and methods, including:

Configurable AWS S3 IAM Roles: The ability to configure the IAM Role used to access S3 data, this enables permissions to be configured on a per-source basis and enables finer grained access rights to be defined. Configurable S3 IAM Roles utilizes AWS AssumeRole capabilities.

AWS Secrets Manager: Supports new configuration options to store password credentials for connectors in AWS Secrets Manager instead of entering them within Dremio. This lets administrators centralize password management and store passwords in a single secure key vault. Supported for the Redshift, Oracle and PostgreSQL connectors.

Server-Side Encryption with AWS KMS Keys: Enables the ability to use AWS KMS customer master keys (CMKs) to encrypt Amazon S3 objects. KMS keys can be centrally managed and controlled and audited to prove they are being used correctly.

Azure AD, Single Sign-On and OAuth: Full support for Azure AD is included with Single Sign-On which drastically simplifies management, security and administration. Additionally, Dremio now supports the OAuth and OpenID Connect protocols, which can be used to configure most major Identity Providers.

Dremio Hub

Dremio Hub provides an accessible and easy to use listing of community-developed connectors to a wide variety of data sources, ranging from relational databases to SaaS applications. A Dremio administrator can download the desired connector and run the specified command to add it to the environment.

With the recently released SDK, developers can easily build new connectors and share them on Dremio Hub. Connectors can be built to any data source with a JDBC driver and are template based, making it simple and easy to define new connectors without writing code.

Dremio Hub connectors have the same high-performance capabilities as native Dremio connectors, including Advanced Relational Pushdown which executes complex SQL logic directly within the data source.

Multiple Hive Metastores

Another key capability we added is the ability to connect to multiple metastores versions from the same Dremio instance, plus the ability to connect to to Hive 3.1 as a new data source. In Dremio 4.0 both Hive 2 and Hive 3 data sources can be configured and queried at the same time, and datasets can be joined across Hive 2 and 3.. Additionally, Dremio supports the new transactional table and ACID properties Hive introduced in Hive 3.1.

AWS, Azure & On-Prem Editions

To customize Dremio for popular environments Dremio now offers editions optimized for major cloud providers and On Prem installations, including deployment options for both production use or evaluation purposes. Simply select the environment and type of deployment required and one-click to launch Dremio in your environment of choice.

Tracking Job Status & Copy Results

When users submit queries through the Dremio UI it is common for users at times to need to stop, edit and resubmit a given SQL query or to monitor the status of a query they are running. Previously, users accomplished this by navigating to the Jobs page and managing their running jobs there.

In Dremio 4.0 these activities can now be accomplished without leaving the SQL query page. The SQL query page now shows the current runtime status of a query, and provides the ability to stop and resubmit the query. This greatly simplifies the process for users to monitor and manage their own activities.

Additionally, query results can now be copied directly to a user’s clipboard to extract results and paste into another application. This enables users to quickly and easily copy results and load them into another application for additional exploration. Copied results are formatted in rich text format and optimized for pasting into Microsoft Excel while preserving columns and data formats.

Inbound Impersonation

Access to Dremio from client applications can now be configured to securely impersonate or run with the authorizations of another user. This enables scenarios where end users always access Dremio using a centralized service account for logon purposes, but assumes the authorizations and access rights of the end users accessing Dremio.

Wrapping Up

We are very excited about this release and its capabilities and look forward to your feedback. For a complete list of additional new features, enhancements and changes, please review the release notes. As always, please post questions on our community site and we’ll do our best to answer them there, along with other members of the Dremio community. If you’d like to hear more about this release, please join the deep-dive webinar on-demand.