Dremio Jekyll

Announcing Dremio 3.0

Why We’re Excited About Dremio 3.0?

It’s been a little less than 18 months since we first launched Dremio and less than 3 years since we worked with others in the open source community to create the Apache Arrow project. In that time we’ve seen countless companies start using Dremio, with more added every day. We’ve also seen stellar adoption of Apache Arrow and great community growth, with recent downloads approaching 1 million downloads per month. Together, we believe these technologies allow us to deliver on our mission: make it easier to access and work with data.

Today, I’m happy to announce Dremio 3.0, a further step in our mission. We’ve delivered even more powerful user friendly access to powerful Apache Arrow technology with the addition of Gandiva. We’ve expanded our support for complex enterprise security, operational, and governance concerns, adding several new features including powerful workload management features. And, as always, we keep adding new connectors (including the addition of Teradata) to make sure that wherever your data is, it is available within Dremio.

Integrated data catalog adds order to data chaos

Where do you find data in your company? Most organizations don’t have an inventory of their data assets, which makes taking that first step very challenging. We consistently hear from users that they want to begin their work with a simple Google-like search to find datasets from across their physical sources as well as the virtual datasets built in Dremio.

In this release we’ve built on the sophisticated schema learning capabilities of Dremio to make it so data stewards can easily tag datasets to simplify how they are organized and discovered for data consumers. We’ve also added built-in wiki pages for datasets, spaces, and sources, so users can capture tribal knowledge about their datasets, such as who to ask questions, how often the data is updated, what sources of data make up the dataset, and screen shots of reports and visualizations that use the dataset. All of this data goes into Dremio’s searchable index so it’s easy for users to access.

We believe that catalog features belong in the execution layer as this allows us to govern access, mask data, and provide row and column-level access controls at query time. For stand alone catalogs, it is difficult to control access for every tool. We’re excited about future enhancements in this area, such as column-level catalog abilities.

Orchestration via Kubernetes

Companies want to make it easier to provision, upgrade, and scale their deployments flexibly and reliably. Ultimately they want a serverless model for Dremio, and this is something we are very focused on developing for next year (we’re hiring!). We also see opportunities to deliver incremental capabilities along the way.

In this release, we are making Helm Charts available for Dremio, building on the official Docker container we released earlier this year.

If you’re new to Helm Charts, you can think of them like a recipe for Chef - a template-driven configuration for provisioning systems through Kubernetes. For example you can scale up the number of executors in your Dremio cluster in a single command:

1
helm upgrade 3.0 dremio --set executor.count=75

Helm charts join our existing capabilities for managing Dremio in a Hadoop cluster via YARN. There’s now a tutorial for using Dremio with Kubernetes and Helm for more details. We already have customers using Dremio on the Kubernetes services of each cloud: Amazon EKS, Azure AKS, and Google GKE.

Big performance improvements from Gandiva

Performance is a key focus in every release. Earlier this year we announced the Gandiva Initiative for Apache Arrow, which builds on LLVM JIT compilation to make operating on Arrow buffers as efficient as possible. Dremio 3.0 is our first release that makes the benefits of Gandiva available to users. (We shipped 3.0 with this feature off by default so that users can opt into testing this new kernel – if you’re interested, please send a note to preview@dremio.com.)

The benefits of Gandiva can be quite striking in some contexts. For example, we worked with an early tester on a complex query that was improved by over 70x. We still have a lot of work to do to provide 100% coverage under this new engine, but for now many queries can be optimized with Gandiva, and those that cannot will automatically compile through our existing Java-based engine.

In addition, we view Gandiva as the optimal way to create UDFs for Dremio and other systems built on Apache Arrow and have a post explaining how you can build your own. There are over a million downloads of Arrow each month, so the work to build a UDF can be far reaching, not just for Dremio.

Multi-tenant workload controls ensure quality of experience

Another key capability we’ve added in Dremio 3.0 is controls for multi-tenant environments related to workload management. Companies want to deploy mixed workloads to various SLAs through a common pool of resources. (We are making this capability available as part of Dremio Enterprise Edition as a preview feature, so email preview@dremio.com to try it out.)

With Dremio 3.0 you can now assign jobs to resource queues, with fine-grained control of CPU, memory, concurrency, queue depth, runtime limits, and enqueued time limits. Jobs are assigned to rules based on query-time factors such as user identity, LDAP group membership, job type, query plan cost, or any combination of these.

In working with customers, there was real interest in maximizing flexibility of expressing these conditions without adding massive complexity in the interface. One of the data engineers we interviewed suggested using SQL to define the rules. We thought that was a great idea, and now you can, for example, whitelist power users issuing expensive queries over ODBC to a priority queue with an expression like the following:

1
2
3
4
5
6
7
(
  USER IN ('JRyan', 'PDirk', 'CPhillips')
  OR  is_member('superadmins')
)
AND query_type IN ( 'ODBC')
AND query_cost > 3000000
AND EXTRACT(HOUR FROM CURRENT_TIME) BETWEEN 9 AND 18

These rules can be used to reject jobs from queues as well, returning a custom error message to the user over ODBC, for example.

Making Dremio more secure for critical deployments

Dremio provides its own row and column level access controls for data from any source we connect to, providing a layer of flexible security that is very useful for sources that don’t support this level of control. With Dremio 3.0 Enterprise Edition, we’ve added the option for integration with Apache Ranger to simplify administration across the Hadoop ecosystem for table-level access enforcement policies. You can follow a tutorial for configuring the integration.

In this release, we’ve also added a long-requested security feature: TLS for ODBC and JDBC, as well as intranode encryption of traffic in a Dremio cluster over TLS. You can read about how to configure these options in the docs.

More relational connectors, faster iterations

In earlier versions of Dremio each connector was developed on an independent code path. With 3.0 we have developed an all-new declarative framework (ARP) for developing relational connectors. This allows us to standardize on a single code base that is now more efficient, provides better push-down abilities, and is easier for us to maintain. More importantly, it allows us to develop more connectors more quickly, starting with Teradata, which is new in this release for users of Enterprise Edition.

The next horizon for data sources is applications like Salesforce. Many of these systems have off the shelf ODBC/JDBC drivers we can use to build connectors, and ARP makes these fast for us to develop and relatively lightweight to maintain. Next year we plan to accelerate the availability of connectors developed by Dremio, as well as developed by members of our community.

Parallel exports open the door to several new use cases

We designed Dremio with the concept of virtual datasets, a way to define new datasets as derivative of physical datasets using standard SQL. Dremio’s user interface gives users the option to build virtual datasets visually or using our SQL console. Virtual datasets are a great way to provision data for users without making copies and while ensuring secure access, including masking sensitive data.

Sometimes it makes sense to materialize a virtual dataset into a filesystem for handoff to a different system, or as an intermediate result set in a multi-step process. Now in Dremio 3.0 you can use familiar CTAS syntax to save a dataset as Parquet to S3, ADLS, HDFS, MapR-FS, and your NAS or other local attached storage. This feature uses the same engine as our Data Reflections, so you have the same options for sorting and partitioning your data to speed up access. We also embed some key metadata in the Parquet footers that Dremio’s SQL engine can use to minimize what we read from disk, providing the fastest possible access.

In prior releases Dremio supported CTAS but writes went to a single $SCRATCH directory on the file system that was open to all users and was hidden (not searchable.). In Dremio 3.0, we’ve made it so you can secure the location of the exported data using the underlying controls of the storage layer, and the ability to control where data is written. In addition you can enable/disable this ability on a per data source basis. One note of caution, with this new capability you now have DROP semantics, which will delete physical data. What this means is users can now delete data from the source, but only as long as they have underlying permissions to do so.

Wrapping up

We’re very excited about this release and look forward to your feedback. If you’d like to hear more about these features please join us on November 7th for a deep dive on Dremio 3.0. Please post questions on community.dremio.com and we’ll do our best to answer them there, along with other members of the Dremio community.