Enabling Open Data Lakes with Dremio and Delta Sharing

May 26, 2021
Thomas Fry

As a contributor and sponsor of key open source Data Lake and Lakehouse projects, Dremio is excited to be a launch partner on the open source Delta Sharing initiative. Delta Sharing is the first open protocol to securely share data with anyone, anywhere. This opens a host of new possibilities for organizations to exchange data regardless of which tools they might be using.

Why Open Data Matters for Data Lakes & Lakehouses

Companies across all industries are seeking to gain more value from their data by democratizing data access. To do so, companies are migrating from traditional and siloed enterprise data warehouses and turning towards Data Lakes and Lakehouses to make data more accessible throughout the organization, and to enable rapid discovery and value generation from their data.

A key attribute of Data Lakes and Lakehouses is that data is stored in open source file and table formats, providing the foundation for an open data ecosystem. In a modern Data Lake architecture, companies are free to choose the right technology for a given task or workload. This is an opposite design philosophy compared to data warehouses, whether on-prem or in the cloud, which are vertically-integrated and proprietary, limiting organizations to the functionality provided by one vendor and creating silos of data within the vendor’s proprietary stack.

The openness of cloud Data Lake and Lakehouse architectures provides three key benefits:

  1. Flexibility to use the best engine, service or tool for any task
    Companies are free to choose the right technology for each use case. This includes the ability to use different storage systems, file formats, table formats, processing engines (e.g., SQL engines), or catalogs such as Delta Sharing. For example, many of Dremio’s customers use Databricks for some workloads (e.g., data processing, machine learning) and Dremio for other workloads (e.g., BI) on the lake. At Dremio, we encourage companies to always choose the right tool for the job, as this results in the most successful and cost-effective solution, and companies often utilize both Databricks and Dremio together to build successful projects.
  2. No vendor lock-in
    A key attribute of Data Lake and Lakehouse architectures is that individual components can be changed at any time as needs evolve and workloads change, without having to initiate migration projects to new systems or duplicate massive amounts of data. This flexibility is critical to organizations as they seek to gain value from their data.
  3. Future-proof
    Openness enables new technologies to be easily incorporated into existing Data Lake and Lakehouse deployments, enabling organizations to quickly take advantage of new innovations and developments, and stay current with industry best practices.

Dremio Enables BI on Cloud Data Lakes

Dremio is a Data Lake service enabling BI tools to perform real-time interactive queries directly on Data Lakes and Lakehouses, whether data is accessed directly from the company’s own S3 buckets or Azure Storage account, or from a partner’s cloud storage account through the secure Delta Sharing protocol.

With Dremio, analysts and data scientists can easily analyze data using standard SQL and BI tools in place without copying or moving data into traditional databases, while experiencing the same interactive response times as relational databases. As a result, there is no need to load data into other systems such as data marts, cubes, aggregation tables, and/or BI extracts.

In addition, Dremio makes it easy for business analysts and data scientists to discover and explore datasets in the Data Lake, curate new virtual datasets, and collaborate with other users within the company, whether exploring the Data Lake directly or using catalogs such as Delta Sharing server. Dremio makes it possible for users to interact with any data at any time.

High-Speed SQL on Data Lakes

Dremio provides numerous features and technologies to enable interactive SQL performance on Data Lakes, including:

  • Elastic scale-out execution engine: Capable of scaling from a few containers to thousands of containers.
  • Apache Arrow: First execution engine built from the ground up on Apache Arrow, which provides columnar in-memory data processing.
  • Gandiva: Open-source execution kernel enabling high-speed computation on data in Apache Arrow format. Makes optimal use of CPU and GPU architectures.
  • Data Reflections: Accelerate queries and achieve sub-second response times over terabyte- and petabyte-sized datasets.
  • Predictive Pipelines: Anticipates data access patterns so the system is never waiting for data from cloud storage.
  • Cloud Columnar Cache (C3): Intelligently caches data within the execution engine for fast retrieval and processing.
  • Arrow Flight: An Apache Arrow-based RPC layer enabling Python and R to consume data into data frames 10-100x faster than JDBC/ODBC.

Empowering Analysts with a Self-Service Semantic Layer

Dremio provides a self-service semantic layer so that organizations can create Data Lakes that include both a raw zone and a semantic zone. The semantic zone, consisting of virtual datasets organized in spaces, enables IT to apply security and governance, while allowing business analysts and data scientists to define structure, create new virtual datasets, and collaborate with each other. The easy-to-use interface enables non-technical users to access and leverage data in the data lake, thereby making data accessible to everyone in the organization.

Summary

Companies increasingly are migrating from legacy data warehouses and turning towards Data Lake and Lakehouse architectures to democratize data access and make data more accessible. With these open architectures, organizations gain agility, scalability, and availability with cloud-native services, and enjoy flexibility and no vendor lock-in. Dremio enables truly interactive SQL queries and BI directly on data within the Lake/Lakehouse, and Dremio is excited to be a launch partner on the open source Delta Sharing initiative to provide users interactive SQL on data made available through Delta Sharing servers.

Next Steps

  1. Deploy Dremio on AWS
  2. Learn about Lakehouses from Databricks CEO Ali Ghodsi and Dremio CEO Billy Bosworth
  3. Implement a Data Lake on AWS

Ready to get started?