Dremio Jekyll

What is Data-as-a-Service?

Data-as-a-Service (DaaS) is an open source software solution or cloud service that provides critical capabilities for a wide range of data sources for analytical workloads through a unified set of APIs and data model. Data-as-a-Service platforms address key needs in terms of simplifying access, accelerating analytical processing, securing and masking data, curating datasets, and providing a unified catalog of data across all sources.

The tools used by millions of data consumers, such as BI tools, data science platforms, and dashboarding tools assume all data exists in a single, high performance relational database. When data is in multiple systems, or in non-relational stores such as Amazon S3, Hadoop, and NoSQL databases, then these tools are compromised in their abilities. As a result, IT is tasked with moving this data into relational environments, cubes, and proprietary extracts for different analytical tools. Data-as-a-Service solutions simplify these challenges by allowing companies to leave data where it is already managed, and to provide fast access for the data consumer regardless of the tool they use.

Organizations deploy Data-as-a-Service solutions in their own data centers or on cloud platforms such as Amazon AWS, Microsoft Azure, and Google Cloud Platform.

Architecture of Data-as-a-Service

Data-as-a-Service runs between the systems that manage your data and the tools you use to analyze, visualize, and process data for different data consumer applications. Rather than moving data from sources into a single repository, Data-as-a-Service solutions are deployed between existing data sources and the tools of data consumers such as BI tools and data science platforms.

A graphic showing Data-as-a-Service architecture.

From the perspective of these tools, Data-as-a-Service solutions are accessed using standard SQL over ODBC, JDBC, or REST, and the Data-as-a-Service takes care of accessing and securing the data as efficiently as possible wherever it is managed.

Benefits of Data-as-a-Service

BI and analytics tools such as Tableau, Power BI, R, Python, and machine learning models were designed for a world in which data lives in a single, high-performance relational database. However, most organizations manage their data in multiple solutions using different data formats and different technologies. Most organizations now utilize one or more non-relational datastores, such as cloud storage (e.g., S3, Azure ADLS), Hadoop, and NoSQL databases (e.g., MongoDB, Elasticsearch, Cassandra). In addition, following new design patterns in microservices, data for an application is often fragmented and distributed across multiple datastores.

BI tools, data science systems, and machine learning models work best when data lives in a single, high-performance relational database. Unfortunately, that’s not where data lives today. As a result, IT typically applies a combination of custom ETL development and proprietary products to integrate data from different systems in order to improve access for analytics. In many organizations, the analytics stack includes the following layers:

  • Data lake. The data is moved from various operational databases into a single staging area such as a cloud storage service (e.g., Amazon S3, Azure ADLS).
  • Data warehouse. While it is possible to execute SQL queries directly on Hadoop and cloud storage, these systems are simply not designed to deliver interactive performance. Therefore, a subset of the data is usually loaded into a relational data warehouse or MPP database.
  • Cubes, aggregation tables, and BI extracts. In order to provide interactive performance on large datasets, the data must be pre-aggregated and/or indexed by building cubes in an OLAP system or materialized aggregation tables in the data warehouse.

This multi-layer architecture introduces many challenges. For example:

  • Flexibility. As data sources change, or data requirements evolve, every layer of the stack must be revisited to ensure data pipelines and tools continue to work with the system. Changes can take months and quarters to implement, all with high risk.
  • Complexity. Business analysts must understand which component to use for a given query, an unnecessary complexity that is easy to get wrong.
  • IT-centric. This architecture inhibits business self-service. Any change at any stage can only be performed by IT. Business users are unable to do things for themselves.
  • Engineering cost. This architecture requires extensive custom ETL development, DBA expertise, and data engineering to address the evolving data needs of the business.
  • Infrastructure cost. This architecture is extremely expensive because it requires numerous proprietary technologies and typically results in many copies of the data stored in different systems. In addition, the use of proprietary technologies leads to vendor lock-in.
  • Data governance. This architecture leads to data sprawl, making it very difficult to track lineage or maintain tight security.
  • Data freshness. It takes time to move data between systems. It also takes time to create various organizations of the data. In addition, the data pipeline becomes increasingly inefficient over time, resulting in longer cycles and even more stale data.

Data-as-a-Service platforms take a different approach to powering data analytics. Rather than moving data into a single repository, Data-as-a-Service platforms access the data where it is managed, and perform any necessary transformations and integrations of data dynamically. In addition, Data-as-a-Service platforms provide a self-service model that enables data consumers to explore, organize, describe, and analyze data regardless of its location, size or structure, using their favorite tools such as Tableau, Python, and R.

Some data sources may not be optimized for analytical processing and unable to provide efficient access to the data. Data-as-a-Service platforms provide the ability to optimize the physical access to data that is independent of the schema that is used to organize and facilitate acess to the data. With this ability individual datasets can be optimized without changing the way data consumers access the data, and without changing the tools they use. These changes can be made over time to addess the evolving needs of data consumers. Data-as-a-Service Compared to Traditional Solutions Organizations employ multiple technologies to make data accessible by data consumers. Data-as-a-Service platforms are different in that they integrate many of these capabilities into a single solution that is self-service. Data-as-a-Service Compared to Data Lakes Data lakes are typically deployed on Amazon S3, Azure ADLS, or Hadoop. They provide a flexible, file-oriented storage layer. Hadoop includes interfaces for querying the data, while S3 and ADLS rely on other services for performing analysis beyond basic file-level access.

Data-as-a-Service platforms are typically deployed alongside data lakes. The data lake is used in two distinct ways: 1) as a data source, and 2) as a persistence layer for metadata or any data acceleration-related data structures. Data-as-a-Service platforms provide many features that are complementary to the data lake, including:

  • Fast SQL-based access
  • A searchable data catalog
  • A logical data model
  • Data provenance & lineage
  • Data curation
  • Row and column-level access controls
  • Self-service access for data consumers

Typically raw data would be loaded into the data lake, and the Data-as-a-Service platform provides capabilities for making the data ready for analysis, as well as easy to find, and fast to analyze. These capabilities are provided in a self-service model so that data consumers can perform these tasks on their own.

Data-as-a-Service Compared to Data Warehouses

Data warehouses are specialized relational databases that are optimized for analytical workloads. Data-as-a-Service platforms are typically deployed alongside data warehouses to simplify access to this data and to allow for uniform access to the data warehouse along with other sources, such as data lakes and NoSQL systems.

Data-as-a-Service platforms provide many features that are complementary to the data warehouse, including:

  • Joins between the data warehouse and other systems
  • SQL-based access to non-relational data sources such as NoSQL
  • A searchable data catalog
  • A logical data model
  • Data provenance & lineage
  • Data curation
  • Self-service access for data consumers

Data-as-a-Service Compared to Cubes, Extracts, and Aggregation Tables

Cubes, extracts, and aggregation tables provide optimized access to data for different analytical tools. Organizations use these technologies by making a copy of the data and transforming it into one of these data structures. In order to create these resources, IT must have a strong understanding of the types of queries that will be issued by data consumers. As a result, designing and building these resources is typically a multi-week or multi-month exercise.

These resources tend to require a relational database as the input to their build process, so data must first be loaded into a relational database if it is non-relational (eg, JSON). This can add significant time and effort to the work of building and maintaining cubes, extracts, and aggregation tables.

Some Data-as-a-Service platforms provide an alternative to these techniques that is similar to the concept of a materialized view in a relational database. Data is organized in such a way that it is physically optimized for different query patterns, such as sorting, aggregating, or partitioning the data. These data structures are then maintained in a columnar, compressed representation that is invisible to data consumers.

The query planner of the Data-as-a-Service platform determines which materializations of the data can be used to generate a more efficient query plan. Just as with materialized views, the data consumer benefits from a faster query without changing their behavior. Unlike materialized views, Data-as-a-Service platforms can perform this optimization on non-relational data structures can be used, and there is no dependency on a database for storing or querying the data.

Dremio

Dremio is a Data-as-a-Service Platform. Dremio enables business analysts and data scientists to explore and analyze any data at any time, regardless of its location, size or structure, using their favorite tools such as Tableau, Python, and R. Dremio leverages Apache Arrow and a patented data acceleration capability called Data Reflections to provide interactive performance on massive datasets.

Dremio enables a self-service model, where consumers of data use Dremio’s data curation capabilities to collaboratively discover, curate, accelerate and share data without relying on IT. This user experience is powered by a modern, intuitive, Web-based UI.

Discover

Dremio includes a unified data catalog where users can discover and explore physical and virtual datasets. The data catalog is automatically updated when new data sources are added, and as data sources and virtual datasets evolve. All metadata is indexed in a high-performance, searchable index, and exposed to users throughout the Dremio interface.

Curate

Dremio enables users to curate data by creating virtual datasets. A variety of point-and-click transformations are supported, and advanced users can utilize SQL syntax to define more complex transformations. As queries execute in the system, Dremio learns about the data, enabling it to recommend various transformations such as joins and data type conversions.

Accelerate

Dremio is capable of accelerating datasets by up to 1000x over the performance of the source system. Users can vote for datasets they think should be faster, and Dremio’s heuristics will consider these votes in determining which datasets to accelerate. Optionally, system administrators can manually determine which datasets to accelerate through a flexible and rich set of interfaces to optimize the Data Reflections.

Share

Dremio enables users to securely share data with other users and groups. In this model a group of users can collaborate on a virtual dataset that will be used for a particular analytical job. Alternately, users can upload their own data, such as Excel spreadsheets, to join to other datasets from the enterprise catalog. Creators of virtual datasets can determine which users can query or edit their virtual datasets. It’s like Google Docs for your data.

Open Source

Dremio is licensed under the Apache 2.0 license. Dremio Community Edition is available free of charge, and Dremio Enterprise Edition is available as part of an annual subscription that includes support. Learn more about Dremio Enterprise Edition.