Dremio Jekyll

Dremio Compared To Data Warehouses

Overview

Most companies rely on a data warehouse to centralize current and historic data for analytical use. These systems are critical to the business and used across many different departments, including sales, marketing, finance, and others.

In this article we compare the capabilities of a data warehouse to those of Dremio, a new approach to data analytics. Data warehouses are part of a larger end-to-end analytical process that involves many steps, technologies, and teams across IT. Dremio is fundamentally different in that it integrates these steps into a new self-service solution for business users to access any data from any source using their favorite BI and data science tools.

What is a Data Warehouse?

The data warehouse is a central repository of data from across the enterprise that powers data analytics. BI tools and data science technologies access data from the data warehouse to serve the needs of many use cases across most business units. Because most companies use many different applications to run their business, the data warehouse simplifies access for analysis as a single system. It also provides data in a standardized and reliable form, which makes the analysis more reliable and informative.

Typically there are one or more data pipelines that move data from sources into the data warehouse. As part of these data pipelines, data may be transformed, filtered, enriched, or summarized in order to make it more suitable to the needs of analysts and data scientists. In some projects ETL tools are used as part of the data pipeline. In other projects data prep tools might also be used to help get the data ready for the data warehouse.

It is common for companies to store subsets of the data warehouse in smaller systems called data marts. A data mart might be specific to an individual department or region within an organization, with data specific to these users and their needs.

The data warehouse must be capable of reliably storing large volumes of data, while providing SQL access for BI tools and data science tools with high concurrency and low latency. In addition, user access must be secure and follow enterprise governance standards.

What is Dremio?

Dremio is a new and unique approach to data analytics that let’s you do more with your data, with less effort, and at an end-to-end speed never before possible. Unlike Data Warehouse systems, Dremio connects to your source systems directly, eliminating the need for elaborate data pipelines, ETL, and data prep tools. In addition, Dremio also optimizes your data and queries, providing fast, interactive performance no matter where your data originates. Instead of building aggregation tables or BI extracts, Dremio makes you data fast using cutting edge columnar, in-memory data structures.

Instead of cobbling together products from multiple vendors, Dremio lets you start seeing value in minutes, and for the first time makes all of your data easily accessible to IT as well as business users.

Analysts connect to Dremio with their favorite BI tool (Tableau, Power BI, Qlik Sense, etc.) or language (SQL, R, Python, etc.). To an analyst, all data appears as tables, no matter what system it came from, with the full power of SQL to join, aggregate, transform and sort data across one or more data sources. Dremio is entirely transparent to your users. And Dremio Reflections accelerate your data so that no matter the size or data source, your data feels small, approachable, and instantaneous. Unlike cubes that only work for a small set of pre-defined queries, Dremio makes all your SQL fast, including ad-hoc row-level queries.

Terminology & Concepts

Cubes
A specialized software process analyzes source data and pre-computes measures across one or more dimensions of data to make accessing these calculations efficient. Typical measures are SUM, MIN, MAX, AVG, and typical dimensions include time, geography, organization, and product taxonomies. Cubes are queried with MDX or a proprietary language.
BI Extracts
Similar to a cube, but typically designed for a proprietary in-memory database management system that is specific to one BI Tool. BI Extracts cannot be used across multiple tools; extracts must be created for each tool. BI Extracts are queried with a proprietary language.
Aggregation Tables
A table in a relational database management system used to store pre-computed measures across one or more dimensions of data. Similar to cubes and BI Extracts, but the data is persisted in a relational model and is accessible by standard SQL.
Columnarized Data
Also known as column-oriented data, columnarization is a way of organizing data so that values for a given column across many rows are stored contiguously. Organizing data in this way makes scans of many rows for a given column much more efficient, and it makes compression of the data more efficient as data of similar types and values is stored together. Row-oriented data, in contrast, organizes all columns for a row contiguously, which is more efficient for discrete reads and writes to a single row.
Apache Parquet & Apache ORC
Parquet and ORC are two similar columnar data formats in use in the Hadoop ecosystem. These formats are engineered for optimal storage of analytical data on disk.
Apache Arrow
Arrow is a columnar data format for in-memory data analytics. It is complementary to Apache Parquet and Apache ORC.

Feature Comparison

Feature Dremio Data Warehouse
Store multi-TB to multi-PB datasets Yes YesLimited, most systems struggle to support PB-scale datasets.
Provide full SQL access to all structured data over ODBC and JDBC Yes Yes
Provide full SQL access to all unstructured data over ODBC and JDBC Yes No
Support BI workloads with 100s of concurrent users Yes No
Ensure secure, governed access through integration to centralized security controls for authentication, authorization, and auditing, as well as end-to-end encryption Yes Yes
Ensure SLAs across many users and multiple tenants with resource management Yes Yes
Scale out on commodity hardware to 1000+ nodes Yes No
Query and join across external sources, including non-relational systems (eg, MongoDB, Elasticsearch, S3) Yes No
Provide a self-service interface for business users to discover, curate, accelerate, and share data Yes No
Natively optimize data structures for multiple workloads, entirely transparent to end users, eliminating the need for cubes, BI extracts, and aggregation tables Yes No
Provide direct access to in-memory data buffers with zero copy and zero serialization/deserialization for Python, R, C++, Java, Spark, and other languages Yes No
Software license Open source Proprietary

What Are Common Use Cases for Dremio?

Dremio lets you reimagine your end to end analytical processes, with a solution that makes your data engineers and your analysts more productive on day 1. Instead of using Data Prep, ETL, and custom scripts to move your data between different environments, Dremio connects to your data sources directly, and automatically creates a highly optimized cache that makes even your biggest data feel small, approachable, and interactive. Dremio supports all your favorite BI tools, and advanced languages like Python/Pandas, R, and Apache Spark.

Customer use Dremio in a wide range of applications. Here are some popular first projects:

  • BI on Modern Data. Fast access to Elasticsearch, MongoDB, HDFS, S3, plus joins to relational data.
  • Data Acceleration. Make PB-scale queries fast, without cubes or aggregation tables.
  • Self-Service Data. Empower IT and analysts to discover, curate, accelerate, and share data.
  • Data Lineage. Lineage of data flows, data reshaping, sharing, and access patterns.

Want to Learn More?

Dremio is a new approach to data analytics. Learn about Dremio.