Dremio Compared to SQL Execution Engine Alternatives
Companies use SQL Execution Engines to run analytical workloads on data they have accumulated in environments such as Hadoop and Amazon S3. SQL Execution Engines allow companies to apply existing skills and tools that are based on SQL to their analytical data without loading this data into a relational database.
What is a SQL Execution Engine?
SQL execution engines evaluate SQL expression on underlying sources of data. Traditional databases tightly integrate data storage and query evaluation. In contrast, SQL execution engines are able to evaluate queries against data in sources they do not manage, such as file systems, databases, and NoSQL systems.
In the Hadoop ecosystem, Hive quickly became the most popular way for users to run analytical jobs on data in HDFS due to the familiarity of its SQL interface. Users could take their existing skills in SQL to write HiveQL, which Hive compiled into MapReduce jobs that are executed against data in HDFS.
A number of alternatives to Hive emerged as new SQL execution engine projects, including Apache Impala, Apache Spark SQL, Apache Drill, Presto, and Apache HAWQ.
SQL Execution Engines typically allow users to:
- Execute complex SQL queries in data in multiple formats (eg, CSV, Avro, Parquet, ORC)
- Run on a cluster of servers, from a few nodes to several hundred, and maybe more
- Execute queries using a distributed SQL engine, sometimes running in a Hadoop cluster
What is Dremio?
Dremio is the Data-as-a-Service Platform. It helps you get more value from your data, faster. Unlike SQL Execution Engines, Dremio is a comprehensive solution that eliminates the need for complex ETL, aggregation tables, or data cubes. Instead of cobbling together products from multiple vendors, Dremio lets you start seeing value in minutes, with a user experience whose quality is unprecedented in the SQL Execution market.
Dremio lets you easily query all of your data sources, not just the things you’ve moved to HDFS, with optimized push downs to relational and non-relational systems like Elasticsearch, and S3. Dremio lets you reach your data faster, with far less effort.
Analysts connect to Dremio with their favorite BI tool (Tableau, Power BI, Qlik Sense, etc.) or language (SQL, R, Python, etc.). To an analyst, all data appears as tables, no matter what system it came from, with the full power of SQL to join, aggregate, transform and sort data across one or more data sources.
Entirely invisible to your users, Dremio Reflections™ accelerate your data so that no matter how big it is or where it came from, it feels small, approachable, and instantaneous. Unlike cubes that only work for small data on a small set of pre-defined queries, Dremio makes all your SQL fast, including ad-hoc row-level queries as well as OLAP workloads.
Dremio runs as a distributed process, on dedicated infrastructure, in containers, or as a YARN application in your Hadoop cluster. With Dremio you can query data that’s already in HDFS, or you can query external systems directly, removing the need for ETL.
Unlike SQL Execution Engines, Dremio provides:
- Data Acceleration. Using columnar, compressed Apache Arrow for efficient in-memory analytical processing, and Apache Parquet for persistence of source data that is optimized for one or more query workloads through partitioning, sorting, aggregations, projections, and distributions.
- Data Catalog searchable index of your data source metadata, as well as virtual datasets created by Dremio users.
- Integrated Data Curation. Through a powerful and intuitive GUI, easy for business users, yet sufficiently powerful for your data engineers, and fully integrated into Dremio.
- Push-Downs On Any Data Source. Including optimized push downs and parallel connectivity to relational databases, non-relational systems like Elasticsearch, as well as S3 and HDFS.
- Cross-Data Source Joins execute high-performance joins across multiple disparate systems and technologies, between relational and NoSQL, S3, HDFS, and more.
- Data Lineage. Full visibility into your data lineage, from your data sources, through transformations, joining with other data sources, and sharing with other users.
Terminology & Concepts
- Data Reflections™
- Physically optimized representations of source data that both offload operational systems, and optimize one or more analytical workloads. Reflections are transparent to end users, and automatically substituted by Dremio’s query planner. Reflections have a configurable TTL SLA, so you can trade off freshness and query latency.
- Data Catalog
- An index of source metadata, including the names of tables, views, columns, fields, collections, indexes and more. Users can easily issue Google-sytle searches to find datasets for a given job. Data Catalog includes all metadata from virtual datasets as well.
- Data Curation
- A visual and intuitive way for analysts, data scientists, and data engineers to transform data for the needs of a particular job, without making copies of the data.
- Data Lineage
- As data is used for multiple jobs, it is transformed, joined, and shared with other users, forming an implicit graph of relationships and dependencies. These relationships help to understand data use, and relationships that are essential for security, governance, and remediation.
- As users interact with datasets, their behavior can serve as the basis for recommendations to other users, helping to build joins and transformations more easily.
- Apache Arrow-Based Execution
- Apache Arrow is a columnar standard for in-memory analytics. It provides significant advantages in terms of memory and CPU efficiency, and is designed to work well with GPUs and FPGAs.
|Dremio||SQL Execution Engines|
|Accelerates aggregation queries||YesQueries are written against the logical schema, and Dremio's query planner automatically rewrites the query to use Aggregation Reflections, invisible to the end user.||NoRequires a slow full table scan each time.|
|Accelerates ad-hoc queries||YesQueries are written against the logical schema, and Dremio's query planner automatically rewrites the query to use Raw Reflections, invisible to the end user.||NoRequires a slow full table scan each time.|
|Accelerates relational data sources||YesDremio Reflections, and native optimizers with first class push downs of queries||NoVaries by engine, but most require third party ETL to move and prep data for HDFS or S3|
|Accelerates NoSQL data sources||YesDremio Reflections, and native optimizers with first class push downs of queries||NoVaries by engine, but most require third party ETL to move and prep data for HDFS|
|Integrated data curation||YesNatural and intuitive UI for data discovery, curation, acceleration, and collaboration.||NoRequires third party tool or custom scripts written by data engineers|
|Integrated Data Lineage||YesFull visibility into data lineage and access patterns for governance and errr remediation.||NoRequires third party tool or custom scripts written by data engineers|
What Are Common Use Cases for Dremio?
Dremio lets you reimagine your end to end analytical processes, with a solution that makes your data engineers and your analysts more productive on day 1. Instead of using ETL and custom scripts to move your data between different environments, Dremio connects to your data sources directly, and automatically creates a highly optimized cache that makes even your biggest data feel small, approachable, and interactive. Dremio supports all your favorite BI tools, and advanced languages like Python/Pandas, R, and Apache Spark.
We see a wide range of applications, but here are a few popular first projects:
- BI on Modern Data. Fast access to Elasticsearch, HDFS, S3, plus joins to relational data.
- Data Acceleration. Make PB-scale queries fast, without cubes or aggregation tables.
- Self-Service Data. Empower IT and analysts to discover, curate, accelerate, and share data.
- Data Lineage. Lineage of data flows, data reshaping, sharing, and access patterns.
Want to Learn More?
Dremio is a new approach to data analytics. Learn about Dremio.