Since the 90s, data architecture teams have distinguished between operational and analytical workloads with the terms online transactional processing (OLTP) and online analytical processing (OLAP). While the data and technologies we use for applications has evolved over the years, these terms are still very relevant today.
OLTP workloads tend to be dominated by discrete operations that read, write, update, or delete a small number of records for each transaction. These applications tend to support many concurrent operations (thousands of queries per second), with low latency (<10ms) and high availability (99.999% uptime). When these systems are down or slow, they can significantly impact the business. A good example is an ecommerce system for a popular retail site.
In contrast, OLAP workloads are dominated by queries that read large numbers of records for each operation. These queries usually involve aggregations - SUM, MIN, MAX, AVG, etc. - across several dimensions of the data. OLAP queries tend to touch orders of magnitude more data than OLTP queries as they span most or all of the records for each operation. There tend to be fewer concurrent users, and the latency requirements tend to be more relaxed when compared to OLTP systems.
Many companies have embraced a Data Lake strategy for modern data analytics. In this model data from operational systems is moved to Hadoop for long term storage, processing, and analytics. Hadoop’s rich ecosystem of tools and integrations addresses a wide range of workloads through a shared resource model. Furthermore, Hadoop’s architecture and scalability allow companies to secure and centrally manage storage and compute resources of virtually any scale.
OLAP on Hadoop is a category of products that leverage Hadoop to:
Dremio is the Data-as-a-Service Platform. It helps you get more value from your data, faster. Unlike OLAP on Hadoop products, Dremio is a comprehensive solution that eliminates the need for complex ETL, aggregation tables, or data cubes. Instead of cobbling together products from multiple vendors, Dremio lets you start seeing value in minutes, with a user experience whose quality is unprecedented in the Hadoop market.
Dremio seamlessly accelerates all of your data, not just the things you’ve moved to HDFS, with optimized push downs to relational and non-relational systems like MongoDB, Elasticsearch, and S3. Dremio lets you reach your data faster, with far less effort.
Analysts connect to Dremio with their favorite BI tool (Tableau, Power BI, Qlik Sense, etc.) or language (SQL, R, Python, etc.). To an analyst, all data appears as tables, no matter what system it came from, with the full power of SQL to join, aggregate, transform and sort data across one or more data sources. And entirely transparent to your users, Dremio Reflections™ accelerate your data so that no matter how big it is or where it came from, it feels small, approachable, and instantaneous. Unlike cubes that only work for small data on a small set of pre-defined queries, Dremio makes all your SQL fast, including ad-hoc row-level queries.
Dremio runs as a distributed process in your Hadoop cluster, provisioned and managed by YARN. With Dremio you can query data that’s already in HDFS, or you can query external systems directly, removing the need for ETL.
Unlike OLAP on Hadoop, Dremio allows you to:
|Dremio||OLAP on Hadoop|
|Runs in the Hadoop cluster||Yesoptionally provisioned and managed via YARN||Yesprovisioned and managed via YARN|
|Accelerates aggregation queries||YesQueries are written for the logical structure of the data, and automatically rewritten to take advantage of Aggregation Reflections.||YesOnly on Hadoop data, and requires queries to be written against the physical structure of the cubes.|
|Accelerates ad-hoc queries||YesQueries are written for the logical structure of the data, and automatically rewritten to take advantage of Raw Reflections||NoRequires a slow full table scan each time.|
|Accelerates relational data sources||YesDremio Reflections, and native optimizers with first class push downs of queries||NoRequires third party ETL to move and prep data for HDFS|
|Accelerates NoSQL data sources||YesDremio Reflections, and native optimizers with first class push downs of queries||NoRequires third party ETL to move and prep data for HDFS|
|Integrated data curation||YesNatural and intuitive UI for data discovery, curation, acceleration, and collaboration.||NoRequires third party tool or custom scripts written by data engineers|
|Integrated Data Lineage||YesFull visibility into data lineage and access patterns for governance and errr remediation.||NoRequires third party tool or custom scripts written by data engineers|
|Joins Across Data Sources||YesFull SQL including joins across relational, NoSQL, S3, HDFS, and more.||NoRequires all data to be loaded into HDFS, then processed into cubes|
|License||Apache open source||Proprietary|
Dremio lets you reimagine your end to end analytical processes, with a solution that makes your data engineers and your analysts more productive on day 1. Instead of using ETL and custom scripts to move your data between different environments, Dremio connects to your data sources directly, and automatically creates a highly optimized cache that makes even your biggest data feel small, approachable, and interactive. Dremio supports all your favorite BI tools, and advanced languages like Python/Pandas, R, and Apache Spark.
We see a wide range of applications, but here are a few popular first projects: