Bringing the Economics of Cloud Data Lakes to Everyone
As we discussed in this blog, cloud data lakes are much more scalable and cost-efficient than data warehouses, but to become pervasive and applicable to both technical and non-technical users, new approaches are required. In particular, query performance must be drastically improved while reducing the cloud infrastructure cost. Furthermore, to deliver self-service with appropriate security and governance, a semantic layer is required. In this section we explore some of Dremio’s innovations in these areas, as well as some of the capabilities we’re currently working on.
Query performance is what makes or breaks a data consumer’s ability to use interactive BI tools. Without adequate performance, organizations have no choice but to copy data into data warehouses and create BI extracts (e.g., Tableau extracts) in order to make data accessible, leading to other problems of the data becoming stale, scripts being difficult to maintain, and a backlog of data engineering requests that can take weeks or even months to fulfill.
Dremio’s query acceleration technologies address this issue, enabling BI users to interactively drag-and-drop dimensions and measures onto a dashboard, without having to wait minutes each time. These acceleration technologies leverage a variety of capabilities:
Columnar in-memory execution (Apache Arrow). We created an open source project called Apache Arrow, which is now downloaded over 10 million times per month. We use Arrow not just on the outside (to return results), but at our core to speed up query execution by leveraging vectorization in the CPU. Note that while other SQL engines can read columnar data from disk (e.g., Parquet files), the data is immediately converted into a row-based representation in memory, resulting in slower execution.
Columnar Cloud Cache (C3). Unlike on-prem data lakes, compute and storage are not co-located in cloud data lakes. While this separation provides many advantages (like the ability to scale compute and storage independently, and the flexibility to try out new compute engines) it can lead to slower performance because data must be read over a higher-latency link. We created a distributed cache called C3, which utilizes the ephemeral SSDs (typically NVMe) that come with most cloud compute instances. C3 is tightly integrated with our query optimizer and scheduler, and is 100% transparent - users and administrations will never even notice that it exists. Furthermore, C3 practically eliminates the cost of IOPS on S3 and ADLS, representing ~15% of the amortized query cost.
Data Reflections. Dremio’s columnar execution and caching result in ~4x faster ad-hoc query performance than other SQL engines like Presto. However, achieving interactive BI performance for dashboards and reports often requires a much bigger boost. This is where data reflections come into play. Similar in some ways to database indexes, cubes and materialized views with query rewriting, Dremio can maintain various data structures that reduce the amount of processing required to satisfy a query. These data structures are persisted in S3 and ADLS, and our query optimizer automatically and transparently rewrites incoming query plans to take advantage of them.
Infrastructure elasticity. The public cloud provides the ability to acquire a large amount of compute capacity for a short period of time. Let’s assume that a query requires 10 “EC2 hours” to complete, and that a single EC2 instance costs $1/hour. In that case, a compute engine could complete the query in 10 hours using a single EC2 instance, or in 6 minutes using 100 EC2 instances. In both cases, the amortized cost of the query is $100. Clearly the latter is better, but it requires a more advanced compute engine that can dynamically acquire and release compute capacity from the public cloud provider. This is something that we are actively working on and expect to release soon.
Cloud data lake storage (S3 and ADLS) became pervasive due to infinitely scalability and a low cost per TB. Cloud data lake compute engines must have similar characteristics.
It’s worth noting that many compute engines can already scale horizontally. However, the ratio between the output (e.g., number of queries) and the infrastructure size (e.g., number of EC2 instances) is just as important, especially in a world in which budgets are tight and cost control is vital. The following illustration compares two distributed systems that can each scale horizontally, but have very different “unit economics”:
One attribute that affects the $ per query ratio is query performance. In the cloud, cost can be viewed as the inverse of performance. For example, Dremio is about 4-100x faster than Presto (4x for ad-hoc queries, 100x for BI queries). As a result, the same workload can be accomplished with 75% less cloud infrastructure, translating into a large, immediate cost saving.
Another attribute that affects $ per query is the amount of infrastructure idle time. Ideally, the amount of infrastructure (e.g., EC2 instances) running at any point in time should be correlated with the workload at that time. In the extreme case, when nothing is running, there should be almost no infrastructure running, and virtually no costs. This is something that we are actively working on and expect to release soon.
Data consumers need access to governed and curated datasets, but that is not enough. Invariably, they will also require the ability to derive their own datasets from those base datasets. Dremio includes a semantic layer consisting of well-organized virtual datasets, enabling data engineers and data consumers to achieve just that, especially in data-driven organizations and environments where changes in data access is the norm, not the exception.
Our semantic layer provides numerous benefits for data consumers including:
Consistent business logic and KPIs. Is a work week considered Mon-Sun or Sun-Sat? Are all users considered customers, or only those who paid? The semantic layer provides a way to define consistent KPIs via virtual datasets to answer business questions like those. Moreover, it can be consistently leveraged across all client applications, such as Tableau, Power BI, Jupyter Notebooks and even custom applications.
Less waiting on data engineering. A semantic layer, especially if it includes self-service capabilities, enables data consumers to derive their own virtual datasets from those offered by the data team, without having to wait weeks or months for data engineers to provision new one-off physical datasets for every use case. Moreover, a semantic layer makes it very easy for the data team to quickly provision new virtual datasets to data consumers who do not have the knowledge or access to create them independently.
Data teams also realize multiple benefits from our semantic layer, including:
Centralized data security and governance. The semantic layer enables data teams to control who can see what data. They can: define permissions at the dataset, row, and/or column level; define virtual datasets that mask sensitive columns; and provide different views of the data to different users and groups (e.g., data scientists may be allowed to see credit card numbers, but interns should see only the last four digits).
Less reactive and tedious work. Without a semantic layer, data consumers are entirely dependent on the data team any time they need to derive a new dataset. The semantic layer makes them more self-sufficient. Also, because the new datasets are virtual, data teams do not need to worry about analysts creating tens or hundreds of copies of the data.
While we’ve already taken significant steps towards making data analytics much easier and more cost-efficient, we’re just getting started. We have a lot of new capabilities planned for the coming year, with a particular focus on these three categories:
- Further simplify operations. Compute engines can be difficult to set up and manage. We want to make it extremely easy and intuitive to get started with Dremio directly in your AWS or Azure account. We’re eliminating the need to provision resources and install, configure and tune the software.
- Continue to reduce infrastructure costs. We have numerous new capabilities that will further drive down the amount of infrastructure required for your cloud data lake, thus reducing your costs even more.
- Drive more functionality directly in the lake. We understand that when you can’t solve a use case directly on your data lake, you’re forced to use expensive technologies and invest significant engineering resources. We will continue to push the envelope on what’s possible directly in the data lake, without moving your data into another system or locking it into a proprietary file or table format.
The world is a different place today from what it was in 2019. We are dealing with a global pandemic that has already taken the lives of tens of thousands of people worldwide, and will impact many of us on a very personal level. While we do not fully understand the economic fallout from this, we know that it, too, will be significant and impact many of us.
Although it may seem impossible at this moment to continue advancing your strategic data initiatives due to the exorbitant cost of data warehouses and resource-hungry compute engines, we believe that by taking a more modern approach, as discussed in this post, you will be able to do just that. We are committed and eager to partner with you on that journey, in both good times and bad.