Get Started Free
No time limit - totally free - just the way you like it.Sign Up Now
Future users of large data banks must be protected from having to know how the data is organized in the machine (the internal representation).- E. F. Codd, A relational model of data for large shared data banks
Data teams can choose from a variety of data warehouses and query engines to support analytical workloads. In many cases, data teams also build a semantic layer on top of their data stores so their non-technical end users can work with their data in business-friendly terms.
Semantic layers aim to protect end users from the complexities of the underlying physical data model. End users should be able to use their semantic layer to understand what data exists, write queries or create dashboards (using business-friendly language), and then let the underlying system do whatever it needs to return results quickly.
However, this rarely happens in practice. While semantic layers make it easier for end users to understand their data, they don’t provide sufficient performance to support BI and analytics workloads. End users need to know about underlying physical structures, such as materialized views, to get the performance they need — which defeats the purpose of having a semantic layer in the first place.
In this blog, we’ll explore how Dremio’s semantic layer helps end users analyze massive datasets, without sacrificing speed or ease of use, by effectively decoupling the logical data model from its physical implementation. We’ll dive into one of the key technologies that makes this possible, and discuss how Dremio’s approach also benefits data engineers.
To help end users analyze data quickly, data teams build pipelines to create a variety of derived datasets, such as summary tables and materialized views, that pre-aggregate, pre-sort, and/or pre-filter data.
While these derived datasets can help end users accelerate their workloads, they come with drawbacks for both end users and data engineers:
SELECT * FROM catalog.salesschema.materialized_view_total_sales;
This approach to supporting analytics is complex and inhibits self-service data access. The latter issue is particularly problematic, as companies want to help more users, especially those who are non-technical, derive value from data in a self-service manner.
At Dremio, we want to make data as easy as possible. One of the ways we do this is by enabling data architects to build a lakehouse architecture to support analytics — which is simpler, cheaper, and more open than a warehouse-centric architecture. Another key innovation that makes data easy is our semantic layer, which helps data engineers build a logical view of data, define common business logic and metrics across data sources, and expose data in business-friendly terms for end users.
Dremio’s semantic layer has two important differentiators that make data easy for end users. First, Dremio’s semantic layer gives end users a self-service experience to curate, analyze, and share datasets. In addition, query acceleration is completely transparent to end users, so analysts can quickly build out datasets and consume data without worrying about performance, and work completely in their logical data model. Every user and tool that connects to Dremio benefits from Dremio’s transparent query acceleration, so analysts can use BI tools as thin clients through a live connection, instead of having to create extracts and cubes for performance.
One of the key technologies in Dremio’s query acceleration toolbox that makes Dremio’s semantic layer so powerful is Reflections. Reflections are materializations that are aggregated, sorted, and partitioned in different ways and transparently accelerate queries — like indexes on steroids. Dremio persists reflections as Parquet files in your data lake.
Reflections are similar to materialized views, but have a few key differences:
With reflections, end users can work freely in their semantic layer without ever needing to know about their physical data model. Data engineers can eliminate redundant data pipelines and physical data copies, as well as their associated compute and storage costs. In addition, data engineers no longer need to spend time working with end users to help them understand which derived datasets can accelerate their specific workloads, which ultimately accelerates time to value.
The easiest way to see how Dremio’s transparent acceleration benefits end users is through a quick example. We’ll run through the example below from the perspective of an end user, interacting through the Dremio UI.
Suppose we have a dataset named NYC Taxi Trips & Weather:
Dremio’s built-in lineage graph shows us that NYC Taxi Trips & Weather is a view that combines two datasets:
We can also see through a simple
COUNT(*) query that NYC Taxi Trips & Weather contains over 1 billion records:
With Dremio’s transparent acceleration, we can analyze this billion-row dataset with interactive speed without having to know about how the data is organized, or any underlying physical optimizations that may (or may not) exist.
Let’s use the SQL Runner to run the following query, which calculates the average tip amount for taxi trips in each month, along with each month’s average minimum and maximum temperature:
SELECT MONTH(pickup_date) "Month", ROUND(AVG(tip_amount), 2) "Average Tip Amount", ROUND(AVG(tempmin), 2) "Average Min Temp", ROUND(AVG(tempmax), 2) "Average Max Temp" FROM "NYC Taxi Trips & Weather" GROUP BY 1 ORDER BY 1 ASC
This query on a billion-row dataset took less than a second to run — 493ms to be precise. The query’s raw profile shows us the details:
493ms (including preparation steps like metadata retrieval and planning) to run an aggregate query on a billion-row dataset? How? Let’s take a look at what happens under the hood when we run a query.
The best part is that to a user, all this acceleration is completely transparent and happens behind the scenes. In the example above, we saw that we didn’t need to know anything about the underlying physical data model when working with our data. We submitted a query that performed aggregations over a large virtual dataset that contains data residing across multiple sources, and let Dremio do the rest.
While semantic layers aim to expose a common view of data for end users in business-friendly terms (with common business logic and metrics across data sources), it’s important to ensure that end users can also still work with data quickly so they can make timely, impactful decisions for their companies.
In this blog, we discussed how Dremio’s semantic layer achieves this through transparent query acceleration. We showed how easy it is for end users to analyze large datasets in their semantic layer without sacrificing speed or ease of use, and walked through one of the key technologies in Dremio’s query acceleration toolbox that makes this possible. In addition, we learned about how Dremio’s approach to query acceleration makes life easier for data engineers, not just end users.
If you’d like to learn more about reflections, check out our documentation, watch this technical deep dive video, or read this whitepaper. You can also get hands-on with Dremio through our Test Drive. It’s the simplest and fastest way to experience Dremio’s lakehouse (for free!).
Thanks to Brock Griffey, Jason Hughes, and Tomer Shiran for their guidance on this blog.