March 2, 2023
11:45 am - 12:15 pm PST
Massive Survey Dataset Exploration with Dremio and Power BI: Discussing Strategies for an Optimal View
In this session, learn how one ClearPeaks customer used Dremio and Power BI to analyze a massive survey dataset that has grown to over 1 billlion rows. By switching to Dremio and Power BI on data lake storage from their Relational Database Management System (RDBMS) this customer was able to overcome data volume limitations, optimize query performance, and deliver faster insights at reduced costs. ClearPeaks will use a subset of the actual dashboards to discuss best practices for creating views and reflections in Dremio based on the analytic use case.
Sign up to watch all Subsurface 2023 sessions/h3>
Note: This transcript was created using speech recognition software. It may contain errors.
Oscar Martinez Rubi:
Welcome everyone and get ready for a fun next 30 minutes. I’m Oscar Martinez. I am the head of Big Data Cloud and Advance Analytics in Clears. And together with me, I have the pleasure to, to have Alex Maio, one of the data engineers in in my team. today, as, as Nate mentioned, we are gonna discuss basically how we use drio and, and Power BI to optimize the way that one of our customers is managing Sobi data. we’re gonna have a little introduction to, to click picks to us, and then we’re gonna jump right into the, the use case with a little reminder refresher of some couple of Dremeo concepts, which I’m sure most of the audience will, will be aware of. And then I’m gonna give it to, to Alex, and he will basically explain the rest of the use case.
And then, you know, the meat of the talk will be covered by, by Alex. and then we will have a wrap conclusions and then q and a where we will answer questions that, that you might write in the q and a section. Okay. clear picks. We are a basically consultancy firm that is specialized, is in everything about data from, from the typical BI to big data and cloud and a bunch analytics. And we also have a bunch of developers that produce custom applications for, for the custom experience of, of our customers. We cover all the planet, our headquarters in Barcelona, and we also have a few offices around the globe. But but yeah, we are doing everything about data for over 23 years already. the, the project that we have been working is it’s basically massive survey exploration.
And basically it’s a customer that, eh, the job is to collect and, and to execute surveys, ask questions to people basically around Europe. And they are collecting all the, all the informations every, every survey, eh, have around between 200, 400 questions. And it’s asked two thousands of thousands of people. So you can imagine that the amount of data that this is collecting is relatively large until now. The customer was storing the data into our classical relational database management system, be it Oracle, postgre, my either of this. And they were using the b i extracts the, the extract features of, of a certain BI tool. Okay. And they basically, they were loading the entire dataset into extracts what was occupying a lot of space. And the more the, the dataset was growing, they started to have concerns with the cost of the, of the server and the associated cost of the license.
Okay. So at some point they came to us and they asked for advice because, you know, by now when they came to us, it was 1 billion of row, and now it’s already 2 billion of row. Basically, what we proposed was to, to decouple the storage and compute. So basically all the survey data is gonna be stored in, in a data lake. And then basically we’re gonna have a very cool query engine like dremeo basically to do all the heavy compute to execute the queries and the light, the, the weight, the workload that the BI two has to handle is very, very light. There is no need to use bi extracts. It can be a very small server. And basically this means a very cost effective solution. The solution itself is cloud agnostic because it, it can run anywhere, not only in the cloud, it can also run on premise.
And basically what we did for today was to generate a dummy dataset with 1 billion rows, and that’s the one that Alex will, will show. And we will talk about this Dami dataset. so this is the old architecture answers of the SUV data slowed into our relational database management system. Then we have the BI tool that it’s getting the extracts and basically add to as a compute platform, and it is produce a dashboard. Our solution, as you can anticipate, is basically we store the data in a bunch of CSVs or ques wherever in Amazon Stream. Then we have REO that is basically the SQL engine in reo. We are gonna make use of the reflections, which I will cover in just a, a second. And then basically we demonstrated the capability with Power BI because we just lost Power bi, but it could do the same with other BI tools.
But today we decided to, to do this demonstration with Power bi. so now before I give it to, to, to Alex, just a little refresher, refresher on reflection reflections in reo. I’m sure most of you know it, it’s the main feature to accelerate queries. It’s something that you build on top of dataset. Dataset can be a, a view of the data or can be an, an anchor data set. You can program DR to automate to update them automatically. And basically, Dr suggest creation of certain reflections when certain queries are run there. there Ismi recommends a way to organize the different data in different layers. And we have totally followed this, this, this recommendation, which is this, this semantic layer from the physical layer to staging business and application layer. so basically there are two types of reflections. There are reflections.
It’s basically a way to organize and partition the data, but the data is left as is kind of okay. You can, you can have some little transformations, but the data is basically here. What you get is to use a more efficient format like Parquette or Iceberg or, or some, some table format or file format that is more efficient and basically then does that for you if your data is not already in that format. If your data is already on, on an iceberg table, probably a road reflection won’t give a lot, but if it’s in CSVs, like it was over case road reflection will speed up a lot. The second type of reflection is the aggregation reflection, which is kind of a cube. Okay? so for queries that do group buys or wears or this kind of filtering, then basically you can use like a pre-computed cube that is totally automatically managed byo, and then basically your queries will be much faster because they will touch much less data, UMRA said, consists basically of one fat table that is very, very fat with 1 billion rows, and then a bunch of little dimension tables that cover the demographic data, like the gender, like the social level, like the, the studies or the, the, if it’s a woman or a male, and also a type of question that the type of are the questions and the answers.
And the big back table is basically the combination of all the answers. And with that, I will give it to, to Alex. So Alex, all yours.
Yeah. Thanks. Thanks, Oscar, for the introduction. It’s Alex speaking now, and I’m going to work you through the creation of reflections for getting the most out of REO when it comes to handling. Am I on mute? Yeah. No, you can
Oscar Martinez Rubi:
Hear me? No, you’re okay. No, you’re okay. Okay.
Let’s just start with with the deep dive into reflections. the first thing we do when starting a project with reo is a one to one bill for, for each of the tables in the physical data set, and we save them in the staging layer, right? within these view, we can perform some data cleaning actions such as renaming columns changing their type and so on. But these are only bills we created as we have the, the data start in the, in the data lake in series format. What we can do is using the, the view of the fact table in the staging layer, to use it as, as an anchor data set for a general purpose raw reflection. and with this first reflection, we achieve having the data in in an optimized format. And additionally, we can select some fields that appear in most of the common joints.
Wait, because I’ve passed this we can also select partitioning fields. on the table you see in the screen is what a reflection will look like when using the DR is ui. This reflection will improve the overall performance of the queries and even the creation of new reflections. So it is great to start with with this reflection for this specific use case. So what we’re going to, to see in the following slides is a step by step guide on how to create reflections since we’re going to see some other queries to familiarize with reflections and make sure that everyone follows a presentation. And after we have understood how we can liberate reflections to improve performance, we will focus on accelerating BI dashboards. So, to start building our reflection, we must know what we are going to query beforehand.
No, it’s, it’s a typical thing, a typical thing to do when, when working with bi. So this study is needed to detect the partitions the measures and dimensions that we are going to, to be selecting for the reflections. So then, then it is common that we find a query that has a queries in it. we must focus on this first and, and build reflections for the, for the subqueries, and then expand afterwards if needed. And after the previous steps, we can start creating an anchor data set, which will be the bill upon which we will build the reflection. And finally, we feel we build the reflection with the dimensions the measures and the partitions. So now that we we gonna start with another query to see how this works. Here we have a query for the survey data set, which result shows all the surveys that are 5%, 5% longer than the average.
And to make useful reflections, we must focus on the sub queries. The first one has account distinct operation on the field. Questionnaire 80, the second sub query has an average operation. And finally, we have the end the query, which only select fields, and there’s nothing special there. so we will make two reflections, one for, for each of the previous sub queries. So let’s focus on the first one for now. in the first subquery, we can see that there are two fields being used, survey id, which is in the group by and question id, which appears in the, in the account operation. With this fields, we create a view or anchor data set, and then we save the view in the business layer, for example, with the name of question count. And finally, we build a reflection on top of it.
when creating the reflection, it’ll contain the, the group by field and the distinct field, as I mentioned, and the con field, which is the one used together with the distinct as the measure and optional. We can also select approximate con distinct if you expect to use it any time. And in the second sub query, we can see that there is only one field being used con questions, which is a field that comes from the previous sub query. This field is, is used in an average operation. That means that we repeat the process of creating a view with the used fields, and we keep this bill for the second query, the business layer as well. And now with the name of AB questions per survey. And finally, we create the reflection by using the, the only field as as the measure.
And this time for calculating the average, we select some and count. So after we have all the reflections prepared, we can execute the query and check the, the job history in Romeo to see that we get around one second of execution time for 1 billion rows. And that the job used both reflections we created to finish faster. So now that we have seen how to create a reflection to accelerate our queries, let’s move on to the, to the dashboards with, with Power bi. previous to this, and as we just did with the, with the previous example, we must understand the data set and know the type of queries that we are going to cover with the charts. So, for example, we know that we are that we will always be filtered by question 80 to focus on a single question of the surveys.
And we’ll also be focusing on the proportions of the answers for each quest. And in addition to that, charts will be having a filter for the rest of dimensions in the dataset. so with the following slides, we will keep things simple by focusing on one single chart. We will show the queries generated by Power Pi to compute the chart and, and what reflections we can create, accelerate them. We will see what happens with addition of different possible filters, and we will show execution times to be able to compare different approaches when building reflections for bi.
We can now start with the dashboard. Here we see start with some filters around it. In this dashboard, we are calculating the proportions of the answers for a specific question, and we are dividing or grouping the result by its country. So we can guess that the dimension fields will be all the filters we are using as well as the countries and, and answer levels by, by which we are growing the result. But even so, we can run that visual and check the job that is created in Dream, which is this one. then we can create the, the reflection focusing on this query, right? So we can get what we can get of these query is the key fields that are highlighted. And other than that, we can say that this query looks a bit long and complex due to the joints that Power BI needs to perform, because we are querying from 80 different tables, right?
So the issue here is that this is the query from the previous chart. without using any of the optional filters, if we wanted to apply some of your split chart, we would have a query with even more nets of queries and more complexity. And this could be problematic. So another thing worth considering is the time it takes to run the query without any extra work. Just as we have everything by now you can sit in the top right corner of the screen. it takes 18 seconds. So let’s see why having complex queries can be problematic. one of the phases that each job in dreamy has to go through is the panic phase here, among other things the query is analyzed to seek a matching reflection that can accelerate it. when we have queries with high complexity, the panel might not fully check its structure and might not match the most accurate affections in order to save time.
So this is a problem for the, for the query we just saw, because it has many levels of nested queries, and we have even more as we plan more filters. So one solution we thought of is joining all the data in a build and querying to that, to that bill from Power bi. So in, instead of having the 18 tables model, right, so as this is, as this joint will be appeal, no data is being replicated, and we can take advantage of this to lower the complexity of the query. So that’s what we are going to we create a view that joins all the tables. We save the view in the layer within most fit the query, and we change the data source from Power BI to target the new view. So when plotting the chart, again, power bi we can, we, we get this query sent to Dria.
it is the same chart we saw before but now we are querying from a view that contains the data of all the tables. So we can see that we are querying from the virtual joint of the dataset of all the dataset in the land that says from processing dot, all joint processing is the semantic layer, and all joint would be the name of the view. For the highlighted fields, we have the survey id as aggregated field, and all the other as dimensions. So we can appreciate that this query is much simpler than the, than the one we had before. And now we can make use of reflections. So the same way we did before. We create a bill to have reflection with these fields, then we select each of the, of the fields as they mentioned or measures. And finally we can run the query again and check the job historian REU to see that the reflection is being used as well as the, as the time it takes to be executed.
In this case, we see it takes five seconds to execute the query. And if we compare this five seconds to the previous 18 seconds we can, we can say that we had, we have lowered the, the runtime thanks to this aggregation reflection. So although this has been successful, there is another thing to consider. If we wanted to apply some, some of the optional filters to chart, for example in the red side of the, of this image, we can see that we have selected dh gender and nationality filters. so once power BI starts a request to Dr. You, we end up with a, with a quarter. Like this is similar to the one we had before, but it contains more fields in the workloads, but now they are in the, in the same software level. So I could think, okay we could solve this by changing the, the previous reflection and adding all the possible filters as dimensions. So we would have an aggregation reflection with all the dimensions, all the 17 fields. But if we did that DMI would calculate each measure of the reflection for all the combinations of the dimensions. That means the reflection would take long to finish. But not only that, by putting so much data into a single reflection, we would be limiting reo parallelism to resolve queries.
We can check the job and see that we get the query in, in 23 seconds. We’re having the age, gender, and nationality filters. we’re also using the, the 17 dimension dimensions reflection, and that’s the lowest time we have gotten so far. So instead of, of this large reflection, we can keep the, the smaller aggregation reflection for when we do not use additional filters or when we have very few filters, but we must avoid using an aggregation reflection with so many different values. because what makes an aggregation reflection worse in performance is having high cardinality fields as they dimensions or simply having lots of dimensions. So if we run this query again without using this last giant ion reflection, we will see that the raw reflection we created at the beginning is covering the query, right? So we can see that it lasts eight seconds and that the reflection being used is the, the general purpose one, the, the one we started with together with all the one-to-one reflections targeting each its table of the data set.
And there is more to this if we go deeper into the details of a job in Ramo and see how it was planned. here we have the plan for the job of the application reflection with all the dimensions fields. It is a single threat of tasks. This is for the sample with the same filters. The only difference is that we use the giant aggregation reflection, the one that contains all the dimensions and took 23 seconds. So in contrast we have the job for the last approach the one that rhymes the same query, but with no aggregation reflection. Here we see that Theo plans it with a multiple thread of tasks, and it uses all the one-to-one road reflections from the staging layer. And this task is the one that took eight seconds to finish. So we have seen that there are some things to consider before creating reflections that we cannot just start creating them without, with without previously premium.
Right? So let’s finish with with a summary of the key points. we have just seemed to consolidate what we have learned from, from this use with reo. The most general things to keep in mind are related to the use of the right architecture for your data. Things like separating compute and storage are healthy practices that are being applied nowadays. by doing this, you can scale each of these independently and often it leads to reusable structures that do not rely on specific services, right? dream is a perfect example for that as it works directly with a data lake but remains isolated when it comes to, to scaling to allow that cost effective scalability. And comput. Another great thing is that it is not tied to any specific provider. as Oscar mentioned, you can build premium different clouds using even on-premise using different data lakes as the data source and connecting it to a wide variety of BI tools.
And regarding the most technical concepts, here are the main points we think you should remember. row and aggregation reflections have different purposes. None of them is better than the other. row reflections though are amassed when we do not have the data stored in an optimized, in an optimized format. And as for the aggregation books, for the aeration reflection we should avoid using one single reflection for all the possible dimensions. Okay? It offer, it offers better performance to have small reflections. we have seen that using the fierce raw reflection offers lower execution time than using the, the giant generation reflection with, with all the dimensions in it, but also beware with having a lot of reflections that perform similar calculations, because that may cause the planning phase of the query last longer. That happens because the matching of the queries with reflections is held during this planning phase, and having more possibilities to analyze means more time to find the most suitable one. And the last thing we have seen a useful way to prepare data to lower query complexity by joining it in a, in a single, in a single view. This will make all the filters be applied in the same supporting level from the VA tool, and will release the, the development with reflections under detection by planner. Okay, so here ends our presentation.