How a Self-Service Semantic Layer for Your Data Lake Saves You Money

We all know the benefits of a semantic layer for your data, but how do you implement one on your data lake? In this webinar we discuss Common challenges with semantic layers and how to overcome them; how a semantic layer reduces pipeline complexity, and best practices to successfully implement a semantic layer on the data lake.

Dremio Jekyll

How a Self-Service Semantic Layer for Your Data Lake Saves You Money

Transcript

Lucio Daza Hello everyone and thank you so much for being here with us today. If you are here for the webinar on the self-service semantic layer with Dremio, you are in the right place. We are going to allow one more minute for the rest of the audience to trickle in and we will get started shortly. In the meantime enjoy the trivia questions and we will be right back.
Lucio Daza All right everyone, thank you so much for being here with us. My name is Lucio Daza. I run technical marketing here at Dremio and today we have a very exciting presentation prepared for you and this called How a Self-Service Semantic Layer For Your Data Lakes Will Save You Money. Before we get started, there are a couple of things that I want to run by the audience. And first of all, let's go ahead and meet the expert today, this is going to be Brock Griffey, he is one of our solutions architects here at Dremio and Brock, I am going to run through this list of amazing items and all I have to say is that I don't know how much I have accomplished in life because this just is great. So Brock has ran 13 Spartan races or actually he did so the year of his wedding. This is just amazing. Also something that I'm very envious of, he has his own ... check this out ... he has his own server, I call it a mini data center, in his house. The only thing that I'm not envious of is the actual constant noise that those little machines can produce. Even though he calls his his own white noise machine. And in addition to that, Brock has a super cool Tesla vehicle, which as we put in here in the description, he even wrote the script to pull information from it and visualize it on a dashboard. Isn't that cool? Brock, thank you so much for being here with us. How are you today?
Brock Griffey Im' good, thank you for having me.
Lucio Daza Awesome. So before we get started, a couple of things that I want to mention. This webinar is going to be recorded. We will have it on our site later. It will be published in there along with the slides and the transcripts, so you can go ahead and check this webinar out. Plus there were other webinars that we have in there in the library. So just go to Dremio.com/library you will find all our webinars, all our white papers, data sheets, you name it. All the information is going to be there.
Lucio Daza So what are we doing today? In the next 30 minutes, which at this point will be roughly 26 minutes, we are going to talk about the common challenges with semantic layers. We are going to talk about how can you reduce the complexity of your data pipeline using a semantic layer and also Brock is going to walk us through the best practices when implementing these kinds of solutions. And time permitting, we will have a Q&A session at the end of the presentation. However, if you have any questions, please do not wait until the end. Just go ahead and put your questions in the question panel in your go-to webinar interface and we will try to address them as we speak. And without further ado, Brock, the time is yours. Everybody please enjoy.
Brock Griffey Thank you. All right, so just to give you a little bit of intro into what the semantic layer is in Dremio. Many of you may have seen this before, but really what semantic layer in Dremio is is a place that you can create your own virtual datasets, you can create your own spaces, share those spaces and collaborate with your other engineers or business users and this creates an easy place for everyone to just work with the data.
Brock Griffey As you may know, there's two parts to this, right? So there's also the raw zone. The raw zone is when you register any dataset from a data lake storage or another source, it will show up and become a raw or a physical dataset. And from there you can actually provision virtual datasets into your semantic area. And then obviously on top of all this, you can use your tools, right? So you can use your Python, your R, Jupyter Notebooks, Tableau, Power BI, Looker, any of those tools, any tool that use, ODBC, JDBC and soon Arrow Flight will be able to use all of those to access to Dremio.
Brock Griffey So it's great that we have this area, the semantic area that we can kind of go out there and create anything we want. But without a little bit of governance, a little bit of control around this, we end up getting very messy semantic layers, which make it harder for you to actually move forward. So you may start seeing things like, if you can see inside here, we have this MIC Taxi folder, and inside this space, we have "trips" multiple times, we also have "matrix" multiple times, then we have "demos" multiple times. So if you don't organize this properly and make sure that there is some kind of governance around it, you may end up in a situation like this.
Brock Griffey So this is what the best practices come in. So this is a concept that's been around for a long time. The best practices are based off of a concept of ETL and ELT and the way that you structure your data when you pull data from a source and put it into a data warehouse. Traditionally you've loaded your data into the staging area and then, or landing area, and then you cleanse and move it to the next base layer. Sometimes in these data warehouses we'll call them the semantic layers. And then on top of the semantic layer you create a view layer, maybe known as a reporting layer, application layer where you apply your business rules and definitions and then you would expose that out to your end users. This process has been used in enterprise data warehouses for 30 plus years. As long as I've been around, this concept has been around as well.
Brock Griffey So in Dremio, the great thing is we don't have to create copies of the data and we don't have to create all these complex ETL pipelines. We can just use virtual datasets. So the purpose of this semantic layer is to expose representation of the organization's data assets using common business terms. So it gives the business an easy way to look at the data and understand what it means. It also enables your IT to create a security around everything while still promoting a data exploration. So this whole thing, we'll show you exactly how to do that. And of course the best thing about all of this is that by structuring it the way you have it, or the way that the best practice lays it out, you'll learn more about where we can apply our performance accelerations, such as data reflections, to give you that even better performance and sub-second query time.
Brock Griffey So what I'm going to walk through now is the best practice and it's going to be the prescribed layered approach to VDS creation, and just want to mention that every layer has its own purpose and they all have their own common pattern and it makes it easier for onboarding if you used this. So if you're going between multiple environments, so maybe you have a dev prod and other environments, maybe you have an environment for different groups. If you're having users move between these groups, it makes it easier to onboard and get people more familiar with it. Or say maybe you're working for a consulting agency and you're going between different customers. This actually helps too if the customers are all implementing the same best practice, it's easy to understand the organization and layout of the environment. This also is promoting a more performant, low maintenance solution so we can easily migrate between environments and keeps things very consistent.
Brock Griffey It allows us to have a reusable virtual dataset as a building block rather than recreating everything all over the place. In the past, you may have seen tools, like Business Objects, where everyone's creating their own view of what they want it to look like. If you don't govern it somehow, you may end up in that same situation. So we want to make sure that people are following those kind of best practices.
Brock Griffey And so I'm going to go ahead and go next, so I'm going to talk a little bit about the naming. So the naming of the layers is all logical. You can use whatever naming terms and standards you want to use within your own environment. But the idea is to stay consistent within an environment. So if you have different names, which I'll show you the next slide, just make sure that you're follow that same naming pattern throughout your different environments. And so if you have data prod QA, they'll follow the same naming patterns. If you're working with different customers, the may use different semantic definitions, so you just need to make sure that there's a same pattern, but the names may be maybe different a little bit.
Brock Griffey And without further ado, I'm going to show you in practice what this looks like. So as you see, this kind of looks like our first slide that I showed you where we have the data lake storage at the bottom, and that is your physical datasets. That's your raw zone. Then we have the staging layer, on top of that staging layer we would then have a semantic layer. And then lastly, we have a reporting layer on top of that. So each one of these will be their own space that we'll create. And I'm going to walk through each space one at a time.
Brock Griffey So the sources, right? This is your physical data set. This is automatically created when you register a dataset. And when you format them from a data lake environment, it will automatically show you a physical dataset. So there's not much to this. You can't really go wrong. You'll just be able to add your datasets ... in the staging layer.
Brock Griffey So this layer is a layer that we use to prevent rework. I like to think of it as a shim layer. So if we would change in a semantic layer, the source, we are able to easily just change the from clause to pick a different stage folder. And that folder should have the representation of the data in the format that it was for another source. And I'll show you a little bit later on what I mean by that. In this layer we're going to doing common transformations, casting for joins, but you should not be doing joins, just casting for the joins, and cleansing of data. So if you need to clean the data off a little bit.
Brock Griffey So some rules is you should have a staging folder and then a sub-folder per source. In here we're going to be doing one-to-one column mappings between the physical dataset and the virtual dataset, which means no joins in this layer and no grouping. Some common tasks you're going to be looking at column aliasing, data type casting, data cleansing and derived columns may occur in this layer.
Brock Griffey So the next layer, the semantic or business layer, this is a common view of what the business would actually look at. This is almost like your data model, right? So in this area, this is where your VDS's is become reusable components for the next layer. You're going to be using folders for organization, so maybe you have sub-folders called out by subject area or verticals. So might be like customer, or order, or products, something like that. And that's for organization. The source data for this should only be coming from the staging layer. So we shouldn't be having source data comes from within the same semantic layer. It should be coming down below from the staging layer and being pulled up. In this layer also, we shouldn't be doing any grouping, right? We want to stay away from grouping unless you are trying to get rid of duplicates, you shouldn't be doing any grouping here.
Brock Griffey Some common tasks, same thing as the one above it. You're going to be doing call . You're going to to be doing column aliasing, data cleansing, maybe derived columns, but also this is a really great place for creating raw reflections because this would be used by many reports in the reporting layer. So a raw reflection, this is a great place to put it as needed.
Brock Griffey Last layer is the reporting and application layer. So this is an access layer for your applications and your reporting tools and maybe for your customers depending on how you have it laid out. And some of the rules to this layer, I mean of course you want to use folders for organization to make sure specific virtual data sets are located inside those folders. The source data should only be coming from the semantic layer. The source data should not be coming from somewhere within the reporting layer and it should not be coming from a staging layer. It should all be coming from the semantic layer. This helps keep everything secured and makes sure that there is no data lineage breaks. And we also keep up there.
Brock Griffey Common tasks, very similar to the some of the other ones. We have to column aliasing, data cleansing, derived columns. Now here we can do grouping and I also recommend that you do aggregate reflections at this level as needed. Now with grouping, this is something you may or may not want to do because with aggregate reflections, Dremio can automatically do that level of grouping that you need whenever you write a query against that virtual dataset. So if a tool is coming in, like a Power BI tool or any other BI tool and it's doing a sum on a column, with an aggregate reflection you will automatically be able to use the aggregate reflection while pointing to the original dataset and get the performance you need, and then you can keep drilling down. So by doing that, you won't need to have the grouping happen. But there are some cases where you might want grouping to maybe remove duplicates. This is a common place that you would do that grouping.
Brock Griffey So if you do this right and you follow this approach, you'll see inside of your graph view that you will see a very straight and linear logical view of your data lineage, right? So everything should flow from the sources to the staging.
Brock Griffey Sorry, my headset cut out there. Hopefully everyone can hear me now. Everything.
Brock Griffey Okay, thank you. Everything should flow from the sources to the staging to the semantic to the reporting. It should be very linear. And this helps make sure that we don't get into some problems that I'll show you here in a second. So what could go wrong?
Brock Griffey If you aren't following these practices and you start linking things together, you may start getting into some issues. And before I go too much further into that, I'm going to talk about two different things that happen here. So in Dremio we have this concept of query planning and what a query plan does it plans the path of the query. It goes through and it actually gets every step that it needs to take to actually access the data and what it should be doing. So maybe it's using a reflection, it goes and evaluates should I use a reflection or not? Should I go and access original data? Should I be splitting this out? How many nodes do I need to use? It's the step-by-step how it processes the data.
Brock Griffey And then the other thing I want to talk about is heap, coordinator heap, right? So coordinated heap gets used when you do query planning and if the query plans get very big or start taking up a lot of memory, start getting too big, they'll start taking up a lot of memory. And this is also used for internal Dremio on metadata query. So if you start having a lot of large things happen within the heap memory, you could get some problems happening.
Brock Griffey So for instance, this is a bad query plan. In this case, this user had created a virtual dataset that tries to join everything together, but also does a lot of cyclical joins. So they're actually joining to the layers below it, to the layers above it to try and create this dataset. And they're also doing small joins everywhere and building on top of each other. So they take one virtual dataset and they would create a join to another virtual dataset, only having two columns and then keep doing that for like 10 times until they finally had this virtual data set that had everything they wanted inside of it. And they may only actually use three or four of those columns, but they've created this monster virtual dataset.
Brock Griffey And this will lead you to problems like this. A query plan typically lasts milliseconds to maybe 10 seconds. This one took a lot longer because of the way this virtual data set was designed. And typically the phases are not this large. One-hundred-ninety-five is a lot of phases. Typically it's maybe 10 to 15, maybe a little more if you have a larger query. And just so you can see when they ran that query, you can see here this is a view of what the heap looked like. You see it go up and down a little bit, but then at the end you see it just dramatically increase because this query plan just took a very, very long time to plan, took up a lot of memory in the heap and just caused a lot of issues all around. So that can cause problems with constant garbage collection happening.
Brock Griffey So again, like I said, in this situation they were doing a lot of joins to the layer above and below, they were building multiple reports off of each other, adding one to two columns at a time and they're trying to just create this very huge dataset. And this obviously is going to give you slow performance. So if you follow the prescribed approach, that will help you avoid this scenario.
Brock Griffey So how did we fix this? We just went through and re-evaluated it. We clustered similar virtual datasets together to combine them at a higher level. So we removed these small, one-off joins all over the place. We also removed any circular relationships. Obviously to do this we had to review the profile, in Dremio there's an option to view the profile. If you go to the Jobs tab, you can view a detailed view of what's happening. We also flattened out any of the nesting they had going on.
Brock Griffey If all this fails, there's always the option for a temporary fix. And that would be to do a create table as select statement and that was done just for a temporary fix until we fixed everything else. So there are options you can use to fix your problem temporarily. But if you follow the prescribed approach, this should help prevent this from ever happening.
Brock Griffey So remember before I mentioned the shim layer? This just kind of gives you an idea of what the shim layer does. So say right now, today, I'm using a Teradata dataset for my customer view, my customer virtual dataset inside the semantic layer, and the staging layer is pointing to Teradata.customer to grab these columns that have been curated to match the format that I need. That's how it is originally, but if I'm going to move from, maybe Teradata I'm moving everything to HDFS or I'm moving everything to S3. I want to be able to just change the select clause. I just want to change it to say select from HDFS customer. I can easily do that because they have the same format, because I'm not changing anything. They have the common casting already done, so they both look exactly the same. All I do in the semantic layer is say select from HDFS.customer instead of the Teradata.customer one. And now it's going to show you that the view through the shim layer. It creates a very easy way for us to switch out where your sources are and your end users will never have to know. The performance will still be there and it'll still be working for them, they'll still go to query the same data set with no problems.
Brock Griffey So I mentioned before this helps you with your permissioning. So this is kind of a table view of what the permissions look like. So whenever you have these different layers, so you have the staging, semantic, and reporting layer. We also have the different user roles. So I'm going to look through these roles. We have like a data engineer type role, semantic model in a role, and like report and developer role. As we go down through each layer, we can see it as different permissions that would typically apply. And so these permissions maybe ... these are at a high level. You can go more granular into the folders that you've added inside here. So if you want specific data engineers to only have specific access to a sub-folder, you can do that as well. Same thing with any of the other roles.
Brock Griffey But typically what we do is for the staging layer, the data engineer is going to be a person creating that staging layer, so they're going to have the edit permissions. Whereas a data modeler is only going to have view permissions and they need that view permission so they can create the semantic layer where they have edit permission. Whereas the reporting and report developer, they don't need to see the staging layer, they only need to see the semantic layer, so they have no permissions on the staging, but a view permission on a semantic layer. And then of course because they're building their report, they have edit. And the other layers typically I've added edit or view if I want the data engineers and the semantic modelers to be able to assist the other users or you can have this at the none. It kind of depends on your preference there, but the idea is everyone has their own space that they have their own permissions into so you can help make sure that no one has access to something they shouldn't have access to, but we've also made it so it's easy for us to go in and apply permissions to every level.
Brock Griffey So just to kind of give you a brief idea of what this looks like. So just imagine Brock is the data engineer, right? So inside the source space, I'm going to have can't edit permissions. This is what it looks like. If you're using the Community Edition, you may not see this in here because it Community Edition doesn't have these permission levels, but in the Enterprise Edition we have the ability to specify specific users who have access to that dataset. I would set myself to canned edit. The other version of access is the canned query version. So in the staging layer, the data modeler is going to have the canned query permission. Whereas I, the admin, will have the canned edit permission.
Brock Griffey The next layer I left myself with the canned edit permission, semantic layer. That's my preference, but you can change that so I can only canned query. Data modeler would have it and the report user have query. And then lastly, if you could change it around, so the last layer, the reporting layer, would have a canned edit for the report user and then the data modeler would have canned query and as an admin or the data engineer I'd have canned edit.
Brock Griffey So that just kind of gives you a brief walk through of how that works. I think now at this point I'm going to open up to some questions and I'm just going.
Lucio Daza And I'm sorry Brock, while you go through the questions, something that I want to mention here is there are a lot of resources that you, the audience, can go and check out if you want to learn more about what we have, about what Brock just discussed. And the first place where I want to direct you all is, we have a white paper on the best practices when using semantic layers with Dremio and here's a link. I tried to shorten that as much as I could. So please go ahead and if you cannot capture the link as I mentioned before, just go to Dremio.com/library and you will find there our webinars and also of course white papers like this one. And also if you want to practice with Dremio, if you want to experiment, just go ahead and deploy it, go to Dremio.com/deploy. And also if you want and I invite everyone to, if you are not familiar with Dremio yet, go to Dremio University. It's a free online learning platform that we have developed for you to go ahead and take all the classes. You have the opportunity to launch your own virtual lab with the latest edition of Dremio Enterprise Edition in it. So you can kick the wheels, you can kick the tires, go ahead and try the exercises. Practice everything that Brock just showed. We talk about it in, I believe in Dremio Fundamentals, which is the first course that we have in there. You will be able to learn about data reflections as well, security, you name it. And once you're done experimenting, once you're done trying Dremio, if you still have any questions, go ahead and check out the Dremio community, also free to get in there and start asking questions and join other folks on answering and posting challenges and whatnot.
Lucio Daza So I really want to invite you all to go ahead and check all these resources. And so Brock, it looks like we have a ton of questions here, some that I've been able to answer while you were talking. Some others that we're going to follow up after the presentation. But I see one, let's see if you want to take a shot at this one. It says, "How does personally identifying information, PII, feed into the semantic layer best practices?"
Brock Griffey Yeah, that's a great question. So you'll notice that throughout these layers we have not really addressed like PI information and how that works. So typically PII information, people want to cleanse it out at the very beginning and if you're going to do that, what you may need to do is ... what we've done in the past is we've actually created a separate staging piece for your PII data. That way that data can be masked out, and typically because of the way reflections work with PII data, I would have the PII data massed out and kept out of the virtual datasets until they're needed to be joined or they'd have their own section throughout the layers, so they have their own folder throughout their layers. This would simply pull through and allow you to join to them based on the primary key or however you have the data related to each other and allow you to join back to get that PII information without stopping you from using your reflections.
Brock Griffey There was also another question I wanted to answer. The question was about using virtual data sets inside of everything. So I'm just going to go back to a slide here. So referring to a virtual dataset inside another dataset, you should certainly always be doing that. That is how it should be done, right? So whenever we look at a reporting, if you look at this data lineage, this virtual dataset is composed of multiple virtual data sets from the semantic layer. So it's taking the customer, the line item, and the orders joined together to create this report. So that's how it should flow. It should always be a virtual dataset built on top of virtual datasets, but they're coming only from the layer below them. So it's a very straight lineage.
Lucio Daza And just to clarify, did I lose ... I read the question, now I cannot see it. I think we ... did you answer it correctly? I was going to read it for the rest of the audience, but I don't see where the question went anywhere. But anyhow, so thanks for adding that Brock, very helpful. So I think, are there any other questions that you would like to address, Brock? Or anything that you would like to add?
Brock Griffey So for this specific, for the actual semantic layer, I think that is the main question to the semantic layer. There are some other questions out there that don't relate to the semantic layer, but I'm sure we can answer them outside of this.
Lucio Daza Yeah, definitely. We will follow up. All right, Brock, thank you so much. This has been amazing. I appreciate your time and I appreciate the sharing the knowledge with the rest of the audience and for those of you who arrived or who may arrived late, I want to remind you, this recording is going to be posted on our library later. So please keep checking our website for all the resources in terms of webinars and white papers as well.
Lucio Daza And also, if you want to see Dremio in practice or in action, we have every Tuesday a live demonstration that you can register for. If you go to Dremio.com you will see the registration button right there in the home page. I deliver a live demo where we tackle a couple of use cases and you can see how this whole thing works in real life. Other than that, I hope everyone enjoyed. I hope everyone has a good week. Everybody stay safe, stay healthy, and Brock, once again, thank you so much, and everyone have a good day. Bye bye.
Brock Griffey Thank you, everyone.