Dremio Jekyll


Subsurface LIVE Winter 2021

Power BI Best Practices for Working with Big Data

Session Abstract

Power BI is a great tool for building reports and dashboards against datasets of any size, but if you’re working with big data you need to make the right choices to be sure of getting good report performance. In this session you’ll learn how to choose between Import mode and DirectQuery mode for your dataset, find out how composite models and aggregations can help, and see a number of other options that can make a world of difference to how your report performs.

Presented By

Chris Webb, Principal Program Manager, Microsoft

Chris Webb is a member of the Power BI Customer Advisory Team at Microsoft.


Webinar Transcript

Anita:

[00:04:00] Okay. Let’s get started. All right. Okay. Hello, everyone. My name is Anita Pandey, and I will be your moderator today. I want to welcome Chris Webb, who is [00:04:30] from the customer advisory board at the Power BI group in Microsoft. He will be giving us some great tips and hands-on best practices on optimizing Power BI for working with big data.

And just before we get started, I’d like to go through a couple of housekeeping items. Number one, we will do a live Q&A about 10 minutes before the end of the session, so please [00:05:00] hold your questions for that. Second, feel free to use the chat module to the right to do live Q&A and queue up your questions. And then also, during the Q&A session, please remember to enable your mic as well as your camera to the right side of the screen. And finally, if we don’t get to some of your questions, Chris Webb will have [00:05:30] a dedicated Slack channel that I’ll also post into the chat, so you’ll get some great time there as well. So with that, Chris, please take it away.

Chris: Thanks a lot. All right. Let’s get started. So we don’t have an awful lot of time, but what I want to do here is cover pretty much two things. First of all, give you a very, very quick introduction to Power BI and specifically why you as a big [00:06:00] data person should care or be interested in Power BI. Then we’ll do some demos. Power BI connects to a wide variety of data sources, but I thought it might be appropriate to show you connecting from Power BI to Dremio. We’ve got a particularly nice experience there, so I’ll show you that. And then we’ll get onto the main part of the session, which is talking about how you optimize Power BI for working with large data volumes. That will involve a little bit of discussion [00:06:30] about how Power BI works internally. Talk about how Power BI can work with data in different ways. Talk about some best practices. And then we’ll leave some time at the end for doing some Q&A.

Okay. Now, for those of you who are not familiar with Power BI, let me give you a couple of marketing slides. I’ll apologize for the marketing slides. I never like to see marketing slides at a tech conference. But [00:07:00] we won’t spend too long on them. So Power BI is a Microsoft product. It is one of the fastest growing BI tools in the world. It’s something we are focusing on an awful lot, but basically it is a cloud-native BI tool. It has an on-premises cousin if you don’t want to put your data in the cloud, but we are cloud-first. All [00:07:30] of our features come first to the cloud. A lot of our features will only ever come to the cloud as well. But it allows you to quickly and easily create reports and dashboards and share them across your organization through the browser and through a variety of mobile applications.

It’s part of something we call the Power Platform. The Power Platform is a suite of tools developed by Microsoft which is intended to empower [00:08:00] not the technical user but the power user, the business analysts, the people who in the past might have used Excel and Access and tools like that to go away and be productive with their jobs. The Power Platform is the successor to those types of tools. And as far as Power BI goes, Power BI is the tool for building reports and dashboards. So if you’re somebody who in the past has spent an awful lot [00:08:30] of time building reports and dashboards in Excel, then we hope that building reports and dashboards in Power BI is pretty easy. Our related products are Power Apps, which is a low-code/no-code development platform, Power Automate is a workflow automation tool, and Power Virtual Agents is… Well, guess what? A tool for building virtual agents.

Power BI is one of the fastest growing products at Microsoft. We’ve had [00:09:00] five straight years of triple-digit paid growth, and we’re now at the point where we have millions, and I mean millions, of users every month building Power BI reports and consuming Power BI reports all around the world. So we’re a big deal. We’ve got the power of Microsoft behind us to go and sell. We have a very competitive pricing strategy, but that doesn’t mean that we’re a cheap and nasty tool by any means. [00:09:30] We think that we are one of the if not the best BI platform out there. Obviously we would say that, wouldn’t we? But if you’re into things like Gartner Magic Quadrants, this is the Gartner Magic Quadrant from February last year. [inaudible 00:09:44] another one this year. But you can see that we are right out there in the top right-hand corner as the clear leader in the AI the analytics and BI platform space. [00:10:00] So that’s a brief introduction to what Power BI is. It’s the BI and analytics tool that is taking the world by storm.

But why should you as a big data technical person really care about Power BI? So to state the obvious, all data really needs to reach this audience. The whole reason for working with data is to put it in the hands of business people. You can [00:10:30] use it to make decisions and more efficient decisions. So Power BI can be a tool that helps you achieve that goal. It’s a kind of data visualization, dashboarding, and reporting tool. So if you’ve prepared your data, you can then build a report and then send it out to a wide audience throughout your organization. And it’s a really great tool, and we hope that [inaudible 00:10:55] use it as well as all of the power users, business analysts, managers, [00:11:00] and people inside your organization. And it’s certainly true that, whether you know about it or not, there are an increasing number of people out there inside organizations who are just discovering it spontaneously. We find that it grows virally inside organizations. So whether you know about it or not, there may be people today inside your organization using it to build reports and dashboards.

So if you’re interested [00:11:30] in the area of data visualization and reporting and analytics pick up Power BI. You can download Power BI Desktop, which is our Windows-only development environment, for free. So you’ve got a Windows PC available, or you can run Windows somewhere in a VM, then you can download Power BI Desktop and start playing around with it.

So let’s get to the meat of the presentation. So I thought I’d kill two birds [00:12:00] with one stone and give you a quick idea of what Power BI can do by connecting it up to Dremio. Now, the interesting thing about Dremio is that there are actually two ways that you can connect Dremio up to Power BI. First of all, if I flip over to Chrome… Here we are in the Dremio portal. I have to admit that I’m not a Dremio expert, but I played around with it, and [00:12:30] I’ve got far enough to be able to show you the integration with Power BI. Here we are. We’ve got one of the sample datasets. I can see that this is a list of crimes in San Francisco committed in 2016. And apart from everything else on the screen, you will see that the Power BI icon up here has pride of place.

So let’s assume [00:13:00] I want to take this data and use it inside Power BI. I can click. This downloads something called a PBIDS, a PBIDS file. And this is literally a very small file. It provides the connection information for Power BI to connect up to this particular dataset. I can open it up. It will open up Power BI Desktop. Here we go. It may take a few seconds. [00:13:30] It always takes a bit longer when you’re doing a presentation to hundreds of people on screen. And what’s going to happen when Power BI Desktop opens is that I’ll have a connection already configured up to this dataset.

So as I mentioned earlier, Power BI Desktop is our development environment. As I said, it’s a Windows-only app, but it’s the tool that you use to design reports and work with data in Power BI. If you’re somebody who wants to consume a report, [00:14:00] you do not need Power BI Desktop. What you need is just a browser or the Power BI mobile application.

So we’re creating the connection here. I can see a preview of the data. If I wanted to go straight ahead and start doing something with it, I could do that. I can hit transform data. This will give me the ability to do a little bit more filtering [00:14:30] or transformation here in Power BI if I want. Let’s call this SF Crime. I’m going to click close and apply. In this particular situation, no data is downloaded into Power BI. Power BI has got what we call a direct query connection back to Dremio. I’ll explain what that is a little bit more in [00:15:00] a moment. But here we are creating that connection.

And in a moment on the right-hand side of the screen you’ll see the fields pane. You’ll see a list of the fields we’ve got available. And I can drag and drop. And I could say, “Well, let’s just…” First of all, let’s filter by date to try and make things a little bit [00:15:30] faster. Let’s look at just single dates. [inaudible 00:15:34] data. Which will come in a moment. There we go. So let’s filter this just by the 1st of January, 2016. And then let’s maybe drag category into here. [00:16:00] Let’s make a bar chart, and let’s take incident number. We’ll do a distinct count of incident numbers. And we will see in a moment a simple bar chart with all of our crimes and the distinct count of incidents that were associated with each crime.

We’ve got lots and lots of [00:16:30] visualization types here to choose from, because we’ve got latitude and longitude data. Well, it’s inevitable we’ll want to put this on a map, so let me drag the Azure Maps visual in here. Now I need to remember that Y is actually latitude. X is longitude. Let’s put category on the legend. And in a moment, what [00:17:00] we will see is a map with all our crimes plotted on it. Here we go. There. So there’s an awful lot here that I could do, but I’m not going to spend too much time showing you the visual capabilities of Power BI, because that’s not really the point of the [00:17:30] session.

Something else that I should mention is that you don’t have to start in Dremio to connect Power BI up to Dremio. You can instead go here to get data, go to more… And I’ll see a list of all of the data sources that I can connect to here in Power BI in a moment. I [00:18:00] think it’s being very slow, I suspect probably because I’m casting. So here I just search for Dremio. You can see that we’ve got a Dremio connector. I had to install some ODBC drivers to make this work, but that was a fairly trivial thing to do.

Okay, so let’s get back to the presentation. Hopefully this gives you an idea of what Power BI is. [00:18:30] Now, to answer the question of how do you work with Power BI and big data, I need to explain a little bit about how Power BI works internally. In Power BI, when we connect to a data source, we can’t just take that data and put it directly into a report. The thing we have to do first is create what we call a dataset. A dataset is a kind of semantic layer. And once you’ve modeled the data inside your dataset, connected up to different tables of data and different data [00:19:00] sources and created calculations and defined joins between tables, that’s when you can start to build your reports. And all of your reports get their data by querying the dataset, and the dataset in turn returns the data in different ways.

Now, a dataset contains multiple tables of data, and there are two fundamental ways that Power BI can store data in a dataset. On one hand, we’ve got import mode. So [00:19:30] import mode is where you can actually copy the data from your data source into Power BI. And Power BI has its own in-memory [inaudible 00:19:38] database inside it. Now, when you import data into Power BI, all of your reports are going to be as fast as they possibly can be, because the data is native. It’s there inside Power BI. All Power BI functionality is available as well. But there’s going to be a downside copying your data into Power BI, and that is that there is a limit on how much data you can import, [00:20:00] because it’s an in-memory database. And you’re going to have to refresh the data when it gets out of date.

On the other hand, you have DirectQuery mode. So in DirectQuery mode in Power BI, when you run a report, the report queries the dataset, and the dataset translates all the requests for data into whatever query language your data source understands, which nine times out of 10 is going to be SQL. So in DirectQuery mode, you [00:20:30] don’t have to import the data. You can get the data directly from the data source. So there are no limits on data volume on the Power BI side. There’s no need to refresh. On the other hand, report performance is going to be slower, and not all functionality is necessarily going to be available. The good new is that you don’t have to use one or the other for all of your tables. You can mix or match. And this is what we call a composite model. So with a composite model, you can store some of the tables in import [00:21:00] mode and others in DirectQuery.

Now, as far as big data goes, you can see that choosing a storage mode is going to be a very, very important choice, and it’s going to dictate whether you get the performance you need for your reports. Generally speaking, forgetting about big data for a moment, but generally speaking, at Microsoft we always say that import mode should be your default choice, because it gives you the best possible report performance, and that’s the most important [00:21:30] thing. But DirectQuery mode is there for larger data volumes and data that might be too slow to import, perhaps data that changes too frequently and therefore you can’t refresh the data every 30 seconds or whatever.

So it’s fair to say DirectQuery is going to be the key to working with big data. You’re going to be working with genuinely big data and you need to be able to query large amounts of data for your report, then you will need to use [00:22:00] DirectQuery mode. But you should always think when you’re building a report, do you actually need to query large data volumes, or can you get a small subset of the data and then import it? Because if you can, that will make all the difference.

A few general points to make about how Power BI likes to work as well. Regardless of whether you’re working with large or small data volumes, Power BI does like it stated to be modeled in a particular [00:22:30] way. Generally speaking, we say that with Power BI you should be modeling your data as a star schema wherever possible. Now of course, in the world of big data it’s pretty common to have your data in wide, flat tables, because you want to avoid expensive joins. And Power BI can work perfectly well with that, but avoiding expensive joins is only one thing you need to consider. There are going to be other situations where perhaps having something more like a dimensional [00:23:00] model will give you performance benefits, for example, earlier when you saw me drag that list of dates into the filter. There are going to be cases where Power BI wants to get a distinct list of dates. You’ve got a table which has only got one row for each date, then that will populate an awful lot faster than having to go over some very large table that has a date column in it.

Also, Power BI supports multiple fact tables, if you’re familiar with dimensional modeling, [00:23:30] and these fact tables can have different dimensionality and different granularities. And if you’re used to working with BI tools that aren’t able to handle that, and you’re used to working with apps trying to munge everything into one big table, you’ll probably be aware that doesn’t always work when you want to do even simple calculations. Because if you’re trying to squeeze mixed granularity of data into a single table, you may be duplicating data. You might end up with incorrect results [00:24:00] if you sum something that shouldn’t be summed. Or if you’re trying to avoid just doing a simple sum, you might end up with overly complex calculation.

Which then leads us to DirectQuery-specific best practices. So obviously, as I said, you should be using composite models appropriately. Power BI has its own aggregation table functionality, where you can declare an aggregation table and Power BI will detect whether that aggregation table can be used. [00:24:30] There are a couple of properties available for optimizing reports for Direct!uery mode. We’ve got options that will turn off some of the automatic cross-filtering that happens in a Power BI report. We’ve got options here listed as query reduction that will allow you to make buttons appear inside your report so when people change a selection they have to click a button before that takes.

But generally speaking, what we say with DirectQuery mode is, try to keep things relatively [00:25:00] simple. Don’t pack too many visuals on the page, because the more visuals you’ve got, the more queries are going to be fired off against the data source. Limit the complexity of the calculations you define in Power BI. Limit the amount of data prep work you do in Power BI. The more you can precalculate in the data source, the more complexity you can push back to the data source, the better.

So just before I finish, let me flip over to a DirectQuery [00:25:30] report that I’ve created here. This is an example of a report with a dataset with various mixed granularities here. I’ve got two main fact tables, Internet Sales and Product Inventory. Internet Sales is dimensioned by date, customer, and product. Product Inventory is only dimensioned by date and product. I’ve got an aggregation table here in import mode which aggregates all of the data of Internet Sales up to the date level. And if we go over to here, [00:26:00] you can see that I can change my data range. I’ve got some apply buttons here that mean that when I change my date range I’ll see an apply button appear. I can filter by commute distance, which is a property of customer. When I do that, you’ll see that my sales and tax amount values should change, [00:26:30] but my last unit balance doesn’t. Okay. Let me apply this. Hopefully you saw that change. There we go. It’s changing. And this more or less demonstrates all of the options that I mentioned.

I think that brings us pretty much to the end of the demo part of the presentation.

Anita:

Yes.

Chris:

I’m ready to take some questions.

Anita:

Thanks, Chris. And with that, again, I will invite you [00:27:00] into the live Q&A. But in the meantime, we have a couple of questions coming in. One of them is, “Does this run on Linux?”

Chris:

So Power BI Desktop only runs on Windows, but Power BI Desktop is our development environment. If you want to consume Power BI reports, all you need is a browser. We are working to move our complete development experience to the web. We are making steps towards that. [00:27:30] And even now you can do a certain amount of development. You can do all of your report layout in the web, for example. But the data preparation work we do, all of the kind of ETL, the definition of calculations, that needs Windows, a Windows application at the moment.

Anita:

Thanks for that, Chris. And again, a reminder to the audience to enable your audio and video in the right-hand corner if you’ve received an invitation into the video. [00:28:00] In the meantime, another question we’ve got here is, “Any plans for an on-prem Power BI to support embedded visualizations?”

Chris:

No. We’ve got a very strong embedding story, but it’s purely on our cloud-based platform, and I don’t think we have any plans to add embedding to our on-prem platform at the moment.

Anita:

Got it. Thank you, Chris. Another question here [00:28:30] from [Camille 00:28:31]. “We’re trying to work with Power BI dashboards on real-time data. I’m wondering if there are best practices for that in Power BI. So far, we’ve managed to make it work in Synapse Analytics, but it’s only semi real-time.”

Chris:

Okay. So if you’re working with real-time data, you’re going to need to use DirectQuery mode. If you have a look for the Power BI guidance documentation, so if you specifically Bingle [00:29:00] for that, look for Power BI guidance documentation, there’s a lot of useful information there. And there is a page on working with DirectQuery, and that will hopefully cover in a lot more detail the best practices to follow for DirectQuery mode.

Anita:

Got it. And a question from Kevin. “Does Power BI use Dremio Reflections for improved performance?”

Chris:

It can do, yes. All Power BI does [00:29:30] is generate SQL queries against your favorite data sources in DirectQuery mode. It also supports some other query languages for other types of data source, but with Dremio it will generate SQL queries.

Anita:

And another question from Srinivas is, “What are the trade-offs in performance?”

Chris:

It depends. So it will depend on how you’ve modeled your data. It’ll depend on what you’re trying to do with your report. But generally speaking, as [00:30:00] I said, import mode is always going to be faster from a report performance point of view. DirectQuery gives you up-to-date data, allows you to work with how much data is in your data source, but there will always be an overhead with Power BI having to generate a query to get the data that it needs to bring back to populate all of the different visuals. But if you really, really tune your data source and you follow all the best practices, you could absolutely [00:30:30] be successful with DirectQuery mode on top of Dremio or whatever data source you choose to use. And we have lots of customers who are successful with DirectQuery mode on a wide variety of data sources

Anita:

Makes sense. Another question here is, “Are there any plans to enhance Power BI Admin Analytics so that I can easily see who’s joining my reports and how often?” And this question comes from [John O’Neil 00:30:59].

Chris:

[00:31:00] Yes, there are. Something which will be going into public preview fairly soon, it’s something I can talk about because it’s on our public roadmap, is what we’re calling bring your own log analytics. And basically rather than us trying to guess what you want to see in your admin reports, we are giving you the ability to… Well, we are enabling ourselves to give you a dump of all of the data that we collect about usage, [00:31:30] about internal engine performance… Anything that you could possibly collect data on and Power BI that we collect, we are going to be making available just simply as a dump or your data lake. And then once we’ve given you the raw data, then you’ll be able to go away and do whatever you want with it.

Anita:

Got it. And then a question from Richard MacKie here is, “Can Power BI pass user credentials to Dremio so that DirectQuery can use row-level [00:32:00] security?”

Chris:

It can’t yet, but this is something that we are definitely considering for the future. But I can’t make any promises at the moment.

Anita:

Got it. And what’s the best way to ensure tracking of Power BI crashing due to data lake changing table structures? Any logging ways?

Chris:

I mean, before we give you the bring your own log analytics functionality that I mentioned, we do have [00:32:30] an audit log anyway that you should be able to get your hands on. And it depends on what you mean by crashing. But if a schema’s changed or something that Power BI was expecting to be there is no longer there, I would have thought that should be showing up in the audit log. But it depends on exactly what you mean there, I think.

Anita:

Got it. And then another question, “When will Power BI Report Server be able to connect to Dremio [00:33:00] via DirectQuery?”

Chris:

I’m not sure. I don’t think we have any public plans to do that. If you’re using Power BI Report Server, I would strongly recommend that you consider using the Power BI service in the cloud. Not that Power BI Report Server is bad, but the cloud is always better.

Anita:

Makes sense. Makes sense. Yeah, let me see. And again, to the audience, if you’d like to join [00:33:30] live, please enable your audio and video. We have about a minute left here. So one other question from [inaudible 00:33:38]. “Is any…” Assuming this means governance. “Any governance options in…” Actually, yeah. “Any governance options in Power BI like user or [inaudible 00:33:50]…” Sounds like it’s the same question as earlier. Let me give you another one from Roger here. “You describe Power BI as a cloud-first product. And being a Microsoft product that means Azure [00:34:00] only cloud? Or does it? Is there a storage layer in Azure comparable to AWS storage options that support pushdown predicate queries?”

Chris:

So Power BI is… It’s cloud-only, and it is Microsoft cloud only. It sounds like the question is not really about Power BI though. If you’re talking about a query layer that supports pushdown predicates, that could describe a number [00:34:30] of other different Azure products. But from a Power BI point of view, we aim to work with as many different data sources as possible. Just because we’re a Microsoft BI tool, it doesn’t mean that we only work with Microsoft data sources. I’ve just shown you working with Dremio. We work with a number of other tools that would be considered direct competitors to a number of different Microsoft products.

Anita:

Appreciate that clarification, Chris. [00:35:00] I think we have time for one last question. “How can version control be ensured for the Power BI reports published into cloud as a best practice?”

Chris:

So this is something that we are working on improving. As far as version control goes, today we’ve got a reasonably good story for datasets, which is where, to be honest, most of the complexity is. We’ve got a number of community-developed tools that will allow you to extract the definition of the dataset and [00:35:30] put that into version control, so check out tools like Tabular Editor and ALM Toolkit, which are free community-developed tools. As far as report definitions go, I will admit this is something we need to do some more work on. But with a report definition, assuming that you’ve split that away from your data definition, which is definitely a good practice, the report… The Power BI file is fairly small. You could just save that as a binary and put that into [00:36:00] your version control tool of choice. We don’t support you looking inside the definition of the report, the definition of the report file changes a lot. But hopefully in the future we’ll have a better story around that.

Anita:

Thank you, Chris. And thank you all for joining today. That’s all the questions we have time for. But as a reminder, Chris has his own dedicated Slack channel that I shared earlier on chat, so please feel free to continue [00:36:30] the conversation after the presentation. And then also the expo hall is now open, so I encourage you to go check out the demos and also enter a raffle for fun giveaways. And with that, thank you once again, Chris, and thank you, everyone, for attending, and enjoy the rest of your conference.

Chris:

Thanks a lot.

Anita:

Thanks. ( [00:37:00] silence)