Visualize Your Data Lake Using Apache Superset

Session Abstract

Apache Superset is an open source BI platform built for modern data teams. Originally created by Max Beauchemin (who also created Apache Airflow), Superset embraces open standards to speak to nearly any SQL speaking data engine and support all viz types.

Superset has two key workflows, designed to support all personas in an organization. SQL-savvy analysts can tap into SQL Lab, a browser-based SQL IDE to explore and sculpt data. Stakeholders can use Explore, a no-code viz builder that generates SQL for you.

In this talk, I’ll showcase how both workflows empower end users to visualize data from Dremio in Superset.

Video Transcript

Speaker 1:    So with that, I would like to welcome our next speaker. Robert Stolz. Robert, over to you.

Robert:    Thanks so much [Deepa 00:00:13]. I really appreciate all the effort that has gone into organizing this conference. And I’ve been having a blast. This has been a really cool experience for me. So thanks for putting this on. I’m going to go ahead and share my screen really quick here, or attempt to. [00:00:30] Okie doke. So hopefully everybody is able to see stuff okay. Awesome. Okay. Hey, how’s it going? My name is Robert. I’m here to talk today about Apache Superset, our open source BI for the data lake age. So who am I? Just really quickly to start out with, I imagine most people here have never seen me or [inaudible 00:00:49] me. I’m a data engineer, but also a developer advocate at Preset. So a lot of what I do is work on internal data engineering projects, but I’m also responsible for reaching out to the community, [00:01:00] helping people contribute to the open source side of the project, and just trying to build basically the best possible open source community around Superset possible.

My background before that was in scientific research, math, computational biology. And for most of those things I ended up writing a lot of open source software to support those research efforts. And that’s kind of how I got into open source more generally. But I’ve been writing software for a long time, and it’s something I’m really passionate about. Also I’m a data architecture and best practices nerd, for sure. And at Preset one [00:01:30] of the main things that I do is I build open source data stacks to learn more about open-source communities and solve problems in those communities. And you can see for my human interest image on the right I have a wizard wearing a little top hat. I’m super into herpetology and herpetological fashion. So feel free to reach out to me if you yourself keep reptiles, or you’re interested in their genetics, or really anything else about them.

So what we’re going to talk about today really quick… We don’t have a ton of time. So I’m going to try to move pretty fast. A rough [00:02:00] introduction on what is Superset. technologies under the hood enables Superset to do what it does. We’re going to talk about the three phases at Superset, which are Explorer mode, SQL lab, and Dashboards. We’re also going to talk about some specific tips for getting the best out of Superset with data lake engines like Dremio, and more generally data lake architectures.

So Apache Superset, for those that don’t know, I hope that there’s people here who’ve never heard of it, and this is their first introduction to it. It is a enterprise focused [00:02:30] open source business intelligence and data analytics tool. And it makes simple and beautiful dashboards that are powerful. It’s flexible. Because it’s open source, it’s extensible. There’s a lot of things that are great about Superset. And I’m hoping that over the course of this talk I’ll be able to convince you to at least give it a shot, or think about it for your own data.

Superset started around 2015. There’s a gentleman by the name of Max Beauchemin who some people may know. He’s the original creator of Apache Airflow as well. And when [00:03:00] he was working at Airbnb they needed a front end and visualization layer for Druid. And so as part of a hackathon project at Airbnb Max and some other people ended up creating something that at the time was called Panoramics, which is the name of the Druid in a popular Belgian cartoon series. After a few years, that project was spun out of Airbnb to the Apache software foundation. And those who maybe aren’t familiar with the ASF, they provide a solid [inaudible 00:03:28] structure that you can build commercial open source [00:03:30] projects on where all parties involved can be confident that the project is going to be well-managed, and that everyone’s interests will be respected, and that it will basically will continue to be a useful open source project for everyone to use in the future.

So Apache Superset was what Panoramics became. And around the beginning of 2019, Max started a company called Preset, consisting of a lot of the folks who had made major early contributions to Apache Superset. And then over the next two years, we’ve [00:04:00] pushed forward as a company. And Superset at the beginning of this year crossed the version 1.0 threshold, which marked a major goalpost as far as Superset being really ready for enterprise applications, like maturity, having everything about the release cycle figured out, stability, the features. It’s all there now. So version 1.0 is complete, and that’s something I’m very happy to be able to say.

So what does Superset offer generally speaking that makes it enterprise [00:04:30] quality BI? Well, it’s got dynamic dashboards. So you can use Jinja templating, dashboard filter is very useful. It has a built-in SQL IDE, which is extremely useful for actually exploring the raw data. I use it all the time. It allows no code exploration of that data through Explore mode as well, which we’ll talk about a little bit later. It has a bunch of rich visualizations, which we’re streaming in visualizations from the Apache ECharts project, [00:05:00] which is another Apache project that is focused basically on just making awesome charts. And we’re slowly invading all of the ECharts charts into Superset, in addition to having some of our own that are separate from that.

Really granular permissions. Enterprise security is a major consideration. It’s something that we think about a lot. Custom visualization plugins. Semantic layer, which is something we’re going to talk about in a little bit of detail as far as what the architecture of that is. But basically it provides an object oriented [inaudible 00:05:30] [00:05:30] that sits on top of the SQL and allows you to manipulate data in a Pythonic way in the backend. And it’s very lightweight. Modern Data Stack support. So via a technology called SQLAlchemy you can connect to almost any SQL speaking data source, popular cloud data warehouses, data lake engines, real-time data stores, time series optimized data stores, Google sheets, CSV files in an S3 bucket. It’s got a lot of options. Also, it has a cashing service that’s [00:06:00] built in, or that can be configured to work with it to be more appropriately accurate about it. It reduces the load on the database, provides faster queries, faster results, and reduces costs. And then it works in reports. So you can be notified via a Slackbot, emails when certain things happen in the data, which is really useful for monitoring use cases.

So all of these things are essential in an enterprise BI. And there’s a lot of companies that basically use Superset as a consequence of that. And in fact, I would go so [00:06:30] far as to say, based on lots of conversations that I’ve had with people who are users in the community, that there is hardly a large technology company that doesn’t use Superset in some capacity or another, although not are all willing to come out and say that they use it. But there are many organizations that do. And it’s for the reasons that I listed above. It offers a lot to these organizations.

It also, more generally as I said, through SQLAlchemy speaks SQL. So this is just a small list of data sources. If you want to see the full list you can jump into our documentation. [00:07:00] But essentially anything that speaks SQL you could write a SQLAlchemy driver for it. And then once that’s done, it works with Superset, and the interface is the same. And Superset, one of the nice things about it is that it provides a common interface for all of the SQL speaking data systems that are part of your data architecture. So that’s pretty cool too.

So just to talk really briefly, I mentioned that there was a lightweight semantic layer. And I wanted to talk a little bit about what is the architecture of Superset that is connected to that, and what does that really mean. So we have a React front [00:07:30] end, which is really three large apps which are bound together in a common user interface and user experience. Those three are Dashboard mode, where you can view and manipulate dashboards. Explore mode, where you can create and manipulate charts and visualizations on top of some dataset. And then SQL lab, which is our SQL IDE. Underneath that is the semantic layer, which is essentially a Python backend. And underneath that you have SQLAlchemy tables, which could be [00:08:00] actually representing a physical table in your data system, or it could also be a virtual table. So Superset works like the ability to create virtual tables with derived columns, and other things that are really useful like that. And then on the bottom you have SQL speaking data sources. Which is again, really anything that speaks SQL and has a SQLAlchemy driver.

So I also wanted to comment as far as what do we mean by lightweight? What is lightweight really from a performance perspective? Well, I ran some tests with Dremio just to get a sense of a lot [00:08:30] of things actually, but among it Dremio’s performance particularly, and how caching factors into that and all these things. And essentially I queried a 1 billion record test data set unoptimized on Dremio. No optimizations. Just raw data in Dremio. Took about four minutes to actually query that data and generate a chart. Reflections turned on. So Reflections is a feature of Dremio that allows you to pre-compute certain aggregates, and other important metrics. Reflections on, about 16 seconds. [00:09:00] When the data is cached, about one second. And I want to comment further that the semantic layer adds a little bit less than a second to going from… Dremio reports 16 seconds for the query. And then in 17 seconds you have a chart in Superset. So after the query comes back, it’s very fast to go from data to actually having a visualization in Superset.

Also, generally speaking this is a general figure of what is [00:09:30] a cloud data platform. It’s from a really great book, which I highly recommend checking out if you’re new to the subject, called Designing Cloud Data Platforms by Danil Zburivsky and Lynda Partner. And essentially I’ve marked some things on here, but Dremio, while technically speaking it provides direct data lake access, it also because of its performance can be a data warehouse as well. You can use Dremio as a data warehouse in some sense. Apache Superset as a BI software really looks for fast… You really want [00:10:00] fast queries underneath your visualizations. And Dremio provides that in a really good way. And it’s part of why I’m using Dremio for some stuff at Preset now.

Also another thing worth mentioning here is that not everything has to live in Dremio. Modern cloud data platforms have a lot of different data systems. Not all of them are unified in a data lake. Sometimes you have other kinds of faster structured data stores, or data stores that are specialized for geospatial data, or for temporal data, or whatever. And Apache Superset can connect [00:10:30] to all of them and provide a common interface.

So of the three faces of Superset, I’m going to go through them really quick. I know that we’re tight on time. But I’m hoping to just give people a sense of what the experience of using Superset is like. We have a lot of visualizations when you go into Explore mode. They’re really attractive. We have a lot of really beautiful visualizations, particularly geospatial and temporal visualizations. We have a lot of really good time series charts and maps. Generally [00:11:00] speaking this is what a chart looks like in Explore mode. You can see that you get a sense of what the data looks like on the left, which is really useful for exploring a data set as a persona that maybe doesn’t have access to the SQL knowledge that would allow them to run direct queries against the data. You can still get a sense of what are the columns, what are the data types. things like that.

The column [inaudible 00:11:21] essentially has all the charts specific controls, which includes some controls that are common, like time, to all charts, but also more specific controls that give you the ability [00:11:30] to manipulate essentially what the underlying query is for the chart. And if you have caching turned on that query is cached, and is essentially, depending on how you have the cache configured, the chart is going to retrieve data from the cache rather than actually hitting your data lake again, which is a big time saver, and also a big money saver potentially.

Also we have some in chart analytics that are pretty cool. This is an area that I’ve been really interested in and trying to push forward at Preset. But [00:12:00] essentially what we can do is we can surface the functionality of certain Python analytics packages in charts, and allow people to actually manipulate that at the chart layer. So what we’re looking at here is Facebook’s profit package for time series forecasting. I’ve got a blog series on the Preset technical blog about how to get that set up and working if you’re interested. But the ability to basically take any package that’s out there, even if it’s not ASF license compatible, and then make that available through Superset is really cool as well. That’s something I value [00:12:30] a lot, and I hope to explore that a little bit more with my own work in the future.

SQL lab is our SQL IDE. So this is the second face. And you can see here that you get a good sense of what are the columns, what does the data look like. You can run SQL queries. It’s got a suggestive engine for helping you write good SQL queries. You can save the result of your query as a virtual data set, which you can then build charts on top of, which is really handy. So there’s a lot you can do with this [00:13:00] SQL IDE.

And then the third face is the Dashboard mode. So in Dashboard mode you have a lot of options that go beyond just being able to create charts. You have markdown cells. You can create functional elements that are not charts, like filter [inaudible 00:13:16]. And all of these things basically contribute to providing Dashboard consumers with more context about what they’re looking at, and more control over the underlying data. Which is something that I want to talk about in some more specific detail right now.

We now have filters [00:13:30] that live at the Dashboard level in Superset. This is a relatively recent thing. But what it allows you to do is without any SQL, without doing anything with virtual datasets, or anything like that, you can essentially create a customizable filter bar on the side of your dashboard that can pre-filter the underlying data set, and also can provide controls to basically manipulate the data that’s underneath all of the selected charts in your dashboard. So for example, you can have a time slider, or [00:14:00] other various different check boxes, and things that are useful elements for being able to give people power over what they’re looking at more than anything.

So lastly, we have the ability to edit dashboards, and this is a drag and drop interface. So again, no coding necessary here. You can just pop into dashboard edit mode, drag things down, adjust width, create dividers, headers, mark down cells, things [00:14:30] that are going to provide context and organization for your dashboard. This strays into dashboard creation territory. There’s a whole art of dashboard creation, which is really interesting. And it’s not actually my area of expertise. But the ability to provide context, and to be able to provide functional elements that divide the dashboard up is really critical as well.

So some tips for using Superset with Dremio, generally speaking, number one, I would say take advantage of Dremio’s Reflections feature on large tables. That’s actually essential. [00:15:00] BI lives in the realm of speed. You really want fast response times for your queries. And Reflections really gets you there. So if you’re not using Reflections, and you have large tables, and you’re Dremio with BI, I would strongly recommend checking out Reflections. Query caching is configurable in Superset. And if you’re running open source Superset is extensible as well, which minimizes unnecessary queries against the data lake, saves time, saves money.

Superset speaks SQL. Not everything needs to live in Dremio. So if you have other data systems, [00:15:30] other data that does not live in your data lake, Superset can still work with it, and actually provide a unified interface for all of that data. Also, you can use dashboard native filters, which I just showed to allow Dashboard consumers to manipulate large data sets without code. And this includes things like filtering out null values, values that don’t make sense. Things that would fall more under the purview of data cleaning, which provides people like data scientists and data analysts more direct access to data that might be living in a data lake without the need [00:16:00] to necessarily go through a data engineer to clean and organize the data to a really high degree before it’s usable data.

So there’s three ways to run Superset. The first is the manual setup. It’s complex, but it gives you maximum control over configuration. This is usually what enterprises do. There are quite a few advanced features in Superset that requires some additional backend dependencies and set up. You’ve got to set up a caching service, salary workers. Packages like Prophet. If you want to use Prophet [00:16:30] you need basically a service for dashboard thumbnails, alerts, and reports. So the setup is complex, but gives you the most control.

Docker Compose, which is what I would recommend folks here who maybe want to try out Superset, or don’t know about it should check out. It gives you a really easy setup. Great for trying out Superset locally, and for doing local development. Some features are part of the stack by default, such as caching. So we include like a Redis cache, although there are other caching services you can configure as well. And some are not part of the stack. So alerts and [00:17:00] reports, Prophet integration, dashboard thumbnails are not enabled by default in the Docker Compose setup.

And then lastly is Preset Cloud, which is our cloud service offering. There’s no setup involved with this. I think it’s good for an individual evaluation all the way up to what you would need as an enterprise. It has all advanced Superset features by default. So all of these complex backend dependencies are configured and working for every workspace. And I’m actually really happy to be able to announce that it’s still free for small teams. So this is something new that we’re [00:17:30] bringing in August of this year. So just around the corner we’re going to provide essentially free business intelligence for small teams that’s built on top of Apache Superset. So if you want to try Superset out, and you don’t want to go through the process of setting it up locally with Docker Compose, or going through the manual setup, try it out on Preset Cloud in a month. That’s going to be pretty cool.

And then lastly, I want to leave everyone with an idea. There’s so many logos missing from this figure, and that’s totally fine. But I really [00:18:00] want to emphasize why I think Apache Superset is on the cutting edge of open source projects conceptually. And that is, as time has gone on there, of course, have been more and more open source projects. Many of them data focused. But there’s a smaller thread that runs through the history of open source that’s actually end user applications that are not directed at developers, are not critical pieces of data infrastructure. And I’ve tried to put some of those projects in this image as they come to my mind from memory.

Apache [00:18:30] Superset, considering that it’s primarily consumed by analysts, data scientists, decision makers in business, these people are not software developers. They don’t have the ability to directly contribute to a project like Apache Superset through code. But their input is essential to building Apache Superset into something that’s actually going to be useful as a BI tool. And I think that Apache Superset is on the cutting edge of open source in the sense that our community makes a lot of effort to try to include other personas [00:19:00] that are not just software developers in the open source process, particularly designers and analysts.

So we actually would love to have you join our community to talk to us about how you use BI generally, and what you would like to see Apache Superset become able to do in the future. And yeah, feel free to reach out to me. I want to throw a big thanks up again to the Dremio conference organizers, and to everybody for attending. That’s my email. You can find me on GitHub. You can find me on LinkedIn. And yeah, I’m happy to take any questions that exist. I think I ran over [00:19:30] a little bit, so apologies about that.

Speaker 1:    Hey, thank you Robert. So we do have some questions. Let me see if I can get someone on the channel, or maybe not. Okay. So let’s take from the chat. It’s basically open source Tableau from what I can tell so far, from what I’m using it. So I don’t know whether that’s a question or-

Robert:    I [00:20:00] can make a comment on that.

Speaker 1:    Yeah. Yeah. Please do.

Robert:    I mean it is open source Tableau in the sense that Tableau is a mature enterprise ready BI solution, and so is Superset. And the way that you use them is going to be very similar. Like Superset’s not reinventing the wheel in terms of what BI is from the perspective of the analyst, or the data scientist yet. What’s different about it is that it’s open source. It’s been built by a community of enterprises, and [00:20:30] individuals, and essentially is fully available for anyone to understand how it works, and more importantly modify it. Which is why a lot of enterprises that use Superset use it is because they need a solution that they can embed, a solution that they can modify. And Superset is modifiable.

Speaker 1:    Thank you. So we have a few more questions. What are the system requirements?

Robert:    The system requirements, generally speaking, if you’re going to run it locally in Docker Compose are [00:21:00] you need at least six gigs of memory. I would say probably eight would be better because it actually spins up a Postgres database, and a Redis cache for query caching. And those things use a fair amount of ram. Yeah. It’s about that. No specific requirements as far as processor. You have to be running a Nix based operating system where you can install dependencies and stuff like that. I don’t think it works very well in Windows to the best of my knowledge. Yeah. Hopefully [00:21:30] that answers the question.

Speaker 1:    The next question is what about sharing dashboards via email, or working MS Office components?

Robert:    I’m not quite sure about working MS Office components. I don’t think we have tight integration with MS Office right now, although you can connect Superset to spreadsheets that live in Google Drive and treat that essentially as a data source, which is kind of cool. What was the first part of that question?

Speaker 1:    How about sharing [00:22:00] dashboards via email?

Robert:    Yeah. You can share dashboards. You can share them as images. You can share them as links to dashboards. Somewhere we really want to take the project is the ability to more easily embed dashboard components into other things. You can do it right now, but it’s not quite the easiest thing in the world. So we’re trying to make that easier as time goes on.

Speaker 1:    Great. The next question is if the data to be visualized in AWS [00:22:30] S3, you have to put AWS Athena between S3 and Superset. Is that correct?

Robert:    I believe that that is correct. I don’t think you can query directly on top of S3. Yeah. You have to use Athena, which I have tested pretty extensively as well. And I have a blog post on how to set that up if you’re curious.

Speaker 1:    Okay. Can you embed dashboards into emails, SharePoint, intranet or exports [00:23:00] outside of the corporate LAN? For example, clients or investors view without copying and pasting images from the dashboard?

Robert:    Yes. Yes you can. There’s a few ways to do it. The simplest way is to embed the I-frame from the dashboard into whatever the thing that you’re trying to embed it into is. There’s some permissions related things associated with that, but you can solve those problems. Essentially dashboards have [00:23:30] security on them. So if you want to display a public dashboard, there’s some slight modifications you have to make to the security roles in Superset to enable them to view the dashboard without having credentials. But yeah, it’s entirely doable. And then beyond that you can, and many organizations have, extended Superset to be a more full featured embedded solution. And like I said, at Preset we’re thinking about how to push Superset further in that direction as time goes on. I would expect to see something about that maybe next year.

Speaker 1:    Right. So what are [00:24:00] the tips on using Superset with On-Prem Microsoft SQL Server, as well as cloud-based stores, like Redshift, to analyze data quickly? You did mention about Reflections in Dremio.

Robert:    Yeah. Wow. That’s kind of an expansive question. And there’s a lot of different services that were listed there. Generally speaking, for BI a dimensionally [00:24:30] focused data model is going to give you faster queries. It results in having more duplicates of your data, but the queries are faster because the tables don’t need to be enriched at the time that you’re getting the data from your data source to make the chart. So generally speaking, having a data model that’s appropriate for BI, I think, is really important. And most modern data warehousing solutions support that. You can do that in all of them. I’m trying to think any [00:25:00] specific tips for Microsoft SQL Server or Redshift. I don’t have any specific tips for those.

For Dremio I would say definitely use Reflections. And actually maybe in Superset something that’s more useful generally is if you as a data engineer go into Superset you can actually manipulate the caching behavior of the queries that are underneath the charts. And it’s actually the query itself, data that comes back from the [00:25:30] query that’s cached, not the chart. You can build different visualizations on the same query, and they share the same thing in the cache. And you can set the cache timeout to be sufficiently long. So for example, if you have a data system that’s a batch data system, and it refreshes once per day, there’s no need to hit your data more than once a day. You can set the cache to timeout every four hours, and then just avoid any unnecessary redundant hits against your data [00:26:00] lake. So knowing how often the data is refreshed in your data source is something that you can translate into optimizations to the caching behavior in Superset is really, I guess, what I’m trying to say. Hopefully that made sense.

Speaker 1:    Yeah. One more question. I think few more, but let’s see what we can cover everything. Can you join tables from different DB’s to create a chart?

Robert:    [00:26:30] No, you can’t right now, unfortunately. I’m excited by that question though. Because currently we have a Superset improvement proposal open that basically proposes to expand the semantic layer a little bit by introducing some new abstractions. And one of those abstractions is the data set abstraction, which will allow us to do cross data source joins. That’s something that we’re really interested in being able to do in the near future. But right now Superset does not do that, unfortunately. But I would expect to see it come pretty soon.

Speaker 1:    [00:27:00] Okay. Thanks. I think we have two more minutes. So let’s take a couple of more questions. Could Superset be treated as a backend server used to generate plots for an existing front-end app?

Robert:    Hm. Used to generate plots for an existing front end app? So there’s two different ways that I could interpret that question. One is to say if you actually want Superset to generate the plot, then what you’re saying is you want the React front end for Superset to [00:27:30] generate a plot. And then you want to embed that plot inside of another application. And for that use case that’s what we would call the embedded use case. And that one is coming. It’s on a timescale. And it’s actually doable now, but it requires a little bit of tinkering with Superset under the hood. Sorry. I’m really losing my train of thought here. It’s been a long half an hour. Well, what was the first part of that question again?

Speaker 1:    [00:28:00] Hang on. Just give me one second. So could Superset be treated as a backend server?

Robert:    Right. Okay. So the other question there is could you take the Python backend of Superset, and divorce it from the React front end, and attach your own thing to it? And the answer is yes, you can. That’s the really cool thing about open source is you can do that if you want to do it, and you have the ability to do it. So yeah. I’m happy to say that yeah, you can do that.

Speaker 1:    [00:28:30] Awesome. So we have one more minute, and I see a lot of questions coming in the chat. I’ll try to answer as much as possible, but I think you have a Slack channel, Robert. So feel free to speak to him on the Slack channel and get answers to your questions. It almost seems like-

Robert:    I would love to talk to people on Slack.

Speaker 1:    Yeah, I think we are end of the session. So thank you, Robert. Thanks everyone for joining the session. And have a good day. Thanks. Bye.

Robert:    It’s been a pleasure. I hope everybody learned something. [00:29:00] Take it easy.

Speaker 1:    Bye.