6 minute read · March 29, 2024

BI Dashboards with Apache Iceberg Using AWS Glue and Apache Superset

Alex Merced

Alex Merced · Senior Tech Evangelist, Dremio

Business Intelligence (BI) dashboards are invaluable tools that aggregate, visualize, and analyze data to provide actionable insights and support data-driven decision-making. Serving these dashboards directly from the data lake, especially with technologies like Apache Iceberg, offers immense benefits, including real-time data access, cost-efficiency, and the elimination of data silos. Dremio as a data lakehouse platform, enhances this setup by providing high-performance query acceleration and an integrated analytics layer, thus ensuring that the dashboards are timely and powered by rich and comprehensive data sources. This blog will guide you through leveraging your AWS Glue catalog as a Dremio data source and utilizing Apache Superset as the BI tool to create and deliver dynamic, insightful BI dashboards. By combining these powerful technologies, you can unlock the full potential of your data lake, making your data more accessible and actionable for all stakeholders.

Setting Up Our Environment

This exercise will use Docker Compose to set up a local Dremio and Superset environment. We will use the official Dremio Docker image and a custom Superset image with the requisite Dremio libraries installed. Create a docker-compose.yml with the following:

version: "3"

services:
  # Dremio
  dremio:
    platform: linux/x86_64
    image: dremio/dremio-oss:latest
    ports:
      - 9047:9047
      - 31010:31010
      - 32010:32010
    container_name: dremio
    environment:
      - DREMIO_JAVA_SERVER_EXTRA_OPTS=-Dpaths.dist=file:///opt/dremio/data/dist
    networks:
      dremio-superset:
  #Superset
  superset:
    image: alexmerced/dremio-superset
    container_name: superset
    networks:
      dremio-superset:
    ports:
      - 8080:8088
networks:
  dremio-superset:

To spin up the environment, do the following commands with your terminal in the same folder as the docker-compose.yml:

docker compose up

This will spin up Dremio and Superset, but to fully activate Superset open up another terminal and enter the command:

docker exec -it superset superset init

Connecting Your AWS Glue Catalog to Dremio

  • Go to locahost:9047 in your browser and create your Dremio user. Add a new “AWS Glue” data source.
  • Name the source “glue”, select your preferred AWS region, and enter your AWS credentials. The simplest way is to use your access key and secret key, but if you prefer using IAM roles that is also possible.
  • Under the advanced options tab, add a connection property with the “hive.metastore.warehouse.dir” and the value should be the address of the location you want your data written to when Dremio creates Iceberg tables in your Glue catalog.
  • Then click “save” to add the data source 

Connecting Superset to Dremio

Dremio can be used with most existing BI tools, with one-click integrations in the user interface for tools like Tableau and Power BI. We will use an open-source option in Superset for this exercise, but any BI tool would have a similar experience. 

To get started, head over to localhost:8080 and log in to Superset with the username “admin” and password “admin”. Once you are in, click on “Settings” and select “Database Connections”.

The next step is to add a dataset by clicking on the + icon in the upper right corner and selecting “create dataset”. From here, select the table you want to add to Superset, in this case, our sales_data table.

We can then click the + to add charts based on the datasets we’ve added. Once we create the charts we want, we can add them to a dashboard, and that’s it! It is as simple as that, and if you want to accelerate your dashboard even further, you can enable aggregate reflections on your underlying datasets for an additional boost.

Consider deploying Dremio into production to make delivering data for analytics easier for your data engineering team.

Get Started with a Free Data Lakehouse Powered by Apache Iceberg

Access all of your data where it lies and start querying in minutes. No movement required.