17 minute read · May 13, 2024

From Apache Druid to Dashboards with Dremio and Apache Iceberg

Alex Merced

Alex Merced · Senior Tech Evangelist, Dremio

Moving data from source systems like Apache Druid to a dashboard traditionally involves a multi-step process: transferring data to a data lake, moving it into a data warehouse, and then building BI extracts and cubes for acceleration. This process can be tedious and costly. However, this entire workflow is simplified with Dremio, the Data Lakehouse Platform. Dremio enables directly serving BI dashboards from Apache Druid or leveraging Apache Iceberg tables in your data lake. This post will explore how Dremio's data lakehouse platform simplifies your data delivery for business intelligence by doing a prototype version that can run on your laptop.

Setting Up Our Environment

To run this exercise on your laptop, you will need Docker Desktop to run Docker Compose files to spin up our environment. In an empty folder, create a docker-compose.yml file with the following:

**You may want to make sure docker has 6gb of memory allocated to it at least to prevent 137 (error cause not enough memory to start all containers)**

version: "3"

services:
  # Nessie Catalog Server Using In-Memory Store
  nessie:
    image: projectnessie/nessie:latest
    container_name: nessie
    networks:
      dremio-druid-superset:
    ports:
      - 19120:19120

  # Minio Storage Server
  minio:
    image: minio/minio:latest
    container_name: minio
    environment:
      - MINIO_ROOT_USER=admin
      - MINIO_ROOT_PASSWORD=password
      - MINIO_DOMAIN=storage
      - MINIO_REGION_NAME=us-east-1
      - MINIO_REGION=us-east-1
    networks:
      dremio-druid-superset:
    ports:
      - 9001:9001
      - 9000:9000
    command: ["server", "/data", "--console-address", ":9001"]

  # Dremio
  dremio:
    platform: linux/x86_64
    image: dremio/dremio-oss:latest
    ports:
      - 9047:9047
      - 31010:31010
      - 32010:32010
    container_name: dremio
    environment:
      - DREMIO_JAVA_SERVER_EXTRA_OPTS=-Dpaths.dist=file:///opt/dremio/data/dist
    networks:
      dremio-druid-superset:

  # Apache Druid
  postgres:
    container_name: postgres
    image: postgres:latest
    ports:
      - "5433:5432"
    volumes:
      - metadata_data:/var/lib/postgresql/data
    environment:
      - POSTGRES_PASSWORD=FoolishPassword
      - POSTGRES_USER=druid
      - POSTGRES_DB=druid
    networks:
      dremio-druid-superset:

  # Need 3.5 or later for container nodes
  zookeeper:
    container_name: zookeeper
    image: zookeeper:3.5.10
    ports:
      - "2181:2181"
    environment:
      - ZOO_MY_ID=1
    networks:
      dremio-druid-superset:

  coordinator:
    image: apache/druid:29.0.1
    container_name: coordinator
    volumes:
      - druid_shared:/opt/shared
      - coordinator_var:/opt/druid/var
    depends_on:
      - zookeeper
      - postgres
    ports:
      - "8081:8081"
    command:
      - coordinator
    env_file:
      - environment
    networks:
      dremio-druid-superset:

  broker:
    image: apache/druid:29.0.1
    container_name: broker
    volumes:
      - broker_var:/opt/druid/var
    depends_on:
      - zookeeper
      - postgres
      - coordinator
    ports:
      - "8082:8082"
    command:
      - broker
    env_file:
      - environment
    networks:
      dremio-druid-superset:

  historical:
    image: apache/druid:29.0.1
    container_name: historical
    volumes:
      - druid_shared:/opt/shared
      - historical_var:/opt/druid/var
    depends_on:
      - zookeeper
      - postgres
      - coordinator
    ports:
      - "8083:8083"
    command:
      - historical
    env_file:
      - environment
    networks:
      dremio-druid-superset:

  middlemanager:
    image: apache/druid:29.0.1
    container_name: middlemanager
    volumes:
      - druid_shared:/opt/shared
      - middle_var:/opt/druid/var
    user: root
    depends_on:
      - zookeeper
      - postgres
      - coordinator
    ports:
      - "8091:8091"
      - "8100-8105:8100-8105"
    command:
      - middleManager
    env_file:
      - environment
    networks:
      dremio-druid-superset:

  router:
    image: apache/druid:29.0.1
    container_name: router
    volumes:
      - router_var:/opt/druid/var
    depends_on:
      - zookeeper
      - postgres
      - coordinator
    ports:
      - "8888:8888"
    command:
      - router
    env_file:
      - environment
    networks:
      dremio-druid-superset:

  # Superset
  superset:
    image: apache/superset
    container_name: superset
    networks:
      dremio-druid-superset:
    ports:
      - 8088:8088

networks:
  dremio-druid-superset:

volumes:
  metadata_data:
  middle_var:
  historical_var:
  broker_var:
  coordinator_var:
  router_var:
  druid_shared:

This Docker Compose file sets up a development environment for integrating Apache Druid with a BI dashboard through Dremio and Apache Iceberg, leveraging additional tools like Nessie and MinIO. The services are configured to work together within a custom network named dremio-druid-superset.

This Docker Compose file sets up a development environment for integrating Apache Druid with a BI dashboard through Dremio and Apache Iceberg, leveraging additional tools like Nessie and MinIO. Here's what each service in the file represents:

  • Nessie: Acts as a catalog server using an in-memory store, helping to manage and version data in the data lake. It runs on port 19120 and is part of the dremio-druid-superset network, facilitating integration with Dremio and Apache Druid for table cataloging and version control.
  • MinIO: Serves as a storage server, mimicking a cloud storage environment locally. It's configured with essential security and region settings and exposes ports 9000 for API access and 9001 for the console. MinIO provides the storage layer for the data lake, where data from Apache Druid can be stored and managed before processing with Dremio.
  • Dremio: This is the core component of the data lakehouse platform, providing the ability to directly query and manage data across different sources like Apache Druid and the data stored in MinIO. It exposes several ports for different purposes, such as the Dremio UI (9047), and data communication (31010 and 32010).
  • Apache Druid: Serves as a real-time analytics database designed to handle large volumes of event-driven data efficiently. Druid is set up as a single instance for simplicity in initial configuration and demonstration. It exposes its ports 8081 for the router and 8888 for the web console, facilitating easy access for data querying and interaction by other services in the setup, notably Dremio.
  • Superset: A business intelligence tool configured to connect with Dremio, allowing users to create and view dashboards and reports. It runs on port 8088 but is mapped to 8080 on the host machine for easy access.

The networks section defines a custom network named dremio-druid-superset, ensuring these services can communicate seamlessly within this isolated network environment.

In the same folder as the docker-compose.yml create a file called “environment” with the follow:

DRUID_SINGLE_NODE_CONF=micro-quickstart

druid_emitter_logging_logLevel=debug

druid_extensions_loadList=["druid-histogram", "druid-datasketches", "druid-lookups-cached-global", "postgresql-metadata-storage", "druid-multi-stage-query", "druid-basic-security" ]

druid_zk_service_host=zookeeper

druid_metadata_storage_host=
druid_metadata_storage_type=postgresql
druid_metadata_storage_connector_connectURI=jdbc:postgresql://postgres:5432/druid
druid_metadata_storage_connector_user=druid
druid_metadata_storage_connector_password=FoolishPassword

druid_coordinator_balancer_strategy=cachingCost

druid_indexer_runner_javaOptsArray=["-server", "-Xmx1g", "-Xms1g", "-XX:MaxDirectMemorySize=3g", "-Duser.timezone=UTC", "-Dfile.encoding=UTF-8", "-Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager"]
druid_indexer_fork_property_druid_processing_buffer_sizeBytes=256MiB

druid_storage_type=local
druid_storage_storageDirectory=/opt/shared/segments
druid_indexer_logs_type=file
druid_indexer_logs_directory=/opt/shared/indexing-logs

druid_processing_numThreads=2
druid_processing_numMergeBuffers=2

DRUID_LOG4J=<?xml version="1.0" encoding="UTF-8" ?><Configuration status="WARN"><Appenders><Console name="Console" target="SYSTEM_OUT"><PatternLayout pattern="%d{ISO8601} %p [%t] %c - %m%n"/></Console></Appenders><Loggers><Root level="info"><AppenderRef ref="Console"/></Root><Logger name="org.apache.druid.jetty.RequestLog" additivity="false" level="DEBUG"><AppenderRef ref="Console"/></Logger></Loggers></Configuration>

# Druid basic security
druid_auth_authenticatorChain=["MyBasicMetadataAuthenticator"]

druid_auth_authenticator_MyBasicMetadataAuthenticator_type=basic
druid_auth_authenticator_MyBasicMetadataAuthenticator_initialAdminPassword=password1
druid_auth_authenticator_MyBasicMetadataAuthenticator_initialInternalClientPassword=password2
druid_auth_authenticator_MyBasicMetadataAuthenticator_credentialsValidator.type=metadata
druid_auth_authenticator_MyBasicMetadataAuthenticator_skipOnFailure=false
druid_auth_authenticator_MyBasicMetadataAuthenticator_authorizerName=MyBasicMetadataAuthorizer

# Escalator
druid_escalator_type=basic
druid_escalator_internalClientUsername=druid_system
druid_escalator_internalClientPassword=password2
druid_escalator_authorizerName=MyBasicMetadataAuthorizer

druid_auth_authorizers=["MyBasicMetadataAuthorizer"]

druid_auth_authorizer.MyBasicMetadataAuthorizer.type=basic

In the context of our blog, this Docker Compose setup illustrates how to create a local development environment that mirrors the process of moving data from Apache Druid to a BI dashboard using Dremio and Apache Iceberg, with Nessie for table cataloging and MinIO for simulating a cloud storage environment.

Populating Data in Our Apache Druid

After setting up your Docker Compose file, you must populate Apache Druid with data to simulate operational databases. First, ensure Apache Druid is running by executing:

docker-compose up -d

Once everything has started up you should be able to access Druid at localhost:8888 so we can begin adding data. Your username/password is admin/password1 then Click on load data and select batch and then select “paste data” from the available options:

Paste in the following data:

day, amount
1,1
2,2
3,4
4,8
5,16
6,32

Then on the next screen ask for the data to be read as csv that includes a header row.

Then hit next till eventually you click on “start loading data” and wait for the data to load up.

Connecting Apache Druid, Nessie, and Minio to Dremio

The next step is to spin up the Dremio environment and connect our data lake (MinIO as storage + Nessie as catalog) and Apache Druid data sources.

docker compose up dremio nessie minio

Open MinIO in the browser at localhost:9000 and log in with username “admin” and password “password”. Once you log in, create a new bucket called “warehouse”. Remember that Nessie is always running in the background and is accessed by a REST API, which is how the Dremio connector communicates with the catalog.

Now, head over to localhost:9047 in your browser to set up your Dremio admin account. Once that is set up, click “add a Source” and select a “Nessie” as the source. Enter in the following settings:

  • General settings tab
  • Storage settings tab
    • AWS Root Path: warehouse
    • AWS Access Key: admin
    • AWS Secret Key: password
    • Uncheck “Encrypt Connection” Box (since we aren’t using SSL)
    • Connection Properties
      • Key: fs.s3a.path.style.access | Value: true
      • Key: fs.s3a.endpoint | Value: minio:9000
      • Key: dremio.s3.compat | Value: true

Click on “Save,” and the source will now be added to Dremio, against which you can run full DDL and DML SQL. Dremio turns your data lake into a data warehouse, a data lakehouse!

Now let’s add an Apache Druid source with the following settings:

  • Name: druid
  • Host: router
  • Port: 8888
  • Username: admin
  • Password: password1

Once you save the source, you can see both sources on the Dremio dashboard and begin working with them.

Moving Our Apache Druid Data to Our Data Lake

Historically, moving data from our databases into our data lake as Apache Iceberg tables would require spinning up an Apache Spark cluster and writing complex jobs in Python, Java, or Scala. But with Dremio, we can move our data to our data lake with simple SQL.

Head over to the SQL Runner in the Dremio UI and run the following query to ingest the table in Apache Druid as an Apache Iceberg table in our Nessie source:

CREATE TABLE data AS SELECT * FROM druid.druid."inline_data";

This is a great way to initially move the table to your data lake. To update the table, you can use the INSERT INTO SELECT statement, adding all records with a primary key value to the highest value in the destination. This will ensure only new records are added to the table.

INSERT INTO nessie.data
SELECT *
FROM druid.druid."inline_data"
WHERE day > (SELECT COALESCE(MAX(day), 0) FROM nessie.day);

Now we have the data in our data lakehouse. While we could choose not to move the data and operate with the data directly from Apache Druid, we could run into competition for resources as the growing number of analytical queries compete with existing operational queries being sent to our Apache Druid database. To avoid this, we can either create a data reflection, which creates a Dremio-managed materialization on your data lake that Dremio substitutes when the table is queried, or ingest the data into our data lakehouse as we did above. Either option allows analytical queries to utilize Dremio’s infinite horizontal and vertical scaling capabilities fully.

Connecting Superset to Dremio

Dremio can be used with most existing BI tools, with one-click integrations in the user interface for tools like Tableau and Power BI. We will use an open-source option in Superset for this exercise, but any BI tool would have a similar experience. To get started with Superset, let’s open another terminal and spin up the Superset container.

docker compose up superset

We then need to initialize Superset, so open another terminal and run this command:

docker exec -it superset superset init

This may take a few minutes to finish initializing, but once it is done, you can head over to localhost:8080 and log in to Superset with the username “admin” and password “admin”. Once you are in, click on “Settings” and select “Database Connections”.

  • Add a New Database
  • Select “Other”
  • Use the following connection string (make sure to include Dremio username and password in URL): dremio+flight://USERNAME:PASSWORD@dremio:32010/?UseEncryption=false
  • Test connection
  • Save connection

The next step is to add a dataset by clicking on the + icon in the upper right corner and selecting “create dataset”. From here, choose the table you want to add to Superset, which is, in this case, our inline_data table.

We can then click the + to add charts based on the datasets we’ve added. Once we create the charts we want, we can add them to a dashboard, and that’s it! You’ve now taken data from an operational database, ingested it into your data lake, and served a BI dashboard using the data.

Consider deploying Dremio into production to make delivering data for analytics easier for your data engineering team.

Ready to Get Started?

Bring your users closer to the data with organization-wide self-service analytics and lakehouse flexibility, scalability, and performance at a fraction of the cost. Run Dremio anywhere with self-managed software or Dremio Cloud.