Dremio Jekyll

Unlocking Tableau on Elasticsearch

Intro

Elasticsearch is a popular open source datastore that enables developers to query data using a JSON-style domain-specific language, known as the Query DSL. Elasticsearch’s scale-out architecture, JSON data model, and text search capabilities make it an attractive datastore for many applications.

Dremio makes it easy to connect your favorite BI tools to Elasticsearch, including Tableau. We’ll work through an example using Tableau, but you can do the same with Qlik, Power BI, and other SQL-based BI tools.

Assumptions

This tutorial is a continuation of the tutorial Unlocking SQL on Elasticseach. For this tutorial you’ll need access to Dremio, an Elasticsearch cluster, and Tableau.

You’ll also need to install an ODBC driver for Dremio. There are options for Windows, macOS, and Linux.

Tableau on Elasticsearch Indices

We’ve seen how Dremio can translate and execute a SQL query on Elasticsearch data by pushing-down most of the query into Elasticsearch using the Elasticsearch Query DSL. Because SQL is simpler and much more familiar to most people than the Query DSL, this alone makes Elasticsearch accessible to a much broader audience including business analysts and data scientists.

That said, you don’t need to know SQL in order to utilize Elaticsearch. You can actually use BI tools such as Tableau, Power BI, Qlik Sense and MicroStrategy to connect to Elasticsearch via Dremio.

In this tutorial we’ll connect Tableau Desktop to the Dremio cluster and interact with our restaurant inspection dataset. There are two ways you can connect Tableau to Dremio. You can launch Tableau and connect to the Dremio cluster just like you would connect Tableau to any other database. Alternatively, you can click on the Tableau button in the Dataset Viewer in the Dremio UI to launch Tableau Desktop with the necessary connection settings preconfigured. Let’s do the latter. Open the production.restaurants.inspections dataset in Dremio:

Restaurant inspections data

Now click on the Tableau button at the top right:

Tableau button

The browser will download a small connection file (with a .tds extension) which contains the necessary connection settings. Click on this file to launch Tableau:

Tableau

If you’re using Tableau on Windows, you’ll be prompted to log in now. Use your credentials for logging into Dremio. Notice that the dataset name, production.restaurants.inspections, is displayed at the top left. You can now interact with your Elasticsearch data just like you would interact with a relational database. For example, drag BORO onto the Rows shelf at the top. At this point you’ll be asked to enter your Dremio credentials:

Connecting to Dremio from Tableau

If you’re using Tableau on macOS, you won’t be prompted to log in until this step. Use your Dremio credentials. You may notice that the credentials dialog box in the macOS version of Tableau says Apache Drill. That’s because Dremio, at the time of writing, utilizes the same ODBC driver as Apache Drill (though it has little in common with Apache Drill beyond that).

Let’s continue our ad-hoc analysis here by dragging the Number of Records dimension onto the Columns shelf, and the CRITICAL_FLAG measure onto the Color shelf. The Tableau visualization should look like this:

Counting citations by Boro

You can take advantage of Tableau’s rich visualization capabilities. For example, we can look at the distribution of scores for different cuisines, which reveals that Californian restaurants consistently receive good scores, while Chinese restaurants are all over the map.

Counting citations by cuisine type

Choosing an Index/Type in Tableau

Click on the Data Source tab at the bottom left of the Tableau window:

Tableau data sources

Here you have the opportunity to select other indexes and types in the Elasticsearch cluster. You could join multiple indexes/types, or even join data from Elasticsearch with data from any other datastore that’s connected to Dremio.

Text Search in Tableau

As you may recall, Dremio supports text search via the SQL contains function. If you want to utilize text search from the BI tool, you have two options. You can utilize the BI tool’s Custom SQL option to manually enter a SQL query that includes the contains function. Alternatively, you can create a virtual dataset inside Dremio that includes the contains function. This approach has the advantage of supporting any tool that connects to Dremio, rather than working through the various approaches each vendor provides.

Text Search with Custom SQL

Let’s start with the first option. Remove the inspections table from the Tableau Data Source screen, and click on the New Custom SQL button. In the Edit Custom SQL dialog, enter the desired SQL query:

Editing SQL query in Tableau

Once you click OK, this custom SQL statement serves as the basis for your analysis in Tableau:

Custom SQL query in Tableau

Click on Sheet 1 at the bottom to return to create a visualization based on the inspections that contain the terms mice or vermin. You may want to click on the Clear Sheet button at the top in order to start the visualization from scratch.

For example, click on the CAMIS measure and select Convert to Dimension (since it’s an integer, Tableau assumes it’s a measure). CAMIS represents the unique restaurant identifier. Next, drag Number of Records to the Columns shelf at the top, and then drag DBA (Doing Business As) to the Label shelf. Click the Sort icon to sort in descending order.

Citations by DBA

What we can see here is that the New Victory Restaurant has the most inspection records that mentioned the words mice or vermin.

Text Search with Virtual Datasets

Entering custom SQL in the BI tool can get the job done, but sometimes you want to save a query so that other people or tools can utilize it too. Dremio’s virtual datasets enable you to do just that. While virtual datasets and data curation are largely beyond the scope of this tutorial (see Working With Your First Dataset, we do want to show you a simple example.

In the Dremio UI, create a shared space by clicking on the New Space button. Call the new space “restaurant analysis” and click Save.

Creating a new space

One way to create a new dataset is by clicking on the New Query button at the top. Go ahead and do that, and enter the custom SQL query we just used:

Click the Save As button to save this as a virtual dataset named “mice” in the space “restaurant analysis” which we just created:

Saving virtual dataset

This new dataset, which includes the inspection records that mention the words mice or vermin, is immediately available to any SQL query and BI tool (using the canonical path “restaurant analysis”.mice). For example, head back to Tableau, and select the “restaurant analysis” space as a Schema. You should see the new virtual dataset named mice.

New virtual dataset in Tableau

Note that you can share this virtual dataset with your colleagues, who may use other tools like Power BI, Excel and R and Pandas/Python to explore and analyze the dataset. It’s important to keep in mind that Dremio has not exported the data from Elasticsearch. This virtual dataset is… virtual. Dremio will continue to utilize Elasticsearch’s execution capabilities, to the extent possible, even for SQL queries on the virtual dataset.

Accelerating Elasticsearch Queries with Reflections

There’s a lot to cover about Data Reflections™, much more than we can do in this tutorial. In short, reflections are optimized data structures, or data materializations, that are housed in a persistent storage layer such as S3, HDFS or the Dremio cluster’s local disks. In the context of Elasticsearch, reflections provide three major benefits:

  • They accelerate query execution, often by orders of magnitude, especially on queries that require a full table scan on the Elasticsearch index/type.
  • They enable accurate COUNT DISTINCT calculations. Elasticsearch does not natively perform accurate counts, and instead resorts to approximates (also known as Cardinality Aggregation).
  • They enable you to control whether or not analytical queries should be pushed down into Elasticsearch. In some cases, Elasticsearch may be serving an operational use case that must guarantee low latency SLAs, in which case it may be beneficial to enable Dremio to respond to queries via reflections rather than pushing down the processing into Elasticsearch.

Let’s look at an example. Open the Dataset Settings dialog by clicking the Settings button on the production.restaurants.inspections dataset. Click on the Reflections tab, and wait a few seconds while Dremio calculates various statistics on the data in order to recommend appropriate reflections. You should see something like this:

Setting up reflections - basic

In this case, Dremio incorrectly concluded that CAMIS and ZIPCODE are measures. Let’s switch to advanced mode by clicking the button Switch to Advanced:

Setting up reflections - advanced

Enable Raw Reflections by clicking on the toggle at the top:

Reflections toggle

Switch to the Aggregation Reflections tab.

Advanced aggregation reflections

Mark DBA, ZIPCODE, VIOLATION_CODE, CUISINE_DESCRIPTION, INSPECTION_TYPE, CRITICAL_FLAG and GRADE as dimensions (in addition to the dates), and unmark CAMIS. Then enable Aggregation Reflections by clicking on the toggle.

Click on Save and wait until Dremio creates these two reflections. You can track the progress in the Jobs screen, but it should only take a minute or two. Once it’s complete, you should see a colorful flame icon indicating that the reflection is ready.

Advanced aggregation reflections - detail

The Dremio optimizer will now have the option to rewrite query plans to utilize reflections, which are stored in a high-performance columnar format, in lieu of reading from Elasticsearch. Depending on the query, this can lead to significant speedups. That said, when a query involves text search (with the SQL contains function), Dremio will always involve the Elasticsearch cluster in the processing of the query.

Summary

In this tutorial we showed you how to run SQL queries on Elasticsearch, as well as how to utilize standard BI tools such as Tableau on your Elasticsearch indices. These capabilities can open up your Elasticsearch clusters to a much broader set of users in the organization, and can make it easier for anyone to explore and analyze data in Elasticsearch.