Dremio Jekyll

Analyzing Elasticsearch With Qlik Sense

Intro

Elasticsearch is one of the most reliable, fast, and powerful search and analytics engines. It allows you to index and query large amounts of structured data, provides a convenient RESTful API and is widely used with a variety of applications.

However, some tools, like Qlik, don’t work with Elasticsearch, as it has no support for SQL and manages data in JSON documents. In this tutorial we will show you how you can use Dremio to easily connect Qlik as well as other BI tools to Elasticsearch and to start building comprehensive visualizations.

Assumptions

To follow this tutorial you should have Dremio and its ODBC driver installed - downloads for your favorite operating system are at https://www.dremio.com/download. We also recommend you read Getting Oriented to Dremio and Working With Your First Dataset. In addition, we will use Elasticsearch, and Logstash v.5.6.0, as well as Qlik Sense Desktop. You can register for a Qlik Account and download Qlik Sense Desktop here if you don’t have one already.

Loading data into Elasticsearch

In this article, we will work with NYPD Motor Vehicle Collision data which contains details on every collision in NYC, broken down by location and injury. So, go to NYC Open Data and download the data file in CSV format. You can find all of the necessary configuration files at Exploring Public Dataset github page.

Github repo

Note.In order to get the best Dremio performance we will use Elasticsearch v.5.6.0. So, for this version, the process of ingesting data into Elasticsearch will be a little different from the instructions provided on the github page. If you are familiar with Elasticsearch and know how to load data into your cluster - feel free to use your favorite script or tool.

If you are not familiar with this procedure, you can use the following directions:

  1. Download nyc_collision_logstash.conf, nyc_collision_pipeline.json, and nyc_collision_template.json files.

  2. Move nyc_collision_logstash.conf to the Logstash\bin directory.

  3. Modify nyc_collision_logstash.conf and provide the correct paths to the data and downloaded files.

  4. Open the command line, cd into the folder containing the configuration file, and run the next command

1
./logstash.bat -f nyc_collision_logstash.conf

The process of ingesting data may take from 20-30 minutes to few hours, depending on your machine. Check the progress by running

1
curl http://localhost:9200/nyc_visionzero/_count

After the whole data is ingested, you will see the count of 1,161,717 (note that the exact number may change as the data is volatile).

image alt text

Connecting Dremio to Elasticsearch

First off, create a new NYC space by clicking New Space on the main page of Dremio UI. After that, let’s connect to our Elasticsearch data source. Click on New source button and select Elasticsearch option.

Creating an Elasticsearch source

Now, provide details of your Elasticsearch cluster. If you are running Elasticsearch on your laptop, enter localhost as Host and select No Authentication; you can also check out the Advanced options for additional specifications. Then click save:

Configuring your Elasticsearch source

Note: you can read more about Elasticsearch setup and best practices here.

After above preparations, the created nyc_collision source will be available under the source section. Click on it to see all the indexes. Go ahead to nyc_visionzero and click on logs to view the available data in the Dataset Viewer.

Viewing your index in Elasticsearch

There are over a million rows and note how quickly Dremio is preparing results!

Now you can see the data and a query editor window. You can either write an SQL query or simply take advantage of Dremio’s ability to build queries using user friendly interface.

Dremio's SQL Console

To begin with, let’s transform our data for further analysis by dropping the columns we will not need. In order to do this, click a small arrow near the column name and select Drop in the drop-down menu.

Drop a column

Perform this with the next columns: On_street_name, path, @version, host, off_street_name, unique_key, tags - null, _index, type, _uid, and message.

After these transformations, the query in the SQL Editor looks as following:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
SELECT
  logs."date" AS "date",
  number_of_motorist_injured,
  latitude,
  number_of_cyclist_killed,
  borough,
  number_of_persons_killed,
  contributing_factor_vehicle,
  number_persons_impacted,
  zip_code,
  logs."intersection" AS "intersection",
  number_of_pedestrians_killed,
  hour_of_day,
  longitude,
  number_of_motorist_killed,
  vehicle_type,
  number_of_cyclist_injured,
  "@timestamp",
  cross_street_name,
  number_of_pedestrians_injured,
  number_of_persons_injured,
  "location",
  logs."time" AS "time"
FROM
  nyc_collision.nyc_visionzero.logs

Finally, we can create a virtual dataset by clicking Save As at the top right corner of the page. Let’s call it nyc_collision_data and save to our NYC space.

Saving to a Space

Using Dremio and Qlik

First of all, let’s make sure that Qlik is running. Then click on Analyze button at the top right corner of the toolbar and select Qlik Sense.

Connecting with Qlik Sense

You will now be prompted to enter your Dremio password. After connection to Qlik Sense is established, follow the provided instructions.

User prompt for connecting to Qlik Sense

The data load editor should contain the code below:

1
2
3
4
5
6
7
8
9
10
11
SET DirectIdentifierQuoteChar='"';

LIB CONNECT TO 'Dremio';

nyc_collision_data: DIRECT QUERY
DIMENSION "date", "latitude", "borough", "zip_code", "intersection", "longitude", "cross_street_name", "location", "time",  "hour_of_day"

MEASURE "number_of_motorist_injured", "number_of_cyclist_killed", "number_of_persons_killed", "number_persons_impacted", "number_of_pedestrians_killed", "number_of_motorist_killed", "number_of_cyclist_injured", "number_of_pedestrians_injured", "number_of_persons_injured"

DETAIL "contributing_factor_vehicle", "vehicle_type", "@timestamp"
FROM "NYC"."nyc_collision_data";

Note*Qlik editor automatically reads char values as dimensions and numerical values as measures. But let’s move “hour_of_day” to dimensions as we will need it this way later.

Now click Load data at the top right corner. This process can take several minutes.

Loading data from Qlik

As soon as the data load is complete, navigate to App overview at the top left corner.

App overview button in Qlik Sense

On this step you can select the default sheet or create a new one. Let’s go with the default one for our example. After that, you will see the editor to work with your data. There are plenty of charts available on the left side of the screen, as well as different options for data manipulation. For example, we can build a scatter plot by dragging the icon from the list to the editor area. Select hour_of_day as a dimension, and the average number_of_persons_killed, and number_of_persons_impacted as measures. Also, check out all the choices in the Appearance section on the Properties panel. You can play with titles, axes, bubble sizes, colors, and legend. For our example, let’s choose custom color and arrange it by measure of Max(number_of_persons_injured).

Configuring colors in Qlik Sense

You can see the result of such manipulations on the screen below.

Viewing our dataset in Qlik Sense

One of the greatest features of using Dremio with Qlik is the ability to make data transformations directly in Dremio and then simply refreshing data in Qlik without using the load manager. For example, for more detailed analysis, we can filter our data, selecting only the incidents which happened in Manhattan and involved 2 or more people injured. To do this, go to Dremio SQL Editor and add the following line to the query:

1
WHERE borough = 'MANHATTAN' AND 2<=number_of_persons_injured

Great! Now, just go back to Qlik editor, select Fields on the left and click on the Refresh button on the bottom of the page and wait until data is ready.

Calling refresh from Qlik Sense

Now we can build visualizations on the refreshed data. Let’s add Combo chart which shows the average number of persons impacted depending on the hour of the day, Pie chart that represents the percentage of collisions happened in each hour, and the average time of collisions with help of Gauge.

Combo chart

Once again, drag the Combo chart icon to the editor area. First, click Add dimension and select hour_of_day. After that, choose number_of_persons_impacted as a measure and set the Average (Avg) aggregation.

Pie chart

To see the percentage of collisions happened in each hour, select hour_of_day as the dimension and Count(hour_of_day) as a measure of the pie chart. We will leave Auto colors for this visualization.

Gauge

For gauge, let’s add average hour_of_day as a measure and then slightly change the appearance by setting the max limit to 23, toggling the Use segments checkbox, and choosing the color you like the most.

As a result, you should have the following visualizations:

Our final visualization in Qlik Sense

Conclusion

In this tutorial, we showed you how Dremio makes it not only possible, but very easy to connect Qlik to Elasticsearch for building extensive visualizations. While we were using Qlik, we could also work with Tableau or Power BI, which greatly expands the ability to analyze Elasticsearch data. With the help of Dremio, we were able to transform our data before and in the middle of working with Qlik, which made the whole visualization process much more convenient.