Elasticsearch is one of the most reliable, fast, and powerful search and analytics engines. It allows you to index and query large amounts of structured data, provides a convenient RESTful API and is widely used with a variety of applications.
However, some tools, like Qlik, don’t work with Elasticsearch, as it has no support for SQL and manages data in JSON documents. In this tutorial we will show you how you can use Dremio to easily connect Qlik as well as other BI tools to Elasticsearch and to start building comprehensive visualizations.
To follow this tutorial you should have Dremio and its ODBC driver installed - downloads for your favorite operating system are at https://www.dremio.com/download. We also recommend you read Getting Oriented to Dremio and Working With Your First Dataset. In addition, we will use Elasticsearch, and Logstash v.5.6.0, as well as Qlik Sense Desktop. You can register for a Qlik Account and download Qlik Sense Desktop here if you don’t have one already.
In this article, we will work with NYPD Motor Vehicle Collision data which contains details on every collision in NYC, broken down by location and injury. So, go to NYC Open Data and download the data file in CSV format. You can find all of the necessary configuration files at Exploring Public Dataset github page.
Note.In order to get the best Dremio performance we will use Elasticsearch v.5.6.0. So, for this version, the process of ingesting data into Elasticsearch will be a little different from the instructions provided on the github page. If you are familiar with Elasticsearch and know how to load data into your cluster - feel free to use your favorite script or tool.
If you are not familiar with this procedure, you can use the following directions:
Download nyc_collision_logstash.conf, nyc_collision_pipeline.json, and nyc_collision_template.json files.
Move nyc_collision_logstash.conf to the Logstash\bin directory.
Modify nyc_collision_logstash.conf and provide the correct paths to the data and downloaded files.
Open the command line, cd into the folder containing the configuration file, and run the next command
1 ./logstash.bat -f nyc_collision_logstash.conf
The process of ingesting data may take from 20-30 minutes to few hours, depending on your machine. Check the progress by running
1 curl http://localhost:9200/nyc_visionzero/_count
After the whole data is ingested, you will see the count of 1,161,717 (note that the exact number may change as the data is volatile).
First off, create a new NYC space by clicking New Space on the main page of Dremio UI. After that, let’s connect to our Elasticsearch data source. Click on New source button and select Elasticsearch option.
Now, provide details of your Elasticsearch cluster. If you are running Elasticsearch on your laptop, enter localhost as Host and select No Authentication; you can also check out the Advanced options for additional specifications. Then click save:
Note: you can read more about Elasticsearch setup and best practices here.
After above preparations, the created nyc_collision source will be available under the source section. Click on it to see all the indexes. Go ahead to nyc_visionzero and click on logs to view the available data in the Dataset Viewer.
There are over a million rows and note how quickly Dremio is preparing results!
Now you can see the data and a query editor window. You can either write an SQL query or simply take advantage of Dremio’s ability to build queries using user friendly interface.
To begin with, let’s transform our data for further analysis by dropping the columns we will not need. In order to do this, click a small arrow near the column name and select Drop in the drop-down menu.
Perform this with the next columns: On_street_name, path, @version, host, off_street_name, unique_key, tags - null, _index, type, _uid, and message.
After these transformations, the query in the SQL Editor looks as following:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 SELECT logs."date" AS "date", number_of_motorist_injured, latitude, number_of_cyclist_killed, borough, number_of_persons_killed, contributing_factor_vehicle, number_persons_impacted, zip_code, logs."intersection" AS "intersection", number_of_pedestrians_killed, hour_of_day, longitude, number_of_motorist_killed, vehicle_type, number_of_cyclist_injured, "@timestamp", cross_street_name, number_of_pedestrians_injured, number_of_persons_injured, "location", logs."time" AS "time" FROM nyc_collision.nyc_visionzero.logs
Finally, we can create a virtual dataset by clicking Save As at the top right corner of the page. Let’s call it nyc_collision_data and save to our NYC space.
First of all, let’s make sure that Qlik is running. Then click on Analyze button at the top right corner of the toolbar and select Qlik Sense.
You will now be prompted to enter your Dremio password. After connection to Qlik Sense is established, follow the provided instructions.
The data load editor should contain the code below:
1 2 3 4 5 6 7 8 9 10 11 SET DirectIdentifierQuoteChar='"'; LIB CONNECT TO 'Dremio'; nyc_collision_data: DIRECT QUERY DIMENSION "date", "latitude", "borough", "zip_code", "intersection", "longitude", "cross_street_name", "location", "time", "hour_of_day" MEASURE "number_of_motorist_injured", "number_of_cyclist_killed", "number_of_persons_killed", "number_persons_impacted", "number_of_pedestrians_killed", "number_of_motorist_killed", "number_of_cyclist_injured", "number_of_pedestrians_injured", "number_of_persons_injured" DETAIL "contributing_factor_vehicle", "vehicle_type", "@timestamp" FROM "NYC"."nyc_collision_data";
Note*Qlik editor automatically reads char values as dimensions and numerical values as measures. But let’s move “hour_of_day” to dimensions as we will need it this way later.
Now click Load data at the top right corner. This process can take several minutes.
As soon as the data load is complete, navigate to App overview at the top left corner.
On this step you can select the default sheet or create a new one. Let’s go with the default one for our example. After that, you will see the editor to work with your data. There are plenty of charts available on the left side of the screen, as well as different options for data manipulation. For example, we can build a scatter plot by dragging the icon from the list to the editor area. Select hour_of_day as a dimension, and the average number_of_persons_killed, and number_of_persons_impacted as measures. Also, check out all the choices in the Appearance section on the Properties panel. You can play with titles, axes, bubble sizes, colors, and legend. For our example, let’s choose custom color and arrange it by measure of Max(number_of_persons_injured).
You can see the result of such manipulations on the screen below.
One of the greatest features of using Dremio with Qlik is the ability to make data transformations directly in Dremio and then simply refreshing data in Qlik without using the load manager. For example, for more detailed analysis, we can filter our data, selecting only the incidents which happened in Manhattan and involved 2 or more people injured. To do this, go to Dremio SQL Editor and add the following line to the query:
1 WHERE borough = 'MANHATTAN' AND 2<=number_of_persons_injured
Great! Now, just go back to Qlik editor, select Fields on the left and click on the Refresh button on the bottom of the page and wait until data is ready.
Now we can build visualizations on the refreshed data. Let’s add Combo chart which shows the average number of persons impacted depending on the hour of the day, Pie chart that represents the percentage of collisions happened in each hour, and the average time of collisions with help of Gauge.
Once again, drag the Combo chart icon to the editor area. First, click Add dimension and select hour_of_day. After that, choose number_of_persons_impacted as a measure and set the Average (Avg) aggregation.
To see the percentage of collisions happened in each hour, select hour_of_day as the dimension and Count(hour_of_day) as a measure of the pie chart. We will leave Auto colors for this visualization.
For gauge, let’s add average hour_of_day as a measure and then slightly change the appearance by setting the max limit to 23, toggling the Use segments checkbox, and choosing the color you like the most.
As a result, you should have the following visualizations:
In this tutorial, we showed you how Dremio makes it not only possible, but very easy to connect Qlik to Elasticsearch for building extensive visualizations. While we were using Qlik, we could also work with Tableau or Power BI, which greatly expands the ability to analyze Elasticsearch data. With the help of Dremio, we were able to transform our data before and in the middle of working with Qlik, which made the whole visualization process much more convenient.