Dremio Jekyll

Join Elasticsearch and MongoDB with Qlik Sense

Intro

Elasticsearch and MongoDB are two popular data management systems. They are widely-used to work with large amounts of data. While both use a JSON-based data model, each has different strengths, and in some cases it makes sense to use the two systems together. When it comes to analytics, combining data from these two systems together can be challenging. Dremio makes it easy to combine data from MongoDB and Elasticsearch and then visualize it in a BI tool like Qlik Sense.

Assumptions

This tutorial is a continuation of the Analyzing Elasticsearch with Qlik Sense tutorial. So, we assume that you have Dremio and its ODBC driver installed; if not, go to Dremio’s download page, pick the installation for your operating system, and visit the documentation to read about installations and deployments. Also, we will work with Elasticsearch, MongoDB, and Qlik Sense Desktop, so you need to have them installed and configured.

Loading data into MongoDB

Today, we will continue to explore NYPD Motor Vehicle Collisions data, but this time we will combine it with the weather dataset, stored in MongoDB. You can download this dataset here. Weather data is collected from the National Climatic Data Center for the station in Central Park. It contains a number of indicators for each day from 01.01.2016 to 05.31.2018. Namely, the dataset has the following datatypes:

Data types in sample dataset

First off, we need to import the dataset into MongoDB. So, from the command line navigate to bin directory of MongoDB and start it with the following command:

1
mongod

Then open another terminal tab and from the bin directory write the next string to upload the data:

1
mongoimport -d weather -c NYC --type csv --headerline --file <PATH TO FILE>/NYC_weather.csv

Data curation in Dremio

Start by connecting to a new data source, and choose MongoDB as the type. Select name, specify the host (if you are running MongoDB on your local machine, enter localhost) and address, pick No Authentication and click Save.

Setting up connection to MongoDB

After that, navigate to the weather folder and click on the NYC dataset.

Now, we have to transform the dataset a little. First of all, let’s convert our variables to appropriate data types. We begin with converting AWND to float. To do this, click on the sign left to the variable name and then select Convert to Single type and choose the desired type. Click apply.

Accessing the data type menu for the column

Setting the data type for a column

Then, convert date to Date type with an appropriate format.

Formatting the date column

After that, the SQL editor automatically writes the following:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
SELECT
  "_id",
  STATION,
  NAME,
  TO_DATE(NYC."DATE", 'YYYY-MM-DD', 1) AS "DATE",
  clean_data_to_FLOAT(AWND, 1, 1, 0) AS AWND,
  PRCP,
  SNOW,
  SNWD,
  TAVG,
  TMAX,
  TMIN,
  WT01,
  WT02,
  WT03,
  WT04,
  WT06,
  WT08
FROM
  NYC_weather.weather.NYC

Now, we can save this dataset to the NYC space, where we already have nyc_collision dataset.

Saving the new dataset

Let’s make some additional manipulations with the collision dataset. Again, convert date to an appropriate Date type, but this time with a custom format “MM/DD/YYYY”.

Formatting the date column

Next, as we want to analyze the Manhattan area, let’s leave the data only for this borough. To do this, click on a small arrow next to the column name and select Keep only.

Selecting Keep Only for the column

Toggle the Manhattan check mark and click Apply.

Selecting Manhattan for the dataset

Note, that the results are based on a sample dataset with 1000 values. Our actual dataset is much larger, but it is not displayed here. We need cases from 01.01.2016 to 05.31.2018. Just like before, let’s keep only these data values.

Narrowing the range of values to keep

Now, we can save this dataset. The SQL Editor has the following code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
SELECT
  TO_DATE(nyc_collision_data."DATE", 'MM/DD/YYYY', 1) AS "DATE",
  nyc_collision_data."TIME" AS "TIME",
  BOROUGH,
  "ZIP CODE",
  LATITUDE,
  LONGITUDE,
  LOCATION,
  "NUMBER OF PERSONS INJURED",
  "NUMBER OF PERSONS KILLED",
  "NUMBER OF PEDESTRIANS INJURED",
  "NUMBER OF PEDESTRIANS KILLED",
  "NUMBER OF CYCLIST INJURED",
  "NUMBER OF CYCLIST KILLED",
  "NUMBER OF MOTORIST INJURED",
  "NUMBER OF MOTORIST KILLED",
  "CONTRIBUTING FACTOR VEHICLE 1",
  "CONTRIBUTING FACTOR VEHICLE 2",
  "CONTRIBUTING FACTOR VEHICLE 3",
  "CONTRIBUTING FACTOR VEHICLE 4",
  "CONTRIBUTING FACTOR VEHICLE 5",
  "VEHICLE TYPE CODE 1",
  "VEHICLE TYPE CODE 2",
  "VEHICLE TYPE CODE 3",
  "VEHICLE TYPE CODE 4",
  "VEHICLE TYPE CODE 5"
FROM
  NYC.nyc_collision_data
WHERE
  ('2016-01-01' <= TO_DATE(try."DATE", 'MM/DD/YYYY', 1) AND '2018-05-31' > TO_DATE(try."DATE", 'MM/DD/YYYY', 1)) AND (BOROUGH = 'MANHATTAN')

Finally, we want to join these two datasets into one. Luckily, with the help of Dremio, this can be easily performed. Go to the collision dataset, click on Join and select NYC_weather dataset.

Joining MongoDB to Elasticsearh

Selecting the dataset for the join

Then make Inner Join by Date in both datasets.

Creating an inner join

After that, let’s call this dataset fulldata and save it to NYC space. The final SQL code is:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
SELECT
  nested_0."TIME" AS "TIME",
  nested_0.BOROUGH AS BOROUGH,
  nested_0."ZIP CODE" AS "ZIP CODE",
  nested_0.LATITUDE AS LATITUDE,
  nested_0.LONGITUDE AS LONGITUDE,
  nested_0.LOCATION AS LOCATION,
  nested_0."NUMBER OF PERSONS INJURED" AS "NUMBER OF PERSONS INJURED",
  nested_0."NUMBER OF PERSONS KILLED" AS "NUMBER OF PERSONS KILLED",
  nested_0."NUMBER OF PEDESTRIANS INJURED" AS "NUMBER OF PEDESTRIANS INJURED",
  nested_0."NUMBER OF PEDESTRIANS KILLED" AS "NUMBER OF PEDESTRIANS KILLED",
  nested_0."NUMBER OF CYCLIST INJURED" AS "NUMBER OF CYCLIST INJURED",
  nested_0."NUMBER OF CYCLIST KILLED" AS "NUMBER OF CYCLIST KILLED",
  nested_0."NUMBER OF MOTORIST INJURED" AS "NUMBER OF MOTORIST INJURED",
  nested_0."NUMBER OF MOTORIST KILLED" AS "NUMBER OF MOTORIST KILLED",
  nested_0."CONTRIBUTING FACTOR VEHICLE 1" AS "CONTRIBUTING FACTOR VEHICLE 1",
  nested_0."CONTRIBUTING FACTOR VEHICLE 2" AS "CONTRIBUTING FACTOR VEHICLE 2", nested_0."CONTRIBUTING FACTOR VEHICLE 3" AS "CONTRIBUTING FACTOR VEHICLE 3", nested_0."CONTRIBUTING FACTOR VEHICLE 4" AS "CONTRIBUTING FACTOR VEHICLE 4", nested_0."CONTRIBUTING FACTOR VEHICLE 5" AS "CONTRIBUTING FACTOR VEHICLE 5",
  nested_0."VEHICLE TYPE CODE 1" AS "VEHICLE TYPE CODE 1",
  nested_0."VEHICLE TYPE CODE 2" AS "VEHICLE TYPE CODE 2",
  nested_0."VEHICLE TYPE CODE 3" AS "VEHICLE TYPE CODE 3",
  nested_0."VEHICLE TYPE CODE 4" AS "VEHICLE TYPE CODE 4",
  nested_0."VEHICLE TYPE CODE 5" AS "VEHICLE TYPE CODE 5",
  nested_0."DATE" AS "DATE",
  join_NYC_weather."DATE" AS DATE0,
  join_NYC_weather."_id" AS "_id",
  join_NYC_weather.STATION AS STATION,
  join_NYC_weather.NAME AS NAME,
  join_NYC_weather.AWND AS AWND,
  join_NYC_weather.PRCP AS PRCP,
  join_NYC_weather.SNOW AS SNOW,
  join_NYC_weather.SNWD AS SNWD,
  join_NYC_weather.TAVG AS TAVG,
  join_NYC_weather.TMAX AS TMAX,
  join_NYC_weather.TMIN AS TMIN,
  join_NYC_weather.WT01 AS WT01,
  join_NYC_weather.WT02 AS WT02,
  join_NYC_weather.WT03 AS WT03,
  join_NYC_weather.WT04 AS WT04,
  join_NYC_weather.WT06 AS WT06,
  join_NYC_weather.WT08 AS WT08
FROM (
 SELECT
  nyc_collision_data."DATE" AS "DATE",
  nyc_collision_data."TIME" AS "TIME",
  BOROUGH,
  "ZIP CODE",
  LATITUDE,
  LONGITUDE,
  LOCATION,
  "NUMBER OF PERSONS INJURED",
  "NUMBER OF PERSONS KILLED",
  "NUMBER OF PEDESTRIANS INJURED",
  "NUMBER OF PEDESTRIANS KILLED",
  "NUMBER OF CYCLIST INJURED",
  "NUMBER OF CYCLIST KILLED",
  "NUMBER OF MOTORIST INJURED",
  "NUMBER OF MOTORIST KILLED",
  "CONTRIBUTING FACTOR VEHICLE 1",
  "CONTRIBUTING FACTOR VEHICLE 2",
  "CONTRIBUTING FACTOR VEHICLE 3",
  "CONTRIBUTING FACTOR VEHICLE 4",
  "CONTRIBUTING FACTOR VEHICLE 5",
  "VEHICLE TYPE CODE 1",
  "VEHICLE TYPE CODE 2",
  "VEHICLE TYPE CODE 3",
  "VEHICLE TYPE CODE 4",
  "VEHICLE TYPE CODE 5"
 FROM
  NYC.nyc_collision_data) nested_0
INNER JOIN NYC.NYC_weather AS join_NYC_weather ON nested_0."DATE" = join_NYC_weather."DATE"

Connecting to the dataset in Qlik Sense

First, Dremio requires Qlik to be running, so start Qlik Sense Desktop from your computer. Then click on Analyze button at the top right corner of the toolbar and select Qlik Sense.

Launching Qlik Sense connected to Dremio

At this point, you will need to enter your Dremio password and wait for establishing the connection. We will work with Qlik Sense in the browser, so follow the link provided. When you open Qlik Sense, you will see the window that asks if you want to open the data load editor with no data. Click Open.

You will now see the data load editor with the automatically provided code; however, it should be changed. To insert the right script and load the data correctly, follow the next steps:

  • On the right side of the page, you will find Edit connection icon. Make sure that the connection is set for SystemDSN and right version of bit for your machine.

Editing the connection in Qlik Se

  • Click on the Select data icon left to the Edit Connection. Select NYC in Owner menu and click on fulldata dataset. Preview if the data is visible and displays correctly and click on Insert Script. The selection of Include LOAD statement is necessary.

Configuring the Qlik Sense connection

After that the Data load Editor looks like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
SET DirectIdentifierQuoteChar='"';

LIB CONNECT TO 'Dremio';

LOAD "TIME",
    BOROUGH,
    "ZIP CODE",
    LATITUDE,
    LONGITUDE,
    LOCATION,
    "NUMBER OF PERSONS INJURED",
    "NUMBER OF PERSONS KILLED",
    "NUMBER OF PEDESTRIANS INJURED",
    "NUMBER OF PEDESTRIANS KILLED",
    "NUMBER OF CYCLIST INJURED",
    "NUMBER OF CYCLIST KILLED",
    "NUMBER OF MOTORIST INJURED",
    "NUMBER OF MOTORIST KILLED",
    "CONTRIBUTING FACTOR VEHICLE 1",
    "CONTRIBUTING FACTOR VEHICLE 2",
    "CONTRIBUTING FACTOR VEHICLE 3",
    "CONTRIBUTING FACTOR VEHICLE 4",
    "CONTRIBUTING FACTOR VEHICLE 5",
    "VEHICLE TYPE CODE 1",
    "VEHICLE TYPE CODE 2",
    "VEHICLE TYPE CODE 3",
    "VEHICLE TYPE CODE 4",
    "VEHICLE TYPE CODE 5",
    "DATE",
    DATE0,
    "_id",
    STATION,
    NAME,
    AWND,
    PRCP,
    SNOW,
    SNWD,
    TAVG,
    TMAX,
    TMIN,
    WT01,
    WT02,
    WT03,
    WT04,
    WT06,
    WT08;
SQL SELECT "TIME",
    BOROUGH,
    "ZIP CODE",
    LATITUDE,
    LONGITUDE,
    LOCATION,
    "NUMBER OF PERSONS INJURED",
    "NUMBER OF PERSONS KILLED",
    "NUMBER OF PEDESTRIANS INJURED",
    "NUMBER OF PEDESTRIANS KILLED",
    "NUMBER OF CYCLIST INJURED",
    "NUMBER OF CYCLIST KILLED",
    "NUMBER OF MOTORIST INJURED",
    "NUMBER OF MOTORIST KILLED",
    "CONTRIBUTING FACTOR VEHICLE 1",
    "CONTRIBUTING FACTOR VEHICLE 2",
    "CONTRIBUTING FACTOR VEHICLE 3",
    "CONTRIBUTING FACTOR VEHICLE 4",
    "CONTRIBUTING FACTOR VEHICLE 5",
    "VEHICLE TYPE CODE 1",
    "VEHICLE TYPE CODE 2",
    "VEHICLE TYPE CODE 3",
    "VEHICLE TYPE CODE 4",
    "VEHICLE TYPE CODE 5",
    "DATE",
    DATE0,
    "_id",
    STATION,
    NAME,
    AWND,
    PRCP,
    SNOW,
    SNWD,
    TAVG,
    TMAX,
    TMIN,
    WT01,
    WT02,
    WT03,
    WT04,
    WT06,
    WT08
FROM DREMIO.NYC.fulldata;

Click on Load data at the top right corner and wait until the process is complete and finished successfully.

Loading data from Dremio into Qlik Sense

Building visualizations

Now, we can start building visualizations. From the data load editor window, navigate to App overview at the top left corner.image alt text

Select Sheet and click Edit at the top of the page.

First, we want to see the relation between the average wind speed and injured pedestrians in the collisions. To explore this, let’s build Line chart. Drag the chart icon to the editor area and select the AWND as a dimension and the average number of pedestrians injured as a measure. We can also play with the appearance in the respective menu and, for example, select red color and the area mode.

Building the visualization

As a result, we have the next picture:

Building the visualization type

Now, let’s build the visualization that uses more weather variables. For instance, let’s look at the average number of motorists injured in respect to the minimum temperature and the presence of fog. For that we will use WT02 - it has value 1 if there was fog and an empty value in the opposite case. We will use distribution plot as the type of visualization. Drag it to the editor and pick WT02 as a dimension and average motorists injured as a measure. After that, in the data menu on the right, add TMIN to the Y-axis and then sort by temperatureimage alt text

Now, click on Edit and we will get to the more interactive mode. If you click on one of the points, you will see all of the points with the same value of WT02. With this visualization, we can spot how the mean differs from the cases when there was or wasn’t fog for each minimum temperature value.

The final visualization looks like this.

Final visualization of data in Qlik Sense

Conclusion

In this tutorial, we showed how Dremio can be helpful in combining data from different sources such as Elasticsearch and MongoDB, and moreover how this data can be further analyzed in Qlik Sense - very popular, but not a typical tool to work with these data storages. Not only this challenge was successfully solved, but all of the results were achieved without writing a single line of code.