Dremio Jekyll

Analyzing MongoDB With Power BI

Introduction

Power BI is a big player in the market of business intelligence solutions, and there are good reasons why: it is flexible, features a vast library of standard visualizations, and provides the ability to connect to an array of data sources of varying types.

However, everything has its limitations, and Power BI is no exception. An instance of such limitation is the ability to connect to NoSQL schema (e.g., MongoDB) directly. Of course, there is the ability to use ODBC driver for mongo (which is unfortunately slow) or integrate it via custom R code, but you will have to complete a lot of work and coding manually.

This can pose a bit of problem, especially that Power BI is targeted at the non-programming audience and MongoDB is a quite versatile database used for many use cases, e.g., textual data, nested objects, and arrays representation, naturalistic data model all of that are a perfect match for such applications as mobile development, IoT, analytics and many more.

Fortunately, Dremio allows us to connect to the data from MongoDB very easily. Not only that, we also use Dremio to reshape and blend different data sources so they are easier to use in Power BI (of course, there is a possibility to do all the crunching in Power BI, but some of it, for example transforming nested objects, is better done in Dremio).

Let’s take a closer look!

Data curation

First, we will need to connect to a running instance of MongoDB in order to get to the database. You can download the full dataset, and here is a sample of our data:

1
2
3
4
5
6
7
8
9
10
11
12
13
{
    "state":"GA",
    "email":null,
    "ip_address":"114.63.187.46",
    "client_browser":"mozilla",
    "feedback":null,
    "session":{
      "pages_visited":["links","about","support"],
      "date":"07/28/2017",
      "time":"17:10",
      "errors":[400,404]
    }
}

You can import the full dataset into mongo using:

1
mongoimport --db traffic --collection events --file website_traffic.json --jsonArray

Start by connecting to a new data source, and choose MongoDB as the type. We have ours running on the same machine, thus we will be connecting to localhost:

Setting up connection to MongoDB

We will specify the address, port, name for our data source, and authentication information (however it may be omitted if the database has authentication disabled).

Next, it will appear in the list and we can navigate to it and view databases, and collections subsequently.

View list of collections in MongoDB

Collections will represent the physical data and as such the will be marked with purple icons (virtual datasets are green icons).

And here is what we will see when opening events collection from traffic database:

viewing our dataset

Right here, we have mongo data in tabular view. As you can see, some of the fields are data types that don’t work in Power BI, such as arrays and sub-documents. Let’s start transforming our data into more SQL like form, which is much more convenient to build visual analysis around.

Let’s drop the _id field, which is mongo specific id. We will return to this when we are finished with all other transforms.

Removing the _id field

Next, let’s deal with that pesky nested field - session. This field has the form of JSON object and looks like so:

1
2
3
4
5
6
{
    "pages_visited":["links"],
    "date":"07/17/2017",
    "time":"16:03",
    "errors":[502,404]
}

Note: the errors field may be null.

You can inspect the contents of this column in any row by hovering over this cell and clicking three dots.

Viewing the contents of a cell

When viewing the field, it is possible to click on any attribute of the object and click “Extract” in order to extract the field into a discrete column.

This is what we have done here:

Extracting the field into a column

We have separated the field pages_visited into separate column. Let’s name it as pages_visited and do the same thing with errors attribute. But this time, let’s do it different way - let’s click the arrow down icon at the top of the “sessions” column to access the data transformation menu. Choose extract element (just like shown below).

Using the extract element feature

Then, we will manually type the name of attribute we want to extract and voila, the field is added.

What else have we got in the session object. Two more attributes - date and time. It would be great if we could extract each of those, merge them into one field and transform into datetime data type. This is exactly what we are about to do.

First, let’s extract each one date and time into separate fields, just like we did before.

Extracting date and time into separate columns

Then, we are going to add new field (clicking “Add New Field”) and enter the simple formula for this field. We will use CONCAT function (CONCAT for concatenation) and pass it three arguments to build contents of new field - date, space, and time. Here is how our formula will look like:

1
CONCAT("date",  , time)

Notice: literal string use single quotes.

Building the new date time field

We have now built the field we wanted, now let’s cast it to the correct type - datetime.

Converting the string to a datetime data type

For this, we clicked on data type icon in the column header and selected “Date & time”. We will chose custom format and enter this formula ‘MM/DD/YYYY” “HH24:MI’

Great! We have dealt with session nested field to the extent when we are happy with the result and we may drop the original session field as we have extracted all the data we have been interested in.

Up next on our agenda we have getting rid of null values in email column. We want to analyze the domains of emails and having null values in this column will be an obstacle. Thus, we will use replace text function (again, that is available if click on the down arrow by the name of the field) to replace null values for “@anonymous”. Clicking on replace text we choose custom condition replacement option and we specify replacement criteria, which is if “email” is NULL. Then, we add replacement value we mentioned before and finally click apply (make sure to checkbox drop initial column).

Replacing null values with @anonymous

After that, we will extract domain part of the email as this information is what we really want for this job. We are going to do this with the use of regular expressions. To achieve this, we click on the down arrow by the name of the field, choose Extract Text, and switch radio button from position to pattern. Next, we are going to specify our regex pattern to capture domain part of the email: This pattern ‘(?<=@).+’ will do just fine, so we stick into Regular Expression field, type “email” into new field name, and check to drop initial field. That is it.

Extracting the email domain

Right now we have got data we wanted - we transformed datetime field, we have dealt with missing values in email, and we have extracted the domain out of the email addresses, and finally we have unnested that JSON field into more flat representation.

This is what we have:

Viewing the SQL expression we have build

However, we still have got two array fields and it is not advisable to stick them into Power BI. The most natural way to work around array is to flatten them - put each item of array into its own row, and repeating the other columns as necessary.

So what we are going to do is click down arrow by the name of each field and choose “Unnest.” That will break arrays and repeat records as needed. The result is basically the same as if we had errors and pages tables with the same _ids as in our current table and joined them based on their _ids - a virtual dataset that is the result of joining these three tables together. Another option is that we could have also deleted errors and pages_visited columns altogether and created two additional virtual datasets for these columns while preserving the _id column. This would yield three virtual datasets, with one-to-many relationships between them. Both are “valid” approaches in Dremio, and the latter may be more familiar to some users. It is also a little more interesting to show, so let’s follow this option for the rest of the tutorial.

For this we will save the current query as a virtual dataset and name it “main”. This is our query for main:

1
2
3
4
5
6
7
8
9
10
SELECT "\_id", TO_TIMESTAMP(datetime, 'MM/DD/YYYY" "HH24:MI', 1) AS datetime, state, extract_pattern(CASE
  WHEN "email" IS NULL THEN '@anonymous'
  ELSE email END, '(?<=@).+', 0, 'INDEX') AS domain, ip_address, client_browser, feedback
FROM (
  SELECT "\_id", CONCAT("date", ' ', "time") AS datetime, state, email, ip_address, client_browser, feedback, nested_0."time" AS "time", nested_0."date" AS "date", errors, pages_visited
  FROM (
    SELECT "\_id", state, email, ip_address, client_browser, feedback, events.session."time" AS "time", events.session."date" AS "date", events.session.errors AS errors, events.session.pages_visited AS pages_visited
    FROM my_mongo.traffic.events
  ) nested_0
) nested_1

And next, we are going to create two new virtual datasets from our physical one called events, so we go to Mongo source to start over. This time we will extract errors from session field and drop all other fields but newly extracted one and _id. This is the result:

Building the errors virtual dataset

Next we are going to do just the same, but preserve pages_visited. This is our second dataset:

Building the pages_visited virtual dataset

We will name them errors and pages_visited respectively.

This is what we have, let’s proceed to connect to this data in Power BI.

Configuring the connection

First step that we would take is to install ODBC Driver and Power BI connector for Dremio.

You can download ODBC Driver from here, make sure to choose correct version (it will prompt for correct one). Then install it, following instructions here.

After that, you will need to enable Dremio connector in Power BI. It is also described in Dremio’s documentation here. Long story short, you navigate to options -> Preview features and enable custom data connectors. Then, when you click Get Data, you may choose Dremio(Beta) from the list of resources, databases tab. All what is left to do is enter your Dremio login and password and then choose what schemas you want and voila, you got your data in Power BI. Here we have connected to all three virtual datasets we created: errors, main, and pages_visited.

Importing the schemas into Power BI

Building visualizations

You will build a simple dashboard of visualizations as an example. We do not aim it to be a fully comprehensive guide on traffic analytics in Power BI, thus we will stick to the basic visuals.

First, let’s build barchart around visits to website over time. For this, we will need visual type. Then, we will put datetime in the axis property of visual, which will divide the chart into bars basing on date, and then we will put _id into values (notice that it will automatically choose count as method of summarization).

Power BI - virtualization one

Power BI - virtualization two

Then, let’s proceed with analyzing data and add more to our report. Next, we will look at the share of browsers, which users utilize to visit our website. Once again, we choose type of visual and add _id to values that will provide the count to the total and buckets, and then we will add client_browser to details, which will split pie into pieces.

Power BI - virtualization three

Power BI - virtualization four

Next, we are going to visualize error occurrences over time. For this, we will use line chart and set axis to datetime, but the values will be filed with count of errors over time.

Power BI - virtualization five

Power BI - virtualization six

We also have some geo data that captures spatial distribution of visitors over different states in contiguous US that can be put on map. This is exactly what we are going to do. There are two types of maps in Power BI - conventional map that will display data in bubbles and filled map that will saturate the region based on the value. We are going to use the latter. Choose type of visual - filled map. Then, choose state as location and __id count_ as value and this is what it gives:

Power BI - virtualization seven

As you can see we have got most visitors from New England region plus from a couple of Southern states. We can also compare those states in bar chart like this:

Power BI - virtualization eight

Or try a slightly different visualization, such as this tiled blocks diagram:

Power BI - virtualization nine

Eventually, when we are happy with the report we save it and hit publish button to upload it to Power BI web based client. From there we can share it with our colleagues or schoolmates. Reports are published by clicking on this button.

Publish report to Power BI server

Then, we can view it only and edit it, meaning adding new visuals, changing layout, removing visuals, etc. We can also pin it to a dashboard or embed it into a webpage through code.

Embedded report in webpage via Power BI

Conclusion

Today we looked at how Dremio may serve as an intermediate agent for connecting data from MongoDB to Power BI. We gave an example of how Dremio not only allowed us to easily connect Power BI with MongoDB, but also gave a rich ability to modify and curate data for being used with a sophisticated Business Intelligence tool that is tailored for work with a relational data, all without writing a single line of code.