Dremio Jekyll

Analyzing MongoDB Atlas with R and Tableau

Intro

R is a powerful tool for data analysis. It can be used for statistical computing and data visualization. Moreover, the significant benefit of R is the high-quality plots which can include mathematical symbols and formulae. MongoDB Atlas is a fully automated application that can be used as a database reducing the time spent on database managing.

The integration of R and MongoDB Atlas allows broadening a variety of data that can be analyzed. Despite the benefits of R, it demands much more effort while doing data-preprocessing. Fortunately, Dremio allows connecting data from MongoDB Atlas to R easily. Moreover, Dremio helps to reduce the number of code lines related to data curation and can clean up the data just with several clicks. Just like R, you can connect MongoDB Atlas to any BI tool using Dremio. Let’s explore closer these integrations.

Assumptions

We assume that you have Dremio and ODBC driver installed; if not, go to Dremio’s download page, pick the installations for your operating system, and visit Dremio Docs to read about installations and deployments. Also, we will be working with MongoDB Atlas, R, and RStudio, so you need to have them installed and configured as well.

Importing data into MongoDB Atlas and connecting to Dremio

Data about films is commonly used for different types of analytical reporting. We will take a database of TMDB ratings, which you can download here. This dataset includes user reviews of thousands of films and contains fields from titles, genres, and popularity to budget and revenue.

For this tutorial, we will work only with “tmdb_5000_movies.csv” file. So, first of all, you need to download and import this file into MongoDB Atlas. Choose “Connect” -> “Connect with the Mongo Shell” and then choose your version of Mongo Shell.

image alt text

Copy the connection string and paste it into a command line. Then you can import your data by writing the following:

1
mongoimport --host <your replicaSet>/<your host> --ssl --username <your username> --password <your password> --authenticationDatabase admin --db <db name> --collection <collection name> --type csv --headerline --file <Filepath>

Pay attention, as if you import CSV, you should write headerline. It’s not necessary to do in other cases.

If the data is successfully imported, you should see the following:

image alt text

Then, you can check it out on MongoDB Atlas in Collection section.

To connecting to Dremio you should:

  • 1 Log in to Dremio
  • 2 Click “Add source”
  • 3 Choose MongoDB as a source
  • 4 Fill in the necessary information, as shown on the images below

image alt textimage alt text

After the connection, you should be able to see the tmdb dataset in Dremio.

image alt text

Also, it would be useful to add description and tags. Dremio gives us an opportunity to create descriptions and add tags to any dataset. This feature allows to understand what the data contains and what kind of analysis can be done on it.

Сhoose “Action” -> “Catalog”, and then you can write the description or add tags to make the search easier.

image alt text

For now, we have finished the preparation. So, let’s move to data preprocessing.

Data curation in Dremio vs. R

We want to show how you can make your data look better and how it can be easily done with Dremio. Here are some tips:

1. Let’s start with converting the data types to integer.

In Dremio you can simply click “Convert Data Type”, then choose “Integer” and pick the type of rounding. Then, you can create a new field and decide whether you need your source field.

image alt text

One of the Dremio benefits is that you can preview the result before applying the changes to the dataset.image alt text

In R, you can use similar functions: round(variable), ceiling(variable), floor(variable).

2. While working with text data, sometimes it is a good idea to split the text.

For example, in our dataset we have one important but very messy field. We can do a lot of analysis with genres, but it is written in the following form: [{“id”: 12, “name”: “Adventure”}, {“id”: 14, “name”: “Fantasy”}, {“id”: 28, “name”: “Action”}, {“id”: 53, “name”: “Thriller”}, {“id”: 878, “name”: “Science Fiction”}]. Our goal is to get genres from this field.

You should click “Split” and choose the type of splitting.

image alt text

After that, unnest the result:

image alt text

To get the final result, you should do several splitting again until you get the desired genres.

image alt text

In R there are several functions which you can use:

strsplit() - to split by fixed string;

unlist() - to unpack your splitted data;

grep() - to split by a pattern (similar to LIKE function).

3. Now you should clean genres field a little. We can use “Keep only” function or “Exclude” by different parameters.

image alt text

image alt text

image alt text

In R you can do the same using filter().

4. Also, sometimes you need to drop some fields. It can be easily done in Dremio, by clicking only a few buttons.

image alt text

In R you can do the same using drop(), or it would be better to rewrite your dataframe with columns that you need.

Finally, we get the next result:

image alt text

After all the executions, the automatically written SQL query should look like following:

1
2
3
4
5
6
7
8
9
SELECT "id", budget, genres, flatten(regexp_split(genres_1_1, '\Q"}]\E', 'ALL', 10)) AS genres_1_1_1, homepage, id, keywords, original_language, original_title, overview, popularity, popularity_1, production_companies, production_countries, release_date, revenue, runtime, spoken_languages, status, tagline, title, vote_average, vote_count
FROM (
 SELECT "id", budget, genres, genres_1, flatten(regexp_split(genres_1, '\Q"},\E', 'ALL', 10)) AS genres_1_1, homepage, id, keywords, original_language, original_title, overview, popularity, popularity_1, production_companies, production_countries, release_date, revenue, runtime, spoken_languages, status, tagline, title, vote_average, vote_count
FROM (
   SELECT "*_* id", budget, genres, flatten(regexp_split(genres, '\Q": "\E', 'ALL', 10)) AS genres_1, homepage, id, keywords, original_language, original_title, overview, popularity, CAST(ROUND(popularity) as INTEGER) AS popularity_1, production_companies, production_countries, release_date, revenue, runtime, spoken_languages, status, tagline, title, vote_average, vote_count
   FROM Dremio.tmdb.movies
 ) nested_0
) nested_1
WHERE (flatten(regexp_split(genres_1_1, '\Q"}]\E', 'ALL', 10)) <> '') AND (NOT regexp_like(flatten(regexp_split(genres_1_1, '\Q"}]\E', 'ALL', 10)), '.*?\Q"id\E.*?')) AND (flatten(regexp_split(genres_1_1, '\Q"}]\E', 'ALL', 10)) <> '[]')

Now, you can save the dataset to your space.

image alt text

Connecting to R

You can use following code for connection:

1
2
3
4
5
6
7
install.packages("RODBC")
library(RODBC)
dremio_host <- "your hostname"
dremio_port <- "31010"
dremio_uid <- "your user"
dremio_pwd <- "your password"
channel <- odbcDriverConnect(sprintf("DRIVER=Dremio Connector;HOST=%s;PORT=%s;UID=%s;PWD=%s;AUTHENTICATIONTYPE=Basic Authentication;CONNECTIONTYPE=Direct", dremio_host, dremio_port, dremio_uid, dremio_pwd))

It’s important to put single quotation marks at the beginning of the query, because we use double quote in a query. Otherwise, you will get an error.

You can get the SQL query from Dremio. In our case, we use the following code:

1
2
3
4
5
6
7
8
9
10
SELECT * FROM moviespace.movie
df <- sqlQuery(channel, '<SQL Query>')
if (is.character(df)) { close(channel); stop(paste(df, collapse = "\n")) } # stop if query failed
print(nrow(df)) # print # records returned
install.packages('dplyr', dependencies = TRUE)
install.packages('ggplot2', dependencies = TRUE)
library(dplyr)
library(ggplot2)
ggplot(data=df, aes(x=popularity_1, y=vote_count, col=genres_1_1_1)) + geom_point()
close(channel)

This is the result we got:

image alt text

image alt text

Analyzing data in Tableau

Moreover, not only this data can be analyzed using R, but also any of your BI tool of choice, for example, Tableau.

To connect Dremio to Tableau you should:

1. Choose Tableau icon and open a file.image alt text

2. Log in with your Dremio user.image alt text

3. Feel free to analyze! image alt text

First, we should decide which fields we want to analyze. We will choose our cleared genres, vote_count, and popularity. Then, put genres at the center and vote_count and popularity as row and column.

Here is the result:

image alt text

Conclusion

To sum up, Dremio is a handy and easy-to-use tool for data analytics. Moreover, there is no need to write and remember functions, thanks to user-friendly UI. The big advantage of using Dremio is that you can broaden your ability of data analyzing by connecting such powerful tools as R, Tableau, and MongoDB Atlas. Consequently, Dremio is not only a great instrument for data curation but also a mean of data analyzing by using any tool, including R or different BI tools.