Analyzing MongoDB Atlas with R and Tableau
Dremio
Intro
R is a powerful tool for data analysis. It can be used for statistical computing and data visualization. Moreover, the significant benefit of R is the high-quality plots which can include mathematical symbols and formulae. MongoDB Atlas is a fully automated application that can be used as a database reducing the time spent on database managing.
The integration of R and MongoDB Atlas allows broadening a variety of data that can be analyzed. Despite the benefits of R, it demands much more effort while doing data-preprocessing. Fortunately, Dremio allows connecting data from MongoDB Atlas to R easily. Moreover, Dremio helps to reduce the number of code lines related to data curation and can clean up the data just with several clicks. Just like R, you can connect MongoDB Atlas to any BI tool using Dremio. Let’s explore closer these integrations.
Assumptions
We assume that you have Dremio and ODBC driver installed; if not, go to Dremio’s deployments page, pick the installations for your operating system, and visit Dremio Docs to read about installations and deployments. Also, we will be working with MongoDB Atlas, R, and RStudio, so you need to have them installed and configured as well.
Importing data into MongoDB Atlas and connecting to Dremio
Data about films is commonly used for different types of analytical reporting. We will take a database of TMDB ratings, which you can download here. This dataset includes user reviews of thousands of films and contains fields from titles, genres, and popularity to budget and revenue.
For this tutorial, we will work only with “tmdb_5000_movies.csv” file. So, first of all, you need to download and import this file into MongoDB Atlas. Choose “Connect” -> “Connect with the Mongo Shell” and then choose your version of Mongo Shell.
Copy the connection string and paste it into a command line. Then you can import your data by writing the following:
1
mongoimport --host <your replicaSet>/<your host> --ssl --username <your username> --password <your password> --authenticationDatabase admin --db <db name> --collection <collection name> --type csv --headerline --file <Filepath>
Pay attention, as if you import CSV, you should write headerline. It’s not necessary to do in other cases.
If the data is successfully imported, you should see the following:
Then, you can check it out on MongoDB Atlas in Collection section.
To connecting to Dremio you should:
- 1 Log in to Dremio
- 2 Click “Add source”
- 3 Choose MongoDB as a source
- 4 Fill in the necessary information, as shown on the images below
After the connection, you should be able to see the tmdb dataset in Dremio.
Also, it would be useful to add description and tags. Dremio gives us an opportunity to create descriptions and add tags to any dataset. This feature allows to understand what the data contains and what kind of analysis can be done on it.
Сhoose “Action” -> “Catalog”, and then you can write the description or add tags to make the search easier.
For now, we have finished the preparation. So, let’s move to data preprocessing.
Data curation in Dremio vs. R
We want to show how you can make your data look better and how it can be easily done with Dremio. Here are some tips:
1. Let’s start with converting the data types to integer.
In Dremio you can simply click “Convert Data Type”, then choose “Integer” and pick the type of rounding. Then, you can create a new field and decide whether you need your source field.
One of the Dremio benefits is that you can preview the result before applying the changes to the dataset.
In R, you can use similar functions: round(variable), ceiling(variable), floor(variable).
2. While working with text data, sometimes it is a good idea to split the text.
For example, in our dataset we have one important but very messy field. We can do a lot of analysis with genres, but it is written in the following form: [{“id”: 12, “name”: “Adventure”}, {“id”: 14, “name”: “Fantasy”}, {“id”: 28, “name”: “Action”}, {“id”: 53, “name”: “Thriller”}, {“id”: 878, “name”: “Science Fiction”}]. Our goal is to get genres from this field.
You should click “Split” and choose the type of splitting.
After that, unnest the result:
To get the final result, you should do several splitting again until you get the desired genres.
In R there are several functions which you can use:
strsplit() - to split by fixed string;
unlist() - to unpack your splitted data;
grep() - to split by a pattern (similar to LIKE function).
3. Now you should clean genres field a little. We can use “Keep only” function or “Exclude” by different parameters.
In R you can do the same using filter().
4. Also, sometimes you need to drop some fields. It can be easily done in Dremio, by clicking only a few buttons.
In R you can do the same using drop(), or it would be better to rewrite your dataframe with columns that you need.
Finally, we get the next result:
After all the executions, the automatically written SQL query should look like following:
1
2
3
4
5
6
7
8
9
SELECT "id", budget, genres, flatten(regexp_split(genres_1_1, '\Q"}]\E', 'ALL', 10)) AS genres_1_1_1, homepage, id, keywords, original_language, original_title, overview, popularity, popularity_1, production_companies, production_countries, release_date, revenue, runtime, spoken_languages, status, tagline, title, vote_average, vote_count
FROM (
SELECT "id", budget, genres, genres_1, flatten(regexp_split(genres_1, '\Q"},\E', 'ALL', 10)) AS genres_1_1, homepage, id, keywords, original_language, original_title, overview, popularity, popularity_1, production_companies, production_countries, release_date, revenue, runtime, spoken_languages, status, tagline, title, vote_average, vote_count
FROM (
SELECT "*_* id", budget, genres, flatten(regexp_split(genres, '\Q": "\E', 'ALL', 10)) AS genres_1, homepage, id, keywords, original_language, original_title, overview, popularity, CAST(ROUND(popularity) as INTEGER) AS popularity_1, production_companies, production_countries, release_date, revenue, runtime, spoken_languages, status, tagline, title, vote_average, vote_count
FROM Dremio.tmdb.movies
) nested_0
) nested_1
WHERE (flatten(regexp_split(genres_1_1, '\Q"}]\E', 'ALL', 10)) <> '') AND (NOT regexp_like(flatten(regexp_split(genres_1_1, '\Q"}]\E', 'ALL', 10)), '.*?\Q"id\E.*?')) AND (flatten(regexp_split(genres_1_1, '\Q"}]\E', 'ALL', 10)) <> '[]')
Now, you can save the dataset to your space.
Connecting to R
You can use following code for connection:
1
2
3
4
5
6
7
install.packages("RODBC")
library(RODBC)
dremio_host <- "your hostname"
dremio_port <- "31010"
dremio_uid <- "your user"
dremio_pwd <- "your password"
channel <- odbcDriverConnect(sprintf("DRIVER=Dremio Connector;HOST=%s;PORT=%s;UID=%s;PWD=%s;AUTHENTICATIONTYPE=Basic Authentication;CONNECTIONTYPE=Direct", dremio_host, dremio_port, dremio_uid, dremio_pwd))
It’s important to put single quotation marks at the beginning of the query, because we use double quote in a query. Otherwise, you will get an error.
You can get the SQL query from Dremio. In our case, we use the following code:
1
2
3
4
5
6
7
8
9
10
SELECT * FROM moviespace.movie
df <- sqlQuery(channel, '<SQL Query>')
if (is.character(df)) { close(channel); stop(paste(df, collapse = "\n")) } # stop if query failed
print(nrow(df)) # print # records returned
install.packages('dplyr', dependencies = TRUE)
install.packages('ggplot2', dependencies = TRUE)
library(dplyr)
library(ggplot2)
ggplot(data=df, aes(x=popularity_1, y=vote_count, col=genres_1_1_1)) + geom_point()
close(channel)
This is the result we got:
Analyzing data in Tableau
Moreover, not only this data can be analyzed using R, but also any of your BI tool of choice, for example, Tableau.
To connect Dremio to Tableau you should:
1. Choose Tableau icon and open a file.
2. Log in with your Dremio user.
3. Feel free to analyze!
First, we should decide which fields we want to analyze. We will choose our cleared genres, vote_count, and popularity. Then, put genres at the center and vote_count and popularity as row and column.
Here is the result:
Conclusion
To sum up, Dremio is a handy and easy-to-use tool for data analytics. Moreover, there is no need to write and remember functions, thanks to user-friendly UI. The big advantage of using Dremio is that you can broaden your ability of data analyzing by connecting such powerful tools as R, Tableau, and MongoDB Atlas. Consequently, Dremio is not only a great instrument for data curation but also a mean of data analyzing by using any tool, including R or different BI tools.