Dremio Jekyll

Analyzing Book Reviews With Dremio, SQL, and R

Intro

R is a powerful language for data analytics and visualization. In this tutorial we’ll show you how Dremio enhances R by simplifying and accelerating data access across multiple sources, allowing R users to focus on their analytics. The example data is simple and relatively small, but these concepts can be applied to data in virtually any source and many formats, as well as any size.

Assumptions

To follow this tutorial you should have access to a Dremio installation. If you haven’t done so already, installing Dremio is easy - see the Quickstart for instructions.

We also think it will be easier if you’ve read Getting Oriented to Dremio and Working With Your First Dataset. If you have questions along the way, don’t hesitate to ask on the Dremio Community Site.

What is Dremio?

Dremio is an innovative data analytics tool that is targeted to the best user experience, simplicity and scalability. At the moment you can connect a large variety of data sources to Dremio, among them are:

  • Amazon Redshift
  • Amazon S3
  • Elasticsearch
  • HDFS
  • Hive
  • Microsoft SQL Server
  • MongoDB
  • MySQL
  • Postgres

In the near future, Dremio will support other sources, like Cassandra, Salesforce, Solr, Netezza, Teradata, and others.

In this article we will take a look at using Dremio and R together to analyze JSON data. What is great about Dremio, it’s its simplicity and a great user experience. There is no need to type big queries to get the data. Instead, all operations can be done through an impressive user interface. However, if you love to write SQL you can always do that too.

One of the main advantages of Dremio is the ability to connect to all the different data sources we listed above. Dremio also allows you to join data from different systems, even if they don’t support SQL. An ability to combine data is indispensable for meaningful analytical reports that truly tell the full story of your data. Dremio makes this process easy and gives us the opportunity to combine data and then work with new data like it is a new data source.

Virtual Datasets

With ETL or other approaches to this work, we would create our transformations, and write the output of the process out to our destination, in effect making a copy of the data.

Dremio has a different approach they call Virtual Datasets. A virtual dataset is like a view in a relational database - it is defined with SQL, and can be queried with SQL, joined to other virtual datasets or physical datasets, etc. In our example, we removed a column and filtered out some of the records. This didn’t change our source data. Instead, it is a transformation of the data that is applied each time we query the virtual dataset.

How to analyze data using Dremio

Data about users is commonly used for different types of analytical reporting. We will take a database of books ratings, which is available for download from here. This dataset includes user reviews of thousands of books and includes fields like:

  • user_id
  • book id
  • user score
  • book title
  • user location

This data consists of user-book ratings “BX-Book-Ratings.csv”, user info “BX-Users.csv,” and book info “BX-Books.csv.” You can store these files in a local filesystem, S3, HDFS, or other locations. In this example we have stored the files in our local filesystem and access them through a NAS data source.

Let’s follow the following simple steps to understand how to work with data in Dremio. For details on how to setup these files, please see Working With Your First Dataset or the documentation on working with files and directories.

Accessing the files in Dremio, you should now see:

Book ratings - CSV file in Dremio

Books - CSV file in Dremio

Users - CSV file in Dremio

First of all, we are going to make a join between book ratings and user info, by user-id. Use “left outer” join to preserve all data with ratings but without user info:

Left outer join - users and reviews

As a second step, let’s make a join between previous result dataset and book info, by ISBN. Also “left outer” join:

Left outer join - add books

Like in every analysis, we need to focus only on the the fields that are important. We have some columns that we do not need now and we can exclude them, like copies of columns from merging or books image-urls from book info. To do that, select the name of the column and press ‘drop’ in the drop-down menu like on the screenshot:

Removing unnecessary columns

Note: This operation isn’t changing the source data (more on this later).

Age data has its own “NULL” values which might cause issues during analysis, so we replace it with Dremio’s native “null” that is seen as correct missing value. To do this, we select “Replace” from the data transformation menu, then select “NULL” from the histogram, then specify Null as the replacement value:

Replacing NULL values

The resulting dataset should look like this.

Cleaned dataset

Virtual Datasets

With ETL or other approaches to this work, we would apply our transformations, and write the output of the process out to our destination, in effect making a copy of the data, typically in a relational database or a file system.

Dremio has a different approach they call Virtual Datasets. A virtual dataset is like a view in a relational database - it is defined with SQL, and can be queried with SQL, joined to other virtual datasets or physical datasets, etc. In our example, we removed a column and filtered out some of the records. This didn’t change our source data. Instead, it is a transformation of the data that is applied each time we query the virtual dataset.

Let’s save this virtual dataset. Click the “Save As” button at the top:

Click Save As

Now provide a name for your virtual dataset. Here we used “BX-Virtual.”

Saving the virtual dataset

After saving, you can use this virtual dataset with any of your favorite programming languages. Also, we can expand our virtual dataset by joining another dataset to the new one without a single line of code. For now, we have already joined datasets together, and we have all the required information we need for this analysis.

Basic Data Analysis With Dremio

The next step is to analyze our data and get some meaningful results. To make an analysis by the age, we need to change the format of this field first. Let’s convert age column to integer for filtering results. To do that, press the datatype button near the name of the column and choose the integer type from the drop-down list:

Concerting data type to integer

Now we can use group by to get data on mean score, mean user age, and number of users who voted for the book

Using group by and aggregate scores

The last step is to sort the data by “Average_book-Rating” so we can get the list of top rated books:

Sorting by average rating

Note: we have not typed any single query for filtering, everything was done just by using Dremio’s UX, which creates the SQL query in the background.

Because this is a virtual dataset, any tool can access it via SQL. Also, if the source data changes, queries will always return the most recent data. Save this virtual dataset as “BX-Join-Grouped” so that we can query it with R.

Example of connecting to data in Dremio from R

You can retrieve data from your virtual dataset and use it in your favorite program, such as R. When working in R, one of the hardest tasks is getting data ready for analysis. In our example above, we have simplified this work by making the data easily accessible via SQL, and by sorting and filtering the data, all in Dremio. While this example is very simple, it helps explain how you can combine Dremio and R to work together, and focus on analytics in R, because that’s what it does best.

Dremio supports JDBC and ODBC, and you can download Dremio’s drivers from the Dremio download page.

In this example we will connect to Dremio using the RODBC package. This package enables R programs to utilize compliant ODBC drivers, in our case, it will be Dremio’s ODBC driver.

Our program will load our dataset into a data frame and prints some basic statistics about the data:

As Dremio easily connects to R, we now can use features of this programming language to analyse the data. In our case, let’s build a simple but meaningful barchart that is going to show us the number of books with the score >0.

Using R functions we receive a Result histogram

Bar chart of book ratings

Conclusions

As you can see, Dremio is a very useful and easy-to-use tool for data analytics, even if you do not have deep SQL knowledge. The big advantage of using Dremio is that you can use a wide variety of databases with it, create a virtual dataset by joining a few sources of data and then operate with it like you have a new database. Also, after creating a dataset, it is possible to analyze your data with any tool, including R or BI tools like Tableau.

Another important feature we didn’t look at in this blog is Dremio’s ability to accelerate data. We’ll take a look at this feature in a later blog.