Using R to perform data science operations on AWS

   

Table of Contents

Intro

Amazon Web Services (AWS) is a cloud services platform with extensive functionality. AWS provides different opportunities and solutions for databases, storage, data management and analytics, computing, security, AI, etc. Among the offered databases and storages are Amazon Redshift and Amazon S3.

Amazon Redshift belongs to the group of the leading data warehouses. It is designed to deliver far faster performance in contrast to others. This is empowered by the application of machine learning, massively parallel query execution, and columnar storage.

Amazon Simple Storage Service (Amazon S3) is an object-oriented storage service available for any customer. Amazon S3 helps in storing and protecting any amount of data for a wide range of use cases. Industry-leading scalability, data availability, security, and performance are guaranteed as well. Due to the user-friendly management features, usage for your business becomes convenient and pleasant.

In this tutorial, we will explain how to use Amazon Redshift and S3 services, demonstrate how to connect them to Dremio, and perform data curation using Dremio. Furthermore, we will analyze the data in R.

Assumptions

We assume that you have Dremio and ODBC driver installed; if not, go to Dremio’s deployments page, pick the installations for your operating system, and visit Dremio Docs to read about installations and deployments. Also, we will be working with Amazon Redshift, Amazon S3, R, and RStudio, so you need to have them installed and configured as well.

Loading data into the sources

In this tutorial, we will use a sample dataset that contains Amazon customer reviews.

Over 130 million customer reviews were taken into account concerning the products on Amazon.com and customers’ experience. These reviews refer to the following major aspects:

  • Reviews are written in the Amazon.com marketplace.
  • Reviews about products written in multiple languages from different Amazon marketplaces.
  • Non-compliant reviews with respect to Amazon policies.

Amazon S3

First, you need to create a bucket. You can follow this tutorial. Then, go to your bucket and click Upload file.

image alt text

Amazon Redshift

Again, you need to create a cluster. You can follow this tutorial. After the cluster is created, you can go to the Query Editor and perform SQL-statement.

image alt text

If you are not acquainted with SQL you can read our tutorials in topics: MySQL, SQL-Server, Postgres.

Another way is to load data from Amazon S3. You can read more here.

Connecting to Dremio

Amazon S3

For connecting to Dremio you need to receive your credentials. For that create IAM user. At the final step of creating, you can copy your credentials. Or go to My Account->Security credentials and create Access Key.

Then go to Dremio->Add source->fill in data.

image alt text

When the source is connected, you have to open data.csv file and configure dataset settings.

image alt text

Amazon Redshift

Go to your cluster and copy highlighted values.

image alt text

Then fill in the information for Amazon Redshift Source, as shown on the screenshots below:

image alt text
image alt text

Data curation in Dremio

Let’s open our data table from Redshift and join it with data.csv from S3. For that click Join->choose data->choose join condition.

image alt text

In our case, we have the common column - project_title, which is our join condition. Join type is inner because we need only common data for two tables.

image alt text

Save it as “data” to your local space.

Now, it’s time to transform our data. Firstly, change the string data type to int in star_rating, helpful_votes, and total_votes.

image alt text

Also, filter the date and leave values only with date format.

image alt text

When the column is filtered, you can transform type to date type.

image alt text

Here is our final SQL-statement.

SELECT marketplace, customer_id, review_id, product_id, product_parent, product_title, product_category, CONVERT_TO_INTEGER(star_rating, 1, 1, 0) AS star_rating, CONVERT_TO_INTEGER(helpful_votes, 1, 1, 0) AS helpful_votes, CONVERT_TO_INTEGER(total_votes, 1, 1, 0) AS total_votes, vine, verified_purchase, review_headline, review_body, TO_DATE(review_date, 'YYYY-MM-DD', 1) AS review_date
FROM "@kshakhovska".data
WHERE review_date = '2015-08-31'

Now, save to your space.

image alt text

Connecting to R

Finally, we can analyze our data in R. Here is the whole code:

install.packages("RODBC")
library(RODBC)
dremio_host <- "localhost"
dremio_port <- "31010"
dremio_uid <- "your dremio user"
dremio_pwd <- "your dremio password"
channel <- odbcDriverConnect(sprintf("DRIVER=Dremio Connector;HOST=%s;PORT=%s;UID=%s;PWD=%s;AUTHENTICATIONTYPE=Basic Authentication;CONNECTIONTYPE=Direct", dremio_host, dremio_port, dremio_uid, dremio_pwd))
df <- sqlQuery(channel, "SELECT * FROM aws.reviews")
if (is.character(df)) { close(channel); stop(paste(df, collapse = "\n")) } # stop if query failed
print(nrow(df)) # print # records returned
install.packages('dplyr', dependencies = TRUE)
install.packages('ggplot2', dependencies = TRUE)
library(dplyr)
summary(df)
df%>%
  filter(star_rating == 5) %>%
  count(total_votes)
min(df$star_rating)
mean(df$star_rating)
max(df$star_rating)
close(channel)

Let’s go step-by-step:

1. First, we connect R to Dremio. For that provide your user and password. Then, in SQL-query function change Select statement. You can get it directly from a dataset in Dremio:

image alt text

2. After that, you can view data in a global environment.

image alt text

3. Let’s check the number of rows and the dataset summary

image alt text

4. Also, we can filter only the necessary data and work with it.

image alt text

5. Moreover, we can find minimum, mean and max values.

image alt text

Conclusion

In this tutorial we showed how to combine data from Amazon Redshift and Amazon S3, easily work with it in Dremio thanks to user-friendly UI, and then perform data analysis using R. Such combination of these powerful tools gives you a chance to develop a successful project.

We hope you enjoyed this tutorial, stay tuned for more tutorials and resources to learn how you can use Dremio to start gaining insights from your data, faster.

Ready to Get Started? Here Are Some Resources to Help

Whitepaper Thumb

Whitepaper

Dremio Upgrade Testing Framework

read more
Whitepaper Thumb

Whitepaper

Operating Dremio Cloud Runbook

read more

Webinars

Unlock the Power of a Data Lakehouse with Dremio Cloud

read more
get started

Get Started Free

No time limit - totally free - just the way you like it.

Sign Up Now
demo on demand

See Dremio in Action

Not ready to get started today? See the platform in action.

Watch Demo
talk expert

Talk to an Expert

Not sure where to start? Get your questions answered fast.

Contact Us

Ready to Get Started?

Bring your users closer to the data with organization-wide self-service analytics and lakehouse flexibility, scalability, and performance at a fraction of the cost. Run Dremio anywhere with self-managed software or Dremio Cloud.