Intro
Amazon Web Services (AWS) is a cloud services platform with extensive functionality. AWS provides different opportunities and solutions for databases, storage, data management and analytics, computing, security, AI, etc. Among the offered databases and storages are Amazon Redshift and Amazon S3.
Amazon Redshift belongs to the group of the leading data warehouses. It is designed to deliver far faster performance in contrast to others. This is empowered by the application of machine learning, massively parallel query execution, and columnar storage.
Amazon Simple Storage Service (Amazon S3) is an object-oriented storage service available for any customer. Amazon S3 helps in storing and protecting any amount of data for a wide range of use cases. Industry-leading scalability, data availability, security, and performance are guaranteed as well. Due to the user-friendly management features, usage for your business becomes convenient and pleasant.
In this tutorial, we will explain how to use Amazon Redshift and S3 services, demonstrate how to connect them to Dremio, and perform data curation using Dremio. Furthermore, we will analyze the data in R.
Assumptions
We assume that you have Dremio and ODBC driver installed; if not, go to Dremio’s deployments page, pick the installations for your operating system, and visit Dremio Docs to read about installations and deployments. Also, we will be working with Amazon Redshift, Amazon S3, R, and RStudio, so you need to have them installed and configured as well.
Loading data into the sources
In this tutorial, we will use a sample dataset that contains Amazon customer reviews.
Over 130 million customer reviews were taken into account concerning the products on Amazon.com and customers’ experience. These reviews refer to the following major aspects:
- Reviews are written in the Amazon.com marketplace.
- Reviews about products written in multiple languages from different Amazon marketplaces.
- Non-compliant reviews with respect to Amazon policies.
Amazon S3
First, you need to create a bucket. You can follow this tutorial. Then, go to your bucket and click Upload file.
Amazon Redshift
Again, you need to create a cluster. You can follow this tutorial. After the cluster is created, you can go to the Query Editor and perform SQL-statement.
If you are not acquainted with SQL you can read our tutorials in topics: MySQL, SQL-Server, Postgres.
Another way is to load data from Amazon S3. You can read more here.
Connecting to Dremio
Amazon S3
For connecting to Dremio you need to receive your credentials. For that create IAM user. At the final step of creating, you can copy your credentials. Or go to My Account->Security credentials and create Access Key.
Then go to Dremio->Add source->fill in data.
When the source is connected, you have to open data.csv file and configure dataset settings.
Amazon Redshift
Go to your cluster and copy highlighted values.
Then fill in the information for Amazon Redshift Source, as shown on the screenshots below:
Data curation in Dremio
Let’s open our data table from Redshift and join it with data.csv from S3. For that click Join->choose data->choose join condition.
In our case, we have the common column - project_title, which is our join condition. Join type is inner because we need only common data for two tables.
Save it as “data” to your local space.
Now, it’s time to transform our data. Firstly, change the string data type to int in star_rating, helpful_votes, and total_votes.
Also, filter the date and leave values only with date format.
When the column is filtered, you can transform type to date type.
Here is our final SQL-statement.
SELECT marketplace, customer_id, review_id, product_id, product_parent, product_title, product_category, CONVERT_TO_INTEGER(star_rating, 1, 1, 0) AS star_rating, CONVERT_TO_INTEGER(helpful_votes, 1, 1, 0) AS helpful_votes, CONVERT_TO_INTEGER(total_votes, 1, 1, 0) AS total_votes, vine, verified_purchase, review_headline, review_body, TO_DATE(review_date, 'YYYY-MM-DD', 1) AS review_date
FROM "@kshakhovska".data
WHERE review_date = '2015-08-31'
Now, save to your space.
Connecting to R
Finally, we can analyze our data in R. Here is the whole code:
install.packages("RODBC")
library(RODBC)
dremio_host <- "localhost"
dremio_port <- "31010"
dremio_uid <- "your dremio user"
dremio_pwd <- "your dremio password"
channel <- odbcDriverConnect(sprintf("DRIVER=Dremio Connector;HOST=%s;PORT=%s;UID=%s;PWD=%s;AUTHENTICATIONTYPE=Basic Authentication;CONNECTIONTYPE=Direct", dremio_host, dremio_port, dremio_uid, dremio_pwd))
df <- sqlQuery(channel, "SELECT * FROM aws.reviews")
if (is.character(df)) { close(channel); stop(paste(df, collapse = "\n")) } # stop if query failed
print(nrow(df)) # print # records returned
install.packages('dplyr', dependencies = TRUE)
install.packages('ggplot2', dependencies = TRUE)
library(dplyr)
summary(df)
df%>%
filter(star_rating == 5) %>%
count(total_votes)
min(df$star_rating)
mean(df$star_rating)
max(df$star_rating)
close(channel)
Let’s go step-by-step:
1. First, we connect R to Dremio. For that provide your user and password. Then, in SQL-query function change Select statement. You can get it directly from a dataset in Dremio:
2. After that, you can view data in a global environment.
3. Let’s check the number of rows and the dataset summary
4. Also, we can filter only the necessary data and work with it.
5. Moreover, we can find minimum, mean and max values.
Conclusion
In this tutorial we showed how to combine data from Amazon Redshift and Amazon S3, easily work with it in Dremio thanks to user-friendly UI, and then perform data analysis using R. Such combination of these powerful tools gives you a chance to develop a successful project.
We hope you enjoyed this tutorial, stay tuned for more tutorials and resources to learn how you can use Dremio to start gaining insights from your data, faster.