Dremio Jekyll

Analyzing Amazon Redshift with Dremio and Python

Introduction

Amazon Redshift is a powerful data warehouse service in the cloud. It is a simple mean of analyzing data across your data warehouse and data lake. Moreover, it is cost-effective. Redshift delivers ten times faster performance than other data warehouses because of involved technics, such as machine learning, massively parallel query execution, and columnar storage on high-performance disk.

In this tutorial, we will explain how to connect Amazon Redshift, Dremio and Python. Also, we will demonstrate options of data curation using Dremio and ways of visualizing data in Jupyter Notebook.

Assumptions

We assume that you have Dremio and ODBC driver installed; if not, go to Dremio’s deployments page, pick the installations for your operating system, and visit Dremio Docs to read about installations and deployments. Also, we will be working with Amazon Redshift and Python, so you need to have them installed and configured as well.

Connecting Amazon Redshift to Dremio

We will work with the dataset based on tracking sales activity for the fictional TICKIT website, where users buy and sell tickets online for different events. Here we can get information about selling time, rates for sellers, customers, their tastes, etc.

First of all, you have to load data into Amazon, for that you should do the following:

  • Launch your AWS Redshift cluster.
  • Create a new database with empty tables in this cluster.
  • Load sample data from Amazon S3 by using the COPY command table by table.
  • Perform queries that you need and review results on AWS console.

You can read more here.

To connect to Dremio, follow these steps.

  • Go to Sources -> Add Source -> Amazon Redshift.
  • Fill in the form as we do.

image alt text

Copy the JDBC connection string from the AWS console:

image alt text

Type 0 on the “Record fetch size” field and Save.

image alt text

Now, you are connected and can see your data in Dremio.

image alt text

Data Curation with Dremio

Our data consists of several tables, but we only need the necessary information. That is why we will take two tables and join them.

For that, choose your first table and click “Join”. Then, select join condition. Remember that you can preview the result firstly, and only then apply changes.

image alt text

To add, we want to clean up our data from nulls. It’s very easy to do using Dremio. All you need is to click “Exclude” on the desired field and choose data to remove.

image alt text

You can try to perform other preparations using this tutorial.

Here is the final SQL query that Dremio generates automatically:

1
2
3
4
5
6
7
8
9
SELECT nested_0.salesid AS salesid, nested_0.listid AS listid, nested_0.sellerid AS sellerid, nested_0.eventid AS eventid, nested_0.dateid AS dateid, nested_0.qtysold AS qtysold, nested_0.pricepaid AS pricepaid, nested_0.commission AS commission, nested_0.saletime AS saletime, nested_0.buyerid AS buyerid, join_users.userid AS userid, join_users.username AS username, join_users.firstname AS firstname, join_users.lastname AS lastname, join_users.city AS city, join_users.state AS state, join_users.email AS email, join_users.phone AS phone, join_users.likesports AS likesports, join_users.liketheatre AS liketheatre, join_users.likeconcerts AS likeconcerts, join_users.likejazz AS likejazz, join_users.likeclassical AS likeclassical, join_users.likeopera AS likeopera, join_users.likerock AS likerock, join_users.likevegas AS likevegas, join_users.likebroadway AS likebroadway, join_users.likemusicals AS likemusicals

FROM (
  SELECT salesid, listid, sellerid, buyerid, eventid, dateid, qtysold, pricepaid, commission, saletime
  FROM "Amazon Redshift".dev.public.sales
) nested_0
 INNER JOIN "Amazon Redshift".dev.public.users AS join_users ON nested_0.buyerid = join_users.userid
 WHERE join_users.likesports
 

Now, you can save the dataset to the appropriate space.

image alt text

Connecting Dremio to Python

When the data curation is done, we can connect Dremio to Python for further analysis and visualization.

First, we need to make sure that the proper libraries are imported:

1
2
3
import pyodbc
import pandas as pd
import numpy as np

Then, initialize variables:

1
2
3
4
5
6
host = "localhost"
port = 31010
uid = 'your user'
pwd = 'your password'
driver = "Dremio Connector" #for Windows
driver = "file path to odbc driver" #for Linux

The SQL query you can get from the Editor:

image alt text

1
2
3
cnxn = pyodbc.connect("Driver={};ConnectionType=Direct;HOST={};PORT={};AuthenticationType=Plain;UID={};PWD={}".format(driver, host,port,uid,pwd),autocommit=True)
sql = "SELECT * FROM aws.data"
dataframe = pd.read_sql(sql,cnxn)

Let’s look at our data:

1
dataframe.head()

image alt text

Building plots

After importing data we can build plots. Building a plot is an important part of data analysis as it allows to check variables dependency. Let’s review several options.

First one is a scatter plot. It’s a good way to estimate data distribution.

Create scatter plots

1
2
g = sns.FacetGrid(dataframe, col="liketheatre", row="likesports", margin_titles=True)
g.map(plt.scatter,"pricepaid","commission")

Show the plot

1
plt.show()

image alt text

Here we can see that ticket price paid depends on commission because data distribution is similar to a line. Therefore, it makes sense to try linear regression in further analysis. Moreover, we can see that theater taste doesn’t effect on data.

Swarmplot does practically the same, but it also demonstrates the proportion of data by using different colors.

1
sns.swarmplot(x="pricepaid", y="commission", data=dataframe)

Show plot

1
plt.show()

image alt text

The last one is barplot. In this example we tried to estimate the effect of taste on ticket price paid. In opera case, we can see that people who don’t like opera spend more money.

1
sns.barplot(x='likeopera',y='pricepaid',data=dataframe)

image alt text

Conclusions

Dremio is a good mean of connecting Amazon Redshift and Python. Moreover, it makes data curation much easier. Combining Amazon Redshift and Python gives an opportunity of common using data warehouse and data analysis. In this tutorial, we demonstrated how to connect Amazon Redshift to Dremio and Dremio to Python and provided simple analysis by using plots.

We hope that you found this tutorial useful. Stay tuned to learn more about how Dremio can help you get more value from your data, faster.