Dremio Jekyll

Analyzing MySQL data with Dremio and Python

Introduction

MySQL is a relational database management system developed and supported by Oracle Corporation. Nowadays, it is one of the most common tools to manage a database. In this tutorial, we will show how to deal with MySQL and Python integration using Dremio.

Assumptions

This tutorial has been developed on Ubuntu and assumes that you have covered the “Getting Oriented on Dremio” tutorial and you have the following requirements:

Loading data into MySQL

In this tutorial we will perform an analysis of the Siemens Employees Corporate Research database. It has data about employees and their salaries which are generated by Siemens Corporation’s data science team. Firstly, we will download and store our data locally in MySQL server. Then we will connect to the server from Dremio and make some exploratory data analysis with Python.

Let’s start by downloading the repository that contains the data that we will be working with, navigate to data directory and run the MySQL shell with the following command, where “user” and “pass” are your root MySQL username and password:

1
mysql -u user -p pass

Before loading a database, we need to create empty MySQL DATABASE:

1
CREATE DATABASE employees;

Now, you can load the data in our “employees” database:

1
employees < employees.sql

To make sure that we have loaded the data correctly, we will display the first 5 rows of “employees” table:

1
2
3
USE employees;
SELECT * FROM employees
LIMIT 5;

If the data has loaded successfully, you will see the following output in your MySQL shell:

image alt text

Now we can connect to MySQL from Dremio.

Connecting to data in Dremio

In order to build a connection to MySQL from Dremio, you need to launch Dremio and go to the http://localhost:9047. Then you will need to login or create a new user if you’re working with Dremio for the first time. After that, you need to add a new data source by pressing the “+” button in the top of Sources menu.

image alt text

Choose “MySQL” from the list and enter “mysql1” (as the name), “localhost” (as a host), and “3306” (default port). Also, you need to enter your MySQL username and password, so Dremio could connect to your source. Then click “Save”.

Once the connection has been created successfully, you will be able to see the new “mysql1” source in the Sources menu.

image alt text

In Dremio 3.0 we have provided users the ability to create searchable tags, descriptions, and annotations on their datasets. For that, you need to click on the database and select your table. Then switch between “Data” and “Catalog” buttons. A detailed description of how to do this you can find in this tutorial. We have added some Wiki text about Employees table and created some searchable tags.

image alt text

Data curation in Dremio

Now, let’s do some data curation: we will change the datatypes of “hire_date” and “birth_date” columns. To do that, you need to select “employees” dataset. After that, select “Change datatype…” in the corresponding columns and select “Date & Time”, in our case.

image alt text

Then, you will see the preview of your changed column. If everything looks good, click “Apply” and your datatype will be changed.

image alt text

.

When you perform such data curation operations, Dremio is not changing your original dataset. Dremio creates a new Virtual dataset (VDS) and perform the changes on that virtual representation of the data instead. In order to save this VDS, you need to create new space. For this tutorial, we already have created space called “tutorial”. More information about spaces you can read in Getting Oriented to Dremio. To save your Virtual dataset you can click the “Save As” button.

.image alt text

Now you need to enter the name of your dataset. In our case, it is “employees_converted”. Then, choose the “tutorial” space and click “Save”.

We will perform the same operations with the “salary” table and will save it to the same space with the “salary_converted” name. Now, you can find these two new datasets under the “tutorial” space.

image alt text

Also in Dremio we can extract features from our datasets using SQL query. For example, let’s get the information from “employees_converted” about all female users.

1
2
3
SELECT emp_no, CAST(birth_date as TIMESTAMP) AS birth_date, first_name, last_name, CAST(hire_date as TIMESTAMP) AS hire_date
FROM mysql1.employees.employees
WHERE gender='F'

image alt text

We can “Save As” this query as a “female_employees” dataset in our “tutorial” space.

image alt text

Data analysis and visualization using Python

In order to connect the data from Dremio with Python, you need to download ODBC Driver for your operating system. You can do it on our Download page.

image alt text

Download the ODBC Driver for Linux and install it according to the instructions. If you use a Debian-based Linux distribution you should previously convert the downloaded package from .rpm to .deb format with the help of Alien:

1
2
sudo apt-get install alien
sudo alien dremio-odbc-1.3.19.1052-1.x86_64.rpm

Before installing the ODBC Driver, we should install the additional package:

1
sudo apt-get install unixodbc-dev

Finally, install the ODBC Driver:

1
sudo apt-get dremio-odbc_1.3.19.1052-2_amd64.deb

Also, you have to install the “pyodbc” package with the following command:

1
pip install pyodbc

Now, let’s begin to write our Python script. Start by importing all necessary libraries:

1
2
3
import pyodbc
import pandas as pd
import numpy as np

To connect to Dremio from Python, we need to configure connection parameters (it is possible that the host and driver vary depending on your environment):

1
2
3
4
5
host = "localhost"
port = 31010
uid = 'username'
pwd = 'password'
driver = "/home/base/dremio-odbc-1.3.19.1052/debian/dremio-odbc/opt/dremio-odbc/lib64/libdrillodbc_sb64.so"

Variables “uid” and “pwd” should be your dremio user and password in order to connect to Dremio. Initialize connector:

1
cnxn = pyodbc.connect("Driver={};ConnectionType=Direct;HOST={};PORT={};AuthenticationType=Plain;UID={};PWD={}".format(driver, host,port,uid,pwd),autocommit=True)

Now the connection is made and we can access the data from Dremio.

1
2
employees = pd.read_sql('SELECT * from tutorial.employees_converted', cnxn)
salaries = pd.read_sql('SELECT * from tutorial.salary_converted', cnxn)

Let’s take a quick look at our loaded data:

1
employees.head()

image alt text

We will need libraries for visualization, such as matplotlib and seaborn.

1
2
import seaborn as sns
import matplotlib.pyplot as plt

Now, let’s plot some distribution of “gender” column:

1
2
sns.set(style="darkgrid")
ax = sns.countplot(x="gender", data=employees[employees.gender != 28791])

You can see that there are more male employees. As you remember, we have birth_date and hire_date columns in our dataframe already converted to “Date & Time” format. So, we can calculate the age when the candidate was hired. Let’s plot this:

1
2
3
employees['age_when_hired'] = employees['hire_date'].sub(employees['birth_date'], axis=0) / np.timedelta64(1, 'Y')
employees['age_when_hired'] = employees.age_when_hired.astype(int)
yx = sns.countplot(x='age_when_hired', data=employees)

As you can see, the youngest employee hired was only 20 years old. The oldest one was 47 years old. And most of the employees are about 30-32 years old.

So now as we know the distribution of employees’ age, let’s plot a dependency of age and salary income through the years employee was working in a company. For that, we will group our “salaries” dataframe and calculate the difference between the maximum and the minimum salary of every employee.

1
2
3
4
5
6
7
8
9
10
11
12
grouped_salaries = salaries.groupby('emp_no')
employees['salary_dif'] = np.nan
for i, (name, group) in enumerate(grouped_salaries):
		salary_dif = max(group.salary) - min(group.salary)
		employees.loc[i, 'salary_dif'] = salary_dif
age_range = range(20, 48)
avg_salary = list(np.zeros(len(age_range)))
for i, age in enumerate(age_range):
	avg_salary[i] = np.mean(employees[employees['age_when_hired'] == age].salary_dif)
plt.plot(age_range, avg_salary)
plt.ylabel('Salary difference by age')
plt.xlabel('Age')

And this is the result we got:

image alt text

As you can see from the figure above, younger people have a bigger salary income.

Conclusion

In this tutorial we showed the integration of MySQL with Python using Dremio. We converted data types, did some data visualization and data analysis, but it is just the beginning of what you can do with Dremio. We hope that you found this tutorial useful. Stay tuned to learn more about how Dremio can help you get more value from your data, faster.