28 minute read · June 3, 2019

Analyzing HDFS and Hive Data Using scikit-learn and Dremio

Copied to clipboard

Intro

Assumptions

Loading data into HDFS

Loading data into Hive

Dremio and HDFS connection

Dremio and Hive connection

Data curation in Dremio

Use scikit-learn to build a machine learning clustering model

Conclusion

Intro

HDFS stands for Hadoop Distributed File System. HDFS forms the core of the Apache Hadoop, along with MapReduce and YARN. This filesystem is used to safely store a large amount of data on the distributed clusters. It is a very scalable, save and fault tolerant system with a high level of performance. In data science, it is usual to work with big data. In addition, the general rule of machine learning says that the more data you have, the better the model you can create and train. So, many data scientists have to work with HDFS, and it is valuable to know the basics of this technology.

Apache Hive is a tool based on the Apache Hadoop. Hive proposes SQL-like language of querying of distributed big data. You can think about the information stored in Hive as tables of structured data. Apache Hive is also an instrument which is valuable for data scientists.

Dremio can work directly with both HDFS and Apache Hive. Moreover, there can be some situations where you will have to combine data obtained from Hive and HDFS. In this tutorial, we will show how to do this using Dremio. We will use the Abalone Dataset. We intentionally split the dataset into two parts and store one using HDFS and another using Hive. Inside the Hive table, we are going to save data about abalones’ weights. All other information will be stored as a file in HDFS. Using Dremio, we will load these parts, join, curate, and then import from the Jupyter Notebook environment for Python programming language. To prepare the clustering machine learning model, we will use scikit-learn Python package.

Assumptions

In this tutorial, we assume that you have installed:

Dremio 3.2
Apache Hive and Apache Hadoop
Python 3.6
Pyodbc
Dremio ODBC Driver
Pandas, Scikit-learn, Plotly packages
Jupyter Notebook
Ubuntu OS

Note that it is possible to reproduce our work using a slightly different set of prerequisites, so it is only the informative list. For example, you can use a different operating system, different IDE, or different plotting library.

Loading data into HDFS

As we have mentioned earlier, we split our dataset into 2 parts. The first part is saved into the abalone_misc.csv file, and we want to load this file into HDFS.

To start the HDFS cluster, go to the Hadoop’s sbin directory and run the start-dfs.sh file:

~/<hadoop_folder>/sbin$ ./start-dfs.sh

This will start the namenode, data nodes, and secondary namenodes. Now create the directory in HDFS to store the files needed for Dremio (the parameter -p is used to tell Hadoop to create all the directories in the given path, so, in our case, not only the abalone folder will be created, but the user and dremio folders also):

hadoop fs -mkdir -p /user/dremio/abalone

Now we are ready to load the file abalone_misc.csv into HDFS. Go to the directory where you store this file in your local filesystem and issue the following command:

hadoop fs -put abalone_misc.csv /user/dremio/abalone/

At the image below, you can see how to check whether the file was loaded correctly. The command “hadoop fd -ls /user/dremio/abalone” displays all the files in the specified directory. As we can see, our file is really there.

Loading data into Hive

To load data into Hive, we should create a table first. So, go to the local directory where you have your file with the data you want to load (in our case this file is the abalone_weights.csv). Then, run Hive by issuing the following command:

hive

Now we can use SQL-like Hive query language to create a table and then load the data. You can see the code below. We called the table as abalone_weights. There are 5 columns in it, each of the string type. Also, we specified that the commas are used as the fields separators.

CREATE TABLE abalone_weights (id string, Whole_weight string, Shucked_weight string, Viscera_weight string, Shell_weight string) row format delimited fields terminated by ",";

One more step is needed to load the data into Hive, namely the direct load statement. You can find it below.

LOAD DATA LOCAL INPATH "abalone_weights.csv" INTO TABLE abalone_weights;

If everything is okay, you will see the following output:

Also, it is possible to check the content of the table abalone_weights:

Dremio and HDFS connection

After we have our data loaded into both HDFS and Hive, we can connect these data sources to Dremio. Let’s start from HDFS. To connect the HDFS source to Dremio click on the “+” sign near the Sources on the main page of the Dremio UI. The connection to the new data source window will appear.

Give it a name (hdfs_new), specify the NameNode Host and Port. The host is your localhost if you run HDFS locally. The port number you can find when visiting the localhost:50070 web address:

After the successful connection, you will see the folders of your HDFS:

When we open the user folder and then Dremio and abalone folders, we can find the loaded file there:

Dremio and Hive connection

Now it’s a time to connect Hive to Dremio. Before we will try to establish a connection we need to run Hive as follows:

hive --service metastore

In the list of available sources in Dremio, pick Hive. Then, give it a name (hive_new), specify the host (127.0.0.1) and port (default 9083).

After the connection, you will be able to see the Hive table in Dremio:

That’s it! Now, after both HDFS and Hive sources are connected to Dremio, you can see them in the list:

Data curation in Dremio

First what we need is to join data from hdfs_new and hive_new data sources. For joining, we want to create a workspace with the name hdfs_hive_space:

Now go to the HDFS data and click the Join button:

This will open the window where you can select columns-keys for joining on. We join on the id columns.

As a result, we got the joined dataframe:

And the final step is to save this joined dataset as abalone_joined inside the previously created workspace hdfs_hive_space:

Now we can open this dataframe and start the data curation:

First what we want to do is to rename some columns which for any reason don’t have the correct names (as you can see, the first row of our dataframe contains the column names). To rename the columns, you can click on a little arrow near the corresponding column name and select Rename:

After we renamed all needed columns, we noticed that the id columns are ridiculous. So, we want to drop them:

Now, we want to get rid of the first row where the column names are duplicated. To drop this row we use Exclude from the dropdown menu of the Sex column:

In the opened window we choose Sex as the value that should be excluded. This will completely remove the first entire row.

Now we have the following dataframe:

Little Abc characters near each of the columns mean that these column values have string format. We want to change this either to integer or float. So, click on this Abc characters near each of the column and pick the needed type:

The last what we want to do with our data in Dremio is to sort it in ascending order using the Rings column. Pick the Sort Ascending in the dropdown menu near the Rings column, as on the image below:

Now we are ready to export our curated dataset to Python for further work. We have also saved the curated dataset as abalone_curated inside the hdfs_hive_space workspace.

Use scikit-learn to build a machine learning clustering model

Everything is ready for starting a machine learning model creation process. We are going to use a Jupyter Notebook environment for this. First, let’s import all needed libraries:

import pandas as pd
import numpy as np
import pyodbc
from sklearn.cluster import KMeans
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
import plotly as py
import plotly.graph_objs as go
from plotly.offline import *
import matplotlib.pyplot as plt
%matplotlib inline

Now we will use the following code to connect to the dataset in Dremio:

host='localhost'
port=31010
uid ='username'
pwd = 'password'
driver = '/opt/dremio-odbc/lib64/libdrillodbc_sb64.so'
cnxn = pyodbc.connect("Driver={};ConnectionType=Direct;HOST={};PORT={};AuthenticationType=Plain;UID={};PWD={}".format(driver,host,port,uid,pwd),autocommit=True)
sql = "SELECT * FROM hdfs_hive_space.abalone_curated"
df = pd.read_sql(sql,cnxn)

Using the simple head() call we can check if the dataset was imported as expected (you can also check the shape of the dataset by accessing the shape attribute of the dataset - df.shape) :

We want to look at some important information about our dataset. For this purpose we use the* describe()* method:

For example, we can see that the minimum number of rings is 1, while the maximum is 29. The number of rings is an important value in this dataset. Originally, we can use the number of rings as a measure of how old a particular abalone is.

Here we should stop and speak about what is clustering in machine learning. In general, it is an unsupervised technique to analyze data. The simplest difference between supervised and unsupervised machine learning is that when you train a supervised model you have the correct answers for the training set and you want to teach your model to be able to predict the answers on the new data, where the answers are not known. In the case of unsupervised learning, you don’t know what the results will be, because you don’t have any correct answers.

Clustering is a vivid example of unsupervised learning. You don’t know how many and which clusters do you have in your training data. You just want the model to find these clusters and to be able to detect the clusters for the new instances of data. This means that we may not understand the internal sense of the cluster splitting sometimes. But we want to try to guess the underlying meaning of the clusters in our Abalone dataset. There are 2 the most obvious features which could be used as the base for the clusters splitting. First is the ring category. Second is the sex of the abalone. Let’s start from the ring category.

The Rings feature in our dataset is a continuous integer value instead of the category. We want to split all values (from min=1 to max=29) into some number of groups (categories). Let’s visualize the data:

data = [go.Histogram(
        x=x,
        xbins=dict(
            start=0,
            end=30,
            size=5
             ))]
layout = go.Layout(
    title='Rings Distribution',
    xaxis=dict(
        title='Rings'
    ),
    yaxis=dict(
        title='Count'
    ),
    bargap=0.01,
)
fig = go.Figure(data=data, layout=layout)
py.offline.iplot(fig, filename='rings_histogram.html')

Here is the output:

We used step=5 to form 6 bind on a range (0-30). So, may n=6 be a number of clusters? Probably. But we want to explore deeper and that’s why we will use a so-called elbow rule to detect the correct number of clusters. But previously we need to perform one-hot encoding of the categorical feature - sex.

df_ohe = pd.get_dummies(df, columns=['Sex' ] , prefix='dummy')
x = df_ohe.drop(columns=['Rings'])

Now we can use the dataframe in the x variable for clustering:

wcss = []
for i in range(1, 11):
    kmeans = KMeans(n_clusters = i, init = 'k-means++', max_iter = 300, n_init = 10, random_state = 0)
    kmeans.fit(x)
    wcss.append(kmeans.inertia_)
plt.plot(range(1, 11), wcss)
plt.title('The elbow method')
plt.xlabel('Number of clusters')
plt.ylabel('Sum of squares within cluster') #within cluster sum of squares
plt.show()

The logic behind the piece of code above is the following. The measure of the cluster splitting quality is the sum of squares of the difference between the cluster center and each of the datapoint within the cluster. In the margin case, when each cluster is formed by only 1 example, this metric will be equal to 0. The idea of the elbow rule is to perform several clustering with different numbers of clusters for each. Then build a graph and look for the point where the sum of squares stops to decrease dramatically and starts to decrease more slowly. Let’s look at the output of the code above. We can see that at k=3 (number of clusters = 3) the sum of squares starts to decrease more slowly than before. This resembles an elbow. So, according to the elbow rule, this point is an optimal number of clusters for this dataset.

So, we need to generate 3 groups (categories) of the rings. We are going to split into categories as follows:

x = df['Rings']
data = [go.Histogram(
        x=x,
        name='0-5',
        xbins=dict(
            start=0,
            end=6
             )),
       go.Histogram(
        x=x,
        name='6-13',
        xbins=dict(
            start=6,
            end=14
             )),
       go.Histogram(
        x=x,
        name='14-30',
        xbins=dict(
            start=14,
            end=30
             ))]
layout = go.Layout(
    title='Rings Distribution',
    xaxis=dict(
        title='Rings'
    ),
    yaxis=dict(
        title='Count'
    ),
    bargap=0.0,
)
fig = go.Figure(data=data, layout=layout)
py.offline.iplot(fig, filename='rings_histogram2.html')

As you can see, we are going to split into 3 groups with different borders between them. The first group includes all data points where there are from 0 to 5 rings inclusively. The second group spans within a range from 6 to 13 rings inclusively. Eventually, the third group includes all abalones which have from 14 to 30 rings.

Here is the code for new column creation, ring_category, which depends on the value in the Rings column.

df_ohe['ring_category'] = np.where(df_ohe['Rings'] >= 14, '2',
(np.where(df_ohe['Rings'] < 6, '0', '1')))

Here are a couple of plots which can help to understand the data more:

data = [
        go.Scatter(
            x=df_ohe[df_ohe['ring_category'] == i]['Diameter'],
            y=df_ohe[df_ohe['ring_category'] == i]['whole_weight'],
            mode='markers',
            opacity=0.7,
            marker={
                'size': 10,
                'line': {'width': 0.3, 'color': 'white'}
            },
            name='ring category - {}'.format(i)
        ) for i in df_ohe.ring_category.unique()
    ]
layout = go.Layout(
    xaxis={'title': 'Diameter'},
    yaxis={'title': 'whole_weight'},
    margin={'l': 40, 'b': 40, 't': 10, 'r': 10},
    legend={'x': 0, 'y': 1},
    hovermode='closest'
)
fig = go.Figure(data=data, layout=layout)
py.offline.iplot(fig, filename='weight.html')

data = [
        go.Scatter(
            x=df_ohe[df_ohe['ring_category'] == i]['Length'],
            y=df_ohe[df_ohe['ring_category'] == i]['Height'],
            mode='markers',
            opacity=0.7,
            marker={
                'size': 10,
                'line': {'width': 0.3, 'color': 'white'}
            },
            name='ring category - {}'.format(i)
        ) for i in df_ohe.ring_category.unique()
    ]
layout = go.Layout(
    xaxis={'title': 'Length'},
    yaxis={'title': 'Height'},
    margin={'l': 40, 'b': 40, 't': 10, 'r': 10},
    legend={'x': 0, 'y': 1},
    hovermode='closest'
)
fig = go.Figure(data=data, layout=layout)
py.offline.iplot(fig, filename='len_height.html')

As you can see, the orange and green classes are mixed. So, probably clusters will be created based on other features instead of the height, length, whole weight, and diameter only.

Below is the code where we split the dataset into training and testing, then train the KMeans clustering model. After the training, we try to predict the ring_category for the X_test dataset. Then we transform the answers in order that they have the same naming with the actual labels.

X = df_ohe.drop(columns=['ring_category'])
y = df_ohe['ring_category']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = KMeans(n_clusters = 3, random_state = 42)
model.fit(X_train)
X_test['cluster'] = list(model.predict(X_test))
X_test['new_cluster'] = np.where(X_test['cluster'] == 2, 0,
                                   (np.where(X_test['cluster'] == 0, 2, 1)))

Here is the estimation of the results:

As you can see, the accuracy isn’t high enough. It may be because we only guess that the clusters will correspond with the ring_category, but actually, clusters were generated based on many other features.

Let’s now train one more model, but this time we will use all available features. Then, we will try to predict gender based on clusters. Gender distribution is displayed below:

We will use the entire dataset (X) as a training set:

Now we will use the following code to train a machine learning model. Also, here we perform transformation similar to the one we fulfilled earlier to make predicted and true variables correspond to each other.

model = KMeans(n_clusters = 3, random_state = 42)
model.fit(X)
X['cluster'] = list(model.predict(X))
X['sex_transformed'] = np.where(X['dummy_I'] == 1, 0,
                                   (np.where(X['dummy_F'] == 1, 1, 2)))

Here is the dataframe we obtained after these steps:

Eventually, let’s check the accuracy:

You can see that the accuracy of predicting gender based on clusters is higher than the accuracy of predicting the rings category based on clusters. Nevertheless, if we performed a classification task rather than clustering, probably this level of the model quality wouldn’t be enough for us. This fact confirms that you should think about which exactly problem you want to solve before starting building any machine learning models. In our case, the clusters splitting takes into account different features, not only the gender or rings category.

Conclusion

In this article, we demonstrated how to build a machine learning clustering model using Dremio and one of the most popular data science tools in the community. We have also performed some data curation in Dremio as the datasets were obtained from different data sources: HDFS and Hive. So, we showed how to load data into HDFS filesystem and Hive table, how to connect these data sources to Dremio, how to join dataframes and preprocess them, how to export the joined dataframe to Python. In Jupyter Notebook environment, we used scikit-learn library to create and evaluate models. Plotly and matplotlib visualization packages were used to build graphs needed to explain some ideas. The further improvements here can be done after detailed exploration the criteria of splitting into clusters when you will understand what is the implicit meaning of each cluster.

We hope you enjoyed this tutorial, stay tuned for more tutorials and resources to learn how you can use Dremio to start gaining insights from your data, faster.

Try Dremio’s Interactive Demo

Explore this interactive demo and see how Dremio's Intelligent Lakehouse enables Agentic AI

Analyzing HDFS and Hive Data Using scikit-learn and Dremio

Table of Contents

Intro

Assumptions

Loading data into HDFS

Loading data into Hive

Dremio and HDFS connection

Dremio and Hive connection

Data curation in Dremio

Use scikit-learn to build a machine learning clustering model

Conclusion

Try Dremio’s Interactive Demo

Get Started Free

See Dremio in Action

Talk to an Expert

Make data engineers and analysts 10x more productive

Table of Contents

Intro

Assumptions

Loading data into HDFS

Loading data into Hive

Dremio and HDFS connection

Dremio and Hive connection

Data curation in Dremio

Use scikit-learn to build a machine learning clustering model

Conclusion

Try Dremio’s Interactive Demo

Additional Resources

Apache Iceberg: The Definitive Guide

Intro: Data Lake vs Warehouse by Dremio

Getting Started with Apache Iceberg Using AWS Glue and Dremio

Get Started Free

See Dremio in Action

Talk to an Expert

Make data engineers and analysts 10x more productive