Dremio Jekyll

Building A Recommender With Scikit-Learn And Dremio Virtual Datasets

Introduction

Machine Learning is a hot trending topic of the day. It gained popularity with self-driving cars, smart search completions from Google, contextual advertisement, and many other applications of this type of Artificial Intelligence. However, in Machine Learning, failure is extremely probable (many Machine Learning applications in development will fail to make it to production), and success is hugely dependent on the data. Data, which is stale (and it gets so quickly), will not be much of the use and does nothing but prevents the favorable outcome. Hence, it is vital for Machine Learning to deliver data from its residence store to the brains of the application, while it is still fresh and hot (like you would want your favorite pizza).

One problem here is that data can come from many places and in many shapes, whereas the Machine Learning algorithms require a uniform, SQL-like tabular form of data. But, what if we have our data distributed across several stores with different technologies, like PostgreSQL and MongoDB (or other combination of Relational and Nonrelational models of data). In this case, the heavy crunching of the data, even before doing some ML-related work, would be required. Not only does this waste precious time, but also expensive resources. Taking into account what a dicey business Machine Learning is, it may be unduly frustrating to spend hours and hours of work only to discover that the concept (or the set of tools) does not work and has to be updated to something else, possibly throwing away all of the previous labor.

And thus, our topic of today - how Dremio may facilitate the workflows on preparing data for Machine Learning applications. Why is Dremio better than doing manual work? It prevents you from losing the work if you decide to switch tools as Dremio is technology agnostic. It saves you a lot of time by giving you a clean and concise way to transform and prepare your data. It consumes minimum resources and only when it is needed.

Dremio achieves that with a very powerful concept called Data Fabric. It combines two critical capabilities - abstraction and dynamic interface. It abstracts away the actual source of data from its consumer and provides a dynamic interface to that data to be consumed.

Dremio in its essence is a ready-to-use data platform that enables instantaneous delivery of the data to its consumer (e.g. analytics engine), providing various integrations with industry standard tools, like R, Python, BI Solutions, SQL Tools.

It is quite common to build ETL-like pipelines in order to ingest inconsistent data from various data sources. However, the ETL paradigm is pretty hefty since data is moved around a lot, here is what happens:

Traditional ETL process in Dremio

Note that, even if the algorithm will be running in parallel, the method imposes several bottlenecks, with the biggest in the last step, when full volume of data will be copied from its source to a new place, possible taking up tons of storage space.

Dremio manages to provide same results with much less overhead. The main point here is that data is never copied, and only moved when it is requested by consumer.

Data Curation process in Dremio

First off, Dremio connects to data from any data source, but without actually copying it to a new place - these connections are links or pointers to data. Then, after you have chosen some data source, you can make changes to it and transform it in many ways. These transformations are not applied to actual data at its residence, and the changes you observe are only in memory. These changes are saved in the form of SQL that is only applied on demand. Finally, you save all the changes and binding of different data sources in the form of virtual datasets, which on the surface behave like usual data table, but only calculates the contents of the table when they are queried.

Installation

There is an entire chapter on the installation in Dremio Docs. There are many ways to install Dremio and all the types of installations are described in Deployment Models. Even though Dremio is designed to be distributed, we still can install it in a standalone model on a single node to understand how it works. We will deploy on a Ubuntu linux machine from a tarball, if you have different machine and want to install it - reference the docs linked above.

First, we download Dremio Community edition from here (we chose TAR download, installation instructions here).

Downloading Dremio via wget

Then, we set up our environment for Dremio:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
# Prepare directory structure for Dremio. We ran the next commands:

sudo mkdir /opt/dremio

sudo mkdir /var/run/dremio

sudo mkdir /var/log/dremio

sudo mkdir /var/lib/dremio

sudo mkdir /data

sudo mkdir /data/dremio

# Create user and the group for dremio and delegate the folders to them:

sudo groupadd -r dremio

sudo useradd -r -g dremio -d /var/lib/dremio -s /sbin/nologin dremio

sudo chown dremio:dremio /var/run/dremio

sudo chown dremio:dremio /var/log/dremio

sudo chown dremio:dremio /var/lib/dremio

sudo chown dremio:dremio /data/dremio

# Unpack dremio tar into the installation folder:

sudo tar -xvf dremio-community-1.1.0-201708121825170680-436784e.tar.gz -C /opt/dremio/ --strip-components=1

# Create symlink to config files in dremio directory:

sudo ln -s /opt/dremio/conf /etc/dremio

With that the installation is complete, but before using it we will need to tweak several configurations.

We will set the log and PID dir for dremio to directories we have just created in the dremio-env file located at /etc/dremio/conf.

Installing Dremio

And also, we will set the data path in dremio.conf at the same directory.

Installing Dremio

Next, we are going to start Dremio up as a service. Ubuntu uses systemd, thus we will use dremio.service unit, which is shipped in /opt/dremio/share folder, for running. Before that, we will slightly amend it - just change PIDFile from /opt/dremio/run/dremio.pid to /var/run/dremio/dremio.pid.

Installing Dremio

Note: It is important to set these configuration to the folders that are owned by dremio, otherwise the service will fail to start

1
2
3
4
5
6
7
8
9
10
11
# Start the service, but first reload systemd:

sudo systemctl daemon-reload

# then

sudo systemctl start dremio

# If you want dremio to start on system boot:

sudo systemctl enable dremio

To confirm successful installation navigate to “localhost:9047” and you should be prompted to create an admin user:

Create admin user in Dremio

High Level Overview Of Dremio

Before jumping in, let’s look at some of the key Dremio concepts from above. We had already touched on the concepts of Dremio when we compared it to traditional ETL process.

The purpose of Dremio is to give the ability to discover, transform, and handily ingest data without introducing tight coupling of application and data.

Dremio gives concise names to its abstractions, thus providing consistency of interface. In Dremio everything is built around datasets. There are of two types - physical and virtual. Physical datasets reside in data sources and represent original source data. They may be found under each individual source. Data sources are represented with cylinder icons: Data source icon Physical datasets are represented with a purple table icon: Physical dataset icon Virtual datasets are represened with a green icon: Virtual dataset icon Users organize their virtual datasets in collaborative area based on a directory paradigm called Spaces: Space icon Users can create directories and sub-directories in spaces: Directories in Spaces

Probably the most important part of Dremio’s interface is the transformation tools available within datasets. Transformations support many types of operations - fixing incomplete data, NULL handling, type casting, whitespace trimming, case normalization, etc. All of the transformations are performed and saved in virtual datasets without making copies of the source data.

To read more on Dremio concepts navigate to the relevant docs.

Working With Data In Dremio

As we mentioned before, today we will look at how Dremio helps with Machine Learning applications. Data is the most crucial element of success here and thus the importance of fast and reliable connection.

For this tutorial we will build recommender systems based on the MovieLens dataset. The key idea here is that if we take a group of user ratings of movies, we can use that data to predict how similar users will rate other movies. This is the underlying process for movie recommendations on Netflix, as well as product recommendations on Amazon or other ecommerce sites.

We have all of our data (Movie Ratings, Tags, Links) in a PostgreSQL database, but it happened that one table is missing there and is presented as a collection (Movie Titles) in MongoDB. With Dremio this is not an issue - we are just going to create a virtual data set and profit from the combination of sources.

First, we will add our data sources. We begin at default screen, where it says No Data:

Default screen in Dremio

What we have in our database is three tables:

Three phycial datasets in Postgres

To add this database we click on New Sources button and choose our type of source - PostgreSQL:

Add new Postgres data source in Dremio

We will provide the connection details in this window:

Setting up the connection to Postgres

After that the source will appear on the list:

Postgres now available to Dremio

As well the public scheme of this database will be available in the main window of Dremio:

Viewing Postgres schemas in Dremio

Next, we want to add connection to MongoDB. Here is what we have in our Mongo collection:

Connecting Dremio to MongoDB

We will add that just the same way we did with PostgreSQL. Just choosing the MongoDB as our Source Type:

Selecting MongoDB as the data source type

Then, the same way we did before, we fill out the connection info:

Configuring the connection to MongoDB

And then, the data source is readily available:

Viewing the MongoDB source in Dremio

As well as collection itself:

Viewing MongoDB collections in Dremio

In just a few clicks we were able to add our sources, and we now have access to all the data via SQL. Next, we want to adjust our data a bit to fit our needs better.

We will begin by opening one of the physical data sets - the rating table from PostgreSQL. For that, we just chose the source, the public schema, and then the table (dataset) itself.

First off, we want to get rid of time_stamps as we will not need them.

Removing timestamps field

Next, we want to join the mongo data to this one to include movie titles. To do so, we will hit the Join button on top of the dataset and then choose another dataset, from any source and click next.

Joining Postgres and MongoDB in Dremio

In the settings of the Join, we will choose an inner type of join (as we want only the records that both have title and ratings) and then choose fields on which we should join, i.e., if fields match this is the same record.

Creating inner join between Postgres and MongoDB

Finally, after the join is complete and we have all the data, we will delete one of the duplicate columns for movie_id (since both PostgreSQL and MongoDB had this field), delete Mongo’s id, and we are ready. The data should look like this:

Viewing the final virtual dataset

We will save the data by clicking save in the upper right corner and adding that to our new space (created by clicking New Space on the main dashboard):

Saving the virtual dataset in our Space

Now, we are all done and the data is ready to be consumed by any tool that supports SQL.

Querying Data From Dremio In Python

Next, we want to show you how the data would be accessed in Python and we will build some basic recommender from it.

First, we need to install ODBC Driver. To do so you have to download the driver for your system from Dremio Downloads Page. We use a linux installation and thus we will download the one for linux and install it with:

sudo yum install dremio-odbc-1.3.11.1034-1.x86_64.rpm

Then, we will install pyodbc package for python:

pip install pyodbc

And from here, we are to access our data in python. To connect to Dremio we will use pyodbc and read it into pandas dataframe:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
import pyodbc
import pandas

host = "localhost"
port = 31010
uid = "uid"
pwd = "pwd"

driver = "Dremio ODBC Driver 64-bit"

conn_str="Driver={};ConnectionType=Direct;HOST={};PORT={};AuthenticationType=Plain;UID={};PWD={}".format(driver, host, port, uid, pwd)

cnxn = pyodbc.connect(conn_str, autocommit=True)

query = "SELECT * FROM Movies.ratings_with_titles LIMIT 100000"

df = pandas.read_sql(query, cnxn)

We will limit ourselves to 100000 records in order not to overfit the model.

We can check the top records with the .head(n) method:

Checking the first few records

Now, that we have our data in python we may use it to build simple recommender. We will use scikit-learn library to facilitate the work to maximum.

Note: install surprise with pip install surprise

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# Import all of the stuff we are going to need:

from surprise import Dataset
from surprise import Reader
from surprise import SVD
from surprise import evaluate

# Transform data to surprise Dataset:

reader = Reader(rating_scale=(1, 5))

# We use only three fields for model in surprise

data = Dataset.load_from_df(df[['userId', movieId, 'rating']], reader)

# Next, we will divide the data into 5 folds - 4 for training and 1 for testing:

data.split(n_folds=5)

# Create our model

alg = SVD()

# Train our model

evaluate(algo, data, measures=['RMSE', 'MAE'])

The training method in surprise (evaluate) will also apply testing measurements on a model with the testing dataset. We can specify which testing measures to use. We shall take Root Mean Squared Error and Mean Absolute Error for this.

This should produce output similar to this:

Viewing output of our model

Then, finally we can go ahead and predict on some real values. But before that we should single out and fit our entire training dataset into the model:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# Fit entire training set into the model:

trainset = data.build_full_trainset()

alg.train(trainset)

# Do predictions on sample data:

userid = str(7)

movieid = str(318) # shawshank redemption

actual_rating = 5

# actual prediction

print alg.predict(userid, movieid, actual_rating)

And the results will be reported as:

Viewing results of our model

Alternatively, we could use scikit-learn to build prediction model from user-movie matrix like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
# First, count users and movies:

n_users = df.user_id.unique().shape[0]

n_items = df.item_id.unique().shape[0]

# Then, build matrices for training and testings datasets:

from sklearn import cross_validation as cv

import numpy as np

train_data, test_data = cv.train_test_split(df, test_size=0.25)

train_data_matrix = np.zeros((n_users, n_items))

for line in train_data.itertuples():
  train_data_matrix[line[1]-1, line[2]-1] = line[3]

test_data_matrix = np.zeros((n_users, n_items))
for line in test_data.itertuples():
    test_data_matrix[line[1]-1, line[2]-1] = line[3]

# Create our own RMSE measurement function:

from sklearn.metrics import mean_squared_error

from math import sqrt

def rmse(prediction, ground_truth):
    prediction = prediction[ground_truth.nonzero()].flatten()
    ground_truth = ground_truth[ground_truth.nonzero()].flatten()
    return sqrt(mean_squared_error(prediction, ground_truth))

# Generate predictions with by dot product of u,s, vt matrices of svds and measure these predictions with RMSE:

import scipy.sparse as sp

from scipy.sparse.linalg import svds

u, s, vt = svds(train_data_matrix, k = 20)
s_diag_matrix=np.diag(s)
X_pred = np.dot(np.dot(u, s_diag_matrix), vt)
print 'User-based CF MSE: ' + str(rmse(X_pred, test_data_matrix))

This will produce the accuracy score:

Viewing accuracy score of our model

As you have seen, we changed the method we applied to our data without having to change any of the data settings itself. We could have even changed the language (say switch to R) and still be using the same interface.

Without Dremio, we would have to connect two additional drivers for each type of database in our application (a lot of dependencies), join it together with the use of custom script, and, if we would like to reuse it, dump it somewhere, either in relational database or in CSV files, thus taking up valuable space. In our example, we only had 100K of records, but in the real world it would be millions or even billions of records and such frivolities would not be plausible.

Conclusion.

As you have seen Dremio’s abstractions facilitate connection and working with data greatly. Thanks to the ability to connect to multiple sources of different types, it is possible to bring together data that was not considered to be used in association with due to the big size and expensive crunching.

Virtual datasets of Dremio give the ability to transform data to the needs and combine it in the form native to SQL (with joins). Thanks to spaces and folders the data sets may be nested together logically and namespaced according to their purpose, thus avoiding name conflicts.

The simplicity of connection through ODBC Driver enables fast discovery of data from Dremio in any client application. And, more importantly, the data is abstracted from the client, which does not have an idea of either PostgreSQL or MongoDB, thus introducing the additional layer of security. Dremio supports many sources, including Hadoop, relational databases, S3, and more. For a full list see the chapter in the docs on Data Sources.

In our example, we used simple data and simple algorithms that did not take any risks. In the real world, Machine Learning applications are resource expensive and risky. Oftentimes, they require their own prepared data, thus expenses for curating and storing. Accordingly, if a Machine Learning experiment proves to be impractical, all the resources put into data preparation are wasted. With Dremio, instead of physically transforming data and storing it, you build a virtual dataset, then access the data dynamically, saving resources along the way, especially when deployed in the cluster (standalone is not designed for performance, but rather for experimenting).

Make sure to check out our other tutorials here. And remember, you can always find help either in docs or in our lively and vibrant community, which is always willing to support.