Dremio Jekyll

Sentiment Analysis with PyTorch and Dremio

Introduction

Oracle database is a massive multi-model database management system. It is used in data warehousing, online transaction processing, data fetching, etc. MongoDB is a document-oriented cross-platform database program. There is a common situation when you need to analyze data that is stored in different sources, for example, Oracle and MongoDB. In this tutorial we will show how Dremio allows to connect both to Oracle and MongoDB data sources, fetch and prepare data and create a sentiment analysis model based on the IMDB dataset using PyTorch in Python.

Assumptions

This tutorial was developed in Ubuntu 16.04. Please make sure that you have installed and configured all the requirements:

Loading data into MongoDB

For this tutorial we will use IMDB Review Dataset. This dataset can be used to predict the emotional sentiment of the film review. It consists of statements that are classified as “positive” or “negative” review. We will load the first part of the dataset into MongoDB and the second part into Oracle. Then we will easily join them using Dremio.

First, you need to download, install and configure MongoDB on your local machine. For these steps we suggest you to follow these instructions.

After launching MongoDB, use the next command to load the data in CSV format into the database:

1
mongoimport -d mydb -c collectionName --type csv --file imdb1.csv --headerline

After that, you will have your data loaded into mydb database. To make sure that data was loaded correctly, let’s select the first two rows of the dataset. By typing the next command, you will see the result as shown on the image below:

1
	db.collectionName.aggregate({ $limit : 2 });

image alt text

Loading data into Oracle

At first, you need to install Oracle Database 11g R2 and SQLDeveloper GUI. We suggest you to follow these instructions. During installation, you should create your user. With the command below, log in to the SQLDeveloper.

1
sudo /opt/sqldeveloper/sqldeveloper.sh

Then, create a new connection using your username and password, and import data from a CSV file in a table.

image alt text

Follow next simple instructions and you will see the table created. For this tutorial, we named the Oracle table as IMDB2_DATASET.

image alt text

To make sure that the data is loaded correctly, you can go to the Data folder in your table section. You should see your data like this:

image alt text

Connecting to data sources in Dremio

Launch Dremio locally and go to http://localhost:9047. After logging in, you need to add new data sources. This can be done by clicking on the “+” button in the Sources menu.

image alt text

Choose Oracle from a list of available sources, enter the name of source and credentials of your Oracle database.

image alt text

With the release of Dremio 3.0, we have a possibility to add some wiki content and tags to our data. This will make the data search much easier. To do this, select your data and click the Catalog tab.

image alt text

We will add some tags, information about the columns, and description of the dataset in general.

![image alt text](/img/tutorials/pytorch_sentiment/image_7.png)

Now, we will connect to MongoDB and load the second part of our data. Select MongoDB source and fill in the necessary information as shown on the image below.

image alt text

After the connection, we can access our data and add some wiki content.

Data curation in Dremio

Let’s prepare our data in Oracle first, then we will join two datasets and fit the final dataset to the model. For now, we need to delete columns that are unnecessary for our future model and clean the data.

Let’s drop the Unnamed, type, and file columns. Also, we will keep only rows that have label. To do that, click on the column name and select “Keep Only…” in the dropdown list. Then, select all the necessary values of label and click “Apply”.

image alt text

image alt text

After applying changes, save the dataset.

image alt text

The same steps to delete unnecessary columns and selecting only valuable data in label column need to be done with a MongoDB dataset. We will store the result as mongo_imdb.

image alt text

Now, we have our two Dremio datasets in the same virtual directory. Let’s join them. To do this, choose one of the datasets and click “Join”. Then, select the other dataset that you want to join.

image alt text

Choose the ID column for joining in both datasets.

image alt text

Apply all the changes and save our dataset as imdb_full in our directory.

image alt text

image alt text

However, in order to keep your datasets separate, we suggest you to create new space. We will name it Imdb and move the imdb_full dataset from the home space to the created one. To create a space, click on a “+” button near the Spaces menu.

image alt text

Then, enter the name and click Save.

![image alt text](/img/tutorials/pytorch_sentiment/image_18.png)

Dremio allows easy data transfer between spaces. You should go to your space, click on a dataset, select “Move…” and then select the new space.

![image alt text](/img/tutorials/pytorch_sentiment/image_19.png)

Analyzing Dremio data with PyTorch

Now, we will build a machine learning model do detect a sentiment using PyTorch and TorchText step-by-step. First, let’s import all necessary libraries and set up our TorchText Fields: TEXT and LABEL.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import torch
from torchtext import data
import pandas as pd
import random
import io
import nltk
import pyodbc
nltk.download('stopwords')
SEED = 1234
from nltk.corpus import stopwords
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deterministic = True
TEXT = data.Field(tokenize='spacy', stop_words=stopwords.words('english'))
LABEL = data.LabelField(dtype=torch.float)

Then, connect to Dremio using an ODBC Plugin, username, and password. Please, use “driver” variable as your path to the driver.

1
2
3
4
5
6
host='localhost'
port=31010
uid ='username'
pwd = 'password'
driver = '/opt/dremio-odbc/lib64/libdrillodbc_sb64.so'
cnxn = pyodbc.connect("Driver={};ConnectionType=Direct;HOST={};PORT={};AuthenticationType=Plain;UID={};PWD={}".format(driver,host,port,uid,pwd),autocommit=True)

Now, we will load our data from Dremio to pandas DataFrame and display the first 5 rows. The result is shown on the picture below.

1
2
3
sql = 'SELECT * FROM Imdb.imdb_full'
df = pd.read_sql(sql,cnxn)
df.head()

image alt text

Then, we need to create TorchText datasets of our data. To do that, we need to convert our pandas DataFrames to TorchText datasets. There are no direct converting available, but we can save the dataset in CSV file, and then load it to TorchText dataset directly. We will save the data in “temp.csv” file.

1
2
3
4
5
6
7
8
9
10
11
12
df = df.drop(['ID0'], axis=1)
df["review"] = df['review'].str.replace('[^\w\s]','')
df.to_csv('temp.csv', index=False)
fields = [('text', TEXT), ('label', LABEL)]
train_data = data.TabularDataset.splits(
                                        path = '',
                                        train = 'try_me.csv',
                                        format = 'csv',
                                        fields = fields,
                                        skip_header = True
)
train_data = train_data[0]

Now that we have loaded our data, we will split it to the train, test, and validation datasets.

1
2
train_data, test_data = train_data.split(split_ratio=0.7, random_state=random.seed(5))
train_data, valid_data = train_data.split(split_ratio=0.7,random_state=random.seed(2))

How does our training data look after all? We will print the first sentence to get an intuition of the data.

1
2
print("Example of training data ")
print(vars(train_data.examples[0]))

image alt text

Let’s see how the data is split between each dataset.

1
2
3
print(f'Number of training examples: {len(train_data)}')
print(f'Number of validation examples: {len(valid_data)}')
print(f'Number of testing examples: {len(test_data)}')

image alt text

Next, we have to build a vocabulary, a look-up table where every unique word in the data set has a corresponding index. *We do this as our machine learning model cannot operate on strings, only numbers. The following code builds the vocabulary, only keeping the most common *max_size tokens. We also set a ‘pos’ value of our label as ‘1’ and ‘neg’ value as ‘0’.

1
2
3
4
TEXT.build_vocab(train_data, max_size=10000)
LABEL.build_vocab(train_data)
LABEL.vocab.stoi['pos'] = 1
LABEL.vocab.stoi['neg'] = 0

The final step of preparing the data is creating the iterators. We iterate over these in the training/evaluation loop. Also, we will set a possibility to train our model on GPU if you have available one.

1
2
3
4
5
6
7
8
BATCH_SIZE = 64
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(
    (train_data, valid_data, test_data),
    batch_size=BATCH_SIZE,
    sort_key=lambda x:len(x.text),
    sort_within_batch=False,
    device=device)

We are ready to build a model, train it and evaluate with our data. We’ll be using a recurrent neural network (RNN) as it is commonly used in analyzing sequences.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
import torch.nn as nn
class RNN(nn.Module):
    def __init__(self, input_dim, embedding_dim, hidden_dim, output_dim):
        super().__init__()
        self.embedding = nn.Embedding(input_dim, embedding_dim)
        self.rnn = nn.RNN(embedding_dim, hidden_dim)
        self.fc = nn.Linear(hidden_dim, output_dim)
    def forward(self, text):
        #text = [sent len, batch size]
        embedded = self.embedding(text)
        output, hidden = self.rnn(embedded)
        #output = [sent len, batch size, hid dim]
        #hidden = [1, batch size, hid dim]
        assert torch.equal(output[-1,:,:], hidden.squeeze(0))
        return self.fc(hidden.squeeze(0))
INPUT_DIM = len(TEXT.vocab)
EMBEDDING_DIM = 100
HIDDEN_DIM = 512
OUTPUT_DIM = 1
model = RNN(INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM)

Now, we’ll create an optimizer. This is an algorithm we use to update the parameters of the module. Here, we’ll use stochastic gradient descent (SGD), learning rate equal to 0.001.

1
2
import torch.optim as optim
optimizer = optim.SGD(model.parameters(), lr=1e-3)

Next, we’ll define our loss function. In PyTorch, this is commonly called a criterion. Also, we need our own function to calculate binary accuracy.

1
2
3
4
5
6
7
8
criterion = nn.BCEWithLogitsLoss()
model = model.to(device)
criterion = criterion.to(device)
def binary_accuracy(preds, y):
    rounded_preds = torch.round(torch.sigmoid(preds))
    correct = (rounded_preds == y).float() #convert into float for division
    acc = correct.sum()/len(correct)
    return acc

Finally, the train and evaluation functions for our model are defined as follows:

1
2
3
4
def train(model, iterator, optimizer, criterion):
    epoch_loss = 0
    epoch_acc = 0
    model.train()

For batch in iterator:

1
2
3
4
5
6
7
8
9
		    optimizer.zero_grad()
        predictions = model(batch.text).squeeze(1)
        loss = criterion(predictions, batch.label)
        acc = binary_accuracy(predictions, batch.label)
        loss.backward()
        optimizer.step()
        epoch_loss += loss.item()
        epoch_acc += acc.item()
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

Then, we train the model through multiple epochs to get better results.

1
N_EPOCHS = 4

For epoch in range(N_EPOCHS):

1
2
3
    train_loss, train_acc = train(model, train_iterator, optimizer, criterion)
    valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)
    print(f'| Epoch: {epoch+1:02} | Train Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}% | Val. Loss: {valid_loss:.3f} | Val. Acc: {valid_acc*100:.2f}% |')

image alt text

As you can see, we trained our model on 4 epochs, and we get 67% train accuracy. Let’s now evaluate our model on a test set.

1
2
test_loss, test_acc = evaluate(model, train_iterator, criterion)
print(f'| Test Loss: {test_loss:.3f} | Test Acc: {test_acc*100:.2f}% |')

image alt text

66.34% accuracy on a test set with a model that has only one RNN cell and is trained with 4 epoch is pretty good. Of course, we can improve this result by adding more layers to our neural net, tuning the parameters and using more epochs on training.

Conclusion

In this tutorial we showed how Dremio allows us to connect and interact with data of two different data sources, Oracle and MongoDB. We joined the data from these sources using Dremio which really simplified the further work performed with Python. Also, we showed a step-by-step implementation of RNN cell-based neural network in PyTorch, trained on data from Dremio to detect a sentiment of a film review.