Dremio Jekyll

Gensim Topic Modeling with Python, Dremio and S3

Dremio

Intro

Topic modeling is one of the most widespread tasks in natural language processing (NLP). This is one of the vivid examples of unsupervised learning. The main goal of this task is the following: a machine learning model should be trained on the corpus of texts with no predefined labels. In other words, we don’t have an idea about what topics are presented in the given corpus of texts. The model should analyze all texts using the defined algorithm in order to split the corpus into topics. Then, when given a new text, the model should be able to predict the topic for this text. It is important to highlight one more time that we don’t even have a list of possible topics. The model will detect them itself, without any guidance.

Topic modeling can be very useful in situations when we don’t have labeled texts to train a classification machine learning model. Even if theoretically we can perform human labeling to any text, this process can be very expensive. It may be cheaper to do topic modeling than to label all corpus and then create a supervised classification model.

There are many different tools that a data scientist can use for NLP. One of the most known and powerful tools is the gensim Python library. In this tutorial, we will be using it to perform topic modeling of the 20 Newsgroups dataset. Thought it is a labeled collection of texts, we will not use labels in any way except to look at the quality of the trained topic modeling model.

Prior to the performing topic modeling in Python, we will show how to work with Amazon S3 and Dremio to build a data pipeline. We will dump the dataset into Amazon S3, then connect it to Dremio, perform some basic data curation in Dremio, and then perform the final analysis using Python.

Assumptions

In this tutorial, we assume that you have the following items already installed and setup:

  1. Dremio
  2. Amazon account
  3. Gensim
  4. PyODBC
  5. Dremio ODBC Driver
  6. Jupyter Notebook environment
  7. Ubuntu OS

Loading data into Amazon S3

We will start by loading data into Amazon S3 storage. In your Amazon account, go to All services and then find and click S3 under Storage section:

image alt text

All data in S3 storage should be stored in buckets. This means that we need to create a bucket before we dump the dataset to the storage. To create a bucket, click on the corresponding button, as on the image below:

image alt text

The next step is the bucket configuration setting. The required parameters are the Bucket name and the Region. We called our bucket as 20newsgroups, but you are free to use any name (as long as it corresponds to the rules for the bucket name). After you specify the name and the region, click the Create button. Another way is to click the Next button and tune the options and permissions for the bucket. We don’t want to change the default values for any parameters, so we just click the Create button:

image alt text

This is enough to create a new bucket which will be displayed in the list of buckets available in the current account:

image alt text

Now click on the bucket name to go inside the bucket. We need to dump the file with the dataset inside the bucket. To do this, click on the Upload button:

image alt text

On the next step, we should specify the file for uploading (see the image below).

image alt text

Our file is called dataframe.csv. If you download the dataset from its source, you will notice that it is not in the format of the CSV file. There should be 20 folders. Each folder contains text files with news of the specific category. In other words, there are a lot of simple text files inside each of the 20 folders.

In Dremio, we work with dataframes, so to be able to process the dataset in Dremio, we decided to generate a CSV file. We want to have each row of the file representing the document. It would be good to have a text of the news in the first column and the category (class) of the news in the second column. However, it is not recommended to include the long sequences of text in a single cell. So, we needed to devise a way of how to split text into chunks of relatively small sizes.

We decided to apply the following algorithm: for each document, we generate a random string with a length of 10 characters. This string will be used as id for the given document. Then we iterate over the text of the document and dump the sequences with a length of 1600 characters into the pandas dataframe. So, the single document should be split into several pieces, each piece has its own row in the dataframe, but they all have the same identifier.

We will use ids to stick chunks together later. Note that for simplicity of the example, we limit the number of categories to six (we specified the names of these categories in the desired_topics variable). Here is the code for the described algorithm (in the folders variable we have a list with paths to each from 20 folders):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
def randomString(stringLength=10):
    """Generate a random string of fixed length """
    letters = string.ascii_lowercase
    return ''.join(random.choice(letters) for i in range(stringLength))
desired_topics = ['talk.politics.guns', 'soc.religion.christian', 'sci.med',
                  'comp.windows.x', 'rec.sport.hockey']
seq_len = 1600
for folder in folders:
    name = folder[folder.rfind('/')+1:]
    if name in desired_topics:
        files = [f for f in listdir(folder) if isfile(join(folder, f))]
        for file in files:
            file1 = open(folder + "/" + file, "r+", encoding="latin")
            text = file1.read()
            random_id = randomString()
            for i in range(0, len(text), seq_len):
                if i+seq_len > len(text):
                    subtext = text[i:len(text)]
                else:
                    subtext = text[i:i+seq_len]
                df = df.append(pd.DataFrame([[random_id, subtext, name]], columns = ['file_id', 'text', 'label']))
df.to_csv('dataframe.csv')
 else:
                    subtext = text[i:i+seq_len]
                df = df.append(pd.DataFrame([[random_id, subtext, name]], columns = ['file_id', 'text', 'label']))
df.to_csv('dataframe.csv')

Look at the image below to better understand the format of the dataframe which we have generated after the execution of the previous script:

image alt text

Now let’s get back to the AWS console. After we click the Upload button, the uploading should start:

image alt text

When the upload is finished, you will be able to see the file inside the bucket:

image alt text

The dataset is loaded into AWS S3.

Connecting Dremio to Amazon S3

Now we need to establish a connection between Amazon S3 and Dremio. Go to the Dremio UI. In the Sources section, click on the corresponding button to add a new source. Then, select Amazon S3 from the list of available sources:

image alt text

To connect Dremio to Amazon S3 you have to specify several parameters:

image alt text

The first parameter is the Name. This is the name of the source which you will use inside Dremio to refer to this instance. We called it topic_modeling_s3. Dremio also needs authentication credentials to connect to the source. To get them, click on IAM in the Security, Identity, & Compliance section in the list of services in AWS Console:

image alt text

Then click on the Manage Security Credentials button in the Security Status section:

image alt text

On the next page, click on the Create New Access Key button. You can find it in the Access keys section (toggled menu). After this, the new access key (as well as AWS Access Secret) will be generated (see the image below). Copy them into the corresponding fields in Dremio UI.

image alt text

After you click the Save button, the new datasource will be created. Select the dataset and click on the single file displayed there (dataframe.csv). In the next window, ensure that the line delimiter is specified properly as shown below. The Extract Field Names option should also be selected:

image alt text

Click the Save button.

Data curation in Dremio

First, save the dataset in a separate space. We have created the topic_modeling_space earlier. To save the dataset inside the space, use the Save As… button.

Now go to the dataset inside the topic_modeling_space. You may notice that there is a redundant column “A”. We want to remove it. To do this, click on the Drop button:

image alt text

Let’s say we need to rename columns. To do this, you can just click on its current name and type in another name. Alternatively, you can use the Rename… button from the drop-down menu. image alt text

What if we want to see all the unique labels that we have? One of the possible solutions is to group the dataframe by the column “news_category”. Along the way, let’s count the rows for each category. To do this, use the Group By _**button, then specify the dimension and measure for grouping, as well as the action you want to perform (count). Then click **_Apply:

image alt text

The result should look like this:

image alt text

You may notice the caution “Result based on sample dataset”. What does it mean? If you compare the results of the grouping with the grouping performed somewhere else, you can see that in Dremio the results are slightly different. In our case, there is a different number of rows with category “talk.politics.guns”. This is because Dremio previews only 10000 rows out of the entire dataset. However, all the actions are applied to the entire datasetTo sort values in descending order, click on the Sort Descending button in the drop-down menu:

image alt text

Below we are demonstrating how to create a calculated field in Dremio. Click on the Calculated Field… button in the drop-down menu of the column which you want to use as the basis for calculations. In our case, we want to calculate what is the average number of rows the single document is split for each of the categories.

image alt text

We know that there are 1000 news in each category. So, by dividing the count of rows for each category by 1000, we can obtain the average document split number:

image alt text

There are many other useful features for data processing in Dremio. For example, we can exclude missing values from the dataset, convert data types, replace values, etc. Almost all of the features are available through the drop-down menu of the specific column or through the button from the top panel.

In addition, you can write SQL queries directly in Dremio to work with your data. To do this, just click on the SQL Editor button in the top panel.

image alt text

For now, we finish all required data curation in Dremio. Next, we will connect Dremio to Python and continue to work with the dataset there.

Dremio and Python interaction

To connect Dremio with Python, we will use the ODBC driver. Also, you need to have pyodbc Python package installed. If everything is installed properly, the connection becomes easy.

First, you need to import pyodbc *and Pandas. Then specify the required parameters for the connection. They include the name of the host, port, username, and password for Dremio, a path to the driver. Then use *pyodbc package to pass all the parameters into the connect() function.

Now you should be able to execute SQL queries. The only query we want to run is the query for selecting all records from the dataframe. To execute the query, we are using the read_sql() Pandas function. Find below the full code needed for connection:

1
2
3
4
5
6
7
8
9
10
import pandas as pd
import pyodbc
host='localhost'
port=31010
username ='your_dremio_username'
password = 'your_dremio_password'
driver = '/opt/dremio-odbc/lib64/libdrillodbc_sb64.so'
cnxn = pyodbc.connect("Driver={};ConnectionType=Direct;HOST={};PORT={};AuthenticationType=Plain;UID={};PWD={}".format(driver,host,port,username,password),autocommit=True)
sql = "SELECT * FROM topic_modeling_space.raw_processed"
df = pd.read_sql(sql,cnxn)

Note that in the SQL query, we want to access the dataset from the topic_modeling_space in Dremio. As a result, we got the following dataframe in Python:

image alt text

Now everything is ready for the model creation.

Topic modeling in Python

To perform topic modeling in Python, we need to import special packages used for natural language processing and data science in general. See the code below to understand what we will use further.

1
2
3
4
5
6
import pandas as pd
import re
import string
import gensim
from gensim import corpora
from nltk.corpus import stopwords
  • Pandas is a package used to work with dataframes in Python.
  • Re is a module for working with regular expressions. We will use them to perform text cleansing before building the machine learning model. String module is also used for text preprocessing in a bundle with regular expressions.
  • Gensim package is the central library in this tutorial. The package is widely used not only for topic modeling but also for different NLP tasks.
  • NLTK is probably the most well-known and mature NLP library. If you don’t have it yet, you can use pip install nltk command to install the package. In our tutorial, we will use it to remove stopwords from the news.

As you remember, we split documents into chunks of size 1600 to process them more efficiently in Dremio. Now it’s time to merge them back. The code below demonstrates how to do this. We form a list with all possible ids in the corpus of news. Then we create an empty Python list. We iterate over ids list and for each unique id we select the chunk of the dataframe with the document which has given id. Then we iterate over this chunk and form a string whole_text, which then is added to the documents list. As a result, we have a list called documents, where each element is the document representing a piece of news.

1
2
3
4
5
6
7
8
ids = list(set(df['file_id']))
documents = []
for_id in ids:
    subset = df.loc[df['file_id'] == _ id]['text']
    whole_text = ""
    for i in range(len(subset)):
        whole_text += subset.iloc[i]
    documents.append(whole_text)

To build a topic model, we need to perform tokenization. To put it simply, this is the process of splitting the text into words (tokens). Gensim can tokenize texts for us. Here is the code:

1
2
3
tokenized_docs = []
for doc in documents:
    tokenized_docs.append(gensim.utils.simple_preprocess(doc, min_len=3))

You can notice that we filter out words with the length smaller than 3. This is because we think that shorter words are almost senseless for the model. But you can play with this parameter if you want.

Gensim requires dictionary and corpus creation before the model training. We create them with the code below:

1
2
dct = corpora.Dictionary(tokenized_docs)
corpus = [dct.doc2bow(doc) for doc in tokenized_docs]

The dictionary is used to replace words by numbers and to create a mapping between them. This is because machine learning models can work only with numbers, not with texts.

To create a model we use the following code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
lda_model = gensim.models.LdaMulticore(corpus=corpus,
                         id2word=dct,
                         random_state=100,
                         num_topics=5,
                         passes=10,
                         chunksize=1000,
                         batch=False,
                         alpha='asymmetric',
                         decay=0.5,
                         offset=64,
                         eta=None,
                         eval_every=0,
                         iterations=100,
                         gamma_threshold=0.001,
                         per_word_topics=True)

The meaning of the parameters is available in the official Gensim documentation. Most of them can be tuned without significant influence on the model. However, you can play with them to see the differences. The most important parameters are corpus (all our texts), id2word (the dictionary with the mapping from numbers to words), num_topics (we want to find 5 topics because we know exactly that there are 5 news categories in the given corpus).

After the model is trained, we can use it for prediction. But let’s see what topics it has detected:

image alt text

As you can see, there are 5 topics enumerated from 0 to 4. For each topic, there is a list of words that are most relevant for the particular topic. It is obvious that the model is not very good. There is a lot of garbage: articles and auxiliary words, as well as different irrelevant words like* edu, com, que, cmu,* etc. So, we need to improve the model.

The clue in the improvement of the NLP model is often the text preprocessing, not the tuning of the model parameters. First what we want to try is to remove the punctuation. Also, when we select a random document, we can see a lot of email addresses and newline characters which should also be removed. And one of the most hopeful things is the deletion of the so-called stopwords from the news. Stopwords are words which are presented in any text due to its nature. They include such words as the, a, in, at, who, with, and, as, and so on.

Here is the code to remove emails, punctuation, newline characters, and stopwords from documents:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# remove email addresses
regexp = '\S*@\S*\s?'
pattern = re.compile(regexp)
for i in range(len(documents)):
    documents[i] = pattern.sub('', documents[i])  
# remove newlines and double quotes
documents = [re.sub('\s+', ' ', sent) for sent in documents]
documents = [re.sub("\'", "", sent) for sent in documents]
# remove all punctuation
for i in range(len(documents)):
    documents[i] = documents[i].translate(str.maketrans('', '', string.punctuation))
# tokenize texts (with min_len=4)
tokenized_docs = []
for doc in documents:
    tokenized_docs.append(gensim.utils.simple_preprocess(doc, deacc=True, min_len=4))
# remove stopwords
stop_words = stopwords.words('english')
tokenized_docs_filtered = []
for i in range(len(tokenized_docs)):
    doc_out = []
    for word in tokenized_docs[i]:
        if word not in stop_words:
            doc_out.append(word)
    tokenized_docs_filtered.append(doc_out)
    

As a result of the execution of the code above, we have tokenized_docs_filtered list of documents. This list can be used to build a dictionary and train the model again:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
dct = corpora.Dictionary(tokenized_docs_filtered)
corpus = [dct.doc2bow(doc) for doc in tokenized_docs_filtered]
lda_model = gensim.models.LdaMulticore(corpus=corpus,
                         id2word=dct,
                         random_state=100,
                         num_topics=5,
                         passes=10,
                         chunksize=1000,
                         batch=False,
                         alpha='asymmetric',
                         decay=0.5,
                         offset=64,
                         eta=None,
                         eval_every=0,
                         iterations=100,
                         gamma_threshold=0.001,
                         per_word_topics=True)

The result of the model training:

image alt text

You can see that the results are better. For example, we can definitely notice that topic #4 is the soc.religion.christian. Topic #3 is a hockey topic. Probably, topic #1 is about MS Windows. But we cannot say exactly what are the topics #0 and #2. Also, there are many unexpected words in other topics. Those words are not relevant to their corresponding topics. So, we need to improve the model further.

One of the possible ways of model improvement is lemmatization and filtering out the very often words from the corpus. Lemmatization is the process of replacing the word by its basic form. For example, the word cats becomes cat, words was and were become be, etc. Also, on this step, we can filter out the words by their type. For example, we may want to leave only nouns, adjectives, and pronouns in the corpus.

You may wonder why we need to filter out most widespread words in the corpus while we have already removed stopwords before. The reason is that previously we removed stopwords which are stopwords for the English language in general. But we can see that our corpus has some specific words which are very common for this corpus and they are useless in detecting the topic. For example, this is the word from, which is presented in almost every letter. Also, the names of the categories are often included in the text of the document. One more example - the word newsgroups. So, we need to remove them to improve the quality of the model.

Find below the code for lemmatization and common words filtering.

1
2
3
4
5
6
7
8
9
10
11
lemmatized_docs = []
for i in range(len(tokenized_docs_filtered)):
    doc_out = []
    for word in tokenized_docs_filtered[i]:
        lemmatized_word = gensim.utils.lemmatize(word, allowed_tags=re.compile('(NN|JJ|RB)'))
        if lemmatized_word:
            doc_out.append(lemmatized_word[0].split(b'/')[0].decode('utf-8'))
    lemmatized_docs.append(doc_out)
dct = corpora.Dictionary(lemmatized_docs)
dct.filter_extremes(no_below=50, no_above=0.1)
corpus = [dct.doc2bow(doc) for doc in lemmatized_docs]

You can see that for filtering we remain only words which are present in more than 50 documents and in less than 10% of documents. For these purposes, we use the filter_extremes() method of the dictionary created by Gensim. Now we can train the LDA model one more time and see the results:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
lda_model = gensim.models.LdaMulticore(corpus=corpus,
                         id2word=dct,
                         random_state=100,
                         num_topics=5,
                         passes=10,
                         chunksize=1000,
                         batch=False,
                         alpha='asymmetric',
                         decay=0.5,
                         offset=64,
                         eta=None,
                         eval_every=0,
                         iterations=100,
                         gamma_threshold=0.001,
                         per_word_topics=True)

Detected topics:

image alt text

We notice that this time it is the best topic modeling. All the five topics are clearly distinguishable, and almost all words are relevant. So, topic 0 is the topic about Microsoft Windows OS (words window, file, server, application, version).

  • Topic 1 is the topic about religion (words christian, church, bible, etc).
  • Topic 2 is the hockey topic (team, hockey, player, period, shot, etc).
  • Political talks about guns are the main object of topic 3 (firearm, gun, weapon, control, bill, etc).
  • Eventually, words such as disease, medical, patient, doctor, treatment, health, drug and other clearly relate to the medical topic under 4.

So, after several cycles of model improvement, we managed to achieve decent quality. Let’s see how we can use the trained model for the prediction of the topic for the given text. Here is the text (simply the first document from the corpus):

image alt text

This is a letter about religion. Is our model able to detect this? To predict the topic for the document, we should use the method get_document_topics() and pass the document in the encoded form (bag of words approach in our case) to this method:

image alt text

The model predicts topic 1 with 98% confidence. Earlier we printed out the list of topics and their numbers. Topic 1 is actually a “religion” topic. Our model made a correct prediction.

Conclusion

In this tutorial, we have demonstrated how to use the data from Amazon S3 to perform topic modeling in Python with the help of Gensim library. Before loading data to Python script, we curated the dataset in Dremio. This was only a basic example of topic modeling. You can try to improve the quality of the model. Possible steps include:

  • Creating and using bigrams and trigrams in the model;
  • Removing the beginning and ending of each document;
  • Replacing some words by more general word, for example, replacing email addresses by word ** instead of just dropping them out;
  • Experimenting with hyperparameters of models and preprocessors;
  • Increasing the number of categories and trying to build a model which is able to distinguish between 10 or even 20 newsgroups.