32 minute read · January 13, 2020

Data Science on the Data Lake using Dremio, NLTK and Spacy

Dremio Authors: Insights and Perspectives · Dremio Team

Copied to clipboard

Data Science on the Data Lake using Dremio, NLTK and Spacy

Introduction

Assumptions

Data curation with Dremio

Connecting Python to Dremio

Working with text using NLTK

Working with text using Spacy

Conclusion

Introduction

Enterprises often have a need to work with data stored in different places; because of the variety of data being produced and stored, it is almost impossible to use SQL to query all these data sources. These two things represent a great challenge for the data science and BI community. Prior to working on the main issue, we need to prepare an extract-transform-load (ETL) pipeline for data preparation. This is often a time and resource-consuming process.

Dremio helps users solve this problem. Dremio can join datasets from completely different storages with a couple of clicks. Moreover, it supports SQL querying even if the original platforms cannot work with it. With Dremio it is possible to take one part of the dataset from Amazon S3, another from Azure Data Lake Storage, and yet another from ElasticSearch index, join them, and send SQL queries to process data. Besides the UI, Dremio has its API, so you can use it inside your automatic data processing workflows.

In this tutorial, we will perform entity recognition using the Sentiment Labelled Sentences Data Set. To do this, two popular NLP frameworks will be used: NLTK and Spacy. We will also show how to perform some other common natural language processing tasks with these tools. The role of Dremio in this tutorial is to join the dataset and perform basic data curation.

Assumptions

In this tutorial, we assume that you have the following items already installed and setup:

Dremio
AWS account
PyODBC
Dremio ODBC Driver
Jupyter Notebook environment
NLTK
Spacy
Ubuntu OS (optional).

The dataset is originally split into 3 files. Each file has data from a different source: IMDB, Yelp, and Amazon. We will store these files in Amazon S3 bucket.

Data curation with Dremio

The first step is to get data from data sources. In our case, we have three files stored in AWS S3. To connect Dremio to AWS S3 we should click on the Add new source button and then select the Amazon S3 option:

Next, we should give the name to the new data source (let’s call it sentences_s3_storage) and specify the credentials for AWS (AWS Access Key and AWS Access Secret). Then click the Save button. As a result, the connection between AWS S3 and Dremio should be established. There is the sentencesbucketdremio bucket inside S3, and we can see it directly in Dremio UI:

Go inside each of the files and perform its formatting. Specifically, pay attention to field and line delimiters:

Now we need to concatenate three pieces of the entire dataset. Before doing this, we need to create a space where we will store the dataset. We call it sentences_space. Go inside each of the datasets and save it to the new space using the Save As button. We have also given them shorter names:

To perform concatenation:

Click on any of the data frames.

Click on the SQL Editor button:

Write an SQL query for datasets concatenation and click the Preview button:

Here is the query:

SELECT * FROM ((SELECT * FROM yelp) UNION (SELECT * FROM amazon) UNION (SELECT * FROM imdb))

Using the Save As button saves the resulting dataset inside the sentences_space. Let’s call it concat. This dataset should contain all rows from yelp, amazon, and imdb data frames.

After looking at the concat data frame, we will realize that it contains two columns. The first column contains sentences, while the second column represents sentiments for these sentences. For the given tutorial we don’t need the sentiment. So, the second column is redundant. To drop it, click on the arrow near the name of the column. Then, in the drop-down menu, select the Drop option:

Also, we want to rename the column with sentences. At the moment, it is called A. To give it the name (for example, sentences) we can click directly on the current name, and after the cursor will appear, remove the old name and enter the new desired name. Another way is to use the Rename option from the drop-down menu described above.

Later, we want to perform quick exploratory data analysis (EDA) of the dataset. This is a recommended step before starting any machine learning work. To simplify this process in future, we will create a new column with sentences lengths directly in Dremio. Here is the sequence of actions you need to perform:

Click the Add Field button:

In the opened window, find the needed function. In our case, it is the LENGTH() function. The function can be found by scrolling the list of all available functions or by typing the name in the search field. We used the second approach and after entering just “len” we were able to see the needed function. Click on the plus sign near the name of the function:

There is the text field on the right side of the window. There we need to enter the expression that should be evaluated to calculate the values for the new column. In our case, we pass the name of the sentences column to the LENGTH() function (see the image below). Also, we have to specify the name for the new column (we decided to call it sentence_length). Then, you can press the Preview button. If everything is as expected, click the Apply button.

Now, the data frame should have two columns:

This is all that we planned to do with the data in Dremio. Don’t forget to save the changes (we create the sentences_curated data frame). The next step is to establish a connection between Python and Dremio.

Connecting Python to Dremio

To work with Dremio datasets from Python we need to have an ODBC driver and a Pyodbc Python’s library. Also, we will use the read_sql() method of the Pandas package.

The general flow of actions is the following:

Import needed libraries (Pandas and Pyodbc).
Set the variables with host and port where the Dremio is available.
Specify the credentials (in variables) for the Dremio account (username and password).
Set the variable with the path to Dremio ODBC driver.
Use the connect() method from the Pyodbc library to create a connection. The parameters with host, port, the path to the driver and Dremio credentials hould be passed to this method.
Create the SQL query which will fetch the dataset from Dremio (for example, “SELECT * FROM space.dataframe”) and save the query into the variable.
Pass the connection object and the variable with the SQL query to the read_sql() Pandas’ function. This function should return the Pandas data frame.

Here is the full code for the described above steps:

import pandas as pd
import pyodbc
host='localhost'
port=31010
uid ='dfesenko'
pwd = 'awdev1dremio'
driver = '/opt/dremio-odbc/lib64/libdrillodbc_sb64.so'
cnxn = pyodbc.connect("Driver={};ConnectionType=Direct;HOST={};PORT={};AuthenticationType=Plain;UID={};PWD={}".format(driver,host,port,uid,pwd),autocommit=True)
sql = "SELECT * FROM sentences_space.sentences_curated"
df = pd.read_sql(sql,cnxn)

In the result, we now have the data frame in Pandas and can proceed with our work:

Working with text using NLTK

Before starting to use any machine learning models and libraries, let’s explore dataset a little bit. We want to know basic statistics about sentences. It is a recommended step at the beginning of any NLP work. Since we have already generated the column with sentence lengths in Dremio, now we can use it to simplify the process.

To better understand the sentences length distribution we can use simple Plotly Express Python’s visualization library:

fig = px.histogram(df, x="sentence_length", nbins=10, height=500, width=800)
fig

The resulting plot:

As we can see, most sentences are short. Half of all sentences are shorter than 55 characters (based on median statistics). The shortest sentence is 5 letters long, and the longest review has 477 characters.

The EDA helped us to make the following assumptions. We have many short sentences. It is highly likely that there aren’t any named entities in a short piece of text. So, we are interested more in longer sentences.

We think that it is also worth to clarify what are named entities. Named entities are objects from real-world that can be named. For example, locations, persons, products, organizations, dates, events, etc. Extracting named entities from texts is an important step to understand the content. Named entity recognition (NER) is also often used as a component of other NLP-related processes (texts classification, categorization, understanding the market focus, and so on).

Now let’s use NLTK to work with sentences and extract entities. NLTK stands for Natural Language Toolkit and it is probably the most popular NLP tool nowadays.

NLTK has a built-in function for NER called ne_chunk(). It can distinguish between three types of entities: PERSON, ORGANIZATION, and GPE (geopolitical entity). But we cannot directly apply this function to a text. Firstly, we need to tokenize it using the word_tokenize() function. Then, pass the tokenized text as input to the pos_tag() function. This function will label each word as a POS (part of speech) - like nouns, verbs, adjectives, etc. Only after this, we can apply the ne_chunk() function to the output of the pos_tag() function.

Let’s look at the example. Here is the fifth sentence in the dataset:

We want to recognize entities in this sentence.

sent = df.iloc[4]['sentences']
ne_tree = nltk.ne_chunk(nltk.pos_tag(word_tokenize(sent)))
print(ne_tree)

The output:

As we can see, the native NLTK NER parser has successfully recognized the Dos Gringos as the organization.

We can loop over the resulting object like over the ordinary Python lists:

You can see unrecognized words are represented by tuples with POS-tags. But words that are recognized as entities are represented by special NLTK objects:

We can access the label of the entity and the words (leaves):

Below is the code where we use all the above things to extract named entities for the first 20 sentences in the dataset:

entities = []
for i in range(20):
    sentence = df.iloc[i]['sentences']
    ne_tree = nltk.ne_chunk(nltk.pos_tag(word_tokenize(sentence)))
    sent_ent = []
    for j in range(len(ne_tree)):
        if type(ne_tree[j]) == type(ne_tree):
            label = ne_tree[j].label()
            entity = " ".join([leaf[0] for leaf in ne_tree[j].leaves()])
            sent_ent.append((label, entity))
    if sent_ent != []:
        entities.append(sent_ent)

In the result, we have the entities list with the following content:

We can see that some entities are not actually named entities. Also, there are named entities that were assigned to an incorrect label (like NYC was classified as an organization instead of GPE).

Entities without labels can be condensed to Python list with the following expression:

entity_words = [i[1] for x in entities for i in x]
print(entity_words)

The result:

We will execute NER for the entire dataset. To do this, we need to specify the right number of iterations in the loop above (replace 20 to the length of the data frame) and execute it again. There will be a lot of entities, but we want to look only at the most common. To do this, we use the Counter object from the collections module:

from collections import Counter
c = Counter()
entity_words = [i[1] for x in entities for i in x]
Counter(entity_words).most_common(20)

The output:

We can see that there are a lot of incorrectly detected named entities. For example, such words as “Vegas”, “Motorola”, “Amazon”, “Thai”, “Hitchcock”, “Verizon” are actually named entities. But words like “Good”, “Food”, “Overall”, “Bad”, etc. are definitely not named entities.

So, let’s try to improve the quality of the parser. One of the possible ways is to use the Stanford Named Entity Recognizer. At the moment, it has three models: for distinguishing between three, four, and seven classes. For example, the 7-classes model can detect MONEY, PERCENT, DATE, and TIME named entities. But we want to adhere to the comparability principle. So, we will use the first model, which can detect three classes: LOCATION, PERSON, and ORGANIZATION.

Before using the Stanford NER with the NLTK interface, we need to download the model and the recognizer. You can find an archive with them here (the Download section). After downloading, unzip it.

Import StanfordNERTagge and specify the paths to the recognizer and model. Then create a StanfoedNERTagger instance:

from nltk.tag.stanford import StanfordNERTagger
jar = '{PATH_TO_FOLDER}/stanford-ner-2018-10-16/stanford-ner.jar'
model = '{PATH_TO_FOLDER}/stanford-ner-2018-10-16/classifiers/english.all.3class.distsim.crf.ser.gz'
ner_tagger = StanfordNERTagger(model, jar, encoding='utf8')

Let’s check how this tagger works on the following sentence:

First, we need to tokenize the sentence. Then use the tag() method of the ner_tagger instance:

words = nltk.word_tokenize(sent)
stand_ents = ner_tagger.tag(words)
print(stand_ents)

Here is the output:

As we can see, all named entities in this sentence were labelled correctly. Let’s compare with the native NLTK tagger. Here is the result we would get if use it:

This tagger didn’t recognize Valley as geolocation. Maybe it is because “Valley” is actually not a geopolitical entity, but just a location. At the same time, the tagger managed to understand that Phoenix Magazine is the single entity though it is represented by two words.

Based on the example of this sentence, we can suppose that the Stanford NER works generally better than native NLTK parser. Let’s see the results on the whole corpus of sentences. See below the code for entities extraction:

entities_stanf = []
for i in range(len(df)):
    sentence = df.iloc[i]['sentences']
    stanf_ents = ner_tagger.tag((word_tokenize(sentence)))
    sent_ent = []
    for j in range(len(stanf_ents)):
        if stanf_ents[j][1] != "O":
            label = stanf_ents[j][1]
            entity = stanf_ents[j][0]
            sent_ent.append((label, entity))
    if sent_ent != []:
        entities_stanf.append(sent_ent)

To extract twenty most common named entities detected by Stanford NER, execute this code snippet:

stanf_entity_words = [i[1] for x in entities_stanf for i in x]
c = Counter()
Counter(stanf_entity_words).most_common(20)

The output:

We can conclude that our hypothesis is now confirmed. It is clear that among 20 most common named entities detected by Stanford NER there are a lot more actually true named entities than among 20 named entities detected by native NLTK tagger. Still, such words as “Motorola”, “Vegas”, “Verizon”, “Amazon”, and “Hitchcock” are in the top list.

The quality of the results produced by Stanford NER is higher. Nevertheless, it takes significantly more time to recognize entities with it. So, when performing named entity recognition in your real projects we recommend to try using several extractors on the subsample of data to understand what works better in your specific case. Only after this research proceed to process the whole dataset.

Working with text using Spacy

NLTK is like a giant in the NLP field. At the same time, there are several other powerful libraries. They are modern and can be interesting for developers in many cases. Spacy is one of these frameworks. In this section, we will show how to perform entity recognition with Spacy.

Spacy works on pre-trained language models. There are models for different languages and different tasks. For entity recognition, it is possible to use models trained on the OntoNotes 5 corpus and on the Wikipedia corpus.

The first model is more versatile. It supports such features as context-specific token vectors, POS tagging, dependency parsing and named entities recognition. As for named entities recognition, this model provides a wide range of possible labels. For example, there are such labels as WORK_OF_ART, EVENT, LANGUAGE, EVENT, LAW, etc. It is a powerful model.

The second model is only for named entities recognition. Also, there are only 4 available tags: for persons, locations, organization, and for miscellaneous entities (MISC). But this model can work with several languages: English, French, German, Spanish, Russian, Italian, Portuguese. The first model is only for English.

For comparability reasons, we will use the second model. Probably, you will need to download it separately from Spacy installation. To download this model, you can issue the following command from your command-line interface (CLI):

python -m spacy download xx_ent_wiki_sm

After the model is successfully downloaded, you can import it and Spacy in the Python script. After importing, the model should be loaded:

import spacy
import xx_ent_wiki_sm
nlp = xx_ent_wiki_sm.load()

Like we did previously, let’s initially experiment with the recognizer on a single sentence. To work with text in Spacy we need to pass the variable with text to the input of the loaded model:

sent = df.iloc[6]['sentences']
doc = nlp(sent)

The doc object is of the type ‘spacy.tokens.doc.Doc’:

Further, you don’t need to tokenize sentence or doing any other preparation (as you need with NLTK). You can directly access recognized entities:

print([(X.text, X.label_) for X in doc.ents])

The sentence: “There is so much good food in Vegas that I feel cheated for wasting an eating opportunity by going to Rice and Company.”

The output:

Spacy NER has correctly recognized all named entities in this sentence. Here is the code for entities extraction from the whole dataset:

entities_spacy = []
for i in range(len(df)):
    sentence = df.iloc[i]['sentences']
    doc = nlp(sentence)
    spacy_ents = [(X.text, X.label_) for X in doc.ents]
    if spacy_ents != []:
        entities_spacy.append(spacy_ents)

To extract the most common entities we execute the code similar to what we have executed earlier:

spacy_entity_words = [i[0] for x in entities_spacy for i in x]
c = Counter()
print(Counter(spacy_entity_words).most_common(20))

The output:

There are many correctly recognized entities, though the misclassified words (“Excellent”, “Highly”, “Very”, “Best”, “Nice”, etc.) are presented as well. At the same time, remember, that this model labels some entities as MISC, when previously described recognizers do not have such label. So, what is better highly depends on the particular use case.

If we load another model (trained on the OntoNotes 5 corpus - en_core_web_sm) we would get the following results for the entire dataset:

There are numbers (“one”, “two”, “3”, etc.), ordinal numbers (“first”, “second”, etc.), and date-related entities (“today”).

Spacy has other interesting features related to named entity recognition. For example, we can explore entities deeper and analyze the multi-token entities:

print([(X, X.ent_iob_, X.ent_type_) for X in doc])

The output:

You can see that words in “Rice and Company” entity, which is a multi-token entity, are labelled with “B” and “I” tags. “B” means the beginning of the multi-token entity, and “I” means the inner token. “O” means that the token is outside of an entity. You can read more about this in the documentation.

The last but not least cool feature of Spacy that we want to demonstrate is its ability to visualize named entities in texts. To show how it works, we need to use the displacy module of Spacy:

from spacy import displacy

sent = df.iloc[777]['sentences']
displacy.render(nlp(sent), jupyter=True, style='ent')

The output of the code above:

As you can see, named entities are highlighted by colours. Also, entities’ labels are marked near the words. It is very clear and convenient. The feature can be especially useful when you need to prepare a report or explain some concepts to stakeholders. Imagine how good it looks for a large piece of text.

Spacy is a modern NLP framework that provides many interesting and useful features. Some of them are unique, some have their analogies in other frameworks, like NLTK, Stanford Core NLP, Gensim, etc. Anyway, the ultimate choice of tools has to be based on the conditions and environment of a particular project.

Conclusion

In this tutorial, we have demonstrated how to perform named entity extraction using NLTK and Spacy. The dataset was stored on AWS S3. With the help of Dremio, we accessed data and did basic preparation. As you could see, there were no problems for Dremio to work with datasets stored by chunks. Also, we could use SQL to process data, even if this feature is not native for AWS S3. So, Dremio reduced an amount of work we needed to do before starting fulfilling our main task - entities extraction.

NLTK has its own named entities recognizer. At the same time, it is possible to use Stanford recognizer via NLTK interface. In our case, the Stanford NER demonstrated better results, than native NLTK recognizer. Spacy has two language models, that allow entities extraction. In addition, Spacy provides many modern NLP features. The results of named entities extraction in Dremio were good. They are near the same level of quality as the results from Stanford recognizer.

We hope you enjoyed this tutorial, stay tuned for more!

Try Dremio’s Interactive Demo

Explore this interactive demo and see how Dremio's Intelligent Lakehouse enables Agentic AI

Data Science on the Data Lake using Dremio, NLTK and Spacy

Table of Contents

Introduction

Assumptions

Data curation with Dremio

Connecting Python to Dremio

Working with text using NLTK

Working with text using Spacy

Conclusion

Try Dremio’s Interactive Demo

Get Started Free

See Dremio in Action

Talk to an Expert

Make data engineers and analysts 10x more productive

Table of Contents

Introduction

Assumptions

Data curation with Dremio

Connecting Python to Dremio

Working with text using NLTK

Working with text using Spacy

Conclusion

Try Dremio’s Interactive Demo

Additional Resources

Apache Iceberg: The Definitive Guide

Intro: Data Lake vs Warehouse by Dremio

Apache Iceberg: An Architectural Look Under the Covers

Get Started Free

See Dremio in Action

Talk to an Expert

Make data engineers and analysts 10x more productive