Sentiment Analysis is the process of determining whether a piece of writing (product/movie review, tweet, etc.) is positive, negative or neutral. In this tutorial we’ll show you step by step how to analyze tweets using Dremio and a sentiment dictionary. To compute the sentiment of each tweet we need to know the sentiment scores of each of the terms in the tweet, and the overall sentiment is simply reported the sum of the sentiment scores for all the terms.
We assume you have access to a Dremio installation. If you don’t have these things, you can still follow the tutorial but you won’t be able to try out the steps yourself.
Twitter represents a powerful data source for making social measurements. Millions of people voluntarily express opinions across any topic imaginable - this data source is incredibly valuable for both research and business. Researchers from Northeastern University and Harvard University studying the characteristics and dynamics of Twitter have an excellent resource for learning more about how Twitter, and this tool can be used to analyze moods at national scale.
First we need to get a Twitter stream for a given period of time. Twitter provides a very rich REST API for querying the system, accessing data, and controling your account. You can read more about the Twitter API. In this post we are not getting deep into this theme and we are simply providing a sample Twitter stream in JSON format for about ten minutes, it contains near 30K tweets. The following several rows give us a brief look:
There are many attributes in these JSON objects, we are interested only in several of them - id, text and lang. We need a lang attribute to keep only English tweets, and id to reference the tweet in a result dataset. Not all objects are tweets themselves, the first row in the example shown above should be excluded as an object without “text” attribute.
The second thing we are going to use is the dictionary with term scores for English words. In this tutorial we are usingAFINN-111 version, which can be found here. AFINN is tab-delimited list of English words rated for valence with an integer between minus five (negative) and plus five (positive):
abandon -2 abandoned -2 abandons -2 abducted -2 …
What should we do first is to get an access to these files as datasets. We can store them in the local folder and add it as NAS source by clicking “plus” on the “Source” panel (data sources can be added only by Administrator users):
and selecting NAS data source type:
Then NAS source properties like Name, Description and Path should be entered:
After saving new source, the list of the files in the corresponding folder is displayed:
Individual files can be configured as datasets by clicking on the dataset configuration button. Hover over the file you want to configure and you will see the configuration button on the right:
Click the button on the right that shows a directory pointing to a directory with a table icon. You will now see a dialog that allows you to configure the dataset. First we are configuring AFINN-111.txt as a physical dataset:
Change Field Delimiter to the “Tab” and Line Delimiter to the “CRLF-Windows”:
Then click Save button:
Here we see the SQL query, used to configure AFINN-111.txt file as physical dataset. Next we can rename the first column to the “term”, typing new name right into the column name, and convert data type of the column B, clicking arrow on the right of the column name and choosing “Convert Data Type” from the drop down menu. Change Data Type form will be opened:
After selecting Integer, change the name of the new column to the “score”:
Click apply to get the final SQL query, used to access file as a virtual dataset:
Click Save As… button and save it as a virtual dataset into the “Blog” Space:
The next step is to configure tweet-stream.json file as physical dataset and save it as the new virtual dataset. Open it the same way as the previous file in the twitter_source folder. The first time you access the file you may get the message “New schema found and recorded. Please reattempt the query. Multiple attempts may be necessary to fully learn the schema.” Because JSON documents can vary from document to document in their schema, Dremio will iteratively work through the dataset to lean the schema for the entire dataset.
Click the Save button and get SQL query to access the file as physical dataset:
As we mentioned before, we are only interested in two fields - id and text, also we need only English tweets. We could accomplish this with the “Keep Only” data curation feature (see Working With Your First Dataset), It can also be done by simply adding a predicate to the SQL query to be:
Then click “Preview” after entering the new query:
Since the idea is to get score for the every word in the tweet, we can simply tokenize the text field by the space symbol to get an approximate score (in English at least). To achieve this, select Split… from the drop down menu for the “text” column. Then enter Space as Fixed String then change Position to “All”. Dremio will generate a regular expression that tokenizes on space throughout the tweet. Next, uncheck “Drop source field (text)” and give name “term” for the new field:
After clicking Apply we will get new SQL query for the virtual dataset:
Now the “text” column is represented as an array of words. To join this dataset with the term score dictionary the array needs to be unnested. Select Unnest… from the drop down menu for the “text” column:
The resulting dataset has separate rows for the each word in the tweet. It is exactly what we need, then click “Save as” and save it as virtual dataset in Blog Space:
So now we have two datasets - one with tweets and one with term scores. In order to calculate tweets scores, the first dataset should be joined with the second one, then grouped by the tweet id and text columns (we preserve text in the output to analyze the result). The tweet score will be Sum of the word scores.
Open twitter_stream dataset and click Join icon, then select Custom Join tab:
Select “term_score” dataset and click Next, the form for the configuring Join will be opened. Change join Type to the “Inner”, to eliminate tweets which don’t have any sentiment colored words, and drag “term” columns from the left and right datasets:
After clicking Apply we get a joined result:
Then we can remove “term” and “term0” fields by selecting “Drop” from the column’s drop down menus, and add grouping by clicking Group By button. Then we are moving “id” and “text” columns to the Dimension area and “score” to the Measures area (Sum aggregate function will be added by default):
Click Apply and get the final result!
Now we can save it into the Blog Space:
The resulting dataset can be shared between users, opened in different BI tools or downloaded as CSV or JSON files. You can apply the same technique to many more tweets than this sample.
In this tutorial we took a CSV file and a sample of Twitter tweets stored as rich JSON documents. Using standard SQL and a simple GUI, we were able to easily access the data, filter on English tweets, tokenize the tweet body, and then join to our sentiment dictionary. We were also able to calculate an aggregate score for each tweet, based on the sum of the individual words.
Traditional approaches would require 1) loading all the data into a repository, 2) writing a program in your favorite language, then 3) saving the results into a repository. With Dremio we were able to accomplish all of these steps without making copies of the data. What’s more, the same approach can easily scale up to much larger numbers of tweets, and can use more sophisticated models to score each tweet. Finally, we can access our word as a virtual dataset in Dremio, meaning we can use any BI tool, Python, R, or SQL to query this data, and combine it with other interesting sources.
In the next part of this tutorial, we’ll measure term frequency and sentiment to calculate the happiest states in the US. We’ll then access this data from Microsoft Excel to visualize on a map.