In the last tutorial we measured Twitter tweet sentiments using a sentiment dictionary, SQL, and Dremio. In this tutorial we’ll take this idea a little further to measure state sentiment according to tweets. We’ll also look at term frequency across all tweets, which can be useful for categorization and machine learning.
We assume you have access to a Dremio installation. If you don’t have these things, you can still follow the tutorial but you won’t be able to try out the steps yourself.
We’ll introduce a few new techniques in this tutorial, like regular expressions, SQL window functions, and simple visualization in Microsoft Excel. At a high level there are two different tasks with the same datasets.
Term frequency calculation is a content analysis technique. Content analysis is research using the categorization and classification of speech, written text, interviews, images, or other forms of communication. By having contents of communication available in form of machine readable texts, the input is analyzed for frequencies and coded into categories for building up inferences.
This task provides a simple, but such a demonstrative example of a social mood investigation.
For both solutions we will provide not only table representation, but also try to visualize them using 2-D and Map charts.
So let’s start with the Problem #1. As a base dataset we will use twitter-stream.json physical dataset from the previous post, which contains recorded live Twitter stream, and modify it to get required data.
First open* twitter-stream.json* dataset from the twitter-source space, just by clicking it:
Then we can rewrite SQL query to get only “text” attribute and leave only English tweets:
Now, to get the number of occurrences of the each term, we need to:
In order to do this, first split the text attribute - select Split… from the drop down menu for the “text” column, then enter Space as Fixed String, change Position to the “All” and give name “term” for the new field:
Click apply to get the new SQL query:
Then we need to unnest the sets of the terms in each tweet. Select Unnest… from the drop down menu for the “term” column:
Next step is to group dataset by the “term” field and count occurrences of each word. Click “Group by” icon and drag “term” column onto the Dimensions and Measures areas (aggregate function Count will be applied by default):
Click Apply to get the next SQL query:
Now we have the nominator of the target formula:
[# of occurrences of the term in all tweets]/[# of occurrences of all terms in all tweets]
To compute the denominator we just need to summarize all values in the “Count_term” column. Here one of the SQL window aggregate function SUM will be useful. These type of functions compute values for every record in a dataset, taking defined subset called partition and computing some function on it. In our case we are not defining any partition, which means that we are taking the whole dataset. Click “Calculated field…” from the “Count_term” drop down menu and add sql expression and the name “term_frequency” of the calculated field:
What does the expression cast(count_term as float)/sum(count_term) over () mean? The nominator is the count of the each term occurrences, converted to the float format in order to get float precision in the result. The denominator is the sum of all values in the “count_term” column (the partition is in fact the whole dataset). Click Apply to get the final result , which corresponds to the target formula, and sort dataset descending by the “term_frequency” column:
Now we can save it as a new virtual dataset into the Blog space, let us call it term_frequency:
The final dataset can be downloaded and analyzed in third-party software tools. In this post I want to present a basic visualization of the result using Microsoft Excel 2016 and Dremio Connector for Windows, which allows to get data directly from the Dremio server, given that Windows machine has access to it.
First we need to download and install Dremio connector. After that new User DSN should be added and set up, using ODBC Administrator, like on the following screenshot:
The next step is to establish a live connection between Excel range and Dremio dataset term_frequency. It should be done through the ODBC data source:
After selecting From ODBC - Dremio DSN, the Navigator window will be displayed:
Click Load and get the live table, corresponding to the term_frequency dataset. Then select the first twenty rows range and insert a 2-D column Chart:
Although we have not got quite meaningful result, since the “RT” term is just the keyword “retweet” and other words are prepositions or pronouns, the result demonstrates, that further data cleaning is in demand, and provides basic Microsoft Office integration example.
Let us continue with the second task. To compute State sentiment score, we need to extract location information from the tweet, split and unnest text attribute and join the result with the sentiment dictionary. Then group dataset by the state and summarize terms sentiment scores. To extract location information, the tweet JSON structure should be examined. There are two attributes, which can be used:
“Place” attribute has two useful properties - “country” and “full_name”. The first one give us the possibility to filter tweets only from the U.S., and from the second one we can extract State, if it has text like “Los Angeles, CA”.
“User” attribute has property “location”, which provides State if the text is also like “Los Angeles, CA”.
Not every tweet contains properly filled attributes, and the recorded data stream should have enough volume. In order to get information about State, we need to write SQL query on the twitter-stream.json physical dataset, using regexp_like function to find tweets, which have the State as two last characters in the place.full_name or user.location attributes. Additionally, source dataset must be filtered by lang and place.country attributes. Click “+ New Query” button on the top panel and enter the following query:
1 2 3 4 5 6 7 8 9 SELECT s.text,"right"(s.place.full_name,2) as state FROM twitter_source."twitter-stream.json" s WHERE lang = 'en' AND s.place.country_code = 'US' AND (regexp_like (s.place.full_name,'.*(WA|WI|WV|FL|WY|NH|NJ|NM|NA|NC|ND|NE|NY|RI|NV|GU|CO|CA|GA|CT|OK|OH|KS|SC|KY|OR|SD|DE|DC|HI|PR|TX|LA|TN|PA|VA|VI|AK|AL|AS|AR|VT|IL|IN|IA|AZ|ID|ME|MD|MA|UT|MO|MN|MI|MT|MP|MS)$') OR regexp_like (s."user".location,'.*(WA|WI|WV|FL|WY|NH|NJ|NM|NA|NC|ND|NE|NY|RI|NV|GU|CO|CA|GA|CT|OK|OH|KS|SC|KY|OR|SD|DE|DC|HI|PR|TX|LA|TN|PA|VA|VI|AK|AL|AS|AR|VT|IL|IN|IA|AZ|ID|ME|MD|MA|UT|MO|MN|MI|MT|MP|MS)$') )
You will get the following preview:
Then split the “text” column by the Space character, set “All” position and give it new name “term”:
After that “term” column must be unnested:
Now we can join this dataset with the word sentiment dictionary term_score (please refer to the previous post to find out how it is built):
And the final step is to group it by State, compute the summary scores for each row and sort the result in descending order:
After saving it as state_sentiment_score virtual dataset in Blog Space, we can proceed with the visualization part. First, get the live table from the ODBC Dremio DSN, as we did for the Problem #1. Then, after adding Map and 2-D charts, based on this range, and few manipulations with their color scheme, we are getting pretty good visual presentation for the result:
In this post we’ve used Dremio to calculate term frequency across a text-centric corpus, then visualized our results in Microsoft Excel. We used some regular expressions and windowing functions along the way. We also calculated and visualized the happiest states in the US, at least based on this 10 minute sample of Tweets! Again, we used Microsoft Excel to help our data tell a story.