Dremio Jekyll

Visualizing Tweet Sentiment With Excel, SQL, and Dremio

Intro

In the last tutorial we measured Twitter tweet sentiments using a sentiment dictionary, SQL, and Dremio. In this tutorial we’ll take this idea a little further to measure state sentiment according to tweets. We’ll also look at term frequency across all tweets, which can be useful for categorization and machine learning.

Assumptions

We assume you have access to a Dremio installation. If you don’t have these things, you can still follow the tutorial but you won’t be able to try out the steps yourself.

Getting started

We’ll introduce a few new techniques in this tutorial, like regular expressions, SQL window functions, and simple visualization in Microsoft Excel. At a high level there are two different tasks with the same datasets.

  1. Computing term frequency histogram for the twitter stream. The frequency of term occurrence can be calculated as [# of occurrences of the term in all tweets]/[# of occurrences of all terms in all tweets].

Term frequency calculation is a content analysis technique. Content analysis is research using the categorization and classification of speech, written text, interviews, images, or other forms of communication. By having contents of communication available in form of machine readable texts, the input is analyzed for frequencies and coded into categories for building up inferences.

  1. Use tweet sentiment analysis technique to determine, which U.S. State is the happiest? In other words, rank the states by their overall tweets sentiment score.

This task provides a simple, but such a demonstrative example of a social mood investigation.

For both solutions we will provide not only table representation, but also try to visualize them using 2-D and Map charts.

Computing term frequency

So let’s start with the Problem #1. As a base dataset we will use twitter-stream.json physical dataset from the previous post, which contains recorded live Twitter stream, and modify it to get required data.

First open* twitter-stream.json* dataset from the twitter-source space, just by clicking it:

Open the Twitter data source

Then we can rewrite SQL query to get only “text” attribute and leave only English tweets:

English only

Now, to get the number of occurrences of the each term, we need to:

  1. Split each tweet, i.e. “text” attribute, into the separate words,

  2. Transform the result term sets into separate records

  3. Group the whole dataset by terms and compute the count of them in each group.

In order to do this, first split the text attribute - select Split… from the drop down menu for the “text” column, then enter Space as Fixed String, change Position to the “All” and give name “term” for the new field:

Tokenize the tweet body

Click apply to get the new SQL query:

Preview the new query

Then we need to unnest the sets of the terms in each tweet. Select Unnest… from the drop down menu for the “term” column:

Unnest the array

Next step is to group dataset by the “term” field and count occurrences of each word. Click “Group by” icon and drag “term” column onto the Dimensions and Measures areas (aggregate function Count will be applied by default):

Group by term

Click Apply to get the next SQL query:

Preview the query

Using SQL windowing function to calculate the denominator

Now we have the nominator of the target formula:

[# of occurrences of the term in all tweets]/[# of occurrences of all terms in all tweets]

To compute the denominator we just need to summarize all values in the “Count_term” column. Here one of the SQL window aggregate function SUM will be useful. These type of functions compute values for every record in a dataset, taking defined subset called partition and computing some function on it. In our case we are not defining any partition, which means that we are taking the whole dataset. Click “Calculated field…” from the “Count_term” drop down menu and add sql expression and the name “term_frequency” of the calculated field:

Computing the score

What does the expression cast(count_term as float)/sum(count_term) over () mean? The nominator is the count of the each term occurrences, converted to the float format in order to get float precision in the result. The denominator is the sum of all values in the “count_term” column (the partition is in fact the whole dataset). Click Apply to get the final result , which corresponds to the target formula, and sort dataset descending by the “term_frequency” column:

SQL window expression

Now we can save it as a new virtual dataset into the Blog space, let us call it term_frequency:

Saving the new virtual dataset

Visualizing term frequency in Excel

The final dataset can be downloaded and analyzed in third-party software tools. In this post I want to present a basic visualization of the result using Microsoft Excel 2016 and Dremio Connector for Windows, which allows to get data directly from the Dremio server, given that Windows machine has access to it.

First we need to download and install Dremio connector. After that new User DSN should be added and set up, using ODBC Administrator, like on the following screenshot:

Setting up the ODBC connector

The next step is to establish a live connection between Excel range and Dremio dataset term_frequency. It should be done through the ODBC data source:

Configuring ODBC

After selecting From ODBC - Dremio DSN, the Navigator window will be displayed:

Connecting via ODBC

Click Load and get the live table, corresponding to the term_frequency dataset. Then select the first twenty rows range and insert a 2-D column Chart:

Viewing in Excel

Although we have not got quite meaningful result, since the “RT” term is just the keyword “retweet” and other words are prepositions or pronouns, the result demonstrates, that further data cleaning is in demand, and provides basic Microsoft Office integration example.

Measuring sentiment by state

Let us continue with the second task. To compute State sentiment score, we need to extract location information from the tweet, split and unnest text attribute and join the result with the sentiment dictionary. Then group dataset by the state and summarize terms sentiment scores. To extract location information, the tweet JSON structure should be examined. There are two attributes, which can be used:

  1. “Place” attribute has two useful properties - “country” and “full_name”. The first one give us the possibility to filter tweets only from the U.S., and from the second one we can extract State, if it has text like “Los Angeles, CA”.

  2. “User” attribute has property “location”, which provides State if the text is also like “Los Angeles, CA”.

Not every tweet contains properly filled attributes, and the recorded data stream should have enough volume. In order to get information about State, we need to write SQL query on the twitter-stream.json physical dataset, using regexp_like function to find tweets, which have the State as two last characters in the place.full_name or user.location attributes. Additionally, source dataset must be filtered by lang and place.country attributes. Click “+ New Query” button on the top panel and enter the following query:

You will get the following preview:

Query preview

Then split the “text” column by the Space character, set “All” position and give it new name “term”:

Tokenize by space

After that “term” column must be unnested:

Unnest the array

Now we can join this dataset with the word sentiment dictionary term_score (please refer to the previous post to find out how it is built):

Join to sentiment dictionary

And the final step is to group it by State, compute the summary scores for each row and sort the result in descending order:

Group by state

After saving it as state_sentiment_score virtual dataset in Blog Space, we can proceed with the visualization part. First, get the live table from the ODBC Dremio DSN, as we did for the Problem #1. Then, after adding Map and 2-D charts, based on this range, and few manipulations with their color scheme, we are getting pretty good visual presentation for the result:

Final result

Conclusion

In this post we’ve used Dremio to calculate term frequency across a text-centric corpus, then visualized our results in Microsoft Excel. We used some regular expressions and windowing functions along the way. We also calculated and visualized the happiest states in the US, at least based on this 10 minute sample of Tweets! Again, we used Microsoft Excel to help our data tell a story.