Dremio Jekyll

Introducing the Dremio Data Science Index

We just launched our Data Science Index - an unbiased ranking of the popularity of data science tools. Our goal was to take a list of data science tools and see if it was possible to reliably quantify the current popularity of each one. We hope this type of ranking gives some insight into the current daily usage of each tool and what the future might look like for some of them.

To begin with, we took some inspiration from the methodology used by the DB-engines DBMS ranking and put together a list of ranking criteria. Our initial list of criteria looked like this:

  • Number of results in Google
  • Keyword search volume
  • LinkedIn job listings
  • LinkedIn mentions
  • Facebook mentions
  • Number of results from StackOverflow
  • Twitter followers
  • Total website backlinks
  • Total website referring domains

As we started to pull data we quickly realized that many of these metrics wouldn’t be as reliable as we initially thought. To begin with, less than 60% of tools had an official Twitter account, so we removed social media following as a metric.

Pulling the number of mentions from Facebook turned out to be very difficult as well - Facebook’s built in search function doesn’t currently give an estimate of the number of results. The most accurate way we found to pull data was by creating a Facebook ad and pulling the audience size for each given term. This method was time consuming and again only about 60% of tools returned any results, so we removed “Facebook interest” as a metric.

We also noticed that metrics like “number of results from StackOverflow” or “total number of website backlinks” would give greater weight to tools that had been around for longer periods of time. We wanted to make sure that the current popularity of a tool was more important than its historical popularity. Our first step in adjusting for current popularity was to include Google Trends data as a metric. This way we could judge the current search interest for a tool based on its search popularity within the last year.

Next we made use of the StackExchange Data Explorer to aggregate more accurate Stackoverflow data about each of the tools. The Data Explorer allowed us to write our own SQL query against the StackOverflow database, meaning we could limit our data to results from within the past year. Also, instead of using the general mentions of each tool, we switched to tracking the total number of questions, question views and followers for each tool’s StackOverflow tag.

Finally, we thought that website metrics should have a low overall importance compared to the others. Having both “total website backlinks” and “total website referring domains” for each of the tools seemed like overkill, so we removed “total website backlinks”, the less reliable of the two metrics.

Next we grouped metrics together and weighted each group. Our final list of ranking metrics in their order of importance looks like this:

Technical Interest:

  • StackOverflow tag followers
  • StackOverflow tagged questions
  • StackOverflow tagged question views

Search Presence:

  • Number of results in Google
  • Keyword search volume

Job Interest:

  • LinkedIn job listings
  • LinkedIn mentions

Search Trend:

  • Google Trends data

Domain Strength:

  • Total number of referring domains

Here’s an in-depth breakdown of how we crunch the numbers for each of these categories.

Technical Interest

StackOverflow is our most reliable metric, as it gauges how actively people are asking questions or learning about each topic. Because questions have neatly defined tags and every single one of our included tools (up to this point) has its own tag we know that our final numbers aren’t diluted with questions about unrelated items as they might be within a Google search.

To pull tagged question counts and tagged question views within the past year, we run the following SQL query in the StackExchange Data Explorer:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
DECLARE @total float
SELECT @total = Sum(Cast(ViewCount as bigint)) FROM Posts WHERE PostTypeId = 1

SELECT 
    Tags.Tagname AS Tag, 
    Count(*) AS Questions, 
    Sum(Cast(Posts.ViewCount as bigint)) AS Views,
    Sum(Posts.AnswerCount) AS Answers
FROM Posts 
  JOIN PostTags ON Posts.Id = PostTags.PostId
  JOIN Tags ON PostTags.TagId = Tags.Id
WHERE 
  Posts.PostTypeId = 1
  AND (TagName = 'anaconda' OR TagName = 'apache-arrow' OR TagName = 'dremio' OR TagName = 'keras' OR TagName = 'knime' OR TagName = 'numpy' OR TagName = 'pandas' OR TagName = 'python' OR TagName = 'r' OR TagName = 'rapidminer' OR TagName = 'sas' OR TagName = 'scala' OR TagName = 'scikit-learn' OR TagName = 'scipy' OR TagName = 'scrapy' OR TagName = 'spark' OR TagName = 'sql' OR TagName = 'statsmodels' OR TagName = 'tensorflow' OR TagName = 'theano' OR TagName = 'matlab')
  AND Posts.CreationDate >= DATEADD(year,-1,GETDATE())
GROUP BY Tags.Tagname
ORDER BY Tags.Tagname DESC

Oddly enough the Data Explorer doesn’t allow us to access the number of followers for each tag, so to collect this data we go to each tool’s tag on StackOverflow and record the number of followers manually.

Search Presence

When pulling the total number of Google results we ran into a couple of issues. First, generic searches for tools like “Python” or “R” would return millions of results unrelated to what we were trying to quantify. To counter this, we add the type of the tool at the end of the search - either “library”, “software” or “programming language”. For example, “R” became “R programming language”.

The second major issue we encountered was that Google would often estimate a different number of results for the same search term when searched multiple times in a row. In an attempt to normalize for this discrepancy, we started pulling this number from a handful of different sources and averaging the final number.

When looking for keyword search volume we ran into the same problem of keywords not being specific enough - we couldn’t look at the search volume for the word “Python” and expect reliable results. To remedy this we again added qualifying words to each keyword, but this time we added action-oriented words in an attempt to make this metric more accurately gauge current interest. For software we used “download”, for libraries we used “install” and for languages we used “learn”. For example, “SQL” became “Learn SQL”.

Job Interest

To get the number of job listings for each tool, we simply search for the name of the tool within the “jobs” section of LinkedIn and pull the number of results by hand. For this search we use the same terms as we do with the Google search.

To pull the overall interest in each term we use LinkedIn’s ad creation tools. By creating an ad and setting the name of the tool as an audience interest we are able to see the estimated size of the tool’s audience on LinkedIn.

Search Trend

To get the search trend data we use Google Trends and pull the current search interest for each tool compared with its search interest over the past year broken down by month. Google Trends only shows relative values on a scale of 0 - 100 but we can still use that data to evaluate how far off the 1 year peak the current search volume is.

For instance, if a tool’s current search interest is 56, that means they are only at 56% of the search volume they had during their best month within the last year. If they have a search interest of 100 then they are currently having their best month within the last year. This gives us a general idea of whether search interest has moved up or down within a recent period of time.

Domain Strength

This metric is the simplest to pull, here we use Moz’s data to get the total number of referring domains for each tool’s website. A higher number of referring domains means more websites are linking back to the website for that specific tool. In cases where the tool is part of a website that serves more functions than that tool specifically (i.e. Apache Spark) we use the metrics for either the subdomain or the specific page.

To see where all of the tools stack up, check out the Data Science Index itself.