Dremio Jekyll

The Winter Olympics Story: How I Did It

Introduction

The Olympic Games come around every two years, an international sporting event rooted in an Ancient Greek tradition. During the ancient Games, an Olympic truce was called so that all city-states would lay down their arms in respect for the safety of the participants. The modern Games also observes this tradition, but like its predecessor, the Games become a proxy by which countries assert dominance over each other through sport. Sport becomes a metaphor for geopolitics, where war is waged not on the battlefield but rather the slopes or the rink.

Using a dataset from Sports Reference, I decided to explore the origins of the modern Winter Games, visualizing where the Games have been held, which countries tend to do well, and who the medalists are. Find the story here. Here’s how I did it.

Geocoding in Dremio

First, I copy and pasted this table from Sports Reference and saved it as a tsv. Then, I uploaded this file to Dremio using the following configurations.

add source

The dataset contained information about the city and country where the Winter Games have been hosted, but to plot them on a map, I needed the latitude and longitude of those cities. Make sure you have conda configured so that you can install pandas. Run the following commands so that you have all of the necessary libraries for geocoding.

1
2
3
pip install -U googlemaps
pip install pyodbc
conda install pandas

Then I used the Google Maps API to geocode each city with the following Python script that connects to Dremio through an ODBC connection.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
import googlemaps
import pyodbc
import pandas as pd
import sys
import logging

logger = logging.getLogger("root")
logger.setLevel(logging.DEBUG)
ch = logging.StreamHandler()
ch.setLevel(logging.DEBUG)
logger.addHandler(ch)

gmaps = googlemaps.Client(key='<your Google Maps API key>')

host = '<your host>'
port = 31010
uid = '<your Dremio username>'
pwd = '<your Dremio password>'
driver = 'Dremio Connector'
output_file_path = '<your file path>'

cnxn =  pyodbc.connect("DRIVER={};ConnectionType=Direct;HOST={};PORT={};AuthenticationType=Plain;UID={};PWD={}".format(driver,host,port,uid,pwd),autocommit=True)

sql = '''SELECT * FROM <path to your data in Dremio>'''

df = pd.read_sql(sql,cnxn)

for index, row in df.iterrows():
  to_geocode = row['City'] + ", " + row['Country']
  try:
    geocode_result = gmaps.geocode(to_geocode)
    df.at[index, 'latitude'] = (geocode_result[0]['geometry']['location']['lat'])
    df.at[index, 'longitude'] = (geocode_result[0]['geometry']['location']['lng'])
  except Exception as e:
    logger.exception(e)
    logger.error("Skipping!")
  except:
    e = sys.exc_info()
    print e

df.to_csv(output_file_path,index_label="id")

With all of the points geocoded, I was able to display them in a map in D3. To draw the map in D3, I used this geojson file, and then I plotted all of the points on my map. I used this D3 block as a guide for how to zoom to a bounding box. This was the result.

add source

Analysis in Dremio and Visualization with D3

For my line graphs, I used a simple guide to draw them in D3. Then, I used Susie Lu’s d3-annotation library to annotate various points along the graph.

add source

I also used Jim Valladingham’s So You Want to Build a Scroller tutorial and D3 transitions to create the scrolling animations for the first part of the story.

Then, I moved onto exploring the medalists dataset which can be found here. Credit goes to Sports Reference for the data.

To create the small multiples visualization of how each country did at each Games, I first merged some countries together to present a more accurate representation of total medals, including combining the Soviet Union into Russia’s medal count. I used the following SQL query.

1
2
3
4
5
6
7
8
9
10
SELECT medals."Year" AS "Year", Sport, Event, CASE
  WHEN Country = 'Soviet Union' THEN 'Russia'
  WHEN Country = 'Unified Team' THEN 'Russia'
  WHEN Country = 'West Germany' THEN 'Germany'
  WHEN Country = 'East Germany' THEN 'Germany'
  WHEN Country = 'Czechoslovakia' THEN 'Czech Republic'
  WHEN Country = 'Yugoslavia' THEN 'Serbia'
  ELSE Country
END AS Country, Gender, "Medal Rank", Medal, "Name of Athlete or Team", "Age of Athlete"
FROM "@elbert"."Winter Olympics".medals

Then, I used the following SQL query to look at the total gold, silver, bronze, and total medals each country won.

1
2
3
4
5
6
SELECT Country, count(case Medal when 'gold' then 1 else null end) AS gold,  
  count(case Medal when 'silver' then 1 else null end) AS silver,  
  count(case Medal when 'bronze' then 1 else null end) AS bronze, 
  (count(case Medal when 'gold' then 1 else null end)+count(case Medal when 'silver' then 1 else null end)+count(case Medal when 'bronze' then 1 else null end)) AS total
FROM "@elbert"."Winter Olympics".medals
GROUP BY Country

This analysis led me to visualize each country’s medalists, sorted by the most successful countries to the least. add source

I then used Mike Bostock’s small multiples block as a starting point for how to create multiples for each country.

To analyze which countries dominated which sports, I used this SQL query as a starting point to look at countries that have won the lion’s share of medals for a particular sport.

1
2
3
4
5
6
7
SELECT Country, Sport, count(case Medal when 'gold' then 1 else null end) as gold, 
  count(case Medal when 'silver' then 1 else null end) as silver, 
  count(case Medal when 'bronze' then 1 else null end) as bronze, 
  (count(case Medal when 'gold' then 1 else null end)+count(case Medal when 'silver' then 1 else null end)+count(case Medal when 'bronze' then 1 else null end)) AS total
FROM "@elbert"."Winter Olympics".medals
GROUP BY Country, Sport
ORDER BY Country

I then divided the total number of medals a country earned in a particular sport by the total number of medals ever wom in that sport as a measure of dominance in that sport, represented by the darkness of color in my visualization.

add source

Finally, I looked at the age of the medalists in the dataset. Using Dremio’s UI, I was able to write a SQL query that showed the minimum, average, and maximum age for each gender and sport. I found that age varied across gender and sport.

I used Mike Bostck’s block on box plots to visualize the age of participants across all sports of the Winter Games. Because the number of sports that were held changed from year to year, the visualization follows an enter update exit pattern.