9 minute read · March 27, 2018
The Winter Olympics Story: How I Did It
Introduction
The Olympic Games come around every two years, an international sporting event rooted in an Ancient Greek tradition. During the ancient Games, an Olympic truce was called so that all city-states would lay down their arms in respect for the safety of the participants. The modern Games also observes this tradition, but like its predecessor, the Games become a proxy by which countries assert dominance over each other through sport. Sport becomes a metaphor for geopolitics, where war is waged not on the battlefield but rather the slopes or the rink.
Using a dataset from Sports Reference, I decided to explore the origins of the modern Winter Games, visualizing where the Games have been held, which countries tend to do well, and who the medalists are. Find the story here. Here’s how I did it.
Geocoding in Dremio
First, I copy and pasted this table from Sports Reference and saved it as a tsv. Then, I uploaded this file to Dremio using the following configurations.
The dataset contained information about the city and country where the Winter Games have been hosted, but to plot them on a map, I needed the latitude and longitude of those cities. Make sure you have conda configured so that you can install pandas. Run the following commands so that you have all of the necessary libraries for geocoding.
pip install -U googlemaps pip install pyodbc conda install pandas
Then I used the Google Maps API to geocode each city with the following Python script that connects to Dremio through an ODBC connection.
import googlemaps import pyodbc import pandas as pd import sys import logging logger = logging.getLogger("root") logger.setLevel(logging.DEBUG) ch = logging.StreamHandler() ch.setLevel(logging.DEBUG) logger.addHandler(ch) gmaps = googlemaps.Client(key='<your Google Maps API key>') host = '<your host>' port = 31010 uid = '<your Dremio username>' pwd = '<your Dremio password>' driver = 'Dremio Connector' output_file_path = '<your file path>' cnxn = pyodbc.connect("DRIVER={};ConnectionType=Direct;HOST={};PORT={};AuthenticationType=Plain;UID={};PWD={}".format(driver,host,port,uid,pwd),autocommit=True) sql = '''SELECT * FROM <path to your data in Dremio>''' df = pd.read_sql(sql,cnxn) for index, row in df.iterrows(): to_geocode = row['City'] + ", " + row['Country'] try: geocode_result = gmaps.geocode(to_geocode) df.at[index, 'latitude'] = (geocode_result[0]['geometry']['location']['lat']) df.at[index, 'longitude'] = (geocode_result[0]['geometry']['location']['lng']) except Exception as e: logger.exception(e) logger.error("Skipping!") except: e = sys.exc_info() print e df.to_csv(output_file_path,index_label="id")
With all of the points geocoded, I was able to display them in a map in D3. To draw the map in D3, I used this geojson file, and then I plotted all of the points on my map. I used this D3 block as a guide for how to zoom to a bounding box. This was the result.
Analysis in Dremio and Visualization with D3
For my line graphs, I used a simple guide to draw them in D3. Then, I used Susie Lu’s d3-annotation library to annotate various points along the graph.
I also used Jim Valladingham’s So You Want to Build a Scroller tutorial and D3 transitions to create the scrolling animations for the first part of the story.
Then, I moved onto exploring the medalists dataset which can be found here. Credit goes to Sports Reference for the data.
To create the small multiples visualization of how each country did at each Games, I first merged some countries together to present a more accurate representation of total medals, including combining the Soviet Union into Russia’s medal count. I used the following SQL query.
SELECT medals."Year" AS "Year", Sport, Event, CASE WHEN Country = 'Soviet Union' THEN 'Russia' WHEN Country = 'Unified Team' THEN 'Russia' WHEN Country = 'West Germany' THEN 'Germany' WHEN Country = 'East Germany' THEN 'Germany' WHEN Country = 'Czechoslovakia' THEN 'Czech Republic' WHEN Country = 'Yugoslavia' THEN 'Serbia' ELSE Country END AS Country, Gender, "Medal Rank", Medal, "Name of Athlete or Team", "Age of Athlete" FROM "@elbert"."Winter Olympics".medals
Then, I used the following SQL query to look at the total gold, silver, bronze, and total medals each country won.
SELECT Country, count(case Medal when 'gold' then 1 else null end) AS gold, count(case Medal when 'silver' then 1 else null end) AS silver, count(case Medal when 'bronze' then 1 else null end) AS bronze, (count(case Medal when 'gold' then 1 else null end)+count(case Medal when 'silver' then 1 else null end)+count(case Medal when 'bronze' then 1 else null end)) AS total FROM "@elbert"."Winter Olympics".medals GROUP BY Country
This analysis led me to visualize each country’s medalists, sorted by the most successful countries to the least.
I then used Mike Bostock’s small multiples block as a starting point for how to create multiples for each country.
To analyze which countries dominated which sports, I used this SQL query as a starting point to look at countries that have won the lion’s share of medals for a particular sport.
SELECT Country, Sport, count(case Medal when 'gold' then 1 else null end) as gold, count(case Medal when 'silver' then 1 else null end) as silver, count(case Medal when 'bronze' then 1 else null end) as bronze, (count(case Medal when 'gold' then 1 else null end)+count(case Medal when 'silver' then 1 else null end)+count(case Medal when 'bronze' then 1 else null end)) AS total FROM "@elbert"."Winter Olympics".medals GROUP BY Country, Sport ORDER BY Country
I then divided the total number of medals a country earned in a particular sport by the total number of medals ever wom in that sport as a measure of dominance in that sport, represented by the darkness of color in my visualization.
Finally, I looked at the age of the medalists in the dataset. Using Dremio’s UI, I was able to write a SQL query that showed the minimum, average, and maximum age for each gender and sport. I found that age varied across gender and sport.
I used Mike Bostck’s block on box plots to visualize the age of participants across all sports of the Winter Games. Because the number of sports that were held changed from year to year, the visualization follows an enter update exit pattern.