How can we easily analyse a data lake in order to calculate a fair outcome? It is possible that the current football season will not complete with all remaining fixtures fulfilled. This is our proposed recalculated league table for season 2019/2020.

Recalculated table - Liverpool would be the Champions. Norwich City would be relegated.

### Introduction

We are in the midst of an incomplete football season. Will Liverpool be crowned champions, or will they have a muted celebration? No one knows what will be the outcome of the 2019-2020 season. So let’s speculate and propose a fair outcome!

We’ll focus on the Premier League. Let’s assume the following scenario:

• No further fixtures will be played in the 2019/2020 season.
• Winners and losers will still be decided for the 2019/2020 season.

### Big Question: How to Decide the Winners and the Losers for the Current Season?

To answer this question, we will make some assumptions and proposals:

• Exclude results where only the home or the away games have been played; not all teams have played each other both home and away this season. Therefore, we will only include games in which the teams have competed against each other, both home and away. For example, Everton’s 0-2 defeat against Sheffield United will be excluded because the return game has not been played.

• Use Points Average, instead of Points Total: Teams will have an unequal number of included games. Some teams will have more excluded games than others. Therefore, we will use an Average Points Per Game measure, calculated as “total points” / “included matches.”

### Data Discovery

Our data source is residing on an Amazon S3 data lake. Let’s launch our Dremio portal and discover more.

There are no data sources currently available in this Dremio instance. We will connect to a data source that has the results data. First, click “add source.”

### Connect to Data Source

We will add a source from Amazon S3. Click on the Amazon S3 icon and enter the connection credentials.

I named my S3 location “s3flower-bucket.” We can see that there is a data file available called “201920_football_results.parquet.” Let’s examine the file by clicking on the filename. Dremio has recognized the filetype as Parquet, and has made suggestions about the names of the column headings.

• date
• home team
• away team
• result

Let’s accept the defaults, and click “Save” to make the dataset available.

No date is imported into Dremio; we are simply making the S3 bucket available for analysis.

### Prepare the Data for Analysis

We can alter the definition of the data without changing the underlying stored data. We call this a Virtual Data Set (VDS).

Extract the goals from the “Result”

We need to extract the goal values from the “results” field, in order to be able to calculate the winners of each match.

• Highlight one of the numeric fields in the Result column, e.g., where the result was “4 – 1”, highlight the number 4. Dremio will display a drop-down menu.
• Select “Extract”

Name the new column “homegoals”

Enter a name for the New Field, e.g., “homegoals” and click “Apply.”

Create a new column “awaygoals”

Apply the same process to create a new field for “awaygoals.”

Save the Virtual Data Set

Click “Save As;” in the subsequent popup, enter a name for the dataset (e.g., “eplmatches”) and click “Save.”

Now we are ready to analyze our prepared dataset. Dremio’s SQL Editor is a great interface for us to explore the data further.

### View the Results - The League Table Showing All Completed Games

As a reminder, here is the regular table, including all games played, up to and including March 9th, 2020. The Liverpool team is in 1st position with 82 points. They have a goal difference of +45. Norwich is at the bottom of the league, with 21 points.

### What If the Season Ends Now? - Recalculated League Table

Based on average points tally, against a set of games in which teams have played home and away against their opponents, it’s no surprise that the champions will still be Liverpool. At the other end of the table, Aston Villa will avoid relegation by 0.04 of a point!

Proposed final league positions for season 2019/2020.

### Winners and Losers - What Happened to Leicester?!

More surprisingly, Arsenal has made a major leap into 3rd place, mainly because many of the opponents against whom they have fared badly haven’t yet played the return game. So many of Arsenal’s poor performances have been excluded from our calculator. Similarly, Leicester has fallen from 3rd to 9th. Let’s examine why:

• Leicester played 29 matches and gained 53 points.
• In our recalculated league table, Leicester played 20 matches and gained 29 points.

Therefore in the 9 matches that we have excluded, Leicester’s points total was reduced by 24. In other words, they won 8 out of those 9 matches, but those matches are no longer included in our calculations.

Here’s a list of the Leicester results which are excluded from our final table. Unfortunately for Leicester, they won 8 out of those 9 games. Unlucky Foxes!

For completeness, this table contains only the excluded games. Each team’s points totals effectively their points deduction from our final calculation.

We can see the biggest winners and losers from our recalculated league.

• Liverpool lost the most points because they won all of their excluded matches. However, this did not affect their final league position.
• Leicester lost 24 points and this did affect their final position. They fell from 3rd to 9th.
• Everton lost only 7 points from their excluded 9 matches; their overall points total was not as affected, so they climbed from 12th to 8th.
• Arsenal lost only 8 points, and they climbed from 9th to 3rd in our final table.

COVID-19 Stalled English Football – What will be the outcome?