Welcome to the 2024 Football Playoffs Hackathon powered by Dremio. Teams from across the globe will apply their analytics prowess to predict:
American Champion
National Champion
Overall League Winner
Each team must analyze current stats provided to support their selections with detailed insights.
Judging criteria will include the accuracy of predictions, the quality of analysis, the clarity of visual presentation, and the depth of insights shared.
Time for kick off!
Try Dremio’s Interactive Demo
Explore this interactive demo and see how Dremio's Intelligent Lakehouse enables Agentic AI
Why Compete
All Qualified Submissions: Meet the requirements, and all team members will receive a special edition Gnarly football t-shirt, water bottle and Hackathon digital badge.
Top Submission: The selected winning team will be given the chance to present their solution live at Subsurface 2025 (Date TBD) in person in NYC.
Note: Teams shouldn’t be larger than 5 people. You can have more than 5 team members if you wish, but qualified team submissions will only receive Dremio swag for up to 5 people. Current Dremio employees are ineligible to compete in this Hackathon.
Disclaimer: Due to applicable trade control law and the policies of our shipping partners, prizes cannot be shipped to the following countries: Cuba, Iran, North Korea, Syria, Ukraine, Russia, and Belarus. This list is subject to change without notice.
Introduction
This guide will walk you through setting up a powerful local data environment so you can focus on what matters—gaining insights and building visualizations, applications, or even AI/ML models with real football data.
Using Docker Compose, we’ll set up a local environment to run Dremio for querying, MinIO for data storage, Apache Superset for BI and a Jupyter Notebook environment for interactive data science work. Here’s everything you need to get started.
How to Participate
Once you've set up your local environment and loaded the football data, here's how to proceed:
Dive into the Dataset
Analyze the main dataset provided, exploring key insights and patterns.
Feel free to supplement this data by integrating additional sources to enrich your analysis.
Transform Your Data with Dremio Views
Use Dremio’s semantic layer to create custom views, transforming the data to suit your specific project goals.
Build Your Final Project
Design a compelling visualization, generate an insightful report, or develop an application that highlights your findings.
Once you're happy with your project, create a short video presentation:
Record a 3-5 minute video:
Spend 1-2 minutes walking through your data modeling and transformations in Dremio.
Use 1-2 minutes to showcase your final product.
Dedicate 1-2 minutes to share your experience and insights from the project.
Upload the Video:
Post the video to YouTube as “Unlisted” (or public, if you prefer) and submit the link via the provided form.
Submit the form:
Setting Up Your Environment
Step 1: Understanding Docker and Docker Compose
Docker is a platform for developing, shipping, and running container applications. Containers bundle software with its dependencies, ensuring consistent behavior across environments.
Docker Compose is a tool that allows you to define and run multi-container Docker applications using a single docker-compose.yml file. In this file, you define all services, their configurations, and how they interact.
Step 2: Creating the Docker Compose File
Let’s create the docker-compose.yml file that defines our services.
Open a text editor (VS Code, Notepad, etc.).
Create a new file named docker-compose.yml.
Copy and paste the following configuration into it (superset and datanotebook are optional if you prefer other tools):
version: "3"
services:
# Nessie Catalog Server Using In-Memory Store
nessie:
image: projectnessie/nessie:latest
container_name: nessie
networks:
- iceberg
ports:
- 19120:19120
# MinIO Storage Server
## Creates two buckets named lakehouse and lake
minio:
image: minio/minio:latest
container_name: minio
environment:
- MINIO_ROOT_USER=admin
- MINIO_ROOT_PASSWORD=password
networks:
- iceberg
ports:
- 9001:9001
- 9000:9000
command: ["server", "/data", "--console-address", ":9001"]
entrypoint: >
/bin/sh -c "
minio server /data --console-address ':9001' &
sleep 5 &&
mc alias set myminio http://localhost:9000 admin password &&
mc mb myminio/lakehouse &&
mc mb myminio/lake &&
tail -f /dev/null
"
# Dremio
dremio:
platform: linux/x86_64
image: dremio/dremio-oss:latest
container_name: dremio
environment:
- DREMIO_JAVA_SERVER_EXTRA_OPTS=-Dpaths.dist=file:///opt/dremio/data/dist
networks:
- iceberg
ports:
- 9047:9047
- 31010:31010
- 32010:32010
# Superset
superset:
image: alexmerced/dremio-superset
container_name: superset
networks:
- iceberg
ports:
- 8088:8088
# Data Science Notebook (Jupyter Notebook)
datanotebook:
image: alexmerced/datanotebook
container_name: datanotebook
environment:
- JUPYTER_TOKEN= # Set a token if desired, or leave blank to disable token authentication
networks:
- iceberg
ports:
- 8888:8888
volumes:
- ./notebooks:/home/pydata/work # Mounts a local folder for persistent notebook storage
networks:
iceberg:
Explanation of Services:
Nessie: Provides version control for data, useful for tracking data lineage and historical states.
MinIO: Acts as an S3-compatible object storage, holding data buckets that Dremio will use as data sources.
Dremio: The query engine that enables SQL-based interactions with our data stored in MinIO and Nessie.
Superset: A BI tool for creating and visualizing dashboards based on data queried through Dremio.
Step 3: Running the Environment
With the docker-compose.yml file ready, let’s start the environment.
Open a terminal and navigate to the folder where you saved docker-compose.yml.
Run the following command to start all services in detached mode: docker-compose up -d
Wait a few moments for the services to initialize. Verify they are running with: docker ps
You should see containers for Nessie, MinIO, Dremio, Datanotebook and Superset.
Run the following command to initialize superset before using it: docker exec -it superset superset init
Step 4: Verifying the Services
After starting the containers, check each service to ensure it’s accessible.
Step 5: Adding Nessie and MinIO as Data Sources in Dremio
Now, let’s configure Dremio to use Nessie as a catalog and MinIO as an S3-compatible data source.
Connecting Nessie as a Catalog in Dremio
In Dremio, go to Add Source.
Choose Nessie from the source types and enter the following configuration:
General Settings:
Name: lakehouse
Endpoint URL: http://nessie:19120/api/v2
Authentication: None
Storage Settings:
Access Key: admin
Secret Key: password
Root Path: lakehouse
Connection Properties:
fs.s3a.path.style.access: true
fs.s3a.endpoint: minio:9000
dremio.s3.compat: true
Save the source. Dremio will now connect to Nessie, and lakehouse should appear in the Datasets section.
Connecting MinIO as an S3-Compatible Source in Dremio
Again, click Add Source in Dremio and select S3 as the source type.
Configure MinIO with these settings:
General Settings:
Name: lake
Credentials: AWS Access Key
Access Key: admin
Secret Key: password
Encrypt Connection: unchecked
Advanced Options:
Enable Compatibility Mode: true
Root Path: /lake
Connection Properties:
fs.s3a.path.style.access: true
fs.s3a.endpoint: minio:9000
Save the source. The lake source will appear in the Datasets section of Dremio.
Step 6: Setting Up Superset for BI Visualizations
Superset allows us to create dashboards based on data queried from Dremio.
Initialize Superset by running this command in a new terminal: docker exec -it superset superset init
Open Superset at http://localhost:8088, log in, and navigate to Settings > Database Connections.
Add a new database:
Select Other as the type.
Enter the connection string (replace USERNAME and PASSWORD with your Dremio credentials): dremio+flight://USERNAME:PASSWORD@dremio:32010/?UseEncryption=false
Click Test Connection to verify connectivity, then Save.
To add datasets, select the + icon, choose your desired table (e.g., sales_data), and add it to your workspace.
Now, create charts and add them to a dashboard.
Step 7: Shutting Down the Environment
To stop the environment, run:
docker-compose down -v
This will remove all containers and volumes, giving you a clean slate for the next session.
Conclusion
You’ve now set up a powerful local data environment with Nessie for versioned data, MinIO for S3-compatible storage, Dremio for SQL querying, and Superset for BI visualization. This setup enables you to perform SQL-based data operations, track data history, and create visual insights from your data lakehouse environment, all running locally on Docker Compose. Happy data engineering!
Loading Your Data
To incorporate the football dataset from Kaggle into your Dremio and MinIO environment, you’ll start by downloading the data, then upload it to the MinIO service. Once in MinIO, it will be accessible in Dremio as a data source for analysis and querying. Here’s how to do it step-by-step.
This dataset includes player stats, game details, and play-by-play data aggregated from several sources.
Download the Dataset:
If you have a Kaggle account, log in and click on the Download button on the dataset page.
The dataset will download as a compressed file (usually a .zip file) containing multiple .parq files (Parquet format), ideal for analysis and compatible with data lake storage.
Extract the Files:
Once the download is complete, unzip the file. You’ll find files like:
games.parq
players.parq
plays.parq
tackles.parq
tracking_all_weeks.parq
Step 2: Prepare the MinIO Environment
Next, we’ll upload this data to the MinIO instance, simulating an S3 bucket storage.
Log in using the credentials specified in the Docker Compose setup:
Username: admin
Password: password
Locate or Create the lake Bucket:
In the MinIO console, you should see a bucket called lake, which was created automatically by our Docker Compose configuration.
If you do not see the lake bucket, click + to create a new bucket and name it lake.
Step 3: Upload the Dataset Files to MinIO
Upload Files:
In the MinIO console, navigate to the lake bucket.
Click the Upload button and select the Parquet files you extracted from the Kaggle download (e.g., games.parq, players.parq, plays.parq, tackles.parq, tracking_all_weeks.parq).
MinIO will store these files in the lake bucket, making them available as raw data for querying in Dremio.
Verify Uploads:
After uploading, you should see each of the dataset files listed in the lake bucket within MinIO.
Step 4: Access the Data in Dremio
Now that the data is stored in MinIO, you can connect to it in Dremio.
Ingesting Data Into Apache Iceberg Tables with Dremio: A Unified Path to Iceberg
By unifying data from diverse sources, simplifying data operations, and providing powerful tools for data management, Dremio stands out as a comprehensive solution for modern data needs. Whether you are a data engineer, business analyst, or data scientist, harnessing the combined power of Dremio and Apache Iceberg will undoubtedly be a valuable asset in your data management toolkit.
Oct 12, 2023·Product Insights from the Dremio Blog
Table-Driven Access Policies Using Subqueries
This blog helps you learn about table-driven access policies in Dremio Cloud and Dremio Software v24.1+.
Aug 31, 2023·Dremio Blog: News Highlights
Dremio Arctic is Now Your Data Lakehouse Catalog in Dremio Cloud
Dremio Arctic bring new features to Dremio Cloud, including Apache Iceberg table optimization and Data as Code.