Football Statistics Tracker πβ½
An end-to-end data engineering pipeline that collects, processes, and analyzes football match results, standings data, weather data, Reddit data and summarizes matchdays using Gemini from the top 5 European leagues. Used data sources include football-data.org API, Open-Meteo API, and PRAW (Reddit API), Maps...
Introduction
This project demonstrates a complete data pipeline for football (soccer) results, from data extraction to visualization. It implements some data engineering practices including data lakes, transformation layers, and Infrastructure as Code (IaC) with Terraform.
Features
- Automated Data Collection: Scheduled data fetching from multiple APIs using Google Cloud Functions
- Multi-layer Data Architecture: Raw data stored in GCS, processed data in BigQuery, and user-facing data in Firestore
- Weather Integration: Match statistics with weather data at match time
- Social Media (Reddit) Data: Reddit comments for fan sentiment
- Infrastructure as Code: Cloud Functions and Pub/Sub subscriptions and topics defined and deployed with Terraform
Architecture
The pipeline follows the following architecture:
- Data Ingestion: Cloud Functions trigger on schedule to fetch data
- Storage Layers: Raw data(json) β External BQ tables (Parquet) β Processed Data in BQ β Firestore
- Validation: Very simple validation and Data qaulity with Dataplex
- Summarization: Creation of short summaries in Markdown with Gemini
- Visualization: Web app for insights
Data Sources
- Football-data.org: Match data, team data, and standings
- Open-Meteo API: Historical weather data
- Reddit (via PRAW): Fan comments and sentiment
- Maps SDK: Location of stadiums
Technology Stack
| Category | Technologies |
|---|---|
| Cloud Platform | Google Cloud Platform (GCP) |
| Infrastructure as Code | Terraform |
| Programming Languages | Python, TypeScript (Svelte) |
| Data Storage | Cloud Storage, BigQuery, Firestore |
| Data Quality | Dataplex |
| Data Transformation | Dataform |
| Serverless Computing | Cloud Functions |
| Event-Driven Architecture | Pub/Sub |
| API Consumption | Football-data.org, Open-Meteo, Reddit API, Google Maps |
| CI/CD | GitHub Actions |
| Package Management | uv, pyproject.toml |
| Code Quality | Ruff, Bandit, Mypy |
| Testing | pytest |
| Web Framework | Svelte, ShadCN UI Components |
| Hosting | Firebase App Hosting |
| LLM | Google Gemini 2.5 Flash |
Project Structure
soccer-tracker-DE-project/
βββ README.md
βββ .gitignore
βββ pyproject.toml
βββ Github/workflows/ # CI/CD in Github Actions
β βββ cd.yml
β βββ ci.yml
βββ terraform/ # IaC definitions
β βββ main.tf
β βββ variables.tf
β βββ pubsub.tf
β βββ cloud_functions.tf
βββ cloud_functions/
β βββ league_data/ # League and Teams data extraction and load
β βββ discord_utils/ # Package for sending Discord notifications using webhooks
β βββ match_data/ # Match data extraction and load
β βββ weather_data/ # Weather data extraction and load
β βββ reddit_data/ # Reddit data extraction and load
β βββ standings_data/ # Standings data extraction and load for each matchday
β βββ data_validation/ # Data validation using Dataplex
β βββ serving_layer/ # Load data to firestore
β βββ generate_summaries/ # Generate match summaries with Gemini
βββ soccer_tracker_ui/ # Svelte web app in Firebase
β βββ src/
β β βββ lib/ # Reusable components
β β β βββ components/ # UI components from [shadcn](https://next.shadcn-svelte.com/)
β β β βββ firebase.ts # Firebase/Firestore connection
β β β βββ stores/ # Svelte stores for state management
β β βββ routes/ # Page components
β βββ package.json # Dependencies and scripts
β βββ svelte.config.js # Svelte configuration
β βββ vite.config.js # Vite bundler config
βββ tests/ # Test suite for Cloud Functions with Pytest
Additional Documentation
Web app
The project includes a Svelte web app for visualizing match results, weather data, and match summaries.
App includes:
- Match Results
- Match summaries using an LLM (Gemini 2.5 Flash)
- Weather data during matches
- Comments from Reddit
I got the idea to make this project from this repo by digitalghost-dev