About sports-quant

End-to-end NFL data pipeline that scrapes PFF grades and Pro Football Reference game data, builds analysis-ready datasets, and trains ensemble XGBoost models with walk-forward backtesting

t

Published by

thadhutch

Visit View Profile

README.md

View on GitHub

sports-quant

March Madness bracket prediction and NFL over/under modeling.

2025 March Madness Bracket - 81.0% Accuracy

2025 bracket prediction — 51 of 63 games correct (81.0%)

Model Highlights


81.0%	4 of 6	Back-to-Back	Called It
bracket accuracy (2025)	champions correctly predicted	UConn champion picks (2023–2024)	NC State's Cinderella run (2024)

Year-by-Year Accuracy

Results from v6b LightGBM ensemble with forward simulation, backtested across 6 tournaments:

Year	Accuracy	Correct / Total	Champion	Champion Correct?	Highlights
2025	79.4%	50 / 63	Florida (1)	No	Perfect Elite 8 (4/4), Sweet 16: 87.5%
2024	76.2%	48 / 63	UConn (1)	Yes	Perfect Final Four + National Championship
2019	76.2%	48 / 63	Virginia (1)	Yes	R32: 100% — perfect second round
2022	71.4%	45 / 63	Kansas (1)	Yes	Perfect Elite 8 (4/4)
2021	67.7%	42 / 62	Baylor (1)	No	Perfect Final Four
2023	65.1%	41 / 63	UConn (4)	Yes	F4 + NCG correct despite historically wild year

Average accuracy: 72.7% across 6 tournaments (274 / 377 games)

Upset Predictions

The model's best upset calls — games where a lower-seeded team was correctly predicted to win:

2024

Round	Prediction	Seed Gap	Result
R64	12 Grand Canyon over 5 Saint Mary's	7	Correct
R64	11 NC State over 6 Texas Tech	5	Correct — NC State went on a Cinderella run to the Final Four
S16	4 Alabama over 1 North Carolina	3	Correct

2023

Round	Prediction	Seed Gap	Result
NCG	4 UConn wins it all	—	Correct — a 4-seed national champion is rare

2019

Round	Prediction	Seed Gap	Result
R64	12 Oregon over 5 Wisconsin	7	Correct

How the March Madness Model Works

The March Madness model uses a LightGBM ensemble trained on historical tournament data with features derived from team performance metrics, seeding, and matchup interactions.

Feature engineering — KenPom ratings, Barttorvik T-Rank, seed-based statistics, conference strength, and matchup interaction features
Ensemble prediction — Multiple LightGBM models vote on each game's win probability
Forward simulation — The bracket is filled round-by-round, feeding predicted winners into the next round
Seed debiasing — Adjusts for historical seed-vs-seed upset rates to avoid over-favoring top seeds
Backtesting — Every prediction is out-of-sample; the model never sees future tournament results during training

Survivor Pool Optimizer

The project also includes a survivor pool optimizer that uses the model's round-by-round probabilities to select optimal picks across multiple strategies:

Greedy — Pick the highest-probability survivor each round
Bracket-aware — Avoid picking teams from the same bracket side
Monte Carlo optimal — Simulate thousands of scenarios to maximize expected survival

NFL Over/Under Modeling

An end-to-end data pipeline that scrapes PFF team grades and Pro Football Reference game/betting data, builds analysis-ready datasets, and trains an ensemble XGBoost model for NFL over/under prediction.

Accuracy by Algorithm Score

How the NFL Model Works

The core idea is simple: don't try to predict every game — find the games where the model is reliably right, and only bet those.

On each game-day the pipeline trains 50 XGBoost models with different random seeds on all available historical data. The pipeline filters to the top 3 based on a weighted seasonal accuracy score, then requires all three to agree on a pick before it counts. Each consensus pick gets an algorithm score that captures how well the ensemble has historically performed at that confidence level.

Accuracy by Algorithm Score and Season

Higher algorithm-score bins tend to stay accurate across multiple seasons, while lower bins stay inaccurate.

Technical Details

Parameter	Value
Models trained per game-day	50
Models kept after selection	Top 3 by weighted seasonal accuracy
Consensus requirement	All 3 must agree
Algorithm score	Weighted blend of per-model confidence-bin accuracy (0.4 / 0.35 / 0.25)
Bet sizing	1% Kelly criterion
Starting simulation capital	$100

NFL Pipeline Architecture

PFF Scrape              PFR Scrape
    |                       |
    v                       v
Extract Dates          Normalize Dates
    |                       |
    v                       v
Normalize Names        Normalize Names
    |                       |
    +----------+   +--------+
               |   |
               v   v
              Merge
                |
                v
           Over/Under
                |
                v
         Rolling Averages
                |
                v
          Games Played
                |
                v
            Rankings
                |
       +--------+--------+
       |                  |
       v                  v
  Model Train        Backtest
  (ensemble)       (walk-forward)

Example Charts

Expand NFL charts

Vegas Line Accuracy

Vegas Accuracy by Conditions

O/U Line vs Actual

PFF vs Vegas Spread

PFF Grade vs Points

Correlation Heatmap

Team Ranking Heatmap

Upset Rate

Underdog Teams

Dogs That Bite

Features

March Madness Bracket Prediction — LightGBM ensemble with forward simulation, seed debiasing, and survivor pool optimization
PFF Scraping — Selenium-based scraper for PFF team grades (requires PFF Premium; manual login on first run, cookies cached for subsequent runs)
PFR Scraping — Proxy-rotated scraper for Pro Football Reference boxscores
Data Normalization — Standardizes dates and team names across sources
Dataset Merging — Inner join on date + team columns
Rolling Averages — Pre-game cumulative stat averages per team per season
Games Played Tracking — Cumulative games played before each matchup
Feature Rankings — Per-date rankings across all teams
Ensemble Training — Trains 50 XGBoost models per game-day, selects top 3 by weighted seasonal accuracy, requires consensus agreement, and runs a financial simulation
Walk-Forward Backtesting — Trains 50 models across every historical date using walk-forward validation and averages metrics across all models
CLI + Python API — Run the full pipeline or any individual step

Installation

Install from PyPI:

pip install sports-quant

Or install from source with Poetry:

git clone https://github.com/thadhutch/sports-quant.git
cd sports-quant
poetry install

Prerequisites

Requirement	Why
Python 3.12+	Runtime
Google Chrome	PFF scraper uses Selenium to render client-side data
PFF Premium subscription	Authenticates access to PFF team grades
Rotating proxies (CSV)	PFR rate-limits aggressively; proxies prevent blocks

Configuration

Create a .env file (see .env.example) or export environment variables directly:

cp .env.example .env

Variable	Default	Description
`NFL_SEASONS`	`2025`	Comma-separated seasons to scrape from PFF
`NFL_START_YEAR`	`2025`	First year for PFR boxscore URL collection
`NFL_END_YEAR`	`2025`	Last year for PFR boxscore URL collection
`NFL_MAX_WEEK`	`18`	Final week to scrape in the last season
`NFL_DATA_DIR`	`data`	Base directory for all output files
`NFL_PROXY_FILE`	`proxies/proxies.csv`	Path to proxy list (`address:port:user:password` per line)
`NFL_MODEL_CONFIG`	`model_config.yaml`	Path to model configuration file

Usage

CLI

# March Madness
sports-quant march-madness backtest         # Backtest across historical tournaments
sports-quant march-madness simulate 2025    # Generate 2025 bracket predictions
sports-quant march-madness survivor 2025    # Run survivor pool optimizer

# NFL (full pipeline)
sports-quant pipeline

# NFL (individual steps)
sports-quant scrape pff
sports-quant scrape pfr
sports-quant process all
sports-quant model train
sports-quant model backtest

Python API

import sports_quant

# March Madness
sports_quant.run_march_madness_backtest()
sports_quant.simulate_bracket(year=2025)

# NFL
sports_quant.run_full_pipeline()
sports_quant.run_training()
sports_quant.run_backtest()

Project Structure

sports-quant/
├── src/sports_quant/
│   ├── __init__.py           # Public API re-exports
│   ├── _config.py            # Paths, env vars, logging
│   ├── cli.py                # Click CLI entry point
│   ├── pipeline.py           # Pipeline orchestrators
│   ├── teams.py              # Team name/abbreviation mappings
│   ├── march_madness/        # March Madness bracket prediction
│   ├── scrapers/
│   │   ├── pff.py            # PFF grades scraper (Selenium)
│   │   ├── pfr.py            # PFR game data scraper
│   │   ├── pfr_urls.py       # PFR boxscore URL collector
│   │   ├── auth.py           # PFF authentication
│   │   └── proxies.py        # Proxy loading utilities
│   ├── parsers/
│   │   ├── pff_dates.py      # PFF date extraction
│   │   ├── pff_teams.py      # PFF team name normalization
│   │   ├── pfr_dates.py      # PFR date normalization
│   │   └── pfr_teams.py      # PFR team name extraction
│   ├── processing/
│   │   ├── merge.py          # Merge PFF + PFR datasets
│   │   ├── over_under.py     # O/U betting line extraction
│   │   ├── rolling_averages.py
│   │   ├── games_played.py
│   │   └── rankings.py
│   └── modeling/
│       ├── __init__.py       # Public API (run_training, run_backtest)
│       ├── _features.py      # Shared feature definitions
│       ├── _data.py          # Data loading and preparation
│       ├── _training.py      # Single-model and ensemble training
│       ├── _scoring.py       # Season weighting, consensus, model selection
│       ├── _simulation.py    # Financial simulation / profit tracking
│       ├── train.py          # Ensemble training orchestrator
│       ├── backtest.py       # Walk-forward backtesting orchestrator
│       └── plots.py          # All modeling-related charts
├── tests/
├── model_config.yaml
├── pyproject.toml
├── Makefile
├── LICENSE
└── README.md

Development

git clone https://github.com/thadhutch/sports-quant.git
cd sports-quant
poetry install
poetry run pytest -v

CI runs automatically on every push to master and on pull requests via GitHub Actions. Releases are published to PyPI through Trusted Publishers.

Contributing

Fork the repository
Create a feature branch (git checkout -b feature/my-feature)
Make your changes and add tests
Run the test suite (poetry run pytest -v)
Commit and push
Open a Pull Request

Known Limitations

Data files are not tracked in git. Run the pipeline to generate them.
PFF scraping is DOM-dependent. If PFF changes their frontend, selectors need updating.
PFR scraping requires rotating proxies. Without them, requests will be rate-limited.
PFF login requires manual interaction on first run. Cookies are cached afterward.

License

This project is licensed under the MIT License.

PyPI · Issues · CI Status

sports-quant

About sports-quant

Platforms

Languages

Links

README.md

sports-quant

Model Highlights

Year-by-Year Accuracy

Upset Predictions

2024

2023

2019

How the March Madness Model Works

Survivor Pool Optimizer

NFL Over/Under Modeling

How the NFL Model Works

Technical Details

NFL Pipeline Architecture

Example Charts

Features

Installation

Prerequisites

Configuration

Usage

CLI

Python API

Project Structure

Development

Contributing

Known Limitations

License