About Coffee-Shop-Profit-Predictor

Predict the profitability of potential coffee shop locations using SQL and Python. Combines data engineering with feature-rich regression modeling, visual analytics, and business insights to support data-driven site selection and retail decision-making.

a

Published by

amirhosseinhonardoust

Visit View Profile

README.md

View on GitHub

Coffee Shop Profit Predictor

A professional machine learning project that predicts the monthly profit of potential coffee-shop locations using SQL feature engineering and a Python / scikit-learn regression workflow. The project loads location data into SQLite, creates reusable engineered features, compares multiple regression models, evaluates performance honestly, and ranks new candidate sites with business-readable risk explanations.

Important: This project is a site-selection decision-support demo, not a production financial forecasting system. It does not guarantee real-world profitability. It estimates likely profit from the available features and should be interpreted with the project limitations in mind.

Project Overview
What This Project Does
What This Project Does Not Do
Features
Charts and Visual Analysis
How the Model Works
Project Structure
Installation
Building the Database
Training the Model
Scoring Candidate Sites
Model Output
Evaluation
Testing
Code Quality
Data Statement
Limitations
Responsible Use
Future Improvements
Tech Stack
Author
License

Project Overview

Coffee-shop site selection is a practical business problem. A strong location can increase sales, while a poor location can create high rent costs, low traffic, and weak profitability.

This project demonstrates a clean end-to-end analytics workflow:

Load training and candidate-location data from CSV files
Store the data in a SQLite database
Create reusable SQL feature views
Train and compare regression models
Evaluate the selected model on a holdout test set
Score new candidate sites
Save metrics, charts, predictions, and ranked recommendations
Document limitations clearly

The goal is to demonstrate:

SQL + Python integration
Practical feature engineering
Reproducible machine learning workflow design
Honest model evaluation
Business-readable candidate ranking
Professional portfolio documentation

What This Project Does

This project can:

Predict monthly profit for coffee-shop locations
Engineer features using SQL views
Compare multiple regression models
Use cross-validation for model selection
Compare against a mean baseline
Generate holdout test-set diagnostics
Save trained model artifacts
Score new candidate locations
Rank candidates by predicted profitability
Add risk bands and explanation columns
Run automated tests
Document data and model limitations

What This Project Does Not Do

This project does not:

Guarantee actual future profit
Replace a full real-estate or finance analysis
Include labor cost, cost of goods sold, store size, opening hours, or seasonality
Use live market data
Validate predictions against real store openings
Provide a formal statistical prediction interval
Prove that a feature causally increases or decreases profit
Make final business decisions without human review

A production site-selection system would require real transaction data, lease terms, store size, local regulations, labor costs, competitor quality, seasonality, and geographic validation.

Features

SQLite database creation from CSV inputs
SQL feature engineering through reusable views
Centralized configuration in src/config.py
Data validation for required columns, missing values, numeric types, and impossible values
Regression model comparison using scikit-learn
Mean baseline model for honest comparison
Linear Regression, Ridge, ElasticNet, Random Forest, and Gradient Boosting experiments
Cross-validation for model comparison
Hyperparameter tuning for Ridge and ElasticNet
Holdout test-set evaluation
Saved sklearn model using joblib
Candidate-site ranking
Risk band assignment
Business-readable positive and negative drivers
Charts for model diagnostics
Unit tests with Python unittest
Portfolio-ready README documentation

Charts and Visual Analysis

The project automatically generates visual outputs during training to make model behavior easier to understand.

Generated charts are saved in:

outputs/charts/

The main charts include:

Chart	Purpose
Actual vs Predicted	Compares holdout actual profit with predicted profit
Residuals Histogram	Shows model error distribution
Feature Importance	Shows the strongest model signals
Model Comparison MAE	Compares models using mean absolute error

Actual vs Predicted Profit

This chart uses holdout test-set predictions only. That makes the diagnostic more honest than plotting predictions on data the model already trained on.

Residuals Histogram

The residual chart shows how far predictions are from actual profit values. A narrower distribution means lower prediction error.

Feature Importance

For the selected ElasticNet model, feature importance is based on standardized coefficients. These values should be interpreted as model associations, not causal proof.

Model Comparison

The model comparison chart helps show whether the selected model meaningfully improves over simpler alternatives and the mean baseline.

How the Model Works

The project uses a structured regression workflow:

CSV location data
        ↓
SQLite database
        ↓
SQL feature views
        ↓
Data validation
        ↓
Train/test split
        ↓
Model comparison + cross-validation
        ↓
Selected regression pipeline
        ↓
Holdout evaluation
        ↓
Candidate-site scoring
        ↓
Ranked recommendations with risk bands

SQL Feature Engineering

Feature creation is performed in src/queries.sql, which creates two SQLite views:

features_train
features_candidates

Engineered features include:

Feature	Formula	Purpose
`demand_adj`	`foot_traffic / (1 + competition)`	Demand adjusted for competitive intensity
`wknd_traffic`	`weekend_activity × foot_traffic`	Weekend demand potential
`price_income`	`coffee_price × (median_income / 1000)`	Price-income fit / affordability signal
`promo_comp_adj`	`promo_spend / (1 + competition)`	Promotion adjusted for competition

Regression Models

The training script compares several models:

Mean Baseline
Linear Regression
Ridge
ElasticNet
Random Forest
Gradient Boosting

ElasticNet is currently selected because it has the best mean cross-validated MAE among the non-baseline models in the current run.

Project Structure

Coffee-Shop-Profit-Predictor/
│
├── data/
│   ├── locations_train.csv
│   └── locations_candidates.csv
│
├── outputs/
│   ├── charts/
│   │   ├── actual_vs_predicted.png
│   │   ├── feature_importance.png
│   │   ├── model_comparison_mae.png
│   │   └── residuals_hist.png
│   ├── feature_importance.csv
│   ├── metrics.json
│   ├── model.joblib
│   ├── model_comparison.csv
│   ├── model_metadata.json
│   ├── predictions_all.csv
│   ├── predictions_test.csv
│   └── scored_candidates.csv
│
├── src/
│   ├── config.py
│   ├── create_db.py
│   ├── queries.sql
│   ├── score_new_sites.py
│   ├── train_regression.py
│   └── utils.py
│
├── tests/
│   └── test_workflow.py
│
├── README.md
└── requirements.txt

Installation

1. Clone the Repository

git clone https://github.com/AmirhosseinHonardoust/Coffee-Shop-Profit-Predictor.git
cd Coffee-Shop-Profit-Predictor

2. Create a Virtual Environment

On Windows CMD:

python -m venv .venv
.venv\Scripts\activate

On macOS/Linux:

python -m venv .venv
source .venv/bin/activate

3. Install Requirements

pip install -r requirements.txt

Building the Database

Run:

python src/create_db.py --train data/locations_train.csv --candidates data/locations_candidates.csv --db coffee.db

This will:

Read the training CSV file
Read the candidate-site CSV file
Validate required columns and numeric ranges
Create a SQLite database
Save the data into database tables

Created database:

coffee.db

Training the Model

Run:

python src/train_regression.py --db coffee.db --sql src/queries.sql --outdir outputs

This will:

Load SQL feature views
Split the training data into train and holdout sets
Compare multiple regression models
Run cross-validation
Select the best non-baseline model by cross-validated MAE
Evaluate the selected model on the holdout test set
Save the trained model
Save predictions, metrics, model comparison results, and charts

Generated outputs include:

outputs/model.joblib
outputs/metrics.json
outputs/model_metadata.json
outputs/model_comparison.csv
outputs/feature_importance.csv
outputs/predictions_test.csv
outputs/predictions_all.csv
outputs/charts/

Scoring Candidate Sites

Run:

python src/score_new_sites.py --db coffee.db --sql src/queries.sql --model outputs/model.joblib --outdir outputs

This will:

Load candidate-site features from SQLite
Apply the trained model
Predict monthly profit
Rank candidate locations
Add risk bands
Add business-readable positive and negative drivers
Save the scored candidate file

Generated output:

outputs/scored_candidates.csv

Model Output

Candidate scoring returns ranked results with fields such as:

Field	Meaning
`rank`	Candidate rank by predicted monthly profit
`lat`, `lon`	Candidate coordinates
`predicted_profit`	Estimated monthly profit in euros
`risk_band`	Heuristic risk level based on feature profile distance
`expected_error_eur`	Approximate expected error based on model MAE
`profile_distance`	Distance from the training-data feature profile
`main_positive_drivers`	Business-readable features helping the score
`main_negative_drivers`	Business-readable features hurting the score

Example output format:

Rank: 1
Predicted profit: €2,004.10
Risk band: high
Main positive drivers: high competition-adjusted promotion; high competition-adjusted demand; high coffee price
Main negative drivers: no major red flags

The risk band is a practical warning flag, not a formal statistical confidence interval.

Evaluation

The project uses a more responsible evaluation workflow than a single in-sample score.

Evaluation includes:

Train/test split
Mean baseline comparison
Cross-validation
Holdout R²
Holdout MAE
Model comparison table
Test-set diagnostic charts

Metrics are saved to:

outputs/metrics.json

Model comparison is saved to:

outputs/model_comparison.csv

Charts are saved to:

outputs/charts/

Current Results

The current run selects ElasticNet.

Metric	Value
Holdout R²	`0.647`
Holdout MAE	`€383.58`
Mean baseline R²	`-0.030`
Mean baseline MAE	`€699.39`
5-fold CV R²	`0.530 ± 0.110`
5-fold CV MAE	`€413.67 ± €46.65`

Why This Matters

A common mistake in beginner machine learning projects is reporting overly optimistic scores from training data. This project avoids that by using a separate holdout test set and cross-validation.

The model improves clearly over the mean baseline, but the cross-validation spread shows that results should still be interpreted carefully because the dataset is small.

Testing

Run the test suite:

python -m unittest discover -s tests -v

The tests check important project behavior, including:

Python source files compile
CSV data loads into SQLite
SQL feature views are created correctly
Required engineered feature columns exist
Data validation rejects impossible values

Example passing result:

Ran 3 tests
OK

Code Quality

The project includes several maintainability improvements:

Centralized project settings in src/config.py
Shared validation helpers in src/utils.py
Clear separation between database creation, model training, and candidate scoring
Reusable SQL feature definitions
Saved model artifacts
Saved evaluation outputs
Automated tests

These choices make the project easier to reproduce, review, and extend.

Data Statement

This repository is presented as a demonstration / simulated retail analytics project.

The included dataset contains:

220 training locations
60 candidate locations

The data is suitable for demonstrating a portfolio machine learning workflow, but it should not be treated as verified real business data unless the source is updated and documented.

If this project is later connected to real data, this section should be updated with:

Data source
Collection period
Geographic coverage
Feature definitions
Usage permissions
Known data quality issues

Limitations

This project has important limitations.

The model:

Uses a small dataset
Uses demonstration/simulated data
Does not include store size
Does not convert rent_per_sqm into total rent
Does not include labor cost
Does not include cost of goods sold
Does not include opening hours
Does not include seasonality
Does not include competitor quality
Does not include brand effects
Does not include local regulations or lease terms
Uses heuristic risk bands, not formal confidence intervals
May perform poorly on locations outside the training-data profile
Should not be used as the only basis for investment decisions

High performance on the included data does not guarantee high performance in real-world site selection.

Responsible Use

This project is intended for:

Machine learning education
Data analyst / data scientist portfolio demonstration
SQL feature engineering practice
Regression modeling practice
Retail analytics workflow design
Responsible model documentation practice

It should not be used for:

Real lease-signing decisions without deeper analysis
Financial forecasting without validation
Automated investment decisions
Replacing expert real-estate judgment
High-stakes business decisions without human review

Predictions should be treated as decision-support estimates, not final answers.

Future Improvements

Possible future improvements include:

Add real store size and calculate total monthly rent
Add labor cost and cost of goods sold
Add opening hours and seasonality
Add competitor density using geographic distance
Add competitor quality and brand strength
Add local demographic and transit data
Add map-based candidate visualization
Add bootstrapped prediction intervals
Add SHAP or permutation importance explanations
Add a Streamlit dashboard
Add Docker support
Add GitHub Actions CI
Add linting and formatting checks
Add a full EDA notebook

Tech Stack

Python
pandas
NumPy
SQLite
scikit-learn
matplotlib
joblib
unittest

Author

Amir Honardoust

GitHub: @AmirhosseinHonardoust

License

This project is intended for educational and portfolio purposes.

If you use or modify this project, please keep the limitations and responsible-use notes clear.

Coffee-Shop-Profit-Predictor

About Coffee-Shop-Profit-Predictor

Platforms

Languages

Links

README.md

Coffee Shop Profit Predictor

Table of Contents

Project Overview

What This Project Does

What This Project Does Not Do

Features

Charts and Visual Analysis

Actual vs Predicted Profit

Residuals Histogram

Feature Importance

Model Comparison

How the Model Works

SQL Feature Engineering

Regression Models

Project Structure

Installation

1. Clone the Repository

2. Create a Virtual Environment

3. Install Requirements

Building the Database

Training the Model

Scoring Candidate Sites

Model Output

Evaluation

Current Results

Why This Matters

Testing

Code Quality

Data Statement

Limitations

Responsible Use

Future Improvements

Tech Stack

Author

License