November, 2024
UBC MDS Diabetes Predictor (DSCI 522) Built in a four-person sprint for the program’s Workflow and Reproducibility course, this project delivers a fully reproducible Quarto report that classifies diabetes risk in Pima Indian women. We wrapped the entire analysis—data download, cleaning, EDA, model training, and report generation—in an opinionated, best-practices architecture so any teammate (or grader) can regenerate results with a single command.
End‑to‑end, reproducible workflow:
- Layered Makefile pipeline – make all chains every step:
- data/: pulls the Pima Indians Diabetes dataset from Kaggle.
- src/: cleans & splits data, then trains a logistic‑regression model with a C‑grid search.
- reports/: renders a Quarto HTML/PDF report that embeds figures, tables, and model metrics.
- Containerised execution – A Dockerfile + conda-lock pin exact package versions; GitHub Actions builds and pushes the image on every commit.
- Testing & CI – Pytest suites guard data‑processing functions, and the workflow fails fast if tests or make lint do not pass.
- One‑click hygiene – make clean purges intermediates, guaranteeing the repo stays lightweight and deterministic.
⸻
Model highlights
- Logistic‑regression classifier (after hyper‑parameter tuning) achieved 75% accuracy on the held‑out test set—beating the dummy baseline of 67%.*
- Feature importance ranks Glucose, BMI, and Pregnancies as the strongest predictors, while Blood Pressure and Insulin contributed less.*
- The Quarto report discusses false‑positive/false‑negative trade‑offs and ethical considerations around clinical deployment. 
⸻
Tech stack: Python, Pandas, Scikit‑learn, Quarto, Make, Conda + Mamba, Docker, GitHub Actions, Pytest, Matplotlib/Seaborn.