Javier's Portfolio

Diabetes Predictor

November, 2024

Machine Learning

UBC MDS Diabetes Predictor (DSCI 522) Built in a four-person sprint for the program’s Workflow and Reproducibility course, this project delivers a fully reproducible Quarto report that classifies diabetes risk in Pima Indian women. We wrapped the entire analysis—data download, cleaning, EDA, model training, and report generation—in an opinionated, best-practices architecture so any teammate (or grader) can regenerate results with a single command.

End‑to‑end, reproducible workflow:

Layered Makefile pipeline – make all chains every step:
1. data/: pulls the Pima Indians Diabetes dataset from Kaggle.
2. src/: cleans & splits data, then trains a logistic‑regression model with a C‑grid search.
3. reports/: renders a Quarto HTML/PDF report that embeds figures, tables, and model metrics.
Containerised execution – A Dockerfile + conda-lock pin exact package versions; GitHub Actions builds and pushes the image on every commit.
Testing & CI – Pytest suites guard data‑processing functions, and the workflow fails fast if tests or make lint do not pass.
One‑click hygiene – make clean purges intermediates, guaranteeing the repo stays lightweight and deterministic.

⸻

Model highlights

Logistic‑regression classifier (after hyper‑parameter tuning) achieved 75% accuracy on the held‑out test set—beating the dummy baseline of 67%.*
Feature importance ranks Glucose, BMI, and Pregnancies as the strongest predictors, while Blood Pressure and Insulin contributed less.*
The Quarto report discusses false‑positive/false‑negative trade‑offs and ethical considerations around clinical deployment.

⸻

Tech stack: Python, Pandas, Scikit‑learn, Quarto, Make, Conda + Mamba, Docker, GitHub Actions, Pytest, Matplotlib/Seaborn.

Javier Martinez

Diabetes Predictor