Diabetic Readmission Prediction

Outcome

Production-ready ML system with SHAP explainability, demographic fairness audit, and FastAPI + Streamlit deployment with LLM-grounded explanation chatbot

Timeline

March 2025 – March 2026 (AIM x Emeritus Postgrad Capstone)

Role

ML Engineer — AIM x Emeritus Capstone

Technologies

Python scikit-learn XGBoost SHAP FastAPI Streamlit LangChain Docker SMOTE PCA

Project Overview

The Diabetic Readmission Prediction system identifies diabetic patients at high risk of being readmitted to hospital within 30 days of discharge. By surfacing these patients before they leave, care teams can prioritise discharge planning, schedule follow-up calls, and allocate care coordinator resources where the need is highest.

Credential: Built as the capstone project for the Post Graduate Diploma in AI/ML from Emeritus x Asian Institute of Management (AIM). Demonstrates end-to-end ML from raw clinical data through recall-optimised modelling, SHAP explainability, fairness auditing, and production-ready deployment.

Problem & Context

Hospital readmissions within 30 days are a standard quality-of-care metric and a significant cost driver for health systems. For diabetic patients — a chronic, often comorbid population sensitive to medication management and access to care — unplanned readmissions are both common and largely preventable with targeted follow-up.

❌ False Negative (High Cost)

Missing a truly high-risk patient. They leave without adequate support and are likely to return as an emergency readmission — worse outcomes, higher cost.

⚠️ False Positive (Lower Cost)

Flagging a lower-risk patient. Consumes follow-up resources but provides a safety net — acceptable trade-off given the cost asymmetry.

This asymmetry drives the core design decision: optimise for recall (catch as many true positives as possible) while maintaining a minimum precision floor to keep the system operationally credible.

Dataset & Preparation

The dataset covers 101,766 hospital encounters across 130 U.S. hospitals from 1999–2008, sourced from the UCI Machine Learning Repository. Target class: readmitted within 30 days (~11% prevalence — an 8:1 class imbalance).

Cleaning & Leakage Prevention

  • Removed identifiers, constant columns, and high-missingness fields (weight, payer code)
  • Excluded 1,671 structurally non-readmittable encounters (expired, hospice discharge) to prevent label leakage
  • 80:20 stratified train-test split preserving class proportions

Feature Engineering & Selection

  • ICD-9 diagnosis grouping, prior utilisation flags, age encoding, medication ordinals
  • 170 engineered features → 117 selected via mutual information
  • PCA: 117 → 44 components (95.21% variance retained)
  • SMOTE applied to training set only: 80,076 → 141,980 balanced rows

Modelling & Threshold Optimisation

Eight candidate models were trained and compared. The final model — Random Forest on PCA-reduced features — was selected for its best AUC and recall combination.

Threshold Strategy Comparison

Default threshold of 0.50 left too much recall on the table. Four strategies were evaluated:

Strategy Threshold Recall Precision Flags %
Default (0.50) 0.50 0.54 0.17 35%
F2 Score Max 0.42 0.81 0.14 66%
Youden's J 0.50 0.55 0.17 36%
✓ Constrained (selected) 0.46 0.72 0.15 54%

Honest assessment: The AUC target (≥0.75) was not met — final test AUC was 0.6446. However, the recall target (≥0.50) was exceeded at 0.72. The result aligns with published benchmarks for this dataset given its class imbalance, administrative coding limitations, and temporal bounds. The model is positioned as a screening aid, not a standalone clinical decision system.

Explainability (SHAP)

SHAP (SHapley Additive exPlanations) was used to understand which features drive individual predictions — critical for clinical credibility and user trust.

Prior Utilisation

Prior emergency visits and inpatient admissions are the strongest readmission signals — patients with high prior utilisation are chronically unstable.

Encounter Complexity

Number of diagnoses and length of stay reflect the complexity of the current admission and predict difficulty stabilising post-discharge.

Discharge Pathway

Home discharge is protective relative to facility destinations — patients discharged to rehab or skilled nursing have elevated risk.

Fairness Audit

A demographic fairness audit was run across gender and race subgroups, using recall as the primary equity metric — missing a high-risk patient matters equally regardless of demographic group.

✓ Gender — Stable

Female recall: 0.72 · Male recall: 0.71

Negligible gap — model performs equitably across gender.

⚠️ Race — Known Limitation

Caucasian recall: 0.74 · Other recall: 0.52

Gap of 0.22 exceeds the ≤0.15 target. Documented as a limitation; mitigation paths identified.

The racial recall disparity is partly attributable to smaller subgroup sample sizes (especially the "Other" category with n=270) and the historical nature of the dataset. Future work includes stratified thresholds per demographic group and re-weighting strategies.

Deployment Architecture

The project ships production-ready deployment artifacts — not just notebooks. The system is fully containerised with Docker Compose and exposes two services:

FastAPI Backend

  • POST /v1/predict — async risk scoring with Pydantic validation
  • POST /v1/explain — LangChain-powered Q&A with session memory
  • GET /health — health check endpoint
  • Structured logging, custom exception handling, environment config via Pydantic Settings

Streamlit Frontend

  • Tab 1 — Project Summary: Model analysis, metrics, and visualisations
  • Tab 2 — Prediction Tool: Interactive form with preset patient profiles
  • Tab 3 — Explanation Chatbot: LLM-grounded Q&A about predictions
  • Preset drift validation to detect risk-band consistency issues

Artifacts: deployment_pipeline.joblib · best_model_random_forest_pca.joblib · standard_scaler.joblib · pca_transformer.joblib · selected_features.json

Outcomes & Learnings

What Was Delivered

  • Recall of 0.72 at the constrained threshold — recall target exceeded
  • Full production deployment stack (FastAPI, Streamlit, Docker)
  • SHAP explainability integrated into the explanation API endpoint
  • Demographic fairness audit with documented limitations and mitigation paths
  • Risk-band policy document and preset calibration scripts
  • Unit test coverage across feature adapter, API endpoints, and preset validation

Key Learnings

  • Threshold engineering can recover more recall than model tuning alone
  • Fairness audits should be built into the pipeline, not bolted on at the end
  • AUC is not enough — operating metrics at deployment threshold matter more
  • SHAP explanations add credibility but require careful framing for non-ML stakeholders
  • PCA-based dimensionality reduction improved generalisation on this noisy dataset

Future Directions

  • Stratified thresholds — set per-demographic operating thresholds to reduce the racial recall gap below the 0.15 target
  • Temporal validation — test on more recent clinical data to assess model drift and generalisation beyond 1999–2008
  • MLflow integration — replace local experiment tracking with a proper MLOps pipeline for reproducible runs and artefact versioning
  • Richer clinical signals — incorporate medication adherence, lab trends, and post-discharge follow-up data currently absent from the dataset

Reflection

This capstone reinforced something I already believed from my FinTech background: the model is only one part of the solution. Understanding the cost structure of errors, designing for operational credibility, auditing for fairness, and delivering something that can actually be deployed — that's the full job.

A recall of 0.72 on a hard, imbalanced, historically bounded dataset is a defensible result when you've been honest about the limitations and designed the system with appropriate guardrails. The AIM x Emeritus programme gave me the framework to think rigorously about the full ML lifecycle — and this project is where that thinking became concrete.