Diabetic Readmission Prediction
Outcome
Production-ready ML system with SHAP explainability, demographic fairness audit, and FastAPI + Streamlit deployment with LLM-grounded explanation chatbot
Timeline
March 2025 – March 2026 (AIM x Emeritus Postgrad Capstone)
Role
ML Engineer — AIM x Emeritus Capstone
Technologies
Project Overview
The Diabetic Readmission Prediction system identifies diabetic patients at high risk of being readmitted to hospital within 30 days of discharge. By surfacing these patients before they leave, care teams can prioritise discharge planning, schedule follow-up calls, and allocate care coordinator resources where the need is highest.
Credential: Built as the capstone project for the Post Graduate Diploma in AI/ML from Emeritus x Asian Institute of Management (AIM). Demonstrates end-to-end ML from raw clinical data through recall-optimised modelling, SHAP explainability, fairness auditing, and production-ready deployment.
Problem & Context
Hospital readmissions within 30 days are a standard quality-of-care metric and a significant cost driver for health systems. For diabetic patients — a chronic, often comorbid population sensitive to medication management and access to care — unplanned readmissions are both common and largely preventable with targeted follow-up.
❌ False Negative (High Cost)
Missing a truly high-risk patient. They leave without adequate support and are likely to return as an emergency readmission — worse outcomes, higher cost.
⚠️ False Positive (Lower Cost)
Flagging a lower-risk patient. Consumes follow-up resources but provides a safety net — acceptable trade-off given the cost asymmetry.
This asymmetry drives the core design decision: optimise for recall (catch as many true positives as possible) while maintaining a minimum precision floor to keep the system operationally credible.
Dataset & Preparation
The dataset covers 101,766 hospital encounters across 130 U.S. hospitals from 1999–2008, sourced from the UCI Machine Learning Repository. Target class: readmitted within 30 days (~11% prevalence — an 8:1 class imbalance).
Cleaning & Leakage Prevention
- Removed identifiers, constant columns, and high-missingness fields (weight, payer code)
- Excluded 1,671 structurally non-readmittable encounters (expired, hospice discharge) to prevent label leakage
- 80:20 stratified train-test split preserving class proportions
Feature Engineering & Selection
- ICD-9 diagnosis grouping, prior utilisation flags, age encoding, medication ordinals
- 170 engineered features → 117 selected via mutual information
- PCA: 117 → 44 components (95.21% variance retained)
- SMOTE applied to training set only: 80,076 → 141,980 balanced rows
Modelling & Threshold Optimisation
Eight candidate models were trained and compared. The final model — Random Forest on PCA-reduced features — was selected for its best AUC and recall combination.
Threshold Strategy Comparison
Default threshold of 0.50 left too much recall on the table. Four strategies were evaluated:
| Strategy | Threshold | Recall | Precision | Flags % |
|---|---|---|---|---|
| Default (0.50) | 0.50 | 0.54 | 0.17 | 35% |
| F2 Score Max | 0.42 | 0.81 | 0.14 | 66% |
| Youden's J | 0.50 | 0.55 | 0.17 | 36% |
| ✓ Constrained (selected) | 0.46 | 0.72 | 0.15 | 54% |
Honest assessment: The AUC target (≥0.75) was not met — final test AUC was 0.6446. However, the recall target (≥0.50) was exceeded at 0.72. The result aligns with published benchmarks for this dataset given its class imbalance, administrative coding limitations, and temporal bounds. The model is positioned as a screening aid, not a standalone clinical decision system.
Explainability (SHAP)
SHAP (SHapley Additive exPlanations) was used to understand which features drive individual predictions — critical for clinical credibility and user trust.
Prior Utilisation
Prior emergency visits and inpatient admissions are the strongest readmission signals — patients with high prior utilisation are chronically unstable.
Encounter Complexity
Number of diagnoses and length of stay reflect the complexity of the current admission and predict difficulty stabilising post-discharge.
Discharge Pathway
Home discharge is protective relative to facility destinations — patients discharged to rehab or skilled nursing have elevated risk.
Fairness Audit
A demographic fairness audit was run across gender and race subgroups, using recall as the primary equity metric — missing a high-risk patient matters equally regardless of demographic group.
✓ Gender — Stable
Female recall: 0.72 · Male recall: 0.71
Negligible gap — model performs equitably across gender.
⚠️ Race — Known Limitation
Caucasian recall: 0.74 · Other recall: 0.52
Gap of 0.22 exceeds the ≤0.15 target. Documented as a limitation; mitigation paths identified.
The racial recall disparity is partly attributable to smaller subgroup sample sizes (especially the "Other" category with n=270) and the historical nature of the dataset. Future work includes stratified thresholds per demographic group and re-weighting strategies.
Deployment Architecture
The project ships production-ready deployment artifacts — not just notebooks. The system is fully containerised with Docker Compose and exposes two services:
FastAPI Backend
POST /v1/predict— async risk scoring with Pydantic validationPOST /v1/explain— LangChain-powered Q&A with session memoryGET /health— health check endpoint- Structured logging, custom exception handling, environment config via Pydantic Settings
Streamlit Frontend
- Tab 1 — Project Summary: Model analysis, metrics, and visualisations
- Tab 2 — Prediction Tool: Interactive form with preset patient profiles
- Tab 3 — Explanation Chatbot: LLM-grounded Q&A about predictions
- Preset drift validation to detect risk-band consistency issues
Artifacts: deployment_pipeline.joblib · best_model_random_forest_pca.joblib · standard_scaler.joblib · pca_transformer.joblib · selected_features.json
Outcomes & Learnings
What Was Delivered
- Recall of 0.72 at the constrained threshold — recall target exceeded
- Full production deployment stack (FastAPI, Streamlit, Docker)
- SHAP explainability integrated into the explanation API endpoint
- Demographic fairness audit with documented limitations and mitigation paths
- Risk-band policy document and preset calibration scripts
- Unit test coverage across feature adapter, API endpoints, and preset validation
Key Learnings
- Threshold engineering can recover more recall than model tuning alone
- Fairness audits should be built into the pipeline, not bolted on at the end
- AUC is not enough — operating metrics at deployment threshold matter more
- SHAP explanations add credibility but require careful framing for non-ML stakeholders
- PCA-based dimensionality reduction improved generalisation on this noisy dataset
Future Directions
- Stratified thresholds — set per-demographic operating thresholds to reduce the racial recall gap below the 0.15 target
- Temporal validation — test on more recent clinical data to assess model drift and generalisation beyond 1999–2008
- MLflow integration — replace local experiment tracking with a proper MLOps pipeline for reproducible runs and artefact versioning
- Richer clinical signals — incorporate medication adherence, lab trends, and post-discharge follow-up data currently absent from the dataset
Reflection
This capstone reinforced something I already believed from my FinTech background: the model is only one part of the solution. Understanding the cost structure of errors, designing for operational credibility, auditing for fairness, and delivering something that can actually be deployed — that's the full job.
A recall of 0.72 on a hard, imbalanced, historically bounded dataset is a defensible result when you've been honest about the limitations and designed the system with appropriate guardrails. The AIM x Emeritus programme gave me the framework to think rigorously about the full ML lifecycle — and this project is where that thinking became concrete.