Diabetic Readmission Prediction

Project Overview

The Diabetic Readmission Prediction system identifies diabetic patients at high risk of being readmitted to hospital within 30 days of discharge. By surfacing these patients before they leave, care teams can prioritise discharge planning, schedule follow-up calls, and allocate care coordinator resources where the need is highest.

Credential: Built as the capstone project for the Post Graduate Diploma in AI/ML from Emeritus x Asian Institute of Management (AIM). Demonstrates end-to-end ML from raw clinical data through recall-optimised modelling, SHAP explainability, fairness auditing, and production-ready deployment.

Problem & Context

Hospital readmissions within 30 days are a standard quality-of-care metric and a significant cost driver for health systems. For diabetic patients — a chronic, often comorbid population sensitive to medication management and access to care — unplanned readmissions are both common and largely preventable with targeted follow-up.

❌ False Negative (High Cost)

Missing a truly high-risk patient. They leave without adequate support and are likely to return as an emergency readmission — worse outcomes, higher cost.

⚠️ False Positive (Lower Cost)

Flagging a lower-risk patient. Consumes follow-up resources but provides a safety net — acceptable trade-off given the cost asymmetry.

This asymmetry drives the core design decision: optimise for recall (catch as many true positives as possible) while maintaining a minimum precision floor to keep the system operationally credible.

Dataset & Preparation

The dataset covers 101,766 hospital encounters across 130 U.S. hospitals from 1999–2008, sourced from the UCI Machine Learning Repository. Target class: readmitted within 30 days (~11% prevalence — an 8:1 class imbalance).

Cleaning & Leakage Prevention

Removed identifiers, constant columns, and high-missingness fields (weight, payer code)
Excluded 1,671 structurally non-readmittable encounters (expired, hospice discharge) to prevent label leakage
80:20 stratified train-test split preserving class proportions

Feature Engineering & Selection

ICD-9 diagnosis grouping, prior utilisation flags, age encoding, medication ordinals
170 engineered features → 117 selected via mutual information
PCA: 117 → 44 components (95.21% variance retained)
SMOTE applied to training set only: 80,076 → 141,980 balanced rows

Modelling & Threshold Optimisation

Eight candidate models were trained and compared. The final model — Random Forest on PCA-reduced features — was selected for its best AUC and recall combination.

Threshold Strategy Comparison

Default threshold of 0.50 left too much recall on the table. Four strategies were evaluated:

Strategy	Threshold	Recall	Precision	Flags %
Default (0.50)	0.50	0.54	0.17	35%
F2 Score Max	0.42	0.81	0.14	66%
Youden's J	0.50	0.55	0.17	36%
✓ Constrained (selected)	0.46	0.72	0.15	54%

Honest assessment: The AUC target (≥0.75) was not met — final test AUC was 0.6446. However, the recall target (≥0.50) was exceeded at 0.72. The result aligns with published benchmarks for this dataset given its class imbalance, administrative coding limitations, and temporal bounds. The model is positioned as a screening aid, not a standalone clinical decision system.

Explainability (SHAP)

SHAP (SHapley Additive exPlanations) was used to understand which features drive individual predictions — critical for clinical credibility and user trust.

Prior Utilisation

Prior emergency visits and inpatient admissions are the strongest readmission signals — patients with high prior utilisation are chronically unstable.

Encounter Complexity

Number of diagnoses and length of stay reflect the complexity of the current admission and predict difficulty stabilising post-discharge.

Discharge Pathway

Home discharge is protective relative to facility destinations — patients discharged to rehab or skilled nursing have elevated risk.

Fairness Audit

A demographic fairness audit was run across gender and race subgroups, using recall as the primary equity metric — missing a high-risk patient matters equally regardless of demographic group.

✓ Gender — Stable

Female recall: 0.72 · Male recall: 0.71

Negligible gap — model performs equitably across gender.

⚠️ Race — Known Limitation

Caucasian recall: 0.74 · Other recall: 0.52

Gap of 0.22 exceeds the ≤0.15 target. Documented as a limitation; mitigation paths identified.

The racial recall disparity is partly attributable to smaller subgroup sample sizes (especially the "Other" category with n=270) and the historical nature of the dataset. Future work includes stratified thresholds per demographic group and re-weighting strategies.

Deployment Architecture

The project ships production-ready deployment artifacts — not just notebooks. The system is fully containerised with Docker Compose and exposes two services:

FastAPI Backend

POST /v1/predict — async risk scoring with Pydantic validation
POST /v1/explain — LangChain-powered Q&A with session memory
GET /health — health check endpoint
Structured logging, custom exception handling, environment config via Pydantic Settings

Streamlit Frontend

Tab 1 — Project Summary: Model analysis, metrics, and visualisations
Tab 2 — Prediction Tool: Interactive form with preset patient profiles
Tab 3 — Explanation Chatbot: LLM-grounded Q&A about predictions
Preset drift validation to detect risk-band consistency issues

Artifacts: deployment_pipeline.joblib · best_model_random_forest_pca.joblib · standard_scaler.joblib · pca_transformer.joblib · selected_features.json

Outcomes & Learnings

What Was Delivered

Recall of 0.72 at the constrained threshold — recall target exceeded
Full production deployment stack (FastAPI, Streamlit, Docker)
SHAP explainability integrated into the explanation API endpoint
Demographic fairness audit with documented limitations and mitigation paths
Risk-band policy document and preset calibration scripts
Unit test coverage across feature adapter, API endpoints, and preset validation

Key Learnings

Threshold engineering can recover more recall than model tuning alone
Fairness audits should be built into the pipeline, not bolted on at the end
AUC is not enough — operating metrics at deployment threshold matter more
SHAP explanations add credibility but require careful framing for non-ML stakeholders
PCA-based dimensionality reduction improved generalisation on this noisy dataset

Future Directions

Stratified thresholds — set per-demographic operating thresholds to reduce the racial recall gap below the 0.15 target
Temporal validation — test on more recent clinical data to assess model drift and generalisation beyond 1999–2008
MLflow integration — replace local experiment tracking with a proper MLOps pipeline for reproducible runs and artefact versioning
Richer clinical signals — incorporate medication adherence, lab trends, and post-discharge follow-up data currently absent from the dataset

Reflection

This capstone reinforced something I already believed from my FinTech background: the model is only one part of the solution. Understanding the cost structure of errors, designing for operational credibility, auditing for fairness, and delivering something that can actually be deployed — that's the full job.

A recall of 0.72 on a hard, imbalanced, historically bounded dataset is a defensible result when you've been honest about the limitations and designed the system with appropriate guardrails. The AIM x Emeritus programme gave me the framework to think rigorously about the full ML lifecycle — and this project is where that thinking became concrete.

Outcome

Timeline

Role

Technologies