Predicting Hospital Length of Stay

Overview

We built a model that estimates how many days a patient will stay so the hospital can plan beds and staff in advance.

Poor allocation of beds, ventilators, and staff causes complications, a risk made vivid during the COVID-19 pandemic.
HealthPlus hired us to find what drives length of stay (LOS) and predict it from data captured at admission.
Accurate LOS forecasts let management pre-position resources, smooth patient flow, and reduce costly bottlenecks.
Goal: identify the strongest LOS drivers and deliver an interpretable, statistically valid prediction model.

Methodology

flowchart LR
  A[Raw Data] --> B[Clean & Encode]
  B --> C[EDA]
  C --> D[Train/Test Split]
  D --> E["Random Forest / Linear Regression / Ridge / Lasso"]
  E --> F["Tune (Cross-Validation)"]
  F --> G["Evaluate: R2 / RMSE"]

The Data

We started with half a million admission records, each describing a patient and their hospital visit.

Dataset of 500,000 patient admission records across 15 columns, with no missing values and all rows unique.
Mix of numeric fields (extra rooms, staff available, visitors, admission deposit, stay) and categorical fields.
A single patient was admitted up to 21 times; on average ~3 rooms and ~5 staff were available per admission.
About 82% of patients arrive with moderate or minor illness; gynecology receives ~68% of all patients.
Target variable is Stay (in days), available only after admission and a few tests.

Exploratory Analysis

We looked at how long patients typically stay and how the main numeric factors are distributed.

Most patients stay 8-9 days; few stay beyond 10 days and very few beyond 40, matching the mild-illness mix.
Admission deposit is roughly normally distributed with outliers paying unusually high or low fees.
Number of visitors is highly right-skewed, with 2 and 4 being the most common visitor counts.
Correlation heatmap shows almost no correlation among numeric features or with LOS.
Weak numeric correlations signaled that categorical features (ward, department, severity) would drive prediction.

Key Drivers of Length-of-Stay

Stay length is shaped mostly by which ward and department a patient is in, plus illness severity and age.

Wards A and C have the longest stays, suggesting they handle the most serious cases.
Ward A holds the most extreme cases and the only surgery patients, demanding more staff and resources.
Wards B, D, and F are dedicated to gynecology; A, C, and E cover all other diseases.
Patients aged 1-10 and 51-100 stay longest, while the 21-50 group skews toward shorter gynecology stays.
9 doctors staff the hospital, with 4 in high-volume gynecology; Dr. Sarah and Olivia treat the most patients.

Modeling & Results

A linear regression model predicts stay within about two days, and adding complexity did not improve it.

Linear Regression reached an adjusted R-squared of ~0.84, explaining 84% of variance in length of stay.
Mean Absolute Error of ~2.15 days on test data, with train and test metrics close, so no overfitting.
Ridge, Lasso, and Elastic Net (tuned via GridSearchCV) gave no improvement over ordinary least squares.
All linear regression assumptions held: zero-mean normal residuals, linearity, and homoscedasticity.
Forward feature selection cut features from 42 to 8 (~81% fewer) while keeping R-squared at 0.840.

Key Takeaways

We delivered an accurate, easy-to-explain LOS model and pinpointed the few factors that matter most.

LOS can be predicted at admission within ~2.15 days, supporting proactive bed and staff planning.
Ward, department, severity, and age are the dominant drivers; numeric variables add little signal.
A simple 8-feature linear model matches the full model, easing deployment, storage, and interpretation.
Verified statistical assumptions make the model trustworthy for inference, not just prediction.
Built with: pandas, numpy, matplotlib, seaborn, scikit-learn, statsmodels, mlxtend

More Visualizations

Tech Stack

pandas — data wrangling and tabular manipulation
numpy — fast numerical arrays
scikit-learn — modeling, pipelines, and evaluation
seaborn — statistical visualization
matplotlib — plotting
statsmodels — OLS / statistical inference & VIF

Attribution

This project was completed as part of the MIT Applied Data Science Program (MIT IDSS / Great Learning). The program provided the case-study scaffolding; the analysis, code, and results are my own. Published with permission, for portfolio use only.