Model cards

Six models across three families: tabular (Random Forest, LightGBM), recurrent (LSTM), and statistical (ARIMA, Prophet), plus an inverse-MAPE weighted ensemble. All trained per nurse unit; tabular and recurrent models additionally trained per forecast horizon. Cards below follow the Mitchell et al. (2019) ML model card convention, adapted for capstone scope.

Live model performance — Tableau Public

Live Tableau Model Performance dashboard

Heatmap, per-unit breakdown, best-model callouts — interactive in the live dashboard. Open dashboard →

Best model per horizon

99.7%

1h · RandomForest · MAE 0.27

97.8%

2h · LightGBM · MAE 0.50

95.8%

3h · LightGBM · MAE 0.63

94.5%

4h · LightGBM · MAE 0.72

90.5%

12h · LSTM · MAE 0.95

89.0%

24h · LSTM · MAE 0.97

87.5%

48h · LSTM · MAE 1.03

87.6%

72h · LSTM · MAE 1.20

Individual model cards

Non-parametric ensemble of 200 deep decision trees, each trained on a bootstrap sample with random feature subsets. Captures non-linear interactions across the 61-feature input space without requiring scaling or distributional assumptions.

Architecture & hyperparameters

200 trees (n_estimators=200)
max_depth=20, min_samples_split=5, min_samples_leaf=2
Per-unit, per-horizon — separate model per (unit, horizon)
Feature inputs respect horizon-dependent leakage filter

Strengths

Handles non-linear feature interactions natively
Built-in feature importance for interpretability
Robust to outliers and missing values
Best in class at the 1-hour horizon (99.7% ±2 accuracy)

Limitations

No native uncertainty quantification (point predictions)
Memory-heavy at inference (200-tree ensemble per (unit, horizon))
Cannot extrapolate beyond training-data range

Validation ±2 patient accuracy

1h	2h	3h	4h	12h	24h	48h	72h
99.7%	97.4%	95.3%	93.8%	88.1%	85.9%	82.4%	82.1%

Cells highlighted indicate this model is the winner at that horizon.

Microsoft's leaf-wise gradient boosting with histogram-based splits and early stopping on the validation set. Produces strong tabular predictions at modest compute cost; the workhorse model for short-to-medium horizons in this pipeline.

Architecture & hyperparameters

500 estimators max with early stopping on validation
max_depth=8, num_leaves=31, learning_rate=0.05
Subsampling: 0.8 (rows), 0.8 (columns)
Per-unit, per-horizon

Strengths

Best-in-class accuracy on tabular features (94–98% ±2 at 2–4h horizons)
Fast training and inference vs. RF for similar accuracy
Handles categorical features natively, NaN-safe
Built-in feature importance (split-count and gain)

Limitations

Hyperparameter sensitive (tuning matters)
No native uncertainty quantification
Sequential boosting limits parallelism within one tree fit

Validation ±2 patient accuracy

1h	2h	3h	4h	12h	24h	48h	72h
99.7%	97.8%	95.8%	94.5%	89.6%	88.5%	86.5%	86.4%

Cells highlighted indicate this model is the winner at that horizon.

Two-layer stacked Long Short-Term Memory network reading 168-hour (7-day) sequences of feature vectors. Best at long horizons where short-term lag features become unavailable due to leakage filtering and the model must rely on captured temporal structure.

Architecture & hyperparameters

2-layer stacked LSTM, 64 hidden units per layer
Dropout 0.2 between layers, dense output head
Sequence length 168 hours (7 days), batch size 64
Adam optimizer, lr=0.001, MSE loss, early stopping patience 10
Per-unit, per-horizon (with per-horizon scaler)

Strengths

Best at long horizons (12h–72h: 87–91% ±2 accuracy)
Captures long-term temporal patterns missed by tabular models
Sequence input naturally encodes recent history

Limitations

Computationally expensive (PyTorch + per-horizon scaler)
Less interpretable than tree models
Requires sufficient history per unit (sparse units underperform)
Per-horizon training inflates artifact count

Validation ±2 patient accuracy

1h	2h	3h	4h	12h	24h	48h	72h
98.1%	95.6%	88.9%	94.2%	90.5%	89.0%	87.5%	87.6%

Cells highlighted indicate this model is the winner at that horizon.

Univariate seasonal ARIMA fit per unit via auto-selection over (p,d,q) and seasonal (P,D,Q) orders with daily seasonality. Trained once per unit and forecast at all horizons (8× speedup over per-horizon training).

Architecture & hyperparameters

auto_arima search: max_p=5, max_d=2, max_q=5
Seasonal period = 24 (daily seasonality)
3-month rolling window per unit (full history exhausts Kalman state)
JSON parameter serialization (avoids multi-GB SARIMAX pickles)

Strengths

Native confidence intervals from get_forecast()
Statistically interpretable (clear order semantics)
Robust baseline; ~82% ±2 accuracy across all horizons

Limitations

Univariate — ignores ED, surgery, ADT flow features
Slow to fit on long histories (Kalman filter overhead)
Identical accuracy across horizons (cannot exploit horizon-specific features)

Validation ±2 patient accuracy

1h	2h	3h	4h	12h	24h	48h	72h
82.0%	82.0%	82.0%	82.0%	82.0%	82.1%	82.2%	82.1%

Cells highlighted indicate this model is the winner at that horizon.

Facebook's piecewise-linear trend model with explicit yearly, weekly, and daily seasonality plus US holiday effects. Trained once per unit; forecasts at horizons by shifting the future dataframe.

Architecture & hyperparameters

yearly_seasonality, weekly_seasonality, daily_seasonality all enabled
changepoint_prior_scale=0.05 (default)
country_holidays='US' for built-in holiday effects
Per-unit, train-once

Strengths

Robust to missing data and outliers
Native uncertainty intervals
Explicit, interpretable seasonality components
Holiday-aware out of the box

Limitations

Univariate (CENSUS only)
Doesn't capture short-term shocks driven by ADT flow
~86% ±2 accuracy across horizons — solid baseline but bested by ML models

Validation ±2 patient accuracy

1h	2h	3h	4h	12h	24h	48h	72h
86.3%	86.3%	86.3%	86.3%	86.3%	86.3%	86.2%	86.1%

Cells highlighted indicate this model is the winner at that horizon.

Per-unit, per-horizon weighted average of all available individual model predictions. Weights are computed from validation MAPE — better-performing models get higher weight — then normalized to sum to 1.

Architecture & hyperparameters

weight_i = (1 / MAPE_i) normalized across enabled models
Per-unit, per-horizon weights stored as JSON
Skips models with NaN MAPE on the validation set
Requires ≥2 component models per (unit, horizon) to fire

Strengths

Hedges against single-model failure modes
Smoother performance across horizons (no cliff between RF dominance and LSTM dominance)
MAPE-based weighting downweights poor performers automatically

Limitations

Weights frozen from validation set — won't adapt to drift
Inherits the failure modes of all components
Not always the winner: tree models or LSTM often beat the ensemble at their respective horizons

Validation ±2 patient accuracy

1h	2h	3h	4h	12h	24h	48h	72h
97.0%	94.1%	91.6%	91.3%	87.7%	86.6%	85.3%	85.2%

Cells highlighted indicate this model is the winner at that horizon.

References

Mitchell, M., et al. (2019). Model Cards for Model Reporting. FAT* '19.
Box, G.E.P., Jenkins, G.M. (2015). Time Series Analysis: Forecasting and Control, 5th ed.
Taylor, S.J. & Letham, B. (2018). Forecasting at scale. The American Statistician, 72(1).
Hochreiter, S. & Schmidhuber, J. (1997). Long Short-Term Memory. Neural Computation, 9(8).
Breiman, L. (2001). Random Forests. Machine Learning, 45(1).
Ke, G., et al. (2017). LightGBM: A Highly Efficient Gradient Boosting Decision Tree. NeurIPS 30.

Model cards

Random Forest

Architecture & hyperparameters

Strengths

Limitations

Validation ±2 patient accuracy

LightGBM

Architecture & hyperparameters

Strengths

Limitations

Validation ±2 patient accuracy

LSTM

Architecture & hyperparameters

Strengths

Limitations

Validation ±2 patient accuracy

ARIMA / SARIMA

Architecture & hyperparameters

Strengths

Limitations

Validation ±2 patient accuracy

Prophet

Architecture & hyperparameters

Strengths

Limitations

Validation ±2 patient accuracy

Ensemble (inverse-MAPE weighted)

Architecture & hyperparameters

Strengths

Limitations

Validation ±2 patient accuracy