Pipeline design notes covering data ingestion, leakage-safe feature engineering, per-unit per-horizon model training, evaluation, and deployment.
The training data is 21 months (Mar 2024 – Dec 2025) of de-identified hourly census and ADT (admit / discharge / transfer) aggregates across 9 active nurse units, plus contextual features (ED census, scheduled surgeries, holiday flags). 139,507 hourly observations after cleaning. Lag features (1–72 hour previous census, rolling 4/8/24h flow rates, 7-day rolling stats) are pre-computed in SQL and consumed directly by the pipeline. Real production deployment would replace the static CSV input with an ETL job pulling the live data warehouse.
| Split | Date range | Rows | % |
|---|---|---|---|
| Train | 2024-03-25 – 2025-06-30 | 99,772 | 71.5% |
| Validation | 2025-07-01 – 2025-09-30 | 19,872 | 14.2% |
| Test | 2025-10-01 – 2025-12-30 | 19,863 | 14.2% |
Strictly chronological — no shuffling. Train spans full seasonal cycles; validation captures the late-summer trough; test holds out the year-end peak.
For a forecast at horizon H, only features that would be known H hours in advance can be used. The pipeline enforces this automatically: lag features with lag < H are excluded from the feature set when training the H-hour model. For H = 72, that excludes every census lag shorter than 72 hours and every short rolling delta — leaving only temporal/calendar features, the 72-hour and 168-hour lags, and 7-day rolling stats.
Twelve dedicated tests verify this property: for each of 8 horizons, no shorter-lag
feature appears in the corresponding feature set; for every horizon, no TARGET_*
column is in features.
Nurse units differ markedly in capacity, ADT volatility, and patient mix, so each unit gets its own model rather than a single cross-unit model with unit-as-feature. Tabular and recurrent models additionally train per horizon (one model per (unit, horizon) pair). ARIMA and Prophet train once per unit and forecast at all horizons by slicing — both are univariate so the fitted parameters are horizon-invariant, yielding an 8× training speedup. Training across units is parallelized with joblib.
Primary metric: percentage of forecasts within ±2 patients of the actual census (operationally meaningful for staffing decisions). Secondary metrics: MAE, RMSE, MAPE. Residual diagnostics include Shapiro-Wilk normality test (sampled if n > 5000) and Ljung-Box autocorrelation. Evaluation is per (model, unit, horizon); aggregated tables report cross-unit means.
A scheduled GitHub Actions workflow runs daily at 12:15 UTC. Each run regenerates a
synthetic hourly window calibrated against the real distributions in
unit_metadata.csv, runs the export pipeline, rebuilds the static dashboards,
and commits the result. GitHub Pages auto-rebuilds and any connected Tableau Public
workbook refreshes daily. Production deployment swaps the synthetic generator for
a live-data ETL job and leaves the rest of the pipeline unchanged.
Random seeds set centrally (numpy, random, PyTorch). Dependencies pinned in
requirements.txt. All hyperparameters in config/config.yaml —
no magic constants in code. 34 pytest cases
cover data integrity, leakage prevention, chronological splits, metric correctness,
model train/predict, ensemble weights, and feature validation; the data-free subset
(23 cases) runs in GitHub Actions on every push.