Methodology

Pipeline design notes covering data ingestion, leakage-safe feature engineering, per-unit per-horizon model training, evaluation, and deployment.

1. Data

The training data is 21 months (Mar 2024 – Dec 2025) of de-identified hourly census and ADT (admit / discharge / transfer) aggregates across 9 active nurse units, plus contextual features (ED census, scheduled surgeries, holiday flags). 139,507 hourly observations after cleaning. Lag features (1–72 hour previous census, rolling 4/8/24h flow rates, 7-day rolling stats) are pre-computed in SQL and consumed directly by the pipeline. Real production deployment would replace the static CSV input with an ETL job pulling the live data warehouse.

2. Train / validation / test split

Split	Date range	Rows	%
Train	2024-03-25 – 2025-06-30	99,772	71.5%
Validation	2025-07-01 – 2025-09-30	19,872	14.2%
Test	2025-10-01 – 2025-12-30	19,863	14.2%

Strictly chronological — no shuffling. Train spans full seasonal cycles; validation captures the late-summer trough; test holds out the year-end peak.

3. Leakage-safe feature filtering

For a forecast at horizon H, only features that would be known H hours in advance can be used. The pipeline enforces this automatically: lag features with lag < H are excluded from the feature set when training the H-hour model. For H = 72, that excludes every census lag shorter than 72 hours and every short rolling delta — leaving only temporal/calendar features, the 72-hour and 168-hour lags, and 7-day rolling stats.

Twelve dedicated tests verify this property: for each of 8 horizons, no shorter-lag feature appears in the corresponding feature set; for every horizon, no TARGET_* column is in features.

4. Per-unit, per-horizon training

Nurse units differ markedly in capacity, ADT volatility, and patient mix, so each unit gets its own model rather than a single cross-unit model with unit-as-feature. Tabular and recurrent models additionally train per horizon (one model per (unit, horizon) pair). ARIMA and Prophet train once per unit and forecast at all horizons by slicing — both are univariate so the fitted parameters are horizon-invariant, yielding an 8× training speedup. Training across units is parallelized with joblib.

5. Evaluation

Primary metric: percentage of forecasts within ±2 patients of the actual census (operationally meaningful for staffing decisions). Secondary metrics: MAE, RMSE, MAPE. Residual diagnostics include Shapiro-Wilk normality test (sampled if n > 5000) and Ljung-Box autocorrelation. Evaluation is per (model, unit, horizon); aggregated tables report cross-unit means.

6. Operational deployment

A scheduled GitHub Actions workflow runs daily at 12:15 UTC. Each run regenerates a synthetic hourly window calibrated against the real distributions in unit_metadata.csv, runs the export pipeline, rebuilds the static dashboards, and commits the result. GitHub Pages auto-rebuilds and any connected Tableau Public workbook refreshes daily. Production deployment swaps the synthetic generator for a live-data ETL job and leaves the rest of the pipeline unchanged.

7. Reproducibility

Random seeds set centrally (numpy, random, PyTorch). Dependencies pinned in requirements.txt. All hyperparameters in config/config.yaml — no magic constants in code. 34 pytest cases cover data integrity, leakage prevention, chronological splits, metric correctness, model train/predict, ensemble weights, and feature validation; the data-free subset (23 cases) runs in GitHub Actions on every push.