Falsification Testing of Sepsis Prediction Models: Evaluating Independent Biological Signal After Controlling for Care-Process Intensity

20260 citationsJournal Articlegreen Open Access

Authors

Adam Dickens · Roland Park Country School

Abstract

Abstract Background Automated sepsis early-warning systems have attracted substantial research investment, yet a fundamental question remains unresolved: do these models detect independent biological signals, or do they predominantly learn care-process intensity — the pattern of clinician ordering behavior applied to patients already suspected of being ill? We report a pre-registered falsification study testing this hypothesis across four independent clinical datasets. Methods A four-phase falsification framework with pre-specified thresholds was registered on OSF (March 11, 2026) before any data access. The primary confirmatory analysis used MIMIC-IV v3.1 (n=65,241 adult ICU stays, Beth Israel Deaconess Medical Center, 2008-2022). Exploratory replication analyses used eICU-CRD v2.0 (n=136,864, 208 US hospitals), MIMIC-III v1.4 (n=44,091), and the PhysioNet/CinC 2019 Sepsis Challenge (n=40,314). Each phase tested a distinct falsification criterion: (1) concordance across Sepsis-2, Sepsis-3, and CMS SEP-1 definitions; (2) model performance degradation when care-intensity proxy features are removed; (3) predictive performance of care-intensity features alone; and (4) discriminability of synthetic records generated to match care-intensity distributions. Results The pre-registered primary analysis (MIMIC-IV) did not confirm the hypothesis (0/4 phases confirmed). Biological features predicted Sepsis-3 labels with AUROC 0.901 (95% CI 0.899-0.904); removing care-intensity features reduced performance by only 0.003 AUROC (drop=0.0027). The pre-specified Phase 3 threshold (care-only AUROC >0.70) was not met by the primary logistic regression model (AUROC 0.660); however, a sensitivity XGBoost model did exceed the threshold (AUROC 0.729), suggesting nonlinear care-intensity signal. However, a clinically significant finding emerged consistently across all four datasets: mean pairwise Jaccard similarity between clinical sepsis definitions and administrative coding (CMS SEP-1) was approximately 0.32 at the primary site and 0.20 across multi-center cohorts, indicating that hospital quality metrics and regulatory reporting systematically measure a different patient population than clinical definitions identify. Exploratory analyses revealed a detectable care-intensity signal in the eICU multi-center cohort (AUC drop=0.076) not present at the single academic center. At community hospitals the care-intensity signal was larger (AUC drop 0.076 versus 0.003 at the academic center), suggesting the null result may be specific to high-acuity academic settings. Conclusions At an elite academic medical center, sepsis prediction models detect genuine biological signal. Care-process leakage is not the primary driver of model performance in MIMIC-IV. The more consequential and robust finding is the systematic divergence between clinical and administrative sepsis definitions across all datasets examined, which has direct implications for regulatory reporting, pay-for-performance metrics, and the validity of AI benchmarks built on administrative data.

Topics & Keywords

Sepsis Diagnosis and Treatment Machine Learning in Healthcare Bacterial Identification and Susceptibility Testing

UN Sustainable Development Goals

Reduced inequalities

Publication Details

Published in: medRxiv

DOI: 10.64898/2026.03.17.26348414

Field-Weighted Citation Impact: 0.00