Comparative Evaluation of Logistic Regression and Gradient Boosting Models for Influenza Outbreak Early-Warning Using U.S. CDC ILINet Surveillance Data (2010–2025)

20260 citationsJournal Articlegreen Open Access

Authors

Chika Nduka Onwuameze · Grambling State University

Abstract

Abstract Background Timely detection of seasonal influenza outbreaks is critical for healthcare system preparedness and public health response. Although numerous studies have examined short-term influenza forecasting, fewer have operationalized prediction as a binary early-warning problem linked to actionable surveillance thresholds. This study evaluated the performance of traditional and machine learning models for detecting national influenza outbreak weeks using U.S. Centers for Disease Control and Prevention (CDC) ILINet surveillance data. Methods Weekly national ILINet data from 2010–2025 were analyzed. Outbreak weeks were defined as those in which weighted influenza-like illness (ILIPERCENT) exceeded the 90th percentile of the 2010–2017 training distribution (threshold = 3.3932%). Predictors included three-week lags of ILIPERCENT and percent positive laboratory specimens, along with seasonal harmonic terms. Models were trained on 2010–2017 data and evaluated on a temporally held-out 2020–2025 test period. Performance metrics included area under the receiver operating characteristic curve (AUC), precision–recall area under the curve (PR-AUC), sensitivity, specificity, precision, and F1-score. Findings On the 2020–2025 test set, logistic regression achieved an AUC of 0.9964 and PR-AUC of 0.9868, with sensitivity of 1.0000 and specificity of 0.9516. XGBoost achieved an AUC of 0.9946 and PR-AUC of 0.9812, with sensitivity of 0.8939 and specificity of 0.9798. Both models demonstrated near-perfect discrimination between outbreak and non-outbreak weeks under strict temporal validation. Interpretation National influenza outbreak early-warning can be implemented using publicly available CDC surveillance data with high discriminatory accuracy. Framing prediction as a threshold-based outbreak detection problem strengthens operational relevance and supports integration of predictive analytics into routine influenza surveillance and preparedness planning. Author Summary Seasonal influenza places a heavy burden on hospitals and communities each year, yet public health officials often rely on surveillance reports that describe what has already happened rather than signaling when activity is about to intensify. We examined whether routinely collected U.S. influenza surveillance data could be used to detect outbreak conditions earlier and more clearly. Using national data from the Centers for Disease Control and Prevention (CDC) covering 2010 to 2025, we compared a traditional statistical model with a machine learning approach to determine how accurately each could identify weeks when influenza activity exceeded a predefined outbreak threshold. Both approaches performed extremely well when tested on recent seasons, correctly distinguishing outbreak from non-outbreak weeks with high accuracy. Importantly, this framework translates weekly surveillance data into a practical alert signal rather than simply producing numerical forecasts. By linking model output to a clear outbreak definition, health departments and healthcare systems could use similar tools to support timely planning, communication, and resource allocation during influenza season.

Topics & Keywords

Data-Driven Disease Surveillance Influenza Virus Research Studies COVID-19 epidemiological studies

UN Sustainable Development Goals

Reduced inequalities

Publication Details

Published in: medRxiv

DOI: 10.64898/2026.03.05.26347655

Field-Weighted Citation Impact: 0.00