Search for a command to run...
Abstract Short-horizon epidemic forecasting is difficult when surveillance series are highly nonstationary and affected by structural change and evolving reporting conditions. This study evaluates statistical models for global daily COVID-19 incidence using a rolling-origin benchmark designed to approximate real-time forecasting under such conditions. Using global incidence data from 22 January to 27 July 2020, we compare naive, seasonal naive, drift, ARIMA(log1p), ETS(log1p), and Prophet(log1p) forecasts at horizons h ∈ {1, 3, 7, 14} days. Structural phases are identified retrospectively on a variance-stabilized scale and used only to stratify forecast errors. Forecast ranking is strongly horizon-dependent. In the full-sample benchmark, drift performs best at the 1-, 7-, and 14-day horizons, while seasonal naive performs best at 3 days. Among the transformed statistical models, ARIMA(log1p) is competitive at short horizons, whereas ETS(log1p) becomes stronger at 7 and 14 days. Diebold-Mariano tests confirm that several of these differences are statistically meaningful, particularly in favor of drift at short and long horizons and in favor of ETS(log1p) over ARIMA(log1p) at longer horizons. Prophet(log1p) is not competitive in point forecasting and achieves high nominal interval coverage mainly through very wide prediction intervals. Robustness analyses show that the main ranking patterns are broadly stable to alternative segmentation settings, training-window policies, coverage-stabilized subsamples, and alternative target construction based on cumulative confirmed counts. Overall, the results show that simple baselines remain difficult to outperform in epidemic surveillance data and that horizon-specific rolling evaluation is essential for credible forecast comparison under structural change. Author summary Forecasting infectious disease incidence is difficult when case data change rapidly over time and when reporting systems are still evolving. In this study, I examined how several common statistical forecasting models perform on global daily COVID-19 incidence during the early pandemic. Rather than asking which model is best overall, I focused on whether model ranking changes across forecast horizons and whether those conclusions remain stable under different evaluation choices. I compared simple baselines, including naive, seasonal naive, and drift forecasts, with ARIMA, exponential smoothing, and Prophet models using a rolling-origin benchmark that mimics real-time forecasting. I found that forecast ranking depends strongly on the horizon: drift performed best at 1, 7, and 14 days, while seasonal naive performed best at 3 days. Among the transformed statistical models, ARIMA was more competitive at shorter horizons, whereas exponential smoothing was stronger at longer horizons. I also found that these conclusions remained broadly stable under alternative segmentation settings, training windows, coverage-stabilized subsamples, and target definitions. These results show that simple baselines can remain highly competitive in epidemic surveillance data and that horizon-specific evaluation is essential for fair forecast comparison under structural change.