Search for a command to run...
Abstract Background: Prognostic tools such as genomic assays are typically evaluated in whether they can distinguish between high-risk and low-risk patients. However, these scores are uncalibrated, meaning they do not directly correspond to the probability of recurrence, therefore, they lack a straightforward interpretation. As a result, while such tools may be prognostic, translating their outputs into clinical decisions can be challenging. Calibration offers a solution by aligning predicted risks with observed event rates to produce absolute recurrence risk estimates. However, calibration within survival analysis remains challenging due to censoring. In this study, we demonstrate the critical role of calibration in prognostic model development which ensures that scores are aligned with observed event rates. Methods: First, we calibrated a novel AI-based recurrence prediction model using data from 5,351 early-stage breast cancer (BC) patients across 13 cohorts. We evaluated the AI model’s calibration on a hold-out set of 3,457 patients from 5 independent cohorts (separate from the 13 training cohorts). Additionally, we evaluated Oncotype DX recurrence score (ODX RS) calibration according to data based on the SWOG 8814 trial. To calibrate the AI model, we used spline-based methods. During evaluation, we stratified evaluated patients into quartiles based on predicted AI risk. Within each quartile, we compared the median AI score with the observed recurrence rate estimated via Kaplan-Meier estimator. To evaluate ODX RS, patients were divided into quartiles based on RS in our cohorts, and the median score within each quartile was mapped to the corresponding 5-year Disease-Free Survival (DFS) estimate from the SWOG trial. The analysis was performed separately for different treatment regimens (CAF+Tamoxifen vs. Tamoxifen alone) and nodal status (1-3 vs. ≥4 positive nodes). We use Lin’s concordance correlation coefficient (CCC) to quantify calibration for all analyses. Results: Within the 3,457 evaluated patients, while both calibrated and uncalibrated scores showed strong monotonic associations with recurrence (Pearson R-squared = 0.98), the uncalibrated AI model is poorly calibrated, with a CCC of 0.29 (95% CI: [-0.14, 0.63], p=0.19). Model calibration substantially enhanced calibration to a CCC of 0.92 (95% CI: [0.53, 0.99], p=0.001). Among patients with ODX RS, the AI model remained calibrated, with a CCC of 0.90 (95% CI: [0.19, 0.99], p=0.02). In comparison, ODX RS was poorly calibrated, with CCC ranging from 0.13 (for patients with ≥4 nodes, CMF+Tam) to 0.47 (1-3 nodes, Tam only). Conclusion: Calibration enables the model to produce risk estimates that more accurately reflect true recurrence probabilities, enhancing clinical interpretability and utility for treatment decision-making. Citation Format: K. G. Zeng, J. Witowski, J. Cappadona, B. Machura, C. Fernandez-Granada, K. J. Geras. Aligning AI Recurrence Predictions with Real-world Recurrence Rates Through Survival Calibration [abstract]. In: Proceedings of the San Antonio Breast Cancer Symposium 2025; 2025 Dec 9-12; San Antonio, TX. Philadelphia (PA): AACR; Clin Cancer Res 2026;32(4 Suppl):Abstract nr PS3-06-07.
Published in: Clinical Cancer Research
Volume 32, Issue 4_Supplement, pp. PS3-06