Search for a command to run...
Abstract Background: Pancreatic cancer is typically detected at an incurable stage because symptoms are often absent during the early stages, and the current screening guidelines lack sensitivity and specificity. Existing risk prediction tools require structured data, which hinders deployment. We evaluated whether large language models (LLMs) could predict pancreatic cancer risk, using only free-text clinical notes. Methods: We used routine free-text general practitioner clinical notes from individuals in Ontario aged >18 years, collected through ICES between 2010 and 2016. Pancreatic cancer patients were matched with controls using a nested case–control design with metrics adjusted using inverse probability weighting (IPW). Two approaches were explored: Reasoning-based LLM prediction: Source-available reasoning LLMs (DeepSeek-R1, QwQ) were prompted to simulate step-by-step clinical reasoning using raw clinical notes. Ensemble prediction: To minimize computational requirements at deployment, we tested several lightweight LLMs using different ensembling techniques, such as sampling with various decoding parameters (min-p, top-k, top-p) and LLMs, with samples aggregated using different strategies. We developed both methods using a development cohort of 200 patients (1:1 cases to controls) in Southwestern Ontario and subsequently tested them with a cohort of 750 patients (1:5) in Toronto. Look-ahead windows of five years were evaluated with a one-year exclusion period preceding diagnosis to focus on future risk and exclude patients undergoing a diagnostic work-up. Results: The median (range) number of characters per note was 390 (20-8000), and the median number of notes per patient was 20. In the reasoning-based approach, the best-performing model in the development cohort achieved an area under the receiver operating characteristic curve (AUROC) of 0.77 (95% CI: 0.70–0.84) for predicting pancreatic cancer in the five years after each clinical note in the test cohort. Lightweight models in ensembles exhibited variable performance, depending on the strategy used. When sampling a single model with different decoding parameters, performance reached an AUROC of 0.70. Ensembling multiple models and selecting the minimal predicted score across samples yielded an AUROC of 0.75. Using the most frequently predicted score across models with different decoding parameters improved the AUROC to 0.77. A simulated screening strategy that selected the top 0.5% highest-risk individuals resulted in a relative risk of 28.1×, a specificity of 0.991, a sensitivity of 0.192, a positive predictive value of 0.025, and a negative predictive value of 0.999. Conclusions: LLMs can predict pancreatic cancer risk directly from clinical notes years before diagnosis, without structured inputs or pre-processing. This approach provides a scalable, generalizable, and interpretable framework for future risk prediction, potentially supporting novel population-based approaches to pancreatic cancer screening. Citation Format: Daniel Mau, Karl Everett, Ning Liu, Jason Chai-Onn, Liisa Jaakkimainen, Anna Dodd, Spring Holter, Steven Gallinger, Rahul G. Krishnan, Kelvin Chan, Robert Grant. Predicting pancreatic cancer risk from clinical notes using large language models [abstract]. In: Proceedings of the AACR Special Conference in Cancer Research: Advances in Pancreatic Cancer Research—Emerging Science Driving Transformative Solutions; Boston, MA; 2025 Sep 28-Oct 1; Boston, MA. Philadelphia (PA): AACR; Cancer Res 2025;85(18_Suppl_3):Abstract nr B073.
Published in: Cancer Research
Volume 85, Issue 18_Supplement_3, pp. B073-B073