Multimodal large language models for use in diabetic retinopathy screening

20260 citationsJournal Articlehybrid Open Access

Authors

S. Saeed Mohammadi · Northwestern University

Sahana Aggarwal · Northwestern University

Kavina Aggarwal · Northwestern University

Grant Wiarda · Northwestern University

Kayla Nguyen · Northwestern University

Emmanuel A. Sarmiento

Quan Nguyen · Smith-Kettlewell Eye Research Institute

Abstract

Purpose: To evaluate the performance o f ChatGPT-4o and Gemini 2.5 Pro in detecting more-than-mild diabetic retinopathy (mtmDR) from fundus photography (FP) and diabetic macular edema (DME) from optical coherence tomography (OCT) using publicly available datasets. Methods: A custom GPT (powered by ChatGPT-4o) was created and instructed to follow the LumineticsCore™ (IDx-DR) screening criteria for mtmDR, defined as an ETDRS level ≥ 35 and/or clinically significant diabetic macular edema (CSDME). Gemini 2.5 Pro was evaluated with the same criteria. Performance on FPs was assessed using 2 publicly available datasets: MESSIDOR-2 (n = 106; 66 mtmDR, 40 no mtmDR) and EyePACS (n = 99; 56 mtmDR, 43 non-mtmDR). To assess detection of DME, a separate OCT dataset (n = 48; 24 DME, 24 normal) was used to evaluate identification of intraretinal and/or subretinal fluid. Sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV) for detecting mtmDR on FP and DME on OCT were calculated for each multimodal large language model (LLM). Results: On MESSIDOR-2 (n = 106), ChatGPT-4o achieved a sensitivity of 90.77%, specificity of 97.50%, PPV of 98.33%, and NPV of 86.67% for mtmDR detection. Gemini 2.5 Pro achieved a sensitivity of 80.30%, specificity of 97.50%, PPV of 98.15%, and NPV of 75.00%. On EyePACS (n = 99), ChatGPT-4o demonstrated a sensitivity of 94.64%, specificity of 86.05%, PPV of 89.83%, and NPV of 92.50%, while Gemini 2.5 Pro achieved a sensitivity of 89.29%, specificity of 88.37%, PPV of 90.91%, and NPV of 86.36%. For OCT-based DME detection (n = 48), ChatGPT-4o achieved a sensitivity of 95.83%, specificity of 100%, and PPV of 100%, while Gemini 2.5 Pro achieved a sensitivity of 95.83%, specificity of 95.65%, PPV of 95.83%, and NPV of 95.65%. Conclusion: ChatGPT-4o and Gemini 2.5 Pro demonstrated high performance in detecting mtmDR and DME across multiple publicly available datasets. These findings support the potential of multimodal LLMs as cost-effective and accessible tools for diabetic retinopathy screening. Further validation in larger, more diverse real-world datasets is warranted.

Topics & Keywords

Retinal Imaging and Analysis Retinal Diseases and Treatments Retinal and Optic Conditions

UN Sustainable Development Goals

Good health and well-being

Publication Details

Published in: Artificial Intelligence in Vision and Ophthalmology

Volume 2, Issue 1

DOI: 10.35119/aivo.v2i1.157

Field-Weighted Citation Impact: 0.00