Large language models in sports injury care: a comparative expert evaluation of GPT-4o and GPT-5

20260 citationsJournal Articlegold Open Access

Authors

Onur Kaya · Gaziantep Children's Hospital

Gazi Huri · Qatar Orthopaedic and Sports Medicine Hospital

Emre Anıl Özbek · Ankara University

Nevzat Gönder · Gaziantep University

İbrahim Halil Demir · Gaziantep Children's Hospital

Kaan Ali Dalkır · Cukurova University

Abstract

Large language models (LLMs) have shown increasing relevance in clinically supervised decision-support frameworks; however, their performance in orthopedic sports injury scenarios remains unclear. This study aimed to comparatively evaluate the diagnostic, treatment, and rehabilitation recommendations generated by GPT-4o and GPT-5 using standardized clinical scenarios assessed by orthopedic specialists. Fifteen sports injury–based clinical scenarios were developed and validated by orthopedic specialists with subspecialty expertise in sports traumatology. Each scenario was scored for clinical realism, adequacy of physical examination findings, and adequacy of radiological information using a 7-point Likert scale adapted from AGREE II domains. Both GPT-4o and GPT-5 were prompted using standardized zero-shot instructions, with each scenario submitted three times to assess internal consistency. Two blinded orthopedic specialists evaluated content-level consistency, and five independent orthopedic specialists scored the expert-rated clinical adequacy of AI-generated responses on a 0–5 scale. Inter-rater reliability was assessed using the intraclass correlation coefficient (ICC) and Cohen’s kappa. Specialists rated the clinical scenarios favorably, with 69–72% agreement across domains and ICC values indicating good reliability for clinical realism (ICC = 0.725) and moderate reliability for physical examination (ICC = 0.634) and radiological adequacy (ICC = 0.512). GPT-4o produced consistent outputs in 93.3% of cases, with one scenario showing clinically relevant inconsistency (κ = 0.82). Comparative expert evaluation demonstrated significantly higher scores for GPT-5 (median = 4.60) than GPT-4o (median = 4.00) (p = 0.007). Inter-rater reliability for AI response scoring was high for both models (ICC = 0.888 for GPT-4o; ICC = 0.895 for GPT-5). GPT-4o and GPT-5 generated responses with generally high expert-rated clinical adequacy and strong consistency in standardized sports injury–related clinical scenarios, with GPT-5 achieving higher scores in expert evaluations. By providing a structured, specialty-specific expert assessment under controlled conditions, this study adds comparative insight into how contemporary large language models are perceived in orthopedic sports injury contexts, without implying objective diagnostic accuracy or autonomous clinical decision-making.

Topics & Keywords

Artificial Intelligence in Healthcare and Education Genomics and Rare Diseases Topic Modeling

UN Sustainable Development Goals

Quality Education

Publication Details

Published in: BMC Sports Science Medicine and Rehabilitation

DOI: 10.1186/s13102-026-01675-z

Field-Weighted Citation Impact: 0.00