Search for a command to run...
Large language models (LLMs) have shown increasing relevance in clinically supervised decision-support frameworks; however, their performance in orthopedic sports injury scenarios remains unclear. This study aimed to comparatively evaluate the diagnostic, treatment, and rehabilitation recommendations generated by GPT-4o and GPT-5 using standardized clinical scenarios assessed by orthopedic specialists. Fifteen sports injury–based clinical scenarios were developed and validated by orthopedic specialists with subspecialty expertise in sports traumatology. Each scenario was scored for clinical realism, adequacy of physical examination findings, and adequacy of radiological information using a 7-point Likert scale adapted from AGREE II domains. Both GPT-4o and GPT-5 were prompted using standardized zero-shot instructions, with each scenario submitted three times to assess internal consistency. Two blinded orthopedic specialists evaluated content-level consistency, and five independent orthopedic specialists scored the expert-rated clinical adequacy of AI-generated responses on a 0–5 scale. Inter-rater reliability was assessed using the intraclass correlation coefficient (ICC) and Cohen’s kappa. Specialists rated the clinical scenarios favorably, with 69–72% agreement across domains and ICC values indicating good reliability for clinical realism (ICC = 0.725) and moderate reliability for physical examination (ICC = 0.634) and radiological adequacy (ICC = 0.512). GPT-4o produced consistent outputs in 93.3% of cases, with one scenario showing clinically relevant inconsistency (κ = 0.82). Comparative expert evaluation demonstrated significantly higher scores for GPT-5 (median = 4.60) than GPT-4o (median = 4.00) (p = 0.007). Inter-rater reliability for AI response scoring was high for both models (ICC = 0.888 for GPT-4o; ICC = 0.895 for GPT-5). GPT-4o and GPT-5 generated responses with generally high expert-rated clinical adequacy and strong consistency in standardized sports injury–related clinical scenarios, with GPT-5 achieving higher scores in expert evaluations. By providing a structured, specialty-specific expert assessment under controlled conditions, this study adds comparative insight into how contemporary large language models are perceived in orthopedic sports injury contexts, without implying objective diagnostic accuracy or autonomous clinical decision-making.