Search for a command to run...
Introduction: Standardised tests using short-answer questions (SAQs) are common in postgraduate education. Publicly available large language models (LLMs) are designed to simulate conversational language and interpret free-text responses and could potentially be used to grade SAQs. We evaluated the use of an LLM (ChatGPT 4o) to grade SAQs in a postgraduate medical setting. Methods: An observational retrospective study was performed using data from 215 students (557 short-answer responses) enrolled in an online unit of study on mechanical ventilation (2020–2024). Deidentified SAQ responses to three case-based scenarios on mechanical ventilation were presented to ChatGPT with a standardised grading prompt (see Appendix) and rubric. Analysis used mixed-effects modelling, variance component analysis, intraclass correlation coefficients (ICCs), Cohen's kappa, Kendall's W and Bland-Altman statistics. Results: ChatGPT consistently awarded lower marks than human graders, with a mean difference of -1.34 on a 10-point scale. Agreement was poor at the individual level (ICC1 = 0.086), and Cohen's kappa (-0.0786) indicated no meaningful agreement. While ChatGPT showed internal consistency across five sessions (G-value = 0.87), it diverged significantly from human graders. Agreement was poorest for evaluative and analytic items, while checklist and prescriptive rubric items showed less disagreement. Discussion: Though LLMs remain an attractive option for repetitive tasks involving reading and interpretation of language, we caution against the use of LLMs in grading postgraduate coursework. Over 60% of ChatGPT-assigned grades differed from human grades by more than acceptable boundaries for high-stakes assessments. ChatGPT performed best with rubric items that had clearly defined right and wrong answers, those that allocated marks according to the presence of specific terms or concepts in the answer or those that tested recall (e.g., asked for lists of items). Higher-weighted questions that test more complex learning outcomes are likely to be the most affected. Conclusion: Use of consumer-level LLMs with unmodified rubrics disagreed beyond acceptable standards (10% on 10-mark items) for high-stakes assessments. Promising results for certain rubric types (e.g., checklists) suggest prompt engineering and rubric refinement may play roles in improving future performance in automated grading.
Published in: Focus on Health Professional Education A Multi-Professional Journal
Volume 27, Issue 1, pp. 90-104