1006: EVALUATION OF MULTIPLE AI CHATBOTS ON RESPIRATORY THERAPIST MULTIPLE-CHOICE PRACTICE EXAM

20260 citationsJournal Article

Authors

J Liu · Framingham State University

Xiaqiu Yang · Thompson Rivers University

Fai Albuainain

Ping-Hui Liu · Cincinnati Children's Hospital Medical Center

Naba Alshubaini · King Khalid University Hospital

Abdullah Alismail · Loma Linda University

J. Brady Scott · Rush University

Abstract

Introduction: Artificial intelligence (AI)-powered chatbots have shown promise in medical assessments and licensing exams. However, their performance on allied health board exams, such as the Respiratory Therapist Multiple-Choice (TMC) Examination, remains unknown. This study aimed to compare the accuracy, consistency, and reasoning quality of seven advanced AI chatbots on the TMC Practice Examination. Methods: A total of 140 questions were sourced from the free TMC Practice Examination, with the permission of the National Board for Respiratory Care. Seven chatbots—ChatGPT 4o, SuperGrok, Claude Pro Sonnet 3.6, Mistral Pro, Qwen2.5-VL-32B-Instruct, Gemini 2.5 Advanced, and Llama 3.2 (11B and 90B)—were tested. Each question was entered into a new chat session and independently tested by two investigators from April 7–22, 2025. Accuracy was based on alignment with the official answer key. Consistency was measured by the rate of identical responses across repeated trials. For reasoning quality, licensed/board-certified respiratory therapists independently evaluated 20 randomly selected responses using a modified scoring rubric. Results: All chatbots surpassed the thresholds, with an average accuracy of 83%. ChatGPT attained the highest accuracy of 90%, while Mistral was the lowest (76%). Top performance by content areas included Grok in patient data (95%), Gemini in equipment troubleshooting (88%), and ChatGPT in intervention modification (91%). Mistral and Qwen had the lowest. Cognitive-level analysis showed Gemini, ChatGPT, and Grok leading in recall (92%), application (92%), and analysis (90%), respectively. Self-agreement ranged from 84% (Gemini) to 93% (ChatGPT and Qwen), with a mean of 89%. Justification quality was high overall, with over 90% accuracy in comprehension, retrieval, and reasoning for all chatbots except Qwen (85% in reasoning). Errors, omissions, bias, and potential harm were low, ranging from 3% to 10%. Conclusions: ChatGPT and Grok outperformed other chatbots on the TMC Practice Examination in both accuracy and explanation quality. Despite promising results, chatbot-generated misinformation, bias, and variability remain concerning. Careful evaluation is warranted before integrating chatbots into clinical education or practice.

Topics & Keywords

Artificial Intelligence in Healthcare and Education Delphi Technique in Research Cardiovascular Health and Risk Factors

Publication Details

Published in: Critical Care Medicine

Volume 54, Issue 3S

DOI: 10.1097/01.ccm.0001186020.97146.91

Field-Weighted Citation Impact: 0.00

Command Palette

1006: EVALUATION OF MULTIPLE AI CHATBOTS ON RESPIRATORY THERAPIST MULTIPLE-CHOICE PRACTICE EXAM

Authors

Abstract

Topics & Keywords

Publication Details