Search for a command to run...
Introduction: Artificial intelligence (AI)-powered chatbots have shown promise in medical assessments and licensing exams. However, their performance on allied health board exams, such as the Respiratory Therapist Multiple-Choice (TMC) Examination, remains unknown. This study aimed to compare the accuracy, consistency, and reasoning quality of seven advanced AI chatbots on the TMC Practice Examination. Methods: A total of 140 questions were sourced from the free TMC Practice Examination, with the permission of the National Board for Respiratory Care. Seven chatbots—ChatGPT 4o, SuperGrok, Claude Pro Sonnet 3.6, Mistral Pro, Qwen2.5-VL-32B-Instruct, Gemini 2.5 Advanced, and Llama 3.2 (11B and 90B)—were tested. Each question was entered into a new chat session and independently tested by two investigators from April 7–22, 2025. Accuracy was based on alignment with the official answer key. Consistency was measured by the rate of identical responses across repeated trials. For reasoning quality, licensed/board-certified respiratory therapists independently evaluated 20 randomly selected responses using a modified scoring rubric. Results: All chatbots surpassed the thresholds, with an average accuracy of 83%. ChatGPT attained the highest accuracy of 90%, while Mistral was the lowest (76%). Top performance by content areas included Grok in patient data (95%), Gemini in equipment troubleshooting (88%), and ChatGPT in intervention modification (91%). Mistral and Qwen had the lowest. Cognitive-level analysis showed Gemini, ChatGPT, and Grok leading in recall (92%), application (92%), and analysis (90%), respectively. Self-agreement ranged from 84% (Gemini) to 93% (ChatGPT and Qwen), with a mean of 89%. Justification quality was high overall, with over 90% accuracy in comprehension, retrieval, and reasoning for all chatbots except Qwen (85% in reasoning). Errors, omissions, bias, and potential harm were low, ranging from 3% to 10%. Conclusions: ChatGPT and Grok outperformed other chatbots on the TMC Practice Examination in both accuracy and explanation quality. Despite promising results, chatbot-generated misinformation, bias, and variability remain concerning. Careful evaluation is warranted before integrating chatbots into clinical education or practice.