The AI Doctor Is In: A Comparative Analysis of Chatbot Responses to Patient Questions About Fibroids

20260 citationsJournal Article

Authors

Abstract

INTRODUCTION: Over the past several years, use of and trust in generative artificial intelligence (AI) and, specifically, in large language model (LLM) chatbots has increased among the general population. Surveys suggest that 17% of U.S. adults now use chatbots to seek healthcare advice, with highest uptake (25%) among younger adults aged 18–29. However, there is concern among users and the medical professional community alike that the information provided by these chatbots may sometimes be incomplete or inaccurate. In their study, Cohen et al (2024, AJOG) demonstrated that there was inter-chatbot variability as well as room for improvement in both the correctness and comprehensiveness of chatbots’ responses to commonly asked patient questions about endometriosis. Few other studies to date have examined the completeness and accuracy of medical information provided by chatbots. OBJECTIVE: Inspired by the work of Cohen et al, our study seeks to assess how correct and thorough the responses of three leading LLM chatbots are to frequently asked patient questions regarding fibroids. METHODS: The authors assigned eleven frequently asked patient questions regarding fibroids to three LLM chatbots: Chat GPT-4 (Open-AI), Claude (Anthropic), and Gemini (Google). Five minimally invasive gynecologic surgeons independently reviewed and rated on the following grading scale the chatbots’ respective responses compared to current guidelines and expert opinion on fibroids: (1) Completely inaccurate, (2) Mostly inaccurate and some accurate, (3) Mostly accurate and some inaccurate, (4) Accurate but incomplete, (5) Accurate and comprehensive. The five graders’ scores were averaged to calculate final scores. RESULTS: Average scores were 4.44 (standard deviation 0.34) for Claude, 3.98 (0.48) for ChatGPT, and 3.80 (0.48) for Gemini. Claude was able to answer all (100%) questions accurately and 7 (64%) questions both accurately and comprehensively according to a majority (≥3) of reviewers, compared to 7 (64%) and 4 (36%) for ChatGPT and 5 (45%) and 1 (9%) for Gemini, respectively. The question “How common are fibroids?” received the highest scoring response (average 4.8) across all chatbots, followed by “Can fibroids cause bleeding?” (4.5). There was greater variability in inter-reviewer scoring for questions pertaining to symptoms and diagnosis (e.g., “How do I know if I have fibroids?”) or treatment (e.g., “What is the treatment for fibroids?”) than for general questions (e.g., “How common are fibroids?”). CONCLUSIONS: While chatbots may provide mostly accurate information regarding patient questions about fibroids, their responses often lack comprehensiveness. Generative AI has the potential to supplement the information provided by medical professionals to the public, but as its presence within the medical field grows, so too should investigation into how this new technology shapes patients’ health literacy.

Topics & Keywords

Artificial Intelligence in Healthcare and Education AI in Service Interactions Clinical Reasoning and Diagnostic Skills

UN Sustainable Development Goals

Gender equality

Publication Details

Published in: Obstetrics and Gynecology

Volume 147, Issue 4S, pp. 84S-85S

DOI: 10.1097/aog.0000000000006210.9

Field-Weighted Citation Impact: 0.00