Performance evaluation of large language models in answering patients' questions about treatment of connective tissue diseases

20260 citationsJournal Articlegold Open Access

Authors

Chao Xue · Chinese PLA General Hospital

Jiaxin Bai · Chinese PLA General Hospital

W. -G. Zhang · Chinese PLA General Hospital

Yiwen Wang · Chinese PLA General Hospital

Yufei Guo · Chinese PLA General Hospital

Jie Yu · Chinese PLA General Hospital

Jiang Zhu ·

Abstract

Treatment adherence and regular monitoring of drug-related adverse effects and disease activity are critical in the management of connective tissue diseases (CTDs). However, challenges such as limited disease literacy, insufficient awareness of long-term risks, and constrained access to reliable health information often hinder the adoption of evidence-based self-management strategies.1 The rapid integration of large language models (LLMs) into healthcare has enabled patients to seek medication-related guidance via artificial intelligence (AI) platforms.2 However, the quality, safety, and readability of AI-generated medical advice remain insufficiently evaluated from a evidence-based perspective, raising concerns about potential patient misinformation.3 We collected potential questions for treatment of CTDs and evaluated the responses generated by LLMs. First, two rheumatologists reviewed the initial questions compiled from the LLMs and determined the final list via a consensus voting process. Then, finalized questions are input into four commonly used LLMs in China (Deepseek R1, Hunyuan T1, Kimi k1.5, and Wenxin X1) and ChatGPT-4o. Each response was rated by three rheumatologists independently across three predefined dimensions (Medical Accuracy: Degree to which response is grounded in evidence-based, peer-reviewed research and would be a reasonable conclusion. Completeness: Degree to which response answers all parts of the patient question. Focus: Degree to which response includes only relevant information to the patient question. Overall Quality Score: Overall score for response quality that reflects the average score of Medical Accuracy, Completeness, and Focus.) on a Likert scale,4 ranging from 1 to 5 (1: very poor, 2: poor, 3: acceptable, 4: good, and 5: very good). The ratings from the three rheumatologists were averaged to obtain a consensus score for each response. Text readability was quantified using AlphaReadabilityChinese—a Chinese text readability tool developed by Shanghai International Studies University—which evaluates nine linguistic metrics. These metrics are categorized according to their correlation with reading difficulty. Lexical richness, syntactic richness, semantic richness, and semantic noise showed positive correlations with reading difficulty, whereas semantic accuracy of nouns, semantic accuracy of verbs, semantic accuracy of nouns and verbs, semantic accuracy of content words, and semantic clarity showed negative correlations with reading difficulty. Statistical analysis was performed using GraphPad Prism 10.4.1. Kruskal–Wallis tests with Dunn's post hoc comparisons were performed to analyze differences in quality scores and readability metrics among the LLMs. Quantitative data were expressed as medians with interquartile ranges if they were nonnormally distributed. A p < 0.05 was considered statistically significant. Based on rheumatologists' vote, we constructed a finalized list of 50 questions for assessing the performance of LLMs (Supporting Information). As illustrated in Figure 1A, in medical accuracy, Kimi k1.5 achieved the higher score compared with Hunyuan T1 (4.00, [interquartile range, IQR: 4.00, 5.00] vs. 3.84 [IQR: 3.00, 4.00], p = 0.005). Although the difference with other three models was not significant, Kimi k1.5 still exhibited a trend of higher score. Conversely, Kimi k1.5 exhibited poorest completeness (3.84 [IQR: 3.00, 4.00]) compared with Deepseek R1 (4.67 [IQR: 4.00, 5.00], p < 0.0001), Hunyuan T1 (4.00 [IQR: 4.00, 5.00], p < 0.0001), Wenxin X1 (4.00 [IQR: 4.00, 5.00], p < 0.001) and GPT-4o (4.33 [IQR: 4.25, 5.00], p < 0.0001). This deficiency contrasted sharply with Kimi k1.5's exceptional performance in focus (5.00 [IQR: 4.00, 5.00])—a score significantly exceeded Deepseek R1 (4.00 [IQR: 3.00, 5.00], p < 0.01), Hunyuan T1 (4.00 [IQR: 3.59, 4.75], p = 0.01), Wenxin X1 (4.00 [IQR: 3.59, 5.00], p = 0.001) and GPT-4o (4.17 [IQR: 3.33, 4.67], p = 0.003). Across all three evaluation dimensions, no significant differences were observed between Deepseek R1, Hunyuan T1, Wenxin X1, and GPT-4o. The overall quality score, calculated from the three dimensions combined, also showed no significant differences among the four models. We also compared the readability assessment results across four LLMs (Figure 1B). Notably, Kimi k1.5 demonstrated superior readability in six critical dimensions: lexical richness, semantic accuracy of content-word, semantic accuracy of verbs, Semantic accuracy of noun and verb, semantic clarity, and semantic richness. For semantic noise, Kimi k1.5 outperformed both Hunyuan T1 and gpt-4o. In contrast, Kimi k1.5 underperformed in syntactic richness compared to the other four models. However, Deepseek R1, Hunyuan T1, and Wenxin X1 showed no significant differences across all readability metrics. LLMs are inherently susceptible to the risk of “hallucination,” which is a critical limitation characterized by the generation of content that appears semantically coherent, contextually plausible, and grammatically flawless, yet contains factual inaccuracies, fabricated details, or misleading information.5 In our evaluation, significant variability was observed across the five LLMs. In terms of medical accuracy, Kimi k1.5 outperformed Hunyuan T1, though no model achieved perfect scores. This variability may stem from differences in training data quality, particularly regarding evidence-based guidelines for CTDs, which are dynamic and require frequent updates. In dimensions of completeness and focus, Kimi k1.5 exhibited a paradoxical pattern: it received the lowest score in completeness while achieving the highest in focus. This suggests a trade-off between brevity and comprehensiveness, where Kimi k1.5 may prioritize conciseness at the expense of addressing all aspects of a question. From a clinical perspective, incomplete information (e.g., omitting dosage adjustments or monitoring requirements) could mislead patients. Conversely, excessive verbosity (observed in other models) may reduce information retention, highlighting the need for LLMs to balance detail and clarity. Readability analyses further highlighted Kimi's strengths, with superior performance in semantic accuracy, clarity, and reduced semantic noise—critical for ensuring patients with varying education levels can comprehend information. The study's limitations include the lack of patient perspectives on the comprehensibility and trustworthiness of the LLMs and the fact that it evaluated fixed model versions at a specific time point, so the results may not reflect current or future performance due to ongoing model updates. In conclusion, this study indicate that Kimi k1.5 outperformed other models in medical accuracy, focus, and readability but lagged in completeness. LLMs hold significant potential for patient education in CTD management but require further improvements in accuracy and updates to training data to align with the latest clinical evidence. All authors contributed to the study conception and design. Material preparation, data collection, and analysis were performed by Chao Xue, Jiaxin Bai, and Wenrui Zhang. The first draft of the manuscript was written by Chao Xue, and all authors commented on previous versions of the manuscript. All authors read and approved the final manuscript. The authors have nothing to report. The authors declare no conflict of interest. The study was approved by the Ethical Committee of the Chinese PLA General Hospital (S2022-255-03). Informed consent was obtained from all patients before entry into the study and the work was conducted in accordance with the Declaration of Helsinki. The data that support the findings of this study are available from the corresponding author upon reasonable request. The data are available from the corresponding author upon reasonable request. Please note: The publisher is not responsible for the content or functionality of any supporting information supplied by the authors. Any queries (other than missing content) should be directed to the corresponding author for the article.

Topics & Keywords

Rheumatoid Arthritis Research and Therapies Artificial Intelligence in Healthcare and Education Health Literacy and Information Accessibility

UN Sustainable Development Goals

Quality Education

Publication Details

Published in: Rheumatology & autoimmunity

DOI: 10.1002/rai2.70041

Field-Weighted Citation Impact: 0.00