Retrieval as governance: A medical informatics perspective on clinical large language models

20260 citationsJournal Articlehybrid Open Access

Authors

Ziwei Wang · Technical University of Munich

Abstract

INTRODUCTION Large language models (LLMs) are being introduced into clinical and public health environments.[1] Proposed applications include clinical documentation, literature summarization, patient communication, and decision support. Their ability to process large volumes of natural language data has expanded the range of digital tools available to clinicians. However, the clinical environment imposes strict requirements on developers and deploying institutions to ensure the reliability and transparency of these systems and their outputs.. Inaccurate or unverifiable information can influence clinical decisions and compromise patient safety. Fabricated statements, outdated recommendations, and responses lacking identifiable sources remain widely reported limitations of current generative models.[2] Recent reviews of retrieval-augmented generation (RAG) have highlighted its potential for improving factual grounding in LLMs.[3-5] Although many discussions have emphasised improvements in model accuracy or task performance, questions of knowledge governance — defined here as the frameworks for managing data sourcing, validation and accountability — have received less attention. From a medical informatics perspective, healthcare information systems have long relied on structured knowledge management, traceable evidence sources, and regulatory oversight. Retrieval mechanisms provide an opportunity to reintroduce these principles into generative models. CLINICAL KNOWLEDGE AND THE LIMITS OF PARAMETRIC MODELS Parametric language models encode knowledge within model parameters learned during large-scale training. Once deployed, these models cannot easily update individual knowledge elements without retraining or fine-tuning. This limitation is particularly relevant in medicine, where clinical recommendations evolve rapidly as new clinical trials, regulatory alerts, and guideline updates emerge.[6] Systems that rely exclusively on static training data risk generating responses inconsistent with current clinical practice. RAG addresses this limitation by introducing external knowledge retrieval during inference. Relevant documents are retrieved from curated repositories and incorporated into the generation process. Several studies have demonstrated that retrieval-based architectures can improve factual grounding and reduce fabricated content in clinical question-answering tasks.[7,8] For example, retrieval-based systems applied to clinical guideline retrieval have shown improved evidence attribution compared with standalone language models.[5] These findings indicate that external knowledge retrieval can enhance the reliability of knowledge-intensive clinical tasks. RETRIEVAL AS A CLINICAL KNOWLEDGE INFRASTRUCTURE From a medical informatics perspective, retrieval systems should be considered part of the clinical knowledge infrastructure rather than mere extensions of language models (Figure 1).Figure 1.: RAG architecture for clinical knowledge infrastructure. RAG, Retrieval-Augmented Generation; CDS, clinical decision support; EHR, electronic health record; LLMs, large language models; AI, artificial intelligence.Clinical knowledge does not depend solely on isolated textual facts. Instead, it emerges from the interplay among diseases, treatments, contraindications, and patient-specific context.[9] Effective retrieval therefore requires curated and structured knowledge resources. Clinical guidelines, biomedical literature, medical ontologies, and institutional knowledge bases are potential sources of retrieved knowledge for clinical RAG systems. Structured knowledge representations developed within medical informatics encode relationships between clinical concepts and enable retrieval strategies that align with clinical reasoning. Integration of clinical information systems with RAG systems is also essential, as retrieval pipelines interact with electronic health records, clinical data warehouses, or institutional guideline repositories. Such integration enables responses that not only incorporate general medical evidence but also account for the local clinical context. In this RAG architecture, LLMs function primarily as interfaces to curated medical knowledge rather than as autonomous sources of clinical expertise. GOVERNANCE AND TRACEABILITY IN CLINICAL ARTIFICIAL INTELLIGENCE (AI) Healthcare systems operate under regulatory frameworks that require transparency and accountability. Clinical recommendations must therefore be traceable to underlying evidence. Traditional clinical decision support systems address this requirement by linking recommendations to guideline statements or sources of evidence. RAG introduces a similar capability for generative systems. When identifiable documents are referenced during inference, the origins of generated responses can be examined. Traceability enables several governance functions, including retrospective evaluation of system behavior when recommendations are questioned, clinical auditing through the preservation of sources of evidence, and regulatory oversight through documentation of the provenance of generated information. These characteristics align retrieval-based architectures with established principles of clinical governance and evidence-based medicine. RISKS AND GOVERNANCE GAPS Although retrieval architectures provide governance capabilities, they also introduce new risks. In RAG systems, the knowledge corpus becomes a critical component of the information pipeline. Retrieved documents directly influence generated responses. Errors, biases, or corrupted documents within the retrieval corpus may therefore affect system outputs. Recent studies have shown that contradictory or temporally inconsistent evidence sources may degrade the reliability of retrieval-based responses in clinical contexts.[10] Governance must therefore extend to the management of knowledge repositories. Retrieval corpora require curation, version control, and continuous quality monitoring. Provenance information and up-to-date histories should be maintained to ensure that retrieved knowledge can be evaluated within its clinical and temporal context. EVALUATION FOR CLINICAL DEPLOYMENT Evaluation practices for LLMs frequently rely on benchmark datasets and metrics, such as exact matches or F1 scores. These metrics measure linguistic similarity but provide limited insight into system safety. Clinical deployment requires broader evaluation criteria. Relevant dimensions include hallucination frequency, behavior when evidence is insufficient, response latency, robustness to adversarial prompts, and computational cost. Recent reviews of RAG-based medical systems highlight the lack of standardized evaluation frameworks and the limited number of clinically validated deployments.[1,5] Retrieval mechanisms connect generative models with curated medical knowledge infrastructures. The clinical knowledge layer includes guidelines, biomedical literature, medical ontologies, and institutional knowledge bases. The retrieval layer identifies and ranks relevant evidence during inference. Generated responses are conditioned on retrieved evidence and linked to identifiable sources. Governance mechanisms enable traceability, auditing, and regulatory oversight in clinical environments. CONCLUSION The introduction of LLMs into healthcare raises fundamental questions regarding reliability, accountability, and knowledge governance. RAG provides an architectural mechanism that links generative models with curated medical knowledge sources. By associating generated responses with identifiable evidence, retrieval enables traceability, facilitates knowledge updates, and supports the oversight mechanisms required in clinical environments. From a medical informatics perspective, retrieval can therefore be interpreted as a governance layer for clinical LLMs. Future work should focus on the development of retrieval infrastructures, governance frameworks, and evaluation standards that support the safe integration of generative systems into clinical information environments. Acknowledgements We acknowledge the editorial team during manuscript preparation. Author Contributions Wang BC: Conceptualization, Writing—Original draft preparation. Wang ZY: Writing—Reviewing and Editing. Zhong YN: Writing—Reviewing and Editing Supervision, Project administration. All authors have read and approved the final version of the manuscript. Source of funding This work received no specific funding or financial support. Ethical approval Not applicable. Informed consent Not applicable. Conflict of Interest Baocheng Wang is an associate editor of the journal. The article was subject to the journal’s standard procedures, with peer review handled independently of the editor and the affiliated research groups. Use of large language models, AI and machine learning tools None declared. Data availability statement No additional data.

Topics & Keywords

Artificial Intelligence in Healthcare and Education Topic Modeling Machine Learning in Healthcare

UN Sustainable Development Goals

Peace, Justice and strong institutions

Publication Details

Published in: Digital Medicine

Volume 12, Issue 1

DOI: 10.1097/dm-2026-00009

Field-Weighted Citation Impact: 0.00