Search for a command to run...
The deployment of large language models (LLMs) for science carries an intrinsic risk: hallucination of citations, fabricated drug approvals or clinical trials, and unsupported experimental outcomes. Here we describe the testing and deployment of a novel systematic, multi-layer approach called the Validation as a System (VaaS) pipeline, iteratively developed during the construction of an open-source, living Rare Disease Database (RDD). We report lessons learned and production results from 225 carefully annotated rare disease gene curations and a prospective 100-gene collection (99 net new), together representing over 3,000 verified citations. After three iterations of directed refinement, the net functional hallucination rate approached zero. We validated the pipeline using three complementary benchmarks: (1) VaaS-RIKER2, a 640-run prospective ablation study (4 conditions × 4 temperatures × 40 genes) plus 117 open-weight model runs on dedicated GPU hardware — unguided LLM output produced 95.9% Type II hallucination (wrong-topic citations that exist as real papers but carry a correct claim context yet do not support the cited claim); the full VaaS protocol achieved 0.0% Type I and 6.5% Type II, a > 14-fold reduction; live PMID verification alone (C3) eliminated both error types entirely (0.0%/0.0%); (2) an independent L3 citation audit of Wave 3 (179 PMIDs, 99.4% valid, 0 Type I errors); and (3) the MedHallu clinical hallucination benchmark, on which the VaaS protocol achieved F1 = 0.9853 on the hard tier (cases where all benchmark ensemble models were fooled), compared to the published GPT-4o baseline of F1 = 0.811 (Pandit et al., 2025). Three independent open-weight models ( llama3.2 , qwen2.5:14b , mistral:7b ) showed 81–87% Type II rates under unguided conditions, confirming that wrong-topic citation hallucination is structural and model-agnostic. In contrast, the corresponding VaaS rate was measured to be zero ( n = 508 verified citations; 160 runs, C4 full protocol) under the same conditions. Human validation of ≥ 50 entries confirmed zero Type I errors and less than 0.5% Type II errors in the manual curation test. The VaaS pipeline operated at less than ∼$1 overall per comprehensive gene review, demonstrating that citation-integrity standards in AI-assisted biomedical synthesis are achievable at production scale. The VaaS approach represents, to the authors’ knowledge, the lowest measured hallucination system for science to date and is set to further accelerate the use of AI and AI agents for advancing research.