The Scoring Problem in Multi-Model LLM Benchmarks: How Unreported Methodological Choices Change Hallucination Measurement by 3.5×

20260 citationsPreprintgreen Open Access