Search for a command to run...
A bstract We present M ed PI , a high-dimensional benchmark for evaluating large language models (LLMs) in patient–clinician conversations . Unlike single-turn question-answer (QA) benchmarks, M ed PI evaluates the medical dialogue across 105 dimensions comprising the medical process, treatment safety, treatment outcomes and doctor-patient communication across a granular, accreditation-aligned rubric. M ed PI comprises five layers : (1) P atient P ackets (synthetic EHR-like ground truth); (2) an AI P atients instantiated through an LLM with memory and affect; (3) a T ask M atrix spanning encounter reasons (e.g. anxiety, pregnancy, wellness checkup) × encounter objectives (e.g. diagnosis, lifestyle advice, medication advice); (4) an E valuation F ramework with 105 dimensions on a 1–4 scale mapped to the Accreditation Council for Graduate Medical Education (ACGME) competencies; and (5) AI J udges that are calibrated, committee-based LLMs providing scores, flags, and evidence-linked rationales. We evaluate 9 flagship models – Claude Opus 4.1, Claude Sonnet 4, MedGemma, Gemini 2.5 Pro, Llama 3.3 70b Instruct, GPT-5, GPT OSS 120b, o3, Grok-4 – across 366 AI patients and 7,097 conversations using a standardized “vanilla clinician” prompt. For all LLMs, we observe low performance across a variety of dimensions, in particular on differential diagnosis . Our work can help guide future use of LLMs for diagnosis and treatment recommendations.