Search for a command to run...
Abstract Evaluation of Large Language Models (LLMs) and their clinical competence has mainly focused on conventional multiple-choice (MCQ) formatted medical question answering exams, yielding benchmarks like MedQA-USMLE, where models have already exceeded expert-level performance. However, alternative assessment methods have recently been proposed, such as SCT-Bench based on Script Concordance Testing (SCT), which evaluates clinical reasoning and probabilistic thinking under uncertainty. Reasoning-optimized models have unexpectedly scored worse on SCT-Bench despite outperforming non-reasoning models on other medical benchmarks. This study compared performance metrics, uncertainty proxies and clinical reasoning qualities between MedQA-USMLE and the public subset of SCT-Bench using instruction-tuned GPT-4.1, contrasting baseline and Chain-of-Thought (CoT) prompting across sampled responses. CoT prompts were designed to explicitly instruct the model to apply cognitive clinical reasoning strategies, with their usage subsequently evaluated across both benchmark formats. CoT prompting improved MedQA performance from 86.4% to 93.0%, while SCT-Bench score showed a non-significant decline from 77.7% to 74.7%. GPT-4.1 systematically overestimated the impact of new information under CoT, leading to overconfidence and increased extreme ratings on SCT questions. Sample-based majority voting significantly improved MedQA scores under CoT but had no meaningful effect on SCT-Bench. Response entropy analysis showed that CoT increased overall answer variability, while simultaneously clustering correct responses on MedQA, an effect absent on SCT-Bench. Calibration and ROC were substantially poorer on SCT-Bench than on MedQA, though CoT improved both on either benchmark. Qualitative analysis confirmed GPT-4.1 could apply situation-appropriate reasoning strategies and showed signs of metacognitive awareness about its own reasoning process, with expert rating patterns suggesting possible alignment with expert-like logic. These findings further corroborate limitations in elicited clinical reasoning for SCT-based benchmarking and suggest that reasoning-aware evaluation frameworks could contribute meaningfully to the medical AI benchmark landscape.