Search for a command to run...
Legal decision-making is expected to meet high standards of consistency and rationality, yet human judgments in this domain are known to be influenced by procedural factors such as evidence order and intermediate evaluations. Recent work has shown that even legal professionals, including judges, are susceptible to such biases when assessing criminal cases. This raises a critical question: do large language models, which are increasingly proposed as decision-support tools in legal contexts, exhibit similar procedural biases-and if so, can these biases be mitigated? To address this question, we tested GPT-4o and GPT-5.2 using a controlled legal judgment task adapted from prior human research. The task involved simplified criminal cases in which we systematically manipulated (i) the order of incriminating and exonerating evidence and (ii) whether an intermediate guilt judgment was required before a final decision. Model responses were directly compared to human judgments from the original study. We additionally examined whether prompt engineering strategies, based on current best-practice recommendations, could reduce observed biases. GPT-4o exhibited robust order effects and a form of evaluation bias, although the latter differed in structure from the human pattern. GPT-5.2 showed similar but attenuated effects. Across both models, prompt engineering had limited and inconsistent impact, failing to reliably eliminate procedural sensitivity. These findings suggest that even advanced large language models remain vulnerable to normatively irrelevant procedural influences. More broadly, they advise caution in treating large language models as inherently rational or bias-resistant decision-support systems in high-stakes professional domains such as law.