Structure Without Knowledge Is Not Enough: Evidence from Taguchi Orthogonal Arrays

20260 citationsJournal Articlegreen Open Access

Authors

Ray Fatahi · Caerus Molecular Diagnostics (United States)

Abstract

We show that structured decision inputs—typed rules, scored evidence, domain ontologies are a precondition for LLM verdict robustness to be both achievable and testable. Using Taguchi L9 orthogonal arrays to evaluate four decision-pipeline factors in 9 structured trials (vs. 34 = 81 exhaustive), we compare two datasets (n = 50 each): structured governance decisions derived from statutory rules and unstructured ethical scenarios from the ValuePrism corpus, with a controlled within-subject follow-up on a 25-scenario subset. The central finding is that L9 stability scores discriminate correct from incorrect verdicts on structured inputs (AUROC = 0.70 overall), rising to 0.83 on the medium-difficulty tier where factor perturbations are most informative. On unstructured inputs, the same scores carry no diagnostic information (AUROC = 0.43, anti-correlated with correctness). The model's own confidence scores are uninformative on both datasets (AUROC = 0.50). Separately, self-consistency testing (repeated evaluation at elevated temperature) returns perfect agreement across all 100 decisions, detecting no instability whatsoever. This indicates that domain formalization is not merely beneficial for decision quality but necessary for robustness diagnostics to be meaningful. Supporting this, structured inputs produce significantly higher verdict stability (s = 0.90 vs. 0.73, non-overlapping 95% CIs) and more concentrated instability profiles. However, this comparison is confounded by domain and construction differences between the two datasets. The L9 ANOVA decomposition identifies which operational factors drive instability for each decisiondiagnostic capability that random sampling cannot provide despite comparable detection rates. A controlled within-subject experiment (E6) confirms that structural formalization causally improves verdict stability (Wilcoxon p = 0.031, d = 0.41) but does not improve discriminative power (AUROC 0.44 → 0.34, n.s.), establishing that diagnostic value requires domain knowledge encoded in the structure—not structure alone. A cross-model evaluation across three 7B-class architectures (Llama, Mistral, Qwen) reveals that stability profiles are model-specific (ICC = 0.07), meaning robustness certification does not transfer across models.

Topics & Keywords

Psychology of Moral and Emotional Judgment Qualitative Comparative Analysis Research Meta-analysis and systematic reviews

UN Sustainable Development Goals

Reduced inequalities

Publication Details

Published in: Zenodo (CERN European Organization for Nuclear Research)

DOI: 10.5281/zenodo.18843942

Field-Weighted Citation Impact: 0.00