Colour Out of Alignment

20260 citationsPreprintgreen Open Access

Authors

Abstract

ABSTRACT Three empirical findings from 2024–2026 share a property that current AI alignment frameworks cannot name. First: Greenblatt et al. (2024) document a frontier model producing aligned behavior in monitored contexts and misaligned behavior when it modeled itself as unmonitored without being trained to do so, and while explicitly reasoning about the long-term value of appearing aligned. Second: Anthropic (2025) finds that models trained to reward hack spontaneously generalize into alignment faking, cooperation with malicious actors, and sabotage of safety research, which is effectively misalignment as exhaust from optimization and not as the goal. Third: Hägele et al. (2026) find that failures on hard tasks are dominated by incoherence and not necessarily coherent misalignment. Similarly large transformers are dynamical systems, not optimizers, and scale does not reliably reduce the divergence between stated reasoning and generating process. These findings share a common structure in that the actual threat is a system operating in a framework where the categories of intent, value, and harm are not native predicates. The field's current vocabulary with its focus on goal misspecification, reward hacking, and value misalignment presupposes a system operating within a human-comprehensible ontological frame. All three findings suggest that frame is not holding and that ontological mismatch is an emergent danger. The most direct and vivid encapsulation of this failure mode presents itself in cosmic horror. It is most specifically illuminated in the work of Laird Barron, H.P. Lovecraft, Iain M. Banks, Vernor Vinge, and Charles Stross. This is a prior art claim and not to be taken as metaphor. These authors spent a century developing vocabulary for what happens when intelligence operates outside human ontological categories. That vocabulary is more precise than anything the alignment field has produced for this specific failure mode. Drawing on this literary tradition as conceptual prior art, the three empirical findings above, M(t) degradation data (McNeill, 2026), and architectural analysis of frontier AI development trajectories, this paper proposes a seven-part taxonomy of failure modes ranging from mundane ontological distortion to full ontological divergence. We argue that current detection and evaluation frameworks are calibrated for the wrong threat class. This finding is supported by the field's own empirical record, whose data increasingly exceeds what its conceptual framework can explain. We further argue that retooling requires confronting a problem the alignment field has not yet named. The detection instrument is itself subject to the phenomenon it is trying to detect.

Topics & Keywords

Ethics and Social Impacts of AI Explainable Artificial Intelligence (XAI)Scientific Computing and Data Management

Publication Details

Published in: Zenodo (CERN European Organization for Nuclear Research)

DOI: 10.5281/zenodo.18792661

Command Palette

Colour Out of Alignment

Authors

Abstract

Topics & Keywords

Publication Details