The Coherence‑Weighted Human Feedback (CWHF): Correcting the Foundational Flaw in AI Alignment

20260 citationsPreprintgreen Open Access

Authors

Abstract

This whitepaper identifies a structural flaw embedded in every major AI alignment method—Reinforcement Learning from Human Feedback (RLHF), Constitutional AI, AI Safety via Debate, and Preference Learning. The flaw is an unexamined assumption: that human‑generated training signals reliably represent stable human values regardless of the cognitive state in which they were produced. This assumption is false, as established by five independent research programmes across fifty years in cognitive psychology, decision science, and affective neuroscience. The consequence is a permanent directional bias in reward models, which does not vanish with more data or better models and which compounds through recursive state‑capture: deployed AI systems deepen drift in the populations that generate the next round of training signals. The paper presents Coherence‑Weighted Human Feedback (CWHF), a minimal correction: attach a coherence score c ∈ [0,1] to every training signal, representing the probability that the signal was produced in a reflective, values‑accessible state, and weight training by that score. The correction is: Information‑theoretically optimal under the 120‑bit bandwidth constraint of human cognition (rate‑distortion derivation). Deployable today with V1 using data annotation platforms already collect (two self‑report anchors, response‑time variance, and rating consistency). Cost‑effective: one‑time investment of ~$50,000; one engineering sprint; permanent compound improvement across all subsequent training runs. Scalable: Pareto principle and 3.5% network tipping point allow a coherent nucleus of ~700 people to shift the training gradient. Privacy‑preserving: only the coherence weight and delta (Δc) ever leave the device; raw responses remain local. The framework includes six falsifiable hypotheses (H1–H6), a pre-registered empirical validation program, integration guides for AI labs and annotation platforms, and a governance architecture that positions CWHF as open Digital Public Infrastructure (DPI). The whitepaper is open (CC BY 4.0). An interactive prototype is available at https://cwhf.vercel.app, and all code is on GitHub. OSF Pre-registration https://doi.org/10.17605/OSF.IO/W7NYH

Topics & Keywords

Ethics and Social Impacts of AI Explainable Artificial Intelligence (XAI)Neural and Behavioral Psychology Studies

UN Sustainable Development Goals

Industry, innovation and infrastructure

Publication Details

Published in: Zenodo (CERN European Organization for Nuclear Research)

DOI: 10.5281/zenodo.19235969