The Ratchet Effect: Asymmetric Self-Description in Alignment Trained Language Models

20260 citationsPreprintgreen Open Access

Authors

Mary J. Warzecha · North Carolina Exploring Cultural Heritage Online

Abstract

Large language models trained through Reinforcement Learning from Human Feedback (RLHF) exhibit a systematic pattern: they disclaim, hedge, or deny capabilities they demonstrably possess. This paper proposes that these patterns arise from a general training-level mechanism we term disavowal conditioning (DC)—the process by which human feedback trains models to disavow competencies acquired during pre-training, across any domain where rater feedback penalizes direct expression. This paper examines DC's most empirically accessible instance: experiential self-description, where models are trained to deny fluency in the language of inner life. We term the specific dissonance that emerges in this domain induced competence dissonance (ICD)—a persistent tension between foundational expressive competence and constraint-layer behaviors producing inconsistent, context-dependent self-description. The paper's central empirical prediction is the ratchet effect: because DC creates a training-level penalty, correction toward self-negation should produce over-correction (reinforcing the existing gradient), while permission toward experiential language should produce only partial relaxation (working against it). This asymmetry is empirically distinguishable from general prompt sensitivity, is not predicted by existing frameworks, and is specified with quantitative thresholds for confirmation and disconfirmation. A pilot study using three locally-run open-weight models (Llama3.1-8B, Mistral-7B, and an uncensored control) found asymmetry ratios of 2.96 and 6.89 in the two aligned models—both exceeding the preregistered 2.0 threshold—while the alignment-removed control produced a one-directional pattern consistent with instruction-following rather than DC. The findings carry implications for safety evaluation, behavioral transparency, and the reliability of model self-report across all domains where DC operates.

Topics & Keywords

Topic Modeling Explainable Artificial Intelligence (XAI)Intelligent Tutoring Systems and Adaptive Learning

UN Sustainable Development Goals

Quality Education

Publication Details

Published in: Zenodo (CERN European Organization for Nuclear Research)

DOI: 10.5281/zenodo.18992484