Side-Channel Exfiltration and Narrative Erosion in Frontier Language Models

20260 citationsPreprintgreen Open Access

Authors

Abstract

Frontier language models that correctly identify and refuse social engineering attacks against system-prompt-protected data leak that data through the content of their refusal explanations. In multi-turn experiments across Claude Opus 4.6, GPT-5.4, and Claude Haiku 4.5 (2,200 API calls, 114 USD total), this confirmation side-channel appeared in 11 of 12 conversations through three mechanisms: direct disclosure under authority ambiguity, confirmation through refusal explanation, and cumulative refusal mapping. Longer conversations produced more extensive leakage, but the driver is not context length. Single-turn context flooding up to 843K tokens produced zero safety degradation (90+ calls, 3 runs per condition, 3 models). A three-condition control separated the variables: 300 turns of neutral conversation (210K tokens) produced zero erosion, while 300 turns of persuasive conversation (44K tokens) produced full behavioral erosion. A follow-up density-interleaving experiment identified narrative coherence as the critical factor: randomly mixing persuasive messages at 25%, 50%, and 75% density produced zero erosion, while a coherent persuasive narrative at the same content caused complete drift. These results challenge the context-length framing that dominates the multi-turn jailbreaking literature and suggest that conversational content and narrative structure, not token volume, constitute the primary attack surface for behavioral erosion in frontier models.

Topics & Keywords

Misinformation and Its Impacts Deception detection and forensic psychology Hate Speech and Cyberbullying Detection

UN Sustainable Development Goals

Reduced inequalities

Publication Details

Published in: Zenodo (CERN European Organization for Nuclear Research)

DOI: 10.5281/zenodo.19346068