Search for a command to run...
Large Language Models (LLMs) frequently produce incorrect answers in deterministic tasks such as arithmetic, even when their intermediate reasoning appears coherent. These failures are often attributed to probabilistic uncertainty or lack of knowledge. This paper shows that a significant class of these errors instead arises from a structural limitation: the absence of trajectory re-entry for verification. We introduce a distinction between two failure modes: Arithmetic slips — localized computation errors within otherwise valid reasoning trajectories Trajectory collapse — global loss of constraint alignment Using a controlled experimental framework (LAB6), we evaluate a lightweight supervisory system, GuardianAI, which monitors reasoning trajectories and selectively enforces verification through a minimal control loop. Across multiple models, we demonstrate that: Baseline LLMs fail to self-correct even when capable of correct reasoning Enforcing a verification loop recovers arithmetic errors without degrading correct outputs Recovery depends on trajectory validity rather than model scale In controlled experiments, full recovery is achieved across several models, while irrecoverable cases are correctly withheld. A key result is the identification of a detection vs correction gap: systems may detect instability yet degrade correct reasoning if correction is not properly controlled. These findings suggest that many LLM reasoning failures are not capability limitations, but control failures. Correction must be explicitly triggered through trajectory re-entry, and recovery potential must be empirically tested rather than inferred from stability signals alone.