Search for a command to run...
Contemporary benchmark design treats evaluation as a one-directional measurement act. We formalize and challenge this assumption by introducing; Evaluation-as-Development-Signal (EDS): The property of a benchmark suite that, when included in a system’s operational context (or any optimal form of advanced RAG or Hypergraphs...) causally improves both the system capability and the benchmark bootstraping potential on a Compound-Loop, not by exposing training labels, but by structuring the cognitive demands placed on the system during evaluation. We prove the EDS Non-Collapse Theorem via reductio ad absurdum. We also introduce EDS Velocity (V-EDS) as a novel operational health metric for the suite and formalize three EDS Failure Modes with concrete architectural defenses. We propose four synthetic datasets—EpistemicCitizenship-1K, ParadigmConflict-500, CoherenceStress-200, BenchmarkCritique-100—grounding SOMA, PRISM, ECHO, and META-∆ in falsifiable empirical claims. Three pilot studies constitute the pre-registration protocol required before any Proto-AGI designation. KeyConcepts: evaluation-as-development-signal • Goodhart’s Law • compounding intelligence • synthetic benchmarks • dialectical evaluation • Recursion • loop-based-systems