Search for a command to run...
State-Space Models (SSMs) achieve linear-time sequence modeling but treat alltemporal inputs uniformly, absorbing noise and redundancy alike into the hidden state. We study the problem of protecting SSM hidden states from highfrequency multimodal noise during long-horizon video understanding, and presenta principled solution grounded in continuous-time control theory. We proposethe Instability-Gated SSM (IG-SSM), which computes a deterministic temporalinstability signal from the rolling channel-wise variance of a low-cost trimodalstream (video, audio, and ASR text), and uses it to analytically modulate thecontinuous-time discretization step size ∆. We prove that this gating mechanismprovides a formal upper bound on hidden-state corruption under bounded noise,a guarantee absent from standard learned-gating SSMs. Critically, we resolvethe architectural contradiction between noise blocking and information seekingthrough a Decoupled Integration Rule: while noisy continuous frames are throttled,a triggered high-resolution spatial query to a frozen billion-parameter backbonebypasses the attenuation and is fully absorbed into the state. Evaluated on threechallenging long-form multimodal benchmarks—Video-MME (Long), EgoSchema,and LongVideoBench—our framework maintains accuracy within 1% of densebaselines while reducing visual token consumption and GFLOPs by up to 40%,strictly outperforming competitive token-reduction and active-sampling baselinesincluding TempMe Ren et al. [2025], VideoMamba Li et al. [2024], and VTM Kimet al. [2024].