Search for a command to run...
Recent advances in vision-language cross-modal learning have substantially improved the performance of video temporal grounding. However, most existing methods directly associate global video features with sentence-level features, overlooking the fact that textual semantics usually correspond to only limited spatio-temporal regions within a video. This limitation often leads to unstable alignment in complex scenarios involving intertwined events and diverse actions. In essence, accurate video temporal grounding requires the joint modeling of fine-grained spatial semantics and heterogeneous temporal event structures. Motivated by this observation, we propose a hierarchical prototype alignment approach that models cross-modal correspondence between video and text through structured intermediate prototype representations. Specifically, the alignment process is decomposed into two complementary stages: object-phrase alignment and event-sentence alignment. In the object-phrase alignment stage, discriminative local visual regions and informative textual words are aggregated to construct object and phrase prototypes, thereby enhancing fine-grained spatial correspondence at the level of entities and localized actions. In the event-sentence alignment stage, object prototypes are further integrated along the temporal dimension to form event prototypes that represent continuous action units, enabling effective alignment with sentence-level semantics and facilitating the modeling of diverse temporal event structures. On this basis, we further directly inject cross-modal alignment information into candidate moment aggregation. This design allows candidate moment representations to emphasize query-relevant temporal regions. Extensive experiments on Charades-STA, ActivityNet Captions, and TACoS demonstrate that the proposed method outperforms existing approaches, validating the effectiveness of hierarchical prototype alignment for improving both cross-modal alignment quality and temporal grounding accuracy.