MIST : Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering

202374 citationsJournal Article

Authors

Difei Gao · National University of Singapore

Luowei Zhou · Microsoft Research (United Kingdom)

Lei Ji · Microsoft Research Asia (China)

Linchao Zhu · Zhejiang University

Yi Yang · Zhejiang University

Mike Zheng Shou · National University of Singapore

Abstract

To build Video Question Answering (VideoQA) systems capable of assisting humans in daily activities, seeking answers from long-form videos with diverse and complex events is a must. Existing multi-modal VQA models achieve promising performance on images or short video clips, especially with the recent success of large-scale multi-modal pre-training. However, when extending these methods to long-form videos, new challenges arise. On the one hand, using a dense video sampling strategy is computationally prohibitive. On the other hand, methods relying on sparse sampling struggle in scenarios where multi-event and multi-granularity visual reasoning are required. In this work, we introduce a new model named <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$\mathcal{M}ulti{-}$</tex> · modal Iterative <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$\mathcal{S}$</tex> .patial-temporal Transformer <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$(\mathcal{MIST})$</tex> ) to better adapt pre-trained models for long-form VideoQA. Specifically, <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$\mathcal{MIST}$</tex> decomposes traditional dense spatial-temporal self-attention into cascaded segment and region selection modules that adaptively select frames and image regions that are closely relevant to the question itself. Visual concepts at different granularities are then processed efficiently through an attention module. In addition, <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$\mathcal{MIST}$</tex> iteratively conducts selection and attention over multiple layers to support reasoning over multiple events. The experimental results on four VideoQA datasets, including AGQA, NExT-QA, STAR, and Env-QA, show that <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$\mathcal{MIST}$</tex> achieves state-of-the-art performance and is superior at efficiency. The code is available at github.com/showlab/mist.

Topics & Keywords

Multimodal Machine Learning Applications Domain Adaptation and Few-Shot Learning Advanced Image and Video Retrieval Techniques

Publication Details

DOI: 10.1109/cvpr52729.2023.01419

Field-Weighted Citation Impact: 9.09

Command Palette

MIST : Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering

Authors

Abstract

Topics & Keywords

Publication Details