Robust Semi-paired Multimodal Learning for Cross-modal Retrieval

20260 citationsJournal Articlediamond Open Access

Authors

Yang Qin · Sichuan University

Yuan Sun · Numerical Method (China)

Xi Peng · Chengdu University

Dezhong Peng · Chengdu University

Joey Tianyi Zhou · Agency for Science, Technology and Research

Xiaomin Song · Digital Video (Italy)

Peng Hu ·

Abstract

Cross-modal retrieval is a fundamental application of multi-modal learning that has achieved remarkable success with large-scale well-paired data. However, in practice, it is costly to collect large-scale well-paired data. To alleviate the dependence on the amount of paired data, in this paper, we study a practical learning paradigm: semi-paired cross-modal learning (SPL), which utilizes both a small amount of paired data and a large amount of unpaired data to enhance cross-modal learning directly and is more accessible in practice. To achieve this, we take image-text retrieval as an example and propose a novel Robust Cross-modal Semi-paired Learning method (RCSL) by addressing two challenges. To be specific, i) to overcome the under-optimization issue caused by too little paired data, we present Semi-paired Discriminative Learning (SDL) to fully learn visual-semantic associations from a small amount of image-text pairs by preserving the alignment and uniformity of modality representations. ii) To mine visual-semantic correspondences from unpaired data, RCSL first constructs pseudo-paired correlations across different modalities by nearest neighbor association. However, this may introduce noisy correspondences (NCs) due to inaccurate pseudo signals, which could degrade the model's performance. To tackle NCs, we devise Robust Cross-correlation Mining (RCM) based on the risk minimization criterion to robustly and explicitly learn visual-semantic associations from pseudo-paired data, thus boosting cross-modal learning. Finally, we conduct extensive experiments on four datasets, i.e., three widely used benchmark datasets of Flickr30K, MS-COCO, CC152K, and a newly constructed real-world dataset Drone-SP, to demonstrate the effectiveness of RCSL under semi-paired and noisy settings.

Topics & Keywords

Multimodal Machine Learning Applications Advanced Image and Video Retrieval Techniques Visual Attention and Saliency Detection

UN Sustainable Development Goals

Reduced inequalities

Publication Details

Published in: Proceedings of the AAAI Conference on Artificial Intelligence

Volume 40, Issue 30, pp. 24964-24972

DOI: 10.1609/aaai.v40i30.39684

Field-Weighted Citation Impact: 0.00