INTENT: Invariance and Discrimination-aware Noise Mitigation for Robust Composed Image Retrieval

20260 citationsJournal Articlediamond Open Access

Authors

Zhiwei Chen · Shandong University

Yupeng Hu · Shandong University

Zhiheng Fu · Shandong University

Zixu Li · Shandong University

Jiale Huang · Shandong University

Qinlei Huang · Shandong University

Yinwei Wei · Shandong University

Abstract

Composed Image Retrieval (CIR) is a challenging image retrieval paradigm that enables to retrieve target images based on multimodal queries consisting of reference images and modification texts. Although substantial progress has been made in recent years, existing methods assume that all samples are correctly matched. However, in real-world scenarios, due to high triplet annotation costs, CIR datasets inevitably contain annotation errors, resulting in incorrectly matched triplets. To address this issue, the problem of Noisy Triplet Correspondence (NTC) has attracted growing attention. We argue that noise in CIR can be categorized into two types: cross-modal correspondence noise and modality-inherent noise. The former arises from mismatches across modalities, whereas the latter originates from intra-modal background interference or visual factors irrelevant to the coarse-grained modification annotations. However, modality-inherent noise is often overlooked, and research on cross-modal correspondence noise remains nascent. To tackle above issues, we propose the Invariance and discrimiNaTion-awarE Noise neTwork (INTENT), comprising two components: Visual Invariant Composition and Bi-Objective Discriminative Learning, specifically designed to handle the two-aspect noise. The former applies causal intervention on the visual side via Fast Fourier Transform (FFT) to generate intervened composed features, enforcing visual invariance and enabling the model to ignore modality-inherent noise during composition. The latter adopts collaborative optimization with both positive and negative samples, and constructs a scalable decision boundary that dynamically adjusts decisions based on the loyalty degree, enabling robust correspondence discrimination. Extensive experiments on two widely used benchmark datasets demonstrate the superiority and robustness of INTENT.

Topics & Keywords

Advanced Image and Video Retrieval Techniques Image Retrieval and Classification Techniques Multimodal Machine Learning Applications

UN Sustainable Development Goals

Reduced inequalities

Publication Details

Published in: Proceedings of the AAAI Conference on Artificial Intelligence

Volume 40, Issue 25, pp. 20463-20471

DOI: 10.1609/aaai.v40i25.39181

Field-Weighted Citation Impact: 0.00