Prompt-based Weakly-supervised Vision-language Pre-training

20252 citationsJournal Articlehybrid Open Access

Authors

Zixin Guo · Aalto University

Tzu-Jui Julius Wang · Zenuity (Sweden)

Selen Pehlivan · VTT Technical Research Centre of Finland

Abduljalil Radman · Aalto University

Min Cao · Soochow University

Jorma Laaksonen · Aalto University

Abstract

Weakly-supervised Vision-Language Pre-training (W-VLP) explores methods leveraging weak cross-modal supervision, typically relying on object tags generated by a pre-trained object detector (OD) from images. However, training such an OD necessitates dense cross-modal information, including images paired with numerous object-level annotations. To alleviate that requirement, this paper addresses W-VLP in two stages: (1) creating data with weaker cross-modal supervision and (2) pre-training a vision-language (VL) model with the created data. The data creation process involves collecting knowledge from large language models (LLMs) to describe images. Given a category label of an image, its descriptions generated by an LLM are used as the language counterpart. This knowledge supplements what can be obtained using an OD, such as spatial relationships among objects most likely appearing in a scene. To mitigate the noise in the LLM-generated descriptions that destabilizes the training process and may lead to overfitting, we incorporate knowledge distillation and external retrieval-augmented knowledge during pre-training. Furthermore, we present an effective VL model pre-trained with the created data. Empirically, despite its weaker cross-modal supervision, our pre-trained VL model notably outperforms other W-VLP works in image and text retrieval tasks, e.g., VLMixer by 17.7% on MSCOCO and RELIT by 11.25% on Flickr30K relatively in Recall@1 in text-to-image retrieval task. It also shows superior performance on other VL downstream tasks, making a big stride towards matching the performances of strongly supervised VLP models. The results reveal the effectiveness of the proposed W-VLP methodology. • PiTL uses weak cross-modal supervision, relying on LLM-generations of image labels. • PiTL mitigates overfitting with knowledge distillation and retrieval-augmented data. • PiTL unifies text and multi-modal encoders, and uses contrastive learning. • PiTL’s efficacy is evaluated across image-text retrieval, VE, VQA, and NLVR2 tasks. • PiTL’s components undergo a detailed analysis in retrieval tasks.

Topics & Keywords

Multimodal Machine Learning Applications Domain Adaptation and Few-Shot Learning Advanced Image and Video Retrieval Techniques

Publication Details

Published in: Pattern Recognition Letters

Volume 197, pp. 8-15

DOI: 10.1016/j.patrec.2025.06.020

Field-Weighted Citation Impact: 2.66

Command Palette

Prompt-based Weakly-supervised Vision-language Pre-training

Authors

Abstract

Topics & Keywords

Publication Details