P3D: Plug-and-play prompt-driven framework for RGB-thermal semantic segmentation

20260 citationsJournal Articlehybrid Open Access

Authors

Yongqi Sun · PLA Information Engineering University

Chenguang Dai · PLA Information Engineering University

Hanyun Wang · Sun Yat-sen University

Longguang Wang · PLA Air Force Aviation University

W. J. Li · PLA Information Engineering University

Meilin Li · PLA Information Engineering University

Yongsheng Zhang

Abstract

• A plug-and-play prompt-driven framework for RGB-thermal image semantic segmentation. • LoRA-based fine-tuning strategy for SAM series model integration. • A model-agnostic encoder to generate statistical distributed prompts for training. The semantic segmentation of RGB-thermal images is critical for applications with low-light conditions. Existing works primarily focus on feature fusion strategies and model design to enhance performance. While Visual Foundation Models (VFMs) have been introduced in previous studies to improve generalization and segmentation accuracy, they suffer from poor compatibility with other models thus requiring full model retraining. Additionally, the domain gap and modality gap between VFM pre-training datasets and RGB-thermal semantic segmentation datasets pose significant challenges to VFM adaptation for downstream tasks. To address these issues, in this paper a plug-and-play prompt driven framework P 3 D is proposed. Unlike existing VFM-based methods that require complete retraining for each specific architecture, P 3 D is designed with a model-agnostic training strategy that enables one-time training and seamless integration with various existing methods without requiring retraining. First, a dual-branch LoRA (Low-Rank Adaptation) fine-tuned (DBLF) image encoder for the RGB and thermal image branches is proposed to narrow the domain gap and modality gap when incorporating SAM series models into our task. Second, a unified prompt generation and representation (UPGR) encoder is proposed. It generates diverse prompts using semantic labels during the training stage, ensuring the generated prompts are model-agnostic and compatible with existing methods. Finally, a cross-modality spatial-channel attention (CM-SCA) decoder is developed to fuse the embeddings from two-modality images and prompts for the final prediction. Extensive experiments are conducted on three popular benchmarks. Results demonstrate that P 3 D not only improves the performance of existing models but also outperforms current state-of-the-art (SOTA) methods leveraging < 1% trainable parameters. More importantly, by simply plugging P 3 D into existing methods, we consistently achieve significant performance improvements without retraining these base models, demonstrating the practical value of our plug-and-play design.

Topics & Keywords

Advanced Neural Network Applications Infrared Thermography in Medicine Generative Adversarial Networks and Image Synthesis

Publication Details

Published in: Pattern Recognition

Volume 178, pp. 113429-113429

DOI: 10.1016/j.patcog.2026.113429

Field-Weighted Citation Impact: 0.00

Command Palette

P3D: Plug-and-play prompt-driven framework for RGB-thermal semantic segmentation

Authors

Abstract

Topics & Keywords

Publication Details