Search for a command to run...
• A plug-and-play prompt-driven framework for RGB-thermal image semantic segmentation. • LoRA-based fine-tuning strategy for SAM series model integration. • A model-agnostic encoder to generate statistical distributed prompts for training. The semantic segmentation of RGB-thermal images is critical for applications with low-light conditions. Existing works primarily focus on feature fusion strategies and model design to enhance performance. While Visual Foundation Models (VFMs) have been introduced in previous studies to improve generalization and segmentation accuracy, they suffer from poor compatibility with other models thus requiring full model retraining. Additionally, the domain gap and modality gap between VFM pre-training datasets and RGB-thermal semantic segmentation datasets pose significant challenges to VFM adaptation for downstream tasks. To address these issues, in this paper a plug-and-play prompt driven framework P 3 D is proposed. Unlike existing VFM-based methods that require complete retraining for each specific architecture, P 3 D is designed with a model-agnostic training strategy that enables one-time training and seamless integration with various existing methods without requiring retraining. First, a dual-branch LoRA (Low-Rank Adaptation) fine-tuned (DBLF) image encoder for the RGB and thermal image branches is proposed to narrow the domain gap and modality gap when incorporating SAM series models into our task. Second, a unified prompt generation and representation (UPGR) encoder is proposed. It generates diverse prompts using semantic labels during the training stage, ensuring the generated prompts are model-agnostic and compatible with existing methods. Finally, a cross-modality spatial-channel attention (CM-SCA) decoder is developed to fuse the embeddings from two-modality images and prompts for the final prediction. Extensive experiments are conducted on three popular benchmarks. Results demonstrate that P 3 D not only improves the performance of existing models but also outperforms current state-of-the-art (SOTA) methods leveraging < 1% trainable parameters. More importantly, by simply plugging P 3 D into existing methods, we consistently achieve significant performance improvements without retraining these base models, demonstrating the practical value of our plug-and-play design.