Structurally Consistent and Grounding-Aware Stagewise Reasoning for Referring Remote Sensing Image Segmentation

20260 citationsJournal Articlegold Open Access

Authors

Shan Dong · Beijing Institute of Technology

Jianlin Xie · Beijing Institute of Technology

Ling Chen · Beijing Institute of Technology

Huichao Chen · Beijing Institute of Technology

Baogui Qi · Chengdu Institute of Information Technology (China)

Y Ge · Tsinghua University

Abstract

Referring Remote Sensing Image Segmentation (RRSIS) is a representative multimodal understanding task for remote sensing, which segments designated targets from remote images according to free-form natural language descriptions. However, complex remote sensing characteristics, such as cluttered backgrounds, large-scale variations, small scattered targets and repetitive textures, lead to unstable visual grounding and further spatial grounding drift, resulting in inaccurate segmentation results. Existing approaches typically perform implicit visual–linguistic fusion across encoding and decoding stages, entangling spatial grounding with mask refinement. This tightly coupled formulation lacks explicit structural constraints and is prone to cross-modal ambiguity, especially in complex remote sensing layouts. To address these limitations, we propose a Structurally consistent and Grounding-aware Stagewise Reasoning Framework (SGSRF) that follows a grounding-first, segmentation-second paradigm. The framework decomposes inference into three cascaded stages with progressively imposed structural constraints. First, Cross-modal Consistency Refinement (CCR) lays the foundation for stable spatial grounding by enhancing visual–textual structural alignment via CLIP-based features and Structural Consistency Regularization (SCR), producing well-aligned multimodal representations and reliable grounding cues. Second, Grounding-aware Prompt (GPG) Generation bridges grounding and segmentation by converting aligned representations into complementary sparse and dense prompts, which serve as explicit grounding guidance for the segmentation model. Third, Grounding Modulated Segmentation (GMS) leverages the Segment Anything Model (SAM) to generate fine-grained mask prediction under the joint guidance of prompts and grounding cues, improving spatial grounding stability and robustness to background interference and scale variation. Extensive experiments on three remote sensing benchmarks, namely RefSegRS, RRSIS-D, and RISBench, demonstrate that SGSRF achieves state-of-the-art performance. The proposed stagewise paradigm integrates structural alignment, explicit grounding, and prompt-driven segmentation into a unified framework, providing a practical and robust solution for RRSIS in real-world Earth observation applications.

Topics & Keywords

Multimodal Machine Learning Applications Advanced Neural Network Applications Constraint Satisfaction and Optimization

Publication Details

Published in: Remote Sensing

Volume 18, Issue 7, pp. 1015-1015

DOI: 10.3390/rs18071015

Field-Weighted Citation Impact: 0.00

Command Palette

Structurally Consistent and Grounding-Aware Stagewise Reasoning for Referring Remote Sensing Image Segmentation

Authors

Abstract

Topics & Keywords

Publication Details