Search for a command to run...
The fusion of low-resolution hyperspectral images (LR-HSI) and high-spatial-resolution multispectral images (HR-MSI) aims to combine the advantages of both modalities to enhance both spatial and spectral resolution, thereby generating high-spatial-resolution hyperspectral images (HR-HSI). However, existing methods still struggle to balance global-local feature modeling and computational efficiency, and they face a core challenge: spectral distortion during the upsampling due to the lack of cross-modal guidance. To address these issues, this paper proposes a cross-modal token selection network, termed CTSNet. A novel cross-modal guided spatial implicit upsampling pyramid (SIUP) structure is introduced. Unlike conventional implicit neural representation (INR) methods, SIUP directly incorporates MSI features as conditional inputs during the local Multilayer Perceptron (MLP) prediction stage, providing precise spatial priors to the HSI upsampling process. This design enables early-stage and deep fusion of cross-modal information, effectively resolving the problems of spatial detail blurring and spectral distortion during upsampling. Secondly, a token selection Transformer block (TSTB) is proposed to collaboratively extract global-local spatial and spectral features through a parallel dual-branch structure. A token selection attention mechanism (TSAM) is further introduced to significantly reduce computational complexity by employing an adjustable token selection rate. Finally, we design a multi-scale hybrid fusion (MSHF) module to achieve deep feature reconstruction. Experiments on four public hyperspectral datasets demonstrate that CTSNet outperforms current state-of-the-art (SOTA) methods in both qualitative and quantitative evaluations.
Published in: IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing