Search for a command to run...
In the past decades, deep neural networks, particularly convolutional neural networks, have achieved state-of-the-art performance in various medical image segmentation tasks. Recently, the introduction of vision transformers (ViTs) has significantly altered the landscape of deep segmentation models, due to their ability to capture long-range dependencies. However, we argue that the current design of the ViT-based UNet (ViT-UNet) segmentation models is limited in handling the heterogeneous appearance ( e.g., varying shapes and sizes) of target objects that are commonly encountered in medical image segmentation tasks. To tackle this limitation, we present a structured approach to introduce spatially dynamic components into a ViT-UNet. This enables the model to capture features of target objects with diverse appearances effectively. This is achieved by three main components: (i) deformable patch embedding; (ii) spatially dynamic multi-head attention; (iii) multi-scale deformable positional encoding. These components are integrated into a novel architecture, termed AgileFormer , enabling more effective capture of heterogeneous objects at every stage of a ViT-UNet. Experiments in three segmentation tasks using publicly available datasets (Synapse multi-organ, ACDC cardiac, and Decathlon brain tumor datasets) demonstrated the effectiveness of AgileFormer for 2D and 3D segmentation tasks. Remarkably, our AgileFormer sets a new state-of-the-art performance with a Dice Score of 85.74% and 87.43 % for 2D and 3D multi-organ segmentation on Synapse without significant computational overhead. Our code is avaliable at https://github.com/sotiraslab/AgileFormer . • AgileFormer captures spatially varying features in medical image segmentation. • Patch embedding and positional encoding are as crucial as self-attention in ViT-UNet. • AgileFormer achieves SOTA on multi-organ, cardiac, and brain tumor segmentation. • AgileFormer scales well, enhancing segmentation accuracy as model size increases.
Published in: Biomedical Signal Processing and Control
Volume 112, pp. 108842-108842