Search for a command to run...
Most 2D human pose estimation frameworks utilize static designs for multi-scale feature fusion, where information from various scales is integrated using fixed weights. A drawback of these approaches is that they often lead to localization biases in complex scenarios. This paper addresses the issues of multi-scale feature mismatch and joint localization biases in pose estimation. From the perspective of feature processing, multi-scale weights must be adapted to the size and position of joints, while joint predictions should adhere to human anatomical constraints. Existing methods lack effective dynamic adaptation, structural constraints, and bidirectional complementarity between high-level semantics and low-level details. They often experience localization biases in occluded scenarios, and the peaks of their heatmaps demonstrate insufficient consistency with the actual positions of the joints. Through theoretical analysis, we identify the causes of performance gaps and propose directions for narrowing them. We propose Bidirectional Multi-Scale Collaborative Pose Estimation (BiMS-Pose), a framework that introduces dynamic weights to adjust feature proportions, establishes bidirectional topological constraints for joint relationships, and integrates a bidirectional attention flow. The framework filters key information from three dimensions, adjusts filtering strategies in real time, and is enhanced by heatmap optimization to improve localization accuracy. Extensive experiments conducted on COCO, MPII, and our self-built Orchard Spraying Pose Dataset (OSPD) demonstrate the effectiveness of BiMS-Pose. In general scenarios, it achieves a significant 1.2 percentage-point increase in average precision (AP) on the COCO val2017 dataset compared to ViTPose while utilizing the same backbone. In agricultural orchard spraying scenarios, it effectively addresses interference factors such as changes in illumination, occlusion, and varying shooting distances, achieving 75.4% average precision (AP) and 90.7% percent of correct keypoints (PCKh@0.5) on the OSPD dataset. Additionally, it maintains an average frame rate of 18.3 FPS on embedded devices, effectively meeting the requirements for real-time monitoring. This highlights the model’s potential for precise, stable, and practical human pose estimation in both general and agricultural application scenarios.