Search for a command to run...
Introduction Accurate identification of rice seedling age is essential for guiding precise field management and optimizing agronomic practices. However, traditional identification methods mainly rely on manual experience or simple visual cues and often lack robustness under complex field conditions such as illumination variation, background interference, and subtle morphological differences between adjacent growth stages. Therefore, developing a reliable and automated method for fine-grained recognition of rice seedling stages is of great importance. Methods To address this problem, this study proposes two deep learning models for automatic recognition of 13 rice seedling stages. The first model, Lresnet50, enhances visual feature representation by improving the baseline Resnet50 with a Row-Prior Strip Attention (RPS) mechanism, a Feature Pyramid Network (FPN) for multi-scale feature extraction, and Dynamic Channel Pruning (DCP) to reduce redundant channels and improve computational efficiency. Based on this model, a multimodal framework named M-Lresnet50 is further developed by integrating image features with temporal environmental data through a Long Short-Term Memory (LSTM) network, enabling cross-modal feature fusion and improving recognition of continuous seedling growth stages. Results Experimental results demonstrate that the proposed models achieve high accuracy in recognizing 13 rice seedling stages. The Lresnet50 model achieves an average classification accuracy of 97.70%, outperforming several existing convolutional neural network architectures and showing strong performance in transitional growth stages where morphological differences are subtle. By integrating visual features with temporal environmental information, the multimodal M-Lresnet50 further improves the accuracy to 98.33%. The model contains 27.656 million parameters with a computational complexity of 13.965 GFLOPs, indicating a good balance between recognition accuracy and computational cost. Discussion The results confirm the effectiveness of the proposed improvements and multimodal fusion strategy. The Row-Prior Strip Attention (RPS) enhances the model’s ability to focus on row-structured crop regions, while the Feature Pyramid Network (FPN) improves multi-scale feature representation. In addition, Dynamic Channel Pruning (DCP) reduces redundant channels and improves computational efficiency. The integration of temporal environmental information through the multimodal framework further enhances the robustness and consistency of seedling stage recognition. Overall, the proposed approach provides a practical solution for intelligent monitoring of rice seedling growth in greenhouse environments.