Deep spatiotemporal human activity recognition using an optimized 3D CNN model

20260 citationsJournal Articlegold Open Access

Authors

Hamada I. AbdulWakel · Minia University

Ahmed S. Salama · Cairo Higher Institute

Eman Abdullah Aldakheel · Princess Nourah bint Abdulrahman University

Mona M. Moussa · Electronics Research Institute

Rasha Shoitan · Electronics Research Institute

Ahmad M. Nagm · Cairo Higher Institute

Abstract

Surveillance video systems have become indispensable in modern societies for monitoring human activity and detecting abnormal behavior across both public and private environments. This growing reliance on video data has increased the demand for intelligent and efficient Human Activity Recognition (HAR) methods capable of operating reliably in real time. However, existing HAR approaches face two persistent challenges: limited spatiotemporal modeling and high computational requirements. Traditional handcrafted methods and two-dimensional convolutional neural network (2D CNN) based models provide fast processing but struggle to capture temporal dynamics, while more advanced architectures, such as hybrid convolutional neural network-recurrent neural network (CNN-RNN) models and Transformers, deliver stronger accuracy at the expense of increased complexity, larger datasets, and substantial computational resources, which restrict their scalability and practical deployment. To address these challenges, this study presents an optimized three-dimensional convolutional neural network (3D CNN) specifically designed to learn spatiotemporal representations directly from raw video clips. The architecture is composed of three consecutive Conv3D blocks, where each block includes a 3D convolutional layer, batch normalization, 3D max-pooling, and dropout to ensure stable learning and effective regularization. After the final convolutional block, a GlobalAveragePooling3D layer aggregates the spatiotemporal features, which are subsequently fed into a fully connected layer with dropout for further abstraction. A final Dense layer produces the classification output. To enhance learning efficiency and ensure that the model adapts well to diverse video patterns, Bayesian optimization is employed to automatically tune key hyperparameters of this architecture. Evaluated on the UCF50 dataset, the proposed model achieves 89.67% test accuracy, outperforming several competitive architectures by 2–7%, and performing within 1–2% of more advanced transformer-based methods, while using significantly fewer parameters than convolutional neural network-gated recurrent unit (CNN-GRU) hybrids and transformer models.

Topics & Keywords

Human Pose and Action Recognition Context-Aware Activity Recognition Systems Emotion and Mood Recognition

Publication Details

Published in: PeerJ Computer Science

Volume 12, pp. e3719-e3719

DOI: 10.7717/peerj-cs.3719

Field-Weighted Citation Impact: 0.00

Command Palette

Deep spatiotemporal human activity recognition using an optimized 3D CNN model

Authors

Abstract

Topics & Keywords

Publication Details