Search for a command to run...
Surveillance video systems have become indispensable in modern societies for monitoring human activity and detecting abnormal behavior across both public and private environments. This growing reliance on video data has increased the demand for intelligent and efficient Human Activity Recognition (HAR) methods capable of operating reliably in real time. However, existing HAR approaches face two persistent challenges: limited spatiotemporal modeling and high computational requirements. Traditional handcrafted methods and two-dimensional convolutional neural network (2D CNN) based models provide fast processing but struggle to capture temporal dynamics, while more advanced architectures, such as hybrid convolutional neural network-recurrent neural network (CNN-RNN) models and Transformers, deliver stronger accuracy at the expense of increased complexity, larger datasets, and substantial computational resources, which restrict their scalability and practical deployment. To address these challenges, this study presents an optimized three-dimensional convolutional neural network (3D CNN) specifically designed to learn spatiotemporal representations directly from raw video clips. The architecture is composed of three consecutive Conv3D blocks, where each block includes a 3D convolutional layer, batch normalization, 3D max-pooling, and dropout to ensure stable learning and effective regularization. After the final convolutional block, a GlobalAveragePooling3D layer aggregates the spatiotemporal features, which are subsequently fed into a fully connected layer with dropout for further abstraction. A final Dense layer produces the classification output. To enhance learning efficiency and ensure that the model adapts well to diverse video patterns, Bayesian optimization is employed to automatically tune key hyperparameters of this architecture. Evaluated on the UCF50 dataset, the proposed model achieves 89.67% test accuracy, outperforming several competitive architectures by 2–7%, and performing within 1–2% of more advanced transformer-based methods, while using significantly fewer parameters than convolutional neural network-gated recurrent unit (CNN-GRU) hybrids and transformer models.