Search for a command to run...
The detection of fire and smoke in images and videos is essential for environmental monitoring and safety; however, the unpredictable nature of fires makes it a difficult task. Although traditional methods such as CNNs, LSTMs, and 3D-CNNs have made progress in fire detection, they frequently encounter difficulties in effectively integrating spatial and temporal information from both images and videos. In this study, we introduce a novel method that integrates Transformer attention mechanisms and Vision Transformers (ViTs) to enhance the precision of fire and smoke detection in both images and videos. ViTs are employed in our model to extract spatial features from images, leveraging their capacity to capture long-range dependencies, which are essential for the identification of fire and smoke. We utilise 3D-CNNs to extract spatiotemporal features from video sequences, while a Transformer encoder is used to track the evolution of fire and smoke over time. Furthermore, we execute various enhancements to optimise the model’s performance. These encompass enhanced temporal modelling, advanced self-attention mechanisms, and a multi-task learning framework to improve the model’s robustness by identifying potential hazards, such as smoke, fire, and other threats. In order to enhance the model’s adaptability to dynamic environments, we incorporate sophisticated data augmentation techniques and optimize it for real-time deployment on edge devices. To address the inherent class imbalance between fire and non-fire samples in existing datasets, we implemented targeted data augmentation and class-weighted learning strategies, ensuring equal representation and balanced training for improved generalization. The model was tested against two well-known datasets: the NASA Space Apps Challenge Dataset and Kaggle’s Fire Videos Dataset. Our method outperforms conventional methods, achieving 99.2% accuracy on the NASA dataset and 98.3% on the Fire Videos dataset. ResNet50, VGG16, LSTM, 3D-CNN, and hybrid ResNet50 + LSTM and VGG16 + 3D-CNN models, on the other hand, achieved accuracies ranging from 85% to 94%. This study’s findings show that our hybrid model is a more effective solution for real-time fire and smoke detection in real-world settings because of its improved integration of spatial and temporal features.