Real time fire and smoke detection using vision transformers and spatiotemporal learning

20260 citationsJournal Articlegold Open Access

Authors

Umesh Kumar Lilhore · Galgotias University

Yogesh Kumar Sharma · Koneru Lakshmaiah Education Foundation

Kavitha Venkatachari · Universal Engineering College

Nikhil Kumar Jain · Capgemini (Netherlands)

Sultan Aldossary · Prince Sattam Bin Abdulaziz University

Shimaa A. Hussien · Princess Nourah bint Abdulrahman University

Ehab Seif Ghith

Abstract

The detection of fire and smoke in images and videos is essential for environmental monitoring and safety; however, the unpredictable nature of fires makes it a difficult task. Although traditional methods such as CNNs, LSTMs, and 3D-CNNs have made progress in fire detection, they frequently encounter difficulties in effectively integrating spatial and temporal information from both images and videos. In this study, we introduce a novel method that integrates Transformer attention mechanisms and Vision Transformers (ViTs) to enhance the precision of fire and smoke detection in both images and videos. ViTs are employed in our model to extract spatial features from images, leveraging their capacity to capture long-range dependencies, which are essential for the identification of fire and smoke. We utilise 3D-CNNs to extract spatiotemporal features from video sequences, while a Transformer encoder is used to track the evolution of fire and smoke over time. Furthermore, we execute various enhancements to optimise the model’s performance. These encompass enhanced temporal modelling, advanced self-attention mechanisms, and a multi-task learning framework to improve the model’s robustness by identifying potential hazards, such as smoke, fire, and other threats. In order to enhance the model’s adaptability to dynamic environments, we incorporate sophisticated data augmentation techniques and optimize it for real-time deployment on edge devices. To address the inherent class imbalance between fire and non-fire samples in existing datasets, we implemented targeted data augmentation and class-weighted learning strategies, ensuring equal representation and balanced training for improved generalization. The model was tested against two well-known datasets: the NASA Space Apps Challenge Dataset and Kaggle’s Fire Videos Dataset. Our method outperforms conventional methods, achieving 99.2% accuracy on the NASA dataset and 98.3% on the Fire Videos dataset. ResNet50, VGG16, LSTM, 3D-CNN, and hybrid ResNet50 + LSTM and VGG16 + 3D-CNN models, on the other hand, achieved accuracies ranging from 85% to 94%. This study’s findings show that our hybrid model is a more effective solution for real-time fire and smoke detection in real-world settings because of its improved integration of spatial and temporal features.

Topics & Keywords

Fire Detection and Safety Systems Fire dynamics and safety research Image Enhancement Techniques

UN Sustainable Development Goals

Life in Land

Publication Details

Published in: Scientific Reports

Volume 16, Issue 1

DOI: 10.1038/s41598-026-36687-9

Field-Weighted Citation Impact: 0.00