Search for a command to run...
Multimodal sentiment analysis (MSA) has emerged as one of the most dynamic and rapidly advancing areas within the field of artificial intelligence. It combines audio, visual, and textual data to gain a better knowledge and understanding of emotions based on online communication. However, unlike unimodal sentiment analysis, which frequently overlooks such details as sarcasm or cross-cultural emotional cues, MSA incorporates more sophisticated methods dealing with such shortcomings: attention mechanisms, hierarchical fusion, and transformer-based architectures, among others. The study presents a critical assessment of 58 studies reviewed in the past 12 years (2010–2025) and utilizes PRISMA methodology, which limits the risk of selecting literature insufficiently. The main topics covered are fusion techniques (early, late, and hybrid), sophisticated feature extraction approaches, and databases testing (e.g. CMU-MOSEI, MELD, MOSEAS). Some of the problems mentioned in detail include high computational complexity, modalities, bad synchronization, bias in the dataset, and lack of real-time applications. Moreover, a lack of interdisciplinary work and common grounds between the psychological theories and AI models is noted in this review, as well. Applications in healthcare, education, HCI and monitoring of moods among the people are demonstrated demonstrating the real-life applicability of MSA. Finally, the study establishes some crucial research gaps and suggests future research paths toward the multimodal systems that are scalable and culturally-sensitive and ethically-responsible and can potentially operate in the multilingual and dynamically-based environment.