Search for a command to run...
Objective Multimodal emotion recognition aims to understand the emotional states of specific subjects by deeply fusing data from multiple modalities such as text, vision, and audio. However, the inherent heterogeneity existing in the representation forms and distribution laws of different modality data leads to emotional semantics in the latent feature space often being entangled and coupled with modality-specific non-emotional noise. This phenomenon of feature entanglement not only hinders the model's effective learning of key emotional features but also limits the interpretability and generalization ability of the model's decision-making process. Furthermore, feature fusion strategies mostly adopt simple concatenation operations or coarse-grained attention mechanisms, making it difficult to effectively capture fine-grained emotional semantic interaction cues between modalities in complex cross-modal contexts, ultimately resulting in the fused emotional representation lacking sufficient discriminability. To this end, an interpretable invertible disentanglement and adaptive fusion method for multimodal emotion recognition is proposed. Method First, in order to reduce the loss of semantic information during the feature learning phase and achieve structured feature disentanglement, an invertible attention mask-based disentanglement (IAMD) module is designed. Based on invertible neural networks, a bidirectional invertible mapping structure is constructed between the latent representations of each modality's features and emotional semantic factors, and an attention mask mechanism is combined to disentangle latent features in the channel dimension into two parts: one part capturing shared features with semantic consistency across modalities, and the other retaining specific features containing unique attributes of each modality.. Secondly, to further enhance the disentanglement effect from an information-theoretic level, a mutual information constraint (MIC) mechanism is constructed. The semantic consistency of emotional features in the shared subspace is enhanced by calculating and maximizing the mutual information between shared features as well as between shared features and emotion labels. Meanwhile, by minimizing the mutual information between specific features and emotion labels conditioned on shared features, the model is constrained to strip modality-specific attributes not directly related to the emotion task into the specific feature subspace, thereby reducing the interference of modality-redundant noise on emotional semantics. Finally, addressing the issue of insufficient interaction during the feature fusion phase, a semantic-guided adaptive feature fusion (SGAFF) module is designed. The cross-modal consistent emotional semantic information captured in the shared subspace is utilized by this module as contextual cues to perform fine-grained semantic correction and guidance on modality-specific features through residual connections, and a dual-branch prediction structure is constructed, in which a gating mechanism is utilized to adaptively assign weights to the shared branch and the specific-guided branch, thereby enhancing the discriminability of the fused representation. Result Extensive comparative experiments and ablation studies were conducted on the CMU-MOSI, CMU-MOSEI, and UR-FUNNY datasets. Specifically, on CMU-MOSI, the model improved mean absolute error (MAE) and 7-class accuracy (Acc-7) by 2.4% and 2.9%, respectively, compared to the DLF model. On CMU-MOSEI, it yielded improvements of 2.6% and 1.7% in MAE and Pearson correlation coefficient (Corr), respectively, compared to the TMBL model. Furthermore, on UR-FUNNY, the model improved the F1-score (F1) by 5.6% compared to the MISA model. In addition, detailed ablation experiments verified the necessity of the IAMD, MIC, and SGAFF modules for improving model performance. Feature visualization analysis based on t-distributed stochastic neighbor embedding confirmed that the model realized the effective separation of emotional semantics and modality noise in the latent space. Furthermore, visualization analysis of fusion weights revealed that the model adaptively assigned higher decision contributions to the specific-guided branch, verifying the critical role of fine-grained complementary cues in the final emotion judgment. Conclusion In summary, the proposed method achieves interpretable disentanglement of emotional semantic information from modality-specific noise based on INN and mutual information constraints, at the same time, through the semantic-guided adaptive fusion strategy, it realizes deep and fine-grained interactions between cross-modal emotional semantic features, thereby improving the accuracy and robustness of multimodal emotion recognition tasks in complex scenarios. Although the proposed method achieves significant progress in improving model performance and interpretability, the introduction of invertible transformations and multiple mutual information constraints increases the computational complexity of the model. This method is applicable to multimodal scenarios with complex modal heterogeneity, as well as tasks that have strict requirements for quantitative emotion recognition metrics. Future work will focus on lightweight disentanglement and fusion mechanisms for emotion recognition tasks, to further improve the inference efficiency and generalization ability of the model in scenarios with limited computational resources.