Search for a command to run...
Recent advancements in speech emotion recognition (SER) have primarily centered on effective feature selection from acoustic data. This study introduces a novel SER algorithm that leverages raw speech data to enhance recognition accuracy, eliminating the need for manually selected acoustic features. Our approach integrates a Residual Convolutional Neural Network (R-CNN) model to detect emotions directly from raw speech signals and a Conformer Transformer model to capture long-range dependencies and temporal features in speech. The R-CNN model processes the raw audio, extracting emotional cues for accurate classification without relying on pre-selected acoustic features, thus capturing subtle emotion-driven nuances that traditional methods may overlook. Simultaneously, the Conformer Transformer model processes speech data to learn complex representations of the emotional content. Similarly, Long Short-Term Memory (LSTM) models are utilized to capture the sequential nature of speech signals, further enhancing the emotion recognition process. Evaluated across three public datasets in multiple languages, the proposed model demonstrates a notable improvement in accuracy and interpretability by leveraging both emotional and temporal information. This approach highlights the benefits of a multi-model framework that combines deep learning architectures, pushing the boundaries of affective computing through a more holistic understanding of speech data
Published in: Technix International Journal for Engineering Research
Volume 13, Issue 3