Speech Emotion Recognition using Attention-based LSTM-Network with Residual Connection
Title: Attention-Enhanced LSTM with Residual Connections for Speech Emotion Recognition
Abstract
Speech emotion recognition plays a pivotal role in contemporary human-computer interaction. Yet, the widespread adoption of current state-of-the-art methods is often hindered by their reliance on massive pretrained models that demand significant computational resources and memory. To address this limitation, we introduce ResLSTM-SA, a streamlined architecture that combines residual connections with soft attention mechanisms within an LSTM framework.
We evaluated our model using the RAVDESS dataset, employing a rigorous speaker-independent split. The results demonstrate that ResLSTM-SA surpasses standard attention-based LSTM baselines, as well as various CNN and hybrid CNN-LSTM models, in unweighted average recall (UAR). The top-performing configuration, ResLSTM-SA-h64, attained a peak UAR of 0.6517 while utilizing merely 46,800 trainable parameters. This efficiency allows the model to achieve competitive accuracy with three orders of magnitude fewer parameters than large-scale self-supervised counterparts, facilitating efficient deployment on edge devices and real-time voice assistants. The source code can be accessed at https://github.com/Mak-Sim/ResLSTM-SER.
Source: arXiv Generated at: 2026-06-03 00:00:00 UTC



