Investigating Salient Representations and Label Variance Modeling in Dimensional Speech Emotion Analysis
AuthorsVikramjit Mitra, Jingping Nie, Erdrin Azemi
AuthorsVikramjit Mitra, Jingping Nie, Erdrin Azemi
Representations from models such as Bidirectional Encoder Representations from Transformers (BERT) and Hidden units BERT (HuBERT) have helped to achieve state-of-the-art performance in dimensional speech emotion recognition. Both HuBERT, and BERT models generate fairly large dimensional representations, and such models were not trained with emotion recognition task in mind. Such large dimensional representations result in speech emotion models with large parameter size, resulting in both memory and computational cost complexities. In this work, we investigate the selection of representations based on their task saliency, which may help to reduce the model complexity without sacrificing dimensional emotion estimation performance. In addition, we investigate modeling label uncertainty in the form of grader opinion variance, and demonstrate that such information can help to improve the model's generalization capacity and robustness. Finally, we analyzed the robustness of the speech emotion model against acoustic degradation and observed that the selection of salient representations from pre-trained models and modeling label uncertainty helped to improve the models generalization capacity to unseen data containing acoustic distortions in the form of environmental noise and reverberation.
April 2, 2025research area Human-Computer Interaction, research area Speech and Natural Language Processingconference ICLR
Spontaneous speech emotion data usually contain perceptual grades where graders assign emotion score after listening to the speech files. Such perceptual grades introduce uncertainty in labels due to grader opinion variation. Grader variation is addressed by using consensus grades as groundtruth, where the emotion with the highest vote is selected, and as a consequence fails to consider ambiguous instances where a speech sample may contain...
March 14, 2023research area Methods and Algorithms, research area Speech and Natural Language Processingconference ICASSP
Pre-trained model representations have demonstrated state-of-the-art performance in speech recognition, natural language processing, and other applications. Speech models, such as Bidirectional Encoder Representations from Transformers (BERT) and Hidden units BERT (HuBERT), have enabled generating lexical and acoustic representations to benefit speech recognition applications. We investigated the use of pre-trained model representations for...