Speaker Awareness for Speech Emotion Recognition

The idea of recognizing human emotion through speech (SER) has recently received considerable attention from the research community, mostly due to the current machine learning trend. Nevertheless, even the most successful methods are still rather lacking in terms of adaptation to specific speakers and scenarios, evidently reducing their performance when compared to humans. In this paper, we evaluate a largescale machine learning model for classification of emotional states. This model has been trained for speaker identification but is instead used here as a front-end for extracting robust features from emotional speech. We aim to verify that SER improves when some speaker’s emotional prosody cues are considered. Experiments using various state-ofthe-art classifiers are carried out, using the Weka software, so as to evaluate the robustness of the extracted features. Considerable improvement is observed when comparing our results with other SER state-of-the-art techniques. Keywords—Speech emotion recognition, machine learning, CNN, VGG


Introduction
Emotion and its expression undoubtedly govern many aspects of human interaction. It is self-evident that the emotional phenomena experienced by a person should tend to mold their behavior and conversational register in the social settings they engage with. Effectively, by adulthood most humans will have developed a large set of highly nature/nurture dependent, distinct behavioral responses to the multiple emotional states they experience throughout their personal lives. An emulated corroboration of this notion can be inferred by the specific articulation patterns observed in professional actors simulating emotion [1]. Further support is given by the widely accepted effects of recurrent stress in an individual's emotional state [2], [3] naturally affecting their prosody.
The role of someone's personality, and consequently their way of communicating, are often overlooked when it comes to emotion recognition from speech. In fact, even though the state-of-the-art machine learning models are exceptionally competent at evaluating data with an unspecified set of relevant features, most of the designs tend to focus directly and solely on overt emotional cues. As of now, this should be con-sidered an erroneous approach given the observed higher performances of systems with deeper adaptation levels, such as domain-based [4], [5] or context-based [6], [7]. Hence, more information channels should be considered when analyzing emotion in speech, one of which is potential speaker dependency of human emotional prosody.
Empirical evidence of prosody variations for identical emotional sates in different people is of relevance to interactive systems and other social robotic applications. Provided a machine can identify an intervening speaker, the ability to further adapt itself to not only said speaker but also to their emotional conveyance mannerisms, can certainly boost the quality of the system's behavior and response suitability to the situation at hand. This concept is not unlike how pet social companions are able to perceive how their owners feel and modify their behavior to conform with the respective emotional state. Emulation of this in machines would be highly useful.
We introduce our approach to speech emotion recognition (SER), based on the use of the speaker recognition CNN model VGGVox [8], for feature extraction from 6 standard and established emotional speech databases, with minimal preprocessing. The application of the features extracted using our technique, in state-of-the-art classifiers, confirmed that speaker specific features extracted from speech are robust enough to allow for clear classification of emotional states. Moreover, our technique's performance was shown to surpass that of other state-of-the-art methods.
This paper is divided in the following manner: section II provides an overview of recent related work while section III outlines the methodology of our approach. This is followed by section IV where detail is given about the experiments carried out and the obtained supporting results are discussed, and finally section V where a conclusion and overview of future work are presented.

Related Work
Given the necessity of evaluating a panoply of informational cues embedded in speech, added to the already complex task of considering as many vocal features as possible, most classical recognition systems based on speech have seen their performance greatly surpassed by machine leaning models. As such, these architectures have been used as baselines in the performance evaluation of new emotion recognition techniques which use representations learned from other paralinguistic tasks.
Gideon et al. [9] assessed the effectiveness of progressive neural networks (ProgNets) at freezing the weights of a model's initial layers, tuned for speaker recognition and gender detection from speech, and using these transitional representations as input to the posterior layers, trained for emotion recognition. Performance rates were somewhat higher than those of standard DNN or simple pre-training and finetuning (PT/FT) networks. On a separate note, Sidorov et al. [10] explored the effects of adding speaker specific and gender information as features in the vectors used to train emotion recognition models, essentially further detailing the datafiles in one experiment. Parallelly, the group predicted speaker and gender information with ANN-based recognizers, adding the obtained hypotheses to the feature sets fed into the used emotion recognizer. This plain method of extending the feature vector with additional speaker specific information was found to improve emotion recognition performance on both experiments. It was also found that including more specific speaker information besides gender into the feature vectors yielded better results.
Our work is relevant in the sense that it does very minimal preprocessing on the raw data fed to the network. Plus, instead of merely relying on the participation of actors, the transferred learning related to speaker recognition comes in the form of feature matrices generated directly by a large-scale model trained with utterances from hundreds of persons with different ethnicities, accents, professions and ages.

Methodology
In this section we provide an outline of the speech corpora used. Following that, detail is given on the applied VGGVox model, and on how feature matrices were extracted from it when fed the data from the emotional speech databases.

Emotion speech databases
For this work, a set of 6 emotional speech databases was gathered, totaling over 9000 utterances of varying duration in 8 different languages, and portraying 9 different emotional states, to be applied in a speaker recognition model for feature extraction. The set of clips from the databases was reduced to only include clips corresponding to anger, disgust, fear, happiness, sadness, surprise and the neutral state, which were common to all databases. The list of databases is shown in Table 1. All files were converted to the WAV format, at a sampling rate of 16 kHz, as this value has been proved to be more than enough to capture all information embedded in a speech signal. In accordance with the VGGVox model's implementation, and in order to take full advantage of all the provided audio, files were adapted to be 1 to 10 seconds in length as well. Therefore, a small number of clips below the 1 second mark were disregarded, as these would hardly provide any emotional information, and clips above the 10 second mark were divided into equally long audio segments.

The VGGVox model
This model developed by Nagrani et al. [10] and based on a VGG-M architecture and composed of 12 layers, is fed raw data which undergoes minimal processing. With that in mind, narrowband magnitude spectrograms are generated using a sliding hamming window of width 25ms and step 10ms, meaning an n-second input will provide a 100n frames spectrum. Normalization is also performed on mean and variance, at every frequency bin of the spectrum, as it was observed that such a step produced an increase of 10% in classification accuracy. Yet, no other operations are performed on the input data, and the CNN is fed essentially raw spectrograms.
Variable length inputs are also efficiently dealt with by varying the support filter dimension of the apool6 layer. As such, the implementation is adaptable to an audio clip's duration, provided it is between 1 and 10 seconds in length, according to Table  2. The dimension values are conforming with the stride and padding methods used by the model, for each duration value. It should be noted that the model does handle clips longer than 10 seconds, by considering only the central 10-second segment of the clip, in spite of losing all the other potentially relevant surrounding information.
In terms of purpose, the model was directed towards speaker classification, and trained using the VoxCeleb1 dataset [10] also developed by Nagrani and her team. This dataset is of large scale, including over 100,000 utterances by 7000+ speakers of varied backgrounds, resulting in more than 2000 hours of audio. Consequently, the model is an ideal candidate for capturing copious amounts of speaker specific cues and prosody mannerisms from any type of human speech, emotional included. Training iterations also included batch normalization [17] and used the default hyper parameter values of the used MatConvNet toolbox [18].

Feature extraction
Feature arrays were obtained from the output of the apool6 layer of the VGGVox model, corresponding to its bottleneck. This was done considering an ideal middle point of speaker adaptation, meaning the extracted features would not suffer from either under-specialization or over-specialization issues. A simple application of the model to the audio clips without any form of processing other than the already specified was performed in order to obtain these feature arrays, which given their origin, had the dimension of 1x1x4096.

Experimental Results
Several experiments were carried out in order to evaluate the robustness and efficacy of the extracted feature arrays in terms of emotion recognition. The Weka software [19] was employed so as to apply the feature arrays on the following state of the art classifiers: Naive Bayes [20], kNN [21], Random Forest [22], Logistic Model Tree (LMT) [23] and Support Vector Machine (SVM) [24]. A neural network-based approach was not followed during the classification stage given the fact that the amount of data available was not enough to credibly train a machine learning model. In this section, we provide more detail on the carried-out experiments and the obtained results, as well as a discussion and comparison of these to other state-of-the-art techniques.

Classifier performance
The Naive Bayes classifier was used as a mere baseline for evaluation against the rest of classifiers in the Weka software, when fed the provided feature arrays for emotion recognition. Performance results in terms of accuracy were obtained using 5-fold cross validation, on each database individually. Furthermore, the k-statistic [25] was also calculated to further support the validity of the obtained results against random chance, in parallel with unweighted average recall (UAR), a favored metric in emotion recognition systems which attributes the same significance to all possible classes [26]. All these results are shown in Table 3, with the highlighted cells corresponding to the best performance results used for comparison later on. Table 3. State-of-the-art classifier performance on standalone emotional database 1x1x4096 feature arrays. Code: A=classifier accuracy (percentage), B=k-statistic and C=UAR.

Discussion
Results are clearly varying from database to database. This suggests that, even though database size must also be taken into consideration, emotional prosody is affected differently in each population due to language and cultural diversity. As such, adaptation to cultural background is likely an additional approach worthwhile researching in order to improve SER systems.
Altogether, the obtained results always surpassed the proposed baseline having LMT given the best results (see highlighted column). As such, this results column was used for comparison against other state-of-the-art techniques, whose performances are shown in Table 4. Here it is possible to verify that our technique did almost always produce better results on the databases. As such, the efficacy of our proposed method for speech emotion recognition is affirmed, having surpassed other state-of-the-art techniques. Finally, our observations certainly support the existence of relevant emotional information in speaker specific speech features. As such, speaker adaptation should be performed in systems aiming for successful SER.

Conclusion
In this paper, we examined the robustness of speech features extracted using a large-scale speaker recognition model, for emotion recognition. We determined that, regardless of language, there is valuable emotional information embedded within speaker specific features. Acceptable but varying performance ratios were obtained on standalone databases of different languages. This suggests varying degrees of emotional prosody mannerism for different cultural backgrounds. Finally, and based on a general observation of the results, we can conclude that an initial step of speaker adaptation is of paramount importance and should be performed in any SER system, in order to achieve higher accuracy rates.
In the future, we intend to assess the efficacy of dimension reduction techniques such as PCA or LDA, and delve deeper into adaptable emotion recognition, by considering additional speaker information, such as cultural background, and incorporating facial expression analysis into a multi-modal emotion recognition system.