An Analysis of the Impact of Spectral Contrast Feature in Speech Emotion Recognition

thiruvenkadamswarna@gmail.com Abstract —Feature extraction is an integral part of speech emotion recognition. Some emotions become indistinguishable from others due to high resem-blance in their features, which results in low prediction accuracy. This paper analyses the impact of spectral contrast feature in increasing the accuracy for such emotions. The RAVDESS dataset has been chosen for this study. The SAVEE dataset, CREMA-D dataset, and JL corpus dataset were also used to test its performance over different English accents. In addition to that, the EmoDB dataset has been used to study its performance in the German language. The use of the Spectral Contrast feature has increased the prediction accuracy in speech emotion recognition systems to a good degree as it performs well in dis-tinguishing emotions with significant differences in arousal levels, and it has been discussed in


Introduction
Speech Emotion Recognition (SER) is a progressive area of study and it plays a remarkable role in applications like telecommunications, lie detection, Human-Computer Interface (HCI), etc. The interaction between humans and machines is the most significant application and intensive research on this area has been going on for several years. Displaying emotions through speech is the most distinctive and natural characteristic of humans. Speech Emotion Recognition (SER) can be defined as the extraction of the speaker's emotional state from his or her speech signal. Feature extraction, Feature Selection, and Classification are the three main stages of speech emotion recognition. An extracted feature of a speech signal contains information, which leads to better accuracy and recognition rate. Therefore, the feature extraction algorithms help improve the recognition rate and accuracy.
The speech features extracted for emotion recognition are roughly classified as 1. Acoustic features, 2. Language features like lexical information, 3. Context information like gender, cultural influences, and 4. Hybrid features integrate two or more of the above-mentioned features. Acoustic features which are the most commonly used mainly consist of prosodic features, spectral features, and voice quality features.
Although there is no agreement on the best features for SER it is generally accepted that prosodic features carry most of the emotional information. Prosodic features in combination with spectral and voice quality parameterizations are often used in this field of study. However, prosodic features give a certain reiterative confusion pattern among emotions. They seem to be able to discriminate high arousal emotions (anger, happiness) from low arousal ones (sadness, boredom) easily. But the confusion level for emotions of the same arousal level is very large [7]. Pitch, loudness, and duration are commonly used as prosody features since they express the stress and intonation patterns of spoken language. The relation between prosody and emotion is studied in many works in the literature [2]- [6].
Studies have been performed on harmony features for speech emotion recognition. It has been found that the first and second-order differences of harmony features also play an important role in speech emotion recognition and harmony features have been used in SER in [15]. It has been studied that Fourier parameter (FP) features are effective in identifying various emotional states in speech signals [14]. Spectral features (such as spectrum centroid, spectrum cut-off frequency, correlation density, and Melfrequency energy, etc.), and cepstral features (such as Linear Prediction Cepstral Coefficients (LPCC), Mel-Frequency Cepstral Coefficients (MFCC), and Perceptual Linear Prediction (PLP), etc.) [13] have been explored and used in SER. The SER was also carried out by combining acoustic features with phonological representations [1].
Although there are various feature extraction algorithms and techniques, there is no common agreement on what group of features should be considered for better classification. In this paper, it is analyzed that the spectral contrast feature combined with MFCC, Mel, and Chroma, increased the accuracy of predicting the emotions using K-Nearest Neighbor, Logistic Regression, Support Vector Machine (SVM), and Gradient Boosting Classifier algorithms compared to utilizing only MFCC, Mel, and Chroma. However, it provided lesser accuracy on Random Forest and Multi-Layer Perceptron classifiers. It is also noted that when the spectral contrast feature is included with MFCC, Mel, and Chroma, it increased the accuracy significantly between pairs of emotions with different arousal levels. This could be further utilized by applications where it is required to detect high levels of anxiety and depression of a person instantly especially in Helpline services.

Related Work
There are various deep learning methods identified previously to classify emotions and are being improved and applied to increase the accuracy of identifying and classifying these emotions. In [19], characteristics of the emotions are learned from the low-level speech signals and the emotional intensity of words is indicated. For instance, words like "awesome" are meant to carry stronger emotions than words like "human". Features like MFCC, Mel-frequency, and prosodic features are extracted from the data. This multimodal system's performance could further be improved when more features of the data are taken into account.
In this paper [21], the speaker is recognized in the emotional environment. Spectral features are extracted from the data and are classified. Support Vector Machine (SVM), Gaussian Mixture Model (GMM), Gaussian Naive Bayes, K-Nearest Neighbor, Random Forest, and a simple Neural Network using Keras are all used for classification. Feature combinations are used to improve accuracy in classification. Different spectral features (Mel Frequency Cepstral Coefficients (MFCC), Shifted Delta Cepstral Coefficients (SDCC), etc.) were analyzed and the feature combination that was considered contributed the highest accuracy of 100% in a neutral environment and 87.0967% in an emotional environment].
In this paper [22], the speech emotion is recognized using multi-scale area attention by taking Log Mel Spectrogram into consideration and the Interactive Emotional Dyadic Motion Capture (IEMOCAP) dataset was augmented with vocal tract length perturbation (VTLP). With area attention and VLTP based data augmentation, an accuracy of 77.5% was achieved. This paper [23] considers the semantic information and paralinguistic information in the signals. The unified feature vector is used to make the final prediction using LSTM.
Different features have been used for SER systems, however, there is no such specific set of features for the classification. This is an area of progressive study. In this survey [20], the global and local features of the SER systems that were analyzed are Prosodic features, Voice Quality features, Spectral features, and Teager Energy Operator (TEO) based features. It is also noted that the consonant regions of a speech signal include more emotional information, compared to utterances that involve vowels. After feature extraction, these SER systems can select from a wide range of classification algorithms and methods. Selections and extractions of other features could improve the recognition rates of the SER systems.

Block diagram
The detailed project flow and architecture diagram has been shown in Fig 1, Fig 2. The acoustic features extracted from the sound files of the datasets are used to train the multi-layer perceptron to predict emotions. The dataset is split into 2 parts: 80% as the training data and 20% as the test data. In real-time, the features selected from the received speech input are used to recognize the emotion. The simulation is done on Jupyter Notebook and majorly uses Keras library and Librosa package from python. The dataset, feature extraction, and classification methods are discussed below.

Dataset
The Ryerson Audio-Visual Database of Emotional Speech and song dataset [12] used in the speech emotion recognition has 1440 sound files which contain 24 professional actors (12 female, 12 male) vocalizing two lexically matched statements (1. "Kids are talking by the door", and 2. "Dogs are sitting by the door") in a neutral North American accent. Speech emotions include calm, happy, sad, angry, fearful, surprised, and disgust expressions. Each expression is produced at two levels of emotional intensity (normal and strong) with an additional neutral expression.

Feature extraction
The feature extraction part involves extracting the following features from the sound file.
MFCC: MFCC was first introduced and applied to speech emotion recognition in [8]. Mel frequency Cepstral coefficients are coefficients that collectively make up MFCC, which is a representation of the short-term power spectrum of a sound, based on a linear cosine transform of a log power spectrum on a non-linear Mel scale of frequency. MFCC is derived by taking the Fourier transform of a windowed signal, mapping the powers of the spectrum obtained above onto the Mel scale, using triangular overlapping windows. Followed by taking logs of the powers at each of the Mel frequencies then taking discrete cosine transform of the list of Mel log powers, as if it were a signal. The MFCCs are amplitudes of the resulting spectrum.

Fig. 3. MFCC Derivation
Mel: Mel spectrogram is a spectrogram with Mel scale as its y-axis. It is derived by sampling the input with windows, making hops each time to sample the next window then computing fast Fourier transform and generating a Mel scale by taking the entire frequency spectrum and separating it into evenly spaced frequencies.
Chroma: The term Chroma closely relates to the 12 different pitch classes. One main property of Chroma features is that they capture the harmonic and melodic characteristics of music while being robust to changes in timbre and instrumentation.
Spectral Contrast: Spectral contrast considers the spectral peak, the spectral valley, and their difference in each frequency sub-band. This feature includes more spectral information than MFCC. It represents the relative spectral distribution instead of the average spectral envelope.

Classification method
The classifier used for the study is Multilayer perceptron which is a class of feedforward artificial neural networks that refers to networks composed of multiple layers of the perceptron. The MLP model used consists of an input layer, three hidden layers with hidden layer sizes (300, 100, 50), and an output layer. Except for the input node, each node is a neuron that uses the Rectifier Linear unit (ReLu) as an activation function and Adam optimizer for weight optimization. It uses a supervised learning technique called backpropagation for training.

Experimental Results
The most commonly used features for Speech Emotion Recognition are MFCC, MEL, and chroma. The results of using spectral contrast as a feature along with MFCC, MEL, and chroma are compared with the accuracies obtained from excluding the spectral contrast on the RAVDESS dataset on different machine learning algorithms. The usage of spectral contrast feature for speech emotion recognition for the RAVDESS dataset has increased the accuracy for 8 pairs of emotions; neutral and happy, happy and surprised, sad and fearful, sad and surprised, angry and fear, disgust and fear, fear and surprise, and sad and calm which have considerable differences in their arousal levels. Table 1. Accuracies between pairs of emotions before using the spectral contrast feature The comparisons of accuracies before and after using the spectral contrast feature for each pair of emotions in the RAVDESS dataset can be seen in Table 1 and Table 2 respectively. Table 2. Accuracies between pairs of emotions after using the spectral contrast feature Using Multi-layer perceptron, the overall accuracy for all the eight emotions with spectral contrast feature was found to be 45%, and without spectral contrast, it was 56.39%. With logistic regression, the overall accuracy has increased to 51.67% with spectral contrast feature, from 46.11% without the spectral contrast feature. Similarly, the gradient boosting classifier gives an accuracy of 52.78%, while without spectral contrast feature it gives 51.11%, thus increasing the overall accuracy using the spectral contrast feature. Table 4 and Table 5 show the accuracy comparison between pairs of emotions from different datasets that have been used in this study.

Discussion
To evaluate the performance of the system, we also used datasets with different accents of the English language. English is spoken in many accents worldwide. The analysis was carried out on the datasets SAVEE [10] for British accent, JL-corpus for New Zealand accent, and Crema-d which has a collection of accents like Caucasian and African-American. It was found that the overall accuracy for the SAVEE dataset increased from 65% to 70%, and overall accuracy decreased for Crema-d [11] from 46.64% to 44.8% while using spectral contrast. In addition to English language datasets, we used the German Language dataset EmoDB [9] to analyze the performance of this system for different languages. The overall accuracy for EmoDB decreased from 71.64% to 68.66% while using spectral contrast. Table 4. Accuracies between pairs of emotions in SAVEE dataset with spectral contrast (upper-half triangle) and without spectral contrast (lower-half triangle) Table 5. Accuracies between pairs of emotions in Crema-d dataset with spectral contrast (upper-half triangle) and without spectral contrast (lower-half triangle)

Conclusion
The impact of the Spectral contrast feature on Speech Emotion Recognition has provided significant results. Spectral contrast, when combined with Mel, MFCC, and chroma decreased the overall accuracy in the RAVDESS dataset. However, when we tried to classify emotions with relatively different arousal levels it proved significant. Higher accuracies were obtained in detecting certain pairs of emotions after including the Spectral contrast feature in the system. The results on different datasets of the English language suggest that it works across different accents of the same language.
This can also be further combined with language features and phonological representations to improve the accuracy. The modified Feed Forward Neural Network with Particle Swarm Optimization via using Euclidean Distance (FNNPSOED) [16] can be used to handle the classification problem instead of the multilayer perceptron model. The impact of spectral contrast can be studied in the gender identification systems to be used for health-related communications [17]. The Speech Emotion Recognition system which uses the spectral contrast feature can also be extended to identify stress and depression in individuals as the level of arousal between a normal speech and a speech of a depressed person will have a prominent difference and can be used to provide recommendations for metacognition to control the stress levels [18].

Authors
Swarnalaxmi Thiruvenkadam is a Computer Science Engineering student at College of Engineering Guindy, Anna University, India (email: thiruvenkadam swarna@gmail.com).
Shreya Kumar is a Computer Science Engineering student at College of Engineering Guindy, Anna University, India (email: shreyakumar603@gmail.com).