Voice Pathology Detection Using the Adaptive Orthogonal Transform Method, SVM and MLP.

In this paper, an automatic voice pathology recognition system is realized. The special features are extracted by the Adaptive Orthogonal Transform method, and to provide their statistical properties we calculated the average, variance, skewness and kurtosis values. The classification process uses two models that are widely used as a classification method in the field of signal processing: Support Vector Machine (SVM) and Multilayer Perceptron (MLP). The proposed system is tested by using a German voice database: the Saarbruecken Voice Database (SVD). The experimental results show that the Adaptive Orthogonal Transform method works perfectly with the Multilayer Perceptron Neural Network, which achieved 98.87% accuracy. On the other hand, the combination of the Adaptive Orthogonal Transform method and Support Vector Machine reached 85.79% accuracy.


Introduction
Signal processing is a science that analyzes and interprets the information contained in a signal. It has a great effect on our daily life. Signal processing is used to solve many problems in various fields, such as medicine, industry and economy.
Every year, the number of people who suffer from voice problems increases, with the percentage being about 25% of the world population [1]. The most affected are people who smoke, suffer from climate change, or who use their voices excessively in their work as a singer, lawyer, teacher [1], [2].
Voice pathology detection is done by an otolaryngologist, also known as an ear, nose, and throat doctor, who performs a painless examination to visualize the vocal cords and the larynx, called indirect laryngoscopy [3]. The doctor puts a light source on the forehead and uses a small specific mirror, called a laryngeal mirror [4], to check the larynx and vocal cords. The doctor may ask the patient to make sounds to examine the mobility of the vocal cords, but depending on the suspected cause of hoarseness, other examinations may be prescribed to refine the diagnosis, such as a phoniatric assessment [5], an X-ray [6], or a CT scan.
The problem of these techniques is that they require several tests to detect the type of disease. In addition, they make patients uncomfortable and sometimes it is difficult for them to go to the hospital every time, especially in severe cases. To avoid these issues, many researchers have proposed various systems of voice pathology recognition.
Muhammad et al., [7] proposed a system of automatic voice pathology detection and classification using vocal tract area irregularity, where features are extracted from the vocal tract area and the classification part is done by the Support Vector Machine method (SVM). The realized system has been tested by using an American and a German database: the Massachusetts Eye and Ear Infirmary (MEEI) database and the Saarbruecken Voice Database (SVD). With MEEI the system reached 99.22% accuracy, and 94.7% accuracy is reached with SVD.
Mohammed et al., [8] realized a system of voice pathology detection and classification using a Convolutional Neural Network model (CNN), the aim of this paper was to improve accuracy by using CNN and ResNet34 layers. To test the performance of the system, they used the Saarbruecken Voice Database (SVD). The authors obtained a result of 95.41% accuracy.
Shia and Jayasree [9] presented a voice pathology detection system using the Discrete Wavelet Transform method (DWT) and Feed Forward Neural Network (FFNN). They used the Saarbruecken Voice Database (SVD) to test their work. The energy of wavelet sub-band coefficients is calculated and FFNN is applied as a classification method. The proposed system achieved 93.3% accuracy.
Hammami, Salhi, and Labidi [10] realized a voice pathology recognition system using Empirical Mode Decomposition (EMD) and Discrete Wavelet Transform (DWT). Features of the Higher Order Statistics (HOS) and the coefficients of the wavelet decomposition are extracted. The database used in this work is the Saarbruecken Voice Database (SVD). This study achieved 99.29% accuracy, and 100% accuracy for detection and classification of voice pathologies.
Srinivasan, Ramalingam, and Arulmozhi [11] proposed a method of voice pathology recognition using Mel-Frequency Cepstral Coefficients (MFCC) for feature extraction, and different models of Neural Network for classification: the Multilayer Perceptron Neural Network (MLPNN), the Generalized Regression Neural Network (GRNN) and the Probabilistic Neural Network (PNN). 20 samples are taken for analysis: 10 for training and 10 for testing. The system achieved 100% accuracy in the classification process using MLPNN.
From these studies, we found that there are several methods widely used for detection and classification of voice pathologies such as MFCCs, DWT, EMD, MLP, and SVM. However, the researchers did not test all the diseases provided by the database used in their work, either with SVD or with MEEI. They chose only a few diseases [1], [7], [10] such as dysphonia, polyps, cysts, or paralysis. Moreover, the number of the voice signals used in the test phase is too small [9]- [12].
In this work, we propose an automatic voice pathology detection and classification system using the Adaptive Orthogonal Transform method to extract the important features from the input signals with a small training dataset [13]- [15]. The classification phase is done by two of the most well-known methods in the field of signal processing: Support Vector Machine (SVM) [16] and Multilayer Perceptron (MLP) [17], [18], in order to compare them and to see which one will give the best results. To test the performance of the proposed system, the Saarbruecken Voice Database (SVD) [19] is used.
This paper is organized as follows: section 2 presents the proposed method for a voice pathology recognition system based on the Adaptive Orthogonal Transformations. The obtained experimental results are presented in section 3. Finally, we conclude the paper and give an overview about our future study in section 4.

2
Research method 2.1 Proposed system The proposed system goes through several steps as shown in Figure 1. The first step is the compression of the voice signal to improve the speed of the algorithm using Fourier Transform. Then, the informative features are extracted by the Adaptive Orthogonal Transformations and used as the input of the classification methods SVM and MLP, which identify if the voice is pathological or not. Compression. Before starting to extract special features from a voice signal, it is first necessary to compress it by choosing the informative intervals without disturbing the information. In our study, the Fourier Transform is used as a compression method [20], [21]. Figure 2 shows the informative intervals of the pathological and normal voice signals by using the Fourier Transform: Feature extraction. Many researchers used several methods for feature extraction from the voice signal, such as Mel Frequency Cepstral Coefficients (MFCC) [11], Linear Prediction Cepstral Coefficients (LPPC) [22], Perceptual Linear Prediction Coefficients (PLP) [22], or Empirical Mode Decomposition (EMD) [10]. However, they have several disadvantages: they are sensitive to noise, take a long execution time and need a large training dataset to extract the special features from the input signal [15], [23]- [25]. Our goal is to solve these issues by using the Adaptive Orthogonal Transform method [13]- [15].
The proposed approach can extract the informative features of the voice signal with a minimal dimension by creating an operator H that will be adaptable to any input signal. First, it is necessary to calculate the average of the statistical features obtained at the compression phase to form the ˆs d R vector [15], [26], [27]. H is adapted to a class of signals represented by a standard vector ˆs d R when the following condition is verified: where Y t is the target vector that constructs the adaptation criterion of the operator H a to ˆs d R .
Then the vector of the informative features is calculated by the following operation: where: i is the class number. In our study, we worked with two classes: a pathological class and a normal class.
N is the number of class signals. In our study, the pathological class contains 200 voice signals (100 male and 100 female), and the normal class contains 200 voice signals (100 male and 100 female).
After extracting the vector of the informative features, we decompose it into several frames, and then the average, variance, skewness, and kurtosis of each frame is calculated to provide the statistical properties [28].
To find the number and size of frames corresponding to our vector, we used the following equations: where: s is the size of the frames. t is the time per frame. f is the frequency.
where: n is the number of frames. l is length of the vector. s is the size of the frames obtained in equation (3).
The following equations show the formula of the average, variance, skewness and kurtosis: where N is the number of frames and frm i is the ith frame. In our case, the vector is decomposed into 4 frames with 16 different measures (4 frames * 4 values of average, variance, skewness and kurtosis). Figure 3 shows the overall process of the Adaptive Orthogonal Transform method that goes through 7 steps [27]: • Step 1: The input speech signals pass through the compression phase using the Fourier Transform.   Classification. After the extraction of the special features from the voice signals, we come to the phase of classification. In our study, we used two methods that are widely used by researchers: Support Vector Machine (SVM) and Multilayer Perceptron (MLP) [29]- [31].
• Support Vector Machine: Support Vector Machine (SVM) is very popular for data classification. To create it, we go through several steps. First, the creation of the input vector by assigning the informative features of the two classes: the normal class and the pathological class. Then, it is necessary to choose an efficient kernel, which is an important factor to obtain a good classification. In this work, we used the Radial Basis Function (RBF) kernel. Finally, for the optimization of the kernel parameters, c and gamma, we chose the grid search as an optimization method [7], [16], [32]- [34]. • Multilayer Perceptron: Neural Network models are numerous, and among them we find Multilayer Perceptron (MLP) which consists of three layers: the input layer, the hidden layer, and the output layer, as shown in the Figure 5: The input layer receives the special features extracted in the previous phase, then an arbitrary number of hidden layers are placed between the input and output layer.
Finally the classification and decision are made by the output layer [11], [17]. MLP neurons are trained by using the back-propagation learning algorithm, which consists of two steps: a forward pass and a backward pass through the network. The forward pass consists of passing the input layer by layer to the output neuron that will produce the true output of the network. The backward pass consists of detecting the error signal by deducting the desired output from the actual output [18], [34], [35].

Data
In this work, we used a German database, the Saarbruecken Voice Database (SVD), which is freely available online and produced by the Institute of Phonetics at Saarland University [19].
This database presents a collection of voice recordings from over 2000 people. It contains sustained vowels /a/, /i/, and /u/ in normal, high, low and low-high-low pitch, including voice recordings of the following sentence "Guten Morgen, wie geht es Ihnen?" which means "Good morning, how are you?" in English. The length of the signals varies between 1 and 4 seconds, and they are sampled at 50 kHz with 16 bits resolution. The number of pathologies provided by the database is 71. All these properties makes SVD very interesting for researchers. In our study, we used the sustained vowel /a/ in a normal pitch. Table 1 gives more details about the Saarbruecken Voice Database (SVD):  Table 1 presents the total number of the available records in SVD, but in our work we used only the sustained vowel /a/ in a normal pitch, therefore the values are slightly modified as shown in Table 2. Every system needs a training dataset to extract the special features from the input signals and a test dataset to evaluate the performance. Table 3 shows the number of voice recordings in our training and test dataset: 3

Results and discussions
The results of our system performance are obtained by calculating the rate of successivity. The following equation presents the rate of pathological voice recognition: where: p is the number of the recognized pathological voice. ts is the test dataset size.
The following equation presents the rate of normal voice recognition: where: n is the number of the recognized normal voice. ts is the test dataset size.
From Table 4, we observe that the combination of our method and the MLP Neural Network model achieves higher results compared to the combination with SVM: • 98.87% for pathological voice recognition with MLP, which means it successfully identified 1141 pathological voices among 1154. • 98.36% for normal voice recognition with MLP, which means it successfully identified 479 normal voices among 487. • 85.79% for pathological voice recognition with SVM, which means it successfully identified 990 pathological voices among 1154. • 87.27% for normal voice recognition with SVM, which means it successfully identified 425 normal voices among 487.
The difference between the pathological and normal voice recognition rates between the two approaches is 13.08% for pathological voice recognition and 11.09% for normal voice recognition.
We can clearly see that the MLP model works perfectly with the Adaptive Orthogonal Transform method, and it shows the efficiency of our system for the detection and classification of voice pathology. On the other hand, the combination of our approach with the SVM classifier gives also good results, but they are not at the desired level. The execution time of the algorithm was fast with either SVM or MLP.

Conclusion
In this paper, our approach is evaluated with two classification models: MLP and SVM. The results obtained show the efficiency of our system by combining the Adaptive Orthogonal Transform method with MLP. The proposed method consists of extracting the informative feature vectors with minimal dimension to improve the speed of the algorithm, then decomposing them and calculating the average, variance, skewness and kurtosis values with the aim of providing their statistical properties. Finally, the obtained values are assigned as input for the SVM and MLP classification models.
Our future work is to combine other Neural Network models such as Probabilistic Neural Network (PNN) and Generalized Regression Neural Network (GRNN) to see if they will achieve more successful results than MLP and SVM or not.