Evaluation of the Vocal Tract Length Normalization Based Classifiers for Speaker Verification

—This paper proposes and evaluates classifiers based on Vocal Tract Length Normalization (VTLN) in a text-dependent speaker verification (SV) task with short testing utterances. This type of tasks is important in commercial applications and is not easily addressed with methods designed for long utterances such as JFA and i-Vectors. In contrast, VTLN is a speaker compensation scheme that can lead to significant improvements in speech recognition accuracy with just a few seconds of speech samples. A novel scheme to generate new classifiers is employed by incorporating the observation vector sequence compensated with VTLN. The modified sequence of feature vectors and the corresponding warping factors are used to generate classifiers whose scores are combined by a Support Vector Machine (SVM) based SV system. The proposed scheme can provide an average reduction in EER equal to 14% when compared with the baseline system based on the likelihood of observation vectors.


I. INTRODUCTION
Vocal Tract Length Normalization, VTLN, is a widely used method to compensate inter-speaker variation in speaker-independent automatic speech recognition (ASR) [1]. To achieve this, VTLN tries to compensate for the effects of speaker-speci!c vocal tract lengths by warping the frequency axis of the power spectrum of the observation vector sequence by employing a warping factor optimized for each speaker [2]. VTLN has extensively been employed in ASR but is limited work in Speaker Verification (SV). A simple and efficient implementation can be achieved by moving the center of the !lter bank in the parameterization process via the inverse frequency warping function [3]. VTLN can also be applied in the cepstral domain by using linear transformation [2].
As mentioned above, with a few exceptions, VTLN has hardly been applied to SV. In [4] authors proposed a GMM-UBM SV with multiple background model (MBM) system based on VTLN criterion for UBM training data selection. An improvement of 8% in EER can be achieved if the UBM is trained with selected mean-VTLN data when compared with training with all the data. In [5][6] authors proposed the use of a background model per each group of target speakers that were clustered by employing their vocal tract length factor as well as MLLR supervectors in a text-dependent SV with GMM. A different approach is presented in [7] where an ASR is employed to estimate the warping factor and combine it with a GMM-UBM SV system in order to improve the SV accuracy. This scheme provided an improvement of 23% in EER when compared with the baseline system. Despite the fact that VTLN based approaches could improve SV accuracy, it has not been explored further.
Text-dependent SV task with short testing utterances have an important presence in commercial applications and is not easily addressed with methods designed for long utterances such as Joint Factor Analysis (JFA) and i-Vectors. Otherwise, VTLN is a speaker compensation scheme and with just a few seconds of speech samples can lead to significant improvements in speech recognition accuracy.
In [8] it was suggested that the VTLN warping factor could be employed for gender classification. It is well known that men and women have different warping factors, with men in general showing a higher value [7]. These results imply that the VTLN warping factor could be a criterion for discriminating clients or target speaker from impostors or non-target speaker in a SV task. Besides, new feature vectors may be generated by compensating the input utterance with VTLN warping to obtain new classifiers, and using selection and combination techniques to improve the accuracy of the entire SV system. This paper proposes a novel scheme for the generation of classifiers based on VTLN in a text-dependent SV task with short testing utterances. In order to fuse this new classifiers, a combination scheme is performed based on SVM to improve the accuracy of the SV System.

II. SPEAKER VERIFICATION SYSTEM
In a SV system, the task is to describe the identity that is claimed by a given user. Two classes are possible: client, C1; and, impostor, C2. In the enrolling process, each user is prompted to pronounce a given number of utterances that will be employed to generate the user's speaker dependent (SD) model. In verification, the speech signal from a user that claims a given identity is compared with the corresponding SD model associated to the claimed identity. In a HMM based system, the observation vector sequence is also compared with an impostor model PAPER EVALUATION OF THE VOCAL TRACT LENGTH NORMALIZATION BASED CLASSIFIERS FOR SPEAKER VERIFICATION [10]. This impostor model is denominated speaker independent (SI) because it is usually trained with a wide variety of users.
Given an input vector sequence is the feature vector in the k th frame, where K is the total number of features. As an output of a SV system the alignments or the sequences of states associated to each frame are obtained, for the SD model and for the SI model A log-likelihood score for each frame and his state is

III. VOCAL TRACT LENGTH NORMALIZATION
VTLN attempts to compensate for the difference among speakers´ vocal tract lengths by warping the frequency axis of the speech signal power spectrum [2]. In general, the frequency axis is scaled by a warping function with a transformation parameter ".
Consider that ! m is the central frequency of filter m in a filter-bank composed of M filters. Then ˆm ! is the warped central frequency of filter m. By using the linear piece-wise warping function proposed in [3],ˆm ! can be written as Where ! max corresponds to the highest filter-bank frequency, ! is the warping factor or parameter, and 0 ! is defined as follows Conventional VTLN is usually implemented by generating a filter-bank per each warping factor ! to be evaluated. Then, the optimal ! is that one that provides the maximum likelihood of a feature vector sequence transformed with the warping function, ( ) Where Wr is the recognized word sequence obtained by a !rst recognition pass

IV. VTLN BASED CLASSIFICATION
After the conventional operational procedure of a SV system explained in Section 2.1, supplementary information can potentially be obtained by incorporating a new observation vector sequence adapted with VTLN. This modified utterance can be used as an input to the SV system, obtaining different scores when compared with those computed with the original feature vectors without compensation. The new scores, combined with the one obtained with the original observation vector sequences, can be fused by using standard techniques of feature selection and classifier combination to improve the accuracy of the SV system. Given a SD model state alignment resulting from the first forced Viterbi pass in a text dependent SV system, an optimal ! that maximizes the following function could be estimated by employing VTLN Where is g a (X) the piece-wise function defined in (1), X is the input feature vector sequence and " SD is the SD model state alignment. The estimation of ! optimal can be achieved using any VTLN technique. It is reasonable to assume that if the estimated warping factor ! optimal is distant from 1, the observation vector sequence had to be compensated more to increase its likelihood with respect to the SD model. In this case, the probability that the input feature vectors was an impostor should be higher. In contrast, if ! optimal is close or equal to one, it is sensible to suppose that it corresponds to a client or target speaker. The difference with respect to ! o =1 is calculated with ! optimal using the following equation.
Where !! is the absolute value of this distance or difference.
The compensation of the observation vector sequence with VTLN, X(! optimal ), can be used as a new input to the SV system in order to obtain a new sequence of aligned SD and SI models, and a new log-likelihood score that depends on ! optimal , LL Score (!). Similarly to the !! case, if the input observation vector sequence corresponds to a client, the VTLN compensation over the signal should be lower when compared with that estimated with an utterance from an impostor speaker. Consequently, the difference between the log-likelihood scores obtained with the original and compensated observation vector sequence should be lower for a client speaker than for an impostor. The estimation of the difference between these log-likelihoods is computed as It can be seen that the adapted feature vector sequence with VTLN ( ) optimal X ! could possibly be utilized to generate new criteria for classification. In Fig.1., a scheme to obtain new classifiers using VTLN is shown. The SD model (# SD ) alignment obtained by the first forced Viterbi pass and the input feature sequence X are employed to estimate ! optimal that maximize (5). The adapted observation vector sequence ( ) optimal X ! is the PAPER EVALUATION OF THE VOCAL TRACT LENGTH NORMALIZATION BASED CLASSIFIERS FOR SPEAKER VERIFICATION input to the SV system, obtaining new scores such as LL Score (!), LL SD-Score (!) and LL SI-Score (!). Finally, given any VTLN method, a difference is estimated between the new and the original scores, obtaining five new classifiers.
V. EXPERIMENTS Experiments were carried out with the Yoho database [9]. The Yoho Speaker Verification Corpus supports development, training and testing of speaker verification systems that use limited vocabulary, free-text input. The vocabulary is composed of two-digit numbers spoken continuously in sets of three. The database is divided into "enrollment" and "verification" segments; each segment contains data from all of the 138 speakers. There are four enrollment sessions per speaker and each session contains 24 utterances. Each verification segment contains 10 sessions and each session contains four utterances per speaker. The database was divided in three groups: Yoho_A, Yoho_B and Yoho_C. 92 speakers were selected for Yoho A and Yoho B. 77 of these were randomly selected for Yoho A, used for testing. While the remaining 15 speakers for Yoho B to be used in the SVM classifier, explained later. The process of random selection of users for Yoho A and Yoho B was repeated 1000 times to generate an equal number of experiments. Finally, Yoho_C, composed of 41 speakers (29 males and 12 females), was used to train the SI model.
The TD-SV system is based on HMM with forced-Viterbi algorithm [10]. While the combination of scores is based on SVM with linear Kernel [11,12]. The SVM parameters were estimated with Yoho_B.
The procedure for training the SVM curve and the evaluation of the classifier combinations was repeated 1000 times in order to obtain a more representative result. VTLN according to [8] was implemented in the experiments reported here. Fig. 2. depicts histogram of "! with VTLN. It can be seen that the values of "! for the clients are in average lower than the impostor. This indicates that "! may discriminate about the two classes and can be itself a criterion for classification in SV. This implies that the client's observation vector sequence compensated with VTLN will not be very different from the original one and the log-likelihood score will be similar to the one obtained in the first round of verification. Similar results are obtained for the histogram of the difference of loglikelihood (!LL Score ) by using VTLN: in average !LL Score are higher for the impostors than for the clients. As mentioned above, this due to the fact that the VTLN compensation for the client's feature vectors is lower than for the impostor's observation vectors. Consequently, the log-likelihood score with the adapted features is similar compared with the signal without adaptation for client speakers.

VI. RESULTS AND DISCUSSION
In Table 1 the individual performance for the six classification criteria are shown: the baseline criterion and the five new criteria obtained with VTLN. It can be observed that the best performance corresponds to the baseline and ( ) VTLN Score LL ! . It can also be learned that ! " achieves a low discrimination between client and impostor, and the performance is very low when compared with the baseline.     Fig. 3 suggest that incorporating a second criterion to the baseline improves the discrimination between the client and impostor. The results indicate that the improvement depends on the group of users selected to estimate the SVM parameters. In the worst scenario there is no improvement (0%), while in the best ones the reduction in EER can be as high as 24%.

VII. CONCLUSION
In this paper a scheme to generate new classification criteria based on VTLN in a text-dependent speaker verification task is proposed. Additional information was obtained by incorporating a new observation vector sequence computed from the original one by applying VTLN compensation. The modified observation vector sequence was used as an input to the SV system to estimate new scores regarding the original ones without the VTLN compensation. The new criteria based on VTLN were combined with the baseline one by using SVM. It is worth emphasizing that the proposed scheme can be employed with any VTLN method.
Experiments with YOHO database are presented. The performances for each classification criterion and for the selection and combination of these new classifiers with the baseline are discussed. The VTLN warping factor was also tested as a criterion for classification. However, its performance was found to be poor when compared with the baseline system (EER equal to 14.24% and 0.72%, respectively).
The propose method, using the combined scores Score LL and ( ) VTLN Score LL ! with SVM, provided an average reduction of 13.9% in EER when compared with the baseline. Also, a reduction as high as 24% in EER can be achieved for some speakers.