Development of Hybrid Feature Based Models for Dysarthric Speech Recognition
DOI:
https://doi.org/10.3991/ijoe.v22i04.59849Keywords:
dysarthria word recognition, Wav2Vec 2.0, (mel-frequency cepstral coefficients) MFCC, wavelet transform, hybrid feature extractionAbstract
Speech recognition for individuals with dysarthria remains challenging due to unstable acoustic signals, high temporal variability, and frequent articulatory distortions, all of which hinder the ability of acoustic models to consistently capture phonetic patterns. This study aims to identify the most effective feature extraction strategy among three approaches, namely Wav2Vec 2.0, MFCC combined with Wav2Vec 2.0, and Wavelet-MFCC combined with Wav2Vec 2.0, evaluated using the UA-Speech dataset. All models were trained using the Wav2Vec 2.0 Base architecture with a CTC decoding mechanism to map audio signals to character sequences in an end-to-end manner. The experimental results demonstrate that the MFCC-Wav2Vec 2.0 combination yields the best performance, achieving a Word Error Rate (WER) of 0.2990. These findings indicate that combining traditional acoustic features with self-supervised representations yields a more robust speech recognition system for dysarthric speech.
References
[1] S. Durga and V. Mehrotra, “Communication and its vital role in human life,” Int. J. Health Sci. (Qassim), vol. 6, no. S5, pp. 5940–5948, 2022. https://doi.org/10.53730/ijhs.v6ns5.10005
[2] A. Skoczylas, W. Koperska, M. Stachowiak, and N. Duda-Mróz, “Speech recognition and enhancement in underground mines for the use of smart voice assistants,” Procedia Comput. Sci., vol. 225, pp. 1964–1973, 2023. https://doi.org/10.1016/j.procs.2023.10.187
[3] R. Joshi and V. Kannan, “Attention-based end-to-end speech recognition for voice search in hindi and english", Association for Computing Machinery, 2021. https://doi.org/10.1145/3503162.3503173
[4] A. Olev and T. Alumäe, “Estonian Speech Recognition and Transcription Editing Service,” Balt. J. Mod. Comput., vol. 10, no. 3, pp. 409–421, 2022. https://doi.org/10.22364/bjmc.2022.10.3.14
[5] H. Kheddar, M. Hemis, and Y. Himeur, “Automatic speech recognition using advanced deep learning approaches: A survey,” Inf. Fusion, vol. 109, 2024. https://doi.org/10.1016/j.inffus.2024.102422
[6] M. McGuire and J. Larson-Hall, “Assessing Whisper automatic speech recognition and WER scoring for elicited imitation: Steps toward automation,” Res. Methods Appl. Linguist., vol. 4, 2025. https://doi.org/10.1016/j.rmal.2025.100197
[7] M. Wang, H. Ma, Y. Wang, and X. Sun, “Design of smart home system speech emotion recognition model based on ensemble deep learning and feature fusion,” Appl. Acoust., vol. 218, 2024. https://doi.org/10.1016/j.apacoust.2024.109886
[8] S. Dutta, D. Irvin, and J. H. L. Hansen, “Exploring discrete speech units for privacy-preserving and efficient speech recognition for school-aged and preschool children,” Int. J. Hum. Comput. Stud., vol. 199, 2025. https://doi.org/10.1016/j.ijhcs.2025.103460
[9] A. A. Joshy and R. Rajan, “Automated dysarthria severity classification: A study on acoustic features and deep learning techniques,” IEEE Trans. Neural Syst. Rehabil. Eng., vol. 30, pp. 1147–1157, 2022. https://doi.org/10.1109/TNSRE.2022.3169814
[10] W. Ye, Z. Jiang, Q. Li, Y. Liu, and Z. Mou, “A hybrid model for pathological voice recognition of post-stroke dysarthria by using 1DCNN and double-LSTM networks,” Appl. Acoust., vol. 197, 2022. https://doi.org/10.1016/j.apacoust.2022.108934
[11] J. R. Duffy, “Motor speech disorders: Substrtes, differential diagnosis, and management,” 2019.
[12] C. Mitchell et al., “Prevalence of aphasia and dysarthria among inpatient stroke survivors: describing the population, therapy provision and outcomes on discharge,” Aphasiology, vol. 35, no. 7, pp. 950–960, 2021. https://doi.org/10.1080/02687038.2020.1759772
[13] N. Do, S. Mitchell, L. Sturgill, P. Khemani, and M. K. Sin, “Speech and Swallowing Problems in Parkinson’s Disease,” J. Nurse Pract., vol. 18, pp. 848–851, 2022. https://doi.org/10.1016/j.nurpra.2022.05.019
[14] R. Dubbioso et al., “Precision medicine in ALS: Identification of new acoustic markers for dysarthria severity assessment,” Biomed. Signal Process. Control, vol. 89, p. 105706, 2024. https://doi.org/10.1016/j.bspc.2023.105706
[15] Kemenkes RI, “Laporan nasional riskesdas 2018,” 2019. [Online]. Available: https://repository.badankebijakan.kemkes.go.id/id/eprint/3514/1/Laporan Riskesdas 2018 Nasional.pdf
[16] J. Mills, O. Duffy, K. Pedlow, and G. Kernohan, “Exploring the perceptions of voice-assisted technology as a tool for speech and voice difficulties: Focus group study among people with parkinson disease and their carers,” JMIR Rehabil. Assist. Technol., vol. 12, 2025. https://doi.org/10.2196/75316
[17] B. A. Al-Qatab and M. B. Mustafa, “Classification of dysarthric speech according to the severity of impairment: An analysis of acoustic features,” IEEE Access, vol. 9, pp. 18183–18194, 2021. https://doi.org/10.1109/ACCESS.2021.3053335
[18] S. Sajiha, K. Radha, D. Venkata Rao, N. Sneha, S. Gunnam, and D. P. Bavirisetti, “Automatic dysarthria detection and severity level assessment using CWT-layered CNN model,” Eurasip J. Audio, Speech, Music Process., vol. 2024, 2024. https://doi.org/10.1186/s13636-024-00357-3
[19] J. C. Prabhala, R. Ragoju, V. Kuppili, and C. Chesneau, “Enhanced early detection of dysarthric speech disabilities using stacking ensemble deep learning model,” Mach. Learn. with Appl., vol. 21, 2025. https://doi.org/10.1016/j.mlwa.2025.100721
[20] L. C. Chang and J. W. Hung, “A preliminary study of robust speech feature extraction based on maximizing the probability of states in deep acoustic models,” Appl. Syst. Innov., vol. 5, 2022. https://doi.org/10.3390/asi5040071
[21] F. Javanmardi, S. R. Kadiri, and P. Alku, “Pre-trained models for detection and severity level classification of dysarthria from speech,” Speech Commun., vol. 158, p. 103047, 2024. https://doi.org/10.1016/j.specom.2024.103047
[22] A. Baevski, H. Zhou, A. Mohamed, and M. Auli, “Wav2vec 2.0: A framework for self-supervised learning of speech representations,” Adv. Neural Inf. Process. Syst., vol. 33, pp. 12449–12460, 2020. https://doi.org/10.48550/arXiv.2006.11477
[23] S. Schneider, A. Baevski, R. Collobert, and M. Auli, “Wav2vec: Unsupervised pre-training for speech recognition,” Conf. Int. Speech Commun. Assoc., pp. 3465–3469, 2019. https://doi.org/10.21437/INTERSPEECH.2019-1873
[24] M. Kunešová, Z. Zajíc, L. Šmídl, and M. Karafiát, “Comparison of wav2vec 2.0 models on three speech processing tasks,” Int. J. Speech Technol., vol. 27, pp. 847–859, 2024. https://doi.org/10.1007/s10772-024-10140-6
[25] A. S. Al-Ali, R. M. Haris, Y. Akbari, M. Saleh, S. Al-Maadeed, and M. R. Kumar, “Integrating binary classification and clustering for multi-class dysarthria severity level classification: A two-stage approach,” Cluster Comput., vol. 28, 2025. https://doi.org/10.1007/s10586-024-04748-1
[26] F. G. Eriş and E. Akbal, “Enhancing speech emotion recognition through deep learning and handcrafted feature fusion,” Appl. Acoust., vol. 222, no. 110070, 2024. https://doi.org/10.1016/j.apacoust.2024.110070
[27] A. Hernandez, P. A. Pérez-Toro, E. Nöth, J. R. Orozco-Arroyave, A. Maier, and S. H. Yang, “Cross-lingual self-supervised speech representations for improved dysarthric speech recognition,” Proc. Annu. Conf. Int. Speech Commun. Assoc. INTERSPEECH, 2022. https://doi.org/10.21437/Interspeech.2022-10674
[28] J. Tobin, K. Tomanek, and S. Venugopalan, “Towards a single asr model that generalizes to disordered speech,” ICASSP, IEEE Int. Conf. Acoust. Speech Signal Process. - Proc., 2025. https://doi.org/10.1109/ICASSP49660.2025.10888895
[29] S. Hidayat, T. Andrasto, F. P. Rochim, M. Khaira, M. H. Herdiansyah, and F. N. Aryza, “Wavelet-MFCC and LSTM-based speech recognition for cleft lip and palate,” in 2025 International Electronics Symposium (IES), pp. 795–801, 2025. https://doi.org/10.1109/IES67184.2025.11162014
[30] N. Gharaibeh, A. A. Abu-Ein, O. M. Al-hazaimeh, K. M. O. Nahar, W. A. Abu-Ain, and M. M. Al-Nawashi, “Swin transformer-based segmentation and multi-scale feature pyramid fusion module for alzheimer’s disease with machine learning,” Int. J. online Biomed. Eng., vol. 19, no. 4, pp. 22–50, 2023. https://doi.org/10.3991/ijoe.v19i04.37677
[31] H. Kim et al., “Dysarthric speech database for universal access research,” in Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, pp. 1741–1744, 2008. https://doi.org/10.21437/Interspeech.2008-480
[32] D. Geneva and G. Shopov, “Towards accurate text verbalization for asr based on audio alignment,” in Proceedings ofthe Student Research Workshop associated with RANLP-2019, pp. 39–47, 2019. https://doi.org/10.26615/ISSN.2603-2821.2019_007
[33] A. Jeannerot, N. de Koeijer, P. Martínez-Nuevo, M. B. Møller, J. Dyreby, and P. Prandoni, “Increasing loudness in audio signals: A perceptually motivated approach to preserve audio quality,” ICASSP, IEEE Int. Conf. Acoust. Speech Signal Process. - Proc., pp. 1001–1005, 2022. https://doi.org/10.1109/ICASSP43922.2022.9747589
[34] S. A. Khan, afiz S. A. Qasim, and I. Azam, “Feature extraction trends for intelligent facial expression recognition : A survey,” Inform. Lithuanian Acad. Sci., vol. 42, no. 4, pp. 507–514, 2018. https://doi.org/10.31449/INF.V42I4.2037
[35] A. van den Oord, Y. Li, and O. Vinyals, “Representation learning with contrastive predictive coding,” arXiv Prepr., 2019, [Online]. Available: http://arxiv.org/abs/1807.03748
[36] A. Vaswani et al., “Attention is all you need,” Adv. Neural Inf. Process. Syst., vol. 30, 2017. https://doi.org/10.1109/2943.974352
[37] R. Fan, W. Chu, P. Chang, and A. Alwan, “A CTC Alignment-Based Non-Autoregressive Transformer for End-to-End Automatic Speech Recognition,” IEEE/ACM Trans. Audio Speech Lang. Process., vol. 31, pp. 1436–1448, 2023. https://doi.org/10.1109/TASLP.2023.3263789
[38] Y. El Kheir, A. Das, E. E. Erdogan, F. Ritter-Guttierez, T. Polzehl, and S. Möller, “Two views, one truth: Spectral and self-supervised features fusion for robust speech deepfake detection,” in 2025 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2025. https://doi.org/10.48550/arXiv.2507.20417
[39] F. Jiao, J. Song, X. Zhao, P. Zhao, and R. wang, “A spoken english teaching system based on speech recognition and machine learning,” Int. J. Emerg. Technol. Learn., vol. 16, no. 4, pp. 68–82, 2021. https://doi.org/10.3991/ijet.v16i14.24049
[40] R. Darni, Y. Harisman, and I. N. A. F. Setiawan, “The implementation and empirical analysis of adaptive virtual mentor: Mobile technology empowers introverts’ Business Communication Skills,” Int. J. Interact. Mob. Technol., vol. 19, no. 8, pp. 159–173, 2025. https://doi.org/10.3991/ijim.v19i08.53887
[41] T. N. Sainath, O. Vinyals, A. Senior, and H. Sak, “Convolutional, Long Short-Term Memory, fully connected Deep Neural Networks,” in IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 4580–4584, 2015. https://doi.org/10.1109/ICASSP.2015.7178838
[42] Y. Gong, Y. A. Chung, and J. Glass, “Ast: Audio spectrogram transformer,” in arXiv preprint, 2021. https://doi.org/10.21437/Interspeech.2021-698
[43] D. Campo, O. L. Quintero, and M. Bastidas, “Multiresolution analysis (discrete wavelet transform) through Daubechies family for emotion recognition in speech,” J. Phys. Conf. Ser., vol. 705, no. 1, p. 012034, 2016. https://doi.org/10.1088/1742-6596/705/1/012034
[44] K. He, X. Zhang, S. Ren, and J. Su, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016. https://doi.org/10.1246/cl.2003.428
[45] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” in 7th International Conference on Learning Representations, ICLR 2019, 2019. https://doi.org/10.48550/arXiv.1711.05101
[46] E. Kumalija and Y. Nakamoto, “Performance Evaluation of Automatic Speech Recognition Systems on Integrated Noise-Network Distorted Speech,” Front. signal Process., vol. 2, 2022. https://doi.org/10.3389/frsip.2022.999457
[47] S. Ouzerrout, “Universal-WER: Enhancing WER with segmentation and weighted substitution for varied linguistic contexts,” in Proceedings ofthe 9th International Workshop on Computational Linguistics for Uralic Languages, pp. 29–35, 2024.
[48] R. Zhao, “Application of mobile interactive applications in college english teaching from the perspective of intelligent educational technology,” Int. J. Interact. Mob. Technol., vol. 19, no. 24, pp. 18–32, 2025. https://doi.org/10.3991/ijim.v19i24.59477
[49] M. Kim, B. Cao, K. An, and J. Wang, “Dysarthric Speech recognition using convolutional lstm neural network,” Proc. Annu. Conf. Int. Speech Commun. Assoc. INTERSPEECH, pp. 2948–2952, 2018. https://doi.org/10.21437/Interspeech.2018-2250
[50] S. W. Yang et al., “SUPERB: Speech processing Universal PERformance Benchmark,” Proc. Annu. Conf. Int. Speech Commun. Assoc. INTERSPEECH, 2021. https://doi.org/10.21437/Interspeech.2021-1775
[51] G. Gidaye, J. Nirmal, K. Ezzine, and M. Frikha, “Wavelet sub-band features for voice disorder detection and classification,” Multimed. Tools Appl., vol. 79, no. 39–40, pp. 28499–28523, 2020. https://doi.org/10.1007/s11042-020-09424-1
[52] Z. Qian and K. Xiao, “A survey of automatic speech recognition for dysarthric speech,” Electronics, vol. 12, no. 20, pp. 1–23, 2023. https://doi.org/10.3390/electronics12204278
[53] S. Hu et al., “Exploring self-supervised pre-trained ASR models for dysarthric and elderly speech recognition,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, pp. 1–5, 2023. https://doi.org/10.1109/ICASSP49357.2023.10097275
[54] Z. Qian, K. Xiao, and C. Yu, “A survey of technologies for automatic dysarthric speech recognition,” EURASIP J. Audio, Speech, Music Process., vol. 1, no. 48, 2023. https://doi.org/10.1186/s13636-023-00318-2
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2026 Faila Nadhifatul Aryza, Syahroni Hidayat

This work is licensed under a Creative Commons Attribution 4.0 International License.

