Automatic Speech Recognition for Crisis Communication in the Albanian Language: Evaluating Whisper Turbo
DOI:
https://doi.org/10.3991/ijoe.v22i02.58873Keywords:
Automatic Speech Recognition (ASR), Whisper Turbo, Crisis Communication, Emergency Audio, Low-Resource Languages, Albanian Language, Word Error Rate (WER), Character Error Rate (CER)Abstract
This study evaluates the performance of the Whisper Turbo automatic speech recognition (ASR) model for crisis communication in the Albanian language. Applying a robust system such as Whisper Turbo will be very challenging because the stress and urgency of speaking in emergency situations will bring a rapid tempo and emotional intonation. We assess its accuracy and speed on two distinct datasets: a controlled corpus of formal, literary Albanian and a challenging corpus of emergency-style speech from first responders. The research uses word error rate (WER) and character error rate (CER) to quantify performance. We found a significant performance difference between the two conditions. The model achieved an average WER of 52.9% on formal Albanian but degraded to 63.5% on the dialectal and stressed speech. These results indicate a clear bias toward standardized language and highlight the model’s difficulty with non-standard pronunciations, emotional intonation, and specialized vocabulary. The findings underscore that while current multilingual ASR models can process low-resource languages such as Albanian, they are not yet suitable for deployment in critical emergency contexts without domain-specific fine-tuning. This work contributes an essential evaluation to the under-researched field of Albanian ASR and provides a foundation for developing more robust and reliable systems for crisis communication.
References
[1] Davis K H, Biddulph R & Balashek S (1952). Automatic Recognition of Spoken Digits’ J. Acoust. Soc. Am. 24(6), 627 –642.
[2] Karyotaki, M., Drigas, A., & Skianis, C. . (2024). Mobile/VR/Robotics/IoT-Based Chatbots and Intelligent Personal Assistants for Social Inclusion. International Journal of Interactive Mobile Technologies (iJIM), 18(08), pp. 40–51. https://doi.org/10.3991/ijim.v18i08.46473].
[3] Fang, L., Wang, X., & Zhang, L. (2025). Design of a Virtual Reality-Supported Immersive English Learning Environment and Interaction Behavior Analysis. International Journal of Interactive Mobile Technologies (iJIM), 19(21), pp. 184–198. https://doi.org/10.3991/ijim.v19i21.58853.
[4] Leonard E Baum. “An inequality and associated maximization technique in statistical estimation for probabilistic functions of Markov processes”. In: Inequalities 3.1 (1972), pp. 1–4
[5] Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A. R., Jaitly, N., ... & Kingsbury, B. (2012). Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal processing magazine, 29(6), 82-97.
[6] Maas, A. L., Le, Q. V., O'neil, T. M., Vinyals, O., Nguyen, P., & Ng, A. Y. (2012, September). Recurrent neural networks for noise reduction in robust asr. In Interspeech (Vol. 2012, pp. 22-25).
[7] Lakshmanarao, A., & Shashi, M. (2022). Android Malware Detection with Deep Learning using RNN from Opcode Sequences. International Journal of Interactive Mobile Technologies (iJIM), 16(01), pp. 145–157. https://doi.org/10.3991/ijim.v16i01.26433
[8] Li, J., Mohamed, A., Zweig, G., & Gong, Y. (2016, March). Exploring multidimensional LSTMs for large vocabulary ASR. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 4940-4944). IEEE.
[9] Yu, D., & Deng, L. (2016). Automatic speech recognition (Vol. 1). Berlin: Springer.
[10] Mozilla Foundation,” Common Voice Corpus 17.0,” Mozilla Common Voice, Mar. 20, 2024. [Online]. Available: https://commonvoice.mozilla.org.
[11] Yeroyan, A., & Karpov N. (2024). Enabling ASR for Low-Resource Languages: A Comprehensive Dataset Creation Approach. Arxiv, IEEE Spoken Language Technology Workshop.
[12] Bartelds, M., San, N., McDonnell, B., Jurafsky, D., Wieling, M. (2023). Making More of Little Data: Improving Low-Resource Automatic Speech Recognition Using Data Augmentation. Arxiv. ACL 2023.
[13] Dhahbi, S., Saleem, N., Bourouis, S., Berrima, M., Verdú, U. (2025). End-to-end neural automatic speech recognition systemfor low resource languages. Egyptian Informatics Journal. https://doi.org/10.1016/j.eij.2025.100615.
[14] Ardiana T., Adelina A., Reinald Z., Designing and Optimizing Deep Learning Models for Speech Recognition in the Albanian Language, Journal of Information Systems Engineering and Management, 2025, 10(15s)
[15] Rista, Amarildo & Kadriu, Arbana. (2022). A Model for Albanian Speech Recognition Using End-to-End Deep Learning Techniques. Interdisciplinary Journal of Research and Development. 9. 1. 10.56345/ijrdv9n301.
[16] Mandic, D., & Chambers, J. (2001). Recurrent neural networks for prediction: learning algorithms, architectures and stability. Wiley.
[17] Riverin-Coutlée, Josiane, Enkeleida Kapia and Michele Gubian. (2024). Dialectchange and language attitudes in Albania. Language Variation and Change 1 24. https://doi.org/10.1017/S0954394524000103
[18] Kryeziu, L., Shehu, V., & Caushi, A. (2022). Evaluation and Verification of NLP Datasets for the Albanian Language. International Conference on Artificial Intelligence of Things. Istanbul, Turkey.
[19] Sabedini, M., & Selimi,F. Misinformation and the Use of Emotional Language in Albanian-Language Media in Kosovo: A Comparative Analysis of the March 2004 Riots and the September 2023 Banjska Attack. (2025). Architecture Image Studies, 6(3), 922-934. https://doi.org/10.62754/ais.v6i3.354.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Labehat Kryeziu, Visar Shehu

This work is licensed under a Creative Commons Attribution 4.0 International License.

