Challenges and Solutions in Clustering Low-Resource Language Social Media Text: An Evaluation Using Unsupervised Algorithms

Mërgim H. Hoti; Avni Rexhepi; Arbër H. Hoti; Blerim Rexha

doi:10.3991/ijim.v19i20.56307

Authors

Mërgim H. Hoti University of Prishtina, Prishtinë, Republic of Kosova https://orcid.org/0000-0003-0744-2250
Avni Rexhepi University of Prishtina, Prishtinë, Republic of Kosova https://orcid.org/0000-0003-3306-8784
Arbër H. Hoti University of Prishtina, Prishtinë, Republic of Kosova https://orcid.org/0000-0003-3106-7024
Blerim Rexha University of Prishtina, Prishtinë, Republic of Kosova https://orcid.org/0000-0002-3428-7666

DOI:

https://doi.org/10.3991/ijim.v19i20.56307

Keywords:

Unsupervised algorithms, K-Means, DBSCAN, HDBSCAN, low resource language

Abstract

Low-resource languages present unique challenges for natural language processing (NLP) due to limited annotated corpora, linguistic resources, and pre-trained models. This paper addresses the gap in clustering methodologies for such languages by evaluating the performance of three unsupervised algorithms—K-Means, DBSCAN, and HDBSCAN— on social media text data. Unlike prior studies focusing on high-resource languages, this study explores challenges in preprocessing, tokenization, and vectorization specific to lowresource settings. The results highlight the sensitivity of clustering performance to linguistic nuances and preprocessing approaches, with DBSCAN and HDBSCAN excelling in handling noisy and unstructured data. The findings provide actionable insights into algorithm selection and preprocessing strategies, showcasing the potential and limitations of traditional clustering methods in low-resource NLP. By shedding light on these challenges, this study paper contributes to the development of inclusive approaches for text analysis across underrepresented languages, advancing NLP applications globally.