Challenges and Solutions in Clustering Low-Resource Language Social Media Text: An Evaluation Using Unsupervised Algorithms

Authors

DOI:

https://doi.org/10.3991/ijim.v19i20.56307

Keywords:

Unsupervised algorithms, K-Means, DBSCAN, HDBSCAN, low resource language

Abstract


Low-resource languages present unique challenges for natural language processing (NLP) due to limited annotated corpora, linguistic resources, and pre-trained models. This paper addresses the gap in clustering methodologies for such languages by evaluating the performance of three unsupervised algorithms—K-Means, DBSCAN, and HDBSCAN— on social media text data. Unlike prior studies focusing on high-resource languages, this study explores challenges in preprocessing, tokenization, and vectorization specific to lowresource settings. The results highlight the sensitivity of clustering performance to linguistic nuances and preprocessing approaches, with DBSCAN and HDBSCAN excelling in handling noisy and unstructured data. The findings provide actionable insights into algorithm selection and preprocessing strategies, showcasing the potential and limitations of traditional clustering methods in low-resource NLP. By shedding light on these challenges, this study paper contributes to the development of inclusive approaches for text analysis across underrepresented languages, advancing NLP applications globally.

Author Biographies

Mërgim H. Hoti, University of Prishtina, Prishtinë, Republic of Kosova

 

Avni Rexhepi, University of Prishtina, Prishtinë, Republic of Kosova

 

Downloads

Published

2025-10-17

How to Cite

Hoti, M. H., Rexhepi, A., Hoti, A. H., & Rexha, B. (2025). Challenges and Solutions in Clustering Low-Resource Language Social Media Text: An Evaluation Using Unsupervised Algorithms. International Journal of Interactive Mobile Technologies (iJIM), 19(20), 151–167. https://doi.org/10.3991/ijim.v19i20.56307

Issue

Section

Papers