Challenges and Solutions in Clustering Low-Resource Language Social Media Text: An Evaluation Using Unsupervised Algorithms
DOI:
https://doi.org/10.3991/ijim.v19i20.56307Keywords:
Unsupervised algorithms, K-Means, DBSCAN, HDBSCAN, low resource languageAbstract
Low-resource languages present unique challenges for natural language processing (NLP) due to limited annotated corpora, linguistic resources, and pre-trained models. This paper addresses the gap in clustering methodologies for such languages by evaluating the performance of three unsupervised algorithms—K-Means, DBSCAN, and HDBSCAN— on social media text data. Unlike prior studies focusing on high-resource languages, this study explores challenges in preprocessing, tokenization, and vectorization specific to lowresource settings. The results highlight the sensitivity of clustering performance to linguistic nuances and preprocessing approaches, with DBSCAN and HDBSCAN excelling in handling noisy and unstructured data. The findings provide actionable insights into algorithm selection and preprocessing strategies, showcasing the potential and limitations of traditional clustering methods in low-resource NLP. By shedding light on these challenges, this study paper contributes to the development of inclusive approaches for text analysis across underrepresented languages, advancing NLP applications globally.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Mërgim H. Hoti, Avni Rexhepi, Arbër H. Hoti, Blerim Rexha

This work is licensed under a Creative Commons Attribution 4.0 International License.

