AI-Based Hate Speech Detection in Albanian Social Media: New Dataset and Mobile Web Application Integration

Endrit Fetahi; Mentor Hamiti; Arsim Susuri; Jaumin Ajdari; Xhemal Zenuni

doi:10.3991/ijim.v18i24.50851

Authors

Endrit Fetahi South East European University, Tetovo, North Macedonia & University of Prizren, Prizren, Kosovo https://orcid.org/0000-0002-6141-7438
Mentor Hamiti South East European University, Tetovo, North Macedonia
Arsim Susuri University of Prizren, Prizren, Kosovo https://orcid.org/0000-0002-4434-5233
Jaumin Ajdari South East European University, Tetovo, North Macedonia https://orcid.org/0000-0003-0375-3748
Xhemal Zenuni South East European University, Tetovo, North Macedonia https://orcid.org/0000-0002-8195-1507

DOI:

https://doi.org/10.3991/ijim.v18i24.50851

Keywords:

Hate Speech, Twitter, Toxic, Cyberbullying, Convolutional Neural Network., Machine Learning,, Social Media, AI, Facebook

Abstract

This paper aims to advance AI-based hate speech (HS) detection in the Albanian language, which is resource-limited in natural language processing (NLP). Addressing the challenge of limited data, we developed a human-annotated dataset of over 11,000 comments, carefully curated from various Albanian social media platforms, containing a substantial number of HS instances. The dataset was annotated using a detailed two-layer taxonomy to capture the complex dimensions of HS. To ensure high-quality annotations, three expert annotators applied a majority voting system, achieving a substantial Fleiss’s kappa coefficient of 0.62, underscoring the reliability and consistency of the annotations. We conducted a comparative analysis of several machine learning (ML) algorithms, including support vector machine (SVM), Naïve Bayes (NB), XGBoost, and random forest (RF), paired with various text vectorisation techniques and pre-processing methods. In binary classification, the NB model with term frequencyinverse document frequency (TF-IDF) vectorization achieved the highest performance, with an F1 score of 0.80. For multiclass classification, XGBoost outperformed other models, achieving an F1 score of 0.77. Interestingly, our experiments revealed that pre-processing steps generally reduced model performance, suggesting that raw text inputs work better for the Albanian language. Through error analysis using local interpretable model-agnostic explanations (LIME), we identified key challenges, such as polysemy and irony, which contributed to misclassifications. To demonstrate the practical applicability of our work, we developed a user-friendly mobile web application based on the best-performing model, providing realtime HS detection with the potential for integration into social media platforms.

Author Biographies

Endrit Fetahi, South East European University, Tetovo, North Macedonia & University of Prizren, Prizren, Kosovo

Endrit Fetahi – is a full-time teaching assistant at the University of Prizren. He holds a bachelor’s and master’s degree in computer science and is currently a PhD candidate at the South East European University (email: ef30456@seeu.edu.mk)

Mentor Hamiti, South East European University, Tetovo, North Macedonia

Mentor Hamiti – is a full-time professor at the Faculty of Contemporary Sciences and Technologies, South East European University in Tetovo, Macedonia.

Arsim Susuri, University of Prizren, Prizren, Kosovo

Arsim Susuri – is an Associate Professor at the University of Prizren 'Ukshin Hoti.' He holds a Ph.D. in Computer Science (email: arsim.susuri@uni-prizren.com).

Jaumin Ajdari, South East European University, Tetovo, North Macedonia

Jaumin Ajdari – a full-time professor at the Faculty of Contemporary Sciences and Technologies at South East European University in Tetovo, Macedonia.

Xhemal Zenuni, South East European University, Tetovo, North Macedonia

Xhemal Zenuni – is a full-time professor at the Faculty of Contemporary Sciences and Technologies at South East European University in Tetovo, Macedonia.