AI-Based Hate Speech Detection in Albanian Social Media: New Dataset and Mobile Web Application Integration
DOI:
https://doi.org/10.3991/ijim.v18i24.50851Keywords:
Hate Speech, Twitter, Toxic, Cyberbullying, Convolutional Neural Network., Machine Learning,, Social Media, AI, FacebookAbstract
This paper aims to advance AI-based hate speech (HS) detection in the Albanian language, which is resource-limited in natural language processing (NLP). Addressing the challenge of limited data, we developed a human-annotated dataset of over 11,000 comments, carefully curated from various Albanian social media platforms, containing a substantial number of HS instances. The dataset was annotated using a detailed two-layer taxonomy to capture the complex dimensions of HS. To ensure high-quality annotations, three expert annotators applied a majority voting system, achieving a substantial Fleiss’s kappa coefficient of 0.62, underscoring the reliability and consistency of the annotations. We conducted a comparative analysis of several machine learning (ML) algorithms, including support vector machine (SVM), Naïve Bayes (NB), XGBoost, and random forest (RF), paired with various text vectorisation techniques and pre-processing methods. In binary classification, the NB model with term frequencyinverse document frequency (TF-IDF) vectorization achieved the highest performance, with an F1 score of 0.80. For multiclass classification, XGBoost outperformed other models, achieving an F1 score of 0.77. Interestingly, our experiments revealed that pre-processing steps generally reduced model performance, suggesting that raw text inputs work better for the Albanian language. Through error analysis using local interpretable model-agnostic explanations (LIME), we identified key challenges, such as polysemy and irony, which contributed to misclassifications. To demonstrate the practical applicability of our work, we developed a user-friendly mobile web application based on the best-performing model, providing realtime HS detection with the potential for integration into social media platforms.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2024 Arsim Susuri, Endrit Fetahi, Mentor Hamiti, Jaumin Ajdari, Xhemal Zenuni

This work is licensed under a Creative Commons Attribution 4.0 International License.