Attention-Driven Image Captioning for Mobile Accessibility of the Visually Impaired
DOI:
https://doi.org/10.3991/ijim.v19i09.53441Keywords:
Attention, image captioning, mobile accessibility, ResNet, visually impairedAbstract
In a world increasingly reliant on visual information, individuals with visual impairments face significant challenges in understanding their environment. This paper introduces an attention-based image captioning model to improve accessibility for visually impaired users. The model integrates ResNet-152 for visual feature extraction, long short-term memory (LSTM) for text processing, and an attention mechanism to generate contextual image descriptions. Captured images are processed via a mobile device, then the description text is translated into Bahasa and converted to speech in real-time using text-to-speech technology. The system shows an average inference time of 2.99 seconds per image, enabling real-time use. The model is tested on the Flickr dataset and new datasets covering a variety of environments and object interactions. Experimental results show superior performance on the Flickr dataset (bilingual evaluation understudy (BLEU)-1: 0.59, metric for evaluation of translation with explicit ordering (METEOR): 0.25). Performance on real-world datasets is slightly lower, indicating challenges in generalizing to scenarios with occluded objects and inconsistent text. Future research will focus on scaling up real-world datasets, adversarial training, and integrating the system into devices such as smart glasses or canes for wider accessibility.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Dessy Santi, Amil Ahmad Ilham, Syafaruddin, Ingrid Nurtanio

This work is licensed under a Creative Commons Attribution 4.0 International License.

