Attention-Driven Image Captioning for Mobile Accessibility of the Visually Impaired

Authors

DOI:

https://doi.org/10.3991/ijim.v19i09.53441

Keywords:

Attention, image captioning, mobile accessibility, ResNet, visually impaired

Abstract


In a world increasingly reliant on visual information, individuals with visual impairments face significant challenges in understanding their environment. This paper introduces an attention-based image captioning model to improve accessibility for visually impaired users. The model integrates ResNet-152 for visual feature extraction, long short-term memory (LSTM) for text processing, and an attention mechanism to generate contextual image descriptions. Captured images are processed via a mobile device, then the description text is translated into Bahasa and converted to speech in real-time using text-to-speech technology. The system shows an average inference time of 2.99 seconds per image, enabling real-time use. The model is tested on the Flickr dataset and new datasets covering a variety of environments and object interactions. Experimental results show superior performance on the Flickr dataset (bilingual evaluation understudy (BLEU)-1: 0.59, metric for evaluation of translation with explicit ordering (METEOR): 0.25). Performance on real-world datasets is slightly lower, indicating challenges in generalizing to scenarios with occluded objects and inconsistent text. Future research will focus on scaling up real-world datasets, adversarial training, and integrating the system into devices such as smart glasses or canes for wider accessibility.

Downloads

Published

2025-05-09

How to Cite

Santi, D., Ilham, A. A., Syafaruddin, & Nurtanio, I. (2025). Attention-Driven Image Captioning for Mobile Accessibility of the Visually Impaired. International Journal of Interactive Mobile Technologies (iJIM), 19(09), pp. 4–18. https://doi.org/10.3991/ijim.v19i09.53441

Issue

Section

Papers