Attention-Driven Image Captioning for Mobile Accessibility of the Visually Impaired

Dessy Santi; Amil Ahmad Ilham; Syafaruddin; Ingrid Nurtanio

doi:10.3991/ijim.v19i09.53441

Authors

Dessy Santi Universitas Hasanuddin, Gowa, Sulawesi Selatan, Indonesia; Universitas Tadulako, Palu, Sulawesi Tengah, Indonesia https://orcid.org/0000-0002-3255-2526
Amil Ahmad Ilham Universitas Hasanuddin, Gowa, Sulawesi Selatan, Indonesia https://orcid.org/0000-0002-4755-6415
Syafaruddin Universitas Hasanuddin, Gowa, Sulawesi Selatan, Indonesia https://orcid.org/0000-0002-9915-7694
Ingrid Nurtanio Universitas Hasanuddin, Gowa, Sulawesi Selatan, Indonesia https://orcid.org/0000-0002-3053-4201

DOI:

https://doi.org/10.3991/ijim.v19i09.53441

Keywords:

Attention, image captioning, mobile accessibility, ResNet, visually impaired

Abstract

In a world increasingly reliant on visual information, individuals with visual impairments face significant challenges in understanding their environment. This paper introduces an attention-based image captioning model to improve accessibility for visually impaired users. The model integrates ResNet-152 for visual feature extraction, long short-term memory (LSTM) for text processing, and an attention mechanism to generate contextual image descriptions. Captured images are processed via a mobile device, then the description text is translated into Bahasa and converted to speech in real-time using text-to-speech technology. The system shows an average inference time of 2.99 seconds per image, enabling real-time use. The model is tested on the Flickr dataset and new datasets covering a variety of environments and object interactions. Experimental results show superior performance on the Flickr dataset (bilingual evaluation understudy (BLEU)-1: 0.59, metric for evaluation of translation with explicit ordering (METEOR): 0.25). Performance on real-world datasets is slightly lower, indicating challenges in generalizing to scenarios with occluded objects and inconsistent text. Future research will focus on scaling up real-world datasets, adversarial training, and integrating the system into devices such as smart glasses or canes for wider accessibility.

Attention-Driven Image Captioning for Mobile Accessibility of the Visually Impaired

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

License

Rankings

Other journals