Image Captioning for Medical Surveillance in Smart Home Environments Using Vision Transformers

Authors

DOI:

https://doi.org/10.3991/ijoe.v21i05.54331

Keywords:

vision transformers, medical surveillance, image captioning, smart healthcare, AI in healthcare

Abstract


Medical surveillance in smart homes represents a transformative approach to patient care by utilizing advancements in computer vision to monitor and analyze patient behavior continuously. This study builds upon previous research by fine-tuning vision transformer (ViT) neural networks with a curated dataset that includes diverse scenarios of patients in both normal and abnormal conditions. The proposed model generates descriptive captions from surveillance camera images, effectively capturing contextual information and identifying potential medical indicators. These insights are integrated into an automated notification system designed to alert healthcare providers promptly, enabling timely and informed interventions. To evaluate the effectiveness of the approach, the fine-tuned ViT model is compared against traditional convolutional neural networks (CNNs) state-of-the-art model, demonstrating superior performance with an accuracy of 87.2%, a BLEU-4 score of 0.351, and a ROUGE-2 score of 0.591. These results highlight the model’s ability to generate accurate and contextually relevant captions, outperforming CNN-LSTM baselines in accuracy, robustness, and contextual understanding. The findings underscore the critical role of artificial intelligence (AI) in detecting changes in patient conditions and providing personalized care through real-time monitoring. This proof-of-concept highlights the feasibility of deploying AI-driven solutions in medical surveillance systems, paving the way for innovative healthcare technologies. By addressing key challenges in patient monitoring, the study establishes ViT as a reliable and scalable tool for enhancing the quality and efficiency of healthcare delivery in smart home environments.

Downloads

Published

2025-04-18

How to Cite

Eloutouate, L., Gibet Tani, H., Elouaai, F., Bouhorma, M., & Hajoub, M. W. (2025). Image Captioning for Medical Surveillance in Smart Home Environments Using Vision Transformers. International Journal of Online and Biomedical Engineering (iJOE), 21(05), pp. 113–126. https://doi.org/10.3991/ijoe.v21i05.54331

Issue

Section

Papers