Multimodal Human Action Recognition for Ubiquitous Systems: Cross-Attention of Skeleton and Audio

Mounir Boudmagh; Adlen Kerboua; Mohamed Redjimi

doi:10.3991/ijim.v20i05.58381

Authors

Mounir Boudmagh Badji Mokhtar Annaba University, Annaba, Algeria https://orcid.org/0009-0002-2140-3063
Adlen Kerboua University of 20 August 1955, Skikda, Algeria https://orcid.org/0000-0003-3078-462X
Mohamed Redjimi University of 20 August 1955, Skikda, Algeria https://orcid.org/0000-0001-8425-5470

DOI:

https://doi.org/10.3991/ijim.v20i05.58381

Keywords:

Human Action Recognition, Artificial Intelligence, Computer Vision, Skeleton, Audio, Cross-Attention.

Abstract

Human action recognition (HAR) systems are foundational for mobile educational technologies, such as gesture-based learning analytics and remote skill acquisition. However, current systems often fail in real-world settings due to visual occlusion and the neglect of the rich contextual information provided by the acoustic modality, particularly in visual-centric datasets such as NTU RGB+D 60 and MSR Daily Activity 3D. By manually producing action-relevant audio streams for these datasets, we propose a multimodal approach that fuses skeleton and audio modalities through a cross-attention mechanism. Our framework processes skeleton data by integrating joints and limbs into an H × W × 31 spatial feature map, which is then fed into a ResNet50 backbone. Log-Mel spectrograms are encoded using a ConvNeXt-T architecture. A cross-attention mechanism is employed to fuse these features, effectively learning inter-modal dependencies. Evaluations demonstrate significant gains: 94.7% on NTU RGB+D X-SUB (up from 90.5% using only skeleton data) and 97.9% on MSR Daily Activity 3D (compared to 89.8%). These results quantitatively establish the critical role of audio in enabling robust, real-time feedback loops that are essential for smart learning environments and interactive mobile coaching, where visual data alone is unreliable.

Author Biographies

Mounir Boudmagh, Badji Mokhtar Annaba University, Annaba, Algeria

Mounir Boudmagh is a PhD Student in the Embedded Systems Laboratory (LASE), Department of Computer Science at Badji Mokhtar-Annaba University, Algeria. His research focuses on applied Artificial Intelligence, particularly computer vision, object recognition, and human action prediction. Email: mounir.boudmagh@univ-annaba.dz; ORCID: https://orcid.org/0009-0002-2140-3063.

Adlen Kerboua, University of 20 August 1955, Skikda, Algeria

Adlen Kerboua is a researcher in computer science and intelligent systems, with expertise in numerical optimization, robotics, and data-driven modeling. His research focuses on optimization algorithms, sensitivity analysis, and computational methods applied to robotic mechanisms and engineering systems. He has contributed to peer-reviewed international journals in the fields of robotics and intelligent applications.

References

[1] M. Liu, F. Meng, C. Chen, and S. Wu, “Novel Motion Patterns Matter for Practical Skeleton-Based Action Recognition”, AAAI, vol. 37, no. 2, pp, 2023 doi: https://doi.org/10.1609/aaai.v37i2.25258.

[2] Y. Du, Y. Fu and L. Wang, "Skeleton based action recognition with convolutional neural network," 3rd IAPR Asian Conference on Pattern Recognition (ACPR), Kuala Lumpur, Malaysia, pp. 579-583, doi: 10.1109/ACPR.2015.7486569, 2015.

[3] A. R. Rasa, "Artificial intelligence and its revolutionary role in physical and mental rehabilitation: A review of recent advancements," BioMed Research International, vol. 2024, Art. no. 9554590, doi: 10.1155/bmri/9554590, 2024.

[4] S. Klakegg, K. Opoku Asare, N. van Berkel, et al. "CARE: Context-awareness for elderly care". Health Technol. 11, 211–226. https://doi.org/10.1007/s12553-020-00512-8, 2021.

[5] A. Hisam, S. Zia-ul-Haq, S. Aziz, P. Doherty, J. Pell. "Effectiveness of Mobile Health Augmented Cardiac Rehabilitation (MCard) on health-related quality of life among post-acute coronary syndrome patients: A randomized controlled trial". Pak J Med Sci. 38(3):716-723. doi: https://doi.org/10.12669/pjms.38.3.4724, 2022.

[6] A. Farsi, G. Cerone, D. Falla, M. Gazzoni. "Emerging Applications of Augmented and Mixed Reality Technologies in Motor Rehabilitation: A Scoping Review". Sensors, 2042. https://doi.org/10.3390/s25072042, 2025.

[7] Qureshi, T.S., Shahid, M.H., Farhan, A.A. et al. "A systematic literature review on human activity recognition using smart devices: advances, challenges, and future directions". Artif Intell Rev 58, 276. https://doi.org/10.1007/s10462-025-11275-x, 2025.

[8] M. Karim, S. Khalid, A. Aleryani, J. Khan, I. Ullah and Z. Ali, "Human Action Recognition Systems: A Review of the Trends and State-of-the-Art," in IEEE Access, vol. 12, pp. 36372-36390, doi: 10.1109/ACCESS.2024.3373199, 2024.

[9] Q. Zhao, C. Zheng, M. Liu, and C. Chen, "A single 2D pose with context is worth hundreds for 3D human pose estimation," in Proc. 37th Conf. Neural Inf. Process. Syst. (NeurIPS), doi: 10.5555/3666122.3667315, 2023.

[10] S. Yao, S. Hu, Y. Zhao, A. Zhang, and T. Abdelzaher, "DeepSense: A unified deep learning framework for time-series mobile sensing data processing," in Proc. 26th Int. Conf. World Wide Web (WWW ’17), Geneva, Switzerland, pp. 351–360. doi: 10.1145/3038912.3052577, 2017.

[11] C. Feichtenhofer, "X3D: Expanding Architectures for Efficient Video Recognition," IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 2020, pp. 200-210, doi: 10.1109/CVPR42600.2020.00028, 2020.

[12] D. Lahat, T. Adali and C. Jutten, "Multimodal Data Fusion: An Overview of Methods, Challenges, and Prospects," in Proceedings of the IEEE, vol. 103, no. 9, pp. 1449-1477, doi: 10.1109/JPROC.2015.2460697, Sept. 2015.

[13] S. Papadakis, S. H. Lytvynova, S. M. Ivanova, and I. A. Selyshcheva, "Advancing lifelong learning with AI-enhanced ICT: A review of 3L-Person 2024," in CEUR Workshop Proceedings, 9th Int. Workshop Prof. Retraining Life-Long Learning using ICT: Person-oriented Approach (3L-Person 2024), Lviv, Ukraine, Sep. 2023.

[14] C. Xu, R. Panda, A. Nagrani, J. Lin, and R. Feris. “Audio-Visual SlowFast Networks for Video Recognition”. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10457–10467, doi:10.48550/arXiv.2001.08740, 2021.

[15] M. Boudmagh, M. Redjimi, and A. Kerboua, “Video activity recognition based on objects detection using recurrent neural networks,” in Innovations in Smart Cities Applications Volume 4, Springer, pp. 555–565. doi: 10.1007/978-3-030-66840-2_65, 2021.

[16] P. Wang, F. Guo, F. Gu, M. Li and X. Long, "MobileHAR: A Lightweight and Efficient Human Activity Recognition Model based on Inverted Residual Inception Block," 2024 20th International Conference on Mobility, Sensing and Networking (MSN), Harbin, China, pp. 834-841, doi: 10.1109/MSN63567.2024.00116, 2024.

[17] G. Sawadwuthikul et al., "Visual Goal Human-Robot Communication Framework with Few-Shot Learning: A Case Study in Robot Waiter System," in IEEE Transactions on Industrial Informatics, vol. 18, no. 3, pp. 1883-1891, 2022, doi: 10.1109/TII.2021.3049831.

[18] W. Yang, Q. Xiao, and Y. Zhang, "HAR2bot: a human-centered augmented reality robot programming method with the awareness of cognitive load," J. Intell. Manuf., vol. 35, no. 5, pp. 1985-2003. doi: 10.1007/s10845-023-02096-2, 2024.

[19] E. Kazakos, A. Nagrani, A. Zisserman and D. Damen, "EPIC-Fusion: Audio-Visual Temporal Binding for Egocentric Action Recognition," 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea (South), pp. 5491-5500, doi: 10.1109/ICCV.2019.00559, 2019.

[20] S. Papadakis, A. M. Striuk, H. M. Kravtsov, M. P. Shyshkina, M. V. Marienko, and H. B. Danylchuk, "Embracing digital innovation and cloud technologies for transformative learning experiences," in Proc. 11th Workshop Cloud Technol. Educ. (CTE 2023), Kryvyi Rih, Ukraine, Dec. 2023, CEUR-WS, 2024.

[21] Y. Xing, S. Golodetz, A. Everitt, A. Markham, and N. Trigoni, “Multiscale human activity recognition and anticipation network (MS-HARA),” IEEE Transactions on Neural Networks and Learning Systems, vol. 33, no. 12, pp. 7679–7692, Dec. doi: 10.1109/TNNLS.2022.3167824, 2022.

[22] D. S. Rao, L. K. Rao, V. Bhagyaraju, and G. K. Meng, “Enhanced depth motion maps for improved human action recognition from depth action sequences,” Traitement du Signal, vol. 41, no. 3, pp. 1461–1472, doi: 10.18280/ts.410334, 2024.

[23] M. Ibh, S. Grasshof, D. Witzner, and P. Madeleine, "TemPose: A new skeleton-based transformer model designed for fine-grained motion recognition in badminton," in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), pp. 5198–5207, doi: 10.1109/CVPRW59228.2023.00548, 2023.

[24] C. Plizzari, M. Cannici, and M. Matteucci, "Spatial-temporal transformer network for skeleton-based action recognition," in Proc. 25th Int. Conf. Pattern Recognit. (ICPR) Workshops, pp. 694–701, doi: 10.48550/arXiv.2008.07404, Jan 2021.

[25] A. Alam, A. Das, M. S. Tasjid, and A. A. Marouf, “Leveraging Sensor Fusion and Sensor-Body Position for Activity Recognition for Wearable Mobile Technologies”, Int. J. Interact. Mob. Technol., vol. 15, no. 17, pp. pp. 141–155, Sep. 2021.

[26] J. Cao, Y. Wang, H. Tao, and X. Guo, “Sensor-based human activity recognition using graph LSTM and multi-task classification model,” ACM Transactions on Multimedia Computing, Communications, and Applications, vol. 18, no. 3s, art. no. 139, pp. 1–19, Oct, doi: 10.1145/3561387, 2022.

[27] L. Fang, X. Wang, and L. Zhang, “Design of a Virtual Reality-Supported Immersive English Learning Environment and Interaction Behavior Analysis”, Int. J. Interact. Mob. Technol., vol. 19, no. 21, pp. pp. 184–198, Nov. 2025.

[28] R. Zellers et al., "MERLOT RESERVE: Neural Script Knowledge through Vision and Language and Sound," IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, pp. 16354-16366, doi:10.48550/arXiv.2201.02639. 2022.

[29] M. Bock, H. Kuehne, K. Van Laerhoven, and M. Moeller, “WEAR: An Outdoor Sports Dataset for Wearable and Egocentric Activity Recognition,” Proc. ACM Interact. Mob. Wearable Ubiquitous Technol., vol. 8, no. 4, Art. 175, Nov, doi: 10.1145/3699776, 2024.

[30] J. Chadha, A. Jain, Y. Kumar, and N. Modi, “Hybrid Deep Learning Approaches for Human Activity Recognition and Postural Transitions Using Mobile Device Sensors,” SN Computer Science, vol. 5, doi: 10.1007/s42979-024-03300 2024.

[31] H. Duan, Y. Zhao, K. Chen, D. Lin and B. Dai, "Revisiting Skeleton-based Action Recognition" IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, USA, pp. 2959-2968, doi: 10.1109/CVPR52688.2022.00298, 2022.

[32] K. He, X. Zhang, S. Ren and J. Sun, "Deep Residual Learning for Image Recognition," 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, pp. 770-778, doi: 10.1109/CVPR.2016.90, 2016.

[33] Z. Liu, H. Mao, C. -Y. Wu, C. Feichtenhofer, T. Darrell and S. Xie, "A ConvNet for the 2020s," IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, pp. 11966-11976, doi: 10.1109/CVPR52688.2022.01167, 2022.

[34] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Proc. 31st Conf. Neural Information Processing Systems (NeurIPS 2017), Long Beach, USA, doi:10.48550/arXiv.1706.03762, Dec. 2017.

[35] A. Shahroudy, J. Liu, T.-T. Ng, and G. Wang, “Ntu rgb+d: A large scale dataset for 3d human activity analysis,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), doi: 10.1109/CVPR.2016.115, 2016.

[36] J. Copet, F. Kreuk, I. Gat, T. Remez, D. Kant, G. Synnaeve, Y. Adi, and A. Défossez, “Simple and Controllable Music Generation,” Proc. 37th Conf. Neural Information Processing Systems, New Orleans, LA, USA, DOI:10.48550/arXiv.2306.05284, Jun. 2023.

[37] J. Wang, Z. Liu, Y. Wu and J. Yuan, "Mining actionlet ensemble for action recognition with depth cameras," IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, pp. 1290-1297, doi: 10.1109/CVPR.2012.6247813, 2012.

[38] D. S. Park, W. Chan, Y. Zhang, C.-C. Chiu, B. Zoph, E. D. Cubuk, and Q. V. Le, “SpecAugment: A simple data augmentation method for automatic speech recognition,” in Proc. Interspeech, DOI:10.48550/arXiv.1904.08779, 2019.

[39] Z. Liu, H. Zhang, Z. Chen, Z. Wang and W. Ouyang, "Disentangling and Unifying Graph Convolutions for Skeleton-Based Action Recognition," 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, pp. 140-149, 2020, doi: 10.1109/CVPR42600.2020.00022.

[40] S. Das, S. Sharma, R. Dai, F. Bremond, M. Thonnat, “Vpn: Learning video-pose embedding for activities of daily living”, in: Computer Vision– ECCV, 16th European Conference, Glasgow, UK, Proceedings, Part IX 16, Springer, pp. 72–90, DOI:10.1007/978-3-030-58545-7_5, August 23–28 2020.

[41] D. Liu, F. Meng, Q. Xia, Z. Ma, J. Mi, Y. Gan, M. Ye, J. Zhang, “Temporal cues enhanced multimodal learning for action recognition in rgb-d videos”, Neurocomputing 594, 10.1016/j.neucom.2024.127882, 2024.

[42] J. Zhu, W. Zou, L. Xu, Y. Hu, Z. Zhu, M. Chang, J. Huang, G. Huang, and D. Du, "Action Machine: Rethinking Action Recognition in Trimmed Videos," DOI:10.48550/arXiv.1812.05770, Dec. 2018.

[43] H. Yang, D. Sun, Y-J. Cai, J. Yang, X-Y. Si and S-M Zhou. "Learning Topological Representation of 3D Skeleton Dynamics with Persistent Homology for Human Activity Recognition," IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Istanbul, Turkiye, pp. 2709-2716, doi: 10.1109/BIBM58861.2023.10385684, 2023.

[44] A. Kerboua and M. Batouche, "3D Skeleton Action Recognition for Security Improvement," International Journal of Intelligent Systems and Applications, vol. 11, no. 3, pp. 42-52, DOI:10.5815/ijisa.2019.03.05, Mar. 2019.

[45] A. Shahroudy, T. -T. Ng, Y. Gong and G. Wang, "Deep Multimodal Feature Analysis for Action Recognition in RGB+D Videos," in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 5, pp. 1045-1058, doi: 10.1109/TPAMI.2017.2691321, 1 May 2018.