Multimodal Human Action Recognition for Ubiquitous Systems: Cross-Attention of Skeleton and Audio
DOI:
https://doi.org/10.3991/ijim.v20i05.58381Keywords:
Human Action Recognition, Artificial Intelligence, Computer Vision, Skeleton, Audio, Cross-Attention.Abstract
Human action recognition (HAR) systems are foundational for mobile educational technologies, such as gesture-based learning analytics and remote skill acquisition. However, current systems often fail in real-world settings due to visual occlusion and the neglect of the rich contextual information provided by the acoustic modality, particularly in visual-centric datasets such as NTU RGB+D 60 and MSR Daily Activity 3D. By manually producing action-relevant audio streams for these datasets, we propose a multimodal approach that fuses skeleton and audio modalities through a cross-attention mechanism. Our framework processes skeleton data by integrating joints and limbs into an H × W × 31 spatial feature map, which is then fed into a ResNet50 backbone. Log-Mel spectrograms are encoded using a ConvNeXt-T architecture. A cross-attention mechanism is employed to fuse these features, effectively learning inter-modal dependencies. Evaluations demonstrate significant gains: 94.7% on NTU RGB+D X-SUB (up from 90.5% using only skeleton data) and 97.9% on MSR Daily Activity 3D (compared to 89.8%). These results quantitatively establish the critical role of audio in enabling robust, real-time feedback loops that are essential for smart learning environments and interactive mobile coaching, where visual data alone is unreliable.
References
[1] M. Liu, F. Meng, C. Chen, and S. Wu, “Novel Motion Patterns Matter for Practical Skeleton-Based Action Recognition”, AAAI, vol. 37, no. 2, pp, 2023 doi: https://doi.org/10.1609/aaai.v37i2.25258.
[2] Y. Du, Y. Fu and L. Wang, "Skeleton based action recognition with convolutional neural network," 3rd IAPR Asian Conference on Pattern Recognition (ACPR), Kuala Lumpur, Malaysia, pp. 579-583, doi: 10.1109/ACPR.2015.7486569, 2015.
[3] A. R. Rasa, "Artificial intelligence and its revolutionary role in physical and mental rehabilitation: A review of recent advancements," BioMed Research International, vol. 2024, Art. no. 9554590, doi: 10.1155/bmri/9554590, 2024.
[4] S. Klakegg, K. Opoku Asare, N. van Berkel, et al. "CARE: Context-awareness for elderly care". Health Technol. 11, 211–226. https://doi.org/10.1007/s12553-020-00512-8, 2021.
[5] A. Hisam, S. Zia-ul-Haq, S. Aziz, P. Doherty, J. Pell. "Effectiveness of Mobile Health Augmented Cardiac Rehabilitation (MCard) on health-related quality of life among post-acute coronary syndrome patients: A randomized controlled trial". Pak J Med Sci. 38(3):716-723. doi: https://doi.org/10.12669/pjms.38.3.4724, 2022.
[6] A. Farsi, G. Cerone, D. Falla, M. Gazzoni. "Emerging Applications of Augmented and Mixed Reality Technologies in Motor Rehabilitation: A Scoping Review". Sensors, 2042. https://doi.org/10.3390/s25072042, 2025.
[7] Qureshi, T.S., Shahid, M.H., Farhan, A.A. et al. "A systematic literature review on human activity recognition using smart devices: advances, challenges, and future directions". Artif Intell Rev 58, 276. https://doi.org/10.1007/s10462-025-11275-x, 2025.
[8] M. Karim, S. Khalid, A. Aleryani, J. Khan, I. Ullah and Z. Ali, "Human Action Recognition Systems: A Review of the Trends and State-of-the-Art," in IEEE Access, vol. 12, pp. 36372-36390, doi: 10.1109/ACCESS.2024.3373199, 2024.
[9] Q. Zhao, C. Zheng, M. Liu, and C. Chen, "A single 2D pose with context is worth hundreds for 3D human pose estimation," in Proc. 37th Conf. Neural Inf. Process. Syst. (NeurIPS), doi: 10.5555/3666122.3667315, 2023.
[10] S. Yao, S. Hu, Y. Zhao, A. Zhang, and T. Abdelzaher, "DeepSense: A unified deep learning framework for time-series mobile sensing data processing," in Proc. 26th Int. Conf. World Wide Web (WWW ’17), Geneva, Switzerland, pp. 351–360. doi: 10.1145/3038912.3052577, 2017.
[11] C. Feichtenhofer, "X3D: Expanding Architectures for Efficient Video Recognition," IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 2020, pp. 200-210, doi: 10.1109/CVPR42600.2020.00028, 2020.
[12] D. Lahat, T. Adali and C. Jutten, "Multimodal Data Fusion: An Overview of Methods, Challenges, and Prospects," in Proceedings of the IEEE, vol. 103, no. 9, pp. 1449-1477, doi: 10.1109/JPROC.2015.2460697, Sept. 2015.
[13] S. Papadakis, S. H. Lytvynova, S. M. Ivanova, and I. A. Selyshcheva, "Advancing lifelong learning with AI-enhanced ICT: A review of 3L-Person 2024," in CEUR Workshop Proceedings, 9th Int. Workshop Prof. Retraining Life-Long Learning using ICT: Person-oriented Approach (3L-Person 2024), Lviv, Ukraine, Sep. 2023.
[14] C. Xu, R. Panda, A. Nagrani, J. Lin, and R. Feris. “Audio-Visual SlowFast Networks for Video Recognition”. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10457–10467, doi:10.48550/arXiv.2001.08740, 2021.
[15] M. Boudmagh, M. Redjimi, and A. Kerboua, “Video activity recognition based on objects detection using recurrent neural networks,” in Innovations in Smart Cities Applications Volume 4, Springer, pp. 555–565. doi: 10.1007/978-3-030-66840-2_65, 2021.
[16] P. Wang, F. Guo, F. Gu, M. Li and X. Long, "MobileHAR: A Lightweight and Efficient Human Activity Recognition Model based on Inverted Residual Inception Block," 2024 20th International Conference on Mobility, Sensing and Networking (MSN), Harbin, China, pp. 834-841, doi: 10.1109/MSN63567.2024.00116, 2024.
[17] G. Sawadwuthikul et al., "Visual Goal Human-Robot Communication Framework with Few-Shot Learning: A Case Study in Robot Waiter System," in IEEE Transactions on Industrial Informatics, vol. 18, no. 3, pp. 1883-1891, 2022, doi: 10.1109/TII.2021.3049831.
[18] W. Yang, Q. Xiao, and Y. Zhang, "HAR2bot: a human-centered augmented reality robot programming method with the awareness of cognitive load," J. Intell. Manuf., vol. 35, no. 5, pp. 1985-2003. doi: 10.1007/s10845-023-02096-2, 2024.
[19] E. Kazakos, A. Nagrani, A. Zisserman and D. Damen, "EPIC-Fusion: Audio-Visual Temporal Binding for Egocentric Action Recognition," 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea (South), pp. 5491-5500, doi: 10.1109/ICCV.2019.00559, 2019.
[20] S. Papadakis, A. M. Striuk, H. M. Kravtsov, M. P. Shyshkina, M. V. Marienko, and H. B. Danylchuk, "Embracing digital innovation and cloud technologies for transformative learning experiences," in Proc. 11th Workshop Cloud Technol. Educ. (CTE 2023), Kryvyi Rih, Ukraine, Dec. 2023, CEUR-WS, 2024.
[21] Y. Xing, S. Golodetz, A. Everitt, A. Markham, and N. Trigoni, “Multiscale human activity recognition and anticipation network (MS-HARA),” IEEE Transactions on Neural Networks and Learning Systems, vol. 33, no. 12, pp. 7679–7692, Dec. doi: 10.1109/TNNLS.2022.3167824, 2022.
[22] D. S. Rao, L. K. Rao, V. Bhagyaraju, and G. K. Meng, “Enhanced depth motion maps for improved human action recognition from depth action sequences,” Traitement du Signal, vol. 41, no. 3, pp. 1461–1472, doi: 10.18280/ts.410334, 2024.
[23] M. Ibh, S. Grasshof, D. Witzner, and P. Madeleine, "TemPose: A new skeleton-based transformer model designed for fine-grained motion recognition in badminton," in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), pp. 5198–5207, doi: 10.1109/CVPRW59228.2023.00548, 2023.
[24] C. Plizzari, M. Cannici, and M. Matteucci, "Spatial-temporal transformer network for skeleton-based action recognition," in Proc. 25th Int. Conf. Pattern Recognit. (ICPR) Workshops, pp. 694–701, doi: 10.48550/arXiv.2008.07404, Jan 2021.
[25] A. Alam, A. Das, M. S. Tasjid, and A. A. Marouf, “Leveraging Sensor Fusion and Sensor-Body Position for Activity Recognition for Wearable Mobile Technologies”, Int. J. Interact. Mob. Technol., vol. 15, no. 17, pp. pp. 141–155, Sep. 2021.
[26] J. Cao, Y. Wang, H. Tao, and X. Guo, “Sensor-based human activity recognition using graph LSTM and multi-task classification model,” ACM Transactions on Multimedia Computing, Communications, and Applications, vol. 18, no. 3s, art. no. 139, pp. 1–19, Oct, doi: 10.1145/3561387, 2022.
[27] L. Fang, X. Wang, and L. Zhang, “Design of a Virtual Reality-Supported Immersive English Learning Environment and Interaction Behavior Analysis”, Int. J. Interact. Mob. Technol., vol. 19, no. 21, pp. pp. 184–198, Nov. 2025.
[28] R. Zellers et al., "MERLOT RESERVE: Neural Script Knowledge through Vision and Language and Sound," IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, pp. 16354-16366, doi:10.48550/arXiv.2201.02639. 2022.
[29] M. Bock, H. Kuehne, K. Van Laerhoven, and M. Moeller, “WEAR: An Outdoor Sports Dataset for Wearable and Egocentric Activity Recognition,” Proc. ACM Interact. Mob. Wearable Ubiquitous Technol., vol. 8, no. 4, Art. 175, Nov, doi: 10.1145/3699776, 2024.
[30] J. Chadha, A. Jain, Y. Kumar, and N. Modi, “Hybrid Deep Learning Approaches for Human Activity Recognition and Postural Transitions Using Mobile Device Sensors,” SN Computer Science, vol. 5, doi: 10.1007/s42979-024-03300 2024.
[31] H. Duan, Y. Zhao, K. Chen, D. Lin and B. Dai, "Revisiting Skeleton-based Action Recognition" IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, USA, pp. 2959-2968, doi: 10.1109/CVPR52688.2022.00298, 2022.
[32] K. He, X. Zhang, S. Ren and J. Sun, "Deep Residual Learning for Image Recognition," 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, pp. 770-778, doi: 10.1109/CVPR.2016.90, 2016.
[33] Z. Liu, H. Mao, C. -Y. Wu, C. Feichtenhofer, T. Darrell and S. Xie, "A ConvNet for the 2020s," IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, pp. 11966-11976, doi: 10.1109/CVPR52688.2022.01167, 2022.
[34] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Proc. 31st Conf. Neural Information Processing Systems (NeurIPS 2017), Long Beach, USA, doi:10.48550/arXiv.1706.03762, Dec. 2017.
[35] A. Shahroudy, J. Liu, T.-T. Ng, and G. Wang, “Ntu rgb+d: A large scale dataset for 3d human activity analysis,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), doi: 10.1109/CVPR.2016.115, 2016.
[36] J. Copet, F. Kreuk, I. Gat, T. Remez, D. Kant, G. Synnaeve, Y. Adi, and A. Défossez, “Simple and Controllable Music Generation,” Proc. 37th Conf. Neural Information Processing Systems, New Orleans, LA, USA, DOI:10.48550/arXiv.2306.05284, Jun. 2023.
[37] J. Wang, Z. Liu, Y. Wu and J. Yuan, "Mining actionlet ensemble for action recognition with depth cameras," IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, pp. 1290-1297, doi: 10.1109/CVPR.2012.6247813, 2012.
[38] D. S. Park, W. Chan, Y. Zhang, C.-C. Chiu, B. Zoph, E. D. Cubuk, and Q. V. Le, “SpecAugment: A simple data augmentation method for automatic speech recognition,” in Proc. Interspeech, DOI:10.48550/arXiv.1904.08779, 2019.
[39] Z. Liu, H. Zhang, Z. Chen, Z. Wang and W. Ouyang, "Disentangling and Unifying Graph Convolutions for Skeleton-Based Action Recognition," 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, pp. 140-149, 2020, doi: 10.1109/CVPR42600.2020.00022.
[40] S. Das, S. Sharma, R. Dai, F. Bremond, M. Thonnat, “Vpn: Learning video-pose embedding for activities of daily living”, in: Computer Vision– ECCV, 16th European Conference, Glasgow, UK, Proceedings, Part IX 16, Springer, pp. 72–90, DOI:10.1007/978-3-030-58545-7_5, August 23–28 2020.
[41] D. Liu, F. Meng, Q. Xia, Z. Ma, J. Mi, Y. Gan, M. Ye, J. Zhang, “Temporal cues enhanced multimodal learning for action recognition in rgb-d videos”, Neurocomputing 594, 10.1016/j.neucom.2024.127882, 2024.
[42] J. Zhu, W. Zou, L. Xu, Y. Hu, Z. Zhu, M. Chang, J. Huang, G. Huang, and D. Du, "Action Machine: Rethinking Action Recognition in Trimmed Videos," DOI:10.48550/arXiv.1812.05770, Dec. 2018.
[43] H. Yang, D. Sun, Y-J. Cai, J. Yang, X-Y. Si and S-M Zhou. "Learning Topological Representation of 3D Skeleton Dynamics with Persistent Homology for Human Activity Recognition," IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Istanbul, Turkiye, pp. 2709-2716, doi: 10.1109/BIBM58861.2023.10385684, 2023.
[44] A. Kerboua and M. Batouche, "3D Skeleton Action Recognition for Security Improvement," International Journal of Intelligent Systems and Applications, vol. 11, no. 3, pp. 42-52, DOI:10.5815/ijisa.2019.03.05, Mar. 2019.
[45] A. Shahroudy, T. -T. Ng, Y. Gong and G. Wang, "Deep Multimodal Feature Analysis for Action Recognition in RGB+D Videos," in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 5, pp. 1045-1058, doi: 10.1109/TPAMI.2017.2691321, 1 May 2018.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2026 Mounir Boudmagh, Adlen Kerboua, Mohamed Redjimi

This work is licensed under a Creative Commons Attribution 4.0 International License.

