A Multimodal Sensor Fusion-Based Mobile System for Real-Time Music Generation: Synergistic Interaction of Gesture, Posture, and Acoustic Environment

Jing Song

doi:10.3991/ijim.v20i10.61929

Authors

Jing Song Shanxi University of Applied Science and Technology, Taiyuan, China

DOI:

https://doi.org/10.3991/ijim.v20i10.61929

Keywords:

mobile real-time generation; multimodal sensor fusion; gesture and posture interaction; acoustic environment perception; edge intelligence; resource-adaptive scheduling; federated meta-learning; music semantic understanding

Abstract

Contemporary mobile music applications are largely limited to interface-level control mechanisms, making it difficult to achieve deep understanding of users’ creative intentions and truly collaborative interaction. Moreover, the trade-off among resource constraints, multimodal signal fusion efficiency, and real-time generation quality constitutes a fundamental bottleneck for creative AI applications on mobile platforms. To address these challenges, this paper proposes and implements a multimodal sensor fusion-based mobile system for real-time music generation, enabling synergistic interaction among gesture, posture, and acoustic environment cues. The system adopts a hierarchically decoupled edge-intelligent music interaction architecture, consisting of a multimodal music semantic understanding network and a resource-adaptive generation engine. The former employs a lightweight cross-modal attention mechanism to directly map heterogeneous sensor signals—captured from gestures, body posture, and acoustic environments—into a structured music semantic space, effectively translating low-level perceptual signals into high-level creative intent. The latter dynamically switches generation modes according to real-time device states, achieving an optimal balance between computational resource consumption and generation quality. To support personalization while preserving user privacy, we design an edge-personalized learning framework that combines cloud-based federated meta-learning pretraining with on-device incremental fine-tuning. Experimental results demonstrate that the proposed system achieves an end-to-end P99 latency below 45 ms on mainstream mobile devices, while significantly reducing power consumption and memory footprint. The generated music exhibits superior performance in terms of melodic fluency and harmonic consistency. This work pushes the deployment boundary of creative AI under resource-constrained environments and establishes a technical paradigm for multimodal interaction and intelligent music generation on mobile platforms, providing key technological support for next-generation interactive artistic experiences.