Cross-Axis Feature Fusion with Joint-Wise Motion Difference Prediction for Text-Based 3D Human Motion Editing
Title: Enhancing Text-Driven 3D Human Motion Editing via Cross-Axis Feature Fusion and Joint-Wise Motion Difference Prediction
Abstract: This study tackles the challenge of text-based 3D human motion editing, aiming to maintain the stylistic and structural integrity of a source clip while implementing modifications specified by natural language prompts. The introduction of the MotionFix dataset has catalyzed significant research into training-based diffusion models capable of producing edited motions directly from source inputs and textual instructions. Although existing approaches have largely concentrated on determining the temporal timing of edits, our objective is to develop a model that grasps not only when changes occur but also identifies the specific joints involved in those modifications. To achieve this, we present a novel architectural framework alongside a complementary auxiliary task designed to facilitate training. Our proposed structure incorporates two axis-anchored transformers: one extracts features along the joint dimension, while the other operates along the time dimension. These distinct representations are then combined using a cross-axis fusion block. Additionally, we introduce an auxiliary objective that requires the joint-anchored transformer to regress the Soft-DTW distance between the source and target joint rotations. This specific goal enables the module to distinguish which joints require alteration and which should remain unchanged. Extensive experiments conducted on the MotionFix dataset reveal that our approach substantially enhances semantic alignment with both the source motion and the text instruction, while also boosting the overall fidelity of the generated output, thereby establishing a new state-of-the-art performance.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC




