MIND: Multi-Scale Intent Diffusion for Text-Driven Physics-Based Humanoid Control
Title: MIND: Multi-Scale Intent Diffusion for Text-Driven Physics-Based Humanoid Control
Original: arXiv:2605.26006v2 Announce Type: replace Abstract: Enabling physics-based humanoids to execute diverse behaviors from high-level textual commands remains a significant challenge. Existing methods typically follow either a two-stage paradigm that combines kinematic motion generation with physics-based tracking, or an end-to-end imitation-learning paradigm that directly generates actions from text. However, the former suffers from the inherent domain shift between kinematic generation and physics-based tracking, while the latter struggles with the substantial modality gap between textual commands and low-level actions, limiting effective semantic alignment. Notably, humanoid states encode rich motion dynamics that are more semantically aligned with textual descriptions than low-level actions, making them a natural basis for deriving behavioral intent. Building upon this insight, we propose MIND, a novel end-to-end diffusion framework for text-driven physics-based humanoid control that leverages behavioral intent as a semantic bridge between textual commands and low-level actions. At its core, MIND introduces a multi-scale intent diffusion mechanism, where a holistic intent predictor captures global behavioral dynamics to guide overall behavior synthesis, while an immediate intent predictor provides step-wise, fine-grained signals for local behavior refinement at each diffusion step. This hierarchical intent formulation imposes a structured inductive bias for humanoid control, improving semantic alignment and behavioral naturalness. Furthermore, MIND encodes humanoid states into a latent space to enable more effective semantic intent modeling. Extensive experiments demonstrate that MIND outperforms existing methods and synthesizes coherent, physically plausible, and semantically aligned humanoid behaviors from text commands. Project page: https://binlee26.github.io/MIND_page.
Rewritten:
Abstract:
Translating high-level text instructions into varied, physics-based movements for humanoid robots continues to pose a major hurdle. Current approaches generally rely on one of two strategies: a two-stage process that pairs kinematic motion generation with physics-based tracking, or an end-to-end imitation-learning model that maps text directly to motor actions. The former is hindered by the domain shift inherent in converting kinematic outputs into physics-compliant motions, whereas the latter faces a significant modality gap between linguistic inputs and low-level control signals, which impedes accurate semantic matching. Crucially, humanoid states contain rich dynamic information that correlates more closely with textual descriptions than raw actions do, suggesting that these states serve as an ideal foundation for extracting behavioral intent. Guided by this observation, we introduce MIND, a new end-to-end diffusion architecture designed for text-driven, physics-based humanoid control. MIND utilizes behavioral intent to bridge the semantic divide between language and motor commands. The framework centers on a multi-scale intent diffusion process: a holistic intent predictor models broad behavioral dynamics to steer the overall motion, while an immediate intent predictor delivers granular, step-by-step signals to refine local movements during each diffusion iteration. This hierarchical structure introduces a purposeful inductive bias that enhances both semantic coherence and the naturalness of the resulting behaviors. Additionally, MIND maps humanoid states into a latent space to facilitate more precise semantic intent representation. Comprehensive evaluations show that MIND surpasses current state-of-the-art techniques, generating behaviors that are physically realistic, semantically consistent, and coherent, all driven by textual prompts. Project page: https://binlee26.github.io/MIND_page.
Source: arXiv Generated at: 2026-06-03 00:00:00 UTC





