Qwen-VLA: Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments
Title: Qwen-VLA: A Unified Approach to Vision-Language-Action Modeling for Diverse Tasks, Environments, and Robotic Forms
Embodied intelligence research has traditionally relied on specialized models designed for isolated tasks like navigation or manipulation, a practice that leads to fragmented capabilities and poor generalization across different tasks, environments, and robot types. This study investigates whether heterogeneous embodied decision-making challenges can be consolidated into a single vision-language-action (VLA) model. We introduce Qwen-VLA, an embodied foundation model that expands Qwen’s vision-language capabilities—spanning perception, comprehension, and reasoning—into continuous action and trajectory generation via a DiT-based action decoder.
The model is developed using a large-scale joint pretraining strategy that leverages diverse data sources. These include robotics manipulation trajectories, human egocentric demonstrations, synthetic simulation data, vision-and-language navigation datasets, trajectory-centric supervision, and auxiliary vision-language data. To accommodate various robot platforms, we implement embodiment-aware prompt conditioning, which utilizes robot-specific textual descriptions to define the current embodiment and its control conventions.
By framing manipulation, navigation, and trajectory prediction within a unified action-and-trajectory prediction framework, Qwen-VLA facilitates transferable visual grounding, spatial reasoning, and continuous action generation across different robot morphologies, task categories, and environments. Evaluations on manipulation, navigation, and trajectory-centric benchmarks demonstrate consistent multi-task performance and robust out-of-distribution generalization when subjected to variations in scene layout, background, lighting, object configuration, and robot embodiment.
Notable performance metrics for Qwen-VLA-Instruct include a 97.9% score on LIBERO, 73.7% on Simpler-WidowX, and 86.1%/87.2% on RoboTwin-Easy/Hard. In navigation tasks, it achieved 69.0% OSR on R2R and 59.6% SR on RxR. Real-world tests on the ALOHA platform showed an average out-of-distribution success rate of 76.9%, while zero-shot dynamic manipulation on DOMINO yielded a 26.6% success rate.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





