arXiv

Qwen-VLA: Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments

June 2, 2026 · Qiuyue Wang, Mingsheng Li, Jian Guan, Jinhui Ye, Sicheng Xie, Yitao Liu, Junhao Chen, Zhixuan Liang, Jie Zhang, Xintong Hu, Xuhong Huang, Pei Lin, Junyang Lin, Dayiheng Liu, Shuai Bai, Jingren Zhou, Jiazhao Zhang, Haoqi Yuan, Gengze Zhou, Hang Yin, Ye Wan · Original Source

Title: Qwen-VLA: A Unified Approach to Vision-Language-Action Modeling for Diverse Tasks, Environments, and Robotic Forms

Embodied intelligence research has traditionally relied on specialized models designed for isolated tasks like navigation or manipulation, a practice that leads to fragmented capabilities and poor generalization across different tasks, environments, and robot types. This study investigates whether heterogeneous embodied decision-making challenges can be consolidated into a single vision-language-action (VLA) model. We introduce Qwen-VLA, an embodied foundation model that expands Qwen’s vision-language capabilities—spanning perception, comprehension, and reasoning—into continuous action and trajectory generation via a DiT-based action decoder.

The model is developed using a large-scale joint pretraining strategy that leverages diverse data sources. These include robotics manipulation trajectories, human egocentric demonstrations, synthetic simulation data, vision-and-language navigation datasets, trajectory-centric supervision, and auxiliary vision-language data. To accommodate various robot platforms, we implement embodiment-aware prompt conditioning, which utilizes robot-specific textual descriptions to define the current embodiment and its control conventions.

By framing manipulation, navigation, and trajectory prediction within a unified action-and-trajectory prediction framework, Qwen-VLA facilitates transferable visual grounding, spatial reasoning, and continuous action generation across different robot morphologies, task categories, and environments. Evaluations on manipulation, navigation, and trajectory-centric benchmarks demonstrate consistent multi-task performance and robust out-of-distribution generalization when subjected to variations in scene layout, background, lighting, object configuration, and robot embodiment.

Notable performance metrics for Qwen-VLA-Instruct include a 97.9% score on LIBERO, 73.7% on Simpler-WidowX, and 86.1%/87.2% on RoboTwin-Easy/Hard. In navigation tasks, it achieved 69.0% OSR on R2R and 59.6% SR on RxR. Real-world tests on the ALOHA platform showed an average out-of-distribution success rate of 76.9%, while zero-shot dynamic manipulation on DOMINO yielded a 26.6% success rate.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC