arXiv

Qwen-VLA: Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments

Title: Qwen-VLA: A Unified Approach to Vision-Language-Action Modeling for Diverse Tasks, Environments, and Robotic Forms

Embodied intelligence research has traditionally relied on specialized models designed for isolated tasks like navigation or manipulation, a practice that leads to fragmented capabilities and poor generalization across different tasks, environments, and robot types. This study investigates whether heterogeneous embodied decision-making challenges can be consolidated into a single vision-language-action (VLA) model. We introduce Qwen-VLA, an embodied foundation model that expands Qwen’s vision-language capabilities—spanning perception, comprehension, and reasoning—into continuous action and trajectory generation via a DiT-based action decoder.

The model is developed using a large-scale joint pretraining strategy that leverages diverse data sources. These include robotics manipulation trajectories, human egocentric demonstrations, synthetic simulation data, vision-and-language navigation datasets, trajectory-centric supervision, and auxiliary vision-language data. To accommodate various robot platforms, we implement embodiment-aware prompt conditioning, which utilizes robot-specific textual descriptions to define the current embodiment and its control conventions.

By framing manipulation, navigation, and trajectory prediction within a unified action-and-trajectory prediction framework, Qwen-VLA facilitates transferable visual grounding, spatial reasoning, and continuous action generation across different robot morphologies, task categories, and environments. Evaluations on manipulation, navigation, and trajectory-centric benchmarks demonstrate consistent multi-task performance and robust out-of-distribution generalization when subjected to variations in scene layout, background, lighting, object configuration, and robot embodiment.

Notable performance metrics for Qwen-VLA-Instruct include a 97.9% score on LIBERO, 73.7% on Simpler-WidowX, and 86.1%/87.2% on RoboTwin-Easy/Hard. In navigation tasks, it achieved 69.0% OSR on R2R and 59.6% SR on RxR. Real-world tests on the ALOHA platform showed an average out-of-distribution success rate of 76.9%, while zero-shot dynamic manipulation on DOMINO yielded a 26.6% success rate.


Source: arXiv Generated at: 2026-06-02 00:00:00 UTC

Related Articles

Law’s Billable Hour Is Being Shredded by AI
Bloomberg

Law’s Billable Hour Is Being Shredded by AI

AI is dismantling the billable hour by automating routine legal tasks. This technological shift threatens the traditiona...

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026
Bloomberg

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026

SoftBank in Early Talks to Back $800 Million Agile Robots Round
Bloomberg

SoftBank in Early Talks to Back $800 Million Agile Robots Round

SoftBank is in early talks to back Agile Robots’ $800 million funding round. The Japanese tech giant is currently in pre...

Amundi Is Diversifying Risk Via Commodity Currencies, Gold
Bloomberg

Amundi Is Diversifying Risk Via Commodity Currencies, Gold

Amundi diversifies risk by investing in commodity-linked currencies and gold. This strategy hedges against market volati...

Reuters

Marvell Technology surges after Nvidia's Huang calls it 'next trillion-dollar company'

Marvell Technology shares surged after Nvidia CEO Jensen Huang labeled the firm the “next trillion-dollar company.”

Russia Says It Found Foreign Spyware on Top Officials’ Phones
Bloomberg

Russia Says It Found Foreign Spyware on Top Officials’ Phones

Russia’s FSB claims to have discovered foreign spyware on senior officials’ phones. Moscow attributes the intrusion to h...