arXiv

Discrete Diffusion VLA: Bringing Discrete Diffusion to Action Decoding in Vision-Language-Action Policies

Title: Discrete Diffusion VLA: Integrating Discrete Diffusion into Action Decoding for Vision-Language-Action Policies

Abstract:

Vision-Language-Action (VLA) models leverage large-scale vision-language backbones to translate visual inputs and natural language instructions into robotic commands. However, existing approaches face significant limitations: they either rely on autoregressive generation with a rigid left-to-right sequence, resulting in suboptimal performance, or they employ external diffusion heads that detach from the main backbone. These separate components disrupt information flow and prevent the development of unified, scalable architectures.

To address these challenges, we introduce Discrete Diffusion VLA, a novel approach that discretizes action sequences and processes them using a discrete diffusion mechanism. This method maintains progressive refinement within a single, unified transformer backbone. A key feature of our design is adaptive decoding, which prioritizes the resolution of high-confidence action elements before tackling more difficult ones. Additionally, we implement secondary re-masking to allow the model to revisit and correct uncertain predictions, thereby enhancing robustness and error correction capabilities.

This architecture preserves the pretrained priors of the vision-language model while enabling parallel decoding, which significantly boosts efficiency. Our method demonstrates strong performance, achieving an average success rate of 96.4% on LIBERO, a visual matching score of 71.2% on SimplerEnv-Fractal, and an overall score of 54.2% on SimplerEnv-Bridge.

Furthermore, evaluations on out-of-distribution tests using LIBERO-Goal highlight the model's superior retention of pretrained capabilities. Compared to parallel decoding, our approach shows only a 0.8% degradation in language performance (versus 8.0%), and compared to continuous diffusion, it exhibits merely 20.4% vision degradation (versus 29.0%). Finally, we validate the practical effectiveness of Discrete Diffusion VLA through two real-robot evaluations conducted on the AgileX Cobot Magic platform.


Source: arXiv Generated at: 2026-06-02 00:00:00 UTC

Related Articles

Law’s Billable Hour Is Being Shredded by AI
Bloomberg

Law’s Billable Hour Is Being Shredded by AI

AI is dismantling the billable hour by automating routine legal tasks. This technological shift threatens the traditiona...

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026
Bloomberg

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026

SoftBank in Early Talks to Back $800 Million Agile Robots Round
Bloomberg

SoftBank in Early Talks to Back $800 Million Agile Robots Round

SoftBank is in early talks to back Agile Robots’ $800 million funding round. The Japanese tech giant is currently in pre...

Amundi Is Diversifying Risk Via Commodity Currencies, Gold
Bloomberg

Amundi Is Diversifying Risk Via Commodity Currencies, Gold

Amundi diversifies risk by investing in commodity-linked currencies and gold. This strategy hedges against market volati...

Reuters

Marvell Technology surges after Nvidia's Huang calls it 'next trillion-dollar company'

Marvell Technology shares surged after Nvidia CEO Jensen Huang labeled the firm the “next trillion-dollar company.”

Russia Says It Found Foreign Spyware on Top Officials’ Phones
Bloomberg

Russia Says It Found Foreign Spyware on Top Officials’ Phones

Russia’s FSB claims to have discovered foreign spyware on senior officials’ phones. Moscow attributes the intrusion to h...