arXiv

Discrete Diffusion VLA: Bringing Discrete Diffusion to Action Decoding in Vision-Language-Action Policies

June 2, 2026 · Zhixuan Liang, Yizhuo Li, Tianshuo Yang, Chengyue Wu, Sitong Mao, Liuao Pei, Tian Nian, Shunbo Zhou, Xiaokang Yang, Jiangmiao Pang, Yao Mu, Ping Luo · Original Source

Title: Discrete Diffusion VLA: Integrating Discrete Diffusion into Action Decoding for Vision-Language-Action Policies

Abstract:

Vision-Language-Action (VLA) models leverage large-scale vision-language backbones to translate visual inputs and natural language instructions into robotic commands. However, existing approaches face significant limitations: they either rely on autoregressive generation with a rigid left-to-right sequence, resulting in suboptimal performance, or they employ external diffusion heads that detach from the main backbone. These separate components disrupt information flow and prevent the development of unified, scalable architectures.

To address these challenges, we introduce Discrete Diffusion VLA, a novel approach that discretizes action sequences and processes them using a discrete diffusion mechanism. This method maintains progressive refinement within a single, unified transformer backbone. A key feature of our design is adaptive decoding, which prioritizes the resolution of high-confidence action elements before tackling more difficult ones. Additionally, we implement secondary re-masking to allow the model to revisit and correct uncertain predictions, thereby enhancing robustness and error correction capabilities.

This architecture preserves the pretrained priors of the vision-language model while enabling parallel decoding, which significantly boosts efficiency. Our method demonstrates strong performance, achieving an average success rate of 96.4% on LIBERO, a visual matching score of 71.2% on SimplerEnv-Fractal, and an overall score of 54.2% on SimplerEnv-Bridge.

Furthermore, evaluations on out-of-distribution tests using LIBERO-Goal highlight the model's superior retention of pretrained capabilities. Compared to parallel decoding, our approach shows only a 0.8% degradation in language performance (versus 8.0%), and compared to continuous diffusion, it exhibits merely 20.4% vision degradation (versus 29.0%). Finally, we validate the practical effectiveness of Discrete Diffusion VLA through two real-robot evaluations conducted on the AgileX Cobot Magic platform.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC