arXiv

$R^2$-dLLM: Accelerating Diffusion Large Language Models via Spatio-Temporal Redundancy Reduction

June 3, 2026 · Zhenbang Du, Kejing Xia, Xinrui Zhong, Yonggan Fu, Nicolai Oswald, Binfei Ji, Brucek Khailany, Pavlo Molchanov, Yingyan Lin · Original Source

Title: $R^2$-dLLM: Accelerating Diffusion Large Language Models via Spatio-Temporal Redundancy Reduction

Original: arXiv:2604.18995v2 Announce Type: replace-cross Abstract: Diffusion Large Language Models (dLLMs) have emerged as a promising alternative to autoregressive generation by enabling parallel token prediction. However, practical dLLM decoding still suffers from high inference latency, which limits deployment. In this work, we observe that a substantial part of this inefficiency comes from recurring redundancy in the decoding process, including spatial redundancy caused by confidence clusters and positional ambiguity, and temporal redundancy caused by repeatedly remasking predictions that have already stabilized. Motivated by these patterns, we propose $R^{2}$-dLLM, a unified framework for reducing decoding redundancy from both inference and training perspectives. At inference time, we introduce training-free decoding rules that aggregate local confidence and token predictions, and finalize temporally stable tokens to avoid redundant decoding steps. We further propose a redundancy-aware supervised fine-tuning pipeline that aligns the model with efficient decoding trajectories and reduces reliance on manually tuned thresholds. Experiments demonstrate that $R^{2}$-dLLM consistently reduces the number of decoding steps by up to 88\% compared to existing decoding strategies, while maintaining competitive generation quality across different models and tasks. These results validate that decoding redundancy is a central bottleneck in dLLMs, and that explicitly reducing it yields substantial practical efficiency gains. Our code and models are available at https://github.com/GATECH-EIC/R2-dLLM.

Rewrite: Diffusion Large Language Models (dLLMs) offer a compelling alternative to autoregressive generation through their capacity for parallel token prediction. Despite this potential, widespread deployment is hindered by significant inference latency during the decoding phase. Our analysis reveals that this inefficiency stems largely from persistent redundancies within the decoding process. Specifically, we identify spatial redundancies arising from confidence clustering and positional ambiguity, as well as temporal redundencies resulting from the repeated remasking of predictions that have already reached stability. Addressing these issues, we introduce $R^{2}$-dLLM, a comprehensive framework designed to mitigate decoding redundancy from both training and inference standpoints. During inference, our approach employs training-free decoding protocols that consolidate local confidence scores and token predictions, thereby finalizing temporally stable tokens and bypassing unnecessary decoding iterations. Additionally, we present a supervised fine-tuning pipeline sensitive to redundancy, which trains models to follow more efficient decoding paths and diminishes the need for manual threshold adjustments. Empirical evaluations show that $R^{2}$-dLLM achieves a reduction in decoding steps of up to 88\% relative to current strategies, without compromising generation quality across various models and tasks. These findings confirm that decoding redundancy represents a major bottleneck for dLLMs and that targeted reduction efforts deliver significant practical efficiency improvements. The associated code and models can be accessed at https://github.com/GATECH-EIC/R2-dLLM.

Source: arXiv Generated at: 2026-06-03 00:00:00 UTC