arXiv

$R^2$-dLLM: Accelerating Diffusion Large Language Models via Spatio-Temporal Redundancy Reduction

Title: $R^2$-dLLM: Accelerating Diffusion Large Language Models via Spatio-Temporal Redundancy Reduction

Original: arXiv:2604.18995v2 Announce Type: replace-cross Abstract: Diffusion Large Language Models (dLLMs) have emerged as a promising alternative to autoregressive generation by enabling parallel token prediction. However, practical dLLM decoding still suffers from high inference latency, which limits deployment. In this work, we observe that a substantial part of this inefficiency comes from recurring redundancy in the decoding process, including spatial redundancy caused by confidence clusters and positional ambiguity, and temporal redundancy caused by repeatedly remasking predictions that have already stabilized. Motivated by these patterns, we propose $R^{2}$-dLLM, a unified framework for reducing decoding redundancy from both inference and training perspectives. At inference time, we introduce training-free decoding rules that aggregate local confidence and token predictions, and finalize temporally stable tokens to avoid redundant decoding steps. We further propose a redundancy-aware supervised fine-tuning pipeline that aligns the model with efficient decoding trajectories and reduces reliance on manually tuned thresholds. Experiments demonstrate that $R^{2}$-dLLM consistently reduces the number of decoding steps by up to 88\% compared to existing decoding strategies, while maintaining competitive generation quality across different models and tasks. These results validate that decoding redundancy is a central bottleneck in dLLMs, and that explicitly reducing it yields substantial practical efficiency gains. Our code and models are available at https://github.com/GATECH-EIC/R2-dLLM.

Rewrite: Diffusion Large Language Models (dLLMs) offer a compelling alternative to autoregressive generation through their capacity for parallel token prediction. Despite this potential, widespread deployment is hindered by significant inference latency during the decoding phase. Our analysis reveals that this inefficiency stems largely from persistent redundancies within the decoding process. Specifically, we identify spatial redundancies arising from confidence clustering and positional ambiguity, as well as temporal redundencies resulting from the repeated remasking of predictions that have already reached stability. Addressing these issues, we introduce $R^{2}$-dLLM, a comprehensive framework designed to mitigate decoding redundancy from both training and inference standpoints. During inference, our approach employs training-free decoding protocols that consolidate local confidence scores and token predictions, thereby finalizing temporally stable tokens and bypassing unnecessary decoding iterations. Additionally, we present a supervised fine-tuning pipeline sensitive to redundancy, which trains models to follow more efficient decoding paths and diminishes the need for manual threshold adjustments. Empirical evaluations show that $R^{2}$-dLLM achieves a reduction in decoding steps of up to 88\% relative to current strategies, without compromising generation quality across various models and tasks. These findings confirm that decoding redundancy represents a major bottleneck for dLLMs and that targeted reduction efforts deliver significant practical efficiency improvements. The associated code and models can be accessed at https://github.com/GATECH-EIC/R2-dLLM.


Source: arXiv Generated at: 2026-06-03 00:00:00 UTC

Related Articles

TikTok Billionaire Tops Ambani as Asia’s Second-Richest
Bloomberg

TikTok Billionaire Tops Ambani as Asia’s Second-Richest

TikTok founder surpasses Mukesh Ambani to become Asia’s second-richest person, marking a significant shift in the region...

Publishers in UK can opt out of Google AI search results
BBC News

Publishers in UK can opt out of Google AI search results

UK publishers can now opt out of Google’s AI search summaries, a CMA ruling designed to boost their bargaining power and...

Kioxia Edges Nearer Toyota’s Market Cap in Shakeup to Japan Inc.
Bloomberg

Kioxia Edges Nearer Toyota’s Market Cap in Shakeup to Japan Inc.

Kioxia’s market cap nears Toyota’s, signaling a major shift in Japan’s corporate hierarchy. This narrowing gap highlight...

Reuters

Morning Bid: Marvell, a fitting name for the latest AI darling

Reuters highlights Marvell as a top AI stock, noting its name perfectly suits its status as the newest market darling.

Financial Times

Tim Hayward: I built the Jaguar E-Type of computer keyboards

Tim Hayward compares his bespoke keyboard designs to the Jaguar E-Type. He explores high-end customization for personal ...

Financial Times

AI Labs: Zuckerberg’s $100bn gamble

Meta’s $100 billion AI investment aims to secure AI dominance, but questions remain whether sheer spending can outpace c...