DyLLM: Efficient Diffusion LLM Inference via Saliency-based Token Selection and Partial Attention
Title: DyLLM: Accelerating Diffusion LLM Inference Through Salient Token Selection and Partial Attention
Abstract: Masked diffusion language models offer a compelling alternative to traditional autoregressive generation by enabling parallel token decoding. However, the iterative denoising process inherent to these models is computationally intensive, as it requires processing the entire sequence at each step. We found that token representations remain largely stable across diffusion steps, with only a minor subset—referred to as salient tokens—significantly influencing the next update. Capitalizing on this temporal sparsity, we introduce DyLLM, a training-free inference framework designed to speed up decoding by focusing computation exclusively on these salient tokens. DyLLM determines saliency by calculating the cosine similarity of attention contexts between consecutive denoising steps. This approach allows the model to recompute feed-forward and attention operations solely for salient tokens, while reusing cached activations for all other tokens. Evaluations across various reasoning and code-generation benchmarks show that DyLLM boosts throughput by up to 9.6x, maintaining the baseline accuracy of popular open-source diffusion LLMs such as LLaDA and Dream.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC






