arXiv

DSL-LLaDA: Scaling Continuous Denoising to 8B Masked Diffusion LMs

June 2, 2026 · Longxuan Yu, Yunshu Wu, Yu Fu, Siheng Xiong, Rob Brekelmans, Hui Liu, Yue Dong, Greg Ver Steeg · Original Source

Title: DSL-LLaDA: Scaling Continuous Denoising to 8B Masked Diffusion LMs

Abstract:

While discrete masked diffusion language models utilize iterative parallel decoding for text generation, they face a significant dilemma during few-step inference: a trade-off between output length and quality. Given a fixed step budget, conventional approaches are forced to choose between generating concise, high-quality text or producing lengthy but repetitive content. Continuous denoising offers a solution to this limitation by jointly evolving all text positions within embedding space. However, developing such a model from the ground up at a large scale remains an unresolved challenge.

In this work, we demonstrate that a pretrained masked Diffusion Language Model (DLM) can be efficiently adapted to enable continuous denoising in the embedding space. Beginning with LLaDA-8B-Instruct, we perform a lightweight continue-pretraining phase consisting of just 1,000 steps using Discrete Stochastic Localization (DSL). This process substitutes traditional binary masking with continuous, per-token Gaussian noise, which acts as a soft mask. The resulting model facilitates continuous inference, allowing all positions to evolve simultaneously in embedding space while postponing hard token commitment until the final step.

In zero-shot summarization tasks with low step budgets (fewer than or equal to 16 forward passes), DSL-LLaDA-SDE delivers the highest ROUGE-1 scores across four distinct benchmarks. Notably, it largely circumvents the premature termination and repetition issues commonly associated with iterative unmasking. Furthermore, this adaptation confers selective robustness to noisy states; the model is capable of correcting corrupted tokens without disturbing those that are already clean. Control experiments employing standard masked diffusion training with equivalent computational resources did not exhibit these advantageous behaviors.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC