DFlare: Scaling Up Draft Capacity for Block Diffusion Speculative Decoding
Title: DFlare: Enhancing Draft Capacity in Block Diffusion Speculative Decoding
Abstract:
Block diffusion speculative decoding boosts Large Language Model (LLM) inference speeds by having the draft model predict an entire block of tokens simultaneously, which the target model then verifies in parallel. This approach demands both a highly capable draft model and the efficient leveraging of the target model’s internal knowledge. Currently, the leading method, DFlash, restricts all draft layers to rely on a single fused representation drawn from only a few target layers. This limitation curtails the expressiveness of individual layers and prevents further scaling of draft capacity.
To address this, we introduce DFlare, a method that expands the narrow conditioning bottleneck found in DFlash using a lightweight, layer-wise fusion mechanism. With negligible overhead, each draft layer attends to its own unique, learnable combination of a wide array of target layers. This strategy not only injects richer knowledge from the target model but also ensures that every draft layer receives distinct input, thereby enhancing per-layer expressiveness. This improvement allows for the scaling of draft models to deeper architectures, yielding consistent performance gains. Additionally, we expanded the training dataset from 800,000 to 2.4 million samples to maximize this increased capacity.
Evaluated across six benchmarks covering mathematical reasoning, code generation, and conversational tasks, DFlare achieved average wall-clock speedups of 5.52x on Qwen3-4B, 5.46x on Qwen3-8B, and 3.91x on GPT-OSS-20B. These results represent improvements of approximately 11%, 8%, and 5% over DFlash, respectively. The source code is accessible at https://github.com/Tencent/AngelSlim.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





