D^2SD: Accelerating Speculative Decoding with Dual Diffusion Draft Models
Title: D^2SD: Enhancing Speculative Decoding Speed via Dual Diffusion Drafting Mechanisms
Abstract: Speculative decoding boosts the efficiency of autoregressive large language model inference by proposing multiple tokens and validating them simultaneously in a single forward pass of the target model. While recent diffusion-based drafters can generate token blocks in parallel, they typically adhere to a single draft sequence during verification. Consequently, if the first mismatch arises, all subsequent tokens in that draft are discarded, leading to a constrained acceptance rate. Simply increasing the number of batched draft candidates yields only marginal gains, as redundant or ill-positioned branches elevate drafting and verification costs without a proportional rise in accepted tokens. To address this, we introduce D^2SD, a speculative decoding framework utilizing dual diffusion drafters. This approach structures candidates within a confidence-guided prefix tree. Initially, a diffusion drafter produces a token block alongside per-position confidence scores, which help pinpoint the most probable rejection boundary and identify the top-K prefix ranges for recovery. Subsequently, a second variable-prefix diffusion drafter re-anchors at each chosen prefix to propose alternative continuations in a single batched operation. These candidates, which share common prefixes, are then jointly verified using cascade attention. Experimental results demonstrate that D^2SD significantly outperforms both the foundational diffusion method and robust autoregressive speculative decoding baselines.
Source: arXiv Generated at: 2026-06-04 00:00:00 UTC




