arXiv

TAPS: Target-Aware Prefix Tree Selection for Diffusion-Drafted Speculative Decoding

June 2, 2026 · Zhuoyu Wang, Junnan Huang, Xinyu Chen · Original Source

Title: TAPS: Target-Aware Prefix Tree Selection for Diffusion-Drafted Speculative Decoding

Original: arXiv:2606.00487v1 Announce Type: new Abstract: Using a diffusion model for parallel drafting is a promising approach for speculative decoding. By predicting tokens at multiple future positions in a single forward pass, diffusion drafters substantially reduce drafting latency. However, this shifts the bottleneck to verification: verifying a single sequence limits acceptance length, while verifying large draft trees incurs excessive target-model latency. We identify a key mismatch in existing draft-tree methods: existing diffusion-tree methods rank nodes by the marginal probability, ignoring that verification is prefix-conditioned. As a result, they may verify unreachable descendants of rejected prefixes, increasing latency with limited acceptance gains. To address this, we propose TAPS, a target-aware prefix selection method that turns diffusion marginals into path-conditioned acceptance estimates. TAPS then selects a compact prefix-closed subtree under a fixed verification budget, improving the acceptance-cost tradeoff rather than simply expanding the draft tree. Experiments across diverse datasets and model families demonstrate that TAPS achieves up to 7.9x lossless end-to-end speedup over vanilla autoregressive decoding, outperforming state-of-the-art DFlash and DDTree by 1.36x and 1.74x respectively. Our work is available at https://anonymous.4open.science/r/TAPS-EMNLP2026-53DD

Rewrite:

Abstract: Leveraging diffusion models for parallel drafting presents a compelling strategy for speculative decoding. These diffusion-based drafters significantly cut down drafting latency by forecasting tokens across various future positions within a single forward pass. Nevertheless, this efficiency gain relocates the primary bottleneck to the verification stage. Specifically, the verification of a single sequence restricts the length of accepted tokens, whereas the verification of extensive draft trees imposes a heavy computational load on the target model.

We highlight a critical flaw in current draft-tree methodologies: existing diffusion-tree approaches prioritize nodes based on marginal probability, overlooking the fact that verification is inherently prefix-conditioned. Consequently, these methods often end up verifying descendants that cannot be reached if earlier prefixes are rejected, thereby inflating latency without proportional improvements in acceptance rates.

To resolve this issue, we introduce TAPS (Target-Aware Prefix Selection), a novel method that transforms diffusion marginals into acceptance estimates conditioned on the specific path taken. TAPS identifies a compact, prefix-closed subtree that adheres to a predefined verification budget. This approach optimizes the balance between acceptance rates and computational cost, rather than merely aiming to maximize the size of the draft tree.

Our empirical evaluations, conducted across a wide range of datasets and model architectures, show that TAPS delivers a lossless end-to-end speedup of up to 7.9x compared to standard autoregressive decoding. Furthermore, it surpasses current state-of-the-art methods, DFlash and DDTree, by factors of 1.36x and 1.74x, respectively. The research is accessible at https://anonymous.4open.science/r/TAPS-EMNLP2026-53DD

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC