BudgetDraft: Acceptance-Aware Multi-View Training for Sparse-KV Speculative Decoding
Title: BudgetDraft: Acceptance-Aware Multi-View Training for Sparse-KV Speculative Decoding
Abstract: Speculative decoding accelerates autoregressive generation by employing a drafter model to suggest multiple tokens, which are then validated in parallel by a verifier. In environments with limited resources, the drafter utilizes a sparse key-value (KV) cache to manage peak GPU memory usage and reduce end-to-end latency within a fixed KV budget, whereas the verifier maintains a complete KV cache. While mid-to-long context inference (ranging from 4K to 16K tokens) is prevalent in practical applications, standard speculative decoding approaches that combine sparse and full caches often struggle as context length increases. This naive method suffers from a mismatch between sparse and full states, leading to a rapid decline in token acceptance rates. To address this, we introduce BudgetDraft, a multi-view sparse training framework designed for drafting in mid-to-long inference scenarios. During training, the drafter encounters various sampled KV budgets, learning to align each sparse representation with a unified full-cache teacher target. BudgetDraft integrates an acceptance-aware loss on the full-cache branch with a multi-view loss on the sparse-cache branch, resulting in a single drafter that is robust to budget variations. This approach restores acceptance rates across different sparsity levels without requiring additional components during inference. Benchmarks on PG-19, LongBench, and LWM demonstrate that BudgetDraft delivers end-to-end speedups of up to 6.55x, 4.46x, and 2.10x compared to autoregressive (AR) decoding at context lengths of 4K, 8K, and 16K, respectively, while maintaining a memory-efficient inference pipeline.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC




