Hybrid Verified Decoding: Learning to Allocate Verification in Speculative Decoding
Title: Hybrid Verified Decoding: Optimizing Verification Allocation in Speculative Decoding
Abstract
The computational cost of Large Language Model (LLM) generation is primarily driven by the autoregressive nature of decoding, which requires invoking the model individually for every new token. Speculative decoding offers a solution to this expense by allowing the system to draft several tokens and verify them against the target model in a single step. However, the resulting performance gains are contingent upon the proportion of drafted tokens that are ultimately accepted. While parameter-free draft sources can efficiently propose extended continuations for structured and agentic tasks, the value of a cache match is not constant; a match that appears promising at one generation step may yield minimal returns in the subsequent step.
To address this variability, we introduce Hybrid Verified Decoding. This approach forecasts the accepted length of a cache draft prior to verification, utilizing this estimated payoff to decide between employing cache verification or switching to a model-based drafter. Evaluations across three distinct LLMs and sixteen datasets demonstrate that Hybrid Verified Decoding is particularly advantageous for agentic workflows. In these scenarios, it surpasses EAGLE3 in all tested conditions, achieving an average speedup of 2.73x. Our analysis highlights how specific prompt structures facilitate cache opportunities and how high-value cache drafts are concentrated within a limited segment of the draft space. Furthermore, the study illustrates how selecting drafts based on payoff estimates diminishes the need for sequential decoding, suggesting that runtime draft selection is a viable and promising avenue for advancing speculative decoding techniques.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC




