NanoSpec: Accelerating Speculative Decoding using Minimalist In-Context Vocabularies
Title: NanoSpec: Boosting Speculative Decoding via Minimalist In-Context Vocabularies
Original: arXiv:2605.26444v2 Announce Type: replace
Abstract: Large language models typically feature vocabulary sizes surpassing 100,000 tokens, creating a significant computational hurdle for the final linear projection layer during speculative decoding. Current methods for reducing this burden depend on static or loosely defined sub-vocabularies, which still require substantial active sizes (approximately 30k) to preserve the quality of drafted tokens. We introduce NanoSpec, a novel, training-free technique that resolves this compromise by generating a compact, context-sensitive active vocabulary dynamically at every generation step. Capitalizing on the natural temporal locality present in language generation, NanoSpec maintains high coverage rates while reducing the average vocabulary size by more than 40-fold (to under 3k tokens), all without the need for additional trained parameters. To harness the potential of such extreme sparsity on contemporary hardware, we present a co-design of algorithms and systems that mitigates sparse memory access inefficiencies through asynchronous gathering and state management resident on the GPU. Functioning as a plug-and-play module, NanoSpec reduces draft time by an average of 51.6%, providing a 1.17 to 1.29 times end-to-end speedup compared to leading speculative decoding approaches like EAGLE-2 and EAGLE-3 across seven tasks, while surpassing complex baselines that rely on training-based pruning.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





