KVarN: Variance-Normalized KV-Cache Quantization Mitigates Error Accumulation in Reasoning Tasks
Title: KVarN: Variance-Normalized KV-Cache Quantization Mitigates Error Accumulation in Reasoning Tasks
Abstract:
While test-time scaling is an effective strategy for enhancing the reasoning capabilities of large language models, it frequently encounters memory bottlenecks during long-horizon decoding due to the expansion of the Key-Value (KV) cache. Although KV-cache quantization offers a potential solution, existing methods are typically assessed in prefill-like contexts, where error dynamics differ significantly from those observed during autoregressive decoding. Our analysis reveals that in the autoregressive regime, quantization errors tend to accumulate over time, a phenomenon primarily caused by inaccurate token scales.
To address this, we propose KVarN, a calibration-free quantizer for the KV-cache. KVarN employs a Hadamard rotation followed by dual-scaling variance normalization applied to both axes of the K and V matrices. This approach effectively corrects outliers in token-scale errors and significantly curbs error accumulation compared to current baselines. Consequently, KVarN sets a new state-of-the-art for KV-cache quantization on generative benchmarks, achieving 2-bit precision on MATH500, AIME24, and HumanEval. A vLLM implementation of KVarN can be accessed at https://github.com/huawei-csl/KVarN.
Source: arXiv Generated at: 2026-06-03 00:00:00 UTC



