Qift: Shift-Friendly No-Zero W2 Post-Training Quantization for Rotated W2A4/KV4 LLM Inference
Title: Qift: A Zero-Free W2 Post-Training Quantization Approach for Efficient Rotated W2A4/KV4 LLM Inference
Abstract:
While two-bit weight quantization offers significant advantages for memory-efficient large language model (LLM) inference, the conventional W2 level set—defined as {-2, -1, 0, +1}—frequently suffers from performance degradation under demanding W2A4/KV4 configurations. This study investigates the geometric properties of two-bit weight level sets within a quantization framework utilizing Hadamard rotation. Our findings indicate that standard asymmetric W2 quantization yields substantial improvements over the traditional level set, suggesting that the limitations of W2A4 stem not merely from bit-width constraints but also from issues related to reconstruction accuracy.
An analysis of 224 linear modules in both LLaMA-2-7B and LLaMA-3.1-8B reveals that pretrained weights are already nearly zero-centered. Furthermore, applying Hadamard rotation effectively Gaussianizes their standardized distribution, drastically reducing excess kurtosis and Q-Q error by several orders of magnitude. Leveraging this approximate zero-centered, Gaussian-like source model, we introduce Qift, a training-free, fixed no-zero W2 level set designed for rotated W2A4/KV4 inference. The primary level set is defined as {+/-0.5, +/-1.5}, which corresponds to {+/-1, +/-3} under a half-scale reparameterization. Alternatively, a power-of-two variant employs {+/-1, +/-4} to facilitate sign-and-shift decoded weight application.
Qift eliminates the need for learned codebooks, group grids, zero points, or redesigns of the fixed two-bit code-to-level mapping, while maintaining standard per-channel scaling. Through scale-invariant ratio analysis, we identify an optimal inner-to-outer centroid ratio range of 0.25 to 0.33. This insight clarifies the superior performance of methods such as mirror no-zero (MNZ), Lloyd, NF2, and PoT-MNZ, while explaining the inefficacy of the {+/-1, +/-2} set.
Experimental results across both models demonstrate that these no-zero level sets consistently enhance perplexity metrics for pure W2A4, mixed W2/W4 configurations across L layers, downstream accuracy, and GPTQ residual behavior compared to the standard W2 approach. Specifically, at a mixed precision setting of L=16, these sets significantly reduce the performance gap relative to W3A4, all while preserving two-bit precision for half of the transformer layers. Consequently, Qift provides a straightforward, source-aware, and deployment-ready alternative to more complex learned W2 codebooks.
Source: arXiv Generated at: 2026-06-03 00:00:00 UTC



