GPTQ-intrinsic LoRA: A Near-optimal Algorithm for Low-precision Quantization with Low-rank Adaptation
Title: GPTQ-intrinsic LoRA: A Near-optimal Algorithm for Low-precision Quantization with Low-rank Adaptation
Abstract:
While post-training quantization is a standard technique for compressing large neural networks, aggressive low-bit quantization often leads to substantial declines in model performance. To mitigate this, a prevalent strategy involves enhancing quantized weights with low-rank corrections, resulting in approximations structured as $W\approx Q+LR$. This study investigates the low-precision combined with low-rank representation framework by analyzing the layer-wise reconstruction objective defined as $|XW-X(Q+LR)|_F^2$, utilizing $X$ as the calibration matrix. To the best of our knowledge, we present the inaugural information-theoretic lower bounds for this specific problem, addressing constraints related to finite-alphabet and bounded low-rank compensation.
We subsequently introduce GPTQ-intrinsic LoRA, a training-free methodology that embeds low-rank correction directly into a GPTQ-style quantization process. This is achieved by appropriately augmenting the calibration Hessian. We demonstrate that when $L$ is selected as $V_r$—comprising the top right singular vectors of $X$—the layer-wise reconstruction error bounds replace the conventional GPTQ dependence on $|X|_F^2$ with the rank-$r$ residual $|X-X_r|_F^2$, excluding regularization terms. Under reasonable structural assumptions, these bounds align with the information-theoretic lower bounds regarding their dominant scaling, differing only by constants and mild factors.
Furthermore, we propose Bid-Up, a fixed-grid quantization refinement technique that can be interleaved with optimal low-rank compensation, ensuring a guaranteed non-increasing layer-wise reconstruction error. Empirical evaluations on DeiT vision transformers and Qwen3 language models indicate that GPTQ-intrinsic LoRA outperforms both standard GPTQ and the sequential approach of GPTQ followed by low-rank compensation, with further improvements realized through refinement loops.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





