Recover-LoRA for Aggressive Quantization: Reclaiming Accuracy in 2-Bit Language Models via Low-Rank Adaptation with Knowledge Distillation on Synthetic Data
Title: Recover-LoRA for Aggressive Quantization: Reclaiming Accuracy in 2-Bit Language Models via Low-Rank Adaptation with Knowledge Distillation on Synthetic Data
Abstract:
Deploying large language models (LLMs) on edge devices or on-device environments is often hindered by strict limitations on memory capacity and bandwidth. While reducing weight precision to 2 bits can significantly boost inference throughput and reduce memory usage, it usually leads to substantial drops in model accuracy. This study adapts Recover-LoRA, a lightweight, data-free technique originally designed to mitigate general weight corruption, to the specific challenge of ultra-low-bit quantization.
We introduce a selective mixed-precision approach named GateUp, where only the gate and up projection layers within the Multi-Layer Perceptron (MLP) are compressed to 2-bit (W2), while all other linear layers retain higher precision. Through roofline analysis across three model architectures ranging from 4B to 20B parameters and two distinct hardware platforms, we show that this W4/W2-GateUp configuration offers a 7.5% to 23.3% increase in tokens per second (TPS) compared to uniform 4-bit (W4) deployment, depending on the specific model and context length. This strategy effectively isolates quantization errors to a predictable set of layers.
To address the accuracy loss associated with 2-bit quantization of these specific layers, we employ Recover-LoRA. This method trains low-rank adapters on the quantized components using logit distillation generated from synthetic data. In a detailed case study involving Qwen3-4B, Recover-LoRA restored 80% to 95% of the accuracy across nine out of twelve benchmarks, utilizing just 10,000 synthetic training samples and requiring no labeled data. Furthermore, our findings indicate that synthetic data yields distillation-based recovery results comparable to those achieved with curated labeled datasets, and that the recovery capabilities extend effectively to out-of-distribution evaluation tasks. These results position Recover-LoRA as a viable, practical solution for post-quantization accuracy recovery in scenarios demanding aggressive weight compression.
Source: arXiv Generated at: 2026-06-04 00:00:00 UTC




