WINDQuant: Weight-Informed Neural Decision-Making for Global Mixed-Precision LLM Quantization
Title: WINDQuant: A Weight-Aware Reinforcement Learning Framework for Global Mixed-Precision LLM Quantization
Abstract: While quantization is a proven method for lowering the memory requirements and inference expenses of Large Language Models (LLMs), preserving model accuracy in ultra-low-bit scenarios presents a significant hurdle. Current post-training techniques frequently lead to substantial drops in performance, whereas quantization-aware training demands expensive retraining processes and substantial computational resources. Furthermore, prevailing mixed-precision strategies typically depend on coarse-grained or heuristic sensitivity analyses, failing to account for the nuanced variations found within weight matrices. To address these issues, we introduce WINDQuant, a reinforcement learning-driven allocation controller designed for ultra-low-bit LLM quantization. Instead of proposing a new low-level quantization operator, WINDQuant utilizes reinforcement learning to determine optimal bit-widths and quantization treatments for fine-grained column chunks, all while adhering to a global storage constraint. By functioning at the column-chunk level, this approach allows for flexible, high-resolution precision allocation within layers under a specified global bit-width target. The system integrates Proximal Policy Optimization (PPO) with activation-aware calibration, lightweight per-unit quantizer fitting, and explicit accounting of the effective bits in the learned mixed-precision configuration. Evaluations on LLaMA models indicate that WINDQuant delivers competitive results in ultra-low-bit environments, significantly lowering optimization overhead compared to retraining-based methods and demonstrating the viability of reinforcement learning as an effective controller for adaptive mixed-precision quantization.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





