BitsMoE: Efficient Spectral Energy-Guided Bit Allocation for MoE LLM Quantization
Title: BitsMoE: Spectral Energy-Guided Bit Allocation for Efficient MoE LLM Quantization
Abstract:
While Mixture-of-Experts (MoE) large language models mitigate per-token computational costs via sparse expert activation, their practical deployment is hindered by significant memory demands, as all expert weights must remain stored in memory. Current compression techniques for MoE architectures face limitations in the ultra-low-bit range: pruning permanently eliminates model capacity, whereas coarse-grained quantization cannot effectively assign bits based on the varying importance of experts and weight directions. To address this, we introduce BitsMoE, a framework for MoE LLM quantization that utilizes spectral energy guidance for bit allocation.
BitsMoE employs Singular Value Decomposition (SVD) to break down each MoE layer into a shared basis and expert-specific spectral factors. The shared basis, which captures common structures across experts, is kept unquantized to preserve integrity, while the expert-specific factors serve as the units for fine-grained quantization. To assign bit-widths to these units, BitsMoE treats spectrum-wise mixed-precision quantization as an activation-aware reconstruction surrogate. It then resolves an integer linear program designed to minimize estimated reconstruction loss within a predetermined bit budget.
Evaluations on various MoE LLMs demonstrate that BitsMoE significantly curtails accuracy drops in ultra-low-bit scenarios. In tests involving 2-bit quantization of Qwen3-30B-A3B-Base, BitsMoE outperformed GPTQ by accelerating the quantization process by 12.3 times, boosting average accuracy by 27.83 percentage points, and enhancing decoding speed by 1.76 times. The source code and model are accessible at https://github.com/zjiayu064/BitsMoE.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC




