Extreme Low-Bit Inference in Reasoning Models: Failure Modes and Targeted Recovery
Extreme Low-Bit Inference in Reasoning Models: Failure Modes and Targeted Recovery
Abstract
Large Reasoning Models (LRMs) typically depend on extensive reasoning traces, which significantly increases inference costs. Although low-bit quantization is known to lower the computational expense of per-token decoding, we demonstrate that aggressive 2-bit inference often fails to provide an overall speedup. This is due to generation instability, which causes the total number of tokens to inflate. Rather than simply reducing answer accuracy, 2-bit quantization frequently results in significantly longer reasoning paths characterized by repetitive loops, premature budget depletion, delayed decision-making, and incomplete reasoning segments. By analyzing the complete reasoning traces of Qwen3 models on both mathematical and commonsense benchmarks, we establish that accuracy drops are closely associated with these process-level failures. To mitigate these issues, we propose two lightweight control mechanisms: FP16 planning, which provides the 2-bit model with a brief high-precision outline, and loop rescue, which identifies repetitive patterns and either selects an earlier answer or reverts to FP16. Our experiments on MATH-500 show that loop rescue boosts Qwen3-8B accuracy from 17.2% to 74.2%, while the combination of planning and loop rescue raises Qwen3-32B accuracy from 65.0% to 87.2%. These findings indicate that extreme low-bit reasoning is viable when its failures are managed as controllable generation pathologies. With selective FP16 assistance and lightweight detection methods, 2-bit inference can restore accuracy without sacrificing end-to-end speed. The code for this work is available at: https://github.com/brain-lab-research/quantized-reasoning.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC




