arXiv

Decomposing MXFP4 quantization error for LLM reinforcement learning: reducible bias, recoverable deadzone, and an irreducible floor

June 3, 2026 · Xiaocan Li, Shiliang Wu, Zheng Shen · Original Source

Title: Deconstructing MXFP4 Quantization Error in LLM Reinforcement Learning: Reducible Bias, a Recoverable Deadzone, and an Irreducible Floor

Original: arXiv:2605.20402v3 Announce Type: replace-cross Abstract: MXFP4 arithmetic can dramatically accelerate reinforcement learning (RL) post-training of large language models (LLMs), yet the quantization error introduces severe accuracy degradation. Existing work treats the quantization error as a monolithic noise term, missing the distinct mechanisms upon interpreting how quantization error damages training. We prove an exact three-way decomposition of quantization error and show how each component dominates a distinct RL training pathway. Our theoretical and empirical analysis decomposes the MXFP4 quantization error into three additive components: "scale bias" from power-of-two rounding, "deadzone truncation" from zeroing small values, and "grid noise" from rounding to the nearest 4-bit grid. Each component dominates a distinct RL failure mode: scale bias accumulates multiplicatively through the backward pass, affecting gradient accuracy; deadzone truncation degrades rollout quality; and grid noise raises the policy's entropy. We combine corrections that are RL failure mode-targeted but not component-exclusive: Macro-block scaling to reduce scale bias, Outlier Fallback recovers deadzone entries, but also partially reduces scale bias induced error, and Adaptive Quantization Noise (AQN) for controlling the policy entropy. On Qwen2.5-3B dense and Qwen3-30B-A3B-Base mixture-of-experts model, the targeted corrections recover BF16 accuracy to within 0.7% and exceed BF16 by +1.0% respectively.

Rewrite: While MXFP4 arithmetic offers significant speedups for the reinforcement learning (RL) post-training of large language models (LLMs), it often leads to substantial drops in accuracy due to quantization errors. Previous studies have largely viewed these errors as a single, uniform source of noise, thereby overlooking the specific mechanisms through which quantization harms the training process. In this work, we demonstrate that quantization error can be precisely broken down into three distinct parts, each driving a different failure mode in RL training. Through both theoretical and empirical analysis, we identify these three additive components: "scale bias," resulting from rounding to powers of two; "deadzone truncation," caused by setting small values to zero; and "grid noise," arising from rounding to the nearest point on a 4-bit grid.

We find that each component is responsible for a specific type of RL failure. Scale bias accumulates multiplicatively during the backward pass, compromising gradient accuracy. Deadzone truncation negatively impacts the quality of rollouts, while grid noise leads to an increase in the policy’s entropy. To address these issues, we implement a suite of corrections designed to target specific RL failure modes, though these fixes are not limited to addressing only one error component. Our approach includes Macro-block scaling to mitigate scale bias, Outlier Fallback to restore truncated deadzone entries (which also helps reduce scale bias-related errors), and Adaptive Quantization Noise (AQN) to regulate policy entropy. When applied to the Qwen2.5-3B dense model and the Qwen3-30B-A3B-Base mixture-of-experts model, these targeted corrections restore accuracy to within 0.7% of BF16 performance and surpass BF16 benchmarks by 1.0%, respectively.

Source: arXiv Generated at: 2026-06-03 00:00:00 UTC