arXiv

Decomposing MXFP4 quantization error for LLM reinforcement learning: reducible bias, recoverable deadzone, and an irreducible floor

Title: Deconstructing MXFP4 Quantization Error in LLM Reinforcement Learning: Reducible Bias, a Recoverable Deadzone, and an Irreducible Floor

Original: arXiv:2605.20402v3 Announce Type: replace-cross Abstract: MXFP4 arithmetic can dramatically accelerate reinforcement learning (RL) post-training of large language models (LLMs), yet the quantization error introduces severe accuracy degradation. Existing work treats the quantization error as a monolithic noise term, missing the distinct mechanisms upon interpreting how quantization error damages training. We prove an exact three-way decomposition of quantization error and show how each component dominates a distinct RL training pathway. Our theoretical and empirical analysis decomposes the MXFP4 quantization error into three additive components: "scale bias" from power-of-two rounding, "deadzone truncation" from zeroing small values, and "grid noise" from rounding to the nearest 4-bit grid. Each component dominates a distinct RL failure mode: scale bias accumulates multiplicatively through the backward pass, affecting gradient accuracy; deadzone truncation degrades rollout quality; and grid noise raises the policy's entropy. We combine corrections that are RL failure mode-targeted but not component-exclusive: Macro-block scaling to reduce scale bias, Outlier Fallback recovers deadzone entries, but also partially reduces scale bias induced error, and Adaptive Quantization Noise (AQN) for controlling the policy entropy. On Qwen2.5-3B dense and Qwen3-30B-A3B-Base mixture-of-experts model, the targeted corrections recover BF16 accuracy to within 0.7% and exceed BF16 by +1.0% respectively.

Rewrite: While MXFP4 arithmetic offers significant speedups for the reinforcement learning (RL) post-training of large language models (LLMs), it often leads to substantial drops in accuracy due to quantization errors. Previous studies have largely viewed these errors as a single, uniform source of noise, thereby overlooking the specific mechanisms through which quantization harms the training process. In this work, we demonstrate that quantization error can be precisely broken down into three distinct parts, each driving a different failure mode in RL training. Through both theoretical and empirical analysis, we identify these three additive components: "scale bias," resulting from rounding to powers of two; "deadzone truncation," caused by setting small values to zero; and "grid noise," arising from rounding to the nearest point on a 4-bit grid.

We find that each component is responsible for a specific type of RL failure. Scale bias accumulates multiplicatively during the backward pass, compromising gradient accuracy. Deadzone truncation negatively impacts the quality of rollouts, while grid noise leads to an increase in the policy’s entropy. To address these issues, we implement a suite of corrections designed to target specific RL failure modes, though these fixes are not limited to addressing only one error component. Our approach includes Macro-block scaling to mitigate scale bias, Outlier Fallback to restore truncated deadzone entries (which also helps reduce scale bias-related errors), and Adaptive Quantization Noise (AQN) to regulate policy entropy. When applied to the Qwen2.5-3B dense model and the Qwen3-30B-A3B-Base mixture-of-experts model, these targeted corrections restore accuracy to within 0.7% of BF16 performance and surpass BF16 benchmarks by 1.0%, respectively.


Source: arXiv Generated at: 2026-06-03 00:00:00 UTC

Related Articles

TikTok Billionaire Tops Ambani as Asia’s Second-Richest
Bloomberg

TikTok Billionaire Tops Ambani as Asia’s Second-Richest

TikTok founder surpasses Mukesh Ambani to become Asia’s second-richest person, marking a significant shift in the region...

Publishers in UK can opt out of Google AI search results
BBC News

Publishers in UK can opt out of Google AI search results

UK publishers can now opt out of Google’s AI search summaries, a CMA ruling designed to boost their bargaining power and...

Kioxia Edges Nearer Toyota’s Market Cap in Shakeup to Japan Inc.
Bloomberg

Kioxia Edges Nearer Toyota’s Market Cap in Shakeup to Japan Inc.

Kioxia’s market cap nears Toyota’s, signaling a major shift in Japan’s corporate hierarchy. This narrowing gap highlight...

Reuters

Morning Bid: Marvell, a fitting name for the latest AI darling

Reuters highlights Marvell as a top AI stock, noting its name perfectly suits its status as the newest market darling.

Financial Times

Tim Hayward: I built the Jaguar E-Type of computer keyboards

Tim Hayward compares his bespoke keyboard designs to the Jaguar E-Type. He explores high-end customization for personal ...

Financial Times

AI Labs: Zuckerberg’s $100bn gamble

Meta’s $100 billion AI investment aims to secure AI dominance, but questions remain whether sheer spending can outpace c...