GNMR: Runtime Stability Control for Low-Precision Large Language Model Training
Title: GNMR: Runtime Stability Control for Low-Precision Large Language Model Training
Abstract: Numerical instability remains a critical hurdle in training low-precision language models. Even when utilizing efficient, cost-effective pathways, a limited number of operators can introduce transient numerical risks. To address this, we introduce Gradient Norm-to-Mean Ratio (GNMR), a lightweight runtime controller that evaluates the current gradient norm of each recoverable unit against its historical average. By integrating $\Delta$-GNMR to detect sudden, short-window spikes, GNMR translates these localized risk indicators into constrained recovery measures. This approach operates within a strict $\mathrm{maxO}$ budget and a brief lock interval, requiring no alterations to the numerical format, kernels, or backend configurations. Evaluations across activation-quantization stress tests, DeepSeek-style recipe-level training, and LLaMA-2 13B fine-tuning demonstrate that GNMR maintains high-fidelity quality through sparse, budget-aware recovery. These findings position GNMR as a backend-agnostic solution that enhances the stability of low-precision training without compromising low-cost execution.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





