Scale-Invariant Neural Network Optimization: Norm Geometry and Heavy-Tailed Noise
Title: Optimizing Scale-Invariant Neural Networks: The Role of Norm Geometry and Heavy-Tailed Noise
Abstract:
Recent insights into neural network optimization suggest that optimizer design must account for model parametrization. Scale-invariant approaches have gained prominence because their layerwise normalized updates facilitate the transfer of hyperparameters across different model scales and leverage the geometry of input-output matrix norms. Concurrently, stochastic gradient noise in deep learning frequently deviates from sub-Gaussian distributions, often displaying heavy-tailed characteristics. While these observations have driven recent algorithmic developments, their combined theoretical implications remain largely unexamined. Specifically, the unavoidable dimension dependence for scale-invariant techniques involving general input-output matrix norms, as well as the potential for higher-order smoothness to accelerate training amidst heavy-tailed noise, are not fully understood.
This study investigates these issues through the lens of nonconvex smooth stochastic optimization over $\mathbb{R}^{m\times n}$ equipped with general norms. The objective is to identify an $\epsilon$-stationary point under $p^{\mathrm{th}}$-moment heavy-tailed noise. Our primary finding is a dimension-dependent lower bound: if the ratio $\frac{\max{m,n}}{(\min{m,n})^2}$ is sufficiently large, any scale-invariant first-order method relying on the spectral norm necessitates $\Omega(\min{m, n}\epsilon^{-\frac{3p-2}{p-1}})$ oracle calls. We demonstrate that a batched Scion method utilizing the spectral norm attains this matching upper bound, $O(\min{m, n}\epsilon^{-\frac{3p-2}{p-1}})$.
To harness the benefits of higher-order smoothness, we introduce a transported Scion method. When the norm is spectral and the Hessian is Lipschitz, this approach improves the convergence bound to $O(\min{m, n}\epsilon^{-\frac{5p-3}{2p-2}})$. Furthermore, we integrate practical heuristics into our transported method and assess its performance across various model sizes and architectures, highlighting its adaptability and effectiveness in training neural networks.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





