arXiv

Scale-Invariant Neural Network Optimization: Norm Geometry and Heavy-Tailed Noise

Title: Optimizing Scale-Invariant Neural Networks: The Role of Norm Geometry and Heavy-Tailed Noise

Abstract:

Recent insights into neural network optimization suggest that optimizer design must account for model parametrization. Scale-invariant approaches have gained prominence because their layerwise normalized updates facilitate the transfer of hyperparameters across different model scales and leverage the geometry of input-output matrix norms. Concurrently, stochastic gradient noise in deep learning frequently deviates from sub-Gaussian distributions, often displaying heavy-tailed characteristics. While these observations have driven recent algorithmic developments, their combined theoretical implications remain largely unexamined. Specifically, the unavoidable dimension dependence for scale-invariant techniques involving general input-output matrix norms, as well as the potential for higher-order smoothness to accelerate training amidst heavy-tailed noise, are not fully understood.

This study investigates these issues through the lens of nonconvex smooth stochastic optimization over $\mathbb{R}^{m\times n}$ equipped with general norms. The objective is to identify an $\epsilon$-stationary point under $p^{\mathrm{th}}$-moment heavy-tailed noise. Our primary finding is a dimension-dependent lower bound: if the ratio $\frac{\max{m,n}}{(\min{m,n})^2}$ is sufficiently large, any scale-invariant first-order method relying on the spectral norm necessitates $\Omega(\min{m, n}\epsilon^{-\frac{3p-2}{p-1}})$ oracle calls. We demonstrate that a batched Scion method utilizing the spectral norm attains this matching upper bound, $O(\min{m, n}\epsilon^{-\frac{3p-2}{p-1}})$.

To harness the benefits of higher-order smoothness, we introduce a transported Scion method. When the norm is spectral and the Hessian is Lipschitz, this approach improves the convergence bound to $O(\min{m, n}\epsilon^{-\frac{5p-3}{2p-2}})$. Furthermore, we integrate practical heuristics into our transported method and assess its performance across various model sizes and architectures, highlighting its adaptability and effectiveness in training neural networks.


Source: arXiv Generated at: 2026-06-02 00:00:00 UTC

Related Articles

Law’s Billable Hour Is Being Shredded by AI
Bloomberg

Law’s Billable Hour Is Being Shredded by AI

AI is dismantling the billable hour by automating routine legal tasks. This technological shift threatens the traditiona...

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026
Bloomberg

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026

SoftBank in Early Talks to Back $800 Million Agile Robots Round
Bloomberg

SoftBank in Early Talks to Back $800 Million Agile Robots Round

SoftBank is in early talks to back Agile Robots’ $800 million funding round. The Japanese tech giant is currently in pre...

Amundi Is Diversifying Risk Via Commodity Currencies, Gold
Bloomberg

Amundi Is Diversifying Risk Via Commodity Currencies, Gold

Amundi diversifies risk by investing in commodity-linked currencies and gold. This strategy hedges against market volati...

Reuters

Marvell Technology surges after Nvidia's Huang calls it 'next trillion-dollar company'

Marvell Technology shares surged after Nvidia CEO Jensen Huang labeled the firm the “next trillion-dollar company.”

Russia Says It Found Foreign Spyware on Top Officials’ Phones
Bloomberg

Russia Says It Found Foreign Spyware on Top Officials’ Phones

Russia’s FSB claims to have discovered foreign spyware on senior officials’ phones. Moscow attributes the intrusion to h...