Spectral Asymptotics of Neural Network Loss Landscapes: An Exact Decomposition of the Curvature Exponent
Title: Deconstructing the Curvature Exponent in Neural Network Loss Landscapes: A Precise Spectral Analysis
Abstract
The scaling relationship between Hessian eigenvalues ($h_k$) and gradient singular values ($\sigma_k$), defined by the curvature exponent $\alpha$ in the expression $h_k \propto \sigma_k^\alpha$, exhibits distinct patterns depending on the specific layer architecture. Specifically, convolutional layers demonstrate $\alpha \approx 2$, transformer attention mechanisms show $\alpha \approx 1$, and MLP up-projections typically feature $\alpha < 1$. To explain these variations, we introduce the Spectral Alignment Decomposition, a theoretical framework expressed as $\alpha = 2 + d\log\Phi_k / d\log\sigma_k$. Here, $\Phi_k$ quantifies the alignment between the eigenbases of Kronecker factors and the singular directions of the gradients. This formulation transforms the question of why $\alpha$ varies into a geometric inquiry, which we resolve for components such as LayerNorm, residual connections, and softmax heads.
Furthermore, this decomposition yields a spectral transfer identity, $s = \alpha\gamma$, which connects the curvature exponent $\alpha$, the effective gradient rank-decay $\gamma$, and the Hessian decay exponent $s$. While this relationship is fundamentally algebraic, its practical utility is demonstrated empirically: by fitting $\alpha$ and $\gamma$ using independent datasets—Hessian-vector products (HVPs) and Singular Value Decomposition (SVD), respectively—we recover the value of $s$ with a median error of approximately 2%. This accuracy holds across five different architectures, three datasets, and 93 layers, requiring no free parameters. Additionally, a zeta-function bound on the participation ratio indicates that curvature tends to concentrate along a single effective direction within each layer. As a proof of concept, we derive an architecture-adaptive preconditioner, $T(\sigma;\alpha)$. We demonstrate that Spectral Newton, which implements this preconditioner within the gradient singular basis, surpasses AdamW in performance on vision benchmarks characterized by a curvature exponent of $\alpha \approx 2$.
Source: arXiv Generated at: 2026-06-03 00:00:00 UTC



