arXiv

Spectral Asymptotics of Neural Network Loss Landscapes: An Exact Decomposition of the Curvature Exponent

Title: Deconstructing the Curvature Exponent in Neural Network Loss Landscapes: A Precise Spectral Analysis

Abstract

The scaling relationship between Hessian eigenvalues ($h_k$) and gradient singular values ($\sigma_k$), defined by the curvature exponent $\alpha$ in the expression $h_k \propto \sigma_k^\alpha$, exhibits distinct patterns depending on the specific layer architecture. Specifically, convolutional layers demonstrate $\alpha \approx 2$, transformer attention mechanisms show $\alpha \approx 1$, and MLP up-projections typically feature $\alpha < 1$. To explain these variations, we introduce the Spectral Alignment Decomposition, a theoretical framework expressed as $\alpha = 2 + d\log\Phi_k / d\log\sigma_k$. Here, $\Phi_k$ quantifies the alignment between the eigenbases of Kronecker factors and the singular directions of the gradients. This formulation transforms the question of why $\alpha$ varies into a geometric inquiry, which we resolve for components such as LayerNorm, residual connections, and softmax heads.

Furthermore, this decomposition yields a spectral transfer identity, $s = \alpha\gamma$, which connects the curvature exponent $\alpha$, the effective gradient rank-decay $\gamma$, and the Hessian decay exponent $s$. While this relationship is fundamentally algebraic, its practical utility is demonstrated empirically: by fitting $\alpha$ and $\gamma$ using independent datasets—Hessian-vector products (HVPs) and Singular Value Decomposition (SVD), respectively—we recover the value of $s$ with a median error of approximately 2%. This accuracy holds across five different architectures, three datasets, and 93 layers, requiring no free parameters. Additionally, a zeta-function bound on the participation ratio indicates that curvature tends to concentrate along a single effective direction within each layer. As a proof of concept, we derive an architecture-adaptive preconditioner, $T(\sigma;\alpha)$. We demonstrate that Spectral Newton, which implements this preconditioner within the gradient singular basis, surpasses AdamW in performance on vision benchmarks characterized by a curvature exponent of $\alpha \approx 2$.


Source: arXiv Generated at: 2026-06-03 00:00:00 UTC

Related Articles

TikTok Billionaire Tops Ambani as Asia’s Second-Richest
Bloomberg

TikTok Billionaire Tops Ambani as Asia’s Second-Richest

TikTok founder surpasses Mukesh Ambani to become Asia’s second-richest person, marking a significant shift in the region...

Publishers in UK can opt out of Google AI search results
BBC News

Publishers in UK can opt out of Google AI search results

UK publishers can now opt out of Google’s AI search summaries, a CMA ruling designed to boost their bargaining power and...

Kioxia Edges Nearer Toyota’s Market Cap in Shakeup to Japan Inc.
Bloomberg

Kioxia Edges Nearer Toyota’s Market Cap in Shakeup to Japan Inc.

Kioxia’s market cap nears Toyota’s, signaling a major shift in Japan’s corporate hierarchy. This narrowing gap highlight...

Reuters

Morning Bid: Marvell, a fitting name for the latest AI darling

Reuters highlights Marvell as a top AI stock, noting its name perfectly suits its status as the newest market darling.

Financial Times

Tim Hayward: I built the Jaguar E-Type of computer keyboards

Tim Hayward compares his bespoke keyboard designs to the Jaguar E-Type. He explores high-end customization for personal ...

Financial Times

AI Labs: Zuckerberg’s $100bn gamble

Meta’s $100 billion AI investment aims to secure AI dominance, but questions remain whether sheer spending can outpace c...