Tight Long-Term Tail Decay of (Clipped) SGD in Non-Convex Optimization
Title: Proving the Tightness of Long-Term Tail Decay for (Clipped) SGD in Non-Convex Optimization
Abstract: Investigating the tail behavior of processes driven by Stochastic Gradient Descent (SGD) has garnered significant attention, primarily because it offers robust assurances regarding individual algorithmic trajectories. Although numerous studies have established high-probability guarantees—quantifying error rates for fixed probability thresholds—there is a notable scarcity of research that directly examines the probability of failure, specifically by quantifying the tail decay rate for a set error threshold. Furthermore, most existing findings are confined to finite-time horizons, which restricts their capacity to capture the genuine long-term tail decay characteristics. This long-term perspective is particularly crucial for modern machine learning models, which are typically trained over millions of iterations.
To address these deficiencies, our study analyzes the long-term tail decay of SGD-based methods using large deviations theory, thereby deriving several pivotal results. Initially, we derive an upper bound for the tails of the squared gradient norm associated with the best iterate generated by vanilla SGD. Under conditions of non-convex cost functions and bounded noise, we demonstrate that the long-term decay rate follows $e^{-t/\log(t)}$. Subsequently, we broaden the scope by relaxing the noise assumptions to accommodate clipped SGD (c-SGD) subjected to heavy-tailed noise with a bounded moment of order $p \in (1,2]$. In this scenario, we establish an upper bound exhibiting a long-term decay rate of $e^{-t^{\beta_p}/\log(t)}$, where $\beta_p = \frac{4(p-1)}{3p-2}$ for $p \in (1,2)$, and $e^{-t/\log^2(t)}$ when $p = 2$.
Additionally, we present lower bounds for tail decay at a rate of $e^{-t}$, confirming that our derived rates for both SGD and c-SGD are tight, modulo poly-logarithmic factors. Notably, these findings indicate that the long-term tail decay is an order of magnitude faster than previously documented rates based on finite-time bounds, which reported $e^{-\sqrt{t}}$ for SGD and $e^{-t^{\beta_p/2}}$ for c-SGD (with $p \in (1,2]$). Consequently, we identify regimes in which the tails diminish significantly more rapidly than previously understood, thereby offering enhanced long-term guarantees for individual execution runs.
Source: arXiv Generated at: 2026-06-04 00:00:00 UTC




