Can Adaptive Gradient Methods Converge under Heavy-Tailed Noise? A Case Study of AdaGrad
Title: Investigating the Convergence of Adaptive Gradient Methods in the Presence of Heavy-Tailed Noise: An Analysis of AdaGrad
Original: arXiv:2605.18694v2 Announce Type: replace-cross Abstract: The optimization landscape in contemporary machine learning frequently exhibits heavy-tailed gradient noise. To address this complex reality, researchers have developed techniques like gradient normalization and clipping to guarantee the convergence of first-order algorithms. Despite these safeguards, adaptive gradient optimizers—such as the widely used $\mathtt{Adam}$ and $\mathtt{AdamW}$—often achieve success without relying on such interventions. This observation raises a compelling question: do adaptive methods inherently converge under heavy-tailed noise without requiring additional algorithmic modifications? To begin addressing this inquiry, our study focuses on $\mathtt{AdaGrad}$, the foundational algorithm in the adaptive gradient family. We establish the first provable convergence rate for $\mathtt{AdaGrad}$ in non-convex settings, demonstrating convergence when the tail index $p$ falls within the range $4/3<p\leq2$, contingent upon a mild additional assumption.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





