Optimal Rates for Generalization of Gradient Descent for Deep ReLU Classification
Title: Tight Generalization Bounds for Gradient Descent in Deep ReLU Classification
Recent developments have substantially deepened our comprehension of how gradient descent (GD) generalizes within deep neural networks. A pivotal inquiry remains whether GD can attain generalization rates that match the minimax optimality observed in kernel methods. Prior studies have largely fallen short, offering suboptimal convergence rates of $O(1/\sqrt{n})$ or restricting analysis to networks with smooth activations, which results in exponential complexity relative to network depth $L$.
In this study, we derive optimal generalization rates for GD applied to deep ReLU networks by strategically balancing optimization and generalization errors, thereby ensuring only a polynomial relationship with depth. Assuming that data exhibits NTK separability with margin $\gamma$, we demonstrate that the excess risk scales as $\widetilde{O}(L^6 / (n \gamma^2))$. This bound corresponds to the optimal SVM-type rate of $\widetilde{O}(1 / (n \gamma^2))$, modulo factors dependent on depth. A central technical innovation of this work is the precise management of activation patterns in the vicinity of a reference model, which facilitates a tighter Rademacher complexity bound for deep ReLU networks undergoing gradient descent training.
Source: arXiv Generated at: 2026-06-03 00:00:00 UTC



