When Both Layers Learn: Training Dynamics of Representing Linear Models via ReLU Networks
Title: Joint Layer Learning: Analyzing the Training Dynamics of ReLU Networks for Linear Function Approximation
Abstract:
This study investigates the gradient descent behavior associated with the simultaneous training of both layers in a single-hidden-layer ReLU network, aiming to approximate a linear target function. We operate within a realizable framework where input data is independently and identically distributed according to a Gaussian distribution, and labels are generated by a planted linear model. This simplified setup reflects key characteristics of end-to-end training processes found in inverse problems and specific auto-encoder architectures. Although the model appears straightforward, its dynamics are not well understood, largely due to a loss landscape populated with multiple non-strict saddle points. These features obscure the reasons why gradient descent, when starting from random initialization, consistently avoids poor stationary regions.
We offer a comprehensive characterization of the optimization landscape and demonstrate that gradient descent, when applied to both layers from a moderately small random initialization, converges to a global minimum. This convergence occurs at a linear rate and achieves order-wise optimal sample complexity. Our theoretical analysis delineates the optimization trajectory into three distinct stages:
- Alignment Phase: Hidden layer weights gradually align with the planted direction, while output weights stabilize to maintain the correct sign pattern.
- Growth Phase: The norms of weights in both layers increase, a process that preserves the alignment established in the previous phase.
- Local Refinement Phase: Neurons that have achieved alignment converge rapidly toward the planted direction, facilitating fast local convergence.
To rigorously prove that gradient descent successfully navigates away from non-strict saddle points, we introduce trajectory-level control arguments for the end-to-end dynamics. Furthermore, we derive new uniform concentration results that remain valid throughout the entire optimization trajectory; these results are critical for establishing the order-wise optimal sample complexity. Our theoretical findings are supported by extensive experimental validation across various configurations.
Source: arXiv Generated at: 2026-06-04 00:00:00 UTC



