arXiv

When Both Layers Learn: Training Dynamics of Representing Linear Models via ReLU Networks

Title: Joint Layer Learning: Analyzing the Training Dynamics of ReLU Networks for Linear Function Approximation

Abstract:

This study investigates the gradient descent behavior associated with the simultaneous training of both layers in a single-hidden-layer ReLU network, aiming to approximate a linear target function. We operate within a realizable framework where input data is independently and identically distributed according to a Gaussian distribution, and labels are generated by a planted linear model. This simplified setup reflects key characteristics of end-to-end training processes found in inverse problems and specific auto-encoder architectures. Although the model appears straightforward, its dynamics are not well understood, largely due to a loss landscape populated with multiple non-strict saddle points. These features obscure the reasons why gradient descent, when starting from random initialization, consistently avoids poor stationary regions.

We offer a comprehensive characterization of the optimization landscape and demonstrate that gradient descent, when applied to both layers from a moderately small random initialization, converges to a global minimum. This convergence occurs at a linear rate and achieves order-wise optimal sample complexity. Our theoretical analysis delineates the optimization trajectory into three distinct stages:

  1. Alignment Phase: Hidden layer weights gradually align with the planted direction, while output weights stabilize to maintain the correct sign pattern.
  2. Growth Phase: The norms of weights in both layers increase, a process that preserves the alignment established in the previous phase.
  3. Local Refinement Phase: Neurons that have achieved alignment converge rapidly toward the planted direction, facilitating fast local convergence.

To rigorously prove that gradient descent successfully navigates away from non-strict saddle points, we introduce trajectory-level control arguments for the end-to-end dynamics. Furthermore, we derive new uniform concentration results that remain valid throughout the entire optimization trajectory; these results are critical for establishing the order-wise optimal sample complexity. Our theoretical findings are supported by extensive experimental validation across various configurations.


Source: arXiv Generated at: 2026-06-04 00:00:00 UTC

Related Articles

TechCrunch

A burglar used a Waymo to steal yoga clothes in San Francisco — and got away with it

A thief stole yoga clothes using a Waymo, but police failed to catch them because the car’s video data was deleted and b...

Goldman Sachs CEO David Solomon on the Coming Mega IPOs
Bloomberg

Goldman Sachs CEO David Solomon on the Coming Mega IPOs

Goldman Sachs CEO David Solomon anticipates a surge in major IPOs, signaling renewed market confidence and significant o...

What Are A.I. Agents Actually Doing?
New York Times

What Are A.I. Agents Actually Doing?

Arena research shows tech professionals are most likely to use AI agents at work, highlighting a strong industry trend i...

TechCrunch

Cash App launches a wand for tap-and-pay

Cash App launched a $25 NFC "Magic Wand" for tap-and-pay, blending viral novelty with practical contactless payments. It...

Databricks CEO Plans to Avoid IPO During Year of Huge Offerings
Bloomberg

Databricks CEO Plans to Avoid IPO During Year of Huge Offerings

Databricks CEO plans to avoid an IPO in 2021, despite a surge in public offerings. This contrasts with earlier reports t...

TechCrunch

Waymo’s spent robotaxi batteries will be used as grid storage

Waymo partners with B2U to repurpose retired robotaxi batteries for grid storage in California and Texas, aligning with ...