arXiv

Unlocking Feature Learning in Gated Delta Networks at Scale

Title: Scaling Feature Learning in Gated Delta Networks

Original: arXiv:2606.04048v1 Announce Type: cross Abstract: Training and scaling Large Language Models demand enormous computational resources, motivating both efficient sub-quadratic architectures and principled hyperparameter tuning methods. While the Maximal Update Parametrization ($\mu$P) has enabled zero-shot hyperparameter transfer for standard Transformers, its extension to linear models, particularly those with structured state transitions and complicated architectures, remains largely unexplored. By rigorously propagating coordinate-size estimates through the forward pass, gating mechanisms, and recurrent state dynamics, we derive the scaling rules for Gated Delta Network. Experiments on language-model pre-training confirm that our configurations enable stable learning-rate transfer across model widths under both AdamW and SGD, whereas standard parametrization fails to transfer, validating the correctness and practical utility of our analysis.

Rewrite: The immense computational costs associated with training and scaling Large Language Models have driven interest in efficient sub-quadratic architectures and robust hyperparameter tuning strategies. Although Maximal Update Parametrization ($\mu$P) has successfully facilitated zero-shot hyperparameter transfer for conventional Transformers, its application to linear models—especially those featuring complex structures and structured state transitions—has received little attention. This study establishes scaling rules for Gated Delta Networks by meticulously tracking coordinate-size estimates through the forward pass, gating operations, and recurrent state dynamics. Pre-training experiments on language models demonstrate that our proposed configurations support stable learning-rate transfer across varying model widths when using either AdamW or SGD. In contrast, standard parametrization does not allow for such transfer, thereby confirming both the accuracy and practical value of our analytical findings.


Source: arXiv Generated at: 2026-06-04 00:00:00 UTC

Related Articles

Reuters

Meta repeatedly pushes back new AI model release for developers, WSJ says

Meta has repeatedly delayed the release of its new AI model for developers, according to the WSJ. This ongoing postponem...

TechCrunch

Benchmark raises its first-ever growth fund as part of $2B capital raise

Benchmark Capital launches its first growth fund, raising $2 billion to target later-stage AI deals. This marks a strate...

Netflix Aims to Use AI to Help Viewers Manage Content Overload
Bloomberg

Netflix Aims to Use AI to Help Viewers Manage Content Overload

Netflix uses AI to help viewers manage content overload, tackling the challenge of too many choices.

TSMC CEO Warns Chip Supply Won’t Meet AI-Fueled Demand for Years
Bloomberg

TSMC CEO Warns Chip Supply Won’t Meet AI-Fueled Demand for Years

TSMC CEO warns that chip supply will lag behind surging AI demand for years. This multi-year shortfall highlights the in...

Reuters

TSMC boss upbeat on outlook as AI boom shows no sign of easing

TSMC executives remain optimistic as sustained AI demand shows no signs of slowing, driving strong confidence in the com...

Bitcoin Falls to Pre-Iran Conflict Low as Crypto Slide Extends
Bloomberg

Bitcoin Falls to Pre-Iran Conflict Low as Crypto Slide Extends

Bitcoin drops to its lowest level before the Iran conflict, extending a broader cryptocurrency decline.