arXiv

Unlocking Feature Learning in Gated Delta Networks at Scale

June 4, 2026 · Yifeng Liu, Quanquan Gu · Original Source

Title: Scaling Feature Learning in Gated Delta Networks

Original: arXiv:2606.04048v1 Announce Type: cross Abstract: Training and scaling Large Language Models demand enormous computational resources, motivating both efficient sub-quadratic architectures and principled hyperparameter tuning methods. While the Maximal Update Parametrization ($\mu$P) has enabled zero-shot hyperparameter transfer for standard Transformers, its extension to linear models, particularly those with structured state transitions and complicated architectures, remains largely unexplored. By rigorously propagating coordinate-size estimates through the forward pass, gating mechanisms, and recurrent state dynamics, we derive the scaling rules for Gated Delta Network. Experiments on language-model pre-training confirm that our configurations enable stable learning-rate transfer across model widths under both AdamW and SGD, whereas standard parametrization fails to transfer, validating the correctness and practical utility of our analysis.

Rewrite: The immense computational costs associated with training and scaling Large Language Models have driven interest in efficient sub-quadratic architectures and robust hyperparameter tuning strategies. Although Maximal Update Parametrization ($\mu$P) has successfully facilitated zero-shot hyperparameter transfer for conventional Transformers, its application to linear models—especially those featuring complex structures and structured state transitions—has received little attention. This study establishes scaling rules for Gated Delta Networks by meticulously tracking coordinate-size estimates through the forward pass, gating operations, and recurrent state dynamics. Pre-training experiments on language models demonstrate that our proposed configurations support stable learning-rate transfer across varying model widths when using either AdamW or SGD. In contrast, standard parametrization does not allow for such transfer, thereby confirming both the accuracy and practical value of our analytical findings.

Source: arXiv Generated at: 2026-06-04 00:00:00 UTC