Scaling depth capacity via zero/one-layer model expansion
Title: Enhancing Depth Capacity Through Zero/One-Layer Model Expansion
Abstract:
In the realm of deep learning, model depth presents a paradox: while deeper architectures generally yield superior accuracy, they simultaneously demand substantially higher computational resources. To address the challenges of training large-scale models efficiently, progressive training—often referred to as model expansion—increases model capacity incrementally during the training process. This approach significantly curtails computational expenses while maintaining minimal performance loss.
This study investigates the expansion of depth in large-scale models by analyzing them through the frameworks of optimization theory and feature learning. Our analysis provides critical insights into several key aspects of the expansion process, including the initialization of newly added layers, the transfer of hyperparameters, the design of learning rate schedules, and the optimal timing for executing model expansion.
We introduce a strategy known as zero/one-layer progressive training, designed to strike an ideal balance between computational cost and loss reduction. Our comprehensive ablation studies validate the effectiveness of this expansion strategy. For instance, applying this method to GPT-2 results in a computational savings of approximately 80%, which is equivalent to a fivefold increase in training speed. Despite these gains, the model achieves a loss level comparable to that of a fully trained 60-layer model containing 7 billion parameters, thereby demonstrating a specific mixing behavior regarding loss metrics.
Additionally, scaling law analyses conducted on LLAMA3 and DeepSeekV3 models reveal a 3 to 5 times improvement in compute efficiency. Notably, this efficiency advantage becomes more pronounced as the model scale increases.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





