Inverse Depth Scaling From Most Layers Being Similar
Title: Inverse Depth Scaling Arises When Most Layers Remain Functionally Identical
Abstract: While established neural scaling laws link loss to model size in large language models (LLMs), the distinct contributions of depth versus width necessitate more granular investigation. This study quantifies the impact of depth on loss by analyzing both LLMs and toy residual networks. Our results indicate that loss in LLMs scales inversely with depth. We attribute this phenomenon to functionally redundant layers that mitigate error via ensemble averaging, rather than through compositional learning or the discretization of smooth dynamics. Although this operational regime is inefficient, it proves robust, likely stemming from the inherent architectural bias of residual networks and target functions that are ill-suited for smooth dynamics. These insights imply that enhancing LLM efficiency will likely depend on architectural advancements that promote the compositional utilization of depth.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC






