Depth-Attention: Cross-Layer Value Mixing for Language Models
Title: Depth-Attention: Cross-Layer Value Mixing for Language Models
Abstract
While self-attention mechanisms allow models to freely select information across the sequence length, standard Transformer architectures limit cross-layer interaction by simply adding each layer’s output to the residual stream. This structure prevents subsequent layers from selectively reusing representations generated by earlier layers. Although recent cross-layer approaches have enhanced this information flow, they typically operate on hidden states external to the attention module, thereby introducing additional state requirements beyond the key-value cache during inference. This overhead becomes particularly problematic as modern large language models (LLMs) increasingly rely on techniques like grouped-query attention and multi-head latent attention to compress their caches.
To address this, we propose Depth-Attention, a method that executes this selection process directly within the attention module. Specifically, prior to a layer attending to the sequence, its query vector interacts with the keys from previous layers at the identical token position, effectively mixing their values into the value vector that self-attention subsequently reads. By leveraging standard attention queries, keys, and value-cache slots, Depth-Attention stores these depth-mixed values in place of the originals. Consequently, this approach introduces neither new parameters nor any persistent inference state beyond the standard key-value cache, maintaining a cache footprint identical to that of a vanilla decoder and smaller than that of hidden-state-based cross-layer methods.
Experiments on Qwen3-style decoder architectures with 1.5B and 3B parameters demonstrate that Depth-Attention achieves the lowest perplexity and highest average downstream accuracy among tested models. It improves accuracy by up to 2.3 points over the vanilla Transformer and outperforms robust cross-layer baselines in both perplexity and average accuracy. Notably, these performance gains are realized with fewer than 0.01% additional arithmetic FLOPs and no extra persistent inference state. These benefits are consistent across model sizes ranging from 360M to 3B parameters and are also applicable to looped Transformer architectures.
Source: arXiv Generated at: 2026-06-04 00:00:00 UTC





