arXiv

Depth-Attention: Cross-Layer Value Mixing for Language Models

Title: Depth-Attention: Cross-Layer Value Mixing for Language Models

Abstract

While self-attention mechanisms allow models to freely select information across the sequence length, standard Transformer architectures limit cross-layer interaction by simply adding each layer’s output to the residual stream. This structure prevents subsequent layers from selectively reusing representations generated by earlier layers. Although recent cross-layer approaches have enhanced this information flow, they typically operate on hidden states external to the attention module, thereby introducing additional state requirements beyond the key-value cache during inference. This overhead becomes particularly problematic as modern large language models (LLMs) increasingly rely on techniques like grouped-query attention and multi-head latent attention to compress their caches.

To address this, we propose Depth-Attention, a method that executes this selection process directly within the attention module. Specifically, prior to a layer attending to the sequence, its query vector interacts with the keys from previous layers at the identical token position, effectively mixing their values into the value vector that self-attention subsequently reads. By leveraging standard attention queries, keys, and value-cache slots, Depth-Attention stores these depth-mixed values in place of the originals. Consequently, this approach introduces neither new parameters nor any persistent inference state beyond the standard key-value cache, maintaining a cache footprint identical to that of a vanilla decoder and smaller than that of hidden-state-based cross-layer methods.

Experiments on Qwen3-style decoder architectures with 1.5B and 3B parameters demonstrate that Depth-Attention achieves the lowest perplexity and highest average downstream accuracy among tested models. It improves accuracy by up to 2.3 points over the vanilla Transformer and outperforms robust cross-layer baselines in both perplexity and average accuracy. Notably, these performance gains are realized with fewer than 0.01% additional arithmetic FLOPs and no extra persistent inference state. These benefits are consistent across model sizes ranging from 360M to 3B parameters and are also applicable to looped Transformer architectures.


Source: arXiv Generated at: 2026-06-04 00:00:00 UTC

Related Articles

China’s Robotaxi Dilemma Shows AI Policy Tension Between Growth and Jobs
Bloomberg

China’s Robotaxi Dilemma Shows AI Policy Tension Between Growth and Jobs

China’s robotaxi expansion highlights the policy tension between driving economic growth through AI and protecting emplo...

Exams watchdog warns of rise in high-tech cheating
BBC News

Exams watchdog warns of rise in high-tech cheating

Ofqual warns of rising high-tech cheating, with smart devices involved in 44% of misconduct cases. Invigilators are trai...

Thailand’s Richest Man Plans $4.3 Billion Expansion Amid AI Boom
Bloomberg

Thailand’s Richest Man Plans $4.3 Billion Expansion Amid AI Boom

Thailand’s wealthiest individual is investing $4.3 billion in expansion, capitalizing on the booming artificial intellig...

Reuters

Amazon unveils new AI warehouse robot in $12 billion Europe push

Amazon unveiled a new AI warehouse robot, marking a key step in its $12 billion European expansion strategy to enhance l...

US Tech Sector Announces Most Job Cuts in Nearly Two Years
Bloomberg

US Tech Sector Announces Most Job Cuts in Nearly Two Years

The US tech sector recorded its highest wave of layoffs in nearly two years, signaling a significant downturn for the in...

Iran Says No Progress in US Talks | The Opening Trade 6/4/2026
Bloomberg

Iran Says No Progress in US Talks | The Opening Trade 6/4/2026

Iran reports no progress in US talks on June 4, 2026. The Opening Trade highlights the ongoing diplomatic impasse betwee...