Do Value Vectors in Deep Layers Need Context from the Residual Stream?
Title: Do Value Vectors in Deep Layers Need Context from the Residual Stream?
Original: arXiv:2606.02780v1 Announce Type: new Abstract: The success of the transformer architecture as the backbone of modern LLMs is in large part due to its use of attention layers. An attention layer follows the standard neural network paradigm: it takes the residual stream as input and thereby produces context-dependent query, key, and value vectors. However, we find that model performance meaningfully improves when deeper layers learn only a context-free value vector to preserve the original token information, without drawing on any context from the residual stream. When the model has access to this context-free value vector, adding back the context-dependent component provides little additional benefit for aggregate benchmark performance. Such context-free value vectors can be stored as sparse model parameters, eliminating the need to recompute or persistently cache these values. Through systematic ablations on the key design choices for such context-free value vectors, we propose Bank of Values (BoV), a new way of computing value vectors in attention by learning a lookup table of token-specific value vectors for each of the last third of layers. Across 135M and 780M models, BoV improves validation loss over standard attention and, at 780M, the average score across 21 benchmarks, matching the previous best method that adds token information to the value vector with less compute and memory.
Rewritten: The transformer architecture’s dominance in powering modern large language models (LLMs) is largely attributed to its attention mechanisms. Typically, these layers adhere to conventional neural network principles: they ingest the residual stream to generate query, key, and value vectors that are dependent on context. In contrast, our research demonstrates that performance significantly increases in deeper layers when they utilize exclusively context-free value vectors. This approach retains the intrinsic information of the original tokens by bypassing any contextual input from the residual stream. Our findings indicate that once a model leverages these context-free vectors, reintroducing context-dependent elements yields negligible gains in overall benchmark results. Furthermore, because these context-free vectors can be saved as sparse parameters, the computational burden of recalculating or continuously caching them is removed. By conducting thorough ablation studies on the essential design parameters of these vectors, we introduce the Bank of Values (BoV). This novel method calculates attention value vectors by training a lookup table of token-specific vectors for the final third of the network’s layers. Evaluations on models with 135M and 780M parameters show that BoV reduces validation loss compared to standard attention. Specifically, the 780M variant achieves an average score across 21 benchmarks that rivals the leading existing technique—which incorporates token data into value vectors—while requiring fewer computational resources and less memory.
Source: arXiv Generated at: 2026-06-03 00:00:00 UTC





