Do Transformers Need Three Projections? Systematic Study of QKV Variants
Title: Is a Triple Projection Necessary? A Comprehensive Analysis of QKV Configurations
Abstract: Transformers have established themselves as the dominant architecture for diverse artificial intelligence applications, largely due to the foundational role of the query, key, and value (QKV) attention mechanism. Despite their prevalence, the distinct function of each individual projection and the consequences of removing certain components remain insufficiently explored. This study conducts a systematic assessment of three distinct projection sharing strategies: a) Q-K=V, which shares the key and value projections; b) Q=K-V, which shares the query and key projections; and c) Q=K=V, which utilizes a single unified projection. While the latter two approaches result in symmetric attention maps, we also investigate asymmetric attention mechanisms utilizing two-dimensional positional encodings to mitigate potential limitations.
Our evaluation encompasses synthetic tasks, computer vision benchmarks (including MNIST, CIFAR, TinyImageNet, and anomaly detection), and large-scale language modeling using 300M and 1.2B parameter models trained on 10 billion tokens. The findings indicate that these modified transformer architectures perform compar to, and in some instances surpass, standard QKV transformers. In the context of language modeling, the Q-K=V sharing strategy reduces the KV cache size by 50% with a negligible perplexity increase of just 3.1%.
Furthermore, projection sharing demonstrates strong synergy with head-sharing techniques such as Grouped Query Attention (GQA) and Multi-Head Attention (MQA). Integrating Q-K=V with GQA-4 results in an 87.5% reduction in cache requirements, while combining it with MQA achieves a 96.9% reduction, thereby facilitating feasible on-device inference. We attribute the quality retention in the Q-K=V model to the fact that keys and values reside in similar representational spaces, allowing attention to function effectively within a low-rank regime. Conversely, the Q=K-V configuration disrupts the directional nature of attention. These results highlight projection sharing as a significant yet under-researched form of weight tying in attention mechanisms, offering measurable reductions in inference memory that are particularly advantageous for edge computing. The source code is publicly accessible at https://github.com/anushamadan02/Do-Transformers-Need-3-Projections
Source: arXiv Generated at: 2026-06-04 00:00:00 UTC





