arXiv

Do Transformers Need Three Projections? Systematic Study of QKV Variants

Title: Is a Triple Projection Necessary? A Comprehensive Analysis of QKV Configurations

Abstract: Transformers have established themselves as the dominant architecture for diverse artificial intelligence applications, largely due to the foundational role of the query, key, and value (QKV) attention mechanism. Despite their prevalence, the distinct function of each individual projection and the consequences of removing certain components remain insufficiently explored. This study conducts a systematic assessment of three distinct projection sharing strategies: a) Q-K=V, which shares the key and value projections; b) Q=K-V, which shares the query and key projections; and c) Q=K=V, which utilizes a single unified projection. While the latter two approaches result in symmetric attention maps, we also investigate asymmetric attention mechanisms utilizing two-dimensional positional encodings to mitigate potential limitations.

Our evaluation encompasses synthetic tasks, computer vision benchmarks (including MNIST, CIFAR, TinyImageNet, and anomaly detection), and large-scale language modeling using 300M and 1.2B parameter models trained on 10 billion tokens. The findings indicate that these modified transformer architectures perform compar to, and in some instances surpass, standard QKV transformers. In the context of language modeling, the Q-K=V sharing strategy reduces the KV cache size by 50% with a negligible perplexity increase of just 3.1%.

Furthermore, projection sharing demonstrates strong synergy with head-sharing techniques such as Grouped Query Attention (GQA) and Multi-Head Attention (MQA). Integrating Q-K=V with GQA-4 results in an 87.5% reduction in cache requirements, while combining it with MQA achieves a 96.9% reduction, thereby facilitating feasible on-device inference. We attribute the quality retention in the Q-K=V model to the fact that keys and values reside in similar representational spaces, allowing attention to function effectively within a low-rank regime. Conversely, the Q=K-V configuration disrupts the directional nature of attention. These results highlight projection sharing as a significant yet under-researched form of weight tying in attention mechanisms, offering measurable reductions in inference memory that are particularly advantageous for edge computing. The source code is publicly accessible at https://github.com/anushamadan02/Do-Transformers-Need-3-Projections


Source: arXiv Generated at: 2026-06-04 00:00:00 UTC

Related Articles

SpaceX Seeks to Raise $75 Billion in Record IPO (Video)
Bloomberg

SpaceX Seeks to Raise $75 Billion in Record IPO (Video)

SpaceX aims for a record $75 billion valuation through an initial public offering. This historic IPO marks a significant...

Broadcom AI Chip Outlook Disappoints Investors
Bloomberg

Broadcom AI Chip Outlook Disappoints Investors

Broadcom’s AI chip projections disappointed investors, dampening market sentiment. The outlook fell short of expectation...

Hiranandani Group CEO on Powering India's Digital Future
Bloomberg

Hiranandani Group CEO on Powering India's Digital Future

Hiranandani Group CEO discusses driving India's digital transformation.

Cerebras Says It’s Working With All AI Gear Makers Except Nvidia
Bloomberg

Cerebras Says It’s Working With All AI Gear Makers Except Nvidia

Cerebras confirmed partnerships with all major AI hardware vendors except Nvidia. This broad engagement positions Cerebr...

Putin Turns Russia’s AI Future Into a Kremlin Family Business
Bloomberg

Putin Turns Russia’s AI Future Into a Kremlin Family Business

Putin is consolidating Russia’s AI ambitions into a Kremlin family business, effectively turning the sector into a dynas...

Reuters

Meta repeatedly pushes back new AI model release for developers, WSJ says

Meta has repeatedly delayed the release of its new AI model for developers, according to the WSJ. This ongoing postponem...