arXiv

LazyAttention: Efficient Retrieval-Augmented Generation with Deferred Positional Encoding

Title: LazyAttention: Efficient Retrieval-Augmented Generation with Deferred Positional Encoding

Abstract:

Large language models (LLMs) leverage key-value (KV) caching to accelerate inference by repurposing previous computations for newly generated tokens. This technique is particularly critical for long-context scenarios, including in-context learning (ICL) and retrieval-augmented generation (RAG). However, traditional KV caching methods embed positional data directly into the cache, which hinders reusability. Current approaches to this problem typically limit reuse to specific prefixes or demand costly memory operations for positional re-encoding.

To address these limitations, we present LazyAttention, an innovative attention mechanism that utilizes deferred positional encoding within a kernelized framework. This approach facilitates zero-copy, position-agnostic reuse of KV caches. By dynamically adjusting positional encoding within the attention kernels during execution, LazyAttention eliminates the need for materialization bottlenecks. Consequently, a single physical instance of the KV cache can support multiple logical requests at varying positions.

By employing specialized attention kernels designed for both prefilling and decoding phases, our system delivers substantial efficiency gains. Compared to the leading Block-Attention method, LazyAttention reduces time-to-first-token (TTFT) by a factor of 1.37 and boosts inference throughput by 1.40 times under skewed document distributions, all while preserving output quality on par with existing state-of-the-art solutions.


Source: arXiv Generated at: 2026-06-04 00:00:00 UTC

Related Articles

China’s Robotaxi Dilemma Shows AI Policy Tension Between Growth and Jobs
Bloomberg

China’s Robotaxi Dilemma Shows AI Policy Tension Between Growth and Jobs

China’s robotaxi expansion highlights the policy tension between driving economic growth through AI and protecting emplo...

Exams watchdog warns of rise in high-tech cheating
BBC News

Exams watchdog warns of rise in high-tech cheating

Ofqual warns of rising high-tech cheating, with smart devices involved in 44% of misconduct cases. Invigilators are trai...

Thailand’s Richest Man Plans $4.3 Billion Expansion Amid AI Boom
Bloomberg

Thailand’s Richest Man Plans $4.3 Billion Expansion Amid AI Boom

Thailand’s wealthiest individual is investing $4.3 billion in expansion, capitalizing on the booming artificial intellig...

US Tech Sector Announces Most Job Cuts in Nearly Two Years
Bloomberg

US Tech Sector Announces Most Job Cuts in Nearly Two Years

The US tech sector recorded its highest wave of layoffs in nearly two years, signaling a significant downturn for the in...

Iran Says No Progress in US Talks | The Opening Trade 6/4/2026
Bloomberg

Iran Says No Progress in US Talks | The Opening Trade 6/4/2026

Iran reports no progress in US talks on June 4, 2026. The Opening Trade highlights the ongoing diplomatic impasse betwee...

The Do’s and Don’ts of Buying Used Tech Gadgets
New York Times

The Do’s and Don’ts of Buying Used Tech Gadgets

Refurbished tech offers a cost-effective alternative amid component shortages and inflated prices. This guide outlines e...