arXiv

Leyline: KV Cache Directives for Agentic Inference

June 2, 2026 · Bole Ma, Jan Eitzinger, Harald Koestler · Original Source

Title: Leyline: Implementing KV Cache Directives for Agentic Inference

Abstract

Current KV cache management strategies are built upon the premise of standard chatbot workloads, where prompts are submitted once and the cache expands in an append-only fashion. Under this model, techniques such as prefix caching and forward-only eviction are inherently correct. However, agentic Large Language Models (LLMs) disrupt this foundational assumption. These systems evolve conversations through policy-driven editing processes, which involve retrying failed tool executions, discarding outdated outputs, and pivoting trajectories. This dynamic nature introduces two primary cache challenges.

The first issue involves content migration: identical data shifts to new positions across different turns, thereby invalidating exact-prefix caches despite the underlying key-value (KV) data remaining valid. While recent research into position-independent caching for Multi-Head Latent Attention (MLA) has begun to address this reuse problem, it is not the primary focus of this study.

The second, central challenge addressed by this paper is the need for policies to instruct the serving system to actively remove or replace specific segments of cached content. Crucially, the system must continue operation without requiring a full re-prefill of all subsequent data. No existing primitive currently supports this functionality. Consequently, production agentic frameworks are forced to re-prefill after every edit, incurring the full cost of prefix recomputation. Meanwhile, kernel-level eviction mechanisms operate independently, lacking the ability to accept external policy directives.

To bridge this gap, we present Leyline, a serving-side primitive designed for this purpose. Leyline utilizes a declarative directive, structured as a 4-tuple, to distinguish between the content to be edited and the methods for maintaining positional accuracy. The policy specifies the edit and its execution mode—either an in-place splice or a prefix-trimmed re-prefill for semantic forgetting. An architecture-agnostic interface directs these requests to per-architecture kernels, which restore attention mathematics through closed-form RoPE-rotation corrections.

Our results demonstrate significant performance gains. The splice kernel improved replay cache hit rates by 11.2 percentage points and reduced latency by as much as 241 ms. Furthermore, a simple ten-line truncation rule, implemented via the same interface, increased the agentic solve rate by 14.3 percentage points on the debug-gym benchmark. While the underlying mechanism is open, the potential policy space it unlocks represents the next major agenda item.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC