arXiv

LRAgent: Efficient KV Cache Sharing for Multi-LoRA LLM Agents

June 2, 2026 · Hyesung Jeon, Hyeongju Ha, Jae-Joon Kim · Original Source

Title: LRAgent: Efficient KV Cache Sharing for Multi-LoRA LLM Agents

Abstract:

In multi-LLM agent systems, role specialization is frequently achieved through the use of multiple Low-Rank Adaptation (LoRA) modules. In this setup, agents utilize a common pretrained backbone but diverge only through lightweight adapters. Although the base model weights are shared, every agent typically maintains and stores its own separate Key-Value (KV) cache when processing identical, long, tool-augmented trajectories. This practice results in significant memory and computational burdens. Current methods for KV cache sharing have largely ignored the specific dynamics of multi-LoRA environments.

Our analysis reveals that variations in the cache across different agents are primarily driven by adapter outputs, whereas activations originating from the shared pretrained backbone exhibit high similarity. Leveraging this insight, we introduce LRAgent, a framework designed for KV cache sharing within multi-LoRA agent ecosystems. LRAgent partitions the cache into two distinct parts: a shared base component, which stems from the pretrained weights, and an adapter-specific component, which originates from the LoRA weights.

By distributing the base component among all agents and retaining the adapter component in its native low-rank format, LRAgent significantly curtails memory usage. Furthermore, it alleviates computational costs by sharing the low-rank cache through a shared-A multi-LoRA architecture. This approach eliminates redundant processing for contexts that other agents have already handled. To facilitate the efficient reconstruction of adapter contributions during runtime, we developed Flash-LoRA-Attention. This specialized kernel rearranges attention computations, thereby preventing the need to materialize the low-rank cache into its full-dimensional form.

Evaluations demonstrate that LRAgent delivers throughput and time-to-first-token latency comparable to fully shared caching systems. Simultaneously, it maintains accuracy levels nearly identical to those of non-shared caching baselines across various agentic question-answering benchmarks.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC