arXiv

LRAgent: Efficient KV Cache Sharing for Multi-LoRA LLM Agents

Title: LRAgent: Efficient KV Cache Sharing for Multi-LoRA LLM Agents

Abstract:

In multi-LLM agent systems, role specialization is frequently achieved through the use of multiple Low-Rank Adaptation (LoRA) modules. In this setup, agents utilize a common pretrained backbone but diverge only through lightweight adapters. Although the base model weights are shared, every agent typically maintains and stores its own separate Key-Value (KV) cache when processing identical, long, tool-augmented trajectories. This practice results in significant memory and computational burdens. Current methods for KV cache sharing have largely ignored the specific dynamics of multi-LoRA environments.

Our analysis reveals that variations in the cache across different agents are primarily driven by adapter outputs, whereas activations originating from the shared pretrained backbone exhibit high similarity. Leveraging this insight, we introduce LRAgent, a framework designed for KV cache sharing within multi-LoRA agent ecosystems. LRAgent partitions the cache into two distinct parts: a shared base component, which stems from the pretrained weights, and an adapter-specific component, which originates from the LoRA weights.

By distributing the base component among all agents and retaining the adapter component in its native low-rank format, LRAgent significantly curtails memory usage. Furthermore, it alleviates computational costs by sharing the low-rank cache through a shared-A multi-LoRA architecture. This approach eliminates redundant processing for contexts that other agents have already handled. To facilitate the efficient reconstruction of adapter contributions during runtime, we developed Flash-LoRA-Attention. This specialized kernel rearranges attention computations, thereby preventing the need to materialize the low-rank cache into its full-dimensional form.

Evaluations demonstrate that LRAgent delivers throughput and time-to-first-token latency comparable to fully shared caching systems. Simultaneously, it maintains accuracy levels nearly identical to those of non-shared caching baselines across various agentic question-answering benchmarks.


Source: arXiv Generated at: 2026-06-02 00:00:00 UTC

Related Articles

Law’s Billable Hour Is Being Shredded by AI
Bloomberg

Law’s Billable Hour Is Being Shredded by AI

AI is dismantling the billable hour by automating routine legal tasks. This technological shift threatens the traditiona...

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026
Bloomberg

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026

SoftBank in Early Talks to Back $800 Million Agile Robots Round
Bloomberg

SoftBank in Early Talks to Back $800 Million Agile Robots Round

SoftBank is in early talks to back Agile Robots’ $800 million funding round. The Japanese tech giant is currently in pre...

Amundi Is Diversifying Risk Via Commodity Currencies, Gold
Bloomberg

Amundi Is Diversifying Risk Via Commodity Currencies, Gold

Amundi diversifies risk by investing in commodity-linked currencies and gold. This strategy hedges against market volati...

Reuters

Marvell Technology surges after Nvidia's Huang calls it 'next trillion-dollar company'

Marvell Technology shares surged after Nvidia CEO Jensen Huang labeled the firm the “next trillion-dollar company.”

Russia Says It Found Foreign Spyware on Top Officials’ Phones
Bloomberg

Russia Says It Found Foreign Spyware on Top Officials’ Phones

Russia’s FSB claims to have discovered foreign spyware on senior officials’ phones. Moscow attributes the intrusion to h...