arXiv

Move the Query, Not the Cache: Characterizing Cross-Instance Latent Attention Redistribution Across GPU Fabrics

June 2, 2026 · Bole Ma, Jan Eitzinger, Harald K\"ostler, Gerhard Wellein · Original Source

Title: Prioritize Query Routing Over Cache Migration: An Analysis of Cross-Instance Latent Attention Redistribution Across GPU Networks

Abstract:

Modern Large Language Models (LLMs) increasingly rely on sparse-attention indexers to determine query focus, selecting only a few Key-Value (KV) cache blocks per query. Consequently, the fundamental unit of attention has shifted to small, reusable data chunks. Agentic workloads place significant strain on this mechanism, as numerous sub-agents frequently query a single, expansive codebase, leading to high reuse of identical blocks. When the underlying corpus exceeds the capacity of a single GPU, it is partitioned across multiple instances. In such scenarios, the query and the specific blocks it targets often reside on different GPUs, necessitating attention mechanisms that span across instances.

Previous approaches to cross-instance KV systems have predominantly adopted a "move the cache" strategy, pulling the selected blocks to the requesting node. However, Multi-head Latent Attention (MLA) inverts this logic by compressing each token’s key and value into a single, narrow vector. This compression reduces a routed query row to approximately 1 KB—smaller than the chunk it attends—making the routing of the query itself often more cost-effective than migrating the cache. The comparative advantage of these two primitives, depending on the network fabric and request characteristics, remains largely unexplored, particularly in environments utilizing device-initiated RDMA, which facilitates low-cost, per-request cross-node transfers.

This study characterizes cross-instance MLA attention using a real-world multi-node H100 cluster. We derive two reusable artifacts: a topology-aware cost model encompassing probe, transfer, compute, return, and merge phases, and a closed-form predicate for determining whether to route, fetch, or process locally. We calibrated the constants for these models using real InfiniBand Global Direct Access (IBGDA) data, achieving a model accuracy within ~7% of observed batched round-trips. During the decoding phase, our approach favors routing the query. This strategy exchanges the latency of moving the cache—characterized by a ~3 ms re-adaptation splice for contiguous chunks or scattered gathers during selection—for a round-trip time measured in tens of microseconds. Furthermore, fabric selection is driven by probe latency rather than peak bandwidth capabilities.

While we instantiate the cost model and predicate specifically for MLA, the framework is not architecture-specific. It applies broadly to any system where compression or sparse selection reduces attention to small chunks, including current models such as DeepSeek-V3.2, V4, and GLM-5.1. Adapting these tools to new architectures requires only the measurement of two coefficients: the size of the routed payload and the cost of fetching via the "move-the-cache" primitive.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC