Global News Digest

arXiv

Move the Query, Not the Cache: Characterizing Cross-Instance Latent Attention Redistribution Across GPU Fabrics

Title: Prioritize Query Routing Over Cache Migration: An Analysis of Cross-Instance Latent Attention Redistribution Across GPU Networks

Abstract:

Modern Large Language Models (LLMs) increasingly rely on sparse-attention indexers to determine query focus, selecting only a few Key-Value (KV) cache blocks per query. Consequently, the fundamental unit of attention has shifted to small, reusable data chunks. Agentic workloads place significant strain on this mechanism, as numerous sub-agents frequently query a single, expansive codebase, leading to high reuse of identical blocks. When the underlying corpus exceeds the capacity of a single GPU, it is partitioned across multiple instances. In such scenarios, the query and the specific blocks it targets often reside on different GPUs, necessitating attention mechanisms that span across instances.

Previous approaches to cross-instance KV systems have predominantly adopted a "move the cache" strategy, pulling the selected blocks to the requesting node. However, Multi-head Latent Attention (MLA) inverts this logic by compressing each token’s key and value into a single, narrow vector. This compression reduces a routed query row to approximately 1 KB—smaller than the chunk it attends—making the routing of the query itself often more cost-effective than migrating the cache. The comparative advantage of these two primitives, depending on the network fabric and request characteristics, remains largely unexplored, particularly in environments utilizing device-initiated RDMA, which facilitates low-cost, per-request cross-node transfers.

This study characterizes cross-instance MLA attention using a real-world multi-node H100 cluster. We derive two reusable artifacts: a topology-aware cost model encompassing probe, transfer, compute, return, and merge phases, and a closed-form predicate for determining whether to route, fetch, or process locally. We calibrated the constants for these models using real InfiniBand Global Direct Access (IBGDA) data, achieving a model accuracy within ~7% of observed batched round-trips. During the decoding phase, our approach favors routing the query. This strategy exchanges the latency of moving the cache—characterized by a ~3 ms re-adaptation splice for contiguous chunks or scattered gathers during selection—for a round-trip time measured in tens of microseconds. Furthermore, fabric selection is driven by probe latency rather than peak bandwidth capabilities.

While we instantiate the cost model and predicate specifically for MLA, the framework is not architecture-specific. It applies broadly to any system where compression or sparse selection reduces attention to small chunks, including current models such as DeepSeek-V3.2, V4, and GLM-5.1. Adapting these tools to new architectures requires only the measurement of two coefficients: the size of the routed payload and the cost of fetching via the "move-the-cache" primitive.


Source: arXiv Generated at: 2026-06-02 00:00:00 UTC

Related Articles

Schroders Renewable Unit Targets AI Assets as Power Demand Soars
Bloomberg

Schroders Renewable Unit Targets AI Assets as Power Demand Soars

Schroders’ renewable unit targets AI infrastructure, pivoting to meet soaring energy demand from artificial intelligence...

State Street's Paglia on SBI Group Partnership, ETFs
Bloomberg

State Street's Paglia on SBI Group Partnership, ETFs

State Street's Paglia discusses the SBI Group partnership and ETFs, but the source text is missing. Please provide the a...

Nvidia Boss Says Workers Should Be Paid ā€˜as Much as Possible’
Bloomberg

Nvidia Boss Says Workers Should Be Paid ā€˜as Much as Possible’

Nvidia CEO Jensen Huang advocates for paying workers ā€œas much as possible,ā€ emphasizing maximum compensation. This stanc...

TSE Talking With Regulator For Easing ETF Listing Rules
Bloomberg

TSE Talking With Regulator For Easing ETF Listing Rules

The Tokyo Stock Exchange is discussing with regulators to ease ETF listing rules. This aims to simplify market access an...

S&P DJI CEO on Japan Markets, Mega IPOs
Bloomberg

S&P DJI CEO on Japan Markets, Mega IPOs

S&P DJI CEO discusses Japan's financial markets and major IPOs.