arXiv

NetKV: Network-Aware Decode Instance Selection for Disaggregated LLM Inference

June 3, 2026 · Mubarak Adetunji Ojewale · Original Source

Title: NetKV: Network-Aware Decode Instance Selection for Disaggregated LLM Inference

Abstract

In disaggregated Large Language Model (LLM) inference architectures, the KV cache must traverse the datacenter network prior to the start of decoding, thereby directly impacting the Time to First Token (TTFT) budget with transfer latency. Existing scheduling strategies typically prioritize compute load and prefix-cache locality, neglecting the topological distance and dynamic network congestion between prefill and decode instances. To address this limitation, we introduce a lightweight operator-to-scheduler interface known as the network cost oracle. We demonstrate theoretically that relying solely on cache-aware scheduling becomes arbitrarily suboptimal as context lengths increase, due to the omission of network cost factors. NetKV, a greedy algorithm with O(|D|) complexity per request that utilizes this oracle, employs tier rankings that are proven to remain robust despite stale telemetry. Evaluated on a 64-GPU four-tier fat-tree simulator using Mooncake traces, NetKV achieves a mean TTFT reduction of up to 21.2% compared to round-robin scheduling and 17.6% relative to a tuned scheduler that accounts for both cache and load. Furthermore, it improves Service Level Objective (SLO) attainment by as much as 20.1 percentage points while maintaining Time Between Tokens overhead below 0.5 ms across all tested scenarios. These performance gains are realized without requiring any modifications to the transport layer, inference engine, or underlying hardware.

Source: arXiv Generated at: 2026-06-03 00:00:00 UTC