DuetServe: Harmonizing Prefill and Decode for LLM Serving via Adaptive GPU Multiplexing
Title: DuetServe: Harmonizing Prefill and Decode for LLM Serving via Adaptive GPU Multiplexing
Original: arXiv:2511.04791v2 Announce Type: replace
Abstract:
Current Large Language Model (LLM) serving infrastructures face the challenge of maintaining high throughput while adhering to stringent latency Service Level Objectives (SLOs) across two fundamentally different inference stages: the compute-heavy prefill phase and the memory-constrained decode phase. Present solutions generally fall into two categories: (1) consolidating both phases on shared GPUs, which causes interference that worsens Time-Between-Tokens (TBT); or (2) separating the phases onto distinct GPUs, which reduces latency but incurs resource inefficiencies due to model duplication and KV cache migration.
We introduce DuetServe, a cohesive LLM serving architecture that delivers the isolation benefits typically associated with disaggregation, but within the confines of a single GPU. While DuetServe defaults to an aggregated operational mode, it employs dynamic SM-level spatial multiplexing when it anticipates a decline in TBT performance. The core concept involves decoupling prefill and decode activities exclusively when necessary, utilizing adaptive, fine-grained Stream Multiprocessor (SM) partitioning to ensure phase isolation only when contention risks violating latency SLOs.
DuetServe incorporates three primary components: (1) an attention-aware roofline model designed to predict iteration latency; (2) a partitioning optimizer that identifies the ideal SM split to maximize throughput without breaching TBT limits; and (3) an execution engine capable of interruption-free operation, thereby removing the overhead associated with CPU-GPU synchronization. Performance evaluations demonstrate that, relative to leading-edge frameworks, DuetServe boosts overall throughput by as much as 1.3x while preserving low generation latency.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





