arXiv

Lodestar: An Online-Learning LLM Inference Router

June 2, 2026 · Gangmuk Lim, Wanyu Zhao, Brighten Godfrey, Jiaxin Shan, Le Xu, Liguang Xie · Original Source

Title: Lodestar: A Machine Learning-Driven Router for Online LLM Inference

Abstract

Optimizing the deployment of large language model (LLM) inference is essential for minimizing user-perceived delays, specifically time-to-first-token (TTFT), while simultaneously maximizing GPU resource efficiency. However, the task of routing LLM requests—assigning individual inference jobs to specific GPU instances—presents significant complexities. The execution process varies heavily based on input data; batching mechanisms and Key-Value (KV) cache reuse introduce tight coupling between requests; and latency exhibits nonlinear sensitivity to factors such as context length, model configurations, and the diversity of hardware accelerators. Consequently, conventional load-balancing techniques, including heuristics specifically designed for LLM inference, struggle to deliver optimal performance.

To address these challenges, we introduce Lodestar, a novel, learning-based request routing framework designed for distributed GPU clusters. Lodestar operates by continuously capturing real-time snapshots of the cluster environment at the individual request level. These snapshots encompass instance states, request attributes, and historical performance metrics. Using this data, the system trains an online reward predictor to direct inference requests to the GPU instance most likely to maximize a predefined reward objective, such as minimizing TTFT.

Lodestar is built with cloud-native principles and integrates smoothly with established serving stacks like vLLM. By continuously adapting to fluctuations in workload and infrastructure conditions, the system demonstrates superior efficiency. In experiments conducted on a public cloud GPU cluster, Lodestar reduced average TTFT by 1.41x and P99 TTFT by 1.47x compared to state-of-the-art prefix caching and load-aware heuristics. Performance gains were even more pronounced in specialized environments, achieving up to 2.15x and 1.86x improvements on homogeneous clusters, and 4.38x and 4.42x on heterogeneous clusters. Notably, Lodestar learns these high-efficiency routing strategies in approximately five minutes.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC