Global News Digest

arXiv

Lodestar: An Online-Learning LLM Inference Router

Title: Lodestar: A Machine Learning-Driven Router for Online LLM Inference

Abstract

Optimizing the deployment of large language model (LLM) inference is essential for minimizing user-perceived delays, specifically time-to-first-token (TTFT), while simultaneously maximizing GPU resource efficiency. However, the task of routing LLM requests—assigning individual inference jobs to specific GPU instances—presents significant complexities. The execution process varies heavily based on input data; batching mechanisms and Key-Value (KV) cache reuse introduce tight coupling between requests; and latency exhibits nonlinear sensitivity to factors such as context length, model configurations, and the diversity of hardware accelerators. Consequently, conventional load-balancing techniques, including heuristics specifically designed for LLM inference, struggle to deliver optimal performance.

To address these challenges, we introduce Lodestar, a novel, learning-based request routing framework designed for distributed GPU clusters. Lodestar operates by continuously capturing real-time snapshots of the cluster environment at the individual request level. These snapshots encompass instance states, request attributes, and historical performance metrics. Using this data, the system trains an online reward predictor to direct inference requests to the GPU instance most likely to maximize a predefined reward objective, such as minimizing TTFT.

Lodestar is built with cloud-native principles and integrates smoothly with established serving stacks like vLLM. By continuously adapting to fluctuations in workload and infrastructure conditions, the system demonstrates superior efficiency. In experiments conducted on a public cloud GPU cluster, Lodestar reduced average TTFT by 1.41x and P99 TTFT by 1.47x compared to state-of-the-art prefix caching and load-aware heuristics. Performance gains were even more pronounced in specialized environments, achieving up to 2.15x and 1.86x improvements on homogeneous clusters, and 4.38x and 4.42x on heterogeneous clusters. Notably, Lodestar learns these high-efficiency routing strategies in approximately five minutes.


Source: arXiv Generated at: 2026-06-02 00:00:00 UTC

Related Articles

Schroders Renewable Unit Targets AI Assets as Power Demand Soars
Bloomberg

Schroders Renewable Unit Targets AI Assets as Power Demand Soars

Schroders’ renewable unit targets AI infrastructure, pivoting to meet soaring energy demand from artificial intelligence...

State Street's Paglia on SBI Group Partnership, ETFs
Bloomberg

State Street's Paglia on SBI Group Partnership, ETFs

State Street's Paglia discusses the SBI Group partnership and ETFs, but the source text is missing. Please provide the a...

Nvidia Boss Says Workers Should Be Paid ‘as Much as Possible’
Bloomberg

Nvidia Boss Says Workers Should Be Paid ‘as Much as Possible’

Nvidia CEO Jensen Huang advocates for paying workers “as much as possible,” emphasizing maximum compensation. This stanc...

TSE Talking With Regulator For Easing ETF Listing Rules
Bloomberg

TSE Talking With Regulator For Easing ETF Listing Rules

The Tokyo Stock Exchange is discussing with regulators to ease ETF listing rules. This aims to simplify market access an...

S&P DJI CEO on Japan Markets, Mega IPOs
Bloomberg

S&P DJI CEO on Japan Markets, Mega IPOs

S&P DJI CEO discusses Japan's financial markets and major IPOs.