arXiv

ViBE: Co-Optimizing Workload Skew and Hardware Variability for MoE Serving

Title: ViBE: Jointly Addressing Workload Imbalance and Hardware Heterogeneity in MoE Inference

Abstract:

In distributed Mixture-of-Experts (MoE) inference, the synchronization of execution phases creates persistent bottlenecks. Because the layer latency is dictated by the slowest GPU, the interaction between input-dependent token routing and the inherent performance variability of GPUs leads to significant straggler effects. This hardware variability is an intrinsic feature of modern accelerators, driven by manufacturing tolerances, power constraints, and thermal dynamics, which result in measurable execution-time disparities among nominally identical GPUs.

The fundamental difficulty lies in the fact that MoE execution imbalance stems from the interplay between workload skew and hardware asymmetry. While token routing generates uneven and layer-specific expert loads, GPU throughput is influenced by both device-specific operating traits and the intensity of the workload. Although previous research has attempted to mitigate routing skew, these approaches typically assume homogeneous hardware, focusing on balancing token counts rather than actual execution latency. Consequently, even when token assignments are balanced, hardware-induced stragglers remain unresolved.

To address this, we introduce Variability-Informed Binning of Experts (ViBE), a hardware-aware framework designed to minimize execution-time disparities across GPUs. ViBE integrates per-GPU performance modeling with expert activation profiling to strategically place high-load experts on faster devices and low-load experts on slower ones. This approach reduces layer-level stragglers without altering model semantics or requiring hardware modifications. Given that both workload profiles and effective GPU throughput can fluctuate under different serving conditions, ViBE incorporates a lightweight recalibration mechanism. This allows the system to update its routing and performance estimates in response to workload or performance drift.

Our evaluations demonstrate that ViBE consistently reduces execution-time imbalance and enhances Service Level Objective (SLO) attainment by 14%, while decreasing the P90 Time-To-First-Token (TTFT) by as much as 45%. Furthermore, our findings indicate that the impact of hardware variability intensifies at scale, underscoring the necessity of variability-aware placement for achieving efficient, high-utilization Large Language Model (LLM) serving.


Source: arXiv Generated at: 2026-06-02 00:00:00 UTC

Related Articles

Law’s Billable Hour Is Being Shredded by AI
Bloomberg

Law’s Billable Hour Is Being Shredded by AI

AI is dismantling the billable hour by automating routine legal tasks. This technological shift threatens the traditiona...

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026
Bloomberg

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026

SoftBank in Early Talks to Back $800 Million Agile Robots Round
Bloomberg

SoftBank in Early Talks to Back $800 Million Agile Robots Round

SoftBank is in early talks to back Agile Robots’ $800 million funding round. The Japanese tech giant is currently in pre...

Amundi Is Diversifying Risk Via Commodity Currencies, Gold
Bloomberg

Amundi Is Diversifying Risk Via Commodity Currencies, Gold

Amundi diversifies risk by investing in commodity-linked currencies and gold. This strategy hedges against market volati...

Reuters

Marvell Technology surges after Nvidia's Huang calls it 'next trillion-dollar company'

Marvell Technology shares surged after Nvidia CEO Jensen Huang labeled the firm the “next trillion-dollar company.”

Russia Says It Found Foreign Spyware on Top Officials’ Phones
Bloomberg

Russia Says It Found Foreign Spyware on Top Officials’ Phones

Russia’s FSB claims to have discovered foreign spyware on senior officials’ phones. Moscow attributes the intrusion to h...