ViBE: Co-Optimizing Workload Skew and Hardware Variability for MoE Serving
Title: ViBE: Jointly Addressing Workload Imbalance and Hardware Heterogeneity in MoE Inference
Abstract:
In distributed Mixture-of-Experts (MoE) inference, the synchronization of execution phases creates persistent bottlenecks. Because the layer latency is dictated by the slowest GPU, the interaction between input-dependent token routing and the inherent performance variability of GPUs leads to significant straggler effects. This hardware variability is an intrinsic feature of modern accelerators, driven by manufacturing tolerances, power constraints, and thermal dynamics, which result in measurable execution-time disparities among nominally identical GPUs.
The fundamental difficulty lies in the fact that MoE execution imbalance stems from the interplay between workload skew and hardware asymmetry. While token routing generates uneven and layer-specific expert loads, GPU throughput is influenced by both device-specific operating traits and the intensity of the workload. Although previous research has attempted to mitigate routing skew, these approaches typically assume homogeneous hardware, focusing on balancing token counts rather than actual execution latency. Consequently, even when token assignments are balanced, hardware-induced stragglers remain unresolved.
To address this, we introduce Variability-Informed Binning of Experts (ViBE), a hardware-aware framework designed to minimize execution-time disparities across GPUs. ViBE integrates per-GPU performance modeling with expert activation profiling to strategically place high-load experts on faster devices and low-load experts on slower ones. This approach reduces layer-level stragglers without altering model semantics or requiring hardware modifications. Given that both workload profiles and effective GPU throughput can fluctuate under different serving conditions, ViBE incorporates a lightweight recalibration mechanism. This allows the system to update its routing and performance estimates in response to workload or performance drift.
Our evaluations demonstrate that ViBE consistently reduces execution-time imbalance and enhances Service Level Objective (SLO) attainment by 14%, while decreasing the P90 Time-To-First-Token (TTFT) by as much as 45%. Furthermore, our findings indicate that the impact of hardware variability intensifies at scale, underscoring the necessity of variability-aware placement for achieving efficient, high-utilization Large Language Model (LLM) serving.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





