arXiv

UltraEP: Unleash MoE Training and Inference on Rack-Scale Nodes with Near-Optimal Load Balancing

June 4, 2026 · Xinming Wei, Chao Jin, Tuo Dai, Yinmin Zhong, Shan Yu, Chengxu Yang, Bingyang Wu, Zili Zhang, Jing Mai, Qianchao Zhu, Zhouyang Li, Yuliang Liu, Guojie Luo · Original Source

UltraEP: Achieving Near-Optimal Load Balancing for Rack-Scale MoE Training and Inference

arXiv:2606.04101v1 | Announcement Type: Cross

Abstract

As large-scale expert parallelism (EP) becomes essential for training and deploying cutting-edge Mixture-of-Experts (MoE) models, it introduces significant challenges. Specifically, it exacerbates device-level expert load imbalances, leading to compute stragglers, token all-to-all bottlenecks, and spikes in activation memory usage. Current balancing strategies typically redistribute experts at periodic intervals based on historical load data; however, this approach proves unreliable in production environments characterized by non-stationary load patterns.

To address these limitations, we introduce UltraEP, the first real-time balancer designed for exact-load management in large-EP MoE training and prefilling on rack-scale nodes (RSNs). Leveraging the enhanced scale-up connectivity inherent to RSNs, UltraEP performs rebalancing for every microbatch and layer along critical execution paths. This process demands a sophisticated co-design of plan solving and expert replication communication to minimize overhead.

UltraEP responds immediately to post-gating load variations through efficient, quota-driven planning. It subsequently executes irregular expert-state transfers using RSN-native persistent tile streaming and employs relay-based fan-out mitigation techniques. Evaluated across MoE models ranging from 106B to 671B parameters during both training and prefilling phases, UltraEP attains 94.3% of the ideal throughput achieved by force-balanced systems. This represents a 1.49$\times$ performance gain over systems without balancing, while significantly lowering the final inter-rank imbalance from a range of 1.30–4.01 down to 1.01–1.04. Furthermore, we demonstrate UltraEP’s scalability and robustness through production MoE training experiments utilizing 2,560 GPUs.

Source: arXiv Generated at: 2026-06-04 00:00:00 UTC