arXiv

CRAFT: Fine-Grained Cost-Aware Expert Replication For Efficient Mixture-of-Experts Serving

June 2, 2026 · Adrian Zhao, Zhenkun Cai, Zhenyu Song, Lingfan Yu, Haozheng Fan, Jun Wu, Yida Wang, Nandita Vijaykumar · Original Source

Title: CRAFT: Fine-Grained Cost-Aware Expert Replication For Efficient Mixture-of-Experts Serving

Abstract:

Mixture-of-Experts (MoE) has recently become the dominant architecture for scaling large language models efficiently, achieving near-constant computational costs. While expert parallelism facilitates the distribution of parameters by partitioning experts across devices, it frequently results in token-level load imbalances during inference. To mitigate these imbalances in large-scale deployments, serving frameworks commonly employ expert replication, a technique that duplicates experts experiencing high loads. However, our analysis reveals that current replication strategies often result in over-replication, where numerous replicas yield only negligible performance gains. These redundant replicas occupy significant GPU memory, potentially causing resource contention and reducing overall throughput.

To address this, we introduce CRAFT, a novel framework designed to optimize load balance within a strict memory budget. CRAFT achieves this through fine-grained, per-layer replication decisions driven by estimated replication benefits. The framework is designed for seamless integration into existing serving environments, requiring no additional model training or architectural modifications. Our evaluations demonstrate that CRAFT enhances end-to-end serving throughput by an average of $1.14\times$ (reaching up to $1.2\times$) compared to conventional replication methods. These improvements were observed in large-scale deployments utilizing models with parameter counts ranging from hundreds of billions to one trillion.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC