arXiv

ProbMoE: Differentiable Probabilistic Routing for Mixture-of-Experts

June 2, 2026 · Heng Zhao, Zilei Shao, Guy Van den Broeck, Zhe Zeng · Original Source

Title: ProbMoE: Differentiable Probabilistic Routing for Mixture-of-Experts

Abstract: While Mixture-of-Experts (MoE) architectures achieve scalability by activating only a limited number of experts for each token, their training process is hindered by the discrete and non-differentiable nature of top-$k$ routing. This limitation necessitates the use of gradient estimators for expert selection, a task that currently represents a significant open challenge in the field. To address this, we present ProbMoE, a novel framework that treats expert selection as a distribution over expert subsets with fixed cardinality, thereby casting the routing problem as probabilistic inference within this discrete space.

Our approach introduces ProbMoE Exact-$k$ routing, which samples subsets of $k$ experts during the forward pass. For the backward pass, we employ the exact marginal probability of each expert as a computationally efficient surrogate to approximate the true gradient. Furthermore, ProbMoE seamlessly extends to a dynamic-$k$ setting. In this configuration, both the training and inference phases restrict the routing cardinality to a specific predefined range, enabling the model to adaptively allocate experts on a per-token basis.

Empirical evaluations across various model backbones and benchmarks demonstrate that ProbMoE Exact-$k$ delivers robust performance relative to strong baselines, while also enhancing routing diversity and expert utilization. Meanwhile, ProbMoE Dynamic-$k$ maintains comparable performance levels while requiring the activation of fewer experts.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC