ProbMoE: Differentiable Probabilistic Routing for Mixture-of-Experts
Title: ProbMoE: Differentiable Probabilistic Routing for Mixture-of-Experts
Abstract: While Mixture-of-Experts (MoE) architectures achieve scalability by activating only a limited number of experts for each token, their training process is hindered by the discrete and non-differentiable nature of top-$k$ routing. This limitation necessitates the use of gradient estimators for expert selection, a task that currently represents a significant open challenge in the field. To address this, we present ProbMoE, a novel framework that treats expert selection as a distribution over expert subsets with fixed cardinality, thereby casting the routing problem as probabilistic inference within this discrete space.
Our approach introduces ProbMoE Exact-$k$ routing, which samples subsets of $k$ experts during the forward pass. For the backward pass, we employ the exact marginal probability of each expert as a computationally efficient surrogate to approximate the true gradient. Furthermore, ProbMoE seamlessly extends to a dynamic-$k$ setting. In this configuration, both the training and inference phases restrict the routing cardinality to a specific predefined range, enabling the model to adaptively allocate experts on a per-token basis.
Empirical evaluations across various model backbones and benchmarks demonstrate that ProbMoE Exact-$k$ delivers robust performance relative to strong baselines, while also enhancing routing diversity and expert utilization. Meanwhile, ProbMoE Dynamic-$k$ maintains comparable performance levels while requiring the activation of fewer experts.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC




