DTop-p MoE: Sparsity-Controlled Dynamic Top-p MoE for Foundation Model Pre-training
**Title: DTop-p MoE: A Sparsity-Controlled Dynamic Top-p MoE for Foundation Model Pre-training
Abstract:
While Sparse Mixture-of-Experts (MoE) architectures are critical for scaling model capacity efficiently, the conventional Top-$k$ routing mechanism enforces a rigid sparsity structure that fails to account for variations in token difficulty and the distinct computational requirements of different layers. In contrast, Top-$p$ routing offers greater adaptability by selecting experts until their cumulative routing probability surpasses a specific threshold. This approach allows high-confidence tokens to utilize fewer experts, while uncertain tokens can engage additional resources. However, our investigation reveals that standard Top-$p$ implementations, which rely on fixed global probability thresholds, yield only slight improvements over Top-$k$, are highly sensitive to hyperparameters, and lead to unpredictable computational overheads.
To address these limitations, this paper introduces DTop-$p$, a novel dynamic routing mechanism featuring controllable sparsity. DTop-$p$ employs a Proportional-Integral controller to learn the optimal Top-$p$ probability threshold and utilizes dynamic routing normalization to facilitate layer-wise expert selection while adhering to a global sparsity constraint. Comprehensive experiments on Large Language Models and Diffusion Transformers show that DTop-$p$ consistently surpasses both Top-$k$ and fixed Top-$p$ baselines, all while maintaining an average FLOP count comparable to that of Top-$k$ MoE. Furthermore, our analysis highlights that DTop-$p$ demonstrates robust scaling characteristics across various dimensions, including expert granularity, total expert capacity, model scale, and dataset size, establishing it as a resilient and efficient framework for pre-training foundation models.
Source: arXiv Generated at: 2026-06-03 00:00:00 UTC



