DAG-MoE: From Simple Mixture to Structural Aggregation in Mixture-of-Experts
Title: DAG-MoE: Advancing from Basic Mixture to Structural Aggregation in Mixture-of-Experts
Abstract
While Mixture-of-Experts (MoE) architectures have emerged as a premier strategy for separating parameter volume from computational expenditure in large language models, scaling their effectiveness continues to present significant hurdles. Previous studies indicate that utilizing fine-grained experts broadens the range of possible expert combinations, thereby enhancing flexibility; however, this approach also introduces considerable routing overhead, which establishes a new limit on scalability. This study investigates an alternative dimension for scaling: the method by which expert outputs are combined. Through theoretical analysis, we demonstrate that substituting conventional weighted summation with structural aggregation increases the diversity of expert combinations without modifying the underlying experts or the routing mechanism, while also facilitating multi-step reasoning within a single MoE layer. To implement this, we introduce DAG-MoE, a sparse MoE architecture featuring a lightweight component designed to automatically identify the most efficient aggregation structure among chosen experts. Comprehensive evaluations under standard language modeling conditions reveal that DAG-MoE delivers consistent performance gains across both pretraining and fine-tuning phases, outperforming established MoE baseline models.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC




