LoopMoE: Unifying Iterative Computation with Mixture-of-Experts for Language Modeling
Title: LoopMoE: Harmonizing Iterative Processing and Mixture-of-Experts for Language Modeling
Abstract: Mixture-of-Experts (MoE) and looped architectures offer distinct pathways for scaling models, enhancing parameter capacity and effective depth respectively. However, conventional looped designs typically utilize dense backbones, creating a coupling between parameter volume and per-token FLOPs. This interdependence hinders the ability to isolate the specific impact of iterative computation when operating under equivalent budget constraints. To address this limitation, we introduce LoopMoE, a looped MoE language model that combines sparse routing with iterative, weight-shared computation through two key innovations. First, IterAdaLN breaks the symmetry inherent in weight-sharing by employing a modulation signal derived from both the per-token hidden state and the iteration index. Second, we implement a capacity-balancing mechanism that restores the attention-to-FFN active parameter ratio found in well-optimized, non-looped counterparts. These innovations facilitate the first rigorous, head-to-head comparison between a looped MoE and a standard Vanilla MoE, maintaining identical totals for parameters, per-token FLOPs, and active sublayer ratios. In evaluations at the 3B scale, LoopMoE surpasses the Vanilla MoE on eight out of nine downstream benchmarks, achieving an average improvement of more than one point. Furthermore, at the 9B scale, LoopMoE maintains its superiority over the matched Vanilla MoE, demonstrating that these architectural benefits remain effective at larger sizes. This study provides a controlled integration of sparsity and recurrence, highlighting a viable trajectory for the development of looped language models.
Source: arXiv Generated at: 2026-06-04 00:00:00 UTC




