Confidence-Adaptive SwiGLU for Mixture-of-Experts
Title: Confidence-Adaptive SwiGLU for Mixture-of-Experts
Abstract:
While SwiGLU has established itself as the standard gated activation mechanism within modern Transformer Multi-Layer Perceptrons (MLPs), its gate sharpness—defined by the smoothness and selectivity of the gating function—remains static throughout the training process. To address this limitation, we introduce Confidence-Aware SwiGLU ($\kappa$-SwiGLU), a specialized variant designed for Mixture-of-Experts (MoE) architectures that dynamically adjusts expert gate sharpness based on token-level routing confidence. In this approach, $\kappa$-SwiGLU treats the SiLU gate sharpness coefficient as a learnable function of the router logit, allowing each expert gate unit to flexibly interpolate between smooth, broadly active gating and sharp, highly selective gating.
We assessed the efficacy of $\kappa$-SwiGLU on the FineWeb-Edu dataset using MoE Transformer models with depths ranging from 8 to 28 layers. Our results indicate that $\kappa$-SwiGLU enhances mean CORE performance across all tested configurations. Notably, this improvement is achieved with negligible parameter overhead and only a minor computational cost. These findings suggest that incorporating confidence-aware gate sharpness is a promising strategy for optimizing MoE MLPs. The source code is publicly available at https://github.com/askerlee/kappa-swiglu.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





