MESA: Improving MoE Safety Alignment via Decentralized Expertise
Title: MESA: Enhancing MoE Safety Alignment Through Decentralized Expertise
Abstract: Mixture-of-Experts (MoE) architectures offer an efficient scaling path for Large Language Models (LLMs), boosting capacity while lowering computational overhead by dynamically directing inputs to specialized experts. However, this approach introduces a significant security flaw known as "Safety Sparsity." In this scenario, safety mechanisms are concentrated within a small subset of experts, rendering the model vulnerable to adversarial attacks. Furthermore, traditional alignment techniques often apply uniform adjustments across all parameters, disregarding functional distinctions and inadvertently compromising model performance.
To overcome these issues, we introduce MESA (MoE Safety Alignment), a specialized framework designed for MoE-based LLMs. MESA strategically decentralizes safety responsibilities to maximize coverage while minimizing disruption to the model's utility. Grounded in Optimal Transport (OT) theory, MESA employs two core mechanisms: First, Expert Capacity Reallocation utilizes a transport cost matrix to assign safety duties to the most efficient experts. Second, Dynamic Routing Refinement adjusts the router to ensure precise activation of these decentralized safety modules. Experimental results demonstrate that MESA maintains robust defense capabilities against diverse harmful benchmarks without sacrificing helpfulness. The source code is accessible at https://github.com/lorraine021/MESA.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC




