Expert-Aware Refusal Steering
Title: Expert-Aware Refusal Steering
Abstract: The safety alignment of instruction-tuned large language models (LLMs) hinges on their capacity to consistently decline harmful or prohibited prompts. Previous studies have demonstrated that applying a specific steering vector during the inference phase of dense LLMs can effectively neutralize refusal mechanisms, thereby encouraging the model to answer such requests. This study expands the refusal steering approach to three open-source Mixture-of-Experts (MoE) LLMs, revealing that the intricate routing dynamics characteristic of MoE architectures do not hinder steering efficacy. We introduce two novel, expert-aware refusal steering techniques that utilize refusal-oriented expert routing patterns and expert-specific steering vectors to inhibit standard refusal responses. Our analysis indicates that refusal behavior can be successfully modulated by focusing on the output of a single expert. The findings suggest that the refusal signals detected by these steering methods are distinct from the routing behaviors of experts, highlighting the significant influence of attention mechanisms in MoE-based refusal conduct.
Source: arXiv Generated at: 2026-06-04 00:00:00 UTC




