arXiv

Expert-Aware Refusal Steering

Title: Expert-Aware Refusal Steering

Abstract: The safety alignment of instruction-tuned large language models (LLMs) hinges on their capacity to consistently decline harmful or prohibited prompts. Previous studies have demonstrated that applying a specific steering vector during the inference phase of dense LLMs can effectively neutralize refusal mechanisms, thereby encouraging the model to answer such requests. This study expands the refusal steering approach to three open-source Mixture-of-Experts (MoE) LLMs, revealing that the intricate routing dynamics characteristic of MoE architectures do not hinder steering efficacy. We introduce two novel, expert-aware refusal steering techniques that utilize refusal-oriented expert routing patterns and expert-specific steering vectors to inhibit standard refusal responses. Our analysis indicates that refusal behavior can be successfully modulated by focusing on the output of a single expert. The findings suggest that the refusal signals detected by these steering methods are distinct from the routing behaviors of experts, highlighting the significant influence of attention mechanisms in MoE-based refusal conduct.


Source: arXiv Generated at: 2026-06-04 00:00:00 UTC

Related Articles

Nvidia-Backed Robotics Startup Generalist AI Valued at $2 Billion
Bloomberg

Nvidia-Backed Robotics Startup Generalist AI Valued at $2 Billion

Nvidia-backed robotics startup Generalist AI has reached a $2 billion valuation. Founders Pete Florence, Andy Zeng, and ...

TechCrunch

Oura Ring 5 review: Thinner, lighter, better

The Oura Ring 5 is 40% smaller and lighter than its predecessor, offering superior comfort and a discreet, jewelry-like ...

Financial Times

How AI has de-skilled translation

AI fragments specialist translation into routine tasks, effectively de-skilling the profession. This shift reduces compl...

Zurich Insurance Expands Data-Center Offering Beyond the US
Bloomberg

Zurich Insurance Expands Data-Center Offering Beyond the US

Zurich Insurance Group is expanding its data center insurance products internationally, extending coverage beyond the Un...

Emerging-Market Stocks Fall as Broadcom Miss Disrupts AI Trade
Bloomberg

Emerging-Market Stocks Fall as Broadcom Miss Disrupts AI Trade

Broadcom’s earnings miss triggered a sell-off in AI stocks, dragging down emerging-market equities. This disruption high...

Revolut Co-Founder, CTO Vlad Yatsenko to Step Down From Role
Bloomberg

Revolut Co-Founder, CTO Vlad Yatsenko to Step Down From Role

Revolut co-founder and CTO Vlad Yatsenko is stepping down from his executive role. The resignation marks a significant l...