arXiv

Learning When to Act or Refuse: Guarding Agentic Reasoning Models for Safe Multi-Step Tool Use

Title: Mastering the Decision to Act or Decline: Securing Agentic Reasoning Models for Secure Multi-Step Tool Integration

Abstract

Agentic language models function within a distinct safety paradigm compared to standard chat interfaces. Unlike static generation tasks, these agents are required to plan, invoke tools, and execute extended sequences of actions. In such environments, a single error—such as inadvertently accessing sensitive files or inputting credentials—can result in irreversible damage. Conventional alignment strategies, which are primarily designed for static output and task completion, frequently fail in these dynamic contexts due to the complexities of sequential decision-making, adversarial feedback from tools, and the tendency for overconfident intermediate reasoning.

To address these challenges, we present MOSAIC, a post-training framework designed to align agents for safe, multi-step tool usage by rendering safety decisions both explicit and trainable. MOSAIC restructures the inference process into a structured loop of planning, checking, and then either acting or refusing, treating explicit safety reasoning and refusal as primary, first-class actions. To facilitate training without the need for trajectory-level labels, we employ preference-based reinforcement learning utilizing pairwise trajectory comparisons. This approach effectively captures nuanced safety distinctions that scalar rewards often overlook.

We assessed MOSAIC’s zero-shot performance across three distinct model architectures: Qwen2.5-7B, Qwen3-4B-Thinking, and Phi-4. The evaluation encompassed out-of-distribution benchmarks covering harmful tasks, prompt injection attacks, legitimate tool usage, and cross-domain privacy leakage. Our results indicate that MOSAIC decreases harmful behaviors by as much as 50% and boosts the refusal rate for harmful tasks by more than 20% in the face of injection attacks. Furthermore, it mitigates privacy leaks while maintaining or enhancing performance on benign tasks, thereby demonstrating robust generalization across various models, domains, and agentic scenarios.


Source: arXiv Generated at: 2026-06-04 00:00:00 UTC

Related Articles

Glazer Family Members Said to Study Manchester United Stake Sale
Bloomberg

Glazer Family Members Said to Study Manchester United Stake Sale

Reports indicate the Glazer family is evaluating a potential sale of their Manchester United stake, with family members ...

Ares' Blair Jacbobson: Disconnect Over Private Credit Headlines
Bloomberg

Ares' Blair Jacbobson: Disconnect Over Private Credit Headlines

Ares’ Blair Jacobson argues that private credit headlines misrepresent reality, highlighting a disconnect between media ...

Nvidia-Backed Robotics Startup Generalist AI Valued at $2 Billion
Bloomberg

Nvidia-Backed Robotics Startup Generalist AI Valued at $2 Billion

Nvidia-backed robotics startup Generalist AI has reached a $2 billion valuation. Founders Pete Florence, Andy Zeng, and ...

TechCrunch

Oura Ring 5 review: Thinner, lighter, better

The Oura Ring 5 is 40% smaller and lighter than its predecessor, offering superior comfort and a discreet, jewelry-like ...

Financial Times

How AI has de-skilled translation

AI fragments specialist translation into routine tasks, effectively de-skilling the profession. This shift reduces compl...

Zurich Insurance Expands Data-Center Offering Beyond the US
Bloomberg

Zurich Insurance Expands Data-Center Offering Beyond the US

Zurich Insurance Group is expanding its data center insurance products internationally, extending coverage beyond the Un...