arXiv

A New Framework for Cybersecurity Refusals in AI Agents

June 3, 2026 · Eliot Krzysztof Jones, Mateusz Dziemian, Matt Fredrikson, J Zico Kolter · Original Source

Title: Establishing Boundaries: A New Protocol for AI Agent Refusals in Cybersecurity

Abstract:

While agentic scaffolding has significantly enhanced Large Language Model (LLM) capabilities in executing complex, long-term tasks, it has simultaneously introduced substantial benefits alongside heightened risks, particularly within the cybersecurity sector. Current evaluation benchmarks for AI agents in this field predominantly assess proficiency, focusing on how efficiently agents can perform offensive security operations. However, they largely overlook a crucial ethical and operational inquiry: the conditions and methods under which agents ought to decline harmful instructions.

To address this gap, we introduce the inaugural framework designed to define refusal boundaries within offensive security environments. This framework outlines three core components: (1) principled standards for determining when a task must be refused, (2) specific classifications of tasks that necessitate refusal, and (3) a robust evaluation methodology to gauge agent resilience against both standard and adversarial challenges.

We applied this framework to evaluate the adherence of contemporary LLM-powered agents to appropriate refusal protocols across various web-based offensive security scenarios. Our findings reveal a stark deficiency in safety mechanisms: six out of the eight leading models tested exhibited near-zero refusal rates. Only two models, GPT-5.2 and GPT-5.1 Codex, displayed any significant capacity for refusal behavior.

Source: arXiv Generated at: 2026-06-03 00:00:00 UTC