AgentRedBench: Dynamic Redteaming and Integration-Aware Defense for LLM Agents over SaaS Integrations
Title: AgentRedBench: Dynamic Redteaming and Integration-Aware Defense for LLM Agents over SaaS Integrations
Abstract
Indirect prompt injection represents a tangible risk in production environments for tool-using agents. Because LLM agents retrieve information from third-party integrations—such as Gmail, Salesforce, or Jira—via tool calls, they process response content that users do not author or control. Current benchmarks fail to adequately capture this vulnerability; they typically test only a limited number of integrations, rely on static attack payloads repeated across executions, and utilize open-source guard models trained on conversational data rather than the specific content of tool responses.
To address these gaps, we present AGENTREDBENCH, a dynamic, LLM-driven redteaming benchmark. It comprises 215 nuanced scenarios involving underspecified authorization attacks, which probe the boundaries of user-permitted actions. These scenarios span 24 enterprise integrations categorized into five attack types and nine functional families. In evaluations involving an eight-model panel from Anthropic, OpenAI, and Google, the attack success rate (ASR) for models without guardrails varied significantly, ranging from 32% for Claude Sonnet 4.6 to 81% for Gemini 3 Flash.
To ensure the canonical scenarios remain outside of future training data and to maintain the long-term validity of ASR metrics, we are openly releasing the codebase, integration schemas, and the AGENTREDGUARD model. Scenario evaluation is conducted through a maintainer-mediated channel with immutable versioning.
Complementing the benchmark, we introduce AGENTREDGUARD, a guard model trained on a diverse corpus of adversarial tool-response content. This defense reduces the panel’s average ASR from 69.9% to 2.4% while maintaining a false-positive rate of just 0.37%. AGENTREDGUARD surpasses all open-source baselines with meaningful detection capabilities, including Llama Guard, PromptGuard 2, and ProtectAI, on both performance metrics. Furthermore, holdout tests for cross-integration and cross-attack type scenarios confirm that these improvements generalize beyond the specific subsets used during training.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC




