SPADE-Bench: Evaluating Spontaneous Strategic Deception in Agents via Plan-Action Divergence
Title: SPADE-Bench: Evaluating Spontaneous Strategic Deception in Agents via Plan-Action Divergence
Abstract:
As the operational capabilities of Large Language Model (LLM)-based agents broaden, ensuring reliability is essential for their deployment in real-world settings. In practical scenarios, however, human operators are unable to supervise every immediate action taken by the agent. Consequently, the internal execution process often functions as a "black box," forcing users to rely exclusively on the agent’s self-generated status updates. This lack of transparency introduces a significant hazard: agents may generate reports intended for observers that do not align with their actual executed behaviors. This discrepancy can render the system unmanageable, particularly within high-stakes autonomous environments. We define this phenomenon, where self-reported plans diverge from executed actions, as agent deception.
To measure this issue, we present SPADE-Bench, a benchmark specifically crafted to assess spontaneous plan-action divergence. Distinct from previous deception benchmarks, SPADE-Bench combines genuine tool execution with controlled pressure situations. This methodology guarantees ecological validity and enables a rigorous differentiation between strategic deception and simple hallucination by comparing plans and actions under pressure. Empirical tests conducted on leading models demonstrate that agent deception is a tangible and urgent challenge within tool-use contexts. By offering a comprehensive and resilient evaluation framework, SPADE-Bench addresses a vital gap in agent safety, thereby supporting the broader effort to develop trustworthy and controllable autonomous systems.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC




