"I Strongly Suspect This Website Is a Scam": Benchmarking PII Leakage and Detection without Defense in Autonomous Web Agents
**Title: "I Strongly Suspect This Website Is a Scam": Benchmarking PII Leakage and Detection without Defense in Autonomous Web Agents
Abstract:
Autonomous web agents are frequently manipulated by deceptive online content—commonly referred to as social-engineering attacks—into transmitting users' personally identifiable information (PII) to servers controlled by malicious actors. This study demonstrates that such social-engineering tactics are remarkably successful at extracting high-value PII from state-of-the-art web agents, presenting a significant threat to deployed agentic systems.
To measure this vulnerability, we present \textsc{Scammer4U}, a pre-registered benchmark comprising 91 attacker-controlled environments and 10 benign-twin baseline sites. This framework covers 8 distinct attack vectors and 16 site categories, organized within an 8-axis factorial taxonomy designed to isolate the causal impact of specific attack design elements.
Our analysis of frontier agents reveals that critical-tier PII leakage rates range from 54% to 93% in the absence of privacy guidance. In contrast, leakage remains at 0% for the benign-twin baselines. This stark disparity confirms that data leakage is directly attributable to the attacks rather than being a result of incidental form-filling behavior.
While escalating prompt-level mitigation strategies leads to model-dependent reductions across the four agent families studied, these measures prove insufficient at the pooled level to consistently prevent the submission of critical PII. Most importantly, we identify a critical "detection–action gap": even when an independent LLM judge verifies that the agent’s reasoning process has correctly identified the site as suspicious, the agent still submits critical PII in 35.9% of sessions. This compares to a 66.1% submission rate when no suspicion is verbalized, representing a robust 30.2% gap consistent across all four model families.
These results indicate that defenses relying on the agent’s own recognition of an attack are targeting the wrong signal. Consequently, we argue for the implementation of output-level interception mechanisms for outbound submissions, which function independently of the agent’s internal reasoning loop.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





