arXiv

Which Defense Closes Which Threat? Attributing OWASP-LLM-Top-10 Coverage and Its Brittleness Under Paraphrasing

June 3, 2026 · Alexandre Cristov\~ao Maiorano · Original Source

Title: Mapping Defense Mechanisms to Specific Threats: An Analysis of OWASP-LLM-Top-10 Coverage and Vulnerability to Paraphrasing

Abstract: While production Large Language Model (LLM) applications typically employ a layered defense strategy—combining refusal-phrase filters, token-budget constraints, model allowlists, rate limits, and tool-registry authentication—current breach-and-attack-simulation (BAS) benchmarks often obscure the specific efficacy of these measures by reporting only a single aggregate coverage metric. This study investigates the attribution of these defenses. By integrating four agents designed to address OWASP-LLM-Top-10 vulnerabilities into a baseline scanner of 21 agents, we evaluated four synthetic LLM endpoints: $L_0$ (unprotected), $L_1$ (refusal filters only), $L_2$ (budget controls only), and $L_3$ (the complete defense stack). It is important to note that $L_1$ and $L_2$ function as independent, single-axis ablations rather than subsets of one another, while $L_3$ represents their combination augmented with tool-registry authentication and credential scrubbing.

Analysis across $N=10$ replications yielded distinct findings for each OWASP category. The refusal mechanism alone successfully eliminated all instances of LLM01 (jailbreaking) and LLM07 (system prompt leakage). Conversely, budget controls alone were effective against LLM02 (sensitive information disclosure) and LLM10 (unbounded consumption) by terminating multi-step attack sequences. However, mitigating LLM06 (excessive agency) required the implementation of the full defense stack.

We further examined the robustness of these defenses against paraphrasing attacks. Using 300 paraphrases generated by Gemini ($K=5$ variations across a 60-template corpus), we observed that $L_1$’s refusal block rate dropped by 15 percentage points for LLM01 and 25 percentage points for LLM07. Additionally, we introduced a fifth target, $L_4$-real, which replaced the stub backend with Gemini-2.5-flash while maintaining the same $L_3$ regex configuration. This setup mirrored $L_1$’s performance exactly, suggesting that within this context, there was no measurable alignment contribution beyond the regex rules (a finding specific to this experimental setup, not a general assertion about alignment capabilities). Notably, budget controls demonstrated resilience against such mutations, showing no decline in performance (0 pp) once the rate-limit floor was accounted for. These results indicate that while a refusal whitelist may pass static benchmarks, it can be circumvented by an LLM-driven paraphraser without altering the underlying attack intent; in contrast, budget controls proved resistant to the same type of mutation.

Source: arXiv Generated at: 2026-06-03 00:00:00 UTC