CyberGym-E2E: Scalable Real-World Benchmark for AI Agents' End-to-End Cybersecurity Capabilities
Title: CyberGym-E2E: A Scalable Real-World Benchmark for Assessing End-to-End Cybersecurity Skills in AI Agents
Abstract:
Artificial intelligence holds the promise of revolutionizing cybersecurity by facilitating the autonomous detection, analysis, and remediation of software flaws. Despite this potential, current evaluations of AI systems in this domain are often restricted in either their breadth or depth, failing to adequately represent the complete lifecycle involved in discovering and fixing real-world vulnerabilities. To bridge this critical gap, we introduce CyberGym-E2E, a comprehensive and scalable benchmark designed to rigorously test AI agents across the entire spectrum of vulnerability management, including discovery, proof-of-concept (PoC) creation, and patch development. Our approach leverages an automated, agent-enhanced pipeline to convert open-source vulnerability data into realistic evaluation scenarios, ensuring the benchmark’s scalability. At present, CyberGym-E2E encompasses 920 genuine vulnerabilities drawn from 139 distinct open-source projects.
Source: arXiv Generated at: 2026-06-04 00:00:00 UTC




