AgentLens: Revealing The Lucky Pass Problem in SWE-Agent Evaluation
Title: AgentLens: Exposing the 'Lucky Pass' Phenomenon in SWE-Agent Assessment
Abstract:
Current evaluations of software engineering (SWE) agents rely heavily on a binary metric: whether the final code patch successfully passes the tests. This outcome-centric approach erroneously equates rigorous, principled solutions with haphazard trial-and-error methods. Our research demonstrates that this equivalence is empirically invalid. By analyzing 2,614 OpenHands trajectories generated by eight different model backends across 60 SWE-bench Verified tasks, we identified a subset of 1,815 trajectories from 47 tasks that contained sufficient passing instances to establish process-level references. Within this subset, 10.7% of the passing trajectories displayed what we term a "Lucky Pass." This phenomenon is characterized by regression loops, unguided retries, skipped verification steps, or a disordered sequence of exploration, implementation, and verification activities.
To address these limitations, we present AgentLens, a framework designed for process-level evaluation of SWE-agent trajectories, alongside AgentLens-Bench, a comprehensive dataset comprising 1,815 annotated trajectories. This dataset includes quality scores, waste indicators, divergence points, and 47 task-level Prefix Tree Acceptor (PTA) references. AgentLens constructs PTA references by synthesizing multiple successful solutions for identical tasks and employs a context-sensitive intent labeler. This labeler categorizes actions into Exploration, Implementation, Verification, or Orchestration based on the trajectory’s historical context, rather than relying solely on tool identity.
When applied to AgentLens-Bench, the quality score effectively stratifies passing trajectories into three distinct tiers: Lucky, Solid, and Ideal. Furthermore, it breaks down Lucky Passes into five recurring operational mechanisms. Our analysis across the eight model backends reveals that Lucky Pass rates vary significantly, ranging from 0.5% to 23.2%. Notably, when models are ranked by quality score rather than simple pass rate, some exhibit shifts of up to five positions in their rankings. We intend to make the project repository publicly available shortly, which will include the AgentLens-Bench artifacts, the AgentLens SDK, and associated analysis tools.
Source: arXiv Generated at: 2026-06-03 00:00:00 UTC



