arXiv

AgentLens: Revealing The Lucky Pass Problem in SWE-Agent Evaluation

Title: AgentLens: Exposing the 'Lucky Pass' Phenomenon in SWE-Agent Assessment

Abstract:

Current evaluations of software engineering (SWE) agents rely heavily on a binary metric: whether the final code patch successfully passes the tests. This outcome-centric approach erroneously equates rigorous, principled solutions with haphazard trial-and-error methods. Our research demonstrates that this equivalence is empirically invalid. By analyzing 2,614 OpenHands trajectories generated by eight different model backends across 60 SWE-bench Verified tasks, we identified a subset of 1,815 trajectories from 47 tasks that contained sufficient passing instances to establish process-level references. Within this subset, 10.7% of the passing trajectories displayed what we term a "Lucky Pass." This phenomenon is characterized by regression loops, unguided retries, skipped verification steps, or a disordered sequence of exploration, implementation, and verification activities.

To address these limitations, we present AgentLens, a framework designed for process-level evaluation of SWE-agent trajectories, alongside AgentLens-Bench, a comprehensive dataset comprising 1,815 annotated trajectories. This dataset includes quality scores, waste indicators, divergence points, and 47 task-level Prefix Tree Acceptor (PTA) references. AgentLens constructs PTA references by synthesizing multiple successful solutions for identical tasks and employs a context-sensitive intent labeler. This labeler categorizes actions into Exploration, Implementation, Verification, or Orchestration based on the trajectory’s historical context, rather than relying solely on tool identity.

When applied to AgentLens-Bench, the quality score effectively stratifies passing trajectories into three distinct tiers: Lucky, Solid, and Ideal. Furthermore, it breaks down Lucky Passes into five recurring operational mechanisms. Our analysis across the eight model backends reveals that Lucky Pass rates vary significantly, ranging from 0.5% to 23.2%. Notably, when models are ranked by quality score rather than simple pass rate, some exhibit shifts of up to five positions in their rankings. We intend to make the project repository publicly available shortly, which will include the AgentLens-Bench artifacts, the AgentLens SDK, and associated analysis tools.


Source: arXiv Generated at: 2026-06-03 00:00:00 UTC

Related Articles

TikTok Billionaire Tops Ambani as Asia’s Second-Richest
Bloomberg

TikTok Billionaire Tops Ambani as Asia’s Second-Richest

TikTok founder surpasses Mukesh Ambani to become Asia’s second-richest person, marking a significant shift in the region...

Publishers in UK can opt out of Google AI search results
BBC News

Publishers in UK can opt out of Google AI search results

UK publishers can now opt out of Google’s AI search summaries, a CMA ruling designed to boost their bargaining power and...

Kioxia Edges Nearer Toyota’s Market Cap in Shakeup to Japan Inc.
Bloomberg

Kioxia Edges Nearer Toyota’s Market Cap in Shakeup to Japan Inc.

Kioxia’s market cap nears Toyota’s, signaling a major shift in Japan’s corporate hierarchy. This narrowing gap highlight...

Reuters

Morning Bid: Marvell, a fitting name for the latest AI darling

Reuters highlights Marvell as a top AI stock, noting its name perfectly suits its status as the newest market darling.

Financial Times

Tim Hayward: I built the Jaguar E-Type of computer keyboards

Tim Hayward compares his bespoke keyboard designs to the Jaguar E-Type. He explores high-end customization for personal ...

Financial Times

AI Labs: Zuckerberg’s $100bn gamble

Meta’s $100 billion AI investment aims to secure AI dominance, but questions remain whether sheer spending can outpace c...