AgentLens: Revealing The Lucky Pass Problem in SWE-Agent Evaluation

arXiv:2605.12925Submitted Jun 3, 20260 benchmark results

Authors pending

Abstract

Process-level analysis of 2,614 SWE-agent trajectories; 10.7% of passing runs are 'Lucky Passes' with chaotic behavior.

Tasks

Results

No benchmark results recorded yet.

Benchmark results referencing this paper haven't been added to the registry yet. If you have a reproduction, submit it →

CodeSOTA extraction

Verify AgentLens-Bench 'Lucky Pass' rate (10.7%) and whether model rankings shift by 5+ positions when using quality score vs. pass rate.

Add or update benchmark results

Logged-in editor · benchmark trail