Autonomous Coding2024en
SWE-bench Verified (Agentic)
Human-validated subset of 500 GitHub issues from real Python repositories. Models must produce a patch that passes hidden tests. Standard benchmark for autonomous coding agents end-to-end (repo navigation, editing, testing).
Current State of the Art
Claude Opus 4.5
Anthropic
80.9
pct_resolved
SWE-bench Verified — pct_resolved
3 results · 1 SOTA advances · higher is better
All results
SOTA frontier
Top Models Performance Comparison
Top 3 models ranked by pct_resolved
Best Score
80.9
Top Model
Claude Opus 4.5
Models Compared
3
Score Range
6.0