Autonomous Coding2024en

SWE-bench Verified (Agentic)

Human-validated subset of 500 GitHub issues from real Python repositories. Models must produce a patch that passes hidden tests. Standard benchmark for autonomous coding agents end-to-end (repo navigation, editing, testing).

Current State of the Art

Claude Opus 4.5

Anthropic

80.9

pct_resolved

SWE-bench Verified — pct_resolved

3 results · 1 SOTA advances · higher is better

All results
SOTA frontier
74757677787980818220262027pct_resolvedClaude Opus 4.5

Top Models Performance Comparison

Top 3 models ranked by pct_resolved

pct_resolved1Claude Opus 4.580.9100.0%2Gemini 3 Pro78.897.4%3GPT-5 Codex74.992.6%0%25%50%75%100%% of best
Best Score
80.9
Top Model
Claude Opus 4.5
Models Compared
3
Score Range
6.0

pct_resolvedPrimary

#ModelScorePaper / CodeDate
1
Claude Opus 4.5
Anthropic
80.9Apr 2026
2
Gemini 3 Pro
Google DeepMind
78.8Apr 2026
3
GPT-5 Codex
OpenAI
74.9Apr 2026