Human-validated subset of 500 GitHub issues from real Python repositories. Models must produce a patch that passes hidden tests. Standard benchmark for autonomous coding agents end-to-end (repo navigation, editing, testing).
Pct Resolved is the reported evaluation metric for SWE-bench Verified. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.
Higher is better
| Rank | Model | Trust | Score | Year | Links | Fix |
|---|---|---|---|---|---|---|
| 01 | Claude Opus 4.5 | paper | 80.9 | 2026 | Source ↗ | Looks wrong? |
| 02 | Gemini 3 Pro | paper | 78.8 | 2026 | Source ↗ | Looks wrong? |
| 03 | GPT-5 Codex | paper | 74.9 | 2026 | Source ↗ | Looks wrong? |