2,294 real GitHub issue→PR pairs across 12 Python repos. The benchmark that redefined what coding evaluation meant — function synthesis was no longer enough; models had to navigate, edit and test inside repositories of record. Verified (the human-filtered 500-task subset) is what every vendor reports.
Frontier attention has moved to SWE-bench Pro, where contamination control and held-out splits drop GPT-5 / Claude Opus 4.1 from 70%+ on Verified to ~23%. Verified scores still dominate vendor announcements; treat them as ceiling-distorted.
The attention path follows the leaderboard frontier — where every credible vendor reports next, once the previous benchmark stops separating models. SWE-bench was the first scope shift from function-level problems to repository-level engineering. The frontier has now moved past it.
APPS (2021-05) was the first widely-cited coding benchmark of the post-Codex era; OpenAI shipped HumanEval purpose-built two months later and attention migrated within a year. HumanEval and MBPP both saturated by 2023 — frontier models hit >95% pass@1, leaving no signal. EvalPlus (HumanEval+, MBPP+) reopened the gap with adversarial tests. Attention then jumped to LiveCodeBench (contamination-free by date) and SWE-bench Verified (repo-scale, human-filtered). As of 2025-09, OpenAI publicly announced they no longer evaluate on SWE-bench Verified — flawed tests reward shortcuts and training-data leakage inflates scores. SWE-bench Pro (Scale AI, arxiv 2509.16941) is the current attention path: 1,865 problems across public/commercial/held-out splits where GPT-5 and Claude Opus 4.1 land at ~23% vs >70% on Verified.
Each card is a node from the curated coding lineage. Edges are typed: scope shift means leaderboard attention jumped tasks; direct successor means same task, sharper test set.
Continuously scrapes new LeetCode/AtCoder/Codeforces problems and dates them — results can be filtered to problems posted after a model's training cutoff, eliminating contamination. Where the leaderboard moved once HumanEval+ also began saturating.
From contest-style problems to real-world software engineering — issues, multi-file edits, regression tests. Different task, but the same field's frontier.
See in lineage graph →2,294 real GitHub issue→PR pairs across 12 Python repos. The first benchmark to test whether models could function as software engineers, not just function generators. Superseded by Verified after analysis showed many issues were unsolvable as posed.
500 SWE-bench tasks human-confirmed solvable with sufficient issue information and a passing test. Was the agentic-coding standard until 2025 — OpenAI publicly stopped evaluating on it in Sep 2025, citing flawed tests that reward shortcuts plus training-data leakage that inflates scores.
Human-filtered subset of 500 verified-solvable tasks. The original SWE-bench is rarely quoted now; Verified is what agentic-coding evals report.
See in lineage graph →1,865 problems across public/commercial/held-out splits sourced from 41 actively-maintained business and B2B repos. Designed to fix Verified's contamination and shortcut problems — GPT-5 and Claude Opus 4.1 land at ~23% here vs >70% on Verified. The frontier OpenAI now reports.
OpenAI publicly stopped evaluating Verified in Sep 2025 — contamination and shortcut-reward tests inflated scores. Pro adds held-out splits, commercial repos, and contamination control. GPT-5 / Claude Opus 4.1 drop from >70% on Verified to ~23% on Pro.
See in lineage graph →On launch day in October 2023, Claude 2 resolved 1.96% of issues end-to-end. By April 2026, Claude Opus 4.7 reached 87.6%. Each row is a record. The vertical bar is the score; the marker right of it is the model that set it.
Resolve rate on SWE-bench Verified — the human-filtered subset every vendor reports. Shaded row marks SOTA. Numbers reflect each model evaluated under a credible standardized harness; some are vendor-internal runs. Treat the gap to Pro (~50 points lower) as the contamination tax.
| # | Model | Org | Family | Params | Type | Submitted | resolve % |
|---|---|---|---|---|---|---|---|
| 01 | Claude Opus 4.7 | Anthropic | Claude | Undisclosed | API | Apr 2026 | 87.6 |
| 02 | Claude Opus 4.5 | Anthropic | Claude | Undisclosed | API | Feb 2026 | 80.9 |
| 03 | MiniMax M2.5 | MiniMax | MiniMax | 229B | OSS | Jan 2026 | 80.2 |
| 04 | GPT-5.2 | OpenAI | GPT | Undisclosed | API | Feb 2026 | 80.0 |
| 05 | Claude Opus 4.6 | Anthropic | Claude | Undisclosed | API | Feb 2026 | 79.8 |
| 06 | GLM-5 | Zhipu AI | GLM | 130B | OSS | Jan 2026 | 77.8 |
| 07 | Gemini 3 Pro | Gemini | Undisclosed | API | Jan 2026 | 77.4 | |
| 08 | Claude Sonnet 4.5 | Anthropic | Claude | Undisclosed | API | Dec 2025 | 77.2 |
| 09 | Kimi K2.5 | Moonshot AI | Kimi | Undisclosed | API | Jan 2026 | 76.8 |
| 10 | DeepSeek R1 | DeepSeek | DeepSeek | 671B MoE | OSS | Dec 2025 | 76.3 |
| 11 | Gemini 3 Flash | Gemini | Undisclosed | API | Feb 2026 | 75.8 | |
| 12 | Qwen3-Max-Thinking | Alibaba | Qwen | MoE | OSS | Feb 2026 | 75.3 |
| 13 | DeepSeek V3.5 | DeepSeek | DeepSeek | 685B MoE | OSS | Nov 2025 | 74.6 |
| 14 | Step-3.5-Flash | StepFun | Step | Unknown | OSS | Jan 2026 | 74.4 |
| 15 | Qwen3 72B | Alibaba | Qwen | 72B | OSS | Oct 2025 | 72.4 |
| 16 | DeepSeek-Coder V2.5 | DeepSeek | DeepSeek | 236B MoE | OSS | Aug 2025 | 68.2 |
| 17 | Qwen2.5-Coder 32B | Alibaba | Qwen | 32B | OSS | Jun 2025 | 55.4 |
| 18 | CodeLlama 70B | Meta | CodeLlama | 70B | OSS | Dec 2024 | 29.8 |
| 19 | StarCoder2 15B | BigCode | StarCoder | 15B | OSS | Oct 2024 | 18.3 |
| 20 | DeepSeek-Coder 33B | DeepSeek | DeepSeek | 33B | OSS | Jun 2024 | 15.6 |
In late 2024 the gap was 30+ points. By early 2026, MiniMax M2.5 (open) lands within 8 points of Anthropic's frontier. Self-hostable code models are now production-viable for most repository workloads.
Coding evals tested by what they ask the model to do. The two highlighted rows are SWE-bench and Verified — what this page tracks. The next row, Pro, is where attention has moved.
| Benchmark | Focus | Tasks | Scope | Tests | Top score |
|---|---|---|---|---|---|
| HumanEval | Function synthesis | 164 | Single fn | Hand-written unit tests | ~98% |
| LiveCodeBench | Competitive coding | Rolling | Single file | I/O matching | ~70% |
| SWE-bench | Repo-scale SE | 2,294 | Multi-file | Project test suites | Often 70%+ (noisy) |
| SWE-bench Verified | Repo-scale SE (filtered) | 500 | Multi-file | Project test suites | 87.6% |
| SWE-bench Pro | Held-out + commercial | 1,865 | Multi-file | Extended + held-out | ~23% (frontier) |
| Multi-SWE-bench | Multi-language fork | ~1,500 | Multi-file | Project test suites | ~50% |