Codesota · Code · Code Generation · SWE-benchBrowse/Code/Code Generation
Coding lineage · scope shift from contests to repositories · Oct 2023

SWE-bench.

2,294 real GitHub issue→PR pairs across 12 Python repos. The benchmark that redefined what coding evaluation meant — function synthesis was no longer enough; models had to navigate, edit and test inside repositories of record. Verified (the human-filtered 500-task subset) is what every vendor reports.

Lineage status · Superseded· OpenAI publicly stopped reporting on Verified · Sep 2025

Frontier attention has moved to SWE-bench Pro, where contamination control and held-out splits drop GPT-5 / Claude Opus 4.1 from 70%+ on Verified to ~23%. Verified scores still dominate vendor announcements; treat them as ceiling-distorted.

Official site Read the paperFull coding lineage
§ 01 · Lineage

Where SWE-bench fits in coding eval.

The attention path follows the leaderboard frontier — where every credible vendor reports next, once the previous benchmark stops separating models. SWE-bench was the first scope shift from function-level problems to repository-level engineering. The frontier has now moved past it.

APPS
May 2021
Saturated
HumanEval
Jul 2021
Saturated
HumanEval+
May 2023
Active
LiveCodeBench
Sep 2023
Active
SWE-bench
Oct 2023
Superseded
◆ this page
SWE-bench Verified
Aug 2024
Saturating
SWE-bench Pro
Sep 2025
Active
← attention now
Editor's note · 2026-04-26

APPS (2021-05) was the first widely-cited coding benchmark of the post-Codex era; OpenAI shipped HumanEval purpose-built two months later and attention migrated within a year. HumanEval and MBPP both saturated by 2023 — frontier models hit >95% pass@1, leaving no signal. EvalPlus (HumanEval+, MBPP+) reopened the gap with adversarial tests. Attention then jumped to LiveCodeBench (contamination-free by date) and SWE-bench Verified (repo-scale, human-filtered). As of 2025-09, OpenAI publicly announced they no longer evaluate on SWE-bench Verified — flawed tests reward shortcuts and training-data leakage inflates scores. SWE-bench Pro (Scale AI, arxiv 2509.16941) is the current attention path: 1,865 problems across public/commercial/held-out splits where GPT-5 and Claude Opus 4.1 land at ~23% vs >70% on Verified.

§ 02 · Context

What changed,
and what changed it.

Each card is a node from the curated coding lineage. Edges are typed: scope shift means leaderboard attention jumped tasks; direct successor means same task, sharper test set.

In-edge · scope shift
LiveCodeBench
SWE-bench
Sep 2023 → Oct 2023

Continuously scrapes new LeetCode/AtCoder/Codeforces problems and dates them — results can be filtered to problems posted after a model's training cutoff, eliminating contamination. Where the leaderboard moved once HumanEval+ also began saturating.

From contest-style problems to real-world software engineering — issues, multi-file edits, regression tests. Different task, but the same field's frontier.

See in lineage graph →
◆ This page
SWE-bench
SupersededOct 2023
SWE-bench (original, unfiltered)

2,294 real GitHub issue→PR pairs across 12 Python repos. The first benchmark to test whether models could function as software engineers, not just function generators. Superseded by Verified after analysis showed many issues were unsolvable as posed.

Jimenez et al. (Princeton) · paper
Out-edge · direct successor (de-facto leaderboard)
SWE-bench
SWE-bench Verified
Oct 2023 → Aug 2024

500 SWE-bench tasks human-confirmed solvable with sufficient issue information and a passing test. Was the agentic-coding standard until 2025 — OpenAI publicly stopped evaluating on it in Sep 2025, citing flawed tests that reward shortcuts plus training-data leakage that inflates scores.

Human-filtered subset of 500 verified-solvable tasks. The original SWE-bench is rarely quoted now; Verified is what agentic-coding evals report.

See in lineage graph →
Out-edge · current attention path
SWE-bench Verified
SWE-bench Pro
Aug 2024 → Sep 2025

1,865 problems across public/commercial/held-out splits sourced from 41 actively-maintained business and B2B repos. Designed to fix Verified's contamination and shortcut problems — GPT-5 and Claude Opus 4.1 land at ~23% here vs >70% on Verified. The frontier OpenAI now reports.

OpenAI publicly stopped evaluating Verified in Sep 2025 — contamination and shortcut-reward tests inflated scores. Pro adds held-out splits, commercial repos, and contamination control. GPT-5 / Claude Opus 4.1 drop from >70% on Verified to ~23% on Pro.

See in lineage graph →
§ 03 · SOTA

1.96% → 87.6%, in 30 months.

On launch day in October 2023, Claude 2 resolved 1.96% of issues end-to-end. By April 2026, Claude Opus 4.7 reached 87.6%. Each row is a record. The vertical bar is the score; the marker right of it is the model that set it.

Oct 2023
1.96%
Claude 2
SWE-bench launch — raw LM baseline
Mar 2024
12.5%
GPT-4 Turbo
First strong code model
Jun 2024
19%
GPT-4o
Aug 2024
27%
Claude 3.5 Sonnet
Anthropic enters
Oct 2024
36.2%
o1-preview
Reasoning-enhanced
Dec 2024
49%
Claude 3.5 Sonnet v2
Mar 2025
55.2%
Claude Opus 4
Jun 2025
62%
GPT-4.5
Sep 2025
70.8%
Claude Sonnet 4.5
OpenAI stops reporting on Verified
Dec 2025
78%
Claude Opus 4.5
Jan 2026
80.2%
MiniMax M2.5 (open)
First open model above 80%
Apr 2026
87.6%
Claude Opus 4.7
Current SOTA
Fig 2 · SWE-bench Verified resolve rate, by record-setting model. The Sep 2025 break is the OpenAI Verified-deprecation announcement; SOTA on Verified continued to climb regardless, on cleaner harnesses and better models.
§ 04 · Leaderboard

Best published scores.

Resolve rate on SWE-bench Verified — the human-filtered subset every vendor reports. Shaded row marks SOTA. Numbers reflect each model evaluated under a credible standardized harness; some are vendor-internal runs. Treat the gap to Pro (~50 points lower) as the contamination tax.


Metric
resolve % · higher is better
Subset
Verified (500/2,294)
Rows
20
#ModelOrgFamilyParamsTypeSubmittedresolve %
01Claude Opus 4.7AnthropicClaudeUndisclosedAPIApr 202687.6
02Claude Opus 4.5AnthropicClaudeUndisclosedAPIFeb 202680.9
03MiniMax M2.5MiniMaxMiniMax229BOSSJan 202680.2
04GPT-5.2OpenAIGPTUndisclosedAPIFeb 202680.0
05Claude Opus 4.6AnthropicClaudeUndisclosedAPIFeb 202679.8
06GLM-5Zhipu AIGLM130BOSSJan 202677.8
07Gemini 3 ProGoogleGeminiUndisclosedAPIJan 202677.4
08Claude Sonnet 4.5AnthropicClaudeUndisclosedAPIDec 202577.2
09Kimi K2.5Moonshot AIKimiUndisclosedAPIJan 202676.8
10DeepSeek R1DeepSeekDeepSeek671B MoEOSSDec 202576.3
11Gemini 3 FlashGoogleGeminiUndisclosedAPIFeb 202675.8
12Qwen3-Max-ThinkingAlibabaQwenMoEOSSFeb 202675.3
13DeepSeek V3.5DeepSeekDeepSeek685B MoEOSSNov 202574.6
14Step-3.5-FlashStepFunStepUnknownOSSJan 202674.4
15Qwen3 72BAlibabaQwen72BOSSOct 202572.4
16DeepSeek-Coder V2.5DeepSeekDeepSeek236B MoEOSSAug 202568.2
17Qwen2.5-Coder 32BAlibabaQwen32BOSSJun 202555.4
18CodeLlama 70BMetaCodeLlama70BOSSDec 202429.8
19StarCoder2 15BBigCodeStarCoder15BOSSOct 202418.3
20DeepSeek-Coder 33BDeepSeekDeepSeek33BOSSJun 202415.6
Fig 3 · Vendor-reported resolve rates on SWE-bench Verified. Frontier proprietary models lead, but the open-vs-closed gap at the top is 7.4 points and shrinking — the second tier is now self-hostable.
§ 05 · Open vs closed

The gap is 7.4 points.

In late 2024 the gap was 30+ points. By early 2026, MiniMax M2.5 (open) lands within 8 points of Anthropic's frontier. Self-hostable code models are now production-viable for most repository workloads.

Open-weight avg
59.9%
12 models · top: MiniMax M2.5 · 80.2%
API/closed avg
79.4%
8 models · top: Claude Opus 4.7 · 87.6%
Frontier gap
7.4pp
Claude Opus 4.7 − MiniMax M2.5, narrowing
§ 06 · Compare

Same lineage, different tests.

Coding evals tested by what they ask the model to do. The two highlighted rows are SWE-bench and Verified — what this page tracks. The next row, Pro, is where attention has moved.

BenchmarkFocusTasksScopeTestsTop score
HumanEvalFunction synthesis164Single fnHand-written unit tests~98%
LiveCodeBenchCompetitive codingRollingSingle fileI/O matching~70%
SWE-benchRepo-scale SE2,294Multi-fileProject test suitesOften 70%+ (noisy)
SWE-bench VerifiedRepo-scale SE (filtered)500Multi-fileProject test suites87.6%
SWE-bench ProHeld-out + commercial1,865Multi-fileExtended + held-out~23% (frontier)
Multi-SWE-benchMulti-language fork~1,500Multi-fileProject test suites~50%
§ 07 · Resources

Papers and code.

Key papers
Repositories

See the full coding lineage All code-generation benchmarksSWE-bench, explained