Codesota · Code · Code Generation · SWE-benchBrowse/Code/Code Generation

Coding lineage · scope shift from contests to repositories · Oct 2023

SWE-bench.

2,294 real GitHub issue→PR pairs across 12 Python repos. The benchmark that redefined what coding evaluation meant — function synthesis was no longer enough; models had to navigate, edit and test inside repositories of record. Verified (the human-filtered 500-task subset) is what every vendor reports.

Lineage status · Superseded· OpenAI publicly stopped reporting on Verified · Sep 2025

Frontier attention has moved to SWE-bench Pro, where contamination control and held-out splits drop GPT-5 / Claude Opus 4.1 from 70%+ on Verified to ~23%. Verified scores still dominate vendor announcements; treat them as ceiling-distorted.

Official site ↗Read the paper Full coding lineage →

§ 01 · Lineage

Where SWE-bench fits in coding eval.

The attention path follows the leaderboard frontier — where every credible vendor reports next, once the previous benchmark stops separating models. SWE-bench was the first scope shift from function-level problems to repository-level engineering. The frontier has now moved past it.

APPS

May 2021

Saturated

HumanEval

Jul 2021

Saturated

HumanEval+

May 2023

Active

LiveCodeBench

Sep 2023

Active

SWE-bench

Oct 2023

Superseded

◆ this page

SWE-bench Verified

Aug 2024

Saturating

SWE-bench Pro

Sep 2025

Active

← attention now

Editor's note · 2026-04-26

APPS (2021-05) was the first widely-cited coding benchmark of the post-Codex era; OpenAI shipped HumanEval purpose-built two months later and attention migrated within a year. HumanEval and MBPP both saturated by 2023 — frontier models hit >95% pass@1, leaving no signal. EvalPlus (HumanEval+, MBPP+) reopened the gap with adversarial tests. Attention then jumped to LiveCodeBench (contamination-free by date) and SWE-bench Verified (repo-scale, human-filtered). As of 2025-09, OpenAI publicly announced they no longer evaluate on SWE-bench Verified — flawed tests reward shortcuts and training-data leakage inflates scores. SWE-bench Pro (Scale AI, arxiv 2509.16941) is the current attention path: 1,865 problems across public/commercial/held-out splits where GPT-5 and Claude Opus 4.1 land at ~23% vs >70% on Verified.

§ 02 · Context

What changed,
and what changed it.

Each card is a node from the curated coding lineage. Edges are typed: scope shift means leaderboard attention jumped tasks; direct successor means same task, sharper test set.

In-edge · scope shift

LiveCodeBench

→

SWE-bench

Sep 2023 → Oct 2023

Continuously scrapes new LeetCode/AtCoder/Codeforces problems and dates them — results can be filtered to problems posted after a model's training cutoff, eliminating contamination. Where the leaderboard moved once HumanEval+ also began saturating.

From contest-style problems to real-world software engineering — issues, multi-file edits, regression tests. Different task, but the same field's frontier.

See in lineage graph →

◆ This page

SWE-bench

SupersededOct 2023

SWE-bench (original, unfiltered)

2,294 real GitHub issue→PR pairs across 12 Python repos. The first benchmark to test whether models could function as software engineers, not just function generators. Superseded by Verified after analysis showed many issues were unsolvable as posed.

Jimenez et al. (Princeton) · paper

Out-edge · direct successor (de-facto leaderboard)

SWE-bench

→

SWE-bench Verified

Oct 2023 → Aug 2024

500 SWE-bench tasks human-confirmed solvable with sufficient issue information and a passing test. Was the agentic-coding standard until 2025 — OpenAI publicly stopped evaluating on it in Sep 2025, citing flawed tests that reward shortcuts plus training-data leakage that inflates scores.

Human-filtered subset of 500 verified-solvable tasks. The original SWE-bench is rarely quoted now; Verified is what agentic-coding evals report.

See in lineage graph →

Out-edge · current attention path

SWE-bench Verified

→

SWE-bench Pro

Aug 2024 → Sep 2025

1,865 problems across public/commercial/held-out splits sourced from 41 actively-maintained business and B2B repos. Designed to fix Verified's contamination and shortcut problems — GPT-5 and Claude Opus 4.1 land at ~23% here vs >70% on Verified. The frontier OpenAI now reports.

OpenAI publicly stopped evaluating Verified in Sep 2025 — contamination and shortcut-reward tests inflated scores. Pro adds held-out splits, commercial repos, and contamination control. GPT-5 / Claude Opus 4.1 drop from >70% on Verified to ~23% on Pro.

See in lineage graph →

§ 03 · SOTA

1.96% → 87.6%, in 30 months.

On launch day in October 2023, Claude 2 resolved 1.96% of issues end-to-end. By April 2026, Claude Opus 4.7 reached 87.6%. Each row is a record. The vertical bar is the score; the marker right of it is the model that set it.

Oct 2023

1.96%

Claude 2

SWE-bench launch — raw LM baseline

Mar 2024

12.5%

GPT-4 Turbo

First strong code model

Jun 2024

19%

GPT-4o

Aug 2024

27%

Claude 3.5 Sonnet

Anthropic enters

Oct 2024

36.2%

o1-preview

Reasoning-enhanced

Dec 2024

49%

Claude 3.5 Sonnet v2

Mar 2025

55.2%

Claude Opus 4

Jun 2025

62%

GPT-4.5

Sep 2025

70.8%

Claude Sonnet 4.5

OpenAI stops reporting on Verified

Dec 2025

78%

Claude Opus 4.5

Jan 2026

80.2%

MiniMax M2.5 (open)

First open model above 80%

Apr 2026

87.6%

Claude Opus 4.7

Current SOTA

Fig 2 · SWE-bench Verified resolve rate, by record-setting model. The Sep 2025 break is the OpenAI Verified-deprecation announcement; SOTA on Verified continued to climb regardless, on cleaner harnesses and better models.

§ 04 · Leaderboard

Best published scores.

Resolve rate on SWE-bench Verified — the human-filtered subset every vendor reports. Shaded row marks SOTA. Numbers reflect each model evaluated under a credible standardized harness; some are vendor-internal runs. Treat the gap to Pro (~50 points lower) as the contamination tax.

Metric: resolve % · higher is better
Subset: Verified (500/2,294)
Rows: 20

#	Model	Org	Family	Params	Type	Submitted	resolve %
01	Claude Opus 4.7	Anthropic	Claude	Undisclosed	API	Apr 2026	87.6
02	Claude Opus 4.5	Anthropic	Claude	Undisclosed	API	Feb 2026	80.9
03	MiniMax M2.5	MiniMax	MiniMax	229B	OSS	Jan 2026	80.2
04	GPT-5.2	OpenAI	GPT	Undisclosed	API	Feb 2026	80.0
05	Claude Opus 4.6	Anthropic	Claude	Undisclosed	API	Feb 2026	79.8
06	GLM-5	Zhipu AI	GLM	130B	OSS	Jan 2026	77.8
07	Gemini 3 Pro	Google	Gemini	Undisclosed	API	Jan 2026	77.4
08	Claude Sonnet 4.5	Anthropic	Claude	Undisclosed	API	Dec 2025	77.2
09	Kimi K2.5	Moonshot AI	Kimi	Undisclosed	API	Jan 2026	76.8
10	DeepSeek R1	DeepSeek	DeepSeek	671B MoE	OSS	Dec 2025	76.3
11	Gemini 3 Flash	Google	Gemini	Undisclosed	API	Feb 2026	75.8
12	Qwen3-Max-Thinking	Alibaba	Qwen	MoE	OSS	Feb 2026	75.3
13	DeepSeek V3.5	DeepSeek	DeepSeek	685B MoE	OSS	Nov 2025	74.6
14	Step-3.5-Flash	StepFun	Step	Unknown	OSS	Jan 2026	74.4
15	Qwen3 72B	Alibaba	Qwen	72B	OSS	Oct 2025	72.4
16	DeepSeek-Coder V2.5	DeepSeek	DeepSeek	236B MoE	OSS	Aug 2025	68.2
17	Qwen2.5-Coder 32B	Alibaba	Qwen	32B	OSS	Jun 2025	55.4
18	CodeLlama 70B	Meta	CodeLlama	70B	OSS	Dec 2024	29.8
19	StarCoder2 15B	BigCode	StarCoder	15B	OSS	Oct 2024	18.3
20	DeepSeek-Coder 33B	DeepSeek	DeepSeek	33B	OSS	Jun 2024	15.6

Fig 3 · Vendor-reported resolve rates on SWE-bench Verified. Frontier proprietary models lead, but the open-vs-closed gap at the top is 7.4 points and shrinking — the second tier is now self-hostable.

§ 05 · Open vs closed

The gap is 7.4 points.

In late 2024 the gap was 30+ points. By early 2026, MiniMax M2.5 (open) lands within 8 points of Anthropic's frontier. Self-hostable code models are now production-viable for most repository workloads.

Open-weight avg

59.9%

12 models · top: MiniMax M2.5 · 80.2%

API/closed avg

79.4%

8 models · top: Claude Opus 4.7 · 87.6%

Frontier gap

7.4pp

Claude Opus 4.7 − MiniMax M2.5, narrowing

§ 06 · Compare

Same lineage, different tests.

Coding evals tested by what they ask the model to do. The two highlighted rows are SWE-bench and Verified — what this page tracks. The next row, Pro, is where attention has moved.

Benchmark	Focus	Tasks	Scope	Tests	Top score
HumanEval	Function synthesis	164	Single fn	Hand-written unit tests	~98%
LiveCodeBench	Competitive coding	Rolling	Single file	I/O matching	~70%
SWE-bench	Repo-scale SE	2,294	Multi-file	Project test suites	Often 70%+ (noisy)
SWE-bench Verified	Repo-scale SE (filtered)	500	Multi-file	Project test suites	87.6%
SWE-bench Pro	Held-out + commercial	1,865	Multi-file	Extended + held-out	~23% (frontier)
Multi-SWE-bench	Multi-language fork	~1,500	Multi-file	Project test suites	~50%

§ 07 · Resources

Papers and code.

Key papers

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Jimenez et al. (Princeton) · ICLR 2024

SWE-bench Verified

OpenAI + SWE-bench team · OpenAI 2024

SWE-bench Pro: Contamination-Controlled SE Evaluation

Scale AI · arXiv 2509.16941

SWE-agent: Agent-Computer Interfaces for Automated SE

Yang, Jimenez, Wettig, et al. · NeurIPS 2024

Agentless: Demystifying LLM-based Software Engineering Agents

Xia, Wen, Deng, Kang, Zou, Zhang · ICSE 2025

Multi-SWE-bench: A Multilingual SWE-bench Variant

ByteDance team · arXiv 2504.02605

Repositories

SWE-bench/SWE-bench · 4.4k★

2,294 real GitHub issues + evaluation harness

SWE-bench/experiments · 1.2k★

Open-sourced predictions, logs, leaderboard runs

princeton-nlp/SWE-agent · 14.5k★

Original agent-computer interface

OpenAutoCoder/Agentless · 2.0k★

Raw model patching without scaffolding

microsoft/SWE-bench-Live · 167★

Contamination-free monthly evaluation

multi-swe-bench/multi-swe-bench · 420★

Multi-language fork (Java, TS, Go, Rust, C/C++)

See the full coding lineage →All code-generation benchmarks SWE-bench, explained

SWE-bench.

Where SWE-bench fits in coding eval.

What changed,and what changed it.

1.96% → 87.6%, in 30 months.

Best published scores.

The gap is 7.4 points.

Same lineage, different tests.

Papers and code.

What changed,
and what changed it.