SWE-bench Verified Subset.

500 manually verified GitHub issues confirmed solvable by human engineers. High-quality subset of SWE-bench.

Paper ↗Submit a result ↵

§ 01 · Leaderboard

Best published scores.

39 results indexed across 1 metric. Shaded row marks current SOTA; ties broken by submission date.

Primary: resolve-rate · higher is better

resolve-rate· primary

39 rows

#	Model	Org	Submitted	Paper / code	resolve-rate
01	Claude Opus 4.7	—	Apr 2026	vendor	87.60
02	Claude Opus 4.5API	Anthropic	Nov 2025	anthropic-blog	80.90
03	Claude Opus 4.6API	Anthropic	Feb 2026	anthropic-blog	80.80
04	Gemini 3.1 ProAPI	Google	Feb 2026	google-blog	80.60
05	MiniMax M2.5OSS	MiniMax	Feb 2026	minimax-blog	80.20
06	GPT-5.2 ThinkingAPI	OpenAI	Dec 2025	openai-blog	80
07	Claude Sonnet 4.6API	Anthropic	Feb 2026	anthropic-blog	79.60
08	Gemini 3 FlashAPI	Google	Dec 2025	google-blog	78
09	Claude Sonnet 4.5API	Anthropic	Mar 2026	anthropic-blog	77.20
10	Kimi K2.5API	Moonshot AI	Mar 2026	moonshot-blog	76.80
11	GPT-5.1API	OpenAI	Mar 2026	openai-blog	76.30
12	Gemini 3 ProAPI	Google	Mar 2026	google-blog	76.20
13	GPT-5API	OpenAI	Mar 2026	openai-blog	74.90
14	MiniMax M2.1API	MiniMax	Mar 2026	minimax-blog	74
15	Claude Haiku 4.5API	Anthropic	Mar 2026	anthropic-blog	73.30
16	Claude Sonnet 4API	Anthropic	Mar 2026	anthropic-blog	72.70
17	Claude Opus 4API	Anthropic	Mar 2026	anthropic-blog	72.50
18	Devstral 2OSS	Mistral	Mar 2026	mistral-blog	72.20
19	Qwen3-Coder 480B A35BOSS	Alibaba Cloud	Mar 2026	qwen-blog	69.60
20	MiniMax M2API	MiniMax	Mar 2026	minimax-blog	69.40
21	o3API	OpenAI	Mar 2026	openai-blog	69.10
22	o4-miniAPI	OpenAI	Mar 2026	swebench-leaderboard	68.10
23	DeepSeek-V3.1OSS	DeepSeek	Mar 2026	deepseek-blog	66
24	Kimi-K2OSS	Moonshot.AI	Mar 2026	kimi-techreport	65.80
25	Grok 3API	xAI	Mar 2026	xai-blog	63.80
26	Gemini 2.5 ProAPI	Google	Mar 2026	google-blog	63.80
27	Claude 3.7 SonnetAPI	Anthropic	Mar 2026	anthropic-blog	63.70
28	Gemini 2.5 FlashAPI	Google	Mar 2026	google-blog	60.40
29	DeepSeek-R1-0528OSS	DeepSeek	Mar 2026	deepseek-blog	57.60
30	o3-miniAPI	OpenAI	Mar 2026	swebench-leaderboard	55.80
31	GPT-4.1API	OpenAI	Mar 2026	swebench-leaderboard	54.60
32	Claude 3.5 SonnetAPI	Anthropic	Mar 2026	anthropic-blog	50.80
33	DeepSeek R1OSS	DeepSeek	Mar 2026	swebench-leaderboard	49.20
34	o1API	OpenAI	Mar 2026	swebench-leaderboard	48.90
35	Devstral Small 2505OSS	Mistral	Mar 2026	mistral-blog	46.80
36	DeepSeek-V3OSS	DeepSeek	Mar 2026	swebench-leaderboard	42
37	GPT-4oAPI	OpenAI	Mar 2026	swebench-leaderboard	41.20
38	Claude 3.5 HaikuAPI	Anthropic	Mar 2026	anthropic-blog	40.60
39	DeepSeek-V2.5OSS	DeepSeek	Mar 2026	deepseek-blog	37

Fig 2 · Rows sorted by score within each metric. Shaded row marks SOTA. Dates reflect model or paper release where available, otherwise the date Codesota accessed the source.

§ 03 · Progress

2 steps
of state of the art.

Each row below marks a model that broke the previous record on resolve-rate. Intermediate submissions are kept in the leaderboard above; only SOTA-setting entries are re-listed here.

Higher scores win. Each subsequent entry improved upon the previous best.

SOTA line · resolve-rate

Nov 24, 2025Claude Opus 4.5Anthropic80.90
Apr 18, 2026Claude Opus 4.787.60

Fig 3 · SOTA-setting models only. 2 entries span Nov 2025 → Apr 2026.

§ 06 · Contribute

Have a score that beats
this table?

Submit a checkpoint and a reproduction script. We will run it, publish the score, and — if it takes the top — annotate the step on the progress chart with your name.

Submit a result ↵Read submission guide

What a submission needs

01A public checkpoint or API endpoint
02A reproduction script with frozen commit + seed
03Declared evaluation environment (Python, deps)
04One row per metric declared by this dataset
05A contact so we can follow up on discrepancies

SWE-bench Verified Subset.

Best published scores.

2 stepsof state of the art.

Neighbouring benchmarks.

Have a score that beatsthis table?

2 steps
of state of the art.

Have a score that beats
this table?