SWE-bench Verified — Agentic Leaderboard.

500 manually verified GitHub issues confirmed solvable by human engineers. The primary benchmark for software engineering agents. Results tracked from autonomous scaffolds (not just model capability).

Paper ↗Submit a result ↵

§ 01 · Leaderboard

Best published scores.

81 results indexed across 1 metric. Shaded row marks current SOTA; ties broken by submission date.

Primary: resolve-rate · higher is better

resolve-rate· primary

81 rows

#	Model	Org	Submitted	Paper / code	resolve-rate
01	Claude Mythos Preview	Anthropic	Apr 2026	editorial	93.90
02	Claude Opus 4.5API	Anthropic	Apr 2026	editorial	80.90
03	Claude Opus 4.6API	Anthropic	Apr 2026	editorial	80.80
04	Gemini 3.1 ProAPI	Google	Apr 2026	editorial	80.60
05	MiniMax M2.5OSS	MiniMax	Apr 2026	editorial	80.20
06	GPT-5.2API	OpenAI	Apr 2026	editorial	80
07	Claude Sonnet 4.6API	Anthropic	Apr 2026	editorial	79.60
08	Qwen3.6 Plus	Alibaba Cloud	Apr 2026	editorial	78.80
09	MiMo-V2-ProOSS	Xiaomi	Apr 2026	editorial	78
10	Gemini 3 FlashAPI	Google	Apr 2026	editorial	78
11	GLM-5OSS	Zhipu AI	Apr 2026	editorial	77.80
12	Muse Spark	Meta	Apr 2026	editorial	77.40
13	Kimi K2.5API	Moonshot AI	Apr 2026	editorial	76.80
14	Seed 2.0 Pro	ByteDance	Apr 2026	editorial	76.50
15	Qwen3.5-397B-A17B	Alibaba Cloud	Apr 2026	editorial	76.40
16	GPT-5.1 Instant	OpenAI	Apr 2026	editorial	76.30
17	GPT-5.1 Thinking	OpenAI	Apr 2026	editorial	76.30
18	GPT-5.1API	OpenAI	Apr 2026	editorial	76.30
19	Gemini 3 ProAPI	Google	Apr 2026	editorial	76.20
20	GPT-5API	OpenAI	Apr 2026	editorial	74.90
21	MiMo-V2-Omni	Xiaomi	Apr 2026	editorial	74.80
22	GPT-5 Codex	OpenAI	Apr 2026	editorial	74.50
23	Claude Opus 4.1	Anthropic	Apr 2026	editorial	74.50
24	Step-3.5-FlashOSS	StepFun	Apr 2026	editorial	74.40
25	GLM-4.7	Zhipu AI	Apr 2026	editorial	73.80
26	GPT-5.1 Codex	OpenAI	Apr 2026	editorial	73.70
27	Seed 2.0 Lite	ByteDance	Apr 2026	editorial	73.50
28	MiMo-V2-Flash	Xiaomi	Apr 2026	editorial	73.40
29	Claude Haiku 4.5API	Anthropic	Apr 2026	editorial	73.30
30	DeepSeek-V3.2-Speciale	DeepSeek	Apr 2026	editorial	73.10
31	DeepSeek-V3.2 (Thinking)	DeepSeek	Apr 2026	editorial	73.10
32	Claude Sonnet 4API	Anthropic	Apr 2026	editorial	72.70
33	Claude Opus 4API	Anthropic	Apr 2026	editorial	72.50
34	Qwen3.5-27B	Alibaba Cloud	Apr 2026	editorial	72.40
35	Qwen3.5-122B-A10B	Alibaba Cloud	Apr 2026	editorial	72
36	Kimi K2-Thinking-0905OSS	Moonshot AI	Apr 2026	editorial	71.30
37	Grok Code Fast 1	xAI	Apr 2026	editorial	70.80
38	Claude 3.7 SonnetAPI	Anthropic	Apr 2026	editorial	70.30
39	LongCat-Flash-Thinking-2601	Meituan	Apr 2026	editorial	70
40	Qwen3-Coder 480B A35BOSS	Alibaba Cloud	Apr 2026	editorial	69.60
41	Qwen3 MaxOSS	Alibaba Cloud	Apr 2026	editorial	69.60
42	MiniMax M2API	MiniMax	Apr 2026	editorial	69.40
43	Qwen3.5-35B-A3B	Alibaba Cloud	Apr 2026	editorial	69.20
44	o3API	OpenAI	Apr 2026	editorial	69.10
45	o4-miniAPI	OpenAI	Apr 2026	editorial	68.10
46	GLM-4.6	Zhipu AI	Apr 2026	editorial	68
47	DeepSeek-V3.2-Exp	DeepSeek	Apr 2026	editorial	67.80
48	Gemini 2.5 Pro Preview	Google	Apr 2026	editorial	67.20
49	MiniMax M2.1API	MiniMax	Apr 2026	editorial	67
50	DeepSeek-V3.1OSS	DeepSeek	Apr 2026	editorial	66
51	Kimi K2-Instruct-0905	Moonshot AI	Apr 2026	editorial	65.80
52	GLM-4.5	Zhipu AI	Apr 2026	editorial	64.20
53	Gemini 2.5 ProAPI	Google	Apr 2026	editorial	63.20
54	Devstral Medium	Mistral AI	Apr 2026	editorial	61.60
55	LongCat-Flash-Chat	Meituan	Apr 2026	editorial	60.40
56	Gemini 2.5 FlashAPI	Google	Apr 2026	editorial	60.40
57	LongCat-Flash-Thinking	Meituan	Apr 2026	editorial	59.40
58	GLM-4.7-Flash	Zhipu AI	Apr 2026	editorial	59.20
59	GLM-4.5-Air	Zhipu AI	Apr 2026	editorial	57.60
60	MiniMax M1 80K	MiniMax	Apr 2026	editorial	56
61	MiniMax M1 40K	MiniMax	Apr 2026	editorial	55.60
62	GPT-4.1API	OpenAI	Apr 2026	editorial	54.60
63	LongCat-Flash-Lite	Meituan	Apr 2026	editorial	54.40
64	Nemotron 3 Super (120B)	NVIDIA	Apr 2026	editorial	53.70
65	Devstral Small 1.1	Mistral AI	Apr 2026	editorial	53.60
66	o3-miniAPI	OpenAI	Apr 2026	editorial	49.30
67	Claude 3.5 SonnetAPI	Anthropic	Apr 2026	editorial	49
68	Sarvam-105B	Sarvam AI	Apr 2026	editorial	45
69	DeepSeek-R1-0528OSS	DeepSeek	Apr 2026	editorial	44.60
70	DeepSeek-V3OSS	DeepSeek	Apr 2026	editorial	42
71	o1-previewAPI	OpenAI	Apr 2026	editorial	41.30
72	o1API	OpenAI	Apr 2026	editorial	41
73	Claude 3.5 HaikuAPI	Anthropic	Apr 2026	editorial	40.60
74	Nemotron 3 Nano (30B)	NVIDIA	Apr 2026	editorial	38.80
75	GPT-4.5API	OpenAI	Apr 2026	editorial	38
76	Sarvam-30B	Sarvam AI	Apr 2026	editorial	34
77	GPT-4oAPI	OpenAI	Apr 2026	editorial	33.20
78	Gemini 2.5 Flash-Lite	Google	Apr 2026	editorial	31.60
79	GPT-4.1 miniAPI	OpenAI	Apr 2026	editorial	23.60
80	Gemini Diffusion	Google	Apr 2026	editorial	22.90
81	DeepSeek-V2.5OSS	DeepSeek	Apr 2026	editorial	16.80

Fig 2 · Rows sorted by score within each metric. Shaded row marks SOTA. Dates reflect model or paper release where available, otherwise the date Codesota accessed the source.

§ 03 · Progress

1 steps
of state of the art.

Each row below marks a model that broke the previous record on resolve-rate. Intermediate submissions are kept in the leaderboard above; only SOTA-setting entries are re-listed here.

Higher scores win. Each subsequent entry improved upon the previous best.

SOTA line · resolve-rate

Apr 9, 2026Claude Mythos PreviewAnthropic93.90

Fig 3 · SOTA-setting models only. 1 entries span Apr 2026 → Apr 2026.

§ 06 · Contribute

Have a score that beats
this table?

Submit a checkpoint and a reproduction script. We will run it, publish the score, and — if it takes the top — annotate the step on the progress chart with your name.

Submit a result ↵Read submission guide

What a submission needs

01A public checkpoint or API endpoint
02A reproduction script with frozen commit + seed
03Declared evaluation environment (Python, deps)
04One row per metric declared by this dataset
05A contact so we can follow up on discrepancies

SWE-bench Verified — Agentic Leaderboard.

Best published scores.

1 stepsof state of the art.

Have a score that beatsthis table?

1 steps
of state of the art.

Have a score that beats
this table?