Codesota · Agentic AI · SWE-bench · SWE-bench VerifiedTasks/Agentic AI/SWE-bench
SWE-bench · benchmark dataset · 2024 · PYTHON

SWE-bench Verified — Agentic Leaderboard.

500 manually verified GitHub issues confirmed solvable by human engineers. The primary benchmark for software engineering agents. Results tracked from autonomous scaffolds (not just model capability).

Paper Submit a result
§ 01 · Leaderboard

Best published scores.

81 results indexed across 1 metric. Shaded row marks current SOTA; ties broken by submission date.


Primary
resolve-rate · higher is better
resolve-rate· primary
81 rows
#ModelOrgSubmittedPaper / coderesolve-rate
01Claude Mythos PreviewAnthropicApr 2026editorial93.90
02Claude Opus 4.5APIAnthropicApr 2026editorial80.90
03Claude Opus 4.6APIAnthropicApr 2026editorial80.80
04Gemini 3.1 ProAPIGoogleApr 2026editorial80.60
05MiniMax M2.5OSSMiniMaxApr 2026editorial80.20
06GPT-5.2APIOpenAIApr 2026editorial80
07Claude Sonnet 4.6APIAnthropicApr 2026editorial79.60
08Qwen3.6 PlusAlibaba CloudApr 2026editorial78.80
09MiMo-V2-ProOSSXiaomiApr 2026editorial78
10Gemini 3 FlashAPIGoogleApr 2026editorial78
11GLM-5OSSZhipu AIApr 2026editorial77.80
12Muse SparkMetaApr 2026editorial77.40
13Kimi K2.5APIMoonshot AIApr 2026editorial76.80
14Seed 2.0 ProByteDanceApr 2026editorial76.50
15Qwen3.5-397B-A17BAlibaba CloudApr 2026editorial76.40
16GPT-5.1 InstantOpenAIApr 2026editorial76.30
17GPT-5.1 ThinkingOpenAIApr 2026editorial76.30
18GPT-5.1APIOpenAIApr 2026editorial76.30
19Gemini 3 ProAPIGoogleApr 2026editorial76.20
20GPT-5APIOpenAIApr 2026editorial74.90
21MiMo-V2-OmniXiaomiApr 2026editorial74.80
22GPT-5 CodexOpenAIApr 2026editorial74.50
23Claude Opus 4.1AnthropicApr 2026editorial74.50
24Step-3.5-FlashOSSStepFunApr 2026editorial74.40
25GLM-4.7Zhipu AIApr 2026editorial73.80
26GPT-5.1 CodexOpenAIApr 2026editorial73.70
27Seed 2.0 LiteByteDanceApr 2026editorial73.50
28MiMo-V2-FlashXiaomiApr 2026editorial73.40
29Claude Haiku 4.5APIAnthropicApr 2026editorial73.30
30DeepSeek-V3.2-SpecialeDeepSeekApr 2026editorial73.10
31DeepSeek-V3.2 (Thinking)DeepSeekApr 2026editorial73.10
32Claude Sonnet 4APIAnthropicApr 2026editorial72.70
33Claude Opus 4APIAnthropicApr 2026editorial72.50
34Qwen3.5-27BAlibaba CloudApr 2026editorial72.40
35Qwen3.5-122B-A10BAlibaba CloudApr 2026editorial72
36Kimi K2-Thinking-0905OSSMoonshot AIApr 2026editorial71.30
37Grok Code Fast 1xAIApr 2026editorial70.80
38Claude 3.7 SonnetAPIAnthropicApr 2026editorial70.30
39LongCat-Flash-Thinking-2601MeituanApr 2026editorial70
40Qwen3-Coder 480B A35BOSSAlibaba CloudApr 2026editorial69.60
41Qwen3 MaxOSSAlibaba CloudApr 2026editorial69.60
42MiniMax M2APIMiniMaxApr 2026editorial69.40
43Qwen3.5-35B-A3BAlibaba CloudApr 2026editorial69.20
44o3APIOpenAIApr 2026editorial69.10
45o4-miniAPIOpenAIApr 2026editorial68.10
46GLM-4.6Zhipu AIApr 2026editorial68
47DeepSeek-V3.2-ExpDeepSeekApr 2026editorial67.80
48Gemini 2.5 Pro PreviewGoogleApr 2026editorial67.20
49MiniMax M2.1APIMiniMaxApr 2026editorial67
50DeepSeek-V3.1OSSDeepSeekApr 2026editorial66
51Kimi K2-Instruct-0905Moonshot AIApr 2026editorial65.80
52GLM-4.5Zhipu AIApr 2026editorial64.20
53Gemini 2.5 ProAPIGoogleApr 2026editorial63.20
54Devstral MediumMistral AIApr 2026editorial61.60
55LongCat-Flash-ChatMeituanApr 2026editorial60.40
56Gemini 2.5 FlashAPIGoogleApr 2026editorial60.40
57LongCat-Flash-ThinkingMeituanApr 2026editorial59.40
58GLM-4.7-FlashZhipu AIApr 2026editorial59.20
59GLM-4.5-AirZhipu AIApr 2026editorial57.60
60MiniMax M1 80KMiniMaxApr 2026editorial56
61MiniMax M1 40KMiniMaxApr 2026editorial55.60
62GPT-4.1APIOpenAIApr 2026editorial54.60
63LongCat-Flash-LiteMeituanApr 2026editorial54.40
64Nemotron 3 Super (120B)NVIDIAApr 2026editorial53.70
65Devstral Small 1.1Mistral AIApr 2026editorial53.60
66o3-miniAPIOpenAIApr 2026editorial49.30
67Claude 3.5 SonnetAPIAnthropicApr 2026editorial49
68Sarvam-105BSarvam AIApr 2026editorial45
69DeepSeek-R1-0528OSSDeepSeekApr 2026editorial44.60
70DeepSeek-V3OSSDeepSeekApr 2026editorial42
71o1-previewAPIOpenAIApr 2026editorial41.30
72o1APIOpenAIApr 2026editorial41
73Claude 3.5 HaikuAPIAnthropicApr 2026editorial40.60
74Nemotron 3 Nano (30B)NVIDIAApr 2026editorial38.80
75GPT-4.5APIOpenAIApr 2026editorial38
76Sarvam-30BSarvam AIApr 2026editorial34
77GPT-4oAPIOpenAIApr 2026editorial33.20
78Gemini 2.5 Flash-LiteGoogleApr 2026editorial31.60
79GPT-4.1 miniAPIOpenAIApr 2026editorial23.60
80Gemini DiffusionGoogleApr 2026editorial22.90
81DeepSeek-V2.5OSSDeepSeekApr 2026editorial16.80
Fig 2 · Rows sorted by score within each metric. Shaded row marks SOTA. Dates reflect model or paper release where available, otherwise the date Codesota accessed the source.
§ 03 · Progress

1 steps
of state of the art.

Each row below marks a model that broke the previous record on resolve-rate. Intermediate submissions are kept in the leaderboard above; only SOTA-setting entries are re-listed here.

Higher scores win. Each subsequent entry improved upon the previous best.

SOTA line · resolve-rate
  1. Apr 9, 2026Claude Mythos PreviewAnthropic93.90
Fig 3 · SOTA-setting models only. 1 entries span Apr 2026 Apr 2026.
§ 06 · Contribute

Have a score that beats
this table?

Submit a checkpoint and a reproduction script. We will run it, publish the score, and — if it takes the top — annotate the step on the progress chart with your name.

Submit a result Read submission guide
What a submission needs
  • 01A public checkpoint or API endpoint
  • 02A reproduction script with frozen commit + seed
  • 03Declared evaluation environment (Python, deps)
  • 04One row per metric declared by this dataset
  • 05A contact so we can follow up on discrepancies