Codesota · Computer Code · Code Generation · SWE-Bench VerifiedTasks/Computer Code/Code Generation
Code Generation · benchmark dataset · 2024 · PYTHON

SWE-bench Verified Subset.

500 manually verified GitHub issues confirmed solvable by human engineers. High-quality subset of SWE-bench.

Paper Submit a result
§ 01 · Leaderboard

Best published scores.

61 results indexed across 2 metrics. Shaded row marks current SOTA; ties broken by submission date.


Primary
resolve-rate · higher is better
All metrics
accuracy, resolve-rate
accuracy
22 rows
#ModelOrgSubmittedPaper / codeaccuracy
01DeepSeek-V4-Pro MaxDeepSeekApr 2026pwc-dump · code80.60
02MiniMax-M2.5OpenMiniMaxAIFeb 2026pwc-dump · code80.20
03Kimi K2.6Apr 2026pwc-dump80.20
04DeepSeek-V4-Flash MaxDeepSeekApr 2026pwc-dump · code79
05MiMo-V2.5-ProApr 2026pwc-dump78.90
06GLM-5OpenZhipu AIFeb 2026GLM-5: from Vibe Coding to Agentic Engineering · code77.80
07Qwen3.6-27BApr 2026pwc-dump · code77.20
08Kimi-K2.5OpenMoonshot.AIFeb 2026Kimi K2.5: Visual Agentic Intelligence · code76.80
09Qwen3.5-397B-A17BOpenAlibabaFeb 2026pwc-dump · code76.40
10Step-3.5-FlashFeb 2026Step 3.5 Flash: Open Frontier-Level Intelligence with 11… · code74.40
11Qwen3.6-35B-A3BApr 2026pwc-dump · code73.40
12DeepSeek-V3.2OpenDeepSeekDec 2025DeepSeek-V3.2: Pushing the Frontier of Open Large Langua…73.10
13Qwen3.5-27BOpenAlibabaFeb 2026pwc-dump · code72.40
14Ling-2.6-1TApr 2026pwc-dump72.20
15Qwen3.5-122B-A10BOpenAlibabaFeb 2026pwc-dump · code72
16Qwen3-Coder-NextQwenFeb 2026Qwen3-Coder-Next Technical Report · code70.60
17Qwen3.5-35B-A3BOpenAlibabaFeb 2026pwc-dump · code69.20
18GLM-4.5OpenZhipu AIAug 2025GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation… · code64.20
19NVIDIA-Nemotron-3-Super-120B-A12B-BF16Dec 2025NVIDIA Nemotron 3: Efficient and Open Intelligence60.47
20Gemini 2.5 ProJul 2025Gemini 2.5: Pushing the Frontier with Advanced Reasoning…59.60
21GLM-4.5-AirOpenZhipu AIAug 2025GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation… · code57.60
22Gemini 2.5 FlashJul 2025Gemini 2.5: Pushing the Frontier with Advanced Reasoning…48.90
resolve-rate· primary
39 rows
#ModelOrgSubmittedPaper / coderesolve-rate
01Claude Opus 4.7AnthropicApr 2026vendor87.60
02Claude Opus 4.5APIAnthropicNov 2025anthropic-blog80.90
03Claude Opus 4.6APIAnthropicFeb 2026anthropic-blog80.80
04Gemini 3.1 ProGoogleFeb 2026google-blog80.60
05MiniMax M2.5OpenMiniMaxFeb 2026minimax-blog80.20
06GPT-5.2 ThinkingAPIOpenAIDec 2025openai-blog80
07Claude Sonnet 4.6APIAnthropicFeb 2026anthropic-blog79.60
08Gemini 3 FlashAPIGoogleDec 2025google-blog78
09Claude Sonnet 4.5APIAnthropicMar 2026anthropic-blog77.20
10Kimi K2.5OpenMoonshot AIMar 2026moonshot-blog76.80
11GPT-5.1APIOpenAIMar 2026openai-blog76.30
12Gemini 3 ProAPIGoogleMar 2026google-blog76.20
13GPT-5OpenAIMar 2026openai-blog74.90
14MiniMax M2.1OpenMiniMaxMar 2026minimax-blog74
15Claude Haiku 4.5APIAnthropicMar 2026anthropic-blog73.30
16Claude Sonnet 4AnthropicMar 2026anthropic-blog72.70
17Claude Opus 4AnthropicMar 2026anthropic-blog72.50
18Devstral 2OpenMistralMar 2026mistral-blog72.20
19Qwen3-Coder 480B A35BOpenAlibaba CloudMar 2026qwen-blog69.60
20MiniMax M2APIMiniMaxMar 2026minimax-blog69.40
21o3OpenAIMar 2026openai-blog69.10
22o4-miniOpenAIMar 2026swebench-leaderboard68.10
23DeepSeek-V3.1OpenDeepSeekMar 2026deepseek-blog66
24Kimi-K2OpenMoonshot.AIMar 2026kimi-techreport65.80
25Gemini 2.5 ProAPIGoogleMar 2026google-blog63.80
26Grok 3APIxAIMar 2026xai-blog63.80
27Claude 3.7 SonnetAPIAnthropicMar 2026anthropic-blog63.70
28Gemini 2.5 FlashAPIGoogleMar 2026google-blog60.40
29DeepSeek-R1-0528OpenDeepSeekMar 2026deepseek-blog57.60
30o3-miniAPIOpenAIMar 2026swebench-leaderboard55.80
31GPT-4.1OpenAIMar 2026swebench-leaderboard54.60
32Claude 3.5 SonnetAPIAnthropicMar 2026anthropic-blog50.80
33DeepSeek R1OpenDeepSeekMar 2026swebench-leaderboard49.20
34o1APIOpenAIMar 2026swebench-leaderboard48.90
35Devstral Small 2505OpenMistralMar 2026mistral-blog46.80
36DeepSeek-V3OpenDeepSeekMar 2026swebench-leaderboard42
37GPT-4oAPIOpenAIMar 2026swebench-leaderboard41.20
38Claude 3.5 HaikuAPIAnthropicMar 2026anthropic-blog40.60
39DeepSeek-V2.5OpenDeepSeekMar 2026deepseek-blog37
Fig 2 · Rows sorted by score within each metric. Shaded row marks SOTA. Dates reflect model or paper release where available, otherwise the date Codesota accessed the source.
§ 03 · Progress

2 steps
of state of the art.

Each row below marks a model that broke the previous record on resolve-rate. Intermediate submissions are kept in the leaderboard above; only SOTA-setting entries are re-listed here.

Higher scores win. Each subsequent entry improved upon the previous best.

SOTA line · resolve-rate
  1. Nov 24, 2025Claude Opus 4.5Anthropic80.90
  2. Apr 18, 2026Claude Opus 4.7Anthropic87.60
Fig 3 · SOTA-setting models only. 2 entries span Nov 2025 Apr 2026.
§ 04 · Literature

8 papers
tied to this benchmark.

Every paper below corresponds to at least one row in the leaderboard above. Click through for the arXiv preprint and, when available, the reference implementation.

§ 06 · Contribute

Have a score that beats
this table?

Submit a checkpoint and a reproduction script. We will run it, publish the score, and — if it takes the top — annotate the step on the progress chart with your name.

Submit a result Read submission guide
What a submission needs
  • 01A public checkpoint or API endpoint
  • 02A reproduction script with frozen commit + seed
  • 03Declared evaluation environment (Python, deps)
  • 04One row per metric declared by this dataset
  • 05A contact so we can follow up on discrepancies
SWE-Bench Verified — Code Generation | CodeSOTA