Codesota · Computer Code · Code Generation · SWE-Bench VerifiedTasks/Computer Code/Code Generation
Code Generation · benchmark dataset · 2024 · PYTHON

SWE-bench Verified Subset.

500 manually verified GitHub issues confirmed solvable by human engineers. High-quality subset of SWE-bench.

Paper Submit a result
§ 01 · Leaderboard

Best published scores.

39 results indexed across 1 metric. Shaded row marks current SOTA; ties broken by submission date.


Primary
resolve-rate · higher is better
resolve-rate· primary
39 rows
#ModelOrgSubmittedPaper / coderesolve-rate
01Claude Opus 4.7Apr 2026vendor87.60
02Claude Opus 4.5APIAnthropicNov 2025anthropic-blog80.90
03Claude Opus 4.6APIAnthropicFeb 2026anthropic-blog80.80
04Gemini 3.1 ProAPIGoogleFeb 2026google-blog80.60
05MiniMax M2.5OSSMiniMaxFeb 2026minimax-blog80.20
06GPT-5.2 ThinkingAPIOpenAIDec 2025openai-blog80
07Claude Sonnet 4.6APIAnthropicFeb 2026anthropic-blog79.60
08Gemini 3 FlashAPIGoogleDec 2025google-blog78
09Claude Sonnet 4.5APIAnthropicMar 2026anthropic-blog77.20
10Kimi K2.5APIMoonshot AIMar 2026moonshot-blog76.80
11GPT-5.1APIOpenAIMar 2026openai-blog76.30
12Gemini 3 ProAPIGoogleMar 2026google-blog76.20
13GPT-5APIOpenAIMar 2026openai-blog74.90
14MiniMax M2.1APIMiniMaxMar 2026minimax-blog74
15Claude Haiku 4.5APIAnthropicMar 2026anthropic-blog73.30
16Claude Sonnet 4APIAnthropicMar 2026anthropic-blog72.70
17Claude Opus 4APIAnthropicMar 2026anthropic-blog72.50
18Devstral 2OSSMistralMar 2026mistral-blog72.20
19Qwen3-Coder 480B A35BOSSAlibaba CloudMar 2026qwen-blog69.60
20MiniMax M2APIMiniMaxMar 2026minimax-blog69.40
21o3APIOpenAIMar 2026openai-blog69.10
22o4-miniAPIOpenAIMar 2026swebench-leaderboard68.10
23DeepSeek-V3.1OSSDeepSeekMar 2026deepseek-blog66
24Kimi-K2OSSMoonshot.AIMar 2026kimi-techreport65.80
25Grok 3APIxAIMar 2026xai-blog63.80
26Gemini 2.5 ProAPIGoogleMar 2026google-blog63.80
27Claude 3.7 SonnetAPIAnthropicMar 2026anthropic-blog63.70
28Gemini 2.5 FlashAPIGoogleMar 2026google-blog60.40
29DeepSeek-R1-0528OSSDeepSeekMar 2026deepseek-blog57.60
30o3-miniAPIOpenAIMar 2026swebench-leaderboard55.80
31GPT-4.1APIOpenAIMar 2026swebench-leaderboard54.60
32Claude 3.5 SonnetAPIAnthropicMar 2026anthropic-blog50.80
33DeepSeek R1OSSDeepSeekMar 2026swebench-leaderboard49.20
34o1APIOpenAIMar 2026swebench-leaderboard48.90
35Devstral Small 2505OSSMistralMar 2026mistral-blog46.80
36DeepSeek-V3OSSDeepSeekMar 2026swebench-leaderboard42
37GPT-4oAPIOpenAIMar 2026swebench-leaderboard41.20
38Claude 3.5 HaikuAPIAnthropicMar 2026anthropic-blog40.60
39DeepSeek-V2.5OSSDeepSeekMar 2026deepseek-blog37
Fig 2 · Rows sorted by score within each metric. Shaded row marks SOTA. Dates reflect model or paper release where available, otherwise the date Codesota accessed the source.
§ 03 · Progress

2 steps
of state of the art.

Each row below marks a model that broke the previous record on resolve-rate. Intermediate submissions are kept in the leaderboard above; only SOTA-setting entries are re-listed here.

Higher scores win. Each subsequent entry improved upon the previous best.

SOTA line · resolve-rate
  1. Nov 24, 2025Claude Opus 4.5Anthropic80.90
  2. Apr 18, 2026Claude Opus 4.787.60
Fig 3 · SOTA-setting models only. 2 entries span Nov 2025 Apr 2026.
§ 06 · Contribute

Have a score that beats
this table?

Submit a checkpoint and a reproduction script. We will run it, publish the score, and — if it takes the top — annotate the step on the progress chart with your name.

Submit a result Read submission guide
What a submission needs
  • 01A public checkpoint or API endpoint
  • 02A reproduction script with frozen commit + seed
  • 03Declared evaluation environment (Python, deps)
  • 04One row per metric declared by this dataset
  • 05A contact so we can follow up on discrepancies