Codesota · Benchmark · SWE-bench VerifiedHome/Leaderboards/SWE-bench Verified
Unknown

SWE-bench Verified.

Human-validated subset of 500 GitHub issues from real Python repositories. Models must produce a patch that passes hidden tests. Standard benchmark for autonomous coding agents end-to-end (repo navigation, editing, testing).

Paper Leaderboard
§ 01 · Leaderboard

Results by metric.

Only 3 models on this benchmark
Help build the community leaderboard — submit your model results.
Found a wrong score or missing run?
Use row edits to send a sourced correction into moderation.
Add / edit result Report issue

Pct Resolved

Pct Resolved is the reported evaluation metric for SWE-bench Verified. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for Pct Resolvedverifiedpapervendorcommunityunverified
RankModelTrustScoreYearLinksFix
01Claude Opus 4.5
Top score on SWE-bench Verified leaderboard (Anthropic reported). seed — verify
paper80.92026Source ↗Looks wrong?
02Gemini 3 Pro
Gemini 3 Pro on SWE-bench Verified via VALS AI bash-agent harness. seed — verify
paper78.82026Source ↗Looks wrong?
03GPT-5 Codex
GPT-5 Codex on SWE-bench Verified. seed — verify
paper74.92026Source ↗Looks wrong?
§ 04 · Submit a result

Add to the leaderboard.

← Back to Leaderboards