ExplainerUpdated April 2026

SWE-Bench Explained

A real-world benchmark that measures whether an LLM can fix real GitHub bugs. What the number actually represents, how it's scored, which variant to believe, and why the scores on the leaderboard are probably 5-15 points too generous.

What SWE-Bench actually is

The task

Each SWE-Bench instance gives an agent: (1) a GitHub issue describing a bug, and (2) the full source repo at the commit just before that bug was fixed. The agent must produce a patch whose application makes the formerly-failing test suite pass, without breaking any previously-passing tests.

  • - 2,294 issues in the original (Oct 2023)
  • - 500 human-verified in SWE-Bench Verified (Aug 2024)
  • - From 12 mature Python repos: django, scikit-learn, sympy, flask, requests, matplotlib, pytest, astropy, pylint, pydata/xarray, psf/black, sphinx

The scoring

Binary per-task. Agent resolved an issue iff after applying its patch: every fail-to-pass test passes AND every pass-to-pass test still passes. The reported metric is the % resolved — pass@1 by convention.

  • - No partial credit. "Mostly works" scores 0.
  • - No test creativity — agents are graded against the PR author's chosen tests.
  • - Harness: Docker per instance, repro-able bit-for-bit.

Scoring pipeline

Every SWE-Bench submission flows through the same harness.

Architecture

SWE-Bench scoring flow

Issue + repo → agent patch → apply → tests → resolved / unresolved

failregressionGitHub issuebug reportRepo @ commitpre-fix snapshotAgent / modelgenerates patchCandidate patchunified diffApply to repoFAIL → PASS testsbug-specificPASS → PASS testsregression guardRESOLVEDboth greenUNRESOLVEDany red

One concrete example

Real instance: django__django-11099. A URL name beginning with a Python reserved word broke reverse(). One-liner fix.

Gold patch (human PR #11100)django/urls/resolvers.py
@@@ -341,6 +341,10 @@ class URLResolver:
def _populate(self):
lookups = MultiValueDict()
- if name.isidentifier():
+ if name.isidentifier() and not keyword.iskeyword(name):
self.apps[name] = app_namespaces
Test added in the same PRtests/urlpatterns/tests.py
@@@ +131,7 @@ class ResolverMatchTests(...):
+ def test_keyword_reserved_name_rejected(self):
+ with self.assertRaises(ImproperlyConfigured):
+ resolve("/with/class/view/", urlconf="class_urls")

The agent succeeds iff, after applying its patch, the new test passes and no pre-existing Django URL resolver test regresses.

What an agent actually does

Stylized call trace for the above instance under Claude Code + Opus 4.5.

Trajectory

Agent trajectory — one SWE-Bench instance

django__django-11099 · Opus 4.5 + Claude Code

readeditbashtestsearchthink
0s9s19s28s37sreadissue body + linked discussionsearchgrep -rn "ResolverMatch" django/urls/readdjango/urls/resolvers.py lines 320-420thinkhypothesize: isidentifier() accepts reserved wordsbashgrep keyword tests/urlpatterns/editstr_replace: add keyword.iskeyword guardbashpytest tests/urlpatterns/ -xreadinspect 2 failing testseditstr_replace: also import keyword at topbashpytest tests/urlpatterns/ tests/urls/bashpytest tests/ -x -q (full resolver suite)

The family of variants

"SWE-Bench" in 2026 means a family, not a single benchmark. Each variant exists to patch a specific weakness of the original.

Benchmark family

SWE-Bench variants — size, coverage, purpose

Circle radius scales with task count (non-linear for readability).

SWE-Bench Full2,294 tasksOriginal, 12 Python repos, 2023SWE-Bench Lite300 tasksSubset for faster evalSWE-Bench Verified500 tasksHuman-vetted, headline leaderboardMultimodal517 tasksScreenshots + codeMultilingual300 tasksJS/TS/Go/Rust/JavaLive1,319 tasksPost-2024 issues, monthlyPro500 tasksHarder, Scale AI curated

Progress, 2023 → 2026

How the headline number has moved. Note the logarithmic look — classic benchmark saturation curve.

Progress curve

SWE-Bench Verified SOTA over time

0%20%40%60%80%100%2023202420252026Claude 22.0%GPT-412.5%SWE-agent18.0%GPT-4o19.0%Sonnet 3.527.0%o1-preview36.2%OpenHands41.3%Sonnet 3.5 v249.0%Opus 455.2%GPT-562.0%Sonnet 4.570.8%Opus 4.580.9%MiniMax M2.580.2%Opus 4.787.6%

The contamination problem

All 2,294 original issues predate every major 2024+ training cutoff. The fixes are on GitHub, visible in discussions, blogged about. Models may have memorized them.

Contamination risk

Why SWE-Bench scores may be inflated

Train-set overlap with test-set PRs. Scored fixes existed publicly before model cutoffs.

Model training corpuspre-2024 GitHub crawlgithub.com/django/djangogithub.com/scikit-learn/scikit-learngithub.com/sympy/sympygithub.com/psf/requestsgithub.com/matplotlib/matplotlib...12 repos, all public since 2008-2015Issue threads + PR conversationsMerged fix commits (verbatim)Release notes naming the bugSWE-Bench Verified test set500 human-vetted issuesIssue: "URL resolver returns wrong..."→ links to django#11099, closed 2019Gold patch: PR #11100, merged 2019Model seen: issue body, PR diff,discussion, backport notes, blog postsOpenAI Feb 2026 audit:59.4% of hard tasks have flawed testsoverlap5-15 pts inflation

The OpenAI Feb 2026 audit found 59.4% of the hardest Verified tasks had tests that wouldn't actually catch the intended bug — meaning an agent can "pass" without truly resolving the issue. Contamination plus weak tests inflates Verified scores by an estimated 5-15 points on post-2023 models. For clean numbers, use SWE-Bench Live or SWE-Bench Pro.

Running SWE-Bench yourself

# 1. Install
git clone https://github.com/SWE-bench/SWE-bench && cd SWE-bench
pip install -e .
# 2. Pull the Verified split
python -m swebench.harness.run_evaluation \\
  --dataset_name princeton-nlp/SWE-bench_Verified \\
  --predictions_path my_preds.jsonl \\
  --run_id my-eval
# 3. Or run mini-SWE-agent end-to-end (recommended)
git clone https://github.com/SWE-agent/mini-SWE-agent
mini-swe-agent run --model claude-opus-4-5 --split verified

A full Verified run costs $100-$500 in API calls for frontier models and takes 4-12 hours on a single machine.

What SWE-Bench does not measure

Design

No architectural decisions. Every task is "fix this bug", not "design this system".

Stakeholder negotiation

No ambiguity about what the bug actually is — the linked PR defines ground truth.

Frontend

Python-first (until Multilingual). No React, no visual correctness.

Production ops

No deployment, no monitoring, no on-call debugging of live systems.

Code review

Agents generate patches; they do not review or critique human code.

Cross-service

Single-repo only. No microservices, no distributed debugging.

Related