Coding Benchmarks
How code-generation evaluation moved from short Python functions to repository-scale software engineering. Attention path tracks the benchmark frontier focus has migrated to; branches show specialised variants and successors that remain active in their own right.
APPS (2021-05) was the first widely-cited coding benchmark of the post-Codex era; OpenAI shipped HumanEval purpose-built two months later and attention migrated within a year. HumanEval and MBPP both saturated by 2023 — frontier models hit >95% pass@1, leaving no signal. EvalPlus (HumanEval+, MBPP+) reopened the gap with adversarial tests. Attention then jumped to LiveCodeBench (contamination-free by date) and SWE-bench Verified (repo-scale, human-filtered). As of 2025-09, OpenAI publicly announced they no longer evaluate on SWE-bench Verified — flawed tests reward shortcuts and training-data leakage inflates scores. SWE-bench Pro (Scale AI, arxiv 2509.16941) is the current attention path: 1,865 problems across public/commercial/held-out splits where GPT-5 and Claude Opus 4.1 land at ~23% vs >70% on Verified.
Attention path plus branches.
Solid arrows follow the attention path — the dashed arrow marks a scope shift (leaderboard attention jumping between tasks). Thin grey arcs drop down to specialized branches. Click any node to jump to its detail.
Nodes in detail.
APPS
10,000 Python problems scraped from coding sites at three difficulty tiers (introductory, interview, competition). The first widely-shared coding benchmark of the post-Codex era — same Hendrycks group that built MMLU. Preceded HumanEval by two months and is the closest direct ancestor of the function-synthesis line.
HumanEval
164 hand-written Python problems with unit tests. The first widely-shared LLM coding benchmark. Pass@1 became the standard code-quality metric.
MBPP
974 entry-level Python problems crowdsourced from non-experts. Companion to HumanEval — broader coverage, easier on average, similar saturation curve.
CodeContests
Codeforces-style competitive programming problems. Harder algorithmic reasoning than HumanEval; requires multi-sample generation to score well.
MultiPL-E
HumanEval and MBPP translated into 18+ languages. Tests whether code-LLMs generalise beyond Python or just memorised it.
HumanEval+
80× more test cases per problem, automatically generated to catch the edge cases the original tests missed. Reopened the leaderboard gap that HumanEval had closed.
MBPP+
Same EvalPlus treatment for MBPP — adversarial tests, broader coverage, hard mode.
LiveCodeBench
Continuously scrapes new LeetCode/AtCoder/Codeforces problems and dates them — results can be filtered to problems posted after a model's training cutoff, eliminating contamination. Where the leaderboard moved once HumanEval+ also began saturating.
SWE-bench
2,294 real GitHub issue→PR pairs across 12 Python repos. The first benchmark to test whether models could function as software engineers, not just function generators. Superseded by Verified after analysis showed many issues were unsolvable as posed.
SWE-bench Verified
500 SWE-bench tasks human-confirmed solvable with sufficient issue information and a passing test. Was the agentic-coding standard until 2025 — OpenAI publicly stopped evaluating on it in Sep 2025, citing flawed tests that reward shortcuts plus training-data leakage that inflates scores.
Multi-SWE-bench
Extends SWE-bench beyond Python to Java, TypeScript, Go, Rust, C, C++. A parallel multi-language branch — useful for cross-language reasoning, but not where leaderboard attention has consolidated.
SWE-bench Pro
1,865 problems across public/commercial/held-out splits sourced from 41 actively-maintained business and B2B repos. Designed to fix Verified's contamination and shortcut problems — GPT-5 and Claude Opus 4.1 land at ~23% here vs >70% on Verified. The frontier OpenAI now reports.
Terminal-Bench
152 hand-built terminal tasks — devops, data, SWE, scientific computing — each scored by container-internal unit tests inside a Docker sandbox. Agent-coupled: the harness, prompt scaffold and underlying model are measured as one system, unlike SWE-bench where only the model is scored. A scope shift, not a successor — Codex + GPT-5.5 currently leads at 82.0%.