Codesota · Lineage · Coding Benchmarks13 benchmarks · 12 edgesUpdated 2026-04-26

Benchmark lineage

Coding Benchmarks

How code-generation evaluation moved from short Python functions to repository-scale software engineering. Attention path tracks the benchmark frontier focus has migrated to; branches show specialised variants and successors that remain active in their own right.

Editor's note

APPS (2021-05) was the first widely-cited coding benchmark of the post-Codex era; OpenAI shipped HumanEval purpose-built two months later and attention migrated within a year. HumanEval and MBPP both saturated by 2023 — frontier models hit >95% pass@1, leaving no signal. EvalPlus (HumanEval+, MBPP+) reopened the gap with adversarial tests. Attention then jumped to LiveCodeBench (contamination-free by date) and SWE-bench Verified (repo-scale, human-filtered). As of 2025-09, OpenAI publicly announced they no longer evaluate on SWE-bench Verified — flawed tests reward shortcuts and training-data leakage inflates scores. SWE-bench Pro (Scale AI, arxiv 2509.16941) is the current attention path: 1,865 problems across public/commercial/held-out splits where GPT-5 and Claude Opus 4.1 land at ~23% vs >70% on Verified.

§ 01 · Lineage graph

Attention path plus branches.

Solid arrows follow the attention path — the dashed arrow marks a scope shift (leaderboard attention jumping between tasks). Thin grey arcs drop down to specialized branches. Click any node to jump to its detail.

attention path scope shift branch / fork active saturating saturated / superseded

APPS → HumanEval · direct successor · attention

OpenAI shipped HumanEval purpose-built for Codex evaluation two months after APPS. Smaller (164 vs 10,000) but cleaner unit-test scaffolding, and the per-problem signal proved sharper at the frontier — leaderboard attention migrated within a year.

HumanEval → HumanEval+ · direct successor · attention

EvalPlus added 80× test cases per problem to catch the edge cases the original 164 missed. Reopened the gap on saturated leaderboards.

HumanEval → MBPP · variant

Companion benchmark from Google, released a month later. Same Python-function-synthesis task, broader and easier.

HumanEval → MultiPL-E · fork

MultiPL-E translates HumanEval (and MBPP) into 18+ programming languages.

HumanEval → CodeContests · scope shift

CodeContests jumps from function-level to competitive-programming difficulty — same task family, harder reasoning.

MBPP → MBPP+ · direct successor

Same EvalPlus adversarial-test treatment applied to MBPP.

HumanEval+ → LiveCodeBench · scope shift · attention

Where leaderboard attention moved once EvalPlus problems also began saturating. LiveCodeBench's by-date contamination control became the new credibility floor.

LiveCodeBench → SWE-bench · scope shift · attention

From contest-style problems to real-world software engineering — issues, multi-file edits, regression tests. Different task, but the same field's frontier.

SWE-bench → SWE-bench Verified · direct successor · attention

Human-filtered subset of 500 verified-solvable tasks. The original SWE-bench is rarely quoted now; Verified is what agentic-coding evals report.

SWE-bench Verified → Multi-SWE-bench · fork

Multi-language fork (Java, TypeScript, Go, Rust, C/C++). A parallel branch rather than the main attention path.

SWE-bench Verified → SWE-bench Pro · direct successor · attention

OpenAI publicly stopped evaluating Verified in Sep 2025 — contamination and shortcut-reward tests inflated scores. Pro adds held-out splits, commercial repos, and contamination control. GPT-5 / Claude Opus 4.1 drop from >70% on Verified to ~23% on Pro.

SWE-bench Verified → Terminal-Bench · scope shift

Parallel branch, not a direct successor: SWE-bench Pro fixes Verified's contamination on the same 'fix one GitHub issue' task. Terminal-Bench changes the task — full terminal/devops/data sessions inside a Docker sandbox, with the agent harness scored as part of the system. Same era, different scope. Frontier closed-agent (Codex + GPT-5.5) currently 82.0%; open-weight harnesses trail by ~30pp.

§ 02 · Benchmarks in this lineage

Nodes in detail.

May 2021Saturated

APPS

Automated Programming Progress Standard

10,000 Python problems scraped from coding sites at three difficulty tiers (introductory, interview, competition). The first widely-shared coding benchmark of the post-Codex era — same Hendrycks group that built MMLU. Preceded HumanEval by two months and is the closest direct ancestor of the function-synthesis line.

Hendrycks et al. · paper

Jul 2021Saturated

View benchmark page →

HumanEval

HumanEval Python Function Synthesis

164 hand-written Python problems with unit tests. The first widely-shared LLM coding benchmark. Pass@1 became the standard code-quality metric.

Chen et al. (OpenAI) · paper

Aug 2021Saturated

View benchmark page →

MBPP

Mostly Basic Python Problems

974 entry-level Python problems crowdsourced from non-experts. Companion to HumanEval — broader coverage, easier on average, similar saturation curve.

Austin et al. (Google) · paper

Feb 2022Active

View benchmark page →

CodeContests

CodeContests Competitive Programming

Codeforces-style competitive programming problems. Harder algorithmic reasoning than HumanEval; requires multi-sample generation to score well.

Li et al. (DeepMind, AlphaCode) · paper

Aug 2022Active

MultiPL-E

Multi-Programming-Language Evaluation

HumanEval and MBPP translated into 18+ languages. Tests whether code-LLMs generalise beyond Python or just memorised it.

Cassano et al. · paper

May 2023Active

View benchmark page →

HumanEval+

HumanEval+ (EvalPlus)

80× more test cases per problem, automatically generated to catch the edge cases the original tests missed. Reopened the leaderboard gap that HumanEval had closed.

Liu et al. · paper

May 2023Active

View benchmark page →

MBPP+

MBPP+ (EvalPlus)

Same EvalPlus treatment for MBPP — adversarial tests, broader coverage, hard mode.

Liu et al. · paper

Sep 2023Active

View benchmark page →

LiveCodeBench

LiveCodeBench Contamination-Free Coding

Continuously scrapes new LeetCode/AtCoder/Codeforces problems and dates them — results can be filtered to problems posted after a model's training cutoff, eliminating contamination. Where the leaderboard moved once HumanEval+ also began saturating.

Jain et al. (UC Berkeley, MIT, Cornell) · paper

Oct 2023Superseded

View benchmark page →

SWE-bench

SWE-bench (original, unfiltered)

2,294 real GitHub issue→PR pairs across 12 Python repos. The first benchmark to test whether models could function as software engineers, not just function generators. Superseded by Verified after analysis showed many issues were unsolvable as posed.

Jimenez et al. (Princeton) · paper

Aug 2024Saturating

View benchmark page →

SWE-bench Verified

SWE-bench Verified (human-filtered subset)

500 SWE-bench tasks human-confirmed solvable with sufficient issue information and a passing test. Was the agentic-coding standard until 2025 — OpenAI publicly stopped evaluating on it in Sep 2025, citing flawed tests that reward shortcuts plus training-data leakage that inflates scores.

OpenAI + SWE-bench team · paper

Apr 2025Active

Multi-SWE-bench

Multi-SWE-bench (multi-language fork)

Extends SWE-bench beyond Python to Java, TypeScript, Go, Rust, C, C++. A parallel multi-language branch — useful for cross-language reasoning, but not where leaderboard attention has consolidated.

ByteDance team · paper

Sep 2025Active

SWE-bench Pro

SWE-bench Pro (Scale AI, contamination-controlled)

1,865 problems across public/commercial/held-out splits sourced from 41 actively-maintained business and B2B repos. Designed to fix Verified's contamination and shortcut problems — GPT-5 and Claude Opus 4.1 land at ~23% here vs >70% on Verified. The frontier OpenAI now reports.

Scale AI · paper

Oct 2025Active

Terminal-Bench

Terminal-Bench 2 (Stanford · Laude Institute)

152 hand-built terminal tasks — devops, data, SWE, scientific computing — each scored by container-internal unit tests inside a Docker sandbox. Agent-coupled: the harness, prompt scaffold and underlying model are measured as one system, unlike SWE-bench where only the model is scored. A scope shift, not a successor — Codex + GPT-5.5 currently leads at 82.0%.

Stanford · Laude Institute · paper