Codesota · Lineage · NLP Benchmarks7 benchmarks · 6 edgesUpdated 2026-04-27

Benchmark lineage

NLP Benchmarks

How natural language understanding evaluation evolved from narrow task-specific tests to multi-task suites, and then was eclipsed by 'reasoning' as the frontier label. GLUE unified disparate NLU tasks; SuperGLUE raised the floor when GLUE saturated; BIG-bench expanded coverage to hundreds of tasks. The shift around 2023 was conceptual as much as technical — once models passed human baselines on NLU tasks, the interesting question became not 'does the model understand language' but 'can it reason'. Branches include SQuAD (reading comprehension), HellaSwag (commonsense completion), and WinoGrande (Winograd schemas).

Editor's note

The NLP benchmark era ended not with a single saturation event but with a framing shift. GLUE human baseline was crossed in 2019, the same year SuperGLUE launched. SuperGLUE's human baseline fell in 2021. BIG-bench (2022) was a collective attempt to find the next hard thing — 204 tasks, most of them showing rapid improvement. The key observation from 2023: once chain-of-thought prompting was routine, the tasks that remained hard were reasoning tasks, not NLU tasks. That reframing — 'reasoning' as the successor label to 'NLP' — explains why MMLU, not SuperGLUE, became the standard report benchmark. GLUE and SuperGLUE are now training-time sanity checks, not leaderboard categories.

§ 01 · Lineage graph

Attention path plus branches.

Solid arrows follow the attention path — the dashed arrow marks a scope shift (leaderboard attention jumping between tasks). Thin grey arcs drop down to specialized branches. Click any node to jump to its detail.

attention path scope shift branch / fork active saturating saturated / superseded

SQuAD → GLUE · scope shift

SQuAD dominated NLP evaluation 2016–2018. GLUE shifted the paradigm from a single reading-comprehension task to a multi-task suite — SQuAD's extractive-QA track is one of the 9 GLUE tasks.

GLUE → SuperGLUE · direct successor · attention

GLUE was crossed by BERT-based models in 2019, the same year it launched. SuperGLUE was immediately built as a harder successor with more complex reasoning tasks and a higher human baseline.

GLUE → HellaSwag · scope shift

HellaSwag is an adversarially filtered commonsense completion benchmark, filling a gap GLUE's NLI tasks didn't cover well. Released the same month as SuperGLUE; complementary rather than competitive.

GLUE → WinoGrande · scope shift

WinoGrande extends Winograd-schema coreference at scale with adversarial filtering, addressing a known artifact problem in earlier coreference datasets.

SuperGLUE → BIG-bench · scope shift · attention

SuperGLUE fell below human baseline ~2021. BIG-bench was the collective attempt to find the next hard thing — 204 tasks, crowdsourced from 200+ researchers. The benchmark that acknowledged no single team could anticipate all the frontier failure modes.

BIG-bench → MMLU · scope shift · attention

MMLU (2020, predating BIG-bench) won the benchmark popularity contest that BIG-bench couldn't: a clean 57-subject multiple-choice format that any evaluator could run cheaply. After GPT-4 launched, MMLU became the single most-cited benchmark — representing the moment 'reasoning' supplanted 'NLP' as the frontier framing.

§ 02 · Benchmarks in this lineage

Nodes in detail.

Jun 2016Saturated

SQuAD

Stanford Question Answering Dataset

107,785 crowd-sourced QA pairs over 536 Wikipedia articles. Extractive reading comprehension — the model finds a span. SQuAD 2.0 (2018) added 53,775 unanswerable questions. F1 and EM are the metrics. Both versions now exceeded by fine-tuned models above human baseline.

Rajpurkar et al. (Stanford) · paper

Apr 2018Saturated

GLUE

General Language Understanding Evaluation

9-task NLU benchmark: NLI, coreference, sentiment, similarity, QA. Unified training and evaluation across diverse language tasks. Human baseline (~87) was crossed in 2019, within months of BERT fine-tuning. The prototype for multi-task language evaluation.

Wang et al. (NYU) · paper

May 2019Saturated

HellaSwag

HellaSwag Commonsense NLI

70,000 activity-description sentences where the model picks the plausible completion. Adversarially filtered so BERT-era models struggled; GPT-3 class models exceeded human accuracy. Often cited alongside MMLU as a commonsense-coverage complement.

Zellers et al. (UW / Allen AI) · paper

May 2019Saturated

SuperGLUE

SuperGLUE: A Stickier Benchmark for NLU

8 harder NLU tasks: question answering with multi-sentence reasoning, coreference with context, word sense disambiguation. Human baseline is 89.8. GPT-3 class models began approaching it in 2021; T5-11B and ST-MoE crossed it shortly after. Retired as a frontier benchmark by 2022.

Wang et al. (NYU / DeepMind / AI2) · paper

Jul 2019Saturating

WinoGrande

WinoGrande: Adversarial Winograd Schema Challenge

44,000 Winograd-style commonsense coreference questions, adversarially filtered to remove statistical artifacts. Human accuracy ~94%. Commonly included in LLM evaluation suites as a commonsense-reasoning check.

Sakaguchi et al. (AI2) · paper

Sep 2020Saturated

View benchmark page →

MMLU

Massive Multitask Language Understanding

57-subject multiple-choice exam spanning STEM, law, history, and social sciences. Though technically a knowledge benchmark, MMLU is the bridge node — it emerged from the NLU era but became the standard 'reasoning-era' model report benchmark. See the reasoning lineage for its successors.

Hendrycks et al. (UC Berkeley) · paper

Jun 2022Active

BIG-bench

Beyond the Imitation Game Benchmark

204 tasks contributed by the research community, covering reasoning, mathematics, code, social bias, and more. The last major attempt to build a single comprehensive NLP benchmark. Most tasks show rapid improvement post-GPT-4, validating the community's move toward harder reasoning-focused evaluations.

Srivastava et al. (200+ authors) · paper