NLP Benchmarks
How natural language understanding evaluation evolved from narrow task-specific tests to multi-task suites, and then was eclipsed by 'reasoning' as the frontier label. GLUE unified disparate NLU tasks; SuperGLUE raised the floor when GLUE saturated; BIG-bench expanded coverage to hundreds of tasks. The shift around 2023 was conceptual as much as technical — once models passed human baselines on NLU tasks, the interesting question became not 'does the model understand language' but 'can it reason'. Branches include SQuAD (reading comprehension), HellaSwag (commonsense completion), and WinoGrande (Winograd schemas).
The NLP benchmark era ended not with a single saturation event but with a framing shift. GLUE human baseline was crossed in 2019, the same year SuperGLUE launched. SuperGLUE's human baseline fell in 2021. BIG-bench (2022) was a collective attempt to find the next hard thing — 204 tasks, most of them showing rapid improvement. The key observation from 2023: once chain-of-thought prompting was routine, the tasks that remained hard were reasoning tasks, not NLU tasks. That reframing — 'reasoning' as the successor label to 'NLP' — explains why MMLU, not SuperGLUE, became the standard report benchmark. GLUE and SuperGLUE are now training-time sanity checks, not leaderboard categories.
Attention path plus branches.
Solid arrows follow the attention path — the dashed arrow marks a scope shift (leaderboard attention jumping between tasks). Thin grey arcs drop down to specialized branches. Click any node to jump to its detail.
Nodes in detail.
SQuAD
107,785 crowd-sourced QA pairs over 536 Wikipedia articles. Extractive reading comprehension — the model finds a span. SQuAD 2.0 (2018) added 53,775 unanswerable questions. F1 and EM are the metrics. Both versions now exceeded by fine-tuned models above human baseline.
GLUE
9-task NLU benchmark: NLI, coreference, sentiment, similarity, QA. Unified training and evaluation across diverse language tasks. Human baseline (~87) was crossed in 2019, within months of BERT fine-tuning. The prototype for multi-task language evaluation.
HellaSwag
70,000 activity-description sentences where the model picks the plausible completion. Adversarially filtered so BERT-era models struggled; GPT-3 class models exceeded human accuracy. Often cited alongside MMLU as a commonsense-coverage complement.
SuperGLUE
8 harder NLU tasks: question answering with multi-sentence reasoning, coreference with context, word sense disambiguation. Human baseline is 89.8. GPT-3 class models began approaching it in 2021; T5-11B and ST-MoE crossed it shortly after. Retired as a frontier benchmark by 2022.
WinoGrande
44,000 Winograd-style commonsense coreference questions, adversarially filtered to remove statistical artifacts. Human accuracy ~94%. Commonly included in LLM evaluation suites as a commonsense-reasoning check.
MMLU
57-subject multiple-choice exam spanning STEM, law, history, and social sciences. Though technically a knowledge benchmark, MMLU is the bridge node — it emerged from the NLU era but became the standard 'reasoning-era' model report benchmark. See the reasoning lineage for its successors.
BIG-bench
204 tasks contributed by the research community, covering reasoning, mathematics, code, social bias, and more. The last major attempt to build a single comprehensive NLP benchmark. Most tasks show rapid improvement post-GPT-4, validating the community's move toward harder reasoning-focused evaluations.