Historical BenchmarkEffectively retired since late 2022

GLUE & SuperGLUE

The General Language Understanding Evaluation (GLUE, 2018) and its harder successor SuperGLUE (2019) defined how the field measured NLU for roughly five years. This page is a historical record of that run — not a live SOTA tracker. The leaderboards saturated near human baseline in 2022 and have seen minimal new submissions since.

SuperGLUE is effectively retired. The aggregate score plateaued around 91.2–91.3 in 2022 (ST-MoE-32B, Vega v2), and no frontier LLM — GPT-4, GPT-5, Claude, Gemini, Llama — submits to the official leaderboard. Current frontier evaluation lives on MMLU, GPQA, BIG-Bench Hard, HELM, and LiveBench. The numbers on this page reflect the last active wave of competition (2018–2022), not the current ceiling of language understanding.

Aggregate Metric
Avg. Score
SuperGLUE Tasks
8
Human Baseline
89.8 (SG)
SOTA (2022)
91.3
Status
Retired ~2022

The NLU Paradigm Shift

Before GLUE, NLP research was fragmented. Models were often designed for single tasks—sentiment analysis, question answering, or natural language inference—making it difficult to measure general linguistic intelligence.

Introduced by researchers from NYU, DeepMind, and Facebook AI, GLUE established a multi-task evaluation paradigm. It forced models to share parameters across tasks, favoring architectures that learned robust, transferable representations of language rather than task-specific heuristics.

As BERT and its successors rapidly saturated GLUE, SuperGLUE was launched with harder tasks like coreference resolution and causal reasoning, featuring lower-resource training sets and more complex linguistic phenomena.

1

Aggregate Scoring

Single macro-average score across all tasks to rank general performance.

2

Private Test Sets

Held-out labels on a centralized server prevent data leakage and overfitting.

Task Category Distribution

GLUE Task Radar Chart
Single SentenceCoLA, SST-2
Similarity/ParaphraseMRPC, STS-B, QQP
Inference (NLI)MNLI, QNLI, RTE, WNLI
Reasoning (SuperGLUE)BoolQ, COPA, MultiRC

The Road to Parity (2018–2022)

Aggregate score progression from BERT to the final wave of SuperGLUE submissions.

Model Score
Human Baseline (89.8)
80.5
2018
BERT-Large
84.6
2019
RoBERTa
89.3
2019
T5-11B
90.3
2021
DeBERTa ens.
90.6
2021
ERNIE 3.0
91.2
2022
ST-MoE-32B
91.3
2022
Vega v2

SuperGLUE Leaderboard (Final Wave)

Verified submissions only

Rows show the official top submissions to super.gluebenchmark.com through 2022, when active competition ended. Each score links to its primary source. Entries that could not be independently sourced have been omitted rather than estimated.

RankModelScoreMetricOrganizationDateSourceType
1
Vega v2 (6B)
91.3
SuperGLUE avgJD Explore Academy2022-10arxiv.org/abs/2212.01853Proprietary
2
ST-MoE-32B
91.2
SuperGLUE avgGoogle Brain2022-02arxiv.org/abs/2202.08906Research
3
ERNIE 3.0
90.6
SuperGLUE avgBaidu2021-07arxiv.org/abs/2107.02137Proprietary
4
DeBERTa (ensemble)
90.3
SuperGLUE avgMicrosoft2021-01microsoft.com/research (DeBERTa blog)Open-source
5
T5-11B
89.3
SuperGLUE avgGoogle2019-10arxiv.org/abs/1910.10683Open-source
6
Human baseline
89.8
SuperGLUE avgWang et al.2019-05arxiv.org/abs/1905.00537Reference

Note: The official SuperGLUE leaderboard has seen no meaningful new submissions since late 2022. Frontier LLMs (GPT-4/5, Claude, Gemini, Llama) are not represented here because their developers target MMLU, GPQA, BIG-Bench Hard, and HELM instead. Treat scores above as the endpoint of a saturated benchmark, not a current capability ranking.

Task Taxonomy

CoLA

MCC
Single-Sentence

Grammatical acceptability (linguistic competence)

SST-2

Acc
Single-Sentence

Binary sentiment analysis of movie reviews

MRPC

F1/Acc
Sentence-Pair

Semantic equivalence detection

STS-B

Pear/Spear
Sentence-Pair

Similarity scoring (1-5 scale)

MNLI

Acc
Sentence-Pair

Natural Language Inference (Entailment/Neutral/Contradiction)

BoolQ

Acc
QA/Reasoning

Yes/No questions based on passage context

COPA

Acc
Reasoning

Causal reasoning (Choice of Plausible Alternatives)

WSC

Acc
Coreference

Winograd Schema Challenge (pronoun resolution)

ReCoRD

F1/EM
QA

Reading comprehension with commonsense reasoning

Linguistic Diagnostic Suite

GLUE includes a curated set of "diagnostic" examples to test specific linguistic phenomena. Even models with high aggregate scores often struggle with these edge cases.

Phenomenon: Negation
"The keys to the dual functionality is not the enzymes."

Models often ignore the 'not' and predict the same as the affirmative version.

Phenomenon: Double Entendre
"The trophy doesn't fit into the brown suitcase because it is too large."

Requires world knowledge to know 'it' refers to the trophy, not the suitcase.

Diagnostic Heatmap

Diagnostic Heatmap

Visualization of model failures across 30+ linguistic categories including logic, lexical semantics, and predicate-argument structure.

Foundational Papers

Implementation & Tools

nyu-mll/glue-baselines

Official GLUE baseline implementations and evaluation toolkit.

800+
huggingface/transformers

Industry standard for fine-tuning on GLUE/SuperGLUE tasks.

157k
microsoft/DeBERTa

Implementation of the model that dominated SuperGLUE for 2 years.

2.2k
google-research/t5

Text-to-Text Transfer Transformer framework for NLU.

6.4k

Related Benchmarks

BenchmarkFocusKey Difference
MMLUWorld KnowledgeTests 57 subjects (STEM, Humanities) vs. NLU reasoning.
SQuADReading ComprehensionExtractive QA on Wikipedia vs. multi-task classification.
HELMHolistic EvaluationIncludes fairness, bias, and toxicity metrics.

Using GLUE/SuperGLUE for research or teaching?

The datasets remain a clean, well-documented starting point for NLU work and transfer-learning experiments, even though the leaderboard itself has stopped drawing frontier submissions.