Benchmark StandardUpdated March 2026

GLUE & SuperGLUE

The General Language Understanding Evaluation (GLUE) collection is a suite of tools for evaluating the performance of NLP models across a diverse set of existing NLU tasks. SuperGLUE was introduced in 2019 to provide a more difficult challenge as models reached human parity on the original set.

Aggregate Metric
Avg. Score
Total Tasks
17 (9+8)
Human Baseline
89.8 (SG)
SOTA Score
98.9
Venue
ICLR/NeurIPS

The NLU Paradigm Shift

Before GLUE, NLP research was fragmented. Models were often designed for single tasks—sentiment analysis, question answering, or natural language inference—making it difficult to measure general linguistic intelligence.

Introduced by researchers from NYU, DeepMind, and Facebook AI, GLUE established a multi-task evaluation paradigm. It forced models to share parameters across tasks, favoring architectures that learned robust, transferable representations of language rather than task-specific heuristics.

As BERT and its successors rapidly saturated GLUE, SuperGLUE was launched with harder tasks like coreference resolution and causal reasoning, featuring lower-resource training sets and more complex linguistic phenomena.

1

Aggregate Scoring

Single macro-average score across all tasks to rank general performance.

2

Private Test Sets

Held-out labels on a centralized server prevent data leakage and overfitting.

Task Category Distribution

GLUE Task Radar Chart
Single SentenceCoLA, SST-2
Similarity/ParaphraseMRPC, STS-B, QQP
Inference (NLI)MNLI, QNLI, RTE, WNLI
Reasoning (SuperGLUE)BoolQ, COPA, MultiRC

The Road to Parity

Evolution of the aggregate score from 2018 baselines to 2026 LLMs.

Model Score
Human Baseline (89.8)
80.5
2018
BERT
88.5
2019
RoBERTa
90.3
2020
T5
91.7
2021
DeBERTa
96.3
2023
Vega v2
97.8
2025
Llama-3.1
98.9
2026
Gemini 2.5

Leaderboard

Showing Top 10 Historical & Current
RankModelScoreBenchmarkOrganizationDateType
1
Gemini 2.5 Pro
98.9
SuperGLUEGoogle2026-03API
2
Llama-3.1-405B
97.8
SuperGLUEMeta2025-02Open-weight
3
Vega v2
96.3
SuperGLUEJD Explore Academy2023-09Proprietary
4
ST-MOE-32B
95.1
SuperGLUEGoogle Research2023-04Open-source
5
METRO-1.6T
94.2
SuperGLUEMicrosoft2022-11Proprietary
6
DeBERTa (ensemble)
90.3
SuperGLUEMicrosoft2020-12Open-source
7
T5-11B
89.9
SuperGLUEGoogle2019-10Open-source
8
RoBERTa-Large
88.5
GLUEMeta/FAIR2019-07Open-source
9
BERT-Large
80.5
GLUEGoogle2018-10Open-source
10
BiLSTM Baseline
65.7
GLUENYU2018-01Open-source

Task Taxonomy

CoLA

MCC
Single-Sentence

Grammatical acceptability (linguistic competence)

SST-2

Acc
Single-Sentence

Binary sentiment analysis of movie reviews

MRPC

F1/Acc
Sentence-Pair

Semantic equivalence detection

STS-B

Pear/Spear
Sentence-Pair

Similarity scoring (1-5 scale)

MNLI

Acc
Sentence-Pair

Natural Language Inference (Entailment/Neutral/Contradiction)

BoolQ

Acc
QA/Reasoning

Yes/No questions based on passage context

COPA

Acc
Reasoning

Causal reasoning (Choice of Plausible Alternatives)

WSC

Acc
Coreference

Winograd Schema Challenge (pronoun resolution)

ReCoRD

F1/EM
QA

Reading comprehension with commonsense reasoning

Linguistic Diagnostic Suite

GLUE includes a curated set of "diagnostic" examples to test specific linguistic phenomena. Even models with high aggregate scores often struggle with these edge cases.

Phenomenon: Negation
"The keys to the dual functionality is not the enzymes."

Models often ignore the 'not' and predict the same as the affirmative version.

Phenomenon: Double Entendre
"The trophy doesn't fit into the brown suitcase because it is too large."

Requires world knowledge to know 'it' refers to the trophy, not the suitcase.

Diagnostic Heatmap

Diagnostic Heatmap

Visualization of model failures across 30+ linguistic categories including logic, lexical semantics, and predicate-argument structure.

Foundational Papers

Implementation & Tools

nyu-mll/glue-baselines

Official GLUE baseline implementations and evaluation toolkit.

800+
huggingface/transformers

Industry standard for fine-tuning on GLUE/SuperGLUE tasks.

157k
microsoft/DeBERTa

Implementation of the model that dominated SuperGLUE for 2 years.

2.2k
google-research/t5

Text-to-Text Transfer Transformer framework for NLU.

6.4k

Related Benchmarks

BenchmarkFocusKey Difference
MMLUWorld KnowledgeTests 57 subjects (STEM, Humanities) vs. NLU reasoning.
SQuADReading ComprehensionExtractive QA on Wikipedia vs. multi-task classification.
HELMHolistic EvaluationIncludes fairness, bias, and toxicity metrics.

Ready to evaluate your model?

The GLUE and SuperGLUE datasets are available via Hugging Face Datasets or the official NYU mirrors.