GLUE & SuperGLUE
The General Language Understanding Evaluation (GLUE) collection is a suite of tools for evaluating the performance of NLP models across a diverse set of existing NLU tasks. SuperGLUE was introduced in 2019 to provide a more difficult challenge as models reached human parity on the original set.
The NLU Paradigm Shift
Before GLUE, NLP research was fragmented. Models were often designed for single tasks—sentiment analysis, question answering, or natural language inference—making it difficult to measure general linguistic intelligence.
Introduced by researchers from NYU, DeepMind, and Facebook AI, GLUE established a multi-task evaluation paradigm. It forced models to share parameters across tasks, favoring architectures that learned robust, transferable representations of language rather than task-specific heuristics.
As BERT and its successors rapidly saturated GLUE, SuperGLUE was launched with harder tasks like coreference resolution and causal reasoning, featuring lower-resource training sets and more complex linguistic phenomena.
Aggregate Scoring
Single macro-average score across all tasks to rank general performance.
Private Test Sets
Held-out labels on a centralized server prevent data leakage and overfitting.
Task Category Distribution

The Road to Parity
Evolution of the aggregate score from 2018 baselines to 2026 LLMs.
Leaderboard
| Rank | Model | Score | Benchmark | Organization | Date | Type |
|---|---|---|---|---|---|---|
| 1 | Gemini 2.5 Pro | 98.9 | SuperGLUE | 2026-03 | API | |
| 2 | Llama-3.1-405B | 97.8 | SuperGLUE | Meta | 2025-02 | Open-weight |
| 3 | Vega v2 | 96.3 | SuperGLUE | JD Explore Academy | 2023-09 | Proprietary |
| 4 | ST-MOE-32B | 95.1 | SuperGLUE | Google Research | 2023-04 | Open-source |
| 5 | METRO-1.6T | 94.2 | SuperGLUE | Microsoft | 2022-11 | Proprietary |
| 6 | DeBERTa (ensemble) | 90.3 | SuperGLUE | Microsoft | 2020-12 | Open-source |
| 7 | T5-11B | 89.9 | SuperGLUE | 2019-10 | Open-source | |
| 8 | RoBERTa-Large | 88.5 | GLUE | Meta/FAIR | 2019-07 | Open-source |
| 9 | BERT-Large | 80.5 | GLUE | 2018-10 | Open-source | |
| 10 | BiLSTM Baseline | 65.7 | GLUE | NYU | 2018-01 | Open-source |
Task Taxonomy
CoLA
MCCGrammatical acceptability (linguistic competence)
SST-2
AccBinary sentiment analysis of movie reviews
MRPC
F1/AccSemantic equivalence detection
STS-B
Pear/SpearSimilarity scoring (1-5 scale)
MNLI
AccNatural Language Inference (Entailment/Neutral/Contradiction)
BoolQ
AccYes/No questions based on passage context
COPA
AccCausal reasoning (Choice of Plausible Alternatives)
WSC
AccWinograd Schema Challenge (pronoun resolution)
ReCoRD
F1/EMReading comprehension with commonsense reasoning
Linguistic Diagnostic Suite
GLUE includes a curated set of "diagnostic" examples to test specific linguistic phenomena. Even models with high aggregate scores often struggle with these edge cases.
Models often ignore the 'not' and predict the same as the affirmative version.
Requires world knowledge to know 'it' refers to the trophy, not the suitcase.

Diagnostic Heatmap
Visualization of model failures across 30+ linguistic categories including logic, lexical semantics, and predicate-argument structure.
Foundational Papers
GLUE: A Multi-Task Benchmark and Analysis Platform for NLU
SuperGLUE: A Stickier Benchmark for General-Purpose NLU
BERT: Pre-training of Deep Bidirectional Transformers
DeBERTa: Decoding-enhanced BERT with Disentangled Attention
Implementation & Tools
Official GLUE baseline implementations and evaluation toolkit.
Industry standard for fine-tuning on GLUE/SuperGLUE tasks.
Implementation of the model that dominated SuperGLUE for 2 years.
Text-to-Text Transfer Transformer framework for NLU.
Related Benchmarks
| Benchmark | Focus | Key Difference |
|---|---|---|
| MMLU | World Knowledge | Tests 57 subjects (STEM, Humanities) vs. NLU reasoning. |
| SQuAD | Reading Comprehension | Extractive QA on Wikipedia vs. multi-task classification. |
| HELM | Holistic Evaluation | Includes fairness, bias, and toxicity metrics. |
Ready to evaluate your model?
The GLUE and SuperGLUE datasets are available via Hugging Face Datasets or the official NYU mirrors.