GLUE & SuperGLUE.

The General Language Understanding Evaluation (GLUE, 2018) and its harder successor SuperGLUE (2019) — a multi-task NLU benchmark covering 9 sub-tasks (CoLA, SST-2, MRPC, STS-B, MNLI, BoolQ, COPA, WSC, ReCoRD). The leaderboard saturated near human baseline around 91 in 2022 and has seen no frontier submissions since; current frontier evaluation has moved to MMLU, GPQA, BIG-Bench Hard, and HELM.

Saturated benchmark· last significant update Oct 2022

Effectively retired since late 2022. Aggregate score plateaued at 91.2–91.3 (ST-MoE-32B, Vega v2); frontier LLMs (GPT-4/5, Claude, Gemini, Llama) do not submit. Human baseline: 89.8.

Paper ↗Download dataset Submit a result ↵

§ 01 · Leaderboard

Best published scores.

5 results indexed across 1 metric. Shaded row marks current SOTA; ties broken by submission date.

Primary: SuperGLUE avg · higher is better

SuperGLUE avg· primary

5 rows

#	Model	Org	Submitted	Paper / code	SuperGLUE avg
01	Vega v2 (6B)API	JD Explore Academy	Oct 2022	Toward Efficient Language Model Pretraining and Downstre…	91.30
02	ST-MoE-32BOSS	Google Brain	Feb 2022	ST-MoE: Designing Stable and Transferable Sparse Expert …	91.20
03	ERNIE 3.0API	Baidu	Jul 2021	ERNIE 3.0: Large-scale Knowledge Enhanced Pre-training f…	90.60
04	DeBERTa (ensemble)OSS	Microsoft	Jan 2021	DeBERTa: Decoding-enhanced BERT with Disentangled Attent…	90.30
05	T5-11BOSS	Google	Oct 2019	Exploring the Limits of Transfer Learning with a Unified…	89.30

Fig 2 · Rows sorted by score within each metric. Shaded row marks SOTA. Dates reflect model or paper release where available, otherwise the date Codesota accessed the source.

§ 03 · Progress

5 steps
of state of the art.

Each row below marks a model that broke the previous record on SuperGLUE avg. Intermediate submissions are kept in the leaderboard above; only SOTA-setting entries are re-listed here.

Higher scores win. Each subsequent entry improved upon the previous best.

SOTA line · SuperGLUE avg

Oct 1, 2019T5-11BGoogle89.30
Jan 1, 2021DeBERTa (ensemble)Microsoft90.30
Jul 1, 2021ERNIE 3.0Baidu90.60
Feb 1, 2022ST-MoE-32BGoogle Brain91.20
Oct 1, 2022Vega v2 (6B)JD Explore Academy91.30

Fig 3 · SOTA-setting models only. 5 entries span Oct 2019 → Oct 2022.

§ 04 · Literature

5 papers
tied to this benchmark.

Every paper below corresponds to at least one row in the leaderboard above. Click through for the arXiv preprint and, when available, the reference implementation.

Toward Efficient Language Model Pretraining and Downstream Adaptation via Self-Evolution: A Case Study on SuperGLUE
arXiv 2022Dec 2022·Vega v2 (6B)
arXiv ↗
ST-MoE: Designing Stable and Transferable Sparse Expert Models
arXiv 2022Feb 2022·ST-MoE-32B
arXiv ↗
ERNIE 3.0: Large-scale Knowledge Enhanced Pre-training for Language Understanding and Generation
arXiv 2021Jul 2021·ERNIE 3.0
arXiv ↗
DeBERTa: Decoding-enhanced BERT with Disentangled Attention
ICLR 2021Jun 2020·DeBERTa (ensemble)
arXiv ↗
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
JMLR 2020Oct 2019·T5-11B
arXiv ↗

§ 06 · Contribute

Have a score that beats
this table?

Submit a checkpoint and a reproduction script. We will run it, publish the score, and — if it takes the top — annotate the step on the progress chart with your name.

Submit a result ↵Read submission guide

What a submission needs

01A public checkpoint or API endpoint
02A reproduction script with frozen commit + seed
03Declared evaluation environment (Python, deps)
04One row per metric declared by this dataset
05A contact so we can follow up on discrepancies

GLUE & SuperGLUE.

Best published scores.

5 stepsof state of the art.

5 paperstied to this benchmark.

Neighbouring benchmarks.

Have a score that beatsthis table?

5 steps
of state of the art.

5 papers
tied to this benchmark.

Have a score that beats
this table?