Codesota · Natural Language Processing · Text classification · GLUETasks/Natural Language Processing/Text classification
Text classification · benchmark dataset · 2018 · EN

GLUE & SuperGLUE.

The General Language Understanding Evaluation (GLUE, 2018) and its harder successor SuperGLUE (2019) — a multi-task NLU benchmark covering 9 sub-tasks (CoLA, SST-2, MRPC, STS-B, MNLI, BoolQ, COPA, WSC, ReCoRD). The leaderboard saturated near human baseline around 91 in 2022 and has seen no frontier submissions since; current frontier evaluation has moved to MMLU, GPQA, BIG-Bench Hard, and HELM.

Saturated benchmark· last significant update Oct 2022

Effectively retired since late 2022. Aggregate score plateaued at 91.2–91.3 (ST-MoE-32B, Vega v2); frontier LLMs (GPT-4/5, Claude, Gemini, Llama) do not submit. Human baseline: 89.8.

Paper Download datasetSubmit a result
§ 01 · Leaderboard

Best published scores.

5 results indexed across 1 metric. Shaded row marks current SOTA; ties broken by submission date.


Primary
SuperGLUE avg · higher is better
SuperGLUE avg· primary
5 rows
#ModelOrgSubmittedPaper / codeSuperGLUE avg
01Vega v2 (6B)APIJD Explore AcademyOct 2022Toward Efficient Language Model Pretraining and Downstre…91.30
02ST-MoE-32BOSSGoogle BrainFeb 2022ST-MoE: Designing Stable and Transferable Sparse Expert …91.20
03ERNIE 3.0APIBaiduJul 2021ERNIE 3.0: Large-scale Knowledge Enhanced Pre-training f…90.60
04DeBERTa (ensemble)OSSMicrosoftJan 2021DeBERTa: Decoding-enhanced BERT with Disentangled Attent…90.30
05T5-11BOSSGoogleOct 2019Exploring the Limits of Transfer Learning with a Unified…89.30
Fig 2 · Rows sorted by score within each metric. Shaded row marks SOTA. Dates reflect model or paper release where available, otherwise the date Codesota accessed the source.
§ 03 · Progress

5 steps
of state of the art.

Each row below marks a model that broke the previous record on SuperGLUE avg. Intermediate submissions are kept in the leaderboard above; only SOTA-setting entries are re-listed here.

Higher scores win. Each subsequent entry improved upon the previous best.

SOTA line · SuperGLUE avg
  1. Oct 1, 2019T5-11BGoogle89.30
  2. Jan 1, 2021DeBERTa (ensemble)Microsoft90.30
  3. Jul 1, 2021ERNIE 3.0Baidu90.60
  4. Feb 1, 2022ST-MoE-32BGoogle Brain91.20
  5. Oct 1, 2022Vega v2 (6B)JD Explore Academy91.30
Fig 3 · SOTA-setting models only. 5 entries span Oct 2019 Oct 2022.
§ 04 · Literature

5 papers
tied to this benchmark.

Every paper below corresponds to at least one row in the leaderboard above. Click through for the arXiv preprint and, when available, the reference implementation.

§ 06 · Contribute

Have a score that beats
this table?

Submit a checkpoint and a reproduction script. We will run it, publish the score, and — if it takes the top — annotate the step on the progress chart with your name.

Submit a result Read submission guide
What a submission needs
  • 01A public checkpoint or API endpoint
  • 02A reproduction script with frozen commit + seed
  • 03Declared evaluation environment (Python, deps)
  • 04One row per metric declared by this dataset
  • 05A contact so we can follow up on discrepancies