Natural Language Processing

The field of AI concerned with the interaction between computers and human language, encompassing text understanding, generation, translation, sentiment analysis, and question answering.

3 tasks66 datasets31 results

Tasks & Benchmarks

Show all datasets and SOTA results

Language Modeling

ARC
AlignBench
Arena-Hard
81.4(Accuracy)Qwen2.5-Plus
AutoLogi
BBH
Bird-SQL (dev)
C-Eval
C-SimpleQA
CommonsenseQA
Creative Writing Benchmark v3
DROP
ECLeKTic
EvalPlus
GPQA
49.7(Accuracy)Qwen2.5-Plus
96(Accuracy)Qwen2.5-Plus
Global MMLU-Lite
HellaSwag
HiddenMath
IFEval
86.3(Accuracy)Qwen2.5-Plus
INCLUDE
LV-Eval
60.4(Accuracy)Qwen2.5-72B-Instruct
54.6(Accuracy)Qwen2.5-Plus
LongBench v2
LongBench-Chat
8.72(Score (1-10))Qwen2.5-72B-Instruct
MATH
84.7(Accuracy)Qwen2.5-Plus
MGS
88.16(Accuracy)Qwen2.5-72B-Instruct
72.5(Accuracy)Qwen2.5-Plus
MMLU-Redux
86.8(Accuracy)Qwen2.5-72B-Instruct
MRCR v2 (1M)
MRCR v2 (≤128K)
MTbench
9.35(Score (1-10))Qwen2.5-72B-Instruct
Multi-IF
MultiChallenge
OpenBookQA
OpenRewrite-Eval
Penn Treebank (WSJ Section 23)
RULER
95.1(Accuracy)Qwen2.5-72B-Instruct
SafetyBench
SuperGPQA
SysBench (ISR)
TriviaQA
Winogrande
WritingBench
ZeroSCROLLS/QuALITY
okapi MMLU (translated)
79.97(Accuracy)Qwen2.5-72B-Instruct

Text classification

GLUE2018
91.3(SuperGLUE avg)Vega v2 (6B)
GLUE (dev)
SuperGLUE2019
91.4(average-score)DeBERTa-v3-large

Machine Translation

DoTA (en->zh)
83.48(COMET)HunyuanOCR (1B)
FLORES-101
FLORES-200 devtest
MTOB (kalam -> eng)
WMT 2014 English->French (newstest2014)
WMT 2014 English->German (newstest2014)
WMT'232023
84.1(comet)GPT-4

Get notified when these results update

New models drop weekly. We track them so you don't have to.