Natural Language Processing

The field of AI concerned with the interaction between computers and human language, encompassing text understanding, generation, translation, sentiment analysis, and question answering.

3 tasks66 datasets31 results

Tasks & Benchmarks

Show all datasets and SOTA results

Language Modeling

AIME 2024

AIME 2025

ARCAI2 Reasoning Challenge (ARC)

AlignBenchAlignBench: Benchmarking Chinese Alignment of Large Language Models

Arena-HardArena-Hard (Arena-Hard-Auto)

81.4(Accuracy)Qwen2.5-Plus

AutoLogiAutoLogi: Automated Logic Puzzle Benchmark

BBHBIG-Bench Hard (BBH)

Bird-SQL (dev)BIRD-SQL (BIg Bench for Large-Scale Database-Grounded Text-to-SQLs)

C-EvalC-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models

C-SimpleQAChinese SimpleQA (C-SimpleQA)

CommonsenseQACommonsenseQA

Creative Writing Benchmark v3EQ-Bench Creative Writing Benchmark v3

DROPDiscrete Reasoning Over Paragraphs (DROP)

ECLeKTicECLeKTic: A Multi-Lingual Knowledge Testing Dataset

EvalPlusEvalPlus

FACTS Grounding

GPQAGPQA: Graduate-Level Google-Proof Q&A Benchmark

49.7(Accuracy)Qwen2.5-Plus

GPQA Diamond

GSM8k

96(Accuracy)Qwen2.5-Plus

Global MMLU-LiteGlobal-MMLU-Lite

HellaSwagHellaSwag: Can a Machine Really Finish Your Sentence?

HiddenMathHiddenMath

IFEvalInstruction-Following Eval

86.3(Accuracy)Qwen2.5-Plus

INCLUDEINCLUDE: Evaluating Multilingual Language Understanding with Regional Knowledge

LV-EvalLV-Eval: A Balanced Long-Context Benchmark with 5 Length Levels Up to 256K

60.4(Accuracy)Qwen2.5-72B-Instruct

Livebench

54.6(Accuracy)Qwen2.5-Plus

LongBench v2LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks

LongBench-ChatLongBench-Chat: Long Context Instruction-Following Benchmark

8.72(Score (1-10))Qwen2.5-72B-Instruct

MATHMATH (Measuring Mathematical Problem Solving) Dataset

84.7(Accuracy)Qwen2.5-Plus

MATH 500

MGSMultilingual Grade School Math (MGSM)

88.16(Accuracy)Qwen2.5-72B-Instruct

MMLU

MMLU-Pro

72.5(Accuracy)Qwen2.5-Plus

MMLU-ReduxMMLU-Redux: Massive Multitask Language Understanding Redux

86.8(Accuracy)Qwen2.5-72B-Instruct

MMMLU

MRCR v2 (1M)Multi-turn Response Coherence and Relevance (1M context)

MRCR v2 (≤128K)Multi-turn Response Coherence and Relevance (≤128K context)

MTbenchMT-Bench (Multi-Turn Benchmark)

9.35(Score (1-10))Qwen2.5-72B-Instruct

Multi-IFMulti-IF: Benchmarking LLMs on Multi-Turn and Multilingual Instructions Following

MultiChallengeMultiChallenge: A Realistic Multi-Turn Conversation Evaluation Benchmark Challenging to Frontier LLMs

OpenBookQAOpenBookQA (Open Book Question Answering)

OpenRewrite-EvalOPENREWRITEEVAL (OpenRewriteEval)

Penn Treebank (WSJ Section 23)Penn Treebank (Wall Street Journal, Section 23)

RULERRULER: What’s the Real Context Size of Your Long-Context Language Models?

95.1(Accuracy)Qwen2.5-72B-Instruct

SafetyBenchSafetyBench: Evaluating the Safety of Large Language Models with Multiple Choice Questions

SimpleQA

SuperGPQASuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines

SysBench (ISR)SysBench (system-message-following benchmark)

TriviaQATriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

WikiText Perplexity2016

WinograndeWinoGrande: An Adversarial Winograd Schema Challenge at Scale

WritingBenchWritingBench: A Comprehensive Benchmark for Generative Writing

ZebraLogic

ZeroSCROLLS/QuALITYQuALITY (ZeroSCROLLS subset)

okapi MMLU (translated)okapi MMLU (translated MMLU for multilingual evaluation)

79.97(Accuracy)Qwen2.5-72B-Instruct

Text classification

GLUEGLUE & SuperGLUE2018

91.3(SuperGLUE avg)Vega v2 (6B)

GLUE (dev)General Language Understanding Evaluation (GLUE)

SuperGLUESuperGLUE2019

91.4(average-score)DeBERTa-v3-large

Machine Translation

DoTA (en->zh)DoTA (Document image machine Translation dataset of ArXiv articles in markdown format)

83.48(COMET)HunyuanOCR (1B)

FLORES-101The FLORES-101 Evaluation Benchmark for Low-Resource and Multilingual Machine Translation

FLORES-2002022

FLORES-200 devtestFLORES-200 (FLoRes-200) Evaluation Benchmark for Multilingual Machine Translation

MTOB (kalam -> eng)Machine Translation from One Book (MTOB)

WMT 2014 English->French (newstest2014)WMT 2014 English–French (newstest2014)

WMT 2014 English->German (newstest2014)WMT 2014 English–German (WMT14 En→De, newstest2014)

WMT'232023

84.1(comet)GPT-4

Get notified when these results update

New models drop weekly. We track them so you don't have to.