Recent studyBlind TTS Elo is live. Compare two anonymous voice samples, vote after listening, and help separate real preference signal from noise.Vote in the study ->
Codesota · Tasks · Language ModelingHome/Tasks/Natural Language Processing/Language Modeling
Natural Language Processing· text-generation

Language Modeling.

Language Modeling is the task of predicting the next word or character in a sequence given the previous context. Language models learn the probability distribution of word sequences and are foundational for many NLP applications including text generation, machine translation, and speech recognition.

55
Datasets
14
Results
perplexity
Canonical metric
§ 02 · Canonical benchmark

The reference dataset.

WikiText Perplexity

Language modeling quality measured by perplexity on Wikipedia text

Primary metric: perplexity
View full leaderboard →
§ 03 · Top 10

Leading models.

Leading models on WikiText Perplexity.

No results yet. Be the first to contribute.

What were you looking for on Language Modeling?

Didn't find the model, metric, or dataset you needed? Tell us in one line. We read every message and reply within 48 hours.

§ 04 · All datasets

Tracked datasets.

55 datasets tracked for this task.

WikiText Perplexity
CANONICAL
0 results · perplexity
Arena-Hard
1 result
Top: Qwen2.5-Plus 81.4
GPQA
1 result
Top: Qwen2.5-Plus 49.7
GSM8k
1 result
Top: Qwen2.5-Plus 96.0
IFEval
1 result
Top: Qwen2.5-Plus 86.3
LV-Eval
1 result
Top: Qwen2.5-72B-Instruct 60.4
Livebench
1 result
Top: Qwen2.5-Plus 54.6
LongBench-Chat
1 result
Top: Qwen2.5-72B-Instruct 8.72
MATH
1 result
Top: Qwen2.5-Plus 84.7
MGS
1 result
Top: Qwen2.5-72B-Instruct 88.2
MMLU-Pro
1 result
Top: Qwen2.5-Plus 72.5
MMLU-Redux
1 result
Top: Qwen2.5-72B-Instruct 86.8
MTbench
1 result
Top: Qwen2.5-72B-Instruct 9.35
RULER
1 result
Top: Qwen2.5-72B-Instruct 95.1
okapi MMLU (translated)
1 result
Top: Qwen2.5-72B-Instruct 80.0
AIME 2024
0 results
AIME 2025
0 results
ARC
0 results
AlignBench
0 results
AutoLogi
0 results
BBH
0 results
Bird-SQL (dev)
0 results
C-Eval
0 results
C-SimpleQA
0 results
CommonsenseQA
0 results
Creative Writing Benchmark v3
0 results
DROP
0 results
ECLeKTic
0 results
EvalPlus
0 results
FACTS Grounding
0 results
GPQA Diamond
0 results
Global MMLU-Lite
0 results
HellaSwag
0 results
HiddenMath
0 results
INCLUDE
0 results
LongBench v2
0 results
MATH 500
0 results
MMLU
0 results
MMMLU
0 results
MRCR v2 (1M)
0 results
MRCR v2 (≤128K)
0 results
Multi-IF
0 results
MultiChallenge
0 results
OpenBookQA
0 results
OpenRewrite-Eval
0 results
Penn Treebank (WSJ Section 23)
0 results
SafetyBench
0 results
SimpleQA
0 results
SuperGPQA
0 results
SysBench (ISR)
0 results
TriviaQA
0 results
Winogrande
0 results
WritingBench
0 results
ZebraLogic
0 results
ZeroSCROLLS/QuALITY
0 results
§ 05 · Related tasks

Other tasks in Natural Language Processing.

Machine TranslationText classification
Reply within 48 hours · No newsletter

Didn't find what you came for?

Still looking for something on Language Modeling? A missing model, a stale score, a benchmark we should cover — drop it here and we'll handle it.

Real humans read every message. We track what people are asking for and prioritize accordingly.