Language Modeling

Language modeling — predicting the next token — is the pretraining objective that accidentally became the foundation of modern AI. From GPT-2's "too dangerous to release" moment in 2019 to GPT-4, Claude, Llama 3, and Gemini, scaling language models has produced emergent capabilities no one predicted from loss curves alone. Perplexity on benchmarks like WikiText-103 and Penn Treebank is essentially a historical artifact now; the field evaluates via downstream tasks (MMLU, HumanEval, MATH) because raw perplexity stopped correlating with usefulness years ago. The frontier has moved to mixture-of-experts architectures (Mixtral, DeepSeek-V3), longer context windows (1M+ tokens), and efficient inference — the model is no longer the bottleneck, serving it is.

1
Datasets
0
Results
perplexity
Canonical metric
Canonical Benchmark

WikiText Perplexity

Language modeling quality measured by perplexity on Wikipedia text

Primary metric: perplexity
View full leaderboard

Top 10

Leading models on WikiText Perplexity.

No results yet. Be the first to contribute.

What were you looking for on Language Modeling?

Didn't find the model, metric, or dataset you needed? Tell us in one line. We read every message and reply within 48 hours.

All datasets

1 dataset tracked for this task.

Related tasks

Other tasks in Natural Language Processing.

Reply within 48 hours · No newsletter

Didn't find what you came for?

Still looking for something on Language Modeling? A missing model, a stale score, a benchmark we should cover — drop it here and we'll handle it.

Real humans read every message. We track what people are asking for and prioritize accordingly.