Language Modeling
Language modeling — predicting the next token — is the pretraining objective that accidentally became the foundation of modern AI. From GPT-2's "too dangerous to release" moment in 2019 to GPT-4, Claude, Llama 3, and Gemini, scaling language models has produced emergent capabilities no one predicted from loss curves alone. Perplexity on benchmarks like WikiText-103 and Penn Treebank is essentially a historical artifact now; the field evaluates via downstream tasks (MMLU, HumanEval, MATH) because raw perplexity stopped correlating with usefulness years ago. The frontier has moved to mixture-of-experts architectures (Mixtral, DeepSeek-V3), longer context windows (1M+ tokens), and efficient inference — the model is no longer the bottleneck, serving it is.
Language modeling — predicting the next token given preceding context — is the foundational task that powers all modern NLP. GPT-4, Claude, Llama, and Gemini are all language models at their core. Perplexity on held-out text remains the key intrinsic metric, but downstream task performance has become the real measure of progress.
History
Bengio et al. introduce neural language models with feedforward networks, replacing n-gram models
Word2Vec shows that language model byproducts (embeddings) transfer to downstream NLP tasks
Transformer architecture (Vaswani et al.) enables massively parallel training, replacing recurrent models
GPT (Radford et al.) demonstrates that autoregressive pretraining on 40GB of text produces useful representations
GPT-2 (1.5B params) shows emergent generation quality; OpenAI delays release over misuse concerns
GPT-3 (175B params) demonstrates in-context learning — the model performs tasks from examples in the prompt
GPT-4 and Claude 2 reach broadly expert-level performance across NLP, coding, and reasoning
Llama 2 (Meta) opens the floodgates for open-weight LLMs; Mistral-7B matches Llama 2 13B
Llama 3.1 405B, DeepSeek-V3, and Qwen2.5-72B close the gap with proprietary frontier models
Claude 3.5, GPT-4o, Gemini 2.0 compete on reasoning, coding, and agentic capabilities; Llama 4 and DeepSeek-R1 push open-source further
How Language Modeling Works
Tokenization
Text is encoded into subword tokens using BPE (GPT), SentencePiece (Llama), or custom tokenizers; vocabulary sizes range from 32K to 256K
Embedding
Each token is mapped to a dense vector; positional information is added via learned or rotary (RoPE) position embeddings
Transformer layers
Tokens pass through N layers of multi-head self-attention and feed-forward networks; modern models use 32-128 layers
Next-token prediction
A linear head projects the final hidden state to vocabulary logits; softmax gives probability distribution over next token
Training
Cross-entropy loss on next-token prediction over trillions of tokens from web text, code, and curated data
Current Landscape
Language modeling in 2025 is the foundation of the entire AI industry. The scaling laws (Kaplan et al., 2020; Hoffmann et al., 2022) continue to hold: more compute and data produce better models. But the frontier has shifted from pure scale to efficiency (MoE architectures, DeepSeek), reasoning (o1-style inference-time compute), and post-training (RLHF, DPO, Constitutional AI). Open-source models lag frontier by 6-12 months but are increasingly competitive. The Chinchilla-optimal training paradigm has given way to over-training smaller models for cheaper inference.
Key Challenges
Scaling cost: training a frontier model costs $50-500M+ in compute; only a handful of organizations can afford it
Data quality and curation are arguably more important than model size — garbage in, garbage out at scale
Evaluation: perplexity doesn't capture reasoning ability; benchmarks saturate quickly; human evaluation is expensive
Alignment: making models helpful, harmless, and honest through RLHF/RLAIF adds complexity and potential capability loss
Inference cost: serving large models requires expensive GPU clusters; efficiency techniques (quantization, speculative decoding) are critical
Quick Recommendations
Best frontier model
Claude 3.5 Sonnet, GPT-4o, or Gemini 2.0 Pro
Top performance on reasoning, coding, and instruction following; competitive pricing
Open-source (large)
Llama 3.1 405B or DeepSeek-V3-671B (MoE)
Approaching frontier model quality; self-hostable for full data control
Open-source (efficient)
Qwen2.5-72B or Llama 3.1 70B
Best quality at the 70B scale; fits on 2x A100 with quantization
Small / edge
Llama 3.2 3B or Phi-3.5 Mini (3.8B)
Runs on mobile and laptop hardware; surprisingly capable for their size
Research / perplexity benchmark
GPT-4 or Gemini 1.5 Pro
Lowest published perplexity on standard LM benchmarks
What's Next
The next phase is test-time compute scaling (thinking longer to solve harder problems), multi-modal native models (text + image + audio + video in one architecture), and agentic models that can use tools, write code, and take actions. Expect the open-source gap to continue closing, with 70B-class models matching today's frontier within a year. Architecture innovations (state-space models, hybrid attention-SSM) may complement or partially replace pure transformers.
Benchmarks & SOTA
Related Tasks
Natural Language Inference
Determining entailment relationships between sentences (SNLI, MNLI).
Reading Comprehension
Understanding and answering questions about passages.
Text Ranking
Text ranking is the invisible backbone of every search engine and RAG pipeline. The field was transformed by ColBERT (2020) introducing late interaction, then by instruction-tuned embedding models like E5-Mistral and GTE-Qwen that turned general LLMs into retrieval engines. MS MARCO and BEIR remain the standard battlegrounds, but the real test is zero-shot transfer — can a model trained on web search generalize to legal documents, scientific papers, and code? The gap between supervised and zero-shot performance has shrunk from 15+ points to under 3 in two years.
Table Question Answering
Table question answering bridges natural language and structured data — asking "what was Q3 revenue?" over a spreadsheet and getting the right cell or computed answer. Google's TAPAS (2020) pioneered joint table-text pre-training, and TAPEX trained on synthetic SQL execution traces to teach models tabular reasoning. The field shifted dramatically when GPT-4 and Claude demonstrated they could reason over tables in-context without any table-specific fine-tuning, often matching or beating specialized models on WikiTableQuestions and SQA. The hard frontier is multi-step numerical reasoning over large tables with hundreds of rows — exactly the kind of task where tool-augmented LLMs that generate and execute code are pulling ahead of pure neural approaches.
Something wrong or missing?
Help keep Language Modeling benchmarks accurate. Report outdated results, missing benchmarks, or errors.