Language models & text processing, side by side.
From frontier LLMs to specialised NER models. Which model for which task, at what cost — and when an LLM is overkill.
Descriptions in serif; scores in tabular mono; navigation in sans. Costs quoted per million input tokens.
The eight that matter.
Ranked by reasoning benchmarks, with costs quoted per million input tokens. MMLU and HumanEval given as the most-cited comparison axes — see /llm for the full registry.
| Model | Vendor | MMLU | HumanEval | Reasoning | Speed | Cost | Best for |
|---|---|---|---|---|---|---|---|
| Claude Opus 4 | Anthropic | 92.4 | 95.1 | Best | Medium | $15/1M in | Complex reasoning, analysis, coding |
| GPT-5 | OpenAI | 91.8 | 93.7 | Excellent | Fast | $5/1M in | General-purpose, multimodal |
| Claude Sonnet 4 | Anthropic | 90.1 | 93.8 | Excellent | Fast | $3/1M in | Best value frontier, coding |
| Gemini 2.5 Pro | 90.3 | 91.2 | Excellent | Fast | $1.25/1M in | 1M+ context, multimodal | |
| Llama 4 Maverick | Meta (Open) | 89.2 | 90.5 | Very Good | Variable | Self-host | Open source, MoE, customization |
| DeepSeek R1 | DeepSeek (Open) | 90.8 | 92.1 | Excellent | Slow (CoT) | $0.55/1M in | Math, reasoning, open weights |
| Claude Haiku 4 | Anthropic | 84.5 | 88 | Good | Very Fast | $0.25/1M in | High volume, cost-efficient |
| GPT-4o-mini | OpenAI | 82 | 87.2 | Good | Very Fast | $0.15/1M in | Cheapest frontier, high throughput |
Not every task needs an LLM.
Six text-processing axes where specialised models still compete — or win outright — on latency, cost, or accuracy at scale.
Text Embeddings →
Semantic search, RAG, clustering
Translation →
33+ languages, document-level
Question Answering →
Extractive, abstractive, multi-hop
Named Entity Recognition →
People, orgs, locations, custom
Text Classification →
Sentiment, intent, topic
Summarization →
News, documents, conversations
LLM, or specialised model?
- ·Few examples available (few-shot)
- ·Complex, nuanced task definitions
- ·You need to explain reasoning
- ·The task evolves frequently
- ·Low volume (< 10K requests/day)
- ·High volume (> 100K requests/day)
- ·Latency critical (< 100ms)
- ·Cost sensitive (pennies per 1K calls)
- ·Well-defined, stable task
- ·Training data available
Go deeper.
Verified benchmarks across every text task. Submit new SOTA results or suggest benchmarks we should be tracking.