Language models & text processing, side by side.

From frontier LLMs to specialised NER models. Which model for which task, at what cost — and when an LLM is overkill.

Descriptions in serif; scores in tabular mono; navigation in sans. Costs quoted per million input tokens.

§ 01 · Frontier LLMs

The eight that matter.

Ranked by reasoning benchmarks, with costs quoted per million input tokens. MMLU and HumanEval given as the most-cited comparison axes — see /llm for the full registry.

Model	Vendor	MMLU	HumanEval	Reasoning	Speed	Cost	Best for
Claude Opus 4	Anthropic	92.4	95.1	Best	Medium	$15/1M in	Complex reasoning, analysis, coding
GPT-5	OpenAI	91.8	93.7	Excellent	Fast	$5/1M in	General-purpose, multimodal
Claude Sonnet 4	Anthropic	90.1	93.8	Excellent	Fast	$3/1M in	Best value frontier, coding
Gemini 2.5 Pro	Google	90.3	91.2	Excellent	Fast	$1.25/1M in	1M+ context, multimodal
Llama 4 Maverick	Meta (Open)	89.2	90.5	Very Good	Variable	Self-host	Open source, MoE, customization
DeepSeek R1	DeepSeek (Open)	90.8	92.1	Excellent	Slow (CoT)	$0.55/1M in	Math, reasoning, open weights
Claude Haiku 4	Anthropic	84.5	88	Good	Very Fast	$0.25/1M in	High volume, cost-efficient
GPT-4o-mini	OpenAI	82	87.2	Good	Very Fast	$0.15/1M in	Cheapest frontier, high throughput

Fig 1 · Copper row marks the reasoning leader. Costs change faster than this page — double-check the vendor page before building on a quoted rate.

§ 02 · Text tasks

Not every task needs an LLM.

Six text-processing axes where specialised models still compete — or win outright — on latency, cost, or accuracy at scale.

Text Embeddings →

Semantic search, RAG, clustering

MTEB

KaLM-Gemma3-12B (72.3%)

Translation →

33+ languages, document-level

WMT

HY-MT1.5 (WMT2025 winner)

Question Answering →

Extractive, abstractive, multi-hop

SQuAD, TriviaQA

GPT-5 / Claude 4

Named Entity Recognition →

People, orgs, locations, custom

CoNLL-2003

Fine-tuned DeBERTa v3

Text Classification →

Sentiment, intent, topic

GLUE, SuperGLUE

DeBERTa v3 (GLUE 91.3)

Summarization →

News, documents, conversations

CNN/DailyMail

Claude 4 / GPT-5

§ 03 · Decision

LLM, or specialised model?

Use an LLM when

·Few examples available (few-shot)
·Complex, nuanced task definitions
·You need to explain reasoning
·The task evolves frequently
·Low volume (< 10K requests/day)

Use a specialised model when

·High volume (> 100K requests/day)
·Latency critical (< 100ms)
·Cost sensitive (pennies per 1K calls)
·Well-defined, stable task
·Training data available

§ 04 · Keep reading

Go deeper.

Verified benchmarks across every text task. Submit new SOTA results or suggest benchmarks we should be tracking.

Frontier leaderboard →MTEB embedding benchmark All NLP benchmarks