Codesota · Language & TextWhich model, what task, at what costIssue: March 2026
§ 00 · Language & text

Language models & text processing, side by side.

From frontier LLMs to specialised NER models. Which model for which task, at what cost — and when an LLM is overkill.

Descriptions in serif; scores in tabular mono; navigation in sans. Costs quoted per million input tokens.

§ 01 · Frontier LLMs

The eight that matter.

Ranked by reasoning benchmarks, with costs quoted per million input tokens. MMLU and HumanEval given as the most-cited comparison axes — see /llm for the full registry.

ModelVendorMMLUHumanEvalReasoningSpeedCostBest for
Claude Opus 4Anthropic92.495.1BestMedium$15/1M inComplex reasoning, analysis, coding
GPT-5OpenAI91.893.7ExcellentFast$5/1M inGeneral-purpose, multimodal
Claude Sonnet 4Anthropic90.193.8ExcellentFast$3/1M inBest value frontier, coding
Gemini 2.5 ProGoogle90.391.2ExcellentFast$1.25/1M in1M+ context, multimodal
Llama 4 MaverickMeta (Open)89.290.5Very GoodVariableSelf-hostOpen source, MoE, customization
DeepSeek R1DeepSeek (Open)90.892.1ExcellentSlow (CoT)$0.55/1M inMath, reasoning, open weights
Claude Haiku 4Anthropic84.588GoodVery Fast$0.25/1M inHigh volume, cost-efficient
GPT-4o-miniOpenAI8287.2GoodVery Fast$0.15/1M inCheapest frontier, high throughput
Fig 1 · Copper row marks the reasoning leader. Costs change faster than this page — double-check the vendor page before building on a quoted rate.
§ 02 · Text tasks

Not every task needs an LLM.

Six text-processing axes where specialised models still compete — or win outright — on latency, cost, or accuracy at scale.

Text Embeddings

Semantic search, RAG, clustering

MTEB
KaLM-Gemma3-12B (72.3%)

Translation

33+ languages, document-level

WMT
HY-MT1.5 (WMT2025 winner)

Question Answering

Extractive, abstractive, multi-hop

SQuAD, TriviaQA
GPT-5 / Claude 4

Named Entity Recognition

People, orgs, locations, custom

CoNLL-2003
Fine-tuned DeBERTa v3

Text Classification

Sentiment, intent, topic

GLUE, SuperGLUE
DeBERTa v3 (GLUE 91.3)

Summarization

News, documents, conversations

CNN/DailyMail
Claude 4 / GPT-5
§ 03 · Decision

LLM, or specialised model?

Use an LLM when
  • ·Few examples available (few-shot)
  • ·Complex, nuanced task definitions
  • ·You need to explain reasoning
  • ·The task evolves frequently
  • ·Low volume (< 10K requests/day)
Use a specialised model when
  • ·High volume (> 100K requests/day)
  • ·Latency critical (< 100ms)
  • ·Cost sensitive (pennies per 1K calls)
  • ·Well-defined, stable task
  • ·Training data available
§ 04 · Keep reading

Go deeper.

Verified benchmarks across every text task. Submit new SOTA results or suggest benchmarks we should be tracking.

Frontier leaderboard MTEB embedding benchmarkAll NLP benchmarks