Machine translation task router

Choose a translation model by language pair, domain, document context, deployment boundary, and evaluation metric. WMT is the public benchmark signal; production quality still needs COMET, terminology checks, and human edit rate.

Compare models ->Benchmarks Explainer

01 - Benchmarks

Public scores are a shortlist, not the deployment answer.

BLEU is still reported for continuity, but modern translation decisions should look at COMET, human preference, terminology errors, and edit distance on your own corpus.

Benchmark	Role	Metric	Best use
WMT	Primary competition	COMET / human eval / BLEU	High-resource language pairs and annual system comparisons.
FLORES-200	Coverage test	spBLEU / chrF++	Broad multilingual coverage, especially underrepresented languages.
COMET	Quality estimator	reference-based / QE	Ranking fluent translations when BLEU misses meaning and adequacy.
Human review	Production gate	edits / accept rate	Legal, medical, marketing, and other high-liability translation.

02 - Models

Translation systems by lane.

Dedicated MT, open multilingual models, and LLM translators solve different versions of the task. The right model depends more on pair and corpus than on one global rank.

Model / system	Lane	Good fit	Watch out
HY-MT1.5	WMT frontier	Competition-grade text translation; use as current WMT reference signal.	Availability and deployment path depend on the publisher.
DeepL	Production API	European languages, business copy, and editorial workflows.	Narrower language coverage than massively multilingual systems.
Google Cloud Translation	Production API	Broad language support, low operational burden, and AutoML-style customization.	Quality varies by pair and domain; test domain terminology.
NLLB-200	Open model	Self-hosted translation across 200 languages and low-resource coverage.	License and quality constraints; needs evaluation per pair.
MADLAD-400	Open model	Very broad language coverage when you need one multilingual backbone.	Larger deployment footprint and uneven pair quality.
Claude / GPT	LLM translation	Document-level context, terminology instructions, style transfer, and format repair.	Higher cost and more latency; require review for factual fidelity.

03 - Routing

Pick by operational need.

Fast website or app localization

DeepL or Google Cloud Translation

Low ops, mature APIs, glossary support, and predictable latency.

Low-resource language coverage

NLLB-200 or MADLAD-400

Open multilingual models cover many pairs that commercial APIs handle weakly.

Document with style and terminology

LLM translation with glossary + review

Long context preserves names, tone, and paragraph-level coherence.

Regulated or private corpus

Self-hosted MT + domain fine-tune

Keeps data inside your boundary and lets terminology be audited.

04 - Related

Need the lower-level explainer?

The building-block page explains encoder-decoder translation, LLM translation, beam search, and implementation options.

Open translation explainer ->