Machine Translation

Machine translation is the oldest AI grand challenge, from rule-based systems in the 1950s to the transformer revolution sparked by "Attention Is All You Need" (2017) — literally the architecture that now powers all of AI. Google's multilingual T5 and Meta's NLLB-200 pushed translation to 200+ languages, but the real disruption came from GPT-4 and Claude matching or beating specialized MT systems on WMT benchmarks for high-resource pairs like English-German and English-Chinese. The unsolved frontier is low-resource languages (under 1M parallel sentences), where dedicated models like NLLB still dominate, and literary translation where preserving style, humor, and cultural nuance remains beyond any system. BLEU scores are increasingly seen as unreliable — human evaluation and newer metrics like COMET and BLEURT are becoming the standard.

2
Datasets
4
Results
bleu
Canonical metric
Canonical Benchmark

WMT'23

State-of-the-art machine translation evaluation from WMT 2023 shared task

Primary metric: bleu
View full leaderboard

Top 10

Leading models on WMT'23.

RankModelcometYearSource
1
GPT-4
84.12023paper
2
Google Translate
83.82023paper
3
DeepL
83.52023paper
4
NLLB-3.3B
81.62023paper

What were you looking for on Machine Translation?

Didn't find the model, metric, or dataset you needed? Tell us in one line. We read every message and reply within 48 hours.

All datasets

2 datasets tracked for this task.

Related tasks

Other tasks in Natural Language Processing.

Reply within 48 hours · No newsletter

Didn't find what you came for?

Still looking for something on Machine Translation? A missing model, a stale score, a benchmark we should cover — drop it here and we'll handle it.

Real humans read every message. We track what people are asking for and prioritize accordingly.