Machine Translation

Machine translation is the oldest AI grand challenge, from rule-based systems in the 1950s to the transformer revolution sparked by "Attention Is All You Need" (2017) — literally the architecture that now powers all of AI. Google's multilingual T5 and Meta's NLLB-200 pushed translation to 200+ languages, but the real disruption came from GPT-4 and Claude matching or beating specialized MT systems on WMT benchmarks for high-resource pairs like English-German and English-Chinese. The unsolved frontier is low-resource languages (under 1M parallel sentences), where dedicated models like NLLB still dominate, and literary translation where preserving style, humor, and cultural nuance remains beyond any system. BLEU scores are increasingly seen as unreliable — human evaluation and newer metrics like COMET and BLEURT are becoming the standard.

2
Datasets
4
Results
bleu
Canonical metric
Canonical Benchmark

WMT'23

State-of-the-art machine translation evaluation from WMT 2023 shared task

Primary metric: bleu
View full leaderboard

Top 10

Leading models on WMT'23.

RankModelcometYearSource
1
GPT-4
84.12023paper
2
Google Translate
83.82023paper
3
DeepL
83.52023paper
4
NLLB-3.3B
81.62023paper

All datasets

2 datasets tracked for this task.

Related tasks

Other tasks in Natural Language Processing.

Run Inference

Looking to run a model? HuggingFace hosts inference for this task type.

HuggingFace