Natural Language Processingtranslation

Machine Translation

Machine Translation is the task of automatically translating text from one natural language to another. The goal is to produce translations that preserve the meaning, style, and grammatical correctness of the source text while being fluent in the target language.

8 datasets5 resultsView full task mapping →

Machine translation has gone from statistical phrase-based systems to neural transformers to LLMs that translate 100+ languages. Google's NLLB-200 covers 200 languages, while GPT-4 and Claude match or exceed specialized MT systems for high-resource language pairs. Low-resource and document-level translation remain the hard frontiers.

History

2014

Sutskever et al. introduce sequence-to-sequence models with LSTMs for MT, launching the neural MT era

2016

Google Neural Machine Translation (GNMT) deploys LSTM-based NMT to Google Translate, replacing phrase-based SMT

2017

Transformer architecture (Vaswani et al.) achieves new SOTA on WMT English-German; becomes the foundation of all modern MT

2018

Multilingual NMT models (Johnson et al.) show a single model can translate between many language pairs with zero-shot transfer

2019

Facebook's mBART introduces multilingual denoising pretraining for MT across 25 languages

2020

M2M-100 (Fan et al.) trains a 12B-param model for direct translation between 100 languages without English pivoting

2022

NLLB-200 (Meta) covers 200 languages including many low-resource African and Asian languages with a single model

2023

GPT-4 matches Google Translate and DeepL on high-resource pairs (WMT); excels at context-aware translation

2024

Tower (Unbabel) and ALMA-R fine-tune Llama for MT, achieving SOTA on WMT with instruction-tuned open models

How Machine Translation Works

Tokenization

Source text is tokenized with a multilingual subword vocabulary (SentencePiece BPE, typically 64K-256K tokens) shared across languages

Encoding

The encoder transformer builds contextualized representations of the source sentence through self-attention layers

Cross-attention decoding

The decoder generates target tokens autoregressively, attending to encoder representations at each step

Beam search

Multiple hypotheses are explored in parallel; beam search (width 4-5) balances fluency and adequacy

Quality estimation

Optional QE models (COMET, CometKiwi) score translations without references, enabling automatic filtering and routing

Current Landscape

Machine translation in 2025 is a mature field where the frontier has shifted from high-resource pairs (effectively solved) to low-resource languages, document-level consistency, and domain adaptation. LLMs have upended the specialized MT model paradigm — GPT-4 translates better than purpose-built systems for major languages. But for the long tail of 200+ languages, dedicated multilingual models like NLLB remain essential. The commercial market (DeepL, Google Translate, Microsoft Translator) increasingly uses LLM backends.

Key Challenges

Low-resource languages (1,000+ languages have virtually no parallel training data) remain poorly served

Document-level translation — maintaining consistency in pronouns, terminology, and style across paragraphs — is unsolved

Domain-specific translation (medical, legal, technical) requires specialized terminology that general models miss

Evaluation: BLEU correlates poorly with human judgment; COMET and human evaluation are expensive

Hallucination in MT: models occasionally generate fluent text that bears no relation to the source, especially for rare languages

Quick Recommendations

Best quality (high-resource pairs)

GPT-4o or Claude 3.5 Sonnet

Context-aware, handles idioms and nuance; matches or beats DeepL on EN↔DE, EN↔FR, EN↔ZH

Low-resource languages

NLLB-200 (3.3B)

Covers 200 languages including many with zero support elsewhere; Meta open-source

Production MT (cost-efficient)

ALMA-R-13B or Tower-13B

Fine-tuned Llama models matching commercial MT at self-hosted cost

Real-time / streaming

MarianMT (Helsinki-NLP)

Small, fast, available for 1,400+ language pairs on Hugging Face; <50ms inference

Translation with QE

NLLB + CometKiwi

Translate then score quality without references; route low-confidence segments to human review

What's Next

The next breakthroughs will come from speech-to-speech translation (bypassing text entirely), real-time document-level MT that maintains terminology consistency across entire books, and closing the gap for the 3,000+ languages with essentially zero digital resources. Multimodal translation (translating text in images, videos, and mixed media) is an emerging frontier driven by VLMs.

Benchmarks & SOTA

WMT'23

20234 results

State-of-the-art machine translation evaluation from WMT 2023 shared task

State of the Art

GPT-4

OpenAI

84.1

comet

DoTA (en->zh)

DoTA (Document image machine Translation dataset of ArXiv articles in markdown format)

1 results

DoTA (Document image machine Translation dataset of ArXiv articles in markdown format) is a large-scale dataset of document-image → translation pairs introduced for document image machine translation (DIMT). It was created from arXiv articles rendered in markdown format and is intended to evaluate translation of long-context, complex-layout document images (e.g., whole pages with tables/figures/sections) into markdown-formatted target text. The NAACL 2024 paper reports a filtered set of about 126K image–translation pairs; the authors also provide an unfiltered collection of ~139K samples in the public repository/dataset. The dataset includes multilingual content (source English and target Chinese for the en→zh subset used in evaluations; the dataset metadata indicates other language variants are present) and is distributed under an MIT license on Hugging Face (the Hugging Face dataset is gated and requires agreeing to access conditions).

State of the Art

HunyuanOCR (1B)

83.48

COMET

FLORES-200 devtest

FLORES-200 (FLoRes-200) Evaluation Benchmark for Multilingual Machine Translation

0 results

FLORES-200 (sometimes written FLoRes-200) is a multilingual evaluation benchmark for machine translation that extends Facebook AI’s earlier FLORES benchmarks to cover ~200 languages. It provides fully aligned sentence-level translations across many languages (many-to-many evaluation) and standard dev/devtest splits that are widely used as the primary evaluation benchmark for multilingual MT research (papers commonly report metrics such as COMET-22 and SacreBLEU on the FLORES-200 devtest split). The dataset is maintained from Meta/Facebook AI resources (GitHub) and is available via community-curated Hugging Face dataset repos (e.g., Muennighoff/flores200). FLORES-200 is the evaluation set used in the NLLB (No Language Left Behind) work and many subsequent multilingual MT evaluations.

No results tracked yet

FLORES-101

The FLORES-101 Evaluation Benchmark for Low-Resource and Multilingual Machine Translation

0 results

FLORES-101 is a high-quality, human-translated evaluation benchmark for low-resource and multilingual machine translation. It contains 3,001 sentences extracted from English Wikipedia and professionally translated into 101 languages through a carefully controlled process, producing a multilingually-aligned set useful for many-to-many MT evaluation. FLORES-101 was released to provide broad coverage of low-resource languages and to enable more reliable comparison of translation quality (commonly used as the dev/devtest evaluation benchmark). The dataset is distributed under a CC BY-SA 4.0 license.

No results tracked yet

WMT 2014 English->French (newstest2014)

WMT 2014 English–French (newstest2014)

0 results

WMT 2014 English–French is the English↔French parallel corpus collection used in the shared translation tasks of the Ninth Workshop on Statistical Machine Translation (WMT14). It is a news-domain translation benchmark assembled from multiple parallel corpora (e.g., Europarl, Common Crawl, News Commentary and others) and is widely used for training and evaluation of machine translation models. The standard evaluation/test set from this campaign is newstest2014 (commonly referred to as WMT14 newstest2014). In the "Attention Is All You Need" paper (arXiv:1706.03762) the authors state they trained on ~36M sentence pairs and report results on newstest2014; dataset splits and exact counts can vary depending on preprocessing and filtering (the Hugging Face wmt/wmt14 fr-en view lists ~40.8M fr-en sentence pairs for one collected version). Commonly used resources and mirrors for WMT14 include the official WMT14 website (statmt.org) and dataset pages on Hugging Face and TensorFlow Datasets.

No results tracked yet

FLORES-200

20220 results

Multilingual translation benchmark covering 200 languages

No results tracked yet

MTOB (kalam -> eng)

Machine Translation from One Book (MTOB)

0 results

MTOB (Machine Translation from One Book) is a benchmark for learning to translate between English and Kalamang (an extremely low-resource language with <200 speakers) using a single field-linguistics grammar book and related reference materials (word lists, example sentence pairs, and grammar excerpts). It was introduced by Tanzer et al. (A Benchmark for Learning to Translate a New Language from One Grammar Book; arXiv:2309.16575 / ICLR 2024). The benchmark frames translation as learning a new language from human-readable grammar materials (rather than large mined corpora) and evaluates model performance on EnglishKalamang translation. The original paper reports automatic evaluation (e.g., chrF) for kgveng; the dataset has been used in evaluations under different context settings (no-context, half-book, full-book) as noted in follow-up evaluations (your extraction indicates BLEURT was reported for the KalamangEnglish direction under no-context, half-book and full-book settings). Code and data are available from the authors' repository (https://github.com/lukemelas/mtob) and there is a Hugging Face dataset entry.

No results tracked yet

WMT 2014 English->German (newstest2014)

WMT 2014 English–German (WMT14 En→De, newstest2014)

0 results

WMT 2014 English–German (WMT14 En–De) is the English⇄German parallel data collection used in the Ninth Workshop on Statistical Machine Translation (WMT 2014) shared translation task. The corpus is a combination of multiple parallel sources commonly used in MT research (e.g., Europarl, Common Crawl, News Commentary, and other parallel collections) and is distributed with standard splits used for training, validation and testing. For the English→German task the training set contains on the order of ~4.5 million sentence pairs (this is the size reported and used in many papers, including “Attention Is All You Need”); commonly used validation/dev and test sets are newstest2013 (dev) and newstest2014 (test). The Hugging Face dataset card (wmt/wmt14) provides per-language-pair configs (e.g., de-en) and lists splits and sizes; it also includes a warning about issues in the Common Crawl portion (misaligned / non-English files). Typical preprocessing applied in literature includes tokenization and Byte-Pair Encoding (BPE) with a shared vocabulary (~37k) as used in the Transformer paper. Primary sources / references: the WMT14 workshop pages (statmt.org/wmt14) and the Hugging Face dataset card (https://huggingface.co/datasets/wmt/wmt14).

No results tracked yet

Related Tasks

Text classification

Text classification is a machine learning process of automatically assigning predefined categories or labels to text based on its content, often using natural language processing (NLP). It involves analyzing text to understand its meaning and then applying the most appropriate label, with common applications including sentiment analysis (e.g., positive/negative reviews), spam detection, and topic categorization (e.g., organizing news articles).

Language Modeling

Language Modeling is the task of predicting the next word or character in a sequence given the previous context. Language models learn the probability distribution of word sequences and are foundational for many NLP applications including text generation, machine translation, and speech recognition.

Get notified when these results update

New models drop weekly. We track them so you don't have to.

Something wrong or missing?

Help keep Machine Translation benchmarks accurate. Report outdated results, missing benchmarks, or errors.

Back to Natural Language Processing