Home/Building Blocks/Machine Translation

Text→Text

Machine Translation

Translate text between languages. Essential for global communication, localization, and cross-lingual applications.

How Machine Translation Works

A technical deep-dive into machine translation. From statistical phrase tables to transformer-based neural MT and the challenge of low-resource languages.

1. The Problem 2. Architecture 3. Evolution 4. Models 5. Low vs High Resource 6. Benchmarks 7. Code

The Problem

Why translation is harder than simple word substitution.

Picture yourself trying to translate "time flies like an arrow" into another language. Word-by-word substitution gives nonsense. The real challenge is that languages encode meaning differently: word order changes, idioms have no literal equivalent, and a single word in one language might need five words in another.

Modern neural machine translation solves this by learning to understand the source sentence as a whole, then generate a natural sentence in the target language. The model does not translate words; it translates meaning.

Word Order

Languages structure sentences differently.

English (SVO):

"I love you"

Japanese (SOV):

"Watashi wa anata o aishiteimasu"

(I you love)

Ambiguity

Same word, different meanings.

"The bank was closed"

- Financial institution?

- River bank?

Context determines translation.

Idioms

Literal translation fails for expressions.

"Break a leg!"

Literal: injury wish

Actual: good luck

Must translate the meaning.

Idiom Translation Examples

English:"The early bird catches the worm."

German:"Der fruhe Vogel fangt den Wurm."(Literal, same idiom exists)

Spanish:"A quien madruga, Dios le ayuda."(Different idiom, same meaning)

Japanese:"Hayaoki wa sanmon no toku."(Early rising = three coins gain)

Encoder-Decoder with Attention

The architectural breakthrough that makes neural MT work.

The core idea is deceptively simple: encode the source sentence into a rich representation, then decode that representation into the target language. The magic is in attention, which lets the decoder look back at different parts of the source sentence as it generates each target word.

SELECT LANGUAGE PAIR:

Transformer Translation Architecture

Source:

Thecatsatonthemat.

Encoder (6 layers)

Self-attention + FFN

Each word attends to all other source words

Encoded:

h1h2h3h4h5h6

Cross-Attention

Decoder (6 layers)

Masked self-attn + Cross-attn + FFN

Generates one token at a time, left to right

Target:

DieKatzesassaufderMatte.

Cross-Attention

When generating each target word, the decoder attends strongly to the corresponding source word. This alignment emerges from training.

The decoder learns: "What source words are relevant for this target word?"

Self-Attention

In the encoder, each word builds context from all other words. "bank" understands its meaning from surrounding words like "river" or "money".

Captures long-range dependencies that RNNs struggled with.

Beam Search: Finding the Best Translation

Instead of greedily picking the most probable word at each step, beam search maintains multiple candidate translations (beams) and explores them in parallel.

Beam Width = 1 (Greedy)

Pick best token each step

Fast but may miss better translations

Beam Width = 4-5 (Common)

Track top 4-5 candidates

Good quality/speed tradeoff

Beam Width = 10+

More thorough search

Slower, diminishing returns

Evolution of MT

From hand-crafted rules to neural networks: 70 years of progress.

Rule-Based MT

1954

RulesBLEU ~5-10Hand-crafted grammar rules, dictionaries

Statistical MT

1990

StatisticalBLEU ~20-25Phrase tables, language models, beam search

Seq2Seq + Attention

2014

NeuralBLEU ~25-30Bahdanau attention, encoder-decoder RNNs

Transformer

2017

NeuralBLEU ~30-35Self-attention, parallelizable, SOTA

mBART

2020

MultilingualBLEU ~35-40Denoising pre-training, 25 languages

M2M-100

2020

MultilingualBLEU ~35-40Direct translation, 100 languages

NLLB-200

2022

MultilingualBLEU ~40+No Language Left Behind, 200 languages

GPT-4/Claude

2023

LLMBLEU ~40+Zero-shot, context-aware, expensive

SMT to Neural (2014)

Phrase tables gave way to end-to-end learning. No more hand-crafted features. The Bahdanau attention mechanism was the key breakthrough.

RNN to Transformer (2017)

Self-attention replaced recurrence. Parallel training, better long-range dependencies. "Attention Is All You Need" changed everything.

Bilingual to Multilingual (2020)

Single models handling 50-200 languages. Transfer learning between related languages. No more separate model per language pair.

Key Models

The models you should know: open-source and commercial.

Model	Org	Languages	Size	Best For
NLLB-200	Meta	200+	600M-54B	Low-resource languages, research
mBART-50	Meta	50	611M	Production multilingual apps
MarianMT	Helsinki-NLP	1400+ pairs	74M-226M	Specific language pairs, edge deployment
Google Translate	Google	130+	API	Production apps, high volume
DeepL	DeepL	30+	API	European business content

NLLB-200

Best for low-resource languages

200 languages including Swahili, Yoruba, Nepali

MarianMT

Best for specific pairs, CPU deployment

Small, fast models for common language pairs

DeepL API

Best quality for European languages

Natural, fluent output. Formality control.

Low-Resource vs High-Resource

Why some language pairs work great and others struggle.

Translation quality depends heavily on training data availability. English-German has billions of sentence pairs; Swahili-Nepali might have thousands. This "resource" gap is the biggest challenge in making translation work for everyone.

High-Resource

Examples: English, Chinese, Spanish, French, German

Data: 10M+ sentence pairs

Quality: Near human-level

Models: All models work well

Medium-Resource

Examples: Polish, Turkish, Vietnamese, Indonesian

Data: 100K-10M pairs

Quality: Good, some errors

Models: NLLB, mBART, fine-tuned

Low-Resource

Examples: Swahili, Nepali, Yoruba, many African languages

Data: <100K pairs

Quality: Variable, often poor

Models: NLLB-200 (specifically designed)

NLLB: No Language Left Behind

Meta's NLLB-200 was specifically designed to address the low-resource problem. It uses:

- Transfer learning from high-resource to related low-resource languages
- Back-translation to create synthetic training data
- Shared vocabulary across all 200 languages
- Balanced training to prevent high-resource languages from dominating

Result: 44% improvement in BLEU for low-resource languages compared to previous models.

Benchmarks and Metrics

How we measure translation quality.

Understanding BLEU Score

BLEU (Bilingual Evaluation Understudy) measures n-gram overlap between machine output and human references. Higher is better (0-100).

BLEU = BP * exp(sum(w_n * log(p_n)))

p_n = n-gram precision

BP = brevity penalty

w_n = weights (usually uniform)

Rough quality guidelines:

40+Very high quality

30-40High quality, understandable

20-30Understandable, some errors

<20Hard to understand

COMET

Neural metric using embeddings. Correlates better with human judgment than BLEU. Scores typically 0-1, higher is better.

chrF++

Character-level F-score. Better for morphologically rich languages (German, Finnish). More robust to tokenization differences.

Human Evaluation

Still the gold standard. Metrics like adequacy (meaning preserved) and fluency (natural output). Expensive but definitive.

Benchmark	Full Name	Description	Metrics
WMT	Workshop on MT	Annual shared task. News domain. EN-DE, EN-RU, etc.	BLEU, COMET
FLORES-200	Facebook Low Resource	200 languages, Wikipedia-style sentences	spBLEU, chrF++
IWSLT	Spoken Language Translation	TED talks, conversational speech	BLEU
OPUS-100	Open Parallel Corpus	100 languages, diverse domains	BLEU

Code Examples

Get started with machine translation in Python.

Quick Start (MarianMT)pip install transformers

Beginner

from transformers import pipeline

# Quick start: Helsinki-NLP MarianMT
translator = pipeline(
    "translation",
    model="Helsinki-NLP/opus-mt-en-de"  # English -> German
)

text = "Machine translation has come a long way since the 1950s."
result = translator(text)
print(result[0]['translation_text'])
# "Maschinelle Ubersetzung hat seit den 1950er Jahren einen langen Weg zuruckgelegt."

# For other language pairs, find models at:
# https://huggingface.co/Helsinki-NLP

Click the buttons above to view different translation API examples

Quick Reference

For Production

- Google/DeepL APIs (quality + reliability)
- MarianMT (self-hosted, fast)
- mBART-50 (multilingual)

For Low-Resource

- NLLB-200 (designed for this)
- Fine-tune on domain data
- Back-translation for augmentation

Key Metrics

- BLEU (n-gram overlap)
- COMET (neural, human-correlated)
- chrF++ (character-level)

Use Cases

✓Document translation
✓Real-time communication
✓Content localization
✓Multilingual search

Architectural Patterns

Encoder-Decoder Transformers

Dedicated translation models (mBART, NLLB).

Pros:

+Optimized for translation
+Fast
+Many language pairs

Cons:

-Fixed language pairs
-May miss context

LLM Translation

Use GPT-4/Claude for translation with prompting.

Pros:

+Handles nuance
+Context-aware
+Any language

Cons:

-Expensive
-Slower
-May hallucinate

Massively Multilingual

One model for 200+ languages (NLLB-200).

Pros:

+Low-resource languages
+Single model

Cons:

-Lower quality than specialized
-Large model

Implementations

API Services

Google Cloud Translation

Google

API

Production quality. 130+ languages. AutoML for custom.

DeepL

API

Best quality for European languages.

Open Source

NLLB-200

CC-BY-NC 4.0

Open Source

200 languages. Best open-source coverage.

HuggingFace

MADLAD-400

Apache 2.0

Open Source

400 languages. Google's massive multilingual model.

HuggingFace

SeamlessM4T

CC-BY-NC 4.0

Open Source

Multimodal translation (speech + text). 100+ languages.

HuggingFace

Benchmarks

WMT →FLORES-200 →

Quick Facts

Input: Text
Output: Text
Implementations: 3 open source, 2 API
Patterns: 3 approaches

Related Blocks

Have benchmark data?

Help us track the state of the art for machine translation.

Submit Results