Home/Building Blocks/Machine Translation
TextText

Machine Translation

Translate text between languages. Essential for global communication, localization, and cross-lingual applications.

How Machine Translation Works

A technical deep-dive into machine translation. From statistical phrase tables to transformer-based neural MT and the challenge of low-resource languages.

1

The Problem

Why translation is harder than simple word substitution.

Picture yourself trying to translate "time flies like an arrow" into another language. Word-by-word substitution gives nonsense. The real challenge is that languages encode meaning differently: word order changes, idioms have no literal equivalent, and a single word in one language might need five words in another.

Modern neural machine translation solves this by learning to understand the source sentence as a whole, then generate a natural sentence in the target language. The model does not translate words; it translates meaning.

Word Order

Languages structure sentences differently.

English (SVO):
"I love you"
Japanese (SOV):
"Watashi wa anata o aishiteimasu"
(I you love)
Ambiguity

Same word, different meanings.

"The bank was closed"
- Financial institution?
- River bank?
Context determines translation.
Idioms

Literal translation fails for expressions.

"Break a leg!"
Literal: injury wish
Actual: good luck
Must translate the meaning.

Idiom Translation Examples

English:"The early bird catches the worm."
German:"Der fruhe Vogel fangt den Wurm."(Literal, same idiom exists)
Spanish:"A quien madruga, Dios le ayuda."(Different idiom, same meaning)
Japanese:"Hayaoki wa sanmon no toku."(Early rising = three coins gain)
2

Encoder-Decoder with Attention

The architectural breakthrough that makes neural MT work.

The core idea is deceptively simple: encode the source sentence into a rich representation, then decode that representation into the target language. The magic is in attention, which lets the decoder look back at different parts of the source sentence as it generates each target word.

SELECT LANGUAGE PAIR:

Transformer Translation Architecture

Source:
Thecatsatonthemat.
|
Encoder (6 layers)
Self-attention + FFN
Each word attends to all other source words
|
Encoded:
h1h2h3h4h5h6
|
Cross-Attention
|
Decoder (6 layers)
Masked self-attn + Cross-attn + FFN
Generates one token at a time, left to right
|
Target:
DieKatzesassaufderMatte.
Cross-Attention

When generating each target word, the decoder attends strongly to the corresponding source word. This alignment emerges from training.

The decoder learns: "What source words are relevant for this target word?"
Self-Attention

In the encoder, each word builds context from all other words. "bank" understands its meaning from surrounding words like "river" or "money".

Captures long-range dependencies that RNNs struggled with.

Beam Search: Finding the Best Translation

Instead of greedily picking the most probable word at each step, beam search maintains multiple candidate translations (beams) and explores them in parallel.

Beam Width = 1 (Greedy)
Pick best token each step
Fast but may miss better translations
Beam Width = 4-5 (Common)
Track top 4-5 candidates
Good quality/speed tradeoff
Beam Width = 10+
More thorough search
Slower, diminishing returns
3

Evolution of MT

From hand-crafted rules to neural networks: 70 years of progress.

Rule-Based MT
1954
RulesBLEU ~5-10Hand-crafted grammar rules, dictionaries
Statistical MT
1990
StatisticalBLEU ~20-25Phrase tables, language models, beam search
Seq2Seq + Attention
2014
NeuralBLEU ~25-30Bahdanau attention, encoder-decoder RNNs
Transformer
2017
NeuralBLEU ~30-35Self-attention, parallelizable, SOTA
mBART
2020
MultilingualBLEU ~35-40Denoising pre-training, 25 languages
M2M-100
2020
MultilingualBLEU ~35-40Direct translation, 100 languages
NLLB-200
2022
MultilingualBLEU ~40+No Language Left Behind, 200 languages
GPT-4/Claude
2023
LLMBLEU ~40+Zero-shot, context-aware, expensive
SMT to Neural (2014)

Phrase tables gave way to end-to-end learning. No more hand-crafted features. The Bahdanau attention mechanism was the key breakthrough.

RNN to Transformer (2017)

Self-attention replaced recurrence. Parallel training, better long-range dependencies. "Attention Is All You Need" changed everything.

Bilingual to Multilingual (2020)

Single models handling 50-200 languages. Transfer learning between related languages. No more separate model per language pair.

4

Key Models

The models you should know: open-source and commercial.

ModelOrgLanguagesSizeBest For
NLLB-200Meta200+600M-54BLow-resource languages, research
mBART-50Meta50611MProduction multilingual apps
MarianMTHelsinki-NLP1400+ pairs74M-226MSpecific language pairs, edge deployment
Google TranslateGoogle130+APIProduction apps, high volume
DeepLDeepL30+APIEuropean business content
NLLB-200
Best for low-resource languages
200 languages including Swahili, Yoruba, Nepali
MarianMT
Best for specific pairs, CPU deployment
Small, fast models for common language pairs
DeepL API
Best quality for European languages
Natural, fluent output. Formality control.
5

Low-Resource vs High-Resource

Why some language pairs work great and others struggle.

Translation quality depends heavily on training data availability. English-German has billions of sentence pairs; Swahili-Nepali might have thousands. This "resource" gap is the biggest challenge in making translation work for everyone.

High-Resource
Examples: English, Chinese, Spanish, French, German
Data: 10M+ sentence pairs
Quality: Near human-level
Models: All models work well
Medium-Resource
Examples: Polish, Turkish, Vietnamese, Indonesian
Data: 100K-10M pairs
Quality: Good, some errors
Models: NLLB, mBART, fine-tuned
Low-Resource
Examples: Swahili, Nepali, Yoruba, many African languages
Data: <100K pairs
Quality: Variable, often poor
Models: NLLB-200 (specifically designed)

NLLB: No Language Left Behind

Meta's NLLB-200 was specifically designed to address the low-resource problem. It uses:

  • - Transfer learning from high-resource to related low-resource languages
  • - Back-translation to create synthetic training data
  • - Shared vocabulary across all 200 languages
  • - Balanced training to prevent high-resource languages from dominating
Result: 44% improvement in BLEU for low-resource languages compared to previous models.
6

Benchmarks and Metrics

How we measure translation quality.

Understanding BLEU Score

BLEU (Bilingual Evaluation Understudy) measures n-gram overlap between machine output and human references. Higher is better (0-100).

BLEU = BP * exp(sum(w_n * log(p_n)))
p_n = n-gram precision
BP = brevity penalty
w_n = weights (usually uniform)
Rough quality guidelines:
40+Very high quality
30-40High quality, understandable
20-30Understandable, some errors
<20Hard to understand
COMET

Neural metric using embeddings. Correlates better with human judgment than BLEU. Scores typically 0-1, higher is better.

chrF++

Character-level F-score. Better for morphologically rich languages (German, Finnish). More robust to tokenization differences.

Human Evaluation

Still the gold standard. Metrics like adequacy (meaning preserved) and fluency (natural output). Expensive but definitive.

BenchmarkFull NameDescriptionMetrics
WMTWorkshop on MTAnnual shared task. News domain. EN-DE, EN-RU, etc.BLEU, COMET
FLORES-200Facebook Low Resource200 languages, Wikipedia-style sentencesspBLEU, chrF++
IWSLTSpoken Language TranslationTED talks, conversational speechBLEU
OPUS-100Open Parallel Corpus100 languages, diverse domainsBLEU
7

Code Examples

Get started with machine translation in Python.

Quick Start (MarianMT)pip install transformers
Beginner
from transformers import pipeline

# Quick start: Helsinki-NLP MarianMT
translator = pipeline(
    "translation",
    model="Helsinki-NLP/opus-mt-en-de"  # English -> German
)

text = "Machine translation has come a long way since the 1950s."
result = translator(text)
print(result[0]['translation_text'])
# "Maschinelle Ubersetzung hat seit den 1950er Jahren einen langen Weg zuruckgelegt."

# For other language pairs, find models at:
# https://huggingface.co/Helsinki-NLP
Click the buttons above to view different translation API examples

Quick Reference

For Production
  • - Google/DeepL APIs (quality + reliability)
  • - MarianMT (self-hosted, fast)
  • - mBART-50 (multilingual)
For Low-Resource
  • - NLLB-200 (designed for this)
  • - Fine-tune on domain data
  • - Back-translation for augmentation
Key Metrics
  • - BLEU (n-gram overlap)
  • - COMET (neural, human-correlated)
  • - chrF++ (character-level)

Use Cases

  • Document translation
  • Real-time communication
  • Content localization
  • Multilingual search

Architectural Patterns

Encoder-Decoder Transformers

Dedicated translation models (mBART, NLLB).

Pros:
  • +Optimized for translation
  • +Fast
  • +Many language pairs
Cons:
  • -Fixed language pairs
  • -May miss context

LLM Translation

Use GPT-4/Claude for translation with prompting.

Pros:
  • +Handles nuance
  • +Context-aware
  • +Any language
Cons:
  • -Expensive
  • -Slower
  • -May hallucinate

Massively Multilingual

One model for 200+ languages (NLLB-200).

Pros:
  • +Low-resource languages
  • +Single model
Cons:
  • -Lower quality than specialized
  • -Large model

Implementations

API Services

Google Cloud Translation

Google
API

Production quality. 130+ languages. AutoML for custom.

DeepL

DeepL
API

Best quality for European languages.

Open Source

NLLB-200

CC-BY-NC 4.0
Open Source

200 languages. Best open-source coverage.

MADLAD-400

Apache 2.0
Open Source

400 languages. Google's massive multilingual model.

SeamlessM4T

CC-BY-NC 4.0
Open Source

Multimodal translation (speech + text). 100+ languages.

Benchmarks

Quick Facts

Input
Text
Output
Text
Implementations
3 open source, 2 API
Patterns
3 approaches

Have benchmark data?

Help us track the state of the art for machine translation.

Submit Results