Speech Translation
Translating spoken audio directly to another language.
Speech translation converts spoken audio in one language directly to text or speech in another language, bypassing the traditional ASR-then-MT pipeline. SeamlessM4T (Meta) and Whisper handle 100+ language directions, while real-time speech-to-speech systems are approaching the Star Trek universal translator vision. The main challenges are latency, quality for low-resource languages, and preserving speaker characteristics across languages.
History
First end-to-end speech translation models emerge, learning to translate without intermediate ASR transcripts
Weiss et al. demonstrate that end-to-end speech translation can match cascaded ASR+MT systems for English-French
MuST-C dataset (Cattoni et al.) provides 400+ hours of TED talk speech translation data for 14 language pairs
Fairseq S2T and ESPnet establish open-source toolkits for end-to-end speech translation research
Whisper (OpenAI) includes speech translation to English from 97 languages as a built-in capability
SeamlessM4T (Meta) delivers speech-to-speech translation across 100+ languages in a unified model
Google USM (Universal Speech Model) handles ASR, translation, and language identification across 300+ languages
SeamlessM4T v2 and SeamlessStreaming enable real-time simultaneous speech translation with low latency
GPT-4o and Gemini 2.0 process speech natively, enabling conversational cross-lingual dialogue in real-time
How Speech Translation Works
Audio encoding
Source speech is processed by a speech encoder (Wav2Vec 2.0, Conformer, or shared encoder) into hidden representations
Language identification
An optional module identifies the source language to route processing, especially for multilingual models
Translation
A sequence-to-sequence decoder generates target text tokens conditioned on the encoded speech, optionally using CTC or attention-based alignment
Speech synthesis (for S2ST)
For speech-to-speech, a TTS module converts translated text to speech in the target language, potentially preserving speaker characteristics
Streaming (optional)
Simultaneous translation uses wait-k or monotonic attention policies to translate before the speaker finishes, trading quality for latency
Current Landscape
Speech translation in 2025 is one of the most tangible demonstrations of AI progress. SeamlessM4T and Whisper have made multilingual speech translation accessible to anyone — a capability that required teams of engineers five years ago now works with a single API call. The field has bifurcated: end-to-end models are preferred for their simplicity and latency, while cascaded systems (ASR → MT → TTS) still win on some quality metrics. Real-time simultaneous translation is production-ready for common language pairs (EN↔ES, EN↔ZH, EN↔FR) but struggles with distant language pairs and fast speakers.
Key Challenges
End-to-end models still slightly lag cascaded ASR+MT pipelines on high-resource pairs, though the gap is closing
Low-resource language pairs have minimal parallel speech translation data, limiting direct end-to-end training
Simultaneous / real-time translation must decide when to start translating before the full source sentence is available
Preserving prosody, emotion, and speaker identity across languages in speech-to-speech translation
Code-switched speech (mixing two languages) breaks current speech translation systems
Quick Recommendations
Best multilingual ST
SeamlessM4T v2 (Meta)
100+ languages, speech-to-text and speech-to-speech; open-source with strong quality
Speech-to-English (any source)
Whisper large-v3
Translates 97 languages to English with excellent accuracy; widely deployed
Real-time / simultaneous
SeamlessStreaming or Meta's STST
Low-latency simultaneous translation with configurable quality-latency tradeoff
Conversational (bidirectional)
GPT-4o voice or Google Translate conversation mode
Real-time bidirectional speech translation with natural conversational flow
Open-source toolkit
ESPnet-ST or Fairseq S2T
Full training and inference pipelines for custom language pairs and domains
What's Next
The vision is zero-latency speech-to-speech translation that preserves the speaker's voice, emotion, and style across languages — essentially dubbing yourself in real-time. Multimodal translation that incorporates visual context (lip reading, gestures) will improve accuracy. Expect personal translation devices (earbuds) to reach consumer quality for common language pairs within 1-2 years. The long-term goal is covering all 7,000 human languages.
Benchmarks & SOTA
No datasets indexed for this task yet.
Contribute on GitHubRelated Tasks
Text-to-Speech
Text-to-speech has undergone a stunning transformation from robotic concatenation to near-human expressiveness in under five years. ElevenLabs, OpenAI's TTS, and XTTS-v2 produce speech that most listeners cannot distinguish from recordings, while open models like Bark, VALL-E (Microsoft), and F5-TTS demonstrated that voice cloning from 3-second samples is now a commodity capability. The frontier has moved beyond intelligibility (solved) to prosody, emotion control, and real-time streaming at under 200ms latency for conversational AI. Evaluation remains messy — MOS (Mean Opinion Score) is subjective and expensive, and automated metrics like UTMOS only loosely correlate with human preference, making benchmark comparisons unreliable.
Speaker Verification
Verifying speaker identity from voice samples.
Voice Cloning
Replicating a speaker's voice characteristics.
Speech Recognition
Automatic speech recognition went from a specialized pipeline (acoustic model + language model + decoder) to a single end-to-end model with OpenAI's Whisper (2022), which was trained on 680K hours of web audio and became the de facto open-source standard overnight. Whisper large-v3 hits under 5% word error rate on LibriSpeech clean, and commercial APIs from Google, AWS, and Deepgram compete fiercely on noisy, accented, and multilingual speech where error rates are 2-3x higher. The real frontier is real-time streaming ASR at conversational latency (<500ms), code-switching between languages mid-sentence, and robust recognition of domain-specific terminology (medical, legal, technical). Assembly AI's Universal-2 and Deepgram's Nova-3 currently lead production benchmarks, but the gap with fine-tuned Whisper variants is narrow.
Something wrong or missing?
Help keep Speech Translation benchmarks accurate. Report outdated results, missing benchmarks, or errors.