Speech

Speech Translation

Translating spoken audio directly to another language.

1 datasets3 resultsView full task mapping →

Speech translation converts spoken audio in one language directly to text or speech in another language, bypassing the traditional ASR-then-MT pipeline. SeamlessM4T (Meta) and Whisper handle 100+ language directions, while real-time speech-to-speech systems are approaching the Star Trek universal translator vision. The main challenges are latency, quality for low-resource languages, and preserving speaker characteristics across languages.

History

2014

First end-to-end speech translation models emerge, learning to translate without intermediate ASR transcripts

2018

Weiss et al. demonstrate that end-to-end speech translation can match cascaded ASR+MT systems for English-French

2019

MuST-C dataset (Cattoni et al.) provides 400+ hours of TED talk speech translation data for 14 language pairs

2020

Fairseq S2T and ESPnet establish open-source toolkits for end-to-end speech translation research

2022

Whisper (OpenAI) includes speech translation to English from 97 languages as a built-in capability

2023

SeamlessM4T (Meta) delivers speech-to-speech translation across 100+ languages in a unified model

2023

Google USM (Universal Speech Model) handles ASR, translation, and language identification across 300+ languages

2024

SeamlessM4T v2 and SeamlessStreaming enable real-time simultaneous speech translation with low latency

2025

GPT-4o and Gemini 2.0 process speech natively, enabling conversational cross-lingual dialogue in real-time

How Speech Translation Works

Audio encoding

Source speech is processed by a speech encoder (Wav2Vec 2.0, Conformer, or shared encoder) into hidden representations

Language identification

An optional module identifies the source language to route processing, especially for multilingual models

Translation

A sequence-to-sequence decoder generates target text tokens conditioned on the encoded speech, optionally using CTC or attention-based alignment

Speech synthesis (for S2ST)

For speech-to-speech, a TTS module converts translated text to speech in the target language, potentially preserving speaker characteristics

Streaming (optional)

Simultaneous translation uses wait-k or monotonic attention policies to translate before the speaker finishes, trading quality for latency

Current Landscape

Speech translation in 2025 is one of the most tangible demonstrations of AI progress. SeamlessM4T and Whisper have made multilingual speech translation accessible to anyone — a capability that required teams of engineers five years ago now works with a single API call. The field has bifurcated: end-to-end models are preferred for their simplicity and latency, while cascaded systems (ASR → MT → TTS) still win on some quality metrics. Real-time simultaneous translation is production-ready for common language pairs (EN↔ES, EN↔ZH, EN↔FR) but struggles with distant language pairs and fast speakers.

Key Challenges

End-to-end models still slightly lag cascaded ASR+MT pipelines on high-resource pairs, though the gap is closing

Low-resource language pairs have minimal parallel speech translation data, limiting direct end-to-end training

Simultaneous / real-time translation must decide when to start translating before the full source sentence is available

Preserving prosody, emotion, and speaker identity across languages in speech-to-speech translation

Code-switched speech (mixing two languages) breaks current speech translation systems

Quick Recommendations

Best multilingual ST

SeamlessM4T v2 (Meta)

100+ languages, speech-to-text and speech-to-speech; open-source with strong quality

Speech-to-English (any source)

Whisper large-v3

Translates 97 languages to English with excellent accuracy; widely deployed

Real-time / simultaneous

SeamlessStreaming or Meta's STST

Low-latency simultaneous translation with configurable quality-latency tradeoff

Conversational (bidirectional)

GPT-4o voice or Google Translate conversation mode

Real-time bidirectional speech translation with natural conversational flow

Open-source toolkit

ESPnet-ST or Fairseq S2T

Full training and inference pipelines for custom language pairs and domains

What's Next

The vision is zero-latency speech-to-speech translation that preserves the speaker's voice, emotion, and style across languages — essentially dubbing yourself in real-time. Multimodal translation that incorporates visual context (lip reading, gestures) will improve accuracy. Expect personal translation devices (earbuds) to reach consumer quality for common language pairs within 1-2 years. The long-term goal is covering all 7,000 human languages.

Benchmarks & SOTA

MuST-C En-De tst-COMMON

MuST-C English-German tst-COMMON

20193 results

Multilingual Speech Translation Corpus built from TED talks. The English-German tst-COMMON split is the de-facto benchmark for end-to-end speech translation. BLEU on tst-COMMON is the primary metric.

State of the Art

SeamlessM4T v2 Large

Meta AI

37.1

bleu

Related Tasks

Speech Enhancement

Recovering clean speech from noisy recordings. Benchmarked on VoiceBank+DEMAND (PESQ, STOI, SI-SDR) and the Microsoft DNS Challenge (DNSMOS).

Speaker Verification

Verifying speaker identity from voice samples.

Speech Recognition

Automatic speech recognition went from a specialized pipeline (acoustic model + language model + decoder) to a single end-to-end model with OpenAI's Whisper (2022), which was trained on 680K hours of web audio and became the de facto open-source standard overnight. Whisper large-v3 hits under 5% word error rate on LibriSpeech clean, and commercial APIs from Google, AWS, and Deepgram compete fiercely on noisy, accented, and multilingual speech where error rates are 2-3x higher. The real frontier is real-time streaming ASR at conversational latency (<500ms), code-switching between languages mid-sentence, and robust recognition of domain-specific terminology (medical, legal, technical). Assembly AI's Universal-2 and Deepgram's Nova-3 currently lead production benchmarks, but the gap with fine-tuned Whisper variants is narrow.

Get notified when these results update

New models drop weekly. We track them so you don't have to.

Something wrong or missing?

Help keep Speech Translation benchmarks accurate. Report outdated results, missing benchmarks, or errors.

Back to Speech