Speech

Speech Translation

Translating spoken audio directly to another language.

0 datasets0 resultsView full task mapping →

Speech translation converts spoken audio in one language directly to text or speech in another language, bypassing the traditional ASR-then-MT pipeline. SeamlessM4T (Meta) and Whisper handle 100+ language directions, while real-time speech-to-speech systems are approaching the Star Trek universal translator vision. The main challenges are latency, quality for low-resource languages, and preserving speaker characteristics across languages.

History

2014

First end-to-end speech translation models emerge, learning to translate without intermediate ASR transcripts

2018

Weiss et al. demonstrate that end-to-end speech translation can match cascaded ASR+MT systems for English-French

2019

MuST-C dataset (Cattoni et al.) provides 400+ hours of TED talk speech translation data for 14 language pairs

2020

Fairseq S2T and ESPnet establish open-source toolkits for end-to-end speech translation research

2022

Whisper (OpenAI) includes speech translation to English from 97 languages as a built-in capability

2023

SeamlessM4T (Meta) delivers speech-to-speech translation across 100+ languages in a unified model

2023

Google USM (Universal Speech Model) handles ASR, translation, and language identification across 300+ languages

2024

SeamlessM4T v2 and SeamlessStreaming enable real-time simultaneous speech translation with low latency

2025

GPT-4o and Gemini 2.0 process speech natively, enabling conversational cross-lingual dialogue in real-time

How Speech Translation Works

1Audio encodingSource speech is processed …2Language identificati…An optional module identifi…3TranslationA sequence-to-sequence deco…4Speech synthesis (for…For speech-to-speech5Streaming (optional)Simultaneous translation us…Speech Translation Pipeline
1

Audio encoding

Source speech is processed by a speech encoder (Wav2Vec 2.0, Conformer, or shared encoder) into hidden representations

2

Language identification

An optional module identifies the source language to route processing, especially for multilingual models

3

Translation

A sequence-to-sequence decoder generates target text tokens conditioned on the encoded speech, optionally using CTC or attention-based alignment

4

Speech synthesis (for S2ST)

For speech-to-speech, a TTS module converts translated text to speech in the target language, potentially preserving speaker characteristics

5

Streaming (optional)

Simultaneous translation uses wait-k or monotonic attention policies to translate before the speaker finishes, trading quality for latency

Current Landscape

Speech translation in 2025 is one of the most tangible demonstrations of AI progress. SeamlessM4T and Whisper have made multilingual speech translation accessible to anyone — a capability that required teams of engineers five years ago now works with a single API call. The field has bifurcated: end-to-end models are preferred for their simplicity and latency, while cascaded systems (ASR → MT → TTS) still win on some quality metrics. Real-time simultaneous translation is production-ready for common language pairs (EN↔ES, EN↔ZH, EN↔FR) but struggles with distant language pairs and fast speakers.

Key Challenges

End-to-end models still slightly lag cascaded ASR+MT pipelines on high-resource pairs, though the gap is closing

Low-resource language pairs have minimal parallel speech translation data, limiting direct end-to-end training

Simultaneous / real-time translation must decide when to start translating before the full source sentence is available

Preserving prosody, emotion, and speaker identity across languages in speech-to-speech translation

Code-switched speech (mixing two languages) breaks current speech translation systems

Quick Recommendations

Best multilingual ST

SeamlessM4T v2 (Meta)

100+ languages, speech-to-text and speech-to-speech; open-source with strong quality

Speech-to-English (any source)

Whisper large-v3

Translates 97 languages to English with excellent accuracy; widely deployed

Real-time / simultaneous

SeamlessStreaming or Meta's STST

Low-latency simultaneous translation with configurable quality-latency tradeoff

Conversational (bidirectional)

GPT-4o voice or Google Translate conversation mode

Real-time bidirectional speech translation with natural conversational flow

Open-source toolkit

ESPnet-ST or Fairseq S2T

Full training and inference pipelines for custom language pairs and domains

What's Next

The vision is zero-latency speech-to-speech translation that preserves the speaker's voice, emotion, and style across languages — essentially dubbing yourself in real-time. Multimodal translation that incorporates visual context (lip reading, gestures) will improve accuracy. Expect personal translation devices (earbuds) to reach consumer quality for common language pairs within 1-2 years. The long-term goal is covering all 7,000 human languages.

Benchmarks & SOTA

No datasets indexed for this task yet.

Contribute on GitHub

Related Tasks

Text-to-Speech

Text-to-speech has undergone a stunning transformation from robotic concatenation to near-human expressiveness in under five years. ElevenLabs, OpenAI's TTS, and XTTS-v2 produce speech that most listeners cannot distinguish from recordings, while open models like Bark, VALL-E (Microsoft), and F5-TTS demonstrated that voice cloning from 3-second samples is now a commodity capability. The frontier has moved beyond intelligibility (solved) to prosody, emotion control, and real-time streaming at under 200ms latency for conversational AI. Evaluation remains messy — MOS (Mean Opinion Score) is subjective and expensive, and automated metrics like UTMOS only loosely correlate with human preference, making benchmark comparisons unreliable.

Speaker Verification

Verifying speaker identity from voice samples.

Voice Cloning

Replicating a speaker's voice characteristics.

Speech Recognition

Automatic speech recognition went from a specialized pipeline (acoustic model + language model + decoder) to a single end-to-end model with OpenAI's Whisper (2022), which was trained on 680K hours of web audio and became the de facto open-source standard overnight. Whisper large-v3 hits under 5% word error rate on LibriSpeech clean, and commercial APIs from Google, AWS, and Deepgram compete fiercely on noisy, accented, and multilingual speech where error rates are 2-3x higher. The real frontier is real-time streaming ASR at conversational latency (<500ms), code-switching between languages mid-sentence, and robust recognition of domain-specific terminology (medical, legal, technical). Assembly AI's Universal-2 and Deepgram's Nova-3 currently lead production benchmarks, but the gap with fine-tuned Whisper variants is narrow.

Something wrong or missing?

Help keep Speech Translation benchmarks accurate. Report outdated results, missing benchmarks, or errors.

0/2000