1.9-3.9% WER on clean speech. Whisper is open-source with broad support. Conformer-1 offers enterprise reliability and business-domain optimization.
Sound in, symbols out. Recognition, synthesis, speaker verification and the latency numbers that decide whether any of it feels human.
Speech tech in 2025 is defined by massive foundation models trained on 500K+ hours. Whisper dominates ASR with 680K hours training. Diffusion models revolutionized TTS with sub-200ms latency. Systems now handle multilingual, accent-robust, real-time synthesis.
Each task opens onto a leaderboard of its canonical benchmark, with the full submission history and dated scores. Tasks without an indexed result are listed elsewhere in the register; the table below is sorted by result count.
Leading scores for the headline benchmarks in this area, drawn from the registry. Shaded rows mark the top result per task; follow any row into the full leaderboard.
| # | Task | Benchmark | Leading model | Score |
|---|---|---|---|---|
| 01 | Speech Translation | MuST-C English-German tst-COMMON | SeamlessM4T v2 Large | 37.1% bleu |
| 02 | Speech Recognition | Mozilla Common Voice | Whisper Large-v2 | 11.2% wer |
| 03 | Voice Cloning | LibriTTS test-clean zero-shot TTS evaluation | VALL-E | 5.9% wer |
| 04 | Text-to-Speech | The LJ Speech Dataset | VALL-E 2 | 4.610 mos |
| 05 | Speaker Verification | VoxCeleb1 Original Test Set (VoxCeleb1-O) | ResNet-34 (AM-Softmax, VoxCeleb2) | 1.180 eer |
1.9-3.9% WER on clean speech. Whisper is open-source with broad support. Conformer-1 offers enterprise reliability and business-domain optimization.
Best latency-accuracy tradeoff. Whisper's 6-7% WER penalty on streaming makes it unusable for conversational AI. Managed APIs handle scaling.
LLM integration crushes traditional ASR on accents and domain-specific terminology. World knowledge compensates for acoustic ambiguity.
43.6% improvement on code-switched speech. Handles 143 languages. Purpose-built for mixed-language scenarios vs Whisper's general multilingual.
3B params, 10M hours training. Best expressiveness and emotional modulation. Top-trending on Hugging Face for a reason.
Sub-200ms latency enables natural conversational flow. Includes speech fillers and emotional modulation. Purpose-built for real-time.
6-second samples for full voice replication. Widely adopted, extensive integrations, robust across diverse speakers. Zero-shot works.
0.5B params runs on Raspberry Pi. Near-human quality without cloud dependency. Kills latency and privacy concerns.
0.12% EER on VoxCeleb. 450M hours training across 143 languages. Evaluate on SVeritas benchmark for real-world robustness.
Mixture of accent-specific LoRA experts improves unknown accents vs full fine-tuning. Parameter-efficient, reduces catastrophic forgetting.
Open-source eliminates per-request API costs. Accept infrastructure management responsibility for 10-100x cost savings at scale.
Dialogue-focused with laughter, sighing, nonverbal elements. Streaming architecture. 2min continuous English per output.
Whisper excels on benchmarks but struggles with streaming. 6-7% WER increase vs batch processing makes real-time painful. For conversational AI or live captioning, AWS Transcribe or AssemblyAI streaming APIs deliver better latency-accuracy tradeoffs despite Whisper's fame.
Google's legacy ASR hits 35% WER on non-native speech while Gemini achieves 10-15%. After billions in R&D, the field still can't reliably transcribe half the world's English speakers. If your users aren't native speakers, expect to double WER.
Deepgram Aura's sub-200ms latency kills the 'robotic delay' problem for conversational AI. Combined with streaming synthesis (Dia, MELA-TTS), we finally have TTS that feels human-speed. The bottleneck shifted from synthesis to LLM response time.
XTTS cloning voices from 6-second samples isn't a research demo anymore. It's deployed at scale. The ethical nightmare is here, but so is massive UX improvement for multilingual content, accessibility, and personalized experiences.
Why train separate ASR, speaker verification, and emotion recognition models? w2v-BERT 2.0 (450M hours, 143 languages) handles all tasks. SeamlessM4T does ASR, translation, and TTS in one model. Specialist systems are legacy tech.
The benchmarks above come from the same Postgres registry that powers the wider Codesota index. Each task has exactly one canonical dataset. Each score carries a metric direction, a date and — where possible — a reproduction status.
When a score regresses, the prior record stays visible. When a benchmark is contested, we mark it rather than delete it. The goal is a register that argues in public.
Sibling area hubs, the unified task index and the methodology that binds them.