Codesota · Registry · SpeechThe area-level registerIssue: April 22, 2026
Area hub · Speech

Speech,
sounded out.

Sound in, symbols out. Recognition, synthesis, speaker verification and the latency numbers that decide whether any of it feels human.

Speech tech in 2025 is defined by massive foundation models trained on 500K+ hours. Whisper dominates ASR with 680K hours training. Diffusion models revolutionized TTS with sub-200ms latency. Systems now handle multilingual, accent-robust, real-time synthesis.

§ 01 · Top tasks

Sub-tasks in speech.

Each task opens onto a leaderboard of its canonical benchmark, with the full submission history and dated scores. Tasks without an indexed result are listed elsewhere in the register; the table below is sorted by result count.

Fig 01 · Showing top 5 of 5 tasks under Speech.
§ 02 · Top benchmarks

Current state of the art.

Leading scores for the headline benchmarks in this area, drawn from the registry. Shaded rows mark the top result per task; follow any row into the full leaderboard.

#TaskBenchmarkLeading modelScore
01Speech TranslationMuST-C English-German tst-COMMONSeamlessM4T v2 Large37.1%
bleu
02Speech RecognitionMozilla Common VoiceWhisper Large-v211.2%
wer
03Voice CloningLibriTTS test-clean zero-shot TTS evaluationVALL-E5.9%
wer
04Text-to-SpeechThe LJ Speech DatasetVALL-E 24.610
mos
05Speaker VerificationVoxCeleb1 Original Test Set (VoxCeleb1-O)ResNet-34 (AM-Softmax, VoxCeleb2)1.180
eer
Fig 02 · Headline benchmarks for Speech. Full leaderboards, dated history and reproduction status live on the task pages.
Side note

State of the Field (2025)

  • 01ASR: Whisper-Large (1.5B params, 680K hours) achieves 1.9-3.9% WER on clean speech. AssemblyAI Conformer-1 (650K hours) cuts noisy speech errors 43%. Gemini leads on accented speech via LLM integration.
  • 02TTS: Higgs Audio V2 (3B params, 10M hours) tops expressiveness. Deepgram Aura delivers sub-200ms latency. XTTS enables voice cloning from 6-second samples. NeuTTS Air runs on-device with 0.5B params.
  • 03Speaker Verification: w2v-BERT 2.0 (600M params, 450M hours across 143 languages) achieves 0.12% EER on VoxCeleb1-O. SVeritas benchmark reveals cross-language and age-mismatch vulnerabilities.
  • 04Architectures: Conformer dominates ASR with progressive downsampling and grouped attention (29% faster inference). Diffusion models power TTS. Self-supervised pre-training (wav2vec, WavLM) enables low-resource deployment.
Picks by use-case

What to reach for.

Editorial picks · not vendor rankings
Production ASR (batch, high accuracy)
Whisper-Large or AssemblyAI Conformer-1

1.9-3.9% WER on clean speech. Whisper is open-source with broad support. Conformer-1 offers enterprise reliability and business-domain optimization.

Real-time ASR (streaming, low latency)
AWS Transcribe or AssemblyAI Streaming

Best latency-accuracy tradeoff. Whisper's 6-7% WER penalty on streaming makes it unusable for conversational AI. Managed APIs handle scaling.

Accented/technical speech ASR
Google Gemini (multimodal)

LLM integration crushes traditional ASR on accents and domain-specific terminology. World knowledge compensates for acoustic ambiguity.

Multilingual/code-switched ASR
SeamlessM4T-v2-Large

43.6% improvement on code-switched speech. Handles 143 languages. Purpose-built for mixed-language scenarios vs Whisper's general multilingual.

High-quality TTS (audiobooks, media)
Higgs Audio V2

3B params, 10M hours training. Best expressiveness and emotional modulation. Top-trending on Hugging Face for a reason.

Low-latency TTS (chatbots, IVR)
Deepgram Aura

Sub-200ms latency enables natural conversational flow. Includes speech fillers and emotional modulation. Purpose-built for real-time.

Voice cloning (minimal reference data)
XTTS-v2

6-second samples for full voice replication. Widely adopted, extensive integrations, robust across diverse speakers. Zero-shot works.

On-device TTS (mobile, IoT, privacy)
NeuTTS Air

0.5B params runs on Raspberry Pi. Near-human quality without cloud dependency. Kills latency and privacy concerns.

Speaker verification (security-critical)
w2v-BERT 2.0 based systems

0.12% EER on VoxCeleb. 450M hours training across 143 languages. Evaluate on SVeritas benchmark for real-world robustness.

Accent-robust ASR (non-native speakers)
Whisper + MAS-LoRA fine-tuning

Mixture of accent-specific LoRA experts improves unknown accents vs full fine-tuning. Parameter-efficient, reduces catastrophic forgetting.

Cost-optimized ASR (high volume)
Self-hosted Whisper on containers

Open-source eliminates per-request API costs. Accept infrastructure management responsibility for 10-100x cost savings at scale.

Multi-speaker dialogue TTS
Dia (1B-2B variants)

Dialogue-focused with laughter, sighing, nonverbal elements. Streaming architecture. 2min continuous English per output.

Editor's note

Honest takes.

Whisper is overhyped for production

Whisper excels on benchmarks but struggles with streaming. 6-7% WER increase vs batch processing makes real-time painful. For conversational AI or live captioning, AWS Transcribe or AssemblyAI streaming APIs deliver better latency-accuracy tradeoffs despite Whisper's fame.

Accent robustness remains embarrassing

Google's legacy ASR hits 35% WER on non-native speech while Gemini achieves 10-15%. After billions in R&D, the field still can't reliably transcribe half the world's English speakers. If your users aren't native speakers, expect to double WER.

TTS latency wars are won

Deepgram Aura's sub-200ms latency kills the 'robotic delay' problem for conversational AI. Combined with streaming synthesis (Dia, MELA-TTS), we finally have TTS that feels human-speed. The bottleneck shifted from synthesis to LLM response time.

Zero-shot voice cloning is production-ready

XTTS cloning voices from 6-second samples isn't a research demo anymore. It's deployed at scale. The ethical nightmare is here, but so is massive UX improvement for multilingual content, accessibility, and personalized experiences.

Foundation models killed task-specific speech systems

Why train separate ASR, speaker verification, and emotion recognition models? w2v-BERT 2.0 (450M hours, 143 languages) handles all tasks. SeamlessM4T does ASR, translation, and TTS in one model. Specialist systems are legacy tech.

§ 03 · Method
How this area is tracked

Every row in this register is dated and sourced.

The benchmarks above come from the same Postgres registry that powers the wider Codesota index. Each task has exactly one canonical dataset. Each score carries a metric direction, a date and — where possible — a reproduction status.

When a score regresses, the prior record stays visible. When a benchmark is contested, we mark it rather than delete it. The goal is a register that argues in public.

Full methodology The unified task index
§ Final · Related

Neighbouring registers.

Sibling area hubs, the unified task index and the methodology that binds them.