Speech

Working with voice and audio? Evaluate speech-to-text accuracy, voice synthesis quality, and speaker identification performance.

5 tasks4 datasets0 results

Speech tech in 2025 is defined by massive foundation models trained on 500K+ hours. Whisper dominates ASR with 680K hours training. Diffusion models revolutionized TTS with sub-200ms latency. Systems now handle multilingual, accent-robust, real-time synthesis.

State of the Field (Dec 2024)

  • -ASR: Whisper-Large (1.5B params, 680K hours) achieves 1.9-3.9% WER on clean speech. AssemblyAI Conformer-1 (650K hours) cuts noisy speech errors 43%. Gemini leads on accented speech via LLM integration.
  • -TTS: Higgs Audio V2 (3B params, 10M hours) tops expressiveness. Deepgram Aura delivers sub-200ms latency. XTTS enables voice cloning from 6-second samples. NeuTTS Air runs on-device with 0.5B params.
  • -Speaker Verification: w2v-BERT 2.0 (600M params, 450M hours across 143 languages) achieves 0.12% EER on VoxCeleb1-O. SVeritas benchmark reveals cross-language and age-mismatch vulnerabilities.
  • -Architectures: Conformer dominates ASR with progressive downsampling and grouped attention (29% faster inference). Diffusion models power TTS. Self-supervised pre-training (wav2vec, WavLM) enables low-resource deployment.

Quick Recommendations

Production ASR (batch, high accuracy)

Whisper-Large or AssemblyAI Conformer-1

1.9-3.9% WER on clean speech. Whisper is open-source with broad support. Conformer-1 offers enterprise reliability and business-domain optimization.

Real-time ASR (streaming, low latency)

AWS Transcribe or AssemblyAI Streaming

Best latency-accuracy tradeoff. Whisper's 6-7% WER penalty on streaming makes it unusable for conversational AI. Managed APIs handle scaling.

Accented/technical speech ASR

Google Gemini (multimodal)

LLM integration crushes traditional ASR on accents and domain-specific terminology. World knowledge compensates for acoustic ambiguity.

Multilingual/code-switched ASR

SeamlessM4T-v2-Large

43.6% improvement on code-switched speech. Handles 143 languages. Purpose-built for mixed-language scenarios vs Whisper's general multilingual.

High-quality TTS (audiobooks, media)

Higgs Audio V2

3B params, 10M hours training. Best expressiveness and emotional modulation. Top-trending on Hugging Face for a reason.

Low-latency TTS (chatbots, IVR)

Deepgram Aura

Sub-200ms latency enables natural conversational flow. Includes speech fillers and emotional modulation. Purpose-built for real-time.

Voice cloning (minimal reference data)

XTTS-v2

6-second samples for full voice replication. Widely adopted, extensive integrations, robust across diverse speakers. Zero-shot works.

On-device TTS (mobile, IoT, privacy)

NeuTTS Air

0.5B params runs on Raspberry Pi. Near-human quality without cloud dependency. Kills latency and privacy concerns.

Speaker verification (security-critical)

w2v-BERT 2.0 based systems

0.12% EER on VoxCeleb. 450M hours training across 143 languages. Evaluate on SVeritas benchmark for real-world robustness.

Accent-robust ASR (non-native speakers)

Whisper + MAS-LoRA fine-tuning

Mixture of accent-specific LoRA experts improves unknown accents vs full fine-tuning. Parameter-efficient, reduces catastrophic forgetting.

Cost-optimized ASR (high volume)

Self-hosted Whisper on containers

Open-source eliminates per-request API costs. Accept infrastructure management responsibility for 10-100x cost savings at scale.

Multi-speaker dialogue TTS

Dia (1B-2B variants)

Dialogue-focused with laughter, sighing, nonverbal elements. Streaming architecture. 2min continuous English per output.

Tasks & Benchmarks

Show all datasets and SOTA results

Speaker Verification

No datasets indexed yet. Contribute on GitHub

Speech Recognition

Common VoiceMozilla Common Voice2019

Massive multilingual dataset of transcribed speech. Covers diverse demographics and accents.

LibriSpeechLibriSpeech ASR Corpus2015

1000 hours of English speech from audiobooks. Standard benchmark for automatic speech recognition.

Speech Translation

No datasets indexed yet. Contribute on GitHub

Text-to-Speech

LJ SpeechThe LJ Speech Dataset2017

13,100 short audio clips of a single speaker reading passages from non-fiction books. Standard benchmark for single-speaker TTS.

VCTKCSTR VCTK Corpus2019

Speech data from 110 English speakers with various accents. Used for multi-speaker TTS.

Voice Cloning

No datasets indexed yet. Contribute on GitHub

Honest Takes

Whisper is overhyped for production

Whisper excels on benchmarks but struggles with streaming. 6-7% WER increase vs batch processing makes real-time painful. For conversational AI or live captioning, AWS Transcribe or AssemblyAI streaming APIs deliver better latency-accuracy tradeoffs despite Whisper's fame.

Accent robustness remains embarrassing

Google's legacy ASR hits 35% WER on non-native speech while Gemini achieves 10-15%. After billions in R&D, the field still can't reliably transcribe half the world's English speakers. If your users aren't native speakers, expect to double WER.

TTS latency wars are won

Deepgram Aura's sub-200ms latency kills the 'robotic delay' problem for conversational AI. Combined with streaming synthesis (Dia, MELA-TTS), we finally have TTS that feels human-speed. The bottleneck shifted from synthesis to LLM response time.

Zero-shot voice cloning is production-ready

XTTS cloning voices from 6-second samples isn't a research demo anymore. It's deployed at scale. The ethical nightmare is here, but so is massive UX improvement for multilingual content, accessibility, and personalized experiences.

Foundation models killed task-specific speech systems

Why train separate ASR, speaker verification, and emotion recognition models? w2v-BERT 2.0 (450M hours, 143 languages) handles all tasks. SeamlessM4T does ASR, translation, and TTS in one model. Specialist systems are legacy tech.

Speech Benchmarks - CodeSOTA | CodeSOTA