Speech
Working with voice and audio? Evaluate speech-to-text accuracy, voice synthesis quality, and speaker identification performance.
Speech tech in 2025 is defined by massive foundation models trained on 500K+ hours. Whisper dominates ASR with 680K hours training. Diffusion models revolutionized TTS with sub-200ms latency. Systems now handle multilingual, accent-robust, real-time synthesis.
State of the Field (Dec 2024)
- -ASR: Whisper-Large (1.5B params, 680K hours) achieves 1.9-3.9% WER on clean speech. AssemblyAI Conformer-1 (650K hours) cuts noisy speech errors 43%. Gemini leads on accented speech via LLM integration.
- -TTS: Higgs Audio V2 (3B params, 10M hours) tops expressiveness. Deepgram Aura delivers sub-200ms latency. XTTS enables voice cloning from 6-second samples. NeuTTS Air runs on-device with 0.5B params.
- -Speaker Verification: w2v-BERT 2.0 (600M params, 450M hours across 143 languages) achieves 0.12% EER on VoxCeleb1-O. SVeritas benchmark reveals cross-language and age-mismatch vulnerabilities.
- -Architectures: Conformer dominates ASR with progressive downsampling and grouped attention (29% faster inference). Diffusion models power TTS. Self-supervised pre-training (wav2vec, WavLM) enables low-resource deployment.
Quick Recommendations
Production ASR (batch, high accuracy)
Whisper-Large or AssemblyAI Conformer-1
1.9-3.9% WER on clean speech. Whisper is open-source with broad support. Conformer-1 offers enterprise reliability and business-domain optimization.
Real-time ASR (streaming, low latency)
AWS Transcribe or AssemblyAI Streaming
Best latency-accuracy tradeoff. Whisper's 6-7% WER penalty on streaming makes it unusable for conversational AI. Managed APIs handle scaling.
Accented/technical speech ASR
Google Gemini (multimodal)
LLM integration crushes traditional ASR on accents and domain-specific terminology. World knowledge compensates for acoustic ambiguity.
Multilingual/code-switched ASR
SeamlessM4T-v2-Large
43.6% improvement on code-switched speech. Handles 143 languages. Purpose-built for mixed-language scenarios vs Whisper's general multilingual.
High-quality TTS (audiobooks, media)
Higgs Audio V2
3B params, 10M hours training. Best expressiveness and emotional modulation. Top-trending on Hugging Face for a reason.
Low-latency TTS (chatbots, IVR)
Deepgram Aura
Sub-200ms latency enables natural conversational flow. Includes speech fillers and emotional modulation. Purpose-built for real-time.
Voice cloning (minimal reference data)
XTTS-v2
6-second samples for full voice replication. Widely adopted, extensive integrations, robust across diverse speakers. Zero-shot works.
On-device TTS (mobile, IoT, privacy)
NeuTTS Air
0.5B params runs on Raspberry Pi. Near-human quality without cloud dependency. Kills latency and privacy concerns.
Speaker verification (security-critical)
w2v-BERT 2.0 based systems
0.12% EER on VoxCeleb. 450M hours training across 143 languages. Evaluate on SVeritas benchmark for real-world robustness.
Accent-robust ASR (non-native speakers)
Whisper + MAS-LoRA fine-tuning
Mixture of accent-specific LoRA experts improves unknown accents vs full fine-tuning. Parameter-efficient, reduces catastrophic forgetting.
Cost-optimized ASR (high volume)
Self-hosted Whisper on containers
Open-source eliminates per-request API costs. Accept infrastructure management responsibility for 10-100x cost savings at scale.
Multi-speaker dialogue TTS
Dia (1B-2B variants)
Dialogue-focused with laughter, sighing, nonverbal elements. Streaming architecture. 2min continuous English per output.
Tasks & Benchmarks
Speaker Verification
Verifying speaker identity from voice samples.
Speech Recognition
Converting spoken audio to text (LibriSpeech, Common Voice).
Speech Translation
Translating spoken audio directly to another language.
Text-to-Speech
Generating natural-sounding speech from text.
Voice Cloning
Replicating a speaker's voice characteristics.
Show all datasets and SOTA results
Speaker Verification
Speech Recognition
Massive multilingual dataset of transcribed speech. Covers diverse demographics and accents.
1000 hours of English speech from audiobooks. Standard benchmark for automatic speech recognition.
Speech Translation
Text-to-Speech
Voice Cloning
Honest Takes
Whisper is overhyped for production
Whisper excels on benchmarks but struggles with streaming. 6-7% WER increase vs batch processing makes real-time painful. For conversational AI or live captioning, AWS Transcribe or AssemblyAI streaming APIs deliver better latency-accuracy tradeoffs despite Whisper's fame.
Accent robustness remains embarrassing
Google's legacy ASR hits 35% WER on non-native speech while Gemini achieves 10-15%. After billions in R&D, the field still can't reliably transcribe half the world's English speakers. If your users aren't native speakers, expect to double WER.
TTS latency wars are won
Deepgram Aura's sub-200ms latency kills the 'robotic delay' problem for conversational AI. Combined with streaming synthesis (Dia, MELA-TTS), we finally have TTS that feels human-speed. The bottleneck shifted from synthesis to LLM response time.
Zero-shot voice cloning is production-ready
XTTS cloning voices from 6-second samples isn't a research demo anymore. It's deployed at scale. The ethical nightmare is here, but so is massive UX improvement for multilingual content, accessibility, and personalized experiences.
Foundation models killed task-specific speech systems
Why train separate ASR, speaker verification, and emotion recognition models? w2v-BERT 2.0 (450M hours, 143 languages) handles all tasks. SeamlessM4T does ASR, translation, and TTS in one model. Specialist systems are legacy tech.