ElevenLabs vs OpenAI TTS
Flagship commercial TTS head-to-head: quality, cost, latency, voice library.
Read the comparison →Two pillars share this register. Speech-to-text now clears human accuracy on clean audio; text-to-speech clears the blind-test bar for naturalness. We keep both on the same page because the pipeline almost always needs both.
18 STT models and 18 TTS models tracked, sourced from the shared model catalogue. Shaded rows mark current state of the art. Numbers shown only where reported; every model links to paper or code where available.
LibriSpeech test-clean remains the canonical benchmark. Lower is better. Human-annotator WER on this split sits in the 2–4% band, which several models now clear.
| # | Model | Vendor | Kind | Params | Trend | WER | Δ |
|---|---|---|---|---|---|---|---|
| 01 | Parakeet RNNT 1.1B | NVIDIA | Open Source | 1.1B | 1.8 | — | |
| 02 | Conformer XL | Research | 600M | 2.0 | +0.2 | ||
| 03 | Deepgram Nova-3 | Deepgram | Cloud API | — | 2.2 | +0.2 | |
| 04 | Voxtral Large | Mistral AI | Cloud API | — | 2.3 | +0.1 | |
| 05 | AssemblyAI Universal-2 | AssemblyAI | Cloud API | — | 2.4 | +0.1 | |
| 06 | Canary 1B | NVIDIA | Open Source | 1B | 2.4 | 0.0 | |
| 07 | Whisper Large v3 Turbo | OpenAI | Open Source | 809M | 2.5 | +0.1 | |
| 08 | Gladia v2 | Gladia | Cloud API | — | 2.5 | 0.0 |
Naturalness is scored by human raters on a 1–5 scale. Commercial and open-source entries now overlap in the 4.5–4.8 band — a gap small enough that the right model is chosen by latency, licence or voice cloning rather than raw quality.
| # | Model | Vendor | Kind | Params | Trend | MOS | Δ |
|---|---|---|---|---|---|---|---|
| 01 | ElevenLabs Turbo v2.5 | ElevenLabs | Cloud API | — | 4.8 | — | |
| 02 | Sesame CSM | Sesame | Open Source | 1B+ | 4.7 | -0.1 | |
| 03 | OpenAI TTS HD | OpenAI | Cloud API | — | 4.7 | 0.0 | |
| 04 | Gemini 2.5 Pro TTS | Cloud API | — | 4.7 | 0.0 | ||
| 05 | Cartesia Sonic 2 | Cartesia | Cloud API | — | 4.7 | 0.0 | |
| 06 | ElevenLabs Flash v2.5 | ElevenLabs | Cloud API | — | 4.6 | -0.1 | |
| 07 | PlayHT 3.0 | PlayHT | Cloud API | — | 4.6 | 0.0 | |
| 08 | Orpheus TTS | Canopy Labs | Open Source | 3B | 4.6 | 0.0 |
Long-form reads for the common decisions: which commercial TTS, which open-source, which model fits podcasts, audiobooks, voice bots or cloning.
Flagship commercial TTS head-to-head: quality, cost, latency, voice library.
Read the comparison →Quality leader against the purpose-built low-latency challenger.
Read the comparison →Hyperscaler comparison — pricing, voices, SSML, streaming.
Read the comparison →Long-form naturalness ranked: pacing, breath, intonation over 30+ minutes.
Read the comparison →SSML, character voices, consistency across chapters.
Read the comparison →TTFB under 200ms: Cartesia, ElevenLabs Flash, Gemini Flash.
Read the comparison →Zero-shot similarity, data requirements, and consent-ethics framing.
Read the comparison →Kokoro, Sesame CSM, Orpheus, F5-TTS, Dia — licensed and deployable.
Read the comparison →Eleven open-source TTS voices, the same prompt, rendered through five DSP lenses and Griffin-Lim resynthesis. A reproducible walkthrough of the representations that vocoders, ASR systems and human ears actually read — mel spectrograms, MFCC, F0, formants.
Every figure is generated from the same code path; every voice is labelled with its provenance. No fabricated spectrograms, no stock audio. If the sample cannot be reproduced, it doesn't appear.
Canonical for each direction plus the community-adopted follow-ups. LibriSpeech, Common Voice and VCTK are canonicalised in our dataset registry; FLEURS, AudioBench and EARS are tracked qualitatively pending canonicalisation.
Rows with a mark live in the registry and carry full lineage.
| Benchmark | Scope | Primary metric | Year | Source | |
|---|---|---|---|---|---|
| LibriSpeech | Speech-to-Text | wer-test-clean | 2015 | link → | |
| Common Voice | Speech-to-Text | wer | 2019 | link → | |
| LJ Speech | Text-to-Speech | mos | 2017 | link → | |
| VCTK | Text-to-Speech | mos | 2019 | link → | |
| TTS Intelligibility | Text-to-Speech | critical-entity-accuracy | 2026 | link → | |
| FLEURS | Speech-to-Text | WER (per-lang) | 2022 | link → | |
| AudioBench | Audio-LLM | composite | 2024 | link → | |
| EARS | Text-to-Speech | MOS · subjective | 2024 | link → |
Modern speech recognition takes raw audio into mel-spectrogram features, runs them through a Conformer or Transformer encoder, and decodes with CTC, RNNT or attention. Post-processing — language-model rescoring, punctuation, diarisation — yields the final transcript.
Modern speech synthesis runs the pipeline in reverse. Text is embedded by a language model; acoustic tokens are predicted autoregressively or by flow matching; a vocoder or neural codec decodes those tokens back to waveform. The neural audio codec — EnCodec, SoundStream, Mimi — is the hinge that lets TTS borrow the tooling of LLMs.
What changed recently is the representation. Once audio could be tokenised, every architectural trick from text generation became available to speech: pretraining, instruction-tuning, prompted style control, zero-shot cloning. That is why the open-source gap in TTS closed so quickly after 2023.
On the STT side, the Conformer block — self-attention plus convolution — is still the workhorse. Whisper took a different path with a pure Transformer encoder-decoder trained on weak supervision at scale, trading some efficiency for massive multilingual coverage.
Other modality hubs on Codesota worth reading next.