Codesota · Speech · Vol. IIThe register of speech-to-text and text-to-speechIssue: April 22, 2026
§ 00 · Speech

Speech AI, both directions.

Two pillars share this register. Speech-to-text now clears human accuracy on clean audio; text-to-speech clears the blind-test bar for naturalness. We keep both on the same page because the pipeline almost always needs both.

18 STT models and 18 TTS models tracked, sourced from the shared model catalogue. Shaded rows mark current state of the art. Numbers shown only where reported; every model links to paper or code where available.

§ 01 · Speech-to-text

Word error rate, ranked.

LibriSpeech test-clean remains the canonical benchmark. Lower is better. Human-annotator WER on this split sits in the 2–4% band, which several models now clear.


Metric
WER · lower is better
Models
18 tracked · top 8 shown
Dataset
LibriSpeech test-clean
Full guide · speech recognition →
Top 8 · April 2026
Shaded row marks current SOTA
#ModelVendorKindParamsTrendWERΔ
01Parakeet RNNT 1.1BNVIDIAOpen Source1.1B1.8
02Conformer XLGoogleResearch600M2.0+0.2
03Deepgram Nova-3DeepgramCloud API2.2+0.2
04Voxtral LargeMistral AICloud API2.3+0.1
05AssemblyAI Universal-2AssemblyAICloud API2.4+0.1
06Canary 1BNVIDIAOpen Source1B2.40.0
07Whisper Large v3 TurboOpenAIOpen Source809M2.5+0.1
08Gladia v2GladiaCloud API2.50.0
Fig 1 · WER on LibriSpeech test-clean. Δ is difference against the row above. Sparkline shows a directional trendline from each model's vendor-reported trajectory — not a per-submission history.
§ 02 · Text-to-speech

Mean opinion score, ranked.

Naturalness is scored by human raters on a 1–5 scale. Commercial and open-source entries now overlap in the 4.5–4.8 band — a gap small enough that the right model is chosen by latency, licence or voice cloning rather than raw quality.


Metric
MOS · higher is better
Models
18 tracked · top 8 shown
Evaluation
Subjective listening tests
Full guide · TTS models →
Top 8 · April 2026
Shaded row marks current SOTA
#ModelVendorKindParamsTrendMOSΔ
01ElevenLabs Turbo v2.5ElevenLabsCloud API4.8
02Sesame CSMSesameOpen Source1B+4.7-0.1
03OpenAI TTS HDOpenAICloud API4.70.0
04Gemini 2.5 Pro TTSGoogleCloud API4.70.0
05Cartesia Sonic 2CartesiaCloud API4.70.0
06ElevenLabs Flash v2.5ElevenLabsCloud API4.6-0.1
07PlayHT 3.0PlayHTCloud API4.60.0
08Orpheus TTSCanopy LabsOpen Source3B4.60.0
Fig 2 · MOS is subjective. Vendors publish different listener panels and reference tracks; direct comparison below 0.1 should be treated as noise.
§ 03 · Comparison pages

Pairwise, and by use-case.

Long-form reads for the common decisions: which commercial TTS, which open-source, which model fits podcasts, audiobooks, voice bots or cloning.

Fig 3 · Each comparison page has its own evidence table; these are editorial reads, not benchmark duplicates.
§ 04 · Featured deep-dive

How speech becomes a picture.

Eleven open-source TTS voices, the same prompt, rendered through five DSP lenses and Griffin-Lim resynthesis. A reproducible walkthrough of the representations that vocoders, ASR systems and human ears actually read — mel spectrograms, MFCC, F0, formants.

Every figure is generated from the same code path; every voice is labelled with its provenance. No fabricated spectrograms, no stock audio. If the sample cannot be reproduced, it doesn't appear.

§ 05 · Benchmarks

The datasets we believe.

Canonical for each direction plus the community-adopted follow-ups. LibriSpeech, Common Voice and VCTK are canonicalised in our dataset registry; FLEURS, AudioBench and EARS are tracked qualitatively pending canonicalisation.

Rows with a mark live in the registry and carry full lineage.

BenchmarkScopePrimary metricYearSource
LibriSpeechSpeech-to-Textwer-test-clean2015link →
Common VoiceSpeech-to-Textwer2019link →
LJ SpeechText-to-Speechmos2017link →
VCTKText-to-Speechmos2019link →
TTS IntelligibilityText-to-Speechcritical-entity-accuracy2026link →
FLEURSSpeech-to-TextWER (per-lang)2022link →
AudioBenchAudio-LLMcomposite2024link →
EARSText-to-SpeechMOS · subjective2024link →
Fig 5 · Solid marker = canonicalised in the Codesota registry. Hollow marker = widely cited, tracked qualitatively, not yet graded.
ASR · English
202326
1.8WER, ↓
TTS · naturalness
202326
4.8MOS, ↑
Realtime TTS
202326
~90ms TTFB, ↓
Open-source TTS
202326
4.7MOS, ↑
Fig 6 · Directional trends across four speech axes. Dot marks the current SOTA entry from the catalogue.
§ 06
How it works

Two pipelines, one register.

Modern speech recognition takes raw audio into mel-spectrogram features, runs them through a Conformer or Transformer encoder, and decodes with CTC, RNNT or attention. Post-processing — language-model rescoring, punctuation, diarisation — yields the final transcript.

Modern speech synthesis runs the pipeline in reverse. Text is embedded by a language model; acoustic tokens are predicted autoregressively or by flow matching; a vocoder or neural codec decodes those tokens back to waveform. The neural audio codec — EnCodec, SoundStream, Mimi — is the hinge that lets TTS borrow the tooling of LLMs.

What changed recently is the representation. Once audio could be tokenised, every architectural trick from text generation became available to speech: pretraining, instruction-tuning, prompted style control, zero-shot cloning. That is why the open-source gap in TTS closed so quickly after 2023.

On the STT side, the Conformer block — self-attention plus convolution — is still the workhorse. Whisper took a different path with a pure Transformer encoder-decoder trained on weak supervision at scale, trading some efficiency for massive multilingual coverage.

Related

Neighbouring registers.

Other modality hubs on Codesota worth reading next.

Guide · TTS models
Long-form overview of the TTS landscape.
Guide · speech recognition
How ASR models are built, trained, evaluated.
OCR · register
Document understanding and text extraction.
LLM · register
Frontier language-model benchmarks.