Turn text into natural-sounding speech. The rare ML task where the frontier is entirely API-only and the open academic benchmarks (LJSpeech, VCTK) lag production by two years. Below: a side-by-side comparison of 12 providers on the axes buyers actually care about — cost, latency, languages, voice cloning, license.
Last verified 2026-04 · Submit corrections below · Vendors: claim your listing →
Frontier API · hyperscaler cloud · open weights. Pricing shown per million characters of synthesized output.
| Provider / Model | Tier | License | Cost / 1M chars | First-byte | Langs | Cloning | Stream | |
|---|---|---|---|---|---|---|---|---|
ElevenLabs Multilingual v2 / Turbo v2.5 / v3 | Frontier | Proprietary API | $150–330/M | ~250–400 ms | 32+ | Professional | ✓ | Claim → |
OpenAI tts-1 / tts-1-hd / gpt-4o voice | Frontier | Proprietary API | $15 / $30 / M | ~500–700 ms | 50+ | — | ✓ | Claim → |
C Cartesia Sonic 2 / Sonic Turbo | Frontier | Proprietary API | ~$19/M | ~90–150 ms | 15+ | Instant | ✓ | Claim → |
Deepgram Aura 2 | Frontier | Proprietary API | ~$30/M | ~150–250 ms | 30+ | Limited | ✓ | Claim → |
H Hume EVI 2 / Octave | Frontier | Proprietary API | Per-minute billing | Realtime | 40+ | Instant | ✓ | Claim → |
S Sesame CSM-1B · Maya / Miles | Frontier | Hybrid | Demo / research | ~400 ms | English (primary) | Limited | ✓ | Claim → |
Google Cloud Studio / Neural2 / Wavenet voices | Cloud | Proprietary API | $4–$160/M | ~300–500 ms | 50+ (380+ voices) | Limited | ✓ | Claim → |
Az Microsoft Azure Neural TTS · HD voices | Cloud | Proprietary API | $16–$30/M | ~300–500 ms | 140+ locales | Professional | ✓ | Claim → |
AWS Amazon Web Services Polly Neural · Long-form · Generative | Cloud | Proprietary API | $4–$30/M | ~300–600 ms | 30+ | — | ✓ | Claim → |
F5 SWivid (open) F5-TTS | Open | Open weights | Self-host | GPU-dependent | English, Chinese (+ finetune) | Instant | — | Claim → |
FS Fish Audio (open) Fish Speech 1.5 | Open | Open weights | Self-host | GPU-dependent | 8 (en, zh, ja, de, fr, es, ko, ar) | Instant | ✓ | Claim → |
X2 Coqui (open) XTTS v2 | Open | Open weights | Self-host | GPU-dependent | 17 | Instant | ✓ | Claim → |
Pricing is list-price per million characters as of 2026-04 and rounds to the nearest meaningful tier — most vendors negotiate at scale. Click any price to open the vendor’s pricing page. Spot an error? Tell us →
Picking a TTS provider is a budget-shaped decision on four axes: quality, latency, voice control, and license. Shortcuts by use-case:
Best API quality
ElevenLabs v3 · OpenAI tts-1-hd
ElevenLabs leads on expressive range and voice library; OpenAI leads on consistency and safety.
Lowest latency (real-time agents)
Cartesia Sonic · Deepgram Aura 2
90–250 ms first-byte beats everything else. Built for conversational voice pipelines.
Voice cloning
ElevenLabs · Cartesia · F5-TTS
ElevenLabs professional cloning (minutes of audio, consent-verified). Cartesia instant cloning. F5-TTS for zero-shot open weights.
On-prem / compliance
Fish Speech 1.5 · F5-TTS · XTTS v2
Run on your GPUs. Watch licenses — F5-TTS and Fish Speech are non-commercial by default; Coqui XTTS has the friendliest commercial terms.
Enterprise with an MSA
Azure Neural TTS · Google Studio · AWS Polly
Already in the hyperscaler MSA. Azure leads on locale breadth; Google on voice quality at top-tier; AWS on cost at scale.
Empathetic voice / agentic
Hume EVI 2 · Sesame CSM · gpt-4o voice
Modelled for interruption timing and emotional tone, not just naturalness.
Cheapest at scale
AWS Polly Standard · GCP Standard
$4/M for standard-tier voices. Sounds worse, but fine for IVR, accessibility, and bulk notifications.
MOS scores collapse a 30-second listen into a single number. If you’re evaluating providers, A/B test your own text — not marketing demos — and listen for these five things that separate polished TTS from uncanny:
Does the stress land on the right word? A good TTS emphasizes new information; a weak one monotones everything.
Real speakers pause mid-sentence to breathe. Synthetic speech that rushes through commas sounds robotic.
Listen to s, sh, z sounds. Cheap TTS hisses; good TTS renders sibilants without distortion.
Um, uh, and self-corrections matter for conversational AI. Most TTS scrubs them — the frontier ones model them.
Play the same sentence as a question, a statement, and in excitement. Most providers produce identical audio.
Run a 5-minute script. Does the voice drift in pitch or pace? Attention-based TTS famously loses the thread past 30 seconds.
MOS (Mean Opinion Score) was designed in 1996 for telephony codecs. It asks human raters to score a speech clip from 1 (bad) to 5 (excellent). For decades it was the only metric in town.
In 2026 it’s breaking because: (a) top systems saturate at 4.3–4.6 and human raters lose discrimination; (b) ratings are crowd-sourced on short clips that miss long-form failures; (c) published MOS typically uses the author’s own test set, which no two papers share.
The metrics buyers should trust are WER (intelligibility — pass the synthesized audio through ASR, compare to ground truth), SECS (speaker similarity for cloning), first-byte latency measured from your own network, and blind AB preference against a real human baseline.
Vendor-published MOS is not reliable enough to build a ranking on. That’s why the comparison matrix above uses operational axes — cost, latency, features — not MOS.
Useful for training open-weights models and reproducible research. Frontier API providers don’t train on these — they use proprietary voice-actor corpora orders of magnitude larger.
Single female English speaker reading public-domain books. The canonical TTS training set for a generation — small, clean, copyright-safe.
Dataset page →Multi-speaker corpus designed for voice-cloning research. Regional English accents. Canonical benchmark for zero-shot speaker conditioning.
Dataset page →Cleaned subset of LibriSpeech with original punctuation and casing preserved. The scale-up training set for modern open-weights TTS.
Dataset page →Mozilla’s ongoing multilingual speech corpus. The go-to for multilingual open-weights TTS — though audio quality varies widely.
Dataset page →Don’t train from scratch. Frontier quality requires tens of thousands of hours of studio-grade voice acting. Unless you have a niche (a specific language, a specific voice type), start from XTTS v2 or Fish Speech 1.5 and finetune.
Latency is priced in. Cartesia and Deepgram lead on first-byte; ElevenLabs Turbo is close. For conversational agents the 150 ms line is the difference between natural and awkward — worth paying for.
Voice cloning is a consent problem, not a tech problem. The tech works. The risk is legal: impersonation, deepfake audio, brand hijacking. ElevenLabs Professional Cloning and Azure Custom Neural Voice both require consent-verified onboarding. If your vendor doesn’t, that’s a red flag.
Stream and cache. Streaming cuts perceived latency by half. Cache by hash(text + voice_id + params) — TTS is deterministic enough that 20–40% of requests in a production app are repeats.
Evaluate on your own text. Vendor demos are hand-picked. Write 10 scripts from your actual domain — technical terms, names, long sentences, emotional beats — and AB them blind.
CodeSOTA’s TTS comparison is read by engineers evaluating providers. If you represent one of the vendors above — or a provider we missed — claim the listing to submit verified pricing, latency benchmarks, voice samples, and a demo link. Free; credibility-gated, not pay-to-play.
Missing a provider, a column we skipped, or a use case you need help picking for? Tell us — we reply within 48 hours and update the page based on what readers actually ask.
Real humans read every message. We track what people are asking for and prioritize accordingly.