Tasks/Audio/Text-to-Speech

Text-to-Speech

Turn text into natural-sounding speech. The rare ML task where the frontier is entirely API-only and the open academic benchmarks (LJSpeech, VCTK) lag production by two years. Below: a side-by-side comparison of 12 providers on the axes buyers actually care about — cost, latency, languages, voice cloning, license.

Last verified 2026-04 · Submit corrections below · Vendors: claim your listing →

12 providers, side by side

Frontier API · hyperscaler cloud · open weights. Pricing shown per million characters of synthesized output.

FrontierCloudOpen

Provider / Model	Tier	License	Cost / 1M chars	First-byte	Langs	Cloning	Stream
ElevenLabs Multilingual v2 / Turbo v2.5 / v3	Frontier	Proprietary API	$150–330/M	~250–400 ms	32+	Professional	✓	Claim →
OpenAI tts-1 / tts-1-hd / gpt-4o voice	Frontier	Proprietary API	$15 / $30 / M	~500–700 ms	50+	—	✓	Claim →
C Cartesia Sonic 2 / Sonic Turbo	Frontier	Proprietary API	~$19/M	~90–150 ms	15+	Instant	✓	Claim →
Deepgram Aura 2	Frontier	Proprietary API	~$30/M	~150–250 ms	30+	Limited	✓	Claim →
H Hume EVI 2 / Octave	Frontier	Proprietary API	Per-minute billing	Realtime	40+	Instant	✓	Claim →
S Sesame CSM-1B · Maya / Miles	Frontier	Hybrid	Demo / research	~400 ms	English (primary)	Limited	✓	Claim →
Google Cloud Studio / Neural2 / Wavenet voices	Cloud	Proprietary API	$4–$160/M	~300–500 ms	50+ (380+ voices)	Limited	✓	Claim →
Az Microsoft Azure Neural TTS · HD voices	Cloud	Proprietary API	$16–$30/M	~300–500 ms	140+ locales	Professional	✓	Claim →
AWS Amazon Web Services Polly Neural · Long-form · Generative	Cloud	Proprietary API	$4–$30/M	~300–600 ms	30+	—	✓	Claim →
F5 SWivid (open) F5-TTS	Open	Open weights	Self-host	GPU-dependent	English, Chinese (+ finetune)	Instant	—	Claim →
FS Fish Audio (open) Fish Speech 1.5	Open	Open weights	Self-host	GPU-dependent	8 (en, zh, ja, de, fr, es, ko, ar)	Instant	✓	Claim →
X2 Coqui (open) XTTS v2	Open	Open weights	Self-host	GPU-dependent	17	Instant	✓	Claim →

Pricing is list-price per million characters as of 2026-04 and rounds to the nearest meaningful tier — most vendors negotiate at scale. Click any price to open the vendor’s pricing page. Spot an error? Tell us →

Which should I use?

Picking a TTS provider is a budget-shaped decision on four axes: quality, latency, voice control, and license. Shortcuts by use-case:

Best API quality

ElevenLabs v3 · OpenAI tts-1-hd

ElevenLabs leads on expressive range and voice library; OpenAI leads on consistency and safety.

Lowest latency (real-time agents)

Cartesia Sonic · Deepgram Aura 2

90–250 ms first-byte beats everything else. Built for conversational voice pipelines.

Voice cloning

ElevenLabs · Cartesia · F5-TTS

ElevenLabs professional cloning (minutes of audio, consent-verified). Cartesia instant cloning. F5-TTS for zero-shot open weights.

On-prem / compliance

Fish Speech 1.5 · F5-TTS · XTTS v2

Run on your GPUs. Watch licenses — F5-TTS and Fish Speech are non-commercial by default; Coqui XTTS has the friendliest commercial terms.

Enterprise with an MSA

Azure Neural TTS · Google Studio · AWS Polly

Already in the hyperscaler MSA. Azure leads on locale breadth; Google on voice quality at top-tier; AWS on cost at scale.

Empathetic voice / agentic

Hume EVI 2 · Sesame CSM · gpt-4o voice

Modelled for interruption timing and emotional tone, not just naturalness.

Cheapest at scale

AWS Polly Standard · GCP Standard

$4/M for standard-tier voices. Sounds worse, but fine for IVR, accessibility, and bulk notifications.

What to listen for

MOS scores collapse a 30-second listen into a single number. If you’re evaluating providers, A/B test your own text — not marketing demos — and listen for these five things that separate polished TTS from uncanny:

Prosody

Does the stress land on the right word? A good TTS emphasizes new information; a weak one monotones everything.

Breath & pauses

Real speakers pause mid-sentence to breathe. Synthetic speech that rushes through commas sounds robotic.

Sibilance

Listen to s, sh, z sounds. Cheap TTS hisses; good TTS renders sibilants without distortion.

Disfluencies

Um, uh, and self-corrections matter for conversational AI. Most TTS scrubs them — the frontier ones model them.

Emotional range

Play the same sentence as a question, a statement, and in excitement. Most providers produce identical audio.

Long-form consistency

Run a 5-minute script. Does the voice drift in pitch or pace? Attention-based TTS famously loses the thread past 30 seconds.

Why MOS scores are misleading in 2026

MOS (Mean Opinion Score) was designed in 1996 for telephony codecs. It asks human raters to score a speech clip from 1 (bad) to 5 (excellent). For decades it was the only metric in town.

In 2026 it’s breaking because: (a) top systems saturate at 4.3–4.6 and human raters lose discrimination; (b) ratings are crowd-sourced on short clips that miss long-form failures; (c) published MOS typically uses the author’s own test set, which no two papers share.

The metrics buyers should trust are WER (intelligibility — pass the synthesized audio through ASR, compare to ground truth), SECS (speaker similarity for cloning), first-byte latency measured from your own network, and blind AB preference against a real human baseline.

Vendor-published MOS is not reliable enough to build a ranking on. That’s why the comparison matrix above uses operational axes — cost, latency, features — not MOS.

Academic datasets

Useful for training open-weights models and reproducible research. Frontier API providers don’t train on these — they use proprietary voice-actor corpora orders of magnitude larger.

LJSpeech

24 hours · 13K utterances · 1 speaker2017

Single female English speaker reading public-domain books. The canonical TTS training set for a generation — small, clean, copyright-safe.

Dataset page →

VCTK

44 hours · 110 speakers · English2017

Multi-speaker corpus designed for voice-cloning research. Regional English accents. Canonical benchmark for zero-shot speaker conditioning.

Dataset page →

LibriTTS

585 hours · 2,456 speakers · English2019

Cleaned subset of LibriSpeech with original punctuation and casing preserved. The scale-up training set for modern open-weights TTS.

Dataset page →

Common Voice

30,000+ hours · 100+ languages · crowdsourced2019

Mozilla’s ongoing multilingual speech corpus. The go-to for multilingual open-weights TTS — though audio quality varies widely.

Dataset page →

Practical tips for 2026

Don’t train from scratch. Frontier quality requires tens of thousands of hours of studio-grade voice acting. Unless you have a niche (a specific language, a specific voice type), start from XTTS v2 or Fish Speech 1.5 and finetune.

Latency is priced in. Cartesia and Deepgram lead on first-byte; ElevenLabs Turbo is close. For conversational agents the 150 ms line is the difference between natural and awkward — worth paying for.

Voice cloning is a consent problem, not a tech problem. The tech works. The risk is legal: impersonation, deepfake audio, brand hijacking. ElevenLabs Professional Cloning and Azure Custom Neural Voice both require consent-verified onboarding. If your vendor doesn’t, that’s a red flag.

Stream and cache. Streaming cuts perceived latency by half. Cache by hash(text + voice_id + params) — TTS is deterministic enough that 20–40% of requests in a production app are repeats.

Evaluate on your own text. Vendor demos are hand-picked. Write 10 scripts from your actual domain — technical terms, names, long sentences, emotional beats — and AB them blind.

Run a TTS product? Claim your listing.

CodeSOTA’s TTS comparison is read by engineers evaluating providers. If you represent one of the vendors above — or a provider we missed — claim the listing to submit verified pricing, latency benchmarks, voice samples, and a demo link. Free; credibility-gated, not pay-to-play.

Claim a listing →Get a rank badge for your site →

Reply within 48 hours · No newsletter

What were you looking for on text-to-speech?

Missing a provider, a column we skipped, or a use case you need help picking for? Tell us — we reply within 48 hours and update the page based on what readers actually ask.

Real humans read every message. We track what people are asking for and prioritize accordingly.