Tasks/Audio/Text-to-Speech

Text-to-Speech

Turn text into natural-sounding speech. The rare ML task where the frontier is entirely API-only and the open academic benchmarks (LJSpeech, VCTK) lag production by two years. Below: a side-by-side comparison of 12 providers on the axes buyers actually care about — cost, latency, languages, voice cloning, license.

Last verified 2026-04 · Submit corrections below · Vendors: claim your listing →

12 providers, side by side

Frontier API · hyperscaler cloud · open weights. Pricing shown per million characters of synthesized output.

FrontierCloudOpen
Provider / ModelTierLicenseCost / 1M charsFirst-byteLangsCloningStream
ElevenLabs logo
ElevenLabs
Multilingual v2 / Turbo v2.5 / v3
FrontierProprietary API$150–330/M~250–400 ms32+ProfessionalClaim →
OpenAI logo
OpenAI
tts-1 / tts-1-hd / gpt-4o voice
FrontierProprietary API$15 / $30 / M~500–700 ms50+Claim →
C
Cartesia
Sonic 2 / Sonic Turbo
FrontierProprietary API~$19/M~90–150 ms15+InstantClaim →
Deepgram logo
Deepgram
Aura 2
FrontierProprietary API~$30/M~150–250 ms30+LimitedClaim →
H
Hume
EVI 2 / Octave
FrontierProprietary APIPer-minute billingRealtime40+InstantClaim →
S
Sesame
CSM-1B · Maya / Miles
FrontierHybridDemo / research~400 msEnglish (primary)LimitedClaim →
Google Cloud logo
Google Cloud
Studio / Neural2 / Wavenet voices
CloudProprietary API$4–$160/M~300–500 ms50+ (380+ voices)LimitedClaim →
Az
Microsoft Azure
Neural TTS · HD voices
CloudProprietary API$16–$30/M~300–500 ms140+ localesProfessionalClaim →
AWS
Amazon Web Services
Polly Neural · Long-form · Generative
CloudProprietary API$4–$30/M~300–600 ms30+Claim →
OpenOpen weightsSelf-hostGPU-dependentEnglish, Chinese (+ finetune)InstantClaim →
FS
Fish Audio (open)
Fish Speech 1.5
OpenOpen weightsSelf-hostGPU-dependent8 (en, zh, ja, de, fr, es, ko, ar)InstantClaim →
X2
OpenOpen weightsSelf-hostGPU-dependent17InstantClaim →

Pricing is list-price per million characters as of 2026-04 and rounds to the nearest meaningful tier — most vendors negotiate at scale. Click any price to open the vendor’s pricing page. Spot an error? Tell us →

Which should I use?

Picking a TTS provider is a budget-shaped decision on four axes: quality, latency, voice control, and license. Shortcuts by use-case:

Best API quality

ElevenLabs v3 · OpenAI tts-1-hd

ElevenLabs leads on expressive range and voice library; OpenAI leads on consistency and safety.

Lowest latency (real-time agents)

Cartesia Sonic · Deepgram Aura 2

90–250 ms first-byte beats everything else. Built for conversational voice pipelines.

Voice cloning

ElevenLabs · Cartesia · F5-TTS

ElevenLabs professional cloning (minutes of audio, consent-verified). Cartesia instant cloning. F5-TTS for zero-shot open weights.

On-prem / compliance

Fish Speech 1.5 · F5-TTS · XTTS v2

Run on your GPUs. Watch licenses — F5-TTS and Fish Speech are non-commercial by default; Coqui XTTS has the friendliest commercial terms.

Enterprise with an MSA

Azure Neural TTS · Google Studio · AWS Polly

Already in the hyperscaler MSA. Azure leads on locale breadth; Google on voice quality at top-tier; AWS on cost at scale.

Empathetic voice / agentic

Hume EVI 2 · Sesame CSM · gpt-4o voice

Modelled for interruption timing and emotional tone, not just naturalness.

Cheapest at scale

AWS Polly Standard · GCP Standard

$4/M for standard-tier voices. Sounds worse, but fine for IVR, accessibility, and bulk notifications.

What to listen for

MOS scores collapse a 30-second listen into a single number. If you’re evaluating providers, A/B test your own text — not marketing demos — and listen for these five things that separate polished TTS from uncanny:

Prosody

Does the stress land on the right word? A good TTS emphasizes new information; a weak one monotones everything.

Breath & pauses

Real speakers pause mid-sentence to breathe. Synthetic speech that rushes through commas sounds robotic.

Sibilance

Listen to s, sh, z sounds. Cheap TTS hisses; good TTS renders sibilants without distortion.

Disfluencies

Um, uh, and self-corrections matter for conversational AI. Most TTS scrubs them — the frontier ones model them.

Emotional range

Play the same sentence as a question, a statement, and in excitement. Most providers produce identical audio.

Long-form consistency

Run a 5-minute script. Does the voice drift in pitch or pace? Attention-based TTS famously loses the thread past 30 seconds.

Why MOS scores are misleading in 2026

MOS (Mean Opinion Score) was designed in 1996 for telephony codecs. It asks human raters to score a speech clip from 1 (bad) to 5 (excellent). For decades it was the only metric in town.

In 2026 it’s breaking because: (a) top systems saturate at 4.3–4.6 and human raters lose discrimination; (b) ratings are crowd-sourced on short clips that miss long-form failures; (c) published MOS typically uses the author’s own test set, which no two papers share.

The metrics buyers should trust are WER (intelligibility — pass the synthesized audio through ASR, compare to ground truth), SECS (speaker similarity for cloning), first-byte latency measured from your own network, and blind AB preference against a real human baseline.

Vendor-published MOS is not reliable enough to build a ranking on. That’s why the comparison matrix above uses operational axes — cost, latency, features — not MOS.

Academic datasets

Useful for training open-weights models and reproducible research. Frontier API providers don’t train on these — they use proprietary voice-actor corpora orders of magnitude larger.

LJSpeech

24 hours · 13K utterances · 1 speaker2017

Single female English speaker reading public-domain books. The canonical TTS training set for a generation — small, clean, copyright-safe.

Dataset page →

VCTK

44 hours · 110 speakers · English2017

Multi-speaker corpus designed for voice-cloning research. Regional English accents. Canonical benchmark for zero-shot speaker conditioning.

Dataset page →

LibriTTS

585 hours · 2,456 speakers · English2019

Cleaned subset of LibriSpeech with original punctuation and casing preserved. The scale-up training set for modern open-weights TTS.

Dataset page →

Common Voice

30,000+ hours · 100+ languages · crowdsourced2019

Mozilla’s ongoing multilingual speech corpus. The go-to for multilingual open-weights TTS — though audio quality varies widely.

Dataset page →

Practical tips for 2026

Don’t train from scratch. Frontier quality requires tens of thousands of hours of studio-grade voice acting. Unless you have a niche (a specific language, a specific voice type), start from XTTS v2 or Fish Speech 1.5 and finetune.

Latency is priced in. Cartesia and Deepgram lead on first-byte; ElevenLabs Turbo is close. For conversational agents the 150 ms line is the difference between natural and awkward — worth paying for.

Voice cloning is a consent problem, not a tech problem. The tech works. The risk is legal: impersonation, deepfake audio, brand hijacking. ElevenLabs Professional Cloning and Azure Custom Neural Voice both require consent-verified onboarding. If your vendor doesn’t, that’s a red flag.

Stream and cache. Streaming cuts perceived latency by half. Cache by hash(text + voice_id + params) — TTS is deterministic enough that 20–40% of requests in a production app are repeats.

Evaluate on your own text. Vendor demos are hand-picked. Write 10 scripts from your actual domain — technical terms, names, long sentences, emotional beats — and AB them blind.

Run a TTS product? Claim your listing.

CodeSOTA’s TTS comparison is read by engineers evaluating providers. If you represent one of the vendors above — or a provider we missed — claim the listing to submit verified pricing, latency benchmarks, voice samples, and a demo link. Free; credibility-gated, not pay-to-play.

Reply within 48 hours · No newsletter

What were you looking for on text-to-speech?

Missing a provider, a column we skipped, or a use case you need help picking for? Tell us — we reply within 48 hours and update the page based on what readers actually ask.

Real humans read every message. We track what people are asking for and prioritize accordingly.