OpenAI's TTS (tts-1, tts-1-hd, gpt-4o-mini-tts) is the newcomer: three models, nine voices, flat pricing. Google Cloud TTS is the incumbent: 400+ voices, 50+ languages, full SSML, and the new Chirp 3 HD / Gemini 2.5 Flash TTS lines pushing quality back to the top.
Published rates and capabilities as of April 2026. Google has tiered pricing by voice class (Standard, Neural2, Studio, Chirp 3 HD); HD tier quoted for apples-to-apples.
| Attribute | OpenAI TTS | Google Cloud TTS |
|---|---|---|
| Flagship model | gpt-4o-mini-tts / tts-1-hd | Chirp 3 HD / Gemini 2.5 Flash TTS |
| MOS (approx) | ~4.3 (hd) / ~4.0 (tts-1) | ~4.45–4.5 (Chirp 3 HD / Gemini) |
| Voices | 9 presets | 400+ (30 Chirp 3 HD personas) |
| Languages | ~57 (auto-detect) | 50+ (80+ locales for Gemini) |
| Voice cloning | Not supported | Instant Custom Voice (Chirp 3 HD) |
| SSML | None | Full |
| Steerability | instructions field (text) | SSML prosody + Gemini prompt control |
| Streaming | Yes (HTTP chunked) | Yes (gRPC streaming) |
| Price / 1M chars | $15 (mini / tts-1), $30 (tts-1-hd) | $4 (Standard), $16 (Neural2), $30 (HD) |
| Free tier | None | 1M/mo Standard, 100k Neural/HD |
| Best for | Apps inside OpenAI stack, prototypes | Contact centers, IVR, multilingual global apps |
Google's Standard voices at $4/1M are the cheapest credible option if pure robotic-ness is acceptable. At the top, Chirp 3 HD and Gemini 2.5 Flash TTS edge out tts-1-hd on naturalness. OpenAI's gpt-4o-mini-tts lands exactly where everyone wants: $15 with near-top quality.
Pareto frontier
OpenAI vs Google — MOS vs cost
Log X. OpenAI (green) clusters at commodity price, Google (blue) spans every tier.
Capability radar
OpenAI TTS vs Google Cloud TTS
Each axis 0–10. Qualitative. Higher is better.
Interactive
TTS cost calculator
Cheapest to most expensive. ElevenLabs effective rates vary by tier — numbers shown are blended list-price for common tiers. Self-host costs exclude compute. For streaming voice-bot workloads, latency and concurrency matter at least as much as per-char price.
“Your package has been delivered. Thank you for shopping with us.”
“Your package has been delivered. Thank you for shopping with us.”
“Your package has been delivered. Thank you for shopping with us.”
“Your package has been delivered. Thank you for shopping with us.”
“Your package has been delivered. Thank you for shopping with us.”
“Your package has been delivered. Thank you for shopping with us.”
Quality is close at the top: gpt-4o-mini-tts ≈ Chirp 3 HD ≈ Gemini 2.5 Flash TTS on short-form. Google's prosody control edges ahead on long-form. The decision is rarely about MOS; it's about SSML, locale coverage, cloning, and which cloud you already pay.
English-first, already on OpenAI, want flat cheap pricing, prefer describing tone in text rather than authoring SSML. Great default for consumer apps, notifications, read-aloud.
Ship in 5+ languages, need SSML (break timing, say-as, emphasis), need voice cloning, or have enterprise procurement on GCP. Essential for IVR and contact center workloads.
OpenAI ships a one-line client. Google requires GCP credentials and the texttospeech client; SSML input unlocks the full prosody surface.
from openai import OpenAI
client = OpenAI()
resp = client.audio.speech.create(
model="gpt-4o-mini-tts",
voice="sage",
input="OpenAI keeps TTS simple and steerable.",
instructions="Speak slowly and reassuringly.",
)
resp.stream_to_file("out.mp3")from google.cloud import texttospeech
client = texttospeech.TextToSpeechClient()
synthesis_input = texttospeech.SynthesisInput(
ssml="<speak>Google supports full <emphasis>SSML</emphasis>.</speak>",
)
voice = texttospeech.VoiceSelectionParams(
language_code="en-US",
name="en-US-Chirp3-HD-Charon", # Chirp 3 HD voice
)
audio_config = texttospeech.AudioConfig(
audio_encoding=texttospeech.AudioEncoding.LINEAR16,
)
response = client.synthesize_speech(
input=synthesis_input, voice=voice, audio_config=audio_config,
)
with open("out.wav", "wb") as out:
out.write(response.audio_content)