Home/Speech/Best for podcasts
Long-formUpdated April 2026

Best TTS for podcasts

Podcasts punish TTS in a way short-form doesn't. Twenty-minute stamina, natural co-host banter, unusual proper nouns, long pause pacing — these separate ElevenLabs v3 and Google NotebookLM from the rest of the pack.

TL;DR

  • > Most natural long-form solo: ElevenLabs v3 with audio tags, 4.8 MOS on 20-minute passages.
  • > Best two-voice show: Google NotebookLM's Audio Overview — generates full back-and-forth from a doc. Free.
  • > Best production pipeline: PlayHT 3.0 with voice cloning for a branded host voice.
  • > Best self-host: Sesame CSM for dialogue, F5-TTS for cloned-host narration.

Prosody: why podcasts sound robotic

The reason bad TTS feels robotic on long passages is flat pitch. Natural speakers drop F0 on declaratives, rise on questions, and pause 150-400ms at semantic boundaries. The top-tier models get this right; commodity models flatten everything into a monotone.

Prosody curve

F0 Hz + energy envelope

ElevenLabs v3 State-space models are Transformers with selective memory — let me explain why that matters.

100Hz150Hz200Hz250Hz||||syllable position →

Prosody curve

F0 Hz + energy envelope

Commodity TTS (tts-1) State-space models are Transformers with selective memory — let me explain why that matters.

100Hz150Hz200Hz250Hzsyllable position →

F0 pitch range is ~50Hz for expressive TTS, <20Hz for flat TTS. Prosodic breaks (||) mark where a listener expects a pause. Commodity models rarely insert them.

Long-form capability radar

Capability radar

Podcast-grade TTS

Each axis scored 0-10. Higher is better. Overlay shows trade-offs.

NaturalnessLong-form staminaMulti-voicePronunciationEmotion controlCostElevenLabs v3NotebookLMCartesia Sonic 2PlayHT 3.0

Voice fingerprints: solo narrator

ElevenLabs v3 · Hope · narrator
mel spectrogram
8k2k00.0s1.0s2.0s

Expressive long-form — wide dynamic range, formant richness

PlayHT 3.0 · Cloned host
mel spectrogram
8k2k00.0s1.0s2.0s

Cloned branded voice — consistent timbre across episodes

Two-voice dialogue

# Two-voice podcast with ElevenLabs v3 (audio tags + voice switching).
from elevenlabs.client import ElevenLabs
client = ElevenLabs(api_key="sk_...")

script = [
    {"voice": "rachel", "text": "[warm] Welcome back. Today we're talking about state-space models."},
    {"voice": "adam",   "text": "[curious] Linear attention, but make it recurrent, right?"},
    {"voice": "rachel", "text": "[laughs] Roughly. Let's actually define what 'selective' means here."},
]

with open("episode.mp3", "wb") as f:
    for turn in script:
        for chunk in client.text_to_speech.convert(
            voice_id=VOICES[turn["voice"]],
            model_id="eleven_v3",
            text=turn["text"],
            output_format="mp3_44100_128",
        ):
            f.write(chunk)

For a no-code approach: drop any article into Google NotebookLM and it generates a ~10-minute two-host podcast. Uses Gemini 2.5 Flash TTS multi-speaker mode. Remarkably natural; limited editing knobs.

Listen: 30-second long-form clips

ElevenLabs v3Hope
eleven_v3
sample TBD

Intro monologue to a tech podcast

drop elevenlabs v3-hope.mp3 at /audio/samples/podcast-11labs-v3.mp3
Google NotebookLMHosts A & B
Gemini 2.5 Flash TTS
sample TBD

Auto-generated two-host banter

drop google notebooklm-hosts a & b.mp3 at /audio/samples/podcast-notebooklm.mp3
PlayHT 3.0Cloned host
Play 3.0 Mini
sample TBD

Long-form solo narration

drop playht 3.0-cloned host.mp3 at /audio/samples/podcast-playht.mp3
CartesiaBritish Narrator
sonic-2
sample TBD

Long-form solo narration

drop cartesia-british narrator.mp3 at /audio/samples/podcast-cartesia.mp3

Where the quality ceiling is

Long-form TTS quality has plateaued near 4.7-4.8 MOS since late 2024. The remaining gap to human narration is in disfluencies, micro-pauses, and context-aware intonation — not timbre.

Evolution

TTS quality over time

MOS per release. Quality has plateaued near 4.7-4.8; the action is now in latency and steerability.

44.254.54.75520222023202420252026Human reference (5.0)v1v2Turbo v2.5Flash v2.5v3 (alpha)tts-1tts-1-hdgpt-4o-mini-ttsSonicSonic 2WaveNetNeural2Chirp 3 HDGemini 2.5 Flash TTSElevenLabsOpenAICartesiaGoogle

Practical long-form tactics

Scripting

  • Chunk scripts into 2-4 sentence blocks — most models drift past ~500 chars.
  • Spell unusual names phonetically: "Karpathy" → "kar-PATH-ee".
  • Insert explicit commas for natural pauses; em-dashes for dramatic ones.
  • Use audio tags (ElevenLabs v3) sparingly — overuse sounds forced.

Post-production

  • Normalize to -16 LUFS integrated — the podcast standard.
  • High-pass at 80Hz to remove vocoder rumble.
  • Render per-turn and concatenate with 350ms gaps, not a single long call.
  • Re-render any turn that trips on a name; keep a pronunciation dictionary.

Related