Best TTS for voice cloning
Voice cloning has moved from novelty to infrastructure. Ten seconds of reference audio is enough for most models to produce convincing output. This page compares vendors on fidelity, sample requirements, and consent tooling — because the ethical defaults matter as much as the speaker-similarity score.
TL;DR
- > Best fidelity: ElevenLabs Professional Voice Clone — 30+ min of studio audio, near-indistinguishable.
- > Best instant clone: PlayHT 3.0 or ElevenLabs IVC — 60 seconds of input.
- > Best self-host: F5-TTS (MIT) or XTTS-v2 (CPML). Zero-shot, 10s reference.
- > Best streaming clone: Cartesia — the only sub-100ms TTFB vendor with cloning.
How voice cloning actually works
Modern cloning is not fine-tuning. A pretrained speaker encoder reads the reference audio and emits a fixed-length embedding that conditions the acoustic model at inference. The acoustic model was already trained on thousands of speakers — it knows how voices differ. The embedding just tells it which voice to render.
Voice cloning
How voice cloning works
Reference audio is encoded into a speaker embedding that conditions the acoustic model at inference.
Vendor capability radar
Capability radar
Cloning vendors across six axes
Each axis scored 0-10. Higher is better. Overlay shows trade-offs.
Side-by-side
| Vendor | Mode | Min sample | Fidelity | Streaming | License / hosting |
|---|---|---|---|---|---|
| ElevenLabs Professional | Fine-tuned | 30+ min | 9.5/10 | Yes (Flash) | Hosted, $99+/mo |
| ElevenLabs Instant (IVC) | Zero-shot | 60s | 8.5/10 | Yes | Hosted, $22+/mo |
| PlayHT 3.0 | Zero-shot or fine-tune | 30-60s | 9/10 | Yes | Hosted, $39+/mo |
| Cartesia | Zero-shot | 15s | 8/10 | Yes (<100ms) | Hosted, usage-based |
| Google Chirp 3 HD | Zero-shot (Custom Voice) | 10s | 7.5/10 | Yes | GCP, usage-based |
| F5-TTS | Zero-shot | 10s | 8/10 | Limited | MIT, self-host |
| XTTS-v2 (Coqui) | Zero-shot | 6s | 7.5/10 | No | CPML (research only), self-host |
| Fish Speech (OpenAudio-S1) | Zero-shot | 10s | 8/10 | Yes | CC-BY-NC, self-host |
Reference vs cloned fingerprint
A well-cloned voice matches the reference's formant positions and harmonic density. Cheap clones get the timbre wrong but fake the pitch — audible as an uncanny-valley effect.
“This voice is a clone trained on a short reference sample. Can you tell it apart?”
“This voice is a clone trained on a short reference sample. Can you tell it apart?”
Listen: one reference, four clones
“This voice is a clone trained on a short reference sample. Can you tell it apart?”
“This voice is a clone trained on a short reference sample. Can you tell it apart?”
“This voice is a clone trained on a short reference sample. Can you tell it apart?”
“This voice is a clone trained on a short reference sample. Can you tell it apart?”
Minimal cloning code
Hosted: ElevenLabs IVC
# ElevenLabs Instant Voice Clone — 60s of consented audio.
from elevenlabs.client import ElevenLabs
client = ElevenLabs(api_key="sk_...")
voice = client.voices.ivc.create(
name="Alice (consented)",
files=[open("alice_reading_script.wav", "rb")],
description="Alice gave written consent on 2026-02-14. See consent-ledger.md.",
)
audio = client.text_to_speech.convert(
voice_id=voice.voice_id,
model_id="eleven_multilingual_v2",
text="This voice is a clone trained on Alice's consented sample.",
)Self-host: F5-TTS
# F5-TTS — MIT-licensed, flow-matching. Zero-shot cloning from 10s reference.
# pip install f5-tts
from f5_tts.api import F5TTS
model = F5TTS(model_type="F5-TTS", ckpt_file="F5-TTS/ckpts/model.pt")
audio, sr = model.infer(
ref_file="alice_reference.wav",
ref_text="This is a reference sample from Alice.",
gen_text="And this is new text in Alice's voice.",
)Consent, ethics, and law
Voice cloning without consent is at minimum a civil wrong in most jurisdictions and a crime in some. The 2024 FCC ruling declaring AI-cloned voice robocalls illegal under the TCPA was a warning shot. The EU AI Act classifies voice cloning as limited-risk with mandatory disclosure.
Production minimums
- Written, timestamped consent tied to the specific reference audio.
- Per-clone audit log of prompts generated.
- Watermark output audio (Resemble AI, ElevenLabs AI Speech Classifier, or SileroWM).
- Rate limiting and prompt-moderation to block impersonation of public figures.
- “AI-generated voice” disclosure on output, per EU AI Act.
Vendor consent features
- ElevenLabs: mandatory voice verification via spoken phrase for Pro clones.
- Google Chirp 3 HD Custom Voice: requires consent statement embedded in reference audio.
- PlayHT: identity verification + consent form on clone creation.
- Open-source: no gates. Build your own — or you personally own the liability.