Kokoro · Hexgrad
Kokoro is the efficiency champion. Built on StyleTTS 2, it achieves the highest MOS in this comparison (4.2) with 82M parameters — orders of magnitude smaller than competitors. It runs comfortably on CPU and reaches RTF 0.03 on GPU, which means a ten-second clip synthesises in 0.3 seconds. It ships with curated style presets for different voices but does not do arbitrary voice cloning. As of early 2026 it supports 9 languages including English, Japanese, Korean and major European languages.
Architecture
StyleTTS 2 based
Best for
Narration, assistants
kokoro_example.pypython
# pip install kokoro>=0.8 soundfile
from kokoro import KPipeline
import soundfile as sf
pipe = KPipeline(lang_code="a") # 'a' = American English
# Available voices: af_heart, af_bella, am_adam, am_michael, etc.
samples = pipe("Hello from Kokoro, the most efficient open-source TTS.", voice="af_heart", speed=1.0)
for i, (gs, ps, audio) in enumerate(samples):
sf.write(f"output_{i}.wav", audio, 24000)
XTTS v2 · Coqui
XTTS v2 remains the gold standard for zero-shot voice cloning. With six seconds of reference audio it produces remarkably faithful voices across 17 languages. The architecture combines a GPT-style autoregressive model with DVAE and a HiFi-GAN vocoder. The main caveat is the CPML licence, which restricts commercial use without a separate agreement — for commercial projects, Fish Speech or F5-TTS are the alternatives.
Architecture
GPT + DVAE + HiFi-GAN
Best for
Voice cloning, dubbing
xtts_example.pypython
# pip install TTS
from TTS.api import TTS
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2", gpu=True)
# Zero-shot voice cloning from a 6s reference clip
tts.tts_to_file(
text="This is a cloned voice speaking naturally.",
speaker_wav="reference.wav",
language="en",
file_path="output.wav",
)
Bark · Suno
Bark is unique in its ability to generate non-speech audio alongside speech — laughter, music snippets, sighs, paralinguistic cues — using inline tags. The GPT-style autoregressive architecture makes it slower (RTF 0.85) and thirstier for VRAM (~6 GB), but for creative applications where expressive, varied audio is needed, Bark remains unmatched. MIT licence; suitable for commercial projects.
Best for
Creative audio, games
bark_example.pypython
# pip install bark
from bark import SAMPLE_RATE, generate_audio, preload_models
from scipy.io.wavfile import write as write_wav
preload_models()
# Bark supports non-speech: laughter, music, hesitations
text = """Hello! [laughs] This is Bark speaking.
It can generate musical notes and even [sighs] emotions."""
audio = generate_audio(text, history_prompt="v2/en_speaker_6")
write_wav("output.wav", SAMPLE_RATE, audio)
Piper · Rhasspy
Piper is the go-to TTS for edge devices. Built on VITS / VITS2 and exported to ONNX, it achieves RTF 0.008 — a ten-second clip in eighty milliseconds. It runs entirely on CPU with under 100 MB of RAM. With 30+ pre-trained language models it is the most broadly multilingual option. The trade-off is lower naturalness (MOS 3.5) and no voice cloning; you pick from pre-trained voices. Ideal for home assistants, kiosks and offline applications.
Architecture
VITS / VITS2 (ONNX)
Streaming
Yes (sentence-level)
Best for
RPi, offline, IoT
piper_example.pypython
# Install: pip install piper-tts
# Download a voice: piper --download-dir ./voices --model en_US-lessac-high
import subprocess
text = "Piper runs on a Raspberry Pi in real-time."
subprocess.run(
["piper", "--model", "./voices/en_US-lessac-high.onnx", "--output_file", "output.wav"],
input=text.encode(),
)
Fish Speech · Fish Audio
Fish Speech combines a VQGAN tokeniser with a Llama-based decoder to achieve strong voice cloning across 8 languages. It requires 10–30 seconds of reference audio — slightly more than XTTS v2 — but ships Apache 2.0, making it the best commercially-friendly cloning option. MOS 4.1 puts it near the top for naturalness. The architecture tolerates fine-tuning on custom voices from relatively small datasets.
Architecture
VQGAN + Llama
Best for
Commercial cloning
fish_speech_example.pypython
# pip install fish-speech
from fish_speech.api import FishSpeechTTS
tts = FishSpeechTTS(device="cuda")
# Zero-shot cloning with 10-30s reference
tts.synthesize(
text="Fish Speech excels at multilingual voice cloning.",
reference_audio="speaker_ref.wav",
output_path="output.wav",
)
Dia · Nari Labs
Dia is purpose-built for dialogue. Pass a script with speaker tags — [S1], [S2] — and it produces a natural multi-speaker conversation with appropriate prosody, pacing and turn-taking. At 1.6B parameters it is the largest model in this comparison, needing ~5 GB VRAM. It also handles non-verbal cues like laughter and hesitations. English-only today, but the dialogue capability is unmatched.
Architecture
Encoder-decoder transformer
Best for
Podcasts, audiobooks
dia_example.pypython
# pip install diarizationlm # Dia by Nari Labs
from dia import Dia
model = Dia("nari-labs/Dia-1.6B", device="cuda")
# Multi-speaker dialogue generation
dialogue = """[S1] Hey, have you tried the new open-source TTS models?
[S2] Yeah, Dia is amazing for dialogue. It handles turn-taking naturally.
[S1] The prosody between speakers is surprisingly good."""
audio = model.generate(dialogue)
model.save_audio("dialogue.wav", audio)
F5-TTS · SWivid
F5-TTS uses flow matching with a Diffusion Transformer (DiT) backbone. MOS 4.1 with 336M parameters and strong zero-shot cloning from 5–15 seconds of reference. The flow-matching architecture produces more consistent output than autoregressive approaches, avoiding the occasional artefacts of GPT-style TTS. CC-BY-NC 4.0 — so non-commercial only.
Architecture
Flow matching + DiT
Streaming
Yes (chunk-based)
Best for
Research, cloning
f5tts_example.pypython
# pip install f5-tts
from f5_tts.api import F5TTS
tts = F5TTS(device="cuda")
# Zero-shot voice cloning via flow matching
tts.infer(
ref_file="reference.wav",
ref_text="This is the reference transcript.",
gen_text="F5-TTS uses flow matching for natural-sounding speech synthesis.",
output="output.wav",
)
Parler-TTS · Hugging Face
Parler-TTS takes a different route: instead of a reference clip, you describe the voice you want. “A warm female voice with a slight British accent, speaking clearly and calmly” — and the model attempts a match. The control surface is striking for rapid prototyping without any reference recordings. MOS 3.8 is decent but not top-tier; the value is in the controllability and the Apache 2.0 licence.
Architecture
T5 + DAC decoder
Best for
Prototyping, content
parler_example.pypython
# pip install parler-tts
from parler_tts import ParlerTTSForConditionalGeneration
from transformers import AutoTokenizer
import soundfile as sf
model = ParlerTTSForConditionalGeneration.from_pretrained("parler-tts/parler-tts-large-v1")
tokenizer = AutoTokenizer.from_pretrained("parler-tts/parler-tts-large-v1")
description = "A warm female voice with a slight British accent, speaking clearly and calmly."
prompt = "Parler TTS lets you describe the exact voice characteristics you want."
input_ids = tokenizer(description, return_tensors="pt").input_ids
prompt_ids = tokenizer(prompt, return_tensors="pt").input_ids
gen = model.generate(input_ids=input_ids, prompt_input_ids=prompt_ids)
sf.write("output.wav", gen.cpu().numpy().squeeze(), model.config.sampling_rate)