Gradium TTS · Gradium
Gradium belongs in the vendor track, not the open-weight track. It now has a live CodeSOTA Elo row for the Russell voice at 1527 Elo from 24 blind same-prompt votes, plus a measured hard-text run: 13.4% WER, 73.3% critical-entity accuracy, and roughly 206ms p50 first-byte latency on the Audrey run. That makes it a real voice-agent candidate rather than a MOS-only catalog entry.
Preference
1527 Elo · Russell
Measured run
13.4% WER · 73.3% entity
Best for
Realtime hosted TTS
gradium_example.pypython
# Hosted vendor API example shape
# Use your Gradium API key and selected voice ID from the dashboard.
import requests
resp = requests.post(
"https://api.gradium.ai/v1/tts",
headers={"Authorization": "Bearer $GRADIUM_API_KEY"},
json={
"voice": "russell",
"text": "The quarterly revenue increased by 17.8 percent to 4.2 million dollars.",
"format": "wav",
"stream": True,
},
timeout=30,
)
resp.raise_for_status()
open("gradium.wav", "wb").write(resp.content)
Kokoro · Hexgrad
Kokoro is the efficiency champion and the strongest open-weight row in the current Elo pool. Built on StyleTTS 2, it runs with 82M parameters — orders of magnitude smaller than many competitors. It runs comfortably on CPU and reaches RTF 0.03 on GPU, which means a ten-second clip synthesises in 0.3 seconds. It ships with curated style presets for different voices but does not do arbitrary voice cloning. As of early 2026 it supports 9 languages including English, Japanese, Korean and major European languages.
Architecture
StyleTTS 2 based
Preference
1424 Elo · am_michael
Best for
Local narration, assistants
kokoro_example.pypython
# pip install kokoro>=0.8 soundfile
from kokoro import KPipeline
import soundfile as sf
pipe = KPipeline(lang_code="a") # 'a' = American English
# Available voices: af_heart, af_bella, am_adam, am_michael, etc.
samples = pipe("Hello from Kokoro, the most efficient open-source TTS.", voice="af_heart", speed=1.0)
for i, (gs, ps, audio) in enumerate(samples):
sf.write(f"output_{i}.wav", audio, 24000)
XTTS v2 · Coqui
XTTS v2 remains the gold standard for zero-shot voice cloning. With six seconds of reference audio it produces remarkably faithful voices across 17 languages. The architecture combines a GPT-style autoregressive model with DVAE and a HiFi-GAN vocoder. The main caveat is the CPML licence, which restricts commercial use without a separate agreement — for commercial projects, Fish Speech or F5-TTS are the alternatives.
Architecture
GPT + DVAE + HiFi-GAN
Best for
Voice cloning, dubbing
xtts_example.pypython
# pip install TTS
from TTS.api import TTS
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2", gpu=True)
# Zero-shot voice cloning from a 6s reference clip
tts.tts_to_file(
text="This is a cloned voice speaking naturally.",
speaker_wav="reference.wav",
language="en",
file_path="output.wav",
)
Bark · Suno
Bark is unique in its ability to generate non-speech audio alongside speech — laughter, music snippets, sighs, paralinguistic cues — using inline tags. The GPT-style autoregressive architecture makes it slower (RTF 0.85) and thirstier for VRAM (~6 GB), but for creative applications where expressive, varied audio is needed, Bark remains unmatched. MIT licence; suitable for commercial projects.
Best for
Creative audio, games
bark_example.pypython
# pip install bark
from bark import SAMPLE_RATE, generate_audio, preload_models
from scipy.io.wavfile import write as write_wav
preload_models()
# Bark supports non-speech: laughter, music, hesitations
text = """Hello! [laughs] This is Bark speaking.
It can generate musical notes and even [sighs] emotions."""
audio = generate_audio(text, history_prompt="v2/en_speaker_6")
write_wav("output.wav", SAMPLE_RATE, audio)
Piper · Rhasspy
Piper is the go-to TTS for edge devices. Built on VITS / VITS2 and exported to ONNX, it achieves RTF 0.008 — a ten-second clip in eighty milliseconds. It runs entirely on CPU with under 100 MB of RAM. With 30+ pre-trained language models it is the most broadly multilingual option. The trade-off is lower perceived naturalness in its published evaluations and no voice cloning; you pick from pre-trained voices. Ideal for home assistants, kiosks and offline applications.
Architecture
VITS / VITS2 (ONNX)
Streaming
Yes (sentence-level)
Best for
RPi, offline, IoT
piper_example.pypython
# Install: pip install piper-tts
# Download a voice: piper --download-dir ./voices --model en_US-lessac-high
import subprocess
text = "Piper runs on a Raspberry Pi in real-time."
subprocess.run(
["piper", "--model", "./voices/en_US-lessac-high.onnx", "--output_file", "output.wav"],
input=text.encode(),
)
Fish Speech · Fish Audio
Fish Speech combines a VQGAN tokeniser with a Llama-based decoder to achieve strong voice cloning across 8 languages. It requires 10–30 seconds of reference audio — slightly more than XTTS v2 — but ships Apache 2.0, making it the best commercially-friendly cloning option. It is not yet ranked in the CodeSOTA Elo pool, so its published MOS stays a model-card note rather than a cross-model rank. The architecture tolerates fine-tuning on custom voices from relatively small datasets.
Architecture
VQGAN + Llama
Best for
Commercial cloning
fish_speech_example.pypython
# pip install fish-speech
from fish_speech.api import FishSpeechTTS
tts = FishSpeechTTS(device="cuda")
# Zero-shot cloning with 10-30s reference
tts.synthesize(
text="Fish Speech excels at multilingual voice cloning.",
reference_audio="speaker_ref.wav",
output_path="output.wav",
)
Dia · Nari Labs
Dia is purpose-built for dialogue. Pass a script with speaker tags — [S1], [S2] — and it produces a natural multi-speaker conversation with appropriate prosody, pacing and turn-taking. At 1.6B parameters it is the largest model in this comparison, needing ~5 GB VRAM. It also handles non-verbal cues like laughter and hesitations. English-only today, but the dialogue capability is unmatched.
Architecture
Encoder-decoder transformer
Best for
Podcasts, audiobooks
dia_example.pypython
# pip install diarizationlm # Dia by Nari Labs
from dia import Dia
model = Dia("nari-labs/Dia-1.6B", device="cuda")
# Multi-speaker dialogue generation
dialogue = """[S1] Hey, have you tried the new open-source TTS models?
[S2] Yeah, Dia is amazing for dialogue. It handles turn-taking naturally.
[S1] The prosody between speakers is surprisingly good."""
audio = model.generate(dialogue)
model.save_audio("dialogue.wav", audio)
F5-TTS · SWivid
F5-TTS uses flow matching with a Diffusion Transformer (DiT) backbone. It has 336M parameters and strong zero-shot cloning from 5–15 seconds of reference. The flow-matching architecture produces more consistent output than autoregressive approaches, avoiding the occasional artefacts of GPT-style TTS. CC-BY-NC 4.0 — so non-commercial only. It still needs shared-prompt audio before it can be ranked by Elo here.
Architecture
Flow matching + DiT
Streaming
Yes (chunk-based)
Best for
Research, cloning
f5tts_example.pypython
# pip install f5-tts
from f5_tts.api import F5TTS
tts = F5TTS(device="cuda")
# Zero-shot voice cloning via flow matching
tts.infer(
ref_file="reference.wav",
ref_text="This is the reference transcript.",
gen_text="F5-TTS uses flow matching for natural-sounding speech synthesis.",
output="output.wav",
)
Parler-TTS · Hugging Face
Parler-TTS takes a different route: instead of a reference clip, you describe the voice you want. “A warm female voice with a slight British accent, speaking clearly and calmly” — and the model attempts a match. The control surface is striking for rapid prototyping without any reference recordings. The value is controllability and Apache 2.0 licensing; the model still needs a same-prompt Elo row before it should be ranked against the others.
Architecture
T5 + DAC decoder
Best for
Prototyping, content
parler_example.pypython
# pip install parler-tts
from parler_tts import ParlerTTSForConditionalGeneration
from transformers import AutoTokenizer
import soundfile as sf
model = ParlerTTSForConditionalGeneration.from_pretrained("parler-tts/parler-tts-large-v1")
tokenizer = AutoTokenizer.from_pretrained("parler-tts/parler-tts-large-v1")
description = "A warm female voice with a slight British accent, speaking clearly and calmly."
prompt = "Parler TTS lets you describe the exact voice characteristics you want."
input_ids = tokenizer(description, return_tensors="pt").input_ids
prompt_ids = tokenizer(prompt, return_tensors="pt").input_ids
gen = model.generate(input_ids=input_ids, prompt_input_ids=prompt_ids)
sf.write("output.wav", gen.cpu().numpy().squeeze(), model.config.sampling_rate)