Codesota · Guides · TTS ModelsEight open-source systems, side by sideUpdated March 2026
Guide · Speech synthesis

Eight open-source voices, compared on the evidence.

Kokoro, XTTS v2, Bark, Piper, Fish Speech, Dia, F5-TTS, Parler-TTS — naturalness, speed, voice cloning, hardware, and licence, with the numbers each team has published.

MOS figures are sourced from the model authors' own evaluations and community reproductions. Real-time factor measured on an A100 unless otherwise noted.

§ 01 · Naturalness

How human the voice feels.

Mean Opinion Score on a 1–5 scale. Human speech typically scores 4.5–4.8.

Eight systems, eight claims about naturalness. The ordering below is the one their own evaluations support; none of the differences between the top four are large, but the trend — efficiency models catching the larger transformer-based systems — is the story of the last eighteen months.

ModelMOSOrgSignature
Kokoro4.2HexgradHighest naturalness per parameter
Fish Speech4.1Fish AudioStrong multilingual cloning
F5-TTS4.1SWividZero-shot cloning via flow matching
XTTS v24.0CoquiBest zero-shot voice cloning
Dia4.0Nari LabsMulti-speaker dialogue generation
Parler-TTS3.8Hugging FaceControl voice via text description
Bark3.7SunoNon-speech audio (music, laughter)
Piper3.5RhasspyRuns on Raspberry Pi
Human speech4.5+upper bound, per field convention
Fig 1 · MOS rankings as published by each model's authors.
§ 02 · Comparison

The full table.

MOS, real-time factor, VRAM, cloning behaviour, languages and licence — the four questions that decide the pick.

ModelMOSRTFVRAMParamsVoice cloningLicence
Kokoro
Hexgrad
4.20.03< 1 GB82MNo (style presets)Apache 2.0
XTTS v2
Coqui
4.00.18~4 GB467MYes (6s reference)CPML (non-commercial)
Bark
Suno
3.70.85~6 GB900MLimited (speaker prompts)MIT
Piper
Rhasspy
3.50.008< 100 MB (CPU)6-60MNo (pre-trained voices)MIT
Fish Speech
Fish Audio
4.10.12~4 GB500MYes (10-30s reference)Apache 2.0
Dia
Nari Labs
4.00.15~5 GB1.6BYes (audio prompt)Apache 2.0
F5-TTS
SWivid
4.10.14~4 GB336MYes (5-15s reference)CC-BY-NC 4.0
Parler-TTS
Hugging Face
3.80.22~4 GB880MNo (text-described voices)Apache 2.0
Fig 2 · Figures as published by each project. Languages listed in § 03 to keep this table legible.
§ 03 · Deep dives

One by one.

Eight models, each with its own design decision and the trade-off that follows.

Kokoro · Hexgrad

Kokoro is the efficiency champion. Built on StyleTTS 2, it achieves the highest MOS in this comparison (4.2) with 82M parameters — orders of magnitude smaller than competitors. It runs comfortably on CPU and reaches RTF 0.03 on GPU, which means a ten-second clip synthesises in 0.3 seconds. It ships with curated style presets for different voices but does not do arbitrary voice cloning. As of early 2026 it supports 9 languages including English, Japanese, Korean and major European languages.

Architecture
StyleTTS 2 based
Sample rate
24 kHz
Streaming
Yes (chunked)
Best for
Narration, assistants
kokoro_example.pypython
# pip install kokoro>=0.8 soundfile
from kokoro import KPipeline
import soundfile as sf

pipe = KPipeline(lang_code="a")  # 'a' = American English
# Available voices: af_heart, af_bella, am_adam, am_michael, etc.
samples = pipe("Hello from Kokoro, the most efficient open-source TTS.", voice="af_heart", speed=1.0)
for i, (gs, ps, audio) in enumerate(samples):
    sf.write(f"output_{i}.wav", audio, 24000)

XTTS v2 · Coqui

XTTS v2 remains the gold standard for zero-shot voice cloning. With six seconds of reference audio it produces remarkably faithful voices across 17 languages. The architecture combines a GPT-style autoregressive model with DVAE and a HiFi-GAN vocoder. The main caveat is the CPML licence, which restricts commercial use without a separate agreement — for commercial projects, Fish Speech or F5-TTS are the alternatives.

Architecture
GPT + DVAE + HiFi-GAN
Sample rate
24 kHz
Streaming
Yes
Best for
Voice cloning, dubbing
xtts_example.pypython
# pip install TTS
from TTS.api import TTS

tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2", gpu=True)

# Zero-shot voice cloning from a 6s reference clip
tts.tts_to_file(
    text="This is a cloned voice speaking naturally.",
    speaker_wav="reference.wav",
    language="en",
    file_path="output.wav",
)

Bark · Suno

Bark is unique in its ability to generate non-speech audio alongside speech — laughter, music snippets, sighs, paralinguistic cues — using inline tags. The GPT-style autoregressive architecture makes it slower (RTF 0.85) and thirstier for VRAM (~6 GB), but for creative applications where expressive, varied audio is needed, Bark remains unmatched. MIT licence; suitable for commercial projects.

Architecture
GPT-style AR
Sample rate
24 kHz
Streaming
No
Best for
Creative audio, games
bark_example.pypython
# pip install bark
from bark import SAMPLE_RATE, generate_audio, preload_models
from scipy.io.wavfile import write as write_wav

preload_models()

# Bark supports non-speech: laughter, music, hesitations
text = """Hello! [laughs] This is Bark speaking.
It can generate musical notes and even [sighs] emotions."""

audio = generate_audio(text, history_prompt="v2/en_speaker_6")
write_wav("output.wav", SAMPLE_RATE, audio)

Piper · Rhasspy

Piper is the go-to TTS for edge devices. Built on VITS / VITS2 and exported to ONNX, it achieves RTF 0.008 — a ten-second clip in eighty milliseconds. It runs entirely on CPU with under 100 MB of RAM. With 30+ pre-trained language models it is the most broadly multilingual option. The trade-off is lower naturalness (MOS 3.5) and no voice cloning; you pick from pre-trained voices. Ideal for home assistants, kiosks and offline applications.

Architecture
VITS / VITS2 (ONNX)
Sample rate
16–22 kHz
Streaming
Yes (sentence-level)
Best for
RPi, offline, IoT
piper_example.pypython
# Install: pip install piper-tts
# Download a voice: piper --download-dir ./voices --model en_US-lessac-high
import subprocess

text = "Piper runs on a Raspberry Pi in real-time."
subprocess.run(
    ["piper", "--model", "./voices/en_US-lessac-high.onnx", "--output_file", "output.wav"],
    input=text.encode(),
)

Fish Speech · Fish Audio

Fish Speech combines a VQGAN tokeniser with a Llama-based decoder to achieve strong voice cloning across 8 languages. It requires 10–30 seconds of reference audio — slightly more than XTTS v2 — but ships Apache 2.0, making it the best commercially-friendly cloning option. MOS 4.1 puts it near the top for naturalness. The architecture tolerates fine-tuning on custom voices from relatively small datasets.

Architecture
VQGAN + Llama
Sample rate
44.1 kHz
Streaming
Yes
Best for
Commercial cloning
fish_speech_example.pypython
# pip install fish-speech
from fish_speech.api import FishSpeechTTS

tts = FishSpeechTTS(device="cuda")

# Zero-shot cloning with 10-30s reference
tts.synthesize(
    text="Fish Speech excels at multilingual voice cloning.",
    reference_audio="speaker_ref.wav",
    output_path="output.wav",
)

Dia · Nari Labs

Dia is purpose-built for dialogue. Pass a script with speaker tags — [S1], [S2] — and it produces a natural multi-speaker conversation with appropriate prosody, pacing and turn-taking. At 1.6B parameters it is the largest model in this comparison, needing ~5 GB VRAM. It also handles non-verbal cues like laughter and hesitations. English-only today, but the dialogue capability is unmatched.

Architecture
Encoder-decoder transformer
Sample rate
44 kHz
Streaming
No
Best for
Podcasts, audiobooks
dia_example.pypython
# pip install diarizationlm  # Dia by Nari Labs
from dia import Dia

model = Dia("nari-labs/Dia-1.6B", device="cuda")

# Multi-speaker dialogue generation
dialogue = """[S1] Hey, have you tried the new open-source TTS models?
[S2] Yeah, Dia is amazing for dialogue. It handles turn-taking naturally.
[S1] The prosody between speakers is surprisingly good."""

audio = model.generate(dialogue)
model.save_audio("dialogue.wav", audio)

F5-TTS · SWivid

F5-TTS uses flow matching with a Diffusion Transformer (DiT) backbone. MOS 4.1 with 336M parameters and strong zero-shot cloning from 5–15 seconds of reference. The flow-matching architecture produces more consistent output than autoregressive approaches, avoiding the occasional artefacts of GPT-style TTS. CC-BY-NC 4.0 — so non-commercial only.

Architecture
Flow matching + DiT
Sample rate
24 kHz
Streaming
Yes (chunk-based)
Best for
Research, cloning
f5tts_example.pypython
# pip install f5-tts
from f5_tts.api import F5TTS

tts = F5TTS(device="cuda")

# Zero-shot voice cloning via flow matching
tts.infer(
    ref_file="reference.wav",
    ref_text="This is the reference transcript.",
    gen_text="F5-TTS uses flow matching for natural-sounding speech synthesis.",
    output="output.wav",
)

Parler-TTS · Hugging Face

Parler-TTS takes a different route: instead of a reference clip, you describe the voice you want. “A warm female voice with a slight British accent, speaking clearly and calmly” — and the model attempts a match. The control surface is striking for rapid prototyping without any reference recordings. MOS 3.8 is decent but not top-tier; the value is in the controllability and the Apache 2.0 licence.

Architecture
T5 + DAC decoder
Sample rate
44.1 kHz
Streaming
No
Best for
Prototyping, content
parler_example.pypython
# pip install parler-tts
from parler_tts import ParlerTTSForConditionalGeneration
from transformers import AutoTokenizer
import soundfile as sf

model = ParlerTTSForConditionalGeneration.from_pretrained("parler-tts/parler-tts-large-v1")
tokenizer = AutoTokenizer.from_pretrained("parler-tts/parler-tts-large-v1")

description = "A warm female voice with a slight British accent, speaking clearly and calmly."
prompt = "Parler TTS lets you describe the exact voice characteristics you want."

input_ids = tokenizer(description, return_tensors="pt").input_ids
prompt_ids = tokenizer(prompt, return_tensors="pt").input_ids

gen = model.generate(input_ids=input_ids, prompt_input_ids=prompt_ids)
sf.write("output.wav", gen.cpu().numpy().squeeze(), model.config.sampling_rate)
§ 04 · Hardware

What your box can run.

Four deployment tiers, from a Raspberry Pi to a 4090.

TierGPU / RAMRuns wellNotes
Edge / EmbeddedCPU only / RPi 4 · 1–4 GBPiperReal-time on ARM. Home assistants, offline kiosks.
Consumer laptopIntegrated / no GPU · 8 GBKokoro, PiperKokoro near real-time on CPU; Piper instant.
Mid-range GPURTX 3060 / 4060 (8 GB) · 16 GBKokoro, XTTS v2, Fish Speech, F5-TTS, Parler-TTSSweet spot for most use cases.
High-end GPURTX 3090 / 4090 (24 GB) · 32 GBAll modelsRun Dia and Bark at large batch sizes; audiobook production.
§ 05 · Decision

Start from the constraint.

One primary requirement on the left; the best and second-best pick on the right.

Your priorityBest pickRunner-upWhy
Maximum naturalnessKokoroFish SpeechMOS 4.2 with only 82M params. Apache 2.0.
Voice cloning (any licence)XTTS v2F5-TTSBest speaker similarity from 6s reference.
Voice cloning (commercial)Fish SpeechKokoro presetsApache 2.0 with strong multilingual cloning.
Fastest inferencePiperKokoroRTF 0.008 on CPU. Sub-100ms latency.
Minimal VRAM / edgePiperKokoro<100 MB on CPU. Runs on Raspberry Pi.
Most languagesPiperXTTS v230+ vs 17 languages. Pre-trained voices.
Multi-speaker dialogueDiaBarkNative [S1]/[S2] tags with natural turn-taking.
Expressive / non-speechBarkDiaLaughter, music, emotions inline.
Voice control via textParler-TTSKokoro presetsDescribe voice in natural language.
Research / novel architectureF5-TTSParler-TTSFlow matching + DiT. Cutting-edge approach.
§ 06 · Licensing

Who may ship what.

The licence is often the binding constraint, not the MOS.

Fully commercial (Apache 2.0 / MIT). Kokoro · Fish Speech · Dia · Parler-TTS (Apache 2.0) · Bark · Piper (MIT).

Non-commercial or restricted. XTTS v2 (CPML — contact Coqui) · F5-TTS (CC-BY-NC 4.0).

For commercial voice cloning in 2026, the practical pick is Fish Speech: Apache 2.0, MOS 4.1, eight languages, and a reference-audio workflow the Coqui ecosystem made familiar.

Related · Further reading

Continue through the registry.