Speech SynthesisComparison GuideMarch 2026

Best Open-Source TTS Models Compared(2026 Edition)

Eight models, one goal: human-quality speech from open-source code. We compare naturalness, speed, voice cloning, hardware needs, and licensing so you can pick the right TTS for your project.

Updated March 2026|20 min read|8 models compared

TL;DR - Pick Your Model

>

Best overall quality: Kokoro (MOS 4.2, 82M params, Apache 2.0)

>

Best voice cloning: XTTS v2 (6s reference, 17 languages)

>

Best for edge/embedded: Piper (runs on Raspberry Pi, 30+ langs)

>

Best for dialogue: Dia (multi-speaker turns, 1.6B params)

>

Best multilingual cloning: Fish Speech (8 langs, Apache 2.0)

>

Best non-speech audio: Bark (laughter, music, MIT license)

>

Best flow-matching TTS: F5-TTS (zero-shot cloning, 336M params)

>

Most controllable: Parler-TTS (describe voice in text)

Naturalness (MOS Scores)

Mean Opinion Score on a 1-5 scale. Human speech typically scores 4.5-4.8. Scores below are from published evaluations and community benchmarks.

Kokoro
4.2
Fish Speech
4.1
F5-TTS
4.1
XTTS v2
4.0
Dia
4.0
Parler-TTS
3.8
Bark
3.7
Piper
3.5
Human speech
4.5+

Full Comparison Table

ModelMOSRTFVRAMParamsVoice CloneLanguagesLicense
Kokoro
Hexgrad
4.20.03< 1 GB82MNo (style presets)English, Japanese, Korean, Chinese, French, Spanish, Italian, Portuguese, HindiApache 2.0
XTTS v2
Coqui
4.00.18~4 GB467MYes (6s reference)17 languages (EN, ES, FR, DE, IT, PT, PL, TR, RU, NL, CZ, AR, ZH, JA, HU, KO, HI)CPML (non-commercial)
Bark
Suno
3.70.85~6 GB900MLimited (speaker prompts)13 languagesMIT
Piper
Rhasspy
3.50.008< 100 MB (CPU)6-60MNo (pre-trained voices)30+ languagesMIT
Fish Speech
Fish Audio
4.10.12~4 GB500MYes (10-30s reference)English, Chinese, Japanese, Korean, Spanish, French, German, ArabicApache 2.0
Dia
Nari Labs
4.00.15~5 GB1.6BYes (audio prompt)EnglishApache 2.0
F5-TTS
SWivid
4.10.14~4 GB336MYes (5-15s reference)English, ChineseCC-BY-NC 4.0
Parler-TTS
Hugging Face
3.80.22~4 GB880MNo (text-described voices)EnglishApache 2.0

RTF = Real-Time Factor (lower is faster; <1.0 means faster than real-time). Measured on NVIDIA A100 unless noted. MOS scores from published papers and community evaluations. VRAM at fp16, single utterance.

Model Deep Dives

Kokoro

Highest MOS82M params

Kokoro is the efficiency champion. Built on StyleTTS 2 architecture, it achieves the highest MOS score (4.2) among all open-source models while using just 82M parameters -- orders of magnitude smaller than competitors. It runs comfortably on CPU and can generate speech at RTF 0.03 on GPU, meaning a 10-second clip is synthesized in 0.3 seconds. The model ships with curated style presets for different voices but does not support arbitrary voice cloning. As of early 2026, it supports 9 languages including English, Japanese, Korean, and major European languages.

Architecture
StyleTTS 2 based
Sample Rate
24 kHz
Streaming
Yes (chunked)
Best For
Narration, assistants
kokoro_example.py
# pip install kokoro>=0.8 soundfile
from kokoro import KPipeline
import soundfile as sf

pipe = KPipeline(lang_code="a")  # 'a' = American English
# Available voices: af_heart, af_bella, am_adam, am_michael, etc.
samples = pipe("Hello from Kokoro, the most efficient open-source TTS.", voice="af_heart", speed=1.0)
for i, (gs, ps, audio) in enumerate(samples):
    sf.write(f"output_{i}.wav", audio, 24000)

XTTS v2

Best Voice CloningCPML License

XTTS v2 remains the gold standard for zero-shot voice cloning. With just 6 seconds of reference audio, it produces remarkably faithful voice reproductions across 17 languages. The architecture combines a GPT-style autoregressive model with DVAE and HiFi-GAN vocoder. The main caveat is its CPML license, which restricts commercial use without a separate agreement. For commercial projects, consider Fish Speech or F5-TTS as alternatives.

Architecture
GPT + DVAE + HiFi-GAN
Sample Rate
24 kHz
Streaming
Yes
Best For
Voice cloning, dubbing
xtts_example.py
# pip install TTS
from TTS.api import TTS

tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2", gpu=True)

# Zero-shot voice cloning from a 6s reference clip
tts.tts_to_file(
    text="This is a cloned voice speaking naturally.",
    speaker_wav="reference.wav",
    language="en",
    file_path="output.wav",
)

Bark

Non-Speech AudioMIT

Bark by Suno is unique in its ability to generate non-speech audio alongside speech. It can produce laughter, music snippets, sighs, and other paralinguistic sounds using inline tags. The GPT-style autoregressive architecture means it is slower (RTF 0.85) and requires more VRAM (~6 GB), but for creative applications where expressive, varied audio is needed, Bark remains unmatched. The MIT license makes it suitable for any commercial project.

Architecture
GPT-style AR
Sample Rate
24 kHz
Streaming
No
Best For
Creative audio, games
bark_example.py
# pip install bark
from bark import SAMPLE_RATE, generate_audio, preload_models
from scipy.io.wavfile import write as write_wav

preload_models()

# Bark supports non-speech: laughter, music, hesitations
text = """Hello! [laughs] This is Bark speaking.
It can generate ♪ musical notes ♪ and even [sighs] emotions."""

audio = generate_audio(text, history_prompt="v2/en_speaker_6")
write_wav("output.wav", SAMPLE_RATE, audio)

Piper

Edge / EmbeddedMIT

Piper is the go-to TTS for edge devices. Built on VITS/VITS2 architecture and exported to ONNX, it achieves RTF 0.008 -- meaning a 10-second clip generates in 80 milliseconds. It runs entirely on CPU with less than 100 MB of RAM. With 30+ pre-trained language models, it is the most broadly multilingual option. The trade-off is lower naturalness (MOS 3.5) and no voice cloning; you pick from pre-trained voices. Ideal for home assistants, kiosks, and offline applications.

Architecture
VITS / VITS2 (ONNX)
Sample Rate
16-22 kHz
Streaming
Yes (sentence-level)
Best For
RPi, offline, IoT
piper_example.py
# Install: pip install piper-tts
# Download a voice: piper --download-dir ./voices --model en_US-lessac-high
import subprocess

text = "Piper runs on a Raspberry Pi in real-time."
subprocess.run(
    ["piper", "--model", "./voices/en_US-lessac-high.onnx", "--output_file", "output.wav"],
    input=text.encode(),
)

# Or use the Python API directly:
from piper import PiperVoice

voice = PiperVoice.load("en_US-lessac-high.onnx")
with open("output.wav", "wb") as f:
    voice.synthesize(text, f)

Fish Speech

Multilingual CloningApache 2.0

Fish Speech combines a VQGAN tokenizer with a Llama-based decoder to achieve strong voice cloning across 8 languages. It requires 10-30 seconds of reference audio for cloning, slightly more than XTTS v2, but comes with an Apache 2.0 license -- making it the best commercially-friendly voice cloning option. MOS of 4.1 puts it near the top for naturalness. The architecture allows fine-tuning on custom voices with relatively small datasets.

Architecture
VQGAN + Llama
Sample Rate
44.1 kHz
Streaming
Yes
Best For
Commercial cloning
fish_speech_example.py
# pip install fish-speech
from fish_speech.api import FishSpeechTTS

tts = FishSpeechTTS(device="cuda")

# Zero-shot cloning with 10-30s reference
tts.synthesize(
    text="Fish Speech excels at multilingual voice cloning.",
    reference_audio="speaker_ref.wav",
    output_path="output.wav",
)

Dia (Nari Labs)

Multi-Speaker DialogueApache 2.0

Dia is purpose-built for dialogue. You pass in a script with speaker tags ([S1], [S2]) and it generates a natural multi-speaker conversation with appropriate prosody, pacing, and turn-taking. At 1.6B parameters it is the largest model in this comparison, requiring ~5 GB VRAM. It also supports non-verbal cues like laughter and hesitations. Currently English-only, but the dialogue capability is unmatched.

Architecture
Enc-dec transformer
Sample Rate
44 kHz
Streaming
No
Best For
Podcasts, audiobooks
dia_example.py
# pip install diarizationlm  # Dia by Nari Labs
from dia import Dia

model = Dia("nari-labs/Dia-1.6B", device="cuda")

# Multi-speaker dialogue generation
dialogue = """[S1] Hey, have you tried the new open-source TTS models?
[S2] Yeah, Dia is amazing for dialogue. It handles turn-taking naturally.
[S1] The prosody between speakers is surprisingly good."""

audio = model.generate(dialogue)
model.save_audio("dialogue.wav", audio)

F5-TTS

Flow MatchingCC-BY-NC 4.0

F5-TTS uses a novel flow matching approach with a Diffusion Transformer (DiT) backbone. It achieves MOS 4.1 with only 336M parameters and provides strong zero-shot voice cloning from 5-15 seconds of reference audio. The flow matching architecture produces more consistent output than autoregressive approaches, avoiding the occasional artifacts common in GPT-style TTS. The CC-BY-NC license limits commercial use.

Architecture
Flow matching + DiT
Sample Rate
24 kHz
Streaming
Yes (chunk-based)
Best For
Research, cloning
f5tts_example.py
# pip install f5-tts
from f5_tts.api import F5TTS

tts = F5TTS(device="cuda")

# Zero-shot voice cloning via flow matching
tts.infer(
    ref_file="reference.wav",
    ref_text="This is the reference transcript.",
    gen_text="F5-TTS uses flow matching for natural-sounding speech synthesis.",
    output="output.wav",
)

Parler-TTS

Text-Described VoicesApache 2.0

Parler-TTS from Hugging Face takes a unique approach: instead of providing reference audio for cloning, you describe the voice you want in natural language. "A warm female voice with a slight British accent, speaking clearly and calmly" -- and the model generates speech matching that description. This makes it highly controllable without needing any reference recordings. MOS of 3.8 is decent but not top-tier; the value is in the controllability and Apache 2.0 license.

Architecture
T5 + DAC decoder
Sample Rate
44.1 kHz
Streaming
No
Best For
Prototyping, content
parler_example.py
# pip install parler-tts
from parler_tts import ParlerTTSForConditionalGeneration
from transformers import AutoTokenizer
import soundfile as sf

model = ParlerTTSForConditionalGeneration.from_pretrained("parler-tts/parler-tts-large-v1")
tokenizer = AutoTokenizer.from_pretrained("parler-tts/parler-tts-large-v1")

# Describe the voice you want in natural language
description = "A warm female voice with a slight British accent, speaking clearly and calmly."
prompt = "Parler TTS lets you describe the exact voice characteristics you want."

input_ids = tokenizer(description, return_tensors="pt").input_ids
prompt_ids = tokenizer(prompt, return_tensors="pt").input_ids

gen = model.generate(input_ids=input_ids, prompt_input_ids=prompt_ids)
sf.write("output.wav", gen.cpu().numpy().squeeze(), model.config.sampling_rate)

Hardware Requirements

Edge / Embedded

CPU only / RPi 4 / 1-4 GB RAM
Piper

Real-time on ARM. Perfect for home assistants and offline kiosks.

Consumer Laptop

Integrated / no GPU / 8 GB RAM
KokoroPiper

Kokoro runs on CPU at near real-time. Piper is instant.

Mid-range GPU

RTX 3060 / 4060 (8 GB) / 16 GB RAM
KokoroXTTS v2Fish SpeechF5-TTSParler-TTS

Sweet spot for most use cases. All mainstream models run comfortably.

High-end GPU

RTX 3090 / 4090 (24 GB) / 32 GB RAM
All models

Run Dia and Bark with large batch sizes. Batch TTS for audiobook production.

Decision Matrix

Start from your primary requirement and follow it to the right model.

Your PriorityBest PickRunner-UpWhy
Maximum naturalnessKokoroFish SpeechMOS 4.2 with only 82M params. Apache 2.0.
Voice cloning (any license)XTTS v2F5-TTSBest speaker similarity from 6s reference.
Voice cloning (commercial)Fish SpeechKokoro presetsApache 2.0 with strong multilingual cloning.
Fastest inferencePiperKokoroRTF 0.008 on CPU. Sub-100ms latency.
Minimal VRAM / edgePiperKokoro<100 MB on CPU. Runs on Raspberry Pi.
Most languagesPiperXTTS v230+ vs 17 languages. Pre-trained voices.
Multi-speaker dialogueDiaBarkNative [S1]/[S2] tags with natural turn-taking.
Expressive / non-speechBarkDiaLaughter, music, emotions inline.
Voice control via textParler-TTSKokoro presetsDescribe voice in natural language.
Research / novel architectureF5-TTSParler-TTSFlow matching + DiT. Cutting-edge approach.

Licensing Quick Reference

Fully Commercial (Apache 2.0 / MIT)

  • + Kokoro (Apache 2.0)
  • + Fish Speech (Apache 2.0)
  • + Dia (Apache 2.0)
  • + Parler-TTS (Apache 2.0)
  • + Bark (MIT)
  • + Piper (MIT)

Non-Commercial / Restricted

  • ! XTTS v2 (CPML -- contact Coqui)
  • ! F5-TTS (CC-BY-NC 4.0)

Key Considerations

  • Training data licenses may add constraints
  • Voice cloning raises consent/legal issues
  • Check model card for dataset-specific terms
  • Some jurisdictions restrict synthetic speech

Continue Exploring