What is the most natural-sounding open-source TTS model in 2026?

Kokoro by Hexgrad achieves the highest MOS (Mean Opinion Score) of ~4.2 among open-source TTS models, while using only 82M parameters and less than 1 GB of VRAM.

Which open-source TTS model is best for voice cloning?

XTTS v2 by Coqui offers the best zero-shot voice cloning, requiring only 6 seconds of reference audio. Fish Speech and F5-TTS are strong alternatives with permissive licenses.

Can I run TTS models on a Raspberry Pi?

Yes. Piper is specifically designed for edge deployment and runs in real-time on a Raspberry Pi 4 using only CPU. It supports 30+ languages with pre-trained voice models.

Which TTS models support multiple languages?

Piper supports 30+ languages, XTTS v2 supports 17 languages with voice cloning, and Kokoro supports 9 languages. Fish Speech supports 8 languages with voice cloning capability.

Best TTS Models 2026: Open-Source Voice AI Compared

Model	MOS	Org	Signature
Kokoro	4.2	Hexgrad	Highest naturalness per parameter
Fish Speech	4.1	Fish Audio	Strong multilingual cloning
F5-TTS	4.1	SWivid	Zero-shot cloning via flow matching
XTTS v2	4.0	Coqui	Best zero-shot voice cloning
Dia	4.0	Nari Labs	Multi-speaker dialogue generation
Parler-TTS	3.8	Hugging Face	Control voice via text description
Bark	3.7	Suno	Non-speech audio (music, laughter)
Piper	3.5	Rhasspy	Runs on Raspberry Pi
Human speech	4.5+	—	upper bound, per field convention

Model	MOS	RTF	VRAM	Params	Voice cloning	Licence
Kokoro Hexgrad	4.2	0.03	< 1 GB	82M	No (style presets)	Apache 2.0
XTTS v2 Coqui	4.0	0.18	~4 GB	467M	Yes (6s reference)	CPML (non-commercial)
Bark Suno	3.7	0.85	~6 GB	900M	Limited (speaker prompts)	MIT
Piper Rhasspy	3.5	0.008	< 100 MB (CPU)	6-60M	No (pre-trained voices)	MIT
Fish Speech Fish Audio	4.1	0.12	~4 GB	500M	Yes (10-30s reference)	Apache 2.0
Dia Nari Labs	4.0	0.15	~5 GB	1.6B	Yes (audio prompt)	Apache 2.0
F5-TTS SWivid	4.1	0.14	~4 GB	336M	Yes (5-15s reference)	CC-BY-NC 4.0
Parler-TTS Hugging Face	3.8	0.22	~4 GB	880M	No (text-described voices)	Apache 2.0

Kokoro · Hexgrad

Kokoro is the efficiency champion. Built on StyleTTS 2, it achieves the highest MOS in this comparison (4.2) with 82M parameters — orders of magnitude smaller than competitors. It runs comfortably on CPU and reaches RTF 0.03 on GPU, which means a ten-second clip synthesises in 0.3 seconds. It ships with curated style presets for different voices but does not do arbitrary voice cloning. As of early 2026 it supports 9 languages including English, Japanese, Korean and major European languages.

Architecture

StyleTTS 2 based

Sample rate

24 kHz

Streaming

Yes (chunked)

Best for

Narration, assistants

kokoro_example.pypython

# pip install kokoro>=0.8 soundfile
from kokoro import KPipeline
import soundfile as sf

pipe = KPipeline(lang_code="a")  # 'a' = American English
# Available voices: af_heart, af_bella, am_adam, am_michael, etc.
samples = pipe("Hello from Kokoro, the most efficient open-source TTS.", voice="af_heart", speed=1.0)
for i, (gs, ps, audio) in enumerate(samples):
    sf.write(f"output_{i}.wav", audio, 24000)

XTTS v2 · Coqui

XTTS v2 remains the gold standard for zero-shot voice cloning. With six seconds of reference audio it produces remarkably faithful voices across 17 languages. The architecture combines a GPT-style autoregressive model with DVAE and a HiFi-GAN vocoder. The main caveat is the CPML licence, which restricts commercial use without a separate agreement — for commercial projects, Fish Speech or F5-TTS are the alternatives.

Architecture

GPT + DVAE + HiFi-GAN

Sample rate

24 kHz

Streaming

Yes

Best for

Voice cloning, dubbing

xtts_example.pypython

# pip install TTS
from TTS.api import TTS

tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2", gpu=True)

# Zero-shot voice cloning from a 6s reference clip
tts.tts_to_file(
    text="This is a cloned voice speaking naturally.",
    speaker_wav="reference.wav",
    language="en",
    file_path="output.wav",
)

Bark · Suno

Bark is unique in its ability to generate non-speech audio alongside speech — laughter, music snippets, sighs, paralinguistic cues — using inline tags. The GPT-style autoregressive architecture makes it slower (RTF 0.85) and thirstier for VRAM (~6 GB), but for creative applications where expressive, varied audio is needed, Bark remains unmatched. MIT licence; suitable for commercial projects.

Architecture

GPT-style AR

Sample rate

24 kHz

Streaming

Best for

Creative audio, games

bark_example.pypython

# pip install bark
from bark import SAMPLE_RATE, generate_audio, preload_models
from scipy.io.wavfile import write as write_wav

preload_models()

# Bark supports non-speech: laughter, music, hesitations
text = """Hello! [laughs] This is Bark speaking.
It can generate musical notes and even [sighs] emotions."""

audio = generate_audio(text, history_prompt="v2/en_speaker_6")
write_wav("output.wav", SAMPLE_RATE, audio)

Piper · Rhasspy

Piper is the go-to TTS for edge devices. Built on VITS / VITS2 and exported to ONNX, it achieves RTF 0.008 — a ten-second clip in eighty milliseconds. It runs entirely on CPU with under 100 MB of RAM. With 30+ pre-trained language models it is the most broadly multilingual option. The trade-off is lower naturalness (MOS 3.5) and no voice cloning; you pick from pre-trained voices. Ideal for home assistants, kiosks and offline applications.

Architecture

VITS / VITS2 (ONNX)

Sample rate

16–22 kHz

Streaming

Yes (sentence-level)

Best for

RPi, offline, IoT

piper_example.pypython

# Install: pip install piper-tts
# Download a voice: piper --download-dir ./voices --model en_US-lessac-high
import subprocess

text = "Piper runs on a Raspberry Pi in real-time."
subprocess.run(
    ["piper", "--model", "./voices/en_US-lessac-high.onnx", "--output_file", "output.wav"],
    input=text.encode(),
)

Fish Speech · Fish Audio

Fish Speech combines a VQGAN tokeniser with a Llama-based decoder to achieve strong voice cloning across 8 languages. It requires 10–30 seconds of reference audio — slightly more than XTTS v2 — but ships Apache 2.0, making it the best commercially-friendly cloning option. MOS 4.1 puts it near the top for naturalness. The architecture tolerates fine-tuning on custom voices from relatively small datasets.

Architecture

VQGAN + Llama

Sample rate

44.1 kHz

Streaming

Yes

Best for

Commercial cloning

fish_speech_example.pypython

# pip install fish-speech
from fish_speech.api import FishSpeechTTS

tts = FishSpeechTTS(device="cuda")

# Zero-shot cloning with 10-30s reference
tts.synthesize(
    text="Fish Speech excels at multilingual voice cloning.",
    reference_audio="speaker_ref.wav",
    output_path="output.wav",
)

Dia · Nari Labs

Dia is purpose-built for dialogue. Pass a script with speaker tags — [S1], [S2] — and it produces a natural multi-speaker conversation with appropriate prosody, pacing and turn-taking. At 1.6B parameters it is the largest model in this comparison, needing ~5 GB VRAM. It also handles non-verbal cues like laughter and hesitations. English-only today, but the dialogue capability is unmatched.

Architecture

Encoder-decoder transformer

Sample rate

44 kHz

Streaming

Best for

Podcasts, audiobooks

dia_example.pypython

# pip install diarizationlm  # Dia by Nari Labs
from dia import Dia

model = Dia("nari-labs/Dia-1.6B", device="cuda")

# Multi-speaker dialogue generation
dialogue = """[S1] Hey, have you tried the new open-source TTS models?
[S2] Yeah, Dia is amazing for dialogue. It handles turn-taking naturally.
[S1] The prosody between speakers is surprisingly good."""

audio = model.generate(dialogue)
model.save_audio("dialogue.wav", audio)

F5-TTS · SWivid

F5-TTS uses flow matching with a Diffusion Transformer (DiT) backbone. MOS 4.1 with 336M parameters and strong zero-shot cloning from 5–15 seconds of reference. The flow-matching architecture produces more consistent output than autoregressive approaches, avoiding the occasional artefacts of GPT-style TTS. CC-BY-NC 4.0 — so non-commercial only.

Architecture

Flow matching + DiT

Sample rate

24 kHz

Streaming

Yes (chunk-based)

Best for

Research, cloning

f5tts_example.pypython

# pip install f5-tts
from f5_tts.api import F5TTS

tts = F5TTS(device="cuda")

# Zero-shot voice cloning via flow matching
tts.infer(
    ref_file="reference.wav",
    ref_text="This is the reference transcript.",
    gen_text="F5-TTS uses flow matching for natural-sounding speech synthesis.",
    output="output.wav",
)

Parler-TTS · Hugging Face

Parler-TTS takes a different route: instead of a reference clip, you describe the voice you want. “A warm female voice with a slight British accent, speaking clearly and calmly” — and the model attempts a match. The control surface is striking for rapid prototyping without any reference recordings. MOS 3.8 is decent but not top-tier; the value is in the controllability and the Apache 2.0 licence.

Architecture

T5 + DAC decoder

Sample rate

44.1 kHz

Streaming

Best for

Prototyping, content

parler_example.pypython

# pip install parler-tts
from parler_tts import ParlerTTSForConditionalGeneration
from transformers import AutoTokenizer
import soundfile as sf

model = ParlerTTSForConditionalGeneration.from_pretrained("parler-tts/parler-tts-large-v1")
tokenizer = AutoTokenizer.from_pretrained("parler-tts/parler-tts-large-v1")

description = "A warm female voice with a slight British accent, speaking clearly and calmly."
prompt = "Parler TTS lets you describe the exact voice characteristics you want."

input_ids = tokenizer(description, return_tensors="pt").input_ids
prompt_ids = tokenizer(prompt, return_tensors="pt").input_ids

gen = model.generate(input_ids=input_ids, prompt_input_ids=prompt_ids)
sf.write("output.wav", gen.cpu().numpy().squeeze(), model.config.sampling_rate)

Tier	GPU / RAM	Runs well	Notes
Edge / Embedded	CPU only / RPi 4 · 1–4 GB	Piper	Real-time on ARM. Home assistants, offline kiosks.
Consumer laptop	Integrated / no GPU · 8 GB	Kokoro, Piper	Kokoro near real-time on CPU; Piper instant.
Mid-range GPU	RTX 3060 / 4060 (8 GB) · 16 GB	Kokoro, XTTS v2, Fish Speech, F5-TTS, Parler-TTS	Sweet spot for most use cases.
High-end GPU	RTX 3090 / 4090 (24 GB) · 32 GB	All models	Run Dia and Bark at large batch sizes; audiobook production.

Your priority	Best pick	Runner-up	Why
Maximum naturalness	Kokoro	Fish Speech	MOS 4.2 with only 82M params. Apache 2.0.
Voice cloning (any licence)	XTTS v2	F5-TTS	Best speaker similarity from 6s reference.
Voice cloning (commercial)	Fish Speech	Kokoro presets	Apache 2.0 with strong multilingual cloning.
Fastest inference	Piper	Kokoro	RTF 0.008 on CPU. Sub-100ms latency.
Minimal VRAM / edge	Piper	Kokoro	<100 MB on CPU. Runs on Raspberry Pi.
Most languages	Piper	XTTS v2	30+ vs 17 languages. Pre-trained voices.
Multi-speaker dialogue	Dia	Bark	Native [S1]/[S2] tags with natural turn-taking.
Expressive / non-speech	Bark	Dia	Laughter, music, emotions inline.
Voice control via text	Parler-TTS	Kokoro presets	Describe voice in natural language.
Research / novel architecture	F5-TTS	Parler-TTS	Flow matching + DiT. Cutting-edge approach.

Eight open-source voices, compared on the evidence.

How human the voice feels.

The full table.

One by one.

Kokoro · Hexgrad

XTTS v2 · Coqui

Bark · Suno

Piper · Rhasspy

Fish Speech · Fish Audio

Dia · Nari Labs

F5-TTS · SWivid

Parler-TTS · Hugging Face

What your box can run.

Start from the constraint.

Who may ship what.

Continue through the registry.