CodeSOTA · Text-to-speech · Piper vs Kokoro
§ 00 · Direct answer
Piper vs Kokoro: Kokoro wins quality, Piper wins appliances.
Direct answer: choose Kokoro when people will judge the voice. Choose Piper when the system needs simple offline speech, predictable packaging, and many practical voices. The important caveat: CodeSOTA has measured Kokoro on hard text, but Piper still needs the same run before we call the quality gap quantified.
§ 01 · Short decision table
The winner changes with the job.
| Question | Winner | Why |
|---|---|---|
| Local voice quality default | Kokoro | More modern, more listenable, and CodeSOTA has an artifact-backed Kokoro run. |
| Small appliance / offline fleet | Piper | Simpler engine path, broad voice catalog, and a long Home Assistant style deployment history. |
| Voice-agent demo people will judge | Kokoro | A better first impression matters when the voice is part of the product surface. |
| Status prompts, kiosks, alerts | Piper | Predictable utility speech is enough when the content matters more than the speaker. |
| Evidence confidence on this page | Kokoro | CodeSOTA has measured Kokoro; Piper still needs the same hard-text and latency run. |
§ 02 · Evidence ledger
What is measured, and what is still only a deployment claim.
| Claim | Result | Tier | How to use it |
|---|---|---|---|
| Kokoro hard-text intelligibility | 30 prompts, WER 15.6%, CER 6.8%, entity accuracy 66.7% | CodeSOTA measured | Use this as the current local evidence floor, not a final universal ranking. |
| Kokoro latency on CodeSOTA run | M2 Max, ONNX Runtime, p50 first audio 855 ms, p95 2123 ms | CodeSOTA measured | Good enough for many local prototypes; retest on Raspberry Pi, N100, and target cloud CPU before shipping. |
| Kokoro blind preference | 234 pairwise votes across 8 model families; Kokoro placed 7th overall, 3rd on number-heavy prompts, and beat Gradium 10-7 head-to-head. | CodeSOTA measured | Naturalness is not the same as fidelity; Kokoro is strong for tiny local TTS, not the overall winner against larger hosted systems. |
| Piper deployment surface | Fast local neural TTS engine, current Open Home Foundation fork, CLI/server/Python/C/C++ paths, and a 35-language MIT voice repository. | Primary project sources | Pick it when operational simplicity, language availability, and offline repeatability beat expressiveness. |
| Piper vs Kokoro same-prompt quality | Missing from CodeSOTA today: no Piper run on the same 30 hard prompts, no blind Piper-vs-Kokoro vote table, no p95 target-device latency row. | Gap | Do not write 'Piper is worse' as a measured claim until Piper has the same harness. |
§ 03 · Model facts
The registry view.
| Model | Params | License | Deployment | CodeSOTA tier | Best read |
|---|---|---|---|---|---|
| Kokoro v1.0 | 82M | Apache-2.0 | local, edge | codesota measured | Small open-weight voice that sounds better than its size suggests. |
| Piper | ~20M | MIT | local, edge | community reported | Operationally boring in a useful way: good for offline voice plumbing. |
§ 04 · Workload matrix
Pick by where the voice fails.
| Workload | Pick | Reason |
|---|---|---|
| New local English voice prototype | Kokoro | Fast to evaluate, stronger default voice quality, Apache-2.0 model weights. |
| Home appliance, kiosk, embedded prompt | Piper | Small local engine, deterministic output, practical voice catalog. |
| Long listening sessions | Benchmark both | Fatigue can flip the decision; run 20-30 minute listening tests, not only one-sentence demos. |
| Raspberry Pi / constrained CPU | Piper first, Kokoro second | Piper was built around this class of deployment; Kokoro may still win if the device can handle the voice quality target. |
| Public voice assistant demo | Kokoro | Users notice prosody and naturalness more than the ops stack during a demo. |
| Voice cloning or multilingual style control | Neither as default | Compare XTTS, F5-TTS, Chatterbox, and hosted APIs; Piper and Kokoro are not cloning-first choices. |
§ 05 · Production bake-off
The minimum benchmark before a real decision.
| Check | Run this |
|---|---|
| Prompt set | Use the same 30 hard-text prompts plus 30 conversational turns and 10 long-form paragraphs. |
| Objective fidelity | Score WER, CER, critical entity accuracy, URL/date/number failures, omission and repetition rate. |
| Latency | Measure cold start, p50/p95 first audio, real-time factor, peak RSS, and CPU load on the real target box. |
| Preference | Run blind A/B votes with volume matched, randomized order, same text, and at least 30 judgments per pair. |
| Production fit | Check license, voice availability, install size, monitoring path, streamability, and whether the voice is tolerable after repeated listening. |
§ 06 · Sources
Primary references used for non-CodeSOTA facts.
| Source | Used for | Link |
|---|---|---|
| Kokoro model card | 82M model, Apache-2.0, Hugging Face files | Open |
| Kokoro inference library | Open-weight 82M TTS model and install path | Open |
| Piper engine | Current Open Home Foundation fork, local engine and APIs | Open |
| Piper voices | MIT voice repository covering 35 languages | Open |