CodeSOTA · Text-to-speech · XTTS
§ 00 · Direct answer
XTTS v2 is still useful, but not universal.
Direct answer: XTTS v2 is a multilingual voice-cloning TTS model. Use it when you need local zero-shot cloning from a reference voice and broad language coverage. Do not treat it as the default winner for every TTS task: Kokoro is lighter for local English synthesis, F5-TTS is a strong cloning alternative, and hosted APIs usually win for realtime voice agents.
§ 01 · Decision table
When XTTS is the right pick.
| Use case | Pick | Reason |
|---|---|---|
| Multilingual cloning prototype | XTTS v2 | Good local baseline with reference-audio cloning. |
| Tiny local English TTS | Kokoro | Much smaller and easier to run for plain synthesis. |
| Voice-agent production | Hosted realtime API | Streaming latency, monitoring, and product controls matter more than model-card appeal. |
| Cloning quality bake-off | XTTS v2 + F5-TTS | Run both on the same reference clip and score intelligibility, speaker similarity, and artifacts. |