CodeSOTA · Text-to-speech · XTTS
§ 00 · Direct answer

XTTS v2 is still useful, but not universal.

Direct answer: XTTS v2 is a multilingual voice-cloning TTS model. Use it when you need local zero-shot cloning from a reference voice and broad language coverage. Do not treat it as the default winner for every TTS task: Kokoro is lighter for local English synthesis, F5-TTS is a strong cloning alternative, and hosted APIs usually win for realtime voice agents.

§ 01 · Decision table

When XTTS is the right pick.

Use casePickReason
Multilingual cloning prototypeXTTS v2Good local baseline with reference-audio cloning.
Tiny local English TTSKokoroMuch smaller and easier to run for plain synthesis.
Voice-agent productionHosted realtime APIStreaming latency, monitoring, and product controls matter more than model-card appeal.
Cloning quality bake-offXTTS v2 + F5-TTSRun both on the same reference clip and score intelligibility, speaker similarity, and artifacts.