Text-to-speech · watchlist

New TTS models stay unscored until evidence catches up.

These models are tracked for upcoming CodeSOTA runs. They are not ranked against measured rows, and they do not inherit vendor MOS until a source and verification tier are attached.

Measured leaderboard Registry

metadata only

Realtime TTS 1.5 Max

Realtime voice-agent candidate; add API metadata, pricing, and CodeSOTA hard-text run.

metadata only

Realtime TTS 1.5 Mini

Realtime voice-agent candidate from the same Inworld model family; verify latency and pricing.

metadata only

Gemini 3.1 Flash TTS

Google frontier TTS model with audio tags and broad language coverage; add exact API model ID.

metadata only

Eleven v3

ElevenLabs latest expressive synthesis model; needs shared-prompt samples and control-following run.

metadata only

Cartesia Sonic 3.5

Stable May 2026 Cartesia snapshot; verify transcript following, language quality, and p95 latency.

metadata only

OpenAI gpt-4o-mini-tts

OpenAI current TTS model for prompt-controlled realtime speech; add voices, latency, and cost.

metadata only

StepAudio 2.5 TTS

Contextual TTS candidate; verify public access, cloning claims, and language coverage.

metadata only

MiniMax Speech 2.8 HD

High-fidelity TTS candidate; verify emotion control, languages, and commercial terms.

metadata only

OmniVoice

Explicit pitch, whisper, speed, duration, tags.

metadata only

VoxCPM2

Open multilingual speech model to test for hard-text and code-switching.

metadata only

Qwen3-TTS

Qwen speech generation candidate; verify language breadth and licensing.

metadata only

Voxtral TTS

Open-weight TTS candidate; verify artifacts and serving path.

metadata only

MOSS-TTS-Nano

Small-model edge candidate for browser/mobile deployment.

metadata only

Chatterbox

Control stress test for exaggeration, CFG, temperature, reference audio.

metadata only

VibeVoice

Expressive long-form/dialogue candidate; watch for drift and repetitions.

metadata only

Dramabox

Narration and character-voice candidate.

metadata only

Scenema

Scene/dialogue TTS candidate; needs artifact-backed eval.