Codesota · Tasks · Voice cloningHome/Tasks/Audio/Voice cloning

Voice cloning.

Voice cloning is a type of audio deepfake technology that uses machine learning to create a digital replica of a specific person's voice, synthesizing spoken audio that mimics their vocal characteristics like pitch and tone. While it has positive uses, such as generating audiobooks or helping people who have lost their voice, it is also used for malicious purposes, including creating convincing scams where fraudsters impersonate individuals.

Datasets

Results

wer

Canonical metric

§ 02 · Canonical benchmark

The reference dataset.

LibriTTS test-clean (Zero-Shot TTS)

Standard zero-shot voice-cloning / TTS evaluation using LibriTTS test-clean speaker prompts. WER on resynthesized utterances (measured with a frozen ASR like HuBERT-Large or Whisper) is the primary intelligibility metric (lower=better); speaker similarity (SECS) is a secondary metric.

Primary metric: wer

View full leaderboard →

§ 03 · Top 10

Leading models.

Leading models on LibriTTS test-clean (Zero-Shot TTS).

#	Model	wer	Year	Source
★	VALL-E	5.90	2026	paper ↗
2	Voicebox	1.90	2026	paper ↗
3	NaturalSpeech 3	1.81	2026	paper ↗

What were you looking for on Voice cloning?

Didn't find the model, metric, or dataset you needed? Tell us in one line. We read every message and reply within 48 hours.

§ 04 · All datasets

Tracked datasets.

1 dataset tracked for this task.

LibriTTS test-clean (Zero-Shot TTS)

CANONICAL

3 results · wer

Top: VALL-E — 5.90

§ 05 · Related tasks

Other tasks in Audio.

Audio Classification Audio-Language Models Automatic Speech Recognition Text-to-speech

Reply within 48 hours · No newsletter

Didn't find what you came for?

Still looking for something on Voice cloning? A missing model, a stale score, a benchmark we should cover — drop it here and we'll handle it.

Real humans read every message. We track what people are asking for and prioritize accordingly.