Speech

Voice Cloning

Replicating a speaker's voice characteristics.

0 datasets0 resultsView full task mapping →

Voice cloning replicates a specific person's voice from minimal reference audio, enabling TTS in any voice with as little as 3-30 seconds of sample speech. XTTS-v2, OpenVoice, and ElevenLabs have made the technology accessible, but the ethical implications — deepfakes, fraud, impersonation — make this one of the most dual-use capabilities in AI.

History

2018

SV2TTS (Jia et al., Google) demonstrates voice cloning from a few seconds of reference audio using speaker verification embeddings

2019

Transfer learning from multispeaker TTS enables zero-shot voice cloning with fine-tuning on 5-10 minutes of target speech

2021

YourTTS (Casanova et al.) achieves cross-lingual voice cloning — speak in a language the reference speaker never used

2023

VALL-E (Microsoft) uses neural codec language modeling to clone voices from just 3 seconds of audio

2023

XTTS-v2 (Coqui) provides open-source zero-shot voice cloning in 17 languages from 6-second reference

2023

ElevenLabs Professional Voice Cloning captures studio-quality voice replicas from 30 minutes of audio

2024

OpenVoice v2 (MyShell) separates voice cloning from language/emotion control, enabling flexible voice transfer

2024

F5-TTS and MaskGCT demonstrate high-fidelity voice cloning with flow-matching and masked generative approaches

2025

Voice consent protocols and synthetic speech watermarking emerge as industry responses to cloning misuse

How Voice Cloning Works

1Speaker encodingReference audio (3-30 secon…2Text processingTarget text is converted to…3Conditioned generationA TTS model generates speec…4Voice adaptationAdvanced systems use fine-t…5Quality enhancementPost-processing with HiFi-G…Voice Cloning Pipeline
1

Speaker encoding

Reference audio (3-30 seconds) is processed by a speaker encoder to extract a voice embedding capturing timbre, pitch, and speaking style

2

Text processing

Target text is converted to phonemes or linguistic features for synthesis

3

Conditioned generation

A TTS model generates speech conditioned on both the text and the speaker embedding, producing audio in the target voice

4

Voice adaptation

Advanced systems use fine-tuning on the reference audio to capture nuances (breathing, pronunciation, rhythm) beyond the embedding

5

Quality enhancement

Post-processing with HiFi-GAN or BigVGAN improves waveform quality and removes artifacts

Current Landscape

Voice cloning in 2025 is simultaneously one of the most impressive and controversial AI capabilities. The technology has reached the point where casual listeners cannot distinguish cloned speech from real recordings, especially with commercial services like ElevenLabs. Open-source alternatives (XTTS-v2, OpenVoice, F5-TTS) are close behind. The field is grappling with dual-use concerns: legitimate applications (audiobooks, voice preservation for medical patients, content localization) coexist with malicious ones (phone scams, political deepfakes, non-consensual impersonation). Industry self-regulation (voice consent, watermarking) is emerging but not yet standardized.

Key Challenges

Ethical risk: voice cloning enables impersonation, fraud, and non-consensual deepfakes of real individuals

Speaker similarity vs. naturalness tradeoff: maximizing voice similarity can reduce overall speech quality

Emotional range: cloned voices often sound flat compared to the original speaker's expressive range

Cross-lingual cloning: maintaining voice identity when generating speech in a language the speaker has never spoken

Consent and detection: no standardized framework exists for voice usage consent or synthetic speech detection

Quick Recommendations

Best quality (commercial)

ElevenLabs Voice Cloning

Professional-grade cloning from 30 minutes of audio; best speaker similarity in the market

Open-source zero-shot

XTTS-v2 or F5-TTS

Clone from 6-10 seconds of reference; 17+ languages; fully open weights

Flexible voice control

OpenVoice v2

Decouples tone color from style, emotion, and accent for more controllable voice transfer

Research / custom training

VALL-E X or XTTS-v2 fine-tuned

Full training pipelines for custom voice datasets and specialized applications

What's Next

Expect mandatory voice watermarking that embeds detectable signatures in all AI-generated speech, regulatory frameworks requiring explicit consent for voice cloning, and voice authentication systems that can reliably detect cloned speech. On the positive side, voice preservation services (for terminal patients, aging relatives) will become a product category, and personalized voice assistants that speak in your own voice will mainstream. Cross-lingual voice cloning will enable anyone to 'speak' any language in their own voice.

Benchmarks & SOTA

No datasets indexed for this task yet.

Contribute on GitHub

Related Tasks

Text-to-Speech

Text-to-speech has undergone a stunning transformation from robotic concatenation to near-human expressiveness in under five years. ElevenLabs, OpenAI's TTS, and XTTS-v2 produce speech that most listeners cannot distinguish from recordings, while open models like Bark, VALL-E (Microsoft), and F5-TTS demonstrated that voice cloning from 3-second samples is now a commodity capability. The frontier has moved beyond intelligibility (solved) to prosody, emotion control, and real-time streaming at under 200ms latency for conversational AI. Evaluation remains messy — MOS (Mean Opinion Score) is subjective and expensive, and automated metrics like UTMOS only loosely correlate with human preference, making benchmark comparisons unreliable.

Speaker Verification

Verifying speaker identity from voice samples.

Speech Translation

Translating spoken audio directly to another language.

Speech Recognition

Automatic speech recognition went from a specialized pipeline (acoustic model + language model + decoder) to a single end-to-end model with OpenAI's Whisper (2022), which was trained on 680K hours of web audio and became the de facto open-source standard overnight. Whisper large-v3 hits under 5% word error rate on LibriSpeech clean, and commercial APIs from Google, AWS, and Deepgram compete fiercely on noisy, accented, and multilingual speech where error rates are 2-3x higher. The real frontier is real-time streaming ASR at conversational latency (<500ms), code-switching between languages mid-sentence, and robust recognition of domain-specific terminology (medical, legal, technical). Assembly AI's Universal-2 and Deepgram's Nova-3 currently lead production benchmarks, but the gap with fine-tuned Whisper variants is narrow.

Something wrong or missing?

Help keep Voice Cloning benchmarks accurate. Report outdated results, missing benchmarks, or errors.

0/2000
Voice Cloning Benchmarks - Speech - CodeSOTA | CodeSOTA