Voice cloning
Voice cloning is a type of audio deepfake technology that uses machine learning to create a digital replica of a specific person's voice, synthesizing spoken audio that mimics their vocal characteristics like pitch and tone. While it has positive uses, such as generating audiobooks or helping people who have lost their voice, it is also used for malicious purposes, including creating convincing scams where fraudsters impersonate individuals.
Voice cloning replicates a specific person's voice from minimal reference audio, enabling TTS in any voice with as little as 3-30 seconds of sample speech. XTTS-v2, OpenVoice, and ElevenLabs have made the technology accessible, but the ethical implications — deepfakes, fraud, impersonation — make this one of the most dual-use capabilities in AI.
History
SV2TTS (Jia et al., Google) demonstrates voice cloning from a few seconds of reference audio using speaker verification embeddings
Transfer learning from multispeaker TTS enables zero-shot voice cloning with fine-tuning on 5-10 minutes of target speech
YourTTS (Casanova et al.) achieves cross-lingual voice cloning — speak in a language the reference speaker never used
VALL-E (Microsoft) uses neural codec language modeling to clone voices from just 3 seconds of audio
XTTS-v2 (Coqui) provides open-source zero-shot voice cloning in 17 languages from 6-second reference
ElevenLabs Professional Voice Cloning captures studio-quality voice replicas from 30 minutes of audio
OpenVoice v2 (MyShell) separates voice cloning from language/emotion control, enabling flexible voice transfer
F5-TTS and MaskGCT demonstrate high-fidelity voice cloning with flow-matching and masked generative approaches
Voice consent protocols and synthetic speech watermarking emerge as industry responses to cloning misuse
How Voice cloning Works
Speaker encoding
Reference audio (3-30 seconds) is processed by a speaker encoder to extract a voice embedding capturing timbre, pitch, and speaking style
Text processing
Target text is converted to phonemes or linguistic features for synthesis
Conditioned generation
A TTS model generates speech conditioned on both the text and the speaker embedding, producing audio in the target voice
Voice adaptation
Advanced systems use fine-tuning on the reference audio to capture nuances (breathing, pronunciation, rhythm) beyond the embedding
Quality enhancement
Post-processing with HiFi-GAN or BigVGAN improves waveform quality and removes artifacts
Current Landscape
Voice cloning in 2025 is simultaneously one of the most impressive and controversial AI capabilities. The technology has reached the point where casual listeners cannot distinguish cloned speech from real recordings, especially with commercial services like ElevenLabs. Open-source alternatives (XTTS-v2, OpenVoice, F5-TTS) are close behind. The field is grappling with dual-use concerns: legitimate applications (audiobooks, voice preservation for medical patients, content localization) coexist with malicious ones (phone scams, political deepfakes, non-consensual impersonation). Industry self-regulation (voice consent, watermarking) is emerging but not yet standardized.
Key Challenges
Ethical risk: voice cloning enables impersonation, fraud, and non-consensual deepfakes of real individuals
Speaker similarity vs. naturalness tradeoff: maximizing voice similarity can reduce overall speech quality
Emotional range: cloned voices often sound flat compared to the original speaker's expressive range
Cross-lingual cloning: maintaining voice identity when generating speech in a language the speaker has never spoken
Consent and detection: no standardized framework exists for voice usage consent or synthetic speech detection
Quick Recommendations
Best quality (commercial)
ElevenLabs Voice Cloning
Professional-grade cloning from 30 minutes of audio; best speaker similarity in the market
Open-source zero-shot
XTTS-v2 or F5-TTS
Clone from 6-10 seconds of reference; 17+ languages; fully open weights
Flexible voice control
OpenVoice v2
Decouples tone color from style, emotion, and accent for more controllable voice transfer
Research / custom training
VALL-E X or XTTS-v2 fine-tuned
Full training pipelines for custom voice datasets and specialized applications
What's Next
Expect mandatory voice watermarking that embeds detectable signatures in all AI-generated speech, regulatory frameworks requiring explicit consent for voice cloning, and voice authentication systems that can reliably detect cloned speech. On the positive side, voice preservation services (for terminal patients, aging relatives) will become a product category, and personalized voice assistants that speak in your own voice will mainstream. Cross-lingual voice cloning will enable anyone to 'speak' any language in their own voice.
Benchmarks & SOTA
Related Tasks
Audio-Language Models
Audio-Language Models (ALMs) are a form of artificial intelligence that extend natural language processing (NLP) to the domain of audio, enabling computers to understand, generate, and reason about sounds and speech by integrating audio data with language understanding. Trained on audio-text data, ALMs bridge the gap between acoustic signals and linguistic meaning, allowing for tasks like zero-shot audio recognition, audio captioning, and the creation of generative audio, such as text-to-audio synthesis.
Automatic Speech Recognition
Automatic Speech Recognition (ASR) is the technology that converts spoken language into written text. ASR systems process audio signals containing human speech and transcribe them into readable text format. These systems use acoustic models, language models, and often neural networks to recognize phonemes, words, and sentences from audio input. ASR is foundational for applications like voice assistants (Siri, Alexa), transcription services, voice-controlled systems, and accessibility tools for the hearing impaired.
Text-to-speech
Text-to-speech (TTS) is technology that converts written text into natural-sounding audio, also known as "read aloud" technology or speech synthesis. It works by analyzing text to understand words, punctuation, and sentence structure, then generating phonetic representations of those words before synthesizing them into a human-like voice output. TTS is a crucial form of assistive technology and a key component of natural language processing, making digital content accessible and improving user interaction in numerous applications.
Audio Classification
Classification of audio signals into predefined categories such as music genres, environmental sounds, or speaker identification.
Get notified when these results update
New models drop weekly. We track them so you don't have to.
Something wrong or missing?
Help keep Voice cloning benchmarks accurate. Report outdated results, missing benchmarks, or errors.