Audio

Voice cloning

Voice cloning is a type of audio deepfake technology that uses machine learning to create a digital replica of a specific person's voice, synthesizing spoken audio that mimics their vocal characteristics like pitch and tone. While it has positive uses, such as generating audiobooks or helping people who have lost their voice, it is also used for malicious purposes, including creating convincing scams where fraudsters impersonate individuals.

1 datasets3 resultsView full task mapping →

Voice cloning replicates a specific person's voice from minimal reference audio, enabling TTS in any voice with as little as 3-30 seconds of sample speech. XTTS-v2, OpenVoice, and ElevenLabs have made the technology accessible, but the ethical implications — deepfakes, fraud, impersonation — make this one of the most dual-use capabilities in AI.

History

2018

SV2TTS (Jia et al., Google) demonstrates voice cloning from a few seconds of reference audio using speaker verification embeddings

2019

Transfer learning from multispeaker TTS enables zero-shot voice cloning with fine-tuning on 5-10 minutes of target speech

2021

YourTTS (Casanova et al.) achieves cross-lingual voice cloning — speak in a language the reference speaker never used

2023

VALL-E (Microsoft) uses neural codec language modeling to clone voices from just 3 seconds of audio

2023

XTTS-v2 (Coqui) provides open-source zero-shot voice cloning in 17 languages from 6-second reference

2023

ElevenLabs Professional Voice Cloning captures studio-quality voice replicas from 30 minutes of audio

2024

OpenVoice v2 (MyShell) separates voice cloning from language/emotion control, enabling flexible voice transfer

2024

F5-TTS and MaskGCT demonstrate high-fidelity voice cloning with flow-matching and masked generative approaches

2025

Voice consent protocols and synthetic speech watermarking emerge as industry responses to cloning misuse

How Voice cloning Works

Speaker encoding

Reference audio (3-30 seconds) is processed by a speaker encoder to extract a voice embedding capturing timbre, pitch, and speaking style

Text processing

Target text is converted to phonemes or linguistic features for synthesis

Conditioned generation

A TTS model generates speech conditioned on both the text and the speaker embedding, producing audio in the target voice

Voice adaptation

Advanced systems use fine-tuning on the reference audio to capture nuances (breathing, pronunciation, rhythm) beyond the embedding

Quality enhancement

Post-processing with HiFi-GAN or BigVGAN improves waveform quality and removes artifacts

Current Landscape

Voice cloning in 2025 is simultaneously one of the most impressive and controversial AI capabilities. The technology has reached the point where casual listeners cannot distinguish cloned speech from real recordings, especially with commercial services like ElevenLabs. Open-source alternatives (XTTS-v2, OpenVoice, F5-TTS) are close behind. The field is grappling with dual-use concerns: legitimate applications (audiobooks, voice preservation for medical patients, content localization) coexist with malicious ones (phone scams, political deepfakes, non-consensual impersonation). Industry self-regulation (voice consent, watermarking) is emerging but not yet standardized.

Key Challenges

Ethical risk: voice cloning enables impersonation, fraud, and non-consensual deepfakes of real individuals

Speaker similarity vs. naturalness tradeoff: maximizing voice similarity can reduce overall speech quality

Emotional range: cloned voices often sound flat compared to the original speaker's expressive range

Cross-lingual cloning: maintaining voice identity when generating speech in a language the speaker has never spoken

Consent and detection: no standardized framework exists for voice usage consent or synthetic speech detection

Quick Recommendations

Best quality (commercial)

ElevenLabs Voice Cloning

Professional-grade cloning from 30 minutes of audio; best speaker similarity in the market

Open-source zero-shot

XTTS-v2 or F5-TTS

Clone from 6-10 seconds of reference; 17+ languages; fully open weights

Flexible voice control

OpenVoice v2

Decouples tone color from style, emotion, and accent for more controllable voice transfer

Research / custom training

VALL-E X or XTTS-v2 fine-tuned

Full training pipelines for custom voice datasets and specialized applications

What's Next

Expect mandatory voice watermarking that embeds detectable signatures in all AI-generated speech, regulatory frameworks requiring explicit consent for voice cloning, and voice authentication systems that can reliably detect cloned speech. On the positive side, voice preservation services (for terminal patients, aging relatives) will become a product category, and personalized voice assistants that speak in your own voice will mainstream. Cross-lingual voice cloning will enable anyone to 'speak' any language in their own voice.

Benchmarks & SOTA

LibriTTS test-clean (Zero-Shot TTS)

LibriTTS test-clean zero-shot TTS evaluation

20193 results

Standard zero-shot voice-cloning / TTS evaluation using LibriTTS test-clean speaker prompts. WER on resynthesized utterances (measured with a frozen ASR like HuBERT-Large or Whisper) is the primary intelligibility metric (lower=better); speaker similarity (SECS) is a secondary metric.

State of the Art

VALL-E

Microsoft

5.9

wer

Related Tasks

Audio-Language Models

Audio-Language Models (ALMs) are a form of artificial intelligence that extend natural language processing (NLP) to the domain of audio, enabling computers to understand, generate, and reason about sounds and speech by integrating audio data with language understanding. Trained on audio-text data, ALMs bridge the gap between acoustic signals and linguistic meaning, allowing for tasks like zero-shot audio recognition, audio captioning, and the creation of generative audio, such as text-to-audio synthesis.

Automatic Speech Recognition

Automatic Speech Recognition (ASR) is the technology that converts spoken language into written text. ASR systems process audio signals containing human speech and transcribe them into readable text format. These systems use acoustic models, language models, and often neural networks to recognize phonemes, words, and sentences from audio input. ASR is foundational for applications like voice assistants (Siri, Alexa), transcription services, voice-controlled systems, and accessibility tools for the hearing impaired.

Text-to-speech

Text-to-speech (TTS) is technology that converts written text into natural-sounding audio, also known as "read aloud" technology or speech synthesis. It works by analyzing text to understand words, punctuation, and sentence structure, then generating phonetic representations of those words before synthesizing them into a human-like voice output. TTS is a crucial form of assistive technology and a key component of natural language processing, making digital content accessible and improving user interaction in numerous applications.

Audio Classification

Classification of audio signals into predefined categories such as music genres, environmental sounds, or speaker identification.

Get notified when these results update

New models drop weekly. We track them so you don't have to.

Something wrong or missing?

Help keep Voice cloning benchmarks accurate. Report outdated results, missing benchmarks, or errors.

Back to Audio