Voice cloning
Last reviewed
May 1, 2026
Sources
24 citations
Review status
Source-backed
Revision
v1 ยท 3,937 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 1, 2026
Sources
24 citations
Review status
Source-backed
Revision
v1 ยท 3,937 words
Add missing citations, update stale details, or suggest a clearer explanation.
Voice cloning is the use of machine learning to generate synthetic speech that imitates the voice of a specific human speaker, called the target speaker, given some reference audio. Modern systems can clone a voice from anywhere between a few seconds (zero-shot) and a few minutes or hours (fine-tuned) of reference audio, often with a level of fidelity that ordinary listeners struggle to distinguish from the real speaker.
The term sits inside the broader field of Text-to-Speech (TTS) but specifically refers to the speaker-conditioning side: not just generating any natural speech, but generating speech in a particular person's voice. Voice cloning has become one of the more visible applications of generative audio AI since around 2018, when neural speaker-encoder methods made few-second cloning practical. By 2023, codec-based language models such as Microsoft's VALL-E pushed zero-shot quality much closer to natural recordings, and commercial systems from ElevenLabs, Sesame, OpenAI, Resemble AI and others followed.
It is also one of the more controversial. Voice cloning has been used in CEO impersonation fraud, robocall election interference, non-consensual celebrity impersonation, and the wider deepfake problem. The U.S. Federal Communications Commission (FCC) declared AI-generated voices in robocalls illegal under the Telephone Consumer Protection Act on February 8, 2024, and SAG-AFTRA's 2023 actors' strike negotiated explicit AI replica protections.
The technology has both helpful and harmful uses, and most public debate is about how to keep one without the other.
On the constructive side, voice cloning supports accessibility (voice restoration for people with ALS, throat cancer, or other voice loss), media localization and dubbing, audiobook narration at scale, voice agents and conversational AI, and personal voice assistants. Apple's Personal Voice feature on iOS 17 (2023) lets users record about 15 minutes of speech and create a synthetic version of their own voice on device, intended for people whose voices may degrade over time.
On the harm side, voice cloning powers a class of social-engineering attacks that did not exist a decade ago. The first widely reported case was in 2019, when criminals used cloned audio to impersonate the German CEO of an energy company's parent firm and convince a UK subsidiary's chief executive to wire about $243,000 to a fraudulent supplier. Since then, voice cloning has shown up in fake kidnap-for-ransom calls, family-emergency scams targeting elderly relatives, and political deepfakes such as the January 2024 New Hampshire robocall that imitated President Joe Biden telling Democrats to skip the primary. Consent and copyright are also active issues: the Scarlett Johansson dispute over OpenAI's "Sky" voice in May 2024 became the canonical example of a commercial system shipping a voice that sounded uncomfortably like a real, unconsenting celebrity.
Voice cloning is therefore a useful case study for the broader generative AI policy debate. The same model that lets a person with ALS keep speaking in their own voice can also be used to defraud their grandmother.
Voice cloning is a sub-area of speech synthesis that has progressed through several technical generations. Each generation either added new capabilities or sharply reduced the amount of target-speaker data required.
| Era | Approach | Reference data needed | Quality |
|---|---|---|---|
| 1990s-2000s | Concatenative TTS: stitch pre-recorded speech units (diphones, units) from one speaker | Hours of studio recordings of one speaker | Robotic, with audible joins |
| 2000s | Statistical parametric (HMM-based) synthesis | Tens of minutes to hours, single speaker | Smoother but "buzzy" |
| 2016-2017 | Neural TTS without speaker conditioning: Tacotron, WaveNet, Tacotron 2 | Tens of hours, single speaker | Near-natural for the trained voice only |
| 2017-2018 | Multi-speaker neural TTS: shared model with speaker IDs or learned embeddings | Tens of hours per speaker, all known in advance | Good for in-set speakers |
| 2018-2020 | Fine-tuning on target speaker from a base model | Roughly 30 minutes to a few hours | High quality, person-specific model |
| 2019-2022 | Few-shot adaptation: brief fine-tune on a small speaker set | Roughly 5 to 10 minutes | Usable, often with reduced naturalness |
| 2018-present | Zero-shot voice cloning: condition the synthesis model on a speaker embedding extracted from any reference utterance, no retraining required | A few seconds | Steadily improving, near-ground-truth in 2024-2025 systems |
In current research and commercial practice, "voice cloning" usually means the last two rows: a model that can clone a new voice on the fly from a short clip, optionally with extra fine-tuning for higher fidelity.
Modern voice cloning is built from a small set of reusable components.
A speaker embedding is a fixed-size vector that captures "who is talking" while throwing away "what they are saying." Early variants were i-vectors and Joint Factor Analysis. The neural era introduced d-vectors and x-vectors, which are extracted from speaker-verification networks trained to push utterances by the same speaker close together in embedding space and utterances by different speakers apart.
A particularly important step was Wan, Wang, Papir, and Moreno's Generalized End-to-End Loss for Speaker Verification (Google, ICASSP 2018), which introduced the GE2E loss. GE2E reduced equal error rates by more than 10% over the previous tuple-based loss while cutting training time by roughly 60%. The same group's embeddings were reused as the speaker encoder in subsequent voice-cloning work.
The direct ancestor of modern zero-shot cloning is Jia et al., Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis (NeurIPS 2018), often abbreviated SV2TTS. It is a three-stage system: a speaker encoder trained on speaker verification with thousands of speakers, a Tacotron 2 sequence-to-sequence model that produces a mel spectrogram from text conditioned on the speaker embedding, and a WaveNet vocoder that turns the spectrogram into a waveform. The key claim was that knowledge learned by a discriminative speaker encoder could transfer to generative TTS, allowing voice cloning of speakers that had never been seen during TTS training.
SV2TTS was the first widely demonstrated few-second zero-shot voice cloning system, and an open-source reimplementation by Corentin Jemine became one of the most-used voice-cloning code bases of the late 2010s.
A vocoder turns an intermediate acoustic representation, usually a mel spectrogram, into a raw waveform. The quality of the vocoder largely determines whether a synthesized voice sounds clean or muffled.
The shift in 2022-2023 was to move from continuous mel spectrograms to discrete audio tokens. Google's SoundStream (Zeghidour et al., 2021-2022) and Meta's Encodec (Defossez et al., arXiv 2210.13438, October 2022) are encoder-decoder models that compress audio to a small number of token streams using residual vector quantization, then decode it back to high-quality waveform. Encodec runs at 24 kHz and 48 kHz and can compress speech to a few kbps while remaining intelligible. Descript Audio Codec (DAC) is a later variant.
Once audio is a token sequence, the same Transformer machinery used for text language models can be used to model speech, which is exactly what VALL-E does.
The table below covers the current generation of zero-shot voice-cloning systems. The boundary between research, open source, and commercial is fuzzy and shifts year to year.
| System | Org | First public | Approach | Open source? |
|---|---|---|---|---|
| YourTTS | Casanova et al. | ICML 2022 | VITS extension, multilingual zero-shot | Yes |
| Tortoise TTS | James Betker | Jan 2022 | Autoregressive plus diffusion, five-stage pipeline | Yes (Apache 2.0) |
| VALL-E | Microsoft | Jan 2023 | GPT-style decoder over Encodec tokens, 3-second prompt | No (research) |
| VALL-E X | Microsoft | May 2023 | Cross-lingual extension of VALL-E | No (research) |
| NaturalSpeech 2 | Microsoft | April 2023 | Latent diffusion over codec features, also sings | No (research) |
| Voicebox | Meta | June 2023 | Non-autoregressive flow matching, infill-style training | No, weights withheld |
| StyleTTS 2 | Li et al. (Columbia) | June 2023, NeurIPS 2023 | Style diffusion plus speech LM adversarial training | Yes |
| XTTS / XTTS-v2 | Coqui | Sept-Nov 2023 | Multilingual zero-shot from a 6-second clip, 17 languages | Yes |
| NaturalSpeech 3 | Microsoft | March 2024 | Factorized neural codec plus factorized diffusion | No (research) |
| OpenAI Voice Engine | OpenAI | 29 March 2024 (preview) | 15-second sample, multilingual | No, limited partner access |
| F5-TTS | Chen et al. (SJTU) | October 2024 | Non-autoregressive flow matching with Diffusion Transformer | Yes |
| Sesame CSM | Sesame | Demo Feb 2025, weights March 2025 | Conversational speech model with paralinguistic detail | Partial (1B variant on HuggingFace) |
| Eleven Multilingual v2 / v3 | ElevenLabs | August 2023 onward | Proprietary, ~29 languages, voice cloning plus voice design | No |
| Resemble AI, WellSaid, Murf, Play.ht, Hume AI | Various commercial | 2019 onward | Proprietary stacks, mostly fine-tune plus zero-shot hybrids | No |
Some facts worth pinning down. VALL-E (Wang et al., arXiv 2301.02111, January 5, 2023) was trained on around 60,000 hours of English speech and treats TTS as conditional language modeling over Encodec tokens, taking a 3-second "acoustic prompt" of the target speaker. It can carry over not just timbre but the speaker's emotion and acoustic environment. Voicebox (Le et al., June 2023) was trained on more than 50,000 hours of unfiltered speech and reported a word error rate of about 1.9% versus VALL-E's 5.9% on the same benchmark, but Meta chose not to release weights, citing safety concerns. NaturalSpeech 3 (Ju et al., March 2024) factorizes speech into content, prosody, timbre, and acoustic detail, each generated by its own diffusion subspace, and was scaled to 1B parameters and 200K hours of training data. Sesame's Conversational Speech Model (CSM) demo went viral in early 2025 for its handling of pauses, breath sounds, and self-corrections; the company later open-sourced a 1B-parameter variant.
Voice cloning is unusual for a high-end generative AI capability in that strong open implementations exist alongside the commercial ones. The most actively used:
Open-weight availability is one reason the policy conversation is so hard. Even if every commercial provider added perfect watermarking tomorrow, an attacker could still pull XTTS-v2 down from Hugging Face and run it locally.
Voice-cloning systems are typically evaluated along several axes at once. There is no single number that captures "how good" a clone is.
| Metric | What it measures | Typical use |
|---|---|---|
| MOS (Mean Opinion Score) | Subjective 1-5 naturalness rating from human listeners | Compares overall speech quality |
| MOS-S / SECS (Speaker Encoder Cosine Similarity) | How close the cloned voice sounds to the target | Compares speaker fidelity |
| WER (Word Error Rate) | An ASR system transcribes the synthesis; measures pronunciation accuracy | Compares intelligibility, robustness |
| RTF (Real-Time Factor) | Synthesis time per second of output audio | Important for live voice agents |
| Prosody / emotion similarity | Match of pitch, rhythm, expressiveness | Important for narration and acting |
Reported numbers should be read with care. Many papers compare on different subsets of LibriSpeech or VCTK, with different reference durations and different ASR systems for WER. As a rough guide, by 2024 the best zero-shot systems claim speaker similarity in the high 0.5 to 0.7 SECS range from a few seconds of audio, climbing into the 0.7-0.85 range with about 10 minutes of fine-tuning data. Several recent systems (Voicebox, NaturalSpeech 3, Eleven Multilingual v2) claim MOS ratings statistically indistinguishable from real recordings on certain benchmarks. Whether that survives outside the lab is another question.
Voice cloning has more legitimate commercial uses than is sometimes appreciated.
The abuse cases are not theoretical. They show up in the news regularly enough that voice cloning is now a recurring topic in U.S. and EU regulatory work.
| Concern | Concrete example |
|---|---|
| CEO / vendor impersonation fraud | A UK energy firm wired roughly $243,000 in 2019 after a cloned voice impersonated the German CEO of its parent company; multiple later cases reported in seven-figure ranges |
| Family-emergency / kidnap scams | "Grandparent" calls using cloned voices of relatives, frequently flagged by U.S. and Canadian consumer-protection agencies |
| Election interference robocalls | January 2024 New Hampshire primary robocall using a fake Joe Biden voice telling voters to stay home |
| Non-consensual celebrity voices | OpenAI's "Sky" voice, perceived as resembling Scarlett Johansson, removed in May 2024 after she objected through counsel |
| Performer consent and compensation | SAG-AFTRA's 2023 TV/Theatrical contract included AI replica protections; the 2024-2025 video game strike, ended July 2025, added consent, transparency, and compensation rules for digital voice replicas |
| Voice biometric defeat | Banks that use voice as an authentication factor are increasingly vulnerable to cloned reference audio |
| Defamation and harassment | Cloned voices used to impersonate teachers, public figures, or private individuals to damage their reputations |
Several regulatory and standards efforts are now in place:
Industry self-regulation is uneven. ElevenLabs, OpenAI, Resemble AI and others publish acceptable-use policies, run abuse-detection on uploads, and embed watermarks. Open-weight models on Hugging Face have no such gating, which is why detection has become its own research field.
Detection takes two main forms: passive and active.
Active watermarking embeds a signal at synthesis time. SynthID Audio, announced by Google DeepMind in November 2023 and expanded in 2024, embeds an inaudible watermark into audio generated by Google models such as Lyria. The watermark is designed to survive common transformations including added noise, MP3 compression, and tempo changes. C2PA provenance metadata is a complementary approach: cryptographically signed information about how a piece of media was created, attached to the file. Resemble AI ships its own watermarking and a free open detection model called Resemblyzer; ElevenLabs publishes an AI Speech Classifier that scores arbitrary audio for synthetic origin.
Passive detection trains classifiers on acoustic and spectral features that distinguish synthesized speech from natural speech. The ASVspoof challenge series has been the main benchmark for this work since 2015. The hard part is that detectors trained on one generation of synthesis models tend to lose accuracy against the next generation, so adversarial robustness is an active research problem.
None of these approaches solve the underlying problem on their own. Watermarks help only when the synthesis system cooperates, provenance helps only when receivers check it, and detectors lag the attackers by definition. The current consensus in policy circles is that defence has to be layered.
Personal Voice is worth a section because it is the largest consumer deployment of on-device voice cloning. Announced as part of iOS 17 in 2023, Personal Voice is built into iPhone, iPad, and Mac running iOS 17 / iPadOS 17 / macOS Sonoma or later. Users record themselves reading a series of prompts, around 15 minutes of audio in total, and the device trains a synthetic version of their voice locally. The model is encrypted and stored on device behind Face ID, Touch ID, or the device passcode, and never leaves the device unless the user opts to share it across their iCloud-linked devices.
The primary use case is accessibility. Personal Voice integrates with the Live Speech feature, which lets users type messages and have them spoken aloud in their own voice during phone calls, FaceTime, or in person. Apple positioned it for people who may lose the ability to speak, for example due to ALS, but anyone can create one. Because training and inference happen locally, Personal Voice avoids the consent and platform-abuse problems that come with cloud cloning.
Despite the impressive demos, voice cloning has well-known weak points.
Prosody and emotional control remain imperfect. Most systems can match the timbre of a target speaker more easily than they can convincingly act. Code-switching between languages mid-sentence, regional dialects, whispering, shouting, and singing all remain harder than steady-state read speech. Real-time low-latency cloning at conversational tempo is computationally demanding, which is why current voice agents often run smaller, less expressive models in the loop.
Multilingual zero-shot quality varies by language. English, Mandarin, Spanish, and a handful of others get the most training data and the best results; lower-resource languages can suffer from accent leakage from the source language or outright mispronunciation. Some systems hallucinate or skip text under certain prompts, particularly when the reference audio is short or noisy. High-quality reference audio still matters: 5 seconds of clean, expressive speech beats 30 seconds of phone-quality mumbling.
Voice cloning is in a phase where commercial competition is fierce and the technical frontier is moving quickly. Sesame's CSM demo in early 2025 reset expectations for naturalness and emotional range, with users describing extended conversations they had to remind themselves were synthetic. Multimodal foundation models (such as the audio-capable variants of GPT-4o and Gemini) are absorbing voice cloning into general-purpose models rather than treating it as a separate stack. The open-source side keeps narrowing the gap with closed systems, and almost every major launch in 2024-2025 has had to ship watermarking, abuse policies, and consent flows alongside the model itself.
It is genuinely hard to predict where the next year goes. The research is not slowing down, the misuse is not slowing down either, and the policy response is still catching up to systems that worked a year ago, never mind the ones being trained now.