# Voice cloning

> Source: https://aiwiki.ai/wiki/voice_cloning
> Updated: 2026-06-21
> Categories: Generative AI, Speech & Audio AI, Voice AI
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

**Voice cloning** is the use of [machine learning](/wiki/machine_learning) to generate synthetic speech in the voice of a specific real person (the target speaker) from a sample of their recorded audio. Modern zero-shot systems can replicate a voice from as little as a 3-second clip: Microsoft's [VALL-E](/wiki/neural_codec_language_models_are_zero-shot_text_to_speech_synthesizers_vall-e) needs a 3-second acoustic prompt and OpenAI's Voice Engine needs a 15-second sample, often producing speech that ordinary listeners cannot reliably distinguish from the real speaker.[1][14] The same capability powers both accessibility tools (synthetic voices for people who have lost the ability to speak) and a wave of fraud, which led the U.S. Federal Communications Commission to declare AI-generated voices in robocalls illegal on February 8, 2024.[15]

The term sits inside the broader field of [Text-to-Speech](/wiki/text_to_speech_ai) (TTS) but specifically refers to the speaker-conditioning side: not just generating any natural speech, but generating speech in a particular person's voice. Voice cloning has become one of the more visible applications of generative audio AI since around 2018, when neural speaker-encoder methods made few-second cloning practical.[2] By 2023, codec-based language models such as VALL-E pushed zero-shot quality much closer to natural recordings, and commercial systems from [ElevenLabs](/wiki/elevenlabs), Sesame, OpenAI, Resemble AI and others followed.[1] ElevenLabs, founded in 2022, reached an $11 billion valuation in a February 2026 funding round, a marker of how commercially significant the field has become.[26]

It is also one of the more controversial. Voice cloning has been used in CEO impersonation fraud, robocall election interference, non-consensual celebrity impersonation, and the wider deepfake problem. The FCC declared AI-generated voices in robocalls illegal under the Telephone Consumer Protection Act on February 8, 2024,[15] and SAG-AFTRA's 2023 actors' strike negotiated explicit AI replica protections.[21]

## What is voice cloning used for?

The technology has both helpful and harmful uses, and most public debate is about how to keep one without the other.

On the constructive side, voice cloning supports accessibility (voice restoration for people with ALS, throat cancer, or other voice loss), media localization and dubbing, audiobook narration at scale, voice agents and conversational AI, and personal voice assistants. Apple's Personal Voice feature on iOS 17 (2023) lets users record about 15 minutes of speech and create a synthetic version of their own voice on device, intended for people whose voices may degrade over time.[18]

On the harm side, voice cloning powers a class of social-engineering attacks that did not exist a decade ago. The first widely reported case was in 2019, when criminals used cloned audio to impersonate the German CEO of an energy company's parent firm and convince a UK subsidiary's chief executive to wire about $243,000 to a fraudulent supplier; the incident, reported via the insurer Euler Hermes, is generally cited as the first cybercrime in which criminals clearly drew on AI voice synthesis.[16] Since then, voice cloning has shown up in fake kidnap-for-ransom calls, family-emergency scams targeting elderly relatives, and political deepfakes such as the January 2024 New Hampshire robocall that imitated President Joe Biden telling Democrats to skip the primary.[15] Consent and copyright are also active issues: the Scarlett Johansson dispute over OpenAI's "Sky" voice in May 2024 became the canonical example of a commercial system shipping a voice that sounded uncomfortably like a real, unconsenting celebrity.[17]

Voice cloning is therefore a useful case study for the broader generative AI policy debate. The same model that lets a person with ALS keep speaking in their own voice can also be used to defraud their grandmother.

## What are the main approaches to voice cloning?

Voice cloning is a sub-area of [speech synthesis](/wiki/speech_synthesis) that has progressed through several technical generations.[22] Each generation either added new capabilities or sharply reduced the amount of target-speaker data required.

| Era | Approach | Reference data needed | Quality |
|-----|----------|----------------------|---------|
| 1990s-2000s | Concatenative TTS: stitch pre-recorded speech units (diphones, units) from one speaker | Hours of studio recordings of one speaker | Robotic, with audible joins |
| 2000s | Statistical parametric (HMM-based) synthesis | Tens of minutes to hours, single speaker | Smoother but "buzzy" |
| 2016-2017 | Neural TTS without speaker conditioning: Tacotron, WaveNet, Tacotron 2 | Tens of hours, single speaker | Near-natural for the trained voice only |
| 2017-2018 | Multi-speaker neural TTS: shared model with speaker IDs or learned embeddings | Tens of hours per speaker, all known in advance | Good for in-set speakers |
| 2018-2020 | Fine-tuning on target speaker from a base model | Roughly 30 minutes to a few hours | High quality, person-specific model |
| 2019-2022 | Few-shot adaptation: brief fine-tune on a small speaker set | Roughly 5 to 10 minutes | Usable, often with reduced naturalness |
| 2018-present | Zero-shot voice cloning: condition the synthesis model on a speaker embedding extracted from any reference utterance, no retraining required | A few seconds | Steadily improving, near-ground-truth in 2024-2025 systems |

In current research and commercial practice, "voice cloning" usually means the last two rows: a model that can clone a new voice on the fly from a short clip, optionally with extra fine-tuning for higher fidelity.

## How does voice cloning work?

Modern voice cloning is built from a small set of reusable components.

### Speaker embeddings

A speaker embedding is a fixed-size vector that captures "who is talking" while throwing away "what they are saying." Early variants were i-vectors and Joint Factor Analysis. The neural era introduced d-vectors and x-vectors, which are extracted from speaker-verification networks trained to push utterances by the same speaker close together in [embedding space](/wiki/embedding_space) and utterances by different speakers apart.

A particularly important step was Wan, Wang, Papir, and Moreno's *Generalized End-to-End Loss for Speaker Verification* (Google, ICASSP 2018), which introduced the GE2E loss.[4] GE2E reduced equal error rates by more than 10% over the previous tuple-based loss while cutting training time by roughly 60%.[4] The same group's embeddings were reused as the speaker encoder in subsequent voice-cloning work.[2]

### SV2TTS

The direct ancestor of modern zero-shot cloning is Jia et al., *Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis* (NeurIPS 2018), often abbreviated SV2TTS.[2] It is a three-stage system: a speaker encoder trained on speaker verification with thousands of speakers, a Tacotron 2 sequence-to-sequence model that produces a mel spectrogram from text conditioned on the speaker embedding, and a [WaveNet](/wiki/wavenet) vocoder that turns the spectrogram into a waveform.[2] The key claim was that knowledge learned by a discriminative speaker encoder could transfer to generative TTS, allowing voice cloning of speakers that had never been seen during TTS training.[2]

SV2TTS was the first widely demonstrated few-second zero-shot voice cloning system, and an open-source reimplementation by Corentin Jemine became one of the most-used voice-cloning code bases of the late 2010s.

### Neural vocoders

A vocoder turns an intermediate acoustic representation, usually a mel spectrogram, into a raw waveform. The quality of the vocoder largely determines whether a synthesized voice sounds clean or muffled.

- **WaveNet** (van den Oord et al., DeepMind 2016): autoregressive sample-by-sample generation. Slow but very high quality at the time.[23]
- **WaveRNN, MelGAN, HiFi-GAN, BigVGAN, Vocos**: progressively faster non-autoregressive vocoders, with HiFi-GAN being a 2020 workhorse.
- **WaveGrad and DiffWave**: diffusion-based vocoders introduced in 2020, trading sampling steps for very natural waveforms.

### Neural audio codecs

The shift in 2022-2023 was to move from continuous mel spectrograms to discrete audio tokens. Google's SoundStream (Zeghidour et al., 2021-2022) and Meta's Encodec (Defossez et al., arXiv 2210.13438, October 2022) are encoder-decoder models that compress audio to a small number of token streams using residual vector quantization, then decode it back to high-quality waveform.[5] Encodec runs at 24 kHz and 48 kHz and can compress speech to a few kbps while remaining intelligible.[5] Descript Audio Codec (DAC) is a later variant.

Once audio is a token sequence, the same Transformer machinery used for text language models can be used to model speech, which is exactly what VALL-E does.[1]

## What are the leading zero-shot voice-cloning systems?

The table below covers the current generation of zero-shot voice-cloning systems. The boundary between research, open source, and commercial is fuzzy and shifts year to year.

| System | Org | First public | Approach | Open source? |
|--------|-----|--------------|----------|--------------|
| YourTTS | Casanova et al. | ICML 2022 | VITS extension, multilingual zero-shot | Yes |
| Tortoise TTS | James Betker | Jan 2022 | Autoregressive plus diffusion, five-stage pipeline | Yes (Apache 2.0) |
| VALL-E | Microsoft | Jan 2023 | GPT-style decoder over Encodec tokens, 3-second prompt | No (research) |
| VALL-E X | Microsoft | May 2023 | Cross-lingual extension of VALL-E | No (research) |
| NaturalSpeech 2 | Microsoft | April 2023 | Latent diffusion over codec features, also sings | No (research) |
| Voicebox | Meta | June 2023 | Non-autoregressive flow matching, infill-style training | No, weights withheld |
| StyleTTS 2 | Li et al. (Columbia) | June 2023, NeurIPS 2023 | Style diffusion plus speech LM adversarial training | Yes |
| XTTS / XTTS-v2 | Coqui | Sept-Nov 2023 | Multilingual zero-shot from a 6-second clip, 17 languages | Yes |
| NaturalSpeech 3 | Microsoft | March 2024 | Factorized neural codec plus factorized diffusion | No (research) |
| OpenAI Voice Engine | OpenAI | 29 March 2024 (preview) | 15-second sample, multilingual | No, limited partner access |
| F5-TTS | Chen et al. (SJTU) | October 2024 | Non-autoregressive flow matching with Diffusion Transformer | Yes |
| Sesame CSM | Sesame | Demo Feb 2025, weights March 2025 | Conversational speech model with paralinguistic detail | Partial (1B variant on HuggingFace) |
| Eleven Multilingual v2 / v3 | ElevenLabs | August 2023 onward | Proprietary, ~29 languages, voice cloning plus voice design | No |
| Resemble AI, WellSaid, Murf, Play.ht, Hume AI | Various commercial | 2019 onward | Proprietary stacks, mostly fine-tune plus zero-shot hybrids | No |

Some facts worth pinning down. VALL-E (Wang et al., arXiv 2301.02111, January 5, 2023) was trained on around 60,000 hours of English speech from Meta's LibriLight corpus and treats TTS as conditional language modeling over Encodec tokens, taking a 3-second "acoustic prompt" of the target speaker.[1] It can carry over not just timbre but the speaker's emotion and acoustic environment.[1] Voicebox (Le et al., June 2023) was trained on more than 50,000 hours of unfiltered speech and reported a word error rate of about 1.9% versus VALL-E's 5.9% on the same benchmark, but Meta chose not to release weights, citing safety concerns.[6] NaturalSpeech 3 (Ju et al., March 2024) factorizes speech into content, prosody, timbre, and acoustic detail, each generated by its own diffusion subspace, and was scaled to 1B parameters and 200K hours of training data.[7] Sesame's Conversational Speech Model (CSM) demo went viral in early 2025 for its handling of pauses, breath sounds, and self-corrections; the company later open-sourced a 1B-parameter variant.[12]

## Is voice cloning available as open source?

Voice cloning is unusual for a high-end generative AI capability in that strong open implementations exist alongside the commercial ones. The most actively used:

- **Coqui TTS** and the XTTS family. XTTS-v2 supports 17 languages, clones a voice from a 3- to 6-second reference clip, and is widely deployed for self-hosted dubbing and voice agents.[11]
- **StyleTTS 2** (Yinghao Aaron Li et al., NeurIPS 2023). Reported to surpass human recordings on the LJSpeech single-speaker benchmark in MOS evaluations and to match them on multi-speaker VCTK.[9]
- **Tortoise TTS**. Autoregressive decoder plus diffusion, slow at inference but very expressive.[24]
- **F5-TTS** (Chen et al., October 2024). Non-autoregressive flow matching with a Diffusion Transformer backbone, English and Chinese.[10]
- **GPT-SoVITS**. A community project that took off in 2024-2025 for Chinese and Japanese cloning.
- **Open VALL-E reimplementations** (e.g., lifeiteng/vall-e). Useful for research, generally below the closed Microsoft model in quality.
- **Hugging Face TTS pipelines** wrap many of the above behind a common interface.

Open-weight availability is one reason the policy conversation is so hard. Even if every commercial provider added perfect watermarking tomorrow, an attacker could still pull XTTS-v2 down from Hugging Face and run it locally.

## How is voice-cloning quality measured?

Voice-cloning systems are typically evaluated along several axes at once. There is no single number that captures "how good" a clone is.

| Metric | What it measures | Typical use |
|--------|------------------|-------------|
| MOS (Mean Opinion Score) | Subjective 1-5 naturalness rating from human listeners | Compares overall speech quality |
| MOS-S / SECS (Speaker Encoder Cosine Similarity) | How close the cloned voice sounds to the target | Compares speaker fidelity |
| WER (Word Error Rate) | An ASR system transcribes the synthesis; measures pronunciation accuracy | Compares intelligibility, robustness |
| RTF (Real-Time Factor) | Synthesis time per second of output audio | Important for live voice agents |
| Prosody / emotion similarity | Match of pitch, rhythm, expressiveness | Important for narration and acting |

Reported numbers should be read with care. Many papers compare on different subsets of LibriSpeech or VCTK, with different reference durations and different ASR systems for WER. As a rough guide, by 2024 the best zero-shot systems claim speaker similarity in the high 0.5 to 0.7 SECS range from a few seconds of audio, climbing into the 0.7-0.85 range with about 10 minutes of fine-tuning data. Several recent systems (Voicebox, NaturalSpeech 3, Eleven Multilingual v2) claim MOS ratings statistically indistinguishable from real recordings on certain benchmarks.[6] Whether that survives outside the lab is another question.

## Use cases

Voice cloning has more legitimate commercial uses than is sometimes appreciated.

- **Audiobook narration.** Speechki, ElevenLabs and others have signed deals with publishers to produce audiobooks at scale, often with voices cloned from human narrators under license.
- **Localization and dubbing.** HeyGen, Synthesia, ElevenLabs Dubbing and similar tools take a single performance and produce versions in dozens of languages while preserving the original speaker's voice.[13]
- **Game NPCs and interactive media.** Replica Studios and others sell licensed voice replicas for in-game characters; this was the topic of the SAG-AFTRA video game strike.[21]
- **Accessibility.** VocaliD, CereProc and Apple Personal Voice support people with degenerative conditions or post-surgical voice loss.
- **Voice restoration for ALS patients.** Apple's Personal Voice (iOS 17, September 2023) trains a synthetic copy of the user's own voice on device from about 15 minutes of recorded prompts, then plugs into Live Speech for accessibility output.[18]
- **Marketing and advertising.** Brands use cloned voices for personalised ads (with the original speaker's consent and typically a license fee).
- **Customer service voice agents.** OpenAI's gpt-realtime / Realtime API, Sesame, ElevenLabs Conversational AI, and others power voice agents that sound like real people answering the phone.
- **Multilingual content creation.** Cloning a creator's voice into 28+ languages using XTTS or Eleven Multilingual v2 is now a standard YouTube workflow.[13]

## What are the legal and ethical concerns?

The abuse cases are not theoretical. They show up in the news regularly enough that voice cloning is now a recurring topic in U.S. and EU regulatory work.

| Concern | Concrete example |
|---------|------------------|
| CEO / vendor impersonation fraud | A UK energy firm wired roughly $243,000 in 2019 after a cloned voice impersonated the German CEO of its parent company; multiple later cases reported in seven-figure ranges |
| Family-emergency / kidnap scams | "Grandparent" calls using cloned voices of relatives, frequently flagged by U.S. and Canadian consumer-protection agencies |
| Election interference robocalls | January 2024 New Hampshire primary robocall using a fake Joe Biden voice telling voters to stay home |
| Non-consensual celebrity voices | OpenAI's "Sky" voice, perceived as resembling Scarlett Johansson, removed in May 2024 after she objected through counsel |
| Performer consent and compensation | SAG-AFTRA's 2023 TV/Theatrical contract included AI replica protections; the 2024-2025 video game strike, ended July 2025, added consent, transparency, and compensation rules for digital voice replicas |
| Voice biometric defeat | Banks that use voice as an authentication factor are increasingly vulnerable to cloned reference audio |
| Defamation and harassment | Cloned voices used to impersonate teachers, public figures, or private individuals to damage their reputations |

Several regulatory and standards efforts are now in place:

- The **FCC** issued a Declaratory Ruling on **February 8, 2024** confirming that AI-generated voices in robocalls count as "artificial" under the Telephone Consumer Protection Act, making such calls illegal without prior express consent.[15] FCC Chairwoman Jessica Rosenworcel framed the move bluntly: "Bad actors are using AI-generated voices in unsolicited robocalls to extort vulnerable family members, imitate celebrities, and misinform voters. We're putting the fraudsters behind these robocalls on notice."[15] The ruling was directly motivated by the New Hampshire Biden robocall. The FCC later adopted a $6 million forfeiture order against political consultant Steve Kramer for the calls in September 2024 and reached a $1 million settlement with the transmitting carrier, Lingo Telecom; Kramer was acquitted of related state criminal charges by a New Hampshire jury in 2025.[25]
- The **EU AI Act**, adopted in 2024, treats deepfakes and AI-generated content as a transparency-tier risk and requires labelling.
- **NIST AI 100-4** (released by the U.S. AI Safety Institute on November 20, 2024) lays out voluntary technical approaches for digital content transparency, including provenance, watermarking, and labelling, and explicitly covers AI-generated audio.[20]
- The **C2PA** (Coalition for Content Provenance and Authenticity), a cross-industry standard backed by Adobe, Microsoft, Google, Meta and others, defines cryptographic provenance metadata that can travel with synthetic audio files.

Industry self-regulation is uneven. ElevenLabs, OpenAI, Resemble AI and others publish acceptable-use policies, run abuse-detection on uploads, and embed watermarks.[14] In announcing Voice Engine, OpenAI said it was "taking a cautious and informed approach to a broader release due to the potential for synthetic voice misuse," requiring partners to obtain the "explicit and informed consent" of the original speaker and watermarking generated audio.[14] Open-weight models on Hugging Face have no such gating, which is why detection has become its own research field.

## How can cloned voices be detected?

Detection takes two main forms: passive and active.

Active watermarking embeds a signal at synthesis time. **SynthID Audio**, announced by Google DeepMind in November 2023 and expanded in 2024, embeds an inaudible watermark into audio generated by Google models such as Lyria.[19] The watermark is designed to survive common transformations including added noise, MP3 compression, and tempo changes.[19] **C2PA** provenance metadata is a complementary approach: cryptographically signed information about how a piece of media was created, attached to the file. Resemble AI ships its own watermarking and a free open detection model called Resemblyzer; ElevenLabs publishes an AI Speech Classifier that scores arbitrary audio for synthetic origin.

Passive detection trains classifiers on acoustic and spectral features that distinguish synthesized speech from natural speech. The ASVspoof challenge series has been the main benchmark for this work since 2015. The hard part is that detectors trained on one generation of synthesis models tend to lose accuracy against the next generation, so adversarial robustness is an active research problem.

None of these approaches solve the underlying problem on their own. Watermarks help only when the synthesis system cooperates, provenance helps only when receivers check it, and detectors lag the attackers by definition. The current consensus in policy circles is that defence has to be layered.

## Apple Personal Voice

Personal Voice is worth a section because it is the largest consumer deployment of on-device voice cloning. Announced as part of iOS 17 in 2023, Personal Voice is built into iPhone, iPad, and Mac running iOS 17 / iPadOS 17 / macOS Sonoma or later. Users record themselves reading a series of prompts, around 15 minutes of audio in total, and the device trains a synthetic version of their voice locally.[18] The model is encrypted and stored on device behind Face ID, Touch ID, or the device passcode, and never leaves the device unless the user opts to share it across their iCloud-linked devices.

The primary use case is accessibility. Personal Voice integrates with the Live Speech feature, which lets users type messages and have them spoken aloud in their own voice during phone calls, FaceTime, or in person.[18] Apple positioned it for people who may lose the ability to speak, for example due to ALS, but anyone can create one. Because training and inference happen locally, Personal Voice avoids the consent and platform-abuse problems that come with cloud cloning.

## What are the limitations of voice cloning?

Despite the impressive demos, voice cloning has well-known weak points.

Prosody and emotional control remain imperfect. Most systems can match the timbre of a target speaker more easily than they can convincingly act. Code-switching between languages mid-sentence, regional dialects, whispering, shouting, and singing all remain harder than steady-state read speech. Real-time low-latency cloning at conversational tempo is computationally demanding, which is why current voice agents often run smaller, less expressive models in the loop.

Multilingual zero-shot quality varies by language. English, Mandarin, Spanish, and a handful of others get the most training data and the best results; lower-resource languages can suffer from accent leakage from the source language or outright mispronunciation. Some systems hallucinate or skip text under certain prompts, particularly when the reference audio is short or noisy. High-quality reference audio still matters: 5 seconds of clean, expressive speech beats 30 seconds of phone-quality mumbling.

## Recent context

Voice cloning is in a phase where commercial competition is fierce and the technical frontier is moving quickly. ElevenLabs alone raised a $180 million Series C at a $3.3 billion valuation in January 2025, then a $500 million round at an $11 billion valuation in February 2026, a roughly threefold and then further jump that signals how much money is flowing into the category.[26] Sesame's CSM demo in early 2025 reset expectations for naturalness and emotional range, with users describing extended conversations they had to remind themselves were synthetic.[12] Multimodal foundation models (such as the audio-capable variants of GPT-4o and Gemini) are absorbing voice cloning into general-purpose models rather than treating it as a separate stack. The open-source side keeps narrowing the gap with closed systems, and almost every major launch in 2024-2025 has had to ship watermarking, abuse policies, and consent flows alongside the model itself.

It is genuinely hard to predict where the next year goes. The research is not slowing down, the misuse is not slowing down either, and the policy response is still catching up to systems that worked a year ago, never mind the ones being trained now.

## See also

- [Text-to-Speech](/wiki/text_to_speech_ai)
- [Speech synthesis](/wiki/speech_synthesis)
- [VALL-E](/wiki/neural_codec_language_models_are_zero-shot_text_to_speech_synthesizers_vall-e)
- [ElevenLabs](/wiki/elevenlabs)
- [AI Voice Agent](/wiki/ai_voice_agent)
- [Whisper (speech recognition)](/wiki/whisper)
- [OpenAI Whisper](/wiki/openai_whisper)
- [Voice Activity Detection Models](/wiki/voice_activity_detection_models)
- [Text-to-Speech Models](/wiki/text-to-speech_models)

## References

1. Wang, C., Chen, S., Wu, Y., et al. (2023). "Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers." arXiv:2301.02111. https://arxiv.org/abs/2301.02111
2. Jia, Y., Zhang, Y., Weiss, R. J., et al. (2018). "Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis." NeurIPS 2018. https://arxiv.org/abs/1806.04558
3. Casanova, E., Weber, J., Shulby, C., et al. (2022). "YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for Everyone." ICML 2022. https://arxiv.org/abs/2112.02418
4. Wan, L., Wang, Q., Papir, A., Moreno, I. L. (2018). "Generalized End-to-End Loss for Speaker Verification." ICASSP 2018. https://arxiv.org/abs/1710.10467
5. Defossez, A., Copet, J., Synnaeve, G., Adi, Y. (2022). "High Fidelity Neural Audio Compression" (Encodec). arXiv:2210.13438. https://arxiv.org/abs/2210.13438
6. Le, M., Vyas, A., Shi, B., et al. (2023). "Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale." Meta AI. arXiv:2306.15687. https://arxiv.org/abs/2306.15687
7. Ju, Z., Wang, Y., Shen, K., et al. (2024). "NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models." arXiv:2403.03100. https://arxiv.org/abs/2403.03100
8. Shen, K., Ju, Z., Tan, X., et al. (2023). "NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers." arXiv:2304.09116. https://arxiv.org/abs/2304.09116
9. Li, Y. A., Han, C., Raghavan, V. S., Mischler, G., Mesgarani, N. (2023). "StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models." NeurIPS 2023. https://arxiv.org/abs/2306.07691
10. Chen, Y., Niu, Z., Ma, Z., et al. (2024). "F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching." arXiv:2410.06885. https://arxiv.org/abs/2410.06885
11. Coqui AI. "XTTS-v2 model card." Hugging Face. https://huggingface.co/coqui/XTTS-v2
12. Sesame AI. "Crossing the Uncanny Valley of Conversational Voice." 2025. https://www.sesame.com/research/crossing_the_uncanny_valley_of_voice
13. ElevenLabs. "ElevenLabs Comes Out of Beta and Releases Eleven Multilingual v2." 2023. https://elevenlabs.io/blog/elevenlabs-comes-out-of-beta-and-releases-eleven-multilingual-v2-a-foundational-ai-speech-model-for-nearly-30-languages
14. OpenAI. "Navigating the challenges and opportunities of synthetic voices." March 29, 2024. https://openai.com/index/navigating-the-challenges-and-opportunities-of-synthetic-voices/
15. Federal Communications Commission. "FCC Makes AI-Generated Voices in Robocalls Illegal." February 8, 2024. https://www.fcc.gov/document/fcc-makes-ai-generated-voices-robocalls-illegal
16. Stupp, C. (2019). "Fraudsters Used AI to Mimic CEO's Voice in Unusual Cybercrime Case." Wall Street Journal, August 30, 2019.
17. Allyn, B. (2024). "Scarlett Johansson says she is 'shocked' by ChatGPT voice that sounds like 'Her'." NPR, May 20, 2024. https://www.npr.org/2024/05/20/1252495087/openai-pulls-ai-voice-that-was-compared-to-scarlett-johansson-in-the-movie-her
18. Apple. "Create a Personal Voice on iPhone, iPad, Mac, or Apple Watch." Apple Support. https://support.apple.com/en-us/HT213878
19. Google DeepMind. "SynthID." https://deepmind.google/models/synthid/
20. National Institute of Standards and Technology. "NIST AI 100-4: Reducing Risks Posed by Synthetic Content." November 20, 2024. https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.100-4.pdf
21. SAG-AFTRA. "Artificial Intelligence." https://www.sagaftra.org/contracts-industry-resources/member-resources/artificial-intelligence
22. Tan, X., Qin, T., Soong, F., Liu, T.-Y. (2021). "A Survey on Neural Speech Synthesis." arXiv:2106.15561. https://arxiv.org/abs/2106.15561
23. van den Oord, A., Dieleman, S., Zen, H., et al. (2016). "WaveNet: A Generative Model for Raw Audio." arXiv:1609.03499.
24. Betker, J. (2023). "Better Speech Synthesis through Scaling" (Tortoise TTS). arXiv:2305.07243. https://arxiv.org/abs/2305.07243
25. Federal Communications Commission. "FCC Fines Political Consultant Steve Kramer $6 Million for Illegal Spoofed Robocalls." September 26, 2024. https://www.fcc.gov/document/fcc-fines-political-consultant-6m-illegal-deepfake-biden-robocalls
26. CNBC. "Nvidia-backed AI voice startup ElevenLabs hits $11 billion valuation in fresh fundraise, as it eyes IPO." February 4, 2026. https://www.cnbc.com/2026/02/04/nvidia-backed-ai-startup-elevenlabs-11-billion-valuation.html