# Text-to-Speech Models

> Source: https://aiwiki.ai/wiki/text-to-speech_models
> Updated: 2026-06-22
> Categories: AI Models, Speech & Audio AI
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

**Text-to-speech (TTS) models** are [machine learning](/wiki/machine_learning) systems that convert written text into spoken audio. The modern lineage runs from [DeepMind](/wiki/deepmind)'s [WaveNet](/wiki/wavenet) (2016), the first neural model to generate raw audio sample by sample, through [Google](/wiki/google)'s Tacotron 2 (2017), which reached a mean opinion score (MOS) of 4.53 against 4.58 for professional human recordings, to the zero-shot voice-cloning systems of 2023 onward (VALL-E, ElevenLabs, Cartesia, OpenAI) that can copy a voice from a few seconds of audio.[^tacotron2][^cartesia-sonic3][^minimax] The broader task and its history are covered on the [text-to-speech](/wiki/text_to_speech_ai) page; this article catalogs the notable models that implement it. As a class of [generative AI](/wiki/generative_ai), the strongest 2025 systems produce speech that listeners often cannot reliably distinguish from human recordings on short utterances.[^cartesia-sonic3][^minimax]

Modern TTS approaches use [neural networks](/wiki/neural_network) to generate speech that approximates human naturalness, voice timbre, and prosody. They are the inverse of [speech recognition](/wiki/speech_recognition) systems, which transcribe audio into text, and they sit alongside other [audio models](/wiki/audio_models) such as [voice cloning](/wiki/voice_cloning), voice conversion, [music generation](/wiki/music_generation), and audio enhancement. Typical TTS pipelines decompose synthesis into two stages: an **acoustic model** that turns text or [phonemes](/wiki/phoneme) into an intermediate representation such as a mel-spectrogram, and a **vocoder** that converts the intermediate into a waveform. Newer **end-to-end** systems collapse both stages into a single network or operate directly on discrete codes from a neural audio codec. Capabilities have advanced from robotic rule-based output in the 1980s to zero-shot voice cloning from a few seconds of reference audio in 2023 and 2024.

*See also: [Text-to-Speech](/wiki/text_to_speech_ai), [Audio Models](/wiki/audio_models), [Voice cloning](/wiki/voice_cloning), [Speech recognition](/wiki/speech_recognition), [Generative AI](/wiki/generative_ai), [ElevenLabs](/wiki/elevenlabs), [Diffusion Models](/wiki/diffusion_models)*

## How did TTS models evolve?

### Rule-based and formant synthesis

Early digital TTS used **formant synthesis**, generating speech with parametric models of the human vocal tract. [DECtalk](/wiki/dectalk), introduced by [Digital Equipment Corporation](/wiki/digital_equipment_corporation) in 1984 based on Dennis Klatt's KlattTalk work at [MIT](/wiki/mit), became the canonical example. It produced highly intelligible but mechanical-sounding speech and is recognizable as the voice used by physicist [Stephen Hawking](/wiki/stephen_hawking) from the mid-1980s onward.

### Concatenative TTS

In the 1990s and 2000s, **concatenative synthesis** stitched together short recorded units (diphones or sub-word fragments) from large speech databases. Unit selection, formalized by Hunt and Black in 1996, searched the database for best matching segments. Systems such as ATR's nuu-talk and Festival produced more natural output than formant synthesis but required gigabytes of recorded speech and could not generalize beyond the recorded voice.

### HMM-based statistical parametric TTS

Statistical parametric synthesis using [hidden Markov models](/wiki/hidden_markov_model) (HMMs) emerged through the HTS toolkit from Keiichi Tokuda's lab at Nagoya Institute of Technology, formalized by Heiga Zen, Tokuda, and Alan Black in 2009.[^zen2009] HMM-based TTS generalized to new voices through speaker adaptation and dominated commercial deployments until 2016, though it sounded muffled compared with concatenative output.

### Neural TTS

Neural TTS arrived with **WaveNet** from [DeepMind](/wiki/deepmind) in September 2016, an autoregressive model that generated raw 16-bit audio samples one at a time (16,000 samples per second) using dilated causal convolutions.[^wavenet] In blind listening tests using over 500 ratings on 100 test sentences, WaveNet scored a MOS above 4.0 and, in DeepMind's words, "reduces the gap between the state of the art and human-level performance by over 50% for both US English and Mandarin Chinese."[^wavenet][^deepmind-wavenet] **Tacotron** from [Google](/wiki/google) in March 2017 introduced a sequence-to-sequence character-to-spectrogram model with attention.[^tacotron] **Tacotron 2** in December 2017 paired the same encoder-decoder with a WaveNet-style vocoder and reached a MOS of 4.526 plus or minus 0.066, against 4.582 plus or minus 0.053 for professionally recorded studio speech.[^tacotron2]

**FastSpeech** from [Microsoft](/wiki/microsoft) and Zhejiang University in 2019 replaced autoregressive decoding with a parallel feed-forward Transformer, speeding up generation by roughly 270 times for spectrograms.[^fastspeech] **FastSpeech 2** in 2020 added explicit pitch, energy, and duration conditioning.[^fastspeech2] **Glow-TTS** by Kim et al. in 2020 introduced normalizing flows and monotonic alignment search to TTS,[^glowtts] and **VITS** in 2021 unified the acoustic model and vocoder into a single end-to-end variational autoencoder with adversarial training.[^vits] **HiFi-GAN** from [Kakao](/wiki/kakao_corporation) in 2020 became the dominant vocoder, generating 22.05 kHz audio about 168 times faster than real time on a V100 [GPU](/wiki/gpu).[^hifigan]

### What started the zero-shot voice cloning era?

**Tortoise TTS** by James Betker, released in 2022, combined a GPT-style autoregressive prior with a diffusion decoder and a contrastive language-voice transformer.[^tortoise] **VALL-E** from Microsoft in January 2023 reframed TTS as a [language modeling](/wiki/language_model) task over [EnCodec](/wiki/encodec) tokens, learning from 60,000 hours of English audio and cloning unseen voices from a 3-second prompt.[^valle] Microsoft reported that VALL-E "significantly outperforms the state-of-the-art zero-shot TTS system in terms of speech naturalness and speaker similarity."[^valle] **NaturalSpeech 2** in April 2023 used latent [diffusion](/wiki/diffusion_models) over codec tokens and trained on 44,000 hours including singing.[^ns2] **Bark** from [Suno](/wiki/suno) was released in April 2023 under the MIT license, generating speech, music, sound effects, and nonverbal cues like laughter from text alone.[^bark]

**Voicebox** by Meta's [FAIR](/wiki/fair) lab in 2023 introduced flow-matching for speech infilling.[^voicebox] **StyleTTS 2** from Columbia in June 2023 used style diffusion with self-supervised models like [HuBERT](/wiki/hubert) and [WavLM](/wiki/wavlm).[^styletts2] **XTTS-v2** from [Coqui](/wiki/coqui) in November 2023 brought open multilingual voice cloning across 17 languages.[^xtts] **OpenVoice** from [MyShell](/wiki/myshell) and [MIT](/wiki/mit) added accent and emotion control.[^openvoice] [OpenAI](/wiki/openai) demonstrated **Voice Engine** in a March 2024 preview, cloning voices from 15-second samples; it remained a restricted preview limited to trusted partners through 2025 and 2026, with no broad release, which OpenAI attributed to misuse concerns.[^openai-voiceengine][^voiceengine-2025] **F5-TTS** from Shanghai Jiao Tong University in October 2024 used flow matching with a Diffusion Transformer.[^f5tts]

### Production neural codec and conversational systems (2024 to 2026)

By late 2024 and 2025, the frontier shifted toward streaming, conversational, and prompt-steerable systems, many shipped as commercial APIs. **Cartesia Sonic**, released in May 2024 by a team of Stanford researchers, is built on a state-space model (SSM) architecture rather than a Transformer; the company says this gives lower latency and better long-context memory for real-time use.[^cartesia-sonic] **Kokoro** by Hexgrad, released December 25, 2024 with 82 million parameters under Apache 2.0, is a compact StyleTTS-style model trained on fewer than 100 hours of permissive audio at a reported cost of roughly $1,000; it topped the [Hugging Face](/wiki/hugging_face) TTS Spaces Arena despite its small size.[^kokoro] **CosyVoice 2** from [Alibaba](/wiki/alibaba) added streaming synthesis,[^cosyvoice2] and **Sesame CSM-1B**, released March 13, 2025 under Apache 2.0, fuses a [Llama](/wiki/llama) backbone with a smaller audio decoder that produces Mimi codec tokens; it powers Sesame's voice companions Maya and Miles.[^sesame][^sesame-maya]

**OpenAI gpt-4o-mini-tts**, announced in March 2025, is a production text-to-speech model built on GPT-4o mini that is steerable by natural-language instructions (for example, asking for a calm or excited delivery), supports more than 50 languages, and is priced by OpenAI at roughly $0.015 per minute of generated audio; a December 2025 refresh reported about 35% lower word error rate on the Common Voice and FLEURS benchmarks.[^openai-audio][^openai-tts-pricing] **Nari Labs Dia-1.6B**, released in April 2025 under Apache 2.0, is a 1.6 billion parameter open model focused on multi-speaker dialogue and nonverbal cues such as laughter, initially English only.[^dia][^dia-vb] **MiniMax Speech-02** (China), described in a May 2025 paper, pairs an autoregressive Transformer with a learnable speaker encoder and Flow-VAE; MiniMax reported that its Speech-02-HD variant reached first place on the Artificial Analysis Speech Arena and the Hugging Face TTS Arena, ahead of OpenAI and ElevenLabs models.[^minimax][^minimax-arena] **ElevenLabs Eleven v3**, announced June 5, 2025 in public alpha and reaching general availability in February 2026, supports more than 70 languages, inline audio tags such as [excited], [sighs], [laughing], and [whispers], and multi-speaker dialogue through a Text to Dialogue endpoint.[^elevenlabs-v3] **Cartesia Sonic-3**, launched October 28, 2025 alongside a $100 million funding round, extends the SSM line with around 90 milliseconds of model latency (about 190 milliseconds end-to-end), support for 42 languages, and expressive cues such as laughter.[^cartesia-sonic3][^cartesia-funding]

## What are the components of a TTS model?

Most neural TTS pipelines consist of three layers:

* **Text front-end**: normalizes numbers, abbreviations, and punctuation; performs grapheme-to-phoneme conversion; and predicts prosodic structure.
* **Acoustic model**: maps phoneme or character sequences to a mel-spectrogram, latent codes, or pitch and duration targets. Tacotron, FastSpeech, Glow-TTS, and StyleTTS are acoustic models.
* **Vocoder**: converts the intermediate representation into a waveform. Autoregressive (WaveNet, [WaveRNN](/wiki/wavernn)), flow-based ([WaveGlow](/wiki/waveglow)), and adversarial (MelGAN, HiFi-GAN, BigVGAN) vocoders trade quality against speed.

**Neural audio codec models** such as Google's [SoundStream](/wiki/soundstream) (2021) and Meta's [EnCodec](/wiki/encodec) (2022) replaced mel-spectrograms with discrete tokens produced by residual vector quantization.[^soundstream][^encodec] This enabled language-model-style TTS systems like VALL-E and Bark to operate on audio the way [transformers](/wiki/transformer) operate on text. A parallel line of work replaces the Transformer backbone itself with **state-space models** (Cartesia Sonic), which the developers position as more efficient for streaming, low-latency synthesis.[^cartesia-sonic]

### Key technical innovations

* **Attention-based alignment** (Tacotron, 2017) replaced hand-crafted forced alignment with learned soft attention between text and audio frames.
* **Non-autoregressive parallel synthesis** (FastSpeech, Glow-TTS) cut inference time from seconds to milliseconds per utterance.
* **Normalizing flows** (Glow-TTS, VITS, WaveGlow) enabled invertible generative modeling with exact likelihood.
* **Diffusion-based TTS** (NaturalSpeech 2, Voicebox) brought iterative refinement to speech.
* **Neural codec language models** (VALL-E, Bark, CosyVoice) cast synthesis as next-token prediction over discrete audio.
* **Zero-shot voice cloning** from 3 to 15 seconds of reference audio became routine after 2023.
* **Flow matching** (Voicebox, F5-TTS) offered faster sampling than score-based diffusion.
* **State-space sequence models** (Cartesia Sonic) targeted ultra-low-latency streaming as an alternative to Transformers.
* **Instruction steerability** (OpenAI gpt-4o-mini-tts, ElevenLabs v3 audio tags) let users direct tone, emotion, and delivery through natural-language prompts or inline tags.

## What are the most notable TTS models?

| Model | Release | Organization | Type | Notable feature |
|---|---|---|---|---|
| [WaveNet](/wiki/wavenet) | Sep 2016 | [DeepMind](/wiki/deepmind) | Autoregressive vocoder | First neural raw-audio generator |
| [Tacotron 2](/wiki/tacotron) | Dec 2017 | [Google](/wiki/google) | Seq2seq + WaveNet | MOS 4.53, near studio quality |
| [FastSpeech 2](/wiki/fastspeech) | Jun 2020 | [Microsoft](/wiki/microsoft) / Zhejiang | Non-autoregressive | Pitch and energy control |
| [Glow-TTS](/wiki/glow_tts) | May 2020 | KAIST / Kakao | Flow-based | Monotonic alignment search |
| [HiFi-GAN](/wiki/hifi_gan) | Oct 2020 | [Kakao](/wiki/kakao_corporation) | GAN vocoder | 168x real-time on V100 |
| [VITS](/wiki/vits) | Jun 2021 | KAIST | End-to-end VAE | First single-stage neural TTS |
| [Tortoise TTS](/wiki/tortoise_tts) | Apr 2022 | James Betker | AR + diffusion | Open multi-voice cloning |
| [VALL-E](/wiki/vall_e) | Jan 2023 | [Microsoft](/wiki/microsoft) | Codec LM | 3-second voice cloning |
| [NaturalSpeech 2](/wiki/naturalspeech) | Apr 2023 | Microsoft | Latent diffusion | Zero-shot singing |
| [Bark](/wiki/bark) | Apr 2023 | [Suno](/wiki/suno) | Codec LM | Laughter, music, sound effects |
| [Voicebox](/wiki/voicebox) | Jun 2023 | [Meta](/wiki/meta_platforms) | Flow matching | Speech infilling, multilingual |
| [StyleTTS 2](/wiki/styletts) | Jun 2023 | Columbia | Style diffusion + GAN | Matches human MOS on LJSpeech |
| [XTTS-v2](/wiki/xtts) | Nov 2023 | [Coqui](/wiki/coqui) | AR codec | 17-language voice cloning |
| [OpenVoice](/wiki/openvoice) | Dec 2023 | [MyShell](/wiki/myshell) / MIT | Two-stage | Style and accent control |
| [Voice Engine](/wiki/voice_engine) | Mar 2024 | [OpenAI](/wiki/openai) | Proprietary | 15-second cloning, preview only |
| [Cartesia Sonic](/wiki/cartesia) | May 2024 | Cartesia | State-space model | Low-latency streaming SSM |
| [F5-TTS](/wiki/f5_tts) | Oct 2024 | Shanghai Jiao Tong | Flow matching DiT | No duration model needed |
| [Kokoro](/wiki/kokoro_tts) | Dec 2024 | Hexgrad | StyleTTS variant | 82M params, Apache 2.0 |
| [CosyVoice 2](/wiki/cosyvoice) | Dec 2024 | [Alibaba](/wiki/alibaba) | LLM-backbone TTS | Streaming, multilingual |
| [Sesame CSM-1B](/wiki/sesame_csm) | Mar 2025 | Sesame | Llama + audio decoder | Conversational, powers Maya |
| gpt-4o-mini-tts | Mar 2025 | [OpenAI](/wiki/openai) | Proprietary | Prompt-steerable delivery |
| [Dia-1.6B](/wiki/dia) | Apr 2025 | Nari Labs | Codec LM | Multi-speaker dialogue, Apache 2.0 |
| MiniMax Speech-02 | May 2025 | MiniMax | AR Transformer + Flow-VAE | Topped TTS Arena leaderboards |
| [Eleven v3](/wiki/elevenlabs_v3) | Jun 2025 | [ElevenLabs](/wiki/elevenlabs) | Proprietary | Audio tags, 70+ languages |
| Cartesia Sonic-3 | Oct 2025 | Cartesia | State-space model | ~90 ms latency, 42 languages |

## Which vendors offer commercial TTS?

| Vendor | Product | Specialty |
|---|---|---|
| [ElevenLabs](/wiki/elevenlabs) | [Eleven v3](/wiki/elevenlabs_v3), Multilingual v2, Turbo, Flash | Expressive voices, audio tags, dubbing |
| [Cartesia](/wiki/cartesia) | Sonic, Sonic-2, Sonic-3 | Ultra-low-latency state-space voice AI |
| [OpenAI](/wiki/openai) | gpt-4o-mini-tts, Realtime API | Steerable speech, voice agents |
| [Google Cloud](/wiki/google_cloud) | Cloud Text-to-Speech, Studio voices | WaveNet and Neural2 voices |
| [Amazon](/wiki/amazon) | [Polly](/wiki/amazon_polly), Polly Neural | Long-form, generative voices |
| [Microsoft](/wiki/microsoft) Azure | Azure AI Speech, Custom Neural Voice | Brand voice cloning |
| [Resemble AI](/wiki/resemble_ai) | Resemble Clone, Detect | Voice cloning with detection |
| [Play.ht](/wiki/play_ht) | PlayHT 2.0, Play 3.0 | Conversational TTS |
| [Murf AI](/wiki/murf_ai) | Murf Studio | Marketing voiceover |
| [Synthesia](/wiki/synthesia) | Synthesia Avatars | Avatar plus voice video |
| [Hume AI](/wiki/hume_ai) | Octave, EVI | Emotional and expressive TTS |
| [Speechify](/wiki/speechify) | Speechify Voices | Reading assistant, audiobook |
| [Descript](/wiki/descript) | Overdub | Podcast editing, voice clone |

## How is TTS quality measured?

| Dataset | Year | Description |
|---|---|---|
| [LJSpeech](/wiki/ljspeech) | 2017 | 13,100 single-speaker English clips from 7 books, by Keith Ito |
| [VCTK](/wiki/vctk) | 2012 | 110 English speakers, various accents, about 400 sentences each |
| [LibriTTS](/wiki/libritts) | 2019 | 585 hours, 24 kHz, multi-speaker, derived from [LibriSpeech](/wiki/librispeech) |
| [MLS](/wiki/mls) | 2020 | Multilingual LibriSpeech, 50,000+ hours across 8 languages |
| [Common Voice](/wiki/common_voice) | 2017-present | Crowdsourced [Mozilla](/wiki/mozilla) corpus, 100+ languages |
| [GigaSpeech](/wiki/gigaspeech) | 2021 | 10,000 hours of transcribed English audio |
| TTS Arena | 2024 | [Hugging Face](/wiki/hugging_face) head-to-head leaderboard |

Quality is measured both subjectively and automatically:

* **Mean Opinion Score (MOS)**: 1 to 5 listener rating, the most common subjective benchmark.
* **Comparative MOS (CMOS)**: side-by-side preference between two systems.
* **Word Error Rate (WER)**: synthesized audio is transcribed by a [speech recognition](/wiki/speech_recognition) system to test intelligibility.
* **Speaker Encoder Cosine Similarity (SECS)**: cosine distance between speaker embeddings of reference and clone.
* **UTMOS** and **NISQA**: neural networks that predict MOS automatically with high correlation to human judgments.
* **Time to First Audio (TTFA)** and end-to-end latency: critical for streaming voice agents, where current low-latency systems report figures in the tens to low hundreds of milliseconds (Cartesia reports about 90 ms model latency and 190 ms end-to-end for Sonic-3).[^cartesia-sonic3]

Head-to-head Elo leaderboards such as the Artificial Analysis Speech Arena and the Hugging Face TTS Arena became a common reference point in 2025; MiniMax reported its Speech-02-HD model topping both, ahead of OpenAI and ElevenLabs entries.[^minimax-arena]

## What can TTS models do?

Flagship TTS systems in 2025 deliver outputs that listeners often cannot reliably distinguish from human recordings on short utterances. Common capabilities include:

* **Zero-shot voice cloning** from 3 to 15 seconds of reference audio.
* **Cross-lingual cloning**, where a speaker recorded in English can be reproduced speaking Japanese or Spanish.
* **Prosody and emotion control**, exposed as tags or style prompts in models such as Bark, Hume Octave, ElevenLabs v3 (inline audio tags), and OpenAI gpt-4o-mini-tts (natural-language instructions).[^elevenlabs-v3][^openai-audio]
* **Code-switching** between languages within a single utterance.
* **Real-time streaming** with low latency (tens to low hundreds of milliseconds), used in voice agents and powered by systems such as Cartesia Sonic.[^cartesia-sonic3]
* **Singing voice synthesis** through systems like [DiffSinger](/wiki/diffsinger) and NaturalSpeech 2.
* **Long-form coherence** for audiobooks and podcasts, with reference-aware chunking.
* **Multi-speaker dialogue** generated in a single pass, with nonverbal cues, in models such as Dia and Sesame CSM.[^dia]

## What is TTS used for?

TTS is now embedded across consumer and enterprise products. Major application areas:

* **Virtual assistants** including [Alexa](/wiki/alexa), [Siri](/wiki/siri), and [Google Assistant](/wiki/google_assistant).
* **Conversational AI** voice modes in [ChatGPT](/wiki/chatgpt), [Gemini](/wiki/gemini), and [Claude](/wiki/claude).
* **Audiobook narration** and self-published author tooling.
* **Video voiceover** for marketing, e-learning, and short-form social content.
* **Podcast generation** including auto-generated dialogue formats like [NotebookLM](/wiki/notebooklm) Audio Overviews.
* **Accessibility** through screen readers and live captioning for users with vision or speech impairments.
* **Localization and dubbing** of films and games into additional languages while preserving the original speaker's voice.
* **In-car navigation** and infotainment voices.
* **Game NPC dialogue** including dynamic, runtime-generated lines.
* **Customer service** through call-center voice agents.

## What are the ethics and regulatory risks?

Low-cost [voice cloning](/wiki/voice_cloning) has prompted concrete misuse cases. Reports of scam calls impersonating relatives, political robocalls (including a January 2024 New Hampshire robocall mimicking President [Joe Biden](/wiki/joe_biden)), and synthetic [deepfake](/wiki/deepfake) audio used in extortion have drawn regulatory attention. Industry responses include:

* **Watermarking** of synthesized audio (used by OpenAI Voice Engine, Resemble Detect, and Google [SynthID](/wiki/synthid) for audio).
* **Voice consent requirements** for commercial cloning (ElevenLabs Voice Captcha, Resemble VocoderID).
* **Provenance standards** such as [C2PA](/wiki/c2pa) for audio assets.
* **Regulation**: the US [Federal Communications Commission](/wiki/fcc) ruled in February 2024 that AI-voiced robocalls fall under the Telephone Consumer Protection Act.[^fcc] California's AB 2602 (2024) requires explicit consent for digital replicas in performer contracts, and the [EU AI Act](/wiki/eu_ai_act) classifies synthetic audio as content that must be disclosed.

Dataset consent is also disputed: many open TTS corpora were assembled from public audiobooks (LibriVox) or scraped audio, raising debates about whether voice talent must consent specifically to AI training.

## What are the limitations of TTS models?

Despite rapid progress, current TTS systems still exhibit:

* Narrow emotional range over long passages, especially for unscripted reactions.
* Audio quality vs latency tradeoffs that force streaming systems to use smaller models.
* Weak support for low-resource languages and dialects outside major commercial markets.
* Codec artifacts (metallic timbre, popping) at low neural-codec bitrates.
* Hallucinated stress or mispronunciation on rare proper nouns, numbers, and acronyms.
* Degraded similarity when cloning unusual voices (children, elderly speakers, heavily accented English).
* Limited control over background acoustics, microphone characteristics, and recording environment.

## References

[^wavenet]: Oord, A. van den, et al. (2016). WaveNet: A Generative Model for Raw Audio. arXiv:1609.03499. https://arxiv.org/abs/1609.03499 Accessed 2026-05-31.
[^deepmind-wavenet]: DeepMind. (2016). WaveNet: A generative model for raw audio. DeepMind Blog. https://deepmind.google/blog/wavenet-a-generative-model-for-raw-audio/ Accessed 2026-06-22.
[^tacotron]: Wang, Y., et al. (2017). Tacotron: Towards End-to-End Speech Synthesis. arXiv:1703.10135. https://arxiv.org/abs/1703.10135 Accessed 2026-05-31.
[^tacotron2]: Shen, J., et al. (2017). Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions. arXiv:1712.05884. https://arxiv.org/abs/1712.05884 Accessed 2026-05-31.
[^fastspeech]: Ren, Y., et al. (2019). FastSpeech: Fast, Robust and Controllable Text to Speech. arXiv:1905.09263. https://arxiv.org/abs/1905.09263 Accessed 2026-05-31.
[^fastspeech2]: Ren, Y., et al. (2020). FastSpeech 2: Fast and High-Quality End-to-End Text to Speech. arXiv:2006.04558. https://arxiv.org/abs/2006.04558 Accessed 2026-05-31.
[^glowtts]: Kim, J., Kim, S., Kong, J., and Yoon, S. (2020). Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search. arXiv:2005.11129. https://arxiv.org/abs/2005.11129 Accessed 2026-05-31.
[^hifigan]: Kong, J., Kim, J., and Bae, J. (2020). HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis. arXiv:2010.05646. https://arxiv.org/abs/2010.05646 Accessed 2026-05-31.
[^vits]: Kim, J., Kong, J., and Son, J. (2021). Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech (VITS). arXiv:2106.06103. https://arxiv.org/abs/2106.06103 Accessed 2026-05-31.
[^tortoise]: Betker, J. (2023). Better Speech Synthesis through Scaling (Tortoise TTS). arXiv:2305.07243. https://arxiv.org/abs/2305.07243 Accessed 2026-05-31.
[^valle]: Wang, C., et al. (2023). Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers (VALL-E). arXiv:2301.02111. https://arxiv.org/abs/2301.02111 Accessed 2026-05-31.
[^ns2]: Shen, K., et al. (2023). NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers. arXiv:2304.09116. https://arxiv.org/abs/2304.09116 Accessed 2026-05-31.
[^bark]: Suno AI. (2023). Bark: Text-Prompted Generative Audio Model. GitHub. https://github.com/suno-ai/bark Accessed 2026-05-31.
[^voicebox]: Le, M., et al. (2023). Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale. arXiv:2306.15687. https://arxiv.org/abs/2306.15687 Accessed 2026-05-31.
[^styletts2]: Li, Y. A., et al. (2023). StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models. arXiv:2306.07691. https://arxiv.org/abs/2306.07691 Accessed 2026-05-31.
[^xtts]: Coqui. (2023). XTTS: Open Model for Multilingual Voice Cloning. Hugging Face. https://huggingface.co/coqui/XTTS-v2 Accessed 2026-05-31.
[^openvoice]: Qin, Z., et al. (2023). OpenVoice: Versatile Instant Voice Cloning. arXiv:2312.01479. https://arxiv.org/abs/2312.01479 Accessed 2026-05-31.
[^openai-voiceengine]: OpenAI. (2024). Navigating the Challenges and Opportunities of Synthetic Voices. OpenAI Blog. https://openai.com/index/navigating-the-challenges-and-opportunities-of-synthetic-voices/ Accessed 2026-05-31.
[^voiceengine-2025]: Wiggers, K. (2025). A year later, OpenAI still hasn't released its voice cloning tool. TechCrunch. https://techcrunch.com/2025/03/06/a-year-later-openai-still-hasnt-released-its-voice-cloning-tool/ Accessed 2026-05-31.
[^f5tts]: Chen, Y., et al. (2024). F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching. arXiv:2410.06885. https://arxiv.org/abs/2410.06885 Accessed 2026-05-31.
[^cartesia-sonic]: Cartesia. (2024). Announcing Sonic: a low-latency voice model for lifelike speech. Cartesia Blog. https://cartesia.ai/blog/sonic Accessed 2026-05-31.
[^kokoro]: Hexgrad. (2024). Kokoro-82M. Hugging Face. https://huggingface.co/hexgrad/Kokoro-82M Accessed 2026-05-31.
[^cosyvoice2]: Du, Z., et al. (2024). CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models. arXiv:2412.10117. https://arxiv.org/abs/2412.10117 Accessed 2026-05-31.
[^sesame]: Sesame AI Labs. (2025). CSM: A Conversational Speech Generation Model. GitHub. https://github.com/SesameAILabs/csm Accessed 2026-05-31.
[^sesame-maya]: Wiggers, K. (2025). Sesame, the startup behind the viral virtual assistant Maya, releases its base AI model. TechCrunch. https://techcrunch.com/2025/03/13/sesame-the-startup-behind-the-viral-virtual-assistant-maya-releases-its-base-ai-model/ Accessed 2026-05-31.
[^openai-audio]: OpenAI. (2025). Introducing next-generation audio models in the API. OpenAI Blog. https://openai.com/index/introducing-our-next-generation-audio-models/ Accessed 2026-05-31.
[^openai-tts-pricing]: OpenAI. (2025). Text to speech. OpenAI API documentation. https://platform.openai.com/docs/guides/text-to-speech Accessed 2026-05-31.
[^dia]: Nari Labs. (2025). Dia-1.6B. Hugging Face. https://huggingface.co/nari-labs/Dia-1.6B Accessed 2026-05-31.
[^dia-vb]: Wiggers, K. (2025). A new, open source text-to-speech model called Dia has arrived to challenge ElevenLabs, OpenAI and more. VentureBeat. https://venturebeat.com/ai/a-new-open-source-text-to-speech-model-called-dia-has-arrived-to-challenge-elevenlabs-openai-and-more Accessed 2026-05-31.
[^minimax]: Zhang, B., et al. (2025). MiniMax-Speech: Intrinsic Zero-Shot Text-to-Speech with a Learnable Speaker Encoder. arXiv:2505.07916. https://arxiv.org/abs/2505.07916 Accessed 2026-05-31.
[^minimax-arena]: MiniMax. (2025). MiniMax Speech 02: Pioneering a New Era of AI Speech Generation. MiniMax News. https://www.minimax.io/news/minimax-speech-02 Accessed 2026-05-31.
[^elevenlabs-v3]: ElevenLabs. (2025). Eleven v3: Most Expressive AI TTS Model. ElevenLabs Blog. https://elevenlabs.io/blog/eleven-v3 Accessed 2026-05-31.
[^cartesia-sonic3]: Cartesia. (2025). Sonic-3. Cartesia Documentation. https://docs.cartesia.ai/build-with-cartesia/tts-models/latest Accessed 2026-05-31.
[^cartesia-funding]: Cartesia. (2025). Cartesia raises $100M and launches Sonic-3. Cartesia Blog. https://cartesia.ai/blog Accessed 2026-06-22.
[^zen2009]: Zen, H., Tokuda, K., and Black, A. W. (2009). Statistical Parametric Speech Synthesis. Speech Communication, 51(11). https://www.sciencedirect.com/science/article/abs/pii/S0167639309000648 Accessed 2026-05-31.
[^soundstream]: Zeghidour, N., et al. (2021). SoundStream: An End-to-End Neural Audio Codec. arXiv:2107.03312. https://arxiv.org/abs/2107.03312 Accessed 2026-05-31.
[^encodec]: Defossez, A., et al. (2022). High Fidelity Neural Audio Compression (EnCodec). arXiv:2210.13438. https://arxiv.org/abs/2210.13438 Accessed 2026-05-31.
[^fcc]: US Federal Communications Commission. (2024). FCC Makes AI-Generated Voices in Robocalls Illegal. FCC. https://www.fcc.gov/document/fcc-makes-ai-generated-voices-robocalls-illegal Accessed 2026-05-31.