Text-to-Speech Models
Last reviewed
May 30, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v3 · 3,427 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 30, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v3 · 3,427 words
Add missing citations, update stale details, or suggest a clearer explanation.
Text-to-speech (TTS) models are machine learning systems that convert written text into spoken audio. The broader task and its history are covered on the text-to-speech page; this article catalogs the notable models that implement it. Modern TTS approaches use neural networks to generate speech that approximates human naturalness, voice timbre, and prosody. They are the inverse of speech recognition systems, which transcribe audio into text, and they sit alongside other audio models such as voice cloning, voice conversion, music generation, and audio enhancement. As a class of generative AI, the strongest 2025 systems produce speech that listeners often cannot reliably distinguish from human recordings on short utterances.12
Typical TTS pipelines decompose synthesis into two stages: an acoustic model that turns text or phonemes into an intermediate representation such as a mel-spectrogram, and a vocoder that converts the intermediate into a waveform. Newer end-to-end systems collapse both stages into a single network or operate directly on discrete codes from a neural audio codec. Capabilities have advanced from robotic rule-based output in the 1980s to zero-shot voice cloning from a few seconds of reference audio in 2023 and 2024.
See also: Text-to-Speech, Audio Models, Voice cloning, Speech recognition, Generative AI, ElevenLabs, Diffusion Models
Early digital TTS used formant synthesis, generating speech with parametric models of the human vocal tract. DECtalk, introduced by Digital Equipment Corporation in 1984 based on Dennis Klatt's KlattTalk work at MIT, became the canonical example. It produced highly intelligible but mechanical-sounding speech and is recognizable as the voice used by physicist Stephen Hawking from the mid-1980s onward.
In the 1990s and 2000s, concatenative synthesis stitched together short recorded units (diphones or sub-word fragments) from large speech databases. Unit selection, formalized by Hunt and Black in 1996, searched the database for best matching segments. Systems such as ATR's nuu-talk and Festival produced more natural output than formant synthesis but required gigabytes of recorded speech and could not generalize beyond the recorded voice.
Statistical parametric synthesis using hidden Markov models (HMMs) emerged through the HTS toolkit from Keiichi Tokuda's lab at Nagoya Institute of Technology, formalized by Heiga Zen, Tokuda, and Alan Black in 2009.3 HMM-based TTS generalized to new voices through speaker adaptation and dominated commercial deployments until 2016, though it sounded muffled compared with concatenative output.
Neural TTS arrived with WaveNet from DeepMind in September 2016, an autoregressive model that generated raw 16-bit audio samples one at a time using dilated causal convolutions.4 WaveNet halved the perceptual gap to natural speech compared with prior systems. Tacotron from Google in March 2017 introduced a sequence-to-sequence character-to-spectrogram model with attention.5 Tacotron 2 in December 2017 paired the same encoder-decoder with a WaveNet-style vocoder and reached a mean opinion score (MOS) of 4.53 against 4.58 for studio recordings.6
FastSpeech from Microsoft and Zhejiang University in 2019 replaced autoregressive decoding with a parallel feed-forward Transformer, speeding up generation by roughly 270 times for spectrograms.7 FastSpeech 2 in 2020 added explicit pitch, energy, and duration conditioning.8 Glow-TTS by Kim et al. in 2020 introduced normalizing flows and monotonic alignment search to TTS,9 and VITS in 2021 unified the acoustic model and vocoder into a single end-to-end variational autoencoder with adversarial training.10 HiFi-GAN from Kakao in 2020 became the dominant vocoder, generating 22.05 kHz audio about 168 times faster than real time on a V100 GPU.11
Tortoise TTS by James Betker, released in 2022, combined a GPT-style autoregressive prior with a diffusion decoder and a contrastive language-voice transformer.12 VALL-E from Microsoft in January 2023 reframed TTS as a language modeling task over EnCodec tokens, learning from 60,000 hours of English audio and cloning unseen voices from a 3-second prompt.13 NaturalSpeech 2 in April 2023 used latent diffusion over codec tokens and trained on 44,000 hours including singing.14 Bark from Suno was released in April 2023 under the MIT license, generating speech, music, sound effects, and nonverbal cues like laughter from text alone.15
Voicebox by Meta's FAIR lab in 2023 introduced flow-matching for speech infilling.16 StyleTTS 2 from Columbia in June 2023 used style diffusion with self-supervised models like HuBERT and WavLM.17 XTTS-v2 from Coqui in November 2023 brought open multilingual voice cloning across 17 languages.18 OpenVoice from MyShell and MIT added accent and emotion control.19 OpenAI demonstrated Voice Engine in a March 2024 preview, cloning voices from 15-second samples; it remained a restricted preview limited to trusted partners through 2025 and 2026, with no broad release, which OpenAI attributed to misuse concerns.2021 F5-TTS from Shanghai Jiao Tong University in October 2024 used flow matching with a Diffusion Transformer.22
By late 2024 and 2025, the frontier shifted toward streaming, conversational, and prompt-steerable systems, many shipped as commercial APIs. Cartesia Sonic, released in May 2024 by a team of Stanford researchers, is built on a state-space model (SSM) architecture rather than a Transformer; the company says this gives lower latency and better long-context memory for real-time use.23 Kokoro by Hexgrad, released December 2024 with 82 million parameters under Apache 2.0, is a compact StyleTTS-style model that topped the Hugging Face TTS Arena despite its small size.24 CosyVoice 2 from Alibaba added streaming synthesis,25 and Sesame CSM-1B, released March 13, 2025 under Apache 2.0, fuses a Llama backbone with a smaller audio decoder that produces Mimi codec tokens; it powers Sesame's voice companions Maya and Miles.2627
OpenAI gpt-4o-mini-tts, announced in March 2025, is a production text-to-speech model built on GPT-4o mini that is steerable by natural-language instructions (for example, asking for a calm or excited delivery), supports more than 50 languages, and is priced by OpenAI at roughly $0.015 per minute of generated audio; a December 2025 refresh reported about 35% lower word error rate on the Common Voice and FLEURS benchmarks.2829 Nari Labs Dia-1.6B, released in April 2025 under Apache 2.0, is a 1.6 billion parameter open model focused on multi-speaker dialogue and nonverbal cues such as laughter, initially English only.3031 MiniMax Speech-02 (China), described in a May 2025 paper, pairs an autoregressive Transformer with a learnable speaker encoder and Flow-VAE; MiniMax reported that its Speech-02-HD variant reached first place on the Artificial Analysis Speech Arena and the Hugging Face TTS Arena, ahead of OpenAI and ElevenLabs models.232 ElevenLabs Eleven v3, announced June 5, 2025 in public alpha and reaching general availability in February 2026, supports more than 70 languages, inline audio tags such as [excited] or [whispers], and multi-speaker dialogue through a Text to Dialogue endpoint.33 Cartesia Sonic-3, launched in late October 2025 alongside a $100 million funding round, extends the SSM line with around 90 millisecond model latency, support for 42 languages, and expressive cues such as laughter.1
Most neural TTS pipelines consist of three layers:
Neural audio codec models such as Google's SoundStream (2021) and Meta's EnCodec (2022) replaced mel-spectrograms with discrete tokens produced by residual vector quantization.3435 This enabled language-model-style TTS systems like VALL-E and Bark to operate on audio the way transformers operate on text. A parallel line of work replaces the Transformer backbone itself with state-space models (Cartesia Sonic), which the developers position as more efficient for streaming, low-latency synthesis.23
| Model | Release | Organization | Type | Notable feature |
|---|---|---|---|---|
| WaveNet | Sep 2016 | DeepMind | Autoregressive vocoder | First neural raw-audio generator |
| Tacotron 2 | Dec 2017 | Seq2seq + WaveNet | MOS 4.53, near studio quality | |
| FastSpeech 2 | Jun 2020 | Microsoft / Zhejiang | Non-autoregressive | Pitch and energy control |
| Glow-TTS | May 2020 | KAIST / Kakao | Flow-based | Monotonic alignment search |
| HiFi-GAN | Oct 2020 | Kakao | GAN vocoder | 168x real-time on V100 |
| VITS | Jun 2021 | KAIST | End-to-end VAE | First single-stage neural TTS |
| Tortoise TTS | Apr 2022 | James Betker | AR + diffusion | Open multi-voice cloning |
| VALL-E | Jan 2023 | Microsoft | Codec LM | 3-second voice cloning |
| NaturalSpeech 2 | Apr 2023 | Microsoft | Latent diffusion | Zero-shot singing |
| Bark | Apr 2023 | Suno | Codec LM | Laughter, music, sound effects |
| Voicebox | Jun 2023 | Meta | Flow matching | Speech infilling, multilingual |
| StyleTTS 2 | Jun 2023 | Columbia | Style diffusion + GAN | Matches human MOS on LJSpeech |
| XTTS-v2 | Nov 2023 | Coqui | AR codec | 17-language voice cloning |
| OpenVoice | Dec 2023 | MyShell / MIT | Two-stage | Style and accent control |
| Voice Engine | Mar 2024 | OpenAI | Proprietary | 15-second cloning, preview only |
| Cartesia Sonic | May 2024 | Cartesia | State-space model | Low-latency streaming SSM |
| F5-TTS | Oct 2024 | Shanghai Jiao Tong | Flow matching DiT | No duration model needed |
| Kokoro | Dec 2024 | Hexgrad | StyleTTS variant | 82M params, Apache 2.0 |
| CosyVoice 2 | Dec 2024 | Alibaba | LLM-backbone TTS | Streaming, multilingual |
| Sesame CSM-1B | Mar 2025 | Sesame | Llama + audio decoder | Conversational, powers Maya |
| gpt-4o-mini-tts | Mar 2025 | OpenAI | Proprietary | Prompt-steerable delivery |
| Dia-1.6B | Apr 2025 | Nari Labs | Codec LM | Multi-speaker dialogue, Apache 2.0 |
| MiniMax Speech-02 | May 2025 | MiniMax | AR Transformer + Flow-VAE | Topped TTS Arena leaderboards |
| Eleven v3 | Jun 2025 | ElevenLabs | Proprietary | Audio tags, 70+ languages |
| Cartesia Sonic-3 | Oct 2025 | Cartesia | State-space model | ~90 ms latency, 42 languages |
| Vendor | Product | Specialty |
|---|---|---|
| ElevenLabs | Eleven v3, Multilingual v2, Turbo, Flash | Expressive voices, audio tags, dubbing |
| Cartesia | Sonic, Sonic-2, Sonic-3 | Ultra-low-latency state-space voice AI |
| OpenAI | gpt-4o-mini-tts, Realtime API | Steerable speech, voice agents |
| Google Cloud | Cloud Text-to-Speech, Studio voices | WaveNet and Neural2 voices |
| Amazon | Polly, Polly Neural | Long-form, generative voices |
| Microsoft Azure | Azure AI Speech, Custom Neural Voice | Brand voice cloning |
| Resemble AI | Resemble Clone, Detect | Voice cloning with detection |
| Play.ht | PlayHT 2.0, Play 3.0 | Conversational TTS |
| Murf AI | Murf Studio | Marketing voiceover |
| Synthesia | Synthesia Avatars | Avatar plus voice video |
| Hume AI | Octave, EVI | Emotional and expressive TTS |
| Speechify | Speechify Voices | Reading assistant, audiobook |
| Descript | Overdub | Podcast editing, voice clone |
| Dataset | Year | Description |
|---|---|---|
| LJSpeech | 2017 | 13,100 single-speaker English clips from 7 books, by Keith Ito |
| VCTK | 2012 | 110 English speakers, various accents, about 400 sentences each |
| LibriTTS | 2019 | 585 hours, 24 kHz, multi-speaker, derived from LibriSpeech |
| MLS | 2020 | Multilingual LibriSpeech, 50,000+ hours across 8 languages |
| Common Voice | 2017-present | Crowdsourced Mozilla corpus, 100+ languages |
| GigaSpeech | 2021 | 10,000 hours of transcribed English audio |
| TTS Arena | 2024 | Hugging Face head-to-head leaderboard |
Quality is measured both subjectively and automatically:
Head-to-head Elo leaderboards such as the Artificial Analysis Speech Arena and the Hugging Face TTS Arena became a common reference point in 2025; MiniMax reported its Speech-02-HD model topping both, ahead of OpenAI and ElevenLabs entries.32
Flagship TTS systems in 2025 deliver outputs that listeners often cannot reliably distinguish from human recordings on short utterances. Common capabilities include:
TTS is now embedded across consumer and enterprise products. Major application areas:
Low-cost voice cloning has prompted concrete misuse cases. Reports of scam calls impersonating relatives, political robocalls (including a January 2024 New Hampshire robocall mimicking President Joe Biden), and synthetic deepfake audio used in extortion have drawn regulatory attention. Industry responses include:
Dataset consent is also disputed: many open TTS corpora were assembled from public audiobooks (LibriVox) or scraped audio, raising debates about whether voice talent must consent specifically to AI training.
Despite rapid progress, current TTS systems still exhibit:
Cartesia. (2025). Sonic-3. Cartesia Documentation. https://docs.cartesia.ai/build-with-cartesia/tts-models/latest Accessed 2026-05-31. ↩ ↩2 ↩3 ↩4
Zhang, B., et al. (2025). MiniMax-Speech: Intrinsic Zero-Shot Text-to-Speech with a Learnable Speaker Encoder. arXiv:2505.07916. https://arxiv.org/abs/2505.07916 Accessed 2026-05-31. ↩ ↩2
Zen, H., Tokuda, K., and Black, A. W. (2009). Statistical Parametric Speech Synthesis. Speech Communication, 51(11). https://www.sciencedirect.com/science/article/abs/pii/S0167639309000648 Accessed 2026-05-31. ↩
Oord, A. van den, et al. (2016). WaveNet: A Generative Model for Raw Audio. arXiv:1609.03499. https://arxiv.org/abs/1609.03499 Accessed 2026-05-31. ↩
Wang, Y., et al. (2017). Tacotron: Towards End-to-End Speech Synthesis. arXiv:1703.10135. https://arxiv.org/abs/1703.10135 Accessed 2026-05-31. ↩
Shen, J., et al. (2017). Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions. arXiv:1712.05884. https://arxiv.org/abs/1712.05884 Accessed 2026-05-31. ↩
Ren, Y., et al. (2019). FastSpeech: Fast, Robust and Controllable Text to Speech. arXiv:1905.09263. https://arxiv.org/abs/1905.09263 Accessed 2026-05-31. ↩
Ren, Y., et al. (2020). FastSpeech 2: Fast and High-Quality End-to-End Text to Speech. arXiv:2006.04558. https://arxiv.org/abs/2006.04558 Accessed 2026-05-31. ↩
Kim, J., Kim, S., Kong, J., and Yoon, S. (2020). Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search. arXiv:2005.11129. https://arxiv.org/abs/2005.11129 Accessed 2026-05-31. ↩
Kim, J., Kong, J., and Son, J. (2021). Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech (VITS). arXiv:2106.06103. https://arxiv.org/abs/2106.06103 Accessed 2026-05-31. ↩
Kong, J., Kim, J., and Bae, J. (2020). HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis. arXiv:2010.05646. https://arxiv.org/abs/2010.05646 Accessed 2026-05-31. ↩
Betker, J. (2023). Better Speech Synthesis through Scaling (Tortoise TTS). arXiv:2305.07243. https://arxiv.org/abs/2305.07243 Accessed 2026-05-31. ↩
Wang, C., et al. (2023). Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers (VALL-E). arXiv:2301.02111. https://arxiv.org/abs/2301.02111 Accessed 2026-05-31. ↩
Shen, K., et al. (2023). NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers. arXiv:2304.09116. https://arxiv.org/abs/2304.09116 Accessed 2026-05-31. ↩
Suno AI. (2023). Bark: Text-Prompted Generative Audio Model. GitHub. https://github.com/suno-ai/bark Accessed 2026-05-31. ↩
Le, M., et al. (2023). Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale. arXiv:2306.15687. https://arxiv.org/abs/2306.15687 Accessed 2026-05-31. ↩
Li, Y. A., et al. (2023). StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models. arXiv:2306.07691. https://arxiv.org/abs/2306.07691 Accessed 2026-05-31. ↩
Coqui. (2023). XTTS: Open Model for Multilingual Voice Cloning. Hugging Face. https://huggingface.co/coqui/XTTS-v2 Accessed 2026-05-31. ↩
Qin, Z., et al. (2023). OpenVoice: Versatile Instant Voice Cloning. arXiv:2312.01479. https://arxiv.org/abs/2312.01479 Accessed 2026-05-31. ↩
OpenAI. (2024). Navigating the Challenges and Opportunities of Synthetic Voices. OpenAI Blog. https://openai.com/index/navigating-the-challenges-and-opportunities-of-synthetic-voices/ Accessed 2026-05-31. ↩
Wiggers, K. (2025). A year later, OpenAI still hasn't released its voice cloning tool. TechCrunch. https://techcrunch.com/2025/03/06/a-year-later-openai-still-hasnt-released-its-voice-cloning-tool/ Accessed 2026-05-31. ↩
Chen, Y., et al. (2024). F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching. arXiv:2410.06885. https://arxiv.org/abs/2410.06885 Accessed 2026-05-31. ↩
Cartesia. (2024). Announcing Sonic: a low-latency voice model for lifelike speech. Cartesia Blog. https://cartesia.ai/blog/sonic Accessed 2026-05-31. ↩ ↩2
Hexgrad. (2024). Kokoro-82M. Hugging Face. https://huggingface.co/hexgrad/Kokoro-82M Accessed 2026-05-31. ↩
Du, Z., et al. (2024). CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models. arXiv:2412.10117. https://arxiv.org/abs/2412.10117 Accessed 2026-05-31. ↩
Sesame AI Labs. (2025). CSM: A Conversational Speech Generation Model. GitHub. https://github.com/SesameAILabs/csm Accessed 2026-05-31. ↩
Wiggers, K. (2025). Sesame, the startup behind the viral virtual assistant Maya, releases its base AI model. TechCrunch. https://techcrunch.com/2025/03/13/sesame-the-startup-behind-the-viral-virtual-assistant-maya-releases-its-base-ai-model/ Accessed 2026-05-31. ↩
OpenAI. (2025). Introducing next-generation audio models in the API. OpenAI Blog. https://openai.com/index/introducing-our-next-generation-audio-models/ Accessed 2026-05-31. ↩ ↩2
OpenAI. (2025). Text to speech. OpenAI API documentation. https://platform.openai.com/docs/guides/text-to-speech Accessed 2026-05-31. ↩
Nari Labs. (2025). Dia-1.6B. Hugging Face. https://huggingface.co/nari-labs/Dia-1.6B Accessed 2026-05-31. ↩ ↩2
Wiggers, K. (2025). A new, open source text-to-speech model called Dia has arrived to challenge ElevenLabs, OpenAI and more. VentureBeat. https://venturebeat.com/ai/a-new-open-source-text-to-speech-model-called-dia-has-arrived-to-challenge-elevenlabs-openai-and-more Accessed 2026-05-31. ↩
MiniMax. (2025). MiniMax Speech 02: Pioneering a New Era of AI Speech Generation. MiniMax News. https://www.minimax.io/news/minimax-speech-02 Accessed 2026-05-31. ↩ ↩2
ElevenLabs. (2025). Eleven v3: Most Expressive AI TTS Model. ElevenLabs Blog. https://elevenlabs.io/blog/eleven-v3 Accessed 2026-05-31. ↩ ↩2
Zeghidour, N., et al. (2021). SoundStream: An End-to-End Neural Audio Codec. arXiv:2107.03312. https://arxiv.org/abs/2107.03312 Accessed 2026-05-31. ↩
Defossez, A., et al. (2022). High Fidelity Neural Audio Compression (EnCodec). arXiv:2210.13438. https://arxiv.org/abs/2210.13438 Accessed 2026-05-31. ↩
US Federal Communications Commission. (2024). FCC Makes AI-Generated Voices in Robocalls Illegal. FCC. https://www.fcc.gov/document/fcc-makes-ai-generated-voices-robocalls-illegal Accessed 2026-05-31. ↩