Text-to-Speech Models
Last reviewed
May 11, 2026
Sources
24 citations
Review status
Source-backed
Revision
v2 ยท 2,500 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 11, 2026
Sources
24 citations
Review status
Source-backed
Revision
v2 ยท 2,500 words
Add missing citations, update stale details, or suggest a clearer explanation.
Text-to-speech (TTS) models are machine learning systems that convert written text into spoken audio. Modern TTS approaches use neural networks to generate speech that approximates human naturalness, voice timbre, and prosody. They are the inverse of speech-to-text systems, which transcribe audio into text, and they sit alongside other audio models such as voice conversion, music generation, and audio enhancement.
Typical TTS pipelines decompose synthesis into two stages: an acoustic model that turns text or phonemes into an intermediate representation such as a mel-spectrogram, and a vocoder that converts the intermediate into a waveform. Newer end-to-end systems collapse both stages into a single network or operate directly on discrete codes from a neural audio codec. Capabilities have advanced from robotic rule-based output in the 1980s to zero-shot voice cloning from a few seconds of reference audio in 2023 and 2024.
See also: Audio Models, Speech Recognition, Generative AI, Diffusion Models
Early digital TTS used formant synthesis, generating speech with parametric models of the human vocal tract. DECtalk, introduced by Digital Equipment Corporation in 1984 based on Dennis Klatt's KlattTalk work at MIT, became the canonical example. It produced highly intelligible but mechanical-sounding speech and is recognizable as the voice used by physicist Stephen Hawking from the mid-1980s onward.
In the 1990s and 2000s, concatenative synthesis stitched together short recorded units (diphones or sub-word fragments) from large speech databases. Unit selection, formalized by Hunt and Black in 1996, searched the database for best matching segments. Systems such as ATR's nuu-talk and Festival produced more natural output than formant synthesis but required gigabytes of recorded speech and could not generalize beyond the recorded voice.
Statistical parametric synthesis using hidden Markov models (HMMs) emerged through the HTS toolkit from Keiichi Tokuda's lab at Nagoya Institute of Technology, formalized by Heiga Zen, Tokuda, and Alan Black in 2009. HMM-based TTS generalized to new voices through speaker adaptation and dominated commercial deployments until 2016, though it sounded muffled compared with concatenative output.
Neural TTS arrived with WaveNet from DeepMind in September 2016, an autoregressive model that generated raw 16-bit audio samples one at a time using dilated causal convolutions. WaveNet halved the perceptual gap to natural speech compared with prior systems. Tacotron from Google in March 2017 introduced a sequence-to-sequence character-to-spectrogram model with attention. Tacotron 2 in December 2017 paired the same encoder-decoder with a WaveNet-style vocoder and reached a mean opinion score (MOS) of 4.53 against 4.58 for studio recordings.
FastSpeech from Microsoft and Zhejiang University in 2019 replaced autoregressive decoding with a parallel feed-forward Transformer, speeding up generation by roughly 270 times for spectrograms. FastSpeech 2 in 2020 added explicit pitch, energy, and duration conditioning. Glow-TTS by Kim et al. in 2020 introduced normalizing flows and monotonic alignment search to TTS, and VITS in 2021 unified the acoustic model and vocoder into a single end-to-end variational autoencoder with adversarial training. HiFi-GAN from Kakao in 2020 became the dominant vocoder, generating 22.05 kHz audio about 168 times faster than real time on a V100 GPU.
Tortoise TTS by James Betker, released in 2022, combined a GPT-style autoregressive prior with a diffusion decoder and a contrastive language-voice transformer. VALL-E from Microsoft in January 2023 reframed TTS as a language modeling task over EnCodec tokens, learning from 60,000 hours of English audio and cloning unseen voices from a 3-second prompt. NaturalSpeech 2 in April 2023 used latent diffusion over codec tokens and trained on 44,000 hours including singing. Bark from Suno was released in April 2023 under the MIT license, generating speech, music, sound effects, and nonverbal cues like laughter from text alone.
Voicebox by Meta's FAIR lab in 2023 introduced flow-matching for speech infilling. StyleTTS 2 from Columbia in June 2023 used style diffusion with self-supervised models like HuBERT and WavLM. XTTS-v2 from Coqui in November 2023 brought open multilingual voice cloning across 17 languages. OpenVoice from MyShell and MIT added accent and emotion control. OpenAI demonstrated Voice Engine in a March 2024 preview, cloning voices from 15-second samples but holding back public release on safety grounds. F5-TTS from Shanghai Jiao Tong University in October 2024 used flow matching with a Diffusion Transformer. Kokoro by Hexgrad, released December 2024 with 82 million parameters under Apache 2.0, topped the TTS Arena leaderboard. CosyVoice 2 from Alibaba added streaming synthesis, and Sesame CSM in 2025 fused a Llama backbone with an audio decoder for conversational speech.
Most neural TTS pipelines consist of three layers:
Neural audio codec models such as Google's SoundStream (2021) and Meta's EnCodec (2022) replaced mel-spectrograms with discrete tokens produced by residual vector quantization. This enabled language-model-style TTS systems like VALL-E and Bark to operate on audio the way transformers operate on text.
| Model | Release | Organization | Type | Notable feature |
|---|---|---|---|---|
| WaveNet | Sep 2016 | DeepMind | Autoregressive vocoder | First neural raw-audio generator |
| Tacotron 2 | Dec 2017 | Seq2seq + WaveNet | MOS 4.53, near studio quality | |
| FastSpeech 2 | Jun 2020 | Microsoft / Zhejiang | Non-autoregressive | Pitch and energy control |
| Glow-TTS | May 2020 | KAIST / Kakao | Flow-based | Monotonic alignment search |
| HiFi-GAN | Oct 2020 | Kakao | GAN vocoder | 168x real-time on V100 |
| VITS | Jun 2021 | KAIST | End-to-end VAE | First single-stage neural TTS |
| Tortoise TTS | Apr 2022 | James Betker | AR + diffusion | Open multi-voice cloning |
| VALL-E | Jan 2023 | Microsoft | Codec LM | 3-second voice cloning |
| NaturalSpeech 2 | Apr 2023 | Microsoft | Latent diffusion | Zero-shot singing |
| Bark | Apr 2023 | Suno | Codec LM | Laughter, music, sound effects |
| Voicebox | Jun 2023 | Meta | Flow matching | Speech infilling, multilingual |
| StyleTTS 2 | Jun 2023 | Columbia | Style diffusion + GAN | Matches human MOS on LJSpeech |
| XTTS-v2 | Nov 2023 | Coqui | AR codec | 17-language voice cloning |
| OpenVoice | Dec 2023 | MyShell / MIT | Two-stage | Style and accent control |
| Voice Engine | Mar 2024 | OpenAI | Proprietary | 15-second sample cloning |
| F5-TTS | Oct 2024 | Shanghai Jiao Tong | Flow matching DiT | No duration model needed |
| Kokoro | Dec 2024 | Hexgrad | StyleTTS variant | 82M params, Apache 2.0 |
| CosyVoice 2 | Dec 2024 | Alibaba | LLM-backbone TTS | Streaming, 9 languages |
| Sesame CSM | Mar 2025 | Sesame AI Labs | Llama + audio decoder | Conversational context |
| Vendor | Product | Specialty |
|---|---|---|
| ElevenLabs | Multilingual v2, Turbo, Flash | Premium voice cloning, dubbing |
| Google Cloud | Cloud Text-to-Speech, Studio voices | WaveNet and Neural2 voices |
| Amazon | Polly, Polly Neural | Long-form, generative voices |
| Microsoft Azure | Azure AI Speech, Custom Neural Voice | Brand voice cloning |
| Resemble AI | Resemble Clone, Detect | Voice cloning with detection |
| Play.ht | PlayHT 2.0, Play 3.0 | Conversational TTS |
| Murf AI | Murf Studio | Marketing voiceover |
| Synthesia | Synthesia Avatars | Avatar plus voice video |
| Hume AI | Octave, EVI | Emotional and expressive TTS |
| Speechify | Speechify Voices | Reading assistant, audiobook |
| Descript | Overdub | Podcast editing, voice clone |
| Dataset | Year | Description |
|---|---|---|
| LJSpeech | 2017 | 13,100 single-speaker English clips from 7 books, by Keith Ito |
| VCTK | 2012 | 110 English speakers, various accents, about 400 sentences each |
| LibriTTS | 2019 | 585 hours, 24 kHz, multi-speaker, derived from LibriSpeech |
| MLS | 2020 | Multilingual Librispeech, 50,000+ hours across 8 languages |
| Common Voice | 2017-present | Crowdsourced Mozilla corpus, 100+ languages |
| GigaSpeech | 2021 | 10,000 hours of transcribed English audio |
| TTS Arena | 2024 | Hugging Face head-to-head leaderboard |
Quality is measured both subjectively and automatically:
Flagship TTS systems in 2025 deliver outputs that listeners often cannot reliably distinguish from human recordings on short utterances. Common capabilities include:
TTS is now embedded across consumer and enterprise products. Major application areas:
Low-cost voice cloning has prompted concrete misuse cases. Reports of scam calls impersonating relatives, political robocalls (including a January 2024 New Hampshire robocall mimicking President Joe Biden), and synthetic deepfake audio used in extortion have drawn regulatory attention. Industry responses include:
Dataset consent is also disputed: many open TTS corpora were assembled from public audiobooks (LibriVox) or scraped audio, raising debates about whether voice talent must consent specifically to AI training.
Despite rapid progress, current TTS systems still exhibit: