Text-to-Speech Models

AI Models Speech & Audio AI

18 min read

Updated Jun 22, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 22, 2026

Fact-checked

In review queue

Sources

38 citations

Revision

v4 · 3,653 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Text-to-speech (TTS) models are machine learning systems that convert written text into spoken audio. The modern lineage runs from DeepMind's WaveNet (2016), the first neural model to generate raw audio sample by sample, through Google's Tacotron 2 (2017), which reached a mean opinion score (MOS) of 4.53 against 4.58 for professional human recordings, to the zero-shot voice-cloning systems of 2023 onward (VALL-E, ElevenLabs, Cartesia, OpenAI) that can copy a voice from a few seconds of audio.¹²³ The broader task and its history are covered on the text-to-speech page; this article catalogs the notable models that implement it. As a class of generative AI, the strongest 2025 systems produce speech that listeners often cannot reliably distinguish from human recordings on short utterances.²³

Modern TTS approaches use neural networks to generate speech that approximates human naturalness, voice timbre, and prosody. They are the inverse of speech recognition systems, which transcribe audio into text, and they sit alongside other audio models such as voice cloning, voice conversion, music generation, and audio enhancement. Typical TTS pipelines decompose synthesis into two stages: an acoustic model that turns text or phonemes into an intermediate representation such as a mel-spectrogram, and a vocoder that converts the intermediate into a waveform. Newer end-to-end systems collapse both stages into a single network or operate directly on discrete codes from a neural audio codec. Capabilities have advanced from robotic rule-based output in the 1980s to zero-shot voice cloning from a few seconds of reference audio in 2023 and 2024.

How did TTS models evolve?

Rule-based and formant synthesis

Early digital TTS used formant synthesis, generating speech with parametric models of the human vocal tract. DECtalk, introduced by Digital Equipment Corporation in 1984 based on Dennis Klatt's KlattTalk work at MIT, became the canonical example. It produced highly intelligible but mechanical-sounding speech and is recognizable as the voice used by physicist Stephen Hawking from the mid-1980s onward.

Concatenative TTS

In the 1990s and 2000s, concatenative synthesis stitched together short recorded units (diphones or sub-word fragments) from large speech databases. Unit selection, formalized by Hunt and Black in 1996, searched the database for best matching segments. Systems such as ATR's nuu-talk and Festival produced more natural output than formant synthesis but required gigabytes of recorded speech and could not generalize beyond the recorded voice.

HMM-based statistical parametric TTS

Statistical parametric synthesis using hidden Markov models (HMMs) emerged through the HTS toolkit from Keiichi Tokuda's lab at Nagoya Institute of Technology, formalized by Heiga Zen, Tokuda, and Alan Black in 2009.⁴ HMM-based TTS generalized to new voices through speaker adaptation and dominated commercial deployments until 2016, though it sounded muffled compared with concatenative output.

Neural TTS

Neural TTS arrived with WaveNet from DeepMind in September 2016, an autoregressive model that generated raw 16-bit audio samples one at a time (16,000 samples per second) using dilated causal convolutions.⁵ In blind listening tests using over 500 ratings on 100 test sentences, WaveNet scored a MOS above 4.0 and, in DeepMind's words, "reduces the gap between the state of the art and human-level performance by over 50% for both US English and Mandarin Chinese."⁵⁶ Tacotron from Google in March 2017 introduced a sequence-to-sequence character-to-spectrogram model with attention.⁷ Tacotron 2 in December 2017 paired the same encoder-decoder with a WaveNet-style vocoder and reached a MOS of 4.526 plus or minus 0.066, against 4.582 plus or minus 0.053 for professionally recorded studio speech.¹

FastSpeech from Microsoft and Zhejiang University in 2019 replaced autoregressive decoding with a parallel feed-forward Transformer, speeding up generation by roughly 270 times for spectrograms.⁸ FastSpeech 2 in 2020 added explicit pitch, energy, and duration conditioning.⁹ Glow-TTS by Kim et al. in 2020 introduced normalizing flows and monotonic alignment search to TTS,¹⁰ and VITS in 2021 unified the acoustic model and vocoder into a single end-to-end variational autoencoder with adversarial training.¹¹ HiFi-GAN from Kakao in 2020 became the dominant vocoder, generating 22.05 kHz audio about 168 times faster than real time on a V100 GPU.¹²

What started the zero-shot voice cloning era?

Tortoise TTS by James Betker, released in 2022, combined a GPT-style autoregressive prior with a diffusion decoder and a contrastive language-voice transformer.¹³ VALL-E from Microsoft in January 2023 reframed TTS as a language modeling task over EnCodec tokens, learning from 60,000 hours of English audio and cloning unseen voices from a 3-second prompt.¹⁴ Microsoft reported that VALL-E "significantly outperforms the state-of-the-art zero-shot TTS system in terms of speech naturalness and speaker similarity."¹⁴ NaturalSpeech 2 in April 2023 used latent diffusion over codec tokens and trained on 44,000 hours including singing.¹⁵ Bark from Suno was released in April 2023 under the MIT license, generating speech, music, sound effects, and nonverbal cues like laughter from text alone.¹⁶

Voicebox by Meta's FAIR lab in 2023 introduced flow-matching for speech infilling.¹⁷ StyleTTS 2 from Columbia in June 2023 used style diffusion with self-supervised models like HuBERT and WavLM.¹⁸ XTTS-v2 from Coqui in November 2023 brought open multilingual voice cloning across 17 languages.¹⁹ OpenVoice from MyShell and MIT added accent and emotion control.²⁰ OpenAI demonstrated Voice Engine in a March 2024 preview, cloning voices from 15-second samples; it remained a restricted preview limited to trusted partners through 2025 and 2026, with no broad release, which OpenAI attributed to misuse concerns.²¹²² F5-TTS from Shanghai Jiao Tong University in October 2024 used flow matching with a Diffusion Transformer.²³

Production neural codec and conversational systems (2024 to 2026)

By late 2024 and 2025, the frontier shifted toward streaming, conversational, and prompt-steerable systems, many shipped as commercial APIs. Cartesia Sonic, released in May 2024 by a team of Stanford researchers, is built on a state-space model (SSM) architecture rather than a Transformer; the company says this gives lower latency and better long-context memory for real-time use.²⁴ Kokoro by Hexgrad, released December 25, 2024 with 82 million parameters under Apache 2.0, is a compact StyleTTS-style model trained on fewer than 100 hours of permissive audio at a reported cost of roughly $1,000; it topped the Hugging Face TTS Spaces Arena despite its small size.²⁵ CosyVoice 2 from Alibaba added streaming synthesis,²⁶ and Sesame CSM-1B, released March 13, 2025 under Apache 2.0, fuses a Llama backbone with a smaller audio decoder that produces Mimi codec tokens; it powers Sesame's voice companions Maya and Miles.²⁷²⁸

OpenAI gpt-4o-mini-tts, announced in March 2025, is a production text-to-speech model built on GPT-4o mini that is steerable by natural-language instructions (for example, asking for a calm or excited delivery), supports more than 50 languages, and is priced by OpenAI at roughly $0.015 per minute of generated audio; a December 2025 refresh reported about 35% lower word error rate on the Common Voice and FLEURS benchmarks.²⁹³⁰ Nari Labs Dia-1.6B, released in April 2025 under Apache 2.0, is a 1.6 billion parameter open model focused on multi-speaker dialogue and nonverbal cues such as laughter, initially English only.³¹³² MiniMax Speech-02 (China), described in a May 2025 paper, pairs an autoregressive Transformer with a learnable speaker encoder and Flow-VAE; MiniMax reported that its Speech-02-HD variant reached first place on the Artificial Analysis Speech Arena and the Hugging Face TTS Arena, ahead of OpenAI and ElevenLabs models.³³³ ElevenLabs Eleven v3, announced June 5, 2025 in public alpha and reaching general availability in February 2026, supports more than 70 languages, inline audio tags such as [excited], [sighs], [laughing], and [whispers], and multi-speaker dialogue through a Text to Dialogue endpoint.³⁴ Cartesia Sonic-3, launched October 28, 2025 alongside a $100 million funding round, extends the SSM line with around 90 milliseconds of model latency (about 190 milliseconds end-to-end), support for 42 languages, and expressive cues such as laughter.²³⁵

What are the components of a TTS model?

Most neural TTS pipelines consist of three layers:

Text front-end: normalizes numbers, abbreviations, and punctuation; performs grapheme-to-phoneme conversion; and predicts prosodic structure.
Acoustic model: maps phoneme or character sequences to a mel-spectrogram, latent codes, or pitch and duration targets. Tacotron, FastSpeech, Glow-TTS, and StyleTTS are acoustic models.
Vocoder: converts the intermediate representation into a waveform. Autoregressive (WaveNet, WaveRNN), flow-based (WaveGlow), and adversarial (MelGAN, HiFi-GAN, BigVGAN) vocoders trade quality against speed.

Neural audio codec models such as Google's SoundStream (2021) and Meta's EnCodec (2022) replaced mel-spectrograms with discrete tokens produced by residual vector quantization.³⁶³⁷ This enabled language-model-style TTS systems like VALL-E and Bark to operate on audio the way transformers operate on text. A parallel line of work replaces the Transformer backbone itself with state-space models (Cartesia Sonic), which the developers position as more efficient for streaming, low-latency synthesis.²⁴

Key technical innovations

Attention-based alignment (Tacotron, 2017) replaced hand-crafted forced alignment with learned soft attention between text and audio frames.
Non-autoregressive parallel synthesis (FastSpeech, Glow-TTS) cut inference time from seconds to milliseconds per utterance.
Normalizing flows (Glow-TTS, VITS, WaveGlow) enabled invertible generative modeling with exact likelihood.
Diffusion-based TTS (NaturalSpeech 2, Voicebox) brought iterative refinement to speech.
Neural codec language models (VALL-E, Bark, CosyVoice) cast synthesis as next-token prediction over discrete audio.
Zero-shot voice cloning from 3 to 15 seconds of reference audio became routine after 2023.
Flow matching (Voicebox, F5-TTS) offered faster sampling than score-based diffusion.
State-space sequence models (Cartesia Sonic) targeted ultra-low-latency streaming as an alternative to Transformers.
Instruction steerability (OpenAI gpt-4o-mini-tts, ElevenLabs v3 audio tags) let users direct tone, emotion, and delivery through natural-language prompts or inline tags.

What are the most notable TTS models?

Model	Release	Organization	Type	Notable feature
WaveNet	Sep 2016	DeepMind	Autoregressive vocoder	First neural raw-audio generator
Tacotron 2	Dec 2017	Google	Seq2seq + WaveNet	MOS 4.53, near studio quality
FastSpeech 2	Jun 2020	Microsoft / Zhejiang	Non-autoregressive	Pitch and energy control
Glow-TTS	May 2020	KAIST / Kakao	Flow-based	Monotonic alignment search
HiFi-GAN	Oct 2020	Kakao	GAN vocoder	168x real-time on V100
VITS	Jun 2021	KAIST	End-to-end VAE	First single-stage neural TTS
Tortoise TTS	Apr 2022	James Betker	AR + diffusion	Open multi-voice cloning
VALL-E	Jan 2023	Microsoft	Codec LM	3-second voice cloning
NaturalSpeech 2	Apr 2023	Microsoft	Latent diffusion	Zero-shot singing
Bark	Apr 2023	Suno	Codec LM	Laughter, music, sound effects
Voicebox	Jun 2023	Meta	Flow matching	Speech infilling, multilingual
StyleTTS 2	Jun 2023	Columbia	Style diffusion + GAN	Matches human MOS on LJSpeech
XTTS-v2	Nov 2023	Coqui	AR codec	17-language voice cloning
OpenVoice	Dec 2023	MyShell / MIT	Two-stage	Style and accent control
Voice Engine	Mar 2024	OpenAI	Proprietary	15-second cloning, preview only
Cartesia Sonic	May 2024	Cartesia	State-space model	Low-latency streaming SSM
F5-TTS	Oct 2024	Shanghai Jiao Tong	Flow matching DiT	No duration model needed
Kokoro	Dec 2024	Hexgrad	StyleTTS variant	82M params, Apache 2.0
CosyVoice 2	Dec 2024	Alibaba	LLM-backbone TTS	Streaming, multilingual
Sesame CSM-1B	Mar 2025	Sesame	Llama + audio decoder	Conversational, powers Maya
gpt-4o-mini-tts	Mar 2025	OpenAI	Proprietary	Prompt-steerable delivery
Dia-1.6B	Apr 2025	Nari Labs	Codec LM	Multi-speaker dialogue, Apache 2.0
MiniMax Speech-02	May 2025	MiniMax	AR Transformer + Flow-VAE	Topped TTS Arena leaderboards
Eleven v3	Jun 2025	ElevenLabs	Proprietary	Audio tags, 70+ languages
Cartesia Sonic-3	Oct 2025	Cartesia	State-space model	~90 ms latency, 42 languages

Which vendors offer commercial TTS?

Vendor	Product	Specialty
ElevenLabs	Eleven v3, Multilingual v2, Turbo, Flash	Expressive voices, audio tags, dubbing
Cartesia	Sonic, Sonic-2, Sonic-3	Ultra-low-latency state-space voice AI
OpenAI	gpt-4o-mini-tts, Realtime API	Steerable speech, voice agents
Google Cloud	Cloud Text-to-Speech, Studio voices	WaveNet and Neural2 voices
Amazon	Polly, Polly Neural	Long-form, generative voices
Microsoft Azure	Azure AI Speech, Custom Neural Voice	Brand voice cloning
Resemble AI	Resemble Clone, Detect	Voice cloning with detection
Play.ht	PlayHT 2.0, Play 3.0	Conversational TTS
Murf AI	Murf Studio	Marketing voiceover
Synthesia	Synthesia Avatars	Avatar plus voice video
Hume AI	Octave, EVI	Emotional and expressive TTS
Speechify	Speechify Voices	Reading assistant, audiobook
Descript	Overdub	Podcast editing, voice clone

How is TTS quality measured?

Dataset	Year	Description
LJSpeech	2017	13,100 single-speaker English clips from 7 books, by Keith Ito
VCTK	2012	110 English speakers, various accents, about 400 sentences each
LibriTTS	2019	585 hours, 24 kHz, multi-speaker, derived from LibriSpeech
MLS	2020	Multilingual LibriSpeech, 50,000+ hours across 8 languages
Common Voice	2017-present	Crowdsourced Mozilla corpus, 100+ languages
GigaSpeech	2021	10,000 hours of transcribed English audio
TTS Arena	2024	Hugging Face head-to-head leaderboard

Quality is measured both subjectively and automatically:

Mean Opinion Score (MOS): 1 to 5 listener rating, the most common subjective benchmark.
Comparative MOS (CMOS): side-by-side preference between two systems.
Word Error Rate (WER): synthesized audio is transcribed by a speech recognition system to test intelligibility.
Speaker Encoder Cosine Similarity (SECS): cosine distance between speaker embeddings of reference and clone.
UTMOS and NISQA: neural networks that predict MOS automatically with high correlation to human judgments.
Time to First Audio (TTFA) and end-to-end latency: critical for streaming voice agents, where current low-latency systems report figures in the tens to low hundreds of milliseconds (Cartesia reports about 90 ms model latency and 190 ms end-to-end for Sonic-3).²

Head-to-head Elo leaderboards such as the Artificial Analysis Speech Arena and the Hugging Face TTS Arena became a common reference point in 2025; MiniMax reported its Speech-02-HD model topping both, ahead of OpenAI and ElevenLabs entries.³³

What can TTS models do?

Flagship TTS systems in 2025 deliver outputs that listeners often cannot reliably distinguish from human recordings on short utterances. Common capabilities include:

Zero-shot voice cloning from 3 to 15 seconds of reference audio.
Cross-lingual cloning, where a speaker recorded in English can be reproduced speaking Japanese or Spanish.
Prosody and emotion control, exposed as tags or style prompts in models such as Bark, Hume Octave, ElevenLabs v3 (inline audio tags), and OpenAI gpt-4o-mini-tts (natural-language instructions).³⁴²⁹
Code-switching between languages within a single utterance.
Real-time streaming with low latency (tens to low hundreds of milliseconds), used in voice agents and powered by systems such as Cartesia Sonic.²
Singing voice synthesis through systems like DiffSinger and NaturalSpeech 2.
Long-form coherence for audiobooks and podcasts, with reference-aware chunking.
Multi-speaker dialogue generated in a single pass, with nonverbal cues, in models such as Dia and Sesame CSM.³¹

What is TTS used for?

TTS is now embedded across consumer and enterprise products. Major application areas:

Virtual assistants including Alexa, Siri, and Google Assistant.
Conversational AI voice modes in ChatGPT, Gemini, and Claude.
Audiobook narration and self-published author tooling.
Video voiceover for marketing, e-learning, and short-form social content.
Podcast generation including auto-generated dialogue formats like NotebookLM Audio Overviews.
Accessibility through screen readers and live captioning for users with vision or speech impairments.
Localization and dubbing of films and games into additional languages while preserving the original speaker's voice.
In-car navigation and infotainment voices.
Game NPC dialogue including dynamic, runtime-generated lines.
Customer service through call-center voice agents.

What are the ethics and regulatory risks?

Low-cost voice cloning has prompted concrete misuse cases. Reports of scam calls impersonating relatives, political robocalls (including a January 2024 New Hampshire robocall mimicking President Joe Biden), and synthetic deepfake audio used in extortion have drawn regulatory attention. Industry responses include:

Watermarking of synthesized audio (used by OpenAI Voice Engine, Resemble Detect, and Google SynthID for audio).
Voice consent requirements for commercial cloning (ElevenLabs Voice Captcha, Resemble VocoderID).
Provenance standards such as C2PA for audio assets.
Regulation: the US Federal Communications Commission ruled in February 2024 that AI-voiced robocalls fall under the Telephone Consumer Protection Act.³⁸ California's AB 2602 (2024) requires explicit consent for digital replicas in performer contracts, and the EU AI Act classifies synthetic audio as content that must be disclosed.

Dataset consent is also disputed: many open TTS corpora were assembled from public audiobooks (LibriVox) or scraped audio, raising debates about whether voice talent must consent specifically to AI training.

What are the limitations of TTS models?

Despite rapid progress, current TTS systems still exhibit:

Narrow emotional range over long passages, especially for unscripted reactions.
Audio quality vs latency tradeoffs that force streaming systems to use smaller models.
Weak support for low-resource languages and dialects outside major commercial markets.
Codec artifacts (metallic timbre, popping) at low neural-codec bitrates.
Hallucinated stress or mispronunciation on rare proper nouns, numbers, and acronyms.
Degraded similarity when cloning unusual voices (children, elderly speakers, heavily accented English).
Limited control over background acoustics, microphone characteristics, and recording environment.

References

Shen, J., et al. (2017). Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions. arXiv:1712.05884. https://arxiv.org/abs/1712.05884 Accessed 2026-05-31. ↩ ↩²
Cartesia. (2025). Sonic-3. Cartesia Documentation. https://docs.cartesia.ai/build-with-cartesia/tts-models/latest Accessed 2026-05-31. ↩ ↩² ↩³ ↩⁴ ↩⁵
Zhang, B., et al. (2025). MiniMax-Speech: Intrinsic Zero-Shot Text-to-Speech with a Learnable Speaker Encoder. arXiv:2505.07916. https://arxiv.org/abs/2505.07916 Accessed 2026-05-31. ↩ ↩² ↩³
Zen, H., Tokuda, K., and Black, A. W. (2009). Statistical Parametric Speech Synthesis. Speech Communication, 51(11). https://www.sciencedirect.com/science/article/abs/pii/S0167639309000648 Accessed 2026-05-31. ↩
Oord, A. van den, et al. (2016). WaveNet: A Generative Model for Raw Audio. arXiv:1609.03499. https://arxiv.org/abs/1609.03499 Accessed 2026-05-31. ↩ ↩²
DeepMind. (2016). WaveNet: A generative model for raw audio. DeepMind Blog. https://deepmind.google/blog/wavenet-a-generative-model-for-raw-audio/ Accessed 2026-06-22. ↩
Wang, Y., et al. (2017). Tacotron: Towards End-to-End Speech Synthesis. arXiv:1703.10135. https://arxiv.org/abs/1703.10135 Accessed 2026-05-31. ↩
Ren, Y., et al. (2019). FastSpeech: Fast, Robust and Controllable Text to Speech. arXiv:1905.09263. https://arxiv.org/abs/1905.09263 Accessed 2026-05-31. ↩
Ren, Y., et al. (2020). FastSpeech 2: Fast and High-Quality End-to-End Text to Speech. arXiv:2006.04558. https://arxiv.org/abs/2006.04558 Accessed 2026-05-31. ↩
Kim, J., Kim, S., Kong, J., and Yoon, S. (2020). Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search. arXiv:2005.11129. https://arxiv.org/abs/2005.11129 Accessed 2026-05-31. ↩
Kim, J., Kong, J., and Son, J. (2021). Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech (VITS). arXiv:2106.06103. https://arxiv.org/abs/2106.06103 Accessed 2026-05-31. ↩
Kong, J., Kim, J., and Bae, J. (2020). HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis. arXiv:2010.05646. https://arxiv.org/abs/2010.05646 Accessed 2026-05-31. ↩
Betker, J. (2023). Better Speech Synthesis through Scaling (Tortoise TTS). arXiv:2305.07243. https://arxiv.org/abs/2305.07243 Accessed 2026-05-31. ↩
Wang, C., et al. (2023). Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers (VALL-E). arXiv:2301.02111. https://arxiv.org/abs/2301.02111 Accessed 2026-05-31. ↩ ↩²
Shen, K., et al. (2023). NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers. arXiv:2304.09116. https://arxiv.org/abs/2304.09116 Accessed 2026-05-31. ↩
Suno AI. (2023). Bark: Text-Prompted Generative Audio Model. GitHub. https://github.com/suno-ai/bark Accessed 2026-05-31. ↩
Le, M., et al. (2023). Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale. arXiv:2306.15687. https://arxiv.org/abs/2306.15687 Accessed 2026-05-31. ↩
Li, Y. A., et al. (2023). StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models. arXiv:2306.07691. https://arxiv.org/abs/2306.07691 Accessed 2026-05-31. ↩
Coqui. (2023). XTTS: Open Model for Multilingual Voice Cloning. Hugging Face. https://huggingface.co/coqui/XTTS-v2 Accessed 2026-05-31. ↩
Qin, Z., et al. (2023). OpenVoice: Versatile Instant Voice Cloning. arXiv:2312.01479. https://arxiv.org/abs/2312.01479 Accessed 2026-05-31. ↩
OpenAI. (2024). Navigating the Challenges and Opportunities of Synthetic Voices. OpenAI Blog. https://openai.com/index/navigating-the-challenges-and-opportunities-of-synthetic-voices/ Accessed 2026-05-31. ↩
Wiggers, K. (2025). A year later, OpenAI still hasn't released its voice cloning tool. TechCrunch. https://techcrunch.com/2025/03/06/a-year-later-openai-still-hasnt-released-its-voice-cloning-tool/ Accessed 2026-05-31. ↩
Chen, Y., et al. (2024). F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching. arXiv:2410.06885. https://arxiv.org/abs/2410.06885 Accessed 2026-05-31. ↩
Cartesia. (2024). Announcing Sonic: a low-latency voice model for lifelike speech. Cartesia Blog. https://cartesia.ai/blog/sonic Accessed 2026-05-31. ↩ ↩²
Hexgrad. (2024). Kokoro-82M. Hugging Face. https://huggingface.co/hexgrad/Kokoro-82M Accessed 2026-05-31. ↩
Du, Z., et al. (2024). CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models. arXiv:2412.10117. https://arxiv.org/abs/2412.10117 Accessed 2026-05-31. ↩
Sesame AI Labs. (2025). CSM: A Conversational Speech Generation Model. GitHub. https://github.com/SesameAILabs/csm Accessed 2026-05-31. ↩
Wiggers, K. (2025). Sesame, the startup behind the viral virtual assistant Maya, releases its base AI model. TechCrunch. https://techcrunch.com/2025/03/13/sesame-the-startup-behind-the-viral-virtual-assistant-maya-releases-its-base-ai-model/ Accessed 2026-05-31. ↩
OpenAI. (2025). Introducing next-generation audio models in the API. OpenAI Blog. https://openai.com/index/introducing-our-next-generation-audio-models/ Accessed 2026-05-31. ↩ ↩²
OpenAI. (2025). Text to speech. OpenAI API documentation. https://platform.openai.com/docs/guides/text-to-speech Accessed 2026-05-31. ↩
Nari Labs. (2025). Dia-1.6B. Hugging Face. https://huggingface.co/nari-labs/Dia-1.6B Accessed 2026-05-31. ↩ ↩²
Wiggers, K. (2025). A new, open source text-to-speech model called Dia has arrived to challenge ElevenLabs, OpenAI and more. VentureBeat. https://venturebeat.com/ai/a-new-open-source-text-to-speech-model-called-dia-has-arrived-to-challenge-elevenlabs-openai-and-more Accessed 2026-05-31. ↩
MiniMax. (2025). MiniMax Speech 02: Pioneering a New Era of AI Speech Generation. MiniMax News. https://www.minimax.io/news/minimax-speech-02 Accessed 2026-05-31. ↩ ↩²
ElevenLabs. (2025). Eleven v3: Most Expressive AI TTS Model. ElevenLabs Blog. https://elevenlabs.io/blog/eleven-v3 Accessed 2026-05-31. ↩ ↩²
Cartesia. (2025). Cartesia raises $100M and launches Sonic-3. Cartesia Blog. https://cartesia.ai/blog Accessed 2026-06-22. ↩
Zeghidour, N., et al. (2021). SoundStream: An End-to-End Neural Audio Codec. arXiv:2107.03312. https://arxiv.org/abs/2107.03312 Accessed 2026-05-31. ↩
Defossez, A., et al. (2022). High Fidelity Neural Audio Compression (EnCodec). arXiv:2210.13438. https://arxiv.org/abs/2210.13438 Accessed 2026-05-31. ↩
US Federal Communications Commission. (2024). FCC Makes AI-Generated Voices in Robocalls Illegal. FCC. https://www.fcc.gov/document/fcc-makes-ai-generated-voices-robocalls-illegal Accessed 2026-05-31. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

3 revisions by 1 contributors · full history

Suggest edit

What links here

Audio Models Audio-to-Audio Models Audiobox Automatic Speech Recognition Models BSide: Olivia Lin Cartesia DPO Deepgram Descript ElevenLabs v3 Hume Octave 2 Massively Multilingual Speech (MMS)SeamlessM4T Sesame (AI company)Sesame CSM SpiRit-LM Voice cloning Voicebox

How did TTS models evolve?

Rule-based and formant synthesis

Concatenative TTS

HMM-based statistical parametric TTS

Neural TTS

What started the zero-shot voice cloning era?

Production neural codec and conversational systems (2024 to 2026)

What are the components of a TTS model?

Key technical innovations

What are the most notable TTS models?

Which vendors offer commercial TTS?

How is TTS quality measured?

What can TTS models do?

What is TTS used for?

What are the ethics and regulatory risks?

What are the limitations of TTS models?

References

Footnotes

Improve this article

Related Articles

Audio-to-Audio Models

Audio Models

Automatic Speech Recognition Models

Universal Speech Model

Voice Activity Detection Models

Cartesia

What links here

Related Articles

Audio-to-Audio Models

Audio Models

Automatic Speech Recognition Models

Universal Speech Model

Voice Activity Detection Models

Cartesia

What links here