Text-to-Speech

Artificial Intelligence Natural Language Processing

29 min read

Updated Jun 21, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 21, 2026

Fact-checked

In review queue

Sources

28 citations

Revision

v4 · 5,748 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Text-to-speech (TTS) refers to artificial intelligence systems that convert written text into natural-sounding spoken audio. Modern TTS systems use deep learning models to produce speech that closely mimics human vocal patterns, including natural intonation, rhythm, emphasis, pauses, and emotional expression. The technology has evolved from robotic, clearly synthetic voices in the 1960s to neural models in the 2020s that can generate speech virtually indistinguishable from human recordings.

TTS is a foundational technology for a wide range of applications, including accessibility tools for visually impaired users, virtual assistants like Siri and Alexa, audiobook production, video game dialogue, customer service automation, content creation, and real-time dubbing of video and audio content across languages. The global TTS market was valued at approximately $3.2-3.7 billion in 2025 and is projected to grow to over $11 billion by 2034-2035, driven by expanding adoption in enterprise, media, and consumer applications ^[1].

The field has been shaped by several breakthrough moments: the introduction of concatenative synthesis in the 1990s, statistical parametric synthesis in the 2000s, WaveNet by DeepMind in 2016, Tacotron by Google in 2017, and the emergence of commercial neural TTS platforms like ElevenLabs starting in 2022. Today, the leading TTS systems achieve Mean Opinion Scores (MOS) above 4.5 on a 5-point scale, approaching or matching the perceived quality of natural human speech.

History

The history of text-to-speech technology spans over six decades, progressing through several distinct technological paradigms before reaching the current era of neural synthesis.

Early Formant and Rule-Based Synthesis (1960s-1980s)

The earliest TTS systems used formant synthesis, which generated speech by modeling the resonant frequencies (formants) of the human vocal tract using electronic circuits or mathematical rules. These systems produced intelligible but highly robotic-sounding speech. Notable early systems include:

System	Year	Developer	Approach
IBM Shoebox	1961	IBM	Could recognize 16 spoken words
Klattalk	1983	Dennis Klatt (MIT)	Rule-based formant synthesis, foundation for DECtalk
DECtalk	1984	Digital Equipment Corporation	Commercial formant synthesizer; used by Stephen Hawking
Prose 2000	1980s	Centigram Communications	One of the first commercial TTS products

Dennis Klatt's work at MIT was particularly influential. His Klattalk system, and its commercial derivative DECtalk, became the most recognizable synthetic voice of the 1980s and 1990s. Physicist Stephen Hawking used a DECtalk voice synthesizer from 1986 until his death in 2018, making it one of the most famous synthetic voices in history ^[2]. These rule-based systems worked by applying linguistic rules to convert text into phoneme sequences, then generating audio waveforms based on mathematical models of vocal tract acoustics.

Concatenative Synthesis (1990s-2000s)

Concatenative synthesis represented a major shift in approach. Rather than generating speech from mathematical rules, concatenative systems assembled speech from pre-recorded snippets of human voice. A voice actor would record a large corpus of speech covering a wide range of phoneme combinations, and the system would select and concatenate the most appropriate segments for each word or phrase.

The two main variants were unit selection synthesis, which chose the best-matching units from a large database, and diphone synthesis, which used smaller units (pairs of adjacent phonemes) for more complete coverage. AT&T's Natural Voices and the Festival Speech Synthesis System (developed at the University of Edinburgh) were prominent examples.

Concatenative synthesis produced more natural-sounding speech than formant synthesis, but it had significant limitations. The output quality depended heavily on the size and quality of the recorded database. Transitions between concatenated segments often produced audible artifacts (clicks, pitch discontinuities, or unnatural joins). And creating a new voice required recording an entirely new database, a process that could take weeks of studio time and cost tens of thousands of dollars.

Statistical Parametric Synthesis (2000s-2010s)

Statistical parametric synthesis emerged in the 2000s as a middle ground between the flexibility of rule-based systems and the naturalness of concatenative approaches. These systems used statistical models, primarily Hidden Markov Models (HMMs), to learn the relationship between linguistic features (phonemes, stress patterns, prosody) and acoustic parameters (spectral features, pitch, duration) from a training corpus ^[3].

During synthesis, the statistical model generated a sequence of acoustic parameters that were then converted into audio using a vocoder. The most influential system in this paradigm was HTS (HMM-based Speech Synthesis System), developed by a consortium of Japanese and British universities. HTS could produce smooth, consistent speech and could adapt to new voices with relatively small amounts of training data.

However, statistical parametric speech had a characteristic "muffled" or "buzzy" quality caused by the vocoder, which could not faithfully reconstruct the full complexity of natural speech from the compact parametric representation. Listeners could always tell they were hearing synthesized speech.

WaveNet: The Neural Revolution (2016)

The landscape changed dramatically in September 2016 when Google DeepMind published WaveNet, a deep neural network that generated raw audio waveforms one sample at a time ^[4]. WaveNet was built on a dilated causal convolutional architecture that could model long-range dependencies in audio signals while maintaining the autoregressive property (each sample was predicted based on all preceding samples).

WaveNet's results were striking. In listening tests, WaveNet-generated speech narrowed the gap between synthetic and natural human speech by over 50% compared to the best existing TTS systems, for both American English and Mandarin Chinese ^[4]. As DeepMind put it in the announcement, "WaveNets reduce the gap between the state of the art and human-level performance by over 50% for both US English and Mandarin Chinese," generating speech that "sounds more natural than the best existing Text-to-Speech systems" ^[4]. The model produced speech with natural-sounding intonation, appropriate emphasis, and realistic vocal qualities that previous systems could not achieve.

The original WaveNet was too computationally expensive for real-time use, generating audio far slower than real time because it had to produce each of the 24,000 samples per second of audio sequentially. DeepMind addressed this through several optimizations. Using a technique called knowledge distillation, a student model was trained to mimic WaveNet's outputs in a parallel (non-autoregressive) fashion, achieving a 1,000-fold speedup. The optimized model could generate one second of speech in just 50 milliseconds ^[5]. DeepMind also developed WaveRNN, a simpler and more efficient recurrent model that could run on mobile devices rather than requiring data center compute.

In 2018, Google integrated WaveNet voices into Google Cloud Text-to-Speech, making neural TTS commercially available for the first time ^[5].

Tacotron: End-to-End Synthesis (2017)

While WaveNet solved the problem of generating high-quality audio from acoustic features, it still required a separate system to convert text into those features. Google's Tacotron, published in March 2017, addressed this by creating an end-to-end model that could learn the entire mapping from text characters to spectrograms without hand-engineered linguistic features ^[6].

Tacotron used a sequence-to-sequence architecture with an attention mechanism to align input text with output spectrogram frames. The model learned to handle pronunciation, timing, emphasis, and intonation directly from paired text-audio training data, eliminating the need for explicit phoneme dictionaries, prosody models, or linguistic feature extraction pipelines that previous systems required.

Tacotron 2, published in December 2017, combined an improved sequence-to-sequence model with a WaveNet vocoder, achieving audio quality that listeners rated as comparable to natural speech in many cases ^[7]. The Tacotron architecture became the foundation for a generation of end-to-end TTS systems.

The Modern Era (2019-Present)

Since Tacotron 2, the TTS field has seen rapid progress along several lines:

Non-autoregressive models like FastSpeech (2019) and FastSpeech 2 (2020) replaced the sequential generation process with parallel generation, dramatically reducing inference time while maintaining quality. These models could generate complete utterances in a single forward pass, enabling real-time synthesis on modest hardware ^[8].

Transformer-based architectures adapted the transformer model (originally developed for natural language processing) to TTS. Models like SpeechT5 (Microsoft, 2022) and VALL-E (Microsoft, 2023) used large-scale transformer architectures trained on massive speech datasets, achieving new levels of naturalness and enabling zero-shot voice cloning ^[9].

Diffusion-based TTS models like Grad-TTS and DiffSpeech applied diffusion models (the same family of models behind AI image generation) to speech synthesis. These models generate spectrograms through an iterative denoising process, often producing exceptionally smooth and natural output.

Large-scale commercial platforms like ElevenLabs (founded 2022), OpenAI's TTS API (2023), and numerous others brought neural TTS to mainstream audiences, making high-quality voice synthesis accessible through simple APIs and web interfaces.

How does modern TTS work?

Contemporary TTS systems employ several distinct architectural paradigms, often combining elements from multiple approaches.

Autoregressive Models

Autoregressive TTS models generate audio sequentially, predicting each element (whether an audio sample, a spectrogram frame, or a discrete audio token) based on all previous elements. WaveNet and Tacotron are classic autoregressive models. More recent examples include VALL-E (Microsoft, 2023), which treats TTS as a language modeling task: it encodes speech into discrete tokens using a neural audio codec and then uses a transformer to predict these tokens autoregressively, conditioned on the input text and a short audio prompt of the target voice ^[9].

Autoregressive models tend to produce highly natural output because each generation step is informed by the full history of what has been generated so far. However, their sequential nature makes them inherently slower than parallel methods, and they can suffer from error accumulation (where early mistakes in generation propagate through the rest of the utterance).

Non-Autoregressive Models

Non-autoregressive models generate the entire output in parallel (or in a fixed small number of steps), trading some quality for dramatically faster inference. FastSpeech and FastSpeech 2 are foundational non-autoregressive TTS models that use a feed-forward transformer to predict the entire mel-spectrogram at once ^[8]. These models typically require a duration predictor to determine how long each phoneme should be, since the alignment between text and audio must be determined upfront rather than learned through attention during sequential generation.

Non-autoregressive models are well-suited to real-time applications (virtual assistants, live narration, phone systems) where latency is critical.

Diffusion-Based TTS

Diffusion-based TTS models generate speech by starting with random noise and iteratively denoising it into a clean spectrogram, conditioned on the input text. Grad-TTS (2021) was one of the first diffusion-based TTS models, using a score-based diffusion process to generate high-quality mel-spectrograms ^[10]. The approach produces smooth, artifact-free output and offers natural control over the diversity of generated speech through the number of diffusion steps.

More recent models like NaturalSpeech 2 (Microsoft, 2023) and NaturalSpeech 3 (2024) use latent diffusion for TTS, operating in a compressed representation space for efficiency. These models have achieved MOS scores that match or exceed natural speech recordings in some evaluations.

Neural Vocoders

Regardless of the acoustic model architecture, most TTS systems require a vocoder to convert the generated mel-spectrogram (or other acoustic representation) into a raw audio waveform. Neural vocoders have progressed through several generations:

Vocoder	Year	Type	Key Innovation
WaveNet	2016	Autoregressive	First neural vocoder; sample-by-sample generation
WaveRNN	2018	Autoregressive	Efficient single-layer RNN; mobile deployment
WaveGlow	2018	Flow-based	Parallel generation using normalizing flows
HiFi-GAN	2020	GAN-based	Fast, high-fidelity; widely adopted in production
BigVGAN	2022	GAN-based	Improved generalization across speakers and conditions
Vocos	2023	CNN-based	Lightweight, fast; used in many modern TTS systems

HiFi-GAN, in particular, has become a standard component in many TTS pipelines due to its combination of high audio quality and fast inference speed ^[11].

What are the best TTS platforms?

The following table summarizes the major TTS platforms available as of early 2026.

Platform	Developer	Key Features	Voices/Languages	Pricing (approx.)
ElevenLabs	ElevenLabs	Voice cloning, 1,200+ voices, Eleven v3, multilingual	70+ languages	Free tier; Starter $5/mo; Scale $99/mo
OpenAI TTS	OpenAI	gpt-4o-mini-tts, steerability, low latency	13 voices, multilingual	$15/1M input tokens
Google Cloud TTS	Google	WaveNet, Neural2, Studio, Chirp 3 HD voices	50+ languages, ~300 voices	Free tier; $4-30/1M chars
Amazon Polly	Amazon (AWS)	Standard, Neural, Generative voices, SSML support	60+ languages	$4-19.20/1M chars
Microsoft Azure Speech	Microsoft	Custom Neural Voice, emotional styles, SSML	140+ languages, 400+ voices	Free tier; $16-24/1M chars
Bark	Suno	Open-source, non-verbal sounds (laughs, sighs), music	Multilingual	Free (open-source)
Coqui XTTS	Coqui (community)	Open-source, zero-shot voice cloning, multilingual	17 languages	Free (open-source)
Voxtral TTS	Mistral AI	Open-weight, streaming, voice cloning from 3s reference	9 languages	Free weights (CC BY NC 4.0); API $0.016/1k chars

ElevenLabs

ElevenLabs has emerged as the leading dedicated TTS platform, known for producing the most natural and expressive synthetic speech commercially available. Founded in 2022 by Piotr Dabkowski and Mati Staniszewski, the company reached a $3.3 billion valuation after a $180 million Series C round in January 2025, and then raised an additional $500 million at an $11 billion valuation in February 2026, led by Sequoia Capital ^[12].

ElevenLabs' platform offers over 1,200 pre-built voices, professional voice cloning from as little as one minute of sample audio, and a Voice Design tool for creating entirely new synthetic voices. The company's Eleven v3 model, released in June 2025, supports more than 70 languages, natural multi-speaker dialogue, and audio tags that control expression (such as [excited], [whispers], and [sighs]) ^[13]. In independent benchmarks, ElevenLabs achieved the lowest word error rate at 2.83%, a hallucination rate of 5%, and superior scores in context awareness (63.37%) and prosody accuracy (64.57%) compared to competitors ^[14].

The company reported over $330 million in annual recurring revenue at the end of 2025, driven by enterprise customers including Deutsche Telekom and Revolut. ElevenLabs has expanded beyond TTS into a broader audio AI platform with 14 products including dubbing, sound effects generation, conversational AI agents, and transcription (Scribe v2). Announcing the Series D, co-founder and CEO Mati Staniszewski said the company would "go beyond voice alone to transform how we interact with technology altogether," adding, "we stay hungry, knowing how early this space still is, as we build toward IPO and beyond" ^[12].

OpenAI TTS

OpenAI's TTS offering integrates speech synthesis into its broader AI ecosystem. The gpt-4o-mini-tts model, released in 2025, emphasizes steerability, allowing users to control pitch, speed, and emotional delivery through natural language instructions ^[14]. While OpenAI offers only 13 voices (compared to ElevenLabs' 1,200+), its pricing is significantly lower (roughly 12 times cheaper per unit), making it attractive for high-volume applications.

In benchmark comparisons, OpenAI's pronunciation accuracy sits at 77.30% versus ElevenLabs' 81.97%, and its hallucination rate is 10% versus ElevenLabs' 5%. OpenAI's strength lies in its integration with the GPT ecosystem, allowing developers to combine language understanding, reasoning, and speech generation in a single API call.

Google Cloud Text-to-Speech

Google Cloud TTS offers a tiered system of voice models reflecting the evolution of TTS technology. Standard voices use older concatenative/parametric methods. WaveNet voices, powered by DeepMind's technology, provide significantly more natural output. Neural2 voices use Custom Voice technology for improved pronunciation and intonation. Studio voices offer professional-grade quality. And the newest Chirp 3 HD voices represent Google's latest neural TTS capabilities ^[15].

The platform supports over 50 languages with approximately 300 voices and offers extensive SSML (Speech Synthesis Markup Language) support for fine-grained control over pronunciation, pauses, emphasis, and speaking rate. A free tier provides 1 million characters per month for WaveNet voices and 4 million for Standard voices.

Amazon Polly

Amazon Polly is Amazon Web Services' TTS service, offering Standard, Neural, Long-form, and Generative voice types across 60+ languages ^[16]. Polly integrates with other AWS services and is commonly used in enterprise applications, IVR (Interactive Voice Response) systems, and IoT devices. Standard voices are priced at $4.80 per million characters, while Neural voices cost $19.20 per million characters. Polly's strength is its reliability, scalability, and deep integration with the AWS ecosystem rather than cutting-edge voice quality.

Microsoft Azure Speech

Microsoft Azure Speech Service offers one of the largest selections of voices, with over 400 neural voices across 140+ languages ^[17]. A distinguishing feature is Custom Neural Voice, which allows enterprises to create a unique branded voice trained on their own audio data. Azure Speech supports emotional speaking styles (cheerful, angry, sad, and others) and fine-grained SSML controls. The platform's Dragon HD Omni model represents Microsoft's latest-generation TTS technology.

Open-Source Alternatives

Bark, developed by Suno (the same company behind the AI music platform), is an open-source text-to-audio model that can generate not only speech but also non-verbal sounds like laughter, sighs, music, and ambient noise ^[18]. Bark's ability to produce expressive, contextual audio makes it unique among open-source options, though its output quality trails commercial platforms.

Coqui XTTS (and its successor XTTS v2) is a major open-source TTS model that supports zero-shot voice cloning across 17 languages ^[19]. It requires only a few seconds of reference audio to clone a voice and can be deployed locally on consumer hardware. While Coqui as a company shut down operations, the open-source community has continued maintaining and improving the models.

Voxtral TTS, released by Mistral AI on March 26, 2026, is a 4-billion-parameter open-weight TTS model built on a hybrid autoregressive and flow-matching architecture. It supports nine languages (English, French, German, Spanish, Dutch, Portuguese, Italian, Hindi, and Arabic) and can clone a voice from as little as three seconds of reference audio. The model weights are available on Hugging Face under a CC BY NC 4.0 license, and Mistral also offers an API at $0.016 per 1,000 characters. In human evaluations, Voxtral TTS was preferred over ElevenLabs Flash v2.5 in naturalness comparisons for conversational use cases, and it is lightweight enough to run on a modern laptop or mid-range GPU ^[28]. Voxtral TTS was the first frontier-quality open-weight TTS model from a major European AI lab.

What is voice cloning and how does it work?

Voice cloning is the ability to replicate a specific person's voice, including their unique timbre, accent, speech patterns, and vocal characteristics, so that the TTS system can generate new speech that sounds like that person. Voice cloning has become one of the most commercially significant and ethically contentious capabilities of modern TTS.

Traditional Voice Cloning

Early approaches to voice cloning required extensive recordings of the target speaker (typically 10-30 hours of studio-quality audio) to train a speaker-specific model. This process was expensive, time-consuming, and limited to professional applications. The resulting models could only produce speech in the language and style represented in the training data.

Few-Shot Voice Cloning

Few-shot voice cloning reduced the data requirement to a few minutes of audio. Systems like Microsoft's Custom Neural Voice and ElevenLabs' Professional Voice Clone use fine-tuning techniques to adapt a pre-trained multi-speaker model to a new voice using a small amount of target audio (typically 1-30 minutes). The quality improves with more data, but even a few minutes can produce a recognizable clone.

Zero-Shot Voice Cloning

Zero-shot voice cloning represents the current frontier of the technology. These systems can replicate a voice from just 3-10 seconds of reference audio, without any fine-tuning or additional training ^[9]. The model extracts a speaker embedding from the short audio sample and uses it to condition the generation process, producing speech in the target voice for any input text.

Microsoft's VALL-E (2023) demonstrated this capability using a neural codec language model approach. Given a 3-second audio sample, VALL-E could generate speech that preserved the speaker's voice characteristics, emotional tone, and even the acoustic environment of the recording ^[9]. The VALL-E paper reported that the model "emerges in-context learning capabilities and can be used to synthesize high-quality personalized speech with only a 3-second enrolled recording of an unseen speaker as an acoustic prompt" ^[9]. More recent models like XTTS v2, CosyVoice 2, and GLM-TTS have pushed zero-shot cloning quality further, with CosyVoice 2 achieving a MOS score of 5.53 and GLM-TTS reaching a character error rate of 0.89 with reinforcement learning optimization ^[20].

Zero-shot voice cloning works across languages in many modern systems. A voice sampled from English speech can be used to generate speech in French, Japanese, or Mandarin, with the system maintaining the speaker's vocal identity while producing phonetically correct speech in the target language.

What is TTS used for?

TTS technology serves a broad spectrum of use cases across industries and contexts.

Accessibility

TTS is a critical assistive technology for people who are blind or have low vision, enabling them to access written content through screen readers, navigation systems, and document readers. It also serves individuals with reading disabilities such as dyslexia, motor impairments that make reading physical text difficult, and speech disabilities (allowing them to communicate using a synthetic voice). The DECtalk voice used by Stephen Hawking is perhaps the most famous example of TTS as an assistive communication tool.

Modern accessibility applications use neural TTS to provide a more comfortable listening experience than the robotic voices of earlier screen readers, reducing listener fatigue during extended use.

Audiobooks

The audiobook market was valued at approximately $7.9-11.2 billion in 2025 and is projected to grow significantly through the next decade ^[21]. AI TTS has transformed audiobook production by dramatically reducing the time and cost of creating narrated versions of books. A professional human narrator typically takes 2-4 hours to record one finished hour of audio, not counting editing and post-production. AI can generate an entire audiobook in minutes.

Apple launched AI-narrated audiobooks in 2023, and Amazon's Audible has experimented with AI voices for selected titles. ElevenLabs' Reader app and similar tools allow users to generate audiobook-quality narration from any text. Publishers are increasingly using AI TTS for backlist titles that would not justify the cost of human narration, while reserving human narrators for frontlist and premium titles.

Dubbing and Localization

AI dubbing uses TTS combined with voice cloning to translate and re-voice video content across languages while preserving the original speaker's voice characteristics. ElevenLabs offers an automated dubbing product that can translate video content into 70+ languages, maintaining the speaker's voice identity and synchronizing lip movements. This application is transforming media localization, which has traditionally been expensive and time-consuming.

The global dubbing and voice-over market was valued at $5.8 billion in 2025 and is projected to reach $9.67 billion by 2033 ^[22]. AI is expected to capture an increasing share of this market, particularly for content like corporate training, e-learning, and social media, where the premium quality of human dubbing may not justify the cost.

Virtual Assistants

Virtual assistants like Apple's Siri, Amazon's Alexa, and Google Assistant rely on TTS to communicate with users. The shift from concatenative to neural TTS voices has made these assistants sound noticeably more natural in recent years. Google integrated WaveNet voices into Google Assistant in 2018, and other assistant platforms have followed with their own neural TTS upgrades.

Customer Service

AI-powered customer service systems use TTS to provide spoken responses in call centers, IVR (Interactive Voice Response) systems, and voice-based chatbots. The combination of large language models for understanding and generating responses with neural TTS for delivering those responses has enabled increasingly natural automated customer interactions. ElevenLabs' Conversational AI product and similar offerings allow businesses to deploy AI voice agents that can handle customer inquiries with human-like speech.

Content Creation

Content creators use TTS for narrating videos, producing podcasts, creating voiceovers for presentations, and generating audio versions of written content. The availability of high-quality, affordable TTS has enabled solo creators and small teams to produce professional-sounding audio content that previously required hiring voice actors.

Education and E-Learning

TTS supports educational applications including reading assistance for children, language learning tools (where learners can hear correct pronunciation), and narration of e-learning courses. The ability to generate speech in multiple languages and accents is particularly valuable for language education.

How is TTS quality measured?

The quality of TTS systems is evaluated using both subjective and objective measures.

Subjective Metrics

Mean Opinion Score (MOS) is the most widely used subjective metric for TTS quality. Human listeners rate speech samples on a scale from 1 (bad) to 5 (excellent) across dimensions including naturalness, clarity, and intelligibility. Natural human speech typically receives MOS scores between 4.0 and 4.5 in controlled evaluations (it rarely achieves a perfect 5.0 due to recording conditions and listener variability). State-of-the-art neural TTS systems now regularly achieve MOS scores of 4.3 to 4.7, overlapping with or exceeding the range for natural speech ^[20].

Similarity Mean Opinion Score (SMOS) specifically measures how closely a synthetic voice matches a target speaker, rated on the same 1-5 scale. This metric is particularly important for evaluating voice cloning systems.

Objective Metrics

Metric	What It Measures	Ideal Direction
Word Error Rate (WER)	Accuracy of pronounced words (via ASR)	Lower is better
Character Error Rate (CER)	Accuracy at the character level	Lower is better
Speaker Embedding Cosine Similarity (SECS)	How closely the generated voice matches the target speaker	Higher is better
Mel Cepstral Distortion (MCD)	Spectral distance between synthetic and natural speech	Lower is better
F0 RMSE	Pitch accuracy compared to reference	Lower is better
Real-Time Factor (RTF)	Generation speed relative to audio duration	Lower is better (< 1 for real-time)

ElevenLabs achieved a word error rate of 2.83% in independent evaluations, compared to OpenAI's higher error rate, while also demonstrating superior prosody accuracy at 64.57% versus OpenAI's 45.83% ^[14].

Is TTS voice cloning dangerous?

The capabilities of modern TTS, particularly voice cloning, have raised significant safety and ethical concerns.

Voice Deepfakes

Voice cloning technology has made it possible to generate convincing fake audio of real people saying things they never actually said. These voice deepfakes can be used for fraud, identity theft, political manipulation, and harassment. Cybersecurity firm DeepStrike estimated an increase from roughly 500,000 online deepfakes in 2023 to approximately 8 million in 2025, with annual growth nearing 900% ^[23].

Voice cloning has crossed what researchers call the "indistinguishable threshold," meaning that a few seconds of audio now suffice to generate a clone with natural intonation, rhythm, emphasis, emotion, pauses, and breathing sounds that most listeners cannot distinguish from the real person ^[24]. This capability has been exploited in several high-profile fraud cases, including a 2019 incident where criminals used AI-generated voice to impersonate a CEO and authorize a fraudulent wire transfer of $243,000.

AI-generated voices are increasingly used in phone scams, where the caller impersonates a family member, bank representative, or authority figure. Some major retailers report receiving over 1,000 AI-generated scam calls per day ^[23]. The FBI and FTC have issued warnings about AI voice scams, particularly those targeting elderly individuals with fake emergency calls purporting to be from relatives.

Political Manipulation

Deepfake voices can be weaponized in politics, where fabricated speeches or manipulated audio can spread disinformation and distort public opinion. In January 2024, a robocall using a synthetic voice impersonating President Biden was sent to New Hampshire voters before the state primary, telling them not to vote. The incident demonstrated the potential for AI voice technology to directly interfere with democratic processes.

Detection Challenges

Detecting AI-generated speech has become increasingly difficult as the quality of synthesis has improved. While early TTS systems produced artifacts that were easy to identify (metallic tones, unnatural rhythms, pronunciation errors), modern systems produce speech that even trained listeners struggle to distinguish from genuine recordings. Research into audio deepfake detection is an active area, with methods based on spectral analysis, temporal patterns, and trained classifiers, but the detection models face a constant arms race against improving generation quality.

Regulations

The regulatory landscape for TTS and voice cloning is evolving rapidly in response to the technology's growing capabilities and misuse potential.

United States

As of February 2026, 46 U.S. states have enacted legislation directly targeting the use of AI-generated media, including synthetic voice ^[25]. Tennessee's Ensuring Likeness, Voice, and Image Security (ELVIS) Act was the first state law to expressly extend right-of-publicity protections to AI-generated voice clones. New York has added enhanced protections with new civil remedies for individuals whose voice or likeness is used through synthetic media.

At the federal level, the NO FAKES Act (Nurture Originals, Foster Art, and Keep Entertainment Safe) was reintroduced in Congress in April 2025, proposing uniform national protections against vocal cloning and digital deepfakes ^[25]. The FCC has also ruled that AI-generated voices in robocalls violate the Telephone Consumer Protection Act, creating regulatory tools to combat voice-based scams.

European Union

The EU AI Act classifies certain TTS applications based on risk levels. Under Article 50's transparency obligations, providers and deployers of AI systems that generate synthetic audio must inform users that the content is AI-generated and label or mark it accordingly ^[26]. High-risk applications (such as using synthetic voices in law enforcement or critical infrastructure) face additional requirements including conformity assessments and ongoing monitoring.

Industry Self-Regulation

Major TTS providers have implemented their own safeguards. ElevenLabs requires users to confirm they have the rights to clone a voice before allowing voice cloning. The platform uses AI-based moderation to detect and block potentially harmful content. OpenAI restricts its voice cloning capabilities and requires explicit consent from the voice owner. Many platforms embed inaudible watermarks in generated audio to enable tracing of AI-generated content to its source.

Current State (2025-2026)

As of early 2026, TTS technology is in a period of rapid commercial expansion and increasing integration into everyday products and services.

Quality Plateau at the Top

The leading TTS systems have reached a quality level where the difference between synthetic and natural speech is negligible for most listeners and most use cases. ElevenLabs' Eleven v3, Google's Chirp 3 HD, Microsoft's Dragon HD Omni, and OpenAI's gpt-4o-mini-tts all produce speech that can pass casual listening tests. The competitive frontier has shifted from raw quality to other dimensions: expressiveness, emotional range, control, latency, multilingual support, and cost.

Conversational AI Integration

One of the most significant trends is the integration of TTS with large language models to create real-time conversational AI systems. OpenAI's approach of combining language understanding and speech generation in a single multimodal model, and ElevenLabs' Conversational AI product, represent a shift toward AI systems that can listen, understand, think, and speak in a seamless loop. This is enabling a new generation of AI voice agents for customer service, sales, healthcare, and personal assistance.

Market Consolidation

The TTS market has seen consolidation, with larger companies acquiring smaller players. Meta acquired PlayHT in July 2025, shutting down the PlayHT API by December 2025 ^[27]. ElevenLabs' rapid growth (from $3.3 billion to $11 billion valuation in just over a year) and potential IPO plans suggest that the company may emerge as a dominant standalone player in the voice AI space.

Open-Source Progress

Open-source TTS models have made significant advances, with Coqui XTTS v2, Bark, CosyVoice 2, GLM-TTS, and the newly released Mistral Voxtral TTS (March 2026) all narrowing the gap with commercial offerings. Voxtral TTS in particular represents a step-change: it is the first frontier-quality, open-weight TTS from a major AI lab and can run on consumer hardware, democratizing access to voice AI for privacy-sensitive or cost-sensitive deployments ^[28].

Emerging Capabilities

Recent developments point toward several emerging capabilities:

Emotional intelligence: Models that can read the emotional context of text and adjust vocal delivery accordingly, without explicit instructions.
Multi-speaker dialogue: Generation of natural conversations between multiple synthetic voices, with appropriate turn-taking, interruptions, and reactive prosody.
Real-time voice translation: Combining speech recognition, machine translation, and TTS to enable real-time cross-language communication with voice preservation.
Audio-native LLMs: Models that can reason, understand, and speak natively in audio modality, without the text-to-speech conversion step.

References

"Text-to-Speech Market Size, Share, Analysis To 2035." Market Research Future. https://www.marketresearchfuture.com/reports/text-to-speech-market-21388 ↩
"Stephen Hawking's Voice and the Machine That Powers It." Wired, 2015. https://www.wired.com/2015/01/intel-gave-stephen-hawking-voice/ ↩
Zen, H., Tokuda, K., and Black, A. (2009). "Statistical Parametric Speech Synthesis." Speech Communication, 51(11), 1039-1064. https://doi.org/10.1016/j.specom.2009.04.004 ↩
van den Oord, A., et al. (2016). "WaveNet: A Generative Model for Raw Audio." DeepMind. https://arxiv.org/abs/1609.03499 ↩
"Introducing Cloud Text-to-Speech Powered by DeepMind WaveNet Technology." Google Cloud Blog, 2018. https://cloud.google.com/blog/products/ai-machine-learning/introducing-cloud-text-to-speech-powered-by-deepmind-wavenet-technology ↩
Wang, Y., et al. (2017). "Tacotron: Towards End-to-End Speech Synthesis." arXiv:1703.10135. https://arxiv.org/abs/1703.10135 ↩
Shen, J., et al. (2017). "Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions." arXiv:1712.05884. https://arxiv.org/abs/1712.05884 ↩
Ren, Y., et al. (2019). "FastSpeech: Fast, Robust and Controllable Text to Speech." arXiv:1905.09263. https://arxiv.org/abs/1905.09263 ↩
Wang, C., et al. (2023). "Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers (VALL-E)." arXiv:2301.02111. https://arxiv.org/abs/2301.02111 ↩
Popov, V., et al. (2021). "Grad-TTS: A Diffusion Probabilistic Model for Text-to-Speech." arXiv:2105.06337. https://arxiv.org/abs/2105.06337 ↩
Kong, J., Kim, J., and Bae, J. (2020). "HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis." arXiv:2010.05646. https://arxiv.org/abs/2010.05646 ↩
"ElevenLabs raises $500M from Sequoia at an $11 billion valuation." TechCrunch, February 2026. https://techcrunch.com/2026/02/04/elevenlabs-raises-500m-from-sequioia-at-a-11-billion-valuation/ ↩
"Eleven v3: Most Expressive AI TTS Model Launched." ElevenLabs Blog, June 2025. https://elevenlabs.io/blog/eleven-v3 ↩
"ElevenLabs vs OpenAI TTS: Voice-first platform or AI ecosystem add-on?" ElevenLabs Blog. https://elevenlabs.io/blog/elevenlabs-vs-openai ↩
"Text-to-Speech: Lifelike AI Voices & Speech Synthesis." Google Cloud. https://cloud.google.com/text-to-speech ↩
"Amazon Polly Pricing." Amazon Web Services. https://aws.amazon.com/polly/pricing/ ↩
"Azure AI Speech." Microsoft Azure. https://azure.microsoft.com/en-us/products/ai-services/ai-speech ↩
"Bark: Text-Prompted Generative Audio Model." Suno/GitHub. https://github.com/suno-ai/bark ↩
"XTTS: Multilingual Zero-Shot TTS." Coqui. https://docs.coqui.ai/en/latest/models/xtts.html ↩
"GLM-TTS Complete Guide 2025: Revolutionary Zero-Shot Voice Cloning with Reinforcement Learning." Dev.to. https://dev.to/czmilo/glm-tts-complete-guide-2025-revolutionary-zero-shot-voice-cloning-with-reinforcement-learning-m8m ↩
"Audiobooks Market Size, Share And Growth Report, 2034." Fortune Business Insights. https://www.fortunebusinessinsights.com/audiobooks-market-104739 ↩
"Global Dubbing and Voice Over Market Report 2025." Cognitive Market Research. https://www.cognitivemarketresearch.com/dubbing-and-voice-over-market-report ↩
"Voice Cloning in 2025: Risks, Laws, and New Use Cases." Here and Now AI. https://hereandnowai.com/voice-cloning-in-2025/ ↩
"2026 will be the year you get fooled by a deepfake, researcher says." Fortune, December 2025. https://fortune.com/2025/12/27/2026-deepfakes-outlook-forecast/ ↩
"AI Voice Cloning Regulation in 2026: What's Legal, What's Risky, and How to Stay Compliant." AI Tribune, February 2026. https://aitribune.net/2026/02/24/ai-voice-cloning-regulation-in-2026/ ↩
"The EU AI Act." European Commission. https://artificialintelligenceact.eu/ ↩
"Top 7 PlayHT Alternatives in 2026." ElevenLabs Blog. https://elevenlabs.io/blog/playht-alternatives-2026 ↩
Mistral AI. "Speaking of Voxtral." March 26, 2026. https://mistral.ai/news/voxtral-tts/ ; VentureBeat coverage: https://venturebeat.com/orchestration/mistral-ai-just-released-a-text-to-speech-model-it-says-beats-elevenlabs-and ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

3 revisions by 1 contributors · full history

Suggest edit