Voice AI

11 min read

Updated Jul 23, 2026

Voice AI is the umbrella term for artificial intelligence systems that listen to, understand, and generate human speech. The field spans speech recognition (ASR), text-to-speech (TTS) synthesis, voice cloning and conversion, and real-time voice agents that hold spoken conversations over apps and phone lines. Between 2024 and 2026 voice shifted from a peripheral interface to one of the most heavily funded areas of applied AI: OpenAI, Google, and Amazon shipped native speech-to-speech models, and ElevenLabs raised a $500 million Series D led by Sequoia Capital that valued it at $11 billion in February 2026 ^[1]. The conversational AI market alone was estimated at $11.58 billion in 2024 and is forecast to reach $41.39 billion by 2030, a 23.7 percent compound annual growth rate ^[2]. The same generative advances revived old harms in new forms, from cloned-voice fraud to the deepfake robocall that prompted a 2024 US Federal Communications Commission ruling ^[3].

What does voice AI include?

Voice AI covers four overlapping problem areas:

Speech recognition (ASR or speech-to-text): converting audio into text, including streaming transcription, speaker diarization, and multilingual code-switching.
Speech synthesis (TTS): generating audio from text, from neutral narration to expressive, emotionally controllable dialogue.
Voice cloning and conversion: reproducing a specific person's voice from a short sample, or transforming one voice into another while preserving the words.
Conversational voice agents: full-duplex or near-real-time spoken dialogue systems that combine listening, reasoning, and speaking, usually around a large language model.

Modern voice AI is the product of two earlier revolutions. Deep learning replaced hidden-Markov-model ASR in the mid-2010s, and DeepMind's WaveNet (2016) showed that neural networks could generate raw audio waveforms that sounded far more natural than concatenative synthesis. OpenAI's Whisper, an open-source ASR model trained on 680,000 hours of weakly supervised audio and released in September 2022, made robust multilingual transcription a commodity; OpenAI described it as approaching "human level robustness and accuracy on English speech recognition" ^[4]. The third revolution arrived when LLMs were placed in the middle of the loop, and then trained to operate on audio directly.

How does a voice AI system work?

Most production voice agents in 2026 still use a cascaded pipeline: a voice activity detector and turn-taking model decides when the user has finished speaking; a streaming ASR model transcribes the audio; an LLM generates a response (often calling tools or APIs); and a streaming TTS model speaks it. Transport runs over WebRTC for apps or SIP for telephony, with orchestration frameworks such as LiveKit Agents and Pipecat gluing the stages together.

Stage	Function	Representative systems (2024 to 2026)
Transport	WebRTC, SIP/telephony streaming	LiveKit, Daily, Twilio
Turn-taking	Voice activity detection, semantic endpointing	Silero VAD, vendor turn-detection models
ASR	Streaming speech-to-text	Whisper, Deepgram Nova-3, AssemblyAI Universal-2, ElevenLabs Scribe
Reasoning	LLM with tool calling	GPT-4.1, Claude, Gemini, Llama
TTS	Streaming synthesis	ElevenLabs, Cartesia Sonic-3, Deepgram Aura-2, Hume AI Octave, OpenAI TTS

Latency is the stack's defining constraint. Humans respond in conversation after roughly 200 to 300 milliseconds, while the original ChatGPT voice mode, a three-model cascade, averaged 2.8 seconds with GPT-3.5 and 5.4 seconds with GPT-4 ^[5]. Component vendors now compete in tens of milliseconds: Cartesia's Sonic-3, built on state space models rather than transformers, claims about 90 ms model latency and 190 ms end-to-end across 42 languages ^[6]. Cascades remain popular because they preserve the strongest text-domain reasoning, let builders swap best-of-breed components, and produce transcripts for observability. Their weaknesses are added latency, brittle turn-taking, and the loss of paralinguistic signal: a transcript discards tone, emotion, hesitation, and speaker identity.

What are end-to-end speech models?

The alternative is a single model that consumes and produces audio natively. OpenAI's GPT-4o (May 2024) was the first frontier model trained end-to-end across text, vision, and audio, responding to speech in as little as 232 ms and 320 ms on average, comparable to human conversational turn-taking ^[5]. It reached consumers as ChatGPT's Advanced Voice Mode, rolled out to paid users on September 24, 2024 ^[7], and developers through the Realtime API, launched in beta in October 2024 ^[8]. The Realtime API became generally available on August 28, 2025 with the gpt-realtime speech-to-speech model, adding SIP phone calling, image input, and remote MCP tool support, priced at $32 per million audio input tokens and $64 per million audio output tokens ^[9].

Google launched Gemini Live, its answer to Advanced Voice Mode, in August 2024 for Gemini Advanced subscribers on Android ^[10], and later shipped native-audio versions of Gemini 2.5 with controllable speech and "thinking" variants. Other notable end-to-end systems include Amazon's Nova Sonic speech-to-speech model (April 2025), Kyutai's Moshi (unveiled July 2024 and released openly that September), which its authors describe as "the first real-time full-duplex spoken large language model," modeling the user's and its own audio as parallel streams and listening while it speaks at about 200 ms latency in practice ^[11], Hume AI's EVI 3 (May 2025), a speech-language model that can speak in voices designed by prompt ^[12], and Sesame's Conversational Speech Model, whose "Maya" demo drew more than a million users within weeks of its February 2025 debut ^[13].

End-to-end models capture emotion, accent, and interruption dynamics that cascades lose, but early versions reasoned worse in speech than in text. On Big Bench Audio, a 1,000-question audio reasoning benchmark released in December 2024, GPT-4o scored 92 percent in text but 66 percent speech-to-speech, while a cascaded Whisper-GPT-4o-TTS pipeline showed minimal degradation ^[14]. The gap has since narrowed: Artificial Analysis measured Google's Gemini 2.5 Native Audio Thinking at 92 percent in October 2025, the first native speech model to match cascaded pipelines on the benchmark ^[15]. Production systems increasingly mix both patterns, using speech-native models for conversation and delegating hard reasoning to text models.

Who are the leading voice AI companies?

Company	Core focus	Notable releases	Recent milestones
ElevenLabs	TTS, cloning, dubbing, agents	Eleven v3, Flash, Scribe ASR (Feb 2025) ^[16]	$3.3B Series C Jan 2025 ^[17]; $6.6B secondary Sept 2025; $500M Series D at $11B Feb 2026 ^[1]
OpenAI	Speech-to-speech, ASR, TTS APIs	gpt-realtime, Whisper, Voice Engine	Realtime API GA Aug 2025 ^[9]
Google	Assistant and cloud speech	Gemini Live, Gemini 2.5 native audio, Chirp	Top Big Bench Audio score, Oct 2025 ^[15]
Cartesia	Low-latency SSM voice models	Sonic-1/2/3	$100M round (Kleiner Perkins, Index, Lightspeed, Nvidia) Oct 2025; customers include ServiceNow, Cresta, Decagon ^[6]
Deepgram	Enterprise ASR, voice agent API	Nova-3 ASR (Feb 2025) ^[18], Aura-2 TTS (Apr 2025) ^[19]	Founded 2015; contact-center focus
AssemblyAI	ASR and speech understanding	Universal-2 (Oct 2024) ^[20]	Claims 73% human preference over prior model ^[20]
Hume AI	Emotionally expressive voice	EVI 1/2/3, Octave TTS	EVI 3 launched May 2025 ^[12]
Sesame	Voice companions, smart glasses	CSM, Maya and Miles	$250M Series B (Sequoia, Spark) Oct 2025 ^[13]

The segments are converging. ElevenLabs, which began in TTS and closed 2025 at roughly $330 million in annual recurring revenue, added the Scribe ASR model in February 2025 and a conversational agents platform ^[1]^[16]; Deepgram, an ASR specialist founded in 2015, added TTS and a full voice-agent API; OpenAI and Google sell every layer. A parallel open ecosystem (Whisper, Moshi, Sesame's CSM-1B base model, NVIDIA's open ASR models, and small open TTS models such as Kokoro) keeps self-hosted stacks viable. Voice agent startups such as Sierra, Decagon, PolyAI, Parloa, Retell AI, and Vapi build on these models for enterprise deployments.

What is voice AI used for?

Contact centers are the largest commercial market. LLM-based voice agents answer, qualify, and resolve calls that previously went to interactive voice response systems or offshore agents, and the Realtime API's SIP support made plugging models directly into phone systems a first-class feature ^[9]. Quick-service restaurants are a visible proving ground with mixed results: McDonald's, which acquired voice startup Apprente in 2019, ended its IBM-partnered automated drive-thru pilot across more than 100 locations in 2024 after accuracy complaints ^[21], while Wendy's FreshAI, built with Google Cloud, and Yum Brands' Nvidia-backed Byte platform at Taco Bell continued expanding to hundreds of sites ^[22]. SoundHound, which acquired restaurant voice firms SYNQ3 and Amelia in 2024, powers ordering for other chains.

Other significant uses include ambient clinical scribes that document doctor-patient conversations, audiobook and news narration, video dubbing and localization that preserves the original speaker's voice, real-time speech translation, in-car assistants, accessibility tools such as voice banking for people losing speech to ALS, and AI companions, the focus of Sesame's planned smart glasses ^[13].

What are the ethical and legal risks of voice AI?

Modern cloning systems need only seconds of audio, which makes consent the field's central ethical issue. OpenAI built Voice Engine, a cloning model requiring a 15-second sample, in 2022 but announced in March 2024 that it would keep access restricted, citing impersonation risks in an election year ^[23]. Microsoft similarly never released its VALL-E research models. The stakes were illustrated in May 2024 when OpenAI paused ChatGPT's "Sky" voice after Scarlett Johansson, who had declined to voice the product, said it sounded "eerily similar" to her own ^[24].

Cloned voices are already used in fraud, including impostor scams that mimic a relative in distress. A March 2025 Consumer Reports assessment of six cloning products found that four (ElevenLabs, Speechify, PlayHT, and Lovo) relied on self-attestation checkboxes rather than technical safeguards to confirm a speaker's consent, while Descript and Resemble AI imposed stronger checks ^[25]. Vendors have responded with voice verification steps, watermarking, blocked "no-go" voices of public figures, and detection classifiers.

Regulators moved quickly after a January 2024 robocall imitating President Biden urged New Hampshire voters to skip the state's primary. On February 8, 2024 the FCC unanimously ruled that AI-generated voices are "artificial" under the Telephone Consumer Protection Act, making robocalls that use them illegal without prior express consent ^[3]. "Bad actors are using AI-generated voices in unsolicited robocalls to extort vulnerable family members, imitate celebrities, and misinform voters," said FCC Chairwoman Jessica Rosenworcel in announcing the ruling ^[3]. The consultant behind the calls, Steve Kramer, was fined $6 million in September 2024 ^[26], and carrier Lingo Telecom paid a $1 million settlement ^[27]. Tennessee's ELVIS Act, signed March 21, 2024 and effective July 1, 2024, became the first US state law adding voice to right-of-publicity protections against AI imitation ^[28]. In the European Union, the EU AI Act's transparency rules, applying from August 2, 2026, require that people be told when they are talking to an AI system and that synthetic audio be disclosed. A proposed federal NO FAKES Act, reintroduced in the US Congress in April 2025, would create a national digital-replica right but had not become law as of June 2026.

References

^TechCrunch. "ElevenLabs raises $500M from Sequoia at an $11 billion valuation." February 4, 2026. techcrunch.com/...quioia-at-a-11-billion-valuation
^Grand View Research. "Conversational AI Market Size, Share & Trends Analysis Report, 2025-2030." 2025. grandviewresearch.com/...sational-ai-market-report
^Federal Communications Commission. "FCC Makes AI-Generated Voices in Robocalls Illegal." February 8, 2024. fcc.gov/...s-ai-generated-voices-robocalls-illegal
^OpenAI. "Introducing Whisper." September 21, 2022. openai.com/...whisper
^OpenAI. "Hello GPT-4o." May 13, 2024. openai.com/...hello-gpt-4o
^AIM Media House. "Cartesia raises $100 million to transform real-time voice AI with Sonic-3." October 2025. aimmediahouse.com/...al-time-voice-ai-with-sonic-3
^MIT Technology Review. "OpenAI released its advanced voice mode to more people. Here's how to get it." September 24, 2024. technologyreview.com/...people-heres-how-to-get-it
^OpenAI. "Introducing the Realtime API." October 1, 2024. openai.com/...introducing-the-realtime-api
^OpenAI. "Introducing gpt-realtime and Realtime API updates for production voice agents." August 28, 2025. openai.com/...introducing-gpt-realtime
^TechCrunch. "Gemini Live, Google's answer to ChatGPT's Advanced Voice Mode, launches." August 13, 2024. techcrunch.com/...pts-advanced-voice-mode-launches
^Defossez, A. et al. "Moshi: a speech-text foundation model for real-time dialogue." arXiv:2410.00037, 2024. arxiv.org/...2410.00037
^Hume AI. "Introducing EVI 3: the world's most realistic and instructible speech-to-speech foundation model." May 2025. hume.ai/...introducing-evi-3
^TechCrunch. "Sesame, the conversational AI startup from Oculus founders, raises $250M and launches beta." October 21, 2025. techcrunch.com/...rs-raises-250m-and-launches-beta
^Hugging Face Blog (Artificial Analysis). "Evaluating Audio Reasoning with Big Bench Audio." December 2024. huggingface.co/...big-bench-audio-release
^Artificial Analysis. "Google's Gemini 2.5 Native Audio Thinking is the new leading Speech to Speech model." October 2025. x.com/...1977720537519636756
^TechCrunch. "ElevenLabs is launching its own speech-to-text model." February 26, 2025. techcrunch.com/...ing-its-own-speech-to-text-model
^TechCrunch. "ElevenLabs, the hot AI audio startup, confirms $180M in Series C funding at a $3.3B valuation." January 30, 2025. techcrunch.com/...funding-at-3-3-billion-valuation
^Deepgram. "Introducing Nova-3: Setting a New Standard for AI-Driven Speech-to-Text." February 12, 2025. deepgram.com/...roducing-nova-3-speech-to-text-api
^Business Wire. "Deepgram Unveils Aura-2: The World's Most Professional, Cost-Effective, and Enterprise-Grade Text-to-Speech Model." April 15, 2025. businesswire.com/...en
^AssemblyAI. "Introducing Universal-2." October 31, 2024. assemblyai.com/universal-2
^Restaurant Dive. "McDonald's ends IBM drive-thru voice order test." June 2024. restaurantdive.com/...719085
^Metaintro. "McDonald's and Wendy's AI Drive-Thru 2026." 2026. metaintro.com/...-2026-fast-food-jobs-disappearing
^OpenAI. "Navigating the challenges and opportunities of synthetic voices." March 29, 2024. openai.com/...nd-opportunities-of-synthetic-voices
^NPR. "Scarlett Johansson wants answers about ChatGPT voice that sounds like 'Her'." May 20, 2024. npr.org/...-to-scarlett-johansson-in-the-movie-her
^Consumer Reports. "Consumer Reports' Assessment of AI Voice Cloning Products." March 2025. consumerreports.org/...f-ai-voice-cloning-products
^Federal Communications Commission. "FCC Issues $6M Fine For N.H. Robocalls." September 2024. fcc.gov/...fcc-issues-6m-fine-nh-robocalls
^Perkins Coie. "FCC Fines Telecom That Transmitted AI-Generated Deepfake Robocalls Impersonating President Biden." 2024. perkinscoie.com/...eepfake-robocalls-impersonating
^Office of the Tennessee Governor. "Gov. Lee Signs ELVIS Act Into Law." March 21, 2024. tn.gov/...photos--gov--lee-signs-elvis-act-into-law

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributors · v3 · 2,268 words · full history

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Suggest edit

Voice AI

What does voice AI include?

How does a voice AI system work?

What are end-to-end speech models?

Who are the leading voice AI companies?

What is voice AI used for?

What are the ethical and legal risks of voice AI?

References

Improve this article

What links here (24 of 33)

What links here (24 of 33)

What does voice AI include?

How does a voice AI system work?

What are end-to-end speech models?

Who are the leading voice AI companies?

What is voice AI used for?

What are the ethical and legal risks of voice AI?

References

Improve this article

Related Articles

ChatGPT

Diffusion model

Discriminator

Gaming

Stability AI

Stable Diffusion

What links here (24 of 33)

Related Articles

ChatGPT

Diffusion model

Discriminator

Gaming

Stability AI

Stable Diffusion

What links here (24 of 33)