Voice AI
Last reviewed
Jun 9, 2026
Sources
27 citations
Review status
Source-backed
Revision
v1 ยท 2,132 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 9, 2026
Sources
27 citations
Review status
Source-backed
Revision
v1 ยท 2,132 words
Add missing citations, update stale details, or suggest a clearer explanation.
Voice AI is the umbrella term for artificial intelligence systems that listen to, understand, and generate human speech. The field spans speech recognition (ASR), text-to-speech (TTS) synthesis, voice cloning and conversion, and real-time voice agents that hold spoken conversations over apps and phone lines. Between 2024 and 2026 voice shifted from a peripheral interface to one of the most heavily funded areas of applied AI: OpenAI, Google, and Amazon shipped native speech-to-speech models, ElevenLabs reached an $11 billion valuation [1], and enterprises began replacing touch-tone phone trees with LLM-driven agents in contact centers and drive-thrus. The same generative advances revived old harms in new forms, from cloned-voice fraud to the deepfake robocall that prompted a 2024 US Federal Communications Commission ruling [2].
Voice AI covers four overlapping problem areas:
Modern voice AI is the product of two earlier revolutions. Deep learning replaced hidden-Markov-model ASR in the mid-2010s, and DeepMind's WaveNet (2016) showed that neural networks could generate raw audio waveforms that sounded far more natural than concatenative synthesis. OpenAI's Whisper, an open-source ASR model trained on 680,000 hours of weakly supervised audio and released in September 2022, made robust multilingual transcription a commodity [3]. The third revolution arrived when LLMs were placed in the middle of the loop, and then trained to operate on audio directly.
Most production voice agents in 2026 still use a cascaded pipeline: a voice activity detector and turn-taking model decides when the user has finished speaking; a streaming ASR model transcribes the audio; an LLM generates a response (often calling tools or APIs); and a streaming TTS model speaks it. Transport runs over WebRTC for apps or SIP for telephony, with orchestration frameworks such as LiveKit Agents and Pipecat gluing the stages together.
| Stage | Function | Representative systems (2024 to 2026) |
|---|---|---|
| Transport | WebRTC, SIP/telephony streaming | LiveKit, Daily, Twilio |
| Turn-taking | Voice activity detection, semantic endpointing | Silero VAD, vendor turn-detection models |
| ASR | Streaming speech-to-text | Whisper, Deepgram Nova-3, AssemblyAI Universal-2, ElevenLabs Scribe |
| Reasoning | LLM with tool calling | GPT-4.1, Claude, Gemini, Llama |
| TTS | Streaming synthesis | ElevenLabs, Cartesia Sonic-3, Deepgram Aura-2, Hume AI Octave, OpenAI TTS |
Latency is the stack's defining constraint. Humans respond in conversation after roughly 200 to 300 milliseconds, while the original ChatGPT voice mode, a three-model cascade, averaged 2.8 seconds with GPT-3.5 and 5.4 seconds with GPT-4 [4]. Component vendors now compete in tens of milliseconds: Cartesia's Sonic-3, built on state space models rather than transformers, claims about 90 ms model latency and 190 ms end-to-end across 42 languages [5]. Cascades remain popular because they preserve the strongest text-domain reasoning, let builders swap best-of-breed components, and produce transcripts for observability. Their weaknesses are added latency, brittle turn-taking, and the loss of paralinguistic signal: a transcript discards tone, emotion, hesitation, and speaker identity.
The alternative is a single model that consumes and produces audio natively. OpenAI's GPT-4o (May 2024) was the first frontier model trained end-to-end across text, vision, and audio, responding to speech in as little as 232 ms and 320 ms on average [4]. It reached consumers as ChatGPT's Advanced Voice Mode, rolled out to paid users on September 24, 2024 [6], and developers through the Realtime API, launched in beta in October 2024 [7]. The Realtime API became generally available on August 28, 2025 with the gpt-realtime speech-to-speech model, adding SIP phone calling, image input, and remote MCP tool support, priced at $32 per million audio input tokens and $64 per million audio output tokens [8].
Google launched Gemini Live, its answer to Advanced Voice Mode, in August 2024 for Gemini Advanced subscribers on Android [9], and later shipped native-audio versions of Gemini 2.5 with controllable speech and "thinking" variants. Other notable end-to-end systems include Amazon's Nova Sonic speech-to-speech model (April 2025), Kyutai's Moshi (unveiled July 2024 and released openly that September), the first open full-duplex spoken dialogue model, which models the user's and its own audio as parallel streams and listens while it speaks at about 200 ms latency [10], Hume AI's EVI 3 (May 2025), a speech-language model that can speak in voices designed by prompt [11], and Sesame's Conversational Speech Model, whose "Maya" demo drew more than a million users within weeks of its February 2025 debut [12].
End-to-end models capture emotion, accent, and interruption dynamics that cascades lose, but early versions reasoned worse in speech than in text. On Big Bench Audio, a 1,000-question audio reasoning benchmark released in December 2024, GPT-4o scored 92 percent in text but 66 percent speech-to-speech, while a cascaded Whisper-GPT-4o-TTS pipeline showed minimal degradation [13]. The gap has since narrowed: Artificial Analysis measured Google's Gemini 2.5 Native Audio Thinking at 92 percent in October 2025, the first native speech model to match cascaded pipelines on the benchmark [14]. Production systems increasingly mix both patterns, using speech-native models for conversation and delegating hard reasoning to text models.
| Company | Core focus | Notable releases | Recent milestones |
|---|---|---|---|
| ElevenLabs | TTS, cloning, dubbing, agents | Eleven v3, Flash, Scribe ASR (Feb 2025) [15] | $3.3B Series C Jan 2025 [16]; $6.6B secondary Sept 2025; $11B round Feb 2026 [1] |
| OpenAI | Speech-to-speech, ASR, TTS APIs | gpt-realtime, Whisper, Voice Engine | Realtime API GA Aug 2025 [8] |
| Assistant and cloud speech | Gemini Live, Gemini 2.5 native audio, Chirp | Top Big Bench Audio score, Oct 2025 [14] | |
| Cartesia | Low-latency SSM voice models | Sonic-1/2/3 | $100M round (Kleiner Perkins, Index, Lightspeed, Nvidia) Oct 2025; customers include ServiceNow, Cresta, Decagon [5] |
| Deepgram | Enterprise ASR, voice agent API | Nova-3 ASR (Feb 2025) [17], Aura-2 TTS (Apr 2025) [18] | Founded 2015; contact-center focus |
| AssemblyAI | ASR and speech understanding | Universal-2 (Oct 2024) [19] | Claims 73% human preference over prior model [19] |
| Hume AI | Emotionally expressive voice | EVI 1/2/3, Octave TTS | EVI 3 launched May 2025 [11] |
| Sesame | Voice companions, smart glasses | CSM, Maya and Miles | $250M Series B (Sequoia, Spark) Oct 2025 [12] |
The segments are converging. ElevenLabs, which began in TTS, added the Scribe ASR model in February 2025 and a conversational agents platform [15]; Deepgram, an ASR specialist founded in 2015, added TTS and a full voice-agent API; OpenAI and Google sell every layer. A parallel open ecosystem (Whisper, Moshi, Sesame's CSM-1B base model, NVIDIA's open ASR models, and small open TTS models such as Kokoro) keeps self-hosted stacks viable. Voice agent startups such as Sierra, Decagon, PolyAI, Parloa, Retell AI, and Vapi build on these models for enterprise deployments.
Contact centers are the largest commercial market. LLM-based voice agents answer, qualify, and resolve calls that previously went to interactive voice response systems or offshore agents, and the Realtime API's SIP support made plugging models directly into phone systems a first-class feature [8]. Quick-service restaurants are a visible proving ground with mixed results: McDonald's, which acquired voice startup Apprente in 2019, ended its IBM-partnered automated drive-thru pilot across more than 100 locations in 2024 after accuracy complaints [20], while Wendy's FreshAI, built with Google Cloud, and Yum Brands' Nvidia-backed Byte platform at Taco Bell continued expanding to hundreds of sites [21]. SoundHound, which acquired restaurant voice firms SYNQ3 and Amelia in 2024, powers ordering for other chains.
Other significant uses include ambient clinical scribes that document doctor-patient conversations, audiobook and news narration, video dubbing and localization that preserves the original speaker's voice, real-time speech translation, in-car assistants, accessibility tools such as voice banking for people losing speech to ALS, and AI companions, the focus of Sesame's planned smart glasses [12].
Modern cloning systems need only seconds of audio, which makes consent the field's central ethical issue. OpenAI built Voice Engine, a cloning model requiring a 15-second sample, in 2022 but announced in March 2024 that it would keep access restricted, citing impersonation risks in an election year [22]. Microsoft similarly never released its VALL-E research models. The stakes were illustrated in May 2024 when OpenAI paused ChatGPT's "Sky" voice after Scarlett Johansson, who had declined to voice the product, said it sounded "eerily similar" to her own [23].
Cloned voices are already used in fraud, including impostor scams that mimic a relative in distress. A March 2025 Consumer Reports assessment of six cloning products found that four (ElevenLabs, Speechify, PlayHT, and Lovo) relied on self-attestation checkboxes rather than technical safeguards to confirm a speaker's consent, while Descript and Resemble AI imposed stronger checks [24]. Vendors have responded with voice verification steps, watermarking, blocked "no-go" voices of public figures, and detection classifiers.
Regulators moved quickly after a January 2024 robocall imitating President Biden urged New Hampshire voters to skip the state's primary. On February 8, 2024 the FCC unanimously ruled that AI-generated voices are "artificial" under the Telephone Consumer Protection Act, making robocalls that use them illegal without prior express consent [2]. The consultant behind the calls, Steve Kramer, was fined $6 million in September 2024 [25], and carrier Lingo Telecom paid a $1 million settlement [26]. Tennessee's ELVIS Act, signed March 21, 2024 and effective July 1, 2024, became the first US state law adding voice to right-of-publicity protections against AI imitation [27]. In the European Union, the EU AI Act's transparency rules, applying from August 2, 2026, require that people be told when they are talking to an AI system and that synthetic audio be disclosed. A proposed federal NO FAKES Act, reintroduced in the US Congress in April 2025, would create a national digital-replica right but had not become law as of June 2026.