AI Voice Agent

AI Agents Artificial Intelligence Conversational AI Speech & Audio AI

29 min read

Updated Jun 24, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 24, 2026

Fact-checked

In review queue

Sources

40 citations

Revision

v7 · 5,892 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

An AI voice agent is a conversational artificial intelligence system that communicates with users through spoken language in real time, holding fluid telephone or in-app conversations, interpreting intent, handling interruptions, and executing tasks such as booking appointments or transferring calls. Unlike text-based chatbots or simple interactive voice response (IVR) menus, AI voice agents combine speech recognition, large language models, and text-to-speech synthesis into a low-latency loop, and they have become one of the fastest-growing segments of the broader voice AI and conversational AI market.

The global voice AI agents market was valued at approximately USD 2.4 billion in 2024 and is projected to reach USD 47.5 billion by 2034, growing at a compound annual growth rate (CAGR) of 34.8%.^[1] In 2024 North America accounted for over 40.2% of that market, and voice AI platforms (as opposed to standalone components) made up more than 76.4% of revenue.^[1] Enterprises across banking, healthcare, retail, and telecommunications have driven this growth by deploying voice agents for customer service automation, outbound sales, appointment scheduling, and phone triage. Gartner predicts conversational AI will reduce contact center agent labor costs by $80 billion in 2026, and a Forrester Consulting study found that companies using voice AI report three-year ROI between 331% and 391%.^[22]

How does an AI voice agent work?

AI voice agents rely on one of two primary architectural paradigms: the cascaded (pipeline) approach and the end-to-end (speech-to-speech) approach.

Cascaded Architecture

The cascaded architecture is the most widely deployed design in production systems as of 2025.^[14] It chains together three distinct components in sequence:

Speech-to-Text (STT): An automatic speech recognition model, such as OpenAI Whisper or Deepgram Nova, transcribes the caller's audio into text.
Large Language Model (LLM): A text-based LLM, such as GPT-4 or Claude, processes the transcript, reasons about the user's intent, and generates a text response.
Text-to-Speech (TTS): A neural TTS engine, such as ElevenLabs or Play.ai, converts the LLM's text output back into spoken audio that the caller hears.

Because each component operates independently, the cascaded pipeline is highly modular. Developers can swap in different STT, LLM, or TTS providers depending on cost, latency, or language requirements. The text intermediary between stages also provides a convenient point for applying content filters, compliance checks, and logging before the caller hears anything. These properties make the cascaded approach attractive for enterprise deployments where control, debuggability, and regulatory compliance are priorities.^[14]

The primary drawback is latency. Each handoff between components adds delay. A well-optimized cascaded pipeline typically achieves 500 to 800 milliseconds of end-to-end latency, though poorly tuned setups can exceed two seconds.^[17] Information is also lost at each boundary: the STT stage discards prosodic cues like tone, emphasis, and emotion, meaning the LLM works only with flat text.

End-to-End (Speech-to-Speech) Architecture

End-to-end models process the entire exchange in a single latent space, accepting audio input and producing audio output without intermediate text representations. This design eliminates the multiple conversion steps and handoffs between models, achieving latency as low as 200 to 300 milliseconds.^[14] The model can also preserve and respond to non-verbal cues such as tone of voice, hesitation, and emotion, because audio features are never discarded.

OpenAI's GPT-4o and the gpt-realtime model family are leading examples of this approach.^[3] Google's Gemini Live similarly operates as an end-to-end voice model.^[24]

Despite the latency and expressiveness advantages, end-to-end models present significant challenges for enterprise adoption. Four main issues limit their use: the lack of a text intermediary makes content filtering harder; debugging is more difficult because there is no transcript to inspect mid-pipeline; there are no straightforward fallback mechanisms if one component fails; and evaluation tooling for speech-to-speech quality remains immature.^[23] Costs also tend to run higher, sometimes roughly ten times that of a chained pipeline, because the model re-processes the entire conversation context on each turn.

As a result, while speech-to-speech models saw major investment from AI labs throughout 2025, cascaded architectures remain the dominant solution for complex agentic tasks in production environments.^[14]

Hybrid Approaches (Half-Cascade)

A third approach, sometimes called the "half-cascade" architecture, has gained traction as a middle ground. In this design, audio input is processed by a native audio encoder that feeds directly into a text-based language model for reasoning and response generation. The text output is then synthesized into speech by a TTS component. This preserves the speed advantage of native audio input (because the STT step is replaced by a faster audio encoder) while maintaining the debuggability and tool-calling reliability of text-based reasoning.

Both OpenAI and Google use variants of this half-cascade architecture in their consumer voice products. Google's Gemini 2.5 Flash Native Audio model, for instance, processes raw audio natively through a single low-latency model while still producing text-based reasoning internally.^[24] OpenAI's gpt-realtime model similarly combines native audio understanding with structured text-based tool calling.^[3]

Some platforms take a different hybrid approach: using a speech-to-speech model for initial response generation to minimize latency, then falling back to a cascaded pipeline for tool calls or knowledge retrieval that require structured text reasoning. These designs attempt to capture the low latency of end-to-end models while retaining the controllability of cascaded pipelines.

As of early 2026, less than 15% of enterprise deployments use pure speech-to-speech architectures, with the majority relying on cascaded or hybrid systems for production workloads.^[14]

How do cascaded and speech-to-speech architectures compare?

Attribute	Cascaded (STT + LLM + TTS)	End-to-end (S2S)	Hybrid (half-cascade)
Typical latency	500-2,000+ ms	200-300 ms	300-800 ms
Debuggability	High (full text transcripts at each stage)	Low (no intermediate transcripts)	Medium (text available at reasoning stage)
Emotional understanding	Low (prosody lost in STT conversion)	High (native audio preserves tone, emotion)	Medium to high (audio encoder retains some cues)
Tool calling reliability	High	Less reliable	High
Compliance and auditing	Strong (full transcripts for review)	Weak (no text intermediary)	Moderate to strong
Modularity	High (swap STT, LLM, TTS independently)	Low (single model)	Medium
Cost per minute	~$0.15/min	Higher (up to 10x cascaded)	Varies
Enterprise adoption (2026)	Dominant	<15%	Growing

What are the main AI voice agent products?

Several major AI companies have launched consumer-facing voice conversation products that showcase the state of the art.

ChatGPT Advanced Voice Mode

OpenAI demonstrated Advanced Voice Mode as part of the GPT-4o announcement in May 2024, but the feature did not ship immediately. It began rolling out to a limited group of ChatGPT Plus subscribers in late July 2024 and expanded to all Plus and Teams subscribers on September 24, 2024.^[4] Advanced Voice Mode uses the GPT-4o model's native audio capabilities, allowing it to process speech input and generate speech output in a single model. Users can interrupt the model mid-sentence, and the system can sense and interpret emotions from tone of voice and adjust its responses accordingly. As of September 2024, ChatGPT offered nine voice options: Breeze, Juniper, Cove, Ember, Arbor, Maple, Sol, Spruce, and Vale.^[4]

OpenAI Realtime API

For developers building their own voice agents, OpenAI introduced the Realtime API in public beta on October 1, 2024, at OpenAI Dev Day.^[2] The API allows third-party applications to stream audio to and from OpenAI's speech-to-speech models over WebSocket or WebRTC connections. On October 30, 2024, OpenAI added five new voices with greater range and expressiveness.

On August 28, 2025, the Realtime API reached general availability with the launch of the gpt-realtime model, which OpenAI described as "its most advanced, production-ready voice model yet, delivering major improvements in following complex instructions, calling tools with precision, and producing speech that sounds more natural and expressive."^[3] The model showed significant improvements across audio benchmarks: it scored 82.8% on the Big Bench Audio reasoning evaluation (up from 65.6% for the prior gpt-4o-realtime-preview model), 30.5% on the MultiChallenge audio instruction-following benchmark (up from 20.6%), and 66.5% on ComplexFuncBench for function-calling accuracy (up from 49.7%).^[3] The generally available API added support for remote MCP servers, image inputs, and phone calling through Session Initiation Protocol (SIP), enabling developers and enterprises to build production-ready voice agents.^[3] At launch the gpt-realtime model was priced at $32 per million audio input tokens (with cached input at $0.40 per million) and $64 per million audio output tokens, roughly 20 percent below the prior gpt-4o-realtime-preview model.^[3] The default voices recommended for assistant use were named Marin and Cedar.

GPT-Realtime-2 and specialized audio models (2026)

On May 8, 2026, OpenAI released three new audio models in the Realtime API: GPT-Realtime-2, GPT-Realtime-Translate, and GPT-Realtime-Whisper.^[26] GPT-Realtime-2 brought reasoning described by OpenAI as on par with GPT-4's successor frontier model to live conversation, expanding the context window from 32,000 to 128,000 tokens and exposing five adjustable reasoning effort levels (minimal, low, medium, high, and xhigh).^[26] OpenAI reported that GPT-Realtime-2 scored 96.6 percent on the Big Bench Audio reasoning benchmark at its high setting, up from 81.4 percent for the previous version, and 48.5 percent on Audio MultiChallenge instruction following at its xhigh setting, up from 34.7 percent.^[26] The model added parallel tool calling with spoken narration of in-progress actions (for example, stalling phrases such as "let me check that"), tone control, and recovery behaviors for handling interruptions and failures.^[27] OpenAI priced GPT-Realtime-2 at $32 per million audio input tokens ($0.40 cached) and $64 per million audio output tokens.^[26]

GPT-Realtime-Translate is a dedicated live speech translation model supporting more than 70 input languages and 13 output languages, billed at $0.034 per minute, while GPT-Realtime-Whisper is a streaming speech-to-text model with controllable latency settings billed at $0.017 per minute.^[27]

Gemini Live

Google unveiled Gemini Live at its Pixel launch event in August 2024. The feature allows users to have free-form voice conversations with Gemini, including the ability to interrupt the AI and change topics mid-conversation.^[5] Gemini Live initially launched for English-speaking Android users who subscribed to Gemini Advanced. By early October 2024, Google expanded availability to all Android users at no cost.^[5] The feature subsequently rolled out to iOS and Google Workspace accounts.

Google detailed the Gemini 2.5 Flash native audio model on June 3, 2025, with public preview access following on June 17, 2025 and an updated preview model on September 18, 2025.^[36] The native audio model offers 30 high-definition voices across 24 languages and supports "affective dialog," in which the model adapts to the emotional tone of the speaker, along with native live speech-to-speech translation.^[36] Google announced general availability of the Gemini Live API on Vertex AI powered by Gemini 2.5 Flash native audio, while keeping the model in preview in the consumer Gemini API.^[24]

Claude Voice

Anthropic launched voice mode for Claude in late May 2025, initially available to paid subscribers on iOS and Android.^[6] Voice access was extended to all users on June 3, 2025.^[6] The feature offers five distinct voice options (Buttery, Airy, Mellow, Glassy, and Rounded) and allows users to switch between text and voice during a conversation.^[6] In March 2026, Anthropic introduced a voice mode for Claude Code, its command-line coding assistant, using a push-to-talk interface activated via the /voice command.^[7]

What platforms are used to build AI voice agents?

A growing ecosystem of startups and platforms enables businesses to build, deploy, and manage AI voice agents without developing the underlying infrastructure from scratch. These platforms typically provide orchestration layers that combine STT, LLM, and TTS components along with telephony integration, analytics, and compliance tooling.

Platform	Founded	Headquarters	Key Features	Notable Funding	Typical Latency
Bland AI	2023	San Francisco, CA	No-code Conversational Pathways workflow builder; HIPAA and GDPR compliant self-hosted infrastructure; voice cloning; multi-language support; post-call analytics	$65M total (including $40M Series B, January 2025, led by Emergence Capital)	Not publicly disclosed
Vapi	2023	San Francisco, CA	Developer-first API; real-time voice orchestration over WebRTC; supports bring-your-own STT/LLM/TTS; 100+ languages; Squad multi-agent routing; GoHighLevel and Make.com integrations	$72M total ($50M Series B, May 2026, led by Peak XV Partners; $20M Series A, December 2024, Bessemer Venture Partners)	550 to 800 ms
Retell AI	2023	San Francisco, CA	No-code agent builder; proprietary turn-taking model; function calling for appointments and CRM updates; 31+ languages with automatic language detection; HIPAA, SOC 2 Type II, GDPR compliant; Retell Assure automated QA (launched late 2025)	~$5M seed (Y Combinator, 2024)	~600 ms
ElevenLabs Conversational AI	2022	New York, NY	Sub-100 ms voice latency; 32+ languages; RAG integration; SDKs for JavaScript, Python, Swift; Conversational AI 2.0 (May 2025) with multimodal text and voice input	$180M Series C (January 2025, a16z and ICONIQ Growth); $500M Series D (February 2026, $11B valuation)	Sub-100 ms
Voiceflow	2019	San Francisco, CA	Visual drag-and-drop flow builder; Agent Step for autonomous AI decisions (Winter 2025); 300+ native integrations; SOC 2 and ISO compliant; custom TTS voices via ElevenLabs	~$39M total	Sub-500 ms
Play.ai (PlayAI)	2022	San Francisco, CA	PlayDialog model with emotional prompting; Play 3.0 mini for low-latency multilingual TTS (30+ languages); web, phone, and app deployment; 24/7 voice agents	$21M seed (November 2024, led by Kindred Ventures, with Y Combinator)	Not publicly disclosed
Sierra	2023	San Francisco, CA	Enterprise customer-experience agents; Agent OS development platform; voice and text channels; deployed for SiriusXM, WeightWatchers, ADT, SoFi, and others; author of the tau-Voice benchmark	$350M (October 2025, $10B valuation); $950M Series E (May 2026, ~$15.8B valuation)	Not publicly disclosed
Decagon	2023	San Francisco, CA	Enterprise AI customer-service agents (Agent Operating Procedures); voice and chat support; deployed for Eventbrite, Notion, Substack, Hertz, Duolingo, and others	$131M Series C (June 2025, $1.5B valuation); later rounds at higher valuations	Not publicly disclosed

Beyond these orchestration platforms, several providers compete primarily on the underlying speech models. Cartesia builds the Sonic family of text-to-speech models on State Space Models rather than transformers; its Sonic-3 model, released in late 2025, reports about 190 milliseconds of end-to-end latency and support for 42 languages.^[37] Cartesia raised a $64 million Series A in March 2025 followed by a $100 million round led by Kleiner Perkins, Index Ventures, Lightspeed, and NVIDIA, bringing total funding to roughly $191 million.^[37] Hume AI develops the Empathic Voice Interface (EVI), an emotion-aware speech-to-speech system; EVI 3, released in May 2025, lets users converse with more than 100,000 custom voices, each with an inferred personality and prosody.^[38] Deepgram supplies the Nova-3 and Flux speech-to-text models and the Aura-2 text-to-speech model used across many cascaded voice stacks.^[40] AssemblyAI offers a Voice Agent API on a flat-rate basis of about $4.50 per hour covering STT, LLM, and TTS, which it contrasts with the per-token billing of OpenAI's Realtime API and the roughly $0.07 to $0.30 per minute pricing of Retell and ElevenLabs Conversational AI.^[39] By 2026, end-to-end latency under one second had become a baseline expectation across these providers rather than a differentiator.^[39]

What are AI voice agents used for?

AI voice agents have found traction across a range of industries and business functions. The banking, financial services, and insurance (BFSI) sector leads adoption with a 32.9% market share as of 2024, followed by healthcare, retail, and telecommunications.^[1]

Customer Service Automation

The most common deployment scenario involves replacing or augmenting traditional call center operations. Voice agents can handle frequently asked questions, account inquiries, password resets, billing disputes, and order status checks without human intervention. Retell AI reports that companies deploying its technology automate up to 80% of inbound calls.^[12] By operating around the clock and handling unlimited concurrent calls, voice agents eliminate hold times and reduce labor costs.

Appointment Booking and Scheduling

Healthcare providers, dental offices, salons, and service businesses use voice agents to manage appointment scheduling over the phone. The agent accesses the business's calendar system through function calling or API integration, checks availability, and confirms bookings in real time during the call. This use case is well suited to voice AI because the conversation follows a relatively predictable structure while still requiring natural language understanding to handle variations in how callers express their needs.

Outbound Sales and Lead Qualification

Voice agents can initiate outbound calls to prospects, deliver a scripted pitch, answer questions, and qualify leads based on predefined criteria before routing interested prospects to human sales representatives. Bland AI's Conversational Pathways feature allows sales teams to design branching call flows that adapt based on the prospect's responses. The scalability of AI-driven outbound calling allows businesses to reach thousands of prospects simultaneously, though this use case faces particular regulatory scrutiny (see the legal and ethical concerns below).

Phone Triage and Routing

In healthcare settings, voice agents perform initial patient triage by asking about symptoms, urgency, and medical history before routing the call to the appropriate department or scheduling a telehealth consultation. In corporate environments, voice agents serve as intelligent receptionists that understand caller intent and route calls to the correct department or individual, replacing rigid IVR menu trees with natural conversation.

Collections and Payment Processing

Financial institutions and utilities deploy voice agents to make collections calls, negotiate payment plans, and process payments over the phone. The agent can access account information in real time, verify the caller's identity, and complete transactions, all while maintaining compliance with regulations such as the Fair Debt Collection Practices Act.

Surveys and Feedback Collection

Voice agents conduct post-interaction surveys, customer satisfaction calls, and market research interviews. Because the agent can ask follow-up questions and probe for detail, voice surveys often yield richer qualitative data than automated text-based surveys or pre-recorded robocalls.

What are the technical challenges of voice agents?

Building voice agents that feel natural and reliable in production requires solving several difficult engineering problems.

Latency

Conversational fluency demands that the agent respond within a window that feels natural to the caller. Research on human conversation patterns suggests that pauses longer than approximately 500 milliseconds begin to feel unnatural, and pauses beyond one second are perceived as the system being "stuck" or broken.^[17] Achieving sub-500-millisecond end-to-end latency in a cascaded pipeline requires aggressive optimization at every stage: streaming STT that emits partial transcripts, speculative LLM inference that begins generating before the user finishes speaking, and streaming TTS that starts synthesizing audio from the first output tokens.

End-to-end speech-to-speech models can achieve 200 to 300 milliseconds of latency by eliminating inter-component handoffs, but they come with the tradeoffs described in the architecture section.^[14] ElevenLabs claims sub-100-millisecond voice latency for its Conversational AI platform, though this figure likely measures only the TTS component rather than full end-to-end latency.

Turn-Taking and Interruption Handling

Human conversations involve constant, subtle negotiation over who speaks next. Speakers use pauses, intonation changes, and filler words to signal that they are yielding the floor or holding it. Replicating this behavior in a voice agent is one of the field's hardest unsolved problems.

The specific challenge of "barge-in" detection (recognizing when a caller interrupts the agent mid-utterance) illustrates the difficulty. Most voice agents rely on Voice Activity Detection (VAD) to notice when the caller is speaking during the agent's turn. But VAD alone cannot distinguish between a genuine interruption ("Actually, never mind, I want something else"), a backchannel acknowledgment ("mm-hmm," "yeah"), ambient noise (a cough, typing, or background chatter), and an echo of the agent's own output.

Treating every detected sound as a full interruption makes the agent jittery: it constantly stops mid-sentence and restarts, creating a frustrating experience. Ignoring all sounds during agent speech makes the agent seem oblivious when the caller genuinely wants to interject. Advanced systems use a combination of VAD, acoustic echo cancellation, semantic analysis of partial transcripts, and trained classifiers to categorize detected speech as interruption, backchannel, or noise. Retell AI, for example, has developed a proprietary turn-taking model specifically designed to determine when to stop speaking and when to continue.

Emotion and Prosody

Human callers convey frustration, confusion, urgency, and satisfaction through their tone, pace, and inflection. A voice agent that responds to an angry customer in a cheerful tone risks escalating the situation. Detecting caller emotion from audio signals and adapting the agent's response (both in content and delivery) remains an active area of research.

On the generation side, most TTS engines produce speech with a limited emotional range. Play.ai's PlayDialog model introduces "emotional prompting" to control the tone, pacing, and inflection of generated speech. GPT-4o's Advanced Voice Mode can express a range of emotions and adjust its delivery based on the caller's detected emotional state, representing one of the most advanced capabilities in production as of 2025.

Multilingual and Accent Handling

Global enterprises require voice agents that operate across multiple languages and understand a wide range of accents. While the leading platforms claim support for 30 to 100+ languages, performance varies significantly across languages and dialects. Code-switching (when a caller mixes two languages in a single sentence) poses particular challenges for both STT and LLM components. The gpt-realtime model introduced in 2025 specifically improved its ability to switch languages within a single sentence and process alphanumeric sequences across languages.^[3]

Reliability and Error Recovery

Voice conversations are less forgiving of errors than text interactions. If a text chatbot misunderstands a query, the user can simply retype their message. In a voice conversation, misunderstandings compound: the agent may respond to something the caller did not say, the caller may not realize the misunderstanding occurred, and the conversation can drift far off track before either party recognizes the problem. Building robust error detection and graceful recovery mechanisms (such as confirming understanding before taking irreversible actions) is critical for production voice agents.

How are AI voice agents evaluated?

Because voice agents combine speech recognition, reasoning, tool use, and speech synthesis in a single real-time loop, they are harder to evaluate than text-only systems, and standardized benchmarks emerged comparatively late. Early industry practice measured component-level metrics such as word error rate for STT, mean opinion score (MOS) for TTS, and turn-level latency, alongside task-oriented metrics such as task success rate and first call resolution.

A more demanding academic benchmark, tau-Voice, was published in March 2026 by researchers at Sierra (Soham Ray, Keshav Dhandhania, Victor Barres, and Karthik Narasimhan).^[28] It extends the earlier text-based tau2-bench to spoken interaction, evaluating whether an agent can complete a grounded real-world task across a full multi-turn voice conversation in the retail, airline, and telecom domains, with noise, accents, interruptions, and backchannels injected into the audio.^[28] The authors evaluated three production full-duplex voice agents (OpenAI gpt-realtime-1.5, Google gemini-live-2.5-flash-native-audio, and xAI grok-voice-agent) and found that they scored 31 to 51 percent under clean conditions and 26 to 38 percent under realistic audio, while the best text agent reached 85 percent on identical tasks; no voice agent in the cohort reached even half of text-level capability under realistic conditions.^[29] Notably, the paper attributes most of the gap to the agents themselves rather than the listening environment, reporting that "79-90% of failures stem from agent behavior."^[28]

ElevenLabs Conversational AI

ElevenLabs, founded in 2022 and headquartered in New York, has established itself as a leading platform for voice AI. Originally known for its high-fidelity text-to-speech and voice cloning technology, the company expanded into the conversational AI space in November 2024 with the launch of its Conversational AI developer platform.^[8]

The platform allows developers to build interactive voice agents powered by leading LLMs (including Claude, GPT-4, and Gemini) or custom models. It includes built-in retrieval-augmented generation (RAG) so agents can ground their answers in business-specific data. SDKs are available for JavaScript, Python, Swift, and additional languages.^[8]

In May 2025, ElevenLabs released Conversational AI 2.0, introducing multimodal input support (simultaneous text and voice), improved enterprise readiness, and enhanced agent capabilities.^[9] The platform claims sub-100-millisecond latency and supports 32 or more languages.^[9]

ElevenLabs has raised significant capital to fuel its growth. In January 2025, the company announced a $180 million Series C round co-led by a16z and ICONIQ Growth, valuing the company at $3.3 billion. In February 2026, ElevenLabs raised $500 million in a Series D round at an $11 billion valuation as the company began exploring a potential initial public offering.^[20] The Series D, announced on February 4, 2026, was led by Sequoia Capital, with a16z roughly quadrupling and ICONIQ roughly tripling their prior investments, and brought the company's total funding to more than $781 million; ElevenLabs reported about $330 million in annual recurring revenue at the time of the raise.^[32]

What are the legal and ethical concerns?

The rapid advancement of AI voice technology has raised significant ethical and legal concerns that regulators, industry groups, and civil society organizations are actively working to address.

Modern TTS systems can clone a person's voice from as little as a few seconds of reference audio. While this capability has legitimate applications (such as creating personalized voice agents or preserving the voices of people with degenerative diseases), it also enables misuse. Unauthorized voice cloning has been used to create fraudulent audio of public figures, conduct phone scams impersonating family members, and generate non-consensual content.

The core ethical issue is consent. Courts in the United States and the European Union began classifying voice data as biometric property in early 2024, allowing individuals to claim ownership of their vocal signatures. In the case of Lehrman v. Lovo, Inc. (2024), professional voice actors alleged that Lovo Inc. used their recorded voices without proper authorization to train AI models.^[16] The U.S. District Court for the Southern District of New York partially granted and partially denied a motion to dismiss, establishing that unauthorized use of voice data for AI training may constitute a viable legal claim.^[16]

Deepfake Audio

AI-generated voice deepfakes pose risks to public trust, political discourse, and individual reputation. Deepfake audio can fabricate statements that an individual never made, potentially leading to defamation, market manipulation, or political interference.^[25] The EU AI Act classifies voice cloning as high-risk AI, demanding transparency and strict safeguards. Several U.S. states have passed deepfake and voice cloning laws requiring consent and clear disclosure.

Robocall and Telemarketing Regulation

On February 8, 2024, the U.S. Federal Communications Commission (FCC) unanimously adopted a Declaratory Ruling confirming that calls made with AI-generated voices qualify as "artificial or prerecorded voice" calls under the Telephone Consumer Protection Act (TCPA).^[15] This ruling requires that callers obtain prior express consent from the called party before making AI-generated voice calls, provide identification and disclosure information about the party responsible for initiating the call, and offer opt-out mechanisms.^[15] The ruling gave State Attorneys General new enforcement tools against AI-powered robocall scams.^[15]

For businesses deploying AI voice agents for outbound calling, this ruling means that consent management and disclosure practices are legally mandatory, not optional best practices.

Transparency and Disclosure

A growing consensus among regulators and industry participants holds that AI voice agents should disclose their non-human identity at the beginning of every interaction. California's B.O.T. Act (2019) already requires bots to disclose their artificial identity when communicating with consumers for sales or political purposes. The EU AI Act imposes similar disclosure requirements for AI systems that interact with humans. Several voice AI platforms have built disclosure features into their products, automatically playing a statement such as "This call is powered by AI" at the start of each conversation.

Bias and Fairness

Voice agents may exhibit biased behavior based on the caller's accent, dialect, speech patterns, or language. STT models have been shown to perform less accurately on speakers with non-standard accents, which can lead to higher error rates and poorer service for certain demographic groups. Ensuring equitable performance across diverse speaker populations is an ongoing challenge that requires careful dataset curation, testing across demographic groups, and monitoring in production.

Market Outlook

The conversational AI market is projected to grow from $17.97 billion in 2026 to $82.46 billion by 2034.^[22] Gartner forecasts that 40% of enterprise applications will integrate task-specific AI agents by the end of 2026, up from less than 5% in 2025.^[21] An estimated 80% of all businesses plan to integrate AI-driven voice technology into customer service operations by 2026.^[21]

In the near term, cascaded architectures continue to dominate enterprise deployments because of their reliability, debuggability, and compliance features. Speech-to-speech models are gaining ground in consumer-facing applications where natural conversational feel is prioritized over auditability. Hybrid (half-cascade) architectures are emerging as the likely medium-term standard for enterprise use, combining the latency benefits of native audio processing with the control and transparency of text-based reasoning.

The banking, financial services, and insurance (BFSI) sector leads voice AI adoption with a 32.9% market share as of 2024.^[1] Healthcare, retail, and telecommunications follow. Competition among platform vendors has driven down per-minute costs, with entry-level pricing dropping below $0.07 per minute for some providers.

The scale of enterprise voice deployments grew sharply through 2025 and 2026. By the time of its $950 million Series E in May 2026, Sierra, the customer-experience agent company co-founded by Bret Taylor and Clay Bavor, was valued at about $15.8 billion, having surpassed $100 million in annual recurring revenue in November 2025 and reported that voice had overtaken text as the primary channel on its platform.^[34]^[35] In the developer-platform tier, Vapi raised a $50 million Series B led by Peak XV Partners in May 2026 at a valuation of roughly $500 million, with participation from Microsoft's M12, Kleiner Perkins, and Bessemer Venture Partners; the company reported surpassing 1 billion calls and more than 1 million developers, and disclosed that Amazon's Ring had selected its platform over more than 40 competing vendors to route inbound calls.^[30]^[31]

Future Directions

Several trends are shaping the trajectory of AI voice agent technology.

Agentic capabilities. Voice agents are evolving from conversational interfaces into autonomous AI agents that can take complex, multi-step actions on behalf of the caller. This includes navigating multiple backend systems, making decisions based on business logic, and completing transactions end to end without human handoff.

Multimodal interaction. The boundary between voice and other modalities is blurring. ElevenLabs' Conversational AI 2.0 supports simultaneous voice and text input. Google's Gemini Live can process visual input alongside voice. Future voice agents will likely combine speech, vision, screen sharing, and text in unified interactions.

On-device processing. Anthropic has been preparing offline voice packs that allow voice processing without an internet connection for short prompts, designed for educational institutions and sensitive enterprise environments. On-device STT and TTS models are becoming feasible on modern smartphones and edge devices, which could reduce latency to near zero and address data privacy concerns.

Improved evaluation and quality assurance. Retell AI's launch of Retell Assure in late 2025, the first automated QA solution for voice AI, signals growing industry recognition that voice agents need systematic monitoring and evaluation.^[18]^[19] The arrival of demanding academic benchmarks such as Sierra's tau-Voice in 2026 reinforced this trend by quantifying how far production voice agents still lagged behind text agents on grounded tasks.^[28] As deployment scales, automated tools for detecting hallucinations, measuring conversation quality, and identifying failure modes will become essential.

Cost reduction. The cost of operating voice agents is declining as competition intensifies among STT, LLM, and TTS providers. Models like Play.ai's Play 3.0 mini and smaller open-source alternatives are making production voice agents accessible to smaller businesses.

References

Market.us. "Voice AI Agents Market Size, Share | CAGR of 34.8%." 2024. https://market.us/report/voice-ai-agents-market/ ↩
OpenAI. "Introducing the Realtime API." October 1, 2024. https://openai.com/index/introducing-the-realtime-api/ ↩
OpenAI. "Introducing gpt-realtime and Realtime API updates for production voice agents." August 28, 2025. https://openai.com/index/introducing-gpt-realtime/ ↩
TechCrunch. "OpenAI rolls out Advanced Voice Mode with more voices and a new look." September 24, 2024. https://techcrunch.com/2024/09/24/openai-rolls-out-advanced-voice-mode-with-more-voices-and-a-new-look/ ↩
Google. "Gemini Live audio updates help conversations feel more natural." 2024. https://blog.google/products/gemini/gemini-live-audio-updates/ ↩
TechCrunch. "Anthropic launches a voice mode for Claude." May 27, 2025. https://techcrunch.com/2025/05/27/anthropic-launches-a-voice-mode-for-claude/ ↩
TechCrunch. "Claude Code rolls out a voice mode capability." March 3, 2026. https://techcrunch.com/2026/03/03/claude-code-rolls-out-a-voice-mode-capability/ ↩
TechCrunch. "ElevenLabs now offers ability to build conversational AI agents." November 18, 2024. https://techcrunch.com/2024/11/18/elevenlabs-now-offers-ability-to-build-conversational-ai-agents/ ↩
ElevenLabs. "Conversational AI 2.0 voice agents now live." May 2025. https://elevenlabs.io/blog/conversational-ai-2-0 ↩
Skandalaris Center, Washington University in St. Louis. "WashU AI Startup Bland.com announces $40M Series B." January 30, 2025. https://skandalaris.wustl.edu/blog/2025/01/30/washu-ai-startup-bland-com-announces-40m-series-b-funding-round-to-change-outdated-enterprise-call-practices/
GlobeNewsWire. "Vapi Dials-in $20M in Series A Led by Bessemer." December 12, 2024. https://www.globenewswire.com/news-release/2024/12/12/2996317/0/en/Vapi-Dials-in-20M-in-Series-A-Led-by-Bessemer-to-Bring-AI-Voice-Agents-to-Enterprise.html
TechCrunch. "Retell AI lets companies build 'voice agents' to answer phone calls." May 9, 2024. https://techcrunch.com/2024/05/09/retell-ai-lets-companies-build-agents-to-answer-their-calls/ ↩
SiliconANGLE. "AI voice startup PlayAI raises $21M." November 25, 2024. https://siliconangle.com/2024/11/25/ai-voice-startup-playai-raises-21m-power-new-generation-humanlike-ai-agents/
Coval.dev. "Speech-to-Speech vs Cascaded Voice AI: Which Architecture Should You Deploy?" 2025. https://www.coval.dev/blog/speech-to-speech-vs-cascaded-voice-ai-which-architecture-should-you-deploy ↩
FCC. "FCC Makes AI-Generated Voices in Robocalls Illegal." February 8, 2024. https://www.fcc.gov/document/fcc-makes-ai-generated-voices-robocalls-illegal ↩
Lehrman v. Lovo, Inc. U.S. District Court, Southern District of New York. 2024. ↩
Twilio. "Core Latency in AI Voice Agents." 2025. https://www.twilio.com/en-us/blog/developers/best-practices/guide-core-latency-ai-voice-agents ↩
SiliconANGLE. "Retell AI targets human bottlenecks in agentic voice AI with automated QA." December 17, 2025. https://siliconangle.com/2025/12/17/retell-ai-targets-human-bottlenecks-agentic-voice-ai-automated-qa/ ↩
GlobeNewsWire. "Upgraded Retell AI Voice Platform Enables Corporate Call Centers." January 29, 2026. https://www.globenewswire.com/news-release/2026/01/29/3228780/0/en/ ↩
Vestbee. "Polish-founded voice AI platform ElevenLabs secures $500M Series D at an $11B valuation." February 2026. https://www.vestbee.com/insights/articles/eleven-labs-secures-500-m ↩
Ringly.io. "47 voice AI statistics for 2026: market size, growth, and trends." 2026. https://www.ringly.io/blog/voice-ai-statistics-2026 ↩
Nextiva. "50+ Conversational AI Statistics for 2026." 2026. https://www.nextiva.com/blog/conversational-ai-statistics.html ↩
Coval.dev. "Speech-to-Speech vs Cascaded Voice AI: Which Architecture Should You Deploy?" 2025. https://www.coval.dev/blog/speech-to-speech-vs-cascaded-voice-ai-which-architecture-should-you-deploy ↩
Google Cloud. "How to use Gemini Live API Native Audio in Vertex AI." 2025. https://cloud.google.com/blog/topics/developers-practitioners/how-to-use-gemini-live-api-native-audio-in-vertex-ai ↩
Fortune. "2026 will be the year you get fooled by a deepfake, researcher says." December 2025. https://fortune.com/2025/12/27/2026-deepfakes-outlook-forecast/ ↩
MarkTechPost. "OpenAI Releases Three Realtime Audio Models: GPT-Realtime-2, GPT-Realtime-Translate, and GPT-Realtime-Whisper in the Realtime API." May 8, 2026. https://www.marktechpost.com/2026/05/08/openai-releases-three-realtime-audio-models-gpt-realtime-2-gpt-realtime-translate-and-gpt-realtime-whisper-in-the-realtime-api/ ↩
The Next Web. "OpenAI launches GPT-Realtime-2 and two new voice API models." May 8, 2026. https://thenextweb.com/news/openai-gpt-realtime-2-voice-models ↩
Ray, Soham; Dhandhania, Keshav; Barres, Victor; Narasimhan, Karthik. "tau-Voice: Benchmarking Full-Duplex Voice Agents on Real-World Domains." arXiv:2603.13686. March 2026. https://arxiv.org/abs/2603.13686 ↩
Sierra. "tau-voice: benchmarking real-time voice agents on real-world tasks." 2026. https://sierra.ai/blog/tau-voice-benchmarking-real-time-voice-agents-on-real-world-tasks ↩
TechCrunch. "AI voice startup Vapi hits $500M valuation after winning Amazon Ring over 40 rivals." May 12, 2026. https://techcrunch.com/2026/05/12/vapi-hits-500m-valuation-as-amazon-ring-chose-its-ai-platform-over-40-rivals/ ↩
GlobeNewsWire. "Vapi raises $50M Series B as it reaches 1 billion calls, powering the next generation of enterprise voice AI." May 12, 2026. https://www.globenewswire.com/news-release/2026/05/12/3292882/0/en/vapi-raises-50m-series-b-as-it-reaches-1-billion-calls-powering-the-next-generation-of-enterprise-voice-ai.html ↩
TechCrunch. "ElevenLabs raises $500M from Sequoia at an $11 billion valuation." February 4, 2026. https://techcrunch.com/2026/02/04/elevenlabs-raises-500m-from-sequioia-at-a-11-billion-valuation/ ↩
ElevenLabs. "ElevenLabs raises $500M Series D at $11B valuation." February 2026. https://elevenlabs.io/blog/series-d
TechCrunch. "Bret Taylor's Sierra reaches $100M ARR in under two years." November 21, 2025. https://techcrunch.com/2025/11/21/bret-taylors-sierra-reaches-100m-arr-in-under-two-years/ ↩
The AI Insider. "Sierra Secures $950M at $15B Valuation to Become Global Standard for AI Customer Agents." May 5, 2026. https://theaiinsider.tech/2026/05/05/sierra-secures-950m-at-15b-valuation-to-become-global-standard-for-ai-customer-agents/ ↩
Google. "Gemini 2.5 Native Audio upgrade, plus text-to-speech model updates." 2025. https://blog.google/products/gemini/gemini-audio-model-updates/ ↩
Startupstag. "Cartesia Raises $100M, Launches Sonic-3 AI Voice Model." 2025. https://startupstag.com/investments/cartesia-raises-100m-launches-sonic-3-ai-voice-model/ ↩
Hume AI. "Hume Raises $50M Series B and Releases New Empathic Voice Interface." 2024. https://www.hume.ai/blog/series-b-evi-announcement ↩
AssemblyAI. "Best Speech-to-Speech Voice Agent API in 2026." 2026. https://www.assemblyai.com/blog/best-speech-to-speech-voice-agent-api ↩
Deepgram. "Top Voice AI Agents for 2026: The Ultimate Buyer's Guide." 2026. https://deepgram.com/learn/best-voice-ai-agents-2026-buyers-guide ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

6 revisions by 1 contributors · full history

Suggest edit

What links here

Bland AI Colossyan Descript ElevenLabs GLM-4-Voice GPT-Realtime / OpenAI Realtime API Murf AI Synthesia Voice cloning

How does an AI voice agent work?

Cascaded Architecture

End-to-End (Speech-to-Speech) Architecture

Hybrid Approaches (Half-Cascade)

How do cascaded and speech-to-speech architectures compare?

What are the main AI voice agent products?

ChatGPT Advanced Voice Mode

OpenAI Realtime API

GPT-Realtime-2 and specialized audio models (2026)

Gemini Live

Claude Voice

What platforms are used to build AI voice agents?

What are AI voice agents used for?

Customer Service Automation

Appointment Booking and Scheduling

Outbound Sales and Lead Qualification

Phone Triage and Routing

Collections and Payment Processing

Surveys and Feedback Collection

What are the technical challenges of voice agents?

Latency

Turn-Taking and Interruption Handling

Emotion and Prosody

Multilingual and Accent Handling

Reliability and Error Recovery

How are AI voice agents evaluated?

ElevenLabs Conversational AI

What are the legal and ethical concerns?

Voice Cloning and Consent

Deepfake Audio

Robocall and Telemarketing Regulation

Transparency and Disclosure

Bias and Fairness

Market Outlook

Future Directions

See Also

References

Improve this article

Related Articles

Hume AI

OpenAI Realtime API

Moshi

Inworld AI

Sierra AI

Retell AI

What links here

Related Articles

Hume AI

OpenAI Realtime API

Moshi

Inworld AI

Sierra AI

Retell AI

What links here