Vapi
Last reviewed
May 6, 2026
Sources
12 citations
Review status
Source-backed
Revision
v1 · 4,516 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 6, 2026
Sources
12 citations
Review status
Source-backed
Revision
v1 · 4,516 words
Add missing citations, update stale details, or suggest a clearer explanation.
Vapi is a voice AI orchestration platform that lets software developers build, deploy, and scale AI phone agents through a programmable API. Founded in 2023 by Jordan Dearsley and Nikhil Gupta, the company emerged from a pivot of their earlier Y Combinator-backed startup, Superpowered. Vapi functions as a middleware layer between the raw components of a voice AI system: a speech-to-text (STT) transcriber, a large language model (LLM), and a text-to-speech (TTS) voice engine. Developers configure which providers fill each role, and Vapi handles the real-time orchestration, turn detection, interruption management, and telephony routing that would otherwise require months of engineering work. By December 2024, more than 100,000 developers were building on the platform, which was processing over 400,000 calls per day. The company raised a $20 million Series A round led by Bessemer Venture Partners in December 2024 at a reported valuation of $130 million.
Vapi operates on a usage-based model, charging $0.05 per minute as an orchestration fee on top of provider costs passed through at cost. The company has been described as a "Twilio for AI agents" in investor materials, reflecting its position as programmable infrastructure that developers compose into applications rather than a finished end-user product. Vapi's closest analogy is middleware: it does not own the underlying AI models or voice synthesis technology, but provides the runtime that connects them, manages the real-time audio pipeline, and abstracts away the latency-sensitive engineering challenges that make voice AI hard to build reliably.
Jordan Dearsley grew up in Canada and studied computer science before working in product and engineering roles at various technology companies. His co-founder, Nikhil Gupta, focused on infrastructure engineering. The two met prior to their first YC venture and founded Superpowered, an AI-powered meeting notetaker that captured notes from live audio without requiring a recording bot to join the call.
Dearsley has described his background as coming from product rather than pure infrastructure engineering, which shaped Vapi's positioning. He wanted a platform that removed the infrastructure complexity from voice AI development without requiring developers to become specialists in real-time audio systems. That philosophy is reflected in the company's tagline: "voice AI for developers."
Superpowered entered Y Combinator's Winter 2021 batch. The product gained traction and reached roughly 10,000 weekly active users and around $500,000 in annual revenue by mid-2023. Despite this early success, Dearsley and Gupta concluded that the meeting productivity market had too many well-resourced competitors and that differentiation would be hard to sustain.
The team went through several product experiments before landing on voice AI infrastructure. One experiment was a therapy chatbot called Harmon that used voice interaction. While building it, the founders ran directly into the core technical problem that would define their next company: assembling and running the real-time audio pipeline for a voice agent was extraordinarily difficult. Latency between components, noise filtering, turn-taking logic, and telephony integration each represented weeks of engineering work. Dearsley later described this experience as the direct motivation for Vapi. "Infrastructure itself is not something that people should spend time on," he said in a 2025 interview. "I don't think anyone that's not running real-time audio systems should be running real-time audio systems."
In August and September 2023, Dearsley and Gupta launched Vapi as a voice API platform. The company formally announced the pivot from Superpowered in November 2023, at which point it had already shipped an early version of the product. The seed round of $2.1 million, raised alongside the pivot, included Kleiner Perkins and Abstract Ventures as investors.
At launch, Vapi's main value proposition was latency. At the time, most developers assembling voice agents from raw components were seeing round-trip latency of 1.5 to 2.5 seconds between when a user finished speaking and when the agent responded. Vapi's early architecture brought this closer to 1 second through streaming optimizations and a custom turn-detection model. Dearsley cited latency as the primary early differentiator: "that latency piece was really our differentiator at the time and the reason that people would use us rather than roll it themselves."
Vapi is organized as a YC W21 company in Y Combinator's records, reflecting the original Superpowered batch. The company is headquartered in San Francisco.
Vapi has raised approximately $22 million in total external funding across two disclosed rounds.
The seed round of $2.1 million closed in late 2023, shortly after the pivot from Superpowered. Investors included Kleiner Perkins, Abstract Ventures, and Y Combinator.
On December 12, 2024, Vapi announced a $20 million Series A led by Bessemer Venture Partners. Additional participants included Abstract Ventures (returning), AI Grant, Y Combinator, Saga Ventures, and investor Michael Ovitz. The round valued the company at approximately $130 million post-money.
In its announcement, Bessemer described Vapi as representing "a 10x improvement on the development experience for voice agents" and cited the company's speed of execution, including shipping feature requests from a Friday discussion to production by the following Saturday. The firm positioned its investment thesis around the observation that telephony remains the dominant communication channel for high-stakes, time-sensitive interactions in healthcare, insurance, legal services, and logistics, and that voice AI infrastructure was entering what Bessemer called a "Cambrian explosion."
Vapi stated it would use the Series A proceeds to expand its engineering team, scale infrastructure, and deepen enterprise sales.
The company's revenue trajectory is notable for a two-year-old startup. Independent estimates published in 2025 pegged Vapi's annual recurring revenue at approximately $4.5 million in 2024 and $8 million in 2025, reflecting rapid growth from a near-zero baseline at its November 2023 public launch. By December 2024, over 100,000 developers had signed up for the platform. These figures come from third-party revenue intelligence services rather than official company disclosures.
Vapi's architecture has three primary layers: the core pipeline of STT, LLM, and TTS; the orchestration models that run in parallel with that pipeline; and the telephony or WebRTC transport layer that handles the audio connection.
Building a voice agent without an orchestration layer requires solving several difficult engineering problems simultaneously. The speech-to-text model must stream audio in real time and produce low-latency transcriptions. The language model must receive incremental transcriptions, generate a response, and stream that response at a token level. The text-to-speech engine must begin synthesizing audio before the full response is available. Every component must communicate over a shared timing bus with round-trip budgets measured in tens of milliseconds. On top of that, the system must handle noisy audio, detect when the caller wants to interrupt, manage turn-taking conventions that differ from text conversation, and route calls over a telephone or WebRTC network. Vapi packages all of this as a managed service accessed through a REST API and a real-time events system.
Every Vapi call runs through three configurable model slots:
All three slots are swappable per-call through the API. A developer can run Deepgram for transcription, Groq-hosted Llama for the language model, and Cartesia for TTS in one call, then switch any component on the next call. Vapi also supports bring-your-own API keys across all three layers, which means provider costs pass through directly and the developer maintains their own billing relationship with each provider.
On top of the core pipeline, Vapi runs a suite of proprietary real-time models that the company groups under the name "orchestration layer." These models are what justify the $0.05 per minute platform fee on top of raw provider costs. They include:
Turn detection (endpointing). Rather than using a silence timeout to determine when a caller has finished speaking, Vapi employs a custom fusion audio-text model that analyzes both the acoustic properties of the caller's voice and the semantic content of what was said. This allows the system to distinguish between a natural mid-sentence pause and a genuine end-of-turn, which reduces false triggers and improves perceived conversational fluency.
Interruption handling (barge-in). A separate custom model distinguishes genuine interruptions (the caller saying "stop" or cutting off the agent mid-sentence) from backchannel signals. Backchannel signals are short affirmations like "yeah," "uh-huh," or "got it" that human listeners produce to signal they are still engaged without intending to take the floor. When Vapi detects a backchannel signal, it passes that information to the LLM as context. When it detects a true interruption, it notes the point in the agent's speech where it was cut off and informs the LLM, so the model can resume coherently or adapt its response.
Audio filtering. Vapi runs two parallel audio filtering models. A noise filter removes ambient sounds including music and traffic while preserving speech content. A background voice filter isolates the primary speaker and suppresses other voices, which matters in environments like call centers, open offices, or households where multiple people may be speaking nearby.
Emotion detection. A proprietary model extracts emotional inflection from the caller's voice and passes that signal to the LLM as context. This allows the model to adapt its tone or escalation behavior based on whether the caller sounds calm, frustrated, or distressed. Bessemer cited this capability as a key technical differentiator in its investment announcement.
Filler injection. Because LLMs produce streamed text that begins with formal language, Vapi applies a custom model to inject natural filler sounds and conversational phrases in real-time. This avoids the uncanny valley problem of an agent that pauses completely in silence while processing and then begins speaking with formal language.
Backchanneling. A fusion audio-text model detects appropriate moments during the caller's speech to insert brief affirmations from the agent. The model selects contextually suitable responses based on the content being spoken.
All six of these models run in parallel at sub-50 millisecond latency budgets. The full voice-to-voice round trip, from when the caller stops speaking to when the agent's first audio byte plays, targets between 500 and 700 milliseconds over WebRTC. Over telephone networks, additional network latency from Twilio or Telnyx adds roughly 400 to 600 milliseconds on top of that.
Vapi supports two transport modes. The WebRTC mode connects browser and mobile apps directly to Vapi's servers using the WebRTC protocol, achieving the lowest possible latency for web applications. The telephony mode routes calls through phone infrastructure using either Twilio or Telnyx as the SIP carrier, or a developer's own SIP trunk.
Vapi maintains first-party integrations across all three pipeline layers and the telephony layer. The table below lists the primary supported providers as of early 2025.
| Layer | Provider | Notes |
|---|---|---|
| Speech-to-text | Deepgram Nova-2, Nova-3 | Default option; low latency |
| Speech-to-text | AssemblyAI Universal-Streaming | Lowest cost option at ~$0.00025/min |
| Speech-to-text | Gladia | Multilingual specialization |
| Speech-to-text | OpenAI Whisper | Via OpenAI API |
| Speech-to-text | Speechmatics | High accuracy for accents |
| Language model | OpenAI GPT-4o, GPT-4o mini | Most common default |
| Language model | Anthropic Claude (Sonnet, Haiku) | Strong instruction following |
| Language model | Google Gemini 1.5 Flash, Pro | Long context capability |
| Language model | Groq (Llama, Mixtral) | Ultra-low inference latency |
| Language model | Custom endpoint | Developer's own server via HTTP |
| Text-to-speech | ElevenLabs | High expressiveness; higher cost |
| Text-to-speech | Cartesia Sonic | Low latency, natural voices |
| Text-to-speech | Deepgram Aura | Integrated with Deepgram STT |
| Text-to-speech | PlayHT | Voice cloning support |
| Text-to-speech | LMNT | Fast streaming synthesis |
| Text-to-speech | Azure Neural TTS | Enterprise compliance |
| Telephony | Twilio | Most widely used; higher cost |
| Telephony | Telnyx | Lower cost alternative |
| Telephony | Vonage | Available option |
| Telephony | Custom SIP trunk | Bring your own carrier |
The modularity is intentional. Dearsley has described the design philosophy as: "That's why our approach has always been very modular," noting that the right combination of providers depends on balancing "three variables: performance, latency, and cost" and that the optimal stack changes as individual providers improve.
Developers can provision phone numbers directly through the Vapi dashboard, which purchases numbers via Twilio on the backend. Numbers are currently limited to United States area codes through this native purchase flow; international numbers require importing from an external Twilio or Telnyx account.
Vapi supports both inbound and outbound calling. For inbound calls, a phone number is associated with an assistant configuration. When the number receives a call, Vapi routes it through the configured STT, LLM, and TTS pipeline. For outbound calls, developers trigger calls via the API, specifying the target number and the assistant configuration to use.
Vapi also offers a soft-phone capability through its Web SDK, which uses WebRTC to connect browser-based callers directly to an assistant without requiring a traditional phone number. This is commonly used for website chat widgets that offer voice as an option.
Custom SIP trunking (bring-your-own-carrier) is available on enterprise plans, allowing companies that already have existing telephony infrastructure to route calls into Vapi without changing carriers.
Vapi's pricing model is layered. The platform charges a base orchestration fee of $0.05 per minute on top of whatever the underlying providers charge. Provider costs are passed through approximately at cost.
A typical call stack in 2025 breaks down roughly as follows:
| Component | Provider | Approximate cost per minute |
|---|---|---|
| Platform fee | Vapi | $0.05 |
| Speech-to-text | Deepgram Nova-2 | $0.01 |
| Language model | GPT-4o mini | $0.05–$0.10 |
| Text-to-speech | Cartesia Sonic | $0.01–$0.02 |
| Text-to-speech | ElevenLabs Flash v2.5 | $0.03–$0.04 |
| Telephony | Twilio outbound | $0.008–$0.014 |
| Telephony | Telnyx outbound | $0.005–$0.008 |
| Total (typical range) | $0.15–$0.30 |
The advertised $0.05 per minute figure covers only the orchestration layer. Actual all-in costs typically land between $0.15 and $0.33 per minute depending on which LLM and TTS providers are selected and whether calls are inbound or outbound. Premium voice providers like ElevenLabs and high-capability LLMs like GPT-4o (full version) push costs toward the upper end of that range.
Vapi offers a pay-as-you-go plan with no monthly commitment, limited to 10 concurrent calls. Enterprise plans include volume discounts, higher concurrency limits, dedicated support, and add-ons including HIPAA compliance (priced at approximately $1,000 per month) and a signed Data Processing Agreement (DPA).
New accounts receive approximately $10 in credits to test the platform before spending.
Vapi launched Squads in November 2025 as a way to build voice AI systems that involve multiple specialized assistants within a single call. The problem Squads addresses is that as a voice agent's responsibilities grow, cramming all functionality into a single prompt and tool set makes the system increasingly fragile and unreliable. Squads allow developers to define a pool of specialized assistants, each with a focused prompt and tool set, and configure routing logic that hands off between them during the conversation.
From the caller's perspective, a Squad behaves as a single continuous call. One phone number, one conversation thread, one transcript. Behind the scenes, a billing assistant might handle the opening of a call before handing off to a scheduling assistant, which then transfers to a technical support specialist. Each handoff happens without dead air or noticeable interruption.
Developers control the context passed between assistants at each handoff point. The options include passing no prior context (useful for sensitive operations like payment collection), passing only the last N messages, or passing the full conversation history. Routing logic is expressed through standard LLM tool calls and explicit prompts rather than black-box decision trees, which makes it straightforward to debug and audit.
Vapi provides a canvas-based visual builder for designing Squad flows, where assistants appear as nodes and routing conditions appear as labeled edges between them.
Fleetworks, a company that builds AI agents for the transportation industry, is one of the most publicly cited users of Squads. The company runs assistants handling dispatch, scheduling, billing, and support as a coordinated Squad, processing more than 240,000 calls per day. The company has stated that breaking these workflows across specialized assistants rather than building one monolithic agent was what allowed them to scale reliably.
Latency is one of Vapi's primary competitive claims. The platform's own documentation targets a voice-to-voice round trip of 500 to 700 milliseconds over WebRTC under optimal conditions. Published benchmarks from third parties show how component selection affects this number in practice.
AssemblyAI published a configuration that achieved approximately 465 milliseconds end-to-end over WebRTC using:
This configuration required deliberately disabling Vapi's default turn detection padding settings. The default "startSpeakingPlan" configuration adds a pause before the agent speaks to reduce false starts, which can contribute more than 1.5 seconds of latency on its own if not tuned. Disabling this feature requires prompt-level and configuration-level adjustments but is documented in Vapi's developer documentation.
Over telephone (PSTN) networks, the same configuration produces approximately 965 milliseconds due to the additional 400 to 600 milliseconds of latency introduced by Twilio's network. This is important context when comparing Vapi's stated latency targets with competitors: the 465 ms benchmark is a best-case WebRTC number achieved with a specifically tuned low-latency stack. Production telephony deployments with default configurations typically see round trips of 800 to 1,500 milliseconds.
For comparison, human phone conversations are generally considered natural at roughly 700 milliseconds of round-trip response time. Latency above 1,000 milliseconds creates noticeable pauses that many callers interpret as hesitation or connection problems.
Vapi's publicly disclosed customers span healthcare, transportation, customer service, and fintech verticals.
FleetWorks automates communications between transportation brokers and truck drivers. The company uses Vapi Squads to run assistants covering dispatch coordination, scheduling, billing inquiries, and driver support within single calls. Fleetworks processes over 240,000 calls per day through the platform and has publicly stated the switch to Vapi's managed infrastructure saved more than 100 engineering hours per month that would otherwise go toward maintaining voice pipeline code.
Luma Health is a healthcare patient engagement platform that uses Vapi for automated outbound calls around appointment reminders, scheduling changes, and care gap outreach. Healthcare deployments require HIPAA compliance, which Vapi supports through its enterprise plan with a signed Business Associate Agreement.
Ellipsis Health uses voice AI for mental health screening, where the platform captures spoken responses and routes them through clinical analysis models. The use of Vapi's emotion detection models is relevant to this application.
Mindtickle is a sales readiness platform that uses Vapi-powered voice agents for sales training simulations, where sales representatives practice calls with AI counterparts.
Beyond named customers, Vapi is widely used for outbound sales calling, appointment booking, lead qualification, customer service escalation routing, and post-call survey collection. The platform supports more than 100 languages and accents, which makes it viable for international deployments.
The voice AI infrastructure market includes several competing developer platforms. The main alternatives to Vapi are Retell AI, Bland AI, and the open-source framework Pipecat.
| Platform | Model | Pricing | Latency | Compliance | Target users |
|---|---|---|---|---|---|
| Vapi | Managed SaaS, modular stack | $0.05/min + providers (~$0.15–$0.33 all-in) | 500–700ms (WebRTC); 800–1,500ms (telephony) | SOC 2 Type II; HIPAA at $1,000/mo add-on | Developers and engineering-led teams |
| Retell AI | Managed SaaS, all-inclusive | $0.07/min (all-inclusive) | 300–500ms | SOC 2 Type I/II; HIPAA with self-service BAA | Healthcare, insurance, regulated industries |
| Bland AI | Managed SaaS | $0.11–$0.14/min | ~800ms average | Not specified | Outbound sales, SDR automation |
| Pipecat | Open-source Python framework | Free (self-hosted); Pipecat Cloud $0.01–$0.03/min | Depends on infrastructure | Self-managed | Technical teams, edge deployment |
Vapi vs Retell AI. Retell presents the most direct competition. Both platforms target developers building voice agents for business phone calls. Retell's all-inclusive pricing bundles STT, LLM, and TTS into a flat per-minute rate, which simplifies cost forecasting but removes the ability to substitute providers. Retell's advertised latency is lower at 300 to 500 milliseconds and the platform includes native warm call transfer and branded call display, features Vapi does not natively offer. Retell's compliance story is also simpler for regulated industries: its BAA is available through a self-service portal, while Vapi's HIPAA support requires an enterprise contract and an additional fee. Vapi's advantage is modularity. Teams that need a specific LLM provider, a specific voice, or a specific STT service that Retell does not offer can configure it through Vapi's bring-your-own model.
Vapi vs Bland AI. Bland targets outbound calling automation, with a visual interface and built-in features for call logging, confidence ratings, and call summaries that require manual configuration in Vapi. Bland's per-minute pricing is higher on the base plan but includes more built-in functionality. Vapi offers more flexibility for custom architectures and non-standard deployments. Bland's built-in memory allows agents to recall information from previous calls with the same contact, a feature that requires custom implementation in Vapi.
Vapi vs Pipecat. Pipecat is an open-source Python framework maintained by Daily.co. It provides similar STT/LLM/TTS pipeline orchestration but requires developers to manage their own infrastructure, handle scalability themselves, and implement real-time audio handling at a lower level. There is no orchestration fee, but engineering overhead is substantially higher. Pipecat is appropriate for teams with the infrastructure expertise to run their own voice systems or for edge deployment scenarios. Vapi occupies the managed infrastructure position and abstracts away that complexity at the cost of the platform fee and reduced control over the underlying stack.
Vapi has received positive reception from the developer community for reducing the time required to build a production voice agent from weeks to days. The platform's Squads feature received particular attention when Fleetworks published figures showing 240,000 daily calls running through a multi-agent Squad configuration.
Bessemer Venture Partners' investment memo described Vapi's developer experience as ten times better than the alternative of building voice infrastructure in-house, citing the team's shipping velocity as unusual even by startup standards. The firm's memo specifically called out Vapi's proprietary real-time audio model that detects emotional inflections and the ability to deliver enterprise-grade scalability, reliability, and fault tolerance as differentiating technical properties.
Third-party reviews have noted that the platform's modular approach gives experienced developers a high degree of control that opinionated platforms like Retell do not provide. The 100+ language support and the ability to bring custom LLM endpoints are cited as advantages for international deployments and organizations with proprietary models.
The market context for Vapi's growth is worth noting. Bessemer's 2024 investment thesis described telephony as a 150-year-old technology still critical for conveying nuanced information, handling time-sensitive transactions, and reaching demographics that are not comfortable with web or mobile interfaces. Healthcare, legal services, home services, insurance, and logistics were specifically cited as sectors where phone calls remain the primary customer service channel. Vapi's growth reflects developer adoption of a tool positioned to serve this segment of the economy with AI automation.
Vapi has received criticism across several recurring areas.
Pricing opacity. The advertised $0.05 per minute figure understates actual costs. All-in production costs typically run $0.15 to $0.33 per minute when STT, LLM, TTS, and telephony are included. Managing billing relationships with four to five separate vendors simultaneously adds accounting overhead that is absent from all-inclusive platforms. Multiple developer reviews have described frustration with cost forecasting.
Technical complexity. Vapi's flexibility creates a corresponding configuration burden. Non-technical users find the API-first interface difficult to approach. Even experienced developers report that achieving reliable behavior requires extensive prompt engineering and tuning of the orchestration model parameters. The default configurations are not optimized for every use case, and some latency improvements require disabling features that are on by default.
Platform stability. Developer community reports, including posts on the Vapi Discord and Trustpilot reviews, document instances where platform updates broke working assistant configurations. Support response times have been described as slow, with the primary support channel being Discord. Several reviews from 2024 described a period when documentation lagged behind feature changes.
HIPAA complexity. HIPAA compliance is available only on enterprise plans at an additional fee of approximately $1,000 per month. The compliance configuration requires routing call recordings to a developer-managed S3 bucket with encryption at rest, disabling Vapi's default data persistence (which removes call logs and transcripts from Vapi's own storage), and signing a Business Associate Agreement as part of the enterprise contract process. For teams in regulated industries, this setup is more cumbersome than platforms that include HIPAA compliance at lower tiers.
Phone number limitations. The native phone number purchase flow supports only US numbers. International numbers require an existing Twilio or Telnyx account with the number already provisioned, then imported into Vapi.
No on-premise hosting. Vapi runs exclusively in the cloud. Organizations that require on-premise deployment for data sovereignty or security reasons cannot use the platform.
Concurrent call limits on free tier. The pay-as-you-go plan caps concurrent calls at 10, which is sufficient for development but constrains production deployments that see traffic spikes. Higher concurrency limits require an enterprise agreement.