Pipecat is an open-source Python framework for building real-time voice and multimodal conversational AI agents. Created by Daily, a WebRTC infrastructure company, and first released in May 2024, Pipecat provides a pipeline-based architecture that chains together speech recognition, large language models, and text-to-speech synthesis into a single coherent data flow. The framework is licensed under the BSD 2-Clause license and supports over 100 AI services, APIs, and transport protocols. As of early 2026 it had accumulated nearly 12,000 GitHub stars, more than 2,000 forks, and contributions from over 130 developers, making it one of the most widely adopted open-source frameworks for voice agent orchestration.
Daily was founded in 2016 by Kwindla Hultman Kramer and Nina Kuruvilla with a focus on developer-facing real-time audio and video infrastructure built on WebRTC. The company launched its video API in 2019 and later raised a $40 million Series B, with total funding reaching roughly $62 million. Daily's core business involved providing WebRTC-based media transport to developers building video conferencing, telehealth, and live collaboration tools.
In 2023 and early 2024, as large language models became capable enough for real-time spoken conversation, Daily's engineering team observed a recurring pattern among customers: developers were independently solving the same hard infrastructure problems to wire together speech-to-text, LLMs, and text-to-speech in a low-latency pipeline. Each team was reinventing frame buffering, voice activity detection, interruption handling, and transport abstraction from scratch. Kramer and the Daily team decided to extract and generalize their own internal tooling into a reusable open-source library.
Pipecat was publicly released in May 2024. The MarkTechPost announcement described it as a framework designed to simplify the creation of voice and multimodal conversational agents, initially supporting ElevenLabs and OpenAI for speech synthesis, Deepgram for transcription, and Daily's own WebRTC transport. The name reflects the core design metaphor: audio, text, and video travel through the system as discrete data objects (frames) that flow through a chain of processors (a pipe).
Pipecat organizes processing around a directed pipeline of frame processors. Every unit of data in the system is a frame object. Audio arriving from a microphone becomes an AudioRawFrame. A transcription becomes a TextFrame. A complete LLM response is bracketed by an LLMFullResponseEndFrame. Control signals use ControlFrame, and high-priority system signals such as start, stop, and interruption use SystemFrame objects that bypass normal ordering.
Each FrameProcessor receives frames on its input, optionally transforms them, and passes the result to the next processor in the chain. Processors are doubly linked (each maintains a _next and _prev reference), which allows frames to travel both downstream and upstream through the pipeline. A standard voice agent pipeline takes roughly this shape:
The framework separates LLM context management into specialized aggregators. The LLMUserAggregator collects user inputs including images for multimodal scenarios. The LLMAssistantAggregator captures the model's output. This clean separation makes it straightforward to inspect or modify conversation context between turns.
Pipelines can be branched and merged using the Pipeline and ParallelPipeline containers. A developer can run transcription and sentiment analysis concurrently on the same audio, for instance, without one blocking the other. The async Python runtime (asyncio) underpins all processor execution, keeping I/O-bound AI API calls non-blocking.
Knowing when a user has finished speaking is harder than it sounds. Silence-based VAD detects the end of audio energy but misclassifies filled pauses, short hesitations, and sentence fragments as completed turns, causing the bot to respond too early. Pipecat ships with pluggable VAD backends including Silero, AIC, and Krisp VIVA, configurable via VADParams.
For production deployments, Pipecat offers Smart Turn, an open-source semantic turn detection model trained to analyze raw audio waveforms rather than transcriptions. Smart Turn v3, hosted on Hugging Face, determines whether the speaker has genuinely finished a thought by combining acoustic and linguistic cues. It is designed to run after a VAD silence detection event and completes inference in approximately 65 milliseconds on a standard CPU instance. Smart Turn is the default turn stop strategy in current Pipecat releases and has been tested extensively on Pipecat Cloud.
Interruption handling is built into the pipeline. When a user starts speaking while the bot is talking, the framework broadcasts an InterruptionFrame that cancels downstream tasks and stops audio playback. Processors can be marked UninterruptibleFrame to protect critical operations (such as an acknowledgment tone) from being cut short.
One of Pipecat's primary design goals is vendor neutrality. The framework ships with plugins for over 100 services organized into five categories.
Pipecat supports more than 20 speech-to-text providers. The most commonly used are Deepgram (which offers a streaming WebSocket API with low latency and speaker diarization), AssemblyAI, Google Cloud Speech-to-Text, Microsoft Azure Cognitive Services, OpenAI Whisper, AWS Transcribe, Gladia, Soniox, Speechmatics, and Groq's Whisper endpoint. NVIDIA Riva Parakeet NIM is also supported as part of the NVIDIA partnership described below.
LLM integrations cover the major frontier model providers: OpenAI (GPT-4o and its variants), Anthropic (Claude), Google Gemini, AWS Bedrock (including Amazon Nova), Mistral, Groq, DeepSeek, and xAI Grok. The framework uses a common LLMService interface so that swapping one provider for another requires a single-line code change rather than pipeline restructuring. NVIDIA LLM NIMs are accessible through the framework's NVIDIA module.
Pipecat supports more than 30 TTS providers including ElevenLabs, Cartesia, Google Text-to-Speech, OpenAI TTS, Azure Neural Voices, AWS Polly, Deepgram Aura, and NVIDIA FastPitch-HifiGAN NIM. Runtime voice parameter updates (changing speaker ID, speed, or emotion mid-conversation) are supported for Cartesia, ElevenLabs Realtime, Deepgram, and several others.
For applications that prefer to bypass the STT/LLM/TTS cascade entirely, Pipecat integrates with natively multimodal APIs that accept audio input and return audio output directly. These include the OpenAI Realtime API, Google Gemini Multimodal Live, AWS Nova Sonic, and Ultravox. The speech-to-speech path typically offers lower latency than the three-model cascade because it eliminates two serialization round trips, though it trades away the flexibility to mix providers.
Transport plugins abstract the underlying network protocol. Supported transports include Daily WebRTC, LiveKit WebRTC, Twilio Media Streams, Telnyx, Plivo, Vonage, Exotel, raw WebSocket server, FastAPI WebSocket, WhatsApp via Meta's Cloud API, and a local audio transport for development. The framework's BaseInputTransport and BaseOutputTransport provide a uniform frame-based interface across all of these, so pipeline code does not need to change when switching from a WebSocket prototype to a WebRTC production deployment.
Pipecat also connects to video avatar services: HeyGen, Tavus, and Simli can render a photorealistic talking head synchronized to the agent's audio output. Vision processing via Moondream allows agents to analyze still images or video frames passed into the pipeline alongside speech.
WebRTC is the transport layer of choice for production voice agents because it was designed specifically for real-time audio streaming over the public internet. Unlike WebSockets, which guarantee ordered delivery at the cost of latency spikes when packets are reordered, WebRTC uses RTP (Real-time Transport Protocol) and actively trades delivery guarantees for consistent timing. Network jitter buffers and packet loss concealment are handled at the browser or SDK layer rather than the application layer.
Kwindla Kramer has noted in interviews that WebSockets cause latency spikes in real production environments that do not appear during development, and that roughly 15% of production calls show unexpected disconnects when WebSocket-based voice agents encounter congested networks. This background informed Pipecat's strong default preference for WebRTC transport.
Daily, as a company that has operated WebRTC infrastructure since 2016, ships the Daily transport as the best-supported and most tightly integrated option in Pipecat. The Daily SDK handles TURN server negotiation, adaptive bitrate, audio codec selection (Opus by default), and echo cancellation, all without additional configuration in user code. Pipecat also ships a SmallWebRTCTransport for lightweight deployments and a LiveKit WebRTC transport for teams already invested in the LiveKit ecosystem.
For telephony workloads where WebRTC is not available, PSTN and SIP dial-in and dial-out are handled through provider-specific WebSocket transports (Twilio's Media Streams protocol, Telnyx's similar offering, etc.).
Unstructured LLM conversations work well for open-ended chat but are fragile for business workflows that must follow a defined sequence: collect a name, verify a date, confirm an order, escalate if a condition is met. The model may jump ahead, omit required steps, or fail to enforce branching conditions reliably when given a single monolithic system prompt.
Pipecat Flows is a companion library (distributed as pipecat-ai-flows on PyPI) that adds structured conversation state management. Flows models a conversation as a directed graph where each node represents a conversation state with its own task messages, functions, and transition conditions. The FlowManager maintains a current_node and a persistent state dictionary across the session.
Node transitions are triggered by LLM function calls rather than pattern matching on transcripts, which keeps the logic readable and easy to audit. The library supports three context management strategies when transitioning between nodes: APPEND (accumulate all previous context), RESET (clear context for a fresh prompt), and RESET_WITH_SUMMARY (clear context but inject an AI-generated summary of prior exchanges). This gives developers control over context window usage in long multi-turn workflows.
Typical applications of Pipecat Flows include structured intake forms, multi-step customer support escalations, appointment booking sequences, and interactive voice response trees that blend scripted paths with open LLM responses.
In 2024, Daily and NVIDIA announced a collaboration to publish the NVIDIA AI Blueprint: Voice Agents for Conversational AI. The blueprint is a reference architecture for enterprise voice agent deployment that combines Pipecat's pipeline orchestration with NVIDIA NIM microservices.
The reference stack uses NVIDIA Riva Parakeet NIM for automatic speech recognition, NVIDIA Llama 3.3 70B Instruct NIM as the language model, and NVIDIA FastPitch-HifiGAN NIM for speech synthesis. NVIDIA's VP Justin Boitano described the collaboration as enabling developers to create sophisticated, real-time conversational AI experiences with unprecedented ease and flexibility.
NVIDIA extended the integration further through the ACE (Avatar Cloud Engine) Controller microservice, which uses Pipecat as its underlying orchestration framework. The NVIDIA Pipecat library adds frame processors specific to avatar interaction: Audio2Face3DService (which drives a 3D facial animation model from audio), AnimationGraphService, FacialGestureProviderProcessor, and PostureProviderProcessor. This stack enables photorealistic digital humans that lip-sync and gesture in real time, targeting use cases in gaming NPCs, virtual retail assistants, and enterprise training simulations.
The blueprint and ACE Controller are available through the NVIDIA AI catalog and GitHub (github.com/NVIDIA/voice-agent-examples), and are supported under NVIDIA AI Enterprise licenses for production deployments.
Although Pipecat itself is fully open-source and can be self-hosted on any Python-capable server, Daily operates Pipecat Cloud as a managed deployment platform. After a nine-month beta involving more than 1,000 teams, Pipecat Cloud became generally available in January 2026.
The core value proposition is eliminating infrastructure work that is generic to voice AI but difficult to get right: fast agent cold starts, global low-latency routing, autoscaling to handle bursty traffic, and session lifecycle management. Pipecat Cloud achieves P99 agent start times below one second, which matters because callers notice delays longer than a second as awkward hesitation before the bot responds.
Key features include:
Pricing is $0.01 per running agent-minute, with additional costs for optional services like audio recording and enterprise support. Daily bundles AI inference costs for customers who want a single consolidated bill rather than separate vendor invoices.
Pipecat Cloud maintains vendor neutrality: code running on the platform is structurally identical to self-hosted code and can be moved off the platform without modification. Daily's rationale is that lock-in avoidance is itself a selling point for enterprise buyers who have had bad experiences with proprietary voice AI platforms.
Although Pipecat's pipeline runs server-side in Python, the framework ships client SDKs for every major front-end platform. This separation is intentional: the server handles the computationally intensive AI work, while the client handles audio capture, playback, and the WebRTC or WebSocket connection.
The JavaScript SDK (@pipecat-ai/client-js) works in any browser and node environment. The React SDK wraps it in hooks and context providers for easier integration into React applications. React Native is also supported, enabling iOS and Android apps to connect to Pipecat pipelines without embedding Python. Native Swift and Kotlin SDKs cover iOS and Android respectively for teams building purely native mobile experiences. A C++ SDK targets embedded and IoT use cases where Python is impractical.
All client SDKs implement the RTVI (Real-Time Voice Inference) protocol, an open specification for client-server communication in voice AI applications that Daily contributed to the community. RTVI standardizes how a client signals readiness, how it sends audio frames to the server, how the server streams back audio and transcripts, and how both sides exchange configuration and control messages. Because RTVI is an open protocol, a client written for Pipecat can in principle connect to any RTVI-compatible server, and third-party SDKs can implement RTVI without depending on Daily's infrastructure.
Pipecat followed a rapid development cadence after its May 2024 open-source release. The project used a v0.0.x versioning scheme through most of its first two years, releasing multiple updates per month. As of April 2026, the repository had published 109 releases and crossed the v1.0.0 milestone on April 14, 2026, signaling API stability for users who needed a stable surface to build on. Version v1.1.0 followed on April 27, 2026.
Key capabilities added across the v0.0.x lifecycle included:
The framework requires Python 3.11 or higher, with Python 3.12 or higher recommended. Dependency management via uv is the project's preferred approach for reproducible environment setup.
Pipecat supports distributed multi-agent architectures through a pattern called Pipecat Subagents. In this model, each specialized agent runs its own pipeline and communicates with other agents through a shared message bus. A supervisor agent can delegate to a specialist (for example, handing a technical support call to a model fine-tuned on product documentation) and receive the result without interrupting the caller's experience.
Handoffs can transfer conversation context, session state, and in-progress audio. Background tasks (such as looking up a customer record or sending a confirmation email) can be dispatched as fire-and-forget subagents that do not block the main conversational pipeline. This pattern scales horizontally because each agent process is independent, and the message bus (Redis or a similar pub/sub system) decouples their lifecycles.
The multi-agent model is particularly useful for applications that require different personas or knowledge domains within a single call. A customer service agent might handle routine questions but hand off complex billing disputes to a specialist agent with different system prompts, tool configurations, and even a different LLM provider optimized for that subdomain. The caller experiences this as a seamless transfer rather than a jarring hold.
LiveKit Agents is the other major open-source framework for real-time voice agents. The two projects differ primarily in their design philosophy and their relationship to the underlying transport layer.
LiveKit Agents is built on top of LiveKit's own media server, an SFU (Selective Forwarding Unit) that routes WebRTC streams at scale. The agent framework integrates tightly with this infrastructure: agents join LiveKit rooms as participants and respond to audio track events. LiveKit ships with robust server-side VAD and native SIP support out of the box, which simplifies production deployment for teams starting from scratch.
Pipecat sits above the transport layer and treats transport as a pluggable component. Its pipeline DAG model is more explicit: developers wire together every processor and can inspect or modify frames at any point in the chain. This verbosity provides more transparency and more flexibility, but also a steeper learning curve. A team that already has a WebRTC infrastructure or that needs to run on multiple transport types (Daily for web, Twilio for telephony, local audio for testing) benefits from Pipecat's abstraction. A team that wants an integrated, batteries-included stack may find LiveKit Agents faster to reach production.
Latency benchmarks in developer testing have shown LiveKit Agents averaging roughly 750 to 900 milliseconds end-to-end, compared to Pipecat on Daily at roughly 800 to 950 milliseconds, with the difference attributable to LiveKit's tighter integration between its agent framework and its own media server. Both figures are competitive for production voice AI.
From an open-source licensing perspective, LiveKit Agents uses the Apache 2.0 license while Pipecat uses BSD 2-Clause. Both licenses are permissive and allow commercial use without requiring derivative works to be open-sourced, though Apache 2.0 includes an explicit patent grant that BSD 2-Clause does not.
| Feature | Pipecat | LiveKit Agents |
|---|---|---|
| Architecture | Explicit pipeline DAG | Event-driven room model |
| Transport | Vendor-agnostic (Daily, LiveKit, WS, telephony) | Native LiveKit SFU |
| VAD | Silero, Krisp, AIC, Smart Turn | Server-side native VAD |
| SIP/telephony | Via Twilio, Telnyx, Plivo, etc. | Native SIP support |
| Multi-party | Custom implementation | Native room support |
| Language | Python (server-side) | Python (server-side) |
| Provider flexibility | 100+ services | Plugin system |
| Self-hosting complexity | Moderate (transport story varies) | Lower (transport bundled) |
| Managed cloud | Pipecat Cloud (Daily) | LiveKit Cloud |
| License | BSD 2-Clause | Apache 2.0 |
Vapi takes a fundamentally different approach. Where Pipecat exposes the full pipeline as explicit code, Vapi abstracts it behind a configuration API: the developer writes a system prompt, selects a voice, and connects tools; Vapi's infrastructure executes the STT/LLM/TTS loop. This makes Vapi faster to deploy but harder to customize at a granular level.
| Dimension | Pipecat | Vapi |
|---|---|---|
| Model | Open-source framework | Managed SaaS platform |
| License | BSD 2-Clause | Proprietary |
| Pipeline visibility | Full (every step is code) | Abstracted (config-driven) |
| Provider choice | 100+ vendors, freely mixable | Curated integrations |
| Customization | Deep (insert logic anywhere) | Moderate (prompt and tool configuration) |
| Pricing | Infrastructure + AI vendor costs only | $0.05/min platform fee + AI costs |
| Telephony | Via partner providers | Built-in |
| Time to first call | Slower (requires setup) | Fast (minutes) |
| Debugging | Full observability via code | Limited pipeline introspection |
| Mobile deployment | Server-side only | Server-side only |
Vapi is generally preferred for rapid prototyping, standard customer service patterns, and teams without real-time audio engineering expertise. Pipecat is preferred for domain-specific vocabulary requirements, complex multi-step workflows, existing transport infrastructure, and cost optimization at scale, since the $0.05 per minute Vapi platform fee can become significant at hundreds of thousands of minutes per month.
A third class of comparison involves platforms like Retell AI and Bland AI, which similarly offer managed abstractions over voice pipelines and also charge per-minute platform fees. These services sit closer to the Vapi end of the spectrum than to Pipecat's open-source, self-hostable model.
Documented production and reference applications built with Pipecat span several categories.
Customer service and intake bots are the most common deployment. A voice agent answers inbound calls, collects caller information, looks up account records via function calling, and either resolves the issue or transfers to a human agent. Pipecat Flows handles the structured intake sequence while the LLM handles open-ended questions within each state.
AI companions and coaching assistants use Pipecat's low-latency audio pipeline to create conversational experiences that feel natural rather than transactional. The framework's support for emotional voice synthesis (via ElevenLabs and Cartesia providers) and context window management helps maintain coherent, personable personas across long sessions.
Meeting assistants join video calls as a participant (via the Daily or LiveKit transport), transcribe the conversation, answer questions, and produce summaries. Pipecat's multimodal frame system can carry video frames alongside audio for applications that need to interpret screen content during a call.
Virtual avatars in gaming and retail use the NVIDIA ACE integration to drive photorealistic 3D character animations synchronized to agent speech. This application benefits from the NVIDIA NIM pipeline, which minimizes the number of network hops between speech recognition, language modeling, and facial animation inference.
IoT and edge devices that need on-device voice interfaces typically use Pipecat in a client-server configuration: the device runs a lightweight Pipecat client SDK (available for JavaScript, React, React Native, Swift, Kotlin, and C++) that streams audio to a Pipecat server, avoiding the need for a Python runtime on constrained hardware.
Telehealth platforms have adopted Pipecat's HIPAA-compliant Pipecat Cloud deployment to build symptom-checking bots, appointment scheduling assistants, and post-discharge follow-up calls.
Language learning applications use Pipecat to build voice tutors that can evaluate pronunciation, respond in the target language, and adjust to learner proficiency. The framework's support for 83 language combinations across its STT and TTS providers makes it practical to support less common languages that larger managed platforms may not prioritize.
Developer tooling and testing are also documented uses: teams pipe synthetic audio through Pipecat pipelines to run automated integration tests of voice agent behavior, using the local audio transport and a recorded test corpus to simulate callers without incurring telephony costs.
As of the v0.0.77 release milestone highlighted by Kwindla Kramer in early 2026, 131 developers had contributed code to Pipecat core, and the framework supported 83 services, models, and APIs. The GitHub repository had approximately 11,900 stars and 2,000 forks at that point.
Major cloud providers have published first-party tutorials and sample repositories for Pipecat: AWS published a multi-part series on building voice agents with Amazon Bedrock and Pipecat, including a hands-on workshop covering Nova Sonic integration; NVIDIA provides the Voice Agents for Conversational AI blueprint; and multiple AI infrastructure companies including Deepgram, ElevenLabs, and Cartesia ship documented Pipecat integrations as primary developer resources.
The framework has been integrated into deployment pipelines by Amazon Web Services, adopted by medical technology startups building HIPAA-compliant patient intake systems, and used by gaming studios building NPC dialogue systems via NVIDIA ACE.
Pipecat Cloud reached over 1,000 teams during its beta period before its January 2026 general availability launch, suggesting meaningful production adoption beyond hobbyist experimentation.
Pipecat is a Python-only server-side framework. Running the pipeline on iOS or Android devices is not practical: Python has no native support on iOS, Apple's App Store policies restrict dynamic code execution, and embedding a Python interpreter in a mobile app would add tens of megabytes to the binary while the GIL (Global Interpreter Lock) limits parallel execution on multi-core mobile CPUs. Mobile deployments therefore require a client-server architecture where the Pipecat pipeline runs remotely and the device connects via a lightweight client SDK.
Self-hosting Pipecat without using Daily as the transport requires the developer to assemble the transport story independently. LiveKit's agent framework bundles an integrated media server, so a LiveKit self-hosted deployment comes with a complete WebRTC stack. With Pipecat, a team that wants to avoid Daily must configure and operate a separate media server (such as a LiveKit deployment) or use WebSocket transport, which carries the latency risks discussed above. This has been cited by developers as the primary operational burden of the framework.
The explicitness of the pipeline, while powerful, creates a steeper learning curve than configuration-based platforms. Beginners typically need to understand asyncio, WebRTC, streaming audio formats, and the Pipecat frame model before they can confidently build and debug a production agent. Documentation quality has improved substantially across Pipecat's releases but remains a friction point compared to hosted platforms with dedicated onboarding flows.
Real-time audio quality is sensitive to server load. Because Pipecat pipelines run as async Python processes, CPU spikes (from a slow LLM response or a TTS timeout) can introduce perceptible audio glitches. Teams running high-concurrency deployments must size their infrastructure carefully and use Pipecat Cloud's reserved instances or equivalent capacity guarantees to avoid latency variance.
Pipecat's evaluation tooling is also relatively underdeveloped compared to text-based agent frameworks. Measuring whether a voice agent performed well requires either listening to call recordings or running automated transcription of output audio through a separate evaluation pipeline. Kramer has acknowledged that voice quality evals are largely informal in practice, and the framework does not yet ship built-in regression testing or A/B comparison tools for voice agents. Developers typically adapt existing LLM evaluation libraries to transcripts rather than evaluating the audio itself.
Dependency management can be complex. Because Pipecat supports over 100 service integrations, its dependency tree is large. The framework uses optional dependency groups so that users who only need Deepgram and Anthropic do not have to install ElevenLabs and Cartesia libraries. But managing these optional groups alongside conflicting transitive dependencies requires some familiarity with Python packaging tools, adding friction for developers new to the ecosystem.