Pipecat

Pipecat is an open-source Python framework for building real-time voice and multimodal conversational AI agents. Created by Daily, a WebRTC infrastructure company, and first released in May 2024, Pipecat provides a pipeline-based architecture that chains together speech recognition, large language models, and text-to-speech synthesis into a single coherent data flow. The framework is licensed under the BSD 2-Clause license and supports over 100 AI services, APIs, and transport protocols. As of early 2026 it had accumulated nearly 12,000 GitHub stars, more than 2,000 forks, and contributions from over 130 developers, making it one of the most widely adopted open-source frameworks for voice agent orchestration.

Background and origin

Daily was founded in 2016 by Kwindla Hultman Kramer and Nina Kuruvilla with a focus on developer-facing real-time audio and video infrastructure built on WebRTC. The company launched its video API in 2019 and later raised a $40 million Series B, with total funding reaching roughly $62 million. Daily's core business involved providing WebRTC-based media transport to developers building video conferencing, telehealth, and live collaboration tools.

In 2023 and early 2024, as large language models became capable enough for real-time spoken conversation, Daily's engineering team observed a recurring pattern among customers: developers were independently solving the same hard infrastructure problems to wire together speech-to-text, LLMs, and text-to-speech in a low-latency pipeline. Each team was reinventing frame buffering, voice activity detection, interruption handling, and transport abstraction from scratch. Kramer and the Daily team decided to extract and generalize their own internal tooling into a reusable open-source library.

Pipecat was publicly released in May 2024. The MarkTechPost announcement described it as a framework designed to simplify the creation of voice and multimodal conversational agents, initially supporting ElevenLabs and OpenAI for speech synthesis, Deepgram for transcription, and Daily's own WebRTC transport. The name reflects the core design metaphor: audio, text, and video travel through the system as discrete data objects (frames) that flow through a chain of processors (a pipe).

Pipeline architecture

Pipecat organizes processing around a directed pipeline of frame processors. Every unit of data in the system is a frame object. Audio arriving from a microphone becomes an AudioRawFrame. A transcription becomes a TextFrame. A complete LLM response is bracketed by an LLMFullResponseEndFrame. Control signals use ControlFrame, and high-priority system signals such as start, stop, and interruption use SystemFrame objects that bypass normal ordering.

Each FrameProcessor receives frames on its input, optionally transforms them, and passes the result to the next processor in the chain. Processors are doubly linked (each maintains a _next and _prev reference), which allows frames to travel both downstream and upstream through the pipeline. A standard voice agent pipeline takes roughly this shape:

Transport input (microphone or WebRTC stream) produces AudioRawFrame objects.
A voice activity detection (VAD) processor monitors the stream and signals turn boundaries.
A speech-to-text service converts audio into TranscriptionFrame objects.
An LLM aggregator collects the transcription and triggers the language model.
The LLM generates streaming text tokens that pass to a text-to-speech service.
The TTS service produces AudioRawFrame objects that the transport output sends to the caller.

The framework separates LLM context management into specialized aggregators. The LLMUserAggregator collects user inputs including images for multimodal scenarios. The LLMAssistantAggregator captures the model's output. This clean separation makes it straightforward to inspect or modify conversation context between turns.

Pipelines can be branched and merged using the Pipeline and ParallelPipeline containers. A developer can run transcription and sentiment analysis concurrently on the same audio, for instance, without one blocking the other. The async Python runtime (asyncio) underpins all processor execution, keeping I/O-bound AI API calls non-blocking.

Voice activity detection and turn management

Knowing when a user has finished speaking is harder than it sounds. Silence-based VAD detects the end of audio energy but misclassifies filled pauses, short hesitations, and sentence fragments as completed turns, causing the bot to respond too early. Pipecat ships with pluggable VAD backends including Silero, AIC, and Krisp VIVA, configurable via VADParams.

For production deployments, Pipecat offers Smart Turn, an open-source semantic turn detection model trained to analyze raw audio waveforms rather than transcriptions. Smart Turn v3, hosted on Hugging Face, determines whether the speaker has genuinely finished a thought by combining acoustic and linguistic cues. It is designed to run after a VAD silence detection event and completes inference in approximately 65 milliseconds on a standard CPU instance. Smart Turn is the default turn stop strategy in current Pipecat releases and has been tested extensively on Pipecat Cloud.

Interruption handling is built into the pipeline. When a user starts speaking while the bot is talking, the framework broadcasts an InterruptionFrame that cancels downstream tasks and stops audio playback. Processors can be marked UninterruptibleFrame to protect critical operations (such as an acknowledgment tone) from being cut short.

Service provider plugins

One of Pipecat's primary design goals is vendor neutrality. The framework ships with plugins for over 100 services organized into five categories.

Speech-to-text

Pipecat supports more than 20 speech-to-text providers. The most commonly used are Deepgram (which offers a streaming WebSocket API with low latency and speaker diarization), AssemblyAI, Google Cloud Speech-to-Text, Microsoft Azure Cognitive Services, OpenAI Whisper, AWS Transcribe, Gladia, Soniox, Speechmatics, and Groq's Whisper endpoint. NVIDIA Riva Parakeet NIM is also supported as part of the NVIDIA partnership described below.

Large language models

LLM integrations cover the major frontier model providers: OpenAI (GPT-4o and its variants), Anthropic (Claude), Google Gemini, AWS Bedrock (including Amazon Nova), Mistral, Groq, DeepSeek, and xAI Grok. The framework uses a common LLMService interface so that swapping one provider for another requires a single-line code change rather than pipeline restructuring. NVIDIA LLM NIMs are accessible through the framework's NVIDIA module.

Text-to-speech

Pipecat supports more than 30 TTS providers including ElevenLabs, Cartesia, Google Text-to-Speech, OpenAI TTS, Azure Neural Voices, AWS Polly, Deepgram Aura, and NVIDIA FastPitch-HifiGAN NIM. Runtime voice parameter updates (changing speaker ID, speed, or emotion mid-conversation) are supported for Cartesia, ElevenLabs Realtime, Deepgram, and several others.

Speech-to-speech (native multimodal)

For applications that prefer to bypass the STT/LLM/TTS cascade entirely, Pipecat integrates with natively multimodal APIs that accept audio input and return audio output directly. These include the OpenAI Realtime API, Google Gemini Multimodal Live, AWS Nova Sonic, and Ultravox. The speech-to-speech path typically offers lower latency than the three-model cascade because it eliminates two serialization round trips, though it trades away the flexibility to mix providers.

Transport

Transport plugins abstract the underlying network protocol. Supported transports include Daily WebRTC, LiveKit WebRTC, Twilio Media Streams, Telnyx, Plivo, Vonage, Exotel, raw WebSocket server, FastAPI WebSocket, WhatsApp via Meta's Cloud API, and a local audio transport for development. The framework's BaseInputTransport and BaseOutputTransport provide a uniform frame-based interface across all of these, so pipeline code does not need to change when switching from a WebSocket prototype to a WebRTC production deployment.

Video and avatar integrations

Pipecat also connects to video avatar services: HeyGen, Tavus, and Simli can render a photorealistic talking head synchronized to the agent's audio output. Vision processing via Moondream allows agents to analyze still images or video frames passed into the pipeline alongside speech.

WebRTC integration

WebRTC is the transport layer of choice for production voice agents because it was designed specifically for real-time audio streaming over the public internet. Unlike WebSockets, which guarantee ordered delivery at the cost of latency spikes when packets are reordered, WebRTC uses RTP (Real-time Transport Protocol) and actively trades delivery guarantees for consistent timing. Network jitter buffers and packet loss concealment are handled at the browser or SDK layer rather than the application layer.

Kwindla Kramer has noted in interviews that WebSockets cause latency spikes in real production environments that do not appear during development, and that roughly 15% of production calls show unexpected disconnects when WebSocket-based voice agents encounter congested networks. This background informed Pipecat's strong default preference for WebRTC transport.

Daily, as a company that has operated WebRTC infrastructure since 2016, ships the Daily transport as the best-supported and most tightly integrated option in Pipecat. The Daily SDK handles TURN server negotiation, adaptive bitrate, audio codec selection (Opus by default), and echo cancellation, all without additional configuration in user code. Pipecat also ships a SmallWebRTCTransport for lightweight deployments and a LiveKit WebRTC transport for teams already invested in the LiveKit ecosystem.

For telephony workloads where WebRTC is not available, PSTN and SIP dial-in and dial-out are handled through provider-specific WebSocket transports (Twilio's Media Streams protocol, Telnyx's similar offering, etc.).

Pipecat Flows

Unstructured LLM conversations work well for open-ended chat but are fragile for business workflows that must follow a defined sequence: collect a name, verify a date, confirm an order, escalate if a condition is met. The model may jump ahead, omit required steps, or fail to enforce branching conditions reliably when given a single monolithic system prompt.

Pipecat Flows is a companion library (distributed as pipecat-ai-flows on PyPI) that adds structured conversation state management. Flows models a conversation as a directed graph where each node represents a conversation state with its own task messages, functions, and transition conditions. The FlowManager maintains a current_node and a persistent state dictionary across the session.

Node transitions are triggered by LLM function calls rather than pattern matching on transcripts, which keeps the logic readable and easy to audit. The library supports three context management strategies when transitioning between nodes: APPEND (accumulate all previous context), RESET (clear context for a fresh prompt), and RESET_WITH_SUMMARY (clear context but inject an AI-generated summary of prior exchanges). This gives developers control over context window usage in long multi-turn workflows.

Typical applications of Pipecat Flows include structured intake forms, multi-step customer support escalations, appointment booking sequences, and interactive voice response trees that blend scripted paths with open LLM responses.

NVIDIA partnership and ACE integration

In 2024, Daily and NVIDIA announced a collaboration to publish the NVIDIA AI Blueprint: Voice Agents for Conversational AI. The blueprint is a reference architecture for enterprise voice agent deployment that combines Pipecat's pipeline orchestration with NVIDIA NIM microservices.

The reference stack uses NVIDIA Riva Parakeet NIM for automatic speech recognition, NVIDIA Llama 3.3 70B Instruct NIM as the language model, and NVIDIA FastPitch-HifiGAN NIM for speech synthesis. NVIDIA's VP Justin Boitano described the collaboration as enabling developers to create sophisticated, real-time conversational AI experiences with unprecedented ease and flexibility.

NVIDIA extended the integration further through the ACE (Avatar Cloud Engine) Controller microservice, which uses Pipecat as its underlying orchestration framework. The NVIDIA Pipecat library adds frame processors specific to avatar interaction: Audio2Face3DService (which drives a 3D facial animation model from audio), AnimationGraphService, FacialGestureProviderProcessor, and PostureProviderProcessor. This stack enables photorealistic digital humans that lip-sync and gesture in real time, targeting use cases in gaming NPCs, virtual retail assistants, and enterprise training simulations.

The blueprint and ACE Controller are available through the NVIDIA AI catalog and GitHub (github.com/NVIDIA/voice-agent-examples), and are supported under NVIDIA AI Enterprise licenses for production deployments.

Pipecat Cloud

Although Pipecat itself is fully open-source and can be self-hosted on any Python-capable server, Daily operates Pipecat Cloud as a managed deployment platform. After a nine-month beta involving more than 1,000 teams, Pipecat Cloud became generally available in January 2026.

The core value proposition is eliminating infrastructure work that is generic to voice AI but difficult to get right: fast agent cold starts, global low-latency routing, autoscaling to handle bursty traffic, and session lifecycle management. Pipecat Cloud achieves P99 agent start times below one second, which matters because callers notice delays longer than a second as awkward hesitation before the bot responds.

Key features include:

Multi-region hosting across Oregon, Virginia, Frankfurt, and Mumbai
Autoscaling with optional reserved instance pools for predictable throughput
Built-in Krisp VIVA noise cancellation available as a toggle
Integrated PSTN and SIP telephony through Twilio, Telnyx, Plivo, WhatsApp, and Exotel
Session recording and persistent transcript storage
OpenTelemetry-compatible observability and Sentry integration
HIPAA and GDPR compliance support
CI/CD deployment via REST API

Pricing is $0.01 per running agent-minute, with additional costs for optional services like audio recording and enterprise support. Daily bundles AI inference costs for customers who want a single consolidated bill rather than separate vendor invoices.

Pipecat Cloud maintains vendor neutrality: code running on the platform is structurally identical to self-hosted code and can be moved off the platform without modification. Daily's rationale is that lock-in avoidance is itself a selling point for enterprise buyers who have had bad experiences with proprietary voice AI platforms.

Client SDKs and cross-platform support

Although Pipecat's pipeline runs server-side in Python, the framework ships client SDKs for every major front-end platform. This separation is intentional: the server handles the computationally intensive AI work, while the client handles audio capture, playback, and the WebRTC or WebSocket connection.

The JavaScript SDK (@pipecat-ai/client-js) works in any browser and node environment. The React SDK wraps it in hooks and context providers for easier integration into React applications. React Native is also supported, enabling iOS and Android apps to connect to Pipecat pipelines without embedding Python. Native Swift and Kotlin SDKs cover iOS and Android respectively for teams building purely native mobile experiences. A C++ SDK targets embedded and IoT use cases where Python is impractical.

All client SDKs implement the RTVI (Real-Time Voice Inference) protocol, an open specification for client-server communication in voice AI applications that Daily contributed to the community. RTVI standardizes how a client signals readiness, how it sends audio frames to the server, how the server streams back audio and transcripts, and how both sides exchange configuration and control messages. Because RTVI is an open protocol, a client written for Pipecat can in principle connect to any RTVI-compatible server, and third-party SDKs can implement RTVI without depending on Daily's infrastructure.

Versioning history

Pipecat followed a rapid development cadence after its May 2024 open-source release. The project used a v0.0.x versioning scheme through most of its first two years, releasing multiple updates per month. As of April 2026, the repository had published 109 releases and crossed the v1.0.0 milestone on April 14, 2026, signaling API stability for users who needed a stable surface to build on. Version v1.1.0 followed on April 27, 2026.

Key capabilities added across the v0.0.x lifecycle included:

Native speech-to-speech support for OpenAI Realtime API and Gemini Multimodal Live (late 2024)
Smart Turn semantic end-of-turn detection, which became the default turn strategy
Pipecat Flows for structured conversation state management (November 2024)
NVIDIA NIM integrations and the ACE Controller adapter
Amazon Nova Sonic speech-to-speech support (v0.0.67, 2025)
Subagent distributed pipeline architecture
Krisp VIVA real-time noise cancellation integration
OpenTelemetry tracing and Sentry error reporting

The framework requires Python 3.11 or higher, with Python 3.12 or higher recommended. Dependency management via uv is the project's preferred approach for reproducible environment setup.

Sub-agent and multi-agent patterns

Pipecat supports distributed multi-agent architectures through a pattern called Pipecat Subagents. In this model, each specialized agent runs its own pipeline and communicates with other agents through a shared message bus. A supervisor agent can delegate to a specialist (for example, handing a technical support call to a model fine-tuned on product documentation) and receive the result without interrupting the caller's experience.

Handoffs can transfer conversation context, session state, and in-progress audio. Background tasks (such as looking up a customer record or sending a confirmation email) can be dispatched as fire-and-forget subagents that do not block the main conversational pipeline. This pattern scales horizontally because each agent process is independent, and the message bus (Redis or a similar pub/sub system) decouples their lifecycles.

The multi-agent model is particularly useful for applications that require different personas or knowledge domains within a single call. A customer service agent might handle routine questions but hand off complex billing disputes to a specialist agent with different system prompts, tool configurations, and even a different LLM provider optimized for that subdomain. The caller experiences this as a seamless transfer rather than a jarring hold.

Comparison with LiveKit Agents

LiveKit Agents is the other major open-source framework for real-time voice agents. The two projects differ primarily in their design philosophy and their relationship to the underlying transport layer.

LiveKit Agents is built on top of LiveKit's own media server, an SFU (Selective Forwarding Unit) that routes WebRTC streams at scale. The agent framework integrates tightly with this infrastructure: agents join LiveKit rooms as participants and respond to audio track events. LiveKit ships with robust server-side VAD and native SIP support out of the box, which simplifies production deployment for teams starting from scratch.

Pipecat sits above the transport layer and treats transport as a pluggable component. Its pipeline DAG model is more explicit: developers wire together every processor and can inspect or modify frames at any point in the chain. This verbosity provides more transparency and more flexibility, but also a steeper learning curve. A team that already has a WebRTC infrastructure or that needs to run on multiple transport types (Daily for web, Twilio for telephony, local audio for testing) benefits from Pipecat's abstraction. A team that wants an integrated, batteries-included stack may find LiveKit Agents faster to reach production.

Latency benchmarks in developer testing have shown LiveKit Agents averaging roughly 750 to 900 milliseconds end-to-end, compared to Pipecat on Daily at roughly 800 to 950 milliseconds, with the difference attributable to LiveKit's tighter integration between its agent framework and its own media server. Both figures are competitive for production voice AI.

From an open-source licensing perspective, LiveKit Agents uses the Apache 2.0 license while Pipecat uses BSD 2-Clause. Both licenses are permissive and allow commercial use without requiring derivative works to be open-sourced, though Apache 2.0 includes an explicit patent grant that BSD 2-Clause does not.

Feature	Pipecat	LiveKit Agents
Architecture	Explicit pipeline DAG	Event-driven room model
Transport	Vendor-agnostic (Daily, LiveKit, WS, telephony)	Native LiveKit SFU
VAD	Silero, Krisp, AIC, Smart Turn	Server-side native VAD
SIP/telephony	Via Twilio, Telnyx, Plivo, etc.	Native SIP support
Multi-party	Custom implementation	Native room support
Language	Python (server-side)	Python (server-side)
Provider flexibility	100+ services	Plugin system
Self-hosting complexity	Moderate (transport story varies)	Lower (transport bundled)
Managed cloud	Pipecat Cloud (Daily)	LiveKit Cloud
License	BSD 2-Clause	Apache 2.0

Comparison with Vapi

Vapi takes a fundamentally different approach. Where Pipecat exposes the full pipeline as explicit code, Vapi abstracts it behind a configuration API: the developer writes a system prompt, selects a voice, and connects tools; Vapi's infrastructure executes the STT/LLM/TTS loop. This makes Vapi faster to deploy but harder to customize at a granular level.

Dimension	Pipecat	Vapi
Model	Open-source framework	Managed SaaS platform
License	BSD 2-Clause	Proprietary
Pipeline visibility	Full (every step is code)	Abstracted (config-driven)
Provider choice	100+ vendors, freely mixable	Curated integrations
Customization	Deep (insert logic anywhere)	Moderate (prompt and tool configuration)
Pricing	Infrastructure + AI vendor costs only	$0.05/min platform fee + AI costs
Telephony	Via partner providers	Built-in
Time to first call	Slower (requires setup)	Fast (minutes)
Debugging	Full observability via code	Limited pipeline introspection
Mobile deployment	Server-side only	Server-side only

Vapi is generally preferred for rapid prototyping, standard customer service patterns, and teams without real-time audio engineering expertise. Pipecat is preferred for domain-specific vocabulary requirements, complex multi-step workflows, existing transport infrastructure, and cost optimization at scale, since the $0.05 per minute Vapi platform fee can become significant at hundreds of thousands of minutes per month.

A third class of comparison involves platforms like Retell AI and Bland AI, which similarly offer managed abstractions over voice pipelines and also charge per-minute platform fees. These services sit closer to the Vapi end of the spectrum than to Pipecat's open-source, self-hostable model.

Use cases

Documented production and reference applications built with Pipecat span several categories.

Customer service and intake bots are the most common deployment. A voice agent answers inbound calls, collects caller information, looks up account records via function calling, and either resolves the issue or transfers to a human agent. Pipecat Flows handles the structured intake sequence while the LLM handles open-ended questions within each state.

AI companions and coaching assistants use Pipecat's low-latency audio pipeline to create conversational experiences that feel natural rather than transactional. The framework's support for emotional voice synthesis (via ElevenLabs and Cartesia providers) and context window management helps maintain coherent, personable personas across long sessions.

Meeting assistants join video calls as a participant (via the Daily or LiveKit transport), transcribe the conversation, answer questions, and produce summaries. Pipecat's multimodal frame system can carry video frames alongside audio for applications that need to interpret screen content during a call.

Virtual avatars in gaming and retail use the NVIDIA ACE integration to drive photorealistic 3D character animations synchronized to agent speech. This application benefits from the NVIDIA NIM pipeline, which minimizes the number of network hops between speech recognition, language modeling, and facial animation inference.

IoT and edge devices that need on-device voice interfaces typically use Pipecat in a client-server configuration: the device runs a lightweight Pipecat client SDK (available for JavaScript, React, React Native, Swift, Kotlin, and C++) that streams audio to a Pipecat server, avoiding the need for a Python runtime on constrained hardware.

Telehealth platforms have adopted Pipecat's HIPAA-compliant Pipecat Cloud deployment to build symptom-checking bots, appointment scheduling assistants, and post-discharge follow-up calls.

Language learning applications use Pipecat to build voice tutors that can evaluate pronunciation, respond in the target language, and adjust to learner proficiency. The framework's support for 83 language combinations across its STT and TTS providers makes it practical to support less common languages that larger managed platforms may not prioritize.

Developer tooling and testing are also documented uses: teams pipe synthetic audio through Pipecat pipelines to run automated integration tests of voice agent behavior, using the local audio transport and a recorded test corpus to simulate callers without incurring telephony costs.

Adoption and ecosystem

As of the v0.0.77 release milestone highlighted by Kwindla Kramer in early 2026, 131 developers had contributed code to Pipecat core, and the framework supported 83 services, models, and APIs. The GitHub repository had approximately 11,900 stars and 2,000 forks at that point.

Major cloud providers have published first-party tutorials and sample repositories for Pipecat: AWS published a multi-part series on building voice agents with Amazon Bedrock and Pipecat, including a hands-on workshop covering Nova Sonic integration; NVIDIA provides the Voice Agents for Conversational AI blueprint; and multiple AI infrastructure companies including Deepgram, ElevenLabs, and Cartesia ship documented Pipecat integrations as primary developer resources.

The framework has been integrated into deployment pipelines by Amazon Web Services, adopted by medical technology startups building HIPAA-compliant patient intake systems, and used by gaming studios building NPC dialogue systems via NVIDIA ACE.

Pipecat Cloud reached over 1,000 teams during its beta period before its January 2026 general availability launch, suggesting meaningful production adoption beyond hobbyist experimentation.

Limitations

Pipecat is a Python-only server-side framework. Running the pipeline on iOS or Android devices is not practical: Python has no native support on iOS, Apple's App Store policies restrict dynamic code execution, and embedding a Python interpreter in a mobile app would add tens of megabytes to the binary while the GIL (Global Interpreter Lock) limits parallel execution on multi-core mobile CPUs. Mobile deployments therefore require a client-server architecture where the Pipecat pipeline runs remotely and the device connects via a lightweight client SDK.

Self-hosting Pipecat without using Daily as the transport requires the developer to assemble the transport story independently. LiveKit's agent framework bundles an integrated media server, so a LiveKit self-hosted deployment comes with a complete WebRTC stack. With Pipecat, a team that wants to avoid Daily must configure and operate a separate media server (such as a LiveKit deployment) or use WebSocket transport, which carries the latency risks discussed above. This has been cited by developers as the primary operational burden of the framework.

The explicitness of the pipeline, while powerful, creates a steeper learning curve than configuration-based platforms. Beginners typically need to understand asyncio, WebRTC, streaming audio formats, and the Pipecat frame model before they can confidently build and debug a production agent. Documentation quality has improved substantially across Pipecat's releases but remains a friction point compared to hosted platforms with dedicated onboarding flows.

Real-time audio quality is sensitive to server load. Because Pipecat pipelines run as async Python processes, CPU spikes (from a slow LLM response or a TTS timeout) can introduce perceptible audio glitches. Teams running high-concurrency deployments must size their infrastructure carefully and use Pipecat Cloud's reserved instances or equivalent capacity guarantees to avoid latency variance.

Pipecat's evaluation tooling is also relatively underdeveloped compared to text-based agent frameworks. Measuring whether a voice agent performed well requires either listening to call recordings or running automated transcription of output audio through a separate evaluation pipeline. Kramer has acknowledged that voice quality evals are largely informal in practice, and the framework does not yet ship built-in regression testing or A/B comparison tools for voice agents. Developers typically adapt existing LLM evaluation libraries to transcripts rather than evaluating the audio itself.

Dependency management can be complex. Because Pipecat supports over 100 service integrations, its dependency tree is large. The framework uses optional dependency groups so that users who only need Deepgram and Anthropic do not have to install ElevenLabs and Cartesia libraries. But managing these optional groups alongside conflicting transitive dependencies requires some familiarity with Python packaging tools, adding friction for developers new to the ecosystem.

References

pipecat-ai/pipecat GitHub repository. https://github.com/pipecat-ai/pipecat
Pipecat official website. https://www.pipecat.ai/
Pipecat documentation: Introduction. https://docs.pipecat.ai/getting-started/introduction
"Pipecat: An Open Source Framework for Voice and Multimodal Conversational AI." MarkTechPost, May 24, 2024. https://www.marktechpost.com/2024/05/24/pipecat-an-open-source-framework-for-voice-and-multimodal-conversational-ai/
"Daily and NVIDIA collaborate to simplify voice AI at scale." Daily.co Blog. https://www.daily.co/blog/daily-and-nvidia-collaborate-to-simplify-voice-agents-at-scale/
"Pipecat Cloud is Now Generally Available." Daily.co Blog, January 2026. https://www.daily.co/blog/pipecat-cloud-is-now-generally-available/
"Voice Agent Framework for Conversational AI Blueprint by Pipecat." NVIDIA NIM. https://build.nvidia.com/pipecat/voice-agent-framework-for-conversational-ai
"Vapi vs Pipecat vs LiveKit: Which Voice Agent Wins?" AssemblyAI Blog. https://www.assemblyai.com/blog/vapi-vs-pipecat-vs-livekit
"Pipecat vs. LiveKit - Key Differences for Voice Agent Development." Cekura AI Blog. https://www.cekura.ai/blogs/pipecat-vs-livekit-the-real-difference
"Building Production Voice AI That Actually Works." Freeplay Blog. https://freeplay.ai/blog/building-production-voice-ai-that-actually-works-lessons-from-daily-pipecat-co-founder-kwindla-hultman-kramer
"Difference Between VAPI AI vs PIPECAT Voice AI Platforms." F22 Labs. https://www.f22labs.com/blogs/difference-between-vapi-ai-vs-pipecat-voice-ai-platforms/
pipecat-ai/pipecat-flows GitHub repository. https://github.com/pipecat-ai/pipecat-flows
pipecat-ai/smart-turn GitHub repository. https://github.com/pipecat-ai/smart-turn
NVIDIA/voice-agent-examples GitHub repository. https://github.com/NVIDIA/voice-agent-examples
"Building intelligent AI voice agents with Pipecat and Amazon Bedrock." AWS Machine Learning Blog. https://aws.amazon.com/blogs/machine-learning/building-intelligent-ai-voice-agents-with-pipecat-and-amazon-bedrock-part-1/
"Daily.co raises $4.6M video chat API service." TechCrunch, May 4, 2020. https://techcrunch.com/2020/05/04/daily-co-raises-4-6m-video-chat-api-service/
"Announcing Our $40M Series B." Daily.co Blog. https://www.daily.co/blog/announcing-our-40m-series-b/
DeepWiki: pipecat-ai/pipecat architecture documentation. https://deepwiki.com/pipecat-ai/pipecat
Pipecat Cloud pricing. https://www.daily.co/pricing/pipecat-cloud/
kwindla on X (formerly Twitter), Pipecat 0.0.77 release milestone. https://x.com/kwindla/status/1951362276990259600

Background and origin

Pipeline architecture

Voice activity detection and turn management

Service provider plugins

Speech-to-text

Large language models

Text-to-speech

Speech-to-speech (native multimodal)

Transport

Video and avatar integrations

WebRTC integration

Pipecat Flows

NVIDIA partnership and ACE integration

Pipecat Cloud

Client SDKs and cross-platform support

Versioning history

Sub-agent and multi-agent patterns

Comparison with LiveKit Agents

Comparison with Vapi

Use cases

Adoption and ecosystem

Limitations

See also

References

Improve this article

Related Articles

LiveKit Agents

OpenClaw

Vapi

Moshi

Horovod

Dev tools

Background and origin

Pipeline architecture

Voice activity detection and turn management

Service provider plugins

Speech-to-text

Large language models

Text-to-speech

Speech-to-speech (native multimodal)

Transport

Video and avatar integrations

WebRTC integration

Pipecat Flows

NVIDIA partnership and ACE integration

Pipecat Cloud

Client SDKs and cross-platform support

Versioning history

Sub-agent and multi-agent patterns

Comparison with LiveKit Agents

Comparison with Vapi

Use cases

Adoption and ecosystem

Limitations

See also

References

Related Articles

LiveKit Agents

OpenClaw

Vapi

Moshi

Horovod

Dev tools