LiveKit Agents is an open source framework for building real-time voice, video, and multimodal AI agents developed by LiveKit Inc. Built on top of LiveKit's WebRTC infrastructure, the framework gives developers a programmable layer that sits between an end user's browser or phone call and a stack of AI providers covering speech recognition, language modeling, and speech synthesis. Since its initial release in 2024, it has become one of the most widely used frameworks in the voice AI ecosystem, used internally by OpenAI to power ChatGPT's Advanced Voice Mode and adopted by tens of thousands of development teams building everything from customer service bots to healthcare intake systems.
LiveKit Inc is a San Jose, California-based company co-founded by Russ d'Sa (CEO) and David Zhao (CTO). The company traces its origins to a 2021 seed-funded open source project that built a scalable, distributed Selective Forwarding Unit (SFU) for WebRTC. The core server is written in Go using the Pion WebRTC implementation and licensed under Apache 2.0. At the time, the pitch was straightforward: existing WebRTC infrastructure was fragmented and hard to scale, and LiveKit offered a clean, open alternative.
The company raised a $7 million seed round in 2021 with backing from Redpoint Ventures and angel investors including Justin Kan, Robin Chan, and Elad Gil. A Series A followed in 2022. By late 2023 and into 2024, the company shifted its primary narrative from "WebRTC infrastructure" toward "real-time AI infrastructure," reflecting what was happening on the platform: developers were connecting speech recognition models, language models, and text-to-speech engines to LiveKit rooms and using them as voice AI backends.
The trajectory accelerated sharply after OpenAI used LiveKit's infrastructure to ship ChatGPT's Advanced Voice Mode in late 2024. A $45 million Series B at a $345 million valuation followed in April 2025, led by Altimeter Capital with participation from Redpoint Ventures and Hanabi Capital. By January 2026, LiveKit reached unicorn status with a $100 million Series C at a $1 billion valuation, led by Index Ventures with Salesforce Ventures, Hanabi Capital, Altimeter, and Redpoint participating. At that point the company reported over 100,000 developers on the platform collectively handling more than 3 billion calls per year.
LiveKit's underlying media server uses a Selective Forwarding Unit architecture. An SFU acts as a specialized media router: when multiple participants send media to a room, the SFU forwards copies of each participant's stream to each subscriber without decoding or re-encoding the media. This keeps latency low and compute costs on the server side manageable compared to an MCU (Multipoint Control Unit), which mixes streams centrally.
Client SDKs exist for JavaScript, Swift, Android, Flutter, React Native, Rust, Python, Unity, and C++, covering virtually every deployment surface where voice AI might run. The server itself is horizontally scalable and can be self-hosted or used through LiveKit Cloud, the managed service hosted on the company's own global edge network.
For voice AI specifically, the SFU architecture means audio from an end user arrives at the SFU and can be forwarded to an agent process running in the cloud, which processes the audio through an STT-LLM-TTS pipeline and sends synthesized speech back through the same room. The agent participates in the room the same way a human participant does, which makes the mental model simple: the agent is just another room participant that happens to be driven by code.
LiveKit Agents is a separate repository from the core LiveKit server (github.com/livekit/agents). The Python package is available on PyPI as livekit-agents. A Node.js port, agents-js (github.com/livekit/agents-js), covers TypeScript and JavaScript developers.
The fundamental runtime unit in the framework is a worker process. When a developer starts an Agents application, it registers itself with a LiveKit server and waits for dispatch requests. When a user joins a room that triggers an agent dispatch, the worker spawns a job subprocess that connects to that specific room. This design allows agents to scale horizontally: multiple worker processes can be deployed across multiple machines, and the LiveKit server distributes incoming job requests across available workers.
The entrypoint for each job is an async function decorated with the framework's job context. A minimal example in Python looks like:
import asyncio
from livekit import agents
from livekit.agents.voice import AgentSession, Agent
from livekit.plugins import openai, silero
async def entrypoint(ctx: agents.JobContext):
await ctx.connect()
session = AgentSession(
vad=silero.VAD.load(),
stt=openai.STT(),
llm=openai.LLM(model="gpt-4o"),
tts=openai.TTS(),
)
await session.start(ctx.room, agent=Agent(instructions="You are a helpful assistant."))
agents.cli.run_app(agents.WorkerOptions(entrypoint_fnc=entrypoint))
The AgentSession class is the main orchestrator introduced in the 1.0 release. It replaced the older VoicePipelineAgent and MultimodalAgent abstractions by unifying them into a single object that handles both traditional STT-LLM-TTS pipelines and direct speech-to-speech paths through APIs like the OpenAI Realtime API.
For agents using a traditional pipeline, user audio flows through three sequential stages. First, a speech-to-text model transcribes incoming audio to text. Second, an LLM generates a text response. Third, a text-to-speech model synthesizes that response into audio, which is played back through the room.
LiveKit implements streaming at every stage boundary. STT results stream token-by-token to the LLM as partial transcripts, and TTS synthesis begins as soon as the LLM produces its first tokens, rather than waiting for the complete response. This pipelining keeps perceived latency low. Under typical conditions with a hosted stack, end-to-end latency from the end of the user's utterance to the first audio of the agent's response is roughly 750 to 900 milliseconds.
Knowing when a user has finished speaking is one of the harder problems in voice AI. LiveKit Agents separates this into two layers. VAD (voice activity detection) detects the presence or absence of speech in the audio stream at the frame level. Silero VAD, a widely used lightweight model from the Silero team, is one of the default options. On top of VAD, the framework has a dedicated turn detection model that decides whether a pause in speech represents the end of a turn or just a mid-sentence pause.
The turn detection model shipped with LiveKit Agents 1.0 was trained specifically for conversational contexts and is described by the team as producing more natural conversation flow than simple VAD-based endpointing. Developers can swap in alternative turn detection strategies, including STT endpointing signals from providers that expose them.
When the VAD detects user speech while the agent is mid-response, the framework fires an interruption event that cancels active TTS playback and triggers a new recognition pass. The AgentSession exposes configuration options for interruption handling and false-interruption suppression.
For agents that use speech-to-speech APIs rather than a sequential pipeline, the framework wraps the API into the same AgentSession interface. The OpenAI Realtime API integration, for example, lets user audio pass directly to GPT-4o, which produces audio responses directly rather than going through separate STT, LLM, and TTS stages. Google's Gemini Live API is supported the same way.
This dual-path architecture lets developers choose between more control (pipeline mode) and lower latency (realtime API mode) without changing the surrounding agent code or infrastructure.
Agents can define tools using Python decorators, and the framework serializes these into the format expected by the LLM. Tools can also be forwarded to the frontend: if a voice agent needs to trigger a UI action in the user's browser (for example, displaying a confirmation dialog or navigating to a page), the agent can forward a tool call to a LiveKit data channel, which the frontend SDK receives and processes. This frontend tool forwarding is specific to LiveKit's architecture and is not available in frameworks that do not bundle transport and agent logic.
The 1.0 release introduced multi-agent support. An agent can spawn additional agents within the same room and transfer the conversation between them. This enables patterns like a triage agent that hands off to a specialist agent, or a pipeline where one agent transcribes and another synthesizes a response in a different language. Agent-to-agent communication happens through the room's message bus rather than through direct function calls.
The framework uses a plugin system where each provider ships as a separate Python package under the livekit-plugins-* namespace. Standardized interfaces for STT, LLM, TTS, VAD, and realtime API categories mean that swapping providers requires changing only a few lines of code.
As of mid-2025, the plugin registry includes integrations across all major AI service categories:
| Category | Providers |
|---|---|
| Speech-to-text (STT) | OpenAI Whisper, Deepgram, AssemblyAI, Google, Azure, AWS, Gladia, Soniox, Speechmatics, Clova, RTZR, Spitch, Sarvam |
| Text-to-speech (TTS) | ElevenLabs, Cartesia, OpenAI, Google, Azure, AWS, LMNT, Neuphonic, Rime, Murf, Resemble AI, Fish Audio, Phonic, Speechify, Smallest AI |
| Voice activity detection (VAD) | Silero, Turn Detector (LiveKit custom model) |
| LLM | OpenAI, Anthropic, Google, Mistral AI, Groq, Cerebras, xAI, Fireworks AI, Baseten, NVIDIA NIM, Azure, AWS Bedrock |
| Realtime speech-to-speech | OpenAI Realtime API, Google Gemini Live, Ultravox, Hume AI, Inworld |
| Avatars | Tavus, Simli, Hedra, BeyondPresence, LiveAvatar, Avatario, D-ID |
| Memory | Mem0, LangChain |
| Telephony | Telnyx, Twilio (via SIP) |
LiveKit also operates its own inference layer, LiveKit Inference, which provides access to models from OpenAI, Google, Deepgram, Cartesia, and ElevenLabs directly through LiveKit Cloud without requiring separate API keys from each provider. This simplifies billing and removes the need to manage credentials across multiple provider dashboards.
In October 2024, OpenAI and LiveKit announced a technical partnership that confirmed LiveKit's role in powering ChatGPT's Advanced Voice Mode. In the architecture OpenAI published, a user's speech is captured by the LiveKit client SDK in the ChatGPT app and streamed over LiveKit Cloud to an OpenAI voice agent process. That agent relays the audio to GPT-4o, which runs inference and streams audio tokens back through the same room to the user's device.
The partnership produced two concrete outputs. First, LiveKit released a Multimodal Agent API in the Agents framework specifically designed to wrap the OpenAI Realtime API. Second, a Hacker News post from an OpenAI engineer confirmed that the open source agents repository was the same framework underlying ChatGPT's voice experience, which drove substantial developer attention toward the project.
The Agents 1.0 release in April 2025 coincided with LiveKit's Series B announcement and included co-authored documentation and examples from OpenAI for the Realtime API integration path. By that point, the partnership had become one of the highest-profile production endorsements in the voice AI tool ecosystem.
LiveKit ships a separate SIP bridge component (github.com/livekit/sip) that connects the LiveKit room model to the public telephone network. When a phone call arrives at a LiveKit SIP trunk, the caller is bridged into a LiveKit room as a SIP participant. From the agent's perspective, there is no meaningful difference between a web browser user and a phone caller: both appear as room participants, and the same agent code handles both.
Inbound calling requires configuring an inbound SIP trunk and dispatch rules that determine which agent process handles incoming calls and whether callers need to enter a PIN. Outbound calling uses the CreateSIPParticipant API to initiate a phone call from within an agent session.
In 2025 LiveKit introduced LiveKit Phone Numbers, a managed telephony product that lets developers purchase US local or toll-free phone numbers directly from the LiveKit Cloud dashboard. The phone number routes incoming calls to an agent with roughly four lines of configuration, removing the need to set up a third-party SIP provider. DTMF (touch-tone) support and SIP REFER (call transfer) are both available for more complex telephony workflows.
LiveKit Cloud is the managed hosting product offered by LiveKit Inc. It runs the same open source media server on a global edge network and provides additional services for teams that do not want to operate infrastructure.
Pricing follows a tiered subscription model with usage-based overage charges:
| Plan | Monthly Price | Agent session minutes | WebRTC participant minutes | Concurrent agents |
|---|---|---|---|---|
| Build | $0 | 1,000 | 5,000 | 5 |
| Ship | $50 | 5,000 | 25,000 | 20 |
| Scale | $500 | 50,000 | 250,000 | 600 |
| Enterprise | Custom | Custom | Custom | Custom |
Overage on agent session minutes is billed at $0.01 per minute. WebRTC participant minutes are billed at $0.0004 per minute. LiveKit Inference credits add a per-model-request charge on top of these transport costs.
LiveKit Cloud is SOC 2 Type II, GDPR, and HIPAA compliant, which is relevant for healthcare deployments. An analytics and telemetry dashboard is included at all tiers. In 2025 the company launched Agent Observability in the dashboard as a beta feature, providing per-session traces that show where latency is accumulating across STT, LLM, and TTS stages.
Pipecat and Vapi are the two most commonly compared alternatives to LiveKit Agents.
| LiveKit Agents | Pipecat | Vapi | |
|---|---|---|---|
| Open source | Yes (Apache 2.0) | Yes (BSD 2-Clause) | No |
| Self-hostable | Yes | Yes | No |
| Transport layer | LiveKit WebRTC (bundled) | Provider-agnostic (Daily, WebRTC, local) | Vapi cloud (managed) |
| Primary language | Python, TypeScript | Python | Any (REST API) |
| Pipeline model | AgentSession orchestrator | Directed processor graph | Hosted pipeline |
| Multi-agent support | Yes (room-based) | Partial | Limited |
| Telephony | LiveKit SIP (first-party) | Via Daily or custom | Built-in |
| Cloud managed option | LiveKit Cloud | Via self-hosting or Daily | Vapi cloud only |
| Pricing model | Usage-based (infra + providers) | Self-hosted (provider costs only) | Per-minute platform fee + providers |
The practical distinction between LiveKit Agents and Pipecat is primarily architectural. Pipecat uses a directed processor graph where audio frames flow through a sequence of processors, and branching the pipeline (for example, to run sentiment analysis in parallel with the main conversation) requires forking the graph. LiveKit organizes parallel workstreams as separate agent processes that coordinate through room events. Pipecat can run on top of LiveKit's transport layer, and some production architectures use both: Pipecat's pipeline composition model with LiveKit's room and SFU infrastructure.
Vapi operates at a higher level of abstraction than either framework. It provides a REST API and a dashboard for configuring voice agents without writing agent code directly. Developers who want an appointment scheduling bot or a customer service IVR replacement can often ship faster with Vapi than with a custom LiveKit setup. However, Vapi is a closed-source managed product, which means costs do not decrease with volume in the same way a self-hosted LiveKit deployment would. At high call volumes the per-minute platform fee Vapi charges typically exceeds the cost of running a custom stack on LiveKit.
LiveKit Agents has seen adoption across several domains:
Customer service and telephony automation. Outbound dialing agents for appointment reminders, lead qualification, and debt collection use LiveKit SIP to place calls. Inbound IVR replacement is another common pattern, where a voice agent replaces a touch-tone phone tree with a natural-language interface.
Healthcare. Pre-visit intake forms, symptom triage, insurance verification, and clinical note generation during telemedicine consultations are all cited by the company as production deployment patterns. HIPAA compliance on LiveKit Cloud enables these deployments without requiring a separately negotiated business associate agreement.
Education and tutoring. Language learning applications, where a voice agent plays the role of a conversation partner, and math or reading tutors that adapt to student responses are among the documented education use cases.
In-app voice companions. Mobile and web applications that embed a voice agent for accessibility or user experience reasons, such as an AI assistant inside a productivity tool, use the browser and mobile client SDKs to integrate the agent without requiring a phone call.
Robotics and embodied AI. LiveKit's description of itself as infrastructure for "physical AI agents" points toward robotic applications where an agent needs a low-latency communication channel with a remote operator or with other agents. The ESP32 SDK enables microcontroller-based edge devices to participate in LiveKit rooms.
The livekit/agents repository on GitHub accumulated approximately 8,500 stars as of mid-2025. The Hacker News post from October 2024 confirming LiveKit's role in ChatGPT's voice mode drove a substantial spike in developer attention and is frequently cited as the moment the framework moved from niche to mainstream awareness in the voice AI developer community.
The company's developer community on Slack is active, and LiveKit runs a community forum for agents-specific questions. The framework's versioning cadence has been fast: the project went from pre-1.0 beta releases to version 1.5 within roughly six months of the 1.0 launch, with incremental releases addressing feedback on session management, error handling, and agent orchestration.
The open source model has been a notable differentiator in a space where several competing products are closed-source managed platforms. Developers building high-volume applications cite the ability to self-host the entire stack as a significant cost advantage. The MIT-friendly Apache 2.0 license also removes concerns about commercial usage restrictions.
Positive coverage from TechCrunch followed the Series B announcement in April 2025, and the Series C in January 2026 was covered by Bloomberg, TechCrunch, and SiliconAngle. The OpenAI partnership is consistently the most-cited proof point in coverage of the company.
Developer feedback on GitHub and the LiveKit community forum identifies several recurring issues.
Latency in telephony contexts has been described as noticeably higher than in browser-based deployments. Issues in the GitHub tracker document end-to-end latency of 4 or more seconds per turn when running agents connected via SIP, compared to sub-second performance in browser WebRTC contexts. LiveKit attributes part of this gap to the additional network hops introduced by PSTN routing, but the issue remains an active area of improvement.
Agent startup latency is another reported friction point. Under load, some developers have reported delays of 15 to 50 seconds between room creation and when a worker process receives the job dispatch, which is significantly above the documented expected delay of under 150 milliseconds. This behavior appears to correlate with worker pool exhaustion and process startup overhead rather than fundamental framework design, but it affects production reliability for teams with spiky traffic patterns.
The versioning cadence, while positive for feature velocity, has produced occasional breaking changes. Developers who built against pre-1.0 APIs found that the 1.0 release replaced VoicePipelineAgent and MultimodalAgent with AgentSession, requiring meaningful code changes to upgrade. The team has signaled an intent to keep the 1.x API stable, but the history of rapid iteration is a risk factor for teams that do not want to track upstream changes closely.
Plugin quality varies across the long tail of integrations. Providers at the center of the ecosystem (OpenAI, Deepgram, ElevenLabs, Cartesia) receive prompt updates and well-tested implementations. Integrations for less common providers are sometimes behind current API versions or lack full feature coverage.
Finally, the framework requires a running LiveKit server (whether self-hosted or LiveKit Cloud) as part of the stack. Teams that want a framework they can run in a completely serverless or embedded context without a persistent signaling server cannot use LiveKit Agents in its current form.