The OpenAI Realtime API is a speech-to-speech interface developed by OpenAI that allows developers to build low-latency, bidirectional voice applications powered by GPT-4o. Rather than chaining separate speech-to-text, language model, and text-to-speech systems, it processes audio input and generates audio output in a single end-to-end pass, preserving prosody, emotional tone, and natural speech cadence throughout. OpenAI announced the API in public beta on October 1, 2024 at its annual developer conference in San Francisco. The system reached general availability in August 2025 alongside the release of a purpose-built model called gpt-realtime.
Before the Realtime API existed as a developer product, OpenAI demonstrated its underlying capabilities through ChatGPT's Advanced Voice Mode. When OpenAI unveiled GPT-4o in May 2024, it described the model as natively multimodal: trained end-to-end on text, images, and audio rather than relying on bolt-on conversion pipelines. The live demo at that announcement showed GPT-4o responding to a researcher telling it a bedtime story, detecting his changing tone, laughing, and commenting on his breathing.
Advanced Voice Mode was eventually rolled out to ChatGPT Plus and Team subscribers in September 2024 after multiple delays. The feature gave users something qualitatively different from the previous voice capability in ChatGPT, which had been a thin layer stitching Whisper (speech recognition), GPT-4 (text reasoning), and a TTS model together. That older pipeline required converting speech to text before the language model could reason about it, and then converting the language model's text response back to speech, losing information at every step. Advanced Voice Mode removed those intermediate conversions: the model hears the audio directly and responds in audio directly.
The Realtime API exposed that same architecture to developers via an API, making it possible to build applications that worked the same way Advanced Voice Mode worked inside ChatGPT.
OpenAI held its developer conference, called DevDay, in San Francisco on October 1, 2024. The company had moved to a smaller, single-city format compared to its 2023 event. Among several announcements that day, the Realtime API was the most substantial for application developers.
OpenAI described the API as a public beta available immediately to all paying API users. The launch came weeks after Advanced Voice Mode had started rolling out to ChatGPT subscribers, and the framing at DevDay was that developers could now build for their own applications what OpenAI had built for its own consumer product.
At launch, the API supported six voices and required a persistent WebSocket connection. Developers could configure the model's behavior through a session object, set system-level instructions, and define functions the model could call during a conversation. The model handled voice activity detection automatically, detecting when a user stopped speaking and generating a response without the developer needing to signal turn boundaries explicitly.
Other announcements at DevDay 2024 included vision fine-tuning for GPT-4o and a prompt caching feature that reduced costs for applications that sent the same context repeatedly. The Realtime API, however, attracted the most developer attention because it addressed a long-standing friction point: voice applications built on separate STT/LLM/TTS pipelines accumulated latency at each stage and lost non-verbal information during transcription.
The core design decision in the Realtime API is that audio enters and audio exits without text as an intermediary step. A standard voice agent pipeline might look like: user speech to Whisper (producing a transcript), transcript to GPT-4 (producing a text response), text response to a TTS model (producing audio). Each conversion adds latency. Transcription takes time. Moving a text transcript through a separate language model adds more. Running TTS adds more still.
More importantly, transcription discards information. A speaker who sounds anxious, confused, or sarcastic produces a transcript that reads as neutral text. The language model that receives only the text cannot know the speaker was nervous. The TTS model that receives only the text response cannot modulate the response's tone based on emotional context it was never given.
The Realtime API avoids this by operating at the audio level throughout. The model receives audio tokens directly and produces audio tokens directly. Emotional cues, speaking rate, emphasis, and background context available in the raw audio all remain available to the model during processing.
At launch in October 2024, the Realtime API supported only WebSocket connections. A WebSocket provides a persistent, full-duplex channel between a client and a server. The developer opens a connection to OpenAI's WebSocket endpoint, authenticates using an API key, and then exchanges JSON-formatted events over that channel for the duration of a session.
Client events include session configuration updates, audio buffer appending (chunks of raw PCM audio), response creation requests, and conversation item management. Server events include session confirmations, response lifecycle notifications, audio delta chunks (small fragments of generated audio), transcription results, and function call arguments when the model invokes a tool.
Because WebSocket connections require the API key to authenticate, using the API directly from a browser would expose the key to end users. The standard pattern was to run a relay server: the browser connects to the developer's own server, which holds the key and forwards audio and events to OpenAI's endpoint.
On December 17, 2024, OpenAI added WebRTC as a second transport option and released two updated model snapshots: gpt-4o-realtime-preview-2024-12-17 and gpt-4o-mini-realtime-preview-2024-12-17. These changes arrived alongside a 60% price reduction for audio tokens.
WebRTC is a browser-native protocol designed for peer-to-peer real-time media. OpenAI's implementation uses an ephemeral token system to address the API key exposure problem. The developer's server makes a short-lived token request to OpenAI, which returns a token valid for 60 seconds and usable for a single WebRTC session. The browser then uses that token to establish a direct WebRTC connection to OpenAI's infrastructure. The session itself can run up to 30 minutes. Because the ephemeral token is single-use and short-lived, exposing it to the browser carries minimal risk.
The WebRTC path uses the browser's native audio APIs, meaning the microphone capture, echo cancellation, and audio playback happen at the browser level without additional JavaScript libraries. Developers building browser-based voice interfaces found this substantially simpler than the WebSocket relay approach. OpenAI noted that WebRTC is recommended for browser and mobile applications, while WebSocket remains appropriate for server-to-server connections, such as a telephony bridge where audio arrives from a phone carrier rather than from a browser.
Starting with the gpt-realtime model released in August 2025, a SIP integration became available as a third transport option, allowing direct connections to telephone infrastructure including PBX systems and public telephone networks.
A Realtime API session centers on a session object that persists for the conversation's duration. This object holds the system instructions, available tools, voice selection, audio format settings, and voice activity detection configuration.
Voice activity detection can run in three modes. Server-side VAD lets OpenAI's model decide when the user has finished speaking, triggering a response automatically. Client-side turn management gives the developer explicit control, which suits push-to-talk interfaces or applications where the speaking boundaries are already known. A semantic VAD mode uses the model's own understanding to decide when a complete thought has been expressed, rather than relying purely on silence duration.
When VAD is enabled, the model handles interruptions automatically. If the user begins speaking while the model is generating audio, the API emits a conversation.interrupted event, cancels the in-progress response, and begins processing the new input. Developers building production voice agents found interrupt handling one of the more finicky areas to tune; the default 500ms silence threshold works for most conversational uses but may clip sentences in environments with background noise.
Sessions support sessions up to 60 minutes in duration (extended from the original 15-minute limit at launch, then increased again to 30 minutes before reaching 60 minutes for the GA model). The session maintains a sliding context window. When the context reaches capacity, the API automatically truncates oldest messages. Developers can disable auto-truncation and handle overflow explicitly if they need precise control over what the model remembers.
The initial model at the October 2024 launch was gpt-4o-realtime-preview, later given the dated snapshot alias gpt-4o-realtime-preview-2024-10-01. It supported text and audio inputs and outputs, six preset voices, function calling, and sessions up to 15 minutes. It was available only via WebSocket.
Initial audio pricing was $100 per million input tokens and $200 per million output tokens. OpenAI also introduced cached audio input pricing at $20 per million tokens for context that had been previously processed and held in the prompt cache, reducing costs for long conversations with stable instructions.
The December 2024 update introduced WebRTC support and released two new model snapshots. The updated gpt-4o-realtime-preview-2024-12-17 brought improved voice quality, more reliable voice activity detection, and audio tokens priced approximately 60% lower than the October 2024 rates, settling at $40 per million input tokens and $80 per million output tokens.
The same update introduced gpt-4o-mini-realtime-preview, a smaller and cheaper model targeting cost-sensitive applications. It was priced at roughly one-tenth the cost of the full-size model, at $0.60 per million text input tokens and $2.40 per million text output tokens (with its own audio token pricing). The mini model supports the same connection methods and API surface as the full model but produces somewhat less capable responses, making it appropriate for simpler use cases like FAQs or appointment scheduling.
Maximum session length increased from 15 minutes to 30 minutes with this update.
On August 28, 2025, OpenAI moved the Realtime API out of beta and released gpt-realtime, described as the first model in the series purpose-built for production voice agent workloads rather than adapted from the GPT-4o family. The release also brought the API to general availability.
gpt-realtime showed significant benchmark gains over the December 2024 snapshot. On MultiChallenge, an audio benchmark measuring instruction-following accuracy, it scored 30.5% versus 20.6% for the previous model. On ComplexFuncBench, measuring function calling performance in audio contexts, it scored 66.5% versus 49.7%. The model was described as better at reading and following system prompts precisely, including tasks like reading disclaimer scripts verbatim, accurately repeating alphanumeric strings, and switching languages mid-sentence.
New capabilities added at GA included image input, remote MCP server integration, asynchronous function calling (where the model continues a fluid conversation while waiting for a tool result rather than pausing and waiting), and SIP phone calling support. Two new voices, Marin and Cedar, were added exclusively to the GA model.
A smaller variant, gpt-realtime-mini, was also made available for cost-sensitive production workloads alongside the full GA model.
Session duration extended to 60 minutes at GA.
Function calling in the Realtime API works similarly to function calling in the standard chat completions API, with adaptations for a streaming audio context. The developer defines tools in the session configuration using the same JSON schema format as the chat API. When the model decides to call a function, it emits a function call event containing the function name and a JSON object of arguments. The developer's code handles the event, executes the function, and returns a result to the model as a conversation item.
In the original API, a function call would pause the conversation: the model would stop generating audio and wait for the tool result before continuing. This introduced a perceptible gap for any function that took more than a fraction of a second to complete.
The gpt-realtime GA model added asynchronous function calling. The model can continue speaking while a background function runs. Developers can inject placeholder audio (such as "Let me check on that") while the function executes, then provide the result for the model to incorporate. This made function-heavy voice agents, like those that look up customer records or check inventory, substantially less awkward to interact with.
Functions can be configured at the session level to apply for the entire conversation, or at the individual response level to apply only to a specific response. The latter allows dynamic tool availability: a customer service agent might have access to a payment processing function only after a user has authenticated.
Audio token pricing has declined substantially since launch. OpenAI counts approximately 100 audio tokens per second of audio, making the per-minute cost a useful unit for comparison.
| Period | Audio input | Audio output | Notes |
|---|---|---|---|
| October 2024 (launch) | $100 / 1M tokens | $200 / 1M tokens | WebSocket only; ~$0.06/min input, ~$0.24/min output |
| October 2024 (cached) | $20 / 1M tokens | $200 / 1M tokens | Cached context introduced Oct 30, 2024 |
| December 2024 | $40 / 1M tokens | $80 / 1M tokens | ~60% reduction; WebRTC added; 4o-mini variant added |
| 2025 | $32 / 1M tokens | $64 / 1M tokens | Additional reduction; $0.40/1M for cached input |
| gpt-realtime (GA, 2025) | $4 / 1M tokens (text) | $16 / 1M tokens (text) | Audio tokens separate; mini variant also available |
The gpt-4o-mini-realtime-preview model has consistently been priced at approximately one-tenth the cost of the full-size model for text tokens, with proportionally lower audio pricing as well.
The Realtime API occupies a distinct position among voice AI products because it combines GPT-4o's reasoning capacity with native audio processing. Other products in the category approach the problem differently.
| Product | Architecture | Latency | Key strength | Key limitation |
|---|---|---|---|---|
| OpenAI Realtime API | End-to-end speech model (GPT-4o) | ~480-520ms round trip | Reasoning quality; ecosystem integration | Higher cost; closed-source |
| Moshi | Open-source dual-stream audio model (Kyutai) | ~160-200ms theoretical | Open weights; dual-stream simultaneous processing | Weaker reasoning than GPT-4o |
| Cartesia | TTS-focused (Sonic SSM architecture) | 40-90ms time-to-first-audio | Lowest latency in class; sub-100ms TTS | TTS only; no STT or LLM layer |
| Hume AI EVI | Speech-language model with emotion optimization | Sub-second | Emotional intelligence; flexible LLM backend | 5-language limit; smaller model |
| Sesame (AI company) CSM | Open-weights conversational speech model | Not specified | Highly natural prosody; open Apache 2.0 license | Not a full API product; 1B parameter base model |
Moshi was developed by Kyutai, a French AI research lab, and released in July 2024, predating the OpenAI Realtime API by several months. Moshi processes two audio streams simultaneously: one for the user's speech and one for its own output. This allows it to listen and speak at the same time, which is closer to how human conversation actually works. Its theoretical latency of around 160ms, achievable on an L4 GPU, is substantially lower than the OpenAI system.
The practical difference is reasoning quality. Moshi is an open-weights model that can be fine-tuned, which makes it attractive for domain-specific applications and for deployments where keeping data on-premises matters. For general-purpose conversation requiring complex reasoning or tool use, GPT-4o's larger architecture holds a clear advantage.
Cartesia is not a speech-to-speech system. Its Sonic model is a text-to-speech engine built on a state-space model (SSM) architecture rather than a transformer. SSMs allow Sonic to achieve time-to-first-audio as low as 40ms (turbo mode) or 90ms (standard), substantially faster than transformer-based TTS systems.
Cartesia occupies a different slot in the pipeline: it replaces the TTS stage of a traditional pipeline rather than the whole pipeline. Developers who use Cartesia with OpenAI's Realtime API would not combine them directly; they are alternatives for different architectural choices. A developer building a traditional STT+LLM+TTS pipeline might choose Cartesia for the TTS stage to minimize audio latency. A developer using the Realtime API is avoiding the TTS stage entirely.
Hume AI's Empathic Voice Interface (EVI) takes a different design priority than OpenAI's system. EVI is built around Hume's research into human emotional expression. The model analyzes prosody, tone, and vocal cues in the user's speech and incorporates that emotional context into its responses, adjusting its own speech delivery accordingly. EVI 2 was released in 2024 and EVI 3 followed in 2025 as a dedicated speech language model.
EVI supports flexible LLM backends, meaning developers can configure it to use their own fine-tuned models or third-party providers rather than being locked to a single LLM. OpenAI's Realtime API is tightly coupled to the GPT-4o architecture. EVI's language support was limited to five languages at comparison, compared to GPT-4o's broader multilingual capability.
Pricing comparison: EVI 2 was priced at approximately $0.072 per minute ($4.32 per hour) with scale discounts. OpenAI's Realtime API at its December 2024 pricing ran roughly $0.06/min for audio input and $0.24/min for audio output, making a two-minute exchange (input + output combined) cost considerably more.
Sesame (AI company)'s Conversational Speech Model (CSM) was open-sourced under the Apache 2.0 license on March 13, 2025. CSM-1B has 1 billion parameters and frames speech generation as an end-to-end multimodal learning task using transformers. The model inserts natural pauses, filler sounds, laughter, and tonal variation in ways that human evaluators found convincing in blind tests of isolated speech samples.
CSM is not an API product. It is a base model that developers can run on their own infrastructure. Comparing it directly to the Realtime API is comparing a foundation model to a hosted service. The significance of CSM is what it demonstrated about the achievable quality of open-weights speech models, not a competing product offering.
Vapi is a voice agent infrastructure platform that abstracts telephony, STT, LLM, and TTS into a unified developer interface. Vapi added support for the OpenAI Realtime API as a model option, allowing developers to use the speech-to-speech path through Vapi's orchestration layer rather than building their own WebSocket or WebRTC integration. With the Realtime API selected as the LLM backend in Vapi, the traditional pipeline stages (STT to LLM to TTS) are replaced by a single call to the realtime endpoint.
Vapi charges a platform fee of approximately $0.05 to $0.11 per minute on top of underlying provider costs. For developers below roughly 10,000 minutes per month, the abstraction and orchestration that Vapi provides is typically worth the platform markup. Above that threshold, building directly against the OpenAI Realtime API with a telephony integration library like LiveKit becomes cost-competitive.
Retell AI is a similar voice agent platform focused on telephony use cases. It supports the OpenAI Realtime API as a model backend and handles the phone number provisioning, call routing, and agent configuration that developers would otherwise build themselves. Retell's pricing is approximately $0.07 per minute as a platform fee, with LLM costs added on top.
Both Vapi and Retell shield developers from the lower-level concerns of WebRTC and WebSocket session management, making the Realtime API accessible without expertise in real-time audio protocols.
The design of the Realtime API makes it applicable to several categories of application that were difficult or impractical to build with pipeline-based voice systems.
Customer service automation is the most commercially active area. Voice agents built on the Realtime API can handle inbound calls, answer questions, look up account information via function calling, escalate to human agents when needed, and do all of this with low enough latency to feel like a natural phone conversation. The model's ability to follow precise instructions verbatim (improved with gpt-realtime) addresses a real production requirement in regulated industries where agents must read specific disclosures.
Language learning applications have deployed the API to provide conversational practice partners. Speak, a language learning company, used the Realtime API to power its role-play feature. The system can provide pronunciation feedback, correct grammar mid-conversation, and sustain practice scenarios in the target language.
Interactive voice response systems in healthcare and financial services have adopted the API for intake workflows, where collecting structured information (appointment reason, insurance details, account number) via voice is faster than web forms for many users. The function calling capability allows the collected information to be written to backend systems in real time during the call.
Accessibility tools for people who prefer or require voice interfaces have found the API useful because its low latency and conversational naturalness make sustained voice interaction more practical than older pipeline-based approaches.
Educational tutoring applications have used the API to conduct Socratic dialogues, provide verbal explanations of technical concepts, and adapt their communication style based on detected comprehension. The emotional tone information available in the raw audio allows the tutoring system to detect frustration or confusion and adjust accordingly.
The Realtime API has several practical limitations that influenced how developers used it.
Session concurrency limits varied by pricing tier. At launch, Tier 5 API accounts supported approximately 100 simultaneous sessions. This was limiting for applications expecting significant concurrent traffic. OpenAI increased limits over time and the GA release addressed scalability for larger deployments, but the limits remain lower than those available on stateless API endpoints.
Voice selection at launch was restricted to six OpenAI-provided voices with no ability to use custom voices or third-party voice cloning. The December 2024 update and subsequent releases expanded the voice count but did not add custom voice support. Applications that needed a branded or specific voice had to use a separate TTS layer, which eliminated some of the latency benefits of the end-to-end architecture.
Voice activity detection, while generally functional, remained sensitive enough to produce noticeable errors in environments with background noise. Aggressive sensitivity settings clipped sentence endings; conservative settings introduced gaps before responses. Developers building for phone or noisy environments found this required careful tuning. The noise reduction options (near_field and far_field configurations) helped but could add latency when combined with transcription and semantic VAD simultaneously.
Function calling latency in the original beta was noticeable when tools took more than a fraction of a second to respond. The asynchronous function calling introduced with gpt-realtime addressed this but was not available in the preview models.
The API does not support fine-tuning on custom audio data. Developers who needed domain-specific voice behavior (accent adaptation, specialized vocabulary, or industry-specific phrasing patterns) had to work through prompt engineering rather than model-level customization.
Token costs for audio remain substantially higher than text costs on a per-token basis. A voice conversation produces audio tokens at approximately 100 tokens per second. A 10-minute conversation generates roughly 60,000 audio tokens per side, making extended sessions costly compared to text-only API usage.