GPT-Realtime / OpenAI Realtime API
Last reviewed
May 24, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 3,189 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 24, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 3,189 words
Add missing citations, update stale details, or suggest a clearer explanation.
GPT-Realtime is a family of speech-to-speech models exposed through the OpenAI Realtime API, a low-latency interface that lets developers build voice agents which take spoken audio in and return spoken audio out without an intermediate text transcription stage.[1][2] The API was announced in public beta on October 1, 2024 at the OpenAI DevDay developer conference in San Francisco, alongside the gpt-4o-realtime-preview model.[1][3] After roughly eleven months of iterative beta releases, OpenAI graduated the service to general availability on August 28, 2025 with a new headline model named gpt-realtime, dropping the preview suffix and the gpt-4o- prefix from the model identifier.[2][4] In May 2026 OpenAI shipped a second generation, including gpt-realtime-2, gpt-realtime-translate, and gpt-realtime-whisper, lifting the context window from 32,000 to 128,000 tokens and adding adjustable reasoning effort.[5][6] The Realtime API is the same audio infrastructure that powers Advanced Voice Mode in chatgpt and forms the model layer used by voice agent platforms such as vapi, retell ai, and livekit agents.[7][8]
OpenAI introduced the Realtime API as part of a four-product DevDay slate on October 1, 2024 that also included Prompt Caching, Vision Fine-Tuning, and Model Distillation.[3] The first available model, gpt-4o-realtime-preview-2024-10-01, was a derivative of gpt 4o tuned for low-latency bidirectional audio over a single WebSocket connection.[1] At launch the API supported six voices (alloy, echo, fable, onyx, nova, shimmer) carried over from the existing text-to-speech endpoint and offered function calling, server-side voice activity detection (VAD), and optional input-audio transcription for logging.[1] OpenAI priced the beta at $5 per million text input tokens, $20 per million text output tokens, $100 per million audio input tokens, and $200 per million audio output tokens, which the company estimated at roughly $0.06 per minute of audio input and $0.24 per minute of audio output.[1]
The TechCrunch coverage of DevDay characterised the launch as positioning OpenAI against third-party voice stacks that previously chained a speech-to-text model, a text language model, and a text-to-speech engine.[3] OpenAI demonstrated the API live with Twilio integrations that placed phone calls on stage, and the company stated that third-party voices were not permitted in order to avoid copyright disputes (a reference to the public dispute earlier in 2024 over the "Sky" voice and Scarlett Johansson).[3]
On December 17, 2024 OpenAI released two new model snapshots, gpt-4o-realtime-preview-2024-12-17 and gpt-4o-mini-realtime-preview-2024-12-17, alongside native WebRTC support.[9][10] The pricing of the full-sized model fell by roughly 60 percent to $40 per million audio input tokens and $80 per million audio output tokens, with $2.50 per million cached audio input tokens.[10] The mini variant was priced at $10 per million input tokens and $20 per million output tokens, ten times cheaper than the original beta.[10] The update extended the maximum session duration from 15 minutes to 30 minutes and added out-of-band concurrent responses, custom input context, and adjustable response timing controls.[9]
The WebRTC option was significant because it eliminated the need for application servers to terminate user audio: a browser or mobile client could now negotiate a peer connection directly with OpenAI using an ephemeral token issued by a developer backend, which OpenAI recommended for any client-facing deployment to keep API keys server-side.[9][11] OpenAI partnered with livekit agents to publish reference architectures that paired LiveKit's WebRTC transport with the Realtime API's WebSocket model interface, a topology that also underpins ChatGPT Advanced Voice Mode.[8]
In late October and November 2024 OpenAI added five additional voices, Ash, Ballad, Coral, Sage, and Verse, which the company described as more expressive than the original six and tunable for emotion, accent, and tone.[12] Prompt caching for the Realtime API was extended to support both text and audio cache hits, with text input cache hits priced at a 50 percent discount and audio input cache hits at an 80 percent discount.[12] OpenAI estimated that a typical 15-minute conversation cost about 30 percent less than at the October launch once cache savings were applied.[12]
Through the first half of 2025 the API received a series of smaller revisions, including removal of the cap on simultaneous sessions (February 3, 2025), addition of native Python SDK support, and a further interim snapshot gpt-4o-realtime-preview-2025-06-03 that introduced European Union data residency through eu.api.openai.com.[13]
OpenAI moved the Realtime API out of beta on August 28, 2025 and at the same time introduced a new headline model called simply gpt-realtime (snapshot gpt-realtime-2025-08-28).[2][4] The dropped prefix was deliberate: OpenAI clarified that the new model was not a straight derivative of gpt 4o but a separate speech-to-speech network with its own training data mix.[4] Pricing for gpt-realtime was set at $4 per million text input tokens, $16 per million text output tokens, $32 per million audio input tokens (with $0.40 per million cached), $5 per million image input tokens, and $64 per million audio output tokens, a further roughly 20 percent reduction in audio cost from the December 2024 snapshot.[2][4]
The GA model added native image input, native MCP server tool calling ([[mcp]] servers can be configured at session creation), and Session Initiation Protocol (SIP) transport for direct integration with telephony providers such as Twilio Elastic SIP Trunking, carrier PBX systems, and desk phones.[2][14] Two new voices, Cedar and Marin, debuted alongside gpt-realtime and were designated by OpenAI as the recommended choices for production deployments.[2][11] On the Big Bench Audio reasoning evaluation the model scored 82.8 percent accuracy, up from 65.6 percent for the December 2024 preview, while instruction following on the MultiChallenge audio benchmark improved correspondingly.[2]
On May 8, 2026 OpenAI released three new Realtime API models simultaneously: gpt-realtime-2, gpt-realtime-translate, and gpt-realtime-whisper.[5][15] The flagship gpt-realtime-2 is described by OpenAI as the first voice model with "GPT-5-class reasoning" and exposes five adjustable reasoning effort levels (minimal, low, medium, high, xhigh) that trade latency against quality.[15][6] The context window expanded from 32,000 to 128,000 tokens, and the model added parallel tool calling with audio "preamble" narration of in-flight tool work (e.g., "let me check that for you").[6]
On the Artificial Analysis Big Bench Audio leaderboard gpt-realtime-2 at xhigh reasoning scored 96.6 percent, tied with Google's Gemini 3.1 Flash Live Preview High and a 15.2 point gain over the 81.4 percent achieved by the intermediate gpt-realtime-1.5 snapshot.[15] On Scale AI's Audio MultiChallenge instruction-retention benchmark, gpt-realtime-2 reached 48.5 percent at xhigh reasoning compared to 34.7 percent for the prior generation; on the Conversational Dynamics (Full Duplex) subset measuring turn-taking and interruption handling, even the minimal-reasoning variant scored 96.1 percent.[15]
gpt-realtime-translate is purpose-built for real-time speech translation across more than 70 input languages and 13 output languages, priced at $0.034 per minute of audio.[5] gpt-realtime-whisper provides streaming speech-to-text with controllable latency at $0.017 per minute, replacing the older non-streaming Whisper endpoint for live use cases.[5]
The Realtime API offers three transports, each suited to a different deployment shape.[11][14]
| Transport | Use case | Recommended for |
|---|---|---|
| WebSocket | Server-to-server bidirectional JSON event stream with base64-encoded audio frames | Backend services, server-orchestrated agents |
| WebRTC | Browser/mobile peer connection with built-in jitter buffering and packet-loss concealment | Direct client connections |
| SIP | Standard telephony signalling; OpenAI accepts inbound SIP INVITEs and dispatches realtime.call.incoming webhooks | Phone numbers, PBX, IVR replacement |
WebSocket was the only transport at launch; WebRTC was added on December 17, 2024; SIP was added as part of the August 2025 GA release.[9][2] OpenAI documentation recommends WebRTC for any client running outside a controlled data centre because the protocol's loss recovery and adaptive jitter buffering produce more consistent quality on consumer networks than raw WebSocket framing.[11]
A Realtime session is driven by an event stream defined by approximately three dozen typed events, split into client-emitted events (such as session.update, input_audio_buffer.append, response.create, and conversation.item.create) and server-emitted events (such as session.created, session.updated, input_audio_buffer.speech_started, response.audio.delta, response.function_call_arguments.done, and response.done).[16] Audio is streamed in 20-millisecond frames as base64-encoded payloads inside JSON events; output audio arrives as a series of response.audio.delta events that the client concatenates and plays.[16]
Voice activity detection is performed server-side by default. The session config exposes a turn_detection block that lets developers tune the silence threshold and prefix padding; when a user pause crosses the threshold the server emits input_audio_buffer.speech_stopped and automatically triggers a response unless turn detection is disabled.[16][11] As a GA feature OpenAI added an input_audio_buffer.timeout_triggered event that fires after a configurable idle timeout.[11]
The API supports three audio codecs on both input and output: pcm16 (16-bit linear PCM at 24 kHz, mono, little-endian), g711_ulaw, and g711_alaw.[17] The two G.711 variants are 8 kHz logarithmic codecs used by classical telephony; their inclusion is what allows direct interconnection with SIP trunks and PBX systems without an external transcoder.[17][14] Each session can negotiate a different input and output codec, so a Twilio inbound call can stream g711_ulaw while the application receives 24 kHz pcm16 for archival.[14]
Tools are declared at session start in a tools array that mirrors the schema used by the openai api Chat Completions endpoint. When the model decides to call a tool it emits a response.function_call_arguments.delta stream followed by response.function_call_arguments.done; the client executes the tool and writes the result back via a conversation.item.create event with a function_call_output item, then optionally requests a follow-up response.[16] Beginning with the GA release function calling became fully asynchronous: the model can continue conversing while a tool call is in flight, automatically producing filler utterances such as "I'm still waiting on that" rather than blocking on the tool's completion.[11] As of GA the Realtime API also accepts a mcp_servers block at session creation that registers remote mcp server endpoints whose tools become callable in the conversation.[2][11]
The voice catalogue grew across releases. The launch lineup of alloy, echo, fable, onyx, nova, and shimmer came from the pre-existing TTS endpoint.[1] Ash, Ballad, Coral, Sage, and Verse were added in late October 2024 as more expressive options.[12] Cedar and Marin shipped with the August 2025 GA gpt-realtime model and are documented as the recommended voices for assistant audio output.[2][11]
OpenAI does not publish official end-to-end latency numbers, but third-party measurements during 2025 placed median time-to-first-byte at approximately 500 milliseconds for clients in the contiguous United States, with full-sentence response latencies of roughly 1.2 to 2.0 seconds and 95th-percentile latency creeping to 2.5 to 3.0 seconds under noisy input or long tool chains.[18] LiveKit's published architecture for Advanced Voice Mode reports an end-to-end target of approximately 300 milliseconds for the client-server WebRTC leg.[8]
The table below summarises the principal Realtime API model snapshots from beta to second generation.
| Snapshot | Released | Audio in / out per 1M tokens | Cached audio in | Notes |
|---|---|---|---|---|
gpt-4o-realtime-preview-2024-10-01 | 2024-10-01 | $100 / $200 | n/a | Public beta launch[1] |
gpt-4o-realtime-preview-2024-12-17 | 2024-12-17 | $40 / $80 | $2.50 | WebRTC; 60% price cut[10] |
gpt-4o-mini-realtime-preview-2024-12-17 | 2024-12-17 | $10 / $20 | $0.30 | Cheap variant[10] |
gpt-4o-realtime-preview-2025-06-03 | 2025-06-03 | $40 / $80 | $2.50 | EU data residency[13] |
gpt-realtime (-2025-08-28) | 2025-08-28 | $32 / $64 | $0.40 | GA; MCP; SIP; image input[2] |
gpt-realtime-2 | 2026-05-08 | $32 / $64 | $0.40 | 128K context; reasoning levels[5][6] |
gpt-realtime-translate | 2026-05-08 | $0.034/min | n/a | Live translation[5] |
gpt-realtime-whisper | 2026-05-08 | $0.017/min | n/a | Streaming STT[5] |
The June 2025 deprecation notice for gpt-4o-realtime-preview-2024-10-01 gave developers a three-month transition window to the December 2024 snapshot.[19]
The Realtime API competes most directly with Google's Gemini Live (delivered as part of the Gemini API and refreshed on March 26, 2026 as gemini-3.1-flash-live), xAI's Grok Voice (which adopted the OpenAI Realtime wire protocol to ease migration), Hume AI's Empathic Voice Interface, and Inworld's Realtime API.[20] Both OpenAI and Google operate on a native multimodal architecture in which a single model ingests audio and emits audio without an explicit transcription step, while Hume and Inworld build on top of orchestrated pipelines.[20] Independent benchmarks from Artificial Analysis show gpt-realtime-2 and gemini-3.1-flash-live tied at 96.6 percent on Big Bench Audio at high reasoning settings.[15] Pricing comparisons are imprecise because Google bills per minute and OpenAI bills per token, but third-party calculators consistently report Gemini Live as cheaper per minute at comparable reasoning effort.[20]
A second axis of competition runs through orchestration platforms rather than model providers. vapi and retell ai offer a higher-level voice agent layer over both OpenAI Realtime and traditional STT-LLM-TTS pipelines, charging a per-minute platform fee on top of underlying model costs.[21] elevenlabs competes through Conversational AI, which pairs ElevenLabs TTS voices with a configurable backbone LLM and a proprietary turn-taking model.[21] livekit agents is OpenAI's officially partnered open-source framework for building Realtime API applications and is the reference implementation OpenAI itself uses for ChatGPT Advanced Voice Mode.[8]
OpenAI's customer documentation and launch posts highlight several deployment patterns.[1][2][22]
Customer support and contact-centre automation is the most-cited application: voice agents handle inbound calls, answer common questions, capture intent before routing to a human, and execute back-office actions through function calls. The combination of SIP transport, async function calling, and image input (for example, a customer holding a damaged product to the phone camera in a hybrid web call) is positioned for this use case.[2][14] Deutsche Telekom is named by OpenAI as a deployment partner building multilingual customer support that uses gpt-realtime-translate to bridge agent and caller languages in real time.[5][22]
Language tutoring was an early showcase. Speak, a conversational language-learning app, integrated the Realtime API during the public beta to power role-play sessions in which learners practice spoken conversations in a target language; the app was used in the DevDay keynote demonstration.[1] Educational applications use the API to build interactive tutors that explain concepts verbally and adapt pacing to learner response.[22]
Accessibility and assistive technologies use the Realtime API for hands-free interaction, including screen reader replacement, sign-language adjacent voice interfaces, and conversational interfaces for users with motor impairments.[22] Healthify, a nutrition and fitness coaching app, uses the Realtime API to drive its AI coach "Ria" and routes more complex cases to human dietitians.[2]
Other documented deployments include IVR replacement on top of livekit agents and SIP trunking, voice-controlled robotics through WebRTC connections from on-device controllers, and gaming non-player characters that hold open-ended spoken dialogue with players.[8][14]
Coverage of the October 2024 beta launch was broadly positive on capability but skeptical on cost. TechCrunch noted that the per-minute audio pricing was high enough to make most consumer-scale deployments uneconomic until the December 2024 price cuts; the report also called out OpenAI's decision not to require automated disclosure that callers were speaking to an AI, leaving that responsibility to developers.[3] InfoQ's DevDay 2024 coverage emphasised the integration with Twilio as a stage-demo highlight but also raised pricing concerns.[23]
Developer community discussion during the beta period focused on three recurring issues: input transcription accuracy on accented English and non-English speakers, name and proper-noun recognition in tool-call arguments, and unexpectedly high session costs caused by long silence periods being billed as audio input.[12][9] The December 2024 snapshot was credited with improving input reliability but transcription quality remained a community concern through 2025.[9]
After the August 2025 GA release, coverage focused on the production-readiness of the API: the addition of SIP, MCP, and async function calling made the service practical for contact-centre replacement deployments rather than just demos.[2][11] The May 2026 second-generation release attracted notice both for the GPT-5-class reasoning claim and for the open question of how the new "xhigh" reasoning level interacts with the API's latency floor; OpenAI documentation recommends that production deployments default to "low" effort to preserve real-time response latency.[6][15]
The Realtime API is available to all paying openai api customers globally, with separate European Union data residency available for gpt-realtime-2025-08-28 and gpt-4o-realtime-preview-2025-06-03 through the dedicated eu.api.openai.com endpoint.[13] The service is also exposed through Microsoft Azure OpenAI Service in Microsoft Foundry, where the same model snapshots ship with Microsoft's own SLA and regional deployment options.[24] Azure offers separate WebSocket, WebRTC, and SIP transports mirroring the OpenAI public surface.[24]
OpenAI deprecated gpt-4o-realtime-preview-2024-10-01 on June 10, 2025 with a three-month sunset window, recommending migration to the December 2024 snapshot.[19] As of the May 2026 second-generation release, the gpt-realtime slug points to the August 2025 snapshot, the gpt-realtime-2 slug points to the May 2026 snapshot, and the gpt-realtime-mini slug points to the December 2025 mini snapshot.[13]