# GPT-Realtime / OpenAI Realtime API

> Source: https://aiwiki.ai/wiki/gpt_realtime
> Updated: 2026-07-23
> Categories: OpenAI, Speech & Audio AI, Voice AI
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

**GPT-Realtime** is a family of speech-to-speech models exposed through the **OpenAI Realtime API**, a low-latency interface that lets developers build voice agents which take spoken audio in and return spoken audio out without an intermediate text transcription stage.[^1][^2] The API was announced in public beta on October 1, 2024 at the OpenAI DevDay developer conference in San Francisco, alongside the `gpt-4o-realtime-preview` model.[^1][^3] After roughly eleven months of iterative beta releases, OpenAI graduated the service to general availability on August 28, 2025 with a new headline model named `gpt-realtime`, dropping the `preview` suffix and the `gpt-4o-` prefix from the model identifier.[^2][^4] In May 2026 OpenAI shipped a second generation, including `gpt-realtime-2`, `gpt-realtime-translate`, and `gpt-realtime-whisper`, lifting the [context window](/wiki/context_window) from 32,000 to 128,000 tokens and adding adjustable reasoning effort.[^5][^6] The Realtime API is the same audio infrastructure that powers [Advanced Voice Mode](/wiki/advanced_voice_mode) in [ChatGPT](/wiki/chatgpt) and forms the model layer used by voice agent platforms such as [Vapi](/wiki/vapi), [Retell AI](/wiki/retell_ai), and [LiveKit Agents](/wiki/livekit_agents).[^7][^8]

## History

### Beta launch at DevDay 2024

OpenAI introduced the Realtime API as part of a four-product DevDay slate on October 1, 2024 that also included [Prompt Caching](/wiki/prompt_caching), Vision Fine-Tuning, and Model Distillation.[^3] The first available model, `gpt-4o-realtime-preview-2024-10-01`, was a derivative of [GPT-4o](/wiki/gpt_4o) tuned for low-latency [bidirectional](/wiki/bidirectional) audio over a single WebSocket connection.[^1] At launch the API supported six voices (alloy, echo, fable, onyx, nova, shimmer) carried over from the existing [text-to-speech](/wiki/text_to_speech_ai) endpoint and offered function calling, server-side voice activity detection (VAD), and optional input-audio transcription for logging.[^1] OpenAI priced the beta at $5 per million text input tokens, $20 per million text output tokens, $100 per million audio input tokens, and $200 per million audio output tokens, which the company estimated at roughly $0.06 per minute of audio input and $0.24 per minute of audio output.[^1]

The TechCrunch coverage of DevDay characterised the launch as positioning OpenAI against third-party voice stacks that previously chained a speech-to-text model, a text [language model](/wiki/language_model), and a text-to-speech engine.[^3] OpenAI demonstrated the API live with Twilio integrations that placed phone calls on stage, and the company stated that third-party voices were not permitted in order to avoid copyright disputes (a reference to the public dispute earlier in 2024 over the "Sky" voice and Scarlett Johansson).[^3]

### December 2024 snapshot and WebRTC

On December 17, 2024 OpenAI released two new model snapshots, `gpt-4o-realtime-preview-2024-12-17` and `gpt-4o-mini-realtime-preview-2024-12-17`, alongside native WebRTC support.[^9][^10] The pricing of the full-sized model fell by roughly 60 percent to $40 per million audio input tokens and $80 per million audio output tokens, with $2.50 per million cached audio input tokens.[^10] The mini variant was priced at $10 per million input tokens and $20 per million output tokens, ten times cheaper than the original beta.[^10] The update extended the maximum session duration from 15 minutes to 30 minutes and added out-of-band concurrent responses, custom input context, and adjustable response timing controls.[^9]

The WebRTC option was significant because it eliminated the need for application servers to terminate user audio: a browser or mobile client could now negotiate a peer connection directly with OpenAI using an ephemeral token issued by a developer backend, which OpenAI recommended for any client-facing deployment to keep API keys server-side.[^9][^11] OpenAI partnered with [LiveKit Agents](/wiki/livekit_agents) to publish reference architectures that paired LiveKit's WebRTC transport with the Realtime API's WebSocket model interface, a topology that also underpins ChatGPT Advanced Voice Mode.[^8]

### Expressive voices and intermediate updates

In late October and November 2024 OpenAI added five additional voices, Ash, Ballad, Coral, Sage, and Verse, which the company described as more expressive than the original six and tunable for emotion, accent, and tone.[^12] Prompt caching for the Realtime API was extended to support both text and audio cache hits, with text input cache hits priced at a 50 percent discount and audio input cache hits at an 80 percent discount.[^12] OpenAI estimated that a typical 15-minute conversation cost about 30 percent less than at the October launch once cache savings were applied.[^12]

Through the first half of 2025 the API received a series of smaller revisions, including removal of the cap on simultaneous sessions (February 3, 2025), addition of native Python SDK support, and a further interim snapshot `gpt-4o-realtime-preview-2025-06-03` that introduced European Union data residency through `eu.api.openai.com`.[^13]

### General availability: gpt-realtime (August 2025)

OpenAI moved the Realtime API out of beta on August 28, 2025 and at the same time introduced a new headline model called simply `gpt-realtime` (snapshot `gpt-realtime-2025-08-28`).[^2][^4] The dropped prefix was deliberate: OpenAI clarified that the new model was not a straight derivative of [GPT-4o](/wiki/gpt_4o) but a separate speech-to-speech network with its own training data mix.[^4] Pricing for `gpt-realtime` was set at $4 per million text input tokens, $16 per million text output tokens, $32 per million audio input tokens (with $0.40 per million cached), $5 per million image input tokens, and $64 per million audio output tokens, a further roughly 20 percent reduction in audio cost from the December 2024 snapshot.[^2][^4]

The GA model added native image input, native MCP server tool calling (`[[mcp]]` servers can be configured at session creation), and Session Initiation Protocol (SIP) transport for direct integration with telephony providers such as Twilio Elastic SIP Trunking, carrier PBX systems, and desk phones.[^2][^14] Two new voices, Cedar and Marin, debuted alongside `gpt-realtime` and were designated by OpenAI as the recommended choices for production deployments.[^2][^11] On the Big Bench Audio reasoning evaluation the model scored 82.8 percent accuracy, up from 65.6 percent for the December 2024 preview, while instruction following on the MultiChallenge audio benchmark improved correspondingly.[^2]

### Second generation (May 2026)

On May 8, 2026 OpenAI released three new Realtime API models simultaneously: `gpt-realtime-2`, `gpt-realtime-translate`, and `gpt-realtime-whisper`.[^5][^15] The flagship `gpt-realtime-2` is described by OpenAI as the first voice model with "GPT-5-class reasoning" and exposes five adjustable reasoning effort levels (minimal, low, medium, high, xhigh) that trade latency against quality.[^15][^6] The context window expanded from 32,000 to 128,000 tokens, and the model added parallel tool calling with audio "preamble" narration of in-flight tool work (e.g., "let me check that for you").[^6]

On the Artificial Analysis Big Bench Audio leaderboard `gpt-realtime-2` at xhigh reasoning scored 96.6 percent, tied with Google's Gemini 3.1 Flash Live Preview High and a 15.2 point gain over the 81.4 percent achieved by the intermediate `gpt-realtime-1.5` snapshot.[^15] On Scale AI's Audio MultiChallenge instruction-retention benchmark, `gpt-realtime-2` reached 48.5 percent at xhigh reasoning compared to 34.7 percent for the prior generation; on the Conversational Dynamics (Full Duplex) subset measuring turn-taking and interruption handling, even the minimal-reasoning variant scored 96.1 percent.[^15]

`gpt-realtime-translate` is purpose-built for real-time speech translation across more than 70 input languages and 13 output languages, priced at $0.034 per minute of audio.[^5] `gpt-realtime-whisper` provides streaming speech-to-text with controllable latency at $0.017 per minute, replacing the older non-streaming Whisper endpoint for live use cases.[^5]

## Architecture

### Transport layers

The Realtime API offers three transports, each suited to a different deployment shape.[^11][^14]

| Transport | Use case | Recommended for |
| --- | --- | --- |
| WebSocket | Server-to-server bidirectional JSON event stream with base64-encoded audio frames | Backend services, server-orchestrated agents |
| WebRTC | Browser/mobile peer connection with built-in jitter buffering and packet-loss concealment | Direct client connections |
| SIP | Standard telephony signalling; OpenAI accepts inbound SIP INVITEs and dispatches `realtime.call.incoming` webhooks | Phone numbers, PBX, IVR replacement |

WebSocket was the only transport at launch; WebRTC was added on December 17, 2024; SIP was added as part of the August 2025 GA release.[^9][^2] OpenAI documentation recommends WebRTC for any client running outside a controlled data centre because the protocol's loss recovery and adaptive jitter buffering produce more consistent quality on consumer networks than raw WebSocket framing.[^11]

### Event model

A Realtime session is driven by an event stream defined by approximately three dozen typed events, split into client-emitted events (such as `session.update`, `input_audio_buffer.append`, `response.create`, and `conversation.item.create`) and server-emitted events (such as `session.created`, `session.updated`, `input_audio_buffer.speech_started`, `response.audio.delta`, `response.function_call_arguments.done`, and `response.done`).[^16] Audio is streamed in 20-millisecond frames as base64-encoded payloads inside JSON events; output audio arrives as a series of `response.audio.delta` events that the client concatenates and plays.[^16]

Voice activity detection is performed server-side by default. The session config exposes a `turn_detection` block that lets developers tune the silence threshold and prefix padding; when a user pause crosses the threshold the server emits `input_audio_buffer.speech_stopped` and automatically triggers a response unless turn detection is disabled.[^16][^11] As a GA feature OpenAI added an `input_audio_buffer.timeout_triggered` event that fires after a configurable idle timeout.[^11]

### Audio formats

The API supports three audio codecs on both input and output: `pcm16` (16-bit linear PCM at 24 kHz, mono, little-endian), `g711_ulaw`, and `g711_alaw`.[^17] The two G.711 variants are 8 kHz logarithmic codecs used by classical telephony; their inclusion is what allows direct interconnection with SIP trunks and PBX systems without an external transcoder.[^17][^14] Each session can negotiate a different input and output codec, so a Twilio inbound call can stream `g711_ulaw` while the application receives 24 kHz `pcm16` for archival.[^14]

### Function and tool calling

Tools are declared at session start in a `tools` array that mirrors the schema used by the [OpenAI API](/wiki/openai_api) Chat Completions endpoint. When the model decides to call a tool it emits a `response.function_call_arguments.delta` stream followed by `response.function_call_arguments.done`; the client executes the tool and writes the result back via a `conversation.item.create` event with a `function_call_output` item, then optionally requests a follow-up response.[^16] Beginning with the GA release function calling became fully asynchronous: the model can continue conversing while a tool call is in flight, automatically producing filler utterances such as "I'm still waiting on that" rather than blocking on the tool's completion.[^11] As of GA the Realtime API also accepts a `mcp_servers` block at session creation that registers remote [MCP server](/wiki/mcp_server) endpoints whose tools become callable in the conversation.[^2][^11]

### Voices

The voice catalogue grew across releases. The launch lineup of alloy, echo, fable, onyx, nova, and shimmer came from the pre-existing TTS endpoint.[^1] Ash, Ballad, Coral, Sage, and Verse were added in late October 2024 as more expressive options.[^12] Cedar and Marin shipped with the August 2025 GA `gpt-realtime` model and are documented as the recommended voices for assistant audio output.[^2][^11]

### Latency

OpenAI does not publish official end-to-end latency numbers, but third-party measurements during 2025 placed median time-to-first-byte at approximately 500 milliseconds for clients in the contiguous United States, with full-sentence response latencies of roughly 1.2 to 2.0 seconds and 95th-percentile latency creeping to 2.5 to 3.0 seconds under noisy input or long tool chains.[^18] LiveKit's published architecture for Advanced Voice Mode reports an end-to-end target of approximately 300 milliseconds for the client-server WebRTC leg.[^8]

## Models and pricing

The table below summarises the principal Realtime API model snapshots from beta to second generation.

| Snapshot | Released | Audio in / out per 1M tokens | Cached audio in | Notes |
| --- | --- | --- | --- | --- |
| `gpt-4o-realtime-preview-2024-10-01` | 2024-10-01 | $100 / $200 | n/a | Public beta launch[^1] |
| `gpt-4o-realtime-preview-2024-12-17` | 2024-12-17 | $40 / $80 | $2.50 | WebRTC; 60% price cut[^10] |
| `gpt-4o-mini-realtime-preview-2024-12-17` | 2024-12-17 | $10 / $20 | $0.30 | Cheap variant[^10] |
| `gpt-4o-realtime-preview-2025-06-03` | 2025-06-03 | $40 / $80 | $2.50 | EU data residency[^13] |
| `gpt-realtime` (`-2025-08-28`) | 2025-08-28 | $32 / $64 | $0.40 | GA; MCP; SIP; image input[^2] |
| `gpt-realtime-2` | 2026-05-08 | $32 / $64 | $0.40 | 128K context; reasoning levels[^5][^6] |
| `gpt-realtime-translate` | 2026-05-08 | $0.034/min | n/a | Live translation[^5] |
| `gpt-realtime-whisper` | 2026-05-08 | $0.017/min | n/a | Streaming STT[^5] |

The June 2025 deprecation notice for `gpt-4o-realtime-preview-2024-10-01` gave developers a three-month transition window to the December 2024 snapshot.[^19]

## Comparison to competing speech-to-speech APIs

The Realtime API competes most directly with Google's Gemini Live (delivered as part of the Gemini API and refreshed on March 26, 2026 as `gemini-3.1-flash-live`), xAI's Grok Voice (which adopted the OpenAI Realtime wire protocol to ease migration), Hume AI's Empathic Voice Interface, and Inworld's Realtime API.[^20] Both OpenAI and Google operate on a native multimodal architecture in which a single model ingests audio and emits audio without an explicit transcription step, while Hume and Inworld build on top of orchestrated pipelines.[^20] Independent benchmarks from Artificial Analysis show `gpt-realtime-2` and `gemini-3.1-flash-live` tied at 96.6 percent on Big Bench Audio at high reasoning settings.[^15] Pricing comparisons are imprecise because Google bills per minute and OpenAI bills per token, but third-party calculators consistently report Gemini Live as cheaper per minute at comparable reasoning effort.[^20]

A second axis of competition runs through orchestration platforms rather than model providers. [Vapi](/wiki/vapi) and [Retell AI](/wiki/retell_ai) offer a higher-level voice agent layer over both OpenAI Realtime and traditional STT-LLM-TTS pipelines, charging a per-minute platform fee on top of underlying model costs.[^21] [ElevenLabs](/wiki/elevenlabs) competes through Conversational AI, which pairs ElevenLabs TTS voices with a configurable backbone LLM and a proprietary turn-taking model.[^21] [LiveKit Agents](/wiki/livekit_agents) is OpenAI's officially partnered open-source framework for building Realtime API applications and is the reference implementation OpenAI itself uses for ChatGPT Advanced Voice Mode.[^8]

## Use cases

OpenAI's customer documentation and launch posts highlight several deployment patterns.[^1][^2][^22]

Customer support and contact-centre automation is the most-cited application: voice agents handle inbound calls, answer common questions, capture intent before routing to a human, and execute back-office actions through function calls. The combination of SIP transport, async function calling, and image input (for example, a customer holding a damaged product to the phone camera in a hybrid web call) is positioned for this use case.[^2][^14] Deutsche Telekom is named by OpenAI as a deployment partner building multilingual customer support that uses `gpt-realtime-translate` to bridge agent and caller languages in real time.[^5][^22]

Language tutoring was an early showcase. Speak, a conversational language-learning app, integrated the Realtime API during the public beta to power role-play sessions in which learners practice spoken conversations in a target language; the app was used in the DevDay keynote demonstration.[^1] Educational applications use the API to build interactive tutors that explain concepts verbally and adapt pacing to learner response.[^22]

Accessibility and assistive technologies use the Realtime API for hands-free interaction, including screen reader replacement, sign-language adjacent voice interfaces, and conversational interfaces for users with motor impairments.[^22] Healthify, a nutrition and fitness coaching app, uses the Realtime API to drive its AI coach "Ria" and routes more complex cases to human dietitians.[^2]

Other documented deployments include IVR replacement on top of [LiveKit Agents](/wiki/livekit_agents) and SIP trunking, voice-controlled robotics through WebRTC connections from on-device controllers, and gaming non-player characters that hold open-ended spoken dialogue with players.[^8][^14]

## Reception and criticisms

Coverage of the October 2024 beta launch was broadly positive on capability but skeptical on cost. TechCrunch noted that the per-minute audio pricing was high enough to make most consumer-scale deployments uneconomic until the December 2024 price cuts; the report also called out OpenAI's decision not to require automated disclosure that callers were speaking to an AI, leaving that responsibility to developers.[^3] InfoQ's DevDay 2024 coverage emphasised the integration with Twilio as a stage-demo highlight but also raised pricing concerns.[^23]

Developer community discussion during the beta period focused on three recurring issues: input transcription accuracy on accented English and non-English speakers, name and proper-noun recognition in tool-call arguments, and unexpectedly high session costs caused by long silence periods being billed as audio input.[^12][^9] The December 2024 snapshot was credited with improving input reliability but transcription quality remained a community concern through 2025.[^9]

After the August 2025 GA release, coverage focused on the production-readiness of the API: the addition of SIP, MCP, and async function calling made the service practical for contact-centre replacement deployments rather than just demos.[^2][^11] The May 2026 second-generation release attracted notice both for the GPT-5-class reasoning claim and for the open question of how the new "xhigh" reasoning level interacts with the API's latency floor; OpenAI documentation recommends that production deployments default to "low" effort to preserve real-time response latency.[^6][^15]

## Availability

The Realtime API is available to all paying [OpenAI API](/wiki/openai_api) customers globally, with separate European Union data residency available for `gpt-realtime-2025-08-28` and `gpt-4o-realtime-preview-2025-06-03` through the dedicated `eu.api.openai.com` endpoint.[^13] The service is also exposed through Microsoft Azure OpenAI Service in Microsoft Foundry, where the same model snapshots ship with Microsoft's own SLA and regional deployment options.[^24] Azure offers separate WebSocket, WebRTC, and SIP transports mirroring the OpenAI public surface.[^24]

OpenAI deprecated `gpt-4o-realtime-preview-2024-10-01` on June 10, 2025 with a three-month sunset window, recommending migration to the December 2024 snapshot.[^19] As of the May 2026 second-generation release, the `gpt-realtime` slug points to the August 2025 snapshot, the `gpt-realtime-2` slug points to the May 2026 snapshot, and the `gpt-realtime-mini` slug points to the December 2025 mini snapshot.[^13]

## See also

- [OpenAI](/wiki/openai)
- [OpenAI API](/wiki/openai_api)
- [OpenAI Responses API](/wiki/openai_responses_api)
- [OpenAI Agents SDK](/wiki/openai_agents_sdk)
- [GPT-4o](/wiki/gpt_4o)
- [ChatGPT](/wiki/chatgpt)
- [OpenAI Whisper](/wiki/openai_whisper)
- [function calling](/wiki/function_calling)
- [MCP server](/wiki/mcp_server)
- [AI voice agent](/wiki/ai_voice_agent)
- [Vapi](/wiki/vapi)
- [Retell AI](/wiki/retell_ai)
- [LiveKit Agents](/wiki/livekit_agents)
- [ElevenLabs](/wiki/elevenlabs)
- [Gemini](/wiki/gemini)
- [multimodal AI](/wiki/multimodal_ai)
- [voice activity detection models](/wiki/voice_activity_detection_models)

## References

[^1]: OpenAI, "Introducing the Realtime API", OpenAI, 2024-10-01. https://openai.com/index/introducing-the-realtime-api/. Accessed 2026-05-25.
[^2]: OpenAI, "Introducing gpt-realtime and Realtime API updates for production voice agents", OpenAI, 2025-08-28. https://openai.com/index/introducing-gpt-realtime/. Accessed 2026-05-25.
[^3]: Kyle Wiggers, "OpenAI's DevDay brings Realtime API and other treats for AI app developers", TechCrunch, 2024-10-01. https://techcrunch.com/2024/10/01/openais-devday-brings-realtime-api-and-other-treats-for-ai-app-developers/. Accessed 2026-05-25.
[^4]: Simon Willison, "Introducing gpt-realtime", Simon Willison's Weblog, 2025-09-01. https://simonwillison.net/2025/Sep/1/introducing-gpt-realtime/. Accessed 2026-05-25.
[^5]: OpenAI, "Advancing voice intelligence with new models in the API", OpenAI, 2026-05-08. https://openai.com/index/advancing-voice-intelligence-with-new-models-in-the-api/. Accessed 2026-05-25.
[^6]: DataCamp, "GPT-Realtime-2: A Voice Model with GPT-5-Class Reasoning", DataCamp, 2026-05-08. https://www.datacamp.com/blog/gpt-realtime-2. Accessed 2026-05-25.
[^7]: OpenAI Developers, "Developer notes on the Realtime API", OpenAI Developers Blog, 2025-08-28. https://developers.openai.com/blog/realtime-api. Accessed 2026-05-25.
[^8]: LiveKit, "OpenAI and LiveKit partner to turn Advanced Voice into an API", LiveKit Blog, 2024-10-01. https://livekit.com/blog/openai-livekit-partnership-advanced-voice-realtime-api/. Accessed 2026-05-25.
[^9]: OpenAI Developer Community, "Realtime API updates - WebRTC, cheaper prices, 4o-mini, and more", OpenAI, 2024-12-17. https://community.openai.com/t/realtime-api-updates-webrtc-cheaper-prices-4o-mini-and-more/1059962. Accessed 2026-05-25.
[^10]: OpenAI Developers, "gpt-4o-realtime-preview-2024-12-17 and gpt-4o-mini-realtime-preview-2024-12-17 pricing", OpenAI Developers on X, 2024-12-17. https://x.com/OpenAIDevs/status/1869116963588649326. Accessed 2026-05-25.
[^11]: OpenAI, "Developer notes on the Realtime API", OpenAI Developers, 2025-08-28. https://developers.openai.com/blog/realtime-api. Accessed 2026-05-25.
[^12]: OpenAI Developer Community, "New Realtime API voices and cache pricing", OpenAI, 2024-10-30. https://community.openai.com/t/new-realtime-api-voices-and-cache-pricing/998238. Accessed 2026-05-25.
[^13]: OpenAI, "Changelog", OpenAI API Documentation, 2026-05-08. https://developers.openai.com/api/docs/changelog. Accessed 2026-05-25.
[^14]: Twilio, "Connect the OpenAI Realtime SIP Connector with Twilio Elastic SIP Trunking", Twilio Blog, 2025-08-28. https://www.twilio.com/en-us/blog/developers/tutorials/product/openai-realtime-api-elastic-sip-trunking. Accessed 2026-05-25.
[^15]: MarkTechPost, "OpenAI Releases Three Realtime Audio Models: GPT-Realtime-2, GPT-Realtime-Translate, and GPT-Realtime-Whisper in the Realtime API", MarkTechPost, 2026-05-08. https://www.marktechpost.com/2026/05/08/openai-releases-three-realtime-audio-models-gpt-realtime-2-gpt-realtime-translate-and-gpt-realtime-whisper-in-the-realtime-api/. Accessed 2026-05-25.
[^16]: OpenAI, "Realtime conversations guide", OpenAI API Documentation, 2025-08-28. https://platform.openai.com/docs/guides/realtime-conversations. Accessed 2026-05-25.
[^17]: OpenAI, "Realtime transcription session reference", OpenAI API Reference, 2025-08-28. https://developers.openai.com/api/reference/resources/realtime/subresources/transcription_sessions/methods/create. Accessed 2026-05-25.
[^18]: Skywork AI, "OpenAI Realtime API Review 2025: Honest Pros and Cons", Skywork AI, 2025-10-15. https://skywork.ai/blog/agent/openai-realtime-api-review-2025-honest-pros-cons/. Accessed 2026-05-25.
[^19]: OpenAI, "Deprecations", OpenAI API Documentation, 2025-06-10. https://developers.openai.com/api/docs/deprecations. Accessed 2026-05-25.
[^20]: MindStudio, "Real-Time AI Voice Models Compared: GPT Realtime 2, Gemini TTS, Grok, and InWorld", MindStudio, 2026-05-09. https://www.mindstudio.ai/blog/real-time-ai-voice-models-compared-2025. Accessed 2026-05-25.
[^21]: AssemblyAI, "Best Speech-to-Speech Voice Agent API in 2026", AssemblyAI Blog, 2026-02-15. https://www.assemblyai.com/blog/best-speech-to-speech-voice-agent-api. Accessed 2026-05-25.
[^22]: Skywork AI, "OpenAI Realtime API Use Cases 2025: 10 Real Examples", Skywork AI, 2025-11-12. https://skywork.ai/blog/ai-agent/openai-realtime-api-use-cases-2025-10-real-examples/. Accessed 2026-05-25.
[^23]: InfoQ, "OpenAI Developer Day 2024 (SF) Announces Real-Time API, Vision Fine-Tuning, and More", InfoQ, 2024-10-08. https://www.infoq.com/news/2024/10/openai-sf-dev-day/. Accessed 2026-05-25.
[^24]: Microsoft, "Use the GPT Realtime API via SIP", Microsoft Foundry Documentation, 2025-10-15. https://learn.microsoft.com/en-us/azure/foundry/openai/how-to/realtime-audio-sip. Accessed 2026-05-25.