Advanced Voice Mode
Last reviewed
Jun 3, 2026
Sources
10 citations
Review status
Source-backed
Revision
v2 · 1,598 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 3, 2026
Sources
10 citations
Review status
Source-backed
Revision
v2 · 1,598 words
Add missing citations, update stale details, or suggest a clearer explanation.
Advanced Voice Mode is a real-time, spoken conversation feature in ChatGPT, developed by OpenAI, that lets users talk with the assistant using natural speech and receive spoken replies. It is built on the native, end-to-end audio capabilities of the GPT-4o model, which processes speech directly rather than first transcribing it to text. OpenAI first demonstrated the capability at the GPT-4o launch on May 13, 2024, began an alpha release to a small group of ChatGPT Plus users on July 30, 2024, and expanded it to all Plus and Team subscribers beginning September 24, 2024.[1][2][3] The feature is distinct from the GPT-4o model itself, from the developer-facing Realtime API, and from OpenAI's separate Voice Engine voice-cloning research.
ChatGPT first gained spoken conversation on September 25, 2023, when OpenAI introduced a voice feature for Plus and Enterprise subscribers in its mobile apps.[4] That original system, now referred to as Standard Voice Mode, worked by chaining three separate models together: OpenAI's open-source Whisper model transcribed the user's speech to text, a text model such as GPT-4 generated a written reply, and a separate text-to-speech model read that reply aloud.[1][2] Because each step ran sequentially, the pipeline introduced noticeable latency and discarded information such as tone, emotion, and background sound along the way.
Advanced Voice Mode replaced that pipeline with a single model. With GPT-4o, OpenAI trained one neural network end-to-end across text, vision, and audio, so that audio input and audio output are handled by the same model without an intermediate text-conversion step.[1] This allows the model to perceive characteristics of speech that a transcript cannot capture, and to respond with a spoken voice that can vary in tone and pacing. OpenAI reported that GPT-4o can respond to audio inputs in as little as 232 milliseconds, with an average of 320 milliseconds, a latency comparable to human response time in conversation.[1]
The defining technical characteristic of Advanced Voice Mode is that it relies on GPT-4o's native speech-to-speech processing. Rather than converting audio to text, reasoning over the text, and synthesizing new audio with three different systems, GPT-4o accepts audio directly and emits audio directly. OpenAI stated that all inputs and outputs are processed by the same neural network, and that because the model is multimodal it can "directly observe tone, multiple speakers, or background noises."[1]
This single-model design produces two practical effects. First, it reduces latency dramatically compared with the chained Standard Voice Mode, enabling near-instant, back-and-forth exchanges in which the user can interrupt the assistant mid-sentence and have it stop and respond. Second, it preserves and can generate paralinguistic information: GPT-4o can detect emotional cues in a speaker's voice, such as whether they sound sad or excited, and can modulate its own delivery, including speaking in different emotive styles, varying its speed, and in some demonstrations singing.[1][2] These behaviors are properties of the underlying GPT-4o model; Advanced Voice Mode is the consumer-facing ChatGPT feature that exposes them in a live conversational interface.
It is worth distinguishing Advanced Voice Mode from related OpenAI systems. The Realtime API, introduced for developers in October 2024, provides programmatic access to the same low-latency speech-to-speech capability for building third-party applications. Voice Engine is a separate research model that clones a specific person's voice from a short audio sample, and it is not the technology behind ChatGPT's preset assistant voices. Advanced Voice Mode specifically refers to the in-app ChatGPT experience.
Advanced Voice Mode is designed around natural, free-flowing conversation. Its principal features include:
When Advanced Voice limits are reached, or when it is otherwise unavailable, ChatGPT falls back to the older Standard Voice Mode. A version of Advanced Voice powered by GPT-4o mini later reached free users; OpenAI began rolling out a daily preview to all free ChatGPT users on February 25, 2025.[6][7]
| Date | Milestone |
|---|---|
| September 25, 2023 | Standard Voice Mode (Whisper plus text-to-speech) launches for Plus and Enterprise users[4] |
| May 13, 2024 | GPT-4o launched; native voice capability demonstrated at OpenAI's Spring Update[1] |
| May 19, 2024 | OpenAI pauses the "Sky" voice[8] |
| July 30, 2024 | Advanced Voice Mode alpha released to a small group of ChatGPT Plus users[2] |
| September 24, 2024 | Broad rollout begins to all Plus and Team users, with new voices and a new look[3] |
| October 22, 2024 | Advanced Voice becomes available to Plus users in the EU, Switzerland, Iceland, Norway, and Liechtenstein[9] |
| December 12, 2024 | Live video and screen sharing added during the "12 Days of OpenAI"[5] |
| February 25, 2025 | A GPT-4o mini-powered version begins rolling out to free users as a daily preview[6][7] |
When the general rollout began on September 24, 2024, OpenAI also gave the feature a new visual identity, replacing the animated black dots shown at the May demo with a blue animated sphere.[3] Enterprise and Edu customers received access in the week following the Plus and Team launch.[3] The feature was initially unavailable in the European Union, the United Kingdom, Switzerland, Iceland, Norway, and Liechtenstein; OpenAI said that some regions require additional external reviews before launch.[3][9]
One of the five original ChatGPT voices, named Sky, became the subject of a public dispute in May 2024. After the GPT-4o demo on May 13, 2024, listeners compared Sky to the voice of actress Scarlett Johansson, who had voiced an artificial intelligence assistant in the 2013 film "Her." On the day of the demo, OpenAI chief executive Sam Altman posted the single word "her" on social media, which many interpreted as a reference to the film.[10]
According to a statement Johansson released on May 20, 2024, Altman had approached her in September 2023 to voice the system, suggesting that her involvement could help bridge the gap between technology companies and creatives. She declined the offer. She said that two days before the May demo, Altman contacted her agent asking her to reconsider, before she was able to respond. Johansson said she was "shocked, angered and in disbelief that Mr. Altman would pursue a voice that sounded so eerily similar to mine," and that she was "forced to hire legal counsel."[10]
OpenAI paused the use of Sky in its products as of May 19, 2024.[8] In a statement, Altman said, "The voice of Sky is not Scarlett Johansson's, and it was never intended to resemble hers. We cast the voice actor behind Sky's voice before any outreach to Ms. Johansson. Out of respect for Ms. Johansson, we have paused using Sky's voice in our products."[8] In a blog post published the same day describing how the voices were chosen, OpenAI said the actors were selected through a multi-month casting process, that more than 400 submissions were reviewed before five voices were chosen, and that none of the voices were picked for similarity to any celebrity, with each being the talent's natural speaking voice.[8]
Advanced Voice Mode drew substantial attention for the naturalness of its conversations, with coverage describing the feature as "hyperrealistic" and noting how closely its pacing and emotional range approached human speech.[2] The roughly seven-month gap between the May 2024 demo and the December 2024 arrival of live video, a capability shown at the original unveiling, was a recurring point in press coverage.[5] The feature's delayed availability in Europe, attributed by some commentators to regulatory review under the EU AI Act and by OpenAI to standard external review processes, also generated discussion before the October 2024 EU rollout.[9] The Sky episode became a widely cited example in debates over voice likeness, consent, and the rights of performers in generative AI.[10]