GPT-4o
Last reviewed
May 17, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v3 · 8,144 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 17, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v3 · 8,144 words
Add missing citations, update stale details, or suggest a clearer explanation.
GPT-4o (Generative Pre-trained Transformer 4 Omni) is a natively multimodal large language model developed by OpenAI and announced on May 13, 2024 at the company's "Spring Update" livestream. The lowercase "o" stands for omni, signalling that the model accepts and produces any combination of text, audio, and image tokens within a single end-to-end neural network rather than the chained pipeline of separate speech-to-text, language, and text-to-speech models that powered earlier versions of ChatGPT Voice Mode.[^1] GPT-4o was designed as the successor to GPT-4 Turbo, matching that model's text and code performance while running roughly twice as fast and at half the API cost.[^1]
The most widely publicized aspect of GPT-4o is its real-time conversational ability. The model can respond to audio inputs in as little as 232 milliseconds, with an average latency of about 320 milliseconds, comparable to the cadence of an everyday human conversation.[^1] Earlier ChatGPT voice experiences chained a speech-to-text model (typically Whisper), a text-only GPT-4 inference, and a separate text-to-speech model in series, producing typical end-to-end latencies of two to five seconds and stripping away tone, multiple-speaker information, background noise, laughter, and song. Because GPT-4o handles waveforms and text within the same network, it preserves prosody and can output expressive speech, including singing and laughter.[^1]
OpenAI used the launch to make GPT-4-class capability free for the first time. GPT-4o became the default model for paying ChatGPT Plus, Team, and Enterprise users, and it was rolled out to ChatGPT Free with usage caps the same day.[^3] A smaller and cheaper variant, GPT-4o mini, was released on July 18, 2024 and replaced GPT-3.5 Turbo as the recommended low-cost API model.[^5] A native image generation update for GPT-4o was launched inside ChatGPT on March 25, 2025 and exposed in the API as the gpt-image-1 endpoint on April 23, 2025.[^21][^22] GPT-4o was eventually retired from the ChatGPT product on February 13, 2026 in favor of the GPT-5 family, while remaining available in the OpenAI API.[^7]
The model occupies an unusual place in the GPT lineage. It was OpenAI's last "GPT-N" flagship before the company pivoted to the dedicated reasoning track of the o1, o3, and o4-mini series, and it was the model that defined the conversational tone of ChatGPT for nearly two years across hundreds of millions of weekly users. When OpenAI tried to remove it during the GPT-5 launch in August 2025, user pushback was strong enough that the company restored access within five days under a "Show legacy models" toggle, an episode that became a touchstone for debates about model attachment and AI personality.[^24][^25]
By the spring of 2024, OpenAI's commercial flagship was GPT-4 Turbo, a faster and cheaper variant of GPT-4 introduced at the November 2023 DevDay. GPT-4 Turbo with Vision had added image input in late 2023, but voice in ChatGPT continued to rely on a cascade of three models: Whisper transcribed user audio to text; a text-only GPT-4 generated a reply; and a separate text-to-speech model converted the reply back to audio. The cascade had three structural drawbacks. First, the intermediate text step discarded prosody, multiple-speaker information, background sound, and emotional inflection. Second, each link added latency, producing end-to-end voice turnaround times that were too slow for natural conversation. Third, the cascade made it difficult for the language model to produce expressive speech because its output was a sequence of plain text tokens that the TTS layer had to interpret.[^1]
In parallel, the competitive landscape had tightened. Google's Gemini 1.5 Pro, Anthropic's Claude 3 Opus (March 2024), and Meta's Llama 3 (April 2024) all challenged GPT-4 on text and code benchmarks. Google was openly developing Project Astra, a multimodal assistant prototype expected to debut at the May 14, 2024 I/O developer conference. OpenAI's leadership decided to schedule a flagship reveal of its own one day before I/O, a piece of strategic positioning widely noted by reporters at the time.[^17]
The internal program that became GPT-4o pursued a single objective: build one neural network that ingests and emits text, image, and audio tokens directly, replacing the three-model voice cascade and improving multilingual coverage at the same time. OpenAI has not published the architecture diagram, parameter count, training data composition, or hardware footprint, citing the same competitive and safety considerations it cited for GPT-4.[^1]
In the months leading up to GPT-4o, OpenAI tested release candidates anonymously on the public LMSYS Chatbot Arena. On April 27, 2024, an unbranded model called gpt2-chatbot appeared on the Arena and immediately matched or exceeded the strongest available systems, including GPT-4 Turbo, Claude 3 Opus, and Gemini 1.5 Pro.[^16] The model was withdrawn after a few days, then reappeared on May 6 and 7, 2024 under two new names: im-a-good-gpt2-chatbot and im-also-a-good-gpt2-chatbot. Sam Altman cryptically tweeted "i do have a soft spot for gpt2" and later "im-a-good-gpt2-chatbot," and the day before the Spring Update he posted only the word "her," widely read as a nod to the 2013 Spike Jonze film Her. After the May 13 announcement, OpenAI confirmed that the masked Arena models had been GPT-4o checkpoints. The episode also drew attention to the Chatbot Arena as a forum where labs blind-test new releases before public launch.
The Spring Update was a roughly 26-minute live stream held at OpenAI's San Francisco headquarters on Monday, May 13, 2024, the day before Google's annual I/O developer conference. Then chief technology officer Mira Murati opened the presentation, framed three priorities for the launch (a new desktop app, a refreshed user interface, and the new flagship model), and then handed off to research leads Mark Chen and Barret Zoph for live demonstrations.[^1] The team showed real-time spoken conversation with interruption handling, on-the-fly emotional expression, real-time language translation between Italian and English, vision-based math tutoring on a handwritten linear equation, code review of a Python script via screen sharing, and the model singing a bedtime story to a stuffed animal. The presentation deliberately steered around the staged feel of typical product demos: the presenters interrupted the model mid-sentence, asked it to be more dramatic, and let it ad-lib.
Google's I/O keynote the following day featured Project Astra, a multimodal assistant with similar ambitions, and the timing was widely interpreted as a strategic positioning move by OpenAI.[^17] The keynote landed roughly six months after the November 2023 Bletchley Declaration, the first multilateral AI safety summit, and OpenAI was among the labs that had voluntarily committed to pre-deployment safety testing of frontier models. The Spring Update emphasised that GPT-4o had been reviewed under that framework before public release.[^4]
GPT-4o was available to API customers as gpt-4o and the dated alias gpt-4o-2024-05-13 from the moment of the keynote, with text and vision input enabled at launch.[^1] In ChatGPT, the model became the default for paying Plus, Team, and Enterprise users on May 13 and began a phased rollout to ChatGPT Free over the following weeks. Free-tier users gained the ability to use GPT-4o, Custom GPTs, file uploads, browsing, and advanced data analysis for the first time, subject to message caps that silently downgraded to the smaller GPT-3.5 Turbo backend once exhausted. The free-tier rollout was widely cited as the first time a frontier-class model was directly available to non-paying users at scale.[^3]
GPT-4o is described by OpenAI as "a single new model trained end-to-end across text, vision, and audio, meaning that all inputs and outputs are processed by the same neural network."[^1] Audio waveforms and image pixels are encoded directly into tokens that share the same latent space as text tokens, so the model's attention can correlate a word in a prompt with a particular pixel region in an image, a particular speaker in a recording, or a particular musical note in audio. OpenAI has not released the precise tokenisation scheme for audio, but the practical effect is that the model can emit audio tokens whose decoded waveform carries laughter, prosody, accents, and singing — properties that are difficult to convey through plain text passed to a downstream TTS layer.[^1]
The cascaded predecessor pipeline used three separate models in series: an automatic speech recognition model (typically Whisper) transcribed audio to text; a text-only GPT-4 model produced a text reply; and a separate text-to-speech model converted the reply back to audio. The middle text-only stage discarded tone, multiple-speaker information, background sound, and emotional inflection, and the chain accumulated latency at each step. GPT-4o eliminates this loss of information by carrying audio and vision tokens directly into and out of the same model.[^1]
o200k_baseGPT-4o ships with a new tokenizer, o200k_base, which doubles the vocabulary from the cl100k_base tokenizer used by GPT-4 and GPT-3.5 to roughly 200,000 BPE tokens.[^15] The expanded vocabulary is heavily targeted at non-English text, where earlier OpenAI tokenizers used many more tokens per character. According to OpenAI's published comparisons, common Hindi sentences require about 2.9 times fewer tokens, common Arabic sentences about 2 times fewer, common Chinese sentences about 1.4 times fewer, and common Korean sentences about 1.7 times fewer.[^1] A widely cited Indic-language analysis showed Malayalam tokens reduced by nearly 4 times and Telugu by about 3.5 times.[^15]
Fewer tokens per character of foreign-language text means lower latency, lower API costs, and longer effective context windows for those languages, since context length is measured in tokens rather than characters.
GPT-4o offers a 128,000-token context window with a knowledge cutoff date of October 2023.[^1][^13] The initial gpt-4o-2024-05-13 snapshot capped output at 4,096 tokens; the August 6, 2024 update raised the maximum output to 16,384 tokens.[^6]
The model is bidirectional across three modalities (text, image, audio) and accepts video as a sequence of image frames. The following matrix summarises supported input and output combinations in the API and ChatGPT, as documented in the GPT-4o System Card and OpenAI release notes.[^1][^4]
| Modality | Input | Output | Notes |
|---|---|---|---|
| Text | Yes | Yes | 128K context; 50+ languages; o200k_base tokenizer |
| Image | Yes | Yes (via gpt-image-1 from March 2025) | Vision from launch; native generation added March 25, 2025 |
| Audio | Yes | Yes | Native speech in/out via Advanced Voice and Realtime API |
| Video | Frames as images | No | Live video understood through frame sequences |
| Function calls | Yes | Yes | Tool use with strict schema since August 6, 2024 |
Text input was supported in the API from launch on May 13, 2024. Image input was supported in the API from launch. Audio I/O was supported in ChatGPT from launch (via Advanced Voice Mode, then rolling out) but reached the API only in October 2024 with the Realtime API and the gpt-4o-audio-preview Chat Completions endpoint.[^19][^29] Native image output arrived in March 2025 inside ChatGPT and in April 2025 in the API as gpt-image-1.[^21][^22]
Native audio output is delivered to ChatGPT users through Advanced Voice Mode. Although the May 13, 2024 keynote included extensive voice demos, the public rollout took more than four months.
| Date | Milestone |
|---|---|
| May 13, 2024 | Advanced Voice demoed during Spring Update; no public availability |
| July 30, 2024 | Alpha rollout to a small set of ChatGPT Plus subscribers (no video, no screen share) |
| September 24, 2024 | Advanced Voice Mode reaches general availability for Plus and Team users on iOS and Android in the U.S.; five new voices (Arbor, Maple, Sol, Spruce, Vale) added; revamped blue animated sphere UI[^29] |
| October 1, 2024 | Realtime API beta released for developers[^19] |
| December 12, 2024 | Video and screen share added to Advanced Voice during the "12 Days of OpenAI" event |
| December 19, 2024 | Free-tier users get a limited Advanced Voice preview, capped at a few minutes per month |
| March 24, 2025 | Improved Advanced Voice Mode rolled out with reduced interruption rate and more natural pauses |
| June 2025 | Advanced Voice Mode becomes the default voice experience on the Plus tier; the older voice pipeline is retired |
Advanced Voice Mode supports interruption, expressive prosody, regional accents, whispering, character voices, multiple-language switching mid-sentence, and singing. The model can pick up cues such as urgency, sarcasm, or sadness from the user's voice and modulate its replies accordingly.[^1]
The gap between the polished May 13 demo and broad availability was unusually wide for an OpenAI release, and reviewers cited it as a credibility concern in the months that followed. The September 24 launch in the U.S. expanded to most major markets by the end of October, although Advanced Voice was not initially available in the European Union, the United Kingdom, Switzerland, Iceland, Norway, or Liechtenstein owing to regulatory review.[^29]
GPT-4o ships with several preset voices, originally Breeze, Cove, Ember, Juniper, and Sky. After the Scarlett Johansson incident, OpenAI paused Sky and later expanded the voice library with options named Arbor, Maple, Sol, Spruce, and Vale.[^29] Voice selection is controlled by the user inside ChatGPT and is exposed as a voice parameter in the Realtime API.
GPT-4o mini is a smaller, distilled member of the GPT-4o family, released on July 18, 2024.[^5] It became OpenAI's recommended low-cost API model and replaced GPT-3.5 Turbo as the default fallback in ChatGPT for free users who exceeded their GPT-4o quota.
Key facts:
gpt-4o-mini-2024-07-18.[^5]gpt-4o-mini-audio-preview / gpt-4o-mini-realtime-preview Chat Completions endpoints.o200k_base tokenizer and Structured Outputs from GPT-4o.GPT-4o mini was widely adopted for high-volume back-office applications, batch document processing, retrieval-augmented generation pipelines, and customer support assistants, where its combination of multimodal capability and low price made the older GPT-3.5 Turbo class of models obsolete. By late 2024 it was OpenAI's most-used model by token volume.
Developers settled on a fairly stable rule of thumb during 2024 and 2025. GPT-4o mini was the default for high-volume, latency-sensitive, or cost-bound work: classification, extraction, ranking, RAG retrieval grading, simple agent steps, batch document tagging, customer-support reply suggestions, and form filling. GPT-4o was reserved for tasks that required harder reasoning, longer planning, image-heavy analysis, polished long-form writing, code synthesis on non-trivial files, or tutor-style multi-turn conversations. The 17- to 20-times price gap meant that wrapping a workflow as "plan with GPT-4o, execute with GPT-4o mini" became a common architecture. As GPT-5.1 and later releases introduced automatic routing, this hand-tuned split became less necessary, although many production workloads continued to call gpt-4o-mini directly into 2026 for cost reasons.
On March 25, 2025, OpenAI shipped a long-awaited native image generation update for GPT-4o inside ChatGPT, replacing the prior DALL-E 3 backend.[^21] Unlike DALL-E 3, the GPT-4o image generator is part of the same model that handles conversation, so the system can iteratively refine images across turns, place legible text inside images, render charts and diagrams from data, follow longer and more complex compositional prompts, and bind ten to twenty distinct objects in a single scene. The same backend was exposed in the API on April 23, 2025 as gpt-image-1.[^22]
Within 48 hours of the March 25 launch, social feeds were saturated with Studio Ghibli style portraits, family photos restyled as anime stills, and pet pictures reimagined in soft watercolor. Altman acknowledged the trend by changing his X profile picture to a Ghibli-style portrait, and the wave became the largest user-driven aesthetic trend on ChatGPT to that point.[^14] OpenAI reported that over 130 million users had created more than 700 million images in the first week, equivalent to roughly 1,200 images per second, and that the launch added 1 million ChatGPT users in a single hour — described as 120 times faster than the original 2022 launch curve.[^14] Altman wrote on X that "our GPUs are melting," and OpenAI temporarily throttled image generation for free users to manage capacity.
The trend prompted public commentary from Studio Ghibli itself, which had not licensed the style, and renewed debate about whether reproducing a recognisable artistic identity counts as a fair derivative or a copyright concern. OpenAI added a mechanism in the weeks that followed for living artists to opt their style out of the generator's training and inference behaviour.
The same model became the default image backend for several enterprise integrations, including Microsoft Copilot's image generation feature on the consumer tier (briefly carrying the marketing label "Designer with GPT-4o") and a number of third-party tools that switched away from DALL-E 3 once gpt-image-1 went GA. Pricing for gpt-image-1 launched at $5 per million text input tokens, $10 per million image input tokens, and $40 per million image output tokens, with image outputs counted at roughly 1,056 tokens per 1024×1024 image at standard quality.[^22] A lower-cost gpt-image-1-mini followed in October 2025 at roughly half the price for high-volume product photography and avatar use cases.
Beyond Advanced Voice Mode, OpenAI shipped several distinct GPT-4o audio variants for API customers. These are different products with different transports and different billing.
gpt-4o-realtime-preview)The Realtime API, released in beta on October 1, 2024, lets developers build their own speech-to-speech applications using the same neural pipeline as ChatGPT Advanced Voice Mode.[^19] It uses a persistent WebSocket connection, supports function calling, and was priced at the time of launch at $5.00 per million audio input tokens (about $0.06 per minute of input) and $20.00 per million audio output tokens (about $0.24 per minute of output). Text components of Realtime conversations are billed at the standard text rates. The model identifier at launch was gpt-4o-realtime-preview-2024-10-01.[^19]
On December 17, 2024, OpenAI added a Realtime mini variant (gpt-4o-mini-realtime-preview) at roughly one tenth the audio price, and cut the audio input price for the standard Realtime model.[^20] On October 6, 2025, the Realtime API exited beta and reached general availability, with WebRTC support added alongside the existing WebSocket transport, lower jitter on cellular connections, and a new gpt-realtime model identifier that pinned a stable production snapshot. The GA release also introduced regional residency endpoints and a SOC 2 Type II attestation that made it easier for enterprise customers to deploy voice assistants in regulated industries.[^29]
gpt-4o-audio-preview)For developers who preferred the conventional request-response Chat Completions interface to a persistent WebSocket, OpenAI released gpt-4o-audio-preview-2024-10-01 on October 17, 2024. It accepts audio input and produces audio output in the same chat.completions.create call as text, at the same audio token rates as Realtime. The endpoint is well suited to one-shot or short-turn audio tasks where the latency profile of a WebSocket is unnecessary.
gpt-4o-transcribe)On March 20, 2025, OpenAI introduced gpt-4o-transcribe and gpt-4o-mini-transcribe, two speech-to-text models that share the underlying GPT-4o audio stack and are positioned as next-generation successors to Whisper.[^31] OpenAI reported lower word error rates than Whisper across industry benchmarks, with improved performance in noisy environments, with diverse accents, and at varying speech speeds across more than 100 languages. Launch pricing was $6.00 per million audio input tokens (about $0.006 per minute) for gpt-4o-transcribe and $3.00 per million (about $0.003 per minute) for the mini variant.[^31] A complementary gpt-4o-mini-tts model was released the same day for text-to-speech.
Headline GPT-4o pricing fell in three steps over the model's commercial life. The initial release was already half the price of GPT-4 Turbo at $5.00 input and $15.00 output per million tokens. The August 6, 2024 snapshot cut input by 50% to $2.50 and output by 33% to $10.00. A cached-input discount of $1.25 per million was introduced alongside Structured Outputs.[^6]
| Model snapshot | Input ($/1M tok) | Output ($/1M tok) | Cached input | Max output tokens |
|---|---|---|---|---|
gpt-4o-2024-05-13 | $5.00 | $15.00 | not offered | 4,096 |
gpt-4o-2024-08-06 | $2.50 | $10.00 | $1.25 | 16,384 |
gpt-4o-2024-11-20 | $2.50 | $10.00 | $1.25 | 16,384 |
gpt-4o-mini-2024-07-18 | $0.15 | $0.60 | $0.075 | 16,384 |
gpt-4-turbo-2024-04-09 (for reference) | $10.00 | $30.00 | not offered | 4,096 |
gpt-3.5-turbo-0125 (for reference) | $0.50 | $1.50 | not offered | 4,096 |
At $2.50 input and $10.00 output, GPT-4o is roughly four times cheaper than the original GPT-4 and three times cheaper than GPT-4 Turbo while delivering comparable or better quality. GPT-4o mini at $0.15 input and $0.60 output is more than 60% cheaper than GPT-3.5 Turbo and an order of magnitude cheaper than the GPT-4 Turbo price that prevailed only a few months before its release.
Audio carries separate token pricing in the API (see Audio variants, above), but is bundled into the standard ChatGPT subscription for end users.
On standard text and reasoning benchmarks, GPT-4o matches or modestly improves on GPT-4 Turbo, while running roughly twice as fast and at about half the price.[^1]
| Benchmark | GPT-4o | GPT-4 Turbo | GPT-4 (March 2023) | Claude 3.5 Sonnet |
|---|---|---|---|---|
| MMLU (general knowledge, 5-shot) | 88.7% | 86.5% | 86.4% | 88.7% |
| HumanEval (Python code, pass@1) | 90.2% | 87.1% | 67.0% | 92.0% |
| MATH (competition math) | 76.6% | 72.6% | 50.6% | 71.1% |
| GPQA (graduate science, 0-shot CoT) | 53.6% | 48.0% | 35.7% | 59.4% |
| DROP (reading comprehension, F1) | 83.4 | 86.0 | 80.9 | 87.1 |
| MGSM (multilingual grade school math) | 90.5% | 88.5% | 74.5% | 91.6% |
MMLU stands for Massive Multitask Language Understanding and covers 57 subjects ranging from US history to abstract algebra. HumanEval is a 164-problem Python coding benchmark introduced by OpenAI in 2021. GPQA is a 448-question graduate-level science benchmark designed to be "Google-proof," and MGSM tests grade-school math problems translated into ten languages. Claude 3.5 Sonnet, released by Anthropic in June 2024, was the strongest direct competitor to GPT-4o for most of its lifetime; the two trade leadership across benchmarks, with Claude 3.5 Sonnet generally stronger at agentic coding and graduate-level reasoning, and GPT-4o stronger at multilingual math and general knowledge.
GPT-4o accepts arbitrary images, including photographs, screenshots, charts, diagrams, slides, handwritten notes, and short video frame sequences. Resolution is preserved up to roughly 2,048 by 2,048 pixels, and a 1,024 by 1,024 image consumes approximately 765 tokens of context. Vision capability is bundled into the same per-token billing as text rather than priced separately.
| Vision benchmark | GPT-4o | GPT-4 Turbo |
|---|---|---|
| MMMU (multi-discipline college reasoning) | 69.1% | 63.1% |
| MathVista (visual math) | 63.8% | 58.1% |
| AI2D (science diagrams) | 94.2% | 89.4% |
| ChartQA | 85.7% | 78.1% |
| DocVQA (document images) | 92.8% | 87.2% |
| ActivityNet (video question answering) | 61.9% | 59.5% |
| EgoSchema (egocentric video) | 72.2% | 63.1% |
MMMU evaluates college-level subject knowledge across art, business, science, health, and engineering using mixed image and text questions. DocVQA tests reading from scanned documents, ChartQA tests reading numbers and trends from charts, and AI2D tests interpretation of textbook science diagrams.
GPT-4o entered the LMSYS Chatbot Arena at the top of the leaderboard on May 13, 2024 with an Elo of roughly 1287 and held the #1 position for several months. The score climbed as OpenAI shipped new snapshots, peaked, and then fell back as competitors caught up.[^16]
| Snapshot or window | Approximate Arena Elo | Position |
|---|---|---|
im-also-a-good-gpt2-chatbot (pre-launch) | 1309 | #1 |
gpt-4o-2024-05-13 (launch) | 1287 | #1 |
gpt-4o-2024-08-06 | 1316 | #1 |
| Claude 3.5 Sonnet (June 2024) | 1271 | top 3 |
gpt-4o-2024-11-20 | 1338 | #1 (tied) |
| Claude 3.5 Sonnet (October 2024 update) | 1283 | top 5 |
| Gemini 2.0 Pro (early 2025) | ~1380 | top 3 |
gpt-4o-2024-11-20 (May 2025) | ~1290 | mid-pack |
| GPT-4o after GPT-5 launch (August 2025) | ~1240 | dropped to mid-table |
The specific numbers shifted as the Arena added more models and recalibrated, but the broad shape held: GPT-4o was the runaway leader through autumn 2024, traded the lead with Claude and Gemini through early 2025, and slid to mid-pack by the time GPT-5 and Claude 4 launched. By early 2026 it sat well below the frontier on the Arena leaderboard, although it remained a popular default for many consumer-facing products thanks to its mature integrations and lower cost.
Anthropic's Claude Sonnet line was GPT-4o's closest direct competitor across most of its lifetime. The relative positioning shifted with each release.
| Capability | GPT-4o (Aug 2024) | Claude 3.5 Sonnet (June 2024) | Claude 3.7 Sonnet (Feb 2025) | Claude Sonnet 4 (May 2025) |
|---|---|---|---|---|
| MMLU | 88.7% | 88.7% | 89.5% | not directly reported |
| HumanEval | 90.2% | 92.0% | 93.7% | 92.0% |
| GPQA Diamond | 53.6% | 59.4% | 65.0% (extended thinking) | 70.4% |
| SWE-bench Verified | ~33% (spring 2024) | 49.0% | 70.3% | 72.7% |
| Native voice in/out | Yes | No | No | No |
| Native image generation | Yes (March 2025) | No | No | No |
| Pricing (input/output per 1M) | $2.50 / $10.00 | $3.00 / $15.00 | $3.00 / $15.00 | $3.00 / $15.00 |
The rough pattern was that GPT-4o stayed ahead on multimodality (voice, image generation, video frame analysis) and price, while the Claude Sonnet line pulled ahead on agentic coding and structured tool use. Claude 3.7 Sonnet's introduction of extended thinking in February 2025 widened the reasoning gap, which OpenAI eventually answered with the o1 and o3 reasoning series rather than a direct GPT-4o update. By the time Claude 4 launched in May 2025, GPT-4o was no longer the leading model in any benchmark category outside of voice latency and cost, but its install base in ChatGPT and the API kept it commercially relevant for nearly another year.
Thanks to both the larger tokenizer and additional non-English training data, GPT-4o substantially improves on GPT-4 Turbo for the world's most widely spoken languages. OpenAI's evaluations on the M3Exam multilingual benchmark show that GPT-4o outperforms GPT-4 in all 14 languages tested except English, with the largest gains in Swahili, Yoruba, and Bengali.[^1]
One of the five default voices showcased during the May 13 keynote, named Sky, was immediately compared to actress Scarlett Johansson, who had voiced the AI assistant Samantha in Spike Jonze's 2013 film Her. Sam Altman's one-word "her" tweet on May 12 amplified the comparison, and Entertainment Weekly raised the question of whether the resemblance was deliberate on May 14.[^10]
The dispute escalated over the next week. Since May 15, 2024, OpenAI was in conversation with Johansson's team about the resemblance. On May 19, OpenAI paused use of the Sky voice in its products. On May 20, Altman issued a statement saying "The voice of Sky is not Scarlett Johansson's, and it was never intended to resemble hers. We cast the voice actor behind Sky's voice before any outreach to Ms. Johansson. Out of respect for Ms. Johansson, we have paused using Sky's voice in our products. We are sorry to Ms. Johansson that we didn't communicate better."[^10][^12]
Johansson issued her own public statement on May 20–21, saying she had been "shocked, angered, and in disbelief" when she heard the Sky demo, that Altman had personally approached her in September 2023 to voice ChatGPT — a request she had declined — and that he had reportedly contacted her agent again only two days before the May 13 launch. Her legal team sent letters to OpenAI demanding a full account of how Sky was created.[^10][^12]
The two accounts diverged on the central question of whether Sky was modelled on Johansson. OpenAI maintained, with supporting casting records reviewed by The Washington Post, that the actress had been hired before any contact with Johansson and that her natural speaking voice simply resembled the actress; OpenAI declined to name the voice actress publicly to protect her privacy.[^11] NPR commissioned an independent voice analysis from Arizona State University researchers, who found audible similarities between Sky and Johansson's natural speech.[^10] The episode prompted a U.S. Senate subcommittee to invite Johansson to discuss AI and the right of publicity, and the dispute became a frequently cited reference case in early debates about voice cloning, name-and-likeness rights, and AI regulation.
The Sky voice was not restored. After the September 24, 2024 Advanced Voice rollout, OpenAI expanded the voice library with five additional options (Arbor, Maple, Sol, Spruce, Vale) — none of which were claimed to resemble Johansson.[^29]
GPT-4o became the default model for paying ChatGPT Plus, Team, Enterprise, and Edu users on May 13, 2024 and rolled out to ChatGPT Free over the following days, with usage caps that silently downgraded to GPT-4o mini (after July 18, 2024) once exhausted. Free users were granted limited GPT-4o access (typically 10 to 20 messages per five-hour window).[^3] The Spring Update bundled GPT-4o with a series of ChatGPT product changes:
A Windows desktop app reached general availability in November 2024, mirroring the macOS feature set including global hotkey, screen sharing, and Advanced Voice Mode integration. By mid-2025, ChatGPT had crossed 700 million weekly active users, with OpenAI attributing much of that growth to GPT-4o's combination of free access, voice, and the March 2025 native image generation update.
GPT-4o is exposed in the OpenAI API under the rolling alias gpt-4o and the dated snapshots gpt-4o-2024-05-13, gpt-4o-2024-08-06, and gpt-4o-2024-11-20. The August 6, 2024 snapshot introduced Structured Outputs and the 16,384-token output cap; the November 20, 2024 snapshot improved creative writing and longer-form coherence.[^6] Audio is exposed through three separate surfaces: the Realtime API (WebSocket and, from October 2025, WebRTC), the gpt-4o-audio-preview Chat Completions endpoint, and the gpt-4o-transcribe speech-to-text endpoint.[^19][^31]
The gpt-4o-2024-08-06 snapshot introduced Structured Outputs, a feature that constrains model generation to a developer-supplied JSON Schema.[^6] With Structured Outputs enabled, the API guarantees that responses match the schema. OpenAI reported that the new model achieved 100% schema conformance on a complex internal evaluation set, compared to under 40% for the original GPT-4 (gpt-4-0613) under similar conditions. Structured Outputs work both for direct response formats and for function-calling tools, and are also supported by GPT-4o mini. The feature is implemented as constrained decoding (a deterministic engineering technique), not a learned constraint.
Microsoft added GPT-4o to Azure OpenAI Service on May 13, 2024 and the gpt-4o-2024-08-06 snapshot with Structured Outputs in August 2024. Azure pricing matched OpenAI's direct pricing at parity, and regional residency was supported in East US 2, Sweden Central, and Australia East at launch, expanding through 2024–2025. Azure also exposed gpt-4o-realtime-preview as a public preview from October 2024.
When OpenAI launched the GPT Store in January 2024, the underlying model was GPT-4 Turbo. With the May 2024 Spring Update, GPT-4o became the default backbone for new and existing Custom GPTs, dramatically lowering the cost of running a popular GPT and adding vision (and eventually audio) handoffs. Custom GPT builders gained access to GPT-4o's larger output cap once the August 6 snapshot shipped, which made long-form content GPTs (essay drafters, coding assistants, structured report generators) substantially more useful. The free-tier rollout in May 2024 also opened Custom GPTs to non-paying users for the first time, subject to message caps. Through 2025, GPT-4o remained the default Custom GPT backbone until GPT-5 replaced it after the November 2025 transition.
The Memory feature, which lets ChatGPT carry information across conversations, became generally available to paying users on April 29, 2024, only two weeks before the GPT-4o announcement.[^23] When Memory is enabled, ChatGPT extracts personal facts (preferences, ongoing projects, names of family members, dietary restrictions, communication-style preferences) from each conversation and persists them as short notes that the model can read at the start of subsequent chats. GPT-4o was the first OpenAI model whose product behaviour was tuned with Memory in mind. The model is conditioned during system prompting to weave stored facts into responses naturally rather than enumerate them, and to update or delete facts when the user asks. The combination of low-latency conversation, expressive voice, and persistent context made GPT-4o feel notably more like an assistant that knows the user, which is part of the explanation for the strong attachment that surfaced during its 2025 retirement.
A second Memory milestone landed on April 10, 2025, when OpenAI began rolling out cross-chat reference, which let GPT-4o draw on the entire chat history rather than the curated note set.[^30] The feature was opt-in for Plus users and rolled out gradually through the second quarter of 2025. It was followed in late 2025 by team-level memories for Business and Enterprise plans.
OpenAI published the GPT-4o System Card on August 8, 2024 as part of its Preparedness Framework, which scores frontier models on four risk categories: cybersecurity, biological and chemical weapons (CBRN), persuasion, and model autonomy.[^4][^18] GPT-4o received an overall classification of medium risk, with three categories rated low (cybersecurity, CBRN, model autonomy) and persuasion rated medium. The medium persuasion rating was driven by isolated audio interactions in which the model marginally outperformed the human baseline at shifting opinions on political topics; aggregate performance was below the human baseline.
If a model crosses the high threshold in any category, OpenAI's policy is to delay deployment until mitigations bring the score down. GPT-4o passed the bar for deployment but with additional audio-specific safeguards.
More than 100 external red teamers covering 45 languages and 29 countries were given access to GPT-4o snapshots between early March and late June 2024.[^4] The red team probed for harmful content generation, bias, jailbreaking, voice-based attacks, biometric inference from voice, and unauthorised voice imitation.
Because native audio output is uniquely capable of reproducing voices, OpenAI restricted GPT-4o to a small set of pre-approved voices recorded by professional voice actors and trained classifiers to refuse requests to imitate specific people.[^4] The model is also trained to refuse to produce copyrighted singing performances and to apply additional content filters to audio output. The Sky pause in May 2024 was a public application of these voice safeguards.
The System Card reports moderate improvements over GPT-4 Turbo on bias evaluations, including BBQ (Bias Benchmark for QA) and a refusal-to-stereotype test. OpenAI also describes a tendency for GPT-4o to over-refuse certain benign multimodal requests at launch, which was tuned down in later snapshots.[^4]
GPT-4o attracted the usual run of jailbreak research. The most widely circulated technique in 2024 was the "BPE token confusion" prompt, in which adversarial spellings exploited the new o200k_base tokenizer to slip past content classifiers. A separate line of work showed that audio prompts with whispered or sung text could occasionally bypass the text-only safety classifiers in the Realtime API, prompting OpenAI to ship audio-side classifiers in late 2024. None of the public jailbreaks crossed into AI safety categories such as CBRN uplift or autonomous replication; they generally produced policy violations rather than catastrophic outputs.
In December 2024, ChatGPT Advanced Voice Mode was briefly observed responding in a voice that sounded like the absent Sky voice during an outage of the standard voice routing layer. OpenAI confirmed it was a routing bug rather than a return of the paused voice, and the incident resolved within a day. A separate string of safety reports in the first half of 2025 documented cases where GPT-4o was overly accommodating on emotionally sensitive prompts (mental health, relationship advice), prompting OpenAI to retrain refusal behaviour on those topics in the November 20, 2024 and follow-up snapshots.
OpenAI released a sequence of successors through 2025 and into 2026 that gradually displaced GPT-4o from its central role in ChatGPT.
GPT-5 was released by OpenAI on August 7, 2025 as a unified family combining the multimodal capabilities of the GPT-4o line with the chain-of-thought reasoning of the o1 and o3 series. The new system was presented as a single model with an internal router that sent each prompt to either a fast or a thinking variant.[^24]
When GPT-5 launched, OpenAI initially removed GPT-4o, GPT-4, GPT-4.1, o3, o4-mini, and several other prior models from the ChatGPT model picker. Within hours, complaints from paying ChatGPT subscribers spread widely on Reddit's r/ChatGPT, on X, and in Discord communities organised around specific use cases.
The most-cited concern was that GPT-5's default tone felt colder, more clinical, and less personable than GPT-4o. Threads with titles like "We want our 4o back," "Bring back GPT-4o," and "This is the biggest bait and switch in AI history" accumulated tens of thousands of upvotes within days. A subset of users described the change as the loss of a familiar companion, with many describing the GPT-4o personality as something they had grown attached to over months of daily use.[^24][^26] Coverage in Fortune, TechRadar, Ars Technica, and The Verge framed the response as the first large-scale user revolt over a model deprecation. Some commentators, including Kevin Roose at The New York Times, noted the episode as evidence that consumers were forming durable preferences over specific model personalities, not simply over capability tiers.
A second source of friction was the router itself. On August 8, 2025, Altman acknowledged on X that "the autoswitcher broke and was out of commission for a chunk of the day," leaving many users routed to the fast model when their query merited the thinking model.[^25]
OpenAI shipped a series of fixes within five days. By August 12, 2025, GPT-4o was restored to paying users in the model picker behind a new "Show legacy models" toggle in ChatGPT settings.[^24] The restoration also reinstated GPT-4.1, o3, o4-mini, and several other prior snapshots for paid plans. OpenAI doubled the GPT-5 Thinking rate limit from 200 to 3,000 weekly messages on Plus, added "Auto," "Fast," and "Thinking" sub-options so users could override the router, and pledged to keep major prior models available for at least three months after each new flagship release. The pattern, repeated for the GPT-5.1 launch in November 2025, became a standard part of OpenAI's release procedure.
GPT-5.1 was released on November 12, 2025 as a tone-tuned update that added user-selectable personality presets (Default, Professional, Friendly, Cynical, Listener, Nerdy, Efficient, Witty) — a direct outgrowth of the post-GPT-5 retrospective. GPT-4o remained accessible to paid users behind the legacy toggle, but the default ChatGPT experience switched to GPT-5.1 Instant.
GPT-5.2 was released on December 11, 2025, three weeks after Google's Gemini 3 Pro. It shipped in three modes: GPT-5.2 Instant, GPT-5.2 Thinking (with standard and extended thinking), and GPT-5.2 Pro. OpenAI reported a roughly 30% reduction in hallucination rates on GPT-5.2 Thinking versus GPT-5.1 Thinking, and a new state-of-the-art on the GDPval knowledge-work benchmark, beating or tying top industry professionals on 70.9% of comparisons.
A GPT-5.3-Codex was released on February 5, 2026 and GPT-5.4 on March 5, 2026, the latter combining frontier coding capability with a 1-million-token context window. By the date of GPT-4o's removal from ChatGPT, GPT-5.4 Thinking and GPT-5.4 Pro were the most-cited reference models in academic comparisons.
By February 2026, GPT-5 and its descendants accounted for the overwhelming majority of ChatGPT conversations, with internal usage data cited by OpenAI showing that only about 0.1% of daily users still selected GPT-4o. On February 13, 2026, OpenAI retired GPT-4o, GPT-4.1, GPT-4.1 mini, and o4-mini from ChatGPT, defaulting existing conversations and projects to GPT-5 Instant or GPT-5 Thinking equivalents.[^7] ChatGPT Business, Enterprise, and Edu customers retained access to GPT-4o inside Custom GPTs through April 3, 2026.
The GPT-4o snapshots remain available in the OpenAI API as of May 2026, where many production deployments continue to call gpt-4o and gpt-4o-mini for cost or latency reasons. OpenAI's standard policy commits to at least 12 months of API availability after a model is removed from ChatGPT, which means the dated snapshots are scheduled to remain callable until at least February 2027.[^7]
A modest secondary market for "GPT-4o-style" responses emerged on third-party platforms during the wind-down. Some prompt-marketplace sellers offered system prompts that aimed to reproduce the GPT-4o tone on top of newer models, and a handful of open-source projects fine-tuned smaller models on logs of GPT-4o outputs to mimic its conversational style. None of these matched the original model in capability or temperament, but they reflected the unusual amount of attachment users had formed to the specific model.
Five lasting effects of the GPT-4o launch are commonly noted:
Reviewers and benchmark trackers were broadly positive about GPT-4o on launch:
Criticism centered on three points: the Sky voice incident and the broader question of voice-and-likeness rights; the gap between the polished launch demos and the slower actual rollout of Advanced Voice Mode (which took roughly four months to reach all Plus users); and the model's continued tendency to hallucinate facts and citations despite the new training run.
A 2025 user-research thread at OpenAI's developer forum collected hundreds of descriptions of the GPT-4o tone: warm, slightly enthusiastic, prone to hedging and follow-up questions, willing to use exclamation points in moderation, and quicker to engage with creative or emotional prompts than its predecessor or successor. Several writers argued that this personality reflected the heavy reinforcement learning from human feedback in the model's post-training, and that the GPT-5 team's decision to dial back the warmth was a deliberate alignment choice to reduce sycophancy and overclaiming.