GPT-4o (Generative Pre-trained Transformer 4 Omni) is a multimodal large language model developed by OpenAI and announced on May 13, 2024 at the company's "Spring Update" event. The letter "o" stands for omni, signaling that the model accepts and produces any combination of text, audio, image, and video tokens within a single end-to-end neural network rather than the chained pipeline of separate models that powered earlier versions of ChatGPT Voice Mode. GPT-4o was designed as the successor to GPT-4 Turbo, matching that model's text and code performance while running roughly twice as fast and at half the API cost.
The most widely publicized aspect of GPT-4o is its real-time conversational ability. The model can respond to audio inputs in as little as 232 milliseconds, with an average latency of about 320 milliseconds, which is comparable to the cadence of an everyday human conversation. Earlier ChatGPT voice experiences chained a speech-to-text model, a text-only GPT-4 inference, and a text-to-speech model in series, producing typical end-to-end latencies of five seconds or more and stripping away tone, multiple speakers, background noise, laughter, and song. Because GPT-4o handles waveforms and text within the same network, it preserves prosody and can output expressive speech, including singing and laughter.
OpenAI used the launch to make GPT-4 class capability free for the first time. GPT-4o became the default model for paying ChatGPT Plus, Team, and Enterprise users, and it was rolled out to ChatGPT Free with usage caps. A smaller and cheaper variant, GPT-4o mini, was released on July 18, 2024 and replaced GPT-3.5 Turbo as the recommended low-cost API model. A native image generation update for GPT-4o was launched inside ChatGPT on March 25, 2025 and exposed in the API as the gpt-image-1 endpoint on April 23, 2025. GPT-4o was eventually retired from the ChatGPT product on February 13, 2026 in favor of the GPT-5 family, while remaining available in the OpenAI API.
In the months leading up to GPT-4o, OpenAI tested release candidates anonymously on the public LMSYS Chatbot Arena. In late April 2024, an unbranded model called gpt2-chatbot appeared on the Arena and immediately matched or exceeded the strongest available systems, including GPT-4 Turbo, Claude 3 Opus, and Gemini 1.5 Pro. The model was withdrawn after a few days, then reappeared on May 6, 2024 under two new names: im-a-good-gpt2-chatbot and im-also-a-good-gpt2-chatbot. OpenAI CEO Sam Altman cryptically tweeted "i do have a soft spot for gpt2" and later "im-a-good-gpt2-chatbot," and the day before the Spring Update he posted only the word "her," widely read as a nod to the 2013 film of the same name. After the May 13 announcement, OpenAI confirmed that the masked Arena models had been GPT-4o checkpoints; im-also-a-good-gpt2-chatbot held an Arena Elo near 1309, well ahead of GPT-4 Turbo at 1253 and Claude 3 Opus at 1246 at the time of the reveal.
The Spring Update was a 26 minute live stream held at OpenAI's San Francisco headquarters on May 13, 2024, the day before Google's annual I/O developer conference. Then chief technology officer Mira Murati opened the presentation, framed three priorities for the launch (a new desktop app, a refreshed user interface, and the new flagship model), and then handed off to research leads Mark Chen and Barret Zoph for live demonstrations. The team showed real-time spoken conversation with interruption handling, on-the-fly emotional expression, real-time language translation between Italian and English, vision based math tutoring on a handwritten linear equation, code review of a Python script via screen sharing, and the model singing a bedtime story for a stuffed animal. The presentation deliberately steered around the staged feel of typical product demos: the presenters interrupted the model mid-sentence, asked it to be more dramatic, and let it ad-lib.
Google's I/O keynote the following day featured Project Astra, a multimodal assistant with similar ambitions, and the timing was widely interpreted as a strategic positioning move by OpenAI.
One of the five default voices showcased during the keynote, named Sky, was immediately compared to actress Scarlett Johansson, who had voiced the AI assistant Samantha in the 2013 film Her. Sam Altman's one-word "her" tweet on May 12 amplified the comparison.
On May 19, 2024, Johansson released a public statement saying she had been "shocked" and "angered" when she heard Sky, and that Altman had personally approached her in September 2023 to voice ChatGPT, a request she had declined. He had reportedly contacted her agent again only two days before the May 13 launch. Her legal team sent letters to OpenAI demanding a full account of how Sky was created. OpenAI paused Sky the same day, with Altman writing that "out of respect for Ms. Johansson, we have paused using Sky's voice."
OpenAI maintained that Sky was not a clone of Johansson's voice and was instead provided by a different professional voice actress whose natural speaking voice resembled the actress. The Washington Post reviewed casting documents and found that the voice actress had been hired before any contact with Johansson. NPR commissioned an independent voice analysis from Arizona State University researchers, who found audible similarities between Sky and Johansson's natural speech. The episode prompted a U.S. Senate subcommittee to invite Johansson to discuss AI and the right of publicity, and the dispute became a frequently cited reference case in early debates about voice cloning, name and likeness rights, and AI regulation.
OpenAI continued to update the GPT-4o family throughout 2024 and 2025:
| Date | Release | Notes |
|---|---|---|
| May 13, 2024 | gpt-4o-2024-05-13 | Initial public release; text and vision in API; voice in ChatGPT |
| July 18, 2024 | gpt-4o-mini-2024-07-18 | Smaller distilled variant; replaces GPT-3.5 Turbo |
| July 30, 2024 | Advanced Voice Mode alpha | Limited Plus rollout of native audio voice mode |
| August 6, 2024 | gpt-4o-2024-08-06 | Adds Structured Outputs (JSON Schema), 16K output tokens, lower price |
| August 8, 2024 | GPT-4o System Card | Public safety evaluation report |
| September 2024 | Advanced Voice general availability | All Plus and Team users on iOS and Android |
| October 1, 2024 | Realtime API beta | Speech-to-speech WebSocket API for developers |
| November 20, 2024 | gpt-4o-2024-11-20 | Improved creative writing and longer answers |
| March 25, 2025 | Native image generation in ChatGPT | Replaces DALL-E 3 as default image generator in ChatGPT |
| April 23, 2025 | gpt-image-1 in API | Native image generation exposed to developers |
| February 13, 2026 | Retirement from ChatGPT | Replaced in product by GPT-5 family; still in API |
GPT-4o is described by OpenAI as "a single new model trained end-to-end across text, vision, and audio, meaning that all inputs and outputs are processed by the same neural network." Audio waveforms and image pixels are encoded directly into tokens that share the same latent space as text tokens, so the model's attention can directly correlate a word in a prompt with a particular pixel region in an image, a particular speaker in a recording, or a particular musical note in audio. OpenAI has not published the architecture diagram, parameter count, training data composition, or hardware footprint, citing the same competitive and safety considerations it cited for GPT-4.
In the previous voice pipeline, three separate models were chained together: an automatic speech recognition model (typically Whisper) transcribed audio to text, a text-only GPT-4 model produced a text reply, and a separate text-to-speech model converted the reply back to audio. The text-only middle stage threw away tone, multiple speaker information, background sound, and emotional inflection, and the chain accumulated latency at each step. GPT-4o eliminates this loss of information by carrying audio and vision tokens directly into and out of the same model.
GPT-4o ships with a new tokenizer, o200k_base, which doubles the vocabulary from the cl100k_base tokenizer used by GPT-4 and GPT-3.5 to roughly 200,000 BPE tokens. The expanded vocabulary is heavily targeted at non-English text, where earlier OpenAI tokenizers used many more tokens per character. According to OpenAI's published comparisons, common Hindi sentences require about 2.9 times fewer tokens, common Arabic sentences about 2 times fewer, common Chinese sentences about 1.4 times fewer, and common Korean sentences about 1.7 times fewer. A widely cited Indic language analysis showed Malayalam tokens reduced by nearly 4 times and Telugu by about 3.5 times.
Fewer tokens per character of foreign-language text means lower latency, lower API costs, and longer effective context windows for those languages, since context length is measured in tokens rather than characters.
GPT-4o offers a 128,000 token context window with a knowledge cutoff date of October 2023. The initial gpt-4o-2024-05-13 snapshot capped output at 4,096 tokens; the August 6, 2024 update raised the maximum output to 16,384 tokens.
On standard text and reasoning benchmarks, GPT-4o matches or modestly improves on GPT-4 Turbo, while running roughly twice as fast and at about half the price.
| Benchmark | GPT-4o | GPT-4 Turbo | GPT-4 (March 2023) | Claude 3.5 Sonnet |
|---|---|---|---|---|
| MMLU (general knowledge, 5-shot) | 88.7% | 86.5% | 86.4% | 88.7% |
| HumanEval (Python code, pass@1) | 90.2% | 87.1% | 67.0% | 92.0% |
| MATH (competition math) | 76.6% | 72.6% | 50.6% | 71.1% |
| GPQA (graduate science, 0-shot CoT) | 53.6% | 48.0% | 35.7% | 59.4% |
| DROP (reading comprehension, F1) | 83.4 | 86.0 | 80.9 | 87.1 |
| MGSM (multilingual grade school math) | 90.5% | 88.5% | 74.5% | 91.6% |
MMLU stands for Massive Multitask Language Understanding and covers 57 subjects ranging from US history to abstract algebra. HumanEval is a 164-problem Python coding benchmark introduced by OpenAI in 2021. GPQA is a 448-question graduate-level science benchmark designed to be "Google-proof," and MGSM tests grade-school math problems translated into ten languages. Claude 3.5 Sonnet, released by Anthropic in June 2024, was the strongest direct competitor to GPT-4o for most of its lifetime; the two trade leadership across benchmarks, with Claude 3.5 Sonnet generally stronger at agentic coding and graduate-level reasoning, and GPT-4o stronger at multilingual math and general knowledge.
GPT-4o accepts arbitrary images, including photographs, screenshots, charts, diagrams, slides, handwritten notes, and short video frame sequences. Resolution is preserved up to roughly 2,048 by 2,048 pixels, and a 1,024 by 1,024 image consumes approximately 765 tokens of context. Vision capability is bundled into the same per-token billing as text rather than priced separately.
| Vision benchmark | GPT-4o | GPT-4 Turbo |
|---|---|---|
| MMMU (multi-discipline college reasoning) | 69.1% | 63.1% |
| MathVista (visual math) | 63.8% | 58.1% |
| AI2D (science diagrams) | 94.2% | 89.4% |
| ChartQA | 85.7% | 78.1% |
| DocVQA (document images) | 92.8% | 87.2% |
| ActivityNet (video question answering) | 61.9% | 59.5% |
| EgoSchema (egocentric video) | 72.2% | 63.1% |
MMMU evaluates college-level subject knowledge across art, business, science, health, and engineering using mixed image and text questions. DocVQA tests reading from scanned documents, ChartQA tests reading numbers and trends from charts, and AI2D tests interpretation of textbook science diagrams.
Native audio output is delivered to ChatGPT users through Advanced Voice Mode, which began an alpha rollout to a small group of ChatGPT Plus subscribers on July 30, 2024 and reached general availability for Plus and Team users on iOS and Android in September 2024. Advanced Voice Mode supports interruption, expressive prosody, regional accents, whispering, character voices, multiple language switching mid-sentence, and singing. The model can pick up cues such as urgency, sarcasm, or sadness from the user's voice and modulate its replies accordingly.
For developers, the same speech-to-speech capability is exposed through the Realtime API, a WebSocket interface released in beta on October 1, 2024. The Realtime API streams audio in and audio out, supports function calling, and lets developers configure a system prompt, voice, and tools without writing their own speech recognition or speech synthesis layer.
GPT-4o ships with several preset voices, originally Breeze, Cove, Ember, Juniper, and Sky. After the Scarlett Johansson incident, OpenAI paused Sky and later expanded the voice library with options named Arbor, Maple, Sol, Spruce, and Vale.
On March 25, 2025, OpenAI shipped a long-awaited native image generation update for GPT-4o inside ChatGPT, replacing the prior DALL-E 3 backend. Unlike DALL-E 3, the GPT-4o image generator is part of the same model that handles conversation, so the system can iteratively refine images across turns, place legible text inside images, render charts and diagrams from data, follow longer and more complex compositional prompts, and bind ten to twenty distinct objects in a single scene. The launch produced viral interest in stylized renderings, including Studio Ghibli style portraits, and OpenAI reported that approximately 700 million images were generated in the first week, equivalent to roughly 1,200 images per second. The same backend was exposed in the API on April 23, 2025 as gpt-image-1.
Thanks to both the larger tokenizer and additional non-English training data, GPT-4o substantially improves on GPT-4 Turbo for the world's most widely spoken languages. OpenAI's evaluations on the M3Exam multilingual benchmark show that GPT-4o outperforms GPT-4 in all 14 languages tested except English, with the largest gains in Swahili, Yoruba, and Bengali.
The Spring Update bundled GPT-4o with a series of ChatGPT product changes:
Free users were granted limited GPT-4o access (typically 10 to 20 messages per five hour window), with the system silently downgrading to GPT-4o mini once the cap was reached.
GPT-4o is exposed through the OpenAI API as well as Microsoft's Azure OpenAI Service. The August 6, 2024 snapshot reduced input pricing to $2.50 per million tokens and output pricing to $10.00 per million tokens, a 50% input discount and 33% output discount versus the original May release.
| Model snapshot | Input price (per 1M tokens) | Output price (per 1M tokens) | Cached input | Max output tokens |
|---|---|---|---|---|
| gpt-4o-2024-05-13 | $5.00 | $15.00 | not offered | 4,096 |
| gpt-4o-2024-08-06 | $2.50 | $10.00 | $1.25 | 16,384 |
| gpt-4o-2024-11-20 | $2.50 | $10.00 | $1.25 | 16,384 |
| gpt-4o-mini-2024-07-18 | $0.15 | $0.60 | $0.075 | 16,384 |
| gpt-4-turbo-2024-04-09 (for reference) | $10.00 | $30.00 | not offered | 4,096 |
| gpt-3.5-turbo-0125 (for reference) | $0.50 | $1.50 | not offered | 4,096 |
At $2.50 input and $10.00 output, GPT-4o is roughly four times cheaper than GPT-4 and three times cheaper than GPT-4 Turbo while delivering comparable or better quality. GPT-4o mini at $0.15 input and $0.60 output is more than 60% cheaper than GPT-3.5 Turbo and an order of magnitude cheaper than the GPT-4 Turbo price that prevailed only a few months before its release.
The gpt-4o-2024-08-06 snapshot introduced Structured Outputs, a feature that constrains model generation to a developer-supplied JSON Schema. With Structured Outputs enabled, the API guarantees that responses match the schema. OpenAI reported that the new model achieved 100% schema conformance on a complex internal evaluation set, compared to under 40% for the original GPT-4 (gpt-4-0613) under similar conditions. Structured Outputs work both for direct response formats and for function calling tools, and are also supported by GPT-4o mini.
The Realtime API, released in beta on October 1, 2024, lets developers build their own speech-to-speech applications using the same neural pipeline as ChatGPT Advanced Voice Mode. It uses a persistent WebSocket connection, supports function calling, and was priced at the time of launch at $5.00 per million audio input tokens (about $0.06 per minute of input) and $20.00 per million audio output tokens (about $0.24 per minute of output). Text components of Realtime conversations are billed at the standard text rates.
| Modality | Input | Output | Notes |
|---|---|---|---|
| Text | Yes | Yes | 128K context, English and 50+ other languages |
| Image | Yes | Yes (via gpt-image-1, 2025) | Vision from launch; native generation added March 2025 |
| Audio | Yes | Yes | Native speech in/out via Advanced Voice and Realtime API |
| Video | Frames as images | No | Live video understood through frame sequences |
| Function calls | Yes | Yes | Tool use with strict schema since August 2024 |
GPT-4o mini is a smaller, distilled member of the GPT-4o family, released on July 18, 2024. It became OpenAI's recommended low-cost API model and replaced GPT-3.5 Turbo as the default fallback in ChatGPT for free users who exceeded their GPT-4o quota.
Key facts:
GPT-4o mini was widely adopted for high-volume back-office applications, batch document processing, retrieval-augmented generation pipelines, and customer support assistants, where its combination of multimodal capability and low price made the older GPT-3.5 Turbo class of models obsolete.
OpenAI published the GPT-4o System Card on August 8, 2024 as part of its Preparedness Framework, which scores frontier models on four risk categories: cybersecurity, biological and chemical weapons (CBRN), persuasion, and model autonomy. GPT-4o received an overall classification of medium risk, with three categories rated low (cybersecurity, CBRN, model autonomy) and persuasion rated medium. The medium persuasion rating was driven by isolated audio interactions in which the model marginally outperformed the human baseline at shifting opinions on political topics; aggregate performance was below the human baseline.
If a model crosses the high threshold in any category, OpenAI's policy is to delay deployment until mitigations bring the score down. GPT-4o passed the bar for deployment but with additional audio-specific safeguards.
More than 100 external red teamers covering 45 languages and 29 countries were given access to GPT-4o snapshots between early March and late June 2024. The red team probed for harmful content generation, bias, jailbreaking, voice based attacks, biometric inference from voice, and unauthorized voice imitation.
Because native audio output is uniquely capable of reproducing voices, OpenAI restricted GPT-4o to a small set of pre-approved voices recorded by professional voice actors and trained classifiers to refuse requests to imitate specific people. The model is also trained to refuse to produce copyrighted singing performances and to apply additional content filters to audio output. The Sky pause in May 2024 was a public application of these voice safeguards.
The System Card reports moderate improvements over GPT-4 Turbo on bias evaluations, including BBQ (Bias Benchmark for QA) and a refusal-to-stereotype test. OpenAI also describes a tendency for GPT-4o to over-refuse certain benign multimodal requests at launch, which was tuned down in later snapshots.
Reviewers and benchmark trackers were broadly positive about GPT-4o on launch:
Criticism centered on three points: the Sky voice incident and the broader question of voice and likeness rights, the gap between the polished launch demos and the slower actual rollout of Advanced Voice Mode (which took roughly four months to reach all Plus users), and the model's continued tendency to hallucinate facts and citations despite the new training run.
GPT-5 was released by OpenAI in 2025 as a unified family combining the multimodal capabilities of the GPT-4o line with the chain-of-thought reasoning of the o1 and o3 series. By February 2026, GPT-5 was the default model for the overwhelming majority of ChatGPT conversations, with internal usage data cited by OpenAI showing that only about 0.1% of daily users still selected GPT-4o. On February 13, 2026, OpenAI retired GPT-4o, GPT-4.1, GPT-4.1 mini, and o4-mini from ChatGPT, defaulting existing conversations and projects to GPT-5 Instant or GPT-5 Thinking equivalents. ChatGPT Business, Enterprise, and Edu customers retained access to GPT-4o inside Custom GPTs through April 3, 2026.
The GPT-4o snapshots remain available in the OpenAI API, where many production deployments continue to call gpt-4o and gpt-4o-mini for cost or latency reasons.
GPT-4o is widely cited as the first commercial AI model to deliver real-time, expressive, native multimodal interaction at scale. Three lasting effects of the launch are commonly noted: