GPT-5
Last reviewed
May 17, 2026
Sources
32 citations
Review status
Source-backed
Revision
v10 ยท 8,295 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 17, 2026
Sources
32 citations
Review status
Source-backed
Revision
v10 ยท 8,295 words
Add missing citations, update stale details, or suggest a clearer explanation.
GPT-5 is OpenAI's flagship large language model, first released on August 7, 2025. It represents a fundamental shift in OpenAI's model strategy: rather than maintaining separate model families for different capabilities (such as GPT-4o for speed and the o-series for reasoning), GPT-5 unifies these into a single system with built-in "thinking" capabilities and a real-time router that selects the appropriate level of reasoning for each query [1]. The model family has been updated several times since launch, with GPT-5.1 arriving on November 12, 2025, GPT-5.2 on December 11, 2025, GPT-5.3 Instant on March 3, 2026, GPT-5.4 on March 5, 2026, and GPT-5.5 on April 23, 2026 [2][3][4][24][27].
At launch, GPT-5 set new state-of-the-art results on multiple benchmarks, including 94.6% on AIME 2025 (mathematics), 74.9% on SWE-bench Verified (software engineering), and 84.2% on MMMU (multimodal understanding). It also showed a significant reduction in hallucinations, producing roughly six times fewer factual errors than its predecessor o3 when using its thinking mode [1]. Sam Altman introduced the model at launch as having "a legitimate PhD-level expert in anything" and described it as "like having a team of Ph.D.-level experts in your pocket" [25][26].
By mid-2025, OpenAI was maintaining two separate product lines: the GPT-4o family, optimized for low-latency conversational use, and the o-series (o1, o3), designed for complex reasoning tasks requiring chain-of-thought processing. This split created confusion for both developers and end users, who had to choose between models with different strengths and could not get both fast responses and deep reasoning from the same system [1].
GPT-5 was built to solve this problem. The model incorporates a unified architecture with three components: an efficient model that handles straightforward queries quickly, a deeper reasoning model (GPT-5 "thinking") for harder problems, and a real-time router that automatically decides which component to engage based on conversation type, problem complexity, tool needs, and explicit user intent. From the user's perspective, it is a single model that adapts its behavior to the difficulty of the question [1].
The development of GPT-5 took place against a backdrop of intensifying competition among frontier AI labs, with Anthropic's Claude models, Google's Gemini series, and open-source efforts like DeepSeek all making rapid progress. By the time of GPT-5's launch, the AI industry had entered a phase of rapid iteration where major model releases from competing labs were separated by weeks rather than months.
OpenAI's approach with GPT-5 also reflected a strategic bet on model unification. Rather than continuing to fragment its offerings across multiple model families with different API endpoints, pricing structures, and capability profiles, the company consolidated everything into a single product line. This simplified the developer experience and reduced the need for complex model-selection logic in production applications.
In the run-up to launch, OpenAI had been hinting at GPT-5 for more than a year. Altman had repeatedly suggested that the next flagship would close the gap between assistants and "experts," framing the company's mission around delivering AGI-grade capability inside ChatGPT [25]. The marketing language at launch leaned into this expectation, with OpenAI describing GPT-5 as "our smartest, fastest, most useful model yet, with built-in thinking" [1].
OpenAI unveiled GPT-5 in a one-hour livestream on August 7, 2025, beginning at 10:00 a.m. Pacific Time on the company's YouTube channel and X profile. Sam Altman led the presentation alongside more than a dozen OpenAI staff who demonstrated the model on tasks ranging from competitive math problems to building a working web application from a single prompt. Altman compared the experience of returning to GPT-4 after using GPT-5 to switching from a pixelated phone display back to a non-retina screen, saying earlier models felt "quite miserable" by comparison [25][26]. He also emphasized that GPT-5's ability to "instantaneously create an entire piece of computer software" would define the model's appeal, coining the phrase "software on demand" to describe the workflow [26][34].
The livestream also introduced four selectable ChatGPT personalities, named Cynic, Robot, Listener, and Nerd, which let users dial in conversational tone without crafting a system prompt. ChatGPT Pro subscribers gained the ability to connect Gmail, Google Calendar, and Google Contacts so that GPT-5 could draft replies, schedule meetings, and answer questions about personal correspondence inside a single chat [26][34].
GPT-5's most important technical innovation is its unified architecture with an intelligent routing system. The system comprises three integrated components [1][21]:
The router's decision-making process considers multiple factors: conversation type, complexity, tool requirements, and explicit user intent. OpenAI reports the router correctly identifies complexity in 94% of cases, with continuous improvement through reinforcement learning [21].
When thinking mode is engaged, GPT-5 produces 22% fewer major errors compared to standard (non-thinking) mode and dramatically improves performance on expert-level questions, from 6.3% to 24.8% accuracy [21]. The thinking mode achieves superior results using 50-80% fewer tokens than o3 across visual reasoning, agentic coding, and scientific problem-solving [1].
Developers can also override the router's decisions. The API supports explicit control over whether thinking mode is engaged, giving developers the ability to force deep reasoning for specific queries or disable it for latency-sensitive applications. The system exposes a reasoning_effort parameter (with values such as minimal, low, medium, and high) and a verbosity parameter, allowing developers to dial in cost, latency, and answer length on a per-request basis [5].
OpenAI has not publicly disclosed the underlying training compute, parameter count, or model architecture for the components of GPT-5. The company described the system in launch materials as "a system" rather than a single monolithic network, and analysts have generally treated the fast model and thinking model as distinct underlying weights coordinated by the router [1][9][21].
GPT-5 launched with the following API-level specifications:
| Specification | Value |
|---|---|
| Context window (input) | 272,000 tokens |
| Maximum output | 128,000 tokens |
| Model variants | gpt-5, gpt-5-mini, gpt-5-nano, GPT-5 Pro |
| Initial snapshot | gpt-5-2025-08-07 |
| Thinking mode | Built-in, automatic or user-controlled |
| Modalities | Text, image, audio (input and output) |
| API input pricing | $1.25 per 1M tokens |
| API output pricing | $10.00 per 1M tokens |
| Cached input pricing | $0.125 per 1M tokens (90% discount) |
The 272K-token context window represented a significant increase over GPT-4o's 128K tokens. The model also supported parallel tool use, built-in web search, and native audio processing [1][5].
OpenAI released several model sizes at launch. GPT-5 (standard) served as the main offering for complex tasks. GPT-5 mini provided a balance of capability and speed at lower cost. GPT-5 nano, priced at $0.05 per million input tokens, targeted high-volume, cost-sensitive applications like classification and extraction. GPT-5 Pro, a higher-effort reasoning configuration that uses substantially more inference compute, was made available to ChatGPT Pro subscribers and was not initially exposed in the public API at launch [5][26].
The API also exposed fixed-date snapshots so applications could pin to consistent behavior. The launch snapshot was gpt-5-2025-08-07, with the alias gpt-5 updating as OpenAI released improvements [5].
The caching system was a notable addition to the API offering. Tokens that appeared in a prompt recently submitted to the API were automatically cached, and subsequent requests reusing those cached tokens were charged at a 90% discount ($0.125 per million tokens instead of $1.25). This dramatically reduced costs for applications that made repeated calls with overlapping context, such as multi-turn conversations or iterative code generation workflows [5].
| Variant | API input price | API output price | Positioning |
|---|---|---|---|
| GPT-5 | $1.25 / 1M tokens | $10.00 / 1M tokens | Default flagship for complex tasks |
| GPT-5 mini | $0.25 / 1M tokens | $2.00 / 1M tokens | Balanced cost/quality |
| GPT-5 nano | $0.05 / 1M tokens | $0.40 / 1M tokens | High-volume, low-latency |
| GPT-5 Pro | Not in public API at launch | Not in public API at launch | ChatGPT Pro subscribers; deeper extended reasoning |
At launch the standard gpt-5 API endpoint defaulted to medium reasoning effort, with developers able to escalate to higher effort levels for harder problems or downshift to minimal to keep responses snappy and cheap [5].
GPT-5 set new state-of-the-art results across several categories at launch:
| Benchmark | Category | GPT-5 (Thinking) | o3 | GPT-4o |
|---|---|---|---|---|
| AIME 2025 | Mathematics | 94.6% | 79.2% | 26.7% |
| SWE-bench Verified | Software engineering | 74.9% | 69.1% | 38.0% |
| MMMU | Multimodal understanding | 84.2% | 74.9% | 69.1% |
| GPQA Diamond | Graduate-level science | 81.6% | 79.7% | 53.6% |
| Aider Polyglot | Coding (multi-language) | 88.0% | - | 45.3% |
With the Pro variant and Python tools enabled, GPT-5 scored a perfect 100% on AIME 2025. Even without tools, the thinking variant reached 99.6%. With extended reasoning, GPT-5 Pro also set a state-of-the-art on GPQA Diamond at 88.4% without tools, well above any score posted by an OpenAI predecessor at the time of release [1][26].
The AIME (American Invitational Mathematics Examination) results were particularly noteworthy because these are competition-level math problems designed for top high school students. A score of 94.6% without tools placed GPT-5 well above the performance of the vast majority of human test-takers.
Independent benchmarking sites such as Vellum, llm-stats.com, and the LM Council leaderboard published their own evaluations in the days after launch. They generally corroborated OpenAI's claim that GPT-5 thinking matched or beat o3 on most reasoning workloads while using a fraction of the output tokens, but several reviewers cautioned that the gap to Claude and Gemini on agentic coding was smaller than OpenAI's own charts suggested [9][13].
On agentic coding evaluations, GPT-5 reported strong but not category-leading results at launch. OpenAI's own benchmark page showed GPT-5 thinking at 88.0% on Aider Polyglot, a clear improvement over o3, alongside 74.9% on SWE-bench Verified using OpenAI's published harness. On the Tau-bench retail and airline customer-service environments (often referred to as Tau-bench) GPT-5 also led the o-series for tool use, though the gap to Anthropic's Claude family on the same suite was narrower than the SWE-bench gap [1][13].
For developers, the model exposed structured outputs (JSON Schema), function calling, parallel tool calls, file search, web search, image input, and audio input/output through the standard chat completions and Responses APIs. OpenAI also rolled out a custom tools interface that lets developers describe tools in plain text rather than rigid JSON, which was pitched as a better fit for the way GPT-5 reasons about tool selection inside long sessions [5].
GPT-5 launched as a natively multimodal model, capable of processing text, images, and audio as both inputs and outputs. This represented a continuation of the multimodal approach introduced with GPT-4o but with significantly enhanced capabilities.[1]
On the MMMU (Massive Multi-discipline Multimodal Understanding) benchmark, which tests a model's ability to reason about images, diagrams, and charts across academic disciplines, GPT-5 Thinking scored 84.2%, compared to GPT-4o's 69.1%. The model demonstrated particular strength on tasks requiring joint reasoning across text and visual inputs, such as interpreting scientific diagrams, analyzing financial charts, and solving geometry problems presented as images.[1]
GPT-5.2 further expanded multimodal performance, with high scores on MMMU-Pro (86.5%) and Video-MMMU (90.5%). The Video-MMMU results suggested a powerful, natively multimodal architecture capable of reasoning across temporal and spatial dimensions simultaneously, enabling the model to understand and reason about video content in addition to static images.[2]
The native audio capabilities allowed GPT-5 to process spoken input directly and generate spoken responses, enabling real-time voice conversations without the intermediate step of speech-to-text transcription. This was particularly relevant for ChatGPT's voice mode and for applications in customer service, accessibility, and language learning.
One of the most significant improvements in GPT-5 was a substantial reduction in hallucinations. According to OpenAI's internal evaluations, GPT-5 (thinking) produced roughly five to six times fewer factual errors than o3 across three factual accuracy benchmarks when browsing was enabled. With web search active, GPT-5 responses were approximately 45% less likely to contain a factual error compared to GPT-4o [1].
This improvement addressed one of the most persistent criticisms of large language models: their tendency to generate plausible-sounding but factually incorrect information. For enterprise and professional use cases where factual reliability is critical, the hallucination reduction was arguably more important than any single benchmark improvement.
Deception rates, measured in scenarios with impossible coding tasks or missing multimodal inputs, dropped from 4.8% for o3 to 2.1% for GPT-5 with reasoning enabled. Sycophantic responses declined from approximately 14.5% to under 6% in OpenAI's internal evaluations, an effort the company explicitly credited to feedback after the unpopular sycophantic GPT-4o update of April 2025 [29][30].
GPT-5 (thinking) matched or exceeded o3's performance across most benchmarks while using 50-80% fewer output tokens. This efficiency translated directly into lower costs and faster response times for developers, making the thinking capabilities practical for production workloads rather than being limited to specialized research scenarios [1].
The token efficiency improvement also had implications for user experience. Shorter reasoning chains meant faster responses, which made the model feel more responsive in interactive settings like ChatGPT conversations, even when engaging in complex reasoning.
OpenAI published the GPT-5 system card on August 7, 2025, alongside the launch. It describes the model's evaluations, training data choices, deployment safeguards, and remaining limitations. The system card was unusually detailed for a frontier release, running to dozens of pages and covering jailbreaks, prompt injection, deception, biological and chemical risk, cybersecurity, and persuasion [28].
Under OpenAI's Preparedness Framework, the company classified GPT-5 thinking as High capability in the Biological and Chemical risk domain. OpenAI stated that it did not have definitive evidence the model could meaningfully help a novice cause severe biological harm (its threshold for High), but said it adopted a precautionary stance because evaluations could not rule out marginal uplift. The classification triggered the activation of associated safeguards under the framework, including additional refusal training, monitoring of API traffic for misuse, and external red-team testing [28].
A central new safety design choice was safe-completions. Rather than a binary classification of user intent ("safe" vs. "unsafe"), safe-completions train the model to maximize helpfulness subject to safety constraints, often producing partial answers, high-level guidance, or explicit refusals with safer alternatives instead of stonewalling. OpenAI reported that this method recovered substantial helpfulness in dual-use scenarios while reducing genuinely harmful outputs [28][30].
The system card also documented the red-teaming campaign behind GPT-5: more than 5,000 hours of work from over 400 external testers and experts focused on violent attack planning, jailbreaks, prompt injection, bioweaponization, child-safety risks, and adversarial multimodal inputs. The classification under the Preparedness Framework also has implications for OpenAI's internal Responsible Scaling commitments, sometimes discussed in the broader AI safety community as analogous to Anthropic's ASL levels, since both frameworks gate deployment on capability evaluations rather than only on alignment evaluations [28].
The initial reception of GPT-5 was sharply mixed. OpenAI reported that API traffic doubled within 24 hours and Microsoft began rolling the model into its Azure and Copilot stacks the same day [1][23]. Early users praised the coding performance, the cost reduction relative to GPT-4o, and the improvements in factuality. Sam Altman's claim that GPT-5 was "like talking to a legitimate PhD-level expert" became one of the most quoted lines from the launch event [25][26].
Within hours, however, the rollout began drawing serious complaints. The most contentious change was that GPT-5 replaced GPT-4o, GPT-4, GPT-4.1, o3, o4-mini, GPT-4.5, and several other models in ChatGPT, removing them from the model picker for many users. Subscribers who had built workflows or even emotional habits around GPT-4o reacted strongly. Some longtime users described it as "the biggest bait-and-switch in AI history" on Reddit and X, and a sizeable subset of Plus subscribers said the new default felt colder and less personable than 4o [11][12].
The second source of complaints was the router. Because the system is presented as a single model, users could not always tell whether their question had been routed to the fast model or the thinking model. On August 8, 2025, the day after launch, Altman acknowledged on X that "the autoswitcher broke and was out of commission for a chunk of the day, and the result was GPT-5 seemed way dumber." The bug routed many queries that should have gone to the thinking model to the fast model instead, dragging benchmark-style behavior on hard problems down [12][14].
Within a week, OpenAI shipped a series of fixes. It restored GPT-4o for paid users in the model picker, doubled rate limits for ChatGPT Plus on GPT-5 thinking from 200 to 3,000 weekly messages, added a clearer indicator showing which underlying model was answering, and introduced "Auto," "Fast," and "Thinking" sub-options so users could override the router. OpenAI also retained legacy access to o3 for paying subscribers under a "show legacy models" toggle [11][12][14].
Independent reviewers were similarly split. Vellum's launch-week analysis described GPT-5 as "clearly state-of-the-art on math and STEM reasoning" but "not the leap people were primed for on coding," placing Claude Opus 4.x ahead on SWE-bench Pro while GPT-5 led AIME and FrontierMath. METR, which evaluates how long autonomous tasks frontier models can complete, reported a noticeable bump in long-horizon task completion compared to o3, with GPT-5 thinking reliably succeeding on tasks taking expert humans up to roughly two hours, but warned that error rates rose sharply beyond that horizon [9][13].
Several journalists also noted basic factual errors in early ChatGPT outputs. Quartz reported that some users got responses claiming Joe Biden was still U.S. president or misspelling "Oregon" as "Onegon," which contradicted Altman's PhD-level marketing. OpenAI argued these were largely cases where the router had wrongly selected the fast model, and shipped further router updates over the next month [11][14].
On the LMArena human-preference leaderboard, GPT-5 was added to the Text, WebDev, and Vision boards within hours of release and entered the top three on each within its first week, though Anthropic and Google updates kept it from holding the outright top spot through the rest of August 2025 [13][14].
OpenAI also drew criticism over a benchmark chart shown during the launch presentation. Several bar graphs comparing GPT-5 to o3 and GPT-4o used inconsistent scaling: in one slide, a bar representing 52.8% accuracy was drawn nearly twice as tall as a bar representing 69.1%, while the 30.8% and 69.1% bars appeared roughly the same height. Altman acknowledged the error on X the following day, calling it a "mega chart screwup," and OpenAI quietly corrected the figures in the published blog post. The Washington Post described the episode as a "chart crime" and tied it to a broader pattern of selective benchmark presentation among frontier labs that month [9][35][36].
On August 18, 2025, eleven days after the launch, Altman publicly conceded that OpenAI had "totally screwed up" the rollout in remarks during a dinner with reporters in San Francisco. He attributed the missteps to the speed of the launch, the underestimated emotional attachment users had to GPT-4o, and the autoswitcher bug, and said the company would invest "trillions of dollars" in data center capacity to support the new model and its successors [37].
OpenAI made GPT-5 the default model in ChatGPT for all users at launch, including the free tier (a first for an OpenAI flagship), with progressive rollout to Enterprise and Edu the following week [1].
Usage limits reflected the tier:
| Tier | GPT-5 access at launch |
|---|---|
| Free | GPT-5 with router; mini fallback when limit reached; ~10 messages per 5 hours |
| Plus ($20/month) | Higher message caps; 3,000 GPT-5 thinking messages per week after Aug 11 update |
| Pro ($200/month) | Unlimited GPT-5 and access to GPT-5 Pro |
| Team / Business | Same as Plus, with admin controls |
| Enterprise / Edu | Phased rollout starting mid-August 2025 |
On the free tier, ChatGPT exposed a streamlined version of the router. Free users got access to the standard GPT-5 endpoint by default and were silently downgraded to GPT-5 mini after hitting their five-hour message cap, rather than being blocked outright. Plus subscribers could explicitly select "GPT-5 Thinking" from the model picker, and Pro subscribers could select "GPT-5 Pro" for the highest-effort reasoning configuration [1][14].
OpenAI publicized a roster of early enterprise customers alongside the GPT-5 launch, framing the model as production-ready for regulated industries. The named partners included:
| Organization | Sector | Use case at launch |
|---|---|---|
| BNY Mellon | Financial services | Internal AI assistant for employees building on prior OpenAI partnership for early model access |
| Lowe's | Retail | Associate-facing assistants for store operations and inventory planning |
| Morgan Stanley | Financial services | Research and client-advisor tooling building on the firm's earlier GPT-4 deployments |
| Figma | Design software | Codex-style code generation and design-to-code workflows |
| Intercom | Customer support software | Customer-facing AI agent product Fin |
| SoftBank | Conglomerate | Internal productivity rollout across portfolio companies |
| T-Mobile | Telecom | Customer-care agents and call-center summarization |
| California State University | Higher education | Campus-wide ChatGPT Edu rollout for students and faculty |
Microsoft made GPT-5 available through Microsoft 365 Copilot, GitHub Copilot, and Azure AI Foundry the same day as the OpenAI launch, with admin controls that let enterprises pilot GPT-5 in specific tenants before broader rollout [1][23][34].
GPT-5's API pricing was structured to nudge developers off GPT-4o and the o-series. At $1.25/$10.00 per million input/output tokens for the standard model, $0.25/$2.00 for mini, and $0.05/$0.40 for nano, the family undercut GPT-4o's $2.50/$10.00 by half on input cost while delivering substantially better evaluations. Cached input was charged at a 90% discount ($0.125 per million tokens for the flagship), and Batch API calls were billed at 50% off, which made high-volume retrieval and bulk classification dramatically cheaper than under the GPT-4o regime [5].
OpenAI also published a comparison showing that for tasks where developers had previously chained a planner on o3 with an executor on GPT-4o, the unified GPT-5 thinking endpoint typically reduced total token spend by half or more thanks to the 50-80% reduction in reasoning tokens versus o3 [1][5].
OpenAI released GPT-5.1 on November 12, 2025, three months after the initial GPT-5 launch. The release was marketed as "a smarter, more conversational ChatGPT" rather than a pure capability bump and rolled out first to paid Pro, Plus, Go, and Business users before reaching free and logged-out users a few days later [27].
GPT-5.1 split the family into two everyday variants:
OpenAI published a system card addendum describing the safety evaluations specific to GPT-5.1 and confirmed that GPT-5 Instant and GPT-5 Thinking would remain available in ChatGPT under a "legacy models" dropdown for paid subscribers for three months after launch. The November release was widely read as OpenAI's response to user complaints that GPT-5 felt overly clinical compared to GPT-4o and as a pre-emptive move ahead of Google's Gemini 3 line [27][32].
OpenAI released GPT-5.2 on December 11, 2025, roughly four months after the initial GPT-5 launch and one month after GPT-5.1. The update introduced a three-tier product structure: Instant (for fast, everyday queries), Thinking (for complex reasoning), and Pro (for maximum performance on the hardest problems) [2].
| Feature | GPT-5 | GPT-5.2 |
|---|---|---|
| Context window | 272K | 400K |
| AIME 2025 (no tools) | 94.6% | 100% |
| SWE-bench Verified | 74.9% | 80.0% |
| ARC-AGI-2 (abstract reasoning) | - | 52.9% |
| GDPval (professional work) | 38.8% | 70.9% |
| FrontierMath | - | 40.3% |
| API input pricing | $1.25/1M | $1.75/1M |
| API output pricing | $10.00/1M | $14.00/1M |
GPT-5.2 expanded the context window to 400,000 tokens across all paid tiers. On GDPval, a benchmark measuring performance on knowledge work tasks across 44 occupations, GPT-5.2 Thinking became the first model to perform at or above human expert level, beating or tying top industry professionals on 70.9% of comparisons [2].
GPT-5.2 Thinking also produced 38% fewer errors than the previous GPT-5.1 update, with the response error rate dropping from 8.8% to 6.2% [2].
The ARC-AGI-2 result was notable because this benchmark tests abstract reasoning ability, a capability widely considered to be a fundamental limitation of current AI systems. GPT-5.2's score of 52.9%, compared to GPT-5.1's 17.6%, represented a 35-point improvement and suggested significant progress on a capability that has historically been resistant to scaling [2].
GPT-5.2's performance on FrontierMath (40.3% on Tiers 1-3) was a significant milestone. FrontierMath, developed by EpochAI, consists of research-level mathematics problems that require graduate-level or beyond mathematical reasoning. Prior to o3, no model had exceeded 2% on this benchmark. o3 reached 25.2%, and GPT-5.2's 40.3% represented a further 60% relative improvement. The result demonstrated that mathematical reasoning capabilities were continuing to scale rapidly with each new model generation.[2]
Alongside the main release, OpenAI introduced GPT-5.2-Codex, a variant specifically optimized for agentic coding tasks in the Codex environment. This version featured improvements in context compaction (allowing it to work with large codebases more efficiently) and stronger performance on large-scale code changes such as refactors and migrations [6].
GPT-5.2-Codex was designed for long-horizon coding workflows where an agent needs to understand a full codebase, plan a multi-file change, and execute it with minimal human intervention. The context compaction feature allowed the model to work within its context window more efficiently by summarizing less-relevant portions of the codebase while maintaining full detail on actively edited files.
GPT-5.2 arrived during an intense period of competition. Google had released Gemini 3 Pro on November 18, 2025, and Anthropic launched Claude Opus 4.5 on November 24, 2025. GPT-5.2 was widely seen as OpenAI's response to these competitive releases. While Claude Opus 4.5 held the edge on SWE-bench Verified at 80.9%, GPT-5.2 achieved state-of-the-art on SWE-bench Pro at 55.6% and led in abstract reasoning with 52.9% on ARC-AGI-2 [2][7].
On March 3, 2026, OpenAI released GPT-5.3 Instant, which replaced GPT-5.2 Instant as the default model for all ChatGPT users, including those on the free tier [3].
GPT-5.3 Instant focused on conversational quality rather than raw benchmark performance. Key improvements included:
GPT-5.3 Instant was also made available to developers in the API as gpt-5.3-chat-latest. The conversational improvements were particularly relevant for ChatGPT's consumer user base, where natural-sounding dialogue matters more than performance on academic benchmarks [3].
A separate release, GPT-5.3-Codex, provided long-term support for GitHub Copilot integrations, optimizing the model specifically for inline code suggestions and repository-level understanding [10].
GPT-5.4 was announced on March 5, 2026, and represents the most capable model in the GPT-5 family until the release of GPT-5.5. It combines frontier reasoning, coding capabilities inherited from GPT-5.3-Codex, and a new native computer-use capability in a single model [4].
| Specification | GPT-5.4 |
|---|---|
| Context window | 1.05M tokens |
| Maximum output | 128,000 tokens |
| API input pricing (standard) | $2.50 per 1M tokens |
| API output pricing | $15.00 per 1M tokens |
| Extended context input (>272K) | $5.00 per 1M tokens |
| Cached input | $1.25 per 1M tokens |
| Computer use | Native, state-of-the-art |
| Variants | GPT-5.4, GPT-5.4 Pro, GPT-5.4 Mini, GPT-5.4 Nano |
GPT-5.4 is the first general-purpose model from OpenAI with native, state-of-the-art computer-use capabilities. This means the model can directly operate desktop applications, navigate web interfaces, and carry out complex multi-step workflows across different programs. The model works through a screenshot-action loop: it receives a screenshot of the current screen, analyzes the visual content, and returns structured actions (clicks, typing, scrolling) that an agent framework can execute. The cycle then repeats with the next screenshot.[4][22]
On OSWorld-Verified, a benchmark for desktop automation tasks, GPT-5.4 scored 75.0%, surpassing the human baseline of 72.4% and dramatically improving over GPT-5.2's 47.3%. This made GPT-5.4 the first AI model to operate a computer better than human experts on this benchmark [4][22].
On BrowseComp, a benchmark for agentic web browsing, GPT-5.4 reached 82.7% (up from 65.8% for GPT-5.2), while GPT-5.4 Pro scored 89.3% [4].
The computer-use capability enables a new class of AI agent applications. Rather than operating through APIs or structured tool calls, GPT-5.4 can interact with software the same way a human would: by reading screen content, moving a cursor, clicking buttons, and typing text. This makes it possible to automate tasks in applications that lack APIs or programmatic interfaces.
The context window expanded to 1.05 million tokens, making GPT-5.4 the first OpenAI model with over one million tokens of context in a standard API offering. This allows agents to plan, execute, and verify tasks across long horizons, processing entire codebases or extensive document collections in a single session. Requests exceeding 272K tokens are priced at 2x for input and 1.5x for output [4].
The million-token context is particularly valuable for coding agents that need to reason about entire repositories, legal professionals reviewing large document sets, and research applications that involve synthesizing information from many sources simultaneously.
| Benchmark | GPT-5.2 | GPT-5.4 | Change |
|---|---|---|---|
| GDPval | 70.9% | 83.0% | +12.1 |
| OSWorld-Verified | 47.3% | 75.0% | +27.7 |
| BrowseComp | 65.8% | 82.7% | +16.9 |
| Investment banking modeling | 68.4% | 87.3% | +18.9 |
| Factual accuracy | Baseline | 33% fewer false claims | - |
| Token efficiency | Baseline | ~50% improvement | - |
GPT-5.4 also improved token efficiency by roughly 50% on complex tasks and reduced false claims by 33% compared to GPT-5.2 [4].
Alongside GPT-5.4, OpenAI released smaller variants:
| Variant | Speed | Context | API Input Price | API Output Price | Use Case |
|---|---|---|---|---|---|
| GPT-5.4 | Standard | 1.05M | $2.50/1M | $15.00/1M | Complex reasoning, professional tasks |
| GPT-5.4 Pro | Slower | 1.05M | $30.00/1M | $180.00/1M | Maximum performance on hardest problems |
| GPT-5.4 Mini | ~180 tok/s | 400K | $0.75/1M | $4.50/1M | High-volume workloads |
| GPT-5.4 Nano | ~200 tok/s | 400K | $0.20/1M | $1.25/1M | Cost-sensitive applications |
GPT-5.4 Mini and Nano were released on March 17, 2026, bringing many of GPT-5.4's strengths to faster, more efficient models designed for high-volume production use. GPT-5.4 Mini operates at roughly 180-190 tokens per second, while GPT-5.4 Nano reaches approximately 200 tokens per second, more than 2x faster than the original GPT-5 Mini [8].
OpenAI released GPT-5.5 on April 23, 2026, just six weeks after GPT-5.4. The release was framed as "a new class of intelligence for real work," with sharper performance on writing and debugging code, web research, data analysis, document and spreadsheet creation, and operating software autonomously [24].
GPT-5.5 was the first OpenAI model to ship with a 1 million-token context window in both ChatGPT (for selected paid tiers) and the public API, and it became the new default frontier model in ChatGPT and Codex. Pricing in the API was set at $5 per million input tokens and $30 per million output tokens for the standard model. Alongside it, OpenAI released GPT-5.5 Pro, a higher-effort reasoning configuration aimed at the hardest professional workloads, priced at $30 per million input tokens and $180 per million output tokens [24][31].
In ChatGPT, GPT-5.5 rolled out as the default for Plus, Pro, Business, and Enterprise tiers, while the free tier continued on GPT-5.3 Instant with automatic upgrades to GPT-5.5 Thinking on harder questions. In Codex, GPT-5.5 was made available with a 400K-token context window for Plus, Pro, Business, Enterprise, Edu, and Go users. The release came just six weeks after GPT-5.4, illustrating how frontier model launches have begun to resemble incremental software updates rather than blockbuster events [24].
On benchmarks, OpenAI reported GPT-5.5 reached 88.7% on SWE-bench Verified and 58.6% on SWE-bench Pro, solving more tasks end-to-end in a single pass than any earlier model. GPT-5.5 also scored 98.0% on Tau2-bench Telecom, a complex customer-service workflow benchmark, without prompt tuning. The model was paired with an updated GPT-5.5 system card and a separate GPT-5.5 Instant system card describing its safety evaluations [31][33].
On May 5, 2026, OpenAI quietly promoted GPT-5.5 Instant to the new default ChatGPT model for free, Plus, and Pro users, replacing GPT-5.3 Instant in everyday conversations [24].
GPT-5's unified architecture was designed partly to simplify the developer experience by eliminating the need to choose between GPT-4o and the o-series for different tasks. When GPT-5 launched, OpenAI positioned it as the default model for both conversational and reasoning workloads, encouraging developers to migrate from both GPT-4o and the o-series.[1][5]
The migration pattern varied by use case. Developers whose applications primarily needed conversational AI or content generation found GPT-5 to be a straightforward replacement for GPT-4o, with better performance and comparable pricing. Developers who had been using o1 or o3 for specialized reasoning tasks had a more nuanced decision, as GPT-5's thinking mode covered most reasoning use cases but did not always match o3's depth on the hardest problems.[1][12]
By late 2025, OpenAI's update cadence had accelerated significantly. The rapid succession of GPT-5, GPT-5.1, and GPT-5.2 within five months required developers to continuously adapt their applications. OpenAI addressed this partly through its tiered model structure, with Instant models providing stability for everyday use while Thinking and Pro variants pushed the capability frontier. The introduction of GPT-5.4 in March 2026 further consolidated the lineup, with OpenAI framing it as the default model for both "broad general-purpose work and most coding tasks," replacing both gpt-5.2 in the API and gpt-5.3-codex in Codex.[4][23]
GPT-5 saw rapid enterprise adoption through Microsoft's integration into its product ecosystem. Microsoft, which integrates OpenAI models across Copilot Studio, Microsoft 365 Copilot, and Azure services, began making GPT-5 available to enterprise customers starting in August 2025. Enterprises could phase in GPT-5 alongside existing models, starting with high-value workflows like code reviews, RFP automation, and analytics.[23]
Azure's native hooks allowed minimal disruption to existing governance, security, and compliance protocols during migration. By December 2025, GPT-5.2 was introduced into Microsoft Foundry as a new standard for enterprise AI, with optimized configurations for large-scale deployment.[23]
The hallucination reduction in GPT-5 was cited as a key factor in enterprise adoption. Several companies reported integrating GPT-5 into customer-facing applications where factual reliability had previously been a barrier. The built-in thinking mode meant that enterprises no longer needed to maintain separate integrations with the o-series for tasks requiring reasoning, simplifying their AI infrastructure.[1]
The rapid iteration of the GPT-5 family produced measurable improvements across each version on key benchmarks.
| Benchmark | GPT-5 (Aug 2025) | GPT-5.2 (Dec 2025) | GPT-5.4 (Mar 2026) | GPT-5.5 (Apr 2026) |
|---|---|---|---|---|
| AIME 2025 (no tools) | 94.6% | 100% | 100% | 100% |
| SWE-bench Verified | 74.9% | 80.0% | - | 88.7% |
| SWE-bench Pro | - | 55.6% | - | 58.6% |
| GPQA Diamond | 81.6% | 93.2% (Pro) | - | - |
| GDPval | 38.8% | 70.9% | 83.0% | - |
| OSWorld-Verified | - | 47.3% | 75.0% | - |
| BrowseComp | - | 65.8% | 82.7% | - |
| ARC-AGI-2 | - | 52.9% | - | - |
| FrontierMath (Tiers 1-3) | - | 40.3% | - | - |
| Tau2-bench Telecom | - | - | - | 98.0% |
| Context window | 272K | 400K | 1.05M | 1.05M |
The GDPval benchmark, which measures performance on professional knowledge work across 44 occupations, showed particularly dramatic improvement, nearly doubling from 38.8% with GPT-5 to 70.9% with GPT-5.2, and reaching 83.0% with GPT-5.4. This trend suggested that the GPT-5 family was becoming increasingly useful for real-world professional tasks beyond the academic benchmarks that had traditionally been used to evaluate language models.[2][4]
The perfect 100% score on AIME 2025 achieved by GPT-5.2 Thinking (without tools) was a watershed moment for mathematical reasoning. The AIME is a competition designed for the top 5% of US high school mathematics students, and a perfect score without tool assistance demonstrated that GPT-5.2 had reached a level of mathematical competence that matched or exceeded top human performers on this particular exam.[2]
| Date | Release | Key Feature |
|---|---|---|
| August 7, 2025 | GPT-5 | Unified model, 272K context, built-in thinking, gpt-5-2025-08-07 snapshot |
| August 8-15, 2025 | Router fixes | GPT-4o restored for paid users, rate limits raised, Auto/Fast/Thinking sub-options added |
| November 12, 2025 | GPT-5.1 (Instant + Thinking) | Warmer tone, adaptive reasoning in Instant tier |
| December 11, 2025 | GPT-5.2 | 400K context, Instant/Thinking/Pro tiers |
| December 18, 2025 | GPT-5.2-Codex | Optimized for agentic coding |
| February 5, 2026 | GPT-5.3-Codex | Combined Codex + GPT-5 stacks |
| March 3, 2026 | GPT-5.3 Instant | Improved conversational tone, 400K context for all |
| March 5, 2026 | GPT-5.4 | 1.05M context, native computer use |
| March 17, 2026 | GPT-5.4 Mini and Nano | Smaller, faster variants of GPT-5.4 |
| April 23, 2026 | GPT-5.5 + GPT-5.5 Pro | 1M context in API and ChatGPT, new default |
| May 5, 2026 | GPT-5.5 Instant default | Replaces GPT-5.3 Instant for free, Plus, and Pro tiers |
The GPT-5 series has maintained competitive pricing relative to its predecessors, particularly given the performance improvements:
| Model | Input (per 1M tokens) | Output (per 1M tokens) | Release |
|---|---|---|---|
| GPT-4o | $2.50 | $10.00 | May 2024 |
| GPT-5 nano | $0.05 | $0.40 | August 2025 |
| GPT-5 mini | $0.25 | $2.00 | August 2025 |
| GPT-5 | $1.25 | $10.00 | August 2025 |
| GPT-5.2 | $1.75 | $14.00 | December 2025 |
| GPT-5.4 | $2.50 | $15.00 | March 2026 |
| GPT-5.4 Mini | $0.75 | $4.50 | March 2026 |
| GPT-5.4 Nano | $0.20 | $1.25 | March 2026 |
| GPT-5.5 | $5.00 | $30.00 | April 2026 |
| GPT-5.5 Pro | $30.00 | $180.00 | April 2026 |
Notably, GPT-5 launched at a lower price per input token than GPT-4o despite substantially better performance, reflecting the efficiency improvements in the underlying architecture. The pricing trend across the GPT-5 series shows a gradual increase for the flagship model (from $1.25 to $5.00 per million input tokens for the standard tier and up to $30 for Pro) alongside the introduction of increasingly affordable smaller variants [5][24].
The GPT-5 series exists in a highly competitive environment. As of mid-2026, the frontier model landscape includes several strong alternatives:
| Provider | Model | Notable Strength |
|---|---|---|
| OpenAI | GPT-5.5 | 1M-token context in API and ChatGPT, agentic coding, computer use |
| Anthropic | Claude Opus 4.5 | Coding (80.9% SWE-bench Verified) |
| Gemini 3 Pro | Reasoning (1501 LMArena Elo), 1M context | |
| DeepSeek | DeepSeek-V3.2 | Cost efficiency (10-30x cheaper) |
| xAI | Grok 4 | Real-time information integration |
The late-2025 and mid-2026 period saw a significant compression of performance gaps between leading models, with each provider developing distinct specializations. Organizations increasingly deploy multiple models, routing queries to the most suitable model for each task type, rather than standardizing on a single provider [7].
By the May 2026 update of the Artificial Analysis intelligence index, GPT-5.5, Claude Opus 4.7, and Gemini 3.1 Pro sat within roughly three points of each other in headline reasoning scores, with each carving out a distinct niche. GPT-5.5 led agentic benchmarks such as GDPval (84.9%) and OSWorld (78.7%) and was generally chosen for autonomous tool-using workflows. Claude Opus 4.7 led SWE-bench Pro at 64.3% versus GPT-5.5's 58.6% and was favored for production coding because of its tighter handling of destructive actions. Gemini 3.1 Pro retained the largest practical context window, the lowest price among the frontier set, and a slight edge on pure reasoning tasks. Reviewers commonly recommended Claude Opus 4.7 as the default daily driver, GPT-5.5 for autonomous agents, and Gemini 3.1 Pro for high-volume long-context jobs [24][38].
Google's Gemini 3 Pro achieved an unprecedented 91.9% on GPQA Diamond, surpassing human expert performance (approximately 89.8%). Gemini 3 Pro's Deep Think mode also pushed Humanity's Last Exam to 41%, the highest published score on that benchmark. Anthropic's Claude Opus 4.5 held the SWE-bench Verified lead at 80.9%. DeepSeek-V3.2 offered frontier-class performance at a fraction of the cost, providing a strong option for cost-sensitive applications. GPT-5.4 and GPT-5.5's distinctive advantages as of mid-2026 are their native computer-use capabilities and their 1.05M-token context window [7][24].
The competitive dynamics of this period also drove pricing pressure across the industry. With DeepSeek demonstrating that high-quality models could be offered at dramatically lower prices, all major providers were forced to offer more competitive pricing or justify their premium through unique capabilities.
In the weeks after the August 2025 GPT-5 launch, the comparison most frequently cited by reviewers stacked GPT-5 against Anthropic's Claude Opus 4.1 and Google's Gemini 2.5 Pro Deep Think. Vellum and the LM Council leaderboard reached broadly similar conclusions: GPT-5 led on AIME, FrontierMath, and most STEM reasoning benchmarks; Claude Opus 4.1 led on SWE-bench Verified and on long-context coding tasks; and Gemini 2.5 Pro Deep Think led on Humanity's Last Exam at the time. On user-preference benchmarks like LMArena, GPT-5 thinking entered the top three within a week of launch, but rarely held the outright #1 position once Claude and Gemini updates landed later in 2025 [9][13].
The rapid pace of GPT-5 updates created a complex model lifecycle that developers needed to navigate. OpenAI's approach was to deprecate older GPT-5 versions relatively quickly as newer ones launched. When GPT-5.4 was released in March 2026, OpenAI framed it as the replacement for both gpt-5.2 in the standard API and gpt-5.3-codex in the Codex environment, consolidating what had briefly been separate model tracks.[4]
OpenAI also announced in 2025-2026 the deprecation of several pre-GPT-5 models, including o1, GPT-4.5, o3-mini, and GPT-4o. This wave of retirements effectively pushed the entire developer ecosystem toward the GPT-5 family and the o3/o4-mini reasoning models. Developers who had built on GPT-4o were encouraged to migrate to GPT-5, while those on o1 were directed to o3 or o4-mini.[7]
The transition was not without friction. Some developers reported that prompt behaviors changed subtly between GPT-5 versions, requiring regression testing and prompt tuning with each update. OpenAI addressed this partly by maintaining stable model snapshot endpoints (e.g., gpt-5-2025-08-07 for the original launch and gpt-5.4-2026-03-05 for GPT-5.4) that developers could pin to for production stability, while the generic gpt-5 and gpt-5.4 aliases would receive rolling updates.
For ChatGPT users, the deprecation pattern was somewhat softer. After the initial GPT-5 launch backlash, OpenAI committed to keeping prior major versions available under a "legacy models" toggle for paid subscribers for at least three months. As of March 11, 2026, GPT-5.1 models were retired from ChatGPT, with existing conversations automatically migrated to GPT-5.3 Instant, GPT-5.4 Thinking, or GPT-5.4 Pro depending on context [11][27][32].
GPT-5's launch in August 2025 was broadly well-received among developers, though not without serious criticism on the consumer side. The unification of reasoning and conversational capabilities into a single model was praised as a significant usability improvement. Developers no longer had to choose between model families or implement their own routing logic [1].
The hallucination reduction was highlighted as a particularly important advance for enterprise adoption. Several companies reported integrating GPT-5 into customer-facing applications where factual reliability was previously a barrier [1].
However, some researchers noted that OpenAI's benchmark presentations were occasionally misleading. In one instance, a benchmark graph in the launch materials was found to contain errors, drawing public criticism. Others pointed out that while GPT-5 was a clear improvement over OpenAI's previous models, the gap between it and competitors like Claude and Gemini was smaller than in earlier generations [9].
The rapid cadence of updates, from GPT-5 in August to GPT-5.5 in April, roughly nine months later, also raised questions about versioning clarity and the challenge facing developers who need to maintain stable production systems while keeping up with frequent model changes. OpenAI addressed this partly through its tiered model structure, with Instant models providing stability for everyday use while Thinking and Pro variants pushed the capability frontier.
The GPT-5 series also marked a turning point in how AI models are consumed. The built-in thinking mode and automatic routing represented a move away from exposing raw model capabilities to users and toward providing an integrated, managed AI experience. This trend toward "model as service" rather than "model as tool" has implications for how developers build applications and how end users interact with AI systems.
In the long view, the August 2025 launch is now widely cited as the moment when frontier AI moved from a benchmark race to a product race. After GPT-5, every major provider, including Anthropic, Google, and xAI, accelerated their own product roadmaps and matched OpenAI's emphasis on routing, agentic capabilities, and enterprise integration over raw benchmark dominance [9][13].