GPT-5 is OpenAI's flagship large language model, first released on August 7, 2025. It represents a fundamental shift in OpenAI's model strategy: rather than maintaining separate model families for different capabilities (such as GPT-4o for speed and the o-series for reasoning), GPT-5 unifies these into a single system with built-in "thinking" capabilities and a real-time router that selects the appropriate level of reasoning for each query [1]. The model family has been updated several times since launch, with GPT-5.2 arriving in December 2025, GPT-5.3 Instant in March 2026, and GPT-5.4 in March 2026 [2][3][4].
At launch, GPT-5 set new state-of-the-art results on multiple benchmarks, including 94.6% on AIME 2025 (mathematics), 74.9% on SWE-bench Verified (software engineering), and 84.2% on MMMU (multimodal understanding). It also showed a significant reduction in hallucinations, producing roughly six times fewer factual errors than its predecessor o3 when using its thinking mode [1].
By mid-2025, OpenAI was maintaining two separate product lines: the GPT-4o family, optimized for low-latency conversational use, and the o-series (o1, o3), designed for complex reasoning tasks requiring chain-of-thought processing. This split created confusion for both developers and end users, who had to choose between models with different strengths and could not get both fast responses and deep reasoning from the same system [1].
GPT-5 was built to solve this problem. The model incorporates a unified architecture with three components: an efficient model that handles straightforward queries quickly, a deeper reasoning model (GPT-5 "thinking") for harder problems, and a real-time router that automatically decides which component to engage based on conversation type, problem complexity, tool needs, and explicit user intent. From the user's perspective, it is a single model that adapts its behavior to the difficulty of the question [1].
The development of GPT-5 took place against a backdrop of intensifying competition among frontier AI labs, with Anthropic's Claude models, Google's Gemini series, and open-source efforts like DeepSeek all making rapid progress. By the time of GPT-5's launch, the AI industry had entered a phase of rapid iteration where major model releases from competing labs were separated by weeks rather than months.
OpenAI's approach with GPT-5 also reflected a strategic bet on model unification. Rather than continuing to fragment its offerings across multiple model families with different API endpoints, pricing structures, and capability profiles, the company consolidated everything into a single product line. This simplified the developer experience and reduced the need for complex model-selection logic in production applications.
GPT-5's most important technical innovation is its unified architecture with an intelligent routing system. The system comprises three integrated components [1][21]:
The router's decision-making process considers multiple factors: conversation type, complexity, tool requirements, and explicit user intent. OpenAI reports the router correctly identifies complexity in 94% of cases, with continuous improvement through reinforcement learning [21].
When thinking mode is engaged, GPT-5 produces 22% fewer major errors compared to standard (non-thinking) mode and dramatically improves performance on expert-level questions, from 6.3% to 24.8% accuracy [21]. The thinking mode achieves superior results using 50-80% fewer tokens than o3 across visual reasoning, agentic coding, and scientific problem-solving [1].
Developers can also override the router's decisions. The API supports explicit control over whether thinking mode is engaged, giving developers the ability to force deep reasoning for specific queries or disable it for latency-sensitive applications.
GPT-5 launched with the following API-level specifications:
| Specification | Value |
|---|---|
| Context window (input) | 272,000 tokens |
| Maximum output | 128,000 tokens |
| Model variants | gpt-5, gpt-5-mini, gpt-5-nano |
| Thinking mode | Built-in, automatic or user-controlled |
| Modalities | Text, image, audio (input and output) |
| API input pricing | $1.25 per 1M tokens |
| API output pricing | $10.00 per 1M tokens |
| Cached input pricing | $0.125 per 1M tokens (90% discount) |
The 272K-token context window represented a significant increase over GPT-4o's 128K tokens. The model also supported parallel tool calling, built-in web search, and native audio processing [1][5].
OpenAI released three model sizes at launch. GPT-5 (standard) served as the main offering for complex tasks. GPT-5-mini provided a balance of capability and speed at lower cost. GPT-5-nano, priced at $0.05 per million input tokens, targeted high-volume, cost-sensitive applications like classification and extraction [5].
The caching system was a notable addition to the API offering. Tokens that appeared in a prompt recently submitted to the API were automatically cached, and subsequent requests reusing those cached tokens were charged at a 90% discount ($0.125 per million tokens instead of $1.25). This dramatically reduced costs for applications that made repeated calls with overlapping context, such as multi-turn conversations or iterative code generation workflows [5].
GPT-5 set new state-of-the-art results across several categories at launch:
| Benchmark | Category | GPT-5 (Thinking) | o3 | GPT-4o |
|---|---|---|---|---|
| AIME 2025 | Mathematics | 94.6% | 79.2% | 26.7% |
| SWE-bench Verified | Software engineering | 74.9% | 69.1% | 38.0% |
| MMMU | Multimodal understanding | 84.2% | 74.9% | 69.1% |
| GPQA Diamond | Graduate-level science | 81.6% | 79.7% | 53.6% |
| Aider Polyglot | Coding (multi-language) | 88.0% | - | 45.3% |
With the Pro variant and Python tools enabled, GPT-5 scored a perfect 100% on AIME 2025. Even without tools, the thinking variant reached 99.6% [1].
The AIME (American Invitational Mathematics Examination) results were particularly noteworthy because these are competition-level math problems designed for top high school students. A score of 94.6% without tools placed GPT-5 well above the performance of the vast majority of human test-takers.
GPT-5 launched as a natively multimodal model, capable of processing text, images, and audio as both inputs and outputs. This represented a continuation of the multimodal approach introduced with GPT-4o but with significantly enhanced capabilities.[1]
On the MMMU (Massive Multi-discipline Multimodal Understanding) benchmark, which tests a model's ability to reason about images, diagrams, and charts across academic disciplines, GPT-5 Thinking scored 84.2%, compared to GPT-4o's 69.1%. The model demonstrated particular strength on tasks requiring joint reasoning across text and visual inputs, such as interpreting scientific diagrams, analyzing financial charts, and solving geometry problems presented as images.[1]
GPT-5.2 further expanded multimodal performance, with high scores on MMMU-Pro (86.5%) and Video-MMMU (90.5%). The Video-MMMU results suggested a powerful, natively multimodal architecture capable of reasoning across temporal and spatial dimensions simultaneously, enabling the model to understand and reason about video content in addition to static images.[2]
The native audio capabilities allowed GPT-5 to process spoken input directly and generate spoken responses, enabling real-time voice conversations without the intermediate step of speech-to-text transcription. This was particularly relevant for ChatGPT's voice mode and for applications in customer service, accessibility, and language learning.
One of the most significant improvements in GPT-5 was a substantial reduction in hallucinations. According to OpenAI's internal evaluations, GPT-5 (thinking) produced roughly five to six times fewer factual errors than o3 across three factual accuracy benchmarks when browsing was enabled. With web search active, GPT-5 responses were approximately 45% less likely to contain a factual error compared to GPT-4o [1].
This improvement addressed one of the most persistent criticisms of large language models: their tendency to generate plausible-sounding but factually incorrect information. For enterprise and professional use cases where factual reliability is critical, the hallucination reduction was arguably more important than any single benchmark improvement.
GPT-5 (thinking) matched or exceeded o3's performance across most benchmarks while using 50-80% fewer output tokens. This efficiency translated directly into lower costs and faster response times for developers, making the thinking capabilities practical for production workloads rather than being limited to specialized research scenarios [1].
The token efficiency improvement also had implications for user experience. Shorter reasoning chains meant faster responses, which made the model feel more responsive in interactive settings like ChatGPT conversations, even when engaging in complex reasoning.
OpenAI released GPT-5.2 on December 11, 2025, roughly four months after the initial GPT-5 launch. The update introduced a three-tier product structure: Instant (for fast, everyday queries), Thinking (for complex reasoning), and Pro (for maximum performance on the hardest problems) [2].
| Feature | GPT-5 | GPT-5.2 |
|---|---|---|
| Context window | 272K | 400K |
| AIME 2025 (no tools) | 94.6% | 100% |
| SWE-bench Verified | 74.9% | 80.0% |
| ARC-AGI-2 (abstract reasoning) | - | 52.9% |
| GDPval (professional work) | 38.8% | 70.9% |
| FrontierMath | - | 40.3% |
| API input pricing | $1.25/1M | $1.75/1M |
| API output pricing | $10.00/1M | $14.00/1M |
GPT-5.2 expanded the context window to 400,000 tokens across all paid tiers. On GDPval, a benchmark measuring performance on knowledge work tasks across 44 occupations, GPT-5.2 Thinking became the first model to perform at or above human expert level, beating or tying top industry professionals on 70.9% of comparisons [2].
GPT-5.2 Thinking also produced 38% fewer errors than the previous GPT-5.1 update, with the response error rate dropping from 8.8% to 6.2% [2].
The ARC-AGI-2 result was notable because this benchmark tests abstract reasoning ability, a capability widely considered to be a fundamental limitation of current AI systems. GPT-5.2's score of 52.9%, compared to GPT-5.1's 17.6%, represented a 35-point improvement and suggested significant progress on a capability that has historically been resistant to scaling [2].
GPT-5.2's performance on FrontierMath (40.3% on Tiers 1-3) was a significant milestone. FrontierMath, developed by EpochAI, consists of research-level mathematics problems that require graduate-level or beyond mathematical reasoning. Prior to o3, no model had exceeded 2% on this benchmark. o3 reached 25.2%, and GPT-5.2's 40.3% represented a further 60% relative improvement. The result demonstrated that mathematical reasoning capabilities were continuing to scale rapidly with each new model generation.[2]
Alongside the main release, OpenAI introduced GPT-5.2-Codex, a variant specifically optimized for agentic coding tasks in the Codex environment. This version featured improvements in context compaction (allowing it to work with large codebases more efficiently) and stronger performance on large-scale code changes such as refactors and migrations [6].
GPT-5.2-Codex was designed for long-horizon coding workflows where an agent needs to understand a full codebase, plan a multi-file change, and execute it with minimal human intervention. The context compaction feature allowed the model to work within its context window more efficiently by summarizing less-relevant portions of the codebase while maintaining full detail on actively edited files.
GPT-5.2 arrived during an intense period of competition. Google had released Gemini 3 Pro on November 18, 2025, and Anthropic launched Claude Opus 4.5 on November 24, 2025. GPT-5.2 was widely seen as OpenAI's response to these competitive releases. While Claude Opus 4.5 held the edge on SWE-bench Verified at 80.9%, GPT-5.2 achieved state-of-the-art on SWE-bench Pro at 55.6% and led in abstract reasoning with 52.9% on ARC-AGI-2 [2][7].
On March 3, 2026, OpenAI released GPT-5.3 Instant, which replaced GPT-5.2 Instant as the default model for all ChatGPT users, including those on the free tier [3].
GPT-5.3 Instant focused on conversational quality rather than raw benchmark performance. Key improvements included:
GPT-5.3 Instant was also made available to developers in the API as gpt-5.3-chat-latest. The conversational improvements were particularly relevant for ChatGPT's consumer user base, where natural-sounding dialogue matters more than performance on academic benchmarks [3].
A separate release, GPT-5.3-Codex, provided long-term support for GitHub Copilot integrations, optimizing the model specifically for inline code suggestions and repository-level understanding [10].
GPT-5.4 was announced on March 5, 2026, and represents the most capable model in the GPT-5 family as of March 2026. It combines frontier reasoning, coding capabilities inherited from GPT-5.3-Codex, and a new native computer-use capability in a single model [4].
| Specification | GPT-5.4 |
|---|---|
| Context window | 1.05M tokens |
| Maximum output | 128,000 tokens |
| API input pricing (standard) | $2.50 per 1M tokens |
| API output pricing | $15.00 per 1M tokens |
| Extended context input (>272K) | $5.00 per 1M tokens |
| Cached input | $1.25 per 1M tokens |
| Computer use | Native, state-of-the-art |
| Variants | GPT-5.4, GPT-5.4 Pro, GPT-5.4 Mini, GPT-5.4 Nano |
GPT-5.4 is the first general-purpose model from OpenAI with native, state-of-the-art computer-use capabilities. This means the model can directly operate desktop applications, navigate web interfaces, and carry out complex multi-step workflows across different programs. The model works through a screenshot-action loop: it receives a screenshot of the current screen, analyzes the visual content, and returns structured actions (clicks, typing, scrolling) that an agent framework can execute. The cycle then repeats with the next screenshot.[4][22]
On OSWorld-Verified, a benchmark for desktop automation tasks, GPT-5.4 scored 75.0%, surpassing the human baseline of 72.4% and dramatically improving over GPT-5.2's 47.3%. This made GPT-5.4 the first AI model to operate a computer better than human experts on this benchmark [4][22].
On BrowseComp, a benchmark for agentic web browsing, GPT-5.4 reached 82.7% (up from 65.8% for GPT-5.2), while GPT-5.4 Pro scored 89.3% [4].
The computer-use capability enables a new class of AI agent applications. Rather than operating through APIs or structured tool calls, GPT-5.4 can interact with software the same way a human would: by reading screen content, moving a cursor, clicking buttons, and typing text. This makes it possible to automate tasks in applications that lack APIs or programmatic interfaces.
The context window expanded to 1.05 million tokens, making GPT-5.4 the first OpenAI model with over one million tokens of context in a standard API offering. This allows agents to plan, execute, and verify tasks across long horizons, processing entire codebases or extensive document collections in a single session. Requests exceeding 272K tokens are priced at 2x for input and 1.5x for output [4].
The million-token context is particularly valuable for coding agents that need to reason about entire repositories, legal professionals reviewing large document sets, and research applications that involve synthesizing information from many sources simultaneously.
| Benchmark | GPT-5.2 | GPT-5.4 | Change |
|---|---|---|---|
| GDPval | 70.9% | 83.0% | +12.1 |
| OSWorld-Verified | 47.3% | 75.0% | +27.7 |
| BrowseComp | 65.8% | 82.7% | +16.9 |
| Investment banking modeling | 68.4% | 87.3% | +18.9 |
| Factual accuracy | Baseline | 33% fewer false claims | - |
| Token efficiency | Baseline | ~50% improvement | - |
GPT-5.4 also improved token efficiency by roughly 50% on complex tasks and reduced false claims by 33% compared to GPT-5.2 [4].
Alongside GPT-5.4, OpenAI released smaller variants:
| Variant | Speed | Context | API Input Price | API Output Price | Use Case |
|---|---|---|---|---|---|
| GPT-5.4 | Standard | 1.05M | $2.50/1M | $15.00/1M | Complex reasoning, professional tasks |
| GPT-5.4 Pro | Slower | 1.05M | $30.00/1M | $180.00/1M | Maximum performance on hardest problems |
| GPT-5.4 Mini | ~180 tok/s | 400K | $0.75/1M | $4.50/1M | High-volume workloads |
| GPT-5.4 Nano | ~200 tok/s | 400K | $0.20/1M | $1.25/1M | Cost-sensitive applications |
GPT-5.4 Mini and Nano were released on March 17, 2026, bringing many of GPT-5.4's strengths to faster, more efficient models designed for high-volume production use. GPT-5.4 Mini operates at roughly 180-190 tokens per second, while GPT-5.4 Nano reaches approximately 200 tokens per second, more than 2x faster than the original GPT-5 Mini [8].
GPT-5's unified architecture was designed partly to simplify the developer experience by eliminating the need to choose between GPT-4o and the o-series for different tasks. When GPT-5 launched, OpenAI positioned it as the default model for both conversational and reasoning workloads, encouraging developers to migrate from both GPT-4o and the o-series.[1][5]
The migration pattern varied by use case. Developers whose applications primarily needed conversational AI or content generation found GPT-5 to be a straightforward replacement for GPT-4o, with better performance and comparable pricing. Developers who had been using o1 or o3 for specialized reasoning tasks had a more nuanced decision, as GPT-5's thinking mode covered most reasoning use cases but did not always match o3's depth on the hardest problems.[1][12]
By late 2025, OpenAI's update cadence had accelerated significantly. The rapid succession of GPT-5, GPT-5.1, and GPT-5.2 within five months required developers to continuously adapt their applications. OpenAI addressed this partly through its tiered model structure, with Instant models providing stability for everyday use while Thinking and Pro variants pushed the capability frontier. The introduction of GPT-5.4 in March 2026 further consolidated the lineup, with OpenAI framing it as the default model for both "broad general-purpose work and most coding tasks," replacing both gpt-5.2 in the API and gpt-5.3-codex in Codex.[4][23]
GPT-5 saw rapid enterprise adoption through Microsoft's integration into its product ecosystem. Microsoft, which integrates OpenAI models across Copilot Studio, Microsoft 365 Copilot, and Azure services, began making GPT-5 available to enterprise customers starting in August 2025. Enterprises could phase in GPT-5 alongside existing models, starting with high-value workflows like code reviews, RFP automation, and analytics.[23]
Azure's native hooks allowed minimal disruption to existing governance, security, and compliance protocols during migration. By December 2025, GPT-5.2 was introduced into Microsoft Foundry as a new standard for enterprise AI, with optimized configurations for large-scale deployment.[23]
The hallucination reduction in GPT-5 was cited as a key factor in enterprise adoption. Several companies reported integrating GPT-5 into customer-facing applications where factual reliability had previously been a barrier. The built-in thinking mode meant that enterprises no longer needed to maintain separate integrations with the o-series for tasks requiring reasoning, simplifying their AI infrastructure.[1]
The rapid iteration of the GPT-5 family produced measurable improvements across each version on key benchmarks.
| Benchmark | GPT-5 (Aug 2025) | GPT-5.2 (Dec 2025) | GPT-5.4 (Mar 2026) |
|---|---|---|---|
| AIME 2025 (no tools) | 94.6% | 100% | 100% |
| SWE-bench Verified | 74.9% | 80.0% | - |
| GPQA Diamond | 81.6% | - | - |
| GDPval | 38.8% | 70.9% | 83.0% |
| OSWorld-Verified | - | 47.3% | 75.0% |
| BrowseComp | - | 65.8% | 82.7% |
| ARC-AGI-2 | - | 52.9% | - |
| FrontierMath (Tiers 1-3) | - | 40.3% | - |
| Context window | 272K | 400K | 1.05M |
The GDPval benchmark, which measures performance on professional knowledge work across 44 occupations, showed particularly dramatic improvement, nearly doubling from 38.8% with GPT-5 to 70.9% with GPT-5.2, and reaching 83.0% with GPT-5.4. This trend suggested that the GPT-5 family was becoming increasingly useful for real-world professional tasks beyond the academic benchmarks that had traditionally been used to evaluate language models.[2][4]
The perfect 100% score on AIME 2025 achieved by GPT-5.2 Thinking (without tools) was a watershed moment for mathematical reasoning. The AIME is a competition designed for the top 5% of US high school mathematics students, and a perfect score without tool assistance demonstrated that GPT-5.2 had reached a level of mathematical competence that matched or exceeded top human performers on this particular exam.[2]
| Date | Release | Key Feature |
|---|---|---|
| August 7, 2025 | GPT-5 | Unified model, 272K context, built-in thinking |
| December 11, 2025 | GPT-5.2 | 400K context, Instant/Thinking/Pro tiers |
| December 2025 | GPT-5.2-Codex | Optimized for agentic coding |
| March 3, 2026 | GPT-5.3 Instant | Improved conversational tone, 400K context for all |
| March 5, 2026 | GPT-5.4 | 1.05M context, native computer use |
| March 17, 2026 | GPT-5.4 Mini and Nano | Smaller, faster variants of GPT-5.4 |
The GPT-5 series has maintained competitive pricing relative to its predecessors, particularly given the performance improvements:
| Model | Input (per 1M tokens) | Output (per 1M tokens) | Release |
|---|---|---|---|
| GPT-4o | $2.50 | $10.00 | May 2024 |
| GPT-5 | $1.25 | $10.00 | August 2025 |
| GPT-5.2 | $1.75 | $14.00 | December 2025 |
| GPT-5.4 | $2.50 | $15.00 | March 2026 |
| GPT-5.4 Mini | $0.75 | $4.50 | March 2026 |
| GPT-5.4 Nano | $0.20 | $1.25 | March 2026 |
Notably, GPT-5 launched at a lower price per input token than GPT-4o despite substantially better performance, reflecting the efficiency improvements in the underlying architecture. The pricing trend across the GPT-5 series shows a gradual increase for the flagship model (from $1.25 to $2.50 per million input tokens) alongside the introduction of increasingly affordable smaller variants [5].
The GPT-5 series exists in a highly competitive environment. As of early 2026, the frontier model landscape includes several strong alternatives:
| Provider | Model | Notable Strength |
|---|---|---|
| OpenAI | GPT-5.4 | Computer use, 1.05M context |
| Anthropic | Claude Opus 4.5 | Coding (80.9% SWE-bench Verified) |
| Gemini 3 Pro | Reasoning (1501 LMArena Elo), 1M context | |
| DeepSeek | DeepSeek-V3.2 | Cost efficiency (10-30x cheaper) |
| xAI | Grok 4 | Real-time information integration |
The late-2025 and early-2026 period saw a significant compression of performance gaps between leading models, with each provider developing distinct specializations. Organizations increasingly deploy multiple models, routing queries to the most suitable model for each task type, rather than standardizing on a single provider [7].
Google's Gemini 3 Pro achieved an unprecedented 91.9% on GPQA Diamond, surpassing human expert performance (approximately 89.8%). Gemini 3 Pro's Deep Think mode also pushed Humanity's Last Exam to 41%, the highest published score on that benchmark. Anthropic's Claude Opus 4.5 held the SWE-bench Verified lead at 80.9%. DeepSeek-V3.2 offered frontier-class performance at a fraction of the cost, providing a strong option for cost-sensitive applications. GPT-5.4's distinctive advantages as of March 2026 are its native computer-use capabilities and its 1.05M-token context window [7].
The competitive dynamics of this period also drove pricing pressure across the industry. With DeepSeek demonstrating that high-quality models could be offered at dramatically lower prices, all major providers were forced to offer more competitive pricing or justify their premium through unique capabilities.
The rapid pace of GPT-5 updates created a complex model lifecycle that developers needed to navigate. OpenAI's approach was to deprecate older GPT-5 versions relatively quickly as newer ones launched. When GPT-5.4 was released in March 2026, OpenAI framed it as the replacement for both gpt-5.2 in the standard API and gpt-5.3-codex in the Codex environment, consolidating what had briefly been separate model tracks.[4]
OpenAI also announced in 2025-2026 the deprecation of several pre-GPT-5 models, including o1, GPT-4.5, o3-mini, and GPT-4o. This wave of retirements effectively pushed the entire developer ecosystem toward the GPT-5 family and the o3/o4-mini reasoning models. Developers who had built on GPT-4o were encouraged to migrate to GPT-5, while those on o1 were directed to o3 or o4-mini.[7]
The transition was not without friction. Some developers reported that prompt behaviors changed subtly between GPT-5 versions, requiring regression testing and prompt tuning with each update. OpenAI addressed this partly by maintaining stable model snapshot endpoints (e.g., gpt-5.4-2026-03-05) that developers could pin to for production stability, while the generic gpt-5.4 endpoint would receive rolling updates.
GPT-5's launch in August 2025 was broadly well-received, though not without criticism. The unification of reasoning and conversational capabilities into a single model was praised as a significant usability improvement. Developers no longer had to choose between model families or implement their own routing logic [1].
The hallucination reduction was highlighted as a particularly important advance for enterprise adoption. Several companies reported integrating GPT-5 into customer-facing applications where factual reliability was previously a barrier [1].
However, some researchers noted that OpenAI's benchmark presentations were occasionally misleading. In one instance, a benchmark graph in the launch materials was found to contain errors, drawing public criticism. Others pointed out that while GPT-5 was a clear improvement over OpenAI's previous models, the gap between it and competitors like Claude and Gemini was smaller than in earlier generations [9].
The rapid cadence of updates, from GPT-5 in August to GPT-5.4 in March, roughly seven months later, also raised questions about versioning clarity and the challenge facing developers who need to maintain stable production systems while keeping up with frequent model changes. OpenAI addressed this partly through its tiered model structure, with Instant models providing stability for everyday use while Thinking and Pro variants pushed the capability frontier.
The GPT-5 series also marked a turning point in how AI models are consumed. The built-in thinking mode and automatic routing represented a move away from exposing raw model capabilities to users and toward providing an integrated, managed AI experience. This trend toward "model as service" rather than "model as tool" has implications for how developers build applications and how end users interact with AI systems.