| OpenAI o3 |
|---|
| Developer |
| Announced |
| Release date |
| Type |
| Architecture |
| Variants |
| Parameters |
| Predecessor |
OpenAI o3 is a family of reasoning-focused large language models developed by OpenAI, representing the second generation of the company's o-series reasoning models. First announced on December 20, 2024 during the "12 Days of OpenAI" event, the o3 family was released in stages: o3-mini launched on January 31, 2025; the full o3 model and o4-mini arrived on April 16, 2025; and o3-pro became available on June 10, 2025. The o3 models build on the inference-time reasoning paradigm established by o1, with significant improvements in performance, tool use, and multimodal capabilities.[1][2][3]
The o3 family achieved remarkable benchmark results that set new standards for AI reasoning. On the ARC-AGI benchmark, a high-compute configuration of o3 scored 87.5%, a result that sparked widespread discussion about the proximity to artificial general intelligence. On AIME 2025, o3 scored 88.9%, and on GPQA Diamond it reached 87.7%. Perhaps most notably, o3 solved 25.2% of problems on EpochAI's Frontier Math benchmark, where no previous model had exceeded 2%.[1][4]
OpenAI first revealed the o3 model family on December 20, 2024, the final day of its "12 Days of OpenAI" event. CEO Sam Altman and SVP of Research Mark Chen presented the model's ARC-AGI results in person at ARC Prize's offices, where the 87.5% high-compute score was disclosed. The name skipped "o2" reportedly to avoid confusion with the British telecommunications company O2.[4][5]
The release followed a staggered schedule:
| Date | Model | Availability |
|---|---|---|
| December 20, 2024 | o3 (announced) | Benchmark results shared; safety testing began |
| January 31, 2025 | o3-mini | ChatGPT (all tiers including free); API |
| April 16, 2025 | o3, o4-mini | ChatGPT Plus, Pro, Team; API |
| June 10, 2025 | o3-pro | ChatGPT Pro, Team; API |
Like its predecessor o1, the o3 model uses extended chain-of-thought reasoning during inference. The model generates internal reasoning tokens that are hidden from the user, working through problems step by step before producing a final response. However, o3's reasoning capabilities are substantially more advanced than o1's. OpenAI reported that in evaluations by external experts, o3 makes 20% fewer major errors than o1 on difficult real-world tasks, with particular improvements in programming, business consulting, and creative ideation.[1]
The reasoning process in o3 is more flexible than in o1, with the model able to dynamically adjust its reasoning depth based on problem complexity. Simple queries receive relatively brief internal reasoning, while complex problems can trigger extended chains of thought spanning thousands of tokens.
One of the most significant advances in o3 is its ability to use tools during the reasoning process itself. For the first time in the o-series, o3 can agentically combine multiple tools within ChatGPT, including web search, Python code execution for data analysis, file analysis, and image generation. Previous reasoning models could only think and then produce text; o3 can interleave reasoning with actions, search for information mid-thought, run calculations to verify hypotheses, and incorporate external data into its reasoning chain.[1]
This capability makes o3 substantially more effective for complex research and analysis tasks that require gathering and synthesizing information from multiple sources.
Another major advance is o3's ability to integrate images directly into its chain of thought. Users can provide images as context, and the model can reason about visual information alongside text. OpenAI described this as "thinking with images," meaning the model can analyze diagrams, charts, photographs, and other visual inputs as part of its reasoning process rather than treating them as separate inputs to be described and then reasoned about textually.[1][7]
With the release of o3, OpenAI introduced reasoning summaries through the Responses API, partially addressing the transparency concerns that had surrounded o1's hidden chain of thought. While the raw reasoning tokens remain hidden, developers can access summarized versions of the model's reasoning process. The API also supports encrypted reasoning content that represents the model's reasoning state, persisted entirely on the client side. Developers can pass this encrypted state back to the API in subsequent requests to improve performance in intelligence, cost, and latency without OpenAI retaining any reasoning data.[20]
Attempting to extract raw reasoning through methods other than the official reasoning summary parameter is not supported and may violate OpenAI's Acceptable Use Policy.[20]
Released on January 31, 2025, o3-mini was the first model in the o3 family to reach the public. It was made available to all ChatGPT users, including free-tier subscribers, and to API developers. o3-mini is a smaller, faster model optimized for cost-efficient reasoning, particularly in math, coding, and science tasks.[2]
o3-mini introduced three configurable reasoning effort levels: low, medium, and high. At medium effort, o3-mini matched o1's performance on most benchmarks while delivering faster responses and lower costs. The high-effort variant (o3-mini-high) was available to paid ChatGPT subscribers and provided additional reasoning depth.[2]
| o3-mini Effort Level | AIME 2024 | GPQA Diamond | Description |
|---|---|---|---|
| Low | 60.0% | 59.9% | Fastest responses, lowest cost |
| Medium | 79.6% | 76.0% | Matches o1 performance |
| High | 86.5% | 77.0% | Exceeds o1, best for hard problems |
The full o3 model launched on April 16, 2025, alongside o4-mini. It represents OpenAI's most capable reasoning model at the time of release, with broad improvements over o1 across all benchmarks. o3 demonstrated particular strength in mathematical reasoning, coding, and scientific problem-solving, while also showing improved performance on creative and business tasks.[1]
Key technical capabilities of the full o3 model include:
Released on June 10, 2025, o3-pro is a variant of o3 designed for maximum reliability on difficult tasks. Like o1-pro before it, o3-pro uses additional compute during the reasoning phase to think longer and more thoroughly. It is specifically designed for users who prioritize correctness and depth over response speed.[3]
OpenAI tested o3-pro using a "4/4 reliability" metric, requiring the model to answer the same question correctly four times in a row. On this measure, o3-pro outperformed both o1-pro and the base o3 model. It also scored higher on clarity, instruction-following, and domain-specific strength in STEM, writing, and business contexts.[3]
o3-pro integrates real-time web search, file analysis, visual reasoning, Python execution, and advanced memory features, addressing complex workflows in science, programming, business, and writing. On competitive programming, o3-pro achieved a Codeforces Elo of 2748, compared to 2517 for o3 at medium effort, representing a substantial 200+ point improvement.[3][13]
The trade-off is speed: o3-pro responses take significantly longer than standard o3 responses. OpenAI acknowledged that responses "typically take longer" than o1-pro and recommended the model for the most challenging questions "where reliability matters more than speed, and waiting a few minutes is worth the tradeoff." o3-pro is available to ChatGPT Pro and Team subscribers and through the API.[3]
Also released on April 16, 2025, o4-mini is a smaller, cost-efficient reasoning model that achieves remarkable performance relative to its size and cost. Despite being positioned as a budget option, o4-mini produced some of the most impressive benchmark results in the o-series lineup, particularly in mathematics.[1][8]
o4-mini is the best-performing benchmarked model on AIME 2024 and AIME 2025. Without tools, it scored 92.7% on AIME 2025, surpassing even the full o3 model (88.9%). When given access to a Python interpreter, o4-mini achieved a near-perfect 99.5% pass@1 on AIME 2025, with 100% consensus@8.[8]
Like o3, o4-mini supports multimodal reasoning, tool use during thinking, and configurable reasoning effort levels. Its combination of strong performance and low cost makes it the recommended choice for most applications that need reasoning capabilities.
From a cost perspective, o4-mini delivers 13.6x cost savings over o1 while maintaining 85.9% accuracy on coding benchmarks. The Batch API further reduces prices by 50%, bringing input costs to $0.55 and output to $2.20 per million tokens. o4-mini also provides a 4x increase in context window compared to o3-mini (from 32K to 128K tokens) while remaining faster at the same reasoning effort level.[14]
The o3 family demonstrated substantial improvements over o1 across all major benchmarks.
| Benchmark | o3 | o1 | GPT-4o | Description |
|---|---|---|---|---|
| AIME 2024 | 91.6% | 74.3% | 13.4% | American Invitational Mathematics Exam |
| AIME 2025 | 88.9% | 79.2% | - | American Invitational Mathematics Exam |
| GPQA Diamond | 87.7% | 78.0% | 53.6% | Graduate-level science questions |
| Frontier Math | 25.2% | <2% | <2% | Research-level mathematics (EpochAI) |
| SWE-bench Verified | 71.7% | 48.9% | 33.2% | Real-world software engineering tasks |
| Codeforces Elo | 2727 | 1891 | - | Competitive programming rating |
| ARC-AGI (low compute) | 75.7% | - | 5% | Visual abstract reasoning |
| ARC-AGI (high compute) | 87.5% | - | - | Visual abstract reasoning (172x compute) |
| MMLU | 92.4% | 92.3% | 87.2% | Multitask language understanding |
The Frontier Math result was particularly striking. This benchmark consists of research-level mathematics problems that had stumped all previous models (none exceeding 2%). o3's 25.2% score represented a qualitative leap in mathematical reasoning capability.[1]
| Benchmark | o4-mini | o3 | o3-mini (high) | o1 |
|---|---|---|---|---|
| AIME 2025 | 92.7% | 88.9% | 86.5% | 79.2% |
| AIME 2025 (with tools) | 99.5% | - | - | - |
| GPQA Diamond | 81.4% | 83.3% | 77.0% | 78.0% |
| SWE-bench Verified | 68.1% | 69.1% | 49.3% | 48.9% |
| HumanEval | 98.2% | 97.6% | - | 92.4% |
The following table summarizes the key characteristics of all models in OpenAI's o-series reasoning lineup as of mid-2025.
| Model | Release Date | Reasoning Effort Levels | Tool Use | Multimodal | API Input (per 1M tokens) | API Output (per 1M tokens) | Best For |
|---|---|---|---|---|---|---|---|
| o1-mini | Sep 12, 2024 | No | No | No | $3.00 | $12.00 | Budget STEM reasoning |
| o1 | Dec 5, 2024 | Low/Med/High | Yes | Yes (Dec 2024) | $15.00 | $60.00 | Complex reasoning tasks |
| o1-pro | Dec 5, 2024 | No | Yes | Yes | $150.00 | $600.00 | Maximum o1 reliability |
| o3-mini | Jan 31, 2025 | Low/Med/High | Limited | No | $1.10 | $4.40 | Cost-efficient reasoning |
| o3 | Apr 16, 2025 | Low/Med/High | Yes (agentic) | Yes | $2.00 | $8.00 | Flagship reasoning |
| o4-mini | Apr 16, 2025 | Low/Med/High | Yes (agentic) | Yes | $1.10 | $4.40 | Best value reasoning |
| o3-pro | Jun 10, 2025 | No (always max) | Yes (agentic) | Yes | $20.00 | $80.00 | Highest reliability |
A notable pattern in the pricing evolution is that o3 became dramatically cheaper over time. In June 2025, OpenAI reduced o3's API pricing by 80%, bringing it from $10/$40 (input/output per million tokens) to $2/$8. This price drop, combined with the introduction of o3-pro, reflected OpenAI's strategy of making high-quality reasoning accessible at lower price points while offering premium options for maximum reliability.[10]
The most discussed aspect of o3's announcement was its performance on the ARC-AGI benchmark. ARC-AGI (Abstraction and Reasoning Corpus for Artificial General Intelligence) is a benchmark designed to test the ability of AI systems to solve novel visual reasoning tasks that require human-like abstraction abilities. It was created by Francois Chollet, a researcher at Google, specifically as a test that would be difficult for systems relying on pattern matching rather than genuine reasoning.[4]
Prior to o3, the best AI performance on ARC-AGI had been around 5% (GPT-4o). The jump to 75.7% at the standard compute budget, and 87.5% at high compute (172x), was described by ARC Prize as "a surprising and important step-function increase in AI capabilities, showing novel task adaptation ability never seen before in the GPT-family models."[4]
Francois Chollet provided detailed cost breakdowns for o3's ARC-AGI performance. In high-efficiency mode, o3 scored 75.7% at approximately $20 per task. Running o3 in this mode against all 400 public ARC-AGI puzzles cost $6,677 and yielded a score of 82.8%. The high-compute configuration, which achieved the headline 87.5% score, cost roughly $1,000 per task, with estimated total costs of approximately $1,148,444 for the full evaluation run.[4][15]
Chollet noted that while o3 was approaching human levels of performance, it "comes at a steep cost, and wouldn't quite be economical yet." He pointed out that humans could solve ARC-AGI tasks for roughly $5 per task "while consuming mere cents in energy." However, Chollet expressed optimism that "cost-performance will likely improve quite dramatically over the next few months and years."[4][15]
The ARC Prize organization was careful to note that "passing ARC-AGI does not equate to achieving AGI." They pointed out that o3 still fails on some very easy tasks, indicating fundamental differences with human intelligence. The high-compute score also came at significant cost, making it impractical for most applications. Nevertheless, the result reignited public debate about the timeline to artificial general intelligence and whether reasoning-focused models represented a viable path toward it.[4]
The ARC-AGI result also prompted the creation of ARC-AGI-2, a harder successor benchmark. When GPT-5.2 was evaluated on ARC-AGI-2 in December 2025, it scored 52.9%, showing continued progress but also demonstrating that substantial challenges remained in abstract reasoning.[11]
The o3 family exists in direct competition with DeepSeek-R1, the open-source reasoning model released by the Chinese company DeepSeek in January 2025. The rivalry between these two model families has defined the reasoning model landscape throughout 2025.
| Benchmark | o3 | DeepSeek-R1 | DeepSeek-R1-0528 |
|---|---|---|---|
| AIME 2024 | 91.6% | 79.8% | 91.4% |
| AIME 2025 | 88.9% | 70.0% | 87.5% |
| GPQA Diamond | 87.7% | 71.5% | - |
| SWE-bench Verified | 71.7% | 49.2% | - |
| Codeforces Elo | 2,727 | 2,029 | ~1,930 |
o3 leads decisively in coding tasks (Codeforces, SWE-bench) and science (GPQA Diamond), while the updated DeepSeek-R1-0528 has closed the gap significantly on mathematics benchmarks. The key differentiator beyond performance is cost and accessibility: R1 is available under the MIT license for self-hosting at zero API cost, and even through DeepSeek's API, it costs approximately 3-4 times less than o3 after OpenAI's June 2025 price cuts.[16]
The competition has been mutually beneficial for the field. DeepSeek's demonstration that reasoning models could be built cheaply and openly pressured OpenAI to reduce prices, while OpenAI's continued performance leadership on harder benchmarks pushed DeepSeek to improve R1 with the 0528 update.
o3 uses a dense transformer architecture where all parameters are active for every task, ensuring consistent performance but requiring more computational resources. DeepSeek-R1 uses a Mixture of Experts architecture that activates only 37 billion of its 671 billion total parameters per token, allowing it to achieve inference costs comparable to a much smaller model. This architectural choice is a key reason for R1's cost advantage.[16]
OpenAI released GPT-5 in mid-2025 as a unified model that merges general-purpose language capabilities with reasoning. The relationship between o3 and GPT-5 reflects two different philosophies about how to build capable AI systems.[11][12]
GPT-5 is a unified model that automatically switches between fast and deep thinking modes based on the query's complexity. It uses an intelligent routing system that analyzes conversation type, complexity, tool needs, and user intent to determine whether to use quick response generation or engage a deeper "GPT-5 thinking" mode. OpenAI reports the router correctly identifies complexity in 94% of cases. This design prioritizes ease of use: the user does not need to choose a model or configure reasoning effort.[11]
o3, by contrast, is a specialist. It is specifically trained and optimized for deep reasoning tasks. When engaged, o3 tends to go deep on problems, following extended chains of reasoning, using tools to verify hypotheses, and systematically exploring solution spaces. It gives developers explicit control over reasoning effort and is designed for applications where thoroughness matters more than speed.[1]
In practice, GPT-5 with thinking enabled performs comparably to o3 on many benchmarks while using 50-80% fewer output tokens. GPT-5's responses are also reported to be roughly 80% less likely to contain factual errors than o3's. However, o3 and o3-pro remain the preferred choice for the most demanding reasoning tasks, where the additional depth and reliability justify the specialized model.[11][12]
OpenAI has indicated that GPT-5 and the o-series will continue to coexist, with GPT-5 serving as the default general-purpose model and the o-series models available for tasks requiring maximum reasoning depth.
Between the December 2024 announcement and the April 2025 release of the full o3 model, OpenAI conducted extensive safety testing. The gap between announcement and release was partly driven by the need for external safety evaluations, which included testing by the US AI Safety Institute (USAISI) and the UK AI Safety Institute (UKAISI), as well as external red-teamers.[19]
OpenAI applied its deliberative alignment safety framework to o3, the same approach used for o1. The model was trained to reason explicitly about safety policies within its chain of thought, considering whether a given query might require a refusal or a careful response. Because o3's reasoning chains were more sophisticated than o1's, the deliberative alignment process was reported to be more effective, with o3 achieving a Pareto improvement over o1 on both over-refusal and under-refusal metrics.[19]
The o3-mini release in January 2025 served partly as a lower-risk deployment that allowed OpenAI to gather real-world data about the model's behavior before releasing the full o3. By limiting o3-mini's tool-use capabilities relative to the full model, OpenAI was able to test the reasoning approach at scale while reducing the risk surface area.
The pricing story of the o3 family illustrates the rapid deflation of reasoning model costs throughout 2025. When o3 first launched in April 2025, it was priced at $10 per million input tokens and $40 per million output tokens. Two months later, in June 2025, OpenAI reduced these prices by 80% to $2/$8, making o3 cheaper than the original o1 had been at $15/$60.[10]
This pricing trajectory was driven by several factors: competitive pressure from DeepSeek-R1 (priced at $0.55/$2.19), improvements in inference efficiency, and OpenAI's strategic desire to make reasoning models accessible to a broader developer base. The introduction of o3-pro at $20/$80 created a clear tiering structure: developers could choose between budget reasoning (o4-mini at $1.10/$4.40), mainstream reasoning (o3 at $2/$8), and premium reliability (o3-pro at $20/$80).[10][14]
For high-volume applications, the Batch API provides an additional 50% discount on all o-series models, making o4-mini available at $0.55/$2.20 per million tokens. This pricing puts sophisticated reasoning capabilities within reach of individual developers and small startups, a significant shift from the early days of o1 when reasoning was effectively a premium product.
Developer adoption of the o3 family has been shaped by the convergence trend in AI models during 2025. By mid-to-late 2025, reasoning depth, tool use, and conversational quality increasingly lived inside the same flagship model line, with model selection becoming more about cost, latency, and quality tradeoffs than choosing between fundamentally different model families.[14]
For most production applications, o4-mini has emerged as the preferred reasoning model due to its combination of strong performance and low cost. At $1.10/$4.40 per million tokens (input/output), it delivers nearly 10x cost savings compared to o3 while maintaining competitive accuracy. The Batch API reduces costs further, making o4-mini particularly attractive for high-volume applications.[14]
o3 itself is typically reserved for tasks requiring maximum reasoning depth, such as complex scientific analysis, multi-step mathematical proofs, and sophisticated code generation. The availability of reasoning effort levels allows developers to fine-tune the tradeoff between cost and thoroughness on a per-query basis.
With the release of GPT-5 in August 2025, some developers migrated away from the o-series entirely, preferring GPT-5's unified model that automatically engages reasoning when needed. However, developers working on tasks requiring maximum reasoning depth continue to prefer o3 and o3-pro for their explicit control and dedicated optimization.
As of March 2026, the o3 family represents OpenAI's primary reasoning model lineup. With the April 2025 release, o3 and o4-mini replaced o1 and o3-mini in the ChatGPT model selector for Plus, Pro, and Team users. o3-pro replaced o1-pro for Enterprise and Edu users.[6]
The o-series reasoning models coexist with OpenAI's GPT-5 family, which took over as the default model in ChatGPT. While GPT-5 handles the majority of everyday tasks and can engage its own thinking mode when needed, o3 and o3-pro remain available for users and developers who need dedicated deep reasoning capabilities.
OpenAI has not publicly confirmed a successor to o3, though the existence of o4-mini (released alongside o3) suggests that the o-series numbering will continue. The pattern of releasing a flagship reasoning model alongside a cost-efficient mini variant appears to be OpenAI's standard approach for the o-series going forward.
The broader impact of the o3 family extends beyond OpenAI's own products. The benchmark results, particularly on ARC-AGI and Frontier Math, have pushed competing labs to invest more heavily in reasoning-focused models. Google's Gemini 2.5 Pro, Anthropic's Claude models with extended thinking, and DeepSeek's R1 series all reflect the competitive pressure created by o3's capabilities.