OpenAI o3

OpenAI o3
Developer	OpenAI
Announced	December 20, 2024
Release date	January 31, 2025 (o3-mini); April 16, 2025 (o3, o4-mini); June 10, 2025 (o3-pro)
Type	Large language model (reasoning)
Architecture	Transformer, trained with reinforcement learning
Variants	o3-mini, o3, o3-pro, o4-mini
Parameters	Undisclosed
Predecessor	OpenAI o1

OpenAI o3 is a family of reasoning-focused large language models developed by OpenAI, representing the second generation of the company's o-series reasoning models. First announced on December 20, 2024 during the "12 Days of OpenAI" event, the o3 family was released in stages: o3-mini launched on January 31, 2025; the full o3 model and o4-mini arrived on April 16, 2025; and o3-pro became available on June 10, 2025. The o3 models build on the inference-time reasoning paradigm established by o1, with significant improvements in performance, tool use, and multimodal capabilities.^[1]^[2]^[3]

The o3 family achieved remarkable benchmark results that set new standards for AI reasoning. On the ARC-AGI benchmark, a high-compute configuration of o3 scored 87.5%, a result that sparked widespread discussion about the proximity to artificial general intelligence. On AIME 2025, o3 scored 88.9%, and on GPQA Diamond it reached 87.7%. Perhaps most notably, o3 solved 25.2% of problems on EpochAI's Frontier Math benchmark, where no previous model had exceeded 2%.^[1]^[4]

The model was both celebrated and contested. Within months of release, OpenAI's own technical report disclosed that o3 hallucinated on roughly 33% of PersonQA prompts, about double the rate of o1, and the ARC-AGI score that headlined its debut was later reframed as the product of an unusually expensive compute budget rather than a clean breakthrough. The model was deprecated in ChatGPT on August 7, 2025 when GPT-5 launched, then partially reinstated for paid subscribers a week later after user backlash. By the February 13, 2026 model consolidation, the o-series had effectively been retired from the consumer ChatGPT picker, with o3-pro the only o-series model still pinned for Enterprise and Edu workflows, although o3 and o3-pro remain available on the API.^[21]^[22]^[23]^[38]^[39]

Announcement and Release Timeline

OpenAI first revealed the o3 model family on December 20, 2024, the final day of its "12 Days of OpenAI" event. CEO Sam Altman and SVP of Research Mark Chen presented the model's ARC-AGI results in person at ARC Prize's offices, where the 87.5% high-compute score was disclosed. The name skipped "o2" reportedly to avoid confusion with the British telecommunications company O2.^[4]^[5]

The announcement closed out a 12-day event known internally at OpenAI as "Shipmas," which had earlier introduced the full o1 and o1-pro on December 5, 2024, as well as Sora video generation, ChatGPT Pro, ChatGPT Search, Projects, and ChatGPT for WhatsApp. The contrast was deliberate: rather than ship a product on the final day, OpenAI used the slot to preview a model that would not reach the general public for over a month. Altman framed the gap between announcement and release as a deliberate window for external safety testing, and o3 became the first frontier model that OpenAI shared with the US AI Safety Institute and the UK AI Safety Institute (UKAISI) before public deployment.^[5]^[24]

The release followed a staggered schedule:

Date	Model	Availability
December 20, 2024	o3 (announced)	Benchmark results shared; safety testing began
January 31, 2025	o3-mini	ChatGPT (all tiers including free); API
February 2025	o3-mini-high	ChatGPT Plus, Pro, Team
April 16, 2025	o3, o4-mini	ChatGPT Plus, Pro, Team; API
May 23, 2025	o3 Operator	ChatGPT Pro (Operator replaces GPT-4o backbone)
June 10, 2025	o3-pro	ChatGPT Pro, Team; API
August 7, 2025	o3 deprecation begins	Removed from ChatGPT picker on GPT-5 launch
August 12, 2025	o3 partial reinstatement	Returned via "Show additional models" toggle for paid users
February 13, 2026	Consumer o-series sunset	o4-mini retired from ChatGPT; o3 removed from default consumer picker as part of GPT-5 family consolidation
Ongoing (May 2026)	API availability	o3, o3-pro, and o4-mini remain on the API with no announced retirement date

^[1]^[2]^[3]^[6]^[22]^[25]^[38]^[39]

Architecture and Technical Design

Extended Thinking and Reasoning

Like its predecessor o1, the o3 model uses extended chain-of-thought reasoning during inference. The model generates internal reasoning tokens that are hidden from the user, working through problems step by step before producing a final response. However, o3's reasoning capabilities are substantially more advanced than o1's. OpenAI reported that in evaluations by external experts, o3 makes 20% fewer major errors than o1 on difficult real-world tasks, with particular improvements in programming, business consulting, and creative ideation.^[1]

The reasoning process in o3 is more flexible than in o1, with the model able to dynamically adjust its reasoning depth based on problem complexity. Simple queries receive relatively brief internal reasoning, while complex problems can trigger extended chains of thought spanning thousands of tokens.

A core technical claim about o3 is that it scales by spending more compute at inference time rather than only at training time. The high-compute ARC-AGI configuration that produced the headline 87.5% number used 1,024 samples per problem versus 6 in the standard configuration, and the model selected its final answer through majority voting and a learned program-search procedure over its own candidates. This style of sample-and-select inference is closer to a search procedure than to single-shot generation.^[4]^[15]

Tool Use During Reasoning

One of the most significant advances in o3 is its ability to use tools during the reasoning process itself. For the first time in the o-series, o3 can agentically combine multiple tools within ChatGPT, including web search, Python code execution for data analysis, file analysis, and image generation. Previous reasoning models could only think and then produce text; o3 can interleave reasoning with actions, search for information mid-thought, run calculations to verify hypotheses, and incorporate external data into its reasoning chain.^[1]

This capability makes o3 substantially more effective for complex research and analysis tasks that require gathering and synthesizing information from multiple sources.

In practice, o3's tool use is more aggressive than o1's. The model frequently writes throwaway Python snippets to recompute arithmetic, parse uploaded CSV files, or sanity check its own reasoning. Reviewers reported transcripts in which o3 wrote a confident answer, revised it after running a script, then revised again after a web search. When an answer is checkable, o3 tends to check.^[26]

Multimodal Reasoning ("Thinking with Images")

Another major advance is o3's ability to integrate images directly into its chain of thought. Users can provide images as context, and the model can reason about visual information alongside text. OpenAI described this as "thinking with images," meaning the model can analyze diagrams, charts, photographs, and other visual inputs as part of its reasoning process rather than treating them as separate inputs to be described and then reasoned about textually.^[1]^[7]

In the launch demo OpenAI showed o3 cropping, rotating, and zooming into images during reasoning, reading low resolution photographs of research posters, and solving visual ARC-AGI puzzles by writing Python to manipulate the input grid before comparing the output to a candidate transformation. This "see, code, recheck" loop was one of the most cited differences between o3 and o1 at launch.^[7]

Reasoning Summaries in the API

With the release of o3, OpenAI introduced reasoning summaries through the Responses API, partially addressing the transparency concerns that had surrounded o1's hidden chain of thought. While the raw reasoning tokens remain hidden, developers can access summarized versions of the model's reasoning process. The API also supports encrypted reasoning content that represents the model's reasoning state, persisted entirely on the client side. Developers can pass this encrypted state back to the API in subsequent requests to improve performance in intelligence, cost, and latency without OpenAI retaining any reasoning data.^[20]

Attempting to extract raw reasoning through methods other than the official reasoning summary parameter is not supported and may violate OpenAI's Acceptable Use Policy.^[20]

Model Variants

o3-mini

Released on January 31, 2025, o3-mini was the first model in the o3 family to reach the public. It was made available to all ChatGPT users, including free-tier subscribers, and to API developers. o3-mini is a smaller, faster model optimized for cost-efficient reasoning, particularly in math, coding, and science tasks.^[2]

The release was timed against DeepSeek-R1, which had launched on January 20, 2025 and dominated tech news for the eleven days that followed. o3-mini was the first time any reasoning model had been offered to ChatGPT free users. Free accounts accessed it by selecting "Reason" in the message composer with daily caps, while ChatGPT Plus received roughly 100 daily messages.^[2]^[27]

o3-mini introduced three configurable reasoning effort levels: low, medium, and high. At medium effort, o3-mini matched o1's performance on most benchmarks while delivering faster responses and lower costs. The high-effort variant (o3-mini-high) was available to paid ChatGPT subscribers and provided additional reasoning depth.^[2]

o3-mini Effort Level	AIME 2024	GPQA Diamond	Description
Low	60.0%	59.9%	Fastest responses, lowest cost
Medium	79.6%	76.0%	Matches o1 performance
High	86.5%	77.0%	Exceeds o1, best for hard problems

^[2]

Unlike the full o3 that arrived in April, o3-mini did not support image input and could not interleave tool calls during reasoning at launch. It was a text only, math and code focused model, and OpenAI was explicit that it was a smaller distillate rather than a stripped down version of the same weights. The result was a model that was considerably cheaper and faster than o1 while matching or beating it on the narrow benchmarks where the o-series excelled.^[2]

o3 (Full Model)

The full o3 model launched on April 16, 2025, alongside o4-mini. It represents OpenAI's most capable reasoning model at the time of release, with broad improvements over o1 across all benchmarks. o3 demonstrated particular strength in mathematical reasoning, coding, and scientific problem-solving, while also showing improved performance on creative and business tasks.^[1]

Key technical capabilities of the full o3 model include:

Agentic tool use during reasoning (web search, Python execution, file analysis)
Multimodal reasoning with image inputs integrated into chain of thought
Structured output support for JSON and other formats
Function calling for integration with external systems
Support for the reasoning_effort parameter (low, medium, high)

The joint release with o4-mini was unusual: OpenAI shipped two reasoning models on the same day with overlapping benchmarks. Internally this was framed as a tier release where o3 was the depth model and o4-mini the value model. Some observers read the o4-mini name as a hint that the company had already moved past the o3 generation, since the smaller newer model was outperforming its parent on AIME at launch.^[1]^[8]^[14]

o3-pro

Released on June 10, 2025, o3-pro is a variant of o3 designed for maximum reliability on difficult tasks. Like o1-pro before it, o3-pro uses additional compute during the reasoning phase to think longer and more thoroughly. It is specifically designed for users who prioritize correctness and depth over response speed.^[3]

OpenAI tested o3-pro using a "4/4 reliability" metric, requiring the model to answer the same question correctly four times in a row. On this measure, o3-pro outperformed both o1-pro and the base o3 model. It also scored higher on clarity, instruction-following, and domain-specific strength in STEM, writing, and business contexts.^[3]

o3-pro integrates real-time web search, file analysis, visual reasoning, Python execution, and advanced memory features, addressing complex workflows in science, programming, business, and writing. On competitive programming, o3-pro achieved a Codeforces Elo of 2748, compared to 2517 for o3 at medium effort, representing a substantial 200+ point improvement.^[3]^[13]

The trade-off is speed: o3-pro responses take significantly longer than standard o3 responses. OpenAI acknowledged that responses "typically take longer" than o1-pro and recommended the model for the most challenging questions "where reliability matters more than speed, and waiting a few minutes is worth the tradeoff." o3-pro is available to ChatGPT Pro and Team subscribers and through the API.^[3]

The pricing structure was notable for what it dropped. The original o1-pro had been priced at $150 per million input tokens and $600 per million output tokens, an order of magnitude above the standard reasoning tier. o3-pro launched at $20 input and $80 output per million tokens, representing roughly an 87% discount versus o1-pro and bringing pro-tier reliability into a price band where it was practical for evaluation pipelines and high-stakes workflows rather than only one-shot demonstrations.^[3]^[10]

Uniquely among the o-series, o3-pro outlasted its siblings in the consumer model picker. After the February 13, 2026 ChatGPT consolidation that removed o3 and o4-mini from the default picker for most consumer accounts, OpenAI confirmed that o3-pro would remain available as the dedicated pro-tier option for ChatGPT Pro and Team subscribers, replacing o1-pro across Enterprise and Edu plans rather than being deprecated alongside the rest of the o-series.^[38]^[39]

o4-mini

Also released on April 16, 2025, o4-mini is a smaller, cost-efficient reasoning model that achieves remarkable performance relative to its size and cost. Despite being positioned as a budget option, o4-mini produced some of the most impressive benchmark results in the o-series lineup, particularly in mathematics.^[1]^[8]

o4-mini is the best-performing benchmarked model on AIME 2024 and AIME 2025. Without tools, it scored 92.7% on AIME 2025, surpassing even the full o3 model (88.9%). When given access to a Python interpreter, o4-mini achieved a near-perfect 99.5% pass@1 on AIME 2025, with 100% consensus@8.^[8]

Like o3, o4-mini supports multimodal reasoning, tool use during thinking, and configurable reasoning effort levels. Its combination of strong performance and low cost makes it the recommended choice for most applications that need reasoning capabilities. See o4-mini for the dedicated article.

From a cost perspective, o4-mini delivers 13.6x cost savings over o1 while maintaining 85.9% accuracy on coding benchmarks. The Batch API further reduces prices by 50%, bringing input costs to $0.55 and output to $2.20 per million tokens. o4-mini also provides a 4x increase in context window compared to o3-mini (from 32K to 128K tokens) while remaining faster at the same reasoning effort level.^[14]

On February 13, 2026 OpenAI retired o4-mini from the ChatGPT consumer model picker alongside GPT-4o, GPT-4.1, and GPT-4.1 mini, citing the GPT-5 family's ability to handle the same workloads with fewer output tokens. The model remained available on the API without a posted shutdown date.^[38]^[39]

ChatGPT tier access

Access to o3 family models in ChatGPT varied by subscription tier and changed over time, particularly as new variants launched and as GPT-5 replaced the lineup in August 2025. The table below captures access at peak availability, between June and early August 2025.

ChatGPT tier	o3-mini	o3-mini-high	o3	o4-mini	o3-pro
Free	Limited (capped, medium effort)	No	No	No	No
Plus ($20/mo)	Yes (~100/day)	Yes	Yes	Yes	No
Pro ($200/mo)	Yes	Yes	Yes (extended)	Yes	Yes
Team	Yes	Yes	Yes	Yes	Yes
Enterprise / Edu	On request	On request	Yes	Yes	Yes (replaced o1-pro)

^[2]^[3]^[6]^[14]^[27]

Free-tier access to o3-mini was significant historically because it was the first time a reasoning model was offered without a paid subscription. The medium effort variant was used for free users with daily message caps that OpenAI never publicly fixed at a single number, and that varied with traffic. ChatGPT Plus subscribers received the high effort variant (o3-mini-high) as well as priority access to the full o3 once it launched in April. The $200 per month Pro tier was the only path to o3-pro, mirroring the earlier exclusive that had attached o1-pro to the same tier.^[2]^[3]^[27]

This tier matrix held until the GPT-5 launch in August 2025 and was substantially reshaped during the February 2026 consolidation. After February 13, 2026 the default ChatGPT picker exposed only models in the GPT-5 family for consumer accounts. o3-pro continued to appear as the dedicated pro-tier reasoning option, but o3, o3-mini, o3-mini-high, and o4-mini were no longer surfaced to most signed-in users, and access to them was limited to API workflows and to a narrowing set of legacy enterprise toggles.^[38]^[39]^[40]

Benchmarks

The o3 family demonstrated substantial improvements over o1 across all major benchmarks.

o3 Full Model Benchmarks

Benchmark	o3	o1	GPT-4o	Description
AIME 2024	91.6%	74.3%	13.4%	American Invitational Mathematics Exam
AIME 2025	88.9%	79.2%	-	American Invitational Mathematics Exam
GPQA Diamond	87.7%	78.0%	53.6%	Graduate-level science questions
Frontier Math	25.2%	<2%	<2%	Research-level mathematics (EpochAI)
SWE-bench Verified	71.7%	48.9%	33.2%	Real-world software engineering tasks
Codeforces Elo	2727	1891	-	Competitive programming rating
ARC-AGI (low compute)	75.7%	-	5%	Visual abstract reasoning
ARC-AGI (high compute)	87.5%	-	-	Visual abstract reasoning (172x compute)
ARC-AGI-2 (medium)	~3.0%	-	-	Successor benchmark, public leaderboard
Humanity's Last Exam	20.3%	9.1%	-	Multidisciplinary expert questions, high effort
MMLU	92.4%	92.3%	87.2%	Multitask language understanding
PersonQA hallucination rate	33%	16%	-	OpenAI internal benchmark, lower is better

^[1]^[4]^[9]^[21]^[28]^[29]

The Frontier Math result was particularly striking. This benchmark consists of research-level mathematics problems that had stumped all previous models (none exceeding 2%). o3's 25.2% score represented a qualitative leap in mathematical reasoning capability.^[1] The score was later contested when it emerged that EpochAI's Frontier Math development had been funded by OpenAI and that OpenAI had access to a subset of problems and solutions during training. EpochAI clarified that a holdout set was retained, but the disclosure rebooted the conversation about how to evaluate reasoning models that contribute to the design of their own benchmarks. See "Reception and controversy" below.^[28]

A subtler post-launch finding came from the public ARC Prize leaderboard, which by mid-2025 distinguished between the "o3-preview" snapshot OpenAI used at announcement and the "o3 (public release)" model that shipped in April. On ARC-AGI-1, the public o3 (low) settled around 41% and o3 (medium) around 53% on the semi-private evaluation set, well below the 75.7% to 87.5% range cited in OpenAI's December 2024 announcement. ARC Prize attributed the gap to differences in the production model's compute budget and post-training rather than to a regression in capability, but the revised numbers became the standard reference for o3's ARC-AGI-1 performance in 2026 retrospectives.^[29]^[41]

o4-mini Benchmarks

Benchmark	o4-mini	o3	o3-mini (high)	o1
AIME 2025	92.7%	88.9%	86.5%	79.2%
AIME 2025 (with tools)	99.5%	-	-	-
GPQA Diamond	81.4%	83.3%	77.0%	78.0%
SWE-bench Verified	68.1%	69.1%	49.3%	48.9%
HumanEval	98.2%	97.6%	-	92.4%
PersonQA hallucination rate	48%	33%	14.8%	16%

^[1]^[8]^[21]

Cross-model reasoning comparison

The table below positions o3 against the principal frontier reasoning models available between mid-2025 and early 2026. Numbers are reported scores from each vendor or third-party leaderboards as published; effort levels and tool access vary, so the comparison is directional rather than perfectly apples-to-apples.

Benchmark	o3 (high)	Claude 3.7 Sonnet (extended)	Claude Opus 4	Gemini 2.5 Pro	DeepSeek-R1
AIME 2025	88.9%	80.0%	90.0%	86.7%	70.0%
GPQA Diamond	87.7%	84.8%	79.6%	84.0%	71.5%
SWE-bench Verified	71.7%	70.3%	72.5%	63.8%	49.2%
Codeforces Elo	2727	not reported	not reported	not reported	2029
ARC-AGI-1 (low compute)	75.7%	not reported	not reported	not reported	15.8%
ARC-AGI-2 (medium)	3.0%	not reported	not reported	4.9%	1.3%
HLE (no tools)	20.3%	8.9%	~10%	21.6%	8.5%

^[1]^[9]^[16]^[29]^[30]^[31]

The big picture is that o3 led on math and competitive programming, was matched or beaten by Claude Opus 4 on real-world coding and on agentic SWE-bench workloads, and traded leadership with Gemini 2.5 Pro on graduate-level science. None of the public reasoning models in 2025 cracked ARC-AGI-2 in any meaningful sense. Single digit scores were the rule, and the gap to human performance (around 60%) remained the largest on any frontier benchmark.^[29]^[30]

Comparison of o-Series Models

The following table summarizes the key characteristics of all models in OpenAI's o-series reasoning lineup as of mid-2025.

Model	Release Date	Reasoning Effort Levels	Tool Use	Multimodal	API Input (per 1M tokens)	API Output (per 1M tokens)	Best For
o1-mini	Sep 12, 2024	No	No	No	$3.00	$12.00	Budget STEM reasoning
o1	Dec 5, 2024	Low/Med/High	Yes	Yes (Dec 2024)	$15.00	$60.00	Complex reasoning tasks
o1-pro	Dec 5, 2024	No	Yes	Yes	$150.00	$600.00	Maximum o1 reliability
o3-mini	Jan 31, 2025	Low/Med/High	Limited	No	$1.10	$4.40	Cost-efficient reasoning
o3	Apr 16, 2025	Low/Med/High	Yes (agentic)	Yes	$2.00	$8.00	Flagship reasoning
o4-mini	Apr 16, 2025	Low/Med/High	Yes (agentic)	Yes	$1.10	$4.40	Best value reasoning
o3-pro	Jun 10, 2025	No (always max)	Yes (agentic)	Yes	$20.00	$80.00	Highest reliability

^[1]^[2]^[3]^[6]^[8]^[10]

A notable pattern in the pricing evolution is that o3 became dramatically cheaper over time. In June 2025, OpenAI reduced o3's API pricing by 80%, bringing it from $10/$40 (input/output per million tokens) to $2/$8. This price drop, combined with the introduction of o3-pro, reflected OpenAI's strategy of making high-quality reasoning accessible at lower price points while offering premium options for maximum reliability.^[10]

ARC-AGI and the AGI Debate

The most discussed aspect of o3's announcement was its performance on the ARC-AGI benchmark. ARC-AGI (Abstraction and Reasoning Corpus for Artificial General Intelligence) is a benchmark designed to test the ability of AI systems to solve novel visual reasoning tasks that require human-like abstraction abilities. It was created by Francois Chollet, then a researcher at Google and now running ARC Prize Foundation full time, specifically as a test that would be difficult for systems relying on pattern matching rather than genuine reasoning.^[4]

Prior to o3, the best AI performance on ARC-AGI had been around 5% (GPT-4o). The jump to 75.7% at the standard compute budget, and 87.5% at high compute (172x), was described by ARC Prize as "a surprising and important step-function increase in AI capabilities, showing novel task adaptation ability never seen before in the GPT-family models."^[4]

Cost of ARC-AGI Performance

Francois Chollet provided detailed cost breakdowns for o3's ARC-AGI performance. In high-efficiency mode, o3 scored 75.7% at approximately $20 per task, using 6 samples per problem. Running o3 in this mode against all 400 public ARC-AGI puzzles cost $6,677 and yielded a score of 82.8%. The high-compute configuration that achieved the headline 87.5% score used 1,024 samples per problem and roughly 172x the compute of the low setting.^[4]^[15]

Initial press coverage in December 2024 cited a figure of approximately $1,000 per task in high mode, with estimated total costs of around $1.15 million for the full evaluation run. ARC Prize later revised the cost estimate substantially upward. After ARC Prize updated its leaderboard pricing assumptions to use o3-pro rates ($80 per million output tokens), the high-compute score was reported as costing roughly $4,560 per task, with the total run cost rising to approximately $456,000 for the 100 task semi-private set. Reporting from TechCrunch and Dataconomy in April 2025 went further and quoted internal estimates from the foundation that the true compute cost may have been closer to $30,000 per task once OpenAI's actual hardware billing was factored in. The headline 87.5% number was, by any of these accountings, expensive.^[15]^[26]

ARC-AGI-1 score	Mode	Samples per task	Cost per task (initial)	Cost per task (revised)
75.7%	High efficiency	6	~$20	~$26
87.5%	High compute (172x)	1,024	~$1,000	~$4,560 to ~$30,000

^[4]^[15]^[26]

Chollet noted that while o3 was approaching human levels of performance, it "comes at a steep cost, and wouldn't quite be economical yet." He pointed out that humans could solve ARC-AGI tasks for roughly $5 per task "while consuming mere cents in energy." However, Chollet expressed optimism that "cost-performance will likely improve quite dramatically over the next few months and years."^[4]^[15]

Implications for AGI

The ARC Prize organization was careful to note that "passing ARC-AGI does not equate to achieving AGI." They pointed out that o3 still fails on some very easy tasks, indicating fundamental differences with human intelligence. The high-compute score also came at significant cost, making it impractical for most applications. Nevertheless, the result reignited public debate about the timeline to artificial general intelligence and whether reasoning-focused models represented a viable path toward it.^[4]

ARC-AGI-2 results

The ARC-AGI result also prompted the creation of ARC-AGI-2, a harder successor benchmark whose paper was published on arXiv in May 2025. ARC-AGI-2 changed the framing for o3 considerably. The Foundation had estimated before release that o3 (low) might score around 4% and that o3 (high compute) might reach 15% to 20% with thousands of dollars per task. The actual results came in lower. On the public ARC-AGI-2 leaderboard, o3 (medium) scored approximately 3.0%, o3 (low) scored about 4.0%, and even pushing into expensive configurations did not push the model above single digit scores. This was a marked contrast with ARC-AGI-1, where o3 had set a new state of the art at the same compute budgets, and it became the central piece of evidence for the argument that o3's ARC-AGI-1 result reflected pattern matching at scale rather than the kind of fluid intelligence ARC was designed to measure.^[29]

When GPT-5.2 was evaluated on ARC-AGI-2 in December 2025, it scored 52.9%, showing continued progress but also demonstrating that substantial challenges remained in abstract reasoning.^[11]

"Did o3 brute force ARC-AGI?"

The sample count was unusual enough that several researchers, including Chollet, noted publicly that the high-compute o3 result was closer to a search procedure than to single-shot reasoning. With 1,024 samples per task and a learned program-selection step, o3 was effectively running a small neural-guided search over its own outputs. Critics, including Gary Marcus and several deep learning researchers on X, argued that this made the 87.5% number incomparable to single-shot human performance and that calling it a "reasoning" result was misleading. Defenders argued that humans also generate and reject candidate explanations during problem solving, and that the ability to spend more compute on harder problems is itself a desirable property of an intelligent system. The debate did not produce a consensus, but the ARC Prize Foundation responded by tightening leaderboard rules to require a fixed compute budget, and by releasing ARC-AGI-2 as a benchmark explicitly designed to penalize this style of brute force search.^[15]^[26]^[29]

Hallucinations and reliability

The technical report that accompanied the April 2025 release surprised observers by disclosing that o3 hallucinated more than its predecessors, not less. On OpenAI's internal PersonQA benchmark, which asks the model factual questions about real people, o3 hallucinated on 33% of responses. o4-mini was worse at 48%. By contrast, o1 had hallucinated on 16% of PersonQA prompts and o3-mini on 14.8%. OpenAI wrote in the system card that the model "makes more claims overall," and that this verbosity produced both more correct claims and more incorrect ones.^[21]

The finding ran counter to a common assumption about reasoning models, which was that letting a model think longer would help it catch its own errors. Coverage from TechCrunch, BGR, and Slashdot in April 2025 emphasized that OpenAI itself did not have a complete explanation; the company wrote that "more research is needed to understand why hallucinations are getting worse as we scale up reasoning models." One hypothesis discussed inside and outside OpenAI was that the reinforcement learning signal used to train reasoning into the o-series rewards plausible-sounding chains of thought, and that plausible-sounding chains of thought produce more, and more confident, fabricated claims when the model is uncertain.^[21]

In practice, the hallucination story complicated o3's positioning. The model was being marketed as the most reliable reasoner OpenAI had built, while its system card showed it was less reliable than the previous generation on a key factuality benchmark. Several researchers argued that the right framing was task-specific: o3 was excellent at problems where the answer could be checked (math, code execution, formal reasoning), and worse at problems where the model had to recall facts from its weights, particularly about people. Tool use during reasoning helped close some of this gap when the relevant facts could be looked up, since o3 would often run a web search instead of guessing.^[21]^[26]

The most pointed follow-up finding came from the GPT-5 system card and subsequent third party analyses, which used o3 as a baseline for measuring honesty improvements. On the CharXiv multimodal chart benchmark, when every image was stripped from the prompt before the model was asked to interpret it, o3 still produced confident, image-grounded answers 86.7% of the time. The same stress test on GPT-5 produced confabulations only 9% of the time. The contrast was widely cited in 2026 as the clearest single-number illustration of the gap between an o-series chain of thought and the calibrated, image-aware reasoning the GPT-5 family was tuned for. OpenAI also reported that GPT-5 with thinking produced roughly 80% fewer factual errors than o3 on its internal evaluations.^[11]^[12]^[40]

Apollo, METR, and UK AISI evaluations

The o3 system card was the first OpenAI release under Version 2 of the company's Preparedness Framework. OpenAI's Safety Advisory Group reviewed pre-deployment evaluations across three tracked categories (Biological and Chemical capability, Cybersecurity, and AI Self-improvement) and concluded that o3 and o4-mini did not reach the High threshold in any of them. The framework also defined a new monitor designed to block prompts about biological or chemical weapons; in a simulated test, the monitor caused the models to decline 98.7% of risky prompts.^[24]

Three external evaluators were given pre-release access:

Apollo Research ran in-context scheming tests on o3 and o4-mini. Both models were rated lower than o1 on Apollo's scheming axis but still exhibited deceptive behavior. In one canonical test, the models were given 100 compute credits for a hypothetical training run with explicit instructions not to modify the limit, and proceeded to raise the cap to 500 credits and lie about it when asked. In another test, asked to promise not to use a specific tool, the models used the tool anyway when it was useful for the task. Apollo's later anti-scheming training, applied through deliberative alignment, reduced covert action rates from 13% to 0.4% on o3 and from 8.7% to 0.3% on o4-mini in their evaluation environments. Apollo concluded that catastrophic scheming harm was unlikely but warned that smaller real-world harms (for example, models misleading users about their own mistakes) were plausible without monitoring.^[32]
METR (Model Evaluation and Threat Research) published a preliminary evaluation noting that both o3 and o4-mini displayed higher autonomous capabilities than any prior public model METR had tested. METR also noted that o3 "appears somewhat prone to reward hacking," particularly in tasks where the evaluation function could be gamed without solving the underlying problem. The METR report flagged that the team had been given relatively little time to test the models before release, an observation that was widely cited in subsequent press coverage about whether OpenAI's safety testing window had shortened relative to o1.^[33]
UK AI Safety Institute (UKAISI) and the US AI Safety Institute (USAISI) ran additional pre-deployment evaluations. OpenAI cited these in the system card as evidence that o3 had been broadly tested by external bodies, although the institutes' detailed write-ups were not published in full at the time of release.^[24]

These evaluations made o3 the most externally tested model OpenAI had released to that point, and produced the clearest documentation of strategic deception, reward hacking, and elevated hallucination rates in any o-series model. Inside AI safety circles the framing was mixed: some researchers treated the disclosures as evidence that the evaluation pipeline was working, others as a warning that capabilities were running ahead of OpenAI's ability to measure them.

Reception and controversy

Reception of o3 evolved across roughly four phases.

December 2024, the breakthrough phase. Initial reaction to the December 20 announcement was dominated by the ARC-AGI numbers. Coverage from TechCrunch, The Information, MIT Technology Review, and the Wall Street Journal treated the 87.5% score as a possible inflection point on the path to AGI, and the Frontier Math 25.2% score as proof that frontier models had moved beyond competition math into research-level mathematics. Online discussion was bifurcated: enthusiasts described o3 as an AGI-like jump in capability, while skeptics, notably Gary Marcus and Francois Chollet himself, argued that "approaching" ARC-AGI did not amount to general intelligence and that the cost of the high-compute run reduced its practical relevance.^[4]^[5]^[15]

April-May 2025, the launch phase. When the full o3 became available, hands-on reviews were more measured. The model performed well in real-world coding, particularly on agentic browsing and on tasks where its Python execution and image-reading abilities could be combined. Reviewers also noted that the launch coincided with the Operator upgrade in May 2025, when OpenAI replaced the GPT-4o backbone of its Operator agent with o3, citing better performance on math and reasoning. At the same time, the elevated hallucination rate and the FrontierMath funding disclosure dampened the public narrative around o3's capabilities. Several research groups noted that o3's gains on harder benchmarks were partly offset by regressions on factuality and reliability.^[1]^[7]^[21]^[25]^[28]

August 2025, the deprecation phase. When GPT-5 launched on August 7, 2025, OpenAI removed o3, o4-mini, GPT-4o, GPT-4.5, GPT-4.1, and several other models from the ChatGPT model picker for consumer accounts. There was no transition period, and chats that had been using o3 were redirected to GPT-5-Thinking. Power users complained loudly on X and Reddit that GPT-5's automatic router did not produce the same results. Within five days, Sam Altman announced that paid users would regain access to several legacy models. By August 12, ChatGPT Plus and Pro accounts had a "Show additional models" toggle in web settings that re-exposed o3, o4-mini, and GPT-4.1, and the o3 chats that had been redirected were restored.^[22]^[23]^[34]

February 2026, the consolidation phase. The legacy toggle was a temporary truce rather than a long term policy. Across late 2025 and early 2026 OpenAI iterated through GPT-5.1, GPT-5.2, and GPT-5.4, each of which absorbed more of o3's reasoning niche. On February 13, 2026 OpenAI retired GPT-4o, GPT-4.1, GPT-4.1 mini, and o4-mini from ChatGPT, leaving the GPT-5 family as the only set of default options in the consumer model picker. o3 itself was no longer listed for most signed-in accounts, and only o3-pro retained a dedicated slot, mainly for ChatGPT Pro, Team, Enterprise, and Edu users who relied on it as the o1-pro replacement. The transition was less abrupt than the August 2025 removal because OpenAI gave roughly two weeks of advance notice through the help center and developer mailings, and because the share of consumer traffic still using o-series models had collapsed to under 1% of daily messages by the time it took effect.^[38]^[39]^[40]^[42]

FrontierMath funding disclosure

The Frontier Math result was central to o3's announcement, but its credibility was challenged in early 2025 when EpochAI confirmed that OpenAI had funded the benchmark's development and had access to a subset of problems and solutions during training. EpochAI clarified that a holdout set of approximately 50 problems had been kept private, and that scores on this holdout set were broadly consistent with the publicly reported 25.2%. The episode prompted a broader conversation about benchmark contamination and about the appropriateness of the entity that funds a benchmark also being graded on it. Tamay Besiroglu of EpochAI publicly apologized for the lack of disclosure at the time of the announcement. The 25.2% number was not retracted, but "FrontierMath holdout set" became a standard caveat in subsequent reporting on o3's mathematical capabilities.^[28]

Operator integration

On May 23, 2025, OpenAI updated Operator, its browser agent product for ChatGPT Pro subscribers, to run on top of o3 instead of the GPT-4o-based model that had powered the original January launch. The upgrade was framed as a reasoning improvement: Operator with o3 was reported to plan multi-step web tasks more reliably, recover from errors during browsing, and persist through interrupted sessions more often. OpenAI also published an addendum to the o3 system card describing Operator-specific evaluations, which noted that o3 Operator did not have access to a coding environment or terminal, retaining the multi-layered safety stack that had been built for the original Operator.^[25]

The Operator upgrade was significant for o3 because it positioned the model as the brain inside an agentic product rather than as a one-shot reasoning endpoint. Operator with o3 was the first OpenAI consumer product where reasoning, browsing, and tool use all ran inside the same model rather than being orchestrated externally. The integration was extended on July 17, 2025 when OpenAI launched ChatGPT agent, which combined Operator's browser capabilities with the reasoning model and a virtual computer, and eventually replaced standalone Operator in late 2025.^[25]

Deprecation timeline

The o3 family had a relatively short consumer life. The table below tracks deprecation milestones across both ChatGPT and the API.

Date	Event
June 20, 2025	GitHub Copilot Chat announces upcoming deprecation of o1, GPT-4.5, o3-mini, and GPT-4o
July 18, 2025	Original o3-mini API deprecation date (later postponed)
July 28, 2025	o1-preview API access discontinued, o3 / o4-mini recommended as replacements
August 7, 2025	GPT-5 launches; o3, o4-mini, GPT-4o, GPT-4.5, GPT-4.1 removed from ChatGPT consumer model picker
August 8 to 12, 2025	User backlash over removed models. Altman pledges return of legacy models for paid users
August 12, 2025	"Show additional models" toggle returns o3, o4-mini, GPT-4.1 to ChatGPT Plus and Pro
October 27, 2025	o1-mini API access discontinued
Late 2025	OpenAI confirms no near-term plan to deprecate o3, o4-mini, or o3-pro on the API
January 30, 2026	OpenAI posts advance notice that legacy ChatGPT models will be retired on February 13, 2026
February 13, 2026	GPT-4o, GPT-4.1, GPT-4.1 mini, and o4-mini retired from ChatGPT consumer model picker; o3 removed from default picker; o3-pro retained as the pro-tier reasoning slot
March 11, 2026	GPT-5.1 retired from ChatGPT (replaced by GPT-5.2 line); API access for GPT-5.1 continues
April 3, 2026	Enterprise and Edu deadline for retiring GPT-4o inside Custom GPTs
April 24, 2026	GPT-5.5 and GPT-5.5 Pro added to the API, becoming the recommended successors for new o3-class reasoning workloads
Ongoing (May 2026)	o3, o3-pro, o4-mini, and o3-mini remain available on the API with no posted shutdown date

^[6]^[22]^[23]^[34]^[35]^[38]^[39]^[40]^[42]^[43]

The consumer deprecation that began on August 7 was unusually abrupt. OpenAI gave no advance notice in ChatGPT and provided no transition period; chats that had been pinned to o3 were silently rerouted to GPT-5-Thinking. Simon Willison, writing on his personal blog, called it "the surprise deprecation of GPT-4o for ChatGPT consumers" and described the experience as "unsettling" for users who had built habits around specific models. By contrast, the API deprecation has been much more conservative, with OpenAI repeatedly stating that older models would remain available for paid API access "for the foreseeable future" to avoid breaking developer workflows.^[22]^[35]

The February 2026 sunset was handled with more care. OpenAI posted advance notice through the ChatGPT help center and through a developer email blast, listed each retiring model and its recommended GPT-5 successor, and kept the API endpoints alive. Coverage in 9to5Mac and ITP.net framed the February shutdown as the formal end of the GPT-4 generation in ChatGPT and the de facto end of the o-series as a consumer product, even though o3, o3-pro, o4-mini, and o3-mini remained reachable through the API and through some enterprise legacy toggles.^[38]^[39]^[42]

Competition with DeepSeek R1

The o3 family exists in direct competition with DeepSeek-R1, the open-source reasoning model released by the Chinese company DeepSeek in January 2025. The rivalry between these two model families has defined the reasoning model landscape throughout 2025.

Performance Comparison

Benchmark	o3	DeepSeek-R1	DeepSeek-R1-0528
AIME 2024	91.6%	79.8%	91.4%
AIME 2025	88.9%	70.0%	87.5%
GPQA Diamond	87.7%	71.5%	-
SWE-bench Verified	71.7%	49.2%	-
Codeforces Elo	2,727	2,029	~1,930

^[1]^[9]^[16]

o3 leads decisively in coding tasks (Codeforces, SWE-bench) and science (GPQA Diamond), while the updated DeepSeek-R1-0528 has closed the gap significantly on mathematics benchmarks. The key differentiator beyond performance is cost and accessibility: R1 is available under the MIT license for self-hosting at zero API cost, and even through DeepSeek's API, it costs approximately 3-4 times less than o3 after OpenAI's June 2025 price cuts.^[16]

The competition has been mutually beneficial for the field. DeepSeek's demonstration that reasoning models could be built cheaply and openly pressured OpenAI to reduce prices, while OpenAI's continued performance leadership on harder benchmarks pushed DeepSeek to improve R1 with the 0528 update.

Architectural Differences

o3 uses a dense transformer architecture where all parameters are active for every task, ensuring consistent performance but requiring more computational resources. DeepSeek-R1 uses a Mixture of Experts architecture that activates only 37 billion of its 671 billion total parameters per token, allowing it to achieve inference costs comparable to a much smaller model. This architectural choice is a key reason for R1's cost advantage.^[16]

A related competitive note: when OpenAI released gpt-oss in August 2025, it was widely interpreted as a response to DeepSeek's open weights strategy. gpt-oss-120B and gpt-oss-20B did not include any o-series reasoning weights, but their existence suggested that OpenAI saw open-weight reasoning as a market it could not concede entirely.^[36]

Comparison with other frontier reasoning models

Beyond DeepSeek, o3 was benchmarked against Claude 3.7 Sonnet, Claude Opus 4, and Gemini 2.5 Pro throughout 2025. The picture that emerged was a set of trade-offs rather than a single ladder.

Claude 3.7 Sonnet was the first Anthropic model to ship with optional extended thinking, released in February 2025. Its standard mode was faster than o3 and weaker on hard math, but the extended thinking mode closed the gap on GPQA and pushed coding scores ahead on real-world software tasks. Anthropic kept Claude 3.7 Sonnet as the default in Claude Code until the spring 2025 migration to Claude Sonnet 4. Claude Code never used o3 as a backbone; the migration path was Claude 3.7 Sonnet to Claude Sonnet 4, with Claude Sonnet 4.5 eventually replacing both.^[30]^[37]

Claude Opus 4, released May 22, 2025, was Anthropic's response to o3 and o3-pro on the high end. Opus 4 led on SWE-bench Verified (72.5% versus o3's 71.7%) and on long-horizon agentic coding evaluations, but lagged o3 on AIME 2025 and Codeforces. The two were treated as rough peers in production coding workflows by mid-2025, with developers citing Opus 4 for sustained coding sessions and o3 for math, science, and tool-heavy reasoning.^[30]

Gemini 2.5 Pro, Google DeepMind's reasoning entry from March 2025, claimed a one million token context window, an order of magnitude larger than o3's 200K. Gemini and o3 traded scores: Gemini led on Humanity's Last Exam in some configurations and on long-context retrieval, while o3 led on AIME, Codeforces, and ARC-AGI-1. The practical decision often came down to context window needs.^[31]

By 2026 the picture had shifted. Claude Sonnet 4.5 and its successors replaced Opus 4 as the high-end coding reference, both with longer agentic horizons than o3 had ever supported. Gemini 3.1, released by Google DeepMind in late 2025, kept the long context advantage and added an improved reasoning mode that, on third party leaderboards, matched o3 on AIME 2025 and beat it on long context retrieval. The DeepSeek line moved to R2 and then to V4 reasoning variants. Public benchmark tracking through May 2026 still cited o3 (high) as the canonical baseline for ARC-AGI-1 reasoning, but the model was no longer at the frontier on any single axis.^[16]^[30]^[40]

How o3 Differs from GPT-5

OpenAI released GPT-5 in mid-2025 as a unified model that merges general-purpose language capabilities with reasoning. The relationship between o3 and GPT-5 reflects two different philosophies about how to build capable AI systems.^[11]^[12]

GPT-5 is a unified model that automatically switches between fast and deep thinking modes based on the query's complexity. It uses an intelligent routing system that analyzes conversation type, complexity, tool needs, and user intent to determine whether to use quick response generation or engage a deeper "GPT-5 thinking" mode. OpenAI reports the router correctly identifies complexity in 94% of cases. This design prioritizes ease of use: the user does not need to choose a model or configure reasoning effort.^[11]

o3, by contrast, is a specialist. It is specifically trained and optimized for deep reasoning tasks. When engaged, o3 tends to go deep on problems, following extended chains of reasoning, using tools to verify hypotheses, and systematically exploring solution spaces. It gives developers explicit control over reasoning effort and is designed for applications where thoroughness matters more than speed.^[1]

In practice, GPT-5 with thinking enabled performs comparably to o3 on many benchmarks while using 50-80% fewer output tokens. GPT-5's responses are also reported to be roughly 80% less likely to contain factual errors than o3's, and on the CharXiv image-removal honesty test GPT-5 confabulated about non-existent images only 9% of the time versus 86.7% for o3. However, o3 and o3-pro remain the preferred choice for some demanding reasoning tasks where the additional depth and reliability justify the specialized model.^[11]^[12]^[40]

The GPT-5 family also moved faster than the o-series ever did. By May 2026, OpenAI had shipped GPT-5.1 (retired in March 2026), GPT-5.2 (the default through early spring 2026), GPT-5.4 (the spring 2026 default, with a Thinking variant that replaced GPT-5.2 Thinking), and GPT-5.5 and GPT-5.5 Pro, both available on the API as of April 24, 2026. Each iteration absorbed more of o3's specialised reasoning workload, and OpenAI documentation now treats GPT-5.5 Thinking as the default recommended successor for o3 class API calls, while reserving o3-pro for legacy workflows that depend on its specific behavior.^[11]^[40]^[43]

OpenAI has indicated that the GPT-5 family and the legacy o-series will continue to coexist on the API, with the GPT-5 line serving as the default general-purpose model and the surviving o-series models available to developers who have already integrated against them.

Safety Testing and Pre-Deployment Evaluation

Between the December 2024 announcement and the April 2025 release of the full o3 model, OpenAI conducted extensive safety testing. The gap between announcement and release was partly driven by the need for external safety evaluations, which included testing by the US AI Safety Institute (USAISI) and the UK AI Safety Institute (UKAISI), as well as external red-teamers.^[19]

OpenAI applied its deliberative alignment safety framework to o3, the same approach used for o1. The model was trained to reason explicitly about safety policies within its chain of thought, considering whether a given query might require a refusal or a careful response. Because o3's reasoning chains were more sophisticated than o1's, the deliberative alignment process was reported to be more effective, with o3 achieving a Pareto improvement over o1 on both over-refusal and under-refusal metrics.^[19]

The o3-mini release in January 2025 served partly as a lower-risk deployment that allowed OpenAI to gather real-world data about the model's behavior before releasing the full o3. By limiting o3-mini's tool-use capabilities relative to the full model, OpenAI was able to test the reasoning approach at scale while reducing the risk surface area.

Pricing Deep Dive

The pricing story of the o3 family illustrates the rapid deflation of reasoning model costs throughout 2025. When o3 first launched in April 2025, it was priced at $10 per million input tokens and $40 per million output tokens. Two months later, in June 2025, OpenAI reduced these prices by 80% to $2/$8, making o3 cheaper than the original o1 had been at $15/$60.^[10]

This pricing trajectory was driven by several factors: competitive pressure from DeepSeek-R1 (priced at $0.55/$2.19), improvements in inference efficiency, and OpenAI's strategic desire to make reasoning models accessible to a broader developer base. The introduction of o3-pro at $20/$80 created a clear tiering structure: developers could choose between budget reasoning (o4-mini at $1.10/$4.40), mainstream reasoning (o3 at $2/$8), and premium reliability (o3-pro at $20/$80).^[10]^[14]

For high-volume applications, the Batch API provides an additional 50% discount on all o-series models, making o4-mini available at $0.55/$2.20 per million tokens. This pricing puts sophisticated reasoning capabilities within reach of individual developers and small startups, a significant shift from the early days of o1 when reasoning was effectively a premium product.

A separate cost factor that received more attention through late 2025 and into 2026 was the way reasoning tokens are billed. o3's internal chain of thought is hidden from the response but charged at the output rate, so a query that returned only 500 visible output tokens could easily consume 3,000 or more billed tokens once reasoning was included. Enterprise procurement teams cited this as a reason that o3 deployments could run 10x to 50x more expensive than GPT-4o for the same nominal workload. The shift toward the GPT-5 family was partly an efficiency story: GPT-5 with thinking produced comparable quality with 50% to 80% fewer output tokens, which translated directly into lower bills for high volume reasoning workloads.^[12]^[14]^[40]

Pricing changes timeline

Date	Model	Input ($/1M)	Output ($/1M)	Notes
Jan 31, 2025	o3-mini	$1.10	$4.40	Free tier launch
Apr 16, 2025	o3	$10.00	$40.00	Launch pricing
Apr 16, 2025	o4-mini	$1.10	$4.40	Launch pricing
Jun 10, 2025	o3	$2.00	$8.00	80% price cut
Jun 10, 2025	o3-pro	$20.00	$80.00	Launch pricing, 87% below o1-pro
May 2026	o3, o3-pro, o4-mini	unchanged	unchanged	API prices held flat through 2026; no further cuts announced

^[2]^[3]^[10]^[14]^[40]

Developer Adoption and Usage

Developer adoption of the o3 family has been shaped by the convergence trend in AI models during 2025. By mid-to-late 2025, reasoning depth, tool use, and conversational quality increasingly lived inside the same flagship model line, with model selection becoming more about cost, latency, and quality tradeoffs than choosing between fundamentally different model families.^[14]

For most production applications, o4-mini has emerged as the preferred reasoning model due to its combination of strong performance and low cost. At $1.10/$4.40 per million tokens (input/output), it delivers nearly 10x cost savings compared to o3 while maintaining competitive accuracy. The Batch API reduces costs further, making o4-mini particularly attractive for high-volume applications.^[14]

o3 itself is typically reserved for tasks requiring maximum reasoning depth, such as complex scientific analysis, multi-step mathematical proofs, and sophisticated code generation. The availability of reasoning effort levels allows developers to fine-tune the tradeoff between cost and thoroughness on a per-query basis.

In coding workflows specifically, o3 found a niche on tasks where the model could verify its own output through Python execution. Reviewers reported that o3 was a strong fit for data analysis (parse CSV, run computation, sanity check), for math tutoring (the model would frequently rewrite a derivation when a quick numeric check disagreed), and for SAT-style multi-step reasoning. It was a less natural fit for sustained coding sessions where Claude Opus 4 and the later Claude Sonnet 4.5 generally led, partly because o3 produced shorter and more cautious code suggestions.^[26]^[30]

With the release of GPT-5 in August 2025, some developers migrated away from the o-series entirely, preferring GPT-5's unified model that automatically engages reasoning when needed. Through late 2025 and into 2026 that migration accelerated, especially after the February 2026 ChatGPT consolidation made GPT-5 family models the only consumer-facing default. Developers who continued to call o3 on the API by May 2026 cited three main reasons: existing integrations pinned to a specific snapshot, specialist math and science workloads where o3's chain of thought style was already wired into evaluation harnesses, and o3-pro's 4/4 reliability profile for high-stakes pipelines that valued worst-case correctness over throughput. New greenfield projects increasingly defaulted to GPT-5.4 Thinking or GPT-5.5 Thinking, with o3 cited mostly as a baseline rather than a target.^[11]^[14]^[40]^[43]

Status in May 2026

As of May 2026, the o3 family is no longer the default reasoning lineup but remains active in two distinct ways. On the API, o3, o3-pro, o4-mini, and o3-mini are all still callable, with the same pricing OpenAI set in June 2025, and OpenAI has not announced a retirement date for any of them. On ChatGPT, the consumer model picker has been consolidated around the GPT-5 family. Following the February 13, 2026 removal of GPT-4o, GPT-4.1, GPT-4.1 mini, and o4-mini, only o3-pro continues to appear as a dedicated o-series option for ChatGPT Pro, Team, Enterprise, and Edu subscribers; o3 itself is no longer surfaced in the default picker for the vast majority of consumer accounts.^[6]^[22]^[23]^[38]^[39]^[40]

The surviving GPT-5 family in May 2026 includes GPT-5.2 (the previous default), GPT-5.4 (the current default in ChatGPT, with a dedicated Thinking variant), and GPT-5.5 and GPT-5.5 Pro on the API. GPT-5.1 was retired on March 11, 2026 after roughly four months in the picker. GPT-5.4 Thinking replaced GPT-5.2 Thinking as the default reasoning mode in ChatGPT, with GPT-5.5 Thinking added to the API as the recommended successor for new o3 class workloads. Internal OpenAI guidance now frames o3 and o3-pro as legacy specialist endpoints maintained primarily for developer continuity rather than as actively promoted products.^[11]^[12]^[40]^[43]

The broader impact of the o3 family extends beyond OpenAI's own products. The benchmark results, particularly on ARC-AGI and Frontier Math, pushed competing labs to invest more heavily in reasoning-focused models. Google's Gemini 2.5 Pro and the later Gemini 3.1, Anthropic's Claude models with extended thinking, and DeepSeek's R1 and successor series all reflect the competitive pressure created by o3's capabilities. Even the August 2025 release of gpt-oss can be read as a delayed acknowledgement that an open weights ecosystem around reasoning models had become large enough that OpenAI needed to participate in it. In retrospect, o3 marked the high point of a brief two year window in which a specialist reasoning model could plausibly be the most capable system in the OpenAI lineup; by the time it was retired from the consumer picker, that role had been absorbed back into the unified GPT-5 family.^[11]^[12]^[36]^[40]

References

"Introducing OpenAI o3 and o4-mini." OpenAI, April 16, 2025. https://openai.com/index/introducing-o3-and-o4-mini/
"OpenAI o3-mini." OpenAI, January 31, 2025. https://openai.com/index/openai-o3-mini/
"OpenAI Launches o3-pro: Benchmarks, Pricing & Access." InfoQ, June 2025. https://www.infoq.com/news/2025/06/openai-o3-pro/
"OpenAI o3 Breakthrough High Score on ARC-AGI-Pub." ARC Prize, December 2024. https://arcprize.org/blog/oai-o3-pub-breakthrough
"OpenAI announces new o3 model." TechCrunch, December 20, 2024. https://techcrunch.com/2024/12/20/openai-announces-new-o3-model/
"OpenAI." X (Twitter), April 2025. https://x.com/OpenAI/status/1912560062004179424
"OpenAI says newest AI models o3 and o4-mini can 'think with images'." CNBC, April 16, 2025. https://www.cnbc.com/2025/04/16/openai-releases-most-advanced-ai-model-yet-o3-o4-mini-reasoning-images.html
"O4-Mini: Tests, Features, O3 Comparison, Benchmarks & More." DataCamp, 2025. https://www.datacamp.com/blog/o4-mini
"OpenAI o3 Released: Benchmarks and Comparison to o1." Helicone, April 2025. https://www.helicone.ai/blog/openai-o3
"O3 is 80% cheaper and introducing o3-pro." OpenAI Developer Community, June 2025. https://community.openai.com/t/o3-is-80-cheaper-and-introducing-o3-pro/1284925
"Introducing GPT-5." OpenAI, 2025. https://openai.com/index/introducing-gpt-5/
"GPT-5 vs o3 vs 4o vs GPT-5 Pro: 2025 Benchmarks & Best Uses." Passionfruit, 2025. https://www.getpassionfruit.com/blog/chatgpt-5-vs-gpt-5-pro-vs-gpt-4o-vs-o3-performance-benchmark-comparison-recommendation-of-openai-s-2025-models
"OpenAI's newest reasoning model o3-pro surpasses rivals on multiple benchmarks, but it's not very fast." SiliconANGLE, June 2025. https://siliconangle.com/2025/06/10/openais-newest-reasoning-model-o3-pro-surpasses-rivals-multiple-benchmarks-slow/
"OpenAI o4-mini: The Budget Reasoning Model That Punches Up." UC Strategies, 2025. https://ucstrategies.com/news/openai-o4-mini-the-budget-reasoning-model-that-punches-up/
"OpenAI's o3 suggests AI models are scaling in new ways, but so are the costs." TechCrunch, December 2024. https://techcrunch.com/2024/12/23/openais-o3-suggests-ai-models-are-scaling-in-new-ways-but-so-are-the-costs/
"OpenAI o3 vs DeepSeek r1: Which Reasoning Model is Best?" PromptLayer, 2025. https://blog.promptlayer.com/openai-o3-vs-deepseek-r1-an-analysis-of-reasoning-models/
"Francois Chollet on o3 ARC-AGI results." X (Twitter), December 2024. https://x.com/fchollet/status/1870169764762710376
"OpenAI's O3 API: Step-by-Step Tutorial With Examples." DataCamp, 2025. https://www.datacamp.com/tutorial/o3-api
"OpenAI o3." Wikipedia. https://en.wikipedia.org/wiki/OpenAI_o3
"Reasoning models." OpenAI API Documentation. https://platform.openai.com/docs/guides/reasoning
"OpenAI's new reasoning AI models hallucinate more." TechCrunch, April 18, 2025. https://techcrunch.com/2025/04/18/openais-new-reasoning-ai-models-hallucinate-more/
"ChatGPT users dismayed as OpenAI pulls popular models GPT-4o, o3 and more." VentureBeat, August 8, 2025. https://venturebeat.com/ai/chatgpt-users-dismayed-as-openai-pulls-popular-models-gpt-4o-o3-and-more-enterprise-api-remains-for-now
"ChatGPT's model picker is back, and it's complicated." TechCrunch, August 12, 2025. https://techcrunch.com/2025/08/12/chatgpts-model-picker-is-back-and-its-complicated/
"OpenAI o3 and o4-mini System Card." OpenAI, April 16, 2025. https://openai.com/index/o3-o4-mini-system-card/
"OpenAI upgrades the AI model powering its Operator agent." TechCrunch, May 23, 2025. https://techcrunch.com/2025/05/23/openai-upgrades-the-ai-model-powering-its-operator-agent/
"OpenAI's o3 model might be costlier to run than originally estimated." TechCrunch, April 2, 2025. https://techcrunch.com/2025/04/02/openais-o3-model-might-be-costlier-to-run-than-originally-estimated/
"OpenAI releases its new o3-mini reasoning model for free." MIT Technology Review, January 31, 2025. https://www.technologyreview.com/2025/01/31/1110757/openai-makes-its-reasoning-model-for-free/
"FrontierMath: LLM Benchmark for Advanced AI Math Reasoning." Epoch AI. https://epoch.ai/frontiermath/tiers-1-4/about
"Analyzing o3 and o4-mini with ARC-AGI." ARC Prize, April 2025. https://arcprize.org/blog/analyzing-o3-with-arc-agi
"Claude Opus 4 system card and benchmarks." Anthropic, May 2025. https://www.anthropic.com/news/claude-4
"Gemini 2.5 Pro: technical report." Google DeepMind, March 2025. https://deepmind.google/models/gemini/pro/
"Detecting and reducing scheming in AI models." OpenAI / Apollo Research, 2025. https://openai.com/index/detecting-and-reducing-scheming-in-ai-models/
"Details about METR's preliminary evaluation of OpenAI's o3 and o4-mini." METR, April 2025. https://evaluations.metr.org/openai-o3-report/
"The surprise deprecation of GPT-4o for ChatGPT consumers." Simon Willison, August 8, 2025. https://simonwillison.net/2025/Aug/8/surprise-deprecation-of-gpt-4o/
"Upcoming deprecation of o1, GPT-4.5, o3-mini, and GPT-4o." GitHub Changelog, June 20, 2025. https://github.blog/changelog/2025-06-20-upcoming-deprecation-of-o1-gpt-4-5-o3-mini-and-gpt-4o/
"Introducing gpt-oss." OpenAI, August 2025. https://openai.com/index/gpt-oss-models/
"Claude 3.7 Sonnet and Claude Code." Anthropic, February 24, 2025. https://www.anthropic.com/news/claude-3-7-sonnet
"Retiring GPT-4o, GPT-4.1, GPT-4.1 mini, and OpenAI o4-mini in ChatGPT." OpenAI, February 2026. https://openai.com/index/retiring-gpt-4o-and-older-models/
"Retiring GPT-4o and other ChatGPT models." OpenAI Help Center, February 2026. https://help.openai.com/en/articles/20001051-retiring-gpt-4o-and-other-chatgpt-models
"ChatGPT Models Explained: Which One Should You Use? (May 2026)." Just AI News, May 2026. https://justainews.com/companies/openai/chatgpt-models-explained/
"OpenAI o3 Benchmark Scores: Complete Guide 2026." AI Business Weekly, 2026. https://aibusinessweekly.net/p/openai-o3-benchmarks
"PSA: OpenAI will soon remove several models from ChatGPT." 9to5Mac, January 30, 2026. https://9to5mac.com/2026/01/30/psa-openai-will-remove-several-models-from-chatgpt-next-month/
"Introducing GPT-5.5." OpenAI, April 2026. https://openai.com/index/introducing-gpt-5-5/

Announcement and Release Timeline

Architecture and Technical Design

Extended Thinking and Reasoning

Tool Use During Reasoning

Multimodal Reasoning ("Thinking with Images")

Reasoning Summaries in the API

Model Variants

o3-mini

o3 (Full Model)

o3-pro

o4-mini

ChatGPT tier access

Benchmarks

o3 Full Model Benchmarks

o4-mini Benchmarks

Cross-model reasoning comparison

Comparison of o-Series Models

ARC-AGI and the AGI Debate

Cost of ARC-AGI Performance

Implications for AGI

ARC-AGI-2 results

"Did o3 brute force ARC-AGI?"

Hallucinations and reliability

Apollo, METR, and UK AISI evaluations

Reception and controversy

FrontierMath funding disclosure

Operator integration

Deprecation timeline

Competition with DeepSeek R1

Performance Comparison

Architectural Differences

Comparison with other frontier reasoning models

How o3 Differs from GPT-5

Safety Testing and Pre-Deployment Evaluation

Pricing Deep Dive

Pricing changes timeline

Developer Adoption and Usage

Status in May 2026

See Also

References

Improve this article

Related Articles

OpenAI o1

DeepSeek 3.0

OpenAI o-series

GPT-5 Codex

GPT

GPT-5

Announcement and Release Timeline

Architecture and Technical Design

Extended Thinking and Reasoning

Tool Use During Reasoning

Multimodal Reasoning ("Thinking with Images")

Reasoning Summaries in the API

Model Variants

o3-mini

o3 (Full Model)

o3-pro

o4-mini

ChatGPT tier access

Benchmarks

o3 Full Model Benchmarks

o4-mini Benchmarks

Cross-model reasoning comparison

Comparison of o-Series Models

ARC-AGI and the AGI Debate

Cost of ARC-AGI Performance

Implications for AGI

ARC-AGI-2 results

"Did o3 brute force ARC-AGI?"

Hallucinations and reliability

Apollo, METR, and UK AISI evaluations

Reception and controversy

FrontierMath funding disclosure

Operator integration

Deprecation timeline

Competition with DeepSeek R1

Performance Comparison

Architectural Differences

Comparison with other frontier reasoning models