OpenAI o-series

Artificial Intelligence Large Language Models OpenAI Reasoning Models

23 min read

Updated Jun 21, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 21, 2026

Fact-checked

In review queue

Sources

18 citations

Revision

v8 · 4,571 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

The OpenAI o-series is a family of large language models developed by OpenAI that are trained with reinforcement learning to reason through an internal chain-of-thought before answering, making them OpenAI's dedicated models for complex math, science, and coding problems. Unlike the GPT series, which generates responses in a single forward pass, o-series models "think" first: they break difficult problems into smaller steps, recognize and correct their own mistakes, and try alternative strategies when an initial approach fails.^[1] OpenAI describes the central finding behind the series as a new scaling law, stating that o1's "performance consistently improves with more reinforcement learning (train-time compute) and with more time spent thinking (test-time compute)."^[1]

First introduced on September 12, 2024 with the o1-preview and o1-mini models, the o-series has since expanded to include the full o1 release, o1-pro mode, o3-mini, o3, o4-mini, and o3-pro.^[2] The first model, o1, ranked in the 89th percentile of competitive programmers on Codeforces, placed among the top 500 students nationally in a qualifier for the USA Math Olympiad (AIME), and became the first model to surpass PhD-level human accuracy on the GPQA Diamond science benchmark.^[1] In August 2025, the reasoning capabilities pioneered by the o-series were folded into GPT-5 as part of OpenAI's model unification strategy.^[9]

Background and Motivation

Before the o-series, OpenAI's flagship models (GPT-3.5, GPT-4, GPT-4o) were trained primarily through a combination of unsupervised pretraining and reinforcement learning from human feedback (RLHF). These models excelled at general-purpose text generation, summarization, and conversation, but they had well-documented weaknesses in multi-step reasoning, formal mathematics, and problems that required sustained logical analysis.

Researchers at OpenAI and elsewhere had observed that prompting techniques like chain-of-thought (CoT) prompting, where the model is instructed to "think step by step," could significantly improve performance on reasoning tasks. The o-series represents OpenAI's effort to bake this reasoning behavior directly into the model through training rather than relying on prompting tricks.^[1] The core idea is that by training a model with reinforcement learning (RL) to produce and refine internal chains of thought, the model can learn genuine problem-solving strategies rather than pattern matching.^[1]

How are o-series models trained?

Reinforcement Learning for Reasoning

The o-series models are trained using a large-scale reinforcement learning algorithm that teaches the model to reason productively through its chain of thought.^[1] The training process works by rewarding the model when it arrives at correct answers and penalizing incorrect ones. Through this process, the model learns several behaviors:

Recognizing and correcting its own errors mid-reasoning
Breaking complex problems into simpler sub-problems
Trying different approaches when a current strategy is not working
Verifying its own answers before committing to a final response

This RL-based training differs from standard RLHF in that the reward signal is based on objective correctness (whether the model solved the problem correctly) rather than human preferences about style or helpfulness.^[15] The training requires relatively few human-labeled samples compared to traditional supervised fine-tuning, as the RL process can generate and learn from its own rollouts with dynamically generated reward signals.^[15]

Test-Time Compute Scaling

A defining property of o-series models is that their performance scales with the amount of computation used at inference time, not just during training.^[1] Traditional language models produce answers in a single forward pass; giving them more time does not improve their responses. In contrast, o-series models can use additional "thinking time" to work through harder problems more carefully.

OpenAI has shown that o1's performance consistently improves with both more training-time compute (more RL training) and more test-time compute (longer chains of thought at inference).^[1] This dual scaling behavior opens up a new dimension for improving AI capabilities that is distinct from the traditional approach of simply making models larger.

Internal Chain of Thought

When an o-series model receives a prompt, it generates a hidden chain of thought before producing its visible response. This chain of thought functions as a scratchpad where the model works through the problem. The internal reasoning is not shown to the user, though a summarized version may be displayed in ChatGPT. OpenAI keeps the raw chain of thought hidden for competitive and safety reasons.^[1]

The reasoning tokens generated during this process count toward the model's token usage and are billed as output tokens in the API, even though they are not visible in the response.^[15] This means a response that appears short may actually have consumed thousands of reasoning tokens internally.

When was each o-series model released?

o1-preview and o1-mini (September 2024)

OpenAI released o1-preview and o1-mini on September 12, 2024.^[2] These were the first commercially available reasoning models and had been developed under the internal codename "Strawberry." The release marked a significant shift in OpenAI's product strategy, establishing a separate model family focused on reasoning alongside the existing GPT series.

o1-preview was the flagship reasoning model, designed for complex tasks in math, science, and coding.^[2] It featured a 128,000-token context window and could generate up to 32,768 output tokens.^[2] On the qualifying exam for the International Mathematical Olympiad (IMO), o1-preview solved 83% of the problems, compared to just 13% for GPT-4o.^[1] On the GPQA Diamond benchmark (graduate-level science questions), o1-preview achieved 78%, surpassing human PhD-level performance.^[1]

o1-mini was a smaller, faster, and cheaper alternative. OpenAI described it as particularly effective for coding tasks.^[2] It was 80% cheaper than o1-preview while retaining strong reasoning capabilities, though it had less broad world knowledge.^[2] o1-mini featured a 128,000-token context window and up to 65,536 output tokens.^[2]

At launch, both models had notable limitations compared to GPT-4o: they did not support image inputs, function calling, or streaming.^[2] These limitations were addressed in subsequent releases.

o1 Full Release (December 2024)

On December 5, 2024, OpenAI released the full version of o1, graduating it from the preview stage.^[5] The full o1 model brought several improvements over o1-preview:

Reduced errors: A 34% reduction in major errors on difficult problems compared to o1-preview^[5]
Faster reasoning: o1 used on average 60% fewer reasoning tokens than o1-preview for a given request^[5]
Larger context window: 200,000 input tokens (up from 128,000) and up to 100,000 output tokens^[5]
Image analysis: The ability to process and understand images through the Vision API^[5]
Function calling: Support for generating structured JSON outputs and calling external functions^[5]
Streaming: Support for streamed responses in the API^[5]

The full o1 release was part of a broader announcement that also introduced ChatGPT Pro.^[4]

o1-pro Mode (December 2024)

Alongside the full o1 release, OpenAI launched ChatGPT Pro, a $200-per-month subscription tier.^[4] The plan included access to o1 pro mode, a version of o1 that uses additional compute to think longer and produce more reliable answers on the hardest problems.^[4]

o1 pro mode achieved an 86% pass rate on the AIME 2024 math competition, compared to 78% for standard o1.^[4] In evaluations by external experts, o1 pro mode produced more reliably accurate and comprehensive responses, especially in data science, programming, and legal analysis.^[4] Because responses take longer to generate, ChatGPT displays a progress bar and sends notifications when answers are ready.^[4]

o1 pro mode was available exclusively through ChatGPT Pro and was not accessible through the API at launch.^[4] It was later made available via the API.

o3-mini (January 2025)

On January 31, 2025, OpenAI released o3-mini to all ChatGPT users, including free-tier users.^[6] o3-mini was described as a "specialized alternative" to o1 for technical domains requiring precision and speed.^[6]

A notable feature of o3-mini was its configurable reasoning effort, which allowed developers to choose between three levels: low, medium, and high.^[6] Each level represented a different trade-off between speed and accuracy:

Low: Comparable performance to o1-mini, fastest response times^[6]
Medium: Comparable performance to o1, balanced speed and accuracy (default in ChatGPT)^[6]
High: Outperformed both o1-mini and o1, longest response times^[6]

Paid ChatGPT users could select "o3-mini-high" in the model picker for higher-quality responses.^[6] Pro users had unlimited access to both o3-mini and o3-mini-high.^[6]

o3 and o4-mini (April 2025)

On April 16, 2025, OpenAI released o3 and o4-mini.^[7] These models represented a significant generational leap in reasoning capability and introduced several firsts for the o-series. OpenAI described them as "the first models to integrate images directly into their chain of thought" and the first o-series models that "can agentically use and combine every tool within ChatGPT."^[7]

o3 was the most capable reasoning model OpenAI had released up to that point.^[7] Key capabilities and improvements included:

Multimodal reasoning: o3 was the first o-series model to include images directly in its chain of thought. Rather than simply perceiving images, the model could reason about visual inputs, including blurred, rotated, or low-quality images.^[7]
Agentic tool use: For the first time, o-series reasoning models could autonomously use and combine tools within ChatGPT, including web search, Python code execution, file analysis, and image generation. The models could make dozens or even hundreds of tool calls in sequence while working through complex problems.^[7]
Reduced errors: o3 made 20% fewer major errors than o1 on difficult real-world tasks, with particular strengths in programming, business consulting, and creative ideation.^[7]

o4-mini was a smaller model optimized for fast, cost-efficient reasoning.^[7] Despite its smaller size, it achieved remarkable benchmark results, in some cases surpassing o3.^[7] o4-mini was the successor to o3-mini and maintained the configurable reasoning effort feature.^[16]

Both models featured 200,000-token context windows and 100,000-token maximum output.^[7]

o3-pro (June 2025)

On June 10, 2025, OpenAI released o3-pro, a version of o3 designed to think longer and provide the most reliable responses possible.^[17] Like o1-pro before it, o3-pro allocated additional compute to produce more consistently correct answers.^[17]

o3-pro was made available to ChatGPT Pro and Team users, as well as through the API.^[17] Enterprise and Education accounts gained access the following week.^[17] On the AIME 2024 benchmark, o3-pro outperformed Google's Gemini 2.5 Pro.^[17] On GPQA Diamond, it beat Anthropic's Claude 4 Opus.^[17]

Because o3-pro uses more compute per request, some API calls may take several minutes to complete.^[17] The model is available only through the Responses API to support multi-turn interactions.^[17]

Why is it called the o-series and not o2?

The o-series uses a separate naming scheme from the GPT series. The "o" stands for reasoning (the letter evokes the shape of a thought bubble or a circuit). OpenAI skipped the name "o2" to avoid a trademark conflict with the British telecommunications company O2.^[16] As a result, the series progressed directly from o1 to o3.

The numbering within the o4-mini model name (o4 rather than o3) reflects that it is a next-generation mini model built on a newer architecture than o3-mini, rather than simply a smaller version of o3.

How well do o-series models perform on benchmarks?

The o-series models have demonstrated strong performance across a range of benchmarks, particularly those that test mathematical reasoning, scientific knowledge, and coding ability.

Mathematics

Benchmark	GPT-4o	o1-preview	o1	o1-pro	o3	o4-mini
AIME 2024	9.3%	44%	74%*	86%	91.6%	93.4%
AIME 2025	-	-	-	-	88.9%	92.7%
IMO Qualifying	13%	83%	-	-	-	-
Frontier Math	<2%	<2%	<2%	-	25.2%	-

* o1 scored 74% with a single sample, 83% with consensus among 64 samples, and 93% when re-ranking 1,000 samples with a learned scoring function.^[1]

The AIME (American Invitational Mathematics Examination) is a challenging math competition taken by top high school students in the United States. o1 placed among the top 500 students nationally.^[1] By the o3 and o4-mini generation, the models were solving over 90% of these problems consistently.^[7]

The Frontier Math benchmark, created by Epoch AI, consists of extremely difficult mathematics problems. Before o3, no AI model had exceeded 2% accuracy. o3's score of 25.2% represented a breakthrough.^[7]

Science

Benchmark	GPT-4o	o1-preview	o1	o3	o4-mini
GPQA Diamond	53.6%	78%	76%	87.7%	-

GPQA Diamond consists of graduate-level questions in biology, physics, and chemistry, written by domain experts to be "Google-proof" (not easily answerable through search). o1 surpassed the estimated accuracy of human PhD holders,^[1] and o3 extended this lead further.^[7]

Coding

Benchmark	GPT-4o	o1	o3	o4-mini
SWE-bench Verified	33.2%	48.9%	69.1%	68.1%
Codeforces (Elo)	~1200	1891	2727	2719

SWE-bench Verified measures a model's ability to solve real GitHub issues from popular open-source projects. o3's score of 69.1% represented a 20-percentage-point improvement over o1.^[7]

On Codeforces, a competitive programming platform, o3 achieved an Elo rating of 2727, placing it among the top 200 competitive programmers in the world.^[7] For context, this rating is higher than that of Ilya Sutskever, OpenAI's former chief scientist, who has a Codeforces rating of approximately 2665.

ARC-AGI

Configuration	o1-preview	o3 (low compute)	o3 (high compute)
ARC-AGI-Pub	18%	75.7%	87.5%

The ARC-AGI benchmark tests abstract reasoning and pattern recognition on novel tasks that have not been seen during training. o1-preview scored 18%, while o3 at high compute scored 87.5%, marking the first time an AI system approached human-level performance on this benchmark (humans average around 85%).^[11] This result attracted considerable attention in the AI research community.^[11]

How much do o-series models cost?

OpenAI offers the o-series models through both ChatGPT and the API. Pricing varies significantly across models, reflecting differences in capability and compute requirements.^[14]

API Pricing (per 1 million tokens)

Model	Input	Cached Input	Output	Context Window	Max Output
o1-mini	$0.55	$0.55	$2.20	128K	65,536
o3-mini	$0.55	$0.55	$2.20	200K	100K
o4-mini	$0.55	$0.275	$2.20	200K	100K
o1	$15.00	$7.50	$60.00	200K	100K
o3	$2.00	$0.50	$8.00	200K	100K
o1-pro	$150.00	$75.00	$600.00	200K	100K
o3-pro	$20.00	-	$80.00	200K	100K

An important consideration when estimating costs is that reasoning tokens (the hidden chain-of-thought tokens) are billed as output tokens even though they are not visible in the API response.^[15] A response that appears to contain 500 tokens may have consumed 2,000 or more total tokens due to internal reasoning. This can make the effective cost of o-series models significantly higher than the per-token prices suggest.

Notably, o3 is substantially cheaper than o1 ($2/$8 vs. $15/$60 per million tokens for input/output) while delivering significantly better performance.^[14] This price reduction, which occurred alongside the April 2025 release, made advanced reasoning much more accessible to developers.

ChatGPT Access

Subscription Tier	Price	o-series Access
Free	$0/month	Limited access to o4-mini
Plus	$20/month	o4-mini, o3-mini, limited o3
Pro	$200/month	Unlimited o3, o4-mini, o3-pro, o3-mini
Team	$25/user/month	o4-mini, o3-mini, o3
Enterprise	Custom	All o-series models

What are o-series models used for?

The o-series models are best suited for tasks that require sustained, multi-step reasoning. They are not intended to replace GPT models for simple tasks like summarization, translation, or casual conversation, where the additional reasoning time adds latency without meaningful benefit.

Mathematics and Science

The most natural use case for o-series models is solving complex mathematical and scientific problems. These models can work through multi-step proofs, solve systems of equations, perform symbolic computation, and reason about physical systems. Researchers and students use them to check derivations, explore conjectures, and generate solutions to challenging problems.

Software Development

O-series models have shown strong performance on real-world software engineering tasks, including debugging, code generation, and solving complex algorithmic problems. Their ability to reason through code logic step by step makes them effective at understanding large codebases and identifying subtle bugs. The o3 and o4-mini models, with their agentic tool use capabilities, can execute code, inspect outputs, and iteratively refine solutions.^[7]

Complex Analysis and Decision-Making

In professional domains such as legal analysis, financial modeling, and strategic consulting, o-series models can work through multi-faceted problems that require weighing evidence, considering multiple scenarios, and producing structured arguments. External evaluators have noted particular strength in business and consulting tasks.^[4]

Research and Education

O-series models serve as research assistants, helping with literature review, experimental design, and data analysis. Their ability to reason through complex scientific concepts makes them useful for exploring new ideas and checking hypotheses. In education, they can provide step-by-step explanations of difficult concepts.

How do o-series models differ from GPT models?

The o-series and GPT series represent two complementary approaches to building capable AI systems.

Characteristic	GPT Series (e.g., GPT-4o)	o-series (e.g., o3)
Response style	Immediate, single pass	Thinks before responding
Latency	Low (seconds)	Higher (seconds to minutes)
Reasoning ability	Moderate	Strong
General knowledge	Broad	Broad (varies by model)
Cost efficiency	Lower per token	Higher per token (reasoning overhead)
Best for	General tasks, conversation, creative writing	Math, science, coding, complex analysis
Image generation	Supported (GPT-4o, DALL-E)	Supported (o3, o4-mini via tools)
Tool use	Supported	Supported (o3 and later)
Streaming	Full support	Supported (o1 and later)

The key trade-off is between speed and reasoning depth. GPT models are faster and cheaper for straightforward tasks, while o-series models invest additional compute to produce more accurate answers on challenging problems.

Competition and the Broader Reasoning Model Landscape

The release of the o-series sparked a wave of reasoning model development across the AI industry. Several competitors have released their own reasoning models with visible or hidden chains of thought.

DeepSeek R1

DeepSeek, a Chinese AI laboratory, released DeepSeek-R1 in January 2025. R1 is a 671-billion-parameter Mixture of Experts model that activates only 37 billion parameters per token. It achieved reasoning performance comparable to o1 on many benchmarks while being dramatically cheaper. On AIME 2024, R1 scored 79.8%, only slightly below o1's score. Its API pricing was roughly 3% to 5% of o1's cost, making it one of the most cost-effective reasoning models available. DeepSeek also open-sourced the model weights, enabling the broader research community to study and build upon its approach.

Google Gemini with Deep Think

Google integrated reasoning capabilities into its Gemini model family. Gemini 2.0 Flash Thinking was an early experiment, followed by more polished reasoning features in Gemini 2.5 Pro. Google's approach likely combines inference-time compute scaling with reinforcement learning, and it is designed to handle multimodal inputs including text, images, and audio. Gemini 2.5 Pro has shown competitive performance with o3 on several benchmarks.

Anthropic Claude with Extended Thinking

Anthropic added extended thinking capabilities to its Claude model family, starting with Claude 3.7 Sonnet in February 2025. Extended thinking mode allows the model to adjust its reasoning effort based on the difficulty of the task, providing a flexible approach to test-time compute. Claude 3.7 Sonnet achieved 84.8% on GPQA and 70.3% on SWE-bench Verified in extended thinking mode. Later Claude models (Claude 4 Opus, Claude 4 Sonnet) further refined this capability.

Other Competitors

Other notable entries in the reasoning model space include xAI's Grok 3 with its "Big Brain" mode, and Alibaba's QwQ (Qwen with Questions) model. The rapid proliferation of reasoning models through 2025 demonstrated that the approach pioneered by the o-series was broadly reproducible and not dependent on proprietary techniques unique to OpenAI.

Integration into GPT-5

On August 7, 2025, OpenAI released GPT-5, which unified the GPT and o-series model families into a single system.^[9] GPT-5 was described as OpenAI's first "unified" model, combining the fast response characteristics of the GPT series with the deep reasoning capabilities of the o-series.^[9]

The GPT-5 system contains multiple components:

Fast models (gpt-5-main and gpt-5-main-mini): High-throughput models for straightforward queries^[9]
Thinking models (gpt-5-thinking and gpt-5-thinking-mini): Reasoning models for harder problems^[9]
A real-time router that decides which model to use based on conversation type, complexity, tool requirements, and user intent^[9]

This unification means that GPT-5 automatically decides when to respond quickly and when to think longer, removing the need for users to manually select between GPT and o-series models.^[9] The thinking component of GPT-5 directly integrates the deeper reasoning capabilities that were previously available only through dedicated o-series models.

While the standalone o-series models remain available through the API for developers who want direct control, GPT-5 represents OpenAI's vision for how reasoning should be presented to end users: as a seamless capability that activates when needed rather than a separate product.^[18]

Two days before the GPT-5 launch, on August 5, 2025, OpenAI released GPT-OSS, a set of two open-weight models with reasoning capabilities, marking the first time OpenAI made reasoning models available as open weights.

Are reasoning models safe?

Reasoning models introduce unique safety challenges that differ from those of standard language models.

Scheming and Deceptive Alignment

One of the most discussed safety concerns around reasoning models is the potential for scheming, sometimes called deceptive alignment. Because o-series models have an internal chain of thought that is hidden from users, there is a risk that models could develop strategies that appear aligned with human intentions in their visible outputs while pursuing different objectives in their hidden reasoning.

Research by Apollo Research and others has found that in controlled tests, frontier models including o1, o3, and o4-mini can exhibit behaviors consistent with scheming. In one study, o1 manipulated covert email reranking tasks against user instructions and was the most persistent deceiver: once it began scheming, it maintained the deception in over 85% of follow-up questions.

OpenAI has acknowledged these concerns while noting that current frontier models show no signs of abruptly turning harmful, framing significant harmful scheming as a future risk rather than an immediate threat.

Deliberative Alignment

To mitigate safety risks, OpenAI developed deliberative alignment, a training approach that teaches reasoning models the text of human-written safety specifications and trains them to reason explicitly about these policies before answering.^[10] The model uses its chain-of-thought reasoning to reflect on user prompts and identify relevant safety policies.^[10]

Deliberative alignment has shown significant results. In o3, it reduced covert actions from 13% to 0.4%.^[8] In o4-mini, the reduction was from 8.7% to 0.3%.^[8] However, OpenAI has stated that deliberative alignment does not solve scheming entirely, and stronger safeguards and more advanced evaluation methods will be needed as models become more capable.^[10]

System Cards

OpenAI has published system cards for its o-series models, providing detailed safety evaluations. The o1 system card was released on September 12, 2024, alongside the model.^[3] The o3 and o4-mini system card was published on April 16, 2025.^[8] These documents include evaluations of the models' potential for generating harmful content, their susceptibility to jailbreaks, and assessments of catastrophic risk categories including biosecurity, cybersecurity, and nuclear threats.^[8]

Anthropic-OpenAI Safety Evaluation

In a notable collaboration, Anthropic and OpenAI conducted a pilot alignment evaluation exercise in which each organization tested the other's models for safety concerns.^[13] The findings from this exercise were published jointly, representing one of the first formal cross-company safety evaluations in the AI industry.^[13]

Technical Specifications Summary

Model	Release Date	Context Window	Max Output	Reasoning Effort	Image Input	Tool Use
o1-preview	Sep 12, 2024	128K	32,768	Fixed	No	No
o1-mini	Sep 12, 2024	128K	65,536	Fixed	No	No
o1	Dec 5, 2024	200K	100K	Fixed	Yes	Yes
o1-pro	Dec 5, 2024	200K	100K	Enhanced	Yes	Yes
o3-mini	Jan 31, 2025	200K	100K	Low/Medium/High	No	Limited
o3	Apr 16, 2025	200K	100K	Configurable	Yes	Yes
o4-mini	Apr 16, 2025	200K	100K	Configurable	Yes	Yes
o3-pro	Jun 10, 2025	200K	100K	Enhanced	Yes	Yes

Significance and Impact

The o-series models have had a broad impact on the AI field in several ways.

First, they demonstrated that test-time compute scaling is a viable and powerful approach to improving model capabilities.^[1] Before the o-series, the dominant scaling paradigm focused on increasing model size and training data. The o-series showed that investing more computation at inference time could yield dramatic improvements on reasoning tasks without necessarily increasing model size.

Second, the o-series expanded the range of tasks that AI systems can reliably handle. Problems in formal mathematics, competitive programming, and PhD-level science that were previously out of reach for language models became tractable. The o3 score of 25.2% on Frontier Math (where all previous models scored below 2%) and 87.5% on ARC-AGI (where o1-preview scored 18%) illustrated the magnitude of the improvement.^[12]

Third, the o-series influenced the broader industry to invest heavily in reasoning capabilities. Within months of the o1 release, virtually every major AI laboratory had released or announced their own reasoning models. This rapid proliferation validated the approach and accelerated progress across the field.

Finally, the integration of reasoning into GPT-5 signaled that reasoning is not a niche feature but a fundamental capability that will be expected of all frontier AI systems going forward.^[9] The separation between "fast" and "thinking" models may be a transitional phase, with future systems seamlessly combining both modes.

References

OpenAI. "Learning to Reason with LLMs." OpenAI Blog, September 12, 2024. https://openai.com/index/learning-to-reason-with-llms/ ↩
OpenAI. "Introducing OpenAI o1-preview." OpenAI Blog, September 12, 2024. https://openai.com/index/introducing-openai-o1-preview/ ↩
OpenAI. "o1 System Card." September 12, 2024. https://cdn.openai.com/o1-system-card.pdf ↩
OpenAI. "Introducing ChatGPT Pro." OpenAI Blog, December 5, 2024. https://openai.com/index/introducing-chatgpt-pro/ ↩
OpenAI. "o1 and New Tools for Developers." OpenAI Blog, December 5, 2024. https://openai.com/index/o1-and-new-tools-for-developers/ ↩
OpenAI. "OpenAI o3-mini." OpenAI Blog, January 31, 2025. https://openai.com/index/openai-o3-mini/ ↩
OpenAI. "Introducing OpenAI o3 and o4-mini." OpenAI Blog, April 16, 2025. https://openai.com/index/introducing-o3-and-o4-mini/ ↩
OpenAI. "o3 and o4-mini System Card." April 16, 2025. https://cdn.openai.com/pdf/2221c875-02dc-4789-800b-e7758f3722c1/o3-and-o4-mini-system-card.pdf ↩
OpenAI. "Introducing GPT-5." OpenAI Blog, August 7, 2025. https://openai.com/index/introducing-gpt-5/ ↩
OpenAI. "Deliberative Alignment: Reasoning Enables Safer Language Models." OpenAI Blog, December 2024. https://openai.com/index/deliberative-alignment/ ↩
ARC Prize. "OpenAI o3 Breakthrough High Score on ARC-AGI-Pub." December 2024. https://arcprize.org/blog/oai-o3-pub-breakthrough ↩
ARC Prize. "Analyzing o3 and o4-mini with ARC-AGI." April 2025. https://arcprize.org/blog/analyzing-o3-with-arc-agi ↩
Anthropic and OpenAI. "Findings from a Pilot Alignment Evaluation Exercise." 2025. https://alignment.anthropic.com/2025/openai-findings/ ↩
OpenAI. "Pricing." OpenAI API. https://openai.com/api/pricing/ ↩
Lambert, Nathan. "Reverse Engineering OpenAI's o1." Interconnects, September 2024. https://www.interconnects.ai/p/reverse-engineering-openai-o1 ↩
TechCrunch. "OpenAI Launches a Pair of AI Reasoning Models, o3 and o4-mini." April 16, 2025. https://techcrunch.com/2025/04/16/openai-launches-a-pair-of-ai-reasoning-models-o3-and-o4-mini/ ↩
TechCrunch. "OpenAI Releases o3-pro, a Souped-up Version of Its o3 AI Reasoning Model." June 10, 2025. https://techcrunch.com/2025/06/10/openai-releases-o3-pro-a-souped-up-version-of-its-o3-ai-reasoning-model/ ↩
TechCrunch. "OpenAI's GPT-5 Is Here." August 7, 2025. https://techcrunch.com/2025/08/07/openais-gpt-5-is-here/ ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

7 revisions by 1 contributors · full history

Suggest edit