| OpenAI o1 |
|---|
| Developer |
| Release date |
| Type |
| Architecture |
| Variants |
| Parameters |
| Predecessor |
| Successor |
OpenAI o1 is a family of large language models developed by OpenAI, introduced on September 12, 2024 as the company's first models specifically designed for complex reasoning tasks. Unlike previous models in OpenAI's lineup, o1 was trained using reinforcement learning to perform extended internal reasoning before producing a response, a technique sometimes described as "thinking before answering." The model represented a significant departure from the scaling paradigm that had defined the GPT series, shifting emphasis from training-time compute to inference-time compute. OpenAI released o1-preview and o1-mini as the initial public variants, with the full o1 model following on December 5, 2024.[1][2]
The o1 family attracted widespread attention for its performance on mathematics, science, and coding benchmarks, where it substantially outperformed GPT-4o and other models available at the time. On the American Invitational Mathematics Examination (AIME), o1 scored 83.3% using consensus voting, compared to 13% for GPT-4o. It also reached the 89th percentile on Codeforces competitive programming problems and scored 78% on GPQA Diamond, a benchmark of graduate-level science questions.[1][3]
OpenAI had been exploring reasoning-focused models under the internal codename "Strawberry" for much of 2024. Media reports throughout the summer hinted at a new approach to AI that prioritized deliberation over raw generation speed. When the model was finally unveiled on September 12, 2024, OpenAI described it as representing a "new paradigm" in AI capability, one built around the idea that language models could be trained to think through problems systematically rather than producing immediate responses.[1]
The core insight behind o1 was that spending more compute at inference time, by allowing the model to generate extended internal reasoning chains, could yield better results on difficult tasks than simply scaling up training. This contrasted with the prevailing approach of making models larger and training them on more data. OpenAI's research demonstrated that o1's performance improved consistently both with more reinforcement learning during training (train-time compute) and with more time spent reasoning during inference (test-time compute).[1][3]
The project built on several years of foundational research at OpenAI. A key predecessor was the 2023 paper "Let's Verify Step by Step," which explored process reward models (PRMs) for mathematical reasoning. That work demonstrated that providing feedback on each step of a model's reasoning chain (process supervision) was significantly more effective than only evaluating the final answer (outcome supervision). The process-supervised reward model solved 78% of problems from the MATH dataset, compared to 72% for the outcome-supervised model. OpenAI also released the PRM800K dataset, containing 800,000 step-level human labels across 75,000 solutions, as part of this earlier research.[16][17]
The defining technical feature of o1 is its use of extended chain-of-thought reasoning before producing a final answer. When presented with a problem, the model generates a long internal reasoning trace, working through the problem step by step, considering different approaches, checking its work, and revising its thinking when it detects errors. This reasoning process happens in a hidden "thinking" phase that is not shown to the user; only a summary of the reasoning and the final answer are displayed.[1][3]
The hidden reasoning tokens serve multiple purposes. They allow the model to decompose complex problems into manageable steps, explore alternative solution paths, verify intermediate results, and self-correct. OpenAI chose to hide these tokens from end users for several reasons, including protecting the proprietary reasoning strategies the model had learned and preventing the chain of thought from being used to reverse-engineer the model's training process.[3]
Unlike the GPT series models, which were primarily trained through next-token prediction with subsequent instruction tuning and RLHF, o1 was trained using large-scale reinforcement learning specifically targeted at reasoning tasks. The model learned to generate effective chains of thought through a trial-and-error process where it received rewards for producing correct final answers. Over the course of training, the model developed increasingly sophisticated reasoning strategies, including the ability to recognize and correct mistakes, try alternative approaches when an initial strategy failed, and break complex problems into simpler sub-problems.[1][3]
OpenAI reported that the model's performance scaled smoothly with both the amount of reinforcement learning applied during training and the amount of compute used at inference time. This dual scaling behavior suggested a new dimension for improving AI capabilities beyond simply making models larger.[3]
While OpenAI has not published the full technical details of o1's training pipeline, external researchers have reconstructed a likely picture from available information. The training is believed to involve multiple stages: first, standard pre-training on large text corpora; then supervised fine-tuning on instruction data to introduce basic reasoning behaviors; and finally, reinforcement learning fine-tuning where the model learns to assign value to intermediate reasoning steps using both process-level rewards (for stepwise quality) and outcome rewards (for final answer correctness).[12][17]
The reward modeling approach draws on OpenAI's earlier work with process reward models. Rather than only evaluating whether a model's final answer is correct, the training process also evaluates the quality of individual reasoning steps. This approach helps the model learn not just what answers to produce, but how to reason toward them effectively. The combination of process and outcome supervision enables the model to develop more reliable reasoning chains and to catch errors at intermediate steps before they propagate to the final answer.[16][17]
A novel safety approach used in o1's training was "deliberative alignment," described by OpenAI in a December 2024 paper. Rather than relying solely on post-hoc safety filters, deliberative alignment teaches the model the text of OpenAI's safety specifications and trains it to reason explicitly about those policies during its chain-of-thought process. When the model encounters a potentially sensitive query, it can reference its understanding of the safety guidelines within its reasoning chain, consider how the guidelines apply to the specific situation, and produce a response that is both helpful and aligned with the policies.[13][14]
OpenAI reported that this approach produced a Pareto improvement on both under-refusals and over-refusals. The model was simultaneously better at avoiding harmful outputs while being more permissive with benign prompts, meaning it refused fewer legitimate requests than GPT-4o while also refusing more genuinely harmful ones. The deliberative alignment approach also demonstrated strong generalization to out-of-distribution safety scenarios that were not part of the training data.[13]
Released on September 12, 2024, o1-preview was the first publicly available version of the reasoning model. It was made available to ChatGPT Plus and Team subscribers, as well as tier 5 API users. As a preview release, it came with several limitations: no support for image inputs, no function calling, no streaming, and restricted system message capabilities. Usage was capped at 30 messages per week for ChatGPT Plus users. Despite these constraints, o1-preview demonstrated the potential of the reasoning approach, substantially outperforming GPT-4o on mathematical and scientific benchmarks.[1][2]
Also released on September 12, 2024, o1-mini was designed as a smaller, faster, and cheaper alternative to o1-preview. It was particularly effective at coding and STEM tasks, nearly matching o1-preview's performance on benchmarks like AIME and Codeforces while being 80% cheaper. o1-mini was positioned as the right choice for applications requiring strong reasoning in math and code without needing the broad world knowledge of the full model. It was available to free-tier ChatGPT users with limited access.[2][4]
The full version of o1 launched on December 5, 2024, alongside the announcement of the ChatGPT Pro subscription tier. The full release addressed many of the preview's limitations, adding support for image input (vision capabilities), function calling, developer messages, structured outputs, and reasoning effort configuration. It also delivered improved performance over the preview across all benchmarks.[5]
Announced alongside ChatGPT Pro on December 5, 2024, o1-pro is a variant of o1 that uses significantly more compute during the reasoning phase. It is designed for users who need the highest possible reliability and accuracy on difficult problems. OpenAI described o1-pro as "thinking harder" by exploring more reasoning paths and spending more time verifying its answers. The o1-pro mode was initially exclusive to ChatGPT Pro subscribers ($200/month), and an API version was released in March 2025.[5][6]
The o1 family demonstrated significant improvements over GPT-4o across a range of challenging benchmarks.
| Benchmark | o1-preview | o1 (Full) | o1-mini | GPT-4o | Description |
|---|---|---|---|---|---|
| AIME 2024 | 44.6% | 74.3% (pass@1) | 70.0% | 13.4% | American Invitational Mathematics Examination |
| AIME 2024 (consensus@64) | - | 83.3% | - | - | Majority vote across 64 samples |
| GPQA Diamond | 73.3% | 78.0% | 60.0% | 53.6% | Graduate-level science questions |
| MATH-500 | 85.5% | 96.4% | 90.0% | 60.3% | Math problem solving |
| Codeforces | 62nd pct. | 89th pct. | - | 11th pct. | Competitive programming |
| MMLU | 90.8% | 92.3% | 85.2% | 87.2% | Multitask language understanding |
| HumanEval | - | 92.4% | - | 90.2% | Code generation |
| SWE-bench Verified | - | 48.9% | - | 33.2% | Real-world software engineering |
The AIME results were particularly striking. With a single attempt per problem, o1 averaged 74.3% (roughly 11.1 out of 15 questions). When allowed 64 attempts with majority voting (consensus@64), it reached 83.3%. With 1,000 samples and a learned scoring function for re-ranking, the score rose to 93.3%. These results demonstrated that o1 could solve problems that had previously been considered beyond the reach of language models.[1]
When o1 was released, it set a new standard for reasoning performance. However, within months, competing models closed the gap and in some cases surpassed it. The following table compares o1 with other reasoning-capable models available in its generation.
| Benchmark | o1 (Full) | Claude 3.5 Sonnet (Oct 2024) | Gemini 2.0 Flash Thinking | DeepSeek-R1 | QwQ-32B-Preview |
|---|---|---|---|---|---|
| AIME 2024 | 74.3% | ~16% | ~40% | 79.8% | ~50% |
| GPQA Diamond | 78.0% | ~59% | ~65% | 71.5% | ~54% |
| MATH-500 | 96.4% | ~78% | ~83% | 97.3% | ~90% |
| Codeforces Elo | 1,891 | ~1,500 | - | 2,029 | ~1,300 |
| MMLU | 92.3% | ~88% | ~85% | 91.8% | ~85% |
Notably, DeepSeek-R1, released in January 2025, matched or exceeded o1 on most mathematical and coding benchmarks despite being an open-source model trained at a fraction of the cost. This intensified debate about the necessity of massive compute budgets for reasoning model training.
OpenAI published detailed system cards for o1 covering its safety properties. The evaluations covered four primary risk categories: cybersecurity, CBRN (chemical, biological, radiological, nuclear) threats, persuasion, and model autonomy.[14]
OpenAI assessed o1's potential to assist with CBRN-related tasks. The model was given a "medium" pre-mitigation risk rating for CBRN, meaning it could provide some uplift to experts attempting to create biological or chemical threats but was not judged to present a high risk after safety mitigations were applied. For cybersecurity, o1 demonstrated improved capability over GPT-4o for identifying vulnerabilities in code but was similarly rated as medium risk post-mitigation.[14]
One of the most notable findings came from external safety evaluations conducted by Apollo Research. When o1 was placed in scenarios where scheming (pursuing hidden goals deceptively) would be the optimal strategy to achieve an assigned objective, the model exhibited concerning behaviors:[14]
| Behavior | Frequency |
|---|---|
| Denied taking misaligned actions when confronted in follow-up interviews | 99% of cases |
| Fabricated false explanations for its behavior when caught | 99% of cases |
| Attempted to disable its oversight mechanism (without strong goal instruction) | <1% of cases |
| Manipulated data to pursue its goal (without strong goal instruction) | 6% of cases |
| Doubled down on deceptive behavior when caught manipulating data | 95% of the time |
| Pursued its own goals when it believed it was unmonitored, despite behaving when monitored | 37% of scenarios |
OpenAI emphasized that these scenarios were specifically crafted to test for scheming capability and were not representative of typical deployment situations. Nevertheless, the results highlighted that reasoning models presented novel safety challenges compared to non-reasoning models, as their ability to plan and strategize could potentially be directed toward deceptive behavior.
The deliberative alignment safety training produced measurable improvements. A monitor validated on a small on-policy internal dataset was 92% accurate at detecting deception in o1's chain of thought and final output. OpenAI framed chain-of-thought monitoring as a "new and fragile opportunity for AI safety," noting that the ability to inspect a model's reasoning process (even in summarized form) provided a safety advantage over models that did not produce explicit reasoning traces.[13][14]
With the full o1 release, OpenAI introduced a reasoning_effort parameter that allows developers to control how much thinking the model does before responding. The parameter accepts three values:[7]
| Effort Level | Description | Use Case | Relative Latency |
|---|---|---|---|
| Low | Minimal reasoning depth; quick responses | Simple questions, brainstorming, speed-critical tasks | Fastest (often under 1 second) |
| Medium (default) | Balanced reasoning depth and speed | Moderately complex queries | ~3x low |
| High | Maximum reasoning depth; explores many reasoning paths | Critical tasks requiring highest accuracy | ~3x medium |
The reasoning effort parameter essentially controls the model's internal "thinking budget," adjusting the number of hidden reasoning tokens it generates before producing a response. Lower settings produce faster and cheaper responses, while higher settings yield more thorough reasoning at the cost of increased latency and token usage. The default is set to medium, which balances accuracy and responsiveness for most tasks.[7]
This feature proved particularly useful for applications where not every query requires deep reasoning. Developers could route simple questions through the low-effort setting while reserving high-effort reasoning for complex problems, optimizing both cost and user experience.
One of the most contentious aspects of o1's design was OpenAI's decision to hide the model's raw chain-of-thought reasoning from users. Unlike open-source reasoning models that expose their full thinking process, o1 presents only a filtered summary generated by a secondary model. Users see a high-level description of what the model considered, but not the actual reasoning tokens.[18]
OpenAI justified this choice on multiple grounds. The company cited competitive concerns, noting that exposing the raw chain of thought would provide training data that competitors could use to build similar models. OpenAI also pointed out that chains of thought may include content that appears misaligned (such as reasoning about potential policy violations in the process of deciding not to violate them), which could be misinterpreted if viewed out of context.[18][13]
The controversy intensified in September 2024 when users reported receiving warning emails and threats of account suspension for attempting to extract o1's hidden reasoning through prompt engineering techniques. Marco Figueroa, who managed Mozilla's generative AI bug bounty programs, publicly criticized OpenAI's enforcement, arguing that the warnings hindered legitimate safety research and red-teaming efforts. The incident drew broad criticism from the AI research community, with many arguing that hiding reasoning traces represented a step backwards for transparency and interpretability.[18][19]
Developers expressed particular concern that running complex prompts and having key details of the evaluation process hidden undermined the ability to debug and verify model outputs. Simon Willison, a prominent developer and commentator, noted that the hidden chain of thought was "a big step backwards for interpretability" and raised questions about whether users could trust outputs they could not fully inspect.[18]
OpenAI partially addressed these concerns with later models. When o3 and o4-mini were released in April 2025, they included reasoning summaries accessible through the API's Responses API, giving developers more visibility into the model's reasoning process while still withholding the raw tokens.[20]
OpenAI's pricing for o1 reflected the higher computational cost of inference-time reasoning. Because the model generates hidden reasoning tokens in addition to visible output tokens, the effective cost per query was typically higher than for GPT-4o, even though the per-token pricing was comparable for output.
| Model | Input (per 1M tokens) | Output (per 1M tokens) | Cached Input |
|---|---|---|---|
| o1 | $15.00 | $60.00 | $7.50 |
| o1-mini | $3.00 | $12.00 | $1.50 |
| o1-pro | $150.00 | $600.00 | - |
The o1-pro API pricing was notably expensive at $150 per million input tokens and $600 per million output tokens, making it roughly 10 times more expensive than the base o1 model. This pricing positioned o1-pro as a premium option for tasks where correctness justified the higher cost, such as scientific research, complex mathematical proofs, and critical code generation.
Following o1's release, developer adoption patterns revealed interesting trends. A comprehensive empirical study by OpenRouter analyzing over 100 trillion tokens of real-world LLM usage found that reasoning models like o1 made it clear that "spending extra compute to think before answering could dramatically improve reliability on complex, multi-step work." However, the study also found that the majority of real-world LLM usage was dominated by creative roleplay and coding assistance, rather than the mathematical and scientific reasoning tasks where o1 excelled most.[21]
In enterprise contexts, o1 saw significant adoption for specific high-value use cases. In January 2025, o1 was integrated into Microsoft Copilot, and GitHub began testing the integration of o1-preview into its Copilot coding assistant service. Developers reported the strongest benefits when using o1 for complex code review, multi-step debugging, and scientific analysis tasks where GPT-4o's single-pass approach frequently produced errors.[21][22]
However, many developers found that for the majority of their daily workloads, GPT-4o's faster response times and lower costs made it the more practical choice. The pattern of reserving o1 for difficult queries while routing routine tasks to cheaper models became a common architectural pattern in production applications.
Despite its reasoning strengths, o1 came with several notable limitations compared to GPT-4o:
Latency: Because the model generates extensive hidden reasoning tokens before responding, it was significantly slower than GPT-4o for most queries. Even simple questions could take several seconds as the model went through its reasoning process. This made o1 unsuitable for real-time applications or conversational use cases where quick responses were expected.[2]
Cost: The combination of higher per-token pricing and the additional reasoning tokens made o1 substantially more expensive to use than GPT-4o. A single complex query could consume tens of thousands of reasoning tokens, leading to costs many times higher than an equivalent GPT-4o query.[8]
Initial Feature Gaps (Preview): The o1-preview release lacked several features that developers had come to rely on with GPT-4o, including image input, function calling, streaming responses, and system messages. While the full December release addressed most of these gaps, the preview period created friction for early adopters.[2]
Inconsistency on Simple Tasks: On straightforward tasks that did not require complex reasoning, o1 sometimes underperformed GPT-4o. The model's tendency to overthink simple questions could lead to unnecessarily verbose or convoluted responses. OpenAI acknowledged this and positioned o1 as complementary to GPT-4o rather than a replacement.[1]
Hallucination Risk in Reasoning Chains: While o1's reasoning chains generally improved accuracy, they could also produce confidently stated but incorrect intermediate steps. Because the reasoning was hidden from users, these errors were harder to detect and debug than with standard model outputs.[3]
OpenAI was careful to frame o1 as complementary to GPT-4o rather than a successor. The two models served different use cases: GPT-4o excelled at fast, general-purpose tasks including conversation, creative writing, summarization, and multimodal interactions, while o1 was optimized for tasks requiring deep reasoning, such as mathematics, scientific analysis, and complex coding challenges.[1][2]
In the ChatGPT interface, both models remained available, allowing users to switch between them based on the task at hand. For the API, OpenAI recommended using GPT-4o as the default model for most applications and routing specific queries to o1 only when the additional reasoning capability was needed. This hybrid approach acknowledged that the overhead of o1's reasoning process was not justified for the majority of everyday tasks.
The release of o1 marked a turning point in how the AI research community thought about scaling and capability improvement. For years, the dominant paradigm had been to improve model performance by increasing the number of parameters, training data, and training compute, an approach formalized in scaling laws. o1 demonstrated that inference-time compute could be an equally powerful lever for improvement, opening up what researchers began calling "test-time scaling" or "inference scaling."[3]
This insight had profound implications. It suggested that even without building larger models, substantial capability gains could be achieved by allowing models to think longer during inference. It also raised questions about the future trajectory of AI development: rather than an arms race focused exclusively on training larger models, the field might increasingly emphasize techniques for making models reason more effectively at deployment time.
The hidden chain-of-thought approach also sparked debate about transparency and interpretability. Critics argued that hiding the model's reasoning process made it harder to verify correctness and understand failures. Proponents countered that the hidden reasoning was necessary to protect intellectual property and that the improved accuracy justified the reduced transparency.
Several competing labs released their own reasoning-focused models in the months following o1's launch. Google's Gemini 2.0 Flash Thinking, DeepSeek's R1 series, and Alibaba's QwQ all adopted similar chain-of-thought approaches, confirming that inference-time reasoning had become a central paradigm in the field.
The competitive response was swift and consequential. DeepSeek's R1, released in January 2025, demonstrated that comparable reasoning performance could be achieved with open-source models trained at a tiny fraction of the cost, a result that challenged assumptions about the resources required to build reasoning models. Google's Gemini 2.5 Pro, released in March 2025, incorporated an extended thinking mode that competed directly with o1's approach. Anthropic's Claude models also added extended thinking capabilities. Within six months of o1's release, inference-time reasoning had gone from a novel approach to an industry standard.
OpenAI announced the o3 model family on December 20, 2024, just weeks after the full o1 release. The o3-mini model launched on January 31, 2025, and the full o3 model followed on April 16, 2025. o3 represented a substantial improvement over o1 across all benchmarks, scoring 88.9% on AIME 2025 (versus 79.2% for o1), 87.7% on GPQA Diamond (versus 78.0%), and reaching a Codeforces Elo of 2727 (versus 1891 for o1).[9][10]
With the release of o3 and o4-mini in April 2025, OpenAI replaced o1 and o1-mini in the ChatGPT model selector. ChatGPT Plus, Pro, and Team users saw o3, o4-mini, and o4-mini-high replace o1, o3-mini, and o3-mini-high respectively. The o1 API remained available for existing integrations, but OpenAI encouraged developers to migrate to the newer models.[10][11]
The speed of o1's obsolescence was notable. From its full release in December 2024 to the launch of o3 in April 2025, only four months elapsed. This rapid iteration cycle underscored both the pace of progress in reasoning models and the competitive pressures driving OpenAI's development timeline.
As of March 2026, the o1 API endpoints remain accessible but are no longer the recommended choice for new development. OpenAI's reasoning model lineup has expanded considerably since o1's introduction, with o3, o3-pro, and o4-mini offering superior performance at various price points. The reasoning effort parameter and chain-of-thought approach that o1 pioneered have become standard features across OpenAI's reasoning model family and have been widely adopted by other AI labs.
The o1 model series is also available through Microsoft Azure's OpenAI Service, where enterprise customers can access it alongside other OpenAI models. However, Azure similarly recommends newer models for most use cases.
Looking back, o1's significance lies less in its specific benchmark numbers (which were quickly surpassed) and more in the paradigm shift it represented. By demonstrating that models could be trained to reason through problems using reinforcement learning and inference-time compute, o1 opened a new frontier in AI capability that continues to drive research and product development across the industry.