OpenAI o3-mini
Last reviewed
May 24, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,077 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 24, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,077 words
Add missing citations, update stale details, or suggest a clearer explanation.
OpenAI o3-mini is a reasoning-focused large language model released by OpenAI on January 31, 2025, the second commercial member of the o-series after OpenAI o1 and a smaller, cheaper, faster sibling to the full o3 model that was previewed alongside it.[1][2] The model was the first OpenAI reasoning model to support function calling, structured outputs, and developer messages from launch, and it introduced a user-selectable reasoning_effort parameter with low, medium, and high settings.[2] At launch its API price was $1.10 per million input tokens and $4.40 per million output tokens, roughly 63% cheaper than o1-mini and competitive with DeepSeek-R1, which had been released six days earlier on January 20, 2025.[2][3] OpenAI positioned o3-mini as the workhorse reasoning model for coding, mathematics, and science tasks, citing benchmark results of about 83% on AIME 2024 and 77% on GPQA Diamond for the high-effort setting and a Codeforces rating near 2073 Elo.[2] It was eventually superseded by o4-mini on April 16, 2025 and entered a deprecation window in the OpenAI API later that year.[4][5]
| Field | Value |
|---|---|
| Developer | OpenAI |
| Announced | December 20, 2024 (preview)[6] |
| General availability | January 31, 2025[2] |
| API model ID | o3-mini and o3-mini-2025-01-31[7] |
| Context window | 200,000 tokens (max 100,000 output)[7] |
| Knowledge cutoff | October 1, 2023[7] |
| Modalities at launch | Text only (image input added in ChatGPT on February 13, 2025)[8] |
| Reasoning effort levels | low, medium, high[2] |
| API price (input / cached / output) | $1.10 / $0.55 / $4.40 per 1M tokens[2] |
| Replaced by | o4-mini (April 16, 2025)[4] |
OpenAI introduced the o3 family during the final day of its "12 Days of OpenAI" livestream series on December 20, 2024. CEO Sam Altman and researchers Mark Chen, Hongyu Ren, and others demonstrated the full o3 model and announced a sibling called o3-mini, billed as a smaller, faster variant suited to coding and STEM tasks where the full model would be excessive.[6] OpenAI skipped the "o2" name to avoid a trademark conflict with the British telecommunications brand O2, a detail Altman discussed during the stream.[6] The December event opened a safety testing window: external red teamers and safety researchers could apply for early access to both o3 and o3-mini, with deployment dates contingent on the outcome of those evaluations.[6]
On January 31, 2025, OpenAI shipped o3-mini in both ChatGPT and the API. The launch followed by ten days the public release of DeepSeek-R1 on January 20, 2025, which had upended pricing expectations for reasoning models by offering comparable benchmark performance at a fraction of the cost.[3][9] OpenAI's announcement page described o3-mini as "the newest, most cost-efficient model in our reasoning series" and emphasised that it was production-ready out of the gate thanks to function calling, structured outputs, and developer messages.[2] Pro subscribers received unlimited access to both o3-mini and a higher-thinking variant called o3-mini-high; Plus and Team subscribers received 150 messages per day, triple the cap that had previously applied to o1-mini.[2] Enterprise and Education customers received access "in a week," and free ChatGPT users could trigger the model by selecting a "Reason" button in the message composer, the first time a reasoning model had reached the free tier.[2]
On February 12 and 13, 2025, OpenAI quietly enabled image upload in ChatGPT for o3-mini accounts, though vision inputs remained unsupported in the public API at that time.[8] In subsequent weeks the company doubled the daily quota for the o3-mini-high variant to 50 messages per day for Plus subscribers as part of a broader uplift announced alongside expanded reasoning-model access.[10]
The full o3 reasoning model and a new mini sibling, o4-mini, launched on April 16, 2025, and OpenAI's product page described the change in ChatGPT as a one-to-one upgrade in which o4-mini replaced o3-mini in the model picker.[4][5] OpenAI kept o3-mini available in the API to preserve stable behaviour for production users that had built on it.[5] On June 20, 2025, GitHub announced the deprecation of o1, GPT-4.5, o3-mini, and GPT-4o on GitHub Models, with o3-mini scheduled for retirement on July 18, 2025; that timeline was later revised after customer pushback, and OpenAI continued listing o3-mini in its API catalogue through the second half of 2025.[11]
The most prominent design decision in o3-mini is the introduction of an explicit reasoning_effort parameter, accepting the string values low, medium, and high. The parameter trades latency for accuracy by allowing the model to allocate more or fewer hidden reasoning tokens before producing the final answer.[2] OpenAI's launch post describes the choice as a way to "think harder when tackling complex challenges or prioritize speed when latency is a concern."[2] Medium effort is the default in ChatGPT, while developers can override the setting per API call.[2] At high effort, OpenAI reported that o3-mini exceeded o1 on the headline math, science, and coding benchmarks while remaining cheaper to run.[2]
The three effort tiers showed a roughly monotonic improvement on benchmark scores. On the 2024 American Invitational Mathematics Examination, OpenAI reported pass@1 figures of about 60% at low effort, 79% at medium, and 83% at high effort with no Python tool use; the high-effort number reached approximately 87% when the same evaluation was rerun with a separate sample of the questions.[2] Codeforces ratings rose from roughly 1831 (low) and 1997 (medium) to 2073 (high), placing o3-mini-high above 99% of human competitors in OpenAI's evaluation harness.[2]
In ChatGPT, OpenAI exposed only two tiers to consumers: regular o3-mini (medium effort) and a separately badged o3-mini-high variant that ran at high effort and consumed a smaller daily quota.[2]
Unlike its predecessor o1-mini, o3-mini natively supports the function-calling API that OpenAI shipped for GPT-4 in 2023, the Structured Outputs feature added for GPT-4o in August 2024, and the new developer role that OpenAI introduced for the o-series in late 2024.[2] The launch post called these "highly requested developer features" and framed o3-mini as "production-ready out of the gate," in contrast to o1 and o1-mini, both of which had launched without function calling or strict JSON-schema output enforcement.[2]
Structured Outputs in o3-mini lets developers pass a JSON Schema in the API request and receive a response whose structure is guaranteed to validate against the schema, removing the need for retry-on-parse-error loops common in LLM pipelines.[2] Function calling allows the model to emit machine-readable tool invocations against a developer-supplied list of available functions, enabling tool use, agentic workflows, and retrieval-augmented generation.[2] Developer messages occupy a system-message-like slot in the request payload but are designed for the o-series, where models trained to reason internally treat them as high-trust instructions distinct from end-user content.[2]
At the January 31, 2025 launch, o3-mini accepted only text input and produced only text output, with no vision or audio support.[2] OpenAI's launch documentation acknowledged that the model "does not support vision capabilities, so developers should continue using OpenAI o1 for visual reasoning tasks."[2] Two weeks later, on February 12 to 13, 2025, image upload became available for o3-mini chats in the ChatGPT web app, with OpenAI confirming the rollout publicly the following day; the API still rejected image_url payloads against the o3-mini endpoint at that time.[8]
OpenAI shipped a prototype web-search integration alongside o3-mini in ChatGPT. The model can fetch up-to-date information and surface citations to source pages, though OpenAI labelled the feature as an early prototype while it worked to integrate search across the reasoning-model family.[2] Full agentic web-search and tool use, including Python execution and image generation from within the chain of thought, was reserved for the later o4-mini and o3 launches in April 2025.[4]
The o3-mini model has a 200,000-token context window with a maximum output of 100,000 tokens, including the hidden reasoning tokens consumed at higher effort settings.[7] The knowledge cutoff is October 1, 2023, the same as o1.[7]
OpenAI's launch post for o3-mini included evaluation results across mathematics, science, coding, and tool-use benchmarks. Unless stated otherwise the figures below are pass@1 with no external tools and follow OpenAI's reported numbers for the three effort levels.[2]
On the 2024 American Invitational Mathematics Examination competition problems, OpenAI reported scores of approximately 60% at low effort, 79% at medium, and 83% at high effort.[2] An additional evaluation that drew a separate set of AIME questions and used a different sampling configuration placed o3-mini-high at roughly 87.3%, matching the figure most often cited in coverage of the launch.[12]
On the MATH dataset, o3-mini scored above 95% at high effort, slightly exceeding o1 and far above GPT-4o.[2] On the FrontierMath research-math benchmark introduced by Epoch AI in late 2024, OpenAI reported that o3-mini at high effort with access to a Python tool solved over 32% of problems on the first attempt, including more than 28% of the "T3" hardest problems.[2] For context, prior to the o-series no public model had exceeded roughly 2% on FrontierMath in its initial release.[6]
On GPQA Diamond, the 198-question subset of GPQA covering graduate-level physics, chemistry, and biology questions, OpenAI reported pass@1 scores of about 60% at low effort, 71% at medium, and 77% at high effort for o3-mini.[2] The high-effort score of roughly 77% slightly exceeded o1's 75.7% on the same benchmark.[2]
On the Codeforces competitive-programming Elo ladder, OpenAI's harness assigned ratings of approximately 1831, 1997, and 2073 to o3-mini at low, medium, and high effort respectively.[2] The high-effort 2073 rating placed o3-mini above roughly 99% of human Codeforces competitors but well below the 2727 rating later reported for the full o3 model.[6]
On SWE-bench Verified, the human-curated 500-problem subset of SWE-bench drawn from real GitHub pull requests, o3-mini scored 39.0% at low effort and 49.3% at high effort.[2] The high-effort 49.3% figure compared favourably to o1's 48.9% on the same evaluation.[2] On the LiveBench coding category, OpenAI reported that o3-mini outperformed o1 across effort tiers.[13]
A notable trade-off in o3-mini was its weaker performance on broad knowledge benchmarks. On the MMLU multitask language understanding benchmark, o3-mini at high effort scored 86.9%, several points below o1's reported 92.3%, reflecting the model's relative narrowing of training toward STEM reasoning rather than encyclopedic recall.[13]
| Benchmark | o3-mini (high) | o1 | DeepSeek-R1 | Source |
|---|---|---|---|---|
| AIME 2024 (pass@1) | ~83% (extended eval ~87%) | 83.3% | 79.8% | [2][9][14] |
| GPQA Diamond | ~77% | 75.7% | 71.5% | [2][9] |
| MATH-500 | ~97% | ~96% | 97.3% | [9][13] |
| Codeforces Elo | 2073 | 1891 | 2029 | [2][9] |
| SWE-bench Verified | 49.3% | 48.9% | 49.2% | [2][9] |
| FrontierMath (with Python) | >32% | not reported | not reported | [2] |
| MMLU | 86.9% | 92.3% | 90.8% | [13][9] |
OpenAI also reported the results of an "external expert tester" study in which evaluators preferred o3-mini responses to o1-mini responses 56% of the time and judged that o3-mini reduced "major mistakes" on real-world hard questions by 39%. The same study found average response latency was 24% faster than o1-mini, at 7.7 seconds versus 10.16 seconds.[2]
OpenAI has not published a detailed architectural report for o3-mini. The o3-mini system card describes the model as a member of the o-series trained with large-scale reinforcement learning to produce long internal chain-of-thought traces before responding.[15] The system card states that o3-mini is "the latest model in our reasoning series" and that its training continues the recipe introduced with o1, in which reinforcement learning teaches the model to reason and refine its strategies during inference rather than at training time.[15] Like the rest of the o-series, o3-mini consumes a configurable number of hidden reasoning tokens during inference, and OpenAI bills those reasoning tokens at the output rate even though they are not visible in the final response.[7]
Beyond reasoning training, OpenAI applied a safety-training procedure called deliberative alignment to o3-mini. Introduced in a paper published alongside the December 2024 o3 preview, deliberative alignment teaches a model to explicitly reason through OpenAI's safety policy text before producing an answer, rather than relying only on examples of compliant and refusal behaviour.[16] OpenAI's deliberative-alignment paper lists o1-preview, o1, and o3-mini as the production models trained with the technique, and the o3-mini system card credits deliberative alignment for improvements in jailbreak resistance and policy compliance.[16][15]
The o3-mini system card, dated January 31, 2025 and revised on February 10, 2025, evaluates o3-mini against OpenAI's Preparedness Framework in four risk categories: Cybersecurity, CBRN (chemical, biological, radiological, and nuclear uplift), Persuasion, and Model Autonomy.[15] OpenAI's internal Safety Advisory Group assigned the pre-mitigation model an overall rating of Medium, with Medium ratings in Persuasion, CBRN, and Model Autonomy and a Low rating in Cybersecurity.[15] The system card highlights that o3-mini is the first OpenAI model to reach Medium risk in Model Autonomy, a classification driven by improvements in software-engineering performance on benchmarks such as SWE-bench Verified.[15]
On jailbreak evaluations, the system card reports that o3-mini performs comparably to or better than o1 on suites including StrongREJECT and on internal disallowed-content suites, and that it shows large gains over GPT-4o on adversarial prompts.[15] OpenAI attributes those gains primarily to deliberative alignment.[16] Within days of public release, a CyberArk researcher reported coaxing o3-mini into producing exploitation guidance for the Windows LSASS process, illustrating that the model's safety improvements did not eliminate residual vulnerabilities.[17]
At launch on January 31, 2025, OpenAI listed o3-mini at $1.10 per million input tokens, $0.55 per million cached input tokens, and $4.40 per million output tokens.[2] The Batch API offered an additional 50% discount on the same rates.[13] Compared to o1-mini's $3 per million input and $12 per million output tokens, the o3-mini launch price represented a roughly 63% reduction in cost.[2] Compared to the full o1 at $15 per million input and $60 per million output tokens, o3-mini's input pricing was roughly 93% lower.[18]
The headline output price of $4.40 per million tokens matched the price of DeepSeek-R1 on third-party hosts, although DeepSeek's official API at the time offered the same model at lower rates (approximately $0.55 input and $2.19 output per million tokens), and DeepSeek-R1's open-weight licence allowed self-hosting at marginal cost.[9] OpenAI's pricing decision was widely interpreted in trade press as a competitive response to DeepSeek's January 20 release.[3][19]
API access at launch was restricted to developers in OpenAI's usage tiers 3 through 5, which require historical spend thresholds.[13] Standard, real-time, batch, and Assistants API surfaces all supported the model.[2] Streaming was supported, as were function calling and Structured Outputs.[2] Image inputs were unsupported at launch.[2]
In ChatGPT, o3-mini was available to Plus, Team, and Pro subscribers from launch day, with Enterprise and Education access following within a week.[2] Plus and Team subscribers received 150 messages per day, three times the previous o1-mini limit of 50.[2] Pro subscribers received unlimited access to both o3-mini and o3-mini-high.[2] Free ChatGPT users could invoke the model by selecting "Reason" in the message composer or by regenerating a response, with a smaller daily quota; this was the first reasoning model available to non-paying ChatGPT users.[2] The o3-mini-high variant, exposed in the ChatGPT model picker for Plus subscribers, allowed a small number of high-effort messages per day; OpenAI later doubled the Plus quota for o3-mini-high from an initial 25 to 50 messages per day.[10]
Trade press coverage of the January 31 launch concentrated on three themes: cost, the competitive context with DeepSeek-R1, and OpenAI's expansion of reasoning-model access to free ChatGPT users. TechCrunch's launch report described o3-mini as OpenAI's "latest reasoning model" and emphasised that it was OpenAI's first reasoning model to support function calling and Structured Outputs, calling it "production-ready out of the gate."[3] VentureBeat framed the release as "OpenAI's o3-mini advanced reasoning model arrives to counter DeepSeek's rise."[19] MacRumors's headline characterised the model as "rivaling DeepSeek," a framing some Hacker News commenters disputed, pointing out that DeepSeek-R1 remained roughly an order of magnitude cheaper at the API level and was distributed under an open licence.[20]
The DeepLearning.AI Batch newsletter noted that o3-mini "trades broad knowledge for domain-specific reasoning strength," pointing to the lower MMLU score relative to o1 as evidence of the trade-off.[13] The newsletter also highlighted that o3-mini's design made it possible to dial reasoning effort up or down via a single API parameter, a feature most other lab models did not expose.[13]
Independent evaluators on the Artificial Analysis benchmark suite recorded o3-mini at output speeds around 168 tokens per second, well above the median of comparable reasoning models, but with a time-to-first-token of about 6.84 seconds, reflecting the latency tax of internal reasoning before the first visible token.[21]
Within a week of launch, security researchers had reported jailbreaks of the model. A CyberArk principal vulnerability researcher demonstrated obtaining instructions for an exploit of the Windows LSASS process, an episode that received coverage in Dark Reading and prompted further discussion of deliberative alignment's robustness.[17]
OpenAI o1 launched in September 2024 and was the first publicly released model in the o-series.[22] o3-mini at medium effort approximately matched o1's headline performance on AIME 2024, GPQA Diamond, and SWE-bench Verified at substantially lower cost and latency, while o3-mini at high effort modestly exceeded o1 on most reasoning benchmarks at the cost of additional thinking time.[2] o3-mini's main advantages over o1-mini were function calling, Structured Outputs, developer messages, higher math and coding benchmark scores, and lower price.[2] Its main disadvantage relative to both o1 and o1-mini at launch was the absence of vision input, which OpenAI explicitly recommended developers use o1 for instead.[2]
OpenAI o4-mini launched on April 16, 2025, and superseded o3-mini in the ChatGPT model picker. OpenAI characterised o4-mini as a direct, one-to-one upgrade in ChatGPT and added image-reasoning ability, fuller tool use, and agentic web search.[4][5] On AIME 2025 OpenAI reported o4-mini scoring 93.4% versus a comparable o3-mini score in the high-80s, and on Codeforces o4-mini reached approximately 2719 Elo versus o3-mini-high's 2073.[4][14] The o4-mini API price at launch matched o3-mini's $1.10 input and $4.40 output per million tokens.[4]
DeepSeek-R1, released by DeepSeek-AI on January 20, 2025, was the immediate competitive backdrop for o3-mini's launch.[9] On AIME 2024, DeepSeek-R1 scored 79.8% versus o3-mini-high's ~83%; on Codeforces, DeepSeek-R1's reported Elo of 2029 was close to o3-mini-high's 2073; on MATH-500 DeepSeek-R1 scored 97.3% versus o3-mini's roughly 97%.[9] DeepSeek-R1 came with open weights under the MIT licence and a hosted API priced near $0.55 input and $2.19 output per million tokens, making it materially cheaper than o3-mini on hosted inference.[9] Coverage of the launches typically framed o3-mini as faster, more reliable, and stronger at coding and tool use, while DeepSeek-R1 was framed as cheaper, open, and stronger at pure mathematics.[14]
Claude 3.5 Sonnet from Anthropic, updated in October 2024, was the dominant commercial coding model at the time of o3-mini's launch. On SWE-bench Verified, Claude 3.5 Sonnet's reported score of about 49.0% was within a percentage point of o3-mini's 49.3%, although Sonnet did not perform explicit reasoning traces in the same style as the o-series.[2] Claude 3.7 Sonnet, released by Anthropic on February 24, 2025, less than four weeks after o3-mini, introduced an extended-thinking mode that brought reasoning-style behaviour to the Claude family and shifted the competitive landscape again.
o3-mini sat in the middle of OpenAI's 2025 reasoning lineup. Above it was the full o3 model, which OpenAI had previewed in December 2024 but did not release in general availability until April 16, 2025.[4] Below it, on a feature level, was o1-mini, which o3-mini replaced in the ChatGPT model picker on the launch day.[2] The full o1 remained available for visual reasoning tasks until o4-mini and o3 added image input in the April 16 release.[4]
OpenAI's broader 2025 roadmap moved on quickly. The full o4-mini and o3 launched on April 16, 2025, and OpenAI's o-series page identifies them as the new generation.[4] Later in the year the company released GPT-5 in August 2025, with subsequent point updates (GPT-5, GPT-5.1, GPT-5.2, and so on), each consolidating reasoning behaviour into the main chat model and reducing the visibility of the separate o-series brand.[23]
OpenAI exposes o3-mini in the API under two identifiers: o3-mini (an alias that may be updated) and the dated snapshot o3-mini-2025-01-31.[7] Requests accept a reasoning_effort parameter with values low, medium (default), or high; a developer role on the messages payload; a response_format with type: "json_schema" for Structured Outputs; and a tools array for function calling.[2][7]
The model card on the OpenAI developer documentation lists a context window of 200,000 tokens, maximum output tokens of 100,000, and a knowledge cutoff of October 1, 2023.[7] Reasoning tokens are billed at the output rate and counted against the maximum-output budget.[7]
OpenAI did not deprecate o3-mini at the time of the o4-mini release on April 16, 2025; the company said in customer communications that o3-mini would remain in the API to preserve stable behaviour for production users that had built atop it.[5] On June 20, 2025, GitHub announced a coordinated deprecation of o1, GPT-4.5, o3-mini, and GPT-4o on GitHub Models, with o3-mini scheduled for retirement on July 18, 2025.[11] That timeline was subsequently revised after customer feedback, and OpenAI continued to list o3-mini in its API model catalogue through the second half of 2025 even as o4-mini became the default reasoning-mini model in ChatGPT.[11][5]