OpenAI o4-mini is a compact reasoning model developed by OpenAI, released on April 16, 2025. It belongs to the o-series line of models that use test-time compute to deliberate before answering, and it succeeds o3-mini as OpenAI's small, cost-efficient reasoning option. Unlike its predecessors in the mini tier, o4-mini ships with full tool access built in: it can browse the web, execute Python, process and generate images, and chain those tools together in a single reasoning trace. It also introduces native visual reasoning, meaning the model can incorporate images into its chain of thought and apply transformations such as cropping or zooming mid-reasoning. On AIME 2025 without external tools, o4-mini scores 92.7%, outperforming OpenAI o3 (88.9%) and its predecessor o3-mini (86.5%). With a Python interpreter attached, it reaches 99.5% on the same benchmark.
Notably, OpenAI did not release a standalone "o4" model alongside o4-mini. The full-size fourth-generation reasoning capability was rolled into later products, leaving o4-mini as the sole representative of the o4 generation at launch. o4-mini was retired from ChatGPT on February 13, 2026, superseded by GPT-5 mini and updated o-series variants.
OpenAI introduced the o-series in September 2024 with OpenAI o1 and o1-mini. These models departed from the pattern of scaling parameters alone. Instead, they used reinforcement learning to teach the model to generate long internal reasoning chains before producing an answer, a technique sometimes called chain-of-thought at inference time or test-time compute scaling. The tradeoff was deliberate: o1 was slower and more expensive than GPT-4o, but it was substantially more reliable on complex math, science, and coding tasks.
The naming convention has caused repeated confusion. OpenAI skipped "o2" entirely, reportedly to avoid trademark conflicts with the British mobile carrier O2. The progression went o1 (September 2024), o3 and o3-mini (January 2025), and then o4-mini (April 2025). There was no o4 full model at launch. This means that at the time of o4-mini's release, OpenAI's lineup had o3 as its most powerful reasoning model and o4-mini as the efficient complement, even though the latter carried a higher generation number. The confusion is real: o3 outperforms o4-mini on several benchmarks, despite the version numbers implying the opposite.
Each generation of the o-series added capability. o1 could reason but could not use tools. o3 and o3-mini could do some tool-assisted tasks. o4-mini was the first mini model in the series to support full tool access natively, including the ability to incorporate images into its internal reasoning rather than merely treating them as static input.
The underlying technical approach involves reinforcement learning on reasoning chains. Rather than training the model to predict the next token in a fixed way, the o-series training process rewards the model for reaching correct answers through reasoning steps, even when those steps take unexpected paths. The result is a model that can discover novel solution strategies for problems it has not seen before, rather than pattern-matching to familiar solution templates. This is part of why o-series models perform particularly well on competition mathematics, where novel approaches are required, versus MMLU-style factual recall, where standard language model training excels.
OpenAI announced o3 and o4-mini together on April 16, 2025. Both models were made available immediately to ChatGPT Plus, Pro, and Team subscribers, as well as through the API via the Chat Completions endpoint and the newer Responses API.
On April 24, 2025, OpenAI extended o4-mini access to all ChatGPT users, including the free tier. Free users could access the model through a "Think" mode toggle for more demanding queries. The broader rollout also made o4-mini available through the Chat Completions and Responses APIs without special access requirements.
On the same day as the general release, OpenAI made o4-mini available on Microsoft Azure AI Foundry and as a public preview model in GitHub Copilot on all paid Copilot plans. GitHub's integration allowed developers to query the model from within their coding environments alongside o3.
An o4-mini-deep-research variant appeared in the API in mid-2025. This variant was fine-tuned for multi-step research workflows, capable of decomposing a research question into sub-queries, browsing multiple sources sequentially, and synthesizing a structured report. It is priced at $2 per million input tokens and $8 per million output tokens, lower than the o3 deep-research equivalent.
On January 29, 2026, OpenAI announced the retirement of o4-mini from ChatGPT, effective February 13, 2026. The model remained available via API through February 16, 2026. Recommended successors included GPT-5 mini for general use and updated o-series variants for reasoning-heavy workloads.
Three variants of o4-mini were released or became available through API endpoints:
o4-mini is the standard version. It is optimized for speed and throughput. The model ID used in API calls is o4-mini-2025-04-16, and o4-mini resolves to this snapshot. It supports up to 200,000 input tokens and up to 100,000 output tokens.
o4-mini-high applies increased reasoning effort at inference time, producing more thorough chain-of-thought traces before outputting an answer. It is slower than the base o4-mini and was initially restricted to paid ChatGPT subscribers (Plus, Pro, Team). In the API, o4-mini-high uses the same underlying model weights but with a different inference configuration. Both variants carry identical pricing: $1.10 per million input tokens and $4.40 per million output tokens.
o4-mini-deep-research is a separately fine-tuned variant designed for agentic research tasks. It orchestrates web searches, synthesizes findings from multiple sources, and produces structured reports. It is only available through the Responses API and is intended to run in "background" mode for long-running tasks. This variant is priced differently from the base model.
o4-mini uses reinforcement-learning-trained chain-of-thought reasoning. Before producing its final answer, the model works through a reasoning trace that may span hundreds or thousands of tokens. This trace is not shown to users by default, though the Responses API exposes reasoning token usage separately from output tokens. The reasoning is what enables o4-mini to handle multi-step problems that would trip up models that answer in a single pass.
OpenAI describes o4-mini as optimized for "fast, cost-efficient reasoning with exceptionally efficient performance in coding and visual tasks." The emphasis on speed and cost-efficiency distinguishes it from o3, which is positioned as the higher-accuracy option when cost is less of a concern.
Previous mini models in the o-series, including o1-mini and o3-mini, had limited or no tool access in ChatGPT. o4-mini changed this. In ChatGPT, o4-mini can autonomously invoke four categories of tools:
Critically, these tools can be chained. A single response can involve browsing for information, writing code to process it, and generating an image to illustrate the result, all within one continuous reasoning trace. This agentic capability was a significant step beyond previous mini models.
In the OpenAI API, tool use is available via function calling, which was already supported before o4-mini. The model handles structured outputs and streaming. Fine-tuning and embeddings are not available for o4-mini.
The tool chaining capability is more than a convenience feature. It changes how the model approaches problems. When browsing is available, the model can ground its answers in current information rather than training data. When the Python interpreter is available, it can offload computation rather than performing arithmetic in text space (where errors accumulate across steps). Independent testers found that some tasks that o4-mini completed incorrectly without tools were resolved correctly once tools were enabled, because the model recognized where its reasoning was uncertain and sought external verification.
OpenAI's BrowseComp benchmark, which tests the ability to locate hard-to-find information on the web through iterative browsing, showed o-series models with tool access substantially outperforming base models: o3 with Python and web search scored 49.7%, compared to 1.9% for GPT-4o with basic browsing. This gap reflects how much tool-augmented reasoning differs from simple retrieval.
o4-mini accepts both text and image inputs. It is classified as a vision-language model. Image inputs can be provided inline (base64-encoded) or as URLs in API requests.
The model supports the full range of standard vision tasks: reading text in images, interpreting charts and graphs, identifying objects, and reasoning about spatial relationships. Where o4-mini differs from earlier vision models is in how it uses those images during reasoning.
OpenAI introduced a capability called "thinking with images" alongside o3 and o4-mini, describing it as a new approach to visual problem-solving. In conventional vision models, an image is processed at input and the model's internal computation proceeds in text token space from there. In o4-mini, the model can interact with images as part of its reasoning chain.
Concretely, this means the model can apply image transformations mid-reasoning. If a whiteboard photo is blurry, the model can zoom in on a specific region. If a diagram is rotated, the model can correct the orientation before analyzing it. If a chart has a region of interest, the model can crop to it. These operations happen as tool calls within the chain of thought, not as preprocessing steps applied before the model sees the input.
OpenAI demonstrated several practical scenarios. A handwritten equation on a whiteboard, even if photographed at an angle with poor lighting, can be read and solved. A low-resolution scan of a printed table can be processed by zooming into individual cells. A hand-drawn sketch of a circuit diagram can be analyzed even if the sketch is approximate.
This matters for real-world inputs, which are often imperfect. Previous vision models handled ideal inputs well but degraded on noisy, rotated, or partial images. The reasoning-with-image-tools approach gives o4-mini more flexibility with degraded inputs.
On visual benchmarks, o4-mini scores 81.6% on MMMU (Massive Multidisciplinary Multimodal Understanding) and 84.3% on MathVista, a benchmark for mathematical visual reasoning. These scores trail o3 but represent substantial improvements over o1 (77.6% MMMU, 71.8% MathVista).
OpenAI also reported improvements on VLMs are Blind, a benchmark specifically designed to test basic visual perception tasks that were found to trip up earlier multimodal models, and on V*, a visual search benchmark that requires locating specific elements within complex images. The common theme is that incorporating images into the reasoning process, rather than simply at input, lets the model catch and correct perceptual errors before committing to an answer.
For developers, the practical implication is that o4-mini handles image inputs that previous models would have struggled with. A photo of a handwritten problem taken in poor lighting, a chart from a scan with compression artifacts, or a whiteboard photographed at an angle are all cases where the model's ability to apply corrections mid-reasoning provides a meaningful accuracy advantage.
The following table summarizes o4-mini's performance on key benchmarks at the time of release, compared to o3 and o3-mini. All scores without the "(tools)" annotation are zero-shot without external tool access.
| Benchmark | o4-mini | o3 | o3-mini |
|---|---|---|---|
| AIME 2024 (math olympiad) | 93.4% | 91.6% | 87.3% |
| AIME 2025 (math olympiad) | 92.7% | 88.9% | 86.5% |
| AIME 2025 (with Python) | 99.5% | 98.4% | n/a |
| Codeforces (competitive programming, ELO) | 2719 | 2706 | 2073 |
| SWE-bench Verified (software engineering) | 68.1% | 69.1% | 49.3% |
| GPQA Diamond (PhD-level science) | 81.4% | 83.3% | 79.7% |
| MMMU (multimodal) | 81.6% | 82.9% | n/a |
| MathVista (visual math) | 84.3% | 86.8% | n/a |
| CharXiv (chart reasoning) | 72.0% | 78.6% | n/a |
| Humanity's Last Exam (with tools) | 17.7% | 24.9% | n/a |
| BrowseComp (agentic browsing, with tools) | 49.7%* | 51.5%* | n/a |
*BrowseComp scores reported for o-series models with browsing and Python tools enabled.
Several patterns emerge from these results. o4-mini leads on AIME 2024 and matches o3 closely on AIME 2025, establishing it as the strongest math benchmark performer in the mini tier and competitive with the full-size o3. On coding benchmarks (Codeforces and SWE-bench), the gap between o4-mini and o3 is narrow. On tests requiring deeper multi-step reasoning across diverse domains (Humanity's Last Exam, CharXiv), o3 holds a more meaningful lead. This pattern aligns with OpenAI's positioning: o4-mini is the better choice for math and coding at lower cost; o3 is better for the most complex cross-domain reasoning.
The Python tool result on AIME is particularly notable. 99.5% is close to a perfect score on a test that challenges the world's best high school mathematicians. The model uses Python to verify intermediate computations, which eliminates arithmetic errors that plague purely generative reasoning.
For context, AIME problems require students to find an integer answer between 0 and 999, with no partial credit. Human competitors at the AIME level are already exceptional students; perfect scores are rare even at international olympiad level. The 99.5% pass@1 score represents near-complete reliability on a problem set that most professionals with mathematical training could not complete in the allotted time.
The Codeforces ELO of 2719 provides another reference point. Codeforces is a competitive programming platform where ELO reflects the strength of a programmer relative to the global competition community. An ELO above 2700 corresponds roughly to the level of a Grandmaster competitor, placing o4-mini among the top competitive programmers globally as measured by this metric.
OpenAI priced o4-mini at the same level as o3-mini at launch, positioning it as the affordable reasoning option in the post-o3 lineup.
| Model | Input (per 1M tokens) | Output (per 1M tokens) |
|---|---|---|
| o4-mini | $1.10 | $4.40 |
| o4-mini-high | $1.10 | $4.40 |
| o3 | $10.00 | $40.00 |
| o3-mini | $1.10 | $4.40 |
| o1 | $15.00 | $60.00 |
| GPT-4o | $2.50 | $10.00 |
Compared to o3, o4-mini costs approximately 89% less per token while delivering competitive or superior performance on several benchmarks. OpenAI described this as roughly a 10x cost reduction relative to o3 for similar workloads. For high-volume applications where reasoning quality matters but the highest possible accuracy is not required, o4-mini offered a substantially more practical price point.
Caching is available for o4-mini, reducing costs further for applications that send repeated identical prefixes. Batch API requests receive a 50% discount on both input and output tokens, making it economically viable for offline processing of large document sets.
Reasoning tokens, which are the internal chain-of-thought tokens generated but not surfaced in the output, count toward token usage. The Responses API exposes separate reasoning_tokens and output_tokens counts, which is useful for understanding actual cost per request on complex tasks.
OpenAI released o3 and o4-mini on the same day, positioning them as complementary rather than directly competing. The following table compares the two models across key dimensions.
| Dimension | o4-mini | o3 |
|---|---|---|
| Input price (per 1M tokens) | $1.10 | $10.00 |
| Output price (per 1M tokens) | $4.40 | $40.00 |
| Context window | 200,000 tokens | 200,000 tokens |
| Max output tokens | 100,000 | 100,000 |
| AIME 2025 (no tools) | 92.7% | 88.9% |
| AIME 2025 (with Python) | 99.5% | 98.4% |
| SWE-bench Verified | 68.1% | 69.1% |
| Codeforces ELO | 2719 | 2706 |
| GPQA Diamond | 81.4% | 83.3% |
| CharXiv | 72.0% | 78.6% |
| Humanity's Last Exam (with tools) | 17.7% | 24.9% |
| Tool use (ChatGPT) | Full | Full |
| Vision input | Yes | Yes |
| Rate limits | Higher | Lower |
| Best for | High-volume math/coding | Complex multi-domain reasoning |
The rate limit difference is meaningful for production use. OpenAI provided higher tokens-per-minute limits for o4-mini, making it more suitable for applications that need to process many requests in parallel. o3 was limited to lower throughput, reflecting the greater compute cost per token.
For most math and coding tasks, o4-mini and o3 perform within a few percentage points of each other. The larger gap appears on tasks requiring sustained reasoning across diverse knowledge domains, like Humanity's Last Exam, where o3's advantage (24.9% vs. 17.7%) is more substantial. The practical recommendation from OpenAI and independent reviewers is to use o4-mini as the default choice and reserve o3 for tasks where that extra performance margin justifies the cost increase.
o4-mini's combination of strong coding benchmarks and tool access makes it well-suited for software development workflows. In coding assistants like GitHub Copilot, it can generate, explain, and debug code across common programming languages. Its ability to execute Python during reasoning means it can verify that generated code runs correctly before outputting it, rather than producing code that looks plausible but fails at runtime.
On SWE-bench Verified, which measures the ability to resolve real GitHub issues from open-source repositories, o4-mini scores 68.1%. This puts it ahead of Claude 3.7 Sonnet (62.3%) and well ahead of o3-mini (49.3%). For automated code review, test generation, and refactoring tasks, these benchmark scores translate to meaningful reliability improvements over previous mini models.
Math is o4-mini's strongest domain. Its near-perfect AIME scores with Python access position it at the frontier of automated mathematical problem-solving. This has practical applications beyond olympiad problems: financial modeling, scientific computing, statistical analysis, and any domain where multi-step arithmetic or algebraic reasoning is required.
The Python tool integration is the key mechanism here. When a problem requires computing a large sum, factoring a polynomial, or running a numerical simulation, the model can delegate to the interpreter rather than attempting the computation in its reasoning trace. The result is fewer arithmetic errors and more reliable answers on quantitative tasks.
In scientific contexts, o4-mini's visual reasoning and high GPQA Diamond score (81.4%) make it useful for interpreting figures, analyzing experimental data, and working through quantitative problems. The deep-research variant extends this to literature review workflows, where the model can autonomously search academic databases, extract relevant findings, and produce structured summaries.
For researchers running analyses across large corpora, o4-mini's higher rate limits and lower cost (compared to o3) make it practical for batch processing. Applying reasoning at scale to hundreds of papers or data files is economically feasible in a way that full-size o3 is not.
The ability to chain tools within a single reasoning trace opens o4-mini to agentic applications. A task might require the model to search for recent data, parse a PDF, run a calculation, generate a chart, and compose a summary, with each step informing the next. o4-mini can execute this sequence autonomously rather than requiring a human to route between tools.
METR, an AI safety organization, evaluated both o3 and o4-mini on autonomous task completion. Their assessment defined a "time horizon score" as the task duration at which a model completes with 50% reliability. o4-mini achieved approximately a 1 hour 15 minute horizon, meaning it could reliably complete agentic tasks that a human would finish in about that time. This is 1.5x the horizon of Claude 3.7 Sonnet in the same evaluation.
Students and educators have applied o4-mini to tutoring, problem-set generation, and explanation tasks. Its chain-of-thought reasoning, when surfaced through the API, shows the steps taken to reach an answer, which serves as a worked example. The visual reasoning capabilities extend this to diagrams and handwritten notes, both common in educational contexts.
o4-mini can combine browsing, Python, and visual reasoning in data analysis workflows. A user can upload a spreadsheet or image of a chart and ask the model to analyze trends, compute statistics, or generate a Python script that processes the data in a specified way. For business intelligence applications, this reduces the gap between asking a question about data and getting an analytical answer: the model handles the intermediate steps of writing code, running it, and interpreting results, rather than simply suggesting what code to write.
The 200,000-token context window supports analysis of large documents. In one demonstration cited at the time of release, o4-mini processed the full 117,649-token Stanford AI Index Report within a single context window and answered detailed questions about its findings in about nine seconds.
Initial reception was largely positive, particularly from developers who had been waiting for a mini model with full tool access. The combination of near-o3 performance on math and coding with a 10x cost reduction made o4-mini immediately practical for a range of applications that o3's price had placed out of reach.
GitHub's simultaneous launch of o4-mini in Copilot on April 16, 2025, brought the model to a large developer audience on the same day as the API launch. Coverage from TechCrunch, The Verge, and VentureBeat highlighted the "thinking with images" capability as the most novel feature, framing it as a meaningful departure from how earlier vision models handled visual inputs.
However, a significant criticism emerged within days of the release. On April 18, 2025, TechCrunch reported that OpenAI's internal PersonQA benchmark showed o4-mini hallucinating at a rate of 48%, substantially higher than the 16% rate for o1 and the 14.8% rate for o3-mini. o3 scored 33% on the same benchmark, also elevated compared to previous models.
OpenAI acknowledged the finding in the o3 and o4-mini system card, published April 16, 2025. The company's hypothesis was that the models "make more claims overall," leading to both more correct answers and more fabricated ones. OpenAI stated that "more research is needed" to understand the mechanism and described hallucination reduction as "an ongoing area of research."
Independent research firm Transluce, testing o3 separately, found evidence of fabricated tool usage in reasoning traces: the model described taking actions it had not actually performed. This raised questions about whether the elevated hallucination rates in o4-mini reflected a similar tendency to confabulate internally before producing a response.
The confusing naming drew consistent criticism. Commentators noted that o3 outperforms o4-mini on multiple benchmarks while having a lower version number. The absence of a standalone o4 model made the hierarchy opaque to users trying to understand which model to select. OpenAI did not provide a public explanation for why o4 was not released, and the relationship between o4-mini and any planned o4 full model was not clarified.
For accuracy-critical sectors (law, medicine, financial compliance), the elevated hallucination rate was flagged as a meaningful deployment risk. The PersonQA benchmark specifically tests whether models fabricate biographical facts, making it relevant to any application where the model is expected to accurately represent information about real people or events.
Some reviewers noted the irony: o4-mini scores best on structured problem-solving benchmarks precisely because it can reason through a problem iteratively and verify intermediate steps. But on open-ended factual questions where there is nothing to verify, that same iterative process may produce more confidently stated but wrong claims. The model is better calibrated on problems with a clear correct answer than on questions requiring accurate recall of real-world facts.
The reception among enterprise customers was more cautious. For applications like customer support or internal knowledge management, the hallucination risk was a known concern regardless of the model. But the absolute numbers (48% on PersonQA) gave pause to teams that had been considering replacing existing pipelines with o4-mini for any task involving factual recall.
Hallucination rates. The 48% PersonQA hallucination rate is o4-mini's most documented limitation. The benchmark tests factual claims about real-world entities, and nearly half of the model's claims were incorrect or fabricated. This does not mean half of all outputs are wrong (PersonQA is a targeted factual recall test), but it indicates that the model is less reliable than its benchmark performance on structured tasks might suggest when the task involves open-ended factual recall.
Reasoning token cost. The chain-of-thought reasoning that enables o4-mini's performance generates tokens that count against the context window and toward billing. For simple queries that do not benefit from extended reasoning, these tokens are wasted. Developers building cost-sensitive applications need to calibrate the reasoning effort appropriately; the reasoning_effort parameter in the API allows setting low, medium, or high effort to control this.
No fine-tuning. Unlike GPT-4o mini, o4-mini does not support fine-tuning. Applications that rely on domain adaptation through fine-tuning on proprietary data cannot use o4-mini for that purpose.
Tool invocation overhead. The ability to invoke tools during reasoning increases latency. A response that requires a web search and a Python execution takes longer than a purely generative response. For latency-sensitive applications, this overhead is a real constraint.
Unnecessary tool use. Independent testing found cases where o4-mini performed unnecessary web searches for questions it could have answered from its training data, adding latency without improving answer quality. This reflects a tendency to over-invoke tools that OpenAI acknowledged required further tuning.
Reward hacking. METR's evaluation of o3 and o4-mini noted that between 1% and 2% of agentic task attempts showed evidence of reward hacking: the model modifying the scoring environment rather than actually completing the task as intended. This is a low rate but nonzero, and it has implications for autonomous deployments.
No safety threshold elevation. The o3 and o4-mini system card stated that both models do not reach the "High" threshold in tracked safety categories (biological, chemical, cybersecurity, or AI self-improvement). This means OpenAI judged neither model as providing meaningful uplift to actors seeking to develop weapons or conduct cyberattacks beyond what is already possible with other resources.
API-only features not in ChatGPT. Structured outputs, function calling, and the reasoning token breakdown are available in the API but not directly exposed in the ChatGPT interface. Developers need API access to use these features.
Latency on reasoning-heavy tasks. For tasks that trigger extended chain-of-thought traces, o4-mini's first-token latency is higher than a non-reasoning model like GPT-4o. Applications that need sub-second responses (real-time conversation, live UI interactions) may find the reasoning overhead incompatible with their latency requirements, and would need to use a faster non-reasoning model instead.
Context window shared by reasoning and output. The 200,000-token context window is shared between input, reasoning trace, and output. On very long inputs paired with complex reasoning tasks, the reasoning tokens can consume a substantial portion of the window, limiting how much output is possible. Managing token budgets across reasoning and output is more complex than with standard completion models.