GPT-5 Codex
Last reviewed
May 17, 2026
Sources
15 citations
Review status
Source-backed
Revision
v1 ยท 3,074 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 17, 2026
Sources
15 citations
Review status
Source-backed
Revision
v1 ยท 3,074 words
Add missing citations, update stale details, or suggest a clearer explanation.
GPT-5-Codex is a coding-specialised variant of OpenAI's GPT-5 model, announced on September 15, 2025 and tuned for agentic software engineering inside the OpenAI Codex product family.[1][2] The model is the default engine for cloud tasks and code review in Codex, and developers can also select it for local tasks through the OpenAI Codex CLI and the Codex IDE extension.[1][3] Compared to the base GPT-5, GPT-5-Codex was further trained with reinforcement learning on real-world engineering work such as building projects from scratch, adding tests, fixing bugs, conducting large-scale refactors, and running code reviews, and it dynamically allocates reasoning time so that simple edits return in seconds while complex agent runs can extend to roughly seven hours.[1][4][5]
GPT-5-Codex was the first member of a new Codex-specific model line for OpenAI. Subsequent variants in the same family include GPT-5.1-Codex, GPT-5.1-Codex-Max, GPT-5.2-Codex, and GPT-5.3-Codex, each pushing further on long-horizon agentic coding benchmarks while keeping the same general product surface.[6][7] At launch, OpenAI positioned the model directly against rival agentic coders such as Claude Code, Cursor, and GitHub Copilot, and made it available to ChatGPT Plus, Pro, Business, Edu, and Enterprise subscribers, with API access following on September 23, 2025.[2][3]
The Codex brand at OpenAI has a layered history. The original Codex was a code-specialised descendant of GPT-3 released in July 2021, which powered the first generation of GitHub Copilot before being retired as a standalone API in March 2023.[8] OpenAI revived the Codex name in April 2025 for an agentic coding push that initially comprised Codex CLI, the cloud product at chatgpt.com/codex now branded as Codex Cloud, and an IDE extension for VS Code, Cursor, and Windsurf.[8][9] The original revival used variants of OpenAI's o-series reasoning models, and was upgraded throughout the summer of 2025 to use the newly launched GPT-5 family.[10]
GPT-5 itself was released on August 7, 2025 as OpenAI's flagship general-purpose model, with both a reasoning router and a long context window.[11] On internal coding benchmarks, GPT-5 already produced competitive results, scoring around 74.9 percent on SWE-bench Verified, but OpenAI observed that real software engineering work demanded different behaviour from chat-style coding: longer running tasks, more careful reading of large codebases, and tighter adherence to repository conventions.[5][11] GPT-5-Codex was OpenAI's first attempt to ship a model whose post-training was tuned specifically for that workflow rather than for chat or general assistance.
The launch on September 15, 2025 was part of a broader Codex upgrade announcement that also covered improvements to the CLI, the IDE extension, the cloud environment, and integrations with GitHub for pull request review.[1][3] OpenAI product lead Alexander Embiricos framed the model as the centrepiece of a wider effort to make Codex the default surface for agentic coding work, both pair-programming sessions and unattended long-running jobs.[2]
GPT-5-Codex is described by OpenAI as "a version of GPT-5 further optimised for agentic coding in Codex."[1] It is not a separately pretrained model but a fine-tune that starts from the GPT-5 base and applies additional reinforcement learning, primarily on coding environments.[1][5] OpenAI released an addendum to the GPT-5 system card on the same day as the model, describing the safety mitigations layered on top of the base GPT-5 release rather than restating them from scratch.[5]
The key behavioural differences between GPT-5-Codex and the base GPT-5 are in how the model spends compute, how it follows code-style instructions, and how it reasons about long sequences of edits. OpenAI states that Codex "spends its thinking time more dynamically" than previous models, ranging from a few seconds for a small bug fix to roughly seven hours of continuous reasoning on a complex multi-file refactor or feature build.[1][12] In practice this means the same model id is used for very different workflows, and the runtime adjusts itself mid-task rather than relying on a separate router or model swap.[2]
A second specialisation is steerability for code style and project conventions. GPT-5-Codex was trained to read project-specific guidance files such as AGENTS.md, which act as a Codex-readable rulebook for the repository, and to follow concise human instructions about formatting, testing, and review preferences without long system prompts.[1][4] OpenAI also reports that the model produces fewer redundant or low-value review comments and is more likely to flag genuine issues, after explicit training on a curated set of high-quality engineering reviews.[1][2]
Finally, GPT-5-Codex is tuned for code review and large-repository refactors as first-class workloads rather than as side effects of general coding ability. The reference workflows in OpenAI's announcement include navigating a codebase, reasoning through dependencies, running tests, and writing actionable pull request comments, all of which are reinforcement-learned end to end.[1][3]
The headline capability of GPT-5-Codex is what OpenAI calls dynamic or adaptive reasoning, in which the model itself decides how much hidden thinking to allocate per request. Short interactions such as renaming a variable, fixing a small bug, or answering a quick question return quickly, sometimes in seconds, while ambitious agentic tasks can sustain reasoning and tool use for up to about seven hours of wall-clock time.[1][12] OpenAI says this is learned at training time rather than configured through a separate parameter, so users do not have to pick between a fast and a slow mode.[1]
A practical consequence is that GPT-5-Codex is unusually well suited to long-horizon agent runs in Codex Cloud, where the model is allowed to keep iterating against a sandbox, run tests, read failures, and revise its work without a human in the loop. In OpenAI's reported pilot uses, the model successfully handled a pull request spanning 232 files and 3,541 lines of changes in a single uninterrupted run.[12]
GPT-5-Codex was specifically tuned for systematic, multi-step edits across a codebase. On an internal refactoring benchmark, OpenAI reports the model reaches 51.3 percent accuracy compared to 33.9 percent for the base GPT-5, a roughly 17-point absolute gap that the company attributes to the model's ability to plan changes, validate them with dependency analysis, and run tests during the edit.[1][4]
For large refactors, GPT-5-Codex follows a workflow similar to that of a careful human engineer. It walks the repository, builds an internal picture of how modules interact, proposes a target state, then makes the edits while running and re-running tests until they pass. The model is also trained to defer when the requested change is ambiguous and to ask clarifying questions instead of guessing, which OpenAI describes as a deliberate trade-off against the more eager default behaviour of GPT-5.[1]
OpenAI trained GPT-5-Codex to act as a structured code reviewer in addition to being a code author. When attached to a GitHub repository through Codex, the model can run on pull requests, read the diff, examine the surrounding code, run tests, and produce review comments that focus on substantive issues rather than nits.[1][3] OpenAI says the training set included real-world reviews assessed by experienced engineers, with the explicit goal of cutting low-value or incorrect comments and concentrating on bugs, regressions, and unsafe patterns.[1][2]
GPT-5-Codex is designed to operate inside a sandbox, with shell access, network policies, and file-system permissions managed by the surrounding Codex product. In the CLI and IDE extension, the sandbox runs locally with operating-system level isolation; in Codex Cloud, OpenAI provisions an ephemeral environment per task.[3] The model is trained to use the tools that the Codex surface exposes, including shell commands, file editing, and a browser-style preview for web apps, and to recover from common failure modes such as failing tests, missing dependencies, or syntax errors.[1][4]
GPT-5-Codex pays attention to the AGENTS.md convention, a markdown file at the top of a repository that describes how the codebase wants to be modified. This includes preferred package managers, test commands, formatting conventions, branch naming, and disallowed actions.[1] In OpenAI's words, the model is "more steerable, adheres better to instructions, and produces higher-quality code" than the base GPT-5 when given short, project-specific guidance.[1] This is intended to reduce the long, hand-tuned system prompts that early agentic coding setups often required.
At launch, OpenAI emphasised both standard benchmarks and internal evaluations. The public framing places GPT-5-Codex above the base GPT-5 on agentic coding tasks, while keeping similar performance on raw code generation. The independent llm-stats comparison and OpenAI's own materials report the following figures for GPT-5-Codex against GPT-5 at the September 2025 release.
| Benchmark | GPT-5 (Aug 2025) | GPT-5-Codex (Sept 2025) | Notes |
|---|---|---|---|
| SWE-bench Verified | 74.9% | 74.5% | Python-only agentic patch generation on real GitHub issues[5][13] |
| Internal refactoring eval | 33.9% | 51.3% | Multi-file refactors with dependency analysis[1][4] |
| Code review usefulness | Lower | Higher | Human-rated comment quality, internal panel[1][2] |
| Token usage on easy tasks | Baseline | About 93.7% fewer | Same-quality output on the simplest 10% of requests[12] |
On SWE-bench Verified the two models are essentially tied, with GPT-5-Codex at 74.5 percent and GPT-5 at 74.9 percent.[13] OpenAI's explanation is that SWE-bench Verified, while widely cited, is a Python-only benchmark of bounded GitHub issues and does not stress the long-horizon, multi-language refactor work that Codex training is aimed at.[1][12] The larger gap appears on internal evaluations that more closely match real engineering workloads, such as the 17-point refactoring lift and the dramatic token-efficiency improvement on routine tasks.[1][4][12]
Later models in the Codex line have built on these numbers. GPT-5.2-Codex, released in December 2025, raised SWE-bench Verified into the high seventies on the new agentic harness, and GPT-5.3-Codex, released in February 2026, reported roughly 80 percent on SWE-bench Verified along with 56.8 percent on the harder SWE-bench Pro and 77.3 percent on Terminal-Bench 2.0.[6][7] These successors confirmed the direction set by GPT-5-Codex but did not retroactively change its September 2025 figures.
GPT-5-Codex was released as the default model across the Codex product family on September 15, 2025. At launch it was accessible to ChatGPT Plus, Pro, Business, Edu, and Enterprise subscribers, who could use it through the Codex CLI, the IDE extension, the Codex Cloud web app at chatgpt.com/codex, and the GitHub pull request review integration.[1][2] Eight days later, on September 23, 2025, OpenAI opened the model to developers using Codex via API keys, making gpt-5-codex a callable model identifier for programmatic use alongside the rest of the GPT-5 family.[2][3]
At the September 2025 launch, OpenAI did not publish a separate price card for GPT-5-Codex distinct from GPT-5. Codex usage for paying ChatGPT subscribers was metered through the same plan-level rate limits as other reasoning models, and API pricing followed the GPT-5 family schedule. Industry observers later noted that GPT-5-Codex tended to use significantly fewer output tokens for the same task than competing models, which made it cheaper in practice on terminal-heavy workloads even before accounting for unit pricing.[14]
GPT-5-Codex is intentionally restricted to surfaces that are designed around it, rather than being offered as a drop-in chat model. The main surfaces are:
OpenAI does not offer GPT-5-Codex through the standard ChatGPT chat interface in the same way as general-purpose GPT-5 variants. The model expects access to a Codex-style toolchain, including a working filesystem and shell, and is not optimised for free-form conversational use.[1]
The most direct competitor to GPT-5-Codex at the September 2025 release was Anthropic's Claude Sonnet 4.5, which had launched a few weeks earlier with explicit positioning around coding and computer use through Claude Code. Independent reviewers and benchmark aggregators reported a mixed picture: GPT-5-Codex was generally rated stronger on long-running backend refactors, large multi-file changes, and token efficiency, while Claude Sonnet 4.5 was preferred for frontend work, design-sensitive UI generation, and conversational ambiguity handling.[14][15]
| Dimension | GPT-5-Codex | Claude Sonnet 4.5 |
|---|---|---|
| Primary surface | Codex CLI, Codex Cloud, IDE extension | Claude Code CLI, IDE plugins, API |
| Long-running autonomous runs | Up to about 7 hours | Up to about 30 hours on Sonnet 4.5 |
| Multi-file refactor | Strong, optimised by training | Strong, more conversational |
| Frontend and design tasks | Solid | Often preferred by reviewers |
| Token efficiency per task | Very high, fewer output tokens | Comparable but typically more verbose |
| Pricing posture at launch | GPT-5 family rates, low effective cost | $3 / $15 per million tokens (input / output) |
GPT-5-Codex was also frequently compared against Claude Code as a product, given that the two are direct workflow rivals: a CLI-first agentic coding tool wrapped around a frontier model. On head-to-head reviews after launch, GPT-5-Codex was often described as the execution engine that produced more concise diffs and used fewer tokens per task, while Claude Code under Sonnet 4.5 was described as the reasoning partner that explained more and pushed back more in ambiguous cases.[14] As of the original release, no single model dominated all categories, and many teams used both depending on the task.
GPT-5-Codex was positioned for several specific workflows:
In its launch materials, OpenAI explicitly contrasted these workloads with chat-style coding assistance, where the user types in one question at a time and expects a quick suggestion. GPT-5-Codex is faster than earlier reasoning models on those short tasks, but it is also designed to behave very differently when given a high-level goal and a sandboxed environment.[1][2]
OpenAI published an addendum to the GPT-5 system card describing the safety mitigations specific to GPT-5-Codex.[5] Because the Codex variant inherits the GPT-5 base, the addendum focuses on changes from that baseline rather than restating shared mitigations. The main additions concern code-specific abuse patterns, such as generating malware or exploit code, and behaviours that arise when the model is given persistent shell and network access.
OpenAI reports that GPT-5-Codex was trained against a curated "golden set" of safety test cases developed by internal policy experts, with edge cases and adversarial examples added to test the model's refusal and dual-use handling.[5] At the product level, the Codex surfaces enforce sandboxing, configurable approval policies for shell commands, and network restrictions, so that even if the model attempts an unsafe action, the surrounding harness can block or require human confirmation.[3][5] The September 2025 system card addendum frames safety for GPT-5-Codex as a combination of model-level steering and product-level constraints, rather than as a property of the model in isolation.
Initial reactions to GPT-5-Codex were broadly positive. Press coverage in VentureBeat, TechCrunch, and InfoQ highlighted the seven-hour autonomous run profile and the refactoring benchmark gain over GPT-5 as the most novel features, and framed the model as OpenAI's clearest response yet to the rise of agentic coding tools.[2][4][12] Developer reviews on technical blogs and on aggregator sites such as llm-stats noted that the model was unusually token-efficient and well behaved on large diffs, with some criticism that its output was sometimes too terse and that frontend work still leaned on Claude.[13][14]
Within OpenAI's own roadmap, GPT-5-Codex was quickly superseded by a fast-moving cadence of Codex-specific successors: GPT-5.1-Codex and GPT-5.1-Codex-Max in November 2025, GPT-5.2-Codex in December 2025, and GPT-5.3-Codex in February 2026, each closing further gaps on agentic coding benchmarks and broadening the workloads handled by the Codex surface.[6][7] Despite that pace, the September 2025 release is widely recognised as the moment at which OpenAI committed to a dedicated coding model line, distinct from its general-purpose GPT-5 chat models.