GPT-5 Codex

Developer Tools Large Language Models OpenAI

17 min read

Updated Jun 24, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 24, 2026

Fact-checked

In review queue

Sources

16 citations

Revision

v2 · 3,341 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

GPT-5-Codex is a coding-specialised variant of OpenAI's GPT-5 model, announced on September 15, 2025 and tuned for agentic software engineering inside the OpenAI Codex product family.^[1]^[2] OpenAI describes it as "a version of GPT-5 further optimized for agentic coding in Codex" that can run independently for more than seven hours on a single complex task, iterating on its own implementation, fixing test failures, and delivering a working result without human intervention.^[1]^[12] The model is the default engine for cloud tasks and code review in Codex, and developers can also select it for local tasks through the OpenAI Codex CLI and the Codex IDE extension.^[1]^[3] Compared to the base GPT-5, GPT-5-Codex was further trained with reinforcement learning on real-world engineering work such as building projects from scratch, adding tests, fixing bugs, conducting large-scale refactors, and running code reviews, and it dynamically allocates reasoning time so that simple edits return in seconds while complex agent runs can extend to roughly seven hours.^[1]^[4]^[5]

GPT-5-Codex was the first member of a new Codex-specific model line for OpenAI. Subsequent variants in the same family include GPT-5.1-Codex and GPT-5.1-Codex-Max (both November 2025), GPT-5.2-Codex (December 2025), and GPT-5.3-Codex (February 2026), each pushing further on long-horizon agentic coding benchmarks while keeping the same general product surface.^[6]^[7]^[16] At launch, OpenAI positioned the model directly against rival agentic coders such as Claude Code, Cursor, and GitHub Copilot, and made it available to ChatGPT Plus, Pro, Business, Edu, and Enterprise subscribers, with API access following on September 23, 2025.^[2]^[3]

Background

The Codex brand at OpenAI has a layered history. The original Codex was a code-specialised descendant of GPT-3 released in July 2021, which powered the first generation of GitHub Copilot before being retired as a standalone API in March 2023.^[8] OpenAI revived the Codex name in April 2025 for an agentic coding push that initially comprised Codex CLI, the cloud product at chatgpt.com/codex now branded as Codex Cloud, and an IDE extension for VS Code, Cursor, and Windsurf.^[8]^[9] The original revival used variants of OpenAI's o-series reasoning models, and was upgraded throughout the summer of 2025 to use the newly launched GPT-5 family.^[10]

GPT-5 itself was released on August 7, 2025 as OpenAI's flagship general-purpose model, with both a reasoning router and a long context window.^[11] On internal coding benchmarks, GPT-5 already produced competitive results, scoring around 74.9 percent on SWE-bench Verified, but OpenAI observed that real software engineering work demanded different behaviour from chat-style coding: longer running tasks, more careful reading of large codebases, and tighter adherence to repository conventions.^[5]^[11] GPT-5-Codex was OpenAI's first attempt to ship a model whose post-training was tuned specifically for that workflow rather than for chat or general assistance.

The launch on September 15, 2025 was part of a broader Codex upgrade announcement that also covered improvements to the CLI, the IDE extension, the cloud environment, and integrations with GitHub for pull request review.^[1]^[3] OpenAI product lead Alexander Embiricos framed the model as the centrepiece of a wider effort to make Codex the default surface for agentic coding work, both pair-programming sessions and unattended long-running jobs.^[2]

How does GPT-5-Codex differ from GPT-5?

GPT-5-Codex is described by OpenAI as "a version of GPT-5 further optimized for agentic coding in Codex."^[1] It is not a separately pretrained model but a fine-tune that starts from the GPT-5 base and applies additional reinforcement learning, primarily on coding environments.^[1]^[5] OpenAI released an addendum to the GPT-5 system card on the same day as the model, describing the safety mitigations layered on top of the base GPT-5 release rather than restating them from scratch.^[5]

The key behavioural differences between GPT-5-Codex and the base GPT-5 are in how the model spends compute, how it follows code-style instructions, and how it reasons about long sequences of edits. OpenAI states that GPT-5-Codex "adapts how much time it spends thinking more dynamically based on the complexity of the task," ranging from a few seconds for a small bug fix to roughly seven hours of continuous reasoning on a complex multi-file refactor or feature build.^[1]^[12] In practice this means the same model id is used for very different workflows, and the runtime adjusts itself mid-task rather than relying on a separate router or model swap.^[2] OpenAI quantified the spread: for the simplest 10 percent of requests, GPT-5-Codex used 93.7 percent fewer tokens than GPT-5, while for the most complex 10 percent it spent roughly twice as long reasoning, editing, testing, and iterating.^[1]^[12]

A second specialisation is steerability for code style and project conventions. GPT-5-Codex was trained to read project-specific guidance files such as AGENTS.md, which act as a Codex-readable rulebook for the repository, and to follow concise human instructions about formatting, testing, and review preferences without long system prompts.^[1]^[4] OpenAI also reports that the model produces fewer redundant or low-value review comments and is more likely to flag genuine issues, after explicit training on a curated set of high-quality engineering reviews.^[1]^[2]

Finally, GPT-5-Codex is tuned for code review and large-repository refactors as first-class workloads rather than as side effects of general coding ability. The reference workflows in OpenAI's announcement include navigating a codebase, reasoning through dependencies, running tests, and writing actionable pull request comments, all of which are reinforcement-learned end to end.^[1]^[3]

What can GPT-5-Codex do?

How does dynamic reasoning time work?

The headline capability of GPT-5-Codex is what OpenAI calls dynamic or adaptive reasoning, in which the model itself decides how much hidden thinking to allocate per request. Short interactions such as renaming a variable, fixing a small bug, or answering a quick question return quickly, sometimes in seconds, while ambitious agentic tasks can sustain reasoning and tool use for up to about seven hours of wall-clock time.^[1]^[12] OpenAI says this is learned at training time rather than configured through a separate parameter, so users do not have to pick between a fast and a slow mode.^[1]

A practical consequence is that GPT-5-Codex is unusually well suited to long-horizon agent runs in Codex Cloud, where the model is allowed to keep iterating against a sandbox, run tests, read failures, and revise its work without a human in the loop. In OpenAI's reported pilot uses, the model successfully handled a Gitea pull request that threaded a context variable through multiple layers of the application, spanning 232 files and more than 3,500 lines of changes, in a single uninterrupted run.^[12]

Refactoring and multi-file edits

GPT-5-Codex was specifically tuned for systematic, multi-step edits across a codebase. On an internal refactoring benchmark, OpenAI reports the model reaches 51.3 percent accuracy compared to 33.9 percent for the base GPT-5, a roughly 17-point absolute gap that the company attributes to the model's ability to plan changes, validate them with dependency analysis, and run tests during the edit.^[1]^[4]

For large refactors, GPT-5-Codex follows a workflow similar to that of a careful human engineer. It walks the repository, builds an internal picture of how modules interact, proposes a target state, then makes the edits while running and re-running tests until they pass. The model is also trained to defer when the requested change is ambiguous and to ask clarifying questions instead of guessing, which OpenAI describes as a deliberate trade-off against the more eager default behaviour of GPT-5.^[1]

How good is GPT-5-Codex at code review?

OpenAI trained GPT-5-Codex to act as a structured code reviewer in addition to being a code author. When attached to a GitHub repository through Codex, the model can run on pull requests, read the diff, examine the surrounding code, run tests, and produce review comments that focus on substantive issues rather than nits.^[1]^[3] To measure quality, OpenAI had experienced software engineers grade review comments on recent commits from popular open-source repositories. On that evaluation, GPT-5-Codex left incorrect comments only 4.4 percent of the time, down from 13.7 percent for GPT-5, while the share of comments rated high-impact rose from 39.4 percent for GPT-5 to 52.4 percent for GPT-5-Codex.^[2]^[13]^[14] OpenAI summarised the result by stating, "We find that comments by GPT-5-Codex are less likely to be incorrect or unimportant."^[1] The training set included real-world reviews assessed by senior engineers, with the explicit goal of cutting low-value or incorrect comments and concentrating on bugs, regressions, and unsafe patterns.^[1]^[2]

Tool use and sandbox execution

GPT-5-Codex is designed to operate inside a sandbox, with shell access, network policies, and file-system permissions managed by the surrounding Codex product. In the CLI and IDE extension, the sandbox runs locally with operating-system level isolation; in Codex Cloud, OpenAI provisions an ephemeral environment per task.^[3] The model is trained to use the tools that the Codex surface exposes, including shell commands, file editing, and a browser-style preview for web apps, and to recover from common failure modes such as failing tests, missing dependencies, or syntax errors.^[1]^[4]

Steerability and project conventions

GPT-5-Codex pays attention to the AGENTS.md convention, a markdown file at the top of a repository that describes how the codebase wants to be modified. This includes preferred package managers, test commands, formatting conventions, branch naming, and disallowed actions.^[1] In OpenAI's words, the model is "more steerable, adheres better to instructions, and produces higher-quality code" than the base GPT-5 when given short, project-specific guidance.^[1] This is intended to reduce the long, hand-tuned system prompts that early agentic coding setups often required.

How does GPT-5-Codex score on benchmarks?

At launch, OpenAI emphasised both standard benchmarks and internal evaluations. The public framing places GPT-5-Codex above the base GPT-5 on agentic coding tasks, while keeping similar performance on raw code generation. The independent llm-stats comparison and OpenAI's own materials report the following figures for GPT-5-Codex against GPT-5 at the September 2025 release.

Benchmark	GPT-5 (Aug 2025)	GPT-5-Codex (Sept 2025)	Notes
SWE-bench Verified	74.9%	74.5%	Python-only agentic patch generation on real GitHub issues^[5]^[13]
Internal refactoring eval	33.9%	51.3%	Multi-file refactors with dependency analysis^[1]^[4]
Incorrect review comments	13.7%	4.4%	Lower is better; engineer-graded review quality^[2]^[13]^[14]
High-impact review comments	39.4%	52.4%	Higher is better; engineer-graded review quality^[2]^[13]^[14]
Token usage on easy tasks	Baseline	About 93.7% fewer	Same-quality output on the simplest 10% of requests^[1]^[12]

On SWE-bench Verified the two models are essentially tied, with GPT-5-Codex at 74.5 percent and GPT-5 at 74.9 percent.^[13] OpenAI's explanation is that SWE-bench Verified, while widely cited, is a Python-only benchmark of bounded GitHub issues and does not stress the long-horizon, multi-language refactor work that Codex training is aimed at.^[1]^[12] The larger gaps appear on internal evaluations that more closely match real engineering workloads, such as the 17-point refactoring lift, the drop in incorrect review comments from 13.7 to 4.4 percent, and the dramatic token-efficiency improvement on routine tasks.^[1]^[4]^[12]

Later models in the Codex line have built on these numbers. GPT-5.2-Codex, released on December 18, 2025, reached state-of-the-art results on the harder SWE-bench Pro (56.4 percent) and Terminal-Bench 2.0 (64.0 percent) agentic harnesses, and GPT-5.3-Codex, released in February 2026, reported roughly 80 percent on SWE-bench Verified along with 56.8 percent on SWE-bench Pro and 77.3 percent on Terminal-Bench 2.0.^[6]^[7]^[16] These successors confirmed the direction set by GPT-5-Codex but did not retroactively change its September 2025 figures.

When was GPT-5-Codex released, and how do you access it?

GPT-5-Codex was released as the default model across the Codex product family on September 15, 2025. At launch it was accessible to ChatGPT Plus, Pro, Business, Edu, and Enterprise subscribers, who could use it through the Codex CLI, the IDE extension, the Codex Cloud web app at chatgpt.com/codex, and the GitHub pull request review integration.^[1]^[2] Eight days later, on September 23, 2025, OpenAI opened the model to developers using Codex via API keys, making gpt-5-codex a callable model identifier for programmatic use alongside the rest of the GPT-5 family.^[2]^[3]

At the September 2025 launch, OpenAI did not publish a separate price card for GPT-5-Codex distinct from GPT-5. Codex usage for paying ChatGPT subscribers was metered through the same plan-level rate limits as other reasoning models, and API pricing followed the GPT-5 family schedule. Industry observers later noted that GPT-5-Codex tended to use significantly fewer output tokens for the same task than competing models, which made it cheaper in practice on terminal-heavy workloads even before accounting for unit pricing.^[14]

Where can you use GPT-5-Codex?

GPT-5-Codex is intentionally restricted to surfaces that are designed around it, rather than being offered as a drop-in chat model. The main surfaces are:

Codex CLI. A terminal-based agent that runs on the developer's machine and executes shell, edit, and test commands inside a local sandbox. GPT-5-Codex is the recommended model for local agentic coding sessions through the CLI.^[1]^[3]
Codex IDE extension. A VS Code-style extension that also ships for Cursor and Windsurf, embedding Codex into the editor and exposing pair-programming and longer agent runs from within the IDE.^[1]^[3]
Codex Cloud. OpenAI's hosted Codex environment, where GPT-5-Codex is the default model for new tasks and runs in an ephemeral cloud sandbox per session. This is the surface most aligned with the model's seven-hour autonomous-run profile.^[1]^[3]
GitHub code review. Codex can be wired into pull requests, with GPT-5-Codex acting as the default reviewer, posting comments through the GitHub PR API.^[1]^[3]
API. From September 23, 2025, the model is available as a callable id through OpenAI's developer platform, intended for teams building custom coding agents and CI/CD integrations.^[2]^[3]

OpenAI does not offer GPT-5-Codex through the standard ChatGPT chat interface in the same way as general-purpose GPT-5 variants. The model expects access to a Codex-style toolchain, including a working filesystem and shell, and is not optimised for free-form conversational use.^[1]

How does GPT-5-Codex compare with Claude?

The most direct competitor to GPT-5-Codex at the September 2025 release was Anthropic's Claude Sonnet 4.5, which had launched a few weeks earlier with explicit positioning around coding and computer use through Claude Code. Independent reviewers and benchmark aggregators reported a mixed picture: GPT-5-Codex was generally rated stronger on long-running backend refactors, large multi-file changes, and token efficiency, while Claude Sonnet 4.5 was preferred for frontend work, design-sensitive UI generation, and conversational ambiguity handling.^[14]^[15]

Dimension	GPT-5-Codex	Claude Sonnet 4.5
Primary surface	Codex CLI, Codex Cloud, IDE extension	Claude Code CLI, IDE plugins, API
Long-running autonomous runs	Up to about 7 hours	Up to about 30 hours on Sonnet 4.5
Multi-file refactor	Strong, optimised by training	Strong, more conversational
Frontend and design tasks	Solid	Often preferred by reviewers
Token efficiency per task	Very high, fewer output tokens	Comparable but typically more verbose
Pricing posture at launch	GPT-5 family rates, low effective cost	$3 / $15 per million tokens (input / output)

GPT-5-Codex was also frequently compared against Claude Code as a product, given that the two are direct workflow rivals: a CLI-first agentic coding tool wrapped around a frontier model. On head-to-head reviews after launch, GPT-5-Codex was often described as the execution engine that produced more concise diffs and used fewer tokens per task, while Claude Code under Sonnet 4.5 was described as the reasoning partner that explained more and pushed back more in ambiguous cases.^[14] As of the original release, no single model dominated all categories, and many teams used both depending on the task.

What is GPT-5-Codex used for?

GPT-5-Codex was positioned for several specific workflows:

Greenfield project scaffolding. Generating a new project layout, dependencies, and initial test suite from a short specification, with the model deciding folder structure and tooling.^[1]
Bug triage and fix. Reading a failing test or stack trace, locating the root cause across files, and producing a minimal patch with tests.^[1]^[4]
Large-scale refactor. Renaming or restructuring functions, modules, or APIs across hundreds of files with dependency analysis and test re-runs.^[1]^[12]
Pull request review. Running on every PR as an automated reviewer that focuses on bugs, regressions, and policy violations rather than style nits.^[1]^[3]
Autonomous overnight work. Running multi-hour Codex Cloud tasks that complete a feature, with the engineer reviewing the diff in the morning rather than supervising every step.^[12]

In its launch materials, OpenAI explicitly contrasted these workloads with chat-style coding assistance, where the user types in one question at a time and expects a quick suggestion. GPT-5-Codex is faster than earlier reasoning models on those short tasks, but it is also designed to behave very differently when given a high-level goal and a sandboxed environment.^[1]^[2]

Is GPT-5-Codex safe to run autonomously?

OpenAI published an addendum to the GPT-5 system card describing the safety mitigations specific to GPT-5-Codex.^[5] Because the Codex variant inherits the GPT-5 base, the addendum focuses on changes from that baseline rather than restating shared mitigations. The main additions concern code-specific abuse patterns, such as generating malware or exploit code, and behaviours that arise when the model is given persistent shell and network access.

OpenAI reports that GPT-5-Codex was trained against a curated "golden set" of safety test cases developed by internal policy experts, with edge cases and adversarial examples added to test the model's refusal and dual-use handling.^[5] At the product level, the Codex surfaces enforce sandboxing, configurable approval policies for shell commands, and network restrictions, so that even if the model attempts an unsafe action, the surrounding harness can block or require human confirmation.^[3]^[5] The September 2025 system card addendum frames safety for GPT-5-Codex as a combination of model-level steering and product-level constraints, rather than as a property of the model in isolation.

Reception

Initial reactions to GPT-5-Codex were broadly positive. Press coverage in VentureBeat, TechCrunch, and InfoQ highlighted the seven-hour autonomous run profile and the refactoring benchmark gain over GPT-5 as the most novel features, and framed the model as OpenAI's clearest response yet to the rise of agentic coding tools.^[2]^[4]^[12] Developer reviews on technical blogs and on aggregator sites such as llm-stats noted that the model was unusually token-efficient and well behaved on large diffs, with some criticism that its output was sometimes too terse and that frontend work still leaned on Claude.^[13]^[14]

Within OpenAI's own roadmap, GPT-5-Codex was quickly superseded by a fast-moving cadence of Codex-specific successors: GPT-5.1-Codex and GPT-5.1-Codex-Max in November 2025, GPT-5.2-Codex in December 2025, and GPT-5.3-Codex in February 2026, each closing further gaps on agentic coding benchmarks and broadening the workloads handled by the Codex surface.^[6]^[7]^[16] Despite that pace, the September 2025 release is widely recognised as the moment at which OpenAI committed to a dedicated coding model line, distinct from its general-purpose GPT-5 chat models.

References

OpenAI. "Introducing upgrades to Codex." openai.com, September 15, 2025. ↩
Wiggers, Kyle. "OpenAI upgrades Codex with a new version of GPT-5." TechCrunch, September 15, 2025. ↩
OpenAI Developers. "Codex changelog and documentation." developers.openai.com, accessed 2026. ↩
Olcott, Eleanor. "OpenAI releases GPT-5-Codex optimized for complex code refactoring and code reviews." InfoQ, September 22, 2025. ↩
OpenAI. "Addendum to GPT-5 system card: GPT-5-Codex." openai.com, September 15, 2025. ↩
OpenAI. "Introducing GPT-5.2-Codex." openai.com, December 18, 2025. ↩
OpenAI. "Introducing GPT-5.3-Codex." openai.com, February 2026. ↩
OpenAI. "OpenAI Codex." Help Center model release notes, accessed 2026. ↩
OpenAI. "Codex CLI repository and release notes." GitHub, April 2025. ↩
OpenAI. "Codex updates with GPT-5." Codex changelog, August 2025. ↩
OpenAI. "Introducing GPT-5." openai.com, August 7, 2025. ↩
Franzen, Carl. "OpenAI's new agentic coding model GPT-5-Codex can run independently for 7 hours at a time." VentureBeat, September 15, 2025. ↩
llm-stats. "GPT-5 vs GPT-5-Codex comparison." llm-stats.com, accessed 2026. ↩
Composio. "Claude Sonnet 4.5 vs GPT-5-Codex: best model for agentic coding." composio.dev, 2025. ↩
Builder.io. "Codex vs Claude Code: which is the better AI coding agent?" builder.io, 2025. ↩
OpenAI. "Building more with GPT-5.1-Codex-Max." openai.com, November 19, 2025. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

ChatGPT Agent Factory (AI company)GPT-5.1-Codex-Max Grok Code Fast OpenAI Codex Preparedness Framework (OpenAI)Programming with ChatGPT Strawberry (OpenAI codename)o4-mini

Background

How does GPT-5-Codex differ from GPT-5?

What can GPT-5-Codex do?

How does dynamic reasoning time work?

Refactoring and multi-file edits

How good is GPT-5-Codex at code review?

Tool use and sandbox execution

Steerability and project conventions

How does GPT-5-Codex score on benchmarks?

When was GPT-5-Codex released, and how do you access it?

Where can you use GPT-5-Codex?

How does GPT-5-Codex compare with Claude?

What is GPT-5-Codex used for?

Is GPT-5-Codex safe to run autonomously?

Reception

See also

References

Improve this article

Related Articles

GPT API

Gym (OpenAI Gym / Gymnasium)

OpenAI Agents SDK

OpenAI Responses API

OpenAI Codex

OpenAI Realtime API

What links here

Related Articles

GPT API

Gym (OpenAI Gym / Gymnasium)

OpenAI Agents SDK

OpenAI Responses API

OpenAI Codex

OpenAI Realtime API

What links here