OpenAI Codex
Last reviewed
May 8, 2026
Sources
15 citations
Review status
Source-backed
Revision
v2 ยท 6,589 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 8, 2026
Sources
15 citations
Review status
Source-backed
Revision
v2 ยท 6,589 words
Add missing citations, update stale details, or suggest a clearer explanation.
OpenAI Codex is the name OpenAI has used for two distinct generations of code-focused artificial intelligence products. The first generation, launched in private beta in August 2021, was a large language model descended from GPT-3 and fine-tuned on billions of lines of public source code. That original Codex powered the first three years of GitHub Copilot and introduced the now-canonical HumanEval benchmark for evaluating code generation, before OpenAI deprecated the underlying API in March 2023 in favor of newer general-purpose models. The second generation, launched in two waves in April and May 2025, is an entirely different product family: a suite of agentic coding tools branded Codex CLI and Codex Cloud, built on specialized variants of OpenAI's reasoning models (codex-1, codex-mini-latest, GPT-5-Codex, GPT-5.2-Codex, GPT-5.3-Codex, and GPT-5.4-Codex), that compete directly with Claude Code, Cursor, and Devin for the autonomous coding agent market [1][2][3].
The two Codex generations share a name and a broad mission of helping developers write software, but they differ in almost every implementation detail. The 2021 Codex was a static autoregressive language model that turned natural-language prompts into code completions, accessed through a REST API or embedded in editor extensions like the original GitHub Copilot. The 2025 Codex is an interactive software-engineering agent that runs in a sandboxed environment, reads and edits files, executes commands, runs tests, opens pull requests, and reports back to the user, all under explicit human approval policies. The shared brand reflects OpenAI's view that both systems address the same underlying goal of reducing the friction between describing software and producing working code, but the technical lineage runs through different research programs and the user experience differs profoundly [3][4].
As of April 2026, the modern Codex platform reports more than two million weekly active users across its CLI, web, IDE, and mobile surfaces, with usage rising roughly five-fold since January 2026, and the open-source Codex CLI repository has accumulated more than 75,000 stars on GitHub [3]. The product has come to be one of the most visible examples of a broader industry shift from inline autocomplete tooling toward fully agentic coding workflows, in which the human operator delegates entire tasks ("add OAuth login to this app," "fix the failing CI checks," "port this module to TypeScript") rather than writing code line by line with assistance.
The Codex name has been recycled rather than continuous. OpenAI originally registered the brand in 2021 for a model that did one thing: take a prompt, return code. After the 2023 deprecation, the name effectively retired for two years. When OpenAI brought it back in April 2025, the engineering, product surface, and underlying research program had all moved on. The decision to reuse the name has occasionally caused confusion in academic papers and developer documentation that still cite "Codex" without specifying which era they mean.
A short timeline of the brand:
| Date | Event |
|---|---|
| July 7, 2021 | Codex paper posted on arXiv [1] |
| August 10, 2021 | Codex API private beta announced [5] |
| June 29, 2021 | GitHub Copilot launches in technical preview using a Codex variant |
| March 23, 2023 | OpenAI announces deprecation of code-davinci-002 and code-cushman-001 [11] |
| April 16, 2025 | Codex CLI launches as an open-source terminal agent [2] |
| May 16, 2025 | Codex Cloud launches as a research preview in ChatGPT [4] |
| June 3, 2025 | Codex Cloud rolls out to ChatGPT Plus subscribers |
| September 23, 2025 | GPT-5-Codex becomes available to API customers |
| Late 2025 | Codex graduates from research preview to general availability [9] |
| January 14, 2026 | GPT-5.2-Codex released |
| February 5, 2026 | GPT-5.3-Codex released |
| March 5, 2026 | GPT-5.4-Codex released alongside GPT-5.4 family |
The original Codex was first announced by OpenAI on August 10, 2021, alongside the launch of a private beta of the Codex API [5]. The model itself had been described in a research paper posted on arXiv on July 7, 2021, titled Evaluating Large Language Models Trained on Code, with Mark Chen and Jerry Tworek as lead authors and a long list of contributors from across OpenAI [1]. That paper is now one of the foundational references for code-generation language models and it introduced both the Codex model family and the HumanEval evaluation set that has been used to benchmark every subsequent code model.
Codex was constructed by taking the GPT-3 architecture and continuing pretraining on a code corpus collected from public GitHub repositories. The full training set was assembled in May 2020 and contained roughly 179 gigabytes of unique Python files under one megabyte in size, drawn from 54 million public repositories, after filtering to remove auto-generated files, files with average line length above 100 characters, files with maximum line length above 1,000 characters, and files containing a small percentage of alphanumeric characters [1]. The fine-tuning corpus was therefore a curated slice of human-authored Python rather than the much larger sweep of all repository content.
OpenAI trained two main families of Codex models. The larger family, code-davinci, peaked at approximately 12 billion parameters and was the basis for the production Codex API; the smaller family, code-cushman, was a 12-billion-parameter variant tuned for lower-latency completion and was the model OpenAI shipped to GitHub for the first version of Copilot [6]. The Chen et al. paper reports results across model scales from 12 million to 12 billion parameters, finding that performance on HumanEval scales smoothly with parameter count and that even 12-million-parameter models have meaningful coding ability after fine-tuning. A successor model, code-davinci-002, was released in 2022 with approximately 14.8 billion parameters and became the base for the GPT-3.5 family.
The Chen et al. paper introduced HumanEval, an evaluation set consisting of 164 hand-written Python programming problems, each with a function signature, a docstring describing the desired behavior, a reference solution, and a set of unit tests. The benchmark measures functional correctness using the pass@k metric: a model is allowed to generate k samples per problem, and a problem is counted as solved if at least one sample passes all the unit tests. Pass@1 measures the probability that a single sample is correct, while pass@10 and pass@100 measure the probability that at least one of ten or one hundred samples is correct.
The headline result from the original paper was that Codex-12B solved 28.8 percent of the HumanEval problems on pass@1, while GPT-3 of comparable scale solved 0 percent and the contemporaneous open model GPT-J solved 11.4 percent [1]. With repeated sampling and an oracle that could pick the correct answer from many candidates, the pass@100 score rose to 70.2 percent, demonstrating that the model often had a working solution somewhere in its output distribution even when it was not consistent enough to produce one on the first try. That gap between pass@1 and pass@100 motivated subsequent work on best-of-n sampling, code execution feedback loops, and reranking with verifier models.
HumanEval has become one of the most cited benchmarks in code generation despite well-documented limitations: only 164 problems, all in Python, all relatively short and self-contained, and all written by humans rather than scraped from real codebases. The numbers have largely saturated as model capabilities have grown, with frontier models scoring above 90 percent and the strongest 2025 systems scoring above 95 percent. Newer benchmarks such as MBPP, APPS, SWE-bench, LiveCodeBench, and the SWE-bench Verified subset have largely supplanted HumanEval for serious comparison of frontier coding systems, but HumanEval remains the standard introductory benchmark and is still routinely reported in model release notes.
The table below summarizes pass@1 HumanEval scores reported for several models across the years.
| Model | Year | Reported pass@1 | Notes |
|---|---|---|---|
| GPT-3 (175B) | 2021 | 0.0% | Pretraining included no code-focused fine-tuning [1] |
| GPT-J (6B) | 2021 | 11.4% | Open-source baseline reported in Codex paper [1] |
| Codex-12B | 2021 | 28.8% | Original Codex API model [1] |
| code-davinci-002 | 2022 | ~47% | OpenAI Codex successor, base of GPT-3.5 |
| GPT-4 | 2023 | 67.0% | OpenAI technical report |
| GPT-4 (with reflection) | 2023 | 91.0% | With execution feedback and self-correction |
| Claude 3 Opus | 2024 | 84.9% | Anthropic blog |
| Llama 3.1 405B | 2024 | 89.0% | Meta release |
| GPT-4o | 2024 | 90.2% | OpenAI blog |
| Claude Opus 4 | 2025 | 94.5% | Anthropic blog |
| Claude Sonnet 4 | 2025 | 95.1% | Anthropic blog |
The most visible deployment of the original Codex was inside GitHub Copilot, a code completion service launched as a technical preview in June 2021 and made generally available in June 2022. Copilot integrated a fine-tuned Codex variant directly into VS Code, JetBrains IDEs, Neovim, and Visual Studio, offering inline ghost-text completions as the user typed. The Codex model running inside Copilot was a smaller, lower-latency version optimized for responsiveness in the editor, since users expected suggestions to appear within a few hundred milliseconds.
Copilot quickly became the dominant commercial AI coding product, with reports of over a million paying subscribers within two years of launch. It also became one of the highest-profile demonstrations that pretrained language models could be productized into a real developer tool with measurable productivity benefits. A widely cited GitHub-internal study from 2022 found that developers completed a JavaScript task 55 percent faster with Copilot than without it. Independent academic studies have produced more mixed results, with some showing significant speedups and others showing little or no improvement and a measurable increase in security defects in Copilot-assisted code.
GitHub Copilot moved away from the original Codex model after OpenAI deprecated the underlying API in March 2023. By the time Copilot X was announced in March 2023, Copilot Chat was running on GPT-4 and the inline completion model had been replaced by newer OpenAI variants. Subsequent versions of Copilot have run on a rotating set of models from OpenAI, Anthropic, and Google, including GPT-5 variants and Claude models, exposed to users through a model picker. The Copilot product is now operationally independent of the OpenAI Codex brand even though it grew out of the same research program.
In March 2023, OpenAI announced that the original Codex models (code-davinci-001, code-davinci-002, code-cushman-001, and related variants) would be deprecated from the API and that customers should migrate to the newly released GPT-3.5-Turbo and GPT-4 models for code-related tasks. The deprecation reflected the broader trend that general-purpose chat models had caught up to and surpassed dedicated code models on most benchmarks, including HumanEval, while also offering far better instruction following, multi-turn dialogue, and integration with new product features like function calling.
The sudden deprecation drew criticism from researchers who had been using code-davinci-002 as a stable baseline for academic experiments. Many published reproductions and ablation studies depended on a specific snapshot of the model, and once the API endpoints were turned off the experiments could no longer be re-run by other researchers. OpenAI extended the timeline several times and eventually preserved access for some classes of researcher, but the episode highlighted the friction between commercial model lifecycle and the reproducibility expectations of academic computer science.
The modern Codex product family, which shares a brand with the 2021 model but no direct technical lineage, was launched in two waves in 2025. On April 16, 2025, OpenAI announced Codex CLI, an open-source command-line coding agent that runs locally on the user's machine and can read, write, and execute code in the working directory [2][7]. Roughly a month later, on May 16, 2025, OpenAI followed with Codex Cloud, a research preview of a hosted agent that runs entirely in an isolated container on OpenAI infrastructure, accepting tasks through ChatGPT and returning pull requests for human review [4]. Both products were initially powered by codex-1, a fine-tuned variant of the o3 reasoning model adapted specifically for software engineering work.
Codex CLI is a free, open-source binary distributed under the Apache 2.0 license (initially released under MIT) that installs via npm or Homebrew and exposes a single command, codex, to launch an interactive terminal user interface [7][8]. Inside the TUI, the user describes a task in natural language and the agent plans, edits files, runs shell commands, and reports progress, with explicit approval prompts for any operation that touches the file system or executes a process. The CLI began life as a Node.js application but was rewritten in Rust during the second half of 2025; by early 2026 the codebase was approximately 95 percent Rust, and the Rust binary became the default distribution for performance and memory-safety reasons [12].
The motivation for the Rust rewrite was several things at once. The original Node.js distribution required users to install Node.js as a separate dependency, and the runtime overhead made startup slow on weaker machines. The Rust binary ships as a single static executable with no garbage collector, uses far less memory, and lets OpenAI call OS sandbox primitives directly rather than going through a Node.js shim [12].
The CLI supports several approval modes:
| Mode | File edits | Shell commands |
|---|---|---|
| Suggest (manual) | Asks before each | Asks before each |
| Auto-edit | Applies without asking | Asks before each |
| Full-auto | Applies without asking | Runs without asking inside a sandbox |
The sandbox enforcement uses macOS sandbox-exec (Apple Seatbelt), Linux Landlock and seccomp, and Windows job objects to restrict file system writes to the working directory and disable network access by default. Users can also configure the CLI to run inside Docker containers for stricter isolation, which is the recommended setup for production use on shared development machines.
A distinguishing feature of Codex CLI is its support for multimodal input: users can paste screenshots, mock-ups, or low-fidelity sketches into the TUI and ask the agent to implement the shown design [2]. The CLI also supports the Model Context Protocol for connecting to external tools and data sources, and it can act as both an MCP client and an MCP server. The latter is unusual: running codex mcp-server exposes the Codex agent as a tool that other MCP-compatible agents (including Claude Code, Cursor, and various third-party tools) can call into. The bidirectional MCP support has made Codex CLI a popular building block for hybrid agent stacks where one agent delegates subtasks to another.
The SDK released in 2025 allows developers to embed the same agent loop into their own applications. As of April 2026, the npm package reports approximately 14.5 million monthly downloads. To encourage adoption, OpenAI launched a $1 million API credit program at the same time as the CLI announcement, awarding $25,000 blocks of API credits to selected open-source projects that integrated or built on the new tool [2].
Codex Cloud, announced May 16, 2025, takes a fundamentally different approach to running coding agents. Rather than executing on the developer's local machine, each Codex Cloud task runs inside an isolated container on OpenAI infrastructure that is preloaded with a clone of the user's GitHub repository and any pre-installed dependencies declared in a per-repo configuration file [4]. Internet access was disabled by default during task execution at launch, so the agent could only interact with the code and tooling that had been explicitly provisioned. OpenAI later added an opt-in switch that gives Codex Cloud network access for tasks that need to fetch packages or run integration tests against external services. Tasks complete asynchronously in roughly one to thirty minutes, after which the agent returns a summary, a diff, terminal logs, and an optional pull request that the user can review and merge.
The asynchronous, fire-and-forget design contrasts with the interactive paradigm of Claude Code (which keeps the user in a continuous dialogue with the agent in their terminal) and with the IDE-embedded paradigm of Cursor (which surfaces the agent inside an editor where the user can watch and steer it in real time). Codex Cloud is closer in spirit to Devin, in that the user delegates a complete task and walks away while the agent works, but with stronger sandboxing, no persistent agent memory across tasks, and explicit human review at the pull-request stage.
Codex Cloud was initially available to ChatGPT Pro, Business, and Enterprise subscribers, with ChatGPT Plus access added on June 3, 2025 and Education plan access shortly after. The product graduated from research preview to general availability in late 2025, at which point OpenAI added a Slack integration that lets team members tag a Codex bot in a channel to delegate tasks, a public Codex SDK that exposes the same agent loop used by the CLI and web interface, and admin controls that give workspace administrators visibility and policy controls over how Codex is used by their organizations [9].
One quirk of the cloud workflow worth flagging: because each task runs in a fresh container with no persistent state, Codex Cloud has no equivalent of a long-running pair-programming session. Every task is its own world. Practitioners who want continuity across a long-running project either reach for the CLI (which keeps state on the local machine) or split work into discrete pull-request-sized chunks. That constraint pushed Codex toward a particular usage style: scoped, well-defined tickets rather than open-ended exploration.
The modern Codex products have run on a steady cadence of new models since the codex-1 launch, with each release shaving time off agentic tasks and pushing the SWE-bench Verified numbers higher.
| Model | Released | Used in | Notes |
|---|---|---|---|
| codex-1 | May 16, 2025 | Codex Cloud | Fine-tune of o3 reasoning model for software engineering [4] |
| codex-mini-latest | May 2025 | Codex CLI default | Smaller, faster variant for interactive use; $1.50 / $6.00 per 1M tokens |
| GPT-5-Codex | Sept 23, 2025 | Codex CLI and Cloud | Codex-tuned variant of GPT-5 family; first available to API customers |
| GPT-5.2-Codex | Jan 14, 2026 | All Codex surfaces | Adds context compaction, stronger Windows support, cybersecurity improvements |
| GPT-5.3-Codex | Feb 5, 2026 | All Codex surfaces | Tops SWE-Bench Pro and Terminal-Bench 2.0; ~25% faster than 5.2-Codex [3] |
| GPT-5.3-Codex-Spark | Feb 12, 2026 | Real-time interactive coding | Approximately 15x faster than earlier Codex models [3] |
| GPT-5.4-Codex | Mar 5, 2026 | All Codex surfaces | First general-purpose model with native computer-use; supports 1M-token context |
The codex-1 system message, which is the persistent prompt that defines the agent's default behavior, was published by OpenAI on the codex-1 launch day to help developers understand how the agent reasons by default and to enable customization in their own deployments. The published system message emphasized careful planning before action, asking clarifying questions when the task is ambiguous, running tests after making changes, and being explicit about what it has and has not verified. This level of public documentation of an agent's behavioral specification was unusual at the time and has since been imitated by other agentic-coding products.
codex-1 itself was a fine-tune of the o3 reasoning model with reinforcement learning on real-world coding tasks, training the model to plan, execute, observe, and revise inside agent loops rather than just predict the next token. Each successor inherits the same general recipe but adds more code data, longer contexts, and more aggressive RL on harder tasks. GPT-5.3-Codex specifically pushed Terminal-Bench 2.0 numbers above 80 percent and was the first Codex model to compete head-to-head with Anthropic's Claude Opus 4 on extended multi-file changes. GPT-5.4-Codex added native computer-use, meaning the agent can drive a browser or desktop GUI as part of a task, which closed a gap with OpenAI Operator for work that mixes coding with clicking through web UIs.
The modern Codex platform handles a fairly wide range of software-engineering tasks, all anchored around the basic loop of read code, propose change, run tests, report back. The capabilities most often cited in OpenAI marketing and in third-party reviews include:
| Capability | Details |
|---|---|
| Whole-repo reading and editing | Navigates the working directory, applies multi-file edits in a single task, summarizes large codebases into compact context during planning |
| Running tests and shell commands | Executes test suites after each edit, parses output, uses failures as feedback to iterate |
| Opening pull requests | Codex Cloud's default output is a GitHub PR with a summary, terminal logs, and follow-up notes |
| Citing sources | Inline citations of the specific files and lines the agent examined while making a change |
| Parallel tasks | Multiple Codex Cloud tasks can run at once against the same or different repositories, each in its own container |
| Code review on incoming PRs | Configurable per-repo, posts GitHub review comments on P0 and P1 issues, matches PR intent against the actual diff |
| Multimodal input | The CLI accepts pasted images and screenshots, useful for error screenshots or UI mock-ups as task descriptions |
| Computer use | From GPT-5.4-Codex onward, the agent can drive a sandboxed browser or desktop session |
| Internet access (opt-in) | Tasks that need to fetch packages or hit a third-party API can be granted controlled network access |
A notable behavioral detail is that Codex defaults to verbosity. The agent narrates what it is doing, what it found, and what it changed, in a way that other agents (notably Claude Code) tend not to. That default has been criticized as cluttering the output for experienced developers but praised as a debugging aid for users still learning how to delegate tasks to an agent.
Codex usage is bundled into the standard ChatGPT subscription tiers rather than priced as a separate per-token product for individual users. The table below summarizes the consumer plans as of early 2026.
| Plan | Price | Codex usage |
|---|---|---|
| ChatGPT Free | $0 | No Codex access |
| ChatGPT Plus | $20/month | Baseline allotment of Codex tasks per week |
| ChatGPT Pro (intermediate) | $100/month | Five times the Plus allotment; introduced April 2026 |
| ChatGPT Pro (full) | $200/month | Roughly twenty times the Plus allotment for high-effort sessions |
| ChatGPT Business | $30/user/month | Team-level usage and admin controls |
| ChatGPT Enterprise | Custom | Enterprise SSO, audit logs, custom retention |
| ChatGPT Edu | Custom | Codex access included for verified institutions |
In April 2026, OpenAI introduced the $100/month Pro tier intermediate between Plus and the original $200 Pro, providing five times the Plus Codex usage at half the price of the top tier [10]. For developers who want to call the underlying models directly through the API, the codex-mini-latest model is priced at $1.50 per million input tokens and $6 per million output tokens with a 75 percent prompt-caching discount, and GPT-5-Codex variants are priced through the standard chat-completions API at the rates set for the broader GPT-5 family. The Codex SDK is open-source and free to use. For education customers, OpenAI runs a free-tier Codex program for verified students that has been adopted by several universities as their default AI pair programmer recommendation in computer science curricula.
The Codex platform spans several interaction surfaces beyond the CLI and the ChatGPT web interface. The first-party VS Code extension, released in late 2025, embeds the Codex Cloud workflow inside the editor, letting the user start agent tasks against the current repository and review returned diffs without leaving VS Code. JetBrains and Xcode plug-ins followed in early 2026 with similar patterns. A Codex desktop app for Windows and macOS launched in February 2026, providing a kanban-style board of in-flight tasks across multiple repositories, and a mobile companion app gives users the ability to launch tasks and review pull requests from a phone.
In October 2025, OpenAI announced a partnership with GitHub to make Codex one of the agents available through GitHub's Agent HQ system, which exposes a unified interface for managing agent runs from multiple vendors inside GitHub repositories. Codex also supports tools through the Model Context Protocol, including a Figma connector that lets the agent read design files directly and security-scanning integrations that became the basis for OpenAI's separate Codex Security product launched in March 2026.
The Slack integration is particularly popular for small fixes and dependency upgrades. Once enabled, team members can mention @codex in a channel and assign a task; Codex picks up the message, runs the work in a Codex Cloud container, and reports back in-thread with a link to the pull request.
Codex integrates into GitHub Actions through a published action (openai/codex-action) that lets repository maintainers run Codex as a step in their CI pipeline. Common patterns include having Codex automatically attempt to fix failing builds, running Codex as a code reviewer on incoming pull requests, and using Codex to keep generated documentation in sync with code changes. The action accepts the same configuration as the CLI.
The code review configuration has been a flagship feature [14]. With it enabled, Codex posts a standard GitHub review on every new pull request without needing an explicit @codex review comment. The review focuses on serious issues only (P0 and P1 in OpenAI's internal severity taxonomy), and Codex matches the stated intent of the PR against the actual diff before flagging anything. Unlike traditional static analysis tools, Codex can reason over the entire codebase and dependencies and execute code and tests to validate behavior, so its reviews catch subtle behavioral regressions that lint and type-check miss. Repository maintainers can shape the review style by adding AGENTS.md files describing repository conventions, security expectations, and review priorities, and Codex picks these up automatically.
The table below contrasts the two generations of OpenAI Codex on the dimensions where they differ most clearly.
| Dimension | Codex (2021) | Codex (2025) |
|---|---|---|
| Product type | Hosted code-completion language model | Agentic coding platform with CLI, cloud, IDE, mobile |
| Underlying model | code-davinci, code-cushman (~12B params) | codex-1, GPT-5-Codex, GPT-5.3-Codex, GPT-5.4-Codex |
| Architecture | Static autoregressive language model | Reasoning model in agent loop with tool use |
| Interaction | Single-turn completion via REST API | Multi-turn agent with file edit, command exec, planning |
| Primary surface | Editor extensions (Copilot) and API | Terminal CLI, ChatGPT web, IDE plug-ins, mobile app |
| Sandboxing | Not applicable (model only) | OS-level sandbox locally, isolated container in cloud |
| Open source | No (proprietary API) | Yes (CLI under Apache 2.0, SDK open) |
| Pricing | Per-token API billing | Bundled with ChatGPT subscription tiers |
| Status | Deprecated March 2023 | Generally available, actively developed |
Codex's modern model lineage shares the same general recipe across versions: take a frontier base model, then post-train it on coding tasks using reinforcement learning where the reward signal comes from whether tests pass, code runs, builds succeed, and (in later versions) human reviewers accept the agent's pull request. That regime is fundamentally different from how the 2021 Codex was built, which relied solely on next-token prediction over GitHub source code.
The key technical differences between the two generations:
The relationship to the broader GPT-5 family is one of branching rather than direct lineage. GPT-5.3-Codex is a fine-tune of GPT-5.3 specifically optimized for coding agents; it is not the same model with a different system prompt. The Codex variants typically score lower than the base models on general-purpose benchmarks (creative writing, abstract reasoning) and higher on coding benchmarks. OpenAI has confirmed in release notes that this is by design.
The modern Codex sits in a crowded and rapidly evolving market for AI coding assistance. Direct competitors include Anthropic's Claude Code, launched in February 2025 as a terminal-based coding agent with a similar approval-prompt model; Cursor, an IDE built as a fork of VS Code that integrates LLM-powered editing into a familiar editor experience; and Devin, an autonomous software engineer from Cognition AI marketed as a fully independent agent that can take complete tickets end to end. Other notable competitors include Aider, Cline, Continue, Windsurf (formerly Codeium), Amazon Kiro, Google's Jules, and GitHub Copilot, which has evolved from its original Codex-powered inline completion product into a multi-model agentic platform of its own.
A frequently quoted comparison comes from the artificial-analysis benchmark suite, which scores these tools on SWE-bench Verified and Terminal-Bench. Claude Code with Opus has historically led on SWE-bench Verified with scores around 80 percent, while Codex CLI has led on Terminal-Bench with scores around 77 percent and on token efficiency, reportedly using roughly four times fewer tokens per task than Claude Code on equivalent assignments. Cursor leads on raw user experience for developers who prefer a graphical editor. Devin remains the most expensive option but is often credited with the most ambitious autonomy, advertising end-to-end resolution of GitHub issues without human steering.
The table below summarizes the competitive landscape on the dimensions that matter most for choosing among these tools.
| Tool | Vendor | Primary surface | Sandbox model | Default model | Pricing model |
|---|---|---|---|---|---|
| Codex CLI | OpenAI | Terminal | OS sandbox or Docker | GPT-5.4-Codex | Bundled with ChatGPT |
| Codex Cloud | OpenAI | ChatGPT web | Isolated cloud container | GPT-5.4-Codex | Bundled with ChatGPT |
| Claude Code | Anthropic | Terminal | Sandbox plus per-command approval | Claude Opus 4 | Bundled with Claude Pro/Max |
| Cursor | Anysphere | VS Code fork | Editor-level | User-selected (Claude, GPT-5, Gemini) | $20/mo Pro |
| Devin | Cognition AI | Web dashboard | Cloud sandbox with shell + browser | Proprietary | $500/mo team |
| GitHub Copilot | GitHub / Microsoft | IDE plug-ins | Editor-level plus optional sandbox | Multi-vendor (model picker) | $10-39/user/mo |
| Aider | Open source | Terminal | None (uses local git) | User-selected | Free (BYOK) |
| Cline | Open source | VS Code extension | Editor-level | User-selected | Free (BYOK) |
| Amazon Kiro | Amazon | IDE | Sandbox in AWS | Bedrock-hosted models | AWS pricing |
| Google Jules | Web + GitHub | Cloud sandbox | Gemini variants | Bundled with Gemini |
Usage share data published by various third parties suggests that, between September 2025 and January 2026, Codex usage rose from approximately five percent of Claude Code's volume to approximately forty percent, reflecting both the maturation of the Codex product and OpenAI's aggressive bundling of Codex usage into ChatGPT subscriptions [3]. Claude Code retained the lead among professional developers writing complex multi-file changes, while Cursor remained dominant for users who preferred a visual IDE workflow, and Codex captured a growing share of users who wanted asynchronous, fire-and-forget delegation of coding tasks.
The architectural choices that separate Codex from its closest peers:
None of these choices is strictly better. The pick between Codex, Claude Code, Cursor, and Devin tends to come down to whether the team's workflow is editor-centric (Cursor, Copilot), terminal-interactive (Claude Code, Aider), or PR-asynchronous (Codex Cloud, Devin), and which vendor relationship already exists.
The modern Codex has had a warmer reception than the 2021 model, partly because it arrived in a market that was already used to AI coding tools. Reviews in the developer press in mid-to-late 2025 consistently praised the Rust-based CLI for its speed, the Cloud product for its parallel-task model, and the GitHub integration for its tightness. Common criticisms in the same period centered on the agent's verbosity (developers complained that Codex narrated more than necessary), occasional over-confidence on tasks that required more domain knowledge, and the friction of having to learn yet another agent's quirks if a developer was already deep into Claude Code or Cursor.
The Codex variants have consistently topped Terminal-Bench, frequently led on SWE-bench Verified after each new release, and posted strong numbers on LiveCodeBench. Independent third-party benchmarks have generally placed Codex in the top three coding agents on every dimension they measure, with a particular advantage in token efficiency. A recurring theme in user reviews is the contrast with Claude Code: the two products are similar enough that experienced users tend to use both, picking whichever is better suited to a given task. Claude Code is generally praised for thoughtful, conservative edits and stronger refusal behavior on ambiguous instructions; Codex is praised for fast, decisive action and tight GitHub integration.
Both Codex generations have faced the standard set of concerns associated with AI coding tools. The original 2021 Codex was the subject of a class-action lawsuit, Doe v. GitHub, filed in November 2022, which alleged that Codex and Copilot had been trained on open-source code in violation of the licenses on that code and that the models reproduced licensed snippets verbatim without attribution. The lawsuit was largely dismissed in stages between 2023 and 2024, but the underlying questions about copyright and attribution in code generation have remained legally and ethically unresolved.
The 2025 agentic Codex products have raised a different set of safety concerns. Because the agents can execute arbitrary commands and modify files, an attacker who can inject instructions into a code repository (for example, through a malicious comment in a dependency) could in principle hijack an agent into running unauthorized commands. Codex's defaults of disabling network access during task execution, restricting file writes to the working directory, and requiring explicit per-command approval are intended to mitigate this risk, but security researchers have published proof-of-concept prompt-injection attacks against Codex and other agentic coding tools, and OpenAI has issued several security advisories addressing specific vulnerability classes since the product launched.
The Codex Cloud variant has its own privacy and data-residency considerations. Tasks run on OpenAI infrastructure, which means a copy of the repository code passes through OpenAI's servers during the task. OpenAI has published documentation on how data is handled for Codex Cloud (the short version is that customer code is not used to train models and is deleted after the task completes), but enterprise customers have continued to ask for stronger contractual guarantees, and several large organizations have insisted on running only the local Codex CLI rather than allowing repository data into the cloud product.
A second concern is reliability. Even the best agentic coding tools as of 2026 score below 90 percent on SWE-bench Verified, meaning that more than one in ten real-world software-engineering tasks results in a failure that the agent cannot solve without human intervention. Long-horizon tasks involving large codebases, ambiguous requirements, or external integration points remain particularly challenging. Practitioners who use Codex and similar tools in production typically report a workflow of "trust but verify," reading every diff before merging and treating the agent as a junior engineer whose output requires review rather than as an autonomous senior engineer.
A third concern is the impact on software-engineering practice and the developer labor market. Agentic tools shift the bottleneck from typing speed to specification quality, code review, and architectural judgment. The longer-term effects are still being debated, but early empirical work suggests that agentic tools can substantially accelerate well-specified work without consistently improving outcomes on poorly scoped tasks, and that the value of senior human judgment in code review may grow rather than shrink as more code is produced by agents.
A fourth, narrower concern is the carbon and compute cost of agent loops. A typical Codex Cloud task can run for tens of minutes and consume far more inference budget than a single-shot completion, and as Codex usage has scaled into the millions of weekly active users, the aggregate compute draw has become a topic among sustainability-focused engineering teams.