OpenAI Codex

OpenAI Codex is the name OpenAI has used for two distinct generations of code-focused artificial intelligence products. The first generation, launched in private beta in August 2021, was a large language model descended from GPT-3 and fine-tuned on billions of lines of public source code. That original Codex powered the first three years of GitHub Copilot and introduced the now-canonical HumanEval benchmark for evaluating code generation, before OpenAI deprecated the underlying API in March 2023 in favor of newer general-purpose models. The second generation, launched in two waves in April and May 2025, is an entirely different product family: a suite of agentic coding tools branded Codex CLI and Codex Cloud, built on specialized variants of OpenAI's reasoning models (codex-1, codex-mini-latest, GPT-5-Codex, GPT-5.2-Codex, GPT-5.3-Codex, and GPT-5.4-Codex), that compete directly with Claude Code, Cursor, and Devin for the autonomous coding agent market ^[1]^[2]^[3].

The two Codex generations share a name and a broad mission of helping developers write software, but they differ in almost every implementation detail. The 2021 Codex was a static autoregressive language model that turned natural-language prompts into code completions, accessed through a REST API or embedded in editor extensions like the original GitHub Copilot. The 2025 Codex is an interactive software-engineering agent that runs in a sandboxed environment, reads and edits files, executes commands, runs tests, opens pull requests, and reports back to the user, all under explicit human approval policies. The shared brand reflects OpenAI's view that both systems address the same underlying goal of reducing the friction between describing software and producing working code, but the technical lineage runs through different research programs and the user experience differs profoundly ^[3]^[4].

As of April 2026, the modern Codex platform reports more than two million weekly active users across its CLI, web, IDE, and mobile surfaces, with usage rising roughly five-fold since January 2026, and the open-source Codex CLI repository has accumulated more than 75,000 stars on GitHub ^[3]. The product has come to be one of the most visible examples of a broader industry shift from inline autocomplete tooling toward fully agentic coding workflows, in which the human operator delegates entire tasks ("add OAuth login to this app," "fix the failing CI checks," "port this module to TypeScript") rather than writing code line by line with assistance.

Naming and brand history

The Codex name has been recycled rather than continuous. OpenAI originally registered the brand in 2021 for a model that did one thing: take a prompt, return code. After the 2023 deprecation, the name effectively retired for two years. When OpenAI brought it back in April 2025, the engineering, product surface, and underlying research program had all moved on. The decision to reuse the name has occasionally caused confusion in academic papers and developer documentation that still cite "Codex" without specifying which era they mean.

A short timeline of the brand:

Date	Event
July 7, 2021	Codex paper posted on arXiv ^[1]
August 10, 2021	Codex API private beta announced ^[5]
June 29, 2021	GitHub Copilot launches in technical preview using a Codex variant
March 23, 2023	OpenAI announces deprecation of code-davinci-002 and code-cushman-001 ^[11]
April 16, 2025	Codex CLI launches as an open-source terminal agent ^[2]
May 16, 2025	Codex Cloud launches as a research preview in ChatGPT ^[4]
June 3, 2025	Codex Cloud rolls out to ChatGPT Plus subscribers
September 23, 2025	GPT-5-Codex becomes available to API customers
Late 2025	Codex graduates from research preview to general availability ^[9]
January 14, 2026	GPT-5.2-Codex released
February 5, 2026	GPT-5.3-Codex released
March 5, 2026	GPT-5.4-Codex released alongside GPT-5.4 family

Codex (2021): the original code model

The original Codex was first announced by OpenAI on August 10, 2021, alongside the launch of a private beta of the Codex API ^[5]. The model itself had been described in a research paper posted on arXiv on July 7, 2021, titled Evaluating Large Language Models Trained on Code, with Mark Chen and Jerry Tworek as lead authors and a long list of contributors from across OpenAI ^[1]. That paper is now one of the foundational references for code-generation language models and it introduced both the Codex model family and the HumanEval evaluation set that has been used to benchmark every subsequent code model.

Training and architecture

Codex was constructed by taking the GPT-3 architecture and continuing pretraining on a code corpus collected from public GitHub repositories. The full training set was assembled in May 2020 and contained roughly 179 gigabytes of unique Python files under one megabyte in size, drawn from 54 million public repositories, after filtering to remove auto-generated files, files with average line length above 100 characters, files with maximum line length above 1,000 characters, and files containing a small percentage of alphanumeric characters ^[1]. The fine-tuning corpus was therefore a curated slice of human-authored Python rather than the much larger sweep of all repository content.

OpenAI trained two main families of Codex models. The larger family, code-davinci, peaked at approximately 12 billion parameters and was the basis for the production Codex API; the smaller family, code-cushman, was a 12-billion-parameter variant tuned for lower-latency completion and was the model OpenAI shipped to GitHub for the first version of Copilot ^[6]. The Chen et al. paper reports results across model scales from 12 million to 12 billion parameters, finding that performance on HumanEval scales smoothly with parameter count and that even 12-million-parameter models have meaningful coding ability after fine-tuning. A successor model, code-davinci-002, was released in 2022 with approximately 14.8 billion parameters and became the base for the GPT-3.5 family.

HumanEval benchmark

The Chen et al. paper introduced HumanEval, an evaluation set consisting of 164 hand-written Python programming problems, each with a function signature, a docstring describing the desired behavior, a reference solution, and a set of unit tests. The benchmark measures functional correctness using the pass@k metric: a model is allowed to generate k samples per problem, and a problem is counted as solved if at least one sample passes all the unit tests. Pass@1 measures the probability that a single sample is correct, while pass@10 and pass@100 measure the probability that at least one of ten or one hundred samples is correct.

The headline result from the original paper was that Codex-12B solved 28.8 percent of the HumanEval problems on pass@1, while GPT-3 of comparable scale solved 0 percent and the contemporaneous open model GPT-J solved 11.4 percent ^[1]. With repeated sampling and an oracle that could pick the correct answer from many candidates, the pass@100 score rose to 70.2 percent, demonstrating that the model often had a working solution somewhere in its output distribution even when it was not consistent enough to produce one on the first try. That gap between pass@1 and pass@100 motivated subsequent work on best-of-n sampling, code execution feedback loops, and reranking with verifier models.

HumanEval has become one of the most cited benchmarks in code generation despite well-documented limitations: only 164 problems, all in Python, all relatively short and self-contained, and all written by humans rather than scraped from real codebases. The numbers have largely saturated as model capabilities have grown, with frontier models scoring above 90 percent and the strongest 2025 systems scoring above 95 percent. Newer benchmarks such as MBPP, APPS, SWE-bench, LiveCodeBench, and the SWE-bench Verified subset have largely supplanted HumanEval for serious comparison of frontier coding systems, but HumanEval remains the standard introductory benchmark and is still routinely reported in model release notes.

The table below summarizes pass@1 HumanEval scores reported for several models across the years.

Model	Year	Reported pass@1	Notes
GPT-3 (175B)	2021	0.0%	Pretraining included no code-focused fine-tuning ^[1]
GPT-J (6B)	2021	11.4%	Open-source baseline reported in Codex paper ^[1]
Codex-12B	2021	28.8%	Original Codex API model ^[1]
code-davinci-002	2022	~47%	OpenAI Codex successor, base of GPT-3.5
GPT-4	2023	67.0%	OpenAI technical report
GPT-4 (with reflection)	2023	91.0%	With execution feedback and self-correction
Claude 3 Opus	2024	84.9%	Anthropic blog
Llama 3.1 405B	2024	89.0%	Meta release
GPT-4o	2024	90.2%	OpenAI blog
Claude Opus 4	2025	94.5%	Anthropic blog
Claude Sonnet 4	2025	95.1%	Anthropic blog

GitHub Copilot integration

The most visible deployment of the original Codex was inside GitHub Copilot, a code completion service launched as a technical preview in June 2021 and made generally available in June 2022. Copilot integrated a fine-tuned Codex variant directly into VS Code, JetBrains IDEs, Neovim, and Visual Studio, offering inline ghost-text completions as the user typed. The Codex model running inside Copilot was a smaller, lower-latency version optimized for responsiveness in the editor, since users expected suggestions to appear within a few hundred milliseconds.

Copilot quickly became the dominant commercial AI coding product, with reports of over a million paying subscribers within two years of launch. It also became one of the highest-profile demonstrations that pretrained language models could be productized into a real developer tool with measurable productivity benefits. A widely cited GitHub-internal study from 2022 found that developers completed a JavaScript task 55 percent faster with Copilot than without it. Independent academic studies have produced more mixed results, with some showing significant speedups and others showing little or no improvement and a measurable increase in security defects in Copilot-assisted code.

GitHub Copilot moved away from the original Codex model after OpenAI deprecated the underlying API in March 2023. By the time Copilot X was announced in March 2023, Copilot Chat was running on GPT-4 and the inline completion model had been replaced by newer OpenAI variants. Subsequent versions of Copilot have run on a rotating set of models from OpenAI, Anthropic, and Google, including GPT-5 variants and Claude models, exposed to users through a model picker. The Copilot product is now operationally independent of the OpenAI Codex brand even though it grew out of the same research program.

Deprecation

In March 2023, OpenAI announced that the original Codex models (code-davinci-001, code-davinci-002, code-cushman-001, and related variants) would be deprecated from the API and that customers should migrate to the newly released GPT-3.5-Turbo and GPT-4 models for code-related tasks. The deprecation reflected the broader trend that general-purpose chat models had caught up to and surpassed dedicated code models on most benchmarks, including HumanEval, while also offering far better instruction following, multi-turn dialogue, and integration with new product features like function calling.

The sudden deprecation drew criticism from researchers who had been using code-davinci-002 as a stable baseline for academic experiments. Many published reproductions and ablation studies depended on a specific snapshot of the model, and once the API endpoints were turned off the experiments could no longer be re-run by other researchers. OpenAI extended the timeline several times and eventually preserved access for some classes of researcher, but the episode highlighted the friction between commercial model lifecycle and the reproducibility expectations of academic computer science.

Codex (2025): the agentic coding platform

The modern Codex product family, which shares a brand with the 2021 model but no direct technical lineage, was launched in two waves in 2025. On April 16, 2025, OpenAI announced Codex CLI, an open-source command-line coding agent that runs locally on the user's machine and can read, write, and execute code in the working directory ^[2]^[7]. Roughly a month later, on May 16, 2025, OpenAI followed with Codex Cloud, a research preview of a hosted agent that runs entirely in an isolated container on OpenAI infrastructure, accepting tasks through ChatGPT and returning pull requests for human review ^[4]. Both products were initially powered by codex-1, a fine-tuned variant of the o3 reasoning model adapted specifically for software engineering work.

Codex CLI

Codex CLI is a free, open-source binary distributed under the Apache 2.0 license (initially released under MIT) that installs via npm or Homebrew and exposes a single command, codex, to launch an interactive terminal user interface ^[7]^[8]. Inside the TUI, the user describes a task in natural language and the agent plans, edits files, runs shell commands, and reports progress, with explicit approval prompts for any operation that touches the file system or executes a process. The CLI began life as a Node.js application but was rewritten in Rust during the second half of 2025; by early 2026 the codebase was approximately 95 percent Rust, and the Rust binary became the default distribution for performance and memory-safety reasons ^[12].

The motivation for the Rust rewrite was several things at once. The original Node.js distribution required users to install Node.js as a separate dependency, and the runtime overhead made startup slow on weaker machines. The Rust binary ships as a single static executable with no garbage collector, uses far less memory, and lets OpenAI call OS sandbox primitives directly rather than going through a Node.js shim ^[12].

The CLI supports several approval modes:

Mode	File edits	Shell commands
Suggest (manual)	Asks before each	Asks before each
Auto-edit	Applies without asking	Asks before each
Full-auto	Applies without asking	Runs without asking inside a sandbox

The sandbox enforcement uses macOS sandbox-exec (Apple Seatbelt), Linux Landlock and seccomp, and Windows job objects to restrict file system writes to the working directory and disable network access by default. Users can also configure the CLI to run inside Docker containers for stricter isolation, which is the recommended setup for production use on shared development machines.

A distinguishing feature of Codex CLI is its support for multimodal input: users can paste screenshots, mock-ups, or low-fidelity sketches into the TUI and ask the agent to implement the shown design ^[2]. The CLI also supports the Model Context Protocol for connecting to external tools and data sources, and it can act as both an MCP client and an MCP server. The latter is unusual: running codex mcp-server exposes the Codex agent as a tool that other MCP-compatible agents (including Claude Code, Cursor, and various third-party tools) can call into. The bidirectional MCP support has made Codex CLI a popular building block for hybrid agent stacks where one agent delegates subtasks to another.

The SDK released in 2025 allows developers to embed the same agent loop into their own applications. As of April 2026, the npm package reports approximately 14.5 million monthly downloads. To encourage adoption, OpenAI launched a $1 million API credit program at the same time as the CLI announcement, awarding $25,000 blocks of API credits to selected open-source projects that integrated or built on the new tool ^[2].

Codex Cloud

Codex Cloud, announced May 16, 2025, takes a fundamentally different approach to running coding agents. Rather than executing on the developer's local machine, each Codex Cloud task runs inside an isolated container on OpenAI infrastructure that is preloaded with a clone of the user's GitHub repository and any pre-installed dependencies declared in a per-repo configuration file ^[4]. Internet access was disabled by default during task execution at launch, so the agent could only interact with the code and tooling that had been explicitly provisioned. OpenAI later added an opt-in switch that gives Codex Cloud network access for tasks that need to fetch packages or run integration tests against external services. Tasks complete asynchronously in roughly one to thirty minutes, after which the agent returns a summary, a diff, terminal logs, and an optional pull request that the user can review and merge.

The asynchronous, fire-and-forget design contrasts with the interactive paradigm of Claude Code (which keeps the user in a continuous dialogue with the agent in their terminal) and with the IDE-embedded paradigm of Cursor (which surfaces the agent inside an editor where the user can watch and steer it in real time). Codex Cloud is closer in spirit to Devin, in that the user delegates a complete task and walks away while the agent works, but with stronger sandboxing, no persistent agent memory across tasks, and explicit human review at the pull-request stage.

Codex Cloud was initially available to ChatGPT Pro, Business, and Enterprise subscribers, with ChatGPT Plus access added on June 3, 2025 and Education plan access shortly after. The product graduated from research preview to general availability in late 2025, at which point OpenAI added a Slack integration that lets team members tag a Codex bot in a channel to delegate tasks, a public Codex SDK that exposes the same agent loop used by the CLI and web interface, and admin controls that give workspace administrators visibility and policy controls over how Codex is used by their organizations ^[9].

One quirk of the cloud workflow worth flagging: because each task runs in a fresh container with no persistent state, Codex Cloud has no equivalent of a long-running pair-programming session. Every task is its own world. Practitioners who want continuity across a long-running project either reach for the CLI (which keeps state on the local machine) or split work into discrete pull-request-sized chunks. That constraint pushed Codex toward a particular usage style: scoped, well-defined tickets rather than open-ended exploration.

Models

The modern Codex products have run on a steady cadence of new models since the codex-1 launch, with each release shaving time off agentic tasks and pushing the SWE-bench Verified numbers higher.

Model	Released	Used in	Notes
codex-1	May 16, 2025	Codex Cloud	Fine-tune of o3 reasoning model for software engineering ^[4]
codex-mini-latest	May 2025	Codex CLI default	Smaller, faster variant for interactive use; $1.50 / $6.00 per 1M tokens
GPT-5-Codex	Sept 23, 2025	Codex CLI and Cloud	Codex-tuned variant of GPT-5 family; first available to API customers
GPT-5.2-Codex	Jan 14, 2026	All Codex surfaces	Adds context compaction, stronger Windows support, cybersecurity improvements
GPT-5.3-Codex	Feb 5, 2026	All Codex surfaces	Tops SWE-Bench Pro and Terminal-Bench 2.0; ~25% faster than 5.2-Codex ^[3]
GPT-5.3-Codex-Spark	Feb 12, 2026	Real-time interactive coding	Approximately 15x faster than earlier Codex models ^[3]
GPT-5.4-Codex	Mar 5, 2026	All Codex surfaces	First general-purpose model with native computer-use; supports 1M-token context

The codex-1 system message, which is the persistent prompt that defines the agent's default behavior, was published by OpenAI on the codex-1 launch day to help developers understand how the agent reasons by default and to enable customization in their own deployments. The published system message emphasized careful planning before action, asking clarifying questions when the task is ambiguous, running tests after making changes, and being explicit about what it has and has not verified. This level of public documentation of an agent's behavioral specification was unusual at the time and has since been imitated by other agentic-coding products.

codex-1 itself was a fine-tune of the o3 reasoning model with reinforcement learning on real-world coding tasks, training the model to plan, execute, observe, and revise inside agent loops rather than just predict the next token. Each successor inherits the same general recipe but adds more code data, longer contexts, and more aggressive RL on harder tasks. GPT-5.3-Codex specifically pushed Terminal-Bench 2.0 numbers above 80 percent and was the first Codex model to compete head-to-head with Anthropic's Claude Opus 4 on extended multi-file changes. GPT-5.4-Codex added native computer-use, meaning the agent can drive a browser or desktop GUI as part of a task, which closed a gap with OpenAI Operator for work that mixes coding with clicking through web UIs.

Capabilities

The modern Codex platform handles a fairly wide range of software-engineering tasks, all anchored around the basic loop of read code, propose change, run tests, report back. The capabilities most often cited in OpenAI marketing and in third-party reviews include:

Capability	Details
Whole-repo reading and editing	Navigates the working directory, applies multi-file edits in a single task, summarizes large codebases into compact context during planning
Running tests and shell commands	Executes test suites after each edit, parses output, uses failures as feedback to iterate
Opening pull requests	Codex Cloud's default output is a GitHub PR with a summary, terminal logs, and follow-up notes
Citing sources	Inline citations of the specific files and lines the agent examined while making a change
Parallel tasks	Multiple Codex Cloud tasks can run at once against the same or different repositories, each in its own container
Code review on incoming PRs	Configurable per-repo, posts GitHub review comments on P0 and P1 issues, matches PR intent against the actual diff
Multimodal input	The CLI accepts pasted images and screenshots, useful for error screenshots or UI mock-ups as task descriptions
Computer use	From GPT-5.4-Codex onward, the agent can drive a sandboxed browser or desktop session
Internet access (opt-in)	Tasks that need to fetch packages or hit a third-party API can be granted controlled network access

A notable behavioral detail is that Codex defaults to verbosity. The agent narrates what it is doing, what it found, and what it changed, in a way that other agents (notably Claude Code) tend not to. That default has been criticized as cluttering the output for experienced developers but praised as a debugging aid for users still learning how to delegate tasks to an agent.

Pricing and access

Codex usage is bundled into the standard ChatGPT subscription tiers rather than priced as a separate per-token product for individual users. The table below summarizes the consumer plans as of early 2026.

Plan	Price	Codex usage
ChatGPT Free	$0	No Codex access
ChatGPT Plus	$20/month	Baseline allotment of Codex tasks per week
ChatGPT Pro (intermediate)	$100/month	Five times the Plus allotment; introduced April 2026
ChatGPT Pro (full)	$200/month	Roughly twenty times the Plus allotment for high-effort sessions
ChatGPT Business	$30/user/month	Team-level usage and admin controls
ChatGPT Enterprise	Custom	Enterprise SSO, audit logs, custom retention
ChatGPT Edu	Custom	Codex access included for verified institutions

In April 2026, OpenAI introduced the $100/month Pro tier intermediate between Plus and the original $200 Pro, providing five times the Plus Codex usage at half the price of the top tier ^[10]. For developers who want to call the underlying models directly through the API, the codex-mini-latest model is priced at $1.50 per million input tokens and $6 per million output tokens with a 75 percent prompt-caching discount, and GPT-5-Codex variants are priced through the standard chat-completions API at the rates set for the broader GPT-5 family. The Codex SDK is open-source and free to use. For education customers, OpenAI runs a free-tier Codex program for verified students that has been adopted by several universities as their default AI pair programmer recommendation in computer science curricula.

Editor and platform integrations

The Codex platform spans several interaction surfaces beyond the CLI and the ChatGPT web interface. The first-party VS Code extension, released in late 2025, embeds the Codex Cloud workflow inside the editor, letting the user start agent tasks against the current repository and review returned diffs without leaving VS Code. JetBrains and Xcode plug-ins followed in early 2026 with similar patterns. A Codex desktop app for Windows and macOS launched in February 2026, providing a kanban-style board of in-flight tasks across multiple repositories, and a mobile companion app gives users the ability to launch tasks and review pull requests from a phone.

In October 2025, OpenAI announced a partnership with GitHub to make Codex one of the agents available through GitHub's Agent HQ system, which exposes a unified interface for managing agent runs from multiple vendors inside GitHub repositories. Codex also supports tools through the Model Context Protocol, including a Figma connector that lets the agent read design files directly and security-scanning integrations that became the basis for OpenAI's separate Codex Security product launched in March 2026.

The Slack integration is particularly popular for small fixes and dependency upgrades. Once enabled, team members can mention @codex in a channel and assign a task; Codex picks up the message, runs the work in a Codex Cloud container, and reports back in-thread with a link to the pull request.

GitHub Actions and CI integration

Codex integrates into GitHub Actions through a published action (openai/codex-action) that lets repository maintainers run Codex as a step in their CI pipeline. Common patterns include having Codex automatically attempt to fix failing builds, running Codex as a code reviewer on incoming pull requests, and using Codex to keep generated documentation in sync with code changes. The action accepts the same configuration as the CLI.

The code review configuration has been a flagship feature ^[14]. With it enabled, Codex posts a standard GitHub review on every new pull request without needing an explicit @codex review comment. The review focuses on serious issues only (P0 and P1 in OpenAI's internal severity taxonomy), and Codex matches the stated intent of the PR against the actual diff before flagging anything. Unlike traditional static analysis tools, Codex can reason over the entire codebase and dependencies and execute code and tests to validate behavior, so its reviews catch subtle behavioral regressions that lint and type-check miss. Repository maintainers can shape the review style by adding AGENTS.md files describing repository conventions, security expectations, and review priorities, and Codex picks these up automatically.

Codex 2021 vs Codex 2025

The table below contrasts the two generations of OpenAI Codex on the dimensions where they differ most clearly.

Dimension	Codex (2021)	Codex (2025)
Product type	Hosted code-completion language model	Agentic coding platform with CLI, cloud, IDE, mobile
Underlying model	code-davinci, code-cushman (~12B params)	codex-1, GPT-5-Codex, GPT-5.3-Codex, GPT-5.4-Codex
Architecture	Static autoregressive language model	Reasoning model in agent loop with tool use
Interaction	Single-turn completion via REST API	Multi-turn agent with file edit, command exec, planning
Primary surface	Editor extensions (Copilot) and API	Terminal CLI, ChatGPT web, IDE plug-ins, mobile app
Sandboxing	Not applicable (model only)	OS-level sandbox locally, isolated container in cloud
Open source	No (proprietary API)	Yes (CLI under Apache 2.0, SDK open)
Pricing	Per-token API billing	Bundled with ChatGPT subscription tiers
Status	Deprecated March 2023	Generally available, actively developed

Architecture and training

Codex's modern model lineage shares the same general recipe across versions: take a frontier base model, then post-train it on coding tasks using reinforcement learning where the reward signal comes from whether tests pass, code runs, builds succeed, and (in later versions) human reviewers accept the agent's pull request. That regime is fundamentally different from how the 2021 Codex was built, which relied solely on next-token prediction over GitHub source code.

The key technical differences between the two generations:

Reasoning vs. completion. codex-1 and its successors are reasoning models, generating an internal chain-of-thought before producing user-visible output. The 2021 Codex produced output token by token without any explicit planning step.
Tool use as a first-class operation. Modern Codex models call tools (file edit, shell exec, web search, image read) as discrete steps in their reasoning. The 2021 Codex was incapable of calling external tools.
Memory and context length. GPT-5.4-Codex supports up to 1 million tokens of context, enough to load entire small-to-medium codebases at once. The 2021 Codex topped out at 8,000 tokens.
Reinforcement learning from real-world tasks. Codex models are trained on millions of real software-engineering tasks scraped from public repositories and synthetic environments, with rewards tied to test outcomes rather than human preference scores. This is the central reason the agentic Codex outperforms general-purpose chat models on practical coding work.

The relationship to the broader GPT-5 family is one of branching rather than direct lineage. GPT-5.3-Codex is a fine-tune of GPT-5.3 specifically optimized for coding agents; it is not the same model with a different system prompt. The Codex variants typically score lower than the base models on general-purpose benchmarks (creative writing, abstract reasoning) and higher on coding benchmarks. OpenAI has confirmed in release notes that this is by design.

Competitive landscape

The modern Codex sits in a crowded and rapidly evolving market for AI coding assistance. Direct competitors include Anthropic's Claude Code, launched in February 2025 as a terminal-based coding agent with a similar approval-prompt model; Cursor, an IDE built as a fork of VS Code that integrates LLM-powered editing into a familiar editor experience; and Devin, an autonomous software engineer from Cognition AI marketed as a fully independent agent that can take complete tickets end to end. Other notable competitors include Aider, Cline, Continue, Windsurf (formerly Codeium), Amazon Kiro, Google's Jules, and GitHub Copilot, which has evolved from its original Codex-powered inline completion product into a multi-model agentic platform of its own.

A frequently quoted comparison comes from the artificial-analysis benchmark suite, which scores these tools on SWE-bench Verified and Terminal-Bench. Claude Code with Opus has historically led on SWE-bench Verified with scores around 80 percent, while Codex CLI has led on Terminal-Bench with scores around 77 percent and on token efficiency, reportedly using roughly four times fewer tokens per task than Claude Code on equivalent assignments. Cursor leads on raw user experience for developers who prefer a graphical editor. Devin remains the most expensive option but is often credited with the most ambitious autonomy, advertising end-to-end resolution of GitHub issues without human steering.

The table below summarizes the competitive landscape on the dimensions that matter most for choosing among these tools.

Tool	Vendor	Primary surface	Sandbox model	Default model	Pricing model
Codex CLI	OpenAI	Terminal	OS sandbox or Docker	GPT-5.4-Codex	Bundled with ChatGPT
Codex Cloud	OpenAI	ChatGPT web	Isolated cloud container	GPT-5.4-Codex	Bundled with ChatGPT
Claude Code	Anthropic	Terminal	Sandbox plus per-command approval	Claude Opus 4	Bundled with Claude Pro/Max
Cursor	Anysphere	VS Code fork	Editor-level	User-selected (Claude, GPT-5, Gemini)	$20/mo Pro
Devin	Cognition AI	Web dashboard	Cloud sandbox with shell + browser	Proprietary	$500/mo team
GitHub Copilot	GitHub / Microsoft	IDE plug-ins	Editor-level plus optional sandbox	Multi-vendor (model picker)	$10-39/user/mo
Aider	Open source	Terminal	None (uses local git)	User-selected	Free (BYOK)
Cline	Open source	VS Code extension	Editor-level	User-selected	Free (BYOK)
Amazon Kiro	Amazon	IDE	Sandbox in AWS	Bedrock-hosted models	AWS pricing
Google Jules	Google	Web + GitHub	Cloud sandbox	Gemini variants	Bundled with Gemini

Usage share data published by various third parties suggests that, between September 2025 and January 2026, Codex usage rose from approximately five percent of Claude Code's volume to approximately forty percent, reflecting both the maturation of the Codex product and OpenAI's aggressive bundling of Codex usage into ChatGPT subscriptions ^[3]. Claude Code retained the lead among professional developers writing complex multi-file changes, while Cursor remained dominant for users who preferred a visual IDE workflow, and Codex captured a growing share of users who wanted asynchronous, fire-and-forget delegation of coding tasks.

How Codex differs from peer agents

The architectural choices that separate Codex from its closest peers:

Cloud-first vs. local-first. Codex's center of gravity is Codex Cloud, which runs in OpenAI's infrastructure. Claude Code runs locally on the user's machine by default. Cloud means consistent execution environments and parallel scaling but introduces network latency and data-residency concerns; local means faster iteration and full file system access but fragile reproducibility.
PR-centric workflow. Codex Cloud is built around the GitHub pull request as the unit of work. Tasks return PRs. Claude Code and Cursor are built around the editing session and treat the eventual PR as a side effect.
Deep GitHub integration. Through the GitHub Action, the Slack integration, and the Agent HQ partnership, Codex has the most polished GitHub-native experience of any major coding agent.
Open-source CLI with proprietary cloud. Codex CLI is Apache-licensed, Codex Cloud is proprietary OpenAI infrastructure. The split lets power users self-host the local agent for free while OpenAI monetizes the asynchronous cloud product.
Bidirectional MCP. Codex CLI is one of the few agents that can act as both an MCP client and an MCP server.

None of these choices is strictly better. The pick between Codex, Claude Code, Cursor, and Devin tends to come down to whether the team's workflow is editor-centric (Cursor, Copilot), terminal-interactive (Claude Code, Aider), or PR-asynchronous (Codex Cloud, Devin), and which vendor relationship already exists.

Reception

The modern Codex has had a warmer reception than the 2021 model, partly because it arrived in a market that was already used to AI coding tools. Reviews in the developer press in mid-to-late 2025 consistently praised the Rust-based CLI for its speed, the Cloud product for its parallel-task model, and the GitHub integration for its tightness. Common criticisms in the same period centered on the agent's verbosity (developers complained that Codex narrated more than necessary), occasional over-confidence on tasks that required more domain knowledge, and the friction of having to learn yet another agent's quirks if a developer was already deep into Claude Code or Cursor.

The Codex variants have consistently topped Terminal-Bench, frequently led on SWE-bench Verified after each new release, and posted strong numbers on LiveCodeBench. Independent third-party benchmarks have generally placed Codex in the top three coding agents on every dimension they measure, with a particular advantage in token efficiency. A recurring theme in user reviews is the contrast with Claude Code: the two products are similar enough that experienced users tend to use both, picking whichever is better suited to a given task. Claude Code is generally praised for thoughtful, conservative edits and stronger refusal behavior on ambiguous instructions; Codex is praised for fast, decisive action and tight GitHub integration.

Limitations and ongoing concerns

Both Codex generations have faced the standard set of concerns associated with AI coding tools. The original 2021 Codex was the subject of a class-action lawsuit, Doe v. GitHub, filed in November 2022, which alleged that Codex and Copilot had been trained on open-source code in violation of the licenses on that code and that the models reproduced licensed snippets verbatim without attribution. The lawsuit was largely dismissed in stages between 2023 and 2024, but the underlying questions about copyright and attribution in code generation have remained legally and ethically unresolved.

The 2025 agentic Codex products have raised a different set of safety concerns. Because the agents can execute arbitrary commands and modify files, an attacker who can inject instructions into a code repository (for example, through a malicious comment in a dependency) could in principle hijack an agent into running unauthorized commands. Codex's defaults of disabling network access during task execution, restricting file writes to the working directory, and requiring explicit per-command approval are intended to mitigate this risk, but security researchers have published proof-of-concept prompt-injection attacks against Codex and other agentic coding tools, and OpenAI has issued several security advisories addressing specific vulnerability classes since the product launched.

The Codex Cloud variant has its own privacy and data-residency considerations. Tasks run on OpenAI infrastructure, which means a copy of the repository code passes through OpenAI's servers during the task. OpenAI has published documentation on how data is handled for Codex Cloud (the short version is that customer code is not used to train models and is deleted after the task completes), but enterprise customers have continued to ask for stronger contractual guarantees, and several large organizations have insisted on running only the local Codex CLI rather than allowing repository data into the cloud product.

A second concern is reliability. Even the best agentic coding tools as of 2026 score below 90 percent on SWE-bench Verified, meaning that more than one in ten real-world software-engineering tasks results in a failure that the agent cannot solve without human intervention. Long-horizon tasks involving large codebases, ambiguous requirements, or external integration points remain particularly challenging. Practitioners who use Codex and similar tools in production typically report a workflow of "trust but verify," reading every diff before merging and treating the agent as a junior engineer whose output requires review rather than as an autonomous senior engineer.

A third concern is the impact on software-engineering practice and the developer labor market. Agentic tools shift the bottleneck from typing speed to specification quality, code review, and architectural judgment. The longer-term effects are still being debated, but early empirical work suggests that agentic tools can substantially accelerate well-specified work without consistently improving outcomes on poorly scoped tasks, and that the value of senior human judgment in code review may grow rather than shrink as more code is produced by agents.

A fourth, narrower concern is the carbon and compute cost of agent loops. A typical Codex Cloud task can run for tens of minutes and consume far more inference budget than a single-shot completion, and as Codex usage has scaled into the millions of weekly active users, the aggregate compute draw has become a topic among sustainability-focused engineering teams.

References

Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. d. O., Kaplan, J., et al. (2021). *Evaluating Large Language Models Trained on Code*. arXiv:2107.03374. https://arxiv.org/abs/2107.03374
Wiggers, K. (2025, April 16). *OpenAI debuts Codex CLI, an open source coding tool for terminals*. TechCrunch. https://techcrunch.com/2025/04/16/openai-debuts-codex-cli-an-open-source-coding-tool-for-terminals/
Wikipedia contributors. (2026). *OpenAI Codex (AI agent)*. Wikipedia. https://en.wikipedia.org/wiki/OpenAI_Codex_(AI_agent)
OpenAI. (2025, May 16). *Introducing Codex*. OpenAI. https://openai.com/index/introducing-codex/
OpenAI. (2021, August 10). *OpenAI Codex*. OpenAI blog announcement of the Codex API private beta.
InfoQ. (2021, August). *OpenAI Announces 12 Billion Parameter Code-Generation AI Codex*. https://www.infoq.com/news/2021/08/openai-codex/
OpenAI. *Codex CLI repository*. GitHub. https://github.com/openai/codex
OpenAI Developers. *CLI documentation, Codex*. https://developers.openai.com/codex/cli
OpenAI. (2025). *Codex is now generally available*. https://openai.com/index/codex-now-generally-available/
OpenAI Help Center. *Using Codex with your ChatGPT plan*. https://help.openai.com/en/articles/11369540-using-codex-with-your-chatgpt-plan
OpenAI. *Codex deprecations notice*. (March 2023) deprecation of code-davinci-002, code-cushman-001 and related Codex API endpoints in favor of GPT-3.5-Turbo and GPT-4.
InfoQ. (2025, June). *Another Rust Rewrite: OpenAI's Codex CLI Goes Native, Drops Node and TypeScript for Rust*. https://www.infoq.com/news/2025/06/codex-cli-rust-native-rewrite/
OpenAI. (2026, February 5). *Introducing GPT-5.3-Codex*. https://openai.com/index/introducing-gpt-5-3-codex/
OpenAI Developers. *Code review in GitHub, Codex*. https://developers.openai.com/codex/integrations/github
OpenAI Developers. *Models, Codex*. https://developers.openai.com/codex/models

Naming and brand history

Codex (2021): the original code model

Training and architecture

HumanEval benchmark

GitHub Copilot integration

Deprecation

Codex (2025): the agentic coding platform

Codex CLI

Codex Cloud

Models

Capabilities

Pricing and access

Editor and platform integrations

GitHub Actions and CI integration

Codex 2021 vs Codex 2025

Architecture and training

Competitive landscape

How Codex differs from peer agents

Reception

Limitations and ongoing concerns

See also

References

Improve this article

Related Articles

Zed (code editor)

GitHub Copilot X

GPT-4o

GPT-4.5

ChatGPT

ChatGPT Guides

Naming and brand history

Codex (2021): the original code model

Training and architecture

HumanEval benchmark

GitHub Copilot integration

Deprecation

Codex (2025): the agentic coding platform

Codex CLI

Codex Cloud

Models

Capabilities

Pricing and access

Editor and platform integrations

GitHub Actions and CI integration

Codex 2021 vs Codex 2025

Architecture and training

Competitive landscape

How Codex differs from peer agents

Reception

Limitations and ongoing concerns

See also

References

Related Articles

Zed (code editor)

GitHub Copilot X

GPT-4o

GPT-4.5

ChatGPT

ChatGPT Guides