OpenAI Codex is the name OpenAI has used for two distinct generations of code-focused artificial intelligence products. The first generation, launched in private beta in August 2021, was a large language model descended from GPT-3 and fine-tuned on billions of lines of public source code. That original Codex powered the first three years of GitHub Copilot and introduced the now-canonical HumanEval benchmark for evaluating code generation, before OpenAI deprecated the underlying API in March 2023 in favor of newer general-purpose models. The second generation, launched in April 2025, is an entirely different product family: a suite of agentic coding tools called Codex CLI and Codex Cloud, built on specialized variants of OpenAI's reasoning models (codex-1, codex-mini-latest, GPT-5-Codex, and later GPT-5.3-Codex and GPT-5.4-Codex), that compete directly with Claude Code, Cursor, and Devin for the autonomous coding agent market [1][2][3].
The two Codex generations share a name and a broad mission of helping developers write software, but they differ in almost every implementation detail. The 2021 Codex was a static autoregressive language model that turned natural-language prompts into code completions, accessed through a REST API or embedded in editor extensions like the original GitHub Copilot. The 2025 Codex is an interactive software-engineering agent that runs in a sandbox, reads and edits files, executes commands, runs tests, opens pull requests, and reports back to the user, all under explicit human approval policies. The shared brand reflects OpenAI's view that both systems address the same underlying goal of reducing the friction between describing software and producing working code, but the technical lineage runs through different research programs and the user experience differs profoundly [3][4].
As of April 2026, the modern Codex platform reports more than two million weekly active users across its CLI, web, IDE, and mobile surfaces, with usage rising roughly five-fold since January 2026, and the open-source Codex CLI repository has accumulated more than 75,000 stars on GitHub [3]. The product has come to be one of the most visible examples of a broader industry shift from inline autocomplete tooling toward fully agentic coding workflows, in which the human operator delegates entire tasks ("add OAuth login to this app," "fix the failing CI checks," "port this module to TypeScript") rather than writing code line by line with assistance.
The original Codex was first announced by OpenAI on August 10, 2021, alongside the launch of a private beta of the Codex API [5]. The model itself had been described in a research paper posted on arXiv on July 7, 2021, titled Evaluating Large Language Models Trained on Code, with Mark Chen and Jerry Tworek as lead authors and a long list of contributors from across OpenAI [1]. That paper is now one of the foundational references for code-generation language models and it introduced both the Codex model family and the HumanEval evaluation set that has been used to benchmark every subsequent code model.
Codex was constructed by taking the GPT-3 architecture and continuing pretraining on a code corpus collected from public GitHub repositories. The full training set was assembled in May 2020 and contained roughly 179 gigabytes of unique Python files under one megabyte in size, drawn from 54 million public repositories, after filtering to remove auto-generated files, files with average line length above 100 characters, files with maximum line length above 1,000 characters, and files containing a small percentage of alphanumeric characters [1]. The fine-tuning corpus was therefore a curated slice of human-authored Python rather than the much larger sweep of all repository content.
OpenAI trained two main families of Codex models. The larger family, code-davinci, peaked at approximately 12 billion parameters and was the basis for the production Codex API; the smaller family, code-cushman, was a 12-billion-parameter variant tuned for lower-latency completion and was the model OpenAI shipped to GitHub for the first version of Copilot [6]. The Chen et al. paper reports results across model scales from 12 million to 12 billion parameters, finding that performance on HumanEval scales smoothly with parameter count and that even 12-million-parameter models have meaningful coding ability after fine-tuning. A successor model, code-davinci-002, was released in 2022 with approximately 14.8 billion parameters and became the base for the GPT-3.5 family.
The Chen et al. paper introduced HumanEval, an evaluation set consisting of 164 hand-written Python programming problems, each with a function signature, a docstring describing the desired behavior, a reference solution, and a set of unit tests. The benchmark measures functional correctness using the pass@k metric: a model is allowed to generate k samples per problem, and a problem is counted as solved if at least one sample passes all the unit tests. Pass@1 measures the probability that a single sample is correct, while pass@10 and pass@100 measure the probability that at least one of ten or one hundred samples is correct.
The headline result from the original paper was that Codex-12B solved 28.8 percent of the HumanEval problems on pass@1, while GPT-3 of comparable scale solved 0 percent and the contemporaneous open model GPT-J solved 11.4 percent [1]. With repeated sampling and an oracle that could pick the correct answer from many candidates, the pass@100 score rose to 70.2 percent, demonstrating that the model often had a working solution somewhere in its output distribution even when it was not consistent enough to produce one on the first try. That gap between pass@1 and pass@100 motivated subsequent work on best-of-n sampling, code execution feedback loops, and reranking with verifier models.
HumanEval has become one of the most cited benchmarks in code generation despite well-documented limitations: only 164 problems, all in Python, all relatively short and self-contained, and all written by humans rather than scraped from real codebases. The numbers have largely saturated as model capabilities have grown, with frontier models scoring above 90 percent and the strongest 2025 systems scoring above 95 percent. Newer benchmarks such as MBPP, APPS, SWE-bench, LiveCodeBench, and the SWE-bench Verified subset have largely supplanted HumanEval for serious comparison of frontier coding systems, but HumanEval remains the standard introductory benchmark and is still routinely reported in model release notes.
The table below summarizes pass@1 HumanEval scores reported for several models across the years.
| Model | Year | Reported pass@1 | Notes |
|---|---|---|---|
| GPT-3 (175B) | 2021 | 0.0% | Pretraining included no code-focused fine-tuning [1] |
| GPT-J (6B) | 2021 | 11.4% | Open-source baseline reported in Codex paper [1] |
| Codex-12B | 2021 | 28.8% | Original Codex API model [1] |
| code-davinci-002 | 2022 | ~47% | OpenAI Codex successor, base of GPT-3.5 |
| GPT-4 | 2023 | 67.0% | OpenAI technical report |
| GPT-4 (with reflection) | 2023 | 91.0% | With execution feedback and self-correction |
| Claude 3 Opus | 2024 | 84.9% | Anthropic blog |
| Llama 3.1 405B | 2024 | 89.0% | Meta release |
| GPT-4o | 2024 | 90.2% | OpenAI blog |
| Claude Opus 4 | 2025 | 94.5% | Anthropic blog |
| Claude Sonnet 4 | 2025 | 95.1% | Anthropic blog |
The most visible deployment of the original Codex was inside GitHub Copilot, a code completion service launched as a technical preview in June 2021 and made generally available in June 2022. Copilot integrated a fine-tuned Codex variant directly into Visual Studio Code, JetBrains IDEs, Neovim, and Visual Studio, offering inline ghost-text completions as the user typed. The Codex model running inside Copilot was a smaller, lower-latency version optimized for responsiveness in the editor, since users expected suggestions to appear within a few hundred milliseconds.
Copilot quickly became the dominant commercial AI coding product, with reports of over a million paying subscribers within two years of launch. It also became one of the highest-profile demonstrations that pretrained language models could be productized into a real developer tool with measurable productivity benefits. A widely cited GitHub-internal study from 2022 found that developers completed a JavaScript task 55 percent faster with Copilot than without it. Independent academic studies have produced more mixed results, with some showing significant speedups and others showing little or no improvement and a measurable increase in security defects in Copilot-assisted code.
GitHub Copilot moved away from the original Codex model after OpenAI deprecated the underlying API in March 2023. By the time Copilot X was announced in March 2023, Copilot Chat was running on GPT-4 and the inline completion model had been replaced by newer OpenAI variants. Subsequent versions of Copilot have run on a rotating set of models from OpenAI, Anthropic, and Google, including GPT-5 variants and Claude models, exposed to users through a model picker. The Copilot product is now operationally independent of the OpenAI Codex brand even though it grew out of the same research program.
In March 2023, OpenAI announced that the original Codex models (code-davinci-001, code-davinci-002, code-cushman-001, and related variants) would be deprecated from the API and that customers should migrate to the newly released GPT-3.5-Turbo and GPT-4 models for code-related tasks. The deprecation reflected the broader trend that general-purpose chat models had caught up to and surpassed dedicated code models on most benchmarks, including HumanEval, while also offering far better instruction following, multi-turn dialogue, and integration with new product features like function calling.
The sudden deprecation drew criticism from researchers who had been using code-davinci-002 as a stable baseline for academic experiments. Many published reproductions and ablation studies depended on a specific snapshot of the model, and once the API endpoints were turned off the experiments could no longer be re-run by other researchers. OpenAI extended the timeline several times and eventually preserved access for some classes of researcher, but the episode highlighted the friction between commercial model lifecycle and the reproducibility expectations of academic computer science.
The modern Codex product family, which shares a brand with the 2021 model but no direct technical lineage, was launched in two waves in 2025. On April 16, 2025, OpenAI announced Codex CLI, an open-source command-line coding agent written in Rust that runs locally on the user's machine and can read, write, and execute code in the working directory [2][7]. On May 16, 2025, OpenAI followed with Codex Cloud, a research preview of a hosted agent that runs entirely in an isolated container on OpenAI infrastructure, accepting tasks through ChatGPT and returning pull requests for human review [4]. Both products were initially powered by codex-1, a fine-tuned variant of the o3 reasoning model adapted specifically for software engineering work.
Codex CLI is a free, open-source binary distributed under the Apache 2.0 license (initially released under MIT) that installs via npm or Homebrew and exposes a single command, codex, to launch an interactive terminal user interface [7][8]. Inside the TUI, the user describes a task in natural language and the agent plans, edits files, runs shell commands, and reports progress, with explicit approval prompts for any operation that touches the file system or executes a process. The CLI began life as a Node.js application but was rewritten in Rust during the second half of 2025; by early 2026 the codebase was approximately 95 percent Rust, and the Rust binary became the default distribution for performance and memory-safety reasons.
The CLI supports several approval modes: a fully manual mode in which the user approves each command before it runs, an auto-edit mode in which file edits are applied without confirmation but shell commands still require approval, and a full-auto mode in which the agent runs without supervision inside a sandbox. The sandbox enforcement uses macOS sandbox-exec, Linux Landlock and seccomp, and Windows job objects to restrict file system writes to the working directory and disable network access by default. Users can also configure the CLI to run inside Docker containers for stricter isolation, which is the recommended setup for production use on shared development machines.
A distinguishing feature of Codex CLI is its support for multimodal input: users can paste screenshots, mock-ups, or low-fidelity sketches into the TUI and ask the agent to implement the shown design [2]. The CLI also supports the Model Context Protocol for connecting to external tools and data sources, and the SDK released in 2025 allows developers to embed the same agent loop into their own applications. As of April 2026, the npm package reports approximately 14.5 million monthly downloads and the Homebrew formula has been one of the most-installed developer tools in the cask repository.
Codex Cloud, announced May 16, 2025, takes a fundamentally different approach to running coding agents. Rather than executing on the developer's local machine, each Codex Cloud task runs inside an isolated container on OpenAI infrastructure that is preloaded with a clone of the user's GitHub repository and any pre-installed dependencies declared in a per-repo configuration file [4]. Internet access is disabled during task execution, so the agent can only interact with the code and tooling that has been explicitly provisioned. Tasks complete asynchronously in roughly one to thirty minutes, after which the agent returns a summary, a diff, terminal logs, and an optional pull request that the user can review and merge.
The asynchronous, fire-and-forget design contrasts with the interactive paradigm of Claude Code (which keeps the user in a continuous dialogue with the agent in their terminal) and with the IDE-embedded paradigm of Cursor (which surfaces the agent inside an editor where the user can watch and steer it in real time). Codex Cloud is closer in spirit to Devin, in that the user delegates a complete task and walks away while the agent works, but with stronger sandboxing, no persistent agent memory across tasks, and explicit human review at the pull-request stage.
Codex Cloud was initially available to ChatGPT Pro, Business, and Enterprise subscribers, with ChatGPT Plus access added in June 2025 and Education plan access shortly after. The product graduated from research preview to general availability in late 2025, at which point OpenAI added a Slack integration that lets team members tag a Codex bot in a channel to delegate tasks, a public Codex SDK that exposes the same agent loop used by the CLI and web interface, and admin controls that give workspace administrators visibility and policy controls over how Codex is used by their organizations [9].
The modern Codex products have run on a steady cadence of new models since the codex-1 launch.
| Model | Released | Used in | Notes |
|---|---|---|---|
| codex-1 | May 2025 | Codex Cloud | Fine-tune of o3 reasoning model for software engineering [4] |
| codex-mini-latest | May 2025 | Codex CLI default | Smaller, faster variant for interactive use; $1.50 / $6.00 per 1M tokens |
| GPT-5-Codex | Aug 2025 | Codex CLI and Cloud | Codex-tuned variant of GPT-5 family |
| GPT-5.3-Codex | Feb 5, 2026 | All Codex surfaces | Major intelligence step over GPT-5-Codex [3] |
| GPT-5.3-Codex-Spark | Feb 12, 2026 | Real-time interactive coding | Approximately 15x faster than earlier Codex models [3] |
| GPT-5.4-Codex | Mar 5, 2026 | All Codex surfaces | Latest stable Codex model series [3] |
The codex-1 system message, which is the persistent prompt that defines the agent's default behavior, was published by OpenAI on the codex-1 launch day to help developers understand how the agent reasons by default and to enable customization in their own deployments. The published system message emphasized careful planning before action, asking clarifying questions when the task is ambiguous, running tests after making changes, and being explicit about what it has and has not verified. This level of public documentation of an agent's behavioral specification was unusual at the time and has since been imitated by other agentic-coding products.
Codex usage is bundled into the standard ChatGPT subscription tiers rather than priced as a separate per-token product for individual users. ChatGPT Plus ($20/month) includes a baseline allotment of Codex usage; ChatGPT Pro ($200/month) includes roughly twenty times the Plus allotment for high-effort sessions; ChatGPT Business ($30/user/month) and Enterprise (custom pricing) provide team-level usage and admin controls [10]. In April 2026, OpenAI introduced a $100/month Pro tier intermediate between Plus and the original $200 Pro, providing five times the Plus Codex usage at half the price of the top tier, in response to user demand for a price point that reflected heavy individual coding use without enterprise features.
For developers who want to call the underlying models directly through the API, the codex-mini-latest model is available at $1.50 per million input tokens and $6 per million output tokens with a 75 percent prompt-caching discount, and GPT-5-Codex variants are priced through the standard chat-completions API at the rates set for the broader GPT-5 family. The Codex SDK is open-source and free to use.
The Codex platform spans several interaction surfaces beyond the CLI and the ChatGPT web interface. A first-party Visual Studio Code extension, released in late 2025, embeds the Codex Cloud workflow inside the editor, letting the user start agent tasks against the current repository and review returned diffs without leaving VS Code. JetBrains and Xcode plug-ins followed in early 2026, with similar patterns of letting the user kick off and review tasks from the IDE. A Codex desktop app for Windows and macOS launched in February 2026, providing a stand-alone interface for managing parallel agent tasks across multiple repositories. A mobile companion app gives users the ability to launch tasks and review pull requests from a phone, which has been popular for tasks like merging dependency updates while traveling.
In October 2025, OpenAI announced a partnership with GitHub to make Codex one of the agents available through GitHub's Agent HQ system, which exposes a unified interface for managing agent runs from multiple vendors inside GitHub repositories. Codex also supports tools through the Model Context Protocol, including a Figma connector that lets the agent read design files directly and an OpenSSL/Chromium-style integration with security scanning tools that became the basis for OpenAI's separate Codex Security product launched in March 2026.
The table below contrasts the two generations of OpenAI Codex on the dimensions where they differ most clearly.
| Dimension | Codex (2021) | Codex (2025) |
|---|---|---|
| Product type | Hosted code-completion language model | Agentic coding platform with CLI, cloud, IDE, mobile |
| Underlying model | code-davinci, code-cushman (~12B params) | codex-1, GPT-5-Codex, GPT-5.3/5.4-Codex |
| Architecture | Static autoregressive language model | Reasoning model in agent loop with tool use |
| Interaction | Single-turn completion via REST API | Multi-turn agent with file edit, command exec, planning |
| Primary surface | Editor extensions (Copilot) and API | Terminal CLI, ChatGPT web, IDE plug-ins, mobile app |
| Sandboxing | Not applicable (model only) | OS-level sandbox locally, isolated container in cloud |
| Open source | No (proprietary API) | Yes (CLI under Apache 2.0, SDK open) |
| Pricing | Per-token API billing | Bundled with ChatGPT subscription tiers |
| Status | Deprecated March 2023 | Generally available, actively developed |
The modern Codex sits in a crowded and rapidly evolving market for AI coding assistance. Direct competitors include Anthropic's Claude Code, launched in February 2025 as a terminal-based coding agent with a similar approval-prompt model; Cursor, an IDE built as a fork of VS Code that integrates LLM-powered editing into a familiar editor experience; and Devin, an autonomous software engineer from Cognition AI marketed as a fully independent agent that can take complete tickets end to end. Other notable competitors include Aider, Continue, Windsurf (formerly Codeium), Amazon Kiro, Google's Jules, and GitHub Copilot, which has evolved from its original Codex-powered inline completion product into a multi-model agentic platform of its own.
A frequently quoted comparison comes from the artificial-analysis benchmark suite, which scores these tools on SWE-bench Verified and Terminal-Bench. Claude Code with Opus has historically led on SWE-bench Verified with scores around 80 percent, while Codex CLI has led on Terminal-Bench with scores around 77 percent and on token efficiency, reportedly using roughly four times fewer tokens per task than Claude Code on equivalent assignments. Cursor leads on raw user experience for developers who prefer a graphical editor. Devin remains the most expensive option but is often credited with the most ambitious autonomy, advertising end-to-end resolution of GitHub issues without human steering.
The table below summarizes the competitive landscape on the dimensions that matter most for choosing among these tools.
| Tool | Vendor | Primary surface | Sandbox model | Default model |
|---|---|---|---|---|
| Codex CLI | OpenAI | Terminal | OS sandbox or Docker | GPT-5.4-Codex |
| Codex Cloud | OpenAI | ChatGPT web | Isolated cloud container | GPT-5.4-Codex |
| Claude Code | Anthropic | Terminal | Sandbox plus per-command approval | Claude Opus 4 |
| Cursor | Anysphere | VS Code fork | Editor-level | User-selected (Claude, GPT-5, Gemini) |
| Devin | Cognition AI | Web dashboard | Cloud sandbox with shell + browser | Proprietary |
| GitHub Copilot | GitHub / Microsoft | IDE plug-ins | Editor-level plus optional sandbox | Multi-vendor (model picker) |
| Aider | Open source | Terminal | None (uses local git) | User-selected |
| Amazon Kiro | Amazon | IDE | Sandbox in AWS | Bedrock-hosted models |
| Google Jules | Web + GitHub | Cloud sandbox | Gemini variants |
Usage share data published by various third parties suggests that, between September 2025 and January 2026, Codex usage rose from approximately five percent of Claude Code's volume to approximately forty percent, reflecting both the maturation of the Codex product and OpenAI's aggressive bundling of Codex usage into ChatGPT subscriptions [3]. Claude Code retained the lead among professional developers writing complex multi-file changes, while Cursor remained dominant for users who preferred a visual IDE workflow, and Codex captured a growing share of users who wanted asynchronous, fire-and-forget delegation of coding tasks.
Both Codex generations have faced the standard set of concerns associated with AI coding tools. The original 2021 Codex was the subject of a class-action lawsuit, Doe v. GitHub, filed in November 2022, which alleged that Codex and Copilot had been trained on open-source code in violation of the licenses on that code and that the models reproduced licensed snippets verbatim without attribution. The lawsuit was largely dismissed in stages between 2023 and 2024, but the underlying questions about copyright and attribution in code generation have remained legally and ethically unresolved.
The 2025 agentic Codex products have raised a different set of safety concerns. Because the agents can execute arbitrary commands and modify files, an attacker who can inject instructions into a code repository (for example, through a malicious comment in a dependency) could in principle hijack an agent into running unauthorized commands. Codex's defaults of disabling network access during task execution, restricting file writes to the working directory, and requiring explicit per-command approval are intended to mitigate this risk, but security researchers have published proof-of-concept prompt-injection attacks against Codex and other agentic coding tools, and OpenAI has issued several security advisories addressing specific vulnerability classes since the product launched.
A second concern is reliability. Even the best agentic coding tools as of 2026 score below 90 percent on SWE-bench Verified, meaning that more than one in ten real-world software-engineering tasks results in a failure that the agent cannot solve without human intervention. Long-horizon tasks involving large codebases, ambiguous requirements, or external integration points remain particularly challenging. Practitioners who use Codex and similar tools in production typically report a workflow of "trust but verify," reading every diff before merging and treating the agent as a junior engineer whose output requires review rather than as an autonomous senior engineer.
A third concern is the impact on software-engineering practice and on the labor market for developers. Several observers have noted that agentic tools shift the bottleneck of software development from typing speed to specification quality, code review, and architectural judgment. The longer-term effects on developer productivity, code quality, security, and employment are still being studied and debated. Early empirical work suggests that agentic tools can substantially accelerate well-specified work but do not always improve outcomes on poorly scoped tasks, and that the value of senior human judgment in code review may grow rather than shrink as more code is produced by agents.