Codex is a code generation large language model developed by OpenAI, first introduced in July 2021 through a research paper and publicly launched as an API in August 2021. Built by fine-tuning GPT-3 on billions of lines of publicly available source code from GitHub, Codex was the first widely deployed AI system capable of translating natural language descriptions into functional code. It powered the original version of GitHub Copilot, the AI pair programming tool that would go on to become one of the most commercially successful AI products in software development. Alongside the model itself, the Codex research team created the HumanEval benchmark, which became the standard evaluation framework for measuring code generation ability in language models. Although the original Codex API was deprecated in March 2023 and its capabilities were absorbed into subsequent OpenAI models, the name was revived in 2025 for an agentic software engineering product built on entirely different technology. That new product, powered first by the codex-1 model (a fine-tuned variant of o3) and later by the GPT-5 Codex family of models, has grown into one of the most widely used AI coding agents in the world.
By 2020, large language models like GPT-3 had demonstrated an unexpected aptitude for generating code, despite being trained primarily on natural language text. Researchers at OpenAI observed that GPT-3 could produce simple Python functions when prompted appropriately, but its performance was unreliable and limited to straightforward tasks. This observation motivated a focused effort to create a model specifically optimized for code generation.
The core hypothesis behind Codex was simple: if a large language model trained on text could write some code, then a similar model fine-tuned on a large corpus of actual source code should be substantially better at it. This proved correct. The resulting model, described in the paper "Evaluating Large Language Models Trained on Code" by Mark Chen and colleagues at OpenAI (published on arXiv on July 7, 2021), demonstrated a significant leap in code generation ability over the base GPT-3 model [1].
Codex is built on the GPT-3 transformer architecture, a decoder-only autoregressive language model. The largest version of Codex contains 12 billion parameters, making it smaller than GPT-3's full 175 billion parameter configuration. OpenAI trained multiple versions of Codex at different scales, but the 12B model was the primary focus of the research paper and the basis for the publicly released API [1].
As an autoregressive, left-to-right language model, Codex generates code one token at a time, predicting the next token based on all preceding tokens in the context. This left-to-right generation approach is naturally well-suited to programming tasks like code completion, where the model extends partially written code, and program synthesis, where it generates entire functions from natural language descriptions.
Codex was fine-tuned on a dataset of 159 gigabytes of Python code collected from 54 million public GitHub repositories [1]. The dataset consisted of unique Python files under 1 megabyte in size, with filtering applied to remove auto-generated code, files with excessively long lines, and other low-quality content. While the initial research focused on Python, subsequent versions of Codex were trained on code in multiple programming languages.
The training process involved two stages:
This two-stage approach was crucial. The natural language pre-training gave Codex the ability to understand English-language descriptions of programming tasks, while the code fine-tuning taught it to produce syntactically and semantically correct programs.
The research paper also described Codex-S, a variant that was further fine-tuned on a curated dataset of correct solutions to programming problems. Codex-S significantly outperformed the base Codex model on the HumanEval benchmark, achieving 37.7% pass@1 compared to Codex's 28.8% [1]. This demonstrated that additional supervised fine-tuning on high-quality, correct code could meaningfully improve a model's ability to generate working solutions.
One of the most lasting contributions of the Codex project is the HumanEval benchmark, which was created by the same team and released alongside the model [1]. Before HumanEval, there was no widely accepted standard for measuring the functional correctness of code generated by language models. Existing benchmarks focused on token-level metrics like BLEU score or exact match, which correlated poorly with whether generated code actually worked.
HumanEval consists of 164 hand-crafted programming problems in Python. Each problem includes:
The benchmark measures functional correctness by executing the model's generated code against the unit tests. A solution is considered correct only if it passes all test cases for a given problem.
The primary metric introduced with HumanEval is pass@k, which measures the probability that at least one correct solution appears in k generated samples. The key metric, pass@1, measures the fraction of problems that the model solves correctly with a single attempt. pass@10 and pass@100 measure performance with 10 and 100 attempts respectively.
| Model | pass@1 | pass@10 | pass@100 |
|---|---|---|---|
| GPT-3 (175B) | 0.0% | 0.0% | 0.0% |
| GPT-J (6B) | 11.4% | 15.7% | 27.7% |
| Codex (12B) | 28.8% | 46.8% | 72.3% |
| Codex-S (12B) | 37.7% | 55.5% | 77.5% |
The finding that GPT-3, despite its 175 billion parameters, scored 0% on HumanEval while the 12B Codex model solved nearly 29% of problems highlighted the importance of domain-specific training data. It also showed that repeated sampling was a surprisingly effective strategy: with 100 samples per problem, Codex solved over 70% of the benchmark [1].
HumanEval quickly became the de facto standard for evaluating code generation models. Virtually every subsequent code-capable language model, including GPT-4, Claude, Gemini, StarCoder, and Code Llama, has been benchmarked against HumanEval. Extended versions of the benchmark, such as HumanEval+ (with expanded test suites) and MultiPL-E (which ports the problems to other programming languages), have further cemented its role in the field. The HumanEval dataset is publicly available on GitHub under OpenAI's repositories [3].
The most significant commercial application of Codex was GitHub Copilot, an AI-powered code completion tool developed through a partnership between OpenAI and GitHub (a subsidiary of Microsoft). Copilot was first announced as a technical preview on June 29, 2021, just weeks before the Codex paper was published, and was made generally available on June 21, 2022 [4].
A distinct production version of Codex, optimized for real-time code completion, powered GitHub Copilot's initial release. When a developer typed code in their editor (primarily Visual Studio Code), the extension sent the surrounding code context to the Codex model via an API call. The model would generate suggested completions, which appeared as grayed-out text that the developer could accept with a single keystroke.
Copilot went beyond simple autocomplete. It could:
GitHub Copilot became one of the earliest and most commercially successful AI developer tools. Priced at $10/month for individual developers (or $19/month for Copilot Pro), it attracted over 1.3 million paying subscribers within its first year of general availability. By 2024, GitHub reported that Copilot was generating an average of 46% of code across all programming languages for users who had it enabled [5].
The success of Copilot validated the commercial viability of AI-assisted software development and inspired a wave of competing products, including Amazon CodeWhisperer (later Amazon Q Developer), Google's code assistance features in Android Studio and various IDEs, Cursor, Codeium, and Tabnine.
GitHub Copilot did not remain tied to the original Codex model. As OpenAI developed more capable models, Copilot transitioned to using newer model versions. By 2023, Copilot was reportedly using models from the GPT-3.5 and GPT-4 families, and subsequent updates have incorporated even newer OpenAI models. The Copilot X announcement in March 2023 explicitly referenced GPT-4 as the underlying technology for advanced features like Copilot Chat [6].
OpenAI launched the Codex API in private beta on August 10, 2021, making it available to developers through the OpenAI API platform [7]. The API exposed the Codex model through the Completions endpoint, where developers could send natural language prompts or partial code and receive generated code completions. The API supported multiple programming languages, including Python, JavaScript, TypeScript, Ruby, Go, and several others.
The primary model identifiers available through the API were:
| Model | Description | Max Tokens |
|---|---|---|
| code-davinci-002 | Most capable Codex model; strongest at translating natural language to code | 8,001 |
| code-davinci-001 | Earlier version of the Codex model | 8,001 |
| code-cushman-001 | Faster but less capable; good for real-time applications | 2,048 |
code-davinci-002 was the most widely used variant and represented the highest-capability version of Codex available through the API. It supported up to 8,001 tokens of context, allowing developers to include substantial code context alongside their prompts.
Remarkably, the Codex API was offered as a free limited beta for its entire lifespan. OpenAI never transitioned Codex to a paid tier before deprecating it, meaning that all API usage of the Codex models was provided at no cost to developers. This was unusual for OpenAI's API products and reflected the company's interest in gathering usage data and driving adoption of AI-assisted coding.
On March 23, 2023, OpenAI announced the discontinuation of the Codex API [8]. The company recommended that developers migrate to the Chat Completions API using GPT-3.5 Turbo or GPT-4, which offered superior code generation capabilities alongside general-purpose language understanding. The code-davinci-002 and code-cushman-001 models were shut down, and no direct replacement was offered under the Codex branding.
The deprecation reflected a broader shift in OpenAI's strategy. Rather than maintaining separate specialized models for code, the company incorporated code generation capabilities into its general-purpose models. GPT-3.5 Turbo and GPT-4 could both generate code at or above the level of the original Codex models while also handling natural language tasks, making the dedicated Codex models redundant.
To appreciate the significance of Codex's capabilities at the time of its release, it is useful to compare its HumanEval results with those of subsequent models. The rapid improvement in code generation performance across the industry illustrates both the impact of Codex as a proving ground and the pace of progress that followed.
| Model | Year | HumanEval pass@1 |
|---|---|---|
| GPT-3 (175B) | 2020 | 0.0% |
| Codex (12B) | 2021 | 28.8% |
| Codex-S (12B) | 2021 | 37.7% |
| code-davinci-002 | 2022 | ~47% |
| GPT-3.5 Turbo | 2023 | ~48% |
| GPT-4 | 2023 | 67.0% |
| GPT-4 Turbo | 2024 | ~79% |
| Claude 3.5 Sonnet | 2024 | 92.0% |
| GPT-4o | 2024 | ~90% |
This progression shows that Codex's initial 28.8% pass@1 was a breakthrough that opened the door, but the field advanced rapidly. Within three years of Codex's release, multiple models exceeded 90% on the same benchmark, a level of performance that would have seemed implausible in 2021.
In May 2025, OpenAI reintroduced the "Codex" name for an entirely new product: an agentic software engineering tool integrated into the ChatGPT interface [9]. The 2025 Codex is not a continuation of the 2021 model; it is a cloud-based coding agent powered by codex-1, a version of OpenAI's o3 reasoning model optimized for software engineering tasks.
The codex-1 model that powered the initial Codex agent launch was not simply o3 with a new label. OpenAI trained codex-1 using reinforcement learning on real-world coding tasks across a variety of environments. The training objective focused on generating code that closely mirrors human style and pull request conventions, follows instructions precisely, and iteratively runs tests until it achieves passing results [9]. This reinforcement learning approach differentiated codex-1 from earlier code models, which relied primarily on next-token prediction during training.
On the SWE-bench Verified benchmark (a test suite of real GitHub issues from popular open-source repositories), codex-1 scored 72.1% at pass@1, surpassing the base o3 model's 71.7% score. At pass@8 (where the agent attempts a problem up to eight times), codex-1 reached 83.86% accuracy [11]. OpenAI tested codex-1 at a maximum context length of 192,000 tokens with medium reasoning effort, and excluded 23 SWE-bench Verified samples that were not runnable on their internal infrastructure [9].
Alongside codex-1, OpenAI released a smaller companion model called codex-mini (available in the API as codex-mini-latest). This model was a version of o4-mini specifically optimized for use in the Codex CLI. It was designed for faster workflows, low-latency code Q&A, and editing tasks, while retaining the same strengths in instruction following and coding style as the full codex-1 model [12]. At launch, codex-mini-latest was priced at $1.50 per million input tokens and $6.00 per million output tokens through the API, with a 75% discount for cached prompts.
On November 17, 2025, OpenAI notified developers that the codex-mini-latest model would be deprecated and removed from the API on February 12, 2026, as it was superseded by newer GPT-5 Codex family models [13].
The 2025 Codex agent operates through a sandboxed, cloud-based architecture designed for both security and reliability. When a user assigns a task to Codex, the system performs the following steps:
This architecture allows multiple tasks to run in parallel, since each operates in its own isolated container. Users can assign several issues at once and receive results independently as each completes.
To help the Codex agent understand project-specific conventions, OpenAI introduced the AGENTS.md file format. Placed in a repository's root directory (or in subdirectories for more specific instructions), AGENTS.md serves as a set of instructions that the agent reads before starting any work. It typically includes coding standards, build and test commands, linting rules, naming conventions, and architectural guidance. Codex reads AGENTS.md files hierarchically: it checks a global configuration directory first, then walks from the project root down to the current working directory, merging instructions as it goes [14].
The Codex agent launched as a research preview on May 16, 2025, initially available to ChatGPT Pro and Team subscribers [9]. On June 3, 2025, OpenAI expanded access to ChatGPT Plus users and enabled internet access during task execution (which had been disabled in the initial preview) [13].
On April 16, 2025, approximately one month before the cloud-based Codex agent, OpenAI released the Codex CLI as an open-source project [12]. Built in Rust for speed and efficiency, the Codex CLI is a lightweight coding agent that runs directly in a developer's terminal rather than in the cloud.
The CLI can be installed via npm (npm i -g @openai/codex) or Homebrew (brew install --cask codex). It reads and edits files in the user's local directory, executes shell commands, and uses the codex-mini model (later upgraded to GPT-5 Codex family models) through the OpenAI API. Unlike the cloud-based Codex agent, the CLI operates on the developer's own machine, giving full access to local tools, environment variables, and file systems.
The Codex CLI supports multiple approval modes that control how much autonomy the agent has. In the most restrictive mode, the user must approve every file edit and command execution. In the most permissive mode, the agent can make changes and run commands freely. This granular control addresses concerns about autonomous code agents modifying files without oversight [12].
After the initial launch with codex-1, OpenAI rapidly iterated on the underlying models powering the Codex agent. Each new model brought improvements in coding capability, speed, context handling, and reliability.
On September 23, 2025, OpenAI released GPT-5-Codex, a version of GPT-5 further optimized for agentic coding in Codex. This model was trained with a focus on real-world software engineering work and was described as equally proficient at quick interactive sessions and at independently handling long, complex tasks. GPT-5-Codex became the default model for cloud-based Codex tasks and code review. It was also made available through the Responses API at the same pricing as the base GPT-5 model ($1.25 per million input tokens, $10.00 per million output tokens) [15].
Alongside GPT-5-Codex, OpenAI introduced two major new features:
$skill-name or let Codex select the appropriate skill automatically based on the prompt [15].On November 19, 2025, OpenAI released GPT-5.1-Codex-Max, the first model natively trained to operate across multiple context windows through a process called compaction. Compaction allows the model to prune its conversation history while preserving the most important context, enabling it to work coherently over millions of tokens in a single task. This was particularly valuable for complex refactors, large codebase migrations, and long-running agent loops that would have previously failed due to context window limits [16].
GPT-5.1-Codex-Max also became the first Codex model trained to operate in Windows environments. On SWE-bench Verified, it scored 77.9% (evaluated at extra-high reasoning effort), compared to 73.7% for the standard GPT-5.1-Codex model. On Terminal-Bench 2.0, it achieved 58.1%. The model used 30% fewer thinking tokens than GPT-5.1-Codex at the same reasoning effort level while delivering better performance [16].
Released on December 18, 2025, GPT-5.2-Codex brought further improvements in long-context understanding, reliable tool calling, improved factuality, and native compaction. Key advances included stronger performance on large code changes (refactors, migrations), improved Windows environment support, and significantly stronger cybersecurity capabilities [17].
| Benchmark | GPT-5.1-Codex | GPT-5.1-Codex-Max | GPT-5.2-Codex |
|---|---|---|---|
| SWE-bench Verified | 73.7% | 77.9% | N/A |
| SWE-bench Pro | 50.8% | N/A | 56.4% |
| Terminal-Bench 2.0 | 52.8% | 58.1% | 64.0% |
| CVE-Bench | N/A | N/A | 87.0% |
On cybersecurity evaluations, GPT-5.2-Codex scored 87% on CVE-Bench (a benchmark for identifying and exploiting known vulnerabilities), 79% on Network Attack Simulation challenges, and 80% on Vulnerability Research and Exploitation challenges [17].
Released on February 5, 2026, GPT-5.3-Codex combined the frontier coding performance of GPT-5.2-Codex with the general reasoning and professional knowledge capabilities of GPT-5.2, all in a single model that was also 25% faster than its predecessor. OpenAI described it as "the most capable agentic coding model to date" [18].
GPT-5.3-Codex topped both SWE-Bench Pro and Terminal-Bench 2.0 at the time of release. It was also the first model that was instrumental in creating itself: the Codex team used early versions of GPT-5.3-Codex to debug its own training, manage its own deployment, and diagnose test results and evaluations [18].
With GPT-5.3-Codex, Codex began providing frequent interactive updates during task execution rather than only delivering a final output. Users could interact with the agent in real time: asking questions, discussing approaches, and steering the agent toward preferred solutions as it worked [18].
On February 12, 2026, OpenAI released GPT-5.3-Codex-Spark, a text-only research preview model optimized for near-instant, real-time coding iteration. Spark was designed for interactive work where latency matters as much as intelligence: making targeted edits, reshaping logic, and refining interfaces with immediate feedback [19].
Spark achieved its speed by running on the Cerebras Wafer-Scale Engine, delivering over 1,000 tokens per second. This made it OpenAI's first model running on non-NVIDIA hardware. Despite its speed focus, Spark demonstrated strong performance on SWE-Bench Pro and Terminal-Bench 2.0, completing tasks in a fraction of the time compared to GPT-5.3-Codex [19]. At launch, Spark had a 128,000-token context window and was available exclusively to ChatGPT Pro subscribers as a research preview.
On March 17, 2026, OpenAI released GPT-5.4 mini, a fast, efficient model for lighter coding tasks and subagent work. It improves over GPT-5 mini across coding, reasoning, image understanding, and tool use while running more than 2x faster. GPT-5.4 mini uses only 30% of included rate limits compared to larger models, effectively extending usage 3.3x longer for the same subscription tier [20]. GPT-5.4, the flagship frontier model, brings together the industry-leading coding capabilities of GPT-5.3-Codex with stronger reasoning, tool use, and agentic workflows [20].
On February 2, 2026, OpenAI launched the Codex app, a dedicated desktop application for macOS designed to manage multiple coding agents simultaneously. The app was later extended to Windows on March 4, 2026 [21].
The Codex app differs from the CLI and ChatGPT web interface by providing a purpose-built interface for parallel agent management. Key features include:
By March 2026, the Codex app supports access to all current models including GPT-5.4, GPT-5.4 mini, and GPT-5.3-Codex, with custom reasoning level configuration and model selection per thread [21].
As of March 2026, Codex is available across multiple interfaces (web, CLI, IDE extension, iOS, and the desktop app) and is bundled with ChatGPT subscriptions rather than sold as a separate product.
| Plan | Price | Codex Features |
|---|---|---|
| ChatGPT Plus | $20/month | Access to all Codex interfaces; latest models including GPT-5.4 and GPT-5.3-Codex |
| ChatGPT Pro | $200/month | Everything in Plus; priority processing; GPT-5.3-Codex-Spark access; 6x higher usage limits; 10x more cloud code reviews |
| ChatGPT Business | $30/user/month | Larger virtual machines; admin controls; SAML SSO; MFA; no training on business data |
| ChatGPT Enterprise | Contact sales | Enterprise-grade security (SCIM, EKM, RBAC); audit logs; usage monitoring |
| API Key | Token-based pricing | CLI, SDK, and IDE extension access; standard API rates |
| Model | Input | Output |
|---|---|---|
| GPT-5.4 | Standard API rates | Standard API rates |
| GPT-5.4 mini | Standard API rates | Standard API rates |
| GPT-5.3-Codex | Standard API rates | Standard API rates |
| GPT-5.2-Codex | $1.75 | $14.00 |
| GPT-5.1-Codex | $1.25 | $10.00 |
| GPT-5-Codex | $1.25 | $10.00 |
Codex usage within ChatGPT subscriptions is measured in credits. Average credit costs vary by model and task type:
| Task Type | GPT-5.4 | GPT-5.4 mini | GPT-5.3-Codex |
|---|---|---|---|
| Local task | ~7 credits | ~2 credits | ~5 credits |
| Cloud task | ~34 credits | N/A | ~25 credits |
Credits enable continued usage beyond included limits without requiring a plan upgrade [22].
The 2025-2026 Codex agent exists within a competitive landscape of AI-powered software engineering tools. Each product takes a different approach to agent architecture, user interaction, and pricing.
| Feature | Codex (OpenAI) | Devin (Cognition) | Claude Code (Anthropic) | Cursor (Anysphere) |
|---|---|---|---|---|
| Launch date | May 2025 | March 2024 (GA: December 2024) | February 2025 (preview); May 2025 (GA) | 2023 (Agent mode: 2025) |
| Primary interface | Web, CLI, desktop app, IDE extension | Web-based IDE | Terminal (CLI) | Desktop IDE |
| Execution environment | Cloud sandbox (isolated containers) | Cloud sandbox (own IDE, browser, terminal) | Local machine (developer's terminal) | Local machine (IDE) |
| Autonomy level | High (assigns tasks, returns PRs) | Highest (plans, codes, tests, deploys) | Medium (developer-in-the-loop) | Medium (developer-in-the-loop) |
| Parallel agents | Yes (multiple tasks simultaneously) | Yes (multiple Devins in parallel) | Yes (via subagents) | Yes (8 parallel agents in Composer) |
| Internet access during execution | Disabled by default (sandboxed) | Full access (browser, terminal) | Full access (local environment) | Full access (local environment) |
| Underlying model | GPT-5.3-Codex / GPT-5.4 | Proprietary (multi-model) | Claude Opus 4.5 | Multiple (OpenAI, Anthropic, etc.) |
| SWE-bench performance | 77.3% (Terminal-Bench 2.0, GPT-5.3-Codex) | N/A (proprietary benchmarks) | 80.9% (SWE-bench Verified, Opus 4.5) | N/A |
| Speed (tokens/sec) | ~240+ tok/s (GPT-5.3-Codex) | Varies | Varies | Varies |
| Starting price | $20/month (ChatGPT Plus) | $20/month + $2.25/ACU | $20/month (Claude Pro) | Free (Hobby); $20/month (Pro) |
In practice, many developers use multiple tools depending on the task. A common pattern is to use Codex for high-volume, parallelizable tasks where speed matters, Claude Code for problems requiring deep reasoning, and Cursor for work that benefits from tight IDE integration [23].
By early 2026, OpenAI reported that more than one million developers were using the Codex agent weekly, with usage increasing fivefold since the start of 2025 [15]. Major enterprises adopted Codex for their development workflows, including Cisco, Virgin Atlantic, Vanta, and Duolingo.
In February 2026, Apple released Xcode 26.3 with native support for agentic coding, integrating both OpenAI's Codex and Anthropic's Claude Agent directly into Apple's development environment through the Model Context Protocol (MCP). This allowed iOS and macOS developers to use Codex within Xcode for tasks like creating files, examining project structure, building projects, running tests, and accessing Apple's developer documentation [24].
The Codex agent's adoption was further bolstered by its integration with GitHub for code review, Slack for team notifications, and various CI/CD pipelines for automated workflows.
The original 2021 Codex holds a distinctive place in the history of AI and software development. Its contributions extend well beyond the model itself.
Proving that LLMs could write code. Before Codex, the idea that a neural network could reliably generate functional code from natural language descriptions was largely theoretical. Codex provided the first large-scale empirical evidence that this was possible, catalyzing an entire subfield of AI research and a new category of developer tools.
Creating the evaluation framework. The HumanEval benchmark gave the research community a shared, rigorous way to measure progress in code generation. By defining success as functional correctness rather than textual similarity, HumanEval set a higher and more meaningful bar that has driven genuine improvements in model capability.
Launching the AI coding assistant market. GitHub Copilot, powered by Codex, was the first AI coding assistant to achieve widespread commercial adoption. Its success demonstrated that developers were willing to pay for AI tools that meaningfully improved their productivity, creating a market that has since grown to include dozens of products and billions of dollars in annual revenue.
Influencing model development strategy. Codex demonstrated that fine-tuning a general-purpose language model on domain-specific data could produce dramatic improvements in task performance. This insight influenced the development of subsequent specialized models across many domains, from scientific research to legal analysis.
Shifting the coding education conversation. The emergence of Codex and Copilot sparked widespread discussion about the future of programming education, the evolving role of software developers, and the extent to which AI would automate routine coding tasks. These conversations continue to shape how the software industry thinks about the intersection of human expertise and AI capability.
Defining the agentic coding paradigm. The 2025 relaunch of Codex as an autonomous agent marked a shift from AI as an autocomplete tool to AI as a collaborator capable of independently completing engineering tasks. The sandboxed execution model, the AGENTS.md configuration format, and the parallel task architecture introduced by Codex have influenced how the broader industry designs coding agents.
The journey from the original 12B parameter Codex model scoring 28.8% on HumanEval in 2021 to the GPT-5.3-Codex agent autonomously engineering software and helping build itself in 2026 encapsulates the extraordinary pace of progress in AI-assisted software development. What began as a research project demonstrating that language models could write code has evolved into a fundamental shift in how software is built.