GPT-5.1-Codex-Max
Last reviewed
May 31, 2026
Sources
19 citations
Review status
Source-backed
Revision
v4 · 1,913 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 31, 2026
Sources
19 citations
Review status
Source-backed
Revision
v4 · 1,913 words
Add missing citations, update stale details, or suggest a clearer explanation.
GPT-5.1-Codex-Max is a frontier agentic coding model from OpenAI, released on November 19, 2025 for Codex, OpenAI's software engineering agent. [1][2] It is the first model OpenAI trained natively to work across multiple context windows, using a technique it calls compaction, which lets the model keep working coherently over millions of tokens and several hours in a single task. [1][3] It is also the first Codex model trained to operate in Windows environments, and it adds a new reasoning effort level called xhigh for more thorough responses. [1][4]
The model sits at the top of the GPT-5.1 coding lineup. OpenAI shipped GPT-5.1-Codex alongside GPT-5.1 in mid November 2025, then released GPT-5.1-Codex-Max about a week later as a higher capability tier aimed at long, project scale work rather than quick edits. [1][5] It descends from the earlier GPT-5-Codex, which itself was a coding tuned variant of GPT-5. [6] OpenAI later replaced it with GPT-5.2-Codex and then GPT-5.3-Codex, so GPT-5.1-Codex-Max occupied the frontier slot for only a short window. [7]
GPT-5.1-Codex-Max is a version of OpenAI's foundational reasoning model that was further trained on agentic tasks across software engineering, math, and research. [1] Like other large language model based coding agents, it does not just autocomplete code. It plans, runs commands, reads test output, edits files, and iterates inside a harness until a task is done. OpenAI built it specifically for AI agents that run for a long time, the kind of work where a single request might touch dozens of files or span an entire afternoon. [1][3]
OpenAI is explicit that this is a specialized model. The system card says GPT-5.1-Codex-Max is optimized for agentic coding in Codex or Codex like environments, and OpenAI recommends using it only for those workflows rather than as a general purpose chat model. [8] In practice you reach it through a coding product, the Codex CLI, an IDE extension, Codex Cloud, or code review, and not through the regular ChatGPT model picker for everyday questions. [1][8]
The headline feature is compaction. Every model has a context window, a hard limit on how much text it can hold in working memory at once. On a long task an agent fills that window with its own history: the files it read, the commands it ran, the errors it hit. Once the window is full, older work normally has to be dropped, and the agent loses the thread.
GPT-5.1-Codex-Max handles this differently. As it nears the context limit, it compacts its own session, pruning the history down to the parts that still matter and carrying that forward into a fresh window. [1][3] Then it keeps going, and it repeats the cycle as many times as the task needs. Because the model was trained to do this rather than having it bolted on by the surrounding software, OpenAI describes it as the first model trained to operate natively across multiple context windows, coherently working over millions of tokens in a single task. [1][2]
This is the difference that lets the model run for a long time. OpenAI says compaction allows GPT-5.1-Codex-Max to finish work that earlier models would have abandoned once they ran out of context, such as large refactors and long running agent loops. [1][9] In internal testing OpenAI observed the model working independently for more than 24 hours on a complex engineering task, iterating on its implementation and fixing test failures until it delivered a result. [1][10]
It is worth keeping the claim in proportion. Compaction as a general idea is not new. Coding agents like Claude Code already summarize earlier turns when context runs low, a point Simon Willison made when the model launched, noting the confusion on Hacker News about what was actually novel here. [3] What is new is that OpenAI trained the model itself to manage the boundary between windows, instead of leaving that entirely to the harness.
GPT-5.1-Codex-Max uses the same reasoning effort control that arrived with GPT-5.1, tuned for coding agents. [1] Reasoning effort sets how many thinking tokens the model spends before it commits to an answer. Lower settings are faster and cheaper and suit latency sensitive work. Higher settings trade speed for thoroughness.
The model adds a new top setting called xhigh, short for extra high, for cases where you want the most thorough response and do not mind waiting longer. [1][4] On the SWE-bench Verified benchmark, xhigh edges out the high setting, so the extra thinking does buy a small accuracy gain. [9] OpenAI's general guidance is to use the standard levels for interactive use and reserve xhigh for hard, open ended problems. [1]
OpenAI reported the following results, comparing GPT-5.1-Codex-Max against its sibling GPT-5.1-Codex. The SWE-bench Verified numbers are on a 500 issue subset. The GPT-5.1-Codex column is at high reasoning effort, and the GPT-5.1-Codex-Max numbers for SWE-Lancer and Terminal-Bench are at xhigh. [1][9][11]
| Benchmark | GPT-5.1-Codex (high) | GPT-5.1-Codex-Max (high) | GPT-5.1-Codex-Max (xhigh) |
|---|---|---|---|
| SWE-bench Verified (n=500) | 73.7% | 76.5% | 77.9% |
| SWE-Lancer IC SWE | 66.3% | -- | 79.9% |
| Terminal-Bench 2.0 | 52.8% | -- | 58.1% |
SWE-bench Verified measures whether a model can resolve real GitHub issues by producing a patch that passes the repository's tests. SWE-Lancer scores models on real freelance software tasks drawn from Upwork, and Terminal-Bench tests agentic work in a command line environment. [9][11] Across all three, GPT-5.1-Codex-Max comes out ahead of GPT-5.1-Codex, with the largest gap on SWE-Lancer. [1][9]
As always with vendor reported numbers, the scores come from OpenAI's own harness and reasoning settings, so they describe the model under conditions OpenAI chose. They are still useful for tracking progress against the previous Codex model on the same setup.
Long agent runs are expensive, because every step adds tokens, and tokens cost money and time. OpenAI's pitch is that GPT-5.1-Codex-Max is not just more capable but more efficient. At medium reasoning effort it used about 30% fewer thinking tokens than GPT-5.1-Codex while matching or beating its accuracy. [1][9] Fewer thinking tokens means lower cost and lower latency for the same quality of result.
The API card lists a 400,000 token context window and up to 128,000 output tokens, with a knowledge cutoff of September 30, 2024. [12][13] API pricing is $1.25 per million input tokens and $10.00 per million output tokens, with cached input billed at $0.125 per million tokens. [12][13]
At launch GPT-5.1-Codex-Max became the default model in Codex across its surfaces: the Codex CLI, the IDE extension, the cloud agent, and code review. [1][8] It was available to ChatGPT Plus, Pro, Business, Edu, and Enterprise users, with API access for developers following shortly after. [1][12]
The model also reached GitHub Copilot. GitHub put GPT-5.1-Codex-Max into public preview for Copilot on December 4, 2025, then made it generally available to Copilot Enterprise, Business, Pro, and Pro+ on December 17, 2025, accessible through the chat model picker in Visual Studio Code, Visual Studio, JetBrains IDEs, Xcode, Eclipse, GitHub Mobile, and the web. [14][15] As newer Codex models shipped, GPT-5.1-Codex-Max was deprecated across Copilot in early April 2026. [16]
OpenAI published a system card for GPT-5.1-Codex-Max dated November 18, 2025. [11] Under OpenAI's Preparedness Framework, OpenAI says the model is very capable in cybersecurity but does not reach the High capability threshold in that domain, while noting it expects models to cross the High cyber threshold in the near future. [8][17] As with other recent OpenAI models, it is treated as High capability in biology and deployed with the same suite of safeguards used for GPT-5. [8][17] OpenAI also says the model was trained to ignore prompt injection attacks in the coding environment, following its instruction hierarchy, and that the Codex sandbox disables network access and restricts file edits to the current workspace by default, with users able to widen those permissions if they choose. [8]
The independent research group METR ran a pre deployment evaluation over roughly two weeks in early November 2025, with access to multiple model checkpoints and, for some tests, visible reasoning traces. [18] METR estimated a 50% time horizon of about 2 hours 40 minutes, meaning the length of task at which the model succeeds about half the time, with a 95% confidence interval of 75 minutes to 5 hours 50 minutes. [18][19] At an 80% success rate the time horizon dropped to about 30 minutes. [18] The 50% figure was the longest point estimate METR had published to that point, up from roughly 2 hours 17 minutes for GPT-5, and METR described it as on trend, landing between its longer seven month and more recent four month doubling time projections. [18][19] METR concluded that the overall risk of AI takeover scenarios from on trend agents in the following six months was very low. [18]
For safety, OpenAI points to model training, scaled up monitoring, and account level enforcement against misuse, and it recommends sandboxing and human oversight for autonomous, long running tasks. [8] Because the model uses tools and can act on its own for long stretches, prompt injection and unintended actions remain live risks that the sandbox defaults and oversight are meant to contain. [8]
GPT-5.1-Codex-Max is narrow by design. OpenAI states the model is not intended for deployment in any general purpose chat application and will not be deployed by OpenAI in any such context, so it is not the right tool for open ended chat or non coding work. [8] The benchmark numbers come from OpenAI's own harness and settings, which limits how directly they compare to other vendors' figures. The same long horizon skill that helps with large refactors also raises offensive security capability, which is why OpenAI watches the cyber threshold closely even though the model has not crossed it. [8][17] And like every model that runs unattended for hours, it carries the practical risks of any autonomous agent, including prompt injection and unintended actions, which is why OpenAI stresses oversight rather than full automation. [8]