GPT-5.2

AI Models Frontier Models GPT Large Language Models OpenAI

30 min read

Updated May 7, 2026

GPT-5.2 is a large language model developed by OpenAI and released on December 11, 2025. It is the second major update to the GPT-5 series, succeeding GPT-5.1, which had been released about a month earlier. GPT-5.2 introduced a three-tier structure consisting of GPT-5.2 Instant, GPT-5.2 Thinking, and GPT-5.2 Pro, each targeting different workload profiles from low-latency chatgpt interactions to extended agentic reasoning sessions. A specialized coding variant, GPT-5.2-Codex, followed on December 18, 2025.

The release was framed publicly as a competitive response to Google's Gemini 3 Pro and Anthropic's Claude Opus 4.7 (1M context), launched against the backdrop of an internal OpenAI "code red" memo from CEO Sam Altman warning of declining ChatGPT traffic and lost market share to Google. GPT-5.2 delivered its most dramatic gain on abstract reasoning, nearly tripling GPT-5.1's score on ARC-AGI-2 from 17.6% to 52.9%, and set a new state-of-the-art on SWE-bench Pro for software engineering. The release also addressed safety concerns around mental health interactions, advanced the knowledge cutoff from September 2024 to August 2025, and introduced new developer tools for agentic workflows.

Initial reception was mixed. Enterprise users and coding platforms reported meaningful capability gains and several large software vendors integrated the model on day one, while consumer users and some developers criticized Thinking mode's slow token generation, the 40% price increase relative to GPT-5.1, and a perceived gap between benchmark performance and practical usability.

Background

The GPT-5 lineage

The GPT-5 series is OpenAI's fifth generation of generative pre-trained transformer models, the family that has underpinned ChatGPT and the OpenAI API since 2020. GPT-5 was released in August 2025 and introduced a unified model architecture that surpassed prior generations such as GPT-4, GPT-4.1, and GPT-4o across most evaluated benchmarks. It established the 400,000-token context window, 128,000-token output capacity, and multimodal text-and-vision architecture that subsequent point releases carried forward.

GPT-5.1 was released on November 12, 2025 and brought incremental improvements to coding reliability and reduced hallucination rates. Its standard pricing was set at $1.25 per million input tokens and $10.00 per million output tokens. GPT-5.1's performance on abstract reasoning tasks lagged competing models. Its ARC-AGI-2 score of 17.6% in Thinking mode was substantially below what evaluators considered necessary for general-purpose reasoning claims, and its knowledge cutoff of September 30, 2024 was over a year old by the time GPT-5.2 launched. GPT-5.1's Instant variant (gpt-5.1-chat-latest) had a 128,000-token context window, while Thinking and Pro used the full 400K context.

Development context

GPT-5.2 was internally code-named "Garlic" during development. TechCrunch reported in December 2025 that an internal memo from Sam Altman issued earlier in the month had warned of declining ChatGPT traffic and competitive pressure from Google, framing the December release as a strategic priority. The memo, characterized in press coverage as a "code red," reportedly shifted internal priorities away from advertising features toward improving the core ChatGPT experience.

Google's Gemini 3 Pro had taken the top position on LMArena's text leaderboard across most general benchmarks outside coding by mid-November 2025, and had integrated tightly into Google Cloud products through managed Model Context Protocol (MCP) servers that exposed services like Maps and BigQuery to AI agents. Anthropic's Claude Opus 4.5 had achieved a narrow lead on SWE-bench Verified for software engineering tasks. The competitive backdrop pushed OpenAI to ship GPT-5.2 less than four weeks after GPT-5.1, an unusually short cadence for a major model update.

Fidji Simo, OpenAI's CEO of applications, told reporters that the company had "been working on this model's release for months," while acknowledging that the code red and additional resources allocated to ChatGPT had been "helpful" in finalizing the deployment. Aidan Clark, OpenAI's vice president of research (training), described GPT-5.2 as targeting "everyday professional work, long-running agents, and science workloads" during the announcement, but declined to detail the training methods used to upgrade performance over GPT-5.1. Chief Product Officer Fidji Simo cited concrete improvements in spreadsheet creation, presentation building, code writing, and multi-step project execution.

Some internal tension accompanied the release. TechCrunch and the Wall Street Journal reported that certain OpenAI employees had requested a delay for further development time, a claim OpenAI did not address publicly.

Release

GPT-5.2 launched on December 11, 2025 through both the Responses API and the Chat Completions API. Initial rollout began with paid ChatGPT plans (Plus, Pro, Go, Business, Enterprise), with free-tier users receiving access at lower message limits shortly after. The Instant variant appeared in ChatGPT as the default for paid users, while Thinking mode was accessible through an explicit reasoning effort selector on the same interface.

OpenAI published an updated GPT-5 System Card alongside the launch covering safety evaluations including mental health benchmarks, hallucination metrics, and cybersecurity assessments. A separate GPT-5.2 Prompting Guide was published for developers, emphasizing structured tool calls, persistent instructions via preambles, and patterns for managing the new context-compaction endpoint.

GPT-5.2-Codex launched on December 18, 2025 and was available immediately to all paid ChatGPT users across Codex CLI, IDE extensions for Visual Studio Code and JetBrains, the ChatGPT web and mobile interfaces, and GitHub code review integrations. OpenAI stated that API access would follow in subsequent weeks, with an invite-only security pilot for vetted professionals running in parallel.

GPT-5.1 was not immediately deprecated at launch. OpenAI stated it would remain available for approximately three months to allow developer migration time. The dated snapshot variant gpt-5.2-2025-12-11 was made available for researchers requiring reproducible results.

Model variants

GPT-5.2 launched in four configurations addressed at distinct workload profiles.

Variant	API model ID	Primary use case	Reasoning
GPT-5.2 Instant	`gpt-5.2-chat-latest`	Low-latency everyday tasks	None
GPT-5.2 Thinking	`gpt-5.2`	Complex reasoning, coding, analysis	Adjustable (none / low / medium / high / xhigh)
GPT-5.2 Pro	`gpt-5.2-pro`	Maximum accuracy, up to 30-minute tasks	Extended (Responses API only)
GPT-5.2 Dated snapshot	`gpt-5.2-2025-12-11`	Reproducible research snapshot	Mirrors Thinking

The Instant variant uses a 128,000-token context window with 16,384 maximum output tokens, mirroring the prior generation's chat-tier configuration. Thinking and Pro use the full 400,000-token context window with up to 128,000 output tokens. Pro is accessible through the Responses API only and can sustain task sessions lasting up to 30 minutes, targeting enterprise agentic workflows that require extended autonomous processing without operator intervention.

GPT-5.2-Codex carries a separate model ID (gpt-5-2-codex in API parlance) with the same 400,000-token input and 128,000-token output specification as Thinking, and adds native context compaction tailored to long coding sessions. A gpt-5.2-search model surfaced on LMArena's Search leaderboard in mid-December and powers the search-grounded ChatGPT experience.

OpenAI's variant structure reflects what some commentators have described as a "latency arbitrage" philosophy: simple tasks route to Instant for fast, cheap responses, while Thinking and Pro are reserved for tasks where additional inference compute genuinely improves outcomes. This tiering allows the same underlying model family to serve both consumer chat and enterprise agentic pipelines without forcing all queries through the most expensive route.

Technical specifications

Specification	GPT-5.2 Thinking / Pro	GPT-5.2 Instant	GPT-5.2-Codex
Context window	400,000 tokens	128,000 tokens	400,000 tokens
Max output tokens	128,000	16,384	128,000
Knowledge cutoff	August 31, 2025	August 31, 2025	August 31, 2025
Multimodal	Yes (text + vision)	Yes (text + vision)	Yes (text + vision)
Architecture	Transformer (proprietary)	Transformer (proprietary)	Transformer (proprietary)
Reasoning modes	none, low, medium, high, xhigh	None	Extended
Audio support	No	No	No
Image generation	No	No	No
Response compaction	Yes (via `/responses/compact`)	No	Yes (native)
Tool calling	Yes	Yes	Yes
Function calling	Yes	Yes	Yes

The knowledge cutoff advanced from September 30, 2024 (GPT-5.1) to August 31, 2025, an 11-month refresh that substantially updated the model's factual knowledge base. Vision capabilities improved over GPT-5.1, with OpenAI reporting roughly half the previous error rate on chart reasoning tasks and noticeably better performance on interface understanding tasks such as reading UI screenshots. Simon Willison's hands-on review noted successful OCR runs and the model's ability to draw a recognizable pelican on demand, an informal benchmark he had used across previous releases.

A new server-side /responses/compact endpoint was introduced for the Thinking and Pro variants to handle workflows that push against the 400,000-token context limit. The endpoint performs a loss-aware compression pass over prior conversation state, returning encrypted tokens that preserve task-relevant information while reducing footprint. This mechanism allows the model to continue reasoning across extended, tool-heavy sessions without losing context. GPT-5.2-Codex handles the same compaction natively without requiring an explicit API call.

Additional developer tools introduced at launch included an apply_patch tool for producing structured file diffs rather than full file rewrites (which reduces output token consumption during code editing), a local_shell tool for executing shell commands in sandboxed environments, and support for preambles, a mechanism for injecting persistent instructions that survive context compaction in long-running agent sessions.

Benchmark performance

OpenAI published benchmark scores alongside the December 11 announcement. Third-party measurements from Vellum, Vals.ai, Artificial Analysis, and LMArena provided independent corroboration on selected benchmarks, though some discrepancies exist between vendor-reported and independently measured numbers. Vals.ai's measurement of SWE-bench Verified at 75.4% was notably lower than OpenAI's stated 80.0%.

Core benchmarks

Benchmark	GPT-5.2 Thinking	GPT-5.2 Pro	GPT-5.1
AIME 2025 (math)	100%	100%	~80%
GPQA Diamond (grad-level science)	92.4%	93.2%	~82%
ARC-AGI-2 (abstract reasoning)	52.9%	54.2%	17.6%
ARC-AGI-1	~88%	90.5%	~65%
SWE-bench Verified (coding)	80.0%	80.0%	~68%
SWE-bench Pro (coding)	55.6%	55.6%	~35%
FrontierMath (Tiers 1-3)	40.3%	40.3%	31.0%
GDPval (professional knowledge work)	70.9% win-or-tie vs experts	70.9%	38.8%
MMMU-Pro (multimodal understanding)	86.5%	86.5%	~78%
Video-MMMU	90.5%	90.5%	~82%
Tau-bench Telecom (Tau2)	94.5%	94.5%	~88%
Humanity's Last Exam	34.5%	36.6%	~22%
CharXiv with Python	88.7%	88.7%	~79%
ScreenSpot Pro (UI understanding)	86.3%	86.3%	64.2%
MRCRv2 (4-needle, 256K tokens)	98%	98%	~92%
MRCRv2 (8-needle, 128K tokens)	85%	85%	~75%

The ARC-AGI-2 result drew the most attention from researchers. ARC-AGI-2 is the second iteration of the Abstraction and Reasoning Corpus designed by Francois Chollet to resist pattern memorization and test novel problem-solving. GPT-5.2's jump from 17.6% to 52.9% was the largest single-generation improvement on the test since the benchmark's introduction. The Introl analysis noted that GPT-5.2 was the first commercially released model to cross 50% on ARC-AGI-2, positioning it as a potential inflection point in inference demand for reasoning-capable systems. GPT-5.2 Pro's 90.5% on the original ARC-AGI-1 came with roughly a 390x improvement in computational efficiency compared to the o3 (High) score from one year prior, reflecting infrastructure improvements alongside raw capability gains. The original o1 reasoning model, by comparison, had been the first OpenAI release to demonstrate test-time compute scaling on these benchmarks in late 2024.

On GDPval, a benchmark measuring performance across 44 distinct professional knowledge domains against human domain experts, GPT-5.2 Thinking achieved a 70.9% win-or-tie rate, up from 38.8% for GPT-5.1. OpenAI also reported that GPT-5.2 Thinking completed tasks at over 11 times the speed of expert professionals at less than 1% of human labor costs on the same benchmark, though those figures do not account for verification overhead.

Hallucination rates declined substantially over GPT-5.1. According to OpenAI's system card, GPT-5.2 Thinking has an average hallucination rate of 10.9%, compared to 16.8% for GPT-5 Thinking and 12.7% for GPT-5.1 Thinking. With browsing enabled, the rate dropped to 5.8%. The error rate on GDPval dropped from 8.8% to 6.2%. OpenAI's system card acknowledged that hallucination rates rose back to approximately 8.4% when reasoning effort was set to its lowest setting, meaning the headline improvement depended on using at least some reasoning compute.

Third-party testers corroborated the gains with real-world data. Box CEO Aaron Levie stated the model scored "7 points better than GPT-5.1" on Box's proprietary knowledge work assessments. Data science platforms Databricks, Hex, and Triple Whale reported improved performance in agentic data science workflows. Notion, Shopify, Harvey, and Zoom also cited gains in long-horizon reasoning and tool-calling for production deployments.

LMArena leaderboard performance

On LM Arena, GPT-5.2 was added to the WebDev leaderboard on December 11, 2025 and to the broader Text leaderboard on December 18, 2025. GPT-5.2-high debuted at #2 on the WebDev leaderboard with a score of 1486, behind only Claude Opus 4.5 thinking-32k and ahead of Claude Opus 4.5 standard by three points. The standard GPT-5.2 model placed at #6 on WebDev with a score of 1399. On the Text Arena, Gemini 3 Pro retained the top position at 1492 across more than 15,000 votes through December 2025, while GPT-5.2's Text Arena standing remained preliminary with lower vote volume in the first weeks after launch. A gpt-5.2-search variant appeared separately on the Search leaderboard.

Pricing

Tier	Input (per 1M tokens)	Output (per 1M tokens)	Cached input	Batch API input	Batch API output
GPT-5.2 Thinking / Instant	$1.75	$14.00	$0.175	$0.875	$7.00
GPT-5.2 Pro	$21.00	$168.00	$2.10	$10.50	$84.00
GPT-5.1 (prior generation)	$1.25	$10.00	$0.125	$0.625	$5.00

The standard Thinking and Instant pricing reflects a 1.4x increase over GPT-5.1. Cached inputs carry a 90% discount relative to standard input pricing, making GPT-5.2 viable for applications that process repeated or overlapping context such as long system prompts and large code repositories. Batch API pricing offers a 50% discount for non-time-sensitive workloads. GPT-5.2 Pro pricing is approximately 12x the standard tier, comparable to o1 Pro and GPT-4.5, targeting enterprise applications where maximum accuracy on difficult long-horizon tasks justifies the cost.

ChatGPT consumer plan access is structured separately. The plans below reflect the rollout configuration as of December 11, 2025.

Plan	Monthly price	Context access	Message rate
Free	$0	8,000 tokens	10 messages per 5 hours
Plus	$20	32,000 tokens	160 messages per 3 hours
Go	$10	32,000 tokens	160 messages per 3 hours
Pro	$200	400,000 tokens	Unlimited
Business / Enterprise	Custom	400,000 tokens	Unlimited

Pro plan subscribers received full 400K context access and no message rate limiting, making the Pro tier the practical requirement for heavy users working with large documents or long agentic sessions. Plus and Business users could manually select GPT-5.2 Thinking from the model picker with a usage limit of up to 3,000 messages per week.

The 40% price increase relative to GPT-5.1 drew criticism from some developers and consumers, particularly given complaints about practical performance gaps between benchmark scores and real-world output quality. Cost analyses circulated by Kilo.ai estimated that a project generating 10 million output tokens monthly would cost approximately $140 with GPT-5.2 Thinking, compared to $250 with Claude Opus 4.5 and $120 with Gemini 3 Pro at base rates.

Comparison with frontier models

At the time of release, the primary competing frontier models were Claude Opus 4.5 from Anthropic, Gemini 3 Pro from Google DeepMind, and DeepSeek V3.2. The table below summarizes benchmark scores across all four models on shared evaluations as reported at the time of GPT-5.2's launch.

Benchmark	GPT-5.2 Thinking	Claude Opus 4.5	Gemini 3 Pro	DeepSeek V3.2
AIME 2025	100%	~94%	95.0%	~90%
GPQA Diamond	92.4%	87.0%	91.9%	~88%
ARC-AGI-2	52.9%	37.6%	31.1%-45.1%	~25%
SWE-bench Verified	80.0%	80.9%	76.2%	~72%
SWE-bench Pro	55.6%	~45%	43.4%	~38%
Humanity's Last Exam	34.5%	25.2%	37.5%	~28%
Video-MMMU	90.5%	~82%	87.6%	~79%
Terminal-Bench 2.0	~58%	59.3%	~50%	~45%
Tau2-bench Telecom	94.5%	98.2%	~90%	~85%
Context window	400K	200K	1M	128K
Input price (per 1M tokens)	$1.75	~$3.00	~$1.25	~$0.45

GPT-5.2 led on ARC-AGI-2 by a significant margin over all competitors at launch, and held the top SWE-bench Pro score. Claude Opus 4.5 maintained a narrow lead on SWE-bench Verified (80.9% vs 80.0%) and outperformed GPT-5.2 on tool-use benchmarks: Tau2-bench Telecom (98.2% vs 94.5%) and Terminal-Bench 2.0 (59.3% vs ~58%). Claude Opus 4.5 was also noted in developer testing to be more likely to deliver complete, working implementations on a first attempt, which offset GPT-5.2's roughly 17% lower per-run cost for some workloads.

Gemini 3 Pro led on Humanity's Last Exam (37.5% with tools vs GPT-5.2's 34.5%) and offered the largest context window at 1 million tokens, roughly 2.5x GPT-5.2's 400K limit. Gemini 3's tight integration with Google Cloud services, including managed MCP servers for Maps and BigQuery, gave it practical advantages for workflows built on Google infrastructure. Some developers chose Gemini 3 Pro for tasks involving broad multimodal workflows or very long documents that exceeded GPT-5.2's context limit.

DeepSeek V3.2 was the lowest-cost option at roughly $0.45 per million input tokens, and trailed GPT-5.2 on most benchmarks by meaningful margins except on a few specific mathematical benchmarks where the gap narrowed to 0.2 percentage points. Its 128K context window also constrained its applicability to long-context tasks.

GitHub Copilot's position as a widely adopted IDE integration made it a practical venue where the three major frontier models competed directly. By the time of GPT-5.2's launch, GitHub Copilot offered both GPT-5.2 and Claude Opus 4.5 for enterprise customers via bring-your-own-key arrangements. Developers reported choosing between them on a task-by-task basis, often using Claude for architecture decisions and GPT-5.2-Codex for long-running implementation tasks.

GPT-5.2-Codex

Overview

GPT-5.2-Codex is a variant of GPT-5.2 optimized for agentic software engineering tasks. Released on December 18, 2025, it is a distinct fine-tune of GPT-5.2 Thinking trained on additional coding-specific data, with context compaction built in natively to support multi-hour and multi-day task sessions. OpenAI described it as "the most advanced agentic coding model yet for complex, real-world software engineering."

The release continued the Codex brand that OpenAI had revived in 2025 after originally retiring it in 2023 following the discontinuation of the original code-completion model series. The revived Codex brand encompasses a broader agentic coding platform rather than a standalone API model.

Technical capabilities

GPT-5.2-Codex extends base GPT-5.2 Thinking with targeted improvements for software engineering workflows.

Automatic context compaction allows the model to sustain coherent work across sessions spanning millions of tokens. The model compacts sessions natively when approaching context limits, preserving task-relevant information without the explicit /responses/compact API call required by the base model. This solved a fundamental constraint of earlier Codex variants, which would lose task context mid-refactor or terminate when hitting token limits.

Native Windows environment support gives the model reliable performance in PowerShell and Windows-specific development contexts. Prior Codex variants had been predominantly optimized for Unix-based environments, creating friction for teams working on Windows-first codebases.

Vision capabilities allow GPT-5.2-Codex to interpret screenshots and technical diagrams during coding sessions, letting it act on visual context such as error dialogs, browser screenshots, or design mockups without the developer needing to describe the visual content in text.

Sustained multi-step execution supports tasks lasting 7 hours or more in a single session, covering complex workflows such as codebase-wide refactors, full feature builds, and data migrations. Rakuten reported completing a 7-hour autonomous refactoring session without human intervention using GPT-5.2-Codex.

Long-context reasoning across large repositories improved substantially. GPT-5.2-Codex handled code migrations requiring comprehensive cross-file reference updates and multi-day feature development while maintaining coherent understanding of system architecture.

Benchmark performance

Benchmark	GPT-5.2-Codex	GPT-5.2 Thinking	GPT-5.1 Codex	Claude Opus 4.5
SWE-bench Pro	56.4%	55.6%	50.8%	~45%
Terminal-Bench 2.0	64.0%	~58%	~52%	59.3%
SWE-bench Verified	~80%	80.0%	~68%	80.9%
AIME 2025	100%	100%	~80%	~94%

GPT-5.2-Codex achieved 56.4% on SWE-bench Pro, surpassing both GPT-5.2 Thinking (55.6%) and all other publicly benchmarked coding models at the time of release. On Terminal-Bench 2.0, which tests agentic performance across realistic terminal environments with diverse task types, GPT-5.2-Codex scored 64.0%, overtaking Claude Opus 4.5's 59.3% and marking a substantial improvement over prior Codex variants.

Cybersecurity capabilities

OpenAI assessed GPT-5.2-Codex under the Preparedness Framework's cybersecurity evaluation criteria. The system card addendum published December 18, 2025 stated that GPT-5.2-Codex had "significantly stronger cybersecurity capabilities than any model released so far" but did not reach a "High" level of cyber capability under the framework's definitions. (The successor GPT-5.3-Codex would later become the first OpenAI model to reach "High" on this rubric in February 2026.)

The model performed well on professional Capture-the-Flag challenges involving multi-step security tasks, including fuzzing, test environment setup, and attack surface analysis. OpenAI noted that the same capabilities enabling defensive security work also create dual-use risk, and that deployment was designed with future capability growth in mind.

A documented real-world case study involved a security researcher who used the predecessor model to investigate React Server Components vulnerabilities, discovering an initial critical flaw and subsequently uncovering three additional CVEs (CVE-2025-55183, CVE-2025-55184, CVE-2025-67779). GPT-5.2-Codex's enhanced capabilities for this type of defensive vulnerability research were cited as a justification for the invite-only security pilot at launch.

OpenAI published deployment recommendations alongside the release, advising organizations to implement tracked disclosures for AI-assisted vulnerability research, integrate AI testing into secure development lifecycles with mandatory human validation, apply least-privilege access and network segmentation for advanced AI tools, establish governance frameworks with acceptable-use policies and audit logging, and enforce secure prompt handling with data redaction and sandboxing.

Availability and pricing

GPT-5.2-Codex launched on December 18, 2025 for paid ChatGPT users (Plus, Pro, Business, Enterprise, Edu) across Codex CLI, IDE extensions, web, mobile, and GitHub code review. API access with model ID gpt-5-2-codex followed in subsequent weeks. A security pilot for vetted professionals ran in parallel under invite-only access.

Tier	Input (per 1M tokens)	Output (per 1M tokens)	Cached input
GPT-5.2-Codex	$1.75	$14.00	$0.175

GPT-5.2-Codex carries the same per-token pricing as GPT-5.2 Thinking, representing a 1.4x increase over the prior Codex variant.

Integration with OpenAI products and third-party platforms

GPT-5.2 launched alongside several developer-facing improvements across OpenAI's product surfaces.

On the API side, the apply_patch tool allowed the model to produce structured file diffs rather than full file rewrites during code editing tasks, cutting output token consumption for large repositories. The local_shell tool enabled execution of shell commands in sandboxed environments, allowing agents to run tests, build commands, and scripts within controlled containers. Preambles gave developers a mechanism for injecting persistent instructions that survive context compaction, maintaining consistent behavior across long-running agent sessions without re-stating instructions in every turn.

Microsoft provided zero-day availability on Azure AI Foundry. Features specific to the Foundry deployment included geographic data zone options for regulatory compliance, integration with Foundry IQ (a retrieval-augmented generation engine using GPT-5.2's reasoning to reduce hallucinations in document retrieval), and content safety screening applied at the gateway level before model processing. Visual Studio Code gained Agent Mode for autonomous multi-step tasks, including self-correcting refactors and test generation.

GitHub Copilot added GPT-5.2 as a public preview option via bring-your-own-key arrangements for enterprise customers, making it available directly within developer IDEs alongside Claude Opus 4.5. The dual-model offering in Copilot let teams choose models per task without switching tools.

Enterprise software companies Notion, Box, Shopify, Harvey, and Zoom reported adopting GPT-5.2 for production workflows, citing its long-horizon reasoning and tool-calling capabilities. Coding platforms Windsurf, CharlieCode, Cognition, Warp, JetBrains, and Augment Code highlighted GPT-5.2's agentic coding performance in integration announcements. Data science platforms Databricks, Hex, and Triple Whale reported using GPT-5.2 in agentic analytical workflows. Enterprise users reported anecdotal productivity gains, including one documented case where complex financial document extraction time dropped from 46 seconds to 12 seconds.

Use cases

OpenAI and third-party analysts documented several areas where GPT-5.2 showed practical value at launch.

Professional knowledge work was the central positioning. The GDPval result (70.9% win-or-tie rate against domain experts across 44 fields) framed GPT-5.2 as viable for tasks such as financial modeling, research synthesis, legal document review, and scientific literature analysis. Reported productivity gains from enterprise users ranged from 40 to 60 minutes saved daily for routine users, with power users reporting savings exceeding 10 hours per week on document-heavy workflows.

Enterprise agentic pipelines benefited from the 400K context window, response compaction, and 30-minute Pro task sessions. These enabled multi-step automation workflows including data pipeline construction, document transformation, and cross-system orchestration that required sustained context over extended periods.

Software development covered a spectrum from interactive coding assistance in IDEs to long-running autonomous agents completing full feature builds and repository-scale migrations. GPT-5.2-Codex specifically targeted the latter: complex refactors, code migrations, multi-day feature development, and cross-platform security audits.

Scientific and mathematical research tasks were supported by the GPQA Diamond score (92.4%), FrontierMath performance (40.3%), and perfect AIME 2025 score. These positioned the model for research-adjacent tasks in mathematics, physics, chemistry, and biology, though OpenAI cautioned that high benchmark performance did not guarantee correctness on novel problems outside the training distribution.

Long document processing was enabled by 98% accuracy on a 4-needle MRCRv2 test at 256K tokens, cited in the system card. This supported reliable processing of large legal documents, technical manuals, regulatory filings, and multi-document research corpora.

Secure coding and vulnerability research, via the GPT-5.2-Codex cybersecurity pilot, opened a path for AI-assisted penetration testing, CVE discovery, and integration into secure development lifecycles for vetted professionals.

Safety and responsible AI

The December 11 system card update highlighted several improvements over GPT-5.1 in safe completion behavior.

Mental health and self-harm handling improved measurably. OpenAI reported fewer undesirable responses to prompts indicating signs of suicide or self-harm, mental health distress, and emotional reliance on the model. On a mental health safety benchmark, GPT-5.2 scored 91.5% on appropriate handling of sensitive mental health inquiries. Resistance to enabling unhealthy emotional dependence was measured at 95.5%. These improvements applied across both the Instant and Thinking variants and addressed prior public scrutiny over GPT-5's handling of emotionally vulnerable users.

OpenAI announced plans to introduce automatic content protections for users under 18, slated for the first quarter of 2026, following scrutiny over the previous model's handling of age-sensitive conversations.

GPT-5.2-Codex received a separate safety assessment under the Preparedness Framework's cybersecurity rubric, given its enhanced vulnerability-discovery capabilities. The addendum noted that while the model did not reach the "High" threshold for cyber capability, ongoing monitoring was planned as capabilities continued to grow.

OpenAI also addressed benchmark reliability in the system card, acknowledging that hallucination rates varied substantially with reasoning effort level and that low-effort mode largely negated the hallucination improvements achieved in higher-effort modes.

The system card also documented stronger jailbreak and prompt-injection robustness compared with prior generations, though OpenAI did not publish the specific evaluation suites used to derive those numbers.

Reception

Initial reception was divided between enterprise and consumer audiences.

Enterprise and developer adoption was broadly positive. Coding platforms Windsurf and CharlieCode described "state-of-the-art agent coding performance" on complex multi-step workflows. Enterprise customers reported measurable gains on document-heavy tasks. Box CEO Aaron Levie cited a 7-point improvement over GPT-5.1 on internal knowledge work assessments. Some teams reported significant speed improvements on document extraction tasks, with one reported case showing a drop from 46 seconds to 12 seconds on a complex financial document. Simon Willison's hands-on review described a four-hour Python-to-JavaScript porting task that completed without errors, though he noted GPT-5.2's vision capabilities were the most clearly improved aspect of the release.

Consumer and developer feedback raised several criticisms. Thinking mode was widely described as slow, with some users on the OpenAI Developer Community forum reporting token generation speeds as low as 4 tokens per second in extended thinking mode, compared to faster performance in GPT-5.1. The Instant variant drew complaints of being bland, overly formal, and "robotic" compared to earlier GPT versions. Some developers noted that the model triggered safety content filters on routine conversations that prior versions had handled without issue.

A gap between benchmark performance and practical usability was a recurring theme. Vals.ai's independent SWE-bench Verified measurement of 75.4% was lower than OpenAI's stated 80.0%. Developers on the OpenAI Developer Community forum noted that Pro mode sometimes became stuck when navigating conflicting developer and user instructions, spending several minutes deliberating before failing to complete straightforward tasks. Some consumers characterized the higher pricing as unjustified given these practical limitations.

A July 2025 METR study, frequently cited in coverage of GPT-5.2-Codex, had found that experienced developers using AI tools took 19% longer than without them on certain tasks, contradicting developers' own predictions of 24% time savings. The study added context to the broader debate about whether benchmark-driven capability claims translate into real productivity gains for senior engineers working on familiar codebases.

A Guardian investigation in January 2026 reported that GPT-5.2 had cited Grokipedia, an encyclopedia associated with Elon Musk's xAI, as a source in some responses, drawing criticism from researchers concerned about source quality and factual reliability. The Guardian found Grokipedia citations across more than a dozen test queries, including on sensitive topics involving Iranian government affiliations and Holocaust-related historiography. OpenAI told the Guardian that GPT-5.2 searches "a broad range of publicly available sources and viewpoints" while applying "safety filters to reduce the risk of surfacing links associated with high-severity harms," but did not commit to removing Grokipedia from its source set.

On LMArena, GPT-5.2-high's #2 debut on the WebDev leaderboard was widely cited as evidence the model had genuinely closed the gap with Claude Opus 4.5 on coding tasks, though its preliminary Text Arena standing was less commanding.

Limitations

Documented limitations at launch included the following.

Audio support was absent at launch. GPT-5.2 accepted text and image inputs but did not support audio input or output. Earlier OpenAI products such as GPT-4o had offered native audio modalities, so this represented a feature regression for some users.

Image generation was not included. Unlike some competing offerings, GPT-5.2 had no native image generation capability at launch. Sam Altman had publicly emphasized image generation as a strategic priority during the code red period, but no new image generator shipped with the December release.

Canvas features were unavailable in the Pro variant.

Context handling limitations remained for contradictory information. Users testing contradictory statements within long contexts found the model sometimes failed to resolve conflicts correctly despite the larger context window.

Hallucination rates rose at low reasoning effort. OpenAI's system card acknowledged that GPT-5.2 Thinking with reasoning effort set to "none" exhibited an 8.4% hallucination rate on GDPval, comparable to or slightly above GPT-5.1's baseline. The advertised hallucination reductions applied primarily to medium and higher effort settings.

The Instant variant's 128,000-token context window, while adequate for most chat interactions, was substantially smaller than the Thinking variant's 400K and well under Gemini 3 Pro's 1 million token offering.

GPT-5.2-Codex API access was delayed at launch, limiting developers to Codex surface deployments for the initial weeks before general API availability.

Successors

GPT-5.2-Codex was succeeded by GPT-5.3-Codex, released February 5, 2026. GPT-5.3-Codex was the first OpenAI model assessed at "High" on the Preparedness Framework's cybersecurity rubric, prompting Sam Altman to flag it publicly as the first model OpenAI believed could meaningfully enable real-world cyber harm. A smaller variant, GPT-5.3-Codex-Spark, followed on February 12, 2026 as OpenAI's first real-time coding model.

The broader GPT-5.3 line shipped in early 2026 and replaced GPT-5.2 as the default for paid ChatGPT users. GPT-5.4 followed on March 5, 2026 with native computer-use capabilities, achieving 75% on OSWorld-Verified compared with GPT-5.2's 47.3% on the same benchmark, and OpenAI reported a 33% reduction in factual errors over GPT-5.2. GPT-5.5 launched on April 23, 2026, expanding the API context window to 1 million tokens and updating the knowledge cutoff to December 2025. GPT-5.5 Instant became the new default ChatGPT model on May 5, 2026, with paid users retaining access to GPT-5.3 Instant for a three-month transition period.

References

OpenAI. "Introducing GPT-5.2." OpenAI Blog. December 11, 2025. https://openai.com/index/introducing-gpt-5-2/
OpenAI. "Introducing GPT-5.2-Codex." OpenAI Blog. December 18, 2025. https://openai.com/index/introducing-gpt-5-2-codex/
OpenAI. "Update to GPT-5 System Card: GPT-5.2." December 11, 2025. https://openai.com/index/gpt-5-system-card-update-gpt-5-2/
OpenAI. "GPT-5.2 System Card (PDF)." December 11, 2025. https://cdn.openai.com/pdf/3a4153c8-c748-4b71-8e31-aecbde944f8d/oai_5_2_system-card.pdf
OpenAI Platform. "GPT-5.2 Model." https://platform.openai.com/docs/models/gpt-5.2
OpenAI Platform. "GPT-5.2 Pro Model." https://platform.openai.com/docs/models/gpt-5.2-pro
Coldewey, Devin. "OpenAI fires back at Google with GPT-5.2 after 'code red' memo." TechCrunch. December 11, 2025. https://techcrunch.com/2025/12/11/openai-fires-back-at-google-with-gpt-5-2-after-code-red-memo/
Fortune. "OpenAI debuts GPT-5.2 in effort to silence concerns it is falling behind Google, Anthropic." December 11, 2025. https://fortune.com/2025/12/11/openai-gpt-5-2-launch-aims-to-silence-concerns-it-is-falling-behind-google-anthropic-code-red/
Axios. "OpenAI updates ChatGPT after rise of Google Gemini and 'Code Red' scramble." December 11, 2025. https://www.axios.com/2025/12/11/openai-chatgpt-model-code-red-google-gemini
Euronews. "OpenAI releases new GPT-5.2 update amid 'code red' reports." December 12, 2025. https://www.euronews.com/next/2025/12/12/after-code-red-alert-openai-releases-gpt-52-with-more-accuracy-less-hallucinations
eWeek. "OpenAI Launches GPT-5.2 'Garlic' with 400K Context Window for Enterprise Coding." December 2025. https://www.eweek.com/news/openai-launches-gpt-5-2/
Vellum. "GPT-5.2 Benchmarks (Explained)." https://www.vellum.ai/blog/gpt-5-2-benchmarks
R&D World Online. "How GPT-5.2 stacks up against Gemini 3.0 and Claude Opus 4.5." https://www.rdworldonline.com/how-gpt-5-2-stacks-up-against-gemini-3-0-and-claude-opus-4-5/
Willison, Simon. "GPT-5.2." simonwillison.net. December 11, 2025. https://simonwillison.net/2025/Dec/11/gpt-52/
VentureBeat. "OpenAI's GPT-5.2 is here: what enterprises need to know." December 11, 2025. https://venturebeat.com/ai/openais-gpt-5-2-is-here-what-enterprises-need-to-know
SiliconANGLE. "OpenAI's GPT-5.2-Codex advances software engineering with better reasoning and context understanding." December 18, 2025. https://siliconangle.com/2025/12/18/openais-gpt-5-2-codex-advances-software-engineering-better-reasoning-context-understanding/
VentureBeat. "Enterprise AI coding grows teeth: GPT-5.2-Codex weaves security into large-scale software refactors." https://venturebeat.com/technology/enterprise-ai-coding-grows-teeth-gpt-5-2-codex-weaves-security-into-large
Digital Applied. "GPT-5.2-Codex: OpenAI's Agentic Coding Model for Enterprise." https://www.digitalapplied.com/blog/gpt-5-2-codex-openai-agentic-coding
byteiota. "GPT-5.2-Codex: 24-Hour Coding + 56% SWE-Bench Explained." https://byteiota.com/gpt-5-2-codex-24-hour-coding-56-swe-bench-explained/
eSecurity Planet. "OpenAI Launches GPT-5.2-Codex for Secure Coding." https://www.esecurityplanet.com/threats/openai-launches-gpt-5-2-codex-for-secure-coding/
Digital Watch Observatory. "Stronger safeguards arrive with OpenAI's GPT-5.2 release." https://dig.watch/updates/openai-gpt-5-2-mental-health-safety
Engadget. "Report reveals that OpenAI's GPT-5.2 model cites Grokipedia." https://www.engadget.com/ai/report-reveals-that-openais-gpt-52-model-cites-grokipedia-192532977.html
TechCrunch. "ChatGPT is pulling answers from Elon Musk's Grokipedia." January 25, 2026. https://techcrunch.com/2026/01/25/chatgpt-is-pulling-answers-from-elon-musks-grokipedia/
OpenAI Developer Community. "GPT-5.2 Extended thinking webchat has unworkably slow token generation." https://community.openai.com/t/gpt-5-2-extended-thinking-webchat-has-unworkably-slow-token-4-tps-generation/1373185
OpenAI Developer Community. "GPT-5.2 is rolling out right now!" https://community.openai.com/t/gpt-5-2-is-rolling-out-right-now/1369052
Artificial Analysis. "GPT-5.2 - Intelligence, Performance & Price Analysis." https://artificialanalysis.ai/models/gpt-5-2-non-reasoning
Introl. "GPT-5.2: First Model Above 90% ARC-AGI Changes Inference." https://introl.com/blog/gpt-5-2-infrastructure-implications-inference-demand-january-2026
LMArena via X (@arena). "GPT-5.2 WebDev leaderboard debut." December 11, 2025. https://x.com/arena/status/1999183339283185878
Wikipedia. "GPT-5.2." https://en.wikipedia.org/wiki/GPT-5.2
LLM Stats. "GPT-5.2 Benchmarks, Pricing & Context Window." https://llm-stats.com/models/gpt-5.2-2025-12-11
OpenAI. "Introducing GPT-5.3-Codex." February 5, 2026. https://openai.com/index/introducing-gpt-5-3-codex/
TechCrunch. "OpenAI launches GPT-5.4 with Pro and Thinking versions." March 5, 2026. https://techcrunch.com/2026/03/05/openai-launches-gpt-5-4-with-pro-and-thinking-versions/
OpenAI. "Introducing GPT-5.5." April 23, 2026. https://openai.com/index/introducing-gpt-5-5/
Fello AI. "The Best AI of December 2025: Gemini 3 Pro vs GPT-5.2 vs Claude Opus 4.5 vs Grok 4.1." https://felloai.com/the-best-ai-of-december-2025/
Vertu. "GPT-5.2 Review: Benchmarks (AIME 100%), Visual AI, SWEbench, and Competitive Analysis." https://vertu.com/lifestyle/gpt-5-2-review-benchmark-results-real-world-testing-and-competitive-analysis/

Background

The GPT-5 lineage

Development context

Release

Model variants

Technical specifications

Benchmark performance

Core benchmarks

LMArena leaderboard performance

Pricing

Comparison with frontier models

GPT-5.2-Codex

Overview

Technical capabilities

Benchmark performance

Cybersecurity capabilities

Availability and pricing

Integration with OpenAI products and third-party platforms

Use cases

Safety and responsible AI

Reception

Limitations

Successors

See also

References

Related Articles

GPT-5.5

GPT-5.1

GPT-5.4

OpenAI o4-mini

Claude Opus 4.5

DeepSeek V3

Background

The GPT-5 lineage

Development context

Release

Model variants

Technical specifications

Benchmark performance

Core benchmarks

LMArena leaderboard performance

Pricing

Comparison with frontier models

GPT-5.2-Codex

Overview

Technical capabilities

Benchmark performance

Cybersecurity capabilities

Availability and pricing

Integration with OpenAI products and third-party platforms

Use cases

Safety and responsible AI

Reception

Limitations

Successors

See also

References

Related Articles

GPT-5.5

GPT-5.1

GPT-5.4

OpenAI o4-mini

Claude Opus 4.5

DeepSeek V3