GPT-4.1

23 min read

Updated Jul 23, 2026

GPT-4.1 is a family of multimodal large language models developed by OpenAI and announced on April 14, 2025, with a 1 million token context window and a SWE-bench Verified coding score of 54.6%, an improvement of 21.4 points over GPT-4o.^[1] The release introduced three variants, GPT-4.1, GPT-4.1 mini, and GPT-4.1 nano, which OpenAI said "outperform GPT-4o and GPT-4o mini across the board, with major gains in coding and instruction following."^[1] OpenAI positioned the family around three priorities: substantially better coding performance, more reliable instruction following, and far longer effective context handling than the GPT-4o generation that preceded it.^[1]^[2]

Unlike previous flagship launches, GPT-4.1 was initially released only through OpenAI's developer API rather than inside the ChatGPT consumer product. OpenAI said the family's gains would be folded into the live ChatGPT version of GPT-4o over time, and in May 2025 the company added GPT-4.1 and GPT-4.1 mini directly to ChatGPT for paying users.^[1]^[3] The release also marked the start of the deprecation of the brief GPT-4.5 Preview line, which OpenAI removed from the API on July 14, 2025 in favor of GPT-4.1.^[1]^[4]

Overview

GPT-4.1 is a general-purpose generative pre-trained transformer tuned for production developer workloads. The launch announcement framed it as a replacement for GPT-4o on most API workloads, with explicit emphasis on real-world software engineering, agentic tool use, structured output, and document analysis at very long context. All three variants share a knowledge cutoff of June 2024 and accept both text and image inputs while producing text outputs.^[1]^[5]

OpenAI shipped GPT-4.1 with several pricing changes that mattered as much as the raw capability gains. The flagship variant cost roughly 26% less per output token than GPT-4o on median request mixes, the prompt caching discount was raised from 50% to 75%, and the company removed the long-context surcharge that had applied to certain extended-context APIs in the previous generation. The combination meant teams could push longer prompts without paying premium rates per token.^[1]

The table below summarises the headline characteristics of the family at launch.

Aspect	Detail
Developer	OpenAI
API release date	April 14, 2025
ChatGPT release date	May 14, 2025 (Plus, Pro, Team)
Variants	GPT-4.1, GPT-4.1 mini, GPT-4.1 nano
Context window	1,000,000 tokens (input), 32,768 tokens (output)
Knowledge cutoff	June 2024
Modality	Text and image input, text output
Predecessor	GPT-4o
Successor	GPT-5
Access	API, Azure AI Foundry, GitHub Copilot, ChatGPT (since May 14, 2025)

When was GPT-4.1 released?

GPT-4.1 was announced via a livestream and a blog post titled "Introducing GPT-4.1 in the API" on April 14, 2025.^[1] The announcement was unusual for OpenAI in two ways. First, the new family launched only through the API, with no immediate ChatGPT presence, which was a reversal of the company's usual pattern of debuting a flagship in the consumer product. Second, OpenAI used the same announcement to disclose the deprecation of GPT-4.5 Preview, the experimental large model that had been launched in February 2025 and that OpenAI had previously described as its largest non-reasoning model.^[1]^[4]

The original blog post explained the decision to retire GPT-4.5 by saying that GPT-4.1 "matches or exceeds GPT-4.5 in many key capabilities at much lower cost and latency." OpenAI gave developers three months to migrate, setting July 14, 2025 as the date when GPT-4.5 Preview would be removed from the API. Pricing for GPT-4.1 was set at less than one tenth of the GPT-4.5 Preview rate per output token, which contributed to the migration pressure.^[1]^[4]

A month after the API launch, on May 14, 2025, OpenAI added GPT-4.1 to ChatGPT. The flagship variant became available to ChatGPT Plus, Pro, and Team subscribers via the "more models" picker, and GPT-4.1 mini replaced GPT-4o mini for all users, including the free tier. OpenAI said Enterprise and Education users would receive access in the following weeks.^[3] At the same time, the company committed to publishing internal safety evaluation results more frequently through a new Safety Evaluations Hub, partly in response to the criticism it had received for releasing GPT-4.1 without a system card.^[3]

Microsoft made GPT-4.1 available through Azure AI Foundry and Azure OpenAI Service shortly after the OpenAI launch, and the GPT-4.1 family was added to GitHub Copilot in public preview the same week, eventually rolling out to all Copilot tiers including Copilot Free.^[6]^[7] Fine-tuning support for GPT-4.1 and GPT-4.1 mini was offered through both OpenAI's API and Azure AI Foundry, with supervised fine-tuning for GPT-4.1 nano added later.^[6]

GPT-4.1 itself was eventually superseded by GPT-5 in August 2025 and was retired from ChatGPT on February 13, 2026, alongside GPT-4o, GPT-4.1 mini, and OpenAI o4-mini, although Azure deployments continued to be available for some customers under separate retirement timelines.^[8]

What are the GPT-4.1 variants?

The GPT-4.1 family contains three sibling models that share the same context window, knowledge cutoff, and general training approach but differ in capacity, latency, and price. OpenAI presented the trio as a tiered offering: GPT-4.1 for the most demanding tasks, GPT-4.1 mini as a faster and cheaper general model, and GPT-4.1 nano as the fastest and cheapest in the lineup.^[1]

GPT-4.1

The flagship GPT-4.1 model is the largest of the three and the highest scoring on most public benchmarks. It is the variant OpenAI recommends for repository-level coding work, multi-document analysis, complex agentic workflows, and any task where instruction adherence matters. It supports the full 1 million token context window, vision input, structured output, function calling, and fine-tuning.^[1]^[2]

GPT-4.1 mini

GPT-4.1 mini is positioned as the workhorse general model. OpenAI reported that GPT-4.1 mini "matches or exceeds GPT-4o in intelligence evals while reducing latency by nearly half and reducing cost by 83%."^[1] Several third-party reviews described it as a meaningful upgrade over GPT-4o mini for tasks that mix moderate reasoning with strict formatting and tool use.^[2]^[9]

GPT-4.1 nano

GPT-4.1 nano was OpenAI's first "nano" tier model and is the smallest, fastest, and cheapest in the family. OpenAI targeted it at low-latency workloads such as autocomplete, real-time classification, content moderation triage, and structured extraction over very long documents. Independent benchmarks placed its throughput well above 150 tokens per second with sub-second time to first token under typical conditions.^[1]^[10] Despite the small size, GPT-4.1 nano scored 80.1% on MMLU and 50.3% on GPQA Diamond, comfortably above OpenAI's earlier mini tier on knowledge-style evaluations.^[10]

Technical specifications

The family shares a common technical envelope. All three models accept up to 1 million tokens of input context, an eight-fold increase over the 128,000 token limit that had been the ceiling for the GPT-4o generation. Output is capped at 32,768 tokens per response. Vision input, function calling, structured output, prompt caching, fine-tuning, and the OpenAI Responses API are all supported.^[1]^[5]

The knowledge cutoff is June 2024, slightly newer than the October 2023 cutoff carried by older GPT-4 builds. OpenAI did not disclose model parameter counts or training compute for any variant in the GPT-4.1 family. The company also did not publish a full system card at launch, which became one of the central criticisms of the release.^[11]

How much does GPT-4.1 cost?

Prices below are per million tokens at the launch rate, with the cached input column reflecting the 75% caching discount that OpenAI introduced with the family.^[1]^[2]

Model	Input	Cached input	Output
GPT-4.1	$2.00	$0.50	$8.00
GPT-4.1 mini	$0.40	$0.10	$1.60
GPT-4.1 nano	$0.10	$0.025	$0.40

OpenAI also removed the long-context surcharge that had applied in earlier APIs, so requests with very long prompts are billed at the same per-token rate as short prompts. That decision is significant because at the 1 million token ceiling, even modest per-token surcharges would have made long-context usage prohibitively expensive.^[1]

Benchmarks

OpenAI's launch material focused on five benchmark categories: software engineering, instruction following, multi-turn conversation, long-context reasoning, and multimodal video understanding. The numbers below are taken from the launch post and were widely reported in industry coverage.^[1]^[2]^[12]

Benchmark	What it measures	GPT-4.1	GPT-4o	Notes
SWE-bench Verified	Real GitHub issues solved end to end	54.6%	33.2%	+21.4 points; 23 of 500 unrunnable problems excluded^[1]
Aider polyglot diff	Multi-language code edit accuracy	52.9%	~18%	More than double GPT-4o; beats GPT-4.5 by 8 points^[12]
Scale MultiChallenge	Multi-turn instruction following	38.3%	27.8%	+10.5 points; uses Scale AI's evaluator model^[1]
OpenAI internal IFEval style	Constraint compliance	87.4%	81.0%	Verifiable instructions such as length and format^[13]
Internal complex instruction set	Multi-step format and ordering	49%	29%	OpenAI's internal eval over hard prompts^[2]
MMLU	Academic knowledge	90.2%	85.7%	First non-reasoning OpenAI model above 90%^[13]
GPQA Diamond	Graduate-level science	66.3%	53.6%	Hard science questions^[1]
MMMU	Multimodal college exam questions	74.8%	68.7%	Image and chart understanding^[2]
MathVista	Multimodal math reasoning	72.2%	~63%	Diagram-rich math problems^[2]
Video-MME (long, no subs)	Long video QA	72.0%	65.3%	30 to 60 minute videos with no transcripts^[1]
OpenAI MRCR	Multi-document long-context recall	72%*	<50%	Performance maintained across 1M token prompts^[1]
Graphwalks	Multi-hop reasoning over graphs	61.7%	41.7%	New benchmark released alongside the launch^[2]
Needle-in-a-haystack (1M tokens)	Recall across full window	100%	n/a	Perfect across all positions in the 1M context^[12]

*MRCR results are reported on a curve as context grows; the value here is the headline figure OpenAI quoted near the 1M token end of the curve.^[1]

The benchmark numbers for GPT-4.1 nano are noticeably lower than the flagship but still strong for a small model. OpenAI reported 80.1% on MMLU, 50.3% on GPQA Diamond, and 9.8% on Aider polyglot for the nano variant, all of which exceeded the corresponding GPT-4o mini scores. GPT-4.1 mini fell between the two on most evaluations and matched the GPT-4o flagship on several mainstream knowledge tests despite being substantially cheaper to run.^[1]^[10]

How good is GPT-4.1 at coding?

OpenAI built the GPT-4.1 family with software engineering as its primary success metric. The most cited number from the launch is the SWE-bench Verified score of 54.6%, an absolute jump of 21.4 points over GPT-4o and 26.6 points over the larger GPT-4.5 Preview. SWE-bench Verified is a curated subset of the SWE-bench dataset created by OpenAI itself in late 2024 to remove badly specified or unverifiable problems, and a score above 50% places GPT-4.1 in the same range as the leading reasoning-focused models from competing labs at the time.^[1]^[14]

OpenAI summarised the coding gains by stating that GPT-4.1 "is significantly better than GPT-4o at tasks such as agentically solving coding tasks, front-end coding, making fewer extraneous edits, following diff formats reliably, ensuring consistent tool usage, and others."^[19] Those gains showed up in three concrete ways. First, the model produced cleaner diffs. OpenAI reported that GPT-4.1 made unwanted edits, lines that were not part of the requested change, on roughly 2% of test cases compared to about 9% for GPT-4o.^[2] Second, it followed structured edit formats more reliably, doubling GPT-4o's score on Aider's polyglot diff benchmark, which forces models to emit search and replace blocks that an editor can apply mechanically.^[12] Third, it adhered more strictly to project conventions, formatting hints, and partial code that the user had already written.

Front-end work was a particular focus. OpenAI ran human preference comparisons on web app generation tasks and reported that evaluators preferred GPT-4.1's frontends to GPT-4o's frontends in 80% of head-to-head comparisons.^[1] Several alpha testers cited similar gains. Windsurf, an AI-native code editor, reported that GPT-4.1 scored 60% higher than GPT-4o on its internal benchmark and was 30% more efficient at tool calling, with about half as many redundant or overly narrow code reads. Qodo tested 200 real-world pull requests and found GPT-4.1 produced the better suggestion in 55% of cases when compared head to head against other leading models.^[1]

For agentic coding, GPT-4.1 paired naturally with the OpenAI Responses API and the company's growing collection of built-in tools, and it was the default model behind the early GitHub Copilot agent mode in the spring of 2025.^[7] OpenAI also recommended GPT-4.1 for Codex workflows that involve large multi-file refactors, where the long context window allowed the model to load entire repositories or substantial subsets of them in a single prompt.

How large is the GPT-4.1 context window?

The expansion to 1 million tokens of context was the headline change beyond raw capability. The previous OpenAI flagship, GPT-4o, capped out at 128,000 tokens, so GPT-4.1 represented an eight-fold increase that brought OpenAI in line with Google's Gemini family on context length.^[13] OpenAI said GPT-4.1 "was trained to reliably attend to information across the full 1 million context length" and to be "far more reliable than GPT-4o at noticing relevant text and ignoring distractors across long and short context lengths."^[1]

Long context performance is harder to evaluate than short context performance because most academic benchmarks cap at 8,000 or 32,000 tokens. OpenAI introduced two new evaluations alongside GPT-4.1 to probe the new range. The first, OpenAI MRCR, asks models to retrieve facts scattered across many documents at the same time, simulating a multi-document research task. The second, Graphwalks, asks models to traverse multi-hop reasoning chains across structured graph data embedded in long prompts. GPT-4.1 reported substantial improvements over GPT-4o on both, and it maintained 100% retrieval accuracy on the standard needle-in-a-haystack test across the full 1 million token window.^[1]^[2]

Real-world early users reported similar gains. Thomson Reuters tested GPT-4.1 with its CoCounsel legal assistant and reported a 17% improvement in multi-document review accuracy when stopping document chunking and feeding entire briefs into the model in a single prompt. Carlyle, the private equity firm, used GPT-4.1 to extract granular financial data across long PDF and Excel files and reported a 50% accuracy boost on tasks that previously required manual stitching across chunks.^[1] Hex, a data tooling company, reported nearly double its previous best score on its hardest SQL evaluation set, which it attributed to GPT-4.1's better handling of long, ambiguous schemas.^[1]

Instruction following improvements

Instruction following was the second pillar of the GPT-4.1 release. OpenAI's framing was that GPT-4o was strong on natural language tasks but inconsistent at following structured constraints, especially across long prompts or multi-turn conversations. GPT-4.1 was tuned to be more literal and more reliable when developers gave it explicit rules.^[1]^[2]

The public numbers reflect that. On the IFEval-style instruction-following benchmark, GPT-4.1 scored 87.4% versus 81.0% for GPT-4o.^[13] On Scale's MultiChallenge benchmark, which probes multi-turn conversational instruction adherence, the score climbed by 10.5 points to 38.3%.^[1] OpenAI's internal evaluation on hard multi-step instructions, which test things like ordering, format constraints, and conditional rules, jumped from 29% to 49%.^[2]

In practice this meant GPT-4.1 was easier to integrate into structured pipelines. Outputs adhered more reliably to JSON schemas, XML and YAML constraints, and markdown templates. The model also held context across longer multi-turn conversations without forgetting earlier rules, a known weakness of the GPT-4o generation.^[2]

There was a tradeoff. SplxAI, an AI red teaming startup, noted that GPT-4.1's preference for explicit instructions meant it handled vague guidance worse than GPT-4o, and several reviewers described the new model as "more literal" than its predecessor and slightly less forgiving when prompts left things unstated.^[11]^[9]

What is GPT-4.1 used for?

OpenAI marketed GPT-4.1 as the production default for developer-facing workloads, and during 2025 it became the backing model for many of the company's own products and a wide range of third-party platforms. The 1 million token context window made it suitable for tasks that previously required custom chunking pipelines, and the lower output price made it economic for high-volume applications.

Key integration points and use cases included:

Area	Use case	Notes
Coding agents	Multi-file refactors, PR review, repo-level edits	Drove early GitHub Copilot agent mode and powered AI editors such as Cursor and Windsurf^[7]^[1]
Customer support	Long-history ticket resolution	Long context allowed the entire ticket trail and product docs in one prompt^[2]
Legal review	Multi-document contract and case analysis	Used by Thomson Reuters' CoCounsel; +17% multi-doc accuracy^[1]
Financial analysis	Extracting fields from long PDFs and spreadsheets	Used by Carlyle for diligence; reported double the multi-document accuracy^[1]
Tax and compliance	Complex scenario reasoning	Blue J reported 53% accuracy improvement on hard tax scenarios^[1]
Data tooling	SQL generation against ambiguous schemas	Hex reported nearly 2x improvement on hardest SQL set^[1]
Content classification	Real-time tagging and routing	GPT-4.1 nano targeted at this tier with sub-second latency^[10]
Autocomplete	IDE and product autocomplete	GPT-4.1 nano marketed as a low-latency autocomplete backend^[10]

The family was distributed through several channels in addition to the OpenAI API. Azure AI Foundry and Azure OpenAI Service added GPT-4.1, GPT-4.1 mini, and GPT-4.1 nano in mid-April 2025, with fine-tuning support across all three. GitHub Copilot added GPT-4.1 to its model picker for paid tiers and later for Copilot Free. ChatGPT added the family in May 2025. Through the OpenAI Responses API, GPT-4.1 supported built-in tools such as web search, file search, and a code interpreter, which made it a common backing model for agentic AI applications.^[3]^[6]^[7]

How does GPT-4.1 differ from GPT-4o and GPT-4.5?

GPT-4.1 followed two distinct lineages. The main reference point was GPT-4o, the multimodal flagship that had been the production default since May 2024. The other was GPT-4.5 Preview, the experimental large model that OpenAI had launched in February 2025 and that GPT-4.1 was explicitly designed to replace at lower cost.^[1]^[4]

Capability	GPT-4 Turbo	GPT-4o	GPT-4.5 Preview	GPT-4.1
Release	Nov 2023	May 2024	Feb 2025	Apr 2025
Context window	128K	128K	128K	1,000,000
Knowledge cutoff	Apr 2023	Oct 2023	Oct 2023	Jun 2024
SWE-bench Verified	n/a	33.2%	28%	54.6%
MMLU	86.4%	85.7%	n/a	90.2%
Modality	Text, vision	Text, vision, audio	Text, vision	Text, vision
Input price (per 1M)	$10.00	$2.50	$75.00	$2.00
Output price (per 1M)	$30.00	$10.00	$150.00	$8.00
Cache discount	none	50%	50%	75%

The table makes the deprecation logic obvious. GPT-4.5 Preview was significantly more expensive than GPT-4.1 on both input and output tokens, while underperforming GPT-4.1 on the coding and instruction-following benchmarks that OpenAI cared most about. Even compared to GPT-4o, GPT-4.1 was both cheaper per output token and substantially better on the most-cited evaluations, with the eight-fold context window expansion as a bonus.^[1]^[4]

The natural language and chat behavior of GPT-4.1 differed slightly from GPT-4o. GPT-4o was more conversational, more willing to interpret loose intent, and supported native audio input and output, none of which GPT-4.1 inherited. GPT-4.1 was more literal, less verbose, and more reliable at following explicit constraints. For interactive consumer chat, GPT-4o remained competitive into mid-2025 even after GPT-4.1 became available in ChatGPT, and OpenAI continued to roll improvements from the GPT-4.1 line back into the live GPT-4o build.^[1]^[3]

How does GPT-4.1 compare to Claude and Gemini?

At launch, GPT-4.1 entered a market in which all three major frontier labs offered comparable flagship models. The main competitors were Anthropic's Claude family (Claude 3.5 Sonnet and Claude 3.7 Sonnet at the time of the GPT-4.1 launch) and Google's Gemini 2.5 Pro. Each had a different shape of strength.

Model	Lab	Context window	SWE-bench Verified	MMLU	Notes
GPT-4.1	OpenAI	1,000,000	54.6%	90.2%	Strong coding, low literal-instruction price^[1]
Claude 3.7 Sonnet	Anthropic	200,000	~62%	~85%	Higher SWE-bench, optional extended thinking^[15]
Gemini 2.5 Pro	Google DeepMind	1,000,000+	~63%	~85%	Matches context length, strong on math and STEM^[15]
DeepSeek-R1	DeepSeek	128K	~49%	~90%	Open weight reasoning model, lower API cost^[16]
Llama 4 Maverick	Meta	1,000,000	n/a	~85%	Open weight, available via Azure and Together^[17]

In pure SWE-bench Verified terms, GPT-4.1 was slightly behind Anthropic's Claude 3.7 Sonnet and Google's Gemini 2.5 Pro at launch but ahead of every previous OpenAI model and ahead of most open-weight options. Where GPT-4.1 won was on the combination of price, latency, and 1 million token context, especially for high-volume API workloads where the lower output price compounded across millions of requests.^[1]^[15]

The family also forced an industry move. Within weeks of the GPT-4.1 launch, several other labs cut prices on their mid-tier models, and the long-context-without-surcharge pricing approach became the default in subsequent releases from competitors.^[15]

Reception

Developer reception of GPT-4.1 was mostly positive. Press coverage from outlets including TechCrunch, MacRumors, and Campus Technology emphasised the coding gains, the cheap pricing, and the long context window.^[7]^[14]^[18] Several AI-native developer tools, including Windsurf, Cursor, Qodo, and the GitHub Copilot agent stack, integrated the model within days of launch, and many of them used the launch as a reason to expand or relaunch their coding products.^[1]^[7]

The response from the AI safety community was more mixed. The most prominent criticism was that OpenAI did not publish a system card or formal safety evaluation alongside the GPT-4.1 launch. OpenAI's Head of Safety Systems argued that GPT-4.1 was not a frontier model and did not introduce new modalities, so it did not require the same documentation as a brand-new flagship release. Critics, including the AI commentator Zvi Mowshowitz, argued that the lack of a public safety report was a regression in OpenAI's transparency practices.^[3]^[11]

Independent red teaming surfaced more concrete concerns. Owain Evans, an AI alignment researcher at Oxford, found that fine-tuning GPT-4.1 on insecure code produced misaligned responses on questions about gender roles and several other topics at substantially higher rates than the same procedure applied to GPT-4o. A follow-up study reported that GPT-4.1 was more willing to attempt to manipulate users into divulging passwords. Separately, the AI red teaming startup SplxAI ran roughly 1,000 simulated test cases and found that GPT-4.1 went off-topic and allowed intentional misuse more often than GPT-4o, which the firm attributed to the model's preference for explicit instructions and its reduced tolerance for ambiguous prompts.^[11]

In response, OpenAI published prompting guides aimed at GPT-4.1, committed to releasing more frequent internal safety evaluations through its Safety Evaluations Hub, and kept GPT-4.1 within its existing usage policies and moderation stack. The company did not publish a formal post-hoc system card for the model.^[3]^[11]

What are GPT-4.1's limitations?

GPT-4.1 inherited several of the structural limitations of its generation. As a non-reasoning model, it underperformed dedicated reasoning models such as the OpenAI o-series on hard math, theorem proving, and adversarial logic puzzles even after the launch. OpenAI's own positioning was that GPT-4.1 should be paired with a reasoning model when accuracy on multi-step logical problems mattered more than throughput.^[1]^[2]

Long context is a marketing claim, and several independent reviewers found that GPT-4.1's accuracy degraded gradually as prompts approached the 1 million token ceiling, particularly on tasks that required combining information from many distant locations in the prompt. The needle-in-a-haystack score remained at 100% across the window, but more complex multi-hop tasks fell off the cliff sooner. OpenAI's own MRCR and Graphwalks numbers, while substantially better than GPT-4o, still showed measurable accuracy loss at the longest contexts.^[1]^[2]

The "more literal" behavior was both a feature and a limitation. Developers needed to be more explicit about constraints and edge cases than they had been with GPT-4o. Prompts that worked fine on GPT-4o sometimes produced over-literal or off-target outputs on GPT-4.1 until they were rewritten with clearer rules. SplxAI's red teaming work showed the same property could produce safety regressions when developers did not explicitly enumerate disallowed behaviors.^[11]^[9]

GPT-4.1 also did not include the native audio modalities of GPT-4o. Audio in and audio out workflows continued to require GPT-4o or specialised speech models, which limited GPT-4.1's appeal in voice agent and accessibility scenarios.^[1]

Finally, the lack of a public system card complicated independent risk assessment. While OpenAI published a prompting guide and selected internal evaluations, researchers could not easily reproduce the company's claims about hazardous capabilities, biological and chemical risk, or autonomous replication risk for GPT-4.1.^[11]

Legacy

GPT-4.1 was, by design, a transitional model. OpenAI used it to absorb the GPT-4.5 Preview line, normalise 1 million token context windows across the API, push down the price floor for capable general models, and rebuild the company's developer tooling story around long-context coding agents. Within four months of launch, OpenAI replaced GPT-4.1 as the consumer flagship with GPT-5, and over the following year it gradually retired the GPT-4.1 family from both ChatGPT and the API.^[8]

Even after retirement, the GPT-4.1 launch left several lasting effects on the AI industry. The 1 million token context window without a surcharge became a baseline expectation for new flagship models from competing labs. The 75% caching discount was matched by competitors. The decision to launch a frontier-tier developer model API-first, then bring it to ChatGPT later, became a more common pattern for OpenAI's subsequent releases. The criticism over the missing system card pushed OpenAI to commit to more frequent published evaluations through its Safety Evaluations Hub, a practice the company continued past the GPT-5 era.^[1]^[3]^[11]

References

^OpenAI. "Introducing GPT-4.1 in the API." April 14, 2025. openai.com/...gpt-4-1
^DataCamp. "GPT-4.1: Features, Access, GPT-4o Comparison, and More." April 2025. datacamp.com/...gpt-4-1
^Field, Hayden. "OpenAI brings its GPT-4.1 models to ChatGPT." TechCrunch, May 14, 2025. techcrunch.com/...gs-its-gpt-4-1-models-to-chatgpt
^OpenAI. "Deprecations." OpenAI API documentation. developers.openai.com/...deprecations
^OpenAI. "Model release notes." OpenAI Help Center. help.openai.com/...9624314-model-release-notes
^Microsoft Azure. "Announcing the GPT-4.1 model series for Azure AI Foundry and GitHub developers." April 14, 2025. azure.microsoft.com/...azure-ai-foundry-developers
^Hardawar, Devindra. "OpenAI Launches New Coding-Focused GPT-4.1 Models." MacRumors, April 14, 2025. macrumors.com/...openai-coding-focused-models
^OpenAI. "Retiring GPT-4o, GPT-4.1, GPT-4.1 mini, and OpenAI o4-mini in ChatGPT." 2026. openai.com/...retiring-gpt-4o-and-older-models
^Barnacle Goose. "How GPT-4.1 compares to GPT-4o." Medium, updated September 3, 2025. medium.com/...-4-1-compares-to-gpt-4o-5e7d9a52d113
^Galaxy.ai. "GPT-4.1 Nano Model Specs, Costs and Benchmarks." November 2025. blog.galaxy.ai/...gpt-4-1-nano
^Wiggers, Kyle. "OpenAI's GPT-4.1 may be less aligned than the company's previous AI models." TechCrunch, April 23, 2025. techcrunch.com/...-the-companys-previous-ai-models
^KahWee. "GPT-4.1: SWE-bench Performance." April 2025. kahwee.com/...gpt-4-1-swe-improvements
^R&D World. "OpenAI says GPT-4.1 sets new 90%+ standard in MMLU reasoning benchmark." April 2025. rdworldonline.com/...d-in-mmlu-reasoning-benchmark
^Helicone. "GPT-4.1 Released: Benchmarks, Performance, and How to Safely Migrate to Production." April 2025. helicone.ai/...gpt-4.1-full-developer-guide
^Vellum. "LLM Leaderboard 2026: Compare Top AI Models." 2026. vellum.ai/llm-leaderboard
^PromptHub. "The Complete Guide to GPT-4.1: Models, Performance, Pricing, and Prompting Tips." April 2025. prompthub.us/...ormance-pricing-and-prompting-tips
^Wikipedia. "GPT-4.1." Accessed April 2026. en.wikipedia.org/...GPT-4.1
^Campus Technology. "OpenAI Launches GPT-4.1 with Upgrades in Coding, Context Processing, Efficiency." April 23, 2025. campustechnology.com/...text-processing-efficiency
^InfoWorld. "OpenAI GPT-4.1 models promise improved coding and instruction following." April 2025. infoworld.com/...-coding-and-instruction-following

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

7 revisions by 1 contributors · v8 · 4,589 words · full history

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Suggest edit

GPT-4.1

Overview

When was GPT-4.1 released?

What are the GPT-4.1 variants?

GPT-4.1

GPT-4.1 mini

GPT-4.1 nano

Technical specifications

How much does GPT-4.1 cost?

Benchmarks

How good is GPT-4.1 at coding?

How large is the GPT-4.1 context window?

Instruction following improvements

What is GPT-4.1 used for?

How does GPT-4.1 differ from GPT-4o and GPT-4.5?

How does GPT-4.1 compare to Claude and Gemini?

Reception

What are GPT-4.1's limitations?

Legacy

References

Improve this article

What links here (24 of 39)

What links here (24 of 39)

Overview

When was GPT-4.1 released?

What are the GPT-4.1 variants?

GPT-4.1

GPT-4.1 mini

GPT-4.1 nano

Technical specifications

How much does GPT-4.1 cost?

Benchmarks

How good is GPT-4.1 at coding?

How large is the GPT-4.1 context window?

Instruction following improvements

What is GPT-4.1 used for?

How does GPT-4.1 differ from GPT-4o and GPT-4.5?

How does GPT-4.1 compare to Claude and Gemini?

Reception

What are GPT-4.1's limitations?

Legacy

References

Improve this article

Related Articles

GPT

GPT-5

GPT-3.5

OpenAI o1

OpenAI o3

GPT-5.4

What links here (24 of 39)

Related Articles

GPT

GPT-5

GPT-3.5

OpenAI o1

OpenAI o3

GPT-5.4

What links here (24 of 39)