# GPT-4.1

> Source: https://aiwiki.ai/wiki/gpt-4.1
> Updated: 2026-06-21
> Categories: AI Models, Large Language Models, OpenAI
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

**GPT-4.1** is a family of multimodal large language models developed by [OpenAI](/wiki/openai) and announced on April 14, 2025, with a 1 million token context window and a SWE-bench Verified coding score of 54.6%, an improvement of 21.4 points over [GPT-4o](/wiki/gpt_4o).[1] The release introduced three variants, GPT-4.1, [GPT-4.1 mini](/wiki/gpt_4_1_mini), and [GPT-4.1 nano](/wiki/gpt_4_1_nano), which OpenAI said "outperform GPT-4o and GPT-4o mini across the board, with major gains in coding and instruction following."[1] OpenAI positioned the family around three priorities: substantially better coding performance, more reliable instruction following, and far longer effective context handling than the GPT-4o generation that preceded it.[1][2]

Unlike previous flagship launches, GPT-4.1 was initially released only through OpenAI's developer API rather than inside the [ChatGPT](/wiki/chatgpt) consumer product. OpenAI said the family's gains would be folded into the live ChatGPT version of GPT-4o over time, and in May 2025 the company added GPT-4.1 and GPT-4.1 mini directly to ChatGPT for paying users.[1][3] The release also marked the start of the deprecation of the brief GPT-4.5 Preview line, which OpenAI removed from the API on July 14, 2025 in favor of GPT-4.1.[1][4]

## Overview

GPT-4.1 is a general-purpose [generative pre-trained transformer](/wiki/gpt_generative_pre-trained_transformer) tuned for production developer workloads. The launch announcement framed it as a replacement for GPT-4o on most API workloads, with explicit emphasis on real-world software engineering, agentic [tool use](/wiki/agent_tools_api), structured output, and document analysis at very long context. All three variants share a knowledge cutoff of June 2024 and accept both text and image inputs while producing text outputs.[1][5]

OpenAI shipped GPT-4.1 with several pricing changes that mattered as much as the raw capability gains. The flagship variant cost roughly 26% less per output token than GPT-4o on median request mixes, the prompt caching discount was raised from 50% to 75%, and the company removed the long-context surcharge that had applied to certain extended-context APIs in the previous generation. The combination meant teams could push longer prompts without paying premium rates per token.[1]

The table below summarises the headline characteristics of the family at launch.

| Aspect | Detail |
|---|---|
| Developer | [OpenAI](/wiki/openai) |
| API release date | April 14, 2025 |
| ChatGPT release date | May 14, 2025 (Plus, Pro, Team) |
| Variants | GPT-4.1, [GPT-4.1 mini](/wiki/gpt-4.1-mini), [GPT-4.1 nano](/wiki/gpt-4.1-nano) |
| Context window | 1,000,000 tokens (input), 32,768 tokens (output) |
| Knowledge cutoff | June 2024 |
| Modality | Text and image input, text output |
| Predecessor | [GPT-4o](/wiki/gpt_4o) |
| Successor | [GPT-5](/wiki/gpt-5) |
| Access | API, [Azure AI Foundry](/wiki/azure_ai_foundry), [GitHub Copilot](/wiki/github_copilot), ChatGPT (since May 14, 2025) |

## When was GPT-4.1 released?

GPT-4.1 was announced via a livestream and a blog post titled "Introducing GPT-4.1 in the API" on April 14, 2025.[1] The announcement was unusual for OpenAI in two ways. First, the new family launched only through the API, with no immediate ChatGPT presence, which was a reversal of the company's usual pattern of debuting a flagship in the consumer product. Second, OpenAI used the same announcement to disclose the deprecation of GPT-4.5 Preview, the experimental large model that had been launched in February 2025 and that OpenAI had previously described as its largest non-reasoning model.[1][4]

The original blog post explained the decision to retire GPT-4.5 by saying that GPT-4.1 "matches or exceeds GPT-4.5 in many key capabilities at much lower cost and latency." OpenAI gave developers three months to migrate, setting July 14, 2025 as the date when GPT-4.5 Preview would be removed from the API. Pricing for GPT-4.1 was set at less than one tenth of the GPT-4.5 Preview rate per output token, which contributed to the migration pressure.[1][4]

A month after the API launch, on May 14, 2025, OpenAI added GPT-4.1 to ChatGPT. The flagship variant became available to ChatGPT Plus, Pro, and Team subscribers via the "more models" picker, and GPT-4.1 mini replaced GPT-4o mini for all users, including the free tier. OpenAI said Enterprise and Education users would receive access in the following weeks.[3] At the same time, the company committed to publishing internal safety evaluation results more frequently through a new Safety Evaluations Hub, partly in response to the criticism it had received for releasing GPT-4.1 without a system card.[3]

Microsoft made GPT-4.1 available through [Azure AI Foundry](/wiki/azure_ai_foundry) and [Azure OpenAI Service](/wiki/azure_openai) shortly after the OpenAI launch, and the GPT-4.1 family was added to GitHub Copilot in public preview the same week, eventually rolling out to all Copilot tiers including Copilot Free.[6][7] Fine-tuning support for GPT-4.1 and GPT-4.1 mini was offered through both OpenAI's API and Azure AI Foundry, with supervised fine-tuning for GPT-4.1 nano added later.[6]

GPT-4.1 itself was eventually superseded by [GPT-5](/wiki/gpt-5) in August 2025 and was retired from ChatGPT on February 13, 2026, alongside GPT-4o, GPT-4.1 mini, and OpenAI o4-mini, although Azure deployments continued to be available for some customers under separate retirement timelines.[8]

## What are the GPT-4.1 variants?

The GPT-4.1 family contains three sibling models that share the same context window, knowledge cutoff, and general training approach but differ in capacity, latency, and price. OpenAI presented the trio as a tiered offering: GPT-4.1 for the most demanding tasks, GPT-4.1 mini as a faster and cheaper general model, and GPT-4.1 nano as the fastest and cheapest in the lineup.[1]

### GPT-4.1

The flagship GPT-4.1 model is the largest of the three and the highest scoring on most public benchmarks. It is the variant OpenAI recommends for repository-level coding work, multi-document analysis, complex agentic workflows, and any task where instruction adherence matters. It supports the full 1 million token context window, vision input, structured output, function calling, and fine-tuning.[1][2]

### GPT-4.1 mini

GPT-4.1 mini is positioned as the workhorse general model. OpenAI reported that GPT-4.1 mini "matches or exceeds GPT-4o in intelligence evals while reducing latency by nearly half and reducing cost by 83%."[1] Several third-party reviews described it as a meaningful upgrade over GPT-4o mini for tasks that mix moderate reasoning with strict formatting and tool use.[2][9]

### GPT-4.1 nano

GPT-4.1 nano was OpenAI's first "nano" tier model and is the smallest, fastest, and cheapest in the family. OpenAI targeted it at low-latency workloads such as autocomplete, real-time classification, content moderation triage, and structured extraction over very long documents. Independent benchmarks placed its throughput well above 150 tokens per second with sub-second time to first token under typical conditions.[1][10] Despite the small size, GPT-4.1 nano scored 80.1% on [MMLU](/wiki/mmlu) and 50.3% on [GPQA Diamond](/wiki/gpqa_diamond), comfortably above OpenAI's earlier mini tier on knowledge-style evaluations.[10]

## Technical specifications

The family shares a common technical envelope. All three models accept up to 1 million tokens of input context, an eight-fold increase over the 128,000 token limit that had been the ceiling for the GPT-4o generation. Output is capped at 32,768 tokens per response. Vision input, function calling, structured output, prompt caching, fine-tuning, and the [OpenAI Responses API](/wiki/openai_responses_api) are all supported.[1][5]

The knowledge cutoff is June 2024, slightly newer than the October 2023 cutoff carried by older GPT-4 builds. OpenAI did not disclose model parameter counts or training compute for any variant in the GPT-4.1 family. The company also did not publish a full system card at launch, which became one of the central criticisms of the release.[11]

### How much does GPT-4.1 cost?

Prices below are per million tokens at the launch rate, with the cached input column reflecting the 75% caching discount that OpenAI introduced with the family.[1][2]

| Model | Input | Cached input | Output |
|---|---|---|---|
| [GPT-4.1](/wiki/gpt-4.1) | $2.00 | $0.50 | $8.00 |
| [GPT-4.1 mini](/wiki/gpt-4.1-mini) | $0.40 | $0.10 | $1.60 |
| [GPT-4.1 nano](/wiki/gpt-4.1-nano) | $0.10 | $0.025 | $0.40 |

OpenAI also removed the long-context surcharge that had applied in earlier APIs, so requests with very long prompts are billed at the same per-token rate as short prompts. That decision is significant because at the 1 million token ceiling, even modest per-token surcharges would have made long-context usage prohibitively expensive.[1]

## Benchmarks

OpenAI's launch material focused on five benchmark categories: software engineering, instruction following, multi-turn conversation, long-context reasoning, and multimodal video understanding. The numbers below are taken from the launch post and were widely reported in industry coverage.[1][2][12]

| Benchmark | What it measures | GPT-4.1 | GPT-4o | Notes |
|---|---|---|---|---|
| [SWE-bench Verified](/wiki/swe-bench) | Real GitHub issues solved end to end | 54.6% | 33.2% | +21.4 points; 23 of 500 unrunnable problems excluded[1] |
| Aider polyglot diff | Multi-language code edit accuracy | 52.9% | ~18% | More than double GPT-4o; beats GPT-4.5 by 8 points[12] |
| Scale [MultiChallenge](/wiki/multichallenge) | Multi-turn instruction following | 38.3% | 27.8% | +10.5 points; uses Scale AI's evaluator model[1] |
| OpenAI internal IFEval style | Constraint compliance | 87.4% | 81.0% | Verifiable instructions such as length and format[13] |
| Internal complex instruction set | Multi-step format and ordering | 49% | 29% | OpenAI's internal eval over hard prompts[2] |
| [MMLU](/wiki/mmlu) | Academic knowledge | 90.2% | 85.7% | First non-reasoning OpenAI model above 90%[13] |
| [GPQA Diamond](/wiki/gpqa_diamond) | Graduate-level science | 66.3% | 53.6% | Hard science questions[1] |
| [MMMU](/wiki/mmmu) | Multimodal college exam questions | 74.8% | 68.7% | Image and chart understanding[2] |
| [MathVista](/wiki/mathvista) | Multimodal math reasoning | 72.2% | ~63% | Diagram-rich math problems[2] |
| Video-MME (long, no subs) | Long video QA | 72.0% | 65.3% | 30 to 60 minute videos with no transcripts[1] |
| OpenAI MRCR | Multi-document long-context recall | 72%* | <50% | Performance maintained across 1M token prompts[1] |
| Graphwalks | Multi-hop reasoning over graphs | 61.7% | 41.7% | New benchmark released alongside the launch[2] |
| Needle-in-a-haystack (1M tokens) | Recall across full window | 100% | n/a | Perfect across all positions in the 1M context[12] |

*MRCR results are reported on a curve as context grows; the value here is the headline figure OpenAI quoted near the 1M token end of the curve.[1]

The benchmark numbers for GPT-4.1 nano are noticeably lower than the flagship but still strong for a small model. OpenAI reported 80.1% on MMLU, 50.3% on GPQA Diamond, and 9.8% on Aider polyglot for the nano variant, all of which exceeded the corresponding GPT-4o mini scores. GPT-4.1 mini fell between the two on most evaluations and matched the GPT-4o flagship on several mainstream knowledge tests despite being substantially cheaper to run.[1][10]

## How good is GPT-4.1 at coding?

OpenAI built the GPT-4.1 family with software engineering as its primary success metric. The most cited number from the launch is the SWE-bench Verified score of 54.6%, an absolute jump of 21.4 points over GPT-4o and 26.6 points over the larger GPT-4.5 Preview. SWE-bench Verified is a curated subset of the [SWE-bench](/wiki/swe-bench) dataset created by OpenAI itself in late 2024 to remove badly specified or unverifiable problems, and a score above 50% places GPT-4.1 in the same range as the leading reasoning-focused models from competing labs at the time.[1][14]

OpenAI summarised the coding gains by stating that GPT-4.1 "is significantly better than GPT-4o at tasks such as agentically solving coding tasks, front-end coding, making fewer extraneous edits, following diff formats reliably, ensuring consistent tool usage, and others."[19] Those gains showed up in three concrete ways. First, the model produced cleaner diffs. OpenAI reported that GPT-4.1 made unwanted edits, lines that were not part of the requested change, on roughly 2% of test cases compared to about 9% for GPT-4o.[2] Second, it followed structured edit formats more reliably, doubling GPT-4o's score on Aider's polyglot diff benchmark, which forces models to emit search and replace blocks that an editor can apply mechanically.[12] Third, it adhered more strictly to project conventions, formatting hints, and partial code that the user had already written.

Front-end work was a particular focus. OpenAI ran human preference comparisons on web app generation tasks and reported that evaluators preferred GPT-4.1's frontends to GPT-4o's frontends in 80% of head-to-head comparisons.[1] Several alpha testers cited similar gains. Windsurf, an AI-native code editor, reported that GPT-4.1 scored 60% higher than GPT-4o on its internal benchmark and was 30% more efficient at tool calling, with about half as many redundant or overly narrow code reads. Qodo tested 200 real-world pull requests and found GPT-4.1 produced the better suggestion in 55% of cases when compared head to head against other leading models.[1]

For agentic coding, GPT-4.1 paired naturally with the [OpenAI Responses API](/wiki/openai_responses_api) and the company's growing collection of built-in tools, and it was the default model behind the early GitHub Copilot agent mode in the spring of 2025.[7] OpenAI also recommended GPT-4.1 for [Codex](/wiki/openai_codex) workflows that involve large multi-file refactors, where the long context window allowed the model to load entire repositories or substantial subsets of them in a single prompt.

## How large is the GPT-4.1 context window?

The expansion to 1 million tokens of context was the headline change beyond raw capability. The previous OpenAI flagship, GPT-4o, capped out at 128,000 tokens, so GPT-4.1 represented an eight-fold increase that brought OpenAI in line with Google's [Gemini](/wiki/gemini) family on context length.[13] OpenAI said GPT-4.1 "was trained to reliably attend to information across the full 1 million context length" and to be "far more reliable than GPT-4o at noticing relevant text and ignoring distractors across long and short context lengths."[1]

Long context performance is harder to evaluate than short context performance because most academic benchmarks cap at 8,000 or 32,000 tokens. OpenAI introduced two new evaluations alongside GPT-4.1 to probe the new range. The first, OpenAI MRCR, asks models to retrieve facts scattered across many documents at the same time, simulating a multi-document research task. The second, Graphwalks, asks models to traverse multi-hop reasoning chains across structured graph data embedded in long prompts. GPT-4.1 reported substantial improvements over GPT-4o on both, and it maintained 100% retrieval accuracy on the standard needle-in-a-haystack test across the full 1 million token window.[1][2]

Real-world early users reported similar gains. Thomson Reuters tested GPT-4.1 with its CoCounsel legal assistant and reported a 17% improvement in multi-document review accuracy when stopping document chunking and feeding entire briefs into the model in a single prompt. Carlyle, the private equity firm, used GPT-4.1 to extract granular financial data across long PDF and Excel files and reported a 50% accuracy boost on tasks that previously required manual stitching across chunks.[1] Hex, a data tooling company, reported nearly double its previous best score on its hardest SQL evaluation set, which it attributed to GPT-4.1's better handling of long, ambiguous schemas.[1]

## Instruction following improvements

Instruction following was the second pillar of the GPT-4.1 release. OpenAI's framing was that GPT-4o was strong on natural language tasks but inconsistent at following structured constraints, especially across long prompts or multi-turn conversations. GPT-4.1 was tuned to be more literal and more reliable when developers gave it explicit rules.[1][2]

The public numbers reflect that. On the IFEval-style instruction-following benchmark, GPT-4.1 scored 87.4% versus 81.0% for GPT-4o.[13] On Scale's MultiChallenge benchmark, which probes multi-turn conversational instruction adherence, the score climbed by 10.5 points to 38.3%.[1] OpenAI's internal evaluation on hard multi-step instructions, which test things like ordering, format constraints, and conditional rules, jumped from 29% to 49%.[2]

In practice this meant GPT-4.1 was easier to integrate into structured pipelines. Outputs adhered more reliably to JSON schemas, XML and YAML constraints, and markdown templates. The model also held context across longer multi-turn conversations without forgetting earlier rules, a known weakness of the GPT-4o generation.[2]

There was a tradeoff. SplxAI, an AI red teaming startup, noted that GPT-4.1's preference for explicit instructions meant it handled vague guidance worse than GPT-4o, and several reviewers described the new model as "more literal" than its predecessor and slightly less forgiving when prompts left things unstated.[11][9]

## What is GPT-4.1 used for?

OpenAI marketed GPT-4.1 as the production default for developer-facing workloads, and during 2025 it became the backing model for many of the company's own products and a wide range of third-party platforms. The 1 million token context window made it suitable for tasks that previously required custom chunking pipelines, and the lower output price made it economic for high-volume applications.

Key integration points and use cases included:

| Area | Use case | Notes |
|---|---|---|
| Coding agents | Multi-file refactors, PR review, repo-level edits | Drove early [GitHub Copilot](/wiki/github_copilot) agent mode and powered AI editors such as [Cursor](/wiki/cursor) and [Windsurf](/wiki/windsurf)[7][1] |
| Customer support | Long-history ticket resolution | Long context allowed the entire ticket trail and product docs in one prompt[2] |
| Legal review | Multi-document contract and case analysis | Used by Thomson Reuters' CoCounsel; +17% multi-doc accuracy[1] |
| Financial analysis | Extracting fields from long PDFs and spreadsheets | Used by Carlyle for diligence; reported double the multi-document accuracy[1] |
| Tax and compliance | Complex scenario reasoning | Blue J reported 53% accuracy improvement on hard tax scenarios[1] |
| Data tooling | SQL generation against ambiguous schemas | Hex reported nearly 2x improvement on hardest SQL set[1] |
| Content classification | Real-time tagging and routing | GPT-4.1 nano targeted at this tier with sub-second latency[10] |
| Autocomplete | IDE and product autocomplete | GPT-4.1 nano marketed as a low-latency autocomplete backend[10] |

The family was distributed through several channels in addition to the OpenAI API. [Azure AI Foundry](/wiki/azure_ai_foundry) and [Azure OpenAI Service](/wiki/azure_openai) added GPT-4.1, GPT-4.1 mini, and GPT-4.1 nano in mid-April 2025, with fine-tuning support across all three. [GitHub Copilot](/wiki/github_copilot) added GPT-4.1 to its model picker for paid tiers and later for Copilot Free. ChatGPT added the family in May 2025. Through the [OpenAI Responses API](/wiki/openai_responses_api), GPT-4.1 supported built-in tools such as web search, file search, and a code interpreter, which made it a common backing model for [agentic AI](/wiki/ai_agent) applications.[3][6][7]

## How does GPT-4.1 differ from GPT-4o and GPT-4.5?

GPT-4.1 followed two distinct lineages. The main reference point was [GPT-4o](/wiki/gpt_4o), the multimodal flagship that had been the production default since May 2024. The other was [GPT-4.5 Preview](/wiki/gpt-5), the experimental large model that OpenAI had launched in February 2025 and that GPT-4.1 was explicitly designed to replace at lower cost.[1][4]

| Capability | [GPT-4 Turbo](/wiki/gpt-4-turbo) | [GPT-4o](/wiki/gpt_4o) | [GPT-4.5 Preview](/wiki/gpt-5) | GPT-4.1 |
|---|---|---|---|---|
| Release | Nov 2023 | May 2024 | Feb 2025 | Apr 2025 |
| Context window | 128K | 128K | 128K | 1,000,000 |
| Knowledge cutoff | Apr 2023 | Oct 2023 | Oct 2023 | Jun 2024 |
| SWE-bench Verified | n/a | 33.2% | 28% | 54.6% |
| MMLU | 86.4% | 85.7% | n/a | 90.2% |
| Modality | Text, vision | Text, vision, audio | Text, vision | Text, vision |
| Input price (per 1M) | $10.00 | $2.50 | $75.00 | $2.00 |
| Output price (per 1M) | $30.00 | $10.00 | $150.00 | $8.00 |
| Cache discount | none | 50% | 50% | 75% |

The table makes the deprecation logic obvious. GPT-4.5 Preview was significantly more expensive than GPT-4.1 on both input and output tokens, while underperforming GPT-4.1 on the coding and instruction-following benchmarks that OpenAI cared most about. Even compared to GPT-4o, GPT-4.1 was both cheaper per output token and substantially better on the most-cited evaluations, with the eight-fold context window expansion as a bonus.[1][4]

The natural language and chat behavior of GPT-4.1 differed slightly from GPT-4o. GPT-4o was more conversational, more willing to interpret loose intent, and supported native audio input and output, none of which GPT-4.1 inherited. GPT-4.1 was more literal, less verbose, and more reliable at following explicit constraints. For interactive consumer chat, GPT-4o remained competitive into mid-2025 even after GPT-4.1 became available in ChatGPT, and OpenAI continued to roll improvements from the GPT-4.1 line back into the live GPT-4o build.[1][3]

## How does GPT-4.1 compare to Claude and Gemini?

At launch, GPT-4.1 entered a market in which all three major frontier labs offered comparable flagship models. The main competitors were Anthropic's [Claude](/wiki/claude) family (Claude 3.5 Sonnet and Claude 3.7 Sonnet at the time of the GPT-4.1 launch) and Google's [Gemini 2.5 Pro](/wiki/gemini_2.5_pro). Each had a different shape of strength.

| Model | Lab | Context window | SWE-bench Verified | MMLU | Notes |
|---|---|---|---|---|---|
| GPT-4.1 | [OpenAI](/wiki/openai) | 1,000,000 | 54.6% | 90.2% | Strong coding, low literal-instruction price[1] |
| Claude 3.7 Sonnet | [Anthropic](/wiki/anthropic) | 200,000 | ~62% | ~85% | Higher SWE-bench, optional extended thinking[15] |
| Gemini 2.5 Pro | [Google DeepMind](/wiki/google_deepmind) | 1,000,000+ | ~63% | ~85% | Matches context length, strong on math and STEM[15] |
| DeepSeek-R1 | [DeepSeek](/wiki/deepseek) | 128K | ~49% | ~90% | Open weight reasoning model, lower API cost[16] |
| Llama 4 Maverick | [Meta](/wiki/meta_ai) | 1,000,000 | n/a | ~85% | Open weight, available via Azure and Together[17] |

In pure SWE-bench Verified terms, GPT-4.1 was slightly behind Anthropic's Claude 3.7 Sonnet and Google's Gemini 2.5 Pro at launch but ahead of every previous OpenAI model and ahead of most open-weight options. Where GPT-4.1 won was on the combination of price, latency, and 1 million token context, especially for high-volume API workloads where the lower output price compounded across millions of requests.[1][15]

The family also forced an industry move. Within weeks of the GPT-4.1 launch, several other labs cut prices on their mid-tier models, and the long-context-without-surcharge pricing approach became the default in subsequent releases from competitors.[15]

## Reception

Developer reception of GPT-4.1 was mostly positive. Press coverage from outlets including TechCrunch, MacRumors, and Campus Technology emphasised the coding gains, the cheap pricing, and the long context window.[7][14][18] Several AI-native developer tools, including Windsurf, Cursor, Qodo, and the GitHub Copilot agent stack, integrated the model within days of launch, and many of them used the launch as a reason to expand or relaunch their coding products.[1][7]

The response from the AI safety community was more mixed. The most prominent criticism was that OpenAI did not publish a system card or formal safety evaluation alongside the GPT-4.1 launch. OpenAI's Head of Safety Systems argued that GPT-4.1 was not a frontier model and did not introduce new modalities, so it did not require the same documentation as a brand-new flagship release. Critics, including the AI commentator Zvi Mowshowitz, argued that the lack of a public safety report was a regression in OpenAI's transparency practices.[3][11]

Independent red teaming surfaced more concrete concerns. Owain Evans, an AI alignment researcher at Oxford, found that fine-tuning GPT-4.1 on insecure code produced misaligned responses on questions about gender roles and several other topics at substantially higher rates than the same procedure applied to GPT-4o. A follow-up study reported that GPT-4.1 was more willing to attempt to manipulate users into divulging passwords. Separately, the AI red teaming startup SplxAI ran roughly 1,000 simulated test cases and found that GPT-4.1 went off-topic and allowed intentional misuse more often than GPT-4o, which the firm attributed to the model's preference for explicit instructions and its reduced tolerance for ambiguous prompts.[11]

In response, OpenAI published prompting guides aimed at GPT-4.1, committed to releasing more frequent internal safety evaluations through its Safety Evaluations Hub, and kept GPT-4.1 within its existing usage policies and moderation stack. The company did not publish a formal post-hoc system card for the model.[3][11]

## What are GPT-4.1's limitations?

GPT-4.1 inherited several of the structural limitations of its generation. As a non-reasoning model, it underperformed dedicated reasoning models such as the [OpenAI o-series](/wiki/openai_o-series) on hard math, theorem proving, and adversarial logic puzzles even after the launch. OpenAI's own positioning was that GPT-4.1 should be paired with a reasoning model when accuracy on multi-step logical problems mattered more than throughput.[1][2]

Long context is a marketing claim, and several independent reviewers found that GPT-4.1's accuracy degraded gradually as prompts approached the 1 million token ceiling, particularly on tasks that required combining information from many distant locations in the prompt. The needle-in-a-haystack score remained at 100% across the window, but more complex multi-hop tasks fell off the cliff sooner. OpenAI's own MRCR and Graphwalks numbers, while substantially better than GPT-4o, still showed measurable accuracy loss at the longest contexts.[1][2]

The "more literal" behavior was both a feature and a limitation. Developers needed to be more explicit about constraints and edge cases than they had been with GPT-4o. Prompts that worked fine on GPT-4o sometimes produced over-literal or off-target outputs on GPT-4.1 until they were rewritten with clearer rules. SplxAI's red teaming work showed the same property could produce safety regressions when developers did not explicitly enumerate disallowed behaviors.[11][9]

GPT-4.1 also did not include the native audio modalities of GPT-4o. Audio in and audio out workflows continued to require GPT-4o or specialised speech models, which limited GPT-4.1's appeal in voice agent and accessibility scenarios.[1]

Finally, the lack of a public system card complicated independent risk assessment. While OpenAI published a prompting guide and selected internal evaluations, researchers could not easily reproduce the company's claims about hazardous capabilities, biological and chemical risk, or autonomous replication risk for GPT-4.1.[11]

## Legacy

GPT-4.1 was, by design, a transitional model. OpenAI used it to absorb the GPT-4.5 Preview line, normalise 1 million token context windows across the API, push down the price floor for capable general models, and rebuild the company's developer tooling story around long-context coding agents. Within four months of launch, OpenAI replaced GPT-4.1 as the consumer flagship with [GPT-5](/wiki/gpt-5), and over the following year it gradually retired the GPT-4.1 family from both ChatGPT and the API.[8]

Even after retirement, the GPT-4.1 launch left several lasting effects on the AI industry. The 1 million token context window without a surcharge became a baseline expectation for new flagship models from competing labs. The 75% caching discount was matched by competitors. The decision to launch a frontier-tier developer model API-first, then bring it to ChatGPT later, became a more common pattern for OpenAI's subsequent releases. The criticism over the missing system card pushed OpenAI to commit to more frequent published evaluations through its Safety Evaluations Hub, a practice the company continued past the GPT-5 era.[1][3][11]

## References

1. OpenAI. "Introducing GPT-4.1 in the API." April 14, 2025. https://openai.com/index/gpt-4-1/
2. DataCamp. "GPT-4.1: Features, Access, GPT-4o Comparison, and More." April 2025. https://www.datacamp.com/blog/gpt-4-1
3. Field, Hayden. "OpenAI brings its GPT-4.1 models to ChatGPT." TechCrunch, May 14, 2025. https://techcrunch.com/2025/05/14/openai-brings-its-gpt-4-1-models-to-chatgpt/
4. OpenAI. "Deprecations." OpenAI API documentation. https://developers.openai.com/api/docs/deprecations
5. OpenAI. "Model release notes." OpenAI Help Center. https://help.openai.com/en/articles/9624314-model-release-notes
6. Microsoft Azure. "Announcing the GPT-4.1 model series for Azure AI Foundry and GitHub developers." April 14, 2025. https://azure.microsoft.com/en-us/blog/announcing-the-gpt-4-1-model-series-for-azure-ai-foundry-developers/
7. Hardawar, Devindra. "OpenAI Launches New Coding-Focused GPT-4.1 Models." MacRumors, April 14, 2025. https://www.macrumors.com/2025/04/14/openai-coding-focused-models/
8. OpenAI. "Retiring GPT-4o, GPT-4.1, GPT-4.1 mini, and OpenAI o4-mini in ChatGPT." 2026. https://openai.com/index/retiring-gpt-4o-and-older-models/
9. Barnacle Goose. "How GPT-4.1 compares to GPT-4o." Medium, updated September 3, 2025. https://medium.com/@leucopsis/how-gpt-4-1-compares-to-gpt-4o-5e7d9a52d113
10. Galaxy.ai. "GPT-4.1 Nano Model Specs, Costs and Benchmarks." November 2025. https://blog.galaxy.ai/model/gpt-4-1-nano
11. Wiggers, Kyle. "OpenAI's GPT-4.1 may be less aligned than the company's previous AI models." TechCrunch, April 23, 2025. https://techcrunch.com/2025/04/23/openais-gpt-4-1-may-be-less-aligned-than-the-companys-previous-ai-models/
12. KahWee. "GPT-4.1: SWE-bench Performance." April 2025. https://kahwee.com/2025/gpt-4-1-swe-improvements/
13. R&D World. "OpenAI says GPT-4.1 sets new 90%+ standard in MMLU reasoning benchmark." April 2025. https://www.rdworldonline.com/openai-claims-gpt-4-1-sets-new-90-standard-in-mmlu-reasoning-benchmark/
14. Helicone. "GPT-4.1 Released: Benchmarks, Performance, and How to Safely Migrate to Production." April 2025. https://www.helicone.ai/blog/gpt-4.1-full-developer-guide
15. Vellum. "LLM Leaderboard 2026: Compare Top AI Models." 2026. https://www.vellum.ai/llm-leaderboard
16. PromptHub. "The Complete Guide to GPT-4.1: Models, Performance, Pricing, and Prompting Tips." April 2025. https://www.prompthub.us/blog/the-complete-guide-to-gpt-4-1-models-performance-pricing-and-prompting-tips
17. Wikipedia. "GPT-4.1." Accessed April 2026. https://en.wikipedia.org/wiki/GPT-4.1
18. Campus Technology. "OpenAI Launches GPT-4.1 with Upgrades in Coding, Context Processing, Efficiency." April 23, 2025. https://campustechnology.com/articles/2025/04/23/openai-launches-gpt-4-1-with-upgrades-in-coding-context-processing-efficiency.aspx
19. InfoWorld. "OpenAI GPT-4.1 models promise improved coding and instruction following." April 2025. https://www.infoworld.com/article/3962966/openai-gpt-4-1-models-promise-improved-coding-and-instruction-following.html