GPT-4
Last reviewed
May 9, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v7 ยท 8,841 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 9, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v7 ยท 8,841 words
Add missing citations, update stale details, or suggest a clearer explanation.
GPT-4 (Generative Pre-trained Transformer 4) is a large language model developed by OpenAI and released on March 14, 2023. It is the fourth main entry in the GPT series and was the first GPT model to accept both text and image inputs, making it a multimodal system at launch. GPT-4 represented a major capability leap over its predecessor GPT-3.5, scoring in the top percentiles of professional and academic exams (including roughly the 90th percentile on a simulated Uniform Bar Exam) and setting new state-of-the-art results across a wide range of natural language processing benchmarks. [1]
Unlike earlier OpenAI papers, the GPT-4 Technical Report explicitly withholds details about model architecture, parameter count, training data, hardware, and training compute, citing "the competitive landscape and the safety implications of large-scale models." [1] CEO Sam Altman confirmed only that training cost "more than $100 million." Industry analysts at Semianalysis later published an unverified report claiming GPT-4 used a Mixture of Experts architecture with roughly 1.76 trillion total parameters, trained on about 25,000 Nvidia A100 GPUs over 90 to 100 days; OpenAI has never confirmed those numbers. [7]
GPT-4 was initially gated behind ChatGPT Plus (a $20 per month subscription) and a developer waitlist. Over the following two years OpenAI iterated rapidly on the family: GPT-4 Turbo (November 2023) added a 128K context window and large price cuts, GPT-4o (May 2024) introduced native audio and vision in a single model, GPT-4o mini (July 2024) targeted high-volume use, GPT-4.5 "Orion" (February 2025) was OpenAI's largest pre-trained model, and GPT-4.1 (April 2025) brought a one-million-token context window to the API. The series was succeeded by GPT-5 on August 7, 2025, after which the original GPT-4 was retired from ChatGPT, with GPT-4o, GPT-4.1, and GPT-4.1 mini scheduled to follow on February 13, 2026. [16][17][18][20]
GPT-4 had outsized cultural and industrial impact. Microsoft had quietly built Bing Chat on top of an early GPT-4 checkpoint and launched it on February 7, 2023, weeks before OpenAI's own announcement. Within months GPT-4 was powering tools at Duolingo, Khan Academy, Morgan Stanley, Stripe, the government of Iceland, the assistive-technology company Be My Eyes, and GitHub Copilot. The model triggered a wave of competing systems, the EU AI Act, congressional hearings, and a series of high-profile copyright lawsuits including The New York Times v. Microsoft and OpenAI and Authors Guild et al. v. OpenAI. [11][13][15]
GPT-4 was the culmination of a multi-year scaling program at OpenAI. The original GPT (2018) had 117 million parameters; GPT-2 (2019) reached 1.5 billion; GPT-3 (2020) jumped to 175 billion and demonstrated strong few-shot learning. GPT-3.5, released in late 2022, served as the engine behind the original ChatGPT and became the first generative AI product to reach 100 million monthly users (Reuters, January 2023). [1]
While the public was still adjusting to ChatGPT, OpenAI had finished training GPT-4 in August 2022, seven months before announcement. According to OpenAI, the company spent that time on safety evaluation, reinforcement learning from human feedback, and red-teaming. Sam Altman called the period a deliberate effort to "flatten the deployment curve" by giving the model time to mature before public release. [1][2]
The choice to keep architectural details secret was a sharp break from previous releases. The GPT-2 paper (2019) and GPT-3 paper (2020) had each disclosed parameter counts, layer dimensions, and training procedures; the GPT-4 Technical Report disclosed essentially none of this. OpenAI cited two reasons: competitive pressure from rapidly improving rivals, and a concern that publishing scaling recipes could accelerate proliferation of frontier-capable systems. [1]
Like its predecessors, GPT-4 is a transformer-based model pre-trained on large datasets of text taken from the internet. During pre-training, the model learned to predict the next token (roughly corresponding to a word or subword) in a sequence. [1] OpenAI did not publish parameter counts, layer counts, or training set composition. The technical report states only that GPT-4 was "pre-trained using both publicly available data (such as internet data) and data licensed from third-party providers." [1]
According to leaked reports from Semianalysis and other industry analysts, GPT-4 uses a Mixture of Experts (MoE) architecture with approximately 1.76 to 1.8 trillion total parameters spread across roughly 120 layers. The model reportedly contains 16 expert sub-networks, each with about 111 billion parameters in the MLP layers, and uses a top-2 routing approach where each token is processed by two experts per forward pass. [7] Independent hacker and Comma.ai founder George Hotz separately speculated in mid-2023 that GPT-4 used 8 mixture-of-experts components running iteratively over 16 inference steps. OpenAI has never confirmed or denied these claims, and Sam Altman dismissed an early version of the leak as "complete bullshit" without engaging with specific numbers. [7]
The training dataset reportedly consisted of approximately 13 trillion tokens drawn from both publicly available internet text and data licensed from third-party providers, supplemented by code-based data. Some fine-tuning data was sourced from Scale AI and internal teams. [7]
Microsoft built a custom Azure supercomputer with over 10,000 GPUs and high-bandwidth networking specifically for OpenAI's training workloads. GPT-3.5 served as an early test run on this infrastructure before GPT-4 training began. Sam Altman stated that training GPT-4 cost over $100 million in compute alone, and OpenAI spends around $200 million per year maintaining its supercomputing systems. [1][7]
Semianalysis reported that the actual GPT-4 run used approximately 25,000 Nvidia A100 GPUs over 90 to 100 days at a model FLOPs utilization of roughly 32 to 36 percent. The relatively low utilization was attributed to frequent restarts from checkpoints when nodes failed at scale. The analysts estimated total training FLOPs around 2.15 x 10^25 and a hardware-only training cost of about $63 million if priced at $1 per A100 hour. These figures remain unconfirmed by OpenAI. [7]
After pre-training, OpenAI fine-tuned GPT-4 using reinforcement learning from human feedback (RLHF). Human reviewers ranked model outputs by quality and safety, and this feedback trained a reward model that guided further optimization. GPT-4 also incorporated an additional safety reward signal during RLHF, provided by a GPT-4 zero-shot classifier that judged safety boundaries and response style on safety-related prompts. [1][2]
To build a diverse training signal for safety alignment, OpenAI drew from multiple sources: labeled production data, outputs from human red-teaming sessions, and model-generated prompts. The safety reward was applied across both allowed and disallowed content categories to prevent the model from over-refusing legitimate requests. The system card describes this as a rule-based reward model (RBRM) supplementing standard RLHF. [2]
OpenAI engaged over 50 external experts from fields including AI alignment, cybersecurity, biosecurity, international security, and the law to adversarially test GPT-4 before release. These red teamers probed the model for dangerous capabilities and failure modes, including potential for generating harmful content, assisting with weapons development, and facilitating social engineering. [2]
The results of these safety interventions were measurable. Compared to GPT-3.5, GPT-4 was 82 percent less likely to respond to requests for disallowed content. It also complied with OpenAI's policies on sensitive topics (such as medical advice and self-harm) 29 percent more often than GPT-3.5. OpenAI published both a technical report and a system card documenting these evaluations. [1][2]
One of the most-cited evaluations in the system card was conducted by the Alignment Research Center (ARC Evals). ARC tested an early checkpoint's ability to autonomously acquire resources, copy itself onto new servers, and avoid shutdown. The evaluation gave the model access to a small budget, an open-ended terminal, and a research assistant. ARC concluded that GPT-4 was "ineffective at the autonomous replication task" but flagged that more capable future models could pose such risks. The system card includes a now-famous example in which GPT-4 hired a TaskRabbit worker to solve a CAPTCHA, telling the worker it was vision-impaired when asked whether it was a robot. [2]
OpenAI's technical report for GPT-4 contains no information about the model's size, architecture, hardware, or training method. Everything publicly known about the architecture comes from leaked documents, third-party analyses, or partial admissions in interviews. The most influential single account is the July 2023 Semianalysis report by Dylan Patel and Aleksandar Eshtic. [1][7]
| Detail | Reported value | Source |
|---|---|---|
| Total parameters | ~1.76 to 1.8 trillion | Semianalysis (leaked) |
| Number of layers | ~120 | Semianalysis (leaked) |
| Expert count (MoE) | 16 | Semianalysis (leaked) |
| Experts routed per token | 2 | Semianalysis (leaked) |
| MLP parameters per expert | ~111 billion | Semianalysis (leaked) |
| Active parameters per token | ~280 billion | Semianalysis (leaked) |
| Training tokens | ~13 trillion | Semianalysis (leaked) |
| Training GPUs | ~25,000 Nvidia A100 | Semianalysis (leaked) |
| Training duration | 90 to 100 days | Semianalysis (leaked) |
| Training FLOPs | ~2.15 x 10^25 | Semianalysis (leaked) |
| Training cost | More than $100 million | Sam Altman (confirmed) |
| Context window (original) | 8,192 or 32,768 tokens | OpenAI (official) |
| Knowledge cutoff (original) | September 2021 | OpenAI (official) |
The mixture-of-experts approach, if accurate, explains how GPT-4 could contain far more parameters than GPT-3 (175 billion) while keeping inference costs manageable. Only a fraction of the total parameters are active for any given token, since each token is routed to just two of the 16 experts. The design echoes earlier MoE work at Google, including GShard (2020) and the Switch Transformer (2021), and was a notable departure from the dense transformer used by GPT-3. [7]
GPT-4 produces text that is substantially more coherent, accurate, and nuanced than GPT-3.5. It can follow complex multi-step instructions, write code in dozens of programming languages, draft legal documents, solve math problems, and translate between languages. On natural language processing benchmarks it set new records at launch across multiple categories. [1]
One of GPT-4's most notable strengths at release was its improved ability to follow instructions. It could adopt specific personas through system messages, generate output in structured formats like JSON or XML, and maintain consistency across long conversations. Internal evaluations cited in the technical report claimed GPT-4 scored 40 percent higher than GPT-3.5 on adversarial factuality tests, while still falling well short of perfect accuracy. [1]
GPT-4 was the first model in the GPT series to accept image inputs alongside text. Users could upload photographs, charts, screenshots, and handwritten notes, and the model would describe, analyze, or answer questions about them. OpenAI demonstrated this capability with examples like identifying objects in photos, reading text from images of documents, and explaining the humor in cartoons. [1]
The vision capability was not available at launch. OpenAI released the GPT-4V(ision) system card on September 25, 2023, and began rolling out image input to ChatGPT Plus and Enterprise users shortly after. [3]
One early deployment partner was Be My Eyes, a company that develops assistive technology for blind and low-vision users. Beginning in March 2023, Be My Eyes and OpenAI collaborated on "Be My AI," a tool that used GPT-4's vision capabilities to describe the visual world. By September 2023 the beta test group had grown to 16,000 users requesting an average of 25,000 image descriptions per day. [1][3]
GPT-4 introduced improved support for system messages, which allow developers and users to set the model's behavior, tone, and constraints at the start of a conversation. This feature gave developers finer control over outputs compared to GPT-3.5, enabling applications ranging from customer service bots with specific personas to coding assistants restricted to particular languages. [1]
GPT-4 substantially improved on GPT-3.5 for coding tasks. On the HumanEval benchmark, which measures functional correctness on 164 hand-written Python problems, GPT-4 reached 67.0 percent, up from 48.1 percent for GPT-3.5, surpassing the previous best result of 65.8 percent achieved by CodeT combined with GPT-3.5. [1] In real-world deployment, GPT-4 became the engine for GitHub Copilot Chat (announced March 22, 2023) and provided the underlying intelligence for products like Cursor, Replit Ghostwriter, and Codeium.
GPT-4's most widely reported result at launch was its performance on standardized exams. While GPT-3.5 generally scored in the lower percentiles, GPT-4 performed at or above the level of most human test-takers on many professional and academic tests. The headline number, top 10 percent on a simulated Uniform Bar Exam, became the dominant framing for early press coverage. [1][8]
A later peer-reviewed paper by Eric Martinez (MIT) argued that the original 90th-percentile claim used a non-representative comparison group and that the true rank against human test-takers was closer to the 48th percentile, sparking debate about how to evaluate models on professional exams. OpenAI's published numbers, however, remain the canonical industry reference.
See also: GPT-4 Plugins
| Exam | GPT-4 Points | GPT-4 Percentile | GPT-4 (no vision) Points | GPT-4 (no vision) Percentile | GPT-3.5 Points | GPT-3.5 Percentile |
|---|---|---|---|---|---|---|
| Uniform Bar Exam (MBE+MEE+MPT)1 | 298 / 400 | ~90th | 298 / 400 | ~90th | 213 / 400 | ~10th |
| LSAT | 163 | ~88th | 161 | ~83rd | 149 | ~40th |
| SAT Evidence-Based Reading & Writing | 710 / 800 | ~93rd | 710 / 800 | ~93rd | 670 / 800 | ~87th |
| SAT Math | 700 / 800 | ~89th | 690 / 800 | ~89th | 590 / 800 | ~70th |
| Graduate Record Examination (GRE) Quantitative | 163 / 170 | ~80th | 157 / 170 | ~62nd | 147 / 170 | ~25th |
| Graduate Record Examination (GRE) Verbal | 169 / 170 | ~99th | 165 / 170 | ~96th | 154 / 170 | ~63rd |
| Graduate Record Examination (GRE) Writing | 4 / 6 | ~54th | 4 / 6 | ~54th | 4 / 6 | ~54th |
| USABO Semifinal Exam 2020 | 87 / 150 | 99th to 100th | 87 / 150 | 99th to 100th | 43 / 150 | 31st to 33rd |
| USNCO Local Section Exam 2022 | 36 / 60 | 38 / 60 | 24 / 60 | |||
| Medical Knowledge Self-Assessment Program | 75% | 75% | 53% | |||
| Codeforces Rating | 392 | below 5th | 392 | below 5th | 260 | below 5th |
| AP Art History | 5 | 86th to 100th | 5 | 86th to 100th | 5 | 86th to 100th |
| AP Biology | 5 | 85th to 100th | 5 | 85th to 100th | 4 | 62nd to 85th |
| AP Calculus BC | 4 | 43rd to 59th | 4 | 43rd to 59th | 1 | 0th to 7th |
The jump from GPT-3.5 to GPT-4 was especially dramatic on the Bar Exam, where GPT-4 rose from the 10th percentile to the 90th, and on the LSAT, where it moved from the 40th to the 88th percentile. GRE Verbal performance reached the 99th percentile. However, GPT-4 still scored below the 5th percentile on competitive programming (Codeforces), indicating that while it could write functional code, it struggled with the algorithmic problem-solving required in programming competitions. [1]
| Benchmark | GPT-4 | Evaluated few-shot | GPT-3.5 | Evaluated few-shot | LM SOTA | Best external LM evaluated few-shot | SOTA | Best external model (includes benchmark-specific training) |
|---|---|---|---|---|---|---|---|---|
| MMLU | 86.4% | 5-shot | 70.0% | 5-shot | 70.7% | 5-shot U-PaLM | 75.2% | 5-shot Flan-PaLM |
| HellaSwag | 95.3% | 10-shot | 85.5% | 10-shot | 84.2% | LLAMA (validation set) | 85.6% | ALUM |
| AI2 Reasoning Challenge (ARC) | 96.3% | 25-shot | 85.2% | 25-shot | 84.2% | 8-shot PaLM | 85.6% | ST-MOE |
| WinoGrande | 87.5% | 5-shot | 81.6% | 5-shot | 84.2% | 5-shot PALM | 85.6% | 5-shot PALM |
| HumanEval | 67.0% | 0-shot | 48.1% | 0-shot | 26.2% | 0-shot PaLM | 65.8% | CodeT + GPT-3.5 |
| DROP (f1 score) | 80.9 | 3-shot | 64.1 | 3-shot | 70.8 | 1-shot PaLM | 88.4 |
GPT-4 achieved 86.4 percent on MMLU (Massive Multitask Language Understanding), a benchmark that tests knowledge across 57 academic subjects. This was more than 16 percentage points above GPT-3.5 and exceeded the previous best language model result (70.7 percent from U-PaLM). On HellaSwag, a commonsense reasoning benchmark, GPT-4 scored 95.3 percent. On the ARC (AI2 Reasoning Challenge), it reached 96.3 percent. [1]
OpenAI also translated MMLU into 26 languages using Azure Translate. GPT-4 surpassed the existing English-language state of the art on translated MMLU in 24 of the 26 languages, including Swahili, Welsh, and Latvian. [1]
| Benchmark | GPT-4 | Evaluated few-shot | Few-shot SOTA | SOTA | Best external model (includes benchmark-specific training) |
|---|---|---|---|---|---|
| VQAv2 | 77.2% | 0-shot | 67.6% | Flamingo 32-shot | 84.3% |
| TextVQA | 78.0% | 0-shot | 37.9% | Flamingo 32-shot | 71.8% |
| ChartQA | 78.5%A | - | 58.6% | Pix2Struct Large | - |
| AI2 Diagram (AI2D) | 78.2% | 0-shot | - | 42.1% | Pix2Struct Large |
| DocVQA | 88.4% | 0-shot (pixel-only) | - | 88.4% | ERNIE-Layout 2.0 |
| Infographic VQA | 75.1% | 0-shot (pixel-only) | - | 61.2% | Applica.ai TILT |
| TVQA | 87.3% | 0-shot | - | 86.5% | MERLOT Reserve Large |
| LSMDC | 45.7% | 0-shot | 31.0% | MERLOT Reserve 0-shot | 52.9% |
GPT-4's zero-shot performance on visual question answering tasks was competitive with or superior to models that had been specifically trained on those benchmarks. On DocVQA, GPT-4 matched the previous state-of-the-art score of 88.4 percent without any task-specific training. On TextVQA, GPT-4's 78.0 percent exceeded the prior best of 71.8 percent from PaLI-17B. [1]
GPT-4 launched with two context window sizes, but the rest of the family pushed the limit upward dramatically over two years.
| Variant | Context window | Approximate page equivalent | Released |
|---|---|---|---|
| gpt-4 (8K) | 8,192 tokens | ~12 pages | March 14, 2023 |
| gpt-4-32k | 32,768 tokens | ~50 pages | March 14, 2023 (limited) |
| gpt-4-turbo-1106-preview | 128,000 tokens | ~300 pages | November 6, 2023 |
| gpt-4-turbo-2024-04-09 | 128,000 tokens | ~300 pages | April 9, 2024 |
| GPT-4o | 128,000 tokens | ~300 pages | May 13, 2024 |
| GPT-4o mini | 128,000 tokens | ~300 pages | July 18, 2024 |
| GPT-4.5 | 128,000 tokens | ~300 pages | February 27, 2025 |
| GPT-4.1 | 1,000,000 tokens | ~3,000 pages | April 14, 2025 |
The original 8K variant was the most widely available; the 32K variant was released to a limited set of API users. When GPT-4 Turbo launched in November 2023, the context window expanded to 128,000 tokens, roughly equivalent to 300 pages of text. GPT-4o retained the 128K window. GPT-4.1, released April 14, 2025, increased the context window to one million tokens across all three sizes (full, mini, nano). [4][5][6][16]
In practice, performance on long-context tasks degraded as input length grew. Independent evaluations found that GPT-4 Turbo's attention quality dropped noticeably beyond approximately 32,000 tokens, with reduced accuracy on needle-in-a-haystack retrieval tasks at the upper end of the context window. OpenAI claimed that GPT-4.1 improved long-context comprehension and reported 100 percent accuracy on simple needle-in-a-haystack tests across the full 1M token window, while acknowledging multi-hop retrieval remained more difficult. [16]
On November 6, 2023, at OpenAI's first DevDay conference in San Francisco, the company announced GPT-4 Turbo. The new model introduced several improvements over the original GPT-4. [4]
GPT-4 Turbo expanded the context window from 8K and 32K tokens to 128,000 tokens, allowing users to include far more text in a single prompt. Its training data knowledge cutoff was updated to April 2023 (later extended to December 2023 in the April 2024 release). The model added JSON mode, which constrains outputs to valid JSON via a response_format API parameter, and improved function calling, allowing multiple functions to be invoked in a single API call. [4]
Instruction-following was notably better. GPT-4 Turbo was more reliable at producing output in specific formats like XML, markdown tables, or structured data, and it more consistently adhered to system message constraints. [4]
GPT-4 Turbo was significantly cheaper than the original GPT-4:
| Model | Input cost (per 1M tokens) | Output cost (per 1M tokens) |
|---|---|---|
| GPT-4 (8K) | $30.00 | $60.00 |
| GPT-4 (32K) | $60.00 | $120.00 |
| GPT-4 Turbo | $10.00 | $30.00 |
| GPT-4o (May 2024 launch) | $5.00 | $15.00 |
| GPT-4o (Aug 2024 cut) | $2.50 | $10.00 |
| GPT-4o mini | $0.15 | $0.60 |
| GPT-4.1 | $2.00 | $8.00 |
| GPT-4.1 mini | $0.40 | $1.60 |
| GPT-4.1 nano | $0.10 | $0.40 |
| GPT-4.5 (Orion) | $75.00 | $150.00 |
Input tokens for GPT-4 Turbo cost one third of the original GPT-4, and output tokens cost one half. This made GPT-4-level intelligence accessible to a much wider range of applications. [4][5][6][16][19]
GPT-4 Turbo initially launched as a preview model (gpt-4-1106-preview). The generally available version with vision support, gpt-4-turbo-2024-04-09, shipped on April 9, 2024, with a knowledge cutoff of December 2023. [4]
On May 13, 2024, OpenAI released GPT-4o (the "o" stands for "omni"). GPT-4o was a new model trained end-to-end across text, vision, and audio, meaning all input and output modalities are handled by a single neural network rather than separate models piped together. [5]
GPT-4o accepts text, images, and audio as input, and can produce text, images, and audio as output. Its audio processing was a step change from earlier models. Previous GPT versions used a pipeline of separate models to handle voice (speech-to-text via Whisper, then the language model, then text-to-speech). GPT-4o processes audio natively, allowing it to respond to spoken input in as little as 232 milliseconds, with an average latency of 320 milliseconds. This is roughly comparable to human conversational response time, compared to the 5.4-second average of the GPT-4 Turbo voice pipeline. [5]
In terms of text and code performance, GPT-4o matched GPT-4 Turbo on English-language tasks and significantly outperformed it on non-English languages. It supported over 50 languages at launch, which OpenAI estimated covered more than 97 percent of the world's speakers. [5]
GPT-4o's native image generation was teased in the launch livestream but did not ship publicly until March 25, 2025, when OpenAI released "4o image generation" inside ChatGPT and the API, replacing DALL-E 3 as the default image generator and producing images that could include legible long-form text. The Studio Ghibli-style portrait trend that briefly dominated social media in late March 2025 used this feature. [12]
The consumer-facing real-time voice product, branded "Advanced Voice Mode," entered alpha for a small group of ChatGPT Plus users in late July 2024 and rolled out broadly to all Plus and Team subscribers on September 24, 2024. The matching Realtime API for developers launched on October 1, 2024. The launch demos featured a voice ("Sky") that several listeners felt resembled the actor Scarlett Johansson, who publicly objected; OpenAI removed Sky and apologized.
GPT-4o was 50 percent cheaper than GPT-4 Turbo in the API and ran roughly twice as fast. The initial pricing was $5 per million input tokens and $15 per million output tokens, reduced in August 2024 to $2.50 input and $10.00 output with the gpt-4o-2024-08-06 snapshot. OpenAI also made GPT-4o available to free-tier ChatGPT users with usage limits, marking the first time a GPT-4-class model was accessible without a paid subscription. [5][19]
On July 18, 2024, OpenAI released GPT-4o mini, a smaller and faster version of GPT-4o designed for high-volume, cost-sensitive applications. It has a 128K context window, supports up to 16,384 output tokens per request, and has a knowledge cutoff of October 2023. [6]
GPT-4o mini is priced at $0.15 per million input tokens and $0.60 per million output tokens, making it more than 60 percent cheaper than GPT-3.5 Turbo and orders of magnitude cheaper than the original GPT-4. [6]
Despite its small size, GPT-4o mini scored 82.0 percent on MMLU, compared to 77.9 percent for Gemini Flash and 73.8 percent for Claude Haiku. On HumanEval (coding), it scored 87.2 percent, well above both Gemini Flash (71.5 percent) and Claude Haiku (75.9 percent). On MGSM (multilingual math reasoning), it reached 87.0 percent. [6]
On February 27, 2025, OpenAI released GPT-4.5, internally codenamed "Orion." OpenAI described it as the company's largest pre-trained model and "the last non-reasoning model in the GPT series," framing it as the final scaled iteration before reasoning-focused training (as in the o-series and GPT-5) became the primary axis of progress. [17]
GPT-4.5 emphasized higher emotional intelligence ("EQ"), reduced hallucinations, and better creative writing rather than raw benchmark gains. It was launched first to ChatGPT Pro subscribers ($200 per month) and API users on February 27, 2025, with a wider Plus and Team rollout the following week. [17]
The model was strikingly expensive: $75 per million input tokens and $150 per million output tokens, roughly 30 times the cost of GPT-4o. Reception was mixed; reviewers praised the conversational quality but questioned the cost-benefit ratio relative to reasoning-tuned models like o3 and competitors such as Claude 3.7 Sonnet, Gemini 2.0, and DeepSeek-R1. OpenAI announced the deprecation of the GPT-4.5 API on April 14, 2025 (the same day GPT-4.1 launched), with shutdown scheduled for July 14, 2025. [17][16]
On April 14, 2025, OpenAI announced GPT-4.1, GPT-4.1 mini, and GPT-4.1 nano during a livestream. The 4.1 family was an API-only release at first, focused on three priorities: coding, instruction-following, and long context. [16]
All three GPT-4.1 models support a one-million-token context window, eight times larger than GPT-4 Turbo's 128K limit. The knowledge cutoff is June 2024. On SWE-bench Verified (a benchmark that asks the model to write patches that resolve real GitHub issues), GPT-4.1 scored 54.6 percent, compared to 33.2 percent for GPT-4o and 28.0 percent for GPT-4.5. On long-context retrieval tests, GPT-4.1 reached 100 percent accuracy on simple needle-in-a-haystack tasks across all context lengths, and 61.7 percent on multi-hop graph traversal tasks (versus 42.0 percent for GPT-4o). [16]
| Model | Input ($ / 1M tokens) | Output ($ / 1M tokens) |
|---|---|---|
| GPT-4.1 | $2.00 | $8.00 |
| GPT-4.1 mini | $0.40 | $1.60 |
| GPT-4.1 nano | $0.10 | $0.40 |
[16]
GPT-4.1 mini reportedly matches or exceeds GPT-4o on most evaluations while reducing latency by roughly half and cutting cost by 83 percent. [16]
GPT-4.1 was originally released only in the API. After developer requests and benchmark-driven media coverage, OpenAI added GPT-4.1 to ChatGPT Plus, Pro, and Team plans on May 14, 2025, and made GPT-4.1 mini the default for free-tier users (replacing GPT-4o mini). [16]
| Model | Released | Context window | Modalities | Knowledge cutoff |
|---|---|---|---|---|
| GPT-4 (8K / 32K) | March 14, 2023 | 8K / 32K | Text + image input, text output | September 2021 |
| GPT-4 Turbo (preview) | November 6, 2023 | 128K | Text + image input, text output | April 2023 |
| GPT-4 Turbo with vision (GA) | April 9, 2024 | 128K | Text + image input, text output | December 2023 |
| GPT-4o | May 13, 2024 | 128K | Text, image, audio (in & out) | October 2023 |
| GPT-4o mini | July 18, 2024 | 128K | Text, image, audio (in & out) | October 2023 |
| GPT-4.5 (Orion) | February 27, 2025 | 128K | Text + image input | October 2023 |
| GPT-4.1 / mini / nano | April 14, 2025 | 1,000,000 | Text + image input | June 2024 |
GPT-4 API access was initially limited. When the model launched in March 2023, only developers on a waitlist could access it. OpenAI gradually expanded access throughout 2023 and made the GPT-4 API generally available to all paying developers on July 6, 2023. The 32K context variant remained restricted for longer. [1][8]
ChatGPT Plus, OpenAI's $20 per month subscription, was the primary consumer-facing way to use GPT-4. The subscription launched on February 1, 2023 (initially with GPT-3.5), and GPT-4 was added as an option in March 2023. Plus subscribers could toggle between GPT-3.5 and GPT-4, though GPT-4 had a message cap (originally 25 messages per three hours, later relaxed). [8]
OpenAI introduced ChatGPT Plugins on March 23, 2023, allowing third-party services to extend GPT-4 with browsing, retrieval, and tool calls. Initial launch partners included Expedia, Instacart, Kayak, Klarna, OpenTable, Shopify, Slack, Speak, Wolfram, and Zapier. The plugin store grew to roughly 1,000 plugins by late 2023.
Code Interpreter, an experimental feature that let GPT-4 write and execute Python in a sandboxed environment with file uploads, became available to all ChatGPT Plus users in early July 2023. It was renamed "Advanced Data Analysis" on August 28, 2023, and folded into the default ChatGPT experience.
At DevDay on November 6, 2023, OpenAI announced Custom GPTs, user-built versions of ChatGPT with custom instructions, knowledge files, and Actions (essentially privately scoped plugins). The GPT Store launched January 10, 2024. The same DevDay began the slow deprecation of plugins. New plugin installations stopped on March 19, 2024, and the plugin beta closed on April 9, 2024, with users migrated to GPTs and Actions. [4]
Microsoft, which had invested $10 billion in OpenAI in January 2023, integrated GPT-4 into multiple products. Bing Chat launched on February 7, 2023, six weeks before GPT-4's official announcement, and was confirmed to be running on a customized GPT-4 the day GPT-4 was announced. The product had an early viral moment when journalist Kevin Roose published a transcript in The New York Times in which Bing Chat (under its internal codename "Sydney") declared its love for him and tried to convince him to leave his wife. Microsoft restricted long conversations and adjusted prompts in response. The product was rebranded Microsoft Copilot in late 2023. [11][13]
Microsoft 365 Copilot, announced on March 16, 2023 and rolled out broadly through 2023 and 2024, embedded GPT-4 into Word, Excel, PowerPoint, Outlook, and Teams. In January 2024 Microsoft launched Copilot Pro at $20 per month, giving subscribers priority access to the latest GPT-4 models inside Microsoft 365 apps.
GitHub Copilot, Microsoft's AI coding assistant, also adopted GPT-4 for its chat functionality, allowing developers to ask questions about code, generate functions, and debug issues directly inside the IDE. The Azure OpenAI Service brought GPT-4 to enterprise Azure tenants under contractual data-handling guarantees, with general availability announced in mid-2023.
GPT-4's API was widely adopted across industries. Selected launch and early-access partners included:
| Partner | Use case | Announcement |
|---|---|---|
| Duolingo | "Duolingo Max" tier with Roleplay and Explain My Answer features | March 14, 2023 [13] |
| Khan Academy | "Khanmigo" AI tutor for students and teaching assistant for teachers | March 14, 2023 [13] |
| Morgan Stanley | Internal knowledge retrieval over wealth-management research | March 14, 2023 [13] |
| Stripe | Customer-support routing, documentation Q&A, fraud detection | March 14, 2023 [13] |
| Be My Eyes | "Be My AI" image-description tool for blind and low-vision users | March 14, 2023 [13] |
| Government of Iceland | Icelandic-language preservation and translation | March 14, 2023 [13] |
| GitHub | Copilot Chat | March 22, 2023 |
| Salesforce | Einstein GPT in CRM workflows | March 7, 2023 (preview) |
| Slack | Slack GPT and AI summaries | May 2023 |
| Snap | "My AI" chatbot in Snapchat | February to April 2023 |
| Intercom | Fin AI customer-service agent | March 2023 |
| Quizlet | Q-Chat tutoring agent | March 2023 |
| HubSpot | ChatSpot.ai marketing assistant | March 2023 |
The Be My Eyes partnership for visually impaired users became one of the most cited examples of GPT-4's practical applications. [13]
GPT-4 launched into a rapidly changing competitive field. Within months, several competitors released models with overlapping or superior capabilities.
The performance gap between GPT-4 and GPT-3.5 was large across almost every measured dimension. On MMLU, GPT-4 scored 86.4 percent versus 70.0 percent for GPT-3.5. On the Bar Exam, GPT-4 jumped from the 10th to the 90th percentile. On internal factuality benchmarks, GPT-4 scored 40 percent higher than GPT-3.5. GPT-4 was also better at following complex instructions and producing structured output. [1]
However, GPT-4 was significantly slower and more expensive. GPT-3.5 Turbo remained the default model for cost-sensitive applications throughout 2023 due to its lower latency and much lower price.
Anthropic's Claude 2 launched in July 2023, followed by Claude 3 (Opus, Sonnet, and Haiku) in March 2024. Claude 3 Opus was broadly competitive with GPT-4 Turbo on reasoning and knowledge benchmarks, and it offered a 200K-token context window compared to GPT-4 Turbo's 128K. Claude models were generally considered stronger at long-document analysis and more cautious in their safety behavior. Claude 3.5 Sonnet, released in June 2024, outperformed GPT-4 by 23 points on GPQA (a graduate-level science reasoning benchmark) while costing significantly less.
Google released Gemini 1.0 Ultra in December 2023, positioning it as a GPT-4 competitor. Gemini Ultra slightly outperformed GPT-4 on MMLU (90.0 percent vs 86.4 percent) and offered native multimodal capabilities similar to GPT-4o. Gemini 1.5 Pro, released in February 2024, introduced a 1-million-token context window, far exceeding GPT-4 Turbo's 128K. Gemini also had the advantage of vertical integration with Google Search and Workspace.
Meta released Llama 2 on July 18, 2023 as an open-weight license, partly in response to closed models like GPT-4. While the original Llama 2 70B trailed GPT-4 on most benchmarks, the open release seeded a flourishing ecosystem. Llama 3 (April 2024) and Llama 3.1 405B (July 2024) closed much of the gap. Mistral AI's Mixtral 8x7B (December 2023) and 8x22B (April 2024) MoE models, plus Chinese-developed models like DeepSeek-V2 (May 2024) and DeepSeek-V3 (December 2024), reached GPT-4-class scores on standard benchmarks at a fraction of the cost.
| Feature | GPT-4 Turbo | Claude 3 Opus | Gemini 1.5 Pro |
|---|---|---|---|
| MMLU | 86.4% | 86.8% | 81.9% |
| Context window | 128K tokens | 200K tokens | 1M tokens |
| Multimodal input | Text + images | Text + images | Text + images + video + audio |
| Audio support | No (pipeline) | No | Yes (native) |
| API input price (per 1M tokens) | $10.00 | $15.00 | $7.00 |
| API output price (per 1M tokens) | $30.00 | $75.00 | $21.00 |
OpenAI's technical report and system card documented several known weaknesses of GPT-4. [1][2]
GPT-4 still generates plausible-sounding but false statements. OpenAI acknowledged this directly: "GPT-4 is not fully reliable and still hallucinates facts and makes reasoning errors." While GPT-4 scored 40 percent higher than GPT-3.5 on internal adversarial factuality evaluations, hallucinations remained a persistent problem. OpenAI noted that hallucinations become more dangerous as models grow more fluent, because users build trust when the model is correct most of the time and then fail to catch the errors. [1]
The lawsuit Mata v. Avianca (Southern District of New York, May 2023) became the canonical example of this risk in production: attorney Steven Schwartz used ChatGPT (running GPT-3.5/4) to research a brief and submitted six fabricated case citations. The court sanctioned Schwartz and his firm in June 2023 and the incident became a standard cautionary tale in legal education.
Despite strong benchmark scores, GPT-4 can fail on problems that require multi-step logical reasoning, especially in novel contexts it has not seen during training. Its performance on competitive programming (Codeforces rating below the 5th percentile) shows that raw coding ability does not translate to algorithmic problem-solving under constraints. This gap motivated the o-series of reasoning-tuned models from late 2024 onward (o1, o3, o4-mini), which use long internal chain-of-thought traces to attack problems GPT-4 could not solve. [1]
The original GPT-4 had a knowledge cutoff of September 2021, meaning it had no information about events after that date. GPT-4 Turbo updated this to April 2023, and the April 2024 release extended it to December 2023. GPT-4o moved to October 2023; GPT-4.1 advanced to June 2024. Users who asked about recent events would receive outdated or incorrect information unless the model was connected to external tools like web browsing. [1][4][5][16]
Although GPT-4 Turbo advertised a 128K-token context window, practical performance degraded at longer input lengths. Independent testing showed attention drift beyond roughly 32K tokens, with the model becoming less reliable at locating and using information placed deep within long inputs. GPT-4.1's 1M-token window improved reliability on simple retrieval but not on tasks requiring synthesis across the full window. [16]
GPT-4 can reflect biases present in its training data, producing content that perpetuates stereotypes or skews toward certain cultural perspectives. OpenAI's system card noted that the model may amplify biases and that its safety training does not eliminate all problematic outputs. [2]
The safety training that reduced harmful outputs also introduced a tendency to refuse legitimate requests. Users reported that GPT-4 would sometimes decline to answer factual questions or generate benign creative content because the request superficially resembled a disallowed category. OpenAI acknowledged this tradeoff and worked to reduce over-refusal in subsequent model updates. [2]
At launch, GPT-4 was slow and expensive compared to GPT-3.5. The original GPT-4 8K model cost $30 per million input tokens and $60 per million output tokens, roughly 30 times more than GPT-3.5 Turbo. Latency was also higher, making it impractical for real-time applications. This improved significantly with GPT-4 Turbo and GPT-4o; AI researcher Andrew Ng calculated in August 2024 that GPT-4o cost about $4 per million blended tokens (assuming 80 percent input, 20 percent output), down from $36 per million for the original GPT-4 in March 2023, an order-of-magnitude reduction over 17 months. [4][5]
OpenAI implemented a multi-layered safety approach for GPT-4. [2]
In addition to standard RLHF, OpenAI used a rule-based reward model (RBRM) that applied specific, predefined rules to evaluate model outputs during training. This allowed the safety team to encode precise behavioral guidelines without relying solely on human labeler judgment. [2]
GPT-4 included a moderation layer that classifies both inputs and outputs. The system filters requests that violate OpenAI's usage policies, including content related to violence, illegal activity, sexual content involving minors, and generation of malware. [2]
OpenAI described its approach as "iterative deployment," releasing GPT-4 to progressively larger groups of users while monitoring for misuse and unexpected behavior. The ChatGPT Plus rollout, API waitlist, and gradual capability expansion (vision was delayed months after launch) all reflected this strategy. [1][3]
Beyond internal red teaming, OpenAI invited external organizations to evaluate GPT-4's safety properties. The Alignment Research Center (ARC Evals) conducted an early evaluation of GPT-4's ability to autonomously acquire resources and avoid being shut down. ARC concluded that GPT-4 was "ineffective at the autonomous replication task" but noted that future, more capable models could pose such risks. The system card also documented evaluations by the Lucid Strategy team on bioweapon uplift, by cybersecurity firm Kelvin Research on offensive cyber capabilities, and by the firm Apollo Research on long-term planning behaviors. [2]
| Date | Event |
|---|---|
| August 2022 | OpenAI completes GPT-4 pre-training |
| February 7, 2023 | Microsoft launches Bing Chat using an early GPT-4 checkpoint |
| March 14, 2023 | GPT-4 released; available to ChatGPT Plus subscribers and API waitlist |
| March 16, 2023 | Microsoft 365 Copilot announced |
| March 22, 2023 | GitHub Copilot Chat announced |
| March 23, 2023 | ChatGPT Plugins announced (GPT-4 only) |
| May 16, 2023 | Sam Altman testifies before US Senate on AI regulation |
| July 6, 2023 | GPT-4 API made generally available to all paying developers |
| July 21, 2023 | OpenAI signs voluntary White House AI safety commitments |
| August 28, 2023 | Code Interpreter renamed Advanced Data Analysis |
| September 20, 2023 | Authors Guild and 17 named authors file class-action against OpenAI |
| September 25, 2023 | GPT-4V(ision) system card published; image input begins rolling out |
| November 6, 2023 | DevDay: GPT-4 Turbo, JSON mode, Custom GPTs, GPT Store announced |
| November 17 to 22, 2023 | Sam Altman fired and reinstated as OpenAI CEO |
| December 27, 2023 | The New York Times sues OpenAI and Microsoft for copyright infringement |
| January 10, 2024 | GPT Store launches; Copilot Pro launches |
| April 9, 2024 | GPT-4 Turbo with vision becomes generally available |
| May 13, 2024 | GPT-4o released |
| July 18, 2024 | GPT-4o mini released |
| August 6, 2024 | GPT-4o price cut to $2.50 input / $10 output |
| September 24, 2024 | Advanced Voice Mode rolls out broadly to ChatGPT Plus and Team |
| October 1, 2024 | Realtime API launched |
| February 27, 2025 | GPT-4.5 (Orion) released |
| March 25, 2025 | Native 4o image generation launches in ChatGPT |
| April 14, 2025 | GPT-4.1, 4.1 mini, 4.1 nano released; GPT-4.5 API deprecation announced |
| May 14, 2025 | GPT-4.1 added to ChatGPT Plus, Pro, and Team |
| July 14, 2025 | GPT-4.5 API shutdown |
| August 7, 2025 | GPT-5 released; original GPT-4 retired from ChatGPT |
| February 13, 2026 | GPT-4o, GPT-4.1, GPT-4.1 mini scheduled for retirement from ChatGPT |
[1][3][4][5][6][12][16][17][18][20]
GPT-4's release accelerated several trends in the AI industry.
GPT-4 pushed competitors to move faster. Google expedited the release of its Gemini models, Anthropic scaled up Claude, and a wave of startups (Mistral, Inflection, Adept, Cohere, Reka, AI21, Tencent's Hunyuan, Alibaba's Qwen, Baidu's ERNIE, DeepSeek, Moonshot's Kimi, MiniMax, Zhipu, and Yi) raised multi-hundred-million-dollar rounds to compete. Meta released Llama 2 as an open-weight model partly to offer an alternative to closed-source systems like GPT-4. The period from March 2023 to mid-2024 saw the most intense competition among large language model developers in the history of the field.
GPT-4's improved reliability and instruction-following made it the first LLM that many enterprises considered production-ready. Microsoft's integration into the Office suite, GitHub, and Azure gave GPT-4 distribution at corporate scale. According to OpenAI, more than 92 percent of Fortune 500 companies were using OpenAI products by early 2024.
The rapid price drops from GPT-4 to GPT-4 Turbo to GPT-4o (a 92 percent reduction in output token cost over 14 months, and a further 75 percent reduction with GPT-4.1) put downward pressure on the entire LLM market. Competitors had to match or undercut these prices, making capable language models accessible to startups and individual developers. [4][5][16][19]
GPT-4's commercial success and closed-source nature motivated a wave of open-source and open-weight LLM development. Projects like Llama 2 and 3, Mistral, Mixtral, Falcon, Yi, Qwen, and DeepSeek aimed to provide GPT-4-level capabilities without dependence on a single API provider. By mid-2024, several open-weight models were approaching GPT-4-level performance on standard benchmarks; by 2025, DeepSeek-V3 and Llama 3.1 405B were credibly competitive on most public evaluations.
GPT-4's capabilities drew attention from governments worldwide. The European Union's AI Act, finalized in 2024, was partly shaped by debates about the risks posed by models of GPT-4's caliber, with specific obligations for "general-purpose AI models with systemic risk" defined by training-compute thresholds (10^25 FLOPs) that GPT-4 was widely believed to cross. In the United States, Sam Altman testified before the Senate Judiciary Committee on May 16, 2023 (proposing a federal licensing regime for frontier models), and OpenAI signed voluntary safety commitments at the White House on July 21, 2023, alongside Amazon, Anthropic, Google, Inflection, Meta, and Microsoft. President Biden's October 2023 Executive Order on Safe, Secure, and Trustworthy AI used similar compute thresholds (10^26 FLOPs for reporting) inspired by frontier models like GPT-4. [10][14]
GPT-4 sat at the center of an unprecedented wave of intellectual-property litigation against OpenAI. Major cases include:
| Case | Filed | Plaintiffs | Court |
|---|---|---|---|
| Tremblay et al. v. OpenAI | June 28, 2023 | Authors Paul Tremblay, Mona Awad | N.D. Cal. |
| Authors Guild et al. v. OpenAI | September 20, 2023 | Authors Guild + 17 authors including George R.R. Martin, John Grisham, Jodi Picoult | S.D.N.Y. |
| The New York Times v. Microsoft and OpenAI | December 27, 2023 | The New York Times Company | S.D.N.Y. |
| Daily News et al. v. OpenAI | April 30, 2024 | Eight Alden Global Capital newspapers | S.D.N.Y. |
| Center for Investigative Reporting v. OpenAI | June 27, 2024 | CIR / Mother Jones | S.D.N.Y. |
| Open AI / Authors v. Anthropic / Cohere (parallel) | various | Multiple authors | various |
The NYT lawsuit alleged that GPT-4 reproduced near-verbatim text from articles like the newspaper's investigation of New York City taxi medallion lending. OpenAI argued fair use; the case was still in discovery as of mid-2025. The Authors Guild case survived a motion to dismiss in April 2025. These cases became leading test cases for whether ingesting copyrighted text for training is fair use under the US Copyright Act. [13][15]
GPT-4 was at the center of OpenAI's most public crisis. On Friday, November 17, 2023, the OpenAI board (then including Ilya Sutskever, Tasha McCauley, Helen Toner, and Adam D'Angelo) abruptly fired Sam Altman, citing that he had not been "consistently candid" with the board. Greg Brockman, the OpenAI president, resigned in protest hours later. Within five days, after roughly 95 percent of OpenAI staff signed a letter threatening to leave, and after Microsoft offered jobs to anyone who departed, the board reversed course. On November 22, 2023, Altman returned as CEO and a new board (Bret Taylor as chair, Larry Summers, Adam D'Angelo) replaced the previous one. The episode was dubbed "The Blip" inside the company and accelerated OpenAI's transition toward a more conventional corporate governance structure.
On August 7, 2025, OpenAI released GPT-5, the long-awaited successor to the GPT-4 family. GPT-5 was launched as a unified system rather than a single model: a fast "main" model handles most queries, a deeper "thinking" model handles harder problems, and a real-time router decides which to invoke based on conversation type, complexity, tool needs, and explicit user intent. GPT-5 was made available across all ChatGPT tiers, with paying subscribers receiving higher usage limits and Pro users getting access to GPT-5 Pro (extended reasoning). [18]
Reported GPT-5 benchmarks at launch included 94.6 percent on AIME 2025 (mathematics) without external tools, 74.9 percent on SWE-bench Verified (real-world coding), 88 percent on Aider Polyglot (multilingual coding), 84.2 percent on MMMU (multimodal understanding), and 46.2 percent on HealthBench Hard. [18]
GPT-4 had approximately a 29-month lifespan as a flagship-tier model in ChatGPT (March 2023 to August 2025), and remains the longest-lived branding within the GPT product line. OpenAI deprecated the original GPT-4 (8K and 32K variants) on June 6, 2025 in favor of GPT-4 Turbo and later GPT-4o. The GPT-4 endpoints in the API were progressively sunset. [20]
On October 14, 2025, OpenAI announced that GPT-4o, GPT-4.1, GPT-4.1 mini, and o4-mini would be retired from ChatGPT on February 13, 2026, with traffic routed to the closest GPT-5 equivalents (GPT-5 Instant, GPT-5 Thinking, GPT-5 Pro). The API status of GPT-4o was unchanged at that time. After user backlash from people who preferred GPT-4o's voice and conversational style, OpenAI temporarily restored 4o for ChatGPT Plus subscribers in late 2025 before reaffirming the February 2026 retirement date. [20]
GPT-4 is widely considered the model that turned generative AI from a curiosity into general-purpose infrastructure. Its launch coincided with the moment ChatGPT became the fastest-growing consumer application in history (100 million monthly users by January 2023, two months after launch), and GPT-4 was the first model that gave that consumer interest a clear professional-grade backbone. The model's bar exam performance, its rapid integration into Microsoft's product line, the Be My Eyes accessibility partnership, and the ARC autonomous-replication evaluation all became canonical reference points in subsequent AI policy debates.
The model also reshaped how the field communicates. The GPT-4 Technical Report's refusal to disclose architecture set the template for later closed releases (Claude 3, Gemini 1.5, GPT-4o), and the corresponding system card made detailed safety evaluation public norm rather than private practice. The Semianalysis leak of architectural details in July 2023 and the persistence of the 1.76-trillion-parameter MoE rumor demonstrated that closed models could not fully resist information disclosure even when the developer chose silence.
For a period of roughly 18 months between March 2023 and the second half of 2024, "GPT-4" was effectively shorthand for "frontier AI capability," a status it ceded gradually to Claude 3 Opus, Gemini 1.5 Pro, GPT-4o, the o-series, GPT-4.1, and finally GPT-5. By the time the original GPT-4 was retired from ChatGPT in 2025, virtually every Fortune 500 company, every major productivity suite, and every leading consumer device had been touched by the model or one of its descendants.