GPT
Last reviewed
Apr 30, 2026
Sources
30 citations
Review status
Source-backed
Revision
v6 ยท 4,538 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Apr 30, 2026
Sources
30 citations
Review status
Source-backed
Revision
v6 ยท 4,538 words
Add missing citations, update stale details, or suggest a clearer explanation.
GPT, short for Generative Pre-trained Transformer, is a family of large language models developed by OpenAI and built on the Transformer architecture introduced by Vaswani et al. in 2017. The first GPT was described in a 2018 paper by Alec Radford and colleagues titled "Improving Language Understanding by Generative Pre-Training," which showed that a single decoder-only Transformer pre-trained on a large unlabeled corpus could be fine-tuned to beat task-specific models on a wide range of benchmarks. Each subsequent generation, GPT-2 in 2019, GPT-3 in 2020, GPT-3.5 and ChatGPT in 2022, GPT-4 in 2023, GPT-4o in 2024, GPT-4.1 in 2025, and GPT-5 in August 2025, has scaled the same basic recipe of next-token prediction on internet-scale text, with refinements to data, alignment, and compute.[1][2][3]
The acronym has spread well beyond OpenAI. Researchers and companies use the suffix "GPT" for any decoder-only Transformer trained with the generative pre-training objective, including BloombergGPT, Baidu's Wenxin Yiyan and ERNIE family, EleutherAI's GPT-J and GPT-NeoX, and many domain-specific models. The U.S. Patent and Trademark Office twice rejected OpenAI's attempt to register "GPT" as a trademark, ruling in February 2024 that the term was "merely descriptive" of a functional class of models rather than a unique brand.[4][5]
GPT is the system that triggered the modern wave of generative AI. ChatGPT, the dialogue interface OpenAI launched on November 30, 2022, reached 1 million users in five days and roughly 100 million monthly users within two months, the fastest consumer software adoption recorded at the time, and it pushed every major technology company into shipping its own competing assistant.[6]
The table below summarizes the headline OpenAI GPT releases. Parameter counts for GPT-4 and later are not officially disclosed; figures shown are widely reported industry estimates and are noted as such.
| Model | Release date | Parameters | Context window | Notes |
|---|---|---|---|---|
| GPT-1 | June 11, 2018 | 117 million | 512 tokens | Introduced in Improving Language Understanding by Generative Pre-Training; trained on the BookCorpus dataset. |
| GPT-2 | February 14, 2019 (small); November 5, 2019 (full) | Up to 1.5 billion | 1,024 tokens | Initially withheld due to misuse concerns; full 1.5B weights released in stages. Trained on WebText (40 GB). |
| GPT-3 | June 11, 2020 (paper May 28, 2020) | 175 billion | 2,048 tokens | Paper "Language Models are Few-Shot Learners" demonstrated in-context learning; trained on roughly 400 billion tokens. NeurIPS 2020 Best Paper. |
| InstructGPT | January 27, 2022 | 1.3B to 175B (multiple sizes) | 2,048 tokens | First public OpenAI model trained with reinforcement learning from human feedback (RLHF). |
| GPT-3.5 | March 15, 2022 (text-davinci-002) | 175 billion (estimated) | 4,096 tokens | Improved instruction-following; basis for the original ChatGPT launch. |
| ChatGPT | November 30, 2022 | 175 billion (GPT-3.5) | 4,096 tokens | Free conversational web app; reached ~100 million MAUs in two months. |
| GPT-4 | March 14, 2023 | Not disclosed; rumored ~1.8T total parameters in a mixture-of-experts configuration | 8K and 32K variants | First GPT to accept image inputs alongside text. Passed a simulated bar exam in roughly the top 10%. |
| GPT-4 Turbo | November 6, 2023 | Not disclosed | 128,000 tokens | Cheaper, faster GPT-4 with vision; data freshness extended to April 2023. |
| GPT-4o | May 13, 2024 | Not disclosed | 128,000 tokens | "Omni" multimodal model: real-time text, image, and audio in a single neural network. |
| GPT-4.1 | April 14, 2025 | Not disclosed | 1,000,000 tokens | API-only release focused on coding, instruction following, and long-context reasoning. |
| GPT-5 | August 7, 2025 | Not disclosed | 256,000 tokens (input), 128,000 output | Unified system with a router that switches between fast and "thinking" reasoning models; mini, nano, and pro variants. |
Sources for this table: OpenAI announcements, the original GPT papers, and Wikipedia articles on each model.[1][2][3][7][8][9][10][11][12]
Every GPT model from GPT-1 onward is a decoder-only Transformer, meaning it consists of a stack of identical Transformer blocks that use masked self-attention so each token can only attend to tokens earlier in the sequence. There is no encoder, and there is no bidirectional context the way BERT has. The model takes a sequence of tokens, runs them through token embeddings and positional embeddings, then through the Transformer stack, and finally through a linear layer plus softmax that produces a probability distribution over the vocabulary for the next token.[13]
The core building block of each layer combines a multi-head causal self-attention sublayer with a position-wise feed-forward network. Residual connections and layer normalization sit around both sublayers. Position information is injected through learned positional embeddings in GPT-1 and GPT-2 and through more elaborate schemes such as rotary position embeddings in many later open-source GPT variants. Vocabulary is encoded with a byte pair encoding tokenizer; modern OpenAI models use a tokenizer family called tiktoken.
GPT-1 used 12 Transformer blocks and 12 attention heads in each block, with a 768-dimensional hidden state. GPT-2's largest variant scaled this to 48 blocks, 25 heads, and a 1,600-dimensional hidden state. GPT-3 took the same recipe to 96 blocks, 96 heads, and a 12,288-dimensional hidden state. The headline parameter counts (117 million for GPT-1, 1.5 billion for GPT-2, 175 billion for GPT-3) come almost entirely from the matrices inside attention and feed-forward layers; the embedding tables are large but a relatively small share of the total at scale.[1][2][3]
Decoder-only architectures became the dominant design for generative AI for two practical reasons. They are simple to scale because every layer has the same shape, and every token in a training document contributes a gradient signal because next-token prediction targets every position in the sequence. That makes training data-efficient relative to encoder-decoder setups. The cost is that the model cannot look at future context, so tasks like document classification or extractive question answering, where bidirectional models such as BERT once dominated, are framed in GPT as text-in text-out problems.[13][14]
From GPT-4 onward, OpenAI is widely believed to use mixture-of-experts (MoE) routing inside the feed-forward layers, which lets the model store many parameters but only activate a fraction of them per token. Industry leaks suggest GPT-4 has roughly 1.8 trillion total parameters split across 8 to 16 experts with about 220 billion or 111 billion parameters each, and that two experts are routed per forward pass. OpenAI has never confirmed these numbers, and the GPT-4 technical report explicitly omits architecture, hardware, training compute, and dataset details.[15][16]
Context window length, the number of tokens the model can attend to at once, has expanded across generations. GPT-1 had a 512-token window, GPT-2 had 1,024, GPT-3 had 2,048, GPT-3.5 had 4,096, GPT-4 launched at 8,192 and 32,768 in two flavors, GPT-4 Turbo extended this to 128,000, GPT-4.1 jumped to 1 million tokens for input, and GPT-5 ships with 256,000 input and 128,000 output by default. Long-context training requires special engineering for attention scaling, including techniques such as FlashAttention, grouped-query attention, and various sparse and linear-attention variants. Most public details on these techniques come from open-source models because OpenAI does not document its production architecture.
For multimodal GPTs (GPT-4 with vision, GPT-4o, GPT-5), the model also accepts image and audio inputs through dedicated encoders that project those modalities into the same token embedding space the language model already understands. GPT-4o is described as a single end-to-end neural network that processes audio, vision, and text in one model rather than relying on separate speech-to-text and text-to-speech pipelines, which is why it can respond to spoken input in roughly 320 milliseconds, close to human conversational latency.[9]
GPT models are trained in two or three main stages.
The pre-training stage uses self-supervised next-token prediction, also known as causal language modeling. The model sees a token sequence drawn from a large corpus and learns to predict the next token at every position. The loss is the negative log likelihood of the correct token under the model's predicted distribution, summed over the sequence. There are no human labels at this stage; the labels come from the text itself.[1]
Pre-training data scaled rapidly across generations. GPT-1 used the 4.5 GB BookCorpus. GPT-2 used 40 GB of WebText scraped from outbound Reddit links with a karma threshold. GPT-3 used roughly 570 GB of filtered Common Crawl, plus WebText2, Books1, Books2, and English Wikipedia, totaling about 400 billion tokens after subsampling. GPT-4 and later models are trained on undisclosed mixtures that almost certainly include licensed data, code repositories, and synthetic data generated by earlier models.[2][3]
After pre-training, GPT models go through alignment so they follow instructions and avoid clearly unsafe outputs. There are usually two phases:
The InstructGPT paper reported that a 1.3 billion parameter RLHF-tuned model produced outputs that human raters preferred to those of the 175 billion parameter GPT-3, a 100x compression in apparent capability driven entirely by alignment. That result is widely credited as the technique that made ChatGPT feel as useful as it does.[17]
Later GPT models add further alignment stages such as Constitutional-AI-style critique loops, rule-based reward shaping, and tool-use pre-training where the model learns to call code interpreters, web search, and image generators. GPT-5 introduced a router model that decides at inference time whether to send a query to a fast "main" model or a slower "thinking" model that allocates more inference compute to reasoning.[12]
The scaling behavior of GPT-style models was formalized in two influential papers. Kaplan et al. (2020) at OpenAI showed that test loss falls as a power law in three quantities: parameter count, dataset size, and training compute. Their advice was to scale parameters faster than data within a fixed compute budget, which justified the parameter-heavy GPT-3 recipe.[18]
Hoffmann et al. (2022) at DeepMind, in the Chinchilla paper, re-ran the experiments more carefully and concluded that for a given compute budget, parameters and tokens should scale roughly in equal proportion. Many large language models trained before Chinchilla, including GPT-3, were therefore over-parameterized for the amount of data they had seen. Later GPT models are believed to follow a more Chinchilla-style data-to-parameter ratio, although exact numbers are not public.[19]
One of the surprises of GPT-3 was that the model could perform new tasks at inference time given only a few demonstrations in the prompt, with no parameter updates. The paper called this few-shot learning, and the broader phenomenon is now usually called in-context learning.[3]
In-context learning is what made prompt engineering a distinct skill: changing the wording, ordering, or examples in the prompt can shift accuracy on a benchmark by tens of percentage points. Later work on chain-of-thought prompting (Wei et al. 2022) showed that asking the model to think step by step before answering improved performance on math word problems and logic puzzles, especially in larger models. The capability is sometimes described as emergent because it appears to switch on around a certain scale rather than improving smoothly.
In 2024 and 2025 OpenAI released a separate "reasoning" line, including o1, o3, and the GPT-5 thinking variants, that uses reinforcement learning to teach the model to spend more inference compute on chains of thought before producing a final answer. These models score much higher on math and coding benchmarks than non-reasoning siblings of comparable size.[20]
OpenAI launched ChatGPT as a free "research preview" on November 30, 2022. The interface was a simple chat box wrapped around a fine-tuned GPT-3.5 model. Within five days it reached 1 million users; within two months it crossed 100 million monthly users, surpassing TikTok's nine months and Instagram's two-and-a-half years to that mark.[6]
The launch is often called the "ChatGPT moment" because it forced the rest of the industry to react. Microsoft, which had invested $1 billion in OpenAI in 2019, expanded that to a multi-year, multibillion-dollar partnership in January 2023 and built ChatGPT-derived features into Bing, Office, GitHub Copilot, and Windows. Google issued a "code red" and accelerated its own Bard (later Gemini) chatbot. Anthropic released Claude in March 2023, and Baidu unveiled Ernie Bot the same month.
ChatGPT itself has gone through many backend models. The default has cycled through GPT-3.5, GPT-4, GPT-4 Turbo, GPT-4o, and GPT-5, with paid users gaining access to newer models first. By April 2025, ChatGPT reportedly had more than 500 million weekly active users, making it among the most-used software products in the world.
On November 6, 2023, at OpenAI's first DevDay, Sam Altman announced "GPTs," customizable versions of ChatGPT that any user could build with natural language instructions, retrieval over uploaded files, and optional connections to external APIs. Builders did not need to write code; they configured a system prompt, tools, and knowledge files through a chat-based builder.[21]
On January 10, 2024, OpenAI opened the GPT Store, a marketplace where ChatGPT Plus, Team, and Enterprise subscribers could browse and use Custom GPTs published by other users. By that date users had already created more than 3 million Custom GPTs. Featured launches included Khan Academy's tutoring GPT, AllTrails' trail finder, Canva's design assistant, and the academic search GPT Consensus.[22]
In early 2024, OpenAI announced a revenue program that pays U.S. builders based on user engagement with their GPTs, although the precise rates and eligibility have shifted over time. The Custom GPT format was the first widely deployed example of an agent-like layer on top of a base model: each GPT bundles a persistent persona, retrieval, and tool calls into a single shareable artifact.
Because OpenAI never secured a trademark on the acronym, dozens of organizations ship models, applications, and even unrelated products under "GPT" branding. The list below is not exhaustive but covers some of the most cited examples.
| Model or product | Organization | Year | Notes |
|---|---|---|---|
| GPT-J-6B | EleutherAI | 2021 | 6-billion-parameter open-weight decoder-only Transformer; an early attempt to replicate GPT-3 publicly. |
| GPT-NeoX-20B | EleutherAI | 2022 | 20-billion-parameter open-weight model. |
| BloombergGPT | Bloomberg L.P. | March 2023 | 50-billion-parameter financial LLM trained on 363 billion tokens of Bloomberg's proprietary financial data plus 345 billion general tokens. |
| ERNIE / Wenxin Yiyan | Baidu | 2019 onward | Chinese decoder-only Transformer family; the public chatbot launched March 16, 2023, and was renamed Wenxiaoyan in 2024. Reported 200 million users by April 2024. |
| BioGPT | Microsoft Research | 2022 | Domain-specific GPT trained on biomedical literature. |
| FinGPT | Multiple academic groups | 2023 | Open-source financial LLM project. |
| Cerebras-GPT | Cerebras | March 2023 | Open-weight family of seven Chinchilla-scaled GPT models. |
| AssistGPT, AutoGPT, AgentGPT | Various | 2023 | Agent frameworks built on top of OpenAI APIs; not standalone models. |
In its February 2024 ruling against OpenAI's trademark application, the USPTO specifically pointed to this proliferation as evidence that "GPT" had become a generic descriptor.[4]
GPT is one of several major closed and open large language model families that emerged after the original Transformer paper. The table below sketches how the GPT line compares to its main competitors as of April 2026.
| Family | Developer | License | Notable strength | Recent flagship |
|---|---|---|---|---|
| GPT | OpenAI | Closed (API and ChatGPT only) | Broadest ecosystem, strong general reasoning, large third-party tooling, GPT Store. | GPT-5 (August 2025) |
| Claude | Anthropic | Closed (API and chat) | Long-form writing, careful reasoning, lower hallucination rates on many evaluations. | Claude 4.5 Opus and Sonnet (2025) |
| Gemini | Google DeepMind | Closed (API, Gemini app, Workspace) | Native multimodality, very long context windows, integration with Google Search. | Gemini 2.5 Pro (2025) |
| LLaMA | Meta AI | Open weights with a community license | Strong base models that anyone can self-host or fine-tune. | LLaMA 4 (2025) |
| DeepSeek | DeepSeek (China) | Open weights | Cost-efficient training; competitive reasoning at a fraction of the inference cost. | DeepSeek-V3 and R1 (2024 to 2025) |
| Mistral | Mistral AI | Open weights and proprietary | Compact European-built models, strong multilingual coverage. | Mistral Large 2 (2025) |
All of these systems use decoder-only Transformer cores, so the architectural distance between them is small. The differences come from training data, alignment recipes, safety tuning, and deployment surfaces. Independent reviewers tend to describe GPT models as the broadest "all-purpose" choice, Claude as the strongest at long-form writing and careful document work, Gemini as the best fit for users in Google's ecosystem, and LLaMA as the leading open-weight option.[23]
What a GPT can actually do depends on the version and the deployment, but the broad envelope is consistent across the family.
GPT models share the failure modes of all current large language models.
The research community received GPT-3 as a watershed paper; "Language Models are Few-Shot Learners" won a Best Paper award at NeurIPS 2020 and reframed how researchers think about scale and emergent behavior.[3] ChatGPT's launch in November 2022 is the moment when most of the public, regulators, and the broader software industry took notice. Within a year, generative AI had become a fixture of national policy debates, including the U.S. executive order on AI of October 2023, the EU AI Act of 2024, and the UK AI Safety Summit at Bletchley Park.
Economic impact has been concrete. GitHub Copilot, built on OpenAI Codex (a GPT-3 derivative) and later GPT-4 class models, was reported to be in use by tens of millions of developers by 2025. Independent productivity studies, including a 2023 randomized trial by Brynjolfsson, Li, and Raymond on customer support agents, measured roughly 14% higher resolved tickets per hour for workers using a GPT-based assistant, with the largest gains for the least experienced staff. Many companies have built internal GPT deployments through Microsoft Azure OpenAI Service, ChatGPT Enterprise, and the OpenAI API, citing customer support, code authoring, and document workflows as primary use cases.
Reception has not been uniformly positive. Critics have pointed to the environmental cost of training and serving frontier models, the displacement of certain knowledge-work jobs, the concentration of frontier compute in a small number of U.S. and Chinese firms, the legal status of training on copyrighted work, and the safety risks of deploying systems whose internal reasoning is not interpretable. GPT-5's launch in August 2025 drew particular pushback from longtime ChatGPT users who said the new default model felt "flat" compared to GPT-4o and complained about the automatic router routing them to a smaller model than they wanted; OpenAI subsequently adjusted defaults and exposed model selection more directly.[12]
"GPT" has entered everyday language as shorthand for "AI chatbot," similar to how "Google" became a verb for web search. The phrase "according to ChatGPT" appears in news articles, court filings, classroom syllabi, and political speeches. The acronym is referenced in books, television, and stand-up routines, and "the model" or "the GPT" is often invoked the way "the algorithm" was a few years earlier. Many writers and artists have organized against generative AI tools, especially after the 2023 Writers Guild of America strike, which secured contractual protections against the unconsented use of GPT-style systems in television and film writing.
In academia, GPTs have prompted rapid changes to assessment practice. Many universities re-introduced in-class exams or oral defenses after 2023, and journals such as Nature and Science updated their editorial policies in early 2023 to require disclosure of any GPT use in submitted manuscripts. Detection tools that claim to identify GPT output have struggled to keep up; OpenAI itself shut down its public AI text classifier in mid-2023, citing low accuracy.
The legal system has also had to adapt. In the New York case Mata v. Avianca (2023), two attorneys were sanctioned after submitting a brief that cited fictional cases hallucinated by ChatGPT. Multiple court systems in the United States, the United Kingdom, and Australia have since issued standing orders that require lawyers to disclose any use of GPT-style tools in filings and to verify every citation independently. Several state bars have published advisory opinions on the duty of competence in working with generative AI.
OpenAI has framed each GPT release with a public safety document called a system card or model card, beginning with GPT-4. These documents describe the alignment training, red-teaming, capability evaluations, and known failure modes for the model. They are not peer-reviewed and have been criticized for omitting key technical details, but they have also become a de facto industry standard; competitors including Anthropic, Google DeepMind, and Meta now publish similar documents for their own flagship models.
Alignment research on GPT-class models is an active field. Major threads include scalable oversight (training models to do tasks humans cannot easily evaluate), interpretability (understanding what circuits inside the network are doing), jailbreaking defenses (preventing users from bypassing safety training), and evaluation of dangerous capabilities such as biosecurity, cybersecurity, and autonomous replication. The Frontier Model Forum, founded in July 2023 by OpenAI, Anthropic, Google, and Microsoft, coordinates some of this work across labs.
Policy interest in GPT has accelerated in parallel. The U.S. AI Executive Order of October 2023 set reporting thresholds for models trained with more than 10^26 floating-point operations, a bar that GPT-4 and its successors are believed to clear. The EU AI Act, finalized in 2024 and entering force in stages through 2026, classifies general-purpose AI models with "systemic risk" and imposes additional transparency, evaluation, and incident reporting obligations on their providers. China requires generative AI services to undergo security assessments and content filtering, which is why models such as Wenxin Yiyan went through an approval process before public release. The United Kingdom established the AI Safety Institute, which conducts pre-deployment evaluations of frontier models, including GPT-class systems, under voluntary agreements with the labs.