GPT-2 is the second model in the GPT (Generative Pre-trained Transformer) series, released by OpenAI in February 2019. Described in the paper "Language Models are Unsupervised Multitask Learners" by Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever, GPT-2 scaled up the architecture of GPT-1 by a factor of roughly 10x, reaching 1.5 billion parameters in its largest variant [1]. The model demonstrated that a sufficiently large language model trained on diverse internet text could perform a wide variety of natural language processing tasks in a zero-shot setting, without any task-specific fine-tuning.
GPT-2 is also remembered for the controversy surrounding its release. OpenAI initially withheld the full model, citing concerns that it could be misused to generate convincing fake text at scale. The decision to implement a staged release over nine months sparked a heated debate in the AI research community about publication norms, responsible disclosure, and the balance between openness and safety [2]. The episode became a defining moment in the broader conversation about dual-use risks in artificial intelligence research.
GPT-1 had shown that generative pre-training on a large text corpus, followed by supervised fine-tuning, could produce strong results on NLP benchmarks. However, GPT-1 still required task-specific fine-tuning and labeled data for each downstream task. The central question motivating GPT-2 was whether scaling the model and training data would be sufficient for the model to learn tasks implicitly, without needing separate fine-tuning steps [1].
The hypothesis was grounded in the idea that a language model trained on a sufficiently large and diverse corpus would encounter many different tasks embedded naturally within the text. For example, a passage might contain a question followed by an answer, a document followed by its summary, or text in one language followed by a translation. If the model could learn to perform these tasks as part of its language modeling objective, it would effectively become a multitask learner without explicit supervision [1].
This framing was captured in the paper's title: "Language Models are Unsupervised Multitask Learners." The claim was bold. It suggested that the path to general-purpose NLP systems did not require ever-more-sophisticated task-specific architectures, but rather ever-larger language models trained on ever-more-diverse data. The results of GPT-2 provided the first strong evidence for this claim, and the subsequent GPT-3 would confirm it decisively.
The broader context of early 2019 is worth noting. At the time, the dominant paradigm in NLP was the pre-train-then-fine-tune approach popularized by BERT, which had been released by Google in October 2018. BERT used a bidirectional transformer encoder and achieved state-of-the-art results across many NLP benchmarks after task-specific fine-tuning. GPT-2's contribution was to show that a unidirectional, decoder-only model could match or exceed these results on certain tasks without any fine-tuning at all, simply by leveraging scale and diverse training data.
GPT-2 follows the same decoder-only transformer architecture as GPT-1, with several minor but important modifications. The model is autoregressive, meaning it generates text one token at a time by predicting the next token conditioned on all preceding tokens. This causal structure is enforced through masked self-attention, which prevents the model from attending to future positions in the sequence.
Layer normalization was moved to the input of each sub-block (pre-norm), rather than being applied after the sub-block as in GPT-1. This pre-normalization approach improves training stability, particularly for deep networks, by ensuring that the inputs to each attention and feed-forward sublayer are well-scaled. An additional layer normalization was added after the final self-attention block. The weights of residual layers were scaled at initialization by a factor of 1/sqrt(N), where N is the number of residual layers. This modified initialization helped stabilize training at greater depths [1].
The vocabulary was expanded from 40,000 byte-pair encoding (BPE) merges to 50,257 tokens using a byte-level BPE scheme. This allowed the model to encode any string of text without encountering unknown tokens, an important practical improvement [1]. The context length was doubled from 512 to 1,024 tokens, allowing the model to condition on longer passages when generating text or making predictions.
GPT-2 uses the GELU (Gaussian Error Linear Unit) activation function in its feed-forward layers, following GPT-1. Each transformer block contains a multi-head self-attention layer followed by a position-wise feed-forward network. The feed-forward network expands the hidden dimension by a factor of four (for example, from 768 to 3,072 in the small model) before projecting back down to the original dimension [1].
GPT-2 was released in four sizes, allowing researchers to study how performance scaled with model capacity:
| Variant | Parameters | Layers | Hidden Dimension | Attention Heads | Feed-Forward Dimension | Context Length |
|---|---|---|---|---|---|---|
| Small | 124M | 12 | 768 | 12 | 3,072 | 1,024 |
| Medium | 355M | 24 | 1,024 | 16 | 4,096 | 1,024 |
| Large | 774M | 36 | 1,280 | 20 | 5,120 | 1,024 |
| XL | 1.5B | 48 | 1,600 | 25 | 6,400 | 1,024 |
The smallest GPT-2 model is roughly equivalent in size to GPT-1, while the largest is approximately 12 times bigger. All four variants share the same architecture and differ only in depth and width [1]. The original paper reported the small model as having 117M parameters and the medium as 345M, but later counting methods that include all embedding and bias terms place these at 124M and 355M respectively [9].
The availability of multiple model sizes proved valuable for the research community. It allowed systematic study of how performance scales with parameter count, and it made the smaller variants accessible to researchers and hobbyists who lacked the computational resources to run the full 1.5B model.
GPT-2 uses byte-level Byte Pair Encoding (BPE), a significant departure from the character-level or word-level BPE used by most previous models. Standard BPE operates on Unicode characters, but GPT-2's byte-level variant operates on raw bytes. This means the base vocabulary consists of just 256 byte values, and the BPE merges build up from there to create a final vocabulary of 50,257 tokens [1].
The practical advantage of byte-level BPE is that the model can encode any string of text, including rare characters, mathematical symbols, code, and non-English text, without ever encountering an out-of-vocabulary token. Previous tokenization schemes often replaced rare or unseen characters with a special unknown token, which degraded performance on text containing unusual formatting, names, or non-Latin scripts.
To prevent BPE from merging across character categories (for example, merging parts of different words), the authors added a rule preventing merges across character boundaries for common classes such as letters, digits, and punctuation. This improved the quality of the tokenization, particularly for rare words and proper nouns [1]. The resulting tokenizer handles multilingual text, code, and special characters with reasonable efficiency, although it was primarily optimized for English.
The GPT-2 tokenizer became widely adopted beyond its original model. Many subsequent language models, including early versions of GPT-3, used the same tokenizer or close variants of it. The tokenizer is available as a standalone component through libraries such as the Hugging Face Transformers library and the tiktoken library released by OpenAI.
A key difference between GPT-2 and its predecessor was the training dataset. While GPT-1 was trained on the BooksCorpus (approximately 800 million words from about 7,000 unpublished books), GPT-2 was trained on WebText, a new dataset created specifically for the project [1].
WebText was constructed by scraping the text content of all outbound links from Reddit posts that received at least 3 karma (upvotes minus downvotes). The rationale was that Reddit's voting mechanism serves as a crude quality filter: links that humans found interesting or informative were more likely to contain high-quality text. Only links posted prior to December 2017 were included. After deduplication and cleaning, the dataset contained about 8 million documents totaling roughly 40 GB of text [1].
The resulting dataset was far more diverse than BooksCorpus, spanning news articles, blog posts, scientific papers, fiction, forums, and many other genres. This diversity was intentional. The authors hypothesized that a model trained on broadly representative internet text would develop more general-purpose capabilities than one trained on a single domain [1].
Wikipedia was explicitly removed from the WebText dataset to ensure that test set contamination would not inflate benchmark scores, since many NLP benchmarks draw from Wikipedia-sourced content [1].
The scale of the dataset was a deliberate choice. GPT-1 had been trained on roughly 800 million words; WebText contained approximately 10 billion words, an order of magnitude increase. The authors hypothesized, correctly, that more diverse training data would produce a model with broader capabilities. The WebText approach also demonstrated that high-quality training data could be assembled without manual curation, using social signals (in this case, Reddit upvotes) as a proxy for quality.
OpenAI did not publicly release the WebText dataset. In response, Aaron Gokaslan and Vanya Cohen, then graduate students at Brown University, created OpenWebText, an open-source replication that followed the same methodology. OpenWebText became one of the most widely used training datasets in NLP research, adopted by projects including Meta's RoBERTa model. The dataset has been downloaded millions of times from the Hugging Face Hub and ranked as one of the most popular datasets on the platform [10].
The 1.5B-parameter model was trained on WebText using a batch size of 512 sequences, each with a context length of 1,024 tokens. The Adam optimizer was used with a learning rate schedule. According to the Hugging Face model card, training was conducted on 256 cloud TPU v3 cores [11]. The total compute used was not publicly disclosed in detail in the paper, though it was substantially less than what would later be used for GPT-3 [1].
The training objective was standard causal language modeling: given a sequence of tokens, predict the next token at each position. The loss function was cross-entropy over the vocabulary. No auxiliary objectives, task-specific heads, or supervised signals were used during training. The model learned everything from the single objective of next-token prediction on WebText.
This simplicity was part of the paper's point. The authors argued that a single, general-purpose training objective, applied at sufficient scale, could produce a model capable of performing many tasks that had previously required dedicated systems.
The defining feature of GPT-2 was its zero-shot performance: the ability to perform tasks without any task-specific training, simply by conditioning the model on an appropriate natural language prompt. Rather than fine-tuning on labeled examples, GPT-2 was evaluated by framing tasks as text completion problems [1].
GPT-2 achieved state-of-the-art results on 7 out of 8 language modeling benchmarks in a zero-shot setting, outperforming models that had been specifically trained on each domain [1].
| Dataset | Metric | GPT-2 (1.5B) | Previous SOTA |
|---|---|---|---|
| Penn Treebank | Perplexity | 35.76 | 46.54 |
| WikiText-103 | Perplexity | 17.48 | 18.3 |
| WikiText-2 | Perplexity | 18.34 | 39.14 |
| LAMBADA | Perplexity | 8.6 | 99.8 |
| LAMBADA | Accuracy | 63.24% | 59.23% |
| Children's Book Test (CN) | Accuracy | 93.3% | 85.7% |
| Children's Book Test (NE) | Accuracy | 89.1% | 82.3% |
| 1 Billion Word | Perplexity | 42.16 | 23.7 |
The only dataset where GPT-2 failed to surpass the previous state of the art was the 1 Billion Word benchmark, which measures performance on shuffled sentences rather than coherent documents. The authors attributed this to the distributional mismatch between WebText's long, coherent documents and the benchmark's shuffled sentence format [1].
The improvement on LAMBADA was particularly striking. LAMBADA tests a model's ability to predict the final word of a passage, which requires understanding long-range context. GPT-2 reduced perplexity from 99.8 to 8.6 and improved accuracy from 19% to 52.66% (or 63.24% with a stop-word filter applied) [1].
On the Winograd Schema Challenge, which tests commonsense reasoning through pronoun resolution, GPT-2 achieved 70.70% accuracy, improving the state of the art by 7 percentage points. This was a notable result because the Winograd Schema Challenge is specifically designed to be difficult for statistical methods and to require genuine understanding of context and world knowledge [1].
On the CoQA benchmark (conversational question answering), GPT-2 achieved an F1 score of 55, matching or exceeding several baseline systems that were built specifically for the task. The model had no access to training examples from CoQA; it was simply conditioned on the document and question text and asked to generate an answer [1].
When prompted with "TL;DR:" after an article, GPT-2 generated rudimentary but coherent summaries. While these summaries were not competitive with dedicated summarization systems, the fact that the model could produce them at all from a simple text prompt was considered significant [1].
With simple prompting (for example, "translate English to French:"), GPT-2 achieved 5 BLEU on the WMT-14 English-to-French translation task. While far from competitive with dedicated translation systems, this was notable for a model that had never been trained on parallel text [1].
On the Natural Questions dataset, GPT-2 correctly answered 4.1% of questions. While modest in absolute terms, this was a non-trivial result for a model with no access to a knowledge retrieval system and no task-specific training [1].
Performance consistently improved with model size. On nearly every benchmark, the 1.5B-parameter model outperformed the 774M, which outperformed the 355M, which outperformed the 124M. This scaling behavior suggested that even larger models would achieve even better results, a prediction borne out by GPT-3 [1].
The paper also noted that the largest model still showed no signs of saturating on the WebText training objective, indicating that the model was underfitting its training data. This was an important observation: it meant that performance gains from scaling had not yet reached diminishing returns, and further increases in model size would likely yield continued improvements.
Beyond benchmarks, GPT-2 attracted widespread attention for the quality of its generated text. When given a prompt, the model could produce multiple paragraphs of coherent, stylistically consistent prose that was often difficult to distinguish from human writing at first glance [2].
OpenAI demonstrated this capability with a now-famous example: given a prompt about unicorns discovered in the Andes, GPT-2 generated a convincing multi-paragraph news article complete with fabricated quotes from fictional scientists. While the text contained factual errors and logical inconsistencies on close inspection, its surface-level fluency was far beyond what previous language models could produce [2].
The model proved capable of adapting to different styles and formats depending on the prompt. Given the opening of a news article, it would continue in journalistic prose. Given the start of a story, it would produce narrative fiction. Given a technical question, it would attempt an explanatory response. This stylistic flexibility, combined with the overall fluency of the output, was what made GPT-2 feel qualitatively different from prior language models.
However, GPT-2's generated text had clear limitations. The model would sometimes contradict itself within a single passage, introduce factual errors, lose coherence over long generations, or repeat phrases and ideas. It had no mechanism for grounding its outputs in verified facts, so it would confidently generate plausible-sounding but entirely fabricated information. These limitations became more apparent the longer the generated text became.
This generation quality was both the model's most impressive feature and the source of its controversy.
On February 14, 2019, OpenAI announced GPT-2 but broke with standard practice by withholding the full model. The organization stated that due to concerns about potential misuse, including the generation of fake news, spam, impersonation, and automated disinformation, it would not release the trained 1.5B-parameter model [2].
OpenAI demonstrated specific misuse scenarios, including a version of GPT-2 fine-tuned to generate convincing positive or negative product reviews on demand. The organization argued that the combination of fluent text generation and easy fine-tuning could lower the barrier for producing misleading content at scale [2].
Instead, OpenAI implemented a staged release strategy over nine months:
| Date | Release | Model Size | Accompanying Materials |
|---|---|---|---|
| February 14, 2019 | Paper and small model | 124M (Small) | Technical paper, blog post |
| May 2019 | Medium model | 355M (Medium) | Output dataset for detection research |
| August 20, 2019 | Large model | 774M (Large) | 6-month follow-up report, legal agreement |
| November 5, 2019 | Full model, code, and weights | 1.5B (XL) | Final release report |
At each stage, OpenAI published analyses of observed risks and monitored for evidence of misuse before proceeding to the next release [3][4].
The decision drew sharply divided reactions from the AI community.
Critics argued that the withholding was primarily a publicity stunt designed to generate media attention. Many machine learning researchers pointed out that the techniques behind GPT-2 were well-understood and easily reproducible, making the withholding of model weights a largely symbolic gesture. Anima Anandkumar, a professor at Caltech and director of machine learning research at Nvidia, stated that there was no evidence GPT-2 had the capabilities to pose the threats described by OpenAI. A February 2019 article in The Verge argued that the threat had been exaggerated. Some researchers argued that the decision harmed academic norms of openness and could slow legitimate safety research [5].
Supporters, particularly in AI policy and governance circles, welcomed the effort to develop norms around responsible publication of dual-use research. They argued that even if GPT-2 itself was not uniquely dangerous, establishing precedents for staged releases and risk assessment would be valuable as models grew more capable [6].
In its final release announcement in November 2019, OpenAI stated that it had seen "no strong evidence of misuse so far" during the staged rollout. The organization concluded that the staged release had been useful for building awareness and developing partnerships, even if the specific model had not been widely abused [4].
Regardless of whether GPT-2 itself posed significant risks, the controversy it generated had a lasting impact on the AI safety debate. The episode raised questions that the field continues to grapple with: When should AI research be restricted? Who gets to decide? How should the community balance the benefits of open science against the risks of dual-use technology? These questions became even more pressing with the release of more capable models like GPT-3 and GPT-4 [6].
The GPT-2 release strategy also influenced the broader AI community's approach to publication norms. Several organizations subsequently adopted staged release or limited access strategies for their own large language models, citing the precedent set by OpenAI's approach. Google, Meta, and other labs have since engaged in similar deliberations when releasing powerful models.
After the full model was released in November 2019, GPT-2 became one of the most widely used language models in the research community and beyond. Its availability through the Hugging Face Transformers library made it accessible to a broad audience, from academic researchers to hobbyists and application developers.
GPT-2 became one of the most downloaded models on the Hugging Face Hub. As of early 2026, the base GPT-2 model (124M) receives over 11 million downloads per month and has spawned over 2,100 fine-tuned variants, 1,600+ adapter models, and nearly 90 quantized versions on the platform [11]. The model's popularity on Hugging Face helped establish that platform as the standard repository for sharing pre-trained language models. The availability of GPT-2 weights through an easy-to-use Python API made it straightforward for developers to integrate the model into their own projects, contributing to the rapid growth of the NLP practitioner community.
The community developed a rich ecosystem of tools for fine-tuning GPT-2 on custom datasets:
| Tool | Creator | Description |
|---|---|---|
| gpt-2-simple | Max Woolf | Python package wrapping fine-tuning code with a simple interface for Google Colab |
| aitextgen | Max Woolf | Successor to gpt-2-simple, built on PyTorch and Hugging Face Transformers |
| Hugging Face Transformers | Hugging Face | Full-featured library supporting GPT-2 training, fine-tuning, and inference |
| Neil Shepperd's fork | Neil Shepperd | Early fork of OpenAI's repo enabling fine-tuning on custom data |
These tools lowered the barrier to entry for working with large language models. Researchers and developers could fine-tune GPT-2 on domain-specific corpora in a matter of hours using a single GPU, producing specialized models for tasks ranging from creative writing to code generation to customer service chatbots.
GPT-2 found its way into a variety of applications both before and after the full model release:
After the full model was released, the community produced hundreds of fine-tuned GPT-2 variants for specific tasks and domains, including creative writing, code generation, dialogue, poetry, legal text, medical summaries, and domain-specific text generation. These fine-tuned models demonstrated the versatility of the GPT-2 architecture and helped establish the practice of building specialized models on top of general-purpose pre-trained foundations.
The sheer volume of community fine-tuning activity around GPT-2 foreshadowed the much larger ecosystem that would develop around later open-weight models like LLaMA and Mistral.
GPT-2's most direct legacy was its role as the foundation for GPT-3, released by OpenAI in June 2020. GPT-3 used essentially the same architecture as GPT-2 but scaled it dramatically, from 1.5 billion to 175 billion parameters, and trained on a much larger and more diverse dataset [12].
Several specific findings from GPT-2 directly shaped the development of GPT-3:
The clear relationship between model size and performance observed in GPT-2 helped motivate the scaling laws research that followed. In January 2020, Jared Kaplan and colleagues at OpenAI published "Scaling Laws for Neural Language Models," which formalized the empirical observation that model performance follows predictable power-law curves as a function of model size, dataset size, and compute [8]. This research was conducted in parallel with GPT-3's development and directly informed decisions about how large to make the model and how much data to train it on.
The scaling laws paper cited GPT-2's results as motivating evidence, and the consistent log-linear improvement across GPT-2's four model sizes was among the empirical patterns the scaling laws sought to explain and predict.
The withholding of GPT-2, and later the restriction of GPT-3 to API-only access, motivated several independent efforts to create open alternatives.
Because OpenAI did not release the WebText training dataset, Aaron Gokaslan and Vanya Cohen of Brown University created OpenWebText, a faithful replication of the WebText methodology. They extracted Reddit post URLs, filtered for pages with at least 3 karma, downloaded the linked web pages, and applied deduplication and language filtering. The resulting dataset closely approximated WebText in both scale and composition and was released publicly. OpenWebText was subsequently used to train many research models, including Meta's RoBERTa [10].
EleutherAI, a grassroots collective of machine learning researchers formed in mid-2020, was motivated in part by the restricted access to GPT-2 and GPT-3. The group created GPT-Neo (2.7B parameters, March 2021), GPT-J (6B parameters, June 2021), and GPT-NeoX-20B (20B parameters, February 2022) as open-source alternatives. These models used architectures closely related to GPT-2 and GPT-3, and were trained on The Pile, a curated 800 GB dataset. EleutherAI's work demonstrated that the techniques behind GPT-2 and GPT-3 could be replicated by independent researchers with sufficient compute and determination [7].
Andrey Karpathy's "nanoGPT" project, released in early 2023, provided a minimal, clean reimplementation of GPT-2 training in about 300 lines of PyTorch code. The project became widely used as an educational tool, helping thousands of students and practitioners understand the transformer training pipeline from scratch. Its popularity reflected GPT-2's role as a canonical reference model for understanding large language models.
The release of GPT-2 spurred early research into detecting machine-generated text. OpenAI released a detection tool alongside the model, noting that even simple statistical classifiers could distinguish GPT-2 output from human writing with reasonable accuracy for the smallest model, but that detection became harder as model size increased [3].
This observation foreshadowed a persistent challenge in the field. As language models improved, the gap between human and machine-generated text narrowed, making detection increasingly difficult. Researchers explored various approaches, including statistical analysis of token probabilities, watermarking techniques, and trained classifiers, but reliable detection of AI-generated text remains an unsolved problem.
OpenAI partnered with the Middlebury Institute of International Studies' Center on Terrorism, Extremism, and Counterterrorism (CTEC) during the staged release period to study the potential for GPT-2 to be used in generating extremist content. The resulting analyses informed subsequent decisions about model release and contributed to the growing body of research on AI misuse risks [3].
To place GPT-2 in context, the following table compares it with other major language models from the same period:
| Model | Organization | Date | Parameters | Architecture | Training Data |
|---|---|---|---|---|---|
| BERT Base | Oct 2018 | 110M | Encoder-only | BooksCorpus + Wikipedia (16 GB) | |
| BERT Large | Oct 2018 | 340M | Encoder-only | BooksCorpus + Wikipedia (16 GB) | |
| GPT-2 XL | OpenAI | Feb 2019 | 1.5B | Decoder-only | WebText (40 GB) |
| XLNet | Google/CMU | Jun 2019 | 340M | Autoregressive | BooksCorpus + Wikipedia + more (33 GB) |
| RoBERTa | Meta | Jul 2019 | 355M | Encoder-only | Multiple datasets (160 GB) |
| Megatron-LM | NVIDIA | Sep 2019 | 8.3B | Decoder-only | Wikipedia + other (174 GB) |
GPT-2 was the largest publicly released language model at the time of its announcement. It held that distinction until the release of Megatron-LM by NVIDIA in September 2019, and later the release of the 11 billion parameter T5 model by Google.
An often-overlooked finding from the GPT-2 paper was that even the largest model still underfit the WebText training data. The training loss had not converged, meaning that a model with even more parameters or more training time would likely perform better. The paper reported that the 1.5B model's training perplexity on WebText was higher than what the data's complexity theoretically demanded [1].
This was a significant result. It meant that the scaling curve had not plateaued, and that the improvements from scaling that the paper documented were not yet approaching their limits. This finding directly supported the case for building much larger models, and it was one of the key pieces of evidence cited by OpenAI when justifying the scale of GPT-3.
GPT-2 occupies a unique place in the history of artificial intelligence. It was not the first large language model, nor the most capable by later standards, but it arrived at a moment when its combination of fluent text generation and zero-shot task performance changed how the field thought about what language models could do.
Several of GPT-2's contributions proved durable:
As of 2026, GPT-2 remains widely used as a baseline model, an educational tool, and a subject of interpretability research. Its relatively small size (by modern standards) makes it practical to run on consumer hardware, and its well-understood architecture makes it a standard reference point for studying transformer language models. The model's continued popularity on platforms like Hugging Face, seven years after its release, reflects its enduring importance to the field.