GPT-1 is the first model in the GPT (Generative Pre-trained Transformer) series, released by OpenAI in June 2018. Introduced in the paper "Improving Language Understanding by Generative Pre-Training" by Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever, GPT-1 demonstrated that unsupervised learning through generative pre-training on a large corpus of text, followed by supervised fine-tuning on specific tasks, could produce strong results across a wide range of natural language processing (NLP) benchmarks [1]. With 117 million parameters, the model was modest by later standards, but it served as the proof-of-concept that launched a paradigm shift in NLP research and laid the foundation for the much larger models that followed.
Before GPT-1, most NLP systems relied heavily on supervised learning, where labeled datasets were required for every task. This approach had clear limitations: labeled data was expensive and time-consuming to create, and models trained on one task did not generalize well to others. Researchers had long explored ways to leverage the vast amounts of unlabeled text available on the internet, but previous attempts at semi-supervised or unsupervised pre-training had produced inconsistent results [1].
The core insight behind GPT-1 was that a language model trained to predict the next word in a sequence (an autoregressive objective) could learn general-purpose linguistic representations. These representations could then be transferred to downstream tasks with minimal architectural modification. This two-stage approach, unsupervised pre-training followed by supervised fine-tuning, distinguished GPT-1 from earlier methods that typically trained task-specific architectures from scratch [1].
The release of GPT-1 coincided with a broader trend in AI research toward transfer learning, where models trained on one task are adapted for another. In computer vision, pre-trained convolutional neural networks had already shown the power of this approach. GPT-1 brought similar benefits to NLP.
Prior work in NLP transfer learning included word-level embeddings such as Word2Vec and GloVe, which captured semantic relationships between individual words but could not represent sentence-level or document-level meaning. ELMo (Embeddings from Language Models), developed by researchers at the Allen Institute for AI in early 2018, took a step further by generating context-dependent word representations using bidirectional LSTMs. ULMFiT, proposed by Jeremy Howard and Sebastian Ruder, demonstrated that pre-trained language models could be effectively fine-tuned through techniques like discriminative learning rates and gradual unfreezing. GPT-1 built on this lineage but replaced recurrent architectures entirely with the transformer, which proved more effective at capturing long-range dependencies and was substantially faster to train due to its parallel processing capabilities [1][2].
These three models (ELMo, ULMFiT, and GPT-1) all appeared in 2018 and collectively established the first wave of transfer learning for NLP. However, they differed in how they transferred knowledge. ELMo produced contextual word embeddings that were fed as features into task-specific architectures. ULMFiT fine-tuned a pre-trained LSTM-based language model end-to-end. GPT-1 combined the fine-tuning approach with the transformer architecture, a combination that proved especially powerful and scalable [1].
GPT-1 uses a decoder-only transformer architecture, based on the transformer design introduced by Vaswani et al. in the 2017 paper "Attention Is All You Need" [2]. Unlike the original transformer, which includes both encoder and decoder stacks, GPT-1 uses only the decoder portion with masked self-attention. This means that when generating or processing text, each token can only attend to tokens that came before it in the sequence, not tokens that come after.
| Component | Specification |
|---|---|
| Layers | 12 transformer decoder blocks |
| Hidden dimension | 768 |
| Attention heads | 12 (64-dimensional states each) |
| Feedforward dimension | 3,072 |
| Parameters | 117 million |
| Context length | 512 tokens |
| Vocabulary | 40,000 BPE merges |
| Activation function | GELU (Gaussian Error Linear Unit) |
| Position encoding | Learned positional embeddings |
| Dropout rate | 0.1 |
| Perplexity (BooksCorpus) | 18.4 |
Each transformer block contains a masked multi-head self-attention layer followed by a position-wise feedforward neural network, with layer normalization applied between sublayers. The model uses learned positional embeddings rather than the sinusoidal encodings from the original transformer paper. Input and output token embeddings are tied (weight tying), meaning the same weight matrix is used for both the input embedding layer and the output prediction layer. This parameter-sharing technique reduces the total number of trainable parameters and helps the model learn more consistent representations [1].
The masked self-attention mechanism is central to the model's operation. In each attention head, the model computes query, key, and value vectors for every token in the sequence. The attention scores between tokens are calculated as the scaled dot product of queries and keys, but a mask is applied to prevent any token from attending to future positions. This ensures the model can only use information from earlier in the sequence when predicting the next token, preserving the autoregressive property needed for language modeling [2].
The GELU (Gaussian Error Linear Unit) activation function was used in the feedforward layers instead of the more common ReLU. GELU provides a smoother approximation that weights inputs by their magnitude, and it became widely adopted in subsequent transformer models [1].
The choice of a decoder-only architecture was significant. While bidirectional models can access context from both directions, the autoregressive (left-to-right) approach used by GPT-1 is naturally suited to text generation tasks and provides a straightforward language modeling objective for pre-training. This architectural decision set the GPT series apart from later models like BERT, which used an encoder-only design with bidirectional attention.
GPT-1 uses Byte Pair Encoding (BPE) for tokenization, a subword segmentation method originally developed as a data compression algorithm. BPE works by iteratively merging the most frequent pairs of characters or character sequences in the training corpus until a target vocabulary size is reached. The GPT-1 vocabulary consists of 40,000 BPE merges [1].
The BPE approach offers several advantages over word-level and character-level tokenization. Word-level tokenization requires a fixed vocabulary and cannot handle out-of-vocabulary words. Character-level tokenization can represent any text but produces very long sequences that are difficult for models to process. BPE strikes a balance: common words are represented as single tokens, while rare words are broken into meaningful subword units. This allows the model to handle novel or rare words by composing them from familiar subword pieces [1].
The pre-training phase used a standard language modeling objective: given a sequence of tokens, predict the next token. Formally, the model maximizes the likelihood of each token given all preceding tokens in the sequence. This objective requires no labeled data and can be applied to any text corpus [1].
GPT-1 was pre-trained on the BooksCorpus dataset, a collection of approximately 7,000 unpublished books spanning multiple genres including adventure, fantasy, and romance [3]. BooksCorpus was chosen because it contained long, contiguous stretches of text, which allowed the model to learn long-range dependencies and contextual relationships that shorter documents (such as individual web pages) might not provide. The dataset totaled roughly 800 million words.
The use of unpublished books, rather than published literature or web text, was a deliberate choice. Unpublished books provided diverse writing styles and narrative structures while avoiding the copyright concerns associated with published works. The long-form nature of books also meant the model was exposed to coherent narratives spanning thousands of tokens, which helped it learn discourse-level patterns that shorter texts could not provide [3].
| Hyperparameter | Value |
|---|---|
| Optimizer | Adam (beta1=0.9, beta2=0.999) |
| Maximum learning rate | 2.5 x 10^-4 |
| Learning rate schedule | Linear warmup over 2,000 steps, then cosine annealing to zero |
| Batch size | 64 sequences of 512 tokens |
| Training epochs | 100 |
| Dropout | 0.1 |
| Training hardware | 8 GPUs |
| Training duration | Approximately 1 month |
| Compute | Approximately 0.96 petaflop/s-days |
The training used the Adam optimizer with a maximum learning rate of 2.5 x 10^-4. The learning rate was warmed up linearly over the first 2,000 updates and then annealed to zero using a cosine schedule. The model was trained for 100 epochs with mini-batches of 64 sequences, each 512 tokens long. A dropout rate of 0.1 was applied for regularization. The model achieved a token-level perplexity of 18.4 on the BooksCorpus test set [1].
By modern standards, the computational requirements for training GPT-1 were extremely modest. The entire training run consumed roughly 0.96 petaflop/s-days of compute, a figure that would grow by orders of magnitude with each successive GPT model. For comparison, GPT-3 required approximately 3,640 petaflop/s-days, nearly 4,000 times more compute than GPT-1 [5].
After pre-training, GPT-1 was fine-tuned on each downstream task using labeled data. The fine-tuning process required minimal changes to the model architecture. For most tasks, the approach involved adding a simple linear output layer on top of the pre-trained transformer and training the entire model end-to-end on the task-specific dataset [1].
| Hyperparameter | Value |
|---|---|
| Learning rate | 6.25 x 10^-5 |
| Batch size | 32 |
| Epochs | 3 |
| Auxiliary LM loss weight (lambda) | 0.5 |
| Linear warmup | Over 0.2% of training steps |
| Dropout | 0.1 |
The fine-tuning learning rate of 6.25 x 10^-5 was considerably lower than the pre-training rate of 2.5 x 10^-4, reflecting the standard practice of using smaller learning rates during fine-tuning to avoid catastrophically overwriting the knowledge acquired during pre-training. Only three epochs of fine-tuning were needed for most tasks, demonstrating the efficiency of the transfer learning approach [1].
To handle different task formats, the authors used input transformations that converted structured inputs into ordered sequences that the pre-trained model could process:
| Task Type | Input Transformation | Example Tasks |
|---|---|---|
| Classification | Sequence of tokens fed directly, with start and extract tokens | SST-2, CoLA |
| Textual entailment | Premise and hypothesis concatenated with a delimiter token | MNLI, SNLI, QNLI, RTE, SciTail |
| Similarity | Both orderings of the two sentences processed separately, then combined | STS-B, QQP, MRPC |
| Multiple choice | Each answer option concatenated with the context, scored independently | RACE, Story Cloze |
This approach was elegant in its simplicity. Rather than designing task-specific architectures, the authors converted every task into a sequence format compatible with the pre-trained model's input expectations. A special start token, delimiter token, and extract token were added to the vocabulary to structure the inputs. The linear output layer was the only new component added during fine-tuning [1].
During fine-tuning, an auxiliary language modeling objective was added alongside the task-specific objective with a weight of lambda = 0.5. This regularization technique helped improve generalization and accelerated convergence. The authors found that including the language modeling loss during fine-tuning improved performance on most tasks, particularly those with larger datasets [1].
GPT-1 was evaluated on 12 datasets across four categories of NLP tasks: natural language inference, question answering, semantic similarity, and text classification. The model achieved new state-of-the-art results on 9 of the 12 benchmarks [1].
| Dataset | Task Description | GPT-1 Score | Previous Best | Improvement |
|---|---|---|---|---|
| MNLI (matched) | Multi-genre natural language inference | 82.1% | 80.6% | +1.5% |
| MNLI (mismatched) | Cross-genre NLI | 81.4% | 80.1% | +1.3% |
| SNLI | Stanford natural language inference | 89.9% | 89.3% | +0.6% |
| SciTail | Science entailment | 88.3% | 83.3% | +5.0% |
| QNLI | Question NLI (derived from SQuAD) | 88.1% | 82.3% | +5.8% |
| RTE | Recognizing textual entailment | 56.0% | -- | -- |
| Dataset | Task Description | GPT-1 Score | Previous Best | Improvement |
|---|---|---|---|---|
| RACE | Reading comprehension from exams | 59.0% | 53.3% | +5.7% |
| Story Cloze | Commonsense story completion | 86.5% | 77.6% | +8.9% |
| Dataset | Task Description | GPT-1 Score | Previous Best | Improvement |
|---|---|---|---|---|
| SST-2 | Binary sentiment analysis | 91.3% | 90.2% | +1.1% |
| CoLA | Linguistic acceptability (Matthews corr.) | 45.4 | 35.0 | +10.4 |
| STS-B | Semantic textual similarity | 82.0 | 81.0 | +1.0 |
| QQP | Quora question pair similarity | 70.3% | 66.1% | +4.2% |
| MRPC | Microsoft paraphrase detection (F1) | 82.3 | -- | -- |
| GLUE (overall) | Multi-task benchmark average | 72.8 | 68.9 | +3.9 |
The improvements were particularly large on tasks requiring commonsense reasoning (Story Cloze, +8.9%) and linguistic acceptability (CoLA, +10.4%). These tasks benefit from the broad world knowledge captured during pre-training on a diverse book corpus [1].
The three datasets where GPT-1 did not achieve state-of-the-art were tasks where existing supervised models had been heavily optimized with task-specific architectures. Even on these tasks, GPT-1 remained competitive, demonstrating that a single general-purpose architecture could approach or match the performance of specialized systems across diverse NLP problems [1].
GPT-1 also demonstrated promising zero-shot transfer capabilities. Without any task-specific fine-tuning, the model showed reasonable performance on several tasks, suggesting that the pre-training process captured broadly useful linguistic knowledge. The authors observed that zero-shot performance generally improved over the course of pre-training, indicating that the language model was gradually acquiring the ability to perform NLP tasks as a byproduct of learning to predict text [1].
The zero-shot experiments used heuristic methods to convert tasks into a format the language model could handle without any fine-tuning. For example, for sentiment analysis, the authors appended a token indicating positive or negative sentiment and measured the language model's probability of generating the correct completion. Performance was modest compared to the fine-tuned results, but the steady improvement during pre-training was a significant observation. It suggested that generative pre-training does not merely learn syntactic patterns but gradually acquires task-relevant knowledge [1].
This zero-shot ability, while modest in GPT-1, foreshadowed the much stronger zero-shot performance of later models like GPT-2 and GPT-3. The GPT-2 paper would later show that scaling up the model and training data could produce surprisingly capable zero-shot performance, and GPT-3 would demonstrate that very large language models could rival fine-tuned systems on many tasks with no gradient updates at all [1].
The GPT-1 paper included a rigorous set of ablation experiments that isolated the contribution of each component in the system. These experiments provided evidence for three key design choices: the use of pre-training, the transformer architecture, and the auxiliary language modeling objective during fine-tuning [1].
Removing pre-training entirely and training the transformer from scratch on the supervised tasks resulted in a dramatic performance drop. Without pre-training, the average score across all tasks decreased by 14.8%, confirming that the representations learned during unsupervised pre-training were critical to the model's success. This was the largest single factor in the ablation analysis, underscoring the central role of generative pre-training in the paper's approach [1].
Replacing the transformer architecture with a single-layer 2048-unit LSTM while keeping the same pre-training and fine-tuning framework resulted in an average score drop of 5.6 points across the evaluated tasks. The LSTM only outperformed the transformer on a single dataset (MRPC). This comparison demonstrated that the transformer architecture was a better fit for transfer learning, likely due to its ability to capture long-range dependencies through self-attention and its more structured memory compared to recurrent networks [1].
Removing the auxiliary language modeling loss during fine-tuning (setting lambda to 0) hurt performance on most tasks, particularly the NLI tasks and QQP. The auxiliary objective acted as a regularizer, preventing the model from overfitting to the small fine-tuning datasets while maintaining the general linguistic knowledge from pre-training. The authors found that the benefit was most pronounced on larger datasets, while smaller datasets showed more variable results [1].
| Configuration | Average Score Change |
|---|---|
| Full model (Transformer + pre-training + auxiliary LM) | Baseline |
| Without pre-training | -14.8% |
| LSTM instead of Transformer | -5.6 points |
| Without auxiliary LM objective | Small decrease on most tasks |
The paper also investigated how many layers of the pre-trained model needed to be transferred for effective fine-tuning. The authors found that each additional transformer layer transferred from the pre-trained model provided incremental improvements. Even transferring only the embedding layers (without any transformer blocks) produced some benefit, but performance improved steadily as more layers were included. On MultiNLI, each additional transferred layer added approximately 9% to performance [1]. This result indicated that every layer in the 12-layer model captured useful representations, with lower layers encoding more general linguistic features and upper layers encoding more task-relevant information.
BERT (Bidirectional Encoder Representations from Transformers), released by Google in October 2018, just four months after GPT-1, took a fundamentally different approach to transformer-based pre-training [4]. The two models together established the pre-training and fine-tuning paradigm that dominated NLP research for years.
| Feature | GPT-1 | BERT |
|---|---|---|
| Architecture | Decoder-only transformer | Encoder-only transformer |
| Directionality | Unidirectional (left-to-right) | Bidirectional |
| Pre-training objective | Next token prediction | Masked language modeling + next sentence prediction |
| Parameters | 117M | 110M (Base), 340M (Large) |
| Training data | BooksCorpus (800M words) | BooksCorpus + English Wikipedia (3.3B words) |
| Release date | June 2018 | October 2018 |
| Primary strength | Text generation | Text understanding and classification |
| Tokenization | BPE (40,000 merges) | WordPiece (30,000 tokens) |
BERT's bidirectional attention allowed it to consider context from both directions simultaneously, which gave it an advantage on many understanding tasks. BERT trained with a masked language modeling objective: random tokens in the input were replaced with a special [MASK] token, and the model learned to predict the original token using context from both sides. A secondary "next sentence prediction" objective trained BERT to determine whether two sentences appeared consecutively in the original text. This approach allowed BERT to build richer contextual representations for classification and extraction tasks. BERT surpassed GPT-1's results on several benchmarks shortly after its release [4].
On the GLUE benchmark, BERT Base achieved an average score of 79.6, substantially higher than GPT-1's 72.8. BERT Large pushed this further to 82.1. The improvement was especially notable on tasks like RTE and WNLI, where GPT-1 had struggled. BERT's bidirectional context was particularly helpful for tasks that required understanding the relationship between two text segments [4].
However, GPT-1's autoregressive approach was better suited for generative tasks and proved more scalable in the long run, as demonstrated by the subsequent GPT series. BERT's bidirectional architecture, while powerful for understanding, cannot straightforwardly generate text because it lacks the left-to-right sequential dependency that autoregressive models rely on [4].
The philosophical difference between the two approaches, generation versus understanding, shaped the direction of NLP research. While BERT-style models dominated leaderboards in 2019 and 2020, the GPT approach of scaling up autoregressive models ultimately proved to be the path toward more general-purpose AI systems. The success of GPT-3, ChatGPT, and GPT-4 validated the autoregressive strategy, as these models demonstrated that sufficiently large decoder-only models could match or exceed encoder-based models on understanding tasks while also excelling at generation.
To place GPT-1 in context, it is useful to compare it with other prominent pre-training approaches that appeared in the same period.
| Model | Authors / Lab | Date | Architecture | Pre-training Objective | Parameters | Key Innovation |
|---|---|---|---|---|---|---|
| ELMo | Peters et al. / Allen AI | Feb 2018 | Bidirectional LSTM | Forward + backward LM | 94M | Contextual word embeddings |
| ULMFiT | Howard & Ruder | Jan 2018 | AWD-LSTM | Language modeling | ~24M | Discriminative fine-tuning, gradual unfreezing |
| GPT-1 | Radford et al. / OpenAI | Jun 2018 | Decoder-only Transformer | Next token prediction | 117M | Transformer-based generative pre-training |
| BERT | Devlin et al. / Google | Oct 2018 | Encoder-only Transformer | Masked LM + NSP | 110M/340M | Bidirectional context for pre-training |
ELMo used bidirectional LSTMs to produce context-sensitive word embeddings, but these embeddings were used as input features to separate, task-specific models rather than being fine-tuned end-to-end. ULMFiT demonstrated effective fine-tuning of LSTM-based language models and introduced training techniques (discriminative learning rates, slanted triangular learning rates) that influenced subsequent work. GPT-1 combined the fine-tuning paradigm with the transformer architecture, and BERT showed that bidirectional pre-training could outperform the unidirectional approach on understanding tasks [1][4].
All four models contributed to the rapid shift away from training task-specific models from scratch. By the end of 2018, the pre-train-then-fine-tune paradigm had become the dominant methodology in NLP research.
GPT-1's most lasting contribution was not its benchmark scores, which were quickly surpassed, but its demonstration that generative pre-training could produce transferable language representations. This insight had several important consequences.
First, it established the pre-training and fine-tuning paradigm as the standard approach in NLP. Before GPT-1, training a separate model from scratch for each task was common practice. After GPT-1 (and BERT), pre-training became the default starting point for virtually all NLP work [1].
Second, GPT-1 suggested that performance would improve with scale. The authors noted that larger models and more data would likely yield better results, a hypothesis dramatically confirmed by GPT-2 (1.5 billion parameters), GPT-3 (175 billion parameters), and GPT-4 [5].
| Model | Release | Parameters | Training Data | Key Advance |
|---|---|---|---|---|
| GPT-1 | June 2018 | 117M | BooksCorpus (800M words) | Generative pre-training + fine-tuning |
| GPT-2 | February 2019 | 1.5B | WebText (40GB) | Zero-shot task transfer, larger scale |
| GPT-3 | June 2020 | 175B | Common Crawl + books (570GB) | Few-shot learning via prompting |
| GPT-4 | March 2023 | Undisclosed (est. >1T) | Undisclosed | Multimodal input, professional-level reasoning |
Third, the model's zero-shot transfer results, while modest, hinted at the possibility of building general-purpose language systems that would not need task-specific fine-tuning at all. This vision was more fully realized in GPT-2 and GPT-3, where prompting replaced fine-tuning for many applications.
GPT-1 also influenced the broader AI research community's approach to deep learning. The success of unsupervised pre-training on large text corpora helped accelerate the trend toward training ever-larger models on ever-larger datasets, a trend that continues to define the field. The paper has been cited over 17,000 times and remains a foundational reference in the study of large language models.
OpenAI released the GPT-1 model weights and code publicly through their GitHub repository (openai/finetune-transformer-lm), implemented in TensorFlow. This open release allowed the research community to reproduce the paper's results, experiment with the model, and build upon it. The model is also available through the Hugging Face Transformers library under the identifier "openai-gpt". The open availability of GPT-1 contributed to its influence, as researchers worldwide could directly experiment with the pre-trained weights rather than training from scratch [5].
One of the most consequential aspects of GPT-1, recognized only in hindsight, is how it foreshadowed the scaling laws that would become central to AI research. The paper's ablation studies showed that more pre-training improved downstream performance, and the authors explicitly noted that larger models and more data were likely to produce further gains. This observation was formalized in later work by Kaplan et al. (2020), who showed that language model performance follows predictable power-law relationships with model size, dataset size, and compute budget. GPT-1 was the first data point on what would become one of the most important empirical findings in modern AI research.
In hindsight, GPT-1 can be understood as the starting point of a research trajectory that, within five years, led to systems capable of passing professional examinations, writing functional code, and holding extended conversations. The 117-million-parameter model that achieved 72.8% on GLUE in 2018 was the first step toward the billion- and trillion-parameter models that would reshape both AI research and the technology industry.