GPT-1

GPT-1 is the first model in the GPT (Generative Pre-trained Transformer) series, released by OpenAI in June 2018. Introduced in the paper "Improving Language Understanding by Generative Pre-Training" by Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever, GPT-1 demonstrated that unsupervised learning through generative pre-training on a large corpus of text, followed by supervised fine-tuning on specific tasks, could produce strong results across a wide range of natural language processing (NLP) benchmarks ^[1]. With 117 million parameters, the model was modest by later standards, but it served as the proof-of-concept that launched a paradigm shift in NLP research and laid the foundation for the much larger models that followed.

Background and Motivation

Before GPT-1, most NLP systems relied heavily on supervised learning, where labeled datasets were required for every task. This approach had clear limitations: labeled data was expensive and time-consuming to create, and models trained on one task did not generalize well to others. Researchers had long explored ways to leverage the vast amounts of unlabeled text available on the internet, but previous attempts at semi-supervised or unsupervised pre-training had produced inconsistent results ^[1].

The core insight behind GPT-1 was that a language model trained to predict the next word in a sequence (an autoregressive objective) could learn general-purpose linguistic representations. These representations could then be transferred to downstream tasks with minimal architectural modification. This two-stage approach, unsupervised pre-training followed by supervised fine-tuning, distinguished GPT-1 from earlier methods that typically trained task-specific architectures from scratch ^[1].

The release of GPT-1 coincided with a broader trend in AI research toward transfer learning, where models trained on one task are adapted for another. In computer vision, pre-trained convolutional neural networks had already shown the power of this approach. GPT-1 brought similar benefits to NLP.

Prior work in NLP transfer learning included word-level embeddings such as Word2Vec and GloVe, which captured semantic relationships between individual words but could not represent sentence-level or document-level meaning. ELMo (Embeddings from Language Models), developed by researchers at the Allen Institute for AI in early 2018, took a step further by generating context-dependent word representations using bidirectional LSTMs. ULMFiT, proposed by Jeremy Howard and Sebastian Ruder, demonstrated that pre-trained language models could be effectively fine-tuned through techniques like discriminative learning rates and gradual unfreezing. GPT-1 built on this lineage but replaced recurrent architectures entirely with the transformer, which proved more effective at capturing long-range dependencies and was substantially faster to train due to its parallel processing capabilities ^[1]^[2].

These three models (ELMo, ULMFiT, and GPT-1) all appeared in 2018 and collectively established the first wave of transfer learning for NLP. However, they differed in how they transferred knowledge. ELMo produced contextual word embeddings that were fed as features into task-specific architectures. ULMFiT fine-tuned a pre-trained LSTM-based language model end-to-end. GPT-1 combined the fine-tuning approach with the transformer architecture, a combination that proved especially powerful and scalable ^[1].

Architecture

GPT-1 uses a decoder-only transformer architecture, based on the transformer design introduced by Vaswani et al. in the 2017 paper "Attention Is All You Need" ^[2]. Unlike the original transformer, which includes both encoder and decoder stacks, GPT-1 uses only the decoder portion with masked self-attention. This means that when generating or processing text, each token can only attend to tokens that came before it in the sequence, not tokens that come after.

Technical Specifications

Component	Specification
Layers	12 transformer decoder blocks
Hidden dimension	768
Attention heads	12 (64-dimensional states each)
Feedforward dimension	3,072
Parameters	117 million
Context length	512 tokens
Vocabulary	40,000 BPE merges
Activation function	GELU (Gaussian Error Linear Unit)
Position encoding	Learned positional embeddings
Dropout rate	0.1
Perplexity (BooksCorpus)	18.4

Each transformer block contains a masked multi-head self-attention layer followed by a position-wise feedforward neural network, with layer normalization applied between sublayers. The model uses learned positional embeddings rather than the sinusoidal encodings from the original transformer paper. Input and output token embeddings are tied (weight tying), meaning the same weight matrix is used for both the input embedding layer and the output prediction layer. This parameter-sharing technique reduces the total number of trainable parameters and helps the model learn more consistent representations ^[1].

The masked self-attention mechanism is central to the model's operation. In each attention head, the model computes query, key, and value vectors for every token in the sequence. The attention scores between tokens are calculated as the scaled dot product of queries and keys, but a mask is applied to prevent any token from attending to future positions. This ensures the model can only use information from earlier in the sequence when predicting the next token, preserving the autoregressive property needed for language modeling ^[2].

The GELU (Gaussian Error Linear Unit) activation function was used in the feedforward layers instead of the more common ReLU. GELU provides a smoother approximation that weights inputs by their magnitude, and it became widely adopted in subsequent transformer models ^[1].

The choice of a decoder-only architecture was significant. While bidirectional models can access context from both directions, the autoregressive (left-to-right) approach used by GPT-1 is naturally suited to text generation tasks and provides a straightforward language modeling objective for pre-training. This architectural decision set the GPT series apart from later models like BERT, which used an encoder-only design with bidirectional attention.

Tokenization

GPT-1 uses Byte Pair Encoding (BPE) for tokenization, a subword segmentation method originally developed as a data compression algorithm. BPE works by iteratively merging the most frequent pairs of characters or character sequences in the training corpus until a target vocabulary size is reached. The GPT-1 vocabulary consists of 40,000 BPE merges ^[1].

The BPE approach offers several advantages over word-level and character-level tokenization. Word-level tokenization requires a fixed vocabulary and cannot handle out-of-vocabulary words. Character-level tokenization can represent any text but produces very long sequences that are difficult for models to process. BPE strikes a balance: common words are represented as single tokens, while rare words are broken into meaningful subword units. This allows the model to handle novel or rare words by composing them from familiar subword pieces ^[1].

Training

Pre-training

The pre-training phase used a standard language modeling objective: given a sequence of tokens, predict the next token. Formally, the model maximizes the likelihood of each token given all preceding tokens in the sequence. This objective requires no labeled data and can be applied to any text corpus ^[1].

GPT-1 was pre-trained on the BooksCorpus dataset, a collection of approximately 7,000 unpublished books spanning multiple genres including adventure, fantasy, and romance ^[3]. BooksCorpus was chosen because it contained long, contiguous stretches of text, which allowed the model to learn long-range dependencies and contextual relationships that shorter documents (such as individual web pages) might not provide. The dataset totaled roughly 800 million words.

The use of unpublished books, rather than published literature or web text, was a deliberate choice. Unpublished books provided diverse writing styles and narrative structures while avoiding the copyright concerns associated with published works. The long-form nature of books also meant the model was exposed to coherent narratives spanning thousands of tokens, which helped it learn discourse-level patterns that shorter texts could not provide ^[3].

Pre-training Hyperparameters

Hyperparameter	Value
Optimizer	Adam (beta1=0.9, beta2=0.999)
Maximum learning rate	2.5 x 10^-4
Learning rate schedule	Linear warmup over 2,000 steps, then cosine annealing to zero
Batch size	64 sequences of 512 tokens
Training epochs	100
Dropout	0.1
Training hardware	8 GPUs
Training duration	Approximately 1 month
Compute	Approximately 0.96 petaflop/s-days

The training used the Adam optimizer with a maximum learning rate of 2.5 x 10^-4. The learning rate was warmed up linearly over the first 2,000 updates and then annealed to zero using a cosine schedule. The model was trained for 100 epochs with mini-batches of 64 sequences, each 512 tokens long. A dropout rate of 0.1 was applied for regularization. The model achieved a token-level perplexity of 18.4 on the BooksCorpus test set ^[1].

By modern standards, the computational requirements for training GPT-1 were extremely modest. The entire training run consumed roughly 0.96 petaflop/s-days of compute, a figure that would grow by orders of magnitude with each successive GPT model. For comparison, GPT-3 required approximately 3,640 petaflop/s-days, nearly 4,000 times more compute than GPT-1 ^[5].

Fine-tuning

After pre-training, GPT-1 was fine-tuned on each downstream task using labeled data. The fine-tuning process required minimal changes to the model architecture. For most tasks, the approach involved adding a simple linear output layer on top of the pre-trained transformer and training the entire model end-to-end on the task-specific dataset ^[1].

Fine-tuning Hyperparameters

Hyperparameter	Value
Learning rate	6.25 x 10^-5
Batch size	32
Epochs	3
Auxiliary LM loss weight (lambda)	0.5
Linear warmup	Over 0.2% of training steps
Dropout	0.1

The fine-tuning learning rate of 6.25 x 10^-5 was considerably lower than the pre-training rate of 2.5 x 10^-4, reflecting the standard practice of using smaller learning rates during fine-tuning to avoid catastrophically overwriting the knowledge acquired during pre-training. Only three epochs of fine-tuning were needed for most tasks, demonstrating the efficiency of the transfer learning approach ^[1].

Input Transformations for Downstream Tasks

To handle different task formats, the authors used input transformations that converted structured inputs into ordered sequences that the pre-trained model could process:

Task Type	Input Transformation	Example Tasks
Classification	Sequence of tokens fed directly, with start and extract tokens	SST-2, CoLA
Textual entailment	Premise and hypothesis concatenated with a delimiter token	MNLI, SNLI, QNLI, RTE, SciTail
Similarity	Both orderings of the two sentences processed separately, then combined	STS-B, QQP, MRPC
Multiple choice	Each answer option concatenated with the context, scored independently	RACE, Story Cloze

This approach was elegant in its simplicity. Rather than designing task-specific architectures, the authors converted every task into a sequence format compatible with the pre-trained model's input expectations. A special start token, delimiter token, and extract token were added to the vocabulary to structure the inputs. The linear output layer was the only new component added during fine-tuning ^[1].

During fine-tuning, an auxiliary language modeling objective was added alongside the task-specific objective with a weight of lambda = 0.5. This regularization technique helped improve generalization and accelerated convergence. The authors found that including the language modeling loss during fine-tuning improved performance on most tasks, particularly those with larger datasets ^[1].

Benchmark Results

GPT-1 was evaluated on 12 datasets across four categories of NLP tasks: natural language inference, question answering, semantic similarity, and text classification. The model achieved new state-of-the-art results on 9 of the 12 benchmarks ^[1].

Natural Language Inference

Dataset	Task Description	GPT-1 Score	Previous Best	Improvement
MNLI (matched)	Multi-genre natural language inference	82.1%	80.6%	+1.5%
MNLI (mismatched)	Cross-genre NLI	81.4%	80.1%	+1.3%
SNLI	Stanford natural language inference	89.9%	89.3%	+0.6%
SciTail	Science entailment	88.3%	83.3%	+5.0%
QNLI	Question NLI (derived from SQuAD)	88.1%	82.3%	+5.8%
RTE	Recognizing textual entailment	56.0%	--	--

Question Answering and Commonsense Reasoning

Dataset	Task Description	GPT-1 Score	Previous Best	Improvement
RACE	Reading comprehension from exams	59.0%	53.3%	+5.7%
Story Cloze	Commonsense story completion	86.5%	77.6%	+8.9%

Semantic Similarity and Classification

Dataset	Task Description	GPT-1 Score	Previous Best	Improvement
SST-2	Binary sentiment analysis	91.3%	90.2%	+1.1%
CoLA	Linguistic acceptability (Matthews corr.)	45.4	35.0	+10.4
STS-B	Semantic textual similarity	82.0	81.0	+1.0
QQP	Quora question pair similarity	70.3%	66.1%	+4.2%
MRPC	Microsoft paraphrase detection (F1)	82.3	--	--
GLUE (overall)	Multi-task benchmark average	72.8	68.9	+3.9

The improvements were particularly large on tasks requiring commonsense reasoning (Story Cloze, +8.9%) and linguistic acceptability (CoLA, +10.4%). These tasks benefit from the broad world knowledge captured during pre-training on a diverse book corpus ^[1].

The three datasets where GPT-1 did not achieve state-of-the-art were tasks where existing supervised models had been heavily optimized with task-specific architectures. Even on these tasks, GPT-1 remained competitive, demonstrating that a single general-purpose architecture could approach or match the performance of specialized systems across diverse NLP problems ^[1].

Zero-Shot Transfer Capabilities

GPT-1 also demonstrated promising zero-shot transfer capabilities. Without any task-specific fine-tuning, the model showed reasonable performance on several tasks, suggesting that the pre-training process captured broadly useful linguistic knowledge. The authors observed that zero-shot performance generally improved over the course of pre-training, indicating that the language model was gradually acquiring the ability to perform NLP tasks as a byproduct of learning to predict text ^[1].

The zero-shot experiments used heuristic methods to convert tasks into a format the language model could handle without any fine-tuning. For example, for sentiment analysis, the authors appended a token indicating positive or negative sentiment and measured the language model's probability of generating the correct completion. Performance was modest compared to the fine-tuned results, but the steady improvement during pre-training was a significant observation. It suggested that generative pre-training does not merely learn syntactic patterns but gradually acquires task-relevant knowledge ^[1].

This zero-shot ability, while modest in GPT-1, foreshadowed the much stronger zero-shot performance of later models like GPT-2 and GPT-3. The GPT-2 paper would later show that scaling up the model and training data could produce surprisingly capable zero-shot performance, and GPT-3 would demonstrate that very large language models could rival fine-tuned systems on many tasks with no gradient updates at all ^[1].

Ablation Studies

The GPT-1 paper included a rigorous set of ablation experiments that isolated the contribution of each component in the system. These experiments provided evidence for three key design choices: the use of pre-training, the transformer architecture, and the auxiliary language modeling objective during fine-tuning ^[1].

Effect of Pre-training

Removing pre-training entirely and training the transformer from scratch on the supervised tasks resulted in a dramatic performance drop. Without pre-training, the average score across all tasks decreased by 14.8%, confirming that the representations learned during unsupervised pre-training were critical to the model's success. This was the largest single factor in the ablation analysis, underscoring the central role of generative pre-training in the paper's approach ^[1].

Transformer vs. LSTM

Replacing the transformer architecture with a single-layer 2048-unit LSTM while keeping the same pre-training and fine-tuning framework resulted in an average score drop of 5.6 points across the evaluated tasks. The LSTM only outperformed the transformer on a single dataset (MRPC). This comparison demonstrated that the transformer architecture was a better fit for transfer learning, likely due to its ability to capture long-range dependencies through self-attention and its more structured memory compared to recurrent networks ^[1].

Effect of Auxiliary Language Modeling Objective

Removing the auxiliary language modeling loss during fine-tuning (setting lambda to 0) hurt performance on most tasks, particularly the NLI tasks and QQP. The auxiliary objective acted as a regularizer, preventing the model from overfitting to the small fine-tuning datasets while maintaining the general linguistic knowledge from pre-training. The authors found that the benefit was most pronounced on larger datasets, while smaller datasets showed more variable results ^[1].

Ablation Summary

Configuration	Average Score Change
Full model (Transformer + pre-training + auxiliary LM)	Baseline
Without pre-training	-14.8%
LSTM instead of Transformer	-5.6 points
Without auxiliary LM objective	Small decrease on most tasks

Layer Transfer Analysis

The paper also investigated how many layers of the pre-trained model needed to be transferred for effective fine-tuning. The authors found that each additional transformer layer transferred from the pre-trained model provided incremental improvements. Even transferring only the embedding layers (without any transformer blocks) produced some benefit, but performance improved steadily as more layers were included. On MultiNLI, each additional transferred layer added approximately 9% to performance ^[1]. This result indicated that every layer in the 12-layer model captured useful representations, with lower layers encoding more general linguistic features and upper layers encoding more task-relevant information.

Comparison with BERT

BERT (Bidirectional Encoder Representations from Transformers), released by Google in October 2018, just four months after GPT-1, took a fundamentally different approach to transformer-based pre-training ^[4]. The two models together established the pre-training and fine-tuning paradigm that dominated NLP research for years.

Feature	GPT-1	BERT
Architecture	Decoder-only transformer	Encoder-only transformer
Directionality	Unidirectional (left-to-right)	Bidirectional
Pre-training objective	Next token prediction	Masked language modeling + next sentence prediction
Parameters	117M	110M (Base), 340M (Large)
Training data	BooksCorpus (800M words)	BooksCorpus + English Wikipedia (3.3B words)
Release date	June 2018	October 2018
Primary strength	Text generation	Text understanding and classification
Tokenization	BPE (40,000 merges)	WordPiece (30,000 tokens)

BERT's bidirectional attention allowed it to consider context from both directions simultaneously, which gave it an advantage on many understanding tasks. BERT trained with a masked language modeling objective: random tokens in the input were replaced with a special [MASK] token, and the model learned to predict the original token using context from both sides. A secondary "next sentence prediction" objective trained BERT to determine whether two sentences appeared consecutively in the original text. This approach allowed BERT to build richer contextual representations for classification and extraction tasks. BERT surpassed GPT-1's results on several benchmarks shortly after its release ^[4].

On the GLUE benchmark, BERT Base achieved an average score of 79.6, substantially higher than GPT-1's 72.8. BERT Large pushed this further to 82.1. The improvement was especially notable on tasks like RTE and WNLI, where GPT-1 had struggled. BERT's bidirectional context was particularly helpful for tasks that required understanding the relationship between two text segments ^[4].

However, GPT-1's autoregressive approach was better suited for generative tasks and proved more scalable in the long run, as demonstrated by the subsequent GPT series. BERT's bidirectional architecture, while powerful for understanding, cannot straightforwardly generate text because it lacks the left-to-right sequential dependency that autoregressive models rely on ^[4].

The philosophical difference between the two approaches, generation versus understanding, shaped the direction of NLP research. While BERT-style models dominated leaderboards in 2019 and 2020, the GPT approach of scaling up autoregressive models ultimately proved to be the path toward more general-purpose AI systems. The success of GPT-3, ChatGPT, and GPT-4 validated the autoregressive strategy, as these models demonstrated that sufficiently large decoder-only models could match or exceed encoder-based models on understanding tasks while also excelling at generation.

Comparison with Contemporary Models

To place GPT-1 in context, it is useful to compare it with other prominent pre-training approaches that appeared in the same period.

Model	Authors / Lab	Date	Architecture	Pre-training Objective	Parameters	Key Innovation
ELMo	Peters et al. / Allen AI	Feb 2018	Bidirectional LSTM	Forward + backward LM	94M	Contextual word embeddings
ULMFiT	Howard & Ruder	Jan 2018	AWD-LSTM	Language modeling	~24M	Discriminative fine-tuning, gradual unfreezing
GPT-1	Radford et al. / OpenAI	Jun 2018	Decoder-only Transformer	Next token prediction	117M	Transformer-based generative pre-training
BERT	Devlin et al. / Google	Oct 2018	Encoder-only Transformer	Masked LM + NSP	110M/340M	Bidirectional context for pre-training

ELMo used bidirectional LSTMs to produce context-sensitive word embeddings, but these embeddings were used as input features to separate, task-specific models rather than being fine-tuned end-to-end. ULMFiT demonstrated effective fine-tuning of LSTM-based language models and introduced training techniques (discriminative learning rates, slanted triangular learning rates) that influenced subsequent work. GPT-1 combined the fine-tuning paradigm with the transformer architecture, and BERT showed that bidirectional pre-training could outperform the unidirectional approach on understanding tasks ^[1]^[4].

All four models contributed to the rapid shift away from training task-specific models from scratch. By the end of 2018, the pre-train-then-fine-tune paradigm had become the dominant methodology in NLP research.

Legacy and Influence

GPT-1's most lasting contribution was not its benchmark scores, which were quickly surpassed, but its demonstration that generative pre-training could produce transferable language representations. This insight had several important consequences.

First, it established the pre-training and fine-tuning paradigm as the standard approach in NLP. Before GPT-1, training a separate model from scratch for each task was common practice. After GPT-1 (and BERT), pre-training became the default starting point for virtually all NLP work ^[1].

Second, GPT-1 suggested that performance would improve with scale. The authors noted that larger models and more data would likely yield better results, a hypothesis dramatically confirmed by GPT-2 (1.5 billion parameters), GPT-3 (175 billion parameters), and GPT-4 ^[5].

Model	Release	Parameters	Training Data	Key Advance
GPT-1	June 2018	117M	BooksCorpus (800M words)	Generative pre-training + fine-tuning
GPT-2	February 2019	1.5B	WebText (40GB)	Zero-shot task transfer, larger scale
GPT-3	June 2020	175B	Common Crawl + books (570GB)	Few-shot learning via prompting
GPT-4	March 2023	Undisclosed (est. >1T)	Undisclosed	Multimodal input, professional-level reasoning

Third, the model's zero-shot transfer results, while modest, hinted at the possibility of building general-purpose language systems that would not need task-specific fine-tuning at all. This vision was more fully realized in GPT-2 and GPT-3, where prompting replaced fine-tuning for many applications.

GPT-1 also influenced the broader AI research community's approach to deep learning. The success of unsupervised pre-training on large text corpora helped accelerate the trend toward training ever-larger models on ever-larger datasets, a trend that continues to define the field. The paper has been cited over 17,000 times and remains a foundational reference in the study of large language models.

Open-Source Release

OpenAI released the GPT-1 model weights and code publicly through their GitHub repository (openai/finetune-transformer-lm), implemented in TensorFlow. This open release allowed the research community to reproduce the paper's results, experiment with the model, and build upon it. The model is also available through the Hugging Face Transformers library under the identifier "openai-gpt". The open availability of GPT-1 contributed to its influence, as researchers worldwide could directly experiment with the pre-trained weights rather than training from scratch ^[5].

Scaling Laws Foreshadowed

One of the most consequential aspects of GPT-1, recognized only in hindsight, is how it foreshadowed the scaling laws that would become central to AI research. The paper's ablation studies showed that more pre-training improved downstream performance, and the authors explicitly noted that larger models and more data were likely to produce further gains. This observation was formalized in later work by Kaplan et al. (2020), who showed that language model performance follows predictable power-law relationships with model size, dataset size, and compute budget. GPT-1 was the first data point on what would become one of the most important empirical findings in modern AI research.

In hindsight, GPT-1 can be understood as the starting point of a research trajectory that, within five years, led to systems capable of passing professional examinations, writing functional code, and holding extended conversations. The 117-million-parameter model that achieved 72.8% on GLUE in 2018 was the first step toward the billion- and trillion-parameter models that would reshape both AI research and the technology industry.

References

Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). "Improving Language Understanding by Generative Pre-Training." OpenAI. https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf
Vaswani, A., Shazeer, N., Parmar, N., et al. (2017). "Attention Is All You Need." Advances in Neural Information Processing Systems. https://arxiv.org/abs/1706.03762
Zhu, Y., Kiros, R., Zemel, R., et al. (2015). "Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books." ICCV. https://arxiv.org/abs/1506.06724
Devlin, J., Chang, M., Lee, K., & Toutanova, K. (2018). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." https://arxiv.org/abs/1810.04805
"Improving language understanding with unsupervised learning." OpenAI Blog. https://openai.com/index/language-unsupervised/
Peters, M., Neumann, M., Iyyer, M., et al. (2018). "Deep contextualized word representations." NAACL. https://arxiv.org/abs/1802.05365
Howard, J. & Ruder, S. (2018). "Universal Language Model Fine-tuning for Text Classification." ACL. https://arxiv.org/abs/1801.06146
Kaplan, J., McCandlish, S., Henighan, T., et al. (2020). "Scaling Laws for Neural Language Models." https://arxiv.org/abs/2001.08361

Background and Motivation

Architecture

Technical Specifications

Tokenization

Training

Pre-training

Pre-training Hyperparameters

Fine-tuning

Fine-tuning Hyperparameters

Input Transformations for Downstream Tasks

Benchmark Results

Natural Language Inference

Question Answering and Commonsense Reasoning

Semantic Similarity and Classification

Zero-Shot Transfer Capabilities

Ablation Studies

Effect of Pre-training

Transformer vs. LSTM

Effect of Auxiliary Language Modeling Objective

Ablation Summary

Layer Transfer Analysis

Comparison with BERT

Comparison with Contemporary Models

Legacy and Influence

Open-Source Release

Scaling Laws Foreshadowed

See Also

References

Improve this article

Related Articles

DeepSeek 3.0

Agentic Context Engineering

Claude Sonnet 4.5

Context window

Post-training

GPT-5 Codex

Background and Motivation

Architecture

Technical Specifications

Tokenization

Training

Pre-training

Pre-training Hyperparameters

Fine-tuning

Fine-tuning Hyperparameters

Input Transformations for Downstream Tasks

Benchmark Results

Natural Language Inference

Question Answering and Commonsense Reasoning

Semantic Similarity and Classification

Zero-Shot Transfer Capabilities

Ablation Studies

Effect of Pre-training

Transformer vs. LSTM

Effect of Auxiliary Language Modeling Objective

Ablation Summary

Layer Transfer Analysis

Comparison with BERT

Comparison with Contemporary Models

Legacy and Influence

Open-Source Release

Scaling Laws Foreshadowed

See Also

References

Related Articles

DeepSeek 3.0

Agentic Context Engineering

Claude Sonnet 4.5

Context window

Post-training

GPT-5 Codex