XLNet is a generalized autoregressive pretraining method for natural language processing that combines the strengths of autoregressive and autoencoding language models. Introduced in June 2019 by researchers at Carnegie Mellon University and Google Brain, XLNet proposed a novel permutation language modeling objective that captures bidirectional context without relying on input corruption (masking). The paper, titled "XLNet: Generalized Autoregressive Pretraining for Language Understanding," was authored by Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V. Le. It was presented at NeurIPS 2019 and quickly became one of the most discussed advances in the field, outperforming BERT on 20 benchmark tasks at the time of its release.
XLNet's core insight was that existing pretraining approaches fell into two camps, each with notable drawbacks. Autoregressive (AR) models like GPT and GPT-2 could only condition on left-side context, limiting their ability to capture bidirectional dependencies. Autoencoding (AE) models like BERT used masked language modeling to capture bidirectional context but introduced a pretrain-finetune discrepancy and assumed independence among masked tokens. XLNet bridged these two paradigms by using permutation-based training on top of the Transformer-XL architecture, achieving state-of-the-art results across a wide range of NLP benchmarks.
By mid-2019, the NLP community had witnessed rapid progress in transfer learning for language understanding. BERT, released in October 2018, demonstrated that pretraining a deep bidirectional transformer encoder on masked language modeling (MLM) and next sentence prediction (NSP) could produce rich contextual representations that transferred well to downstream tasks. However, BERT's approach had known limitations that the XLNet authors sought to address.
Autoregressive language models, such as GPT and GPT-2, factorize the joint probability of a text sequence as a product of conditional probabilities from left to right. Given a sequence of tokens x1, x2, ..., xT, the model estimates:
p(x) = p(x1) * p(x2|x1) * p(x3|x1, x2) * ... * p(xT|x1, ..., xT-1)
This left-to-right factorization means each token can only attend to preceding tokens, not to tokens that come after it. For tasks that require understanding the full context of a sentence (such as question answering or natural language inference), this unidirectional constraint is a significant limitation. The model cannot jointly reason about context on both sides of a given token.
BERT addressed the unidirectional limitation by using masked language modeling: randomly replacing 15% of input tokens with a special [MASK] token and training the model to predict the original tokens from the surrounding bidirectional context. While effective, this approach introduced two problems.
First, the pretrain-finetune discrepancy: during pretraining, the model sees artificial [MASK] tokens that never appear in real text during fine-tuning or inference. BERT partially mitigated this by replacing selected tokens with [MASK] only 80% of the time (using a random token 10% of the time and keeping the original token 10% of the time), but the mismatch remained.
Second, the independence assumption: BERT predicts all masked tokens simultaneously and independently of each other, conditioned only on the unmasked tokens. This means the model does not capture dependencies between the masked tokens themselves. In natural language, masked tokens are often correlated. For example, in the sentence "New York is a city," if both "New" and "York" are masked, BERT predicts each one independently, ignoring the strong dependency between them. An autoregressive model, by contrast, would naturally condition the prediction of "York" on having already predicted "New."
XLNet proposed permutation language modeling as a way to get the best of both worlds: the bidirectional context awareness of autoencoding models and the product-rule factorization of autoregressive models. Instead of predicting tokens from left to right in a fixed order, XLNet trains on all possible permutations of the factorization order. For each training instance, a random permutation of the token positions is sampled, and the model is trained to predict each token conditioned on the tokens that precede it in that particular permutation. Because all permutations are considered in expectation, every token eventually learns to use context from both sides.
Critically, the actual input sequence is never reordered. The positional encodings remain fixed to the original positions in the sentence. The permutation is implemented through attention masks that control which tokens each position can attend to during the forward pass.
XLNet builds on two key architectural components: the Transformer-XL backbone and a novel two-stream self-attention mechanism.
Transformer-XL was introduced by Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V. Le, and Ruslan Salakhutdinov in a paper published at ACL 2019, titled "Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context." Many of these authors also co-authored the XLNet paper. Transformer-XL addressed two critical limitations of standard transformers: the fixed-length context problem and context fragmentation.
Segment-level recurrence mechanism: Standard transformers process fixed-length segments of text independently, which means they cannot capture dependencies that span beyond the segment boundary. Transformer-XL introduces a recurrence mechanism at the segment level. When processing a new segment, the model caches the hidden states computed for the previous segment and reuses them as extended context. This allows information to flow across segment boundaries without recomputation, effectively extending the model's receptive field far beyond a single segment.
During training, the hidden state sequence from the previous segment is fixed (gradients are not propagated back through it) and concatenated with the current segment's hidden states as keys and values in the self-attention computation. This creates a sliding window of memory that grows linearly with the number of layers, enabling the model to capture dependencies spanning hundreds or even thousands of tokens.
Relative positional encoding: Standard transformers use absolute positional embeddings, which assign a fixed vector to each position in the input. This works for single segments but creates problems when segments are concatenated via the recurrence mechanism, because two tokens from different segments could end up with the same absolute position index. Transformer-XL replaces absolute positional embeddings with a relative positional encoding scheme. Instead of encoding the absolute position of each token, the model encodes the relative distance between pairs of tokens in the attention computation. This allows the model to generalize to longer sequences at inference time than it saw during training.
The relative positional encoding decomposes the standard attention score into four components: content-to-content, content-to-position, position-to-content, and position-to-position terms. In practice, two of these terms are simplified using global bias vectors shared across all positions.
Transformer-XL demonstrated that it could model dependencies 80% longer than RNNs and 450% longer than vanilla transformers, while also achieving significant speed improvements during evaluation (up to 1,800 times faster than vanilla transformers for long sequences).
XLNet integrates both the segment-level recurrence and the relative positional encoding from Transformer-XL into its pretraining framework. This gives XLNet the ability to handle long-range dependencies far more effectively than BERT, which is limited to a fixed context window of 512 tokens with absolute positional embeddings.
In addition to relative positional encoding, XLNet introduces a relative segment encoding scheme. Given two positions i and j in the input, if they belong to the same segment, the model uses a learnable embedding s+; if they belong to different segments, it uses a different learnable embedding s-. These segment embeddings are added to the attention score computation. This is more flexible than BERT's absolute segment embeddings (which assign segment A or segment B labels) because it generalizes naturally to inputs with more than two segments.
Implementing permutation language modeling with a standard transformer presents a fundamental contradiction. When predicting a token at position t in a given permutation order, the model needs to know the position of the target token (to produce an appropriate representation for that position) but must not see the content of the target token (otherwise the prediction becomes trivial). A standard transformer's hidden representation for position t encodes both the position and the content, making it impossible to satisfy both requirements simultaneously.
XLNet resolves this with a two-stream self-attention mechanism that maintains two separate sets of hidden representations for each token.
Content stream (h): This stream functions like a standard transformer hidden state. The content representation for position t encodes both the content of the token at position t and all contextual information from tokens that precede it in the current permutation order. The content stream can attend to all tokens that come before it in the permutation, including the token at its own position.
Query stream (g): This stream encodes the contextual information and the positional information for position t, but it does not have access to the content of the token at position t. The query stream can attend to all tokens that precede it in the permutation order, but it cannot attend to the token at its own position. This is the representation used to make predictions.
During pretraining, both streams are computed in parallel at each layer. The query stream uses the query vector at position t as its query, and the content vectors of all preceding positions (excluding t) as keys and values. The content stream uses the content vector at position t as its query, and the content vectors of all preceding positions (including t) as keys and values.
During fine-tuning, the query stream is dropped and only the content stream is used, making the model behave like a standard Transformer-XL. This means XLNet's fine-tuning procedure is straightforward and does not introduce additional complexity compared to BERT.
Sampling all possible permutations during training would be computationally prohibitive, and permutation language modeling converges slowly because the model must predict tokens in many different orders. To improve efficiency, XLNet uses a partial prediction strategy. For each permutation, only the last few tokens in the permutation order are selected as prediction targets, while the earlier tokens serve as context. Specifically, a hyperparameter K controls the ratio: approximately 1/K of the tokens are selected for prediction. The authors found K = 6 to work well in practice, meaning roughly 1 out of every 6 tokens (about 16.7%) is predicted, which is close to BERT's 15% masking rate.
The partial prediction strategy also uses span-based target selection. Instead of selecting individual random tokens, the model selects spans of 1 to 5 consecutive tokens as targets. This is controlled by the hyperparameters mask_alpha (set to 6) and mask_beta (set to 1), with num_predict set to 85 tokens per sequence.
XLNet was released in two sizes, mirroring the BERT model configurations.
| Configuration | Layers | Hidden size | Attention heads | Feed-forward size | Parameters |
|---|---|---|---|---|---|
| XLNet-Base | 12 | 768 | 12 | 3,072 | 110M |
| XLNet-Large | 24 | 1,024 | 16 | 4,096 | 340M |
XLNet-Large shares the same architecture hyperparameters as BERT-Large: 24 layers, 1,024 hidden units, and 16 attention heads. XLNet-Base matches the configuration of BERT-Base. The parameter counts are similar to their BERT counterparts, allowing for direct comparison.
Both configurations use a head dimension of 64 (hidden_size / num_heads), a maximum sequence length of 512 during pretraining, and SentencePiece tokenization.
A significant factor in XLNet's strong performance was its much larger pretraining dataset compared to BERT. While BERT was trained on BooksCorpus (800M words) and English Wikipedia (2,500M words) for a total of roughly 13GB of text, XLNet used five data sources:
| Dataset | Text size | Subword tokens |
|---|---|---|
| English Wikipedia | Part of 13GB (with BooksCorpus) | 2.78B |
| BooksCorpus | Part of 13GB (with Wikipedia) | 1.09B |
| Giga5 (English Gigaword) | 16GB | 4.75B |
| ClueWeb 2012-B | 19GB (after filtering) | 4.30B |
| Common Crawl | 110GB (after aggressive filtering) | 19.97B |
| Total | ~158GB | 32.89B |
After tokenization with SentencePiece, the combined dataset contained 32.89 billion subword tokens. This is roughly 10 times the amount of data used to train BERT.
The ClueWeb and Common Crawl data underwent aggressive filtering and deduplication to remove low-quality text. Despite these efforts, the inclusion of web-crawled data was a departure from BERT's cleaner training set of curated book text and encyclopedia articles.
XLNet-Large was pretrained on 512 TPU v3 chips for 500,000 steps. The training used the Adam optimizer with weight decay and a linear learning rate decay schedule. With a batch size of 8,192, the full pretraining took approximately 5.5 days.
For a fair comparison with BERT (which used a smaller batch size), the XLNet team also trained a configuration with a batch size of 2,048, which took approximately 2.5 days on the same hardware.
Key pretraining hyperparameters for XLNet-Large:
| Hyperparameter | Value |
|---|---|
| Sequence length | 512 |
| Memory length | 384 |
| Reuse length | 256 |
| Permutation size | 256 |
| Training batch size | 2,048 (or 8,192) |
| Learning rate | Varies with linear decay |
| Optimizer | Adam with weight decay |
| Training steps | 500,000 |
| Hardware | 512 TPU v3 chips |
| Training time | ~2.5 days (batch 2,048) / ~5.5 days (batch 8,192) |
The memory length of 384 tokens meant that during pretraining, each token could attend to up to 384 tokens from the previous segment in addition to the 512 tokens in the current segment, for a total effective context of 896 tokens.
Notably, the authors reported that XLNet-Large was still undertrained at the end of pretraining and that further training would likely improve results.
XLNet did not use BERT's next sentence prediction (NSP) objective. Ablation studies in the paper showed that NSP did not contribute to improved downstream performance, consistent with findings from RoBERTa and other later work.
XLNet set new state-of-the-art results on multiple benchmarks upon its release, outperforming BERT on 20 tasks. The improvements were particularly large on tasks that required reasoning over longer text passages.
The General Language Understanding Evaluation (GLUE) benchmark consists of nine sentence-level and sentence-pair classification tasks. XLNet-Large achieved strong results across all GLUE subtasks.
Single-task, single-model dev set results for XLNet-Large:
| Task | Metric | XLNet-Large | BERT-Large |
|---|---|---|---|
| MNLI (matched/mismatched) | Accuracy | 90.8 / 90.8 | 86.7 / 85.9 |
| QNLI | Accuracy | 94.9 | 92.7 |
| QQP | F1 | 92.3 | 72.1 |
| RTE | Accuracy | 85.9 | 70.1 |
| SST-2 | Accuracy | 97.0 | 94.9 |
| MRPC | F1 | 90.8 | 89.3 |
| CoLA | Matthews Corr. | 69.0 | 60.5 |
| STS-B | Spearman Corr. | 92.5 | 86.5 |
XLNet achieved notable gains over BERT on RTE (85.9 vs. 70.1, a 15.8-point improvement), CoLA (69.0 vs. 60.5), and MNLI (90.8 vs. 86.7). On the multi-task, multi-model GLUE test set, XLNet achieved state-of-the-art results on 7 of the 9 tasks.
On the Stanford Question Answering Dataset (SQuAD), XLNet-Large posted strong improvements over BERT.
SQuAD 2.0 (which includes unanswerable questions):
These results represented a significant improvement over BERT-Large on SQuAD 2.0 (which scored EM 80.0, F1 83.1 on the test set at the time of BERT's release).
On SQuAD 1.1, the official GitHub repository reports an expected dev set F1 of approximately 88.6 (median of multiple runs) for XLNet-Large using the TPU training scripts. The test set results were delayed due to leaderboard submission issues, with the authors reporting scores from an older model version.
The RACE (ReAding Comprehension from Examinations) dataset consists of reading comprehension questions from Chinese middle school and high school English exams. This benchmark requires reasoning over passages averaging more than 300 tokens, making it a good test of long-range context modeling.
| Model | Middle | High | Overall |
|---|---|---|---|
| BERT-Large | - | - | 72.0 |
| XLNet-Large | 88.6 | 84.0 | 81.75 |
XLNet-Large achieved an overall accuracy of 81.75% on RACE, a substantial improvement over BERT-Large's 72.0%. The strong performance on RACE highlighted the benefit of Transformer-XL's segment recurrence mechanism for tasks requiring comprehension of longer passages.
XLNet also outperformed BERT on several text classification benchmarks, measured by error rates (lower is better):
| Dataset | BERT-Large (error %) | XLNet-Large (error %) |
|---|---|---|
| IMDB | 4.51 | 3.79 |
| Yelp-2 | 1.89 | 1.55 |
| Yelp-5 | 29.32 | 27.80 |
| DBpedia | 0.64 | 0.62 |
| Amazon-2 | 2.63 | 2.40 |
| Amazon-5 | 34.17 | 32.26 |
The improvements were consistent across all six datasets, with the largest relative gains on sentiment analysis tasks (IMDB, Yelp, Amazon).
XLNet also showed improvements on document ranking tasks. Combined with performance on question answering, natural language inference, sentiment analysis, and document ranking, XLNet outperformed BERT on a total of 20 tasks across these categories.
The XLNet paper included ablation studies to isolate the contribution of each component. These experiments used XLNet-Base (with the partial prediction hyperparameter K=6) and evaluated on four benchmarks.
Key findings from the ablation studies:
Permutation language modeling was a major contributor to XLNet's superior performance. Removing it and using standard left-to-right autoregressive modeling significantly degraded results.
Transformer-XL backbone (segment recurrence and relative positional encoding) was another major factor. Replacing the Transformer-XL backbone with a standard transformer architecture reduced performance, especially on tasks involving longer text sequences like RACE.
Memory caching (the segment-level recurrence) helped most on longer-context tasks but had smaller effects on short-context tasks.
Span-based prediction (predicting contiguous spans rather than individual random tokens) improved convergence and final performance.
Next sentence prediction: The ablation confirmed that NSP was not beneficial and was excluded from XLNet's final training objective.
Bidirectional data pipeline: Using bidirectional input data (as opposed to unidirectional) improved results across the board.
XLNet occupies a unique position in the landscape of pretrained language models, combining elements of both autoregressive and autoencoding approaches.
| Feature | GPT-2 | BERT | XLNet |
|---|---|---|---|
| Type | Autoregressive | Autoencoding | Generalized autoregressive |
| Architecture | Transformer decoder | Transformer encoder | Transformer-XL (encoder-style) |
| Directionality | Unidirectional (left-to-right) | Bidirectional (via masking) | Bidirectional (via permutation) |
| Pretraining objective | Next token prediction | Masked LM + Next Sentence Prediction | Permutation language modeling |
| Input corruption | None | Replaces tokens with [MASK] | None |
| Pretrain-finetune discrepancy | None | Yes ([MASK] tokens absent at finetune) | None |
| Independence assumption | No (autoregressive factorization) | Yes (masked tokens predicted independently) | No (autoregressive factorization) |
| Long-range dependencies | Fixed context window | Fixed context window (512 tokens) | Segment recurrence (extended context) |
| Positional encoding | Learned absolute | Learned absolute | Relative (from Transformer-XL) |
| Parameters (large) | 1.5B | 340M | 340M |
| Pretraining data | 40GB (WebText) | 13GB (Books + Wiki) | ~158GB (Books + Wiki + Giga5 + ClueWeb + CC) |
| Release date | February 2019 | October 2018 | June 2019 |
| Developer | OpenAI | CMU + Google Brain |
No pretrain-finetune discrepancy: Because XLNet does not use [MASK] tokens or any form of input corruption, the data distribution seen during pretraining matches what the model encounters during fine-tuning and inference.
No independence assumption: XLNet's autoregressive formulation naturally factorizes the joint probability of target tokens using the product rule. When predicting multiple tokens, the prediction of each token conditions on the predictions of previously generated tokens in the permutation order, capturing inter-dependencies among target tokens.
Longer context: Through the Transformer-XL backbone, XLNet can attend to context from previous segments, effectively extending its receptive field beyond the 512-token limit that constrains BERT.
Simplicity: BERT's masked language modeling objective is conceptually simpler and easier to implement than permutation language modeling with two-stream attention.
Training efficiency: BERT converges faster during pretraining because it predicts masked tokens in a fixed order rather than requiring the model to learn across all possible permutation orders.
Ecosystem: BERT was released earlier and built up a larger ecosystem of pretrained models, fine-tuned variants, and community resources.
GPT-2, released by OpenAI in February 2019, demonstrated that scaling up autoregressive language models could produce remarkably fluent text generation. However, GPT-2 was primarily designed for generation tasks and used unidirectional context, making it less suitable for bidirectional understanding tasks like question answering and natural language inference. XLNet achieved the bidirectional understanding of BERT while retaining the autoregressive properties of GPT-2, positioning it as a bridge between the two paradigms.
It is worth noting that GPT-2's largest variant (1.5 billion parameters) was significantly larger than XLNet-Large (340 million parameters), and the two models were designed with different primary objectives in mind.
One important caveat in evaluating XLNet's improvements over BERT is the difference in training data. XLNet used roughly 10 times more pretraining data than BERT. The XLNet team acknowledged this by running a fair comparison experiment: they trained an XLNet-Base model using only BooksCorpus and Wikipedia (the same data used by BERT) with the same number of training steps. Even under these controlled conditions, XLNet-Base outperformed BERT-Base on RACE, SQuAD 2.0, and several other benchmarks, confirming that the architectural and objective innovations contributed to the improvements independently of the larger dataset.
However, the full XLNet-Large model that set state-of-the-art records did use the larger dataset, making it difficult to isolate how much of the overall improvement came from the modeling innovations versus the additional data.
XLNet holds an important place in the history of natural language processing and pretrained language models. It was the first model to successfully bridge the gap between autoregressive and autoencoding approaches to pretraining, demonstrating that the advantages of both paradigms could be combined in a single framework.
The paper's analysis of BERT's limitations, specifically the pretrain-finetune discrepancy and the independence assumption, shaped how the research community thought about pretraining objectives. These insights influenced subsequent work, even in models that ultimately adopted different solutions.
XLNet also demonstrated the importance of the Transformer-XL architecture for handling long-range dependencies, an insight that influenced later models designed for long-document understanding.
The XLNet paper brought together researchers with deep expertise in transformer architectures and language modeling. Zhilin Yang and Zihang Dai were PhD students at Carnegie Mellon University. Yiming Yang was a professor in CMU's Language Technologies Institute. Jaime Carbonell, who co-authored both the Transformer-XL and XLNet papers, was the Allen Newell Professor of Computer Science at CMU and the founder of CMU's Language Technologies Institute. Carbonell passed away on February 28, 2020, having made lasting contributions to machine translation, machine learning, and natural language processing over a career spanning more than three decades. Ruslan Salakhutdinov was a professor at CMU known for his work on deep learning and probabilistic models. Quoc V. Le was a principal scientist at Google Brain, known for his work on sequence-to-sequence models and neural architecture search.
Despite its strong benchmark results and theoretical elegance, XLNet did not become the dominant pretrained model in practice. Several factors contributed to this.
RoBERTa (Robustly Optimized BERT Pretraining Approach), published by Facebook AI Research in July 2019 (just weeks after XLNet), demonstrated that BERT's original pretraining was significantly undertrained. By making simple changes to the training procedure (removing NSP, using dynamic masking, training on more data with larger batches for longer), RoBERTa matched or exceeded XLNet's performance on the GLUE benchmark and other tasks. The key insight was that BERT's architecture and masked language modeling objective were not the bottleneck; the training recipe was.
RoBERTa's advantage was its simplicity. It used the same BERT architecture with no changes to the model or the pretraining objective, making it a drop-in replacement for BERT in existing codebases. XLNet, by contrast, required a fundamentally different architecture (two-stream attention, Transformer-XL recurrence) that was harder to implement, debug, and deploy.
ALBERT (A Lite BERT), published by Google Research and the Toyota Technological Institute at Chicago in September 2019, introduced parameter-reduction techniques (factorized embeddings and cross-layer parameter sharing) that allowed larger model configurations with far fewer parameters. An ALBERT-xxlarge configuration with 235M parameters outperformed XLNet-Large (340M parameters) on several benchmarks. ALBERT showed that efficiency improvements could be just as valuable as architectural innovations.
ELECTRA, published in 2020, replaced masked language modeling with a replaced token detection objective that is defined over all input tokens rather than just the 15% that are masked. This made ELECTRA far more sample-efficient: ELECTRA-Large matched XLNet-Large's performance while using less than a quarter of the compute.
XLNet's permutation language modeling objective made training significantly slower than BERT's masked language modeling. The need to enumerate permutations and compute two separate attention streams added overhead. The convergence time for XLNet was much higher than for BERT or RoBERTa, making it less practical for researchers and organizations with limited computational budgets.
The two-stream self-attention mechanism also increased the implementation complexity. While libraries like Hugging Face Transformers eventually provided XLNet implementations, the model was never as widely adopted as BERT or its simpler successors.
The period from 2019 to 2020 saw a rapid succession of pretrained models, each claiming state-of-the-art on various benchmarks. After XLNet came RoBERTa, ALBERT, ELECTRA, DeBERTa, and others. The community gradually converged on the view that training methodology (data, batch size, training duration, hyperparameter tuning) mattered as much as or more than architectural innovations. Models that achieved strong performance with simpler architectures tended to see wider adoption.
Meanwhile, the decoder-only GPT paradigm continued to scale. GPT-3, released in 2020 with 175 billion parameters, showed that very large autoregressive models could perform remarkably well on a wide range of tasks through in-context learning, without any fine-tuning at all. This shift toward very large generative models further reduced interest in encoder-only models like XLNet for new research, though encoder models remain widely used in production for classification, retrieval, and extraction tasks.
XLNet's original implementation in TensorFlow was released on GitHub at github.com/zihangdai/xlnet under the Apache 2.0 license. The model is also available through the Hugging Face Transformers library in PyTorch and TensorFlow under identifiers such as xlnet-base-cased and xlnet-large-cased. Both the pretrained weights and the fine-tuning scripts are publicly available.
Fine-tuning XLNet-Large requires significant GPU memory. On a single 16GB GPU, XLNet-Large with a sequence length of 512 can only fit a batch size of 1, while XLNet-Base with the same sequence length supports a batch size of 8. For practical fine-tuning, gradient accumulation or multi-GPU setups are typically needed.
XLNet's contributions extend beyond its benchmark scores. The permutation language modeling idea demonstrated that there were viable alternatives to masked language modeling for bidirectional pretraining. The theoretical analysis of BERT's pretrain-finetune discrepancy and independence assumption became standard reference points in the NLP literature. The integration of Transformer-XL's segment recurrence into pretraining influenced subsequent work on long-context models.
The XLNet paper has been cited thousands of times and remains required reading in NLP courses and research groups. While the model itself has been largely superseded by later architectures for new projects, its ideas continue to inform the design of modern language models.