# Unidirectional language model

> Source: https://aiwiki.ai/wiki/unidirectional_language_model
> Updated: 2026-06-27
> Categories: Large Language Models, Natural Language Processing
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

*See also: [Machine learning terms](/wiki/machine_learning_terms)*

A **unidirectional language model** is a [language model](/wiki/language_model) that predicts each token using only the tokens that come before it in the sequence (the left context), via causal (autoregressive) masking, so the probability of the whole sequence factorizes by the chain rule into a product of next-token conditionals. It is also called a **causal language model**, an **[autoregressive model](/wiki/autoregressive_model)**, or a **left-to-right language model**, and it is the architecture behind the [GPT](/wiki/gpt) family and almost every modern generative large language model. The defining contrast is with a **[bidirectional language model](/wiki/bidirectional_language_model)** such as BERT, which conditions on tokens on both sides and is trained to fill in masked words rather than to predict the next one [1][2].

This simple constraint has unusually large consequences. It is the reason every modern generative large language model (GPT-4, [Claude](/wiki/claude), [Gemini](/wiki/gemini), [Llama](/wiki/llama) 3, Mistral, DeepSeek-V3, Qwen, and the other systems that get called "LLMs" in 2026) is built the way it is. Because the model can only look left, generating text is native to it: there is no future information to leak, so producing a sequence is just sampling its own next-token distribution forward one step at a time. The same constraint is also the reason these models are weaker than BERT-family encoders at tasks that benefit from seeing both sides of a token, and the reason an entire ecosystem of sampling tricks, decoding strategies, and inference optimizations has grown up around them.

## What is a unidirectional language model?

Given a sequence of tokens x_1, x_2, ..., x_n, a unidirectional language model assigns a probability to the whole sequence by writing it as

P(x_1, x_2, ..., x_n) = P(x_1) * P(x_2 | x_1) * P(x_3 | x_1, x_2) * ... * P(x_n | x_1, ..., x_{n-1}).

Each factor on the right depends only on tokens at earlier positions. The model never conditions on a token to its right. Training maximizes the log-likelihood of this product on a corpus, which reduces to a per-token cross-entropy loss against the next token in the sequence. Inference samples from these conditionals one token at a time, feeding each generated token back as part of the context for the next step.

The contrast with [bidirectional language models](/wiki/bidirectional_language_model) is structural. A BERT-style model learns a representation for each token that depends on the entire surrounding sequence, including tokens at later positions, and is trained with a masked language model objective rather than a next-token objective [2]. The two design choices serve different purposes. Unidirectionality is required if you want the model to generate text. Bidirectionality is preferred if you want the model to embed, classify, tag, or score existing text.

## When did unidirectional language models develop?

The idea of treating language as a sequence of probabilities you can predict from earlier context is older than neural networks. Claude Shannon discussed it explicitly in his 1948 paper "A Mathematical Theory of Communication," where he used letter and word n-gram statistics from English to estimate the entropy of the language [3]. For the next half century the dominant approach was the **n-gram model**, which approximates P(x_t | x_1, ..., x_{t-1}) by P(x_t | x_{t-n+1}, ..., x_{t-1}), counting how often each (n-1)-gram is followed by each token in a training corpus. By the 1990s n-gram language models had become standard infrastructure for speech recognition and machine translation, with sophisticated smoothing techniques (Kneser-Ney, Good-Turing, Katz back-off) doing most of the practical work. They were unidirectional by construction: each token's probability depends on the n-1 tokens immediately before it.

The first widely cited neural language model was Yoshua Bengio, Rejean Ducharme, Pascal Vincent, and Christian Jauvin's 2003 paper "A Neural Probabilistic Language Model" in the *Journal of Machine Learning Research* [4]. Their architecture learned a distributed representation (a continuous vector) for each word and used a feedforward neural network to compute the probability of the next word given a fixed window of previous words. This was still unidirectional and still windowed, but it dramatically outperformed n-gram models on perplexity and started the line of work that would eventually lead to GPT.

Tomas Mikolov and colleagues replaced the fixed window with a recurrent network in 2010, in their Interspeech paper "Recurrent Neural Network Based Language Model" [5]. The [RNN](/wiki/rnn) read tokens one at a time and maintained a hidden state that, in principle, could carry information from arbitrarily far back. Their model cut perplexity roughly in half compared to a strong back-off n-gram baseline and reduced word error rate on Wall Street Journal speech recognition by about 18% [5]. Two years later, Martin Sundermeyer, Ralf Schluter, and Hermann Ney showed in "LSTM Neural Networks for Language Modeling" (Interspeech 2012) that swapping the RNN cell for an [LSTM](/wiki/lstm) gave another 8% relative improvement in perplexity by handling long-range dependencies that the vanilla RNN forgot [6].

LSTM language models held the field for the next five years. The decisive jump came when Vaswani and colleagues introduced the [transformer](/wiki/transformer) in 2017 in "Attention Is All You Need" [7]. The transformer's decoder block uses **causal self-attention**, an attention layer with a mask that zeros out attention to future positions, so that the prediction at position i depends only on positions 1 through i. This makes the entire stack a unidirectional language model.

Alec Radford and colleagues at OpenAI took this decoder block, scaled it up, and trained it on a large corpus of unlabeled text. The June 2018 paper "Improving Language Understanding by Generative Pre-Training" introduced what is now called **GPT-1**: a 12-layer decoder-only transformer with 117 million parameters, pretrained as a unidirectional language model and then finetuned on downstream tasks [8]. The authors framed the appeal of the approach directly: "we explore a semi-supervised approach for language understanding tasks using a combination of unsupervised pre-training and supervised fine-tuning" [8]. [GPT-2](/wiki/gpt-2) (2019) scaled the same architecture to 1.5 billion parameters and a much larger web crawl [9]. [GPT-3](/wiki/gpt-3) (Brown et al., "Language Models are Few-Shot Learners," NeurIPS 2020) reached 175 billion parameters, described by its authors as "an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model," and showed that a sufficiently large unidirectional language model could perform new tasks from a few examples in its prompt without any gradient updates [1]. After GPT-3, decoder-only autoregressive models became the standard recipe for everything generative.

| Year | Model | Authors / org | Parameters | Notable change |
|---|---|---|---|---|
| 1948 | n-gram (theory) | Shannon | n/a | Information-theoretic framing of language |
| 1990s | n-gram + smoothing | Many | n/a | Workhorse of speech recognition and MT |
| 2003 | Feedforward NPLM | Bengio et al. | ~10M | Distributed word representations |
| 2010 | RNN-LM | Mikolov et al. | ~10M | Unbounded left context via recurrence |
| 2012 | LSTM-LM | Sundermeyer et al. | ~10M | Better long-range dependencies |
| 2017 | Transformer decoder | Vaswani et al. | ~65M (base) | Causal self-attention |
| 2018 | GPT-1 | Radford et al. (OpenAI) | 117M | Decoder-only pretraining + finetuning |
| 2019 | GPT-2 | Radford et al. (OpenAI) | 1.5B | Zero-shot transfer at scale |
| 2020 | GPT-3 | Brown et al. (OpenAI) | 175B | In-context few-shot learning |
| 2022 | Chinchilla, PaLM | DeepMind, Google | 70B, 540B | Compute-optimal scaling laws |
| 2023 | Llama, Llama 2 | Meta | 7B-70B | Open-weights decoder-only |
| 2024-2026 | GPT-4 / 4o, Claude 3-4, Gemini 2-3, Llama 3, DeepSeek-V3, Qwen | Various | Trillions (sparse), 100B-700B (dense) | All decoder-only autoregressive |

## How does causal masking work?

In a vanilla self-attention layer, each token at position i attends to every other token in the sequence by computing scaled dot products between query and key vectors and taking a softmax over the resulting logits [7]. To make the model unidirectional, the transformer adds a **causal mask** to those logits before the softmax. The mask is an upper-triangular matrix of zeros below the diagonal and negative infinities above it. After the softmax, the negative infinities become zero attention weight, so position i can only attend to positions 1 through i. Combined with the convention that the predicted token at position i is matched against the *next* token in the input (the labels are the inputs shifted by one), this gives a model that produces, at every position, a distribution over the next token conditional on everything to the left.

A crucial efficiency property falls out of this. Because position i's prediction only depends on positions 1 through i, the model can process an entire training sequence in a single forward pass and compute the loss at every position simultaneously. Training is fully parallelized across positions. This is one reason transformer language models scaled so much faster than RNN language models: the RNN had to step through tokens serially during training, while the transformer could chew through a 4096-token sequence in parallel.

At inference time the parallelism disappears. To generate the next token the model needs the previous token, which it just generated, so generation is inherently sequential. Modern systems hide most of this cost with a **key-value cache** that stores the keys and values from previous positions and reuses them at each step, so generating token n only requires computing one new query, one new key, one new value, and one attention pass at the new position rather than recomputing everything from scratch.

## How is a unidirectional language model trained?

A unidirectional language model is trained by minimizing the average cross-entropy between the model's predicted distribution at each position and the one-hot distribution of the true next token. This is exactly equivalent to maximizing the log-likelihood of the training corpus under the chain-rule factorization. There is no auxiliary objective in standard pretraining: the loss at every position is just the negative log probability the model assigned to the correct next token.

During training the input at each position is the ground-truth token from the corpus, even though at inference the model would receive its own previous prediction. This setup is called **teacher forcing**. It is fast, stable, and lets the loss at every position be computed in parallel. The downside is **exposure bias**: at inference the model has to consume its own (sometimes wrong) predictions, but at training it never had to. Errors in the early part of a generation can compound, drifting the conditional distributions away from anything the model saw during training. Bengio and colleagues proposed **scheduled sampling** in 2015 as a fix, gradually replacing some teacher-forced inputs with the model's own samples during training, but in practice modern LLMs still train with pure teacher forcing because the simplicity and parallelism are too valuable to give up [10].

## How does a unidirectional model generate text?

The unidirectional factorization means the model produces a probability distribution over the vocabulary at every step. To turn that distribution into an actual sequence, you have to choose a **decoding** strategy. The choice has a much larger effect on output quality than is sometimes acknowledged. The major options:

| Method | What it does | Typical use |
|---|---|---|
| Greedy decoding | At each step pick the token with the highest probability | Deterministic short outputs, classification-style tasks |
| [Beam search](/wiki/beam_search) | Maintain k partial sequences ranked by total log probability, expand each by one token, keep the best k | Machine translation, summarization, anything where average log prob correlates with quality |
| Pure sampling | Sample directly from the model's full predicted distribution | Rarely used; tends to produce incoherent text because of long low-probability tail |
| [Temperature sampling](/wiki/temperature_sampling) | Divide logits by a temperature T, then sample; T<1 sharpens, T>1 flattens | Controls overall randomness |
| Top-k sampling | Restrict the sample to the k highest-probability tokens, renormalize, then sample | Removes the tail; k=40 was the original recipe (Fan et al. 2018) |
| Nucleus (top-p) sampling | Keep the smallest set of tokens whose cumulative probability exceeds p (e.g., p=0.9), renormalize, then sample | Default for most chat LLMs; introduced by Holtzman et al. 2019 |
| Speculative decoding | A small "draft" model proposes several tokens; the large model verifies them in parallel; accepted tokens commit, rejected ones are corrected | Inference acceleration, no change to output distribution; introduced by Leviathan, Kalman, Matias 2023 |

Holtzman, Buys, Du, Forbes, and Choi's 2019 paper "The Curious Case of Neural Text Degeneration" was the paper that pushed the field toward nucleus sampling [11]. They showed that beam search and pure greedy decoding tend to produce repetitive, degenerate text from a sufficiently large model (the model gets stuck in loops because high-probability continuations of high-probability text are themselves high probability), while pure sampling produces incoherent text because the very long low-probability tail of the distribution accumulates real probability mass. Top-p sampling was their proposed fix: dynamically truncate the distribution to its high-probability "nucleus" and sample from inside it [11]. The trick stuck.

Leviathan, Kalman, and Matias's 2023 ICML paper "Fast Inference from Transformers via Speculative Decoding" attacks a different bottleneck [12]. Generation is sequential, so latency scales linearly with output length. The insight is that easy tokens (a comma, the second half of a common word, the end of a function name) can be predicted by a much smaller model and only need to be confirmed by the large one. A small draft model proposes K tokens in parallel, the large model evaluates them all in a single forward pass, and any prefix that matches the large model's distribution is accepted in one step. The output distribution is mathematically identical to standard sampling from the large model. On T5-XXL the original paper reported 2x to 3x speedups [12]; later work (Medusa, EAGLE, lookahead decoding) has pushed this further and speculative decoding is now standard in production LLM serving stacks.

## How is a unidirectional language model evaluated?

The canonical intrinsic metric for a unidirectional language model is **[perplexity](/wiki/perplexity)**, defined as the exponential of the per-token cross-entropy loss on held-out text. Concretely, if the model assigns probability p_i to the i-th token of an N-token test sequence, the perplexity is exp(-1/N * sum log p_i), which is the geometric mean of 1/p_i. A perplexity of 20 means the model is, on average, as uncertain as if it were choosing uniformly among 20 options at every step. Lower is better.

Perplexity is comparable across models only when they share the same tokenization and the same test set, because both the per-token denominator and the difficulty of the test text matter. Researchers report **bits per byte** or **bits per character** for cross-tokenizer comparisons. Perplexity correlates well with downstream quality up to a point but stops being a reliable proxy for instruction-following or reasoning quality at the LLM scale, which is why modern evaluations rely heavily on benchmarks like MMLU, HumanEval, GSM8K, and the LMSYS Arena.

## What is the difference between unidirectional and bidirectional models?

The family of architectures sits on a spectrum of how much bidirectional information any token can see. The two pure endpoints are decoder-only (fully causal, GPT) and encoder-only (fully bidirectional, BERT). BERT's authors describe their model as one that is "designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers" [2], and they frame the GPT-style alternative as a limitation: standard language models are "unidirectional, and this limits the choice of architectures that can be used during pre-training" [2]. The interesting middle is occupied by encoder-decoder models like T5 and BART, which use a bidirectional encoder for the input and a causal decoder for the output, and by **prefix LMs** like the original PaLM and the UL2 family, which let an initial portion of the sequence (the prompt) attend bidirectionally and then generate the rest causally. XLNet's permutation language model (Yang et al., NeurIPS 2019) tried to get bidirectional context inside an autoregressive framework by training on random factorization orders; it works but never displaced standard left-to-right pretraining [13].

| Architecture | Attention pattern | Pretraining objective | Best at | Examples |
|---|---|---|---|---|
| Decoder-only (unidirectional) | Causal mask everywhere | Next-token prediction | Generation | [GPT](/wiki/gpt), [GPT-3](/wiki/gpt-3), [Llama](/wiki/llama), Claude, Gemini, Mistral, DeepSeek, Qwen |
| Encoder-only (bidirectional) | No mask | Masked language model (MLM) | Embedding, classification, tagging | BERT, RoBERTa, DeBERTa, ELECTRA |
| Encoder-decoder | Bidirectional encoder, causal decoder + cross-attention | Span corruption, denoising | Translation, summarization, structured generation | T5, BART, mT5 |
| Prefix LM | Bidirectional on prefix, causal on suffix | Mixed, often span corruption + LM | Generation with strong conditioning | Original PaLM, UL2 |
| Permutation LM | Causal but over a permuted order | Permuted next-token | Hybrid understanding + generation | XLNet |

The practical lesson of the last five years is that decoder-only unidirectional models scale better than the alternatives once you care about generation quality at trillion-token, billion-parameter regimes. The reasons are partly empirical (everything important works) and partly architectural (a single uniform stack with one objective is easier to scale than encoder-decoder pipelines with two objectives and a cross-attention bridge).

## What are unidirectional language models good at?

The unidirectional formulation has several properties that explain why it took over generative AI:

- **Generation is native.** The model is trained to predict the next token from the left; producing text is just sampling from that distribution one token at a time. No surgery is required to convert a trained model into a generator, the way it is for BERT.
- **Training parallelizes across positions.** A single forward pass over a sequence yields the loss for every position at once. This is what makes pretraining at trillion-token scale feasible.
- **The objective is the same as the eval.** Pretraining loss (cross-entropy on next token) and intrinsic evaluation (perplexity) are the same quantity, so you can read training progress directly off the loss curve.
- **Inference can stream.** The model can emit a token as soon as it has been sampled, which is why ChatGPT-style interfaces can show text as it is being produced rather than waiting for the whole reply.
- **In-context learning emerges with scale.** GPT-3 showed that a sufficiently large unidirectional model can learn new tasks from a few examples shown in its prompt, with no gradient updates [1]. This property has not been demonstrated cleanly for bidirectional models.

## What are the limitations of unidirectional language models?

The same factorization that makes unidirectional models great generators creates real problems elsewhere:

- **No right context.** Tasks that benefit from seeing both sides of a token (named entity recognition, sentence embedding, sentiment classification, retrieval) generally do better with a bidirectional encoder, all else equal [2]. Decoder-only models can fake bidirectionality by reading the whole input first, but this is awkward and not what they were trained to do.
- **Exposure bias.** Teacher forcing during training and own-sample feedback at inference are different distributions [10]. The model can drift into out-of-distribution territory mid-generation. Sampling tricks like nucleus and temperature exist partly to keep generations from collapsing.
- **Sequential inference.** Generation cost grows linearly with output length; latency is hard to hide. Speculative decoding helps but does not change the fundamental serialization [12].
- **Quadratic attention over long contexts.** Standard self-attention is O(n^2), so context windows of millions of tokens require sparse, linear, or hybrid attention schemes. This is not unique to unidirectional models, but it bites them harder because they are the ones used for long-form generation.
- **No native masked-token completion.** Filling in a missing word in the middle of a passage is what BERT was trained to do [2]. A unidirectional model can be coaxed into it (with prefix-suffix prompting or fill-in-the-middle pretraining) but it is not the natural objective.

## Are modern LLMs unidirectional?

As of 2026, every widely deployed generative LLM is a unidirectional decoder-only model. GPT-4, GPT-4o, GPT-4.1, Claude 3.5, Claude 4, Gemini 2 and 3, Llama 3 and 4, Mistral Large, DeepSeek-V3, Qwen 2.5 and 3, Grok, and the open-weights ecosystem around them all share the same basic recipe: a stack of transformer decoder blocks with causal self-attention, trained to predict the next token, finetuned with reinforcement learning from human feedback or similar techniques, and served with KV caching and speculative decoding [14]. The architectural details vary (rotary position embeddings, RMSNorm, SwiGLU, grouped-query attention, mixture-of-experts at the FFN), but the underlying probabilistic structure is the same one Bengio described in 2003 [4] and the same one Shannon would have recognized in 1948 [3].

The few notable exceptions sit in specialized niches. Encoder-only models still dominate text embeddings (BGE, E5, Sentence-BERT, NV-Embed). Encoder-decoder T5 variants remain common in academic NLP and in some commercial translation systems. Diffusion language models have produced research demonstrations but no production systems. Mamba, RWKV, and other state-space and linear-attention architectures are also unidirectional in the same sense as transformer decoders, just with a different mixing layer. The unidirectional autoregressive frame has, so far, absorbed every serious challenger.

## explain like I'm 5

Imagine reading a story one word at a time, and after every word you have to guess what the next word will be. You only know the words you have already read; you cannot peek ahead. The more stories you read this way, the better your guesses get, until eventually you can finish a sentence somebody else started just by knowing how the first part went. A unidirectional language model is a computer trained to play exactly this game on a huge pile of text. When it writes new text, it just keeps playing the game forwards, picking each next word from its own guesses.

## References

1. Brown, T. B., et al. (2020). "Language Models are Few-Shot Learners." *NeurIPS 2020*. https://arxiv.org/abs/2005.14165
2. Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." https://arxiv.org/abs/1810.04805
3. Shannon, C. E. (1948). "A Mathematical Theory of Communication." *Bell System Technical Journal*, 27, 379-423.
4. Bengio, Y., Ducharme, R., Vincent, P., & Jauvin, C. (2003). "A Neural Probabilistic Language Model." *Journal of Machine Learning Research*, 3, 1137-1155. https://www.jmlr.org/papers/v3/bengio03a.html
5. Mikolov, T., Karafiat, M., Burget, L., Cernocky, J., & Khudanpur, S. (2010). "Recurrent Neural Network Based Language Model." *Interspeech 2010*, 1045-1048. https://www.isca-archive.org/interspeech_2010/mikolov10_interspeech.html
6. Sundermeyer, M., Schluter, R., & Ney, H. (2012). "LSTM Neural Networks for Language Modeling." *Interspeech 2012*, 194-197. https://www.isca-archive.org/interspeech_2012/sundermeyer12_interspeech.html
7. Vaswani, A., et al. (2017). "Attention Is All You Need." *NeurIPS 2017*. https://arxiv.org/abs/1706.03762
8. Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). "Improving Language Understanding by Generative Pre-Training." OpenAI Technical Report. https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf
9. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). "Language Models are Unsupervised Multitask Learners." OpenAI Technical Report.
10. Bengio, S., Vinyals, O., Jaitly, N., & Shazeer, N. (2015). "Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks." *NeurIPS 2015*. https://arxiv.org/abs/1506.03099
11. Holtzman, A., Buys, J., Du, L., Forbes, M., & Choi, Y. (2019). "The Curious Case of Neural Text Degeneration." *ICLR 2020*. https://arxiv.org/abs/1904.09751
12. Leviathan, Y., Kalman, M., & Matias, Y. (2023). "Fast Inference from Transformers via Speculative Decoding." *ICML 2023*. https://arxiv.org/abs/2211.17192
13. Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R., & Le, Q. V. (2019). "XLNet: Generalized Autoregressive Pretraining for Language Understanding." *NeurIPS 2019*. https://arxiv.org/abs/1906.08237
14. Meta AI (2024). "Introducing Meta Llama 3: The most capable openly available LLM to date." https://ai.meta.com/blog/meta-llama-3/

