See also: Machine learning terms
A unidirectional language model is a language model that, when computing the probability of each token, conditions only on the tokens that appear in one direction of the sequence, almost always the tokens that appear earlier (the left context). The same family of models is also called a causal language model, an autoregressive language model, or, less commonly, a left-to-right language model. The defining property is the factorization: the joint probability of a sequence is decomposed by the chain rule into a product of next-token conditionals, each of which depends only on the tokens that came before.
This simple constraint has unusually large consequences. It is the reason every modern generative large language model (GPT-4, Claude, Gemini, Llama 3, Mistral, DeepSeek-V3, Qwen, and the other systems that get called "LLMs" in 2026) is built the way it is. It is also the reason these models are bad at certain things that BERT-family encoders do well, and the reason an entire ecosystem of sampling tricks, decoding strategies, and inference optimizations has grown up around them.
Given a sequence of tokens x_1, x_2, ..., x_n, a unidirectional language model assigns a probability to the whole sequence by writing it as
P(x_1, x_2, ..., x_n) = P(x_1) * P(x_2 | x_1) * P(x_3 | x_1, x_2) * ... * P(x_n | x_1, ..., x_{n-1}).
Each factor on the right depends only on tokens at earlier positions. The model never conditions on a token to its right. Training maximizes the log-likelihood of this product on a corpus, which reduces to a per-token cross-entropy loss against the next token in the sequence. Inference samples from these conditionals one token at a time, feeding each generated token back as part of the context for the next step.
The contrast with bidirectional language models is structural. A BERT-style model learns a representation for each token that depends on the entire surrounding sequence, including tokens at later positions, and is trained with a masked language model objective rather than a next-token objective. The two design choices serve different purposes. Unidirectionality is required if you want the model to generate text. Bidirectionality is preferred if you want the model to embed, classify, tag, or score existing text.
The idea of treating language as a sequence of probabilities you can predict from earlier context is older than neural networks. Claude Shannon discussed it explicitly in his 1948 paper "A Mathematical Theory of Communication," where he used letter and word n-gram statistics from English to estimate the entropy of the language. For the next half century the dominant approach was the n-gram model, which approximates P(x_t | x_1, ..., x_{t-1}) by P(x_t | x_{t-n+1}, ..., x_{t-1}), counting how often each (n-1)-gram is followed by each token in a training corpus. By the 1990s n-gram language models had become standard infrastructure for speech recognition and machine translation, with sophisticated smoothing techniques (Kneser-Ney, Good-Turing, Katz back-off) doing most of the practical work. They were unidirectional by construction: each token's probability depends on the n-1 tokens immediately before it.
The first widely cited neural language model was Yoshua Bengio, Rejean Ducharme, Pascal Vincent, and Christian Jauvin's 2003 paper "A Neural Probabilistic Language Model" in the Journal of Machine Learning Research. Their architecture learned a distributed representation (a continuous vector) for each word and used a feedforward neural network to compute the probability of the next word given a fixed window of previous words. This was still unidirectional and still windowed, but it dramatically outperformed n-gram models on perplexity and started the line of work that would eventually lead to GPT.
Tomas Mikolov and colleagues replaced the fixed window with a recurrent network in 2010, in their Interspeech paper "Recurrent Neural Network Based Language Model." The RNN read tokens one at a time and maintained a hidden state that, in principle, could carry information from arbitrarily far back. Their model cut perplexity roughly in half compared to a strong back-off n-gram baseline and reduced word error rate on Wall Street Journal speech recognition by about 18%. Two years later, Martin Sundermeyer, Ralf Schluter, and Hermann Ney showed in "LSTM Neural Networks for Language Modeling" (Interspeech 2012) that swapping the RNN cell for an LSTM gave another 8% relative improvement in perplexity by handling long-range dependencies that the vanilla RNN forgot.
LSTM language models held the field for the next five years. The decisive jump came when Vaswani and colleagues introduced the transformer in 2017 in "Attention Is All You Need." The transformer's decoder block uses causal self-attention, an attention layer with a mask that zeros out attention to future positions, so that the prediction at position i depends only on positions 1 through i. This makes the entire stack a unidirectional language model.
Alec Radford and colleagues at OpenAI took this decoder block, scaled it up, and trained it on a large corpus of unlabeled text. The 2018 paper "Improving Language Understanding by Generative Pre-Training" introduced what is now called GPT-1: a 12-layer decoder-only transformer with 117 million parameters, pretrained as a unidirectional language model and then finetuned on downstream tasks. GPT-2 (2019) scaled the same architecture to 1.5 billion parameters and a much larger web crawl. GPT-3 (Brown et al., "Language Models are Few-Shot Learners," NeurIPS 2020) reached 175 billion parameters and showed that a sufficiently large unidirectional language model could perform new tasks from a few examples in its prompt without any gradient updates. After GPT-3, decoder-only autoregressive models became the standard recipe for everything generative.
| Year | Model | Authors / org | Parameters | Notable change |
|---|---|---|---|---|
| 1948 | n-gram (theory) | Shannon | n/a | Information-theoretic framing of language |
| 1990s | n-gram + smoothing | Many | n/a | Workhorse of speech recognition and MT |
| 2003 | Feedforward NPLM | Bengio et al. | ~10M | Distributed word representations |
| 2010 | RNN-LM | Mikolov et al. | ~10M | Unbounded left context via recurrence |
| 2012 | LSTM-LM | Sundermeyer et al. | ~10M | Better long-range dependencies |
| 2017 | Transformer decoder | Vaswani et al. | ~65M (base) | Causal self-attention |
| 2018 | GPT-1 | Radford et al. (OpenAI) | 117M | Decoder-only pretraining + finetuning |
| 2019 | GPT-2 | Radford et al. (OpenAI) | 1.5B | Zero-shot transfer at scale |
| 2020 | GPT-3 | Brown et al. (OpenAI) | 175B | In-context few-shot learning |
| 2022 | Chinchilla, PaLM | DeepMind, Google | 70B, 540B | Compute-optimal scaling laws |
| 2023 | Llama, Llama 2 | Meta | 7B-70B | Open-weights decoder-only |
| 2024-2026 | GPT-4 / 4o, Claude 3-4, Gemini 2-3, Llama 3, DeepSeek-V3, Qwen | Various | Trillions (sparse), 100B-700B (dense) | All decoder-only autoregressive |
In a vanilla self-attention layer, each token at position i attends to every other token in the sequence by computing scaled dot products between query and key vectors and taking a softmax over the resulting logits. To make the model unidirectional, the transformer adds a causal mask to those logits before the softmax. The mask is an upper-triangular matrix of zeros below the diagonal and negative infinities above it. After the softmax, the negative infinities become zero attention weight, so position i can only attend to positions 1 through i. Combined with the convention that the predicted token at position i is matched against the next token in the input (the labels are the inputs shifted by one), this gives a model that produces, at every position, a distribution over the next token conditional on everything to the left.
A crucial efficiency property falls out of this. Because position i's prediction only depends on positions 1 through i, the model can process an entire training sequence in a single forward pass and compute the loss at every position simultaneously. Training is fully parallelized across positions. This is one reason transformer language models scaled so much faster than RNN language models: the RNN had to step through tokens serially during training, while the transformer could chew through a 4096-token sequence in parallel.
At inference time the parallelism disappears. To generate the next token the model needs the previous token, which it just generated, so generation is inherently sequential. Modern systems hide most of this cost with a key-value cache that stores the keys and values from previous positions and reuses them at each step, so generating token n only requires computing one new query, one new key, one new value, and one attention pass at the new position rather than recomputing everything from scratch.
A unidirectional language model is trained by minimizing the average cross-entropy between the model's predicted distribution at each position and the one-hot distribution of the true next token. This is exactly equivalent to maximizing the log-likelihood of the training corpus under the chain-rule factorization. There is no auxiliary objective in standard pretraining: the loss at every position is just the negative log probability the model assigned to the correct next token.
During training the input at each position is the ground-truth token from the corpus, even though at inference the model would receive its own previous prediction. This setup is called teacher forcing. It is fast, stable, and lets the loss at every position be computed in parallel. The downside is exposure bias: at inference the model has to consume its own (sometimes wrong) predictions, but at training it never had to. Errors in the early part of a generation can compound, drifting the conditional distributions away from anything the model saw during training. Bengio and colleagues proposed scheduled sampling in 2015 as a fix, gradually replacing some teacher-forced inputs with the model's own samples during training, but in practice modern LLMs still train with pure teacher forcing because the simplicity and parallelism are too valuable to give up.
The unidirectional factorization means the model produces a probability distribution over the vocabulary at every step. To turn that distribution into an actual sequence, you have to choose a decoding strategy. The choice has a much larger effect on output quality than is sometimes acknowledged. The major options:
| Method | What it does | Typical use |
|---|---|---|
| Greedy decoding | At each step pick the token with the highest probability | Deterministic short outputs, classification-style tasks |
| Beam search | Maintain k partial sequences ranked by total log probability, expand each by one token, keep the best k | Machine translation, summarization, anything where average log prob correlates with quality |
| Pure sampling | Sample directly from the model's full predicted distribution | Rarely used; tends to produce incoherent text because of long low-probability tail |
| Temperature sampling | Divide logits by a temperature T, then sample; T<1 sharpens, T>1 flattens | Controls overall randomness |
| Top-k sampling | Restrict the sample to the k highest-probability tokens, renormalize, then sample | Removes the tail; k=40 was the original recipe (Fan et al. 2018) |
| Nucleus (top-p) sampling | Keep the smallest set of tokens whose cumulative probability exceeds p (e.g., p=0.9), renormalize, then sample | Default for most chat LLMs; introduced by Holtzman et al. 2019 |
| Speculative decoding | A small "draft" model proposes several tokens; the large model verifies them in parallel; accepted tokens commit, rejected ones are corrected | Inference acceleration, no change to output distribution; introduced by Leviathan, Kalman, Matias 2023 |
Holtzman, Buys, Du, Forbes, and Choi's 2019 paper "The Curious Case of Neural Text Degeneration" was the paper that pushed the field toward nucleus sampling. They showed that beam search and pure greedy decoding tend to produce repetitive, degenerate text from a sufficiently large model (the model gets stuck in loops because high-probability continuations of high-probability text are themselves high probability), while pure sampling produces incoherent text because the very long low-probability tail of the distribution accumulates real probability mass. Top-p sampling was their proposed fix: dynamically truncate the distribution to its high-probability "nucleus" and sample from inside it. The trick stuck.
Leviathan, Kalman, and Matias's 2023 ICML paper "Fast Inference from Transformers via Speculative Decoding" attacks a different bottleneck. Generation is sequential, so latency scales linearly with output length. The insight is that easy tokens (a comma, the second half of a common word, the end of a function name) can be predicted by a much smaller model and only need to be confirmed by the large one. A small draft model proposes K tokens in parallel, the large model evaluates them all in a single forward pass, and any prefix that matches the large model's distribution is accepted in one step. The output distribution is mathematically identical to standard sampling from the large model. On T5-XXL the original paper reported 2x to 3x speedups; later work (Medusa, EAGLE, lookahead decoding) has pushed this further and speculative decoding is now standard in production LLM serving stacks.
The canonical intrinsic metric for a unidirectional language model is perplexity, defined as the exponential of the per-token cross-entropy loss on held-out text. Concretely, if the model assigns probability p_i to the i-th token of an N-token test sequence, the perplexity is exp(-1/N * sum log p_i), which is the geometric mean of 1/p_i. A perplexity of 20 means the model is, on average, as uncertain as if it were choosing uniformly among 20 options at every step. Lower is better.
Perplexity is comparable across models only when they share the same tokenization and the same test set, because both the per-token denominator and the difficulty of the test text matter. Researchers report bits per byte or bits per character for cross-tokenizer comparisons. Perplexity correlates well with downstream quality up to a point but stops being a reliable proxy for instruction-following or reasoning quality at the LLM scale, which is why modern evaluations rely heavily on benchmarks like MMLU, HumanEval, GSM8K, and the LMSYS Arena.
The family of architectures sits on a spectrum of how much bidirectional information any token can see. The two pure endpoints are decoder-only (fully causal, GPT) and encoder-only (fully bidirectional, BERT). The interesting middle is occupied by encoder-decoder models like T5 and BART, which use a bidirectional encoder for the input and a causal decoder for the output, and by prefix LMs like the original PaLM and the UL2 family, which let an initial portion of the sequence (the prompt) attend bidirectionally and then generate the rest causally. XLNet's permutation language model (Yang et al., NeurIPS 2019) tried to get bidirectional context inside an autoregressive framework by training on random factorization orders; it works but never displaced standard left-to-right pretraining.
| Architecture | Attention pattern | Pretraining objective | Best at | Examples |
|---|---|---|---|---|
| Decoder-only (unidirectional) | Causal mask everywhere | Next-token prediction | Generation | GPT, GPT-3, Llama, Claude, Gemini, Mistral, DeepSeek, Qwen |
| Encoder-only (bidirectional) | No mask | Masked language model (MLM) | Embedding, classification, tagging | BERT, RoBERTa, DeBERTa, ELECTRA |
| Encoder-decoder | Bidirectional encoder, causal decoder + cross-attention | Span corruption, denoising | Translation, summarization, structured generation | T5, BART, mT5 |
| Prefix LM | Bidirectional on prefix, causal on suffix | Mixed, often span corruption + LM | Generation with strong conditioning | Original PaLM, UL2 |
| Permutation LM | Causal but over a permuted order | Permuted next-token | Hybrid understanding + generation | XLNet |
The practical lesson of the last five years is that decoder-only unidirectional models scale better than the alternatives once you care about generation quality at trillion-token, billion-parameter regimes. The reasons are partly empirical (everything important works) and partly architectural (a single uniform stack with one objective is easier to scale than encoder-decoder pipelines with two objectives and a cross-attention bridge).
The unidirectional formulation has several properties that explain why it took over generative AI:
The same factorization that makes unidirectional models great generators creates real problems elsewhere:
As of 2026, every widely deployed generative LLM is a unidirectional decoder-only model. GPT-4, GPT-4o, GPT-4.1, Claude 3.5, Claude 4, Gemini 2 and 3, Llama 3 and 4, Mistral Large, DeepSeek-V3, Qwen 2.5 and 3, Grok, and the open-weights ecosystem around them all share the same basic recipe: a stack of transformer decoder blocks with causal self-attention, trained to predict the next token, finetuned with reinforcement learning from human feedback or similar techniques, and served with KV caching and speculative decoding. The architectural details vary (rotary position embeddings, RMSNorm, SwiGLU, grouped-query attention, mixture-of-experts at the FFN), but the underlying probabilistic structure is the same one Bengio described in 2003 and the same one Shannon would have recognized in 1948.
The few notable exceptions sit in specialized niches. Encoder-only models still dominate text embeddings (BGE, E5, Sentence-BERT, NV-Embed). Encoder-decoder T5 variants remain common in academic NLP and in some commercial translation systems. Diffusion language models have produced research demonstrations but no production systems. Mamba, RWKV, and other state-space and linear-attention architectures are also unidirectional in the same sense as transformer decoders, just with a different mixing layer. The unidirectional autoregressive frame has, so far, absorbed every serious challenger.
Imagine reading a story one word at a time, and after every word you have to guess what the next word will be. You only know the words you have already read; you cannot peek ahead. The more stories you read this way, the better your guesses get, until eventually you can finish a sentence somebody else started just by knowing how the first part went. A unidirectional language model is a computer trained to play exactly this game on a huge pile of text. When it writes new text, it just keeps playing the game forwards, picking each next word from its own guesses.