Unidirectional language model

See also: Machine learning terms

A unidirectional language model is a language model that, when computing the probability of each token, conditions only on the tokens that appear in one direction of the sequence, almost always the tokens that appear earlier (the left context). The same family of models is also called a causal language model, an autoregressive language model, or, less commonly, a left-to-right language model. The defining property is the factorization: the joint probability of a sequence is decomposed by the chain rule into a product of next-token conditionals, each of which depends only on the tokens that came before.

This simple constraint has unusually large consequences. It is the reason every modern generative large language model (GPT-4, Claude, Gemini, Llama 3, Mistral, DeepSeek-V3, Qwen, and the other systems that get called "LLMs" in 2026) is built the way it is. It is also the reason these models are bad at certain things that BERT-family encoders do well, and the reason an entire ecosystem of sampling tricks, decoding strategies, and inference optimizations has grown up around them.

a quick definition

Given a sequence of tokens x_1, x_2, ..., x_n, a unidirectional language model assigns a probability to the whole sequence by writing it as

P(x_1, x_2, ..., x_n) = P(x_1) * P(x_2 | x_1) * P(x_3 | x_1, x_2) * ... * P(x_n | x_1, ..., x_{n-1}).

Each factor on the right depends only on tokens at earlier positions. The model never conditions on a token to its right. Training maximizes the log-likelihood of this product on a corpus, which reduces to a per-token cross-entropy loss against the next token in the sequence. Inference samples from these conditionals one token at a time, feeding each generated token back as part of the context for the next step.

The contrast with bidirectional language models is structural. A BERT-style model learns a representation for each token that depends on the entire surrounding sequence, including tokens at later positions, and is trained with a masked language model objective rather than a next-token objective. The two design choices serve different purposes. Unidirectionality is required if you want the model to generate text. Bidirectionality is preferred if you want the model to embed, classify, tag, or score existing text.

historical evolution

The idea of treating language as a sequence of probabilities you can predict from earlier context is older than neural networks. Claude Shannon discussed it explicitly in his 1948 paper "A Mathematical Theory of Communication," where he used letter and word n-gram statistics from English to estimate the entropy of the language. For the next half century the dominant approach was the n-gram model, which approximates P(x_t | x_1, ..., x_{t-1}) by P(x_t | x_{t-n+1}, ..., x_{t-1}), counting how often each (n-1)-gram is followed by each token in a training corpus. By the 1990s n-gram language models had become standard infrastructure for speech recognition and machine translation, with sophisticated smoothing techniques (Kneser-Ney, Good-Turing, Katz back-off) doing most of the practical work. They were unidirectional by construction: each token's probability depends on the n-1 tokens immediately before it.

The first widely cited neural language model was Yoshua Bengio, Rejean Ducharme, Pascal Vincent, and Christian Jauvin's 2003 paper "A Neural Probabilistic Language Model" in the Journal of Machine Learning Research. Their architecture learned a distributed representation (a continuous vector) for each word and used a feedforward neural network to compute the probability of the next word given a fixed window of previous words. This was still unidirectional and still windowed, but it dramatically outperformed n-gram models on perplexity and started the line of work that would eventually lead to GPT.

Tomas Mikolov and colleagues replaced the fixed window with a recurrent network in 2010, in their Interspeech paper "Recurrent Neural Network Based Language Model." The RNN read tokens one at a time and maintained a hidden state that, in principle, could carry information from arbitrarily far back. Their model cut perplexity roughly in half compared to a strong back-off n-gram baseline and reduced word error rate on Wall Street Journal speech recognition by about 18%. Two years later, Martin Sundermeyer, Ralf Schluter, and Hermann Ney showed in "LSTM Neural Networks for Language Modeling" (Interspeech 2012) that swapping the RNN cell for an LSTM gave another 8% relative improvement in perplexity by handling long-range dependencies that the vanilla RNN forgot.

LSTM language models held the field for the next five years. The decisive jump came when Vaswani and colleagues introduced the transformer in 2017 in "Attention Is All You Need." The transformer's decoder block uses causal self-attention, an attention layer with a mask that zeros out attention to future positions, so that the prediction at position i depends only on positions 1 through i. This makes the entire stack a unidirectional language model.

Alec Radford and colleagues at OpenAI took this decoder block, scaled it up, and trained it on a large corpus of unlabeled text. The 2018 paper "Improving Language Understanding by Generative Pre-Training" introduced what is now called GPT-1: a 12-layer decoder-only transformer with 117 million parameters, pretrained as a unidirectional language model and then finetuned on downstream tasks. GPT-2 (2019) scaled the same architecture to 1.5 billion parameters and a much larger web crawl. GPT-3 (Brown et al., "Language Models are Few-Shot Learners," NeurIPS 2020) reached 175 billion parameters and showed that a sufficiently large unidirectional language model could perform new tasks from a few examples in its prompt without any gradient updates. After GPT-3, decoder-only autoregressive models became the standard recipe for everything generative.

Year	Model	Authors / org	Parameters	Notable change
1948	n-gram (theory)	Shannon	n/a	Information-theoretic framing of language
1990s	n-gram + smoothing	Many	n/a	Workhorse of speech recognition and MT
2003	Feedforward NPLM	Bengio et al.	~10M	Distributed word representations
2010	RNN-LM	Mikolov et al.	~10M	Unbounded left context via recurrence
2012	LSTM-LM	Sundermeyer et al.	~10M	Better long-range dependencies
2017	Transformer decoder	Vaswani et al.	~65M (base)	Causal self-attention
2018	GPT-1	Radford et al. (OpenAI)	117M	Decoder-only pretraining + finetuning
2019	GPT-2	Radford et al. (OpenAI)	1.5B	Zero-shot transfer at scale
2020	GPT-3	Brown et al. (OpenAI)	175B	In-context few-shot learning
2022	Chinchilla, PaLM	DeepMind, Google	70B, 540B	Compute-optimal scaling laws
2023	Llama, Llama 2	Meta	7B-70B	Open-weights decoder-only
2024-2026	GPT-4 / 4o, Claude 3-4, Gemini 2-3, Llama 3, DeepSeek-V3, Qwen	Various	Trillions (sparse), 100B-700B (dense)	All decoder-only autoregressive

causal masking and the transformer decoder

In a vanilla self-attention layer, each token at position i attends to every other token in the sequence by computing scaled dot products between query and key vectors and taking a softmax over the resulting logits. To make the model unidirectional, the transformer adds a causal mask to those logits before the softmax. The mask is an upper-triangular matrix of zeros below the diagonal and negative infinities above it. After the softmax, the negative infinities become zero attention weight, so position i can only attend to positions 1 through i. Combined with the convention that the predicted token at position i is matched against the next token in the input (the labels are the inputs shifted by one), this gives a model that produces, at every position, a distribution over the next token conditional on everything to the left.

A crucial efficiency property falls out of this. Because position i's prediction only depends on positions 1 through i, the model can process an entire training sequence in a single forward pass and compute the loss at every position simultaneously. Training is fully parallelized across positions. This is one reason transformer language models scaled so much faster than RNN language models: the RNN had to step through tokens serially during training, while the transformer could chew through a 4096-token sequence in parallel.

At inference time the parallelism disappears. To generate the next token the model needs the previous token, which it just generated, so generation is inherently sequential. Modern systems hide most of this cost with a key-value cache that stores the keys and values from previous positions and reuses them at each step, so generating token n only requires computing one new query, one new key, one new value, and one attention pass at the new position rather than recomputing everything from scratch.

training objective and teacher forcing

A unidirectional language model is trained by minimizing the average cross-entropy between the model's predicted distribution at each position and the one-hot distribution of the true next token. This is exactly equivalent to maximizing the log-likelihood of the training corpus under the chain-rule factorization. There is no auxiliary objective in standard pretraining: the loss at every position is just the negative log probability the model assigned to the correct next token.

During training the input at each position is the ground-truth token from the corpus, even though at inference the model would receive its own previous prediction. This setup is called teacher forcing. It is fast, stable, and lets the loss at every position be computed in parallel. The downside is exposure bias: at inference the model has to consume its own (sometimes wrong) predictions, but at training it never had to. Errors in the early part of a generation can compound, drifting the conditional distributions away from anything the model saw during training. Bengio and colleagues proposed scheduled sampling in 2015 as a fix, gradually replacing some teacher-forced inputs with the model's own samples during training, but in practice modern LLMs still train with pure teacher forcing because the simplicity and parallelism are too valuable to give up.

inference and sampling

The unidirectional factorization means the model produces a probability distribution over the vocabulary at every step. To turn that distribution into an actual sequence, you have to choose a decoding strategy. The choice has a much larger effect on output quality than is sometimes acknowledged. The major options:

Method	What it does	Typical use
Greedy decoding	At each step pick the token with the highest probability	Deterministic short outputs, classification-style tasks
Beam search	Maintain k partial sequences ranked by total log probability, expand each by one token, keep the best k	Machine translation, summarization, anything where average log prob correlates with quality
Pure sampling	Sample directly from the model's full predicted distribution	Rarely used; tends to produce incoherent text because of long low-probability tail
Temperature sampling	Divide logits by a temperature T, then sample; T<1 sharpens, T>1 flattens	Controls overall randomness
Top-k sampling	Restrict the sample to the k highest-probability tokens, renormalize, then sample	Removes the tail; k=40 was the original recipe (Fan et al. 2018)
Nucleus (top-p) sampling	Keep the smallest set of tokens whose cumulative probability exceeds p (e.g., p=0.9), renormalize, then sample	Default for most chat LLMs; introduced by Holtzman et al. 2019
Speculative decoding	A small "draft" model proposes several tokens; the large model verifies them in parallel; accepted tokens commit, rejected ones are corrected	Inference acceleration, no change to output distribution; introduced by Leviathan, Kalman, Matias 2023

Holtzman, Buys, Du, Forbes, and Choi's 2019 paper "The Curious Case of Neural Text Degeneration" was the paper that pushed the field toward nucleus sampling. They showed that beam search and pure greedy decoding tend to produce repetitive, degenerate text from a sufficiently large model (the model gets stuck in loops because high-probability continuations of high-probability text are themselves high probability), while pure sampling produces incoherent text because the very long low-probability tail of the distribution accumulates real probability mass. Top-p sampling was their proposed fix: dynamically truncate the distribution to its high-probability "nucleus" and sample from inside it. The trick stuck.

Leviathan, Kalman, and Matias's 2023 ICML paper "Fast Inference from Transformers via Speculative Decoding" attacks a different bottleneck. Generation is sequential, so latency scales linearly with output length. The insight is that easy tokens (a comma, the second half of a common word, the end of a function name) can be predicted by a much smaller model and only need to be confirmed by the large one. A small draft model proposes K tokens in parallel, the large model evaluates them all in a single forward pass, and any prefix that matches the large model's distribution is accepted in one step. The output distribution is mathematically identical to standard sampling from the large model. On T5-XXL the original paper reported 2x to 3x speedups; later work (Medusa, EAGLE, lookahead decoding) has pushed this further and speculative decoding is now standard in production LLM serving stacks.

perplexity and evaluation

The canonical intrinsic metric for a unidirectional language model is perplexity, defined as the exponential of the per-token cross-entropy loss on held-out text. Concretely, if the model assigns probability p_i to the i-th token of an N-token test sequence, the perplexity is exp(-1/N * sum log p_i), which is the geometric mean of 1/p_i. A perplexity of 20 means the model is, on average, as uncertain as if it were choosing uniformly among 20 options at every step. Lower is better.

Perplexity is comparable across models only when they share the same tokenization and the same test set, because both the per-token denominator and the difficulty of the test text matter. Researchers report bits per byte or bits per character for cross-tokenizer comparisons. Perplexity correlates well with downstream quality up to a point but stops being a reliable proxy for instruction-following or reasoning quality at the LLM scale, which is why modern evaluations rely heavily on benchmarks like MMLU, HumanEval, GSM8K, and the LMSYS Arena.

unidirectional, bidirectional, and prefix architectures

The family of architectures sits on a spectrum of how much bidirectional information any token can see. The two pure endpoints are decoder-only (fully causal, GPT) and encoder-only (fully bidirectional, BERT). The interesting middle is occupied by encoder-decoder models like T5 and BART, which use a bidirectional encoder for the input and a causal decoder for the output, and by prefix LMs like the original PaLM and the UL2 family, which let an initial portion of the sequence (the prompt) attend bidirectionally and then generate the rest causally. XLNet's permutation language model (Yang et al., NeurIPS 2019) tried to get bidirectional context inside an autoregressive framework by training on random factorization orders; it works but never displaced standard left-to-right pretraining.

Architecture	Attention pattern	Pretraining objective	Best at	Examples
Decoder-only (unidirectional)	Causal mask everywhere	Next-token prediction	Generation	GPT, GPT-3, Llama, Claude, Gemini, Mistral, DeepSeek, Qwen
Encoder-only (bidirectional)	No mask	Masked language model (MLM)	Embedding, classification, tagging	BERT, RoBERTa, DeBERTa, ELECTRA
Encoder-decoder	Bidirectional encoder, causal decoder + cross-attention	Span corruption, denoising	Translation, summarization, structured generation	T5, BART, mT5
Prefix LM	Bidirectional on prefix, causal on suffix	Mixed, often span corruption + LM	Generation with strong conditioning	Original PaLM, UL2
Permutation LM	Causal but over a permuted order	Permuted next-token	Hybrid understanding + generation	XLNet

The practical lesson of the last five years is that decoder-only unidirectional models scale better than the alternatives once you care about generation quality at trillion-token, billion-parameter regimes. The reasons are partly empirical (everything important works) and partly architectural (a single uniform stack with one objective is easier to scale than encoder-decoder pipelines with two objectives and a cross-attention bridge).

strengths

The unidirectional formulation has several properties that explain why it took over generative AI:

Generation is native. The model is trained to predict the next token from the left; producing text is just sampling from that distribution one token at a time. No surgery is required to convert a trained model into a generator, the way it is for BERT.
Training parallelizes across positions. A single forward pass over a sequence yields the loss for every position at once. This is what makes pretraining at trillion-token scale feasible.
The objective is the same as the eval. Pretraining loss (cross-entropy on next token) and intrinsic evaluation (perplexity) are the same quantity, so you can read training progress directly off the loss curve.
Inference can stream. The model can emit a token as soon as it has been sampled, which is why ChatGPT-style interfaces can show text as it is being produced rather than waiting for the whole reply.
In-context learning emerges with scale. GPT-3 showed that a sufficiently large unidirectional model can learn new tasks from a few examples shown in its prompt, with no gradient updates. This property has not been demonstrated cleanly for bidirectional models.

limitations

The same factorization that makes unidirectional models great generators creates real problems elsewhere:

No right context. Tasks that benefit from seeing both sides of a token (named entity recognition, sentence embedding, sentiment classification, retrieval) generally do better with a bidirectional encoder, all else equal. Decoder-only models can fake bidirectionality by reading the whole input first, but this is awkward and not what they were trained to do.
Exposure bias. Teacher forcing during training and own-sample feedback at inference are different distributions. The model can drift into out-of-distribution territory mid-generation. Sampling tricks like nucleus and temperature exist partly to keep generations from collapsing.
Sequential inference. Generation cost grows linearly with output length; latency is hard to hide. Speculative decoding helps but does not change the fundamental serialization.
Quadratic attention over long contexts. Standard self-attention is O(n^2), so context windows of millions of tokens require sparse, linear, or hybrid attention schemes. This is not unique to unidirectional models, but it bites them harder because they are the ones used for long-form generation.
No native masked-token completion. Filling in a missing word in the middle of a passage is what BERT was trained to do. A unidirectional model can be coaxed into it (with prefix-suffix prompting or fill-in-the-middle pretraining) but it is not the natural objective.

modern relevance

As of 2026, every widely deployed generative LLM is a unidirectional decoder-only model. GPT-4, GPT-4o, GPT-4.1, Claude 3.5, Claude 4, Gemini 2 and 3, Llama 3 and 4, Mistral Large, DeepSeek-V3, Qwen 2.5 and 3, Grok, and the open-weights ecosystem around them all share the same basic recipe: a stack of transformer decoder blocks with causal self-attention, trained to predict the next token, finetuned with reinforcement learning from human feedback or similar techniques, and served with KV caching and speculative decoding. The architectural details vary (rotary position embeddings, RMSNorm, SwiGLU, grouped-query attention, mixture-of-experts at the FFN), but the underlying probabilistic structure is the same one Bengio described in 2003 and the same one Shannon would have recognized in 1948.

The few notable exceptions sit in specialized niches. Encoder-only models still dominate text embeddings (BGE, E5, Sentence-BERT, NV-Embed). Encoder-decoder T5 variants remain common in academic NLP and in some commercial translation systems. Diffusion language models have produced research demonstrations but no production systems. Mamba, RWKV, and other state-space and linear-attention architectures are also unidirectional in the same sense as transformer decoders, just with a different mixing layer. The unidirectional autoregressive frame has, so far, absorbed every serious challenger.

explain like I'm 5

Imagine reading a story one word at a time, and after every word you have to guess what the next word will be. You only know the words you have already read; you cannot peek ahead. The more stories you read this way, the better your guesses get, until eventually you can finish a sentence somebody else started just by knowing how the first part went. A unidirectional language model is a computer trained to play exactly this game on a huge pile of text. When it writes new text, it just keeps playing the game forwards, picking each next word from its own guesses.

references

Bengio, Y., Ducharme, R., Vincent, P., & Jauvin, C. (2003). "A Neural Probabilistic Language Model." Journal of Machine Learning Research, 3, 1137-1155. https://www.jmlr.org/papers/v3/bengio03a.html
Bengio, S., Vinyals, O., Jaitly, N., & Shazeer, N. (2015). "Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks." NeurIPS 2015.
Brown, T. B., et al. (2020). "Language Models are Few-Shot Learners." NeurIPS 2020. https://arxiv.org/abs/2005.14165
Fan, A., Lewis, M., & Dauphin, Y. (2018). "Hierarchical Neural Story Generation." ACL 2018. https://arxiv.org/abs/1805.04833
Holtzman, A., Buys, J., Du, L., Forbes, M., & Choi, Y. (2019). "The Curious Case of Neural Text Degeneration." ICLR 2020. https://arxiv.org/abs/1904.09751
Leviathan, Y., Kalman, M., & Matias, Y. (2023). "Fast Inference from Transformers via Speculative Decoding." ICML 2023. https://arxiv.org/abs/2211.17192
Mikolov, T., Karafiat, M., Burget, L., Cernocky, J., & Khudanpur, S. (2010). "Recurrent Neural Network Based Language Model." Interspeech 2010, 1045-1048. https://www.isca-archive.org/interspeech_2010/mikolov10_interspeech.html
Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). "Improving Language Understanding by Generative Pre-Training." OpenAI Technical Report. https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). "Language Models are Unsupervised Multitask Learners." OpenAI Technical Report.
Shannon, C. E. (1948). "A Mathematical Theory of Communication." Bell System Technical Journal, 27, 379-423.
Sundermeyer, M., Schluter, R., & Ney, H. (2012). "LSTM Neural Networks for Language Modeling." Interspeech 2012, 194-197. https://www.isca-archive.org/interspeech_2012/sundermeyer12_interspeech.html
Vaswani, A., et al. (2017). "Attention Is All You Need." NeurIPS 2017. https://arxiv.org/abs/1706.03762
Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R., & Le, Q. V. (2019). "XLNet: Generalized Autoregressive Pretraining for Language Understanding." NeurIPS 2019. https://arxiv.org/abs/1906.08237
Meta AI (2024). "Introducing Meta Llama 3: The most capable openly available LLM to date." https://ai.meta.com/blog/meta-llama-3/

a quick definition

historical evolution

causal masking and the transformer decoder

training objective and teacher forcing

inference and sampling

perplexity and evaluation

unidirectional, bidirectional, and prefix architectures

strengths

limitations

modern relevance

explain like I'm 5

references

Improve this article

Related Articles

Bidirectional language model

WordPiece

Agentic Context Engineering

Claude Sonnet 4.5

Computer-use agent

Computer-use model

a quick definition

historical evolution

causal masking and the transformer decoder

training objective and teacher forcing

inference and sampling

perplexity and evaluation

unidirectional, bidirectional, and prefix architectures

strengths

limitations

modern relevance

explain like I'm 5

references

Related Articles

Bidirectional language model

WordPiece

Agentic Context Engineering

Claude Sonnet 4.5

Computer-use agent

Computer-use model