A causal language model (CLM), also called an autoregressive language model, is a type of language model that generates text by predicting the next token in a sequence based solely on the tokens that precede it. Because each prediction depends only on the past context (and never on future tokens), the modeling direction is strictly left-to-right, mirroring the causal flow of natural writing and speech. Causal language models form the backbone of virtually all modern large language models, including the GPT family, LLaMA, Claude, and Gemini.
At its core, a causal language model learns a probability distribution over sequences of tokens. Given a sequence of tokens $x_1, x_2, \ldots, x_{t-1}$, the model estimates the conditional probability of the next token:
$$P(x_t \mid x_1, x_2, \ldots, x_{t-1})$$
The full probability of a sequence is then the product of these conditional probabilities (by the chain rule of probability):
$$P(x_1, x_2, \ldots, x_T) = \prod_{t=1}^{T} P(x_t \mid x_1, \ldots, x_{t-1})$$
This autoregressive factorization means the model generates text one token at a time, always conditioning on everything produced so far. During inference, each newly generated token is appended to the context and fed back into the model to predict the subsequent token. This cycle repeats until a stopping condition is met, such as reaching a special end-of-sequence token or a maximum length.
Most causal language models are built on the decoder-only variant of the Transformer architecture introduced by Vaswani et al. in 2017. The architecture consists of a stack of identical layers, each containing two main sub-layers:
Residual connections and layer normalization are applied around each sub-layer. A positional encoding scheme (sinusoidal, learned, or rotary) provides the model with information about token order, since the attention mechanism is otherwise permutation-invariant.
The defining architectural feature of a causal language model is causal masking (sometimes called the "causal attention mask" or "look-ahead mask"). In standard self-attention, every token can attend to every other token in the sequence. Causal masking restricts this so that position $t$ can only attend to positions $1$ through $t$.
Mechanically, this is implemented by adding a mask matrix to the attention scores before the softmax operation. The mask is an upper-triangular matrix filled with negative infinity ($-\infty$) values. When these values pass through softmax, they become zero, effectively preventing any information flow from future positions. The resulting attention weight matrix is lower-triangular, ensuring that each token's representation is computed only from the current and previous tokens.
This masking serves two purposes:
The standard training objective for causal language models is next-token prediction, optimized using cross-entropy loss (equivalently, negative log-likelihood). For a training sequence of length $T$, the loss is:
$$\mathcal{L} = -\frac{1}{T} \sum_{t=1}^{T} \log P_{\theta}(x_t \mid x_1, \ldots, x_{t-1})$$
where $P_{\theta}$ is the model's predicted probability distribution parameterized by weights $\theta$. Minimizing this loss is equivalent to maximizing the likelihood of the training data under the model.
During training, the model receives a full sequence and, thanks to causal masking, computes the loss for every position in parallel. This technique, called teacher forcing, is far more efficient than generating tokens one at a time, because the ground-truth tokens (rather than the model's own predictions) are used as inputs for each step.
To prevent overfitting and improve generalization, several techniques are commonly applied:
| Technique | Description |
|---|---|
| Dropout | Randomly zeroes a fraction of activations during training to prevent co-adaptation |
| Weight decay | Adds an L2 penalty on model parameters to the loss function |
| Learning rate scheduling | Uses warmup followed by cosine or linear decay of the learning rate |
| Gradient clipping | Caps gradient magnitudes to stabilize training |
| Layer normalization | Normalizes activations within each layer to reduce internal covariate shift |
Modern causal language models are typically trained using the AdamW optimizer with mixed-precision (FP16 or BF16) arithmetic to reduce memory usage and accelerate computation.
The GPT (Generative Pre-trained Transformer) series from OpenAI is the most well-known line of causal language models. Each generation has dramatically scaled up in size and capability.
| Model | Year | Parameters | Training Data | Key Contribution |
|---|---|---|---|---|
| GPT-1 | 2018 | 117 million | BooksCorpus (~800M words) | Demonstrated generative pre-training followed by discriminative fine-tuning |
| GPT-2 | 2019 | 1.5 billion | WebText (40 GB) | Showed unsupervised multitask learning; zero-shot task transfer |
| GPT-3 | 2020 | 175 billion | ~300 billion tokens | Introduced in-context few-shot learning without fine-tuning |
| GPT-4 | 2023 | ~1.8 trillion (estimated) | Undisclosed | Multimodal capabilities (text and image input); strong reasoning |
Beyond GPT, many other causal language models have been developed, including LLaMA (Meta), PaLM (Google), Falcon (TII), Mistral (Mistral AI), and Claude (Anthropic).
Causal language models are often compared with masked language models (MLMs) such as BERT. The two paradigms differ fundamentally in their training objectives, attention patterns, and downstream strengths.
| Feature | Causal Language Model (e.g., GPT) | Masked Language Model (e.g., BERT) |
|---|---|---|
| Attention direction | Unidirectional (left-to-right) | Bidirectional (full context) |
| Training objective | Predict the next token | Predict randomly masked tokens |
| Context used | Only preceding tokens | Both preceding and following tokens |
| Primary strength | Text generation | Text understanding and classification |
| Architecture | Decoder-only transformer | Encoder-only transformer |
| Example use cases | Chatbots, code generation, story writing | Sentiment analysis, NER, question answering |
Because masked language models have access to bidirectional context, they tend to produce richer representations for understanding tasks. However, they are not naturally suited for text generation, since they cannot predict text sequentially. Causal language models, by contrast, are inherently generative and have become the dominant paradigm for building conversational AI systems, code assistants, and general-purpose large language models.
A prefix language model (PrefixLM) occupies a middle ground between causal and masked language models. In a PrefixLM, an initial "prefix" portion of the input is processed with bidirectional (non-causal) attention, while the remaining tokens are generated autoregressively with causal masking.
This design is useful for tasks where a fixed input (the prefix) must be fully understood before generating an output. For example, in a translation task, the source sentence can serve as the prefix with full bidirectional attention, and the target sentence is generated causally. Research has shown that PrefixLMs can outperform pure causal language models on in-context learning tasks because they allow demonstration examples within the prefix to attend to one another freely, rather than being restricted by left-to-right masking.
Models that use or have explored PrefixLM objectives include T5 (in its encoder component), UL2, and certain configurations of PaLM.
A causal language model produces a probability distribution over the vocabulary at each generation step. The method used to select the next token from this distribution is called a decoding strategy (or sampling strategy). Different strategies trade off between output quality, diversity, and computational cost.
| Strategy | How It Works | Characteristics |
|---|---|---|
| Greedy decoding | Always selects the highest-probability token | Fast but often produces repetitive, generic text |
| Beam search | Maintains the top-$k$ most probable partial sequences and selects the best overall | Better than greedy for short outputs; can still be repetitive |
| Top-$k$ sampling | Samples from the $k$ most probable tokens | Introduces diversity; the fixed $k$ may be too narrow or too wide |
| Top-$p$ (nucleus) sampling | Samples from the smallest set of tokens whose cumulative probability exceeds $p$ | Dynamically adapts the candidate set; widely used in practice |
| Temperature scaling | Divides logits by a temperature parameter $\tau$ before softmax | $\tau < 1$ sharpens the distribution (more deterministic); $\tau > 1$ flattens it (more random) |
| Min-$p$ sampling | Filters tokens below a minimum probability relative to the top token | A recent alternative that combines benefits of top-$k$ and top-$p$ |
In practice, modern systems often combine several of these strategies. For instance, a chatbot might use top-$p$ sampling with a moderate temperature and a repetition penalty to balance coherence and creativity.
One of the most significant discoveries about large causal language models is their ability to perform tasks without explicit fine-tuning. The GPT-3 paper (Brown et al., 2020) demonstrated three paradigms:
As model scale increases, few-shot performance improves much more rapidly than zero-shot performance, suggesting that larger models become better "in-context learners." This emergent capability has been one of the primary drivers of interest in scaling up causal language models and has led to the widespread use of prompt engineering as an alternative to traditional fine-tuning.
Causal language models exhibit remarkably predictable scaling laws. Research by Kaplan et al. (2020) at OpenAI showed that a model's cross-entropy loss on held-out text follows a power-law relationship with three variables:
The loss decreases smoothly and predictably as any of these quantities increases, with trends spanning more than seven orders of magnitude.
In 2022, DeepMind's Chinchilla study (Hoffmann et al.) refined these findings, demonstrating that many existing models were over-parameterized relative to their training data. The Chinchilla-optimal ratio suggests approximately 20 training tokens per parameter for a given compute budget. This insight shifted the field's focus toward training smaller models on more data (for example, LLaMA 7B was trained on 1 trillion tokens, far exceeding the Chinchilla-optimal ratio for its size).
These scaling laws have become essential planning tools for organizations building large causal language models, allowing them to predict downstream performance and allocate compute budgets efficiently.
Since the release of GPT-3 in 2020, the causal language modeling paradigm has become the dominant approach for building general-purpose AI systems. Several factors drive this dominance:
Today, nearly every leading large language model, whether used for chatbots, search, coding assistance, or scientific research, is built on the causal language modeling framework.
Causal language models have been deployed across a wide range of natural language processing tasks:
Imagine you are playing a word game where you have to guess the next word in a sentence. Your friend says "The cat sat on the..." and you guess "mat" because that makes the most sense based on the words you already heard. A causal language model works the same way. It reads words from left to right, one at a time, and tries to guess which word comes next. It never peeks ahead at words it has not seen yet. The more sentences it practices with, the better it gets at guessing. This is how computers learn to write stories, answer questions, and even have conversations.