Causal Language Model
Last reviewed
May 9, 2026
Sources
23 citations
Review status
Source-backed
Revision
v3 ยท 6,277 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 9, 2026
Sources
23 citations
Review status
Source-backed
Revision
v3 ยท 6,277 words
Add missing citations, update stale details, or suggest a clearer explanation.
A causal language model (CLM), also called an autoregressive language model or a decoder-only language model, is a type of language model that generates text by predicting the next token in a sequence based solely on the tokens that precede it. Because each prediction depends only on the past context (and never on future tokens), the modeling direction is strictly left-to-right, mirroring the causal flow of natural writing and speech. Causal language models form the backbone of virtually all modern large language models, including the GPT family, LLaMA, Claude, Gemini, and Mistral.
The term "causal" refers to the directional structure of the model rather than to any notion of causality in the philosophical or statistical sense. In a causal language model, information at position $t$ depends only on positions $1$ through $t-1$, just as a cause must precede an effect in time. This restriction is enforced architecturally through a triangular attention mask, which is why these models are sometimes called "masked attention" or "look-ahead masked" transformers. Despite the simplicity of the underlying objective (predict the next token), causal language models trained at scale have proven capable of translation, summarization, code generation, multi-step reasoning, and open-ended dialogue, making the paradigm the workhorse of contemporary generative AI.
At its core, a causal language model learns a probability distribution over sequences of tokens. Given a sequence of tokens $x_1, x_2, \ldots, x_{t-1}$, the model estimates the conditional probability of the next token:
$$P(x_t \mid x_1, x_2, \ldots, x_{t-1})$$
The full probability of a sequence is then the product of these conditional probabilities (by the chain rule of probability):
$$P(x_1, x_2, \ldots, x_T) = \prod_{t=1}^{T} P(x_t \mid x_1, \ldots, x_{t-1})$$
This autoregressive factorization means the model generates text one token at a time, always conditioning on everything produced so far. During inference, each newly generated token is appended to the context and fed back into the model to predict the subsequent token. This cycle repeats until a stopping condition is met, such as reaching a special end-of-sequence token, a stop string supplied by the application, or a maximum length set by the caller.
The choice of factorization is not arbitrary. Although the chain rule of probability holds for any ordering of variables, left-to-right factorization aligns with how humans produce language and allows efficient training using dense supervision. Every position in the input acts simultaneously as both an input and a target, so a single forward pass over a sequence of length $T$ produces $T$ training signals. Right-to-left or arbitrary-order factorizations are mathematically valid but rarely used because they break this convenient structure and complicate downstream generation.
Most causal language models are built on the decoder-only variant of the transformer architecture introduced by Vaswani et al. in the 2017 paper "Attention Is All You Need." The architecture consists of a stack of identical layers, each containing two main sub-layers:
Residual connections and layer normalization are applied around each sub-layer. A positional encoding scheme (sinusoidal, learned, or rotary) provides the model with information about token order, since the attention mechanism is otherwise permutation-invariant. Recent models such as LLaMA and Mistral use rotary positional embeddings (RoPE), which encode position by rotating query and key vectors in pairs, allowing the model to extrapolate (with some loss) beyond training context lengths.
The original transformer paper described an encoder-decoder model for machine translation in which an encoder built bidirectional representations of the source sentence and a decoder generated the target sentence one token at a time. The decoder-only causal language model strips away the encoder, leaving only the autoregressive component, and feeds the entire input directly into the stacked decoder layers. This simpler design has come to dominate the field because it scales cleanly and unifies understanding and generation into a single network.
The defining architectural feature of a causal language model is causal masking (also called the "causal attention mask," "look-ahead mask," or "autoregressive mask"). In standard self-attention, every token can attend to every other token in the sequence. Causal masking restricts this so that position $t$ can only attend to positions $1$ through $t$.
Mechanically, the mask is implemented by adding a mask matrix to the attention scores before the softmax operation. The mask is an upper-triangular matrix filled with negative infinity ($-\infty$) values. When these values pass through softmax, they become zero, effectively preventing any information flow from future positions. The resulting attention weight matrix is lower-triangular, ensuring that each token's representation is computed only from the current and previous tokens.
For a sequence of length $T$, the unmasked attention weight matrix would have shape $T \times T$ with all entries potentially nonzero. After causal masking, only the lower triangle (including the diagonal) contains nonzero weights, giving the matrix a stair-step appearance:
[ a11 0 0 0 ]
[ a21 a22 0 0 ]
[ a31 a32 a33 0 ]
[ a41 a42 a43 a44 ]
This masking serves two purposes:
In practice, modern transformer implementations skip the explicit mask matrix when possible. Libraries such as FlashAttention compute causal attention directly using fused kernels that simply do not load the future tokens, which avoids the wasted memory of materialising a triangular mask and yields significant speedups on long sequences.
Before a causal language model can process text, raw strings must be converted into a sequence of integer token IDs. Modern systems use subword tokenization algorithms, most commonly byte-pair encoding (BPE), WordPiece, or SentencePiece, which strike a balance between vocabulary size and the ability to represent rare words. A vocabulary of 32,000 to 200,000 subword units is typical for production models. Token IDs index into a learned embedding matrix that maps each ID to a dense vector of size $d_{\text{model}}$, typically 768 to 12,288 dimensions depending on model scale. The embedding matrix is often tied to the output projection ("weight tying"), which reduces parameter count and slightly improves perplexity.
After the final transformer layer, a linear projection (the language modeling head) maps each hidden state to a vector of vocabulary-sized logits. A softmax over these logits produces a probability distribution over the next token. Some systems compute the softmax only over a subset of the vocabulary during inference (a technique known as "vocabulary pruning") or replace it with sampled softmax variants during training to reduce compute, but the standard recipe applies a full softmax at every position.
The standard training objective for causal language models is next-token prediction, optimized using cross-entropy loss (equivalently, negative log-likelihood). For a training sequence of length $T$, the loss is:
$$\mathcal{L} = -\frac{1}{T} \sum_{t=1}^{T} \log P_{\theta}(x_t \mid x_1, \ldots, x_{t-1})$$
where $P_{\theta}$ is the model's predicted probability distribution parameterized by weights $\theta$. Minimizing this loss is equivalent to maximizing the likelihood of the training data under the model. The mean cross-entropy across a held-out corpus is often reported in two equivalent forms: the loss itself in nats (or bits, when log base 2 is used) and the perplexity, defined as $\exp(\mathcal{L})$. A model with a lower perplexity assigns higher probability to the held-out text and is generally considered a stronger language model.
During training, the model receives a full sequence and, thanks to causal masking, computes the loss for every position in parallel. This technique, called teacher forcing, is far more efficient than generating tokens one at a time, because the ground-truth tokens (rather than the model's own predictions) are used as inputs for each step. Teacher forcing introduces a known mismatch between training and inference (called "exposure bias"), since at inference time the model conditions on its own (possibly imperfect) outputs rather than ground-truth tokens. In practice this mismatch is small enough at scale that pure teacher forcing remains the default training recipe.
A recent line of research replaces single-token next-token prediction with multi-token prediction (MTP), in which the model is trained to predict the next several tokens simultaneously using parallel output heads. DeepSeek used a variant of MTP in DeepSeek-V3 (2024) to densify the training signal and improve sample efficiency. Multi-token prediction also pairs naturally with speculative decoding at inference time, where the additional heads can be reused as a built-in draft model.
To prevent overfitting and improve generalization, several techniques are commonly applied:
| Technique | Description |
|---|---|
| Dropout | Randomly zeroes a fraction of activations during training to prevent co-adaptation |
| Weight decay | Adds an L2 penalty on model parameters to the loss function |
| Learning rate scheduling | Uses warmup followed by cosine or linear decay of the learning rate |
| Gradient clipping | Caps gradient magnitudes to stabilize training |
| Layer normalization | Normalizes activations within each layer to reduce internal covariate shift |
| Mixed precision | Stores weights in BF16 or FP16 to halve memory and double throughput |
| Sequence packing | Concatenates multiple short documents into one long training sequence to avoid padding waste |
Modern causal language models are typically trained using the AdamW optimizer with mixed-precision (FP16 or BF16) arithmetic to reduce memory usage and accelerate computation. At very large scales, additional tricks become necessary: gradient checkpointing trades recomputation for memory, ZeRO sharding partitions optimizer state across devices, and pipeline parallelism splits the model depth-wise across nodes. Training a 100-billion-parameter causal language model from scratch typically requires tens of thousands of GPU- or TPU-hours and a carefully tuned data-parallel, tensor-parallel, and pipeline-parallel topology.
Language modeling predates deep learning by decades. Early n-gram models estimated $P(x_t \mid x_{t-n+1}, \ldots, x_{t-1})$ by counting word occurrences in a training corpus, with smoothing techniques (such as Kneser-Ney) to handle unseen contexts. n-gram models are causal in the trivial sense that they only condition on prior tokens, but they are limited to short, fixed-length contexts and cannot share statistical strength across semantically related contexts.
In 2003, Bengio et al. proposed the first widely cited neural probabilistic language model, which embedded words into a continuous vector space and used a feed-forward neural network to predict the next word. This work introduced the idea that distributed representations could overcome the data sparsity problem that plagued n-gram models.
The breakthrough that enabled long-range causal language modeling was the recurrent neural network (RNN). Mikolov et al. (2010) showed that RNN language models substantially outperformed traditional n-gram baselines on speech recognition benchmarks. RNN language models are inherently causal because their hidden state at time $t$ is computed only from prior hidden states and the current input. The introduction of long short-term memory (LSTM) cells (Hochreiter and Schmidhuber, 1997) and gated recurrent units (GRUs, Cho et al., 2014) further improved the ability of recurrent models to capture long-range dependencies.
In 2017, Vaswani et al. introduced the transformer in "Attention Is All You Need." The original transformer was an encoder-decoder model for machine translation, but the decoder component (with causal self-attention) provided a template for a new class of language models. Unlike RNNs, transformers process every token position in parallel, which makes them dramatically more efficient on modern accelerators. Causal masking provides the autoregressive structure that RNNs got from their recurrence.
In June 2018, OpenAI published "Improving Language Understanding by Generative Pre-Training" (Radford et al.), which introduced the first Generative Pre-trained Transformer, retroactively named GPT-1. The model was a 117-million-parameter decoder-only transformer pre-trained on the BooksCorpus dataset using the standard next-token-prediction objective, then fine-tuned for downstream tasks. The combination of unsupervised pre-training plus supervised fine-tuning achieved state-of-the-art on many natural language understanding benchmarks. Although BERT (Devlin et al., 2018) launched four months later and dominated discriminative benchmarks for several years, the GPT line continued to push generative pre-training and would eventually carry the field.
GPT-2, released in February 2019, scaled the GPT-1 architecture to 1.5 billion parameters and trained on 40 GB of web text scraped from outbound Reddit links (the WebText dataset). The model demonstrated strong zero-shot performance on tasks it had never been explicitly trained for, sparking a wave of interest in unsupervised multitask learning. OpenAI's controversial decision to delay the release of the largest GPT-2 weights (citing misuse concerns) marked the beginning of public debate about the safety of releasing large language models.
GPT-3, introduced in May 2020, took the same architecture to 175 billion parameters and trained on roughly 300 billion tokens drawn from Common Crawl, WebText2, two book corpora, and English Wikipedia. The headline finding of the GPT-3 paper (Brown et al., 2020) was in-context learning: with no gradient updates, the model could perform new tasks given only a few demonstrations in the prompt. This shifted the field's focus from supervised fine-tuning to prompt-based interaction.
GPT-3.5 (the InstructGPT family, late 2022) added reinforcement learning from human feedback (RLHF) to align the base model with human preferences. The chat-tuned variant powered the public launch of ChatGPT in November 2022, which reached an estimated 100 million users within two months and triggered the modern wave of generative AI investment.
GPT-4, released in March 2023, introduced multimodal input (images and text), substantially better reasoning, and longer context windows. OpenAI did not disclose architectural details, but external sources have estimated parameter counts in the trillions and a mixture-of-experts structure. GPT-4o (May 2024) added native voice and image generation; GPT-4.5 and GPT-5 continued the line through the mid-2020s.
The causal language modeling paradigm spread quickly to other organizations. Notable open-weight families include:
| Model family | Organization | First release | Defining contribution |
|---|---|---|---|
| LLaMA | Meta | 2023 | Trained 7B-65B models far past Chinchilla-optimal token counts; released openly |
| Mistral | Mistral AI | 2023 | Sliding-window attention and grouped-query attention in a 7B model |
| Mixtral | Mistral AI | 2023 | Open-weight sparse mixture of experts |
| Falcon | TII | 2023 | Open 40B/180B models trained on the RefinedWeb dataset |
| Qwen | Alibaba | 2023 | Strong multilingual and Chinese-language performance |
| DeepSeek | DeepSeek | 2023 | Multi-token prediction and aggressive MoE scaling |
| Gemma | 2024 | Lightweight open variants of Gemini-class architectures |
Proprietary causal language models such as Claude (Anthropic), Gemini (Google DeepMind), and Grok (xAI) follow the same decoder-only template, with proprietary modifications to attention, training data, and post-training pipelines.
The GPT (Generative Pre-trained Transformer) series from OpenAI is the most well-known line of causal language models. Each generation has dramatically scaled up in size and capability.
| Model | Year | Parameters | Training Data | Key Contribution |
|---|---|---|---|---|
| GPT-1 | 2018 | 117 million | BooksCorpus (~800M words) | Demonstrated generative pre-training followed by discriminative fine-tuning |
| GPT-2 | 2019 | 1.5 billion | WebText (40 GB) | Showed unsupervised multitask learning; zero-shot task transfer |
| GPT-3 | 2020 | 175 billion | ~300 billion tokens | Introduced in-context few-shot learning without fine-tuning |
| GPT-3.5 / InstructGPT | 2022 | ~175 billion | ~300B tokens plus RLHF data | Added RLHF for instruction following; powered ChatGPT |
| GPT-4 | 2023 | Undisclosed | Undisclosed | Multimodal capabilities (text and image input); strong reasoning |
| GPT-4o | 2024 | Undisclosed | Undisclosed | Natively multimodal with audio and image generation |
| GPT-5 | 2025 | Undisclosed | Undisclosed | Unified reasoning and chat in a single model |
Beyond GPT, many other causal language models have been developed, including LLaMA (Meta), PaLM (Google), Falcon (TII), Mistral (Mistral AI), and Claude (Anthropic).
Causal language models are often compared with masked language models (MLMs) such as BERT and with prefix language models (PrefixLMs) such as T5. The three paradigms differ fundamentally in their training objectives, attention patterns, and downstream strengths.
| Feature | Causal LM (e.g., GPT) | Masked LM (e.g., BERT) | Prefix LM (e.g., T5, UL2) |
|---|---|---|---|
| Attention pattern | Lower-triangular (causal) | Full bidirectional | Bidirectional on prefix, causal on target |
| Training objective | Predict the next token | Predict randomly masked tokens | Predict the target span given the prefix |
| Context used | Only preceding tokens | Both preceding and following tokens | Bidirectional within prefix; causal within target |
| Architecture | Decoder-only transformer | Encoder-only transformer | Encoder-decoder, or unified prefix masking |
| Primary strength | Open-ended text generation | Text understanding and classification | Conditional generation (translation, summarization) |
| Example use cases | Chatbots, code generation, story writing | Sentiment analysis, NER, retrieval | Translation, structured generation |
Because masked language models have access to bidirectional context, they tend to produce richer representations for understanding tasks. However, they are not naturally suited for text generation, since they cannot predict text sequentially. Causal language models, by contrast, are inherently generative and have become the dominant paradigm for building conversational AI systems, code assistants, and general-purpose large language models.
Through roughly 2019 to 2021, encoder-decoder models such as T5 (Raffel et al., 2020) and BART (Lewis et al., 2019) were widely viewed as state-of-the-art for conditional generation tasks. By 2023, however, decoder-only causal language models had largely displaced them. Several factors drove this shift:
A prefix language model (PrefixLM) occupies a middle ground between causal and masked language models. In a PrefixLM, an initial "prefix" portion of the input is processed with bidirectional (non-causal) attention, while the remaining tokens are generated autoregressively with causal masking.
This design is useful for tasks where a fixed input (the prefix) must be fully understood before generating an output. For example, in a translation task, the source sentence can serve as the prefix with full bidirectional attention, and the target sentence is generated causally. Research has shown that PrefixLMs can outperform pure causal language models on in-context learning tasks because they allow demonstration examples within the prefix to attend to one another freely, rather than being restricted by left-to-right masking.
Models that use or have explored PrefixLM objectives include T5, UL2 (Tay et al., 2022), and certain configurations of PaLM.
A causal language model produces a probability distribution over the vocabulary at each generation step. The method used to select the next token from this distribution is called a decoding strategy (or sampling strategy). Different strategies trade off between output quality, diversity, and computational cost.
| Strategy | How it works | Characteristics |
|---|---|---|
| Greedy decoding | Always selects the highest-probability token | Fast but often produces repetitive, generic text |
| Beam search | Maintains the top-$k$ most probable partial sequences and selects the best overall | Better than greedy for short outputs; can still be repetitive |
| Top-$k$ sampling | Samples from the $k$ most probable tokens | Introduces diversity; the fixed $k$ may be too narrow or too wide |
| Top-$p$ (nucleus) sampling | Samples from the smallest set of tokens whose cumulative probability exceeds $p$ | Dynamically adapts the candidate set; widely used in practice |
| Temperature scaling | Divides logits by a temperature parameter $\tau$ before softmax | $\tau < 1$ sharpens the distribution; $\tau > 1$ flattens it |
| Min-$p$ sampling | Filters tokens below a minimum probability relative to the top token | A recent alternative that combines benefits of top-$k$ and top-$p$ |
| Mirostat | Targets a specified perplexity by dynamically adjusting truncation | Aims for stable, surprise-controlled output |
| Contrastive search | Penalizes tokens whose representations are similar to recent context | Reduces degenerate repetition without aggressive truncation |
| Typical sampling | Selects tokens whose information content is close to the conditional entropy | Aims for "locally typical" generations |
In practice, modern systems often combine several of these strategies. For instance, a chatbot might use top-$p$ sampling with a moderate temperature and a repetition penalty to balance coherence and creativity. Holtzman et al. (2020), "The Curious Case of Neural Text Degeneration," provided the canonical justification for nucleus sampling by showing that high-likelihood beam-search outputs collapse into repetitive loops while truncated stochastic sampling avoids this failure mode.
Decoding is often combined with auxiliary penalties to shape outputs further. Repetition penalties divide the logit of any token already in the context by a fixed factor, discouraging verbatim loops. Frequency and presence penalties, exposed by the OpenAI API, reduce logits proportionally to how often or whether a token has appeared. Logit bias lets a developer add or subtract a fixed amount to specific token IDs, which is useful for forbidding certain words or steering toward a particular format. Constrained decoding (using grammars or regular expressions) restricts the model to outputs that conform to a schema, which is widely used for structured tool use and JSON generation.
Serving a causal language model in production has two distinct computational phases:
The practical consequence is that prefill latency scales with prompt length while per-token decode latency is roughly constant for a given model and hardware. Time-to-first-token (TTFT) is dominated by prefill; tokens-per-second after the first token reflects decode throughput.
During decode, every transformer layer recomputes attention queries from the new token but reuses the keys and values from previous tokens. To avoid recomputation, modern inference engines maintain a KV cache: a per-layer buffer that stores key and value tensors for every position generated so far. Memory use grows linearly with context length, and at long contexts the KV cache can dwarf the model weights themselves. A 70-billion-parameter model with 80 layers, a head dimension of 128, and 64 attention heads consumes roughly 320 KB of KV per token in BF16, so a 100,000-token context occupies about 32 GB just for the cache.
Several architectural tricks reduce KV cache pressure:
Speculative decoding (Leviathan et al., 2022, arXiv 2211.17192) accelerates causal language model inference by running a small "draft" model and a large "target" model in tandem. The draft model proposes several tokens autoregressively, then the target model verifies them all in a single parallel forward pass. Tokens that match the target distribution (according to a rejection-sampling criterion) are accepted; the first mismatch is replaced with a sample from the corrected distribution and the process repeats. The output distribution is provably identical to greedy or sampled decoding from the target model alone, but throughput can roughly double because the target's expensive forward pass produces several output tokens at once.
Medusa (Cai et al., 2024) modifies the target model itself by adding extra prediction heads that each forecast a token several positions ahead. The model can then propose tokens without an external draft network. EAGLE (Li et al., 2024) trains a lightweight head on intermediate hidden states to predict next-token distributions more accurately, raising the acceptance rate of speculative decoding. Lookahead decoding uses Jacobi-style fixed-point iteration over a window of tokens, removing the draft model entirely.
Because user requests arrive at different times and finish at different times, naive batching wastes compute by waiting for the longest request in a batch. Continuous batching (also called "in-flight batching," introduced by Orca and popularized by vLLM) lets new requests join an in-flight batch at the next decode step and lets finished requests leave immediately. Combined with PagedAttention, continuous batching dramatically increases throughput on shared production servers.
A modern causal language model is rarely deployed as a raw next-token predictor. After pre-training on web-scale text, the base model is refined through a multi-stage post-training pipeline:
Post-training does not change the underlying causal language modeling objective; the model still predicts one token at a time conditioned on prior tokens. What changes is the conditional distribution: a post-trained model is much more likely to produce helpful, safe, and well-structured outputs in response to user prompts.
One of the most significant discoveries about large causal language models is their ability to perform tasks without explicit fine-tuning. The GPT-3 paper (Brown et al., 2020) demonstrated three paradigms:
As model scale increases, few-shot performance improves much more rapidly than zero-shot performance, suggesting that larger models become better "in-context learners." This emergent capability has been one of the primary drivers of interest in scaling up causal language models and has led to the widespread use of prompt engineering as an alternative to traditional fine-tuning. Chain-of-thought prompting (Wei et al., 2022) showed that simply asking the model to "think step by step" before answering substantially improves accuracy on multi-step problems, especially for sufficiently large models.
Causal language models exhibit remarkably predictable scaling laws. Research by Kaplan et al. (2020) at OpenAI showed that a model's cross-entropy loss on held-out text follows a power-law relationship with three variables:
The loss decreases smoothly and predictably as any of these quantities increases, with trends spanning more than seven orders of magnitude. The Kaplan paper expressed the relationship as $L(N) \propto N^{-\alpha_N}$, $L(D) \propto D^{-\alpha_D}$, and $L(C) \propto C^{-\alpha_C}$, with empirically fit exponents.
In 2022, DeepMind's Chinchilla study (Hoffmann et al.) refined these findings, demonstrating that many existing models were over-parameterized relative to their training data. The Chinchilla-optimal ratio suggests approximately 20 training tokens per parameter for a given compute budget, contrary to Kaplan's original estimate of about 1.7 tokens per parameter. This insight shifted the field's focus toward training smaller models on more data: LLaMA 7B was trained on 1 trillion tokens, far exceeding the Chinchilla-optimal ratio for its size, which gave it the inference economics of a small model with the quality of a much larger one.
These scaling laws have become essential planning tools for organizations building large causal language models, allowing them to predict downstream performance and allocate compute budgets efficiently. Subsequent work (such as DeepMind's "Approach to scaling" follow-ups and Anthropic's research on overtraining) has further refined the curves and explored what happens when models are pushed beyond compute-optimal data ratios.
Since the release of GPT-3 in 2020, the causal language modeling paradigm has become the dominant approach for building general-purpose AI systems. Several factors drive this dominance:
Today, nearly every leading large language model, whether used for chatbots, search, coding assistance, or scientific research, is built on the causal language modeling framework.
Causal language models have been deployed across a wide range of natural language processing tasks:
Despite their dominance, causal language models have well-known limitations:
These open problems are active areas of research, and progress on each tends to ripple back into the broader landscape of language model design.
Imagine you are playing a word game where you have to guess the next word in a sentence. Your friend says "The cat sat on the..." and you guess "mat" because that makes the most sense based on the words you already heard. A causal language model works the same way. It reads words from left to right, one at a time, and tries to guess which word comes next. It never peeks ahead at words it has not seen yet. The more sentences it practices with, the better it gets at guessing. This is how computers learn to write stories, answer questions, and even have conversations.