# Causal Language Model

> Source: https://aiwiki.ai/wiki/causal_language_model
> Updated: 2026-06-23
> Categories: Deep Learning, Machine Learning, Natural Language Processing
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

A **causal language model** (CLM), also called an **autoregressive language model** or a **decoder-only language model**, is a [language model](/wiki/language_model) that predicts the next [token](/wiki/token) in a sequence using only the tokens that precede it, so generation runs strictly left-to-right and never conditions on future tokens. Hugging Face states the definition plainly: "Causal language modeling predicts the next token in a sequence of tokens, and the model can only attend to tokens on the left. This means the model cannot see future tokens."[24] This unidirectionality is enforced architecturally by a causal (triangular) attention mask, and it is what distinguishes causal language models from bidirectional [masked language models](/wiki/masked_language_model) such as [BERT](/wiki/bert).

Causal language models form the backbone of virtually all modern [large language models](/wiki/large_language_model), including the [GPT](/wiki/gpt) family, [LLaMA](/wiki/llama), [Claude](/wiki/claude), [Gemini](/wiki/gemini), and [Mistral](/wiki/mistral). The paradigm scaled into the mainstream with [GPT-3](/wiki/gpt-3) (2020), described by its authors as "an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model" that could perform new tasks "without any gradient updates or fine-tuning."[4] Despite the simplicity of the underlying objective (predict the next token, scored with cross-entropy loss), causal language models trained at scale have proven capable of translation, summarization, code generation, multi-step reasoning, and open-ended dialogue, making the paradigm the workhorse of contemporary [generative AI](/wiki/generative_ai).

The term "causal" refers to the directional structure of the model rather than to any notion of causality in the philosophical or statistical sense. In a causal language model, information at position $t$ depends only on positions $1$ through $t-1$, just as a cause must precede an effect in time. This restriction is enforced architecturally through a triangular attention mask, which is why these models are sometimes called "masked attention" or "look-ahead masked" transformers.

## How does causal language modeling work?

At its core, a causal language model learns a probability distribution over sequences of tokens. Given a sequence of tokens $x_1, x_2, \ldots, x_{t-1}$, the model estimates the conditional probability of the next token:

$$P(x_t \mid x_1, x_2, \ldots, x_{t-1})$$

The full probability of a sequence is then the product of these conditional probabilities (by the chain rule of probability):

$$P(x_1, x_2, \ldots, x_T) = \prod_{t=1}^{T} P(x_t \mid x_1, \ldots, x_{t-1})$$

This [autoregressive](/wiki/autoregressive_model) factorization means the model generates text one token at a time, always conditioning on everything produced so far. During inference, each newly generated token is appended to the context and fed back into the model to predict the subsequent token. This cycle repeats until a stopping condition is met, such as reaching a special end-of-sequence token, a stop string supplied by the application, or a maximum length set by the caller.

The choice of factorization is not arbitrary. Although the chain rule of probability holds for any ordering of variables, left-to-right factorization aligns with how humans produce language and allows efficient training using dense supervision. Every position in the input acts simultaneously as both an input and a target, so a single forward pass over a sequence of length $T$ produces $T$ training signals. Right-to-left or arbitrary-order factorizations are mathematically valid but rarely used because they break this convenient structure and complicate downstream generation.

## Architecture

### Decoder-only transformers

Most causal language models are built on the decoder-only variant of the [transformer](/wiki/transformer) architecture introduced by Vaswani et al. in the 2017 paper "Attention Is All You Need."[1] The architecture consists of a stack of identical layers, each containing two main sub-layers:

1. **Masked (causal) self-[attention](/wiki/attention):** Computes contextual representations by allowing each token position to attend only to itself and all preceding positions.
2. **Position-wise feed-forward network:** Applies a two-layer [neural network](/wiki/neural_network) independently to each position, typically with a hidden width 4 times the model dimension.

Residual connections and layer normalization are applied around each sub-layer. A [positional encoding](/wiki/positional_encoding) scheme (sinusoidal, learned, or rotary) provides the model with information about token order, since the attention mechanism is otherwise permutation-invariant. Recent models such as [LLaMA](/wiki/llama) and [Mistral](/wiki/mistral) use [rotary positional embeddings](/wiki/rotary_positional_embedding) (RoPE), which encode position by rotating query and key vectors in pairs, allowing the model to extrapolate (with some loss) beyond training context lengths.

The original transformer paper described an encoder-decoder model for machine translation in which an encoder built bidirectional representations of the source sentence and a decoder generated the target sentence one token at a time.[1] The decoder-only causal language model strips away the encoder, leaving only the autoregressive component, and feeds the entire input directly into the stacked decoder layers. This simpler design has come to dominate the field because it scales cleanly and unifies understanding and generation into a single network.

### What is causal masking in self-attention?

The defining architectural feature of a causal language model is **causal masking** (also called the "causal attention mask," "look-ahead mask," or "autoregressive mask"). In standard [self-attention](/wiki/self_attention), every token can attend to every other token in the sequence. Causal masking restricts this so that position $t$ can only attend to positions $1$ through $t$. The transformer authors framed the goal directly: "We need to prevent leftward information flow in the decoder to preserve the auto-regressive property."[1]

Mechanically, the mask is implemented by adding a mask matrix to the attention scores before the [softmax](/wiki/softmax) operation. As the original paper puts it, "We implement this inside of scaled dot-product attention by masking out (setting to minus infinity) all values in the input of the softmax which correspond to illegal connections."[1] The mask is an upper-triangular matrix filled with negative infinity ($-\infty$) values. When these values pass through softmax, they become zero, effectively preventing any information flow from future positions. The resulting attention weight matrix is lower-triangular, ensuring that each token's representation is computed only from the current and previous tokens.

For a sequence of length $T$, the unmasked attention weight matrix would have shape $T \times T$ with all entries potentially nonzero. After causal masking, only the lower triangle (including the diagonal) contains nonzero weights, giving the matrix a stair-step appearance:

```
[ a11   0    0    0  ]
[ a21  a22   0    0  ]
[ a31  a32  a33   0  ]
[ a41  a42  a43  a44 ]
```

This masking serves two purposes:

- **During training**, it allows the model to process an entire sequence in a single forward pass while still making independent predictions for each position. Without the mask, the model could "cheat" by looking at the token it is supposed to predict.
- **During inference**, it ensures that the model's generation process is coherent and consistent with its training regime. A token generated at step $t$ will only ever depend on the prefix it actually saw during training.

In practice, modern transformer implementations skip the explicit mask matrix when possible. Libraries such as [FlashAttention](/wiki/flash_attention) compute causal attention directly using fused kernels that simply do not load the future tokens, which avoids the wasted memory of materialising a triangular mask and yields significant speedups on long sequences.

### Tokenization and embeddings

Before a causal language model can process text, raw strings must be converted into a sequence of integer token IDs. Modern systems use subword [tokenization](/wiki/tokenization) algorithms, most commonly **byte-pair encoding** (BPE), **WordPiece**, or **SentencePiece**, which strike a balance between vocabulary size and the ability to represent rare words. A vocabulary of 32,000 to 200,000 subword units is typical for production models. Token IDs index into a learned embedding matrix that maps each ID to a dense vector of size $d_{\text{model}}$, typically 768 to 12,288 dimensions depending on model scale. The embedding matrix is often tied to the output projection ("weight tying"), which reduces parameter count and slightly improves perplexity.

### Output head

After the final transformer layer, a linear projection (the **language modeling head**) maps each hidden state to a vector of vocabulary-sized logits. A softmax over these logits produces a probability distribution over the next token. Some systems compute the softmax only over a subset of the vocabulary during inference (a technique known as "vocabulary pruning") or replace it with sampled softmax variants during training to reduce compute, but the standard recipe applies a full softmax at every position.

## What is the training objective of a causal language model?

### Next-token prediction and cross-entropy loss

The standard training objective for causal language models is **[next-token prediction](/wiki/next_token_prediction)**, optimized using **cross-entropy loss** (equivalently, negative log-likelihood). For a training sequence of length $T$, the loss is:

$$\mathcal{L} = -\frac{1}{T} \sum_{t=1}^{T} \log P_{\theta}(x_t \mid x_1, \ldots, x_{t-1})$$

where $P_{\theta}$ is the model's predicted probability distribution parameterized by weights $\theta$. Minimizing this loss is equivalent to maximizing the likelihood of the training data under the model. The mean cross-entropy across a held-out corpus is often reported in two equivalent forms: the loss itself in nats (or bits, when log base 2 is used) and the [perplexity](/wiki/perplexity), defined as $\exp(\mathcal{L})$. A model with a lower perplexity assigns higher probability to the held-out text and is generally considered a stronger language model.

During training, the model receives a full sequence and, thanks to causal masking, computes the loss for every position in parallel. This technique, called **teacher forcing**, is far more efficient than generating tokens one at a time, because the ground-truth tokens (rather than the model's own predictions) are used as inputs for each step. Teacher forcing introduces a known mismatch between training and inference (called "exposure bias"), since at inference time the model conditions on its own (possibly imperfect) outputs rather than ground-truth tokens. In practice this mismatch is small enough at scale that pure teacher forcing remains the default training recipe.

### Multi-token prediction

A recent line of research replaces single-token next-token prediction with **multi-token prediction** (MTP), in which the model is trained to predict the next several tokens simultaneously using parallel output heads. DeepSeek used a variant of MTP in [DeepSeek-V3](/wiki/deepseek_v3) (2024) to densify the training signal and improve sample efficiency. Multi-token prediction also pairs naturally with [speculative decoding](/wiki/speculative_decoding) at inference time, where the additional heads can be reused as a built-in draft model.

### Regularization and optimization

To prevent [overfitting](/wiki/overfitting) and improve generalization, several techniques are commonly applied:

| Technique | Description |
|---|---|
| [Dropout](/wiki/dropout_regularization) | Randomly zeroes a fraction of activations during training to prevent co-adaptation |
| Weight decay | Adds an L2 penalty on model parameters to the loss function |
| [Learning rate](/wiki/learning_rate) scheduling | Uses warmup followed by cosine or linear decay of the learning rate |
| [Gradient clipping](/wiki/gradient_clipping) | Caps gradient magnitudes to stabilize training |
| [Layer normalization](/wiki/batch_normalization) | Normalizes activations within each layer to reduce internal covariate shift |
| Mixed precision | Stores weights in BF16 or FP16 to halve memory and double throughput |
| Sequence packing | Concatenates multiple short documents into one long training sequence to avoid padding waste |

Modern causal language models are typically trained using the AdamW optimizer with mixed-precision (FP16 or BF16) arithmetic to reduce memory usage and accelerate computation. At very large scales, additional tricks become necessary: gradient checkpointing trades recomputation for memory, ZeRO sharding partitions optimizer state across devices, and pipeline parallelism splits the model depth-wise across nodes. Training a 100-billion-parameter causal language model from scratch typically requires tens of thousands of GPU- or TPU-hours and a carefully tuned data-parallel, tensor-parallel, and pipeline-parallel topology.

## A brief history

### From n-grams to neural language models

Language modeling predates deep learning by decades. Early **n-gram models** estimated $P(x_t \mid x_{t-n+1}, \ldots, x_{t-1})$ by counting word occurrences in a training corpus, with smoothing techniques (such as Kneser-Ney) to handle unseen contexts. n-gram models are causal in the trivial sense that they only condition on prior tokens, but they are limited to short, fixed-length contexts and cannot share statistical strength across semantically related contexts.

In 2003, Bengio et al. proposed the first widely cited **neural probabilistic language model**, which embedded words into a continuous vector space and used a feed-forward neural network to predict the next word.[12] This work introduced the idea that distributed representations could overcome the data sparsity problem that plagued n-gram models.

The breakthrough that enabled long-range causal language modeling was the **recurrent neural network** (RNN). Mikolov et al. (2010) showed that RNN language models substantially outperformed traditional n-gram baselines on speech recognition benchmarks.[11] RNN language models are inherently causal because their hidden state at time $t$ is computed only from prior hidden states and the current input. The introduction of [long short-term memory](/wiki/lstm) (LSTM) cells (Hochreiter and Schmidhuber, 1997)[13] and gated recurrent units (GRUs, Cho et al., 2014) further improved the ability of recurrent models to capture long-range dependencies.

### The transformer revolution

In 2017, Vaswani et al. introduced the [transformer](/wiki/transformer) in "Attention Is All You Need." The original transformer was an encoder-decoder model for machine translation, but the decoder component (with causal self-attention) provided a template for a new class of language models. Unlike RNNs, transformers process every token position in parallel, which makes them dramatically more efficient on modern accelerators. Causal masking provides the autoregressive structure that RNNs got from their recurrence.

In June 2018, OpenAI published "Improving Language Understanding by Generative Pre-Training" (Radford et al.), which introduced the first **Generative Pre-trained Transformer**, retroactively named **GPT-1**.[2] The model was a 117-million-parameter decoder-only transformer pre-trained on the BooksCorpus dataset using the standard next-token-prediction objective, then fine-tuned for downstream tasks. The combination of unsupervised pre-training plus supervised fine-tuning achieved state-of-the-art on many natural language understanding benchmarks.[2] Although [BERT](/wiki/bert) (Devlin et al., 2018) launched four months later and dominated discriminative benchmarks for several years, the GPT line continued to push generative pre-training and would eventually carry the field.[7]

### Scaling and the GPT family

[GPT-2](/wiki/gpt2), released in February 2019, scaled the GPT-1 architecture to 1.5 billion parameters and trained on 40 GB of web text scraped from outbound Reddit links (the WebText dataset). The model demonstrated strong zero-shot performance on tasks it had never been explicitly trained for, sparking a wave of interest in unsupervised multitask learning.[3] OpenAI's controversial decision to delay the release of the largest GPT-2 weights (citing misuse concerns) marked the beginning of public debate about the safety of releasing large language models.

[GPT-3](/wiki/gpt-3), introduced in May 2020, took the same architecture to 175 billion parameters and trained on roughly 300 billion tokens drawn from Common Crawl, WebText2, two book corpora, and English Wikipedia. The headline finding of the GPT-3 paper (Brown et al., 2020) was **in-context learning**: with no gradient updates, the model could perform new tasks given only a few demonstrations in the prompt.[4] This shifted the field's focus from supervised fine-tuning to prompt-based interaction.

**GPT-3.5** (the InstructGPT family, late 2022) added [reinforcement learning from human feedback](/wiki/rlhf) (RLHF) to align the base model with human preferences. The chat-tuned variant powered the public launch of [ChatGPT](/wiki/chatgpt) on November 30, 2022. Within two months ChatGPT was estimated to have reached 100 million monthly active users, which a widely cited UBS analyst note called the fastest-growing consumer application in history, triggering the modern wave of generative AI investment.[25]

[GPT-4](/wiki/gpt-4), released in March 2023, introduced multimodal input (images and text), substantially better reasoning, and longer context windows. OpenAI did not disclose architectural details, but external sources have estimated parameter counts in the trillions and a mixture-of-experts structure. **GPT-4o** (May 2024) added native voice and image generation; **GPT-4.5** and **GPT-5** continued the line through the mid-2020s.

### Beyond GPT

The causal language modeling paradigm spread quickly to other organizations. Notable open-weight families include:

| Model family | Organization | First release | Defining contribution |
|---|---|---|---|
| [LLaMA](/wiki/llama) | Meta | 2023 | Trained 7B-65B models far past Chinchilla-optimal token counts; released openly |
| [Mistral](/wiki/mistral) | Mistral AI | 2023 | Sliding-window attention and grouped-query attention in a 7B model |
| Mixtral | Mistral AI | 2023 | Open-weight sparse [mixture of experts](/wiki/mixture_of_experts) |
| [Falcon](/wiki/falcon) | TII | 2023 | Open 40B/180B models trained on the RefinedWeb dataset |
| [Qwen](/wiki/qwen) | Alibaba | 2023 | Strong multilingual and Chinese-language performance |
| [DeepSeek](/wiki/deepseek) | DeepSeek | 2023 | Multi-token prediction and aggressive MoE scaling |
| [Gemma](/wiki/gemma) | Google | 2024 | Lightweight open variants of Gemini-class architectures |

Proprietary causal language models such as [Claude](/wiki/claude) (Anthropic), [Gemini](/wiki/gemini) (Google DeepMind), and [Grok](/wiki/grok) (xAI) follow the same decoder-only template, with proprietary modifications to attention, training data, and post-training pipelines.

## The GPT family in detail

The [GPT](/wiki/gpt) (Generative Pre-trained Transformer) series from OpenAI is the most well-known line of causal language models. Each generation has dramatically scaled up in size and capability.

| Model | Year | Parameters | Training Data | Key Contribution |
|---|---|---|---|---|
| [GPT-1](/wiki/gpt1) | 2018 | 117 million | BooksCorpus (~800M words) | Demonstrated generative pre-training followed by discriminative fine-tuning |
| [GPT-2](/wiki/gpt2) | 2019 | 1.5 billion | WebText (40 GB) | Showed unsupervised multitask learning; zero-shot task transfer |
| [GPT-3](/wiki/gpt-3) | 2020 | 175 billion | ~300 billion tokens | Introduced in-context few-shot learning without fine-tuning |
| GPT-3.5 / InstructGPT | 2022 | ~175 billion | ~300B tokens plus RLHF data | Added RLHF for instruction following; powered ChatGPT |
| [GPT-4](/wiki/gpt-4) | 2023 | Undisclosed | Undisclosed | Multimodal capabilities (text and image input); strong reasoning |
| GPT-4o | 2024 | Undisclosed | Undisclosed | Natively multimodal with audio and image generation |
| GPT-5 | 2025 | Undisclosed | Undisclosed | Unified reasoning and chat in a single model |

Beyond GPT, many other causal language models have been developed, including [LLaMA](/wiki/llama) (Meta), [PaLM](/wiki/palm) (Google), [Falcon](/wiki/falcon) (TII), [Mistral](/wiki/mistral) (Mistral AI), and [Claude](/wiki/claude) (Anthropic).

## How does a causal LM differ from a masked LM (BERT) and a prefix LM?

Causal language models are often compared with [masked language models](/wiki/masked_language_model) (MLMs) such as [BERT](/wiki/bert_bidirectional_encoder_representations_from_transformers) and with prefix language models (PrefixLMs) such as [T5](/wiki/t5). The three paradigms differ fundamentally in their training objectives, attention patterns, and downstream strengths. The clearest dividing line is the attention mask: a causal LM uses a lower-triangular mask and sees only the past, whereas a masked LM uses full bidirectional attention and sees the entire input at once.[24]

| Feature | Causal LM (e.g., GPT) | Masked LM (e.g., BERT) | Prefix LM (e.g., T5, UL2) |
|---|---|---|---|
| Attention pattern | Lower-triangular (causal) | Full bidirectional | Bidirectional on prefix, causal on target |
| Training objective | Predict the next token | Predict randomly masked tokens | Predict the target span given the prefix |
| Context used | Only preceding tokens | Both preceding and following tokens | Bidirectional within prefix; causal within target |
| Architecture | Decoder-only transformer | Encoder-only transformer | Encoder-decoder, or unified prefix masking |
| Primary strength | Open-ended text generation | Text understanding and classification | Conditional generation (translation, summarization) |
| Example use cases | Chatbots, code generation, story writing | [Sentiment analysis](/wiki/sentiment_analysis), [NER](/wiki/named_entity_recognition), retrieval | Translation, structured generation |

Because masked language models have access to bidirectional context, they tend to produce richer representations for understanding tasks. However, they are not naturally suited for text generation, since they cannot predict text sequentially. Causal language models, by contrast, are inherently generative and have become the dominant paradigm for building conversational AI systems, code assistants, and general-purpose [large language models](/wiki/large_language_model).

### Why did decoder-only "win"?

Through roughly 2019 to 2021, encoder-decoder models such as [T5](/wiki/t5) (Raffel et al., 2020)[10] and BART (Lewis et al., 2019) were widely viewed as state-of-the-art for conditional generation tasks. By 2023, however, decoder-only causal language models had largely displaced them. Several factors drove this shift:

- **Single objective:** Next-token prediction works for any text-to-text task once both input and output are concatenated, removing the need for task-specific encoder-decoder pretraining.
- **Unified scaling:** A single decoder stack can be scaled with a single set of hyperparameters, whereas encoder-decoder models split parameters across two stacks.
- **In-context learning:** GPT-3 demonstrated that decoder-only models could learn new tasks from prompt examples without fine-tuning.[4] Encoder-decoder models lagged on this capability because their decoder did not see prefix examples through bidirectional attention.
- **KV cache:** Decoder-only inference reuses keys and values from previous tokens, making generation efficient. Encoder-decoder models recompute encoder activations for every new prompt, which wastes work in chat settings.
- **Tooling momentum:** Once the major industrial labs aligned on decoder-only architectures, libraries, accelerators, and inference engines were optimised primarily for that case, creating a self-reinforcing ecosystem.

### Prefix language models

A **prefix language model** (PrefixLM) occupies a middle ground between causal and masked language models. In a PrefixLM, an initial "prefix" portion of the input is processed with bidirectional (non-causal) attention, while the remaining tokens are generated autoregressively with causal masking.

This design is useful for tasks where a fixed input (the prefix) must be fully understood before generating an output. For example, in a translation task, the source sentence can serve as the prefix with full bidirectional attention, and the target sentence is generated causally. Research has shown that PrefixLMs can outperform pure causal language models on in-context learning tasks because they allow demonstration examples within the prefix to attend to one another freely, rather than being restricted by left-to-right masking.

Models that use or have explored PrefixLM objectives include [T5](/wiki/t5)[10], UL2 (Tay et al., 2022)[23], and certain configurations of [PaLM](/wiki/palm).

## Decoding strategies for text generation

A causal language model produces a probability distribution over the vocabulary at each generation step. The method used to select the next token from this distribution is called a **decoding strategy** (or sampling strategy). Different strategies trade off between output quality, diversity, and computational cost.

| Strategy | How it works | Characteristics |
|---|---|---|
| Greedy decoding | Always selects the highest-probability token | Fast but often produces repetitive, generic text |
| Beam search | Maintains the top-$k$ most probable partial sequences and selects the best overall | Better than greedy for short outputs; can still be repetitive |
| Top-$k$ sampling | Samples from the $k$ most probable tokens | Introduces diversity; the fixed $k$ may be too narrow or too wide |
| Top-$p$ (nucleus) sampling | Samples from the smallest set of tokens whose cumulative probability exceeds $p$ | Dynamically adapts the candidate set; widely used in practice |
| [Temperature](/wiki/temperature) scaling | Divides logits by a temperature parameter $\tau$ before softmax | $\tau < 1$ sharpens the distribution; $\tau > 1$ flattens it |
| Min-$p$ sampling | Filters tokens below a minimum probability relative to the top token | A recent alternative that combines benefits of top-$k$ and top-$p$ |
| Mirostat | Targets a specified perplexity by dynamically adjusting truncation | Aims for stable, surprise-controlled output |
| Contrastive search | Penalizes tokens whose representations are similar to recent context | Reduces degenerate repetition without aggressive truncation |
| Typical sampling | Selects tokens whose information content is close to the conditional entropy | Aims for "locally typical" generations |

In practice, modern systems often combine several of these strategies. For instance, a chatbot might use top-$p$ sampling with a moderate temperature and a repetition penalty to balance coherence and creativity. Holtzman et al. (2020), "The Curious Case of Neural Text Degeneration," provided the canonical justification for nucleus sampling by showing that high-likelihood beam-search outputs collapse into repetitive loops while truncated stochastic sampling avoids this failure mode.[8]

### Penalties and constraints

Decoding is often combined with auxiliary penalties to shape outputs further. **Repetition penalties** divide the logit of any token already in the context by a fixed factor, discouraging verbatim loops. **Frequency and presence penalties**, exposed by the OpenAI API, reduce logits proportionally to how often or whether a token has appeared. **Logit bias** lets a developer add or subtract a fixed amount to specific token IDs, which is useful for forbidding certain words or steering toward a particular format. **Constrained decoding** (using grammars or regular expressions) restricts the model to outputs that conform to a schema, which is widely used for structured tool use and JSON generation.

## Inference: prefill, decode, and the KV cache

### Two phases of inference

Serving a causal language model in production has two distinct computational phases:

1. **Prefill (prompt processing):** The model ingests the user's prompt of length $n$ in parallel and computes attention keys and values for every position. This phase is compute-bound on most accelerators because it performs $O(n^2)$ attention work in a single pass.
2. **Decode (token generation):** The model generates output tokens one at a time. Each new token requires only $O(n)$ work because the keys and values for prior tokens are reused from the cache. This phase is typically memory-bandwidth-bound: the bottleneck is reading model weights from HBM rather than performing computations.

The practical consequence is that prefill latency scales with prompt length while per-token decode latency is roughly constant for a given model and hardware. Time-to-first-token (TTFT) is dominated by prefill; tokens-per-second after the first token reflects decode throughput.

### The KV cache

During decode, every transformer layer recomputes attention queries from the new token but reuses the keys and values from previous tokens. To avoid recomputation, modern inference engines maintain a **[KV cache](/wiki/kv_cache)**: a per-layer buffer that stores key and value tensors for every position generated so far. Memory use grows linearly with context length, and at long contexts the KV cache can dwarf the model weights themselves. A 70-billion-parameter model with 80 layers, a head dimension of 128, and 64 attention heads consumes roughly 320 KB of KV per token in BF16, so a 100,000-token context occupies about 32 GB just for the cache.

Several architectural tricks reduce KV cache pressure:

- **Multi-query attention** (Shazeer, 2019) shares a single key and value head across all query heads, cutting KV size by the number of heads.[19]
- **Grouped-query attention** (GQA, Ainslie et al., 2023) shares keys and values across small groups of query heads, balancing quality and memory.[20]
- **Multi-head latent attention** (used in DeepSeek-V2 and V3) projects keys and values through a low-rank latent space, reducing storage further.
- **PagedAttention** (vLLM, Kwon et al., 2023) manages the cache using fixed-size blocks similar to virtual memory pages, eliminating fragmentation across concurrent requests.[18]

### Speculative decoding

**[Speculative decoding](/wiki/speculative_decoding)** (Leviathan et al., 2022, arXiv 2211.17192) accelerates causal language model inference by running a small "draft" model and a large "target" model in tandem. The draft model proposes several tokens autoregressively, then the target model verifies them all in a single parallel forward pass. Tokens that match the target distribution (according to a rejection-sampling criterion) are accepted; the first mismatch is replaced with a sample from the corrected distribution and the process repeats. The output distribution is provably identical to greedy or sampled decoding from the target model alone, but throughput can roughly double because the target's expensive forward pass produces several output tokens at once.[16]

**Medusa** (Cai et al., 2024) modifies the target model itself by adding extra prediction heads that each forecast a token several positions ahead. The model can then propose tokens without an external draft network.[17] **EAGLE** (Li et al., 2024) trains a lightweight head on intermediate hidden states to predict next-token distributions more accurately, raising the acceptance rate of speculative decoding. **Lookahead decoding** uses Jacobi-style fixed-point iteration over a window of tokens, removing the draft model entirely.

### Continuous batching

Because user requests arrive at different times and finish at different times, naive batching wastes compute by waiting for the longest request in a batch. **Continuous batching** (also called "in-flight batching," introduced by Orca and popularized by vLLM) lets new requests join an in-flight batch at the next decode step and lets finished requests leave immediately. Combined with PagedAttention, continuous batching dramatically increases throughput on shared production servers.[18]

## Post-training

A modern causal language model is rarely deployed as a raw next-token predictor. After pre-training on web-scale text, the base model is refined through a multi-stage **post-training** pipeline:

1. **Supervised fine-tuning (SFT):** The model is fine-tuned on a curated set of instruction-response pairs written by humans or distilled from a stronger model. SFT teaches the model the format of helpful responses.
2. **Reward modeling:** A separate model is trained on human comparisons ("response A is better than response B") to predict a scalar quality score.
3. **[Reinforcement learning from human feedback](/wiki/rlhf) (RLHF):** The base model is further fine-tuned to maximize the reward model's score, typically using proximal policy optimization (PPO) or direct preference optimization (DPO). RLHF was introduced for language models by Christiano et al. (2017)[22] and applied to instruction following by OpenAI's InstructGPT paper (Ouyang et al., 2022).[14]
4. **Reinforcement learning from AI feedback (RLAIF) and constitutional AI:** Anthropic's [Constitutional AI](/wiki/constitutional_ai) recipe (Bai et al., 2022) replaces some human comparisons with judgments from a constitution-following model, scaling alignment data more cheaply.[21]
5. **Reasoning training:** Models such as OpenAI's o1, o3, and DeepSeek-R1 add a final stage of reinforcement learning that rewards correct multi-step reasoning, dramatically improving performance on math, science, and code benchmarks.

Post-training does not change the underlying causal language modeling objective; the model still predicts one token at a time conditioned on prior tokens. What changes is the conditional distribution: a post-trained model is much more likely to produce helpful, safe, and well-structured outputs in response to user prompts.

## What are the zero-shot and few-shot capabilities of causal language models?

One of the most significant discoveries about large causal language models is their ability to perform tasks without explicit fine-tuning. The GPT-3 paper (Brown et al., 2020) demonstrated three paradigms:[4]

- **Zero-shot:** The model receives only a natural-language description of a task and must complete it without any examples.
- **One-shot:** A single input-output example is provided alongside the task description.
- **Few-shot (in-context learning):** Several examples (typically 10 to 100) are included in the prompt before the query.

As model scale increases, few-shot performance improves much more rapidly than zero-shot performance, suggesting that larger models become better "in-context learners." This emergent capability has been one of the primary drivers of interest in scaling up causal language models and has led to the widespread use of [prompt engineering](/wiki/prompt_engineering) as an alternative to traditional fine-tuning. **Chain-of-thought prompting** (Wei et al., 2022) showed that simply asking the model to "think step by step" before answering substantially improves accuracy on multi-step problems, especially for sufficiently large models.[15]

## Scaling properties

Causal language models exhibit remarkably predictable **[scaling laws](/wiki/scaling_laws)**. Research by Kaplan et al. (2020) at OpenAI showed that a model's cross-entropy loss on held-out text follows a power-law relationship with three variables:[5]

1. **Number of parameters** ($N$)
2. **Size of the training dataset** ($D$, measured in tokens)
3. **Amount of training compute** ($C$, measured in FLOPs)

The loss decreases smoothly and predictably as any of these quantities increases, with trends spanning more than seven orders of magnitude. The Kaplan paper expressed the relationship as $L(N) \propto N^{-\alpha_N}$, $L(D) \propto D^{-\alpha_D}$, and $L(C) \propto C^{-\alpha_C}$, with empirically fit exponents.[5]

In 2022, DeepMind's Chinchilla study (Hoffmann et al.) refined these findings, demonstrating that many existing models were over-parameterized relative to their training data.[6] The study trained over 400 models ranging from 70 million to 16 billion parameters on 5 to 500 billion tokens, and concluded that for compute-optimal training, model size and the number of training tokens should be scaled in roughly equal proportion: approximately 20 training tokens per parameter, contrary to Kaplan's earlier estimate of about 1.7 tokens per parameter.[6] To validate the recipe, the authors trained a 70-billion-parameter model, Chinchilla, on 1.4 trillion tokens using the same compute budget as the 280-billion-parameter Gopher, and reported that Chinchilla "uniformly and significantly outperforms Gopher, GPT-3 (175B), Jurassic-1 (178B), and Megatron-Turing NLG (530B)."[6] This insight shifted the field's focus toward training smaller models on more data: [LLaMA](/wiki/llama) 7B was trained on 1 trillion tokens, far exceeding the Chinchilla-optimal ratio for its size, which gave it the inference economics of a small model with the quality of a much larger one.[9]

These scaling laws have become essential planning tools for organizations building large causal language models, allowing them to predict downstream performance and allocate compute budgets efficiently. Subsequent work (such as DeepMind's "Approach to scaling" follow-ups and Anthropic's research on overtraining) has further refined the curves and explored what happens when models are pushed beyond compute-optimal data ratios.

## Causal language models as the foundation of modern LLMs

Since the release of GPT-3 in 2020, the causal language modeling paradigm has become the dominant approach for building general-purpose AI systems. Several factors drive this dominance:

- **Simplicity of the training objective:** Next-token prediction requires no labeled data, enabling training on massive unlabeled corpora scraped from the internet.
- **Scalability:** The decoder-only architecture scales efficiently to hundreds of billions (and now trillions) of parameters.
- **Emergent abilities:** As models scale, they develop capabilities not explicitly trained for, including reasoning, translation, code generation, and mathematical problem-solving.
- **Adaptability through fine-tuning:** Pre-trained causal language models can be further adapted with [reinforcement learning from human feedback](/wiki/reinforcement_learning) (RLHF), parameter-efficient fine-tuning (such as [LoRA](/wiki/lora)), or supervised [fine-tuning](/wiki/fine_tuning) for specific applications.
- **Universal interface:** A single text-in, text-out API can be wrapped around any task, including image, audio, and video tasks once the relevant tokenizers are added.

Today, nearly every leading [large language model](/wiki/large_language_model), whether used for chatbots, search, coding assistance, or scientific research, is built on the causal language modeling framework.

## What are causal language models used for?

Causal language models have been deployed across a wide range of natural language processing tasks:

- **Text generation:** Producing coherent, contextually appropriate text for creative writing, summarization, and content drafting.
- **Conversational AI:** Powering chatbots and virtual assistants such as [ChatGPT](/wiki/chatgpt), [Claude](/wiki/claude), and [Gemini](/wiki/gemini).
- **Code generation:** Writing, completing, and debugging source code in tools like [GitHub Copilot](/wiki/github_copilot) and [Cursor](/wiki/cursor).
- **[Machine translation](/wiki/machine_translation):** Translating text between languages by generating target-language tokens conditioned on source-language input.
- **Reasoning and problem-solving:** Answering complex questions, solving math problems, and performing multi-step logical reasoning.
- **[Text summarization](/wiki/text_summarization):** Condensing long documents into concise summaries while retaining essential information.
- **Tool use and agents:** Driving [agents](/wiki/ai_agent) that call external APIs, run code, and operate browsers, with the causal language model planning each step.
- **Embeddings and retrieval:** Recent decoder-only models are competitive with encoder models when fine-tuned for [retrieval](/wiki/information_retrieval), powering [retrieval-augmented generation](/wiki/retrieval_augmented_generation) systems.
- **Multimodal generation:** Causal language models with vision tokenizers can describe images, answer visual questions, and generate captions; with audio tokens they can transcribe and synthesize speech.

## Limitations and open problems

Despite their dominance, causal language models have well-known limitations:

- **Hallucination:** Because the training objective rewards plausible continuations rather than truthful ones, models can generate confident but incorrect statements. Mitigations include retrieval augmentation, tool use, and reinforcement learning against verifiable rewards.
- **Quadratic attention cost:** Standard causal attention scales as $O(n^2)$ in context length, which limits practical context windows. Linear-attention variants, sparse attention, and recurrent alternatives such as [Mamba](/wiki/mamba) and RWKV aim to relax this constraint.
- **No native bidirectional context:** Tasks that require reading the full input before producing output (such as deep coreference resolution or token classification) can be harder for pure causal models than for encoder-based or prefix models, although in practice large enough decoder models close most of the gap.
- **Exposure bias and error accumulation:** Generation conditions on the model's own past outputs, so an early mistake can propagate. Reasoning training and self-consistency sampling help mitigate this.
- **Data quality and contamination:** Web-scale training data inevitably includes errors, duplicates, and copyright-sensitive material; benchmark contamination can inflate reported scores.
- **Energy and cost:** Training and serving frontier causal language models consumes large amounts of energy and GPU capacity, raising sustainability and concentration concerns.

These open problems are active areas of research, and progress on each tends to ripple back into the broader landscape of language model design.

## Explain like I'm 5 (ELI5)

Imagine you are playing a word game where you have to guess the next word in a sentence. Your friend says "The cat sat on the..." and you guess "mat" because that makes the most sense based on the words you already heard. A causal language model works the same way. It reads words from left to right, one at a time, and tries to guess which word comes next. It never peeks ahead at words it has not seen yet. The more sentences it practices with, the better it gets at guessing. This is how computers learn to write stories, answer questions, and even have conversations.

## See also

- [Language model](/wiki/language_model)
- [Large language model](/wiki/large_language_model)
- [Transformer](/wiki/transformer)
- [Self-attention](/wiki/self_attention)
- [Next-token prediction](/wiki/next_token_prediction)
- [Autoregressive model](/wiki/autoregressive_model)
- [Masked language model](/wiki/masked_language_model)
- [BERT](/wiki/bert)
- [GPT](/wiki/gpt)
- [GPT-3](/wiki/gpt-3)
- [GPT-4](/wiki/gpt-4)
- [Llama](/wiki/llama)
- [Mistral](/wiki/mistral)
- [KV cache](/wiki/kv_cache)
- [Speculative decoding](/wiki/speculative_decoding)
- [Scaling laws](/wiki/scaling_laws)
- [RLHF](/wiki/rlhf)

## References

1. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). "Attention Is All You Need." *Advances in Neural Information Processing Systems*, 30. arXiv:1706.03762.
2. Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). "Improving Language Understanding by Generative Pre-Training." *OpenAI*.
3. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). "Language Models are Unsupervised Multitask Learners." *OpenAI*.
4. Brown, T. B., Mann, B., Ryder, N., et al. (2020). "Language Models are Few-Shot Learners." *Advances in Neural Information Processing Systems*, 33. arXiv:2005.14165.
5. Kaplan, J., McCandlish, S., Henighan, T., et al. (2020). "Scaling Laws for Neural Language Models." arXiv:2001.08361.
6. Hoffmann, J., Borgeaud, S., Mensch, A., et al. (2022). "Training Compute-Optimal Large Language Models." arXiv:2203.15556.
7. Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." *NAACL-HLT*. arXiv:1810.04805.
8. Holtzman, A., Buys, J., Du, L., Forbes, M., & Choi, Y. (2020). "The Curious Case of Neural Text Degeneration." *ICLR*. arXiv:1904.09751.
9. Touvron, H., Lavril, T., Izcard, G., et al. (2023). "LLaMA: Open and Efficient Foundation Language Models." arXiv:2302.13971.
10. Raffel, C., Shazeer, N., Roberts, A., et al. (2020). "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer." *Journal of Machine Learning Research*, 21(140), 1-67.
11. Mikolov, T., Karafiat, M., Burget, L., Cernocky, J., & Khudanpur, S. (2010). "Recurrent neural network based language model." *Interspeech*.
12. Bengio, Y., Ducharme, R., Vincent, P., & Jauvin, C. (2003). "A Neural Probabilistic Language Model." *Journal of Machine Learning Research*, 3, 1137-1155.
13. Hochreiter, S., & Schmidhuber, J. (1997). "Long Short-Term Memory." *Neural Computation*, 9(8), 1735-1780.
14. Ouyang, L., Wu, J., Jiang, X., et al. (2022). "Training language models to follow instructions with human feedback." *Advances in Neural Information Processing Systems*, 35. arXiv:2203.02155.
15. Wei, J., Wang, X., Schuurmans, D., et al. (2022). "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models." *NeurIPS*. arXiv:2201.11903.
16. Leviathan, Y., Kalman, M., & Matias, Y. (2022). "Fast Inference from Transformers via Speculative Decoding." arXiv:2211.17192.
17. Cai, T., Li, Y., Geng, Z., et al. (2024). "Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads." arXiv:2401.10774.
18. Kwon, W., Li, Z., Zhuang, S., et al. (2023). "Efficient Memory Management for Large Language Model Serving with PagedAttention." *SOSP*. arXiv:2309.06180.
19. Shazeer, N. (2019). "Fast Transformer Decoding: One Write-Head is All You Need." arXiv:1911.02150.
20. Ainslie, J., Lee-Thorp, J., de Jong, M., Zemlyanskiy, Y., Lebron, F., & Sanghai, S. (2023). "GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints." arXiv:2305.13245.
21. Bai, Y., Kadavath, S., Kundu, S., et al. (2022). "Constitutional AI: Harmlessness from AI Feedback." arXiv:2212.08073.
22. Christiano, P., Leike, J., Brown, T. B., Martic, M., Legg, S., & Amodei, D. (2017). "Deep Reinforcement Learning from Human Preferences." *NeurIPS*. arXiv:1706.03741.
23. Tay, Y., Dehghani, M., Tran, V. Q., et al. (2022). "UL2: Unifying Language Learning Paradigms." arXiv:2205.05131.
24. Hugging Face. "Causal language modeling." *Transformers Documentation*. https://huggingface.co/docs/transformers/en/tasks/language_modeling
25. Hu, K. (2023). "ChatGPT sets record for fastest-growing user base - analyst note." *Reuters*, February 1, 2023. https://www.reuters.com/technology/chatgpt-sets-record-fastest-growing-user-base-analyst-note-2023-02-01/

