Causal Language Model

A causal language model (CLM), also called an autoregressive language model or a decoder-only language model, is a type of language model that generates text by predicting the next token in a sequence based solely on the tokens that precede it. Because each prediction depends only on the past context (and never on future tokens), the modeling direction is strictly left-to-right, mirroring the causal flow of natural writing and speech. Causal language models form the backbone of virtually all modern large language models, including the GPT family, LLaMA, Claude, Gemini, and Mistral.

The term "causal" refers to the directional structure of the model rather than to any notion of causality in the philosophical or statistical sense. In a causal language model, information at position $t$ depends only on positions $1$ through $t-1$, just as a cause must precede an effect in time. This restriction is enforced architecturally through a triangular attention mask, which is why these models are sometimes called "masked attention" or "look-ahead masked" transformers. Despite the simplicity of the underlying objective (predict the next token), causal language models trained at scale have proven capable of translation, summarization, code generation, multi-step reasoning, and open-ended dialogue, making the paradigm the workhorse of contemporary generative AI.

How causal language modeling works

At its core, a causal language model learns a probability distribution over sequences of tokens. Given a sequence of tokens $x_1, x_2, \ldots, x_{t-1}$, the model estimates the conditional probability of the next token:

$$P(x_t \mid x_1, x_2, \ldots, x_{t-1})$$

The full probability of a sequence is then the product of these conditional probabilities (by the chain rule of probability):

$$P(x_1, x_2, \ldots, x_T) = \prod_{t=1}^{T} P(x_t \mid x_1, \ldots, x_{t-1})$$

This autoregressive factorization means the model generates text one token at a time, always conditioning on everything produced so far. During inference, each newly generated token is appended to the context and fed back into the model to predict the subsequent token. This cycle repeats until a stopping condition is met, such as reaching a special end-of-sequence token, a stop string supplied by the application, or a maximum length set by the caller.

The choice of factorization is not arbitrary. Although the chain rule of probability holds for any ordering of variables, left-to-right factorization aligns with how humans produce language and allows efficient training using dense supervision. Every position in the input acts simultaneously as both an input and a target, so a single forward pass over a sequence of length $T$ produces $T$ training signals. Right-to-left or arbitrary-order factorizations are mathematically valid but rarely used because they break this convenient structure and complicate downstream generation.

Architecture

Decoder-only transformers

Most causal language models are built on the decoder-only variant of the transformer architecture introduced by Vaswani et al. in the 2017 paper "Attention Is All You Need." The architecture consists of a stack of identical layers, each containing two main sub-layers:

Masked (causal) self-attention: Computes contextual representations by allowing each token position to attend only to itself and all preceding positions.
Position-wise feed-forward network: Applies a two-layer neural network independently to each position, typically with a hidden width 4 times the model dimension.

Residual connections and layer normalization are applied around each sub-layer. A positional encoding scheme (sinusoidal, learned, or rotary) provides the model with information about token order, since the attention mechanism is otherwise permutation-invariant. Recent models such as LLaMA and Mistral use rotary positional embeddings (RoPE), which encode position by rotating query and key vectors in pairs, allowing the model to extrapolate (with some loss) beyond training context lengths.

The original transformer paper described an encoder-decoder model for machine translation in which an encoder built bidirectional representations of the source sentence and a decoder generated the target sentence one token at a time. The decoder-only causal language model strips away the encoder, leaving only the autoregressive component, and feeds the entire input directly into the stacked decoder layers. This simpler design has come to dominate the field because it scales cleanly and unifies understanding and generation into a single network.

Causal masking in self-attention

The defining architectural feature of a causal language model is causal masking (also called the "causal attention mask," "look-ahead mask," or "autoregressive mask"). In standard self-attention, every token can attend to every other token in the sequence. Causal masking restricts this so that position $t$ can only attend to positions $1$ through $t$.

Mechanically, the mask is implemented by adding a mask matrix to the attention scores before the softmax operation. The mask is an upper-triangular matrix filled with negative infinity ($-\infty$) values. When these values pass through softmax, they become zero, effectively preventing any information flow from future positions. The resulting attention weight matrix is lower-triangular, ensuring that each token's representation is computed only from the current and previous tokens.

For a sequence of length $T$, the unmasked attention weight matrix would have shape $T \times T$ with all entries potentially nonzero. After causal masking, only the lower triangle (including the diagonal) contains nonzero weights, giving the matrix a stair-step appearance:

[ a11   0    0    0  ]
[ a21  a22   0    0  ]
[ a31  a32  a33   0  ]
[ a41  a42  a43  a44 ]

This masking serves two purposes:

During training, it allows the model to process an entire sequence in a single forward pass while still making independent predictions for each position. Without the mask, the model could "cheat" by looking at the token it is supposed to predict.
During inference, it ensures that the model's generation process is coherent and consistent with its training regime. A token generated at step $t$ will only ever depend on the prefix it actually saw during training.

In practice, modern transformer implementations skip the explicit mask matrix when possible. Libraries such as FlashAttention compute causal attention directly using fused kernels that simply do not load the future tokens, which avoids the wasted memory of materialising a triangular mask and yields significant speedups on long sequences.

Tokenization and embeddings

Before a causal language model can process text, raw strings must be converted into a sequence of integer token IDs. Modern systems use subword tokenization algorithms, most commonly byte-pair encoding (BPE), WordPiece, or SentencePiece, which strike a balance between vocabulary size and the ability to represent rare words. A vocabulary of 32,000 to 200,000 subword units is typical for production models. Token IDs index into a learned embedding matrix that maps each ID to a dense vector of size $d_{\text{model}}$, typically 768 to 12,288 dimensions depending on model scale. The embedding matrix is often tied to the output projection ("weight tying"), which reduces parameter count and slightly improves perplexity.

Output head

After the final transformer layer, a linear projection (the language modeling head) maps each hidden state to a vector of vocabulary-sized logits. A softmax over these logits produces a probability distribution over the next token. Some systems compute the softmax only over a subset of the vocabulary during inference (a technique known as "vocabulary pruning") or replace it with sampled softmax variants during training to reduce compute, but the standard recipe applies a full softmax at every position.

Training objective

Next-token prediction and cross-entropy loss

The standard training objective for causal language models is next-token prediction, optimized using cross-entropy loss (equivalently, negative log-likelihood). For a training sequence of length $T$, the loss is:

$$\mathcal{L} = -\frac{1}{T} \sum_{t=1}^{T} \log P_{\theta}(x_t \mid x_1, \ldots, x_{t-1})$$

where $P_{\theta}$ is the model's predicted probability distribution parameterized by weights $\theta$. Minimizing this loss is equivalent to maximizing the likelihood of the training data under the model. The mean cross-entropy across a held-out corpus is often reported in two equivalent forms: the loss itself in nats (or bits, when log base 2 is used) and the perplexity, defined as $\exp(\mathcal{L})$. A model with a lower perplexity assigns higher probability to the held-out text and is generally considered a stronger language model.

During training, the model receives a full sequence and, thanks to causal masking, computes the loss for every position in parallel. This technique, called teacher forcing, is far more efficient than generating tokens one at a time, because the ground-truth tokens (rather than the model's own predictions) are used as inputs for each step. Teacher forcing introduces a known mismatch between training and inference (called "exposure bias"), since at inference time the model conditions on its own (possibly imperfect) outputs rather than ground-truth tokens. In practice this mismatch is small enough at scale that pure teacher forcing remains the default training recipe.

Multi-token prediction

A recent line of research replaces single-token next-token prediction with multi-token prediction (MTP), in which the model is trained to predict the next several tokens simultaneously using parallel output heads. DeepSeek used a variant of MTP in DeepSeek-V3 (2024) to densify the training signal and improve sample efficiency. Multi-token prediction also pairs naturally with speculative decoding at inference time, where the additional heads can be reused as a built-in draft model.

Regularization and optimization

To prevent overfitting and improve generalization, several techniques are commonly applied:

Technique	Description
Dropout	Randomly zeroes a fraction of activations during training to prevent co-adaptation
Weight decay	Adds an L2 penalty on model parameters to the loss function
Learning rate scheduling	Uses warmup followed by cosine or linear decay of the learning rate
Gradient clipping	Caps gradient magnitudes to stabilize training
Layer normalization	Normalizes activations within each layer to reduce internal covariate shift
Mixed precision	Stores weights in BF16 or FP16 to halve memory and double throughput
Sequence packing	Concatenates multiple short documents into one long training sequence to avoid padding waste

Modern causal language models are typically trained using the AdamW optimizer with mixed-precision (FP16 or BF16) arithmetic to reduce memory usage and accelerate computation. At very large scales, additional tricks become necessary: gradient checkpointing trades recomputation for memory, ZeRO sharding partitions optimizer state across devices, and pipeline parallelism splits the model depth-wise across nodes. Training a 100-billion-parameter causal language model from scratch typically requires tens of thousands of GPU- or TPU-hours and a carefully tuned data-parallel, tensor-parallel, and pipeline-parallel topology.

A brief history

From n-grams to neural language models

Language modeling predates deep learning by decades. Early n-gram models estimated $P(x_t \mid x_{t-n+1}, \ldots, x_{t-1})$ by counting word occurrences in a training corpus, with smoothing techniques (such as Kneser-Ney) to handle unseen contexts. n-gram models are causal in the trivial sense that they only condition on prior tokens, but they are limited to short, fixed-length contexts and cannot share statistical strength across semantically related contexts.

In 2003, Bengio et al. proposed the first widely cited neural probabilistic language model, which embedded words into a continuous vector space and used a feed-forward neural network to predict the next word. This work introduced the idea that distributed representations could overcome the data sparsity problem that plagued n-gram models.

The breakthrough that enabled long-range causal language modeling was the recurrent neural network (RNN). Mikolov et al. (2010) showed that RNN language models substantially outperformed traditional n-gram baselines on speech recognition benchmarks. RNN language models are inherently causal because their hidden state at time $t$ is computed only from prior hidden states and the current input. The introduction of long short-term memory (LSTM) cells (Hochreiter and Schmidhuber, 1997) and gated recurrent units (GRUs, Cho et al., 2014) further improved the ability of recurrent models to capture long-range dependencies.

The transformer revolution

In 2017, Vaswani et al. introduced the transformer in "Attention Is All You Need." The original transformer was an encoder-decoder model for machine translation, but the decoder component (with causal self-attention) provided a template for a new class of language models. Unlike RNNs, transformers process every token position in parallel, which makes them dramatically more efficient on modern accelerators. Causal masking provides the autoregressive structure that RNNs got from their recurrence.

In June 2018, OpenAI published "Improving Language Understanding by Generative Pre-Training" (Radford et al.), which introduced the first Generative Pre-trained Transformer, retroactively named GPT-1. The model was a 117-million-parameter decoder-only transformer pre-trained on the BooksCorpus dataset using the standard next-token-prediction objective, then fine-tuned for downstream tasks. The combination of unsupervised pre-training plus supervised fine-tuning achieved state-of-the-art on many natural language understanding benchmarks. Although BERT (Devlin et al., 2018) launched four months later and dominated discriminative benchmarks for several years, the GPT line continued to push generative pre-training and would eventually carry the field.

Scaling and the GPT family

GPT-2, released in February 2019, scaled the GPT-1 architecture to 1.5 billion parameters and trained on 40 GB of web text scraped from outbound Reddit links (the WebText dataset). The model demonstrated strong zero-shot performance on tasks it had never been explicitly trained for, sparking a wave of interest in unsupervised multitask learning. OpenAI's controversial decision to delay the release of the largest GPT-2 weights (citing misuse concerns) marked the beginning of public debate about the safety of releasing large language models.

GPT-3, introduced in May 2020, took the same architecture to 175 billion parameters and trained on roughly 300 billion tokens drawn from Common Crawl, WebText2, two book corpora, and English Wikipedia. The headline finding of the GPT-3 paper (Brown et al., 2020) was in-context learning: with no gradient updates, the model could perform new tasks given only a few demonstrations in the prompt. This shifted the field's focus from supervised fine-tuning to prompt-based interaction.

GPT-3.5 (the InstructGPT family, late 2022) added reinforcement learning from human feedback (RLHF) to align the base model with human preferences. The chat-tuned variant powered the public launch of ChatGPT in November 2022, which reached an estimated 100 million users within two months and triggered the modern wave of generative AI investment.

GPT-4, released in March 2023, introduced multimodal input (images and text), substantially better reasoning, and longer context windows. OpenAI did not disclose architectural details, but external sources have estimated parameter counts in the trillions and a mixture-of-experts structure. GPT-4o (May 2024) added native voice and image generation; GPT-4.5 and GPT-5 continued the line through the mid-2020s.

Beyond GPT

The causal language modeling paradigm spread quickly to other organizations. Notable open-weight families include:

Model family	Organization	First release	Defining contribution
LLaMA	Meta	2023	Trained 7B-65B models far past Chinchilla-optimal token counts; released openly
Mistral	Mistral AI	2023	Sliding-window attention and grouped-query attention in a 7B model
Mixtral	Mistral AI	2023	Open-weight sparse mixture of experts
Falcon	TII	2023	Open 40B/180B models trained on the RefinedWeb dataset
Qwen	Alibaba	2023	Strong multilingual and Chinese-language performance
DeepSeek	DeepSeek	2023	Multi-token prediction and aggressive MoE scaling
Gemma	Google	2024	Lightweight open variants of Gemini-class architectures

Proprietary causal language models such as Claude (Anthropic), Gemini (Google DeepMind), and Grok (xAI) follow the same decoder-only template, with proprietary modifications to attention, training data, and post-training pipelines.

The GPT family in detail

The GPT (Generative Pre-trained Transformer) series from OpenAI is the most well-known line of causal language models. Each generation has dramatically scaled up in size and capability.

Model	Year	Parameters	Training Data	Key Contribution
GPT-1	2018	117 million	BooksCorpus (~800M words)	Demonstrated generative pre-training followed by discriminative fine-tuning
GPT-2	2019	1.5 billion	WebText (40 GB)	Showed unsupervised multitask learning; zero-shot task transfer
GPT-3	2020	175 billion	~300 billion tokens	Introduced in-context few-shot learning without fine-tuning
GPT-3.5 / InstructGPT	2022	~175 billion	~300B tokens plus RLHF data	Added RLHF for instruction following; powered ChatGPT
GPT-4	2023	Undisclosed	Undisclosed	Multimodal capabilities (text and image input); strong reasoning
GPT-4o	2024	Undisclosed	Undisclosed	Natively multimodal with audio and image generation
GPT-5	2025	Undisclosed	Undisclosed	Unified reasoning and chat in a single model

Beyond GPT, many other causal language models have been developed, including LLaMA (Meta), PaLM (Google), Falcon (TII), Mistral (Mistral AI), and Claude (Anthropic).

Causal vs masked vs prefix language models

Causal language models are often compared with masked language models (MLMs) such as BERT and with prefix language models (PrefixLMs) such as T5. The three paradigms differ fundamentally in their training objectives, attention patterns, and downstream strengths.

Feature	Causal LM (e.g., GPT)	Masked LM (e.g., BERT)	Prefix LM (e.g., T5, UL2)
Attention pattern	Lower-triangular (causal)	Full bidirectional	Bidirectional on prefix, causal on target
Training objective	Predict the next token	Predict randomly masked tokens	Predict the target span given the prefix
Context used	Only preceding tokens	Both preceding and following tokens	Bidirectional within prefix; causal within target
Architecture	Decoder-only transformer	Encoder-only transformer	Encoder-decoder, or unified prefix masking
Primary strength	Open-ended text generation	Text understanding and classification	Conditional generation (translation, summarization)
Example use cases	Chatbots, code generation, story writing	Sentiment analysis, NER, retrieval	Translation, structured generation

Because masked language models have access to bidirectional context, they tend to produce richer representations for understanding tasks. However, they are not naturally suited for text generation, since they cannot predict text sequentially. Causal language models, by contrast, are inherently generative and have become the dominant paradigm for building conversational AI systems, code assistants, and general-purpose large language models.

Why decoder-only "won"

Through roughly 2019 to 2021, encoder-decoder models such as T5 (Raffel et al., 2020) and BART (Lewis et al., 2019) were widely viewed as state-of-the-art for conditional generation tasks. By 2023, however, decoder-only causal language models had largely displaced them. Several factors drove this shift:

Single objective: Next-token prediction works for any text-to-text task once both input and output are concatenated, removing the need for task-specific encoder-decoder pretraining.
Unified scaling: A single decoder stack can be scaled with a single set of hyperparameters, whereas encoder-decoder models split parameters across two stacks.
In-context learning: GPT-3 demonstrated that decoder-only models could learn new tasks from prompt examples without fine-tuning. Encoder-decoder models lagged on this capability because their decoder did not see prefix examples through bidirectional attention.
KV cache: Decoder-only inference reuses keys and values from previous tokens, making generation efficient. Encoder-decoder models recompute encoder activations for every new prompt, which wastes work in chat settings.
Tooling momentum: Once the major industrial labs aligned on decoder-only architectures, libraries, accelerators, and inference engines were optimised primarily for that case, creating a self-reinforcing ecosystem.

Prefix language models

A prefix language model (PrefixLM) occupies a middle ground between causal and masked language models. In a PrefixLM, an initial "prefix" portion of the input is processed with bidirectional (non-causal) attention, while the remaining tokens are generated autoregressively with causal masking.

This design is useful for tasks where a fixed input (the prefix) must be fully understood before generating an output. For example, in a translation task, the source sentence can serve as the prefix with full bidirectional attention, and the target sentence is generated causally. Research has shown that PrefixLMs can outperform pure causal language models on in-context learning tasks because they allow demonstration examples within the prefix to attend to one another freely, rather than being restricted by left-to-right masking.

Models that use or have explored PrefixLM objectives include T5, UL2 (Tay et al., 2022), and certain configurations of PaLM.

Decoding strategies for text generation

A causal language model produces a probability distribution over the vocabulary at each generation step. The method used to select the next token from this distribution is called a decoding strategy (or sampling strategy). Different strategies trade off between output quality, diversity, and computational cost.

Strategy	How it works	Characteristics
Greedy decoding	Always selects the highest-probability token	Fast but often produces repetitive, generic text
Beam search	Maintains the top-$k$ most probable partial sequences and selects the best overall	Better than greedy for short outputs; can still be repetitive
Top-$k$ sampling	Samples from the $k$ most probable tokens	Introduces diversity; the fixed $k$ may be too narrow or too wide
Top-$p$ (nucleus) sampling	Samples from the smallest set of tokens whose cumulative probability exceeds $p$	Dynamically adapts the candidate set; widely used in practice
Temperature scaling	Divides logits by a temperature parameter $\tau$ before softmax	$\tau < 1$ sharpens the distribution; $\tau > 1$ flattens it
Min-$p$ sampling	Filters tokens below a minimum probability relative to the top token	A recent alternative that combines benefits of top-$k$ and top-$p$
Mirostat	Targets a specified perplexity by dynamically adjusting truncation	Aims for stable, surprise-controlled output
Contrastive search	Penalizes tokens whose representations are similar to recent context	Reduces degenerate repetition without aggressive truncation
Typical sampling	Selects tokens whose information content is close to the conditional entropy	Aims for "locally typical" generations

In practice, modern systems often combine several of these strategies. For instance, a chatbot might use top-$p$ sampling with a moderate temperature and a repetition penalty to balance coherence and creativity. Holtzman et al. (2020), "The Curious Case of Neural Text Degeneration," provided the canonical justification for nucleus sampling by showing that high-likelihood beam-search outputs collapse into repetitive loops while truncated stochastic sampling avoids this failure mode.

Penalties and constraints

Decoding is often combined with auxiliary penalties to shape outputs further. Repetition penalties divide the logit of any token already in the context by a fixed factor, discouraging verbatim loops. Frequency and presence penalties, exposed by the OpenAI API, reduce logits proportionally to how often or whether a token has appeared. Logit bias lets a developer add or subtract a fixed amount to specific token IDs, which is useful for forbidding certain words or steering toward a particular format. Constrained decoding (using grammars or regular expressions) restricts the model to outputs that conform to a schema, which is widely used for structured tool use and JSON generation.

Inference: prefill, decode, and the KV cache

Two phases of inference

Serving a causal language model in production has two distinct computational phases:

Prefill (prompt processing): The model ingests the user's prompt of length $n$ in parallel and computes attention keys and values for every position. This phase is compute-bound on most accelerators because it performs $O(n^2)$ attention work in a single pass.
Decode (token generation): The model generates output tokens one at a time. Each new token requires only $O(n)$ work because the keys and values for prior tokens are reused from the cache. This phase is typically memory-bandwidth-bound: the bottleneck is reading model weights from HBM rather than performing computations.

The practical consequence is that prefill latency scales with prompt length while per-token decode latency is roughly constant for a given model and hardware. Time-to-first-token (TTFT) is dominated by prefill; tokens-per-second after the first token reflects decode throughput.

The KV cache

During decode, every transformer layer recomputes attention queries from the new token but reuses the keys and values from previous tokens. To avoid recomputation, modern inference engines maintain a KV cache: a per-layer buffer that stores key and value tensors for every position generated so far. Memory use grows linearly with context length, and at long contexts the KV cache can dwarf the model weights themselves. A 70-billion-parameter model with 80 layers, a head dimension of 128, and 64 attention heads consumes roughly 320 KB of KV per token in BF16, so a 100,000-token context occupies about 32 GB just for the cache.

Several architectural tricks reduce KV cache pressure:

Multi-query attention (Shazeer, 2019) shares a single key and value head across all query heads, cutting KV size by the number of heads.
Grouped-query attention (GQA, Ainslie et al., 2023) shares keys and values across small groups of query heads, balancing quality and memory.
Multi-head latent attention (used in DeepSeek-V2 and V3) projects keys and values through a low-rank latent space, reducing storage further.
PagedAttention (vLLM, Kwon et al., 2023) manages the cache using fixed-size blocks similar to virtual memory pages, eliminating fragmentation across concurrent requests.

Speculative decoding

Speculative decoding (Leviathan et al., 2022, arXiv 2211.17192) accelerates causal language model inference by running a small "draft" model and a large "target" model in tandem. The draft model proposes several tokens autoregressively, then the target model verifies them all in a single parallel forward pass. Tokens that match the target distribution (according to a rejection-sampling criterion) are accepted; the first mismatch is replaced with a sample from the corrected distribution and the process repeats. The output distribution is provably identical to greedy or sampled decoding from the target model alone, but throughput can roughly double because the target's expensive forward pass produces several output tokens at once.

Medusa (Cai et al., 2024) modifies the target model itself by adding extra prediction heads that each forecast a token several positions ahead. The model can then propose tokens without an external draft network. EAGLE (Li et al., 2024) trains a lightweight head on intermediate hidden states to predict next-token distributions more accurately, raising the acceptance rate of speculative decoding. Lookahead decoding uses Jacobi-style fixed-point iteration over a window of tokens, removing the draft model entirely.

Continuous batching

Because user requests arrive at different times and finish at different times, naive batching wastes compute by waiting for the longest request in a batch. Continuous batching (also called "in-flight batching," introduced by Orca and popularized by vLLM) lets new requests join an in-flight batch at the next decode step and lets finished requests leave immediately. Combined with PagedAttention, continuous batching dramatically increases throughput on shared production servers.

Post-training

A modern causal language model is rarely deployed as a raw next-token predictor. After pre-training on web-scale text, the base model is refined through a multi-stage post-training pipeline:

Supervised fine-tuning (SFT): The model is fine-tuned on a curated set of instruction-response pairs written by humans or distilled from a stronger model. SFT teaches the model the format of helpful responses.
Reward modeling: A separate model is trained on human comparisons ("response A is better than response B") to predict a scalar quality score.
Reinforcement learning from human feedback (RLHF): The base model is further fine-tuned to maximize the reward model's score, typically using proximal policy optimization (PPO) or direct preference optimization (DPO). RLHF was introduced for language models by Christiano et al. (2017) and applied to instruction following by OpenAI's InstructGPT paper (Ouyang et al., 2022).
Reinforcement learning from AI feedback (RLAIF) and constitutional AI: Anthropic's Constitutional AI recipe (Bai et al., 2022) replaces some human comparisons with judgments from a constitution-following model, scaling alignment data more cheaply.
Reasoning training: Models such as OpenAI's o1, o3, and DeepSeek-R1 add a final stage of reinforcement learning that rewards correct multi-step reasoning, dramatically improving performance on math, science, and code benchmarks.

Post-training does not change the underlying causal language modeling objective; the model still predicts one token at a time conditioned on prior tokens. What changes is the conditional distribution: a post-trained model is much more likely to produce helpful, safe, and well-structured outputs in response to user prompts.

Zero-shot and few-shot capabilities

One of the most significant discoveries about large causal language models is their ability to perform tasks without explicit fine-tuning. The GPT-3 paper (Brown et al., 2020) demonstrated three paradigms:

Zero-shot: The model receives only a natural-language description of a task and must complete it without any examples.
One-shot: A single input-output example is provided alongside the task description.
Few-shot (in-context learning): Several examples (typically 10 to 100) are included in the prompt before the query.

As model scale increases, few-shot performance improves much more rapidly than zero-shot performance, suggesting that larger models become better "in-context learners." This emergent capability has been one of the primary drivers of interest in scaling up causal language models and has led to the widespread use of prompt engineering as an alternative to traditional fine-tuning. Chain-of-thought prompting (Wei et al., 2022) showed that simply asking the model to "think step by step" before answering substantially improves accuracy on multi-step problems, especially for sufficiently large models.

Scaling properties

Causal language models exhibit remarkably predictable scaling laws. Research by Kaplan et al. (2020) at OpenAI showed that a model's cross-entropy loss on held-out text follows a power-law relationship with three variables:

Number of parameters ($N$)
Size of the training dataset ($D$, measured in tokens)
Amount of training compute ($C$, measured in FLOPs)

The loss decreases smoothly and predictably as any of these quantities increases, with trends spanning more than seven orders of magnitude. The Kaplan paper expressed the relationship as $L(N) \propto N^{-\alpha_N}$, $L(D) \propto D^{-\alpha_D}$, and $L(C) \propto C^{-\alpha_C}$, with empirically fit exponents.

In 2022, DeepMind's Chinchilla study (Hoffmann et al.) refined these findings, demonstrating that many existing models were over-parameterized relative to their training data. The Chinchilla-optimal ratio suggests approximately 20 training tokens per parameter for a given compute budget, contrary to Kaplan's original estimate of about 1.7 tokens per parameter. This insight shifted the field's focus toward training smaller models on more data: LLaMA 7B was trained on 1 trillion tokens, far exceeding the Chinchilla-optimal ratio for its size, which gave it the inference economics of a small model with the quality of a much larger one.

These scaling laws have become essential planning tools for organizations building large causal language models, allowing them to predict downstream performance and allocate compute budgets efficiently. Subsequent work (such as DeepMind's "Approach to scaling" follow-ups and Anthropic's research on overtraining) has further refined the curves and explored what happens when models are pushed beyond compute-optimal data ratios.

Causal language models as the foundation of modern LLMs

Since the release of GPT-3 in 2020, the causal language modeling paradigm has become the dominant approach for building general-purpose AI systems. Several factors drive this dominance:

Simplicity of the training objective: Next-token prediction requires no labeled data, enabling training on massive unlabeled corpora scraped from the internet.
Scalability: The decoder-only architecture scales efficiently to hundreds of billions (and now trillions) of parameters.
Emergent abilities: As models scale, they develop capabilities not explicitly trained for, including reasoning, translation, code generation, and mathematical problem-solving.
Adaptability through fine-tuning: Pre-trained causal language models can be further adapted with reinforcement learning from human feedback (RLHF), parameter-efficient fine-tuning (such as LoRA), or supervised fine-tuning for specific applications.
Universal interface: A single text-in, text-out API can be wrapped around any task, including image, audio, and video tasks once the relevant tokenizers are added.

Today, nearly every leading large language model, whether used for chatbots, search, coding assistance, or scientific research, is built on the causal language modeling framework.

Applications

Causal language models have been deployed across a wide range of natural language processing tasks:

Text generation: Producing coherent, contextually appropriate text for creative writing, summarization, and content drafting.
Conversational AI: Powering chatbots and virtual assistants such as ChatGPT, Claude, and Gemini.
Code generation: Writing, completing, and debugging source code in tools like GitHub Copilot and Cursor.
Machine translation: Translating text between languages by generating target-language tokens conditioned on source-language input.
Reasoning and problem-solving: Answering complex questions, solving math problems, and performing multi-step logical reasoning.
Text summarization: Condensing long documents into concise summaries while retaining essential information.
Tool use and agents: Driving agents that call external APIs, run code, and operate browsers, with the causal language model planning each step.
Embeddings and retrieval: Recent decoder-only models are competitive with encoder models when fine-tuned for retrieval, powering retrieval-augmented generation systems.
Multimodal generation: Causal language models with vision tokenizers can describe images, answer visual questions, and generate captions; with audio tokens they can transcribe and synthesize speech.

Limitations and open problems

Despite their dominance, causal language models have well-known limitations:

Hallucination: Because the training objective rewards plausible continuations rather than truthful ones, models can generate confident but incorrect statements. Mitigations include retrieval augmentation, tool use, and reinforcement learning against verifiable rewards.
Quadratic attention cost: Standard causal attention scales as $O(n^2)$ in context length, which limits practical context windows. Linear-attention variants, sparse attention, and recurrent alternatives such as Mamba and RWKV aim to relax this constraint.
No native bidirectional context: Tasks that require reading the full input before producing output (such as deep coreference resolution or token classification) can be harder for pure causal models than for encoder-based or prefix models, although in practice large enough decoder models close most of the gap.
Exposure bias and error accumulation: Generation conditions on the model's own past outputs, so an early mistake can propagate. Reasoning training and self-consistency sampling help mitigate this.
Data quality and contamination: Web-scale training data inevitably includes errors, duplicates, and copyright-sensitive material; benchmark contamination can inflate reported scores.
Energy and cost: Training and serving frontier causal language models consumes large amounts of energy and GPU capacity, raising sustainability and concentration concerns.

These open problems are active areas of research, and progress on each tends to ripple back into the broader landscape of language model design.

Explain like I'm 5 (ELI5)

Imagine you are playing a word game where you have to guess the next word in a sentence. Your friend says "The cat sat on the..." and you guess "mat" because that makes the most sense based on the words you already heard. A causal language model works the same way. It reads words from left to right, one at a time, and tries to guess which word comes next. It never peeks ahead at words it has not seen yet. The more sentences it practices with, the better it gets at guessing. This is how computers learn to write stories, answer questions, and even have conversations.

References

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). "Attention Is All You Need." *Advances in Neural Information Processing Systems*, 30. arXiv:1706.03762.
Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). "Improving Language Understanding by Generative Pre-Training." *OpenAI*.
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). "Language Models are Unsupervised Multitask Learners." *OpenAI*.
Brown, T. B., Mann, B., Ryder, N., et al. (2020). "Language Models are Few-Shot Learners." *Advances in Neural Information Processing Systems*, 33. arXiv:2005.14165.
Kaplan, J., McCandlish, S., Henighan, T., et al. (2020). "Scaling Laws for Neural Language Models." arXiv:2001.08361.
Hoffmann, J., Borgeaud, S., Mensch, A., et al. (2022). "Training Compute-Optimal Large Language Models." arXiv:2203.15556.
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." *NAACL-HLT*. arXiv:1810.04805.
Holtzman, A., Buys, J., Du, L., Forbes, M., & Choi, Y. (2020). "The Curious Case of Neural Text Degeneration." *ICLR*. arXiv:1904.09751.
Touvron, H., Lavril, T., Izcard, G., et al. (2023). "LLaMA: Open and Efficient Foundation Language Models." arXiv:2302.13971.
Raffel, C., Shazeer, N., Roberts, A., et al. (2020). "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer." *Journal of Machine Learning Research*, 21(140), 1-67.
Mikolov, T., Karafiat, M., Burget, L., Cernocky, J., & Khudanpur, S. (2010). "Recurrent neural network based language model." *Interspeech*.
Bengio, Y., Ducharme, R., Vincent, P., & Jauvin, C. (2003). "A Neural Probabilistic Language Model." *Journal of Machine Learning Research*, 3, 1137-1155.
Hochreiter, S., & Schmidhuber, J. (1997). "Long Short-Term Memory." *Neural Computation*, 9(8), 1735-1780.
Ouyang, L., Wu, J., Jiang, X., et al. (2022). "Training language models to follow instructions with human feedback." *Advances in Neural Information Processing Systems*, 35. arXiv:2203.02155.
Wei, J., Wang, X., Schuurmans, D., et al. (2022). "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models." *NeurIPS*. arXiv:2201.11903.
Leviathan, Y., Kalman, M., & Matias, Y. (2022). "Fast Inference from Transformers via Speculative Decoding." arXiv:2211.17192.
Cai, T., Li, Y., Geng, Z., et al. (2024). "Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads." arXiv:2401.10774.
Kwon, W., Li, Z., Zhuang, S., et al. (2023). "Efficient Memory Management for Large Language Model Serving with PagedAttention." *SOSP*. arXiv:2309.06180.
Shazeer, N. (2019). "Fast Transformer Decoding: One Write-Head is All You Need." arXiv:1911.02150.
Ainslie, J., Lee-Thorp, J., de Jong, M., Zemlyanskiy, Y., Lebron, F., & Sanghai, S. (2023). "GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints." arXiv:2305.13245.
Bai, Y., Kadavath, S., Kundu, S., et al. (2022). "Constitutional AI: Harmlessness from AI Feedback." arXiv:2212.08073.
Christiano, P., Leike, J., Brown, T. B., Martic, M., Legg, S., & Amodei, D. (2017). "Deep Reinforcement Learning from Human Preferences." *NeurIPS*. arXiv:1706.03741.
Tay, Y., Dehghani, M., Tran, V. Q., et al. (2022). "UL2: Unifying Language Learning Paradigms." arXiv:2205.05131.

How causal language modeling works

Architecture

Decoder-only transformers

Causal masking in self-attention

Tokenization and embeddings

Output head

Training objective

Next-token prediction and cross-entropy loss

Multi-token prediction

Regularization and optimization

A brief history

From n-grams to neural language models

The transformer revolution

Scaling and the GPT family

Beyond GPT

The GPT family in detail

Causal vs masked vs prefix language models

Why decoder-only "won"

Prefix language models

Decoding strategies for text generation

Penalties and constraints

Inference: prefill, decode, and the KV cache

Two phases of inference

The KV cache

Speculative decoding

Continuous batching

Post-training

Zero-shot and few-shot capabilities

Scaling properties

Causal language models as the foundation of modern LLMs

Applications

Limitations and open problems

Explain like I'm 5 (ELI5)

See also

References

Improve this article

Related Articles

Sparse autoencoder

Context window

OCR Models

Post-training

Pre-training

Supervised fine-tuning

How causal language modeling works

Architecture

Decoder-only transformers

Causal masking in self-attention

Tokenization and embeddings

Output head

Training objective

Next-token prediction and cross-entropy loss

Multi-token prediction

Regularization and optimization

A brief history

From n-grams to neural language models

The transformer revolution

Scaling and the GPT family

Beyond GPT

The GPT family in detail

Causal vs masked vs prefix language models

Why decoder-only "won"

Prefix language models

Decoding strategies for text generation

Penalties and constraints

Inference: prefill, decode, and the KV cache

Two phases of inference

The KV cache

Speculative decoding

Continuous batching

Post-training

Zero-shot and few-shot capabilities

Scaling properties

Causal language models as the foundation of modern LLMs

Applications

Limitations and open problems

Explain like I'm 5 (ELI5)

See also

References

Related Articles

Sparse autoencoder