# Decoder

> Source: https://aiwiki.ai/wiki/decoder
> Updated: 2026-07-11
> Categories: Deep Learning, Machine Learning, Neural Networks
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

*See also: [Encoder](/wiki/encoder), [Transformer](/wiki/transformer), [Machine learning terms](/wiki/machine_learning_terms)*

A **decoder** is the component of a [neural network](/wiki/neural_network) that turns an internal, compressed, or abstract representation into a desired output, such as a translated sentence, a generated image, a reconstructed input, or a segmentation mask. Decoders appear across many [deep learning](/wiki/deep_learning) architectures, including sequence-to-sequence models, [transformers](/wiki/transformer), autoencoders, and U-Nets, and the decoder is the part that actually produces the result a user sees. In modern [large language models](/wiki/large_language_model) such as the [GPT](/wiki/gpt) family, a stack of Transformer decoder layers generates text one [token](/wiki/token) at a time, predicting each next token from all preceding tokens, an approach called autoregressive generation.[5]

The concept of the decoder is tightly coupled with that of the [encoder](/wiki/encoder). In most architectures, the encoder compresses input data into a latent or hidden representation, and the decoder reverses this process. However, decoder-only architectures (such as the GPT family) demonstrate that a standalone decoder, without a separate encoder, can serve as a powerful generative model in its own right. Decoder-only Transformers have become the dominant paradigm for large language models since around 2020, largely because of their simplicity and scalability.[6]

## What does a decoder do?

At the most general level, a decoder maps from a representation space back to an output space. In an encoder-decoder pair, the encoder answers "what is in the input?" by producing a hidden representation, and the decoder answers "what should the output be?" by expanding that representation into structured data. The output may be a sequence (text, code, audio tokens), a grid of pixels (an image), or a per-pixel label map (a segmentation mask). What unifies these uses is that the decoder is generative or reconstructive: it builds an output rather than classifying or scoring an input.

## History and origins: when was the decoder introduced?

The idea of using paired encoding and decoding networks traces back to early work on autoencoders in the 1980s and 1990s, where researchers trained networks to compress and then reconstruct data. The modern concept of the decoder in sequence modeling took shape with the rise of sequence-to-sequence (seq2seq) architectures in 2014. Two landmark papers introduced the encoder-decoder framework for machine translation.

Cho et al. (2014) proposed the Encoder-Decoder architecture using recurrent neural networks with a novel hidden unit called the gated recurrent unit (GRU).[2] Independently, Sutskever, Vinyals, and Le (2014) at Google published "Sequence to Sequence Learning with Neural Networks," which used a multilayered LSTM as both the encoder and decoder, achieving strong results on English-to-French translation.[1]

A critical limitation of early seq2seq models was the information bottleneck: the entire input sequence had to be compressed into a single fixed-length context vector. In 2015, Bahdanau, Cho, and Bengio introduced the [attention](/wiki/attention) mechanism in their paper "Neural Machine Translation by Jointly Learning to Align and Translate" (published at ICLR 2015).[3] Instead of forcing the decoder to rely on a single vector, attention allowed the decoder to selectively focus on different parts of the encoder's output at each generation step. This innovation dramatically improved translation quality, particularly for longer sentences.

The introduction of the [Transformer](/wiki/transformer) architecture by Vaswani et al. in 2017 ("Attention Is All You Need") replaced recurrence entirely with self-attention and cross-attention mechanisms, leading to much faster training and better performance.[4] The Transformer big model reached 28.4 BLEU on the WMT 2014 English-to-German task and a then state-of-the-art 41.8 BLEU on English-to-French after training for 3.5 days on eight GPUs.[4] The Transformer decoder became the foundation for modern [large language models](/wiki/large_language_model).

## Decoder in sequence-to-sequence models

In sequence-to-sequence (seq2seq) models, the decoder generates an output sequence based on a context representation produced by the encoder. The encoder processes the input sequence (for instance, a sentence in English) and produces either a single context vector or a sequence of hidden states. The decoder then uses this representation to produce the output sequence (for instance, the same sentence translated into French) one [token](/wiki/token) at a time.

### Architecture

The decoder in a classical seq2seq model is typically a recurrent neural network such as an LSTM or GRU. At each time step, the decoder receives three inputs:

1. The previous hidden state of the decoder
2. The previously generated output token (or a start-of-sequence token at the first step)
3. Context information from the encoder

The decoder produces a hidden state at each step, which is then passed through a linear layer followed by a [softmax](/wiki/softmax) function to generate a probability distribution over the output vocabulary. The token with the highest probability (or a token sampled from the distribution) becomes the output for that step and is fed back as input to the next step.

### Teacher forcing

During training, a technique called teacher forcing is commonly used. Instead of feeding the decoder's own previous prediction as input at each step, the ground-truth token from the target sequence is provided. This speeds up convergence and stabilizes training, but it can create a discrepancy between training and inference conditions, since at inference time the model must rely on its own predictions. Scheduled sampling, introduced by Bengio et al. (2015), addresses this mismatch by gradually transitioning from teacher forcing to using the model's own predictions during training.

### The attention mechanism

The basic seq2seq decoder suffers from the bottleneck problem: all information about the input must be compressed into a fixed-length context vector. For long input sequences, this vector cannot capture all relevant details, and translation quality degrades.

The attention mechanism solves this by letting the decoder look at all encoder hidden states at every generation step. At each decoder time step, the mechanism computes a set of attention weights over the encoder hidden states, producing a weighted sum called the context vector. This context vector changes at each step, allowing the decoder to focus on different parts of the input as it generates different parts of the output.

There are several variants of attention:

| Attention type | Description | Introduced by |
|---|---|---|
| Additive (Bahdanau) attention | Uses a learned feed-forward network to compute alignment scores between decoder state and each encoder hidden state | Bahdanau et al. (2015) |
| Multiplicative (Luong) attention | Computes alignment scores as a dot product (or general bilinear product) between decoder and encoder states | Luong et al. (2015) |
| Scaled dot-product attention | Dot product attention scaled by the square root of the key dimension; used in the Transformer | Vaswani et al. (2017) |

## Decoder in the Transformer architecture

The Transformer architecture, introduced by Vaswani et al. in the 2017 paper "Attention Is All You Need," redefined how decoders work.[4] The Transformer decoder replaces recurrence with stacked layers of self-attention, cross-attention, and feed-forward networks. This design allows for much greater parallelism during training compared to recurrent decoders. In the original paper, both the encoder and the decoder are stacks of N = 6 identical layers.[4]

### Structure of a Transformer decoder layer

Each layer (or "block") in the Transformer decoder contains three sub-layers:

1. **Masked (causal) self-attention.** The decoder attends to previously generated tokens in the output sequence. A causal mask prevents each position from attending to future positions, preserving the autoregressive property. Attention scores corresponding to future tokens are set to negative infinity before the softmax, effectively zeroing them out.
2. **Cross-attention (encoder-decoder attention).** The decoder attends to the full output of the encoder. The queries come from the decoder's self-attention output, while the keys and values come from the encoder. Because the encoder output is fully computed before decoding begins, no masking is needed in this sub-layer.
3. **Position-wise feed-forward network.** A fully connected feed-forward network is applied independently to each position. It typically consists of two linear transformations with a ReLU or GELU activation function in between.

Each sub-layer is wrapped with a residual connection and layer normalization. Dropout is applied after each sub-layer for regularization.

### Positional encoding

Since the Transformer has no inherent notion of sequence order (unlike RNNs), positional encoding is added to the input embeddings. The original Transformer uses sinusoidal positional encodings, while later models adopt learned positional embeddings or rotary positional embeddings (RoPE).

### How does autoregressive generation work?

During inference, the Transformer decoder generates output tokens one at a time in an autoregressive fashion. At each step, it takes all previously generated tokens as input, applies the masked self-attention and cross-attention layers, and produces a probability distribution over the vocabulary for the next token. The chosen token is appended to the sequence, and the process repeats until an end-of-sequence token is produced or a maximum length is reached.

During training, however, the Transformer decoder processes all target tokens in parallel using teacher forcing. The causal mask ensures that each position can only attend to earlier positions, so the model still learns the correct autoregressive distribution while benefiting from parallelized computation.

## What is causal masking and why does it matter?

Causal masking (also called the "look-ahead mask") is the mechanism that enforces the autoregressive property in decoder models. It prevents each position in the sequence from attending to any future position, ensuring that predictions for a given token depend only on previously observed tokens. The original Transformer paper describes the decoder's self-attention as allowing "each position in the decoder to attend to all positions in the decoder up to and including that position," and it preserves the autoregressive property by "masking out (setting to negative infinity) all values in the input of the softmax which correspond to illegal connections."[4]

### How does causal masking work?

In standard self-attention, every position can attend to every other position, allowing the model to incorporate context from both directions. This is useful for understanding tasks (as in BERT), but it creates a fundamental problem for generation. During generation, tokens that come after the current position do not exist yet. If the model learned to rely on future context during training, it would fail at inference time.

Causal masking solves this by applying a lower-triangular binary matrix to the attention scores before the softmax operation. Positions that correspond to future tokens are set to negative infinity, so after the softmax these positions receive zero attention weight. The result is that each token's representation is computed using only information from itself and all preceding tokens.

Formally, for a sequence of length $$n$$, the causal mask $$M$$ is defined as:

- $$M[i][j] = 0$$ if $$j \le i$$ (allowed: attend to current and past positions)
- $$M[i][j] = -\infty$$ if $$j > i$$ (blocked: future positions)

The masked attention scores are computed as:

$$
\mathrm{Attention}(Q, K, V) = \mathrm{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}} + M\right) V
$$

### Why causal masking matters

Without causal masking, a model trained on complete sequences could "cheat" by peeking at future tokens. Even though this would reduce the training loss, the model would learn to depend on information that is unavailable during inference. Causal masking forces the model to develop genuine predictive capabilities, making the training distribution match the inference distribution.

Causal masking also enables efficient parallel training. Rather than generating tokens one at a time during training (which would be extremely slow), the model processes an entire sequence in a single forward pass. The mask ensures that each position still only "sees" the tokens before it, preserving the autoregressive property while allowing all positions to be computed simultaneously.

## Decoder-only models

Decoder-only Transformer models remove the encoder and the cross-attention sub-layer entirely, keeping only the masked self-attention and feed-forward sub-layers. These models process a single sequence: they take an input prompt and generate a continuation autoregressively. The input prompt effectively replaces the role of the encoder by providing context through the self-attention mechanism.

Decoder-only architectures have become the dominant paradigm for large language models.[5] They are trained as causal [language models](/wiki/language_model), predicting the next token given all preceding tokens, using self-supervised learning on massive text corpora. The scaling of this single objective produced striking emergent abilities: GPT-3 was described by its authors as "an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model," able to perform new tasks from a few text examples "without any gradient updates or fine-tuning."[6]

### Prominent decoder-only models

| Model | Developer | Year | Key features |
|---|---|---|---|
| GPT | OpenAI | 2018 | First large-scale decoder-only Transformer; 117M parameters; demonstrated effectiveness of pretraining plus fine-tuning |
| GPT-2 | OpenAI | 2019 | 1.5B parameters; showed strong zero-shot and few-shot capabilities |
| GPT-3 | OpenAI | 2020 | 175B parameters; popularized in-context learning and few-shot prompting |
| GPT-4 | OpenAI | 2023 | Multimodal (text and image input); significantly improved reasoning |
| LLaMA | Meta | 2023 | Open-weight models (7B to 65B parameters); uses RMSNorm, SwiGLU activation, and rotary positional embeddings (RoPE) |
| LLaMA 2 | Meta | 2023 | Trained on 2 trillion tokens; includes chat-optimized variants with RLHF |
| LLaMA 3 | Meta | 2024 | Expanded to 8B, 70B, and 405B parameters; 128K token context window |
| Mistral 7B | Mistral AI | 2023 | Uses grouped-query attention and sliding window attention for efficiency |
| PaLM | Google | 2022 | 540B parameters; used Pathways distributed training system |

### Architectural variations

While all decoder-only models share the same basic structure, they differ in implementation details:

- **Normalization.** The original Transformer uses post-normalization (applying layer normalization after the residual addition). Most modern models use pre-normalization (applying normalization before the attention or feed-forward sub-layer), which improves training stability. LLaMA uses RMSNorm instead of standard layer normalization.
- **Activation functions.** The original Transformer uses ReLU. Many newer models use GELU or SwiGLU, which have been shown to improve performance.
- **Positional encoding.** GPT-1 and GPT-2 use learned absolute positional embeddings. More recent models such as LLaMA use rotary positional encoding (RoPE), which encodes relative positions and scales better to long sequences.[12]
- **Attention optimizations.** Grouped-query attention (GQA) reduces memory usage by sharing key-value heads across multiple query heads. Flash Attention improves the computational efficiency of the attention operation through memory-aware tiling.

## How does a decoder differ from an encoder?

The three main Transformer-based paradigms each use different portions of the original Transformer architecture and are suited to different types of tasks.

| Property | Encoder-only (BERT) | Decoder-only (GPT) | Encoder-decoder (T5) |
|---|---|---|---|
| Attention direction | Bidirectional | Unidirectional (causal) | Bidirectional encoder, causal decoder |
| Training objective | Masked language model (MLM) | Next-token prediction | Span corruption / text-to-text |
| Output mechanism | Classification head on hidden states | Autoregressive token generation | Autoregressive token generation with cross-attention |
| Typical tasks | Text classification, named entity recognition, question answering (extractive) | Text generation, chatbots, code completion, summarization | Translation, summarization, question answering (abstractive) |
| Notable models | BERT, RoBERTa, ELECTRA, DeBERTa | GPT series, LLaMA, Mistral, PaLM | T5, BART, mBART, Flan-T5 |

Encoder-only models like BERT are strong at tasks that require understanding the full context of an input (bidirectional attention), but they need task-specific classification heads and cannot generate text natively. Decoder-only models like GPT excel at generative tasks and have proven highly scalable.[6] Encoder-decoder models like T5 combine bidirectional encoding with autoregressive decoding, making them well-suited for tasks that map one sequence to another, such as machine translation and summarization.[16]

In practice, the decoder-only architecture has come to dominate large-scale language modeling since around 2020, largely because of its simplicity and scalability.[6] However, encoder-decoder models remain competitive for specific tasks, and recent research (such as the work by Yi Tay and collaborators) has argued that encoder-decoder architectures may be underexplored at very large scales.

## Decoder in autoencoders

In autoencoders, the decoder reconstructs the original input from a compressed latent representation produced by the encoder. Autoencoders are unsupervised models used for dimensionality reduction, feature learning, denoising, and anomaly detection.

### Standard autoencoders

In a standard autoencoder, the encoder maps the input x to a lower-dimensional latent representation z, and the decoder maps z back to a reconstruction x' of the original input. The entire network is trained end-to-end by minimizing a reconstruction loss, typically mean squared error (MSE) for continuous data or binary cross-entropy for binary data.

The decoder's architecture mirrors the encoder's architecture in reverse. If the encoder uses a series of linear layers that progressively reduce dimensionality, the decoder uses linear layers that progressively increase dimensionality. For image data, where the encoder uses convolutional neural network layers with pooling, the decoder uses transposed convolutions (sometimes called deconvolutions) to upsample the feature maps back to the original resolution.

### Variational autoencoders (VAEs)

The decoder plays a particularly important role in variational autoencoders (VAEs), which are generative models introduced by Kingma and Welling in 2013.[7] Unlike standard autoencoders, where the encoder produces a single point in the latent space, a VAE encoder outputs the parameters (mean and variance) of a probability distribution, typically a Gaussian.

During training, a latent vector z is sampled from this distribution using the reparameterization trick (which allows gradients to flow through the sampling step). The decoder then takes z and generates an output. The VAE is trained by jointly minimizing two losses:

1. **Reconstruction loss.** Measures how well the decoder's output matches the original input.
2. **KL divergence.** Measures how much the learned latent distribution deviates from a standard normal prior, encouraging a smooth and well-structured latent space.

Because the VAE decoder learns to generate outputs from any point in the latent space (not just from encoded inputs), it can be used as a standalone generative model. After training, new samples can be generated by sampling z from the prior distribution and passing it through the decoder. The smooth structure of the latent space also enables useful operations like interpolation between data points: by linearly interpolating between two latent vectors and decoding the intermediate points, one can observe a gradual transition between the corresponding outputs.

### Denoising autoencoders

In denoising autoencoders, the encoder receives a corrupted version of the input (with added noise or random masking), and the decoder must reconstruct the clean, original input. This forces the network to learn robust features rather than simply copying the input. The decoder in this setting must be powerful enough to infer the missing or corrupted information from the latent representation.

## Image decoders

Image decoders are responsible for converting compressed or abstract feature representations back into spatial image data. They appear in autoencoders, generative adversarial networks (GANs), segmentation models, and diffusion pipelines. The central challenge for an image decoder is upsampling: increasing the spatial resolution of feature maps while preserving (or generating) fine-grained details.

### Upsampling techniques

Several methods are used for spatial upsampling in image decoders:

| Technique | Description | Advantages | Disadvantages |
|---|---|---|---|
| Transposed convolution | Learnable operation that combines upsampling and convolution; inserts zeros between input elements and applies a convolution | Fully learnable; can capture complex spatial patterns | Prone to checkerboard artifacts if kernel size and stride are mismatched |
| Nearest-neighbor upsampling + convolution | Repeats each pixel value to increase resolution, then applies a standard convolution to refine | Avoids checkerboard artifacts; simple to implement | Two-step process; slightly less parameter-efficient |
| Bilinear interpolation + convolution | Uses bilinear interpolation for smooth upsampling, followed by convolution | Produces smoother results than nearest-neighbor | Less flexible than transposed convolution |
| Pixel shuffle (sub-pixel convolution) | Rearranges elements from depth (channels) to spatial dimensions | Computationally efficient; avoids artifacts | Requires channel count to be a multiple of the upscale factor squared |

Transposed convolutions are the most common learned upsampling approach. They work by inserting zeros between input feature map elements, padding the borders, and then applying a standard convolution. The kernel weights are learned during training, allowing the network to discover the best way to reconstruct spatial detail. However, when the kernel size is not evenly divisible by the stride, transposed convolutions can produce checkerboard artifacts in the output. A popular alternative is to use nearest-neighbor or bilinear upsampling followed by a standard convolution, which separates the spatial enlargement step from the feature transformation step.

### Decoder in U-Net and image segmentation

The U-Net architecture, introduced by Ronneberger, Fischer, and Brox in 2015 for biomedical image segmentation, features a distinctive encoder-decoder structure with skip connections.[8]

The U-Net decoder (sometimes called the "expanding path" or "upsampling path") progressively increases the spatial resolution of feature maps while decreasing the number of channels. At each level, the decoder performs:

1. **Upsampling.** A transposed convolution (up-convolution) doubles the spatial dimensions of the feature map.
2. **Concatenation via skip connections.** The upsampled feature map is concatenated with the corresponding feature map from the encoder at the same resolution level. These skip connections are the defining feature of U-Net, allowing the decoder to access high-resolution spatial information from the encoder that would otherwise be lost during downsampling.
3. **Convolution.** Two 3x3 convolutions (each followed by a ReLU activation and optionally batch normalization) refine the concatenated features.

The final layer of the decoder is a 1x1 convolution that maps the feature channels to the desired number of output classes for pixel-wise classification.

Without skip connections, the decoder would have to reconstruct fine spatial details purely from the low-resolution, high-level features at the bottom of the network. Skip connections provide a shortcut for spatial information to flow directly from the encoder to the decoder, enabling precise localization. This design has proven so effective that skip connections have been adopted in many subsequent architectures, including SegNet, Feature Pyramid Networks (FPN), and variants like U-Net++ and Attention U-Net.

### Decoder in diffusion models

In latent diffusion models such as Stable Diffusion, a VAE decoder plays a critical role in the generation pipeline. The diffusion process operates in a compressed latent space (encoded by a VAE encoder) rather than in pixel space, which makes training and inference much more efficient.[11] After the U-Net denoises the latent representation through iterative reverse diffusion steps, the VAE decoder converts the final denoised latent back into a full-resolution pixel image. This decoder runs only once at the end of the generation process.

## Decoding strategies for text generation

Once a decoder model has been trained, the method used to select tokens during inference significantly affects the quality, diversity, and coherence of the generated text. These methods are called decoding strategies (or decoding algorithms). Holtzman et al. (2020) found that purely "maximization-based decoding methods such as beam search lead to degeneration: output text that is bland, incoherent, or gets stuck in repetitive loops," which motivated the move toward sampling-based strategies for open-ended generation.[9]

### Greedy decoding

Greedy decoding selects the single most probable token at each step. It is computationally cheap and deterministic, but it often produces repetitive or suboptimal text because it never considers alternative paths. A locally optimal choice at each step does not guarantee a globally optimal sequence.

### Beam search

Beam search maintains a set of the top-k most probable partial sequences (called "beams") at each step, expanding each beam with every possible next token and keeping only the top-k scoring candidates. It explores a broader set of possibilities than greedy decoding while remaining tractable. Beam search is widely used in machine translation and speech recognition, where output quality matters and some diversity is acceptable. The beam width (typically 4 to 10) controls the trade-off between computation and search thoroughness. However, beam search tends to produce generic, high-probability text and can still be repetitive. For open-ended generation tasks (such as creative writing or dialogue), sampling-based methods are preferred.

### Top-k sampling

Top-k sampling restricts the candidate pool at each step to the k most probable tokens, then samples from this truncated distribution after renormalization. This prevents the model from selecting very low-probability (and often incoherent) tokens while introducing controlled randomness. Fan et al. (2018) popularized this approach.[14] The parameter k is fixed; a common value is 40 or 50.

### Top-p (nucleus) sampling

Top-p sampling (also called nucleus sampling), introduced by Holtzman et al. (2020), dynamically adjusts the candidate pool size.[9] Instead of fixing k, it selects the smallest set of tokens whose cumulative probability exceeds a threshold p (typically 0.9 or 0.95). When the model is confident, the nucleus is small; when the distribution is flat, it is larger. This adaptive behavior makes top-p sampling more flexible than top-k and generally produces more natural-sounding text.

### Temperature

Temperature is a parameter that scales the logits (the raw scores before softmax) before computing the probability distribution. A temperature below 1.0 sharpens the distribution, making the model more confident and deterministic. A temperature above 1.0 flattens the distribution, increasing randomness and diversity. At the extreme, a temperature approaching 0 is equivalent to greedy decoding, and a very high temperature produces near-uniform random sampling. Temperature is often combined with top-k or top-p sampling. For instance, a common configuration for creative text generation might use temperature 0.8 with top-p 0.95.

### Comparison of decoding strategies

| Strategy | Deterministic? | Diversity | Common use cases | Key parameter |
|---|---|---|---|---|
| Greedy | Yes | Low | Quick prototyping, simple tasks | None |
| Beam search | Yes (for a given beam width) | Low to moderate | Machine translation, speech recognition, summarization | Beam width (k) |
| Top-k sampling | No | Moderate | Open-ended text generation | k (e.g., 40) |
| Top-p (nucleus) sampling | No | Moderate to high | Dialogue, creative writing | p (e.g., 0.9) |
| Temperature scaling | Depends on base method | Adjustable | Combined with sampling methods | Temperature (e.g., 0.7) |

### Advanced decoding methods

Recent research has introduced several more sophisticated decoding techniques:

- **Contrastive decoding.** Introduced by Li et al. (2023), this method uses the difference between a large model's logits and a smaller "amateur" model's logits to guide generation.[15] The intuition is that tokens favored by the large model but not by the small model are more likely to be high-quality.
- **Minimum Bayes risk (MBR) decoding.** Instead of selecting the highest-probability sequence, MBR decoding generates multiple candidate sequences and selects the one that minimizes the expected loss (e.g., the candidate most similar to all others under a chosen metric like BLEU).

## KV cache

The key-value (KV) cache is an optimization technique that dramatically accelerates autoregressive decoding in Transformer models. During autoregressive generation, the model produces tokens one at a time, and at each step the self-attention mechanism must compute over all previous tokens. Without caching, this means the key and value projections for every earlier token would be recomputed at every step, resulting in redundant computation that grows quadratically with sequence length.

The KV cache stores the key (K) and value (V) matrices from previous time steps so they can be reused. At each new generation step, only the key and value vectors for the newly generated token need to be computed and appended to the cache. The query (Q) vector for the current token is then computed against the full set of cached keys and values. This reduces the per-step computation from $$O(n^2)$$ to $$O(n)$$, where $$n$$ is the current sequence length.

### Memory considerations

The KV cache's memory footprint scales linearly with the sequence length, the number of layers, and the hidden dimension of the model. For large models with long context windows, the KV cache can consume substantial GPU memory. For example, a LLaMA 2 7B model at 16-bit precision with a batch size of 1 requires approximately 2 GB of KV cache memory. At larger batch sizes or longer contexts, this figure grows rapidly.

Several techniques have been developed to reduce KV cache memory usage:

| Optimization technique | Description |
|---|---|
| Multi-query attention (MQA) | Shares a single set of key-value heads across all query heads, dramatically reducing cache size |
| Grouped-query attention (GQA) | Shares key-value heads among groups of query heads; a middle ground between full multi-head attention and MQA |
| KV cache quantization | Reduces the numerical precision of cached values (e.g., from 16-bit to 4-bit) to save memory with minimal quality impact |
| Cache eviction strategies | Selectively removes less important entries from the cache, keeping only the most relevant tokens |
| Sliding window attention | Limits the attention context to a fixed window of recent tokens rather than the full sequence |

In practical benchmarks, generating 1,000 tokens with KV caching takes roughly 11 to 12 seconds, compared to over 56 seconds without it. This illustrates the technique's critical importance for real-time inference in production systems.

## What is speculative decoding?

Speculative decoding is an inference acceleration technique that speeds up autoregressive generation without changing the output distribution. It was introduced by Leviathan, Kalman, and Matias (2023) and independently by Chen et al. (2023).[10] The authors describe it as a way "to make exact decoding from the large models faster, by running them in parallel on the outputs of the approximation models," and report a 2x to 3x acceleration over a standard implementation without changing the output distribution.[10]

### How speculative decoding works

The core idea is to use a smaller, faster "draft" model to generate a sequence of candidate tokens, and then verify these candidates in a single forward pass of the larger "target" model. Because the target model can process multiple tokens in parallel (just as it does during training), verification is much cheaper than generating each token individually.

The algorithm proceeds as follows:

1. The draft model generates gamma candidate tokens autoregressively (where gamma is typically 3 to 8).
2. The target model processes all candidate tokens in one forward pass, producing probability distributions for each position.
3. Each candidate token is accepted or rejected based on a comparison between the draft model's probability and the target model's probability. Accepted tokens are kept; the first rejected token is resampled from a corrected distribution.
4. The process repeats from the new sequence position.

Because the acceptance/rejection step uses a specific mathematical correction, the output distribution of speculative decoding is provably identical to that of standard autoregressive decoding with the target model alone. This means speculative decoding is a lossless acceleration technique.

### Performance and adoption

Speculative decoding typically delivers 2x to 3x speedups over standard autoregressive decoding, depending on the similarity between the draft and target models and the nature of the task. With draft lengths of gamma = 5 and gamma = 7, Leviathan et al. measured speedups of 2.3x to 3.4x on translation and summarization tasks.[10] It has been adopted in production systems across the industry. Google has deployed speculative decoding in several products, and Apple has developed variants such as Mirror Speculative Decoding and Speculative Streaming, which aim to further reduce the serial bottleneck in generation.

Recent developments (2024 and 2025) include techniques for handling vocabulary mismatches between draft and target models, multi-model speculation approaches, and methods like SpecEE that use early-exit strategies rather than separate draft models.

## Training the decoder

Regardless of the specific architecture, training a decoder generally involves optimizing it to produce output sequences that match ground-truth target sequences.

### Loss functions

For sequence generation tasks (translation, text generation), the standard loss function is cross-entropy loss computed at each output position. Given a target sequence of tokens, the loss at each step is the negative log-probability of the correct token under the decoder's predicted distribution. The total loss is the sum (or average) over all positions.

For reconstruction tasks (autoencoders), the loss is typically mean squared error (MSE) for continuous data or binary cross-entropy for binary data.

### Optimization

Decoders are trained using gradient descent (typically Adam or AdamW) with backpropagation. In Transformer-based models, learning rate warmup followed by a cosine or inverse-square-root decay schedule is standard practice. Large decoder models may require mixed-precision training, gradient checkpointing, and distributed training across multiple GPUs or TPUs.

### Regularization

Dropout is applied within decoder layers to prevent overfitting. Label smoothing, which replaces hard one-hot targets with softened distributions, is another common regularization technique for decoder training. Weight decay (L2 regularization) is also standard.

## Evaluation

The quality of a decoder's output is evaluated differently depending on the task:

- **Machine translation.** BLEU score, METEOR, chrF, and COMET are common metrics. BLEU measures n-gram overlap between the generated translation and reference translations.
- **Text generation.** Perplexity measures how well the model's predicted probability distribution matches the actual distribution of a test corpus. Lower perplexity indicates a better model. Human evaluation is often used alongside automatic metrics.
- **Image reconstruction (autoencoders).** Mean squared error, structural similarity index (SSIM), and peak signal-to-noise ratio (PSNR) are standard.
- **Image segmentation (U-Net).** Intersection over Union (IoU), Dice coefficient, and pixel accuracy are common metrics.

## What is a decoder used for?

Decoders are used across a wide range of deep learning applications:

- **Machine translation.** The decoder generates the translated sentence in the target language, one token at a time.
- **Text generation and chatbots.** Decoder-only models like GPT and ChatGPT generate fluent, contextually relevant text for dialogue, content creation, code generation, and more.
- **Speech recognition.** Models like Whisper use an encoder-decoder architecture where the decoder transcribes audio features into text.
- **Image captioning.** The decoder produces a natural language description of an image, conditioned on visual features extracted by an image encoder.
- **Image segmentation.** The U-Net decoder produces pixel-level classification maps for medical imaging, satellite imagery, and other applications.
- **Image generation.** VAE decoders and diffusion model decoders produce images from latent representations.
- **Text summarization.** Encoder-decoder models generate concise summaries of longer documents.
- **Code generation.** Decoder-only models trained on code (such as Codex and Code LLaMA) generate programming code from natural language descriptions.

## Explain Like I'm 5 (ELI5)

Imagine you have a box of building blocks, and you build a really cool castle. Now, you want to send the castle to your friend, but it is too big to mail. So you take a photo of the castle and send that instead. The photo is much smaller, but it captures the important details.

Your friend gets the photo and uses it to rebuild the castle with their own blocks. They might not get every single detail exactly right, but they can rebuild something very close to your original castle.

In machine learning, the "encoder" is like taking the photo: it squishes big, complicated information down into something small and compact. The "decoder" is like your friend rebuilding the castle from the photo: it takes that compact information and turns it back into something full-sized and useful, like a translated sentence, a generated image, or an answer to a question.

Some decoders are extra clever. Instead of needing a photo from someone else, they can create entirely new castles on their own, one block at a time, by deciding what block to place next based on all the blocks they have already placed. That is how chatbots like ChatGPT work: they generate text one word at a time, always looking at what they have written so far to decide the next word.

## References

1. Sutskever, I., Vinyals, O., & Le, Q. V. (2014). "Sequence to Sequence Learning with Neural Networks." *Advances in Neural Information Processing Systems 27 (NIPS 2014)*. arXiv:1409.3215.
2. Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014). "Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation." *Proceedings of EMNLP 2014*. arXiv:1406.1078.
3. Bahdanau, D., Cho, K., & Bengio, Y. (2015). "Neural Machine Translation by Jointly Learning to Align and Translate." *Proceedings of ICLR 2015*. arXiv:1409.0473.
4. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). "Attention Is All You Need." *Advances in Neural Information Processing Systems 30 (NeurIPS 2017)*. arXiv:1706.03762.
5. Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). "Improving Language Understanding by Generative Pre-Training." OpenAI.
6. Brown, T. B., et al. (2020). "Language Models are Few-Shot Learners." *Advances in Neural Information Processing Systems 33 (NeurIPS 2020)*. arXiv:2005.14165.
7. Kingma, D. P., & Welling, M. (2014). "Auto-Encoding Variational Bayes." *Proceedings of ICLR 2014*. arXiv:1312.6114.
8. Ronneberger, O., Fischer, P., & Brox, T. (2015). "U-Net: Convolutional Networks for Biomedical Image Segmentation." *Proceedings of MICCAI 2015*. arXiv:1505.04597.
9. Holtzman, A., Buys, J., Du, L., Forbes, M., & Choi, Y. (2020). "The Curious Case of Neural Text Degeneration." *Proceedings of ICLR 2020*. arXiv:1904.09751.
10. Leviathan, Y., Kalman, M., & Matias, Y. (2023). "Fast Inference from Transformers via Speculative Decoding." *Proceedings of ICML 2023*. arXiv:2211.17192.
11. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022). "High-Resolution Image Synthesis with Latent Diffusion Models." *Proceedings of CVPR 2022*. arXiv:2112.10752.
12. Touvron, H., et al. (2023). "LLaMA: Open and Efficient Foundation Language Models." arXiv:2302.13971.
13. Luong, M.-T., Pham, H., & Manning, C. D. (2015). "Effective Approaches to Attention-based Neural Machine Translation." *Proceedings of EMNLP 2015*. arXiv:1508.04025.
14. Fan, A., Lewis, M., & Dauphin, Y. (2018). "Hierarchical Neural Story Generation." *Proceedings of ACL 2018*. arXiv:1805.04833.
15. Li, X. L., Holtzman, A., Fried, D., Liang, P., Eisner, J., Hashimoto, T., Zettlemoyer, L., & Lewis, M. (2023). "Contrastive Decoding: Open-ended Text Generation as Optimization." *Proceedings of ACL 2023*. arXiv:2210.15097.
16. Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., & Liu, P. J. (2020). "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer." *Journal of Machine Learning Research*, 21(140), 1-67. arXiv:1910.10683.