See also: Encoder, Transformer, Machine learning terms
In neural networks, a decoder is a component that transforms an internal, compressed, or abstract representation back into a desired output format. Decoders appear across many deep learning architectures, including sequence-to-sequence models, transformers, autoencoders, and U-Nets. While the specific design of a decoder varies by architecture, its core function remains consistent: it takes a learned representation and produces a structured output, whether that output is a translated sentence, a generated image, a reconstructed input, or a segmentation mask.
The concept of the decoder is tightly coupled with that of the encoder. In most architectures, the encoder compresses input data into a latent or hidden representation, and the decoder reverses this process. However, decoder-only architectures (such as the GPT family) demonstrate that a standalone decoder, without a separate encoder, can serve as a powerful generative model in its own right.
The idea of using paired encoding and decoding networks traces back to early work on autoencoders in the 1980s and 1990s, where researchers trained networks to compress and then reconstruct data. The modern concept of the decoder in sequence modeling took shape with the rise of sequence-to-sequence (seq2seq) architectures in 2014. Two landmark papers introduced the encoder-decoder framework for machine translation.
Cho et al. (2014) proposed the Encoder-Decoder architecture using recurrent neural networks with a novel hidden unit called the gated recurrent unit (GRU). Independently, Sutskever, Vinyals, and Le (2014) at Google published "Sequence to Sequence Learning with Neural Networks," which used a multilayered LSTM as both the encoder and decoder, achieving strong results on English-to-French translation.
A critical limitation of early seq2seq models was the information bottleneck: the entire input sequence had to be compressed into a single fixed-length context vector. In 2015, Bahdanau, Cho, and Bengio introduced the attention mechanism in their paper "Neural Machine Translation by Jointly Learning to Align and Translate" (published at ICLR 2015). Instead of forcing the decoder to rely on a single vector, attention allowed the decoder to selectively focus on different parts of the encoder's output at each generation step. This innovation dramatically improved translation quality, particularly for longer sentences.
The introduction of the Transformer architecture by Vaswani et al. in 2017 ("Attention Is All You Need") replaced recurrence entirely with self-attention and cross-attention mechanisms, leading to much faster training and better performance. The Transformer decoder became the foundation for modern large language models.
In sequence-to-sequence (seq2seq) models, the decoder generates an output sequence based on a context representation produced by the encoder. The encoder processes the input sequence (for instance, a sentence in English) and produces either a single context vector or a sequence of hidden states. The decoder then uses this representation to produce the output sequence (for instance, the same sentence translated into French) one token at a time.
The decoder in a classical seq2seq model is typically a recurrent neural network such as an LSTM or GRU. At each time step, the decoder receives three inputs:
The decoder produces a hidden state at each step, which is then passed through a linear layer followed by a softmax function to generate a probability distribution over the output vocabulary. The token with the highest probability (or a token sampled from the distribution) becomes the output for that step and is fed back as input to the next step.
During training, a technique called teacher forcing is commonly used. Instead of feeding the decoder's own previous prediction as input at each step, the ground-truth token from the target sequence is provided. This speeds up convergence and stabilizes training, but it can create a discrepancy between training and inference conditions, since at inference time the model must rely on its own predictions. Scheduled sampling, introduced by Bengio et al. (2015), addresses this mismatch by gradually transitioning from teacher forcing to using the model's own predictions during training.
The basic seq2seq decoder suffers from the bottleneck problem: all information about the input must be compressed into a fixed-length context vector. For long input sequences, this vector cannot capture all relevant details, and translation quality degrades.
The attention mechanism solves this by letting the decoder look at all encoder hidden states at every generation step. At each decoder time step, the mechanism computes a set of attention weights over the encoder hidden states, producing a weighted sum called the context vector. This context vector changes at each step, allowing the decoder to focus on different parts of the input as it generates different parts of the output.
There are several variants of attention:
| Attention type | Description | Introduced by |
|---|---|---|
| Additive (Bahdanau) attention | Uses a learned feed-forward network to compute alignment scores between decoder state and each encoder hidden state | Bahdanau et al. (2015) |
| Multiplicative (Luong) attention | Computes alignment scores as a dot product (or general bilinear product) between decoder and encoder states | Luong et al. (2015) |
| Scaled dot-product attention | Dot product attention scaled by the square root of the key dimension; used in the Transformer | Vaswani et al. (2017) |
The Transformer architecture, introduced by Vaswani et al. in the 2017 paper "Attention Is All You Need," redefined how decoders work. The Transformer decoder replaces recurrence with stacked layers of self-attention, cross-attention, and feed-forward networks. This design allows for much greater parallelism during training compared to recurrent decoders.
Each layer (or "block") in the Transformer decoder contains three sub-layers:
Each sub-layer is wrapped with a residual connection and layer normalization. Dropout is applied after each sub-layer for regularization.
Since the Transformer has no inherent notion of sequence order (unlike RNNs), positional encoding is added to the input embeddings. The original Transformer uses sinusoidal positional encodings, while later models adopt learned positional embeddings or rotary positional embeddings (RoPE).
During inference, the Transformer decoder generates output tokens one at a time in an autoregressive fashion. At each step, it takes all previously generated tokens as input, applies the masked self-attention and cross-attention layers, and produces a probability distribution over the vocabulary for the next token. The chosen token is appended to the sequence, and the process repeats until an end-of-sequence token is produced or a maximum length is reached.
During training, however, the Transformer decoder processes all target tokens in parallel using teacher forcing. The causal mask ensures that each position can only attend to earlier positions, so the model still learns the correct autoregressive distribution while benefiting from parallelized computation.
Causal masking (also called the "look-ahead mask") is the mechanism that enforces the autoregressive property in decoder models. It prevents each position in the sequence from attending to any future position, ensuring that predictions for a given token depend only on previously observed tokens.
In standard self-attention, every position can attend to every other position, allowing the model to incorporate context from both directions. This is useful for understanding tasks (as in BERT), but it creates a fundamental problem for generation. During generation, tokens that come after the current position do not exist yet. If the model learned to rely on future context during training, it would fail at inference time.
Causal masking solves this by applying a lower-triangular binary matrix to the attention scores before the softmax operation. Positions that correspond to future tokens are set to negative infinity, so after the softmax these positions receive zero attention weight. The result is that each token's representation is computed using only information from itself and all preceding tokens.
Formally, for a sequence of length n, the causal mask M is defined as:
The masked attention scores are computed as: Attention(Q, K, V) = softmax((QK^T / sqrt(d_k)) + M) * V
Without causal masking, a model trained on complete sequences could "cheat" by peeking at future tokens. Even though this would reduce the training loss, the model would learn to depend on information that is unavailable during inference. Causal masking forces the model to develop genuine predictive capabilities, making the training distribution match the inference distribution.
Causal masking also enables efficient parallel training. Rather than generating tokens one at a time during training (which would be extremely slow), the model processes an entire sequence in a single forward pass. The mask ensures that each position still only "sees" the tokens before it, preserving the autoregressive property while allowing all positions to be computed simultaneously.
Decoder-only Transformer models remove the encoder and the cross-attention sub-layer entirely, keeping only the masked self-attention and feed-forward sub-layers. These models process a single sequence: they take an input prompt and generate a continuation autoregressively. The input prompt effectively replaces the role of the encoder by providing context through the self-attention mechanism.
Decoder-only architectures have become the dominant paradigm for large language models. They are trained as causal language models, predicting the next token given all preceding tokens, using self-supervised learning on massive text corpora.
| Model | Developer | Year | Key features |
|---|---|---|---|
| GPT | OpenAI | 2018 | First large-scale decoder-only Transformer; 117M parameters; demonstrated effectiveness of pretraining plus fine-tuning |
| GPT-2 | OpenAI | 2019 | 1.5B parameters; showed strong zero-shot and few-shot capabilities |
| GPT-3 | OpenAI | 2020 | 175B parameters; popularized in-context learning and few-shot prompting |
| GPT-4 | OpenAI | 2023 | Multimodal (text and image input); significantly improved reasoning |
| LLaMA | Meta | 2023 | Open-weight models (7B to 65B parameters); uses RMSNorm, SwiGLU activation, and rotary positional embeddings (RoPE) |
| LLaMA 2 | Meta | 2023 | Trained on 2 trillion tokens; includes chat-optimized variants with RLHF |
| LLaMA 3 | Meta | 2024 | Expanded to 8B, 70B, and 405B parameters; 128K token context window |
| Mistral 7B | Mistral AI | 2023 | Uses grouped-query attention and sliding window attention for efficiency |
| PaLM | 2022 | 540B parameters; used Pathways distributed training system |
While all decoder-only models share the same basic structure, they differ in implementation details:
The three main Transformer-based paradigms each use different portions of the original Transformer architecture and are suited to different types of tasks.
| Property | Encoder-only (BERT) | Decoder-only (GPT) | Encoder-decoder (T5) |
|---|---|---|---|
| Attention direction | Bidirectional | Unidirectional (causal) | Bidirectional encoder, causal decoder |
| Training objective | Masked language model (MLM) | Next-token prediction | Span corruption / text-to-text |
| Output mechanism | Classification head on hidden states | Autoregressive token generation | Autoregressive token generation with cross-attention |
| Typical tasks | Text classification, named entity recognition, question answering (extractive) | Text generation, chatbots, code completion, summarization | Translation, summarization, question answering (abstractive) |
| Notable models | BERT, RoBERTa, ELECTRA, DeBERTa | GPT series, LLaMA, Mistral, PaLM | T5, BART, mBART, Flan-T5 |
Encoder-only models like BERT are strong at tasks that require understanding the full context of an input (bidirectional attention), but they need task-specific classification heads and cannot generate text natively. Decoder-only models like GPT excel at generative tasks and have proven highly scalable. Encoder-decoder models like T5 combine bidirectional encoding with autoregressive decoding, making them well-suited for tasks that map one sequence to another, such as machine translation and summarization.
In practice, the decoder-only architecture has come to dominate large-scale language modeling since around 2020, largely because of its simplicity and scalability. However, encoder-decoder models remain competitive for specific tasks, and recent research (such as the work by Yi Tay and collaborators) has argued that encoder-decoder architectures may be underexplored at very large scales.
In autoencoders, the decoder reconstructs the original input from a compressed latent representation produced by the encoder. Autoencoders are unsupervised models used for dimensionality reduction, feature learning, denoising, and anomaly detection.
In a standard autoencoder, the encoder maps the input x to a lower-dimensional latent representation z, and the decoder maps z back to a reconstruction x' of the original input. The entire network is trained end-to-end by minimizing a reconstruction loss, typically mean squared error (MSE) for continuous data or binary cross-entropy for binary data.
The decoder's architecture mirrors the encoder's architecture in reverse. If the encoder uses a series of linear layers that progressively reduce dimensionality, the decoder uses linear layers that progressively increase dimensionality. For image data, where the encoder uses convolutional neural network layers with pooling, the decoder uses transposed convolutions (sometimes called deconvolutions) to upsample the feature maps back to the original resolution.
The decoder plays a particularly important role in variational autoencoders (VAEs), which are generative models introduced by Kingma and Welling in 2013. Unlike standard autoencoders, where the encoder produces a single point in the latent space, a VAE encoder outputs the parameters (mean and variance) of a probability distribution, typically a Gaussian.
During training, a latent vector z is sampled from this distribution using the reparameterization trick (which allows gradients to flow through the sampling step). The decoder then takes z and generates an output. The VAE is trained by jointly minimizing two losses:
Because the VAE decoder learns to generate outputs from any point in the latent space (not just from encoded inputs), it can be used as a standalone generative model. After training, new samples can be generated by sampling z from the prior distribution and passing it through the decoder. The smooth structure of the latent space also enables useful operations like interpolation between data points: by linearly interpolating between two latent vectors and decoding the intermediate points, one can observe a gradual transition between the corresponding outputs.
In denoising autoencoders, the encoder receives a corrupted version of the input (with added noise or random masking), and the decoder must reconstruct the clean, original input. This forces the network to learn robust features rather than simply copying the input. The decoder in this setting must be powerful enough to infer the missing or corrupted information from the latent representation.
Image decoders are responsible for converting compressed or abstract feature representations back into spatial image data. They appear in autoencoders, generative adversarial networks (GANs), segmentation models, and diffusion pipelines. The central challenge for an image decoder is upsampling: increasing the spatial resolution of feature maps while preserving (or generating) fine-grained details.
Several methods are used for spatial upsampling in image decoders:
| Technique | Description | Advantages | Disadvantages |
|---|---|---|---|
| Transposed convolution | Learnable operation that combines upsampling and convolution; inserts zeros between input elements and applies a convolution | Fully learnable; can capture complex spatial patterns | Prone to checkerboard artifacts if kernel size and stride are mismatched |
| Nearest-neighbor upsampling + convolution | Repeats each pixel value to increase resolution, then applies a standard convolution to refine | Avoids checkerboard artifacts; simple to implement | Two-step process; slightly less parameter-efficient |
| Bilinear interpolation + convolution | Uses bilinear interpolation for smooth upsampling, followed by convolution | Produces smoother results than nearest-neighbor | Less flexible than transposed convolution |
| Pixel shuffle (sub-pixel convolution) | Rearranges elements from depth (channels) to spatial dimensions | Computationally efficient; avoids artifacts | Requires channel count to be a multiple of the upscale factor squared |
Transposed convolutions are the most common learned upsampling approach. They work by inserting zeros between input feature map elements, padding the borders, and then applying a standard convolution. The kernel weights are learned during training, allowing the network to discover the best way to reconstruct spatial detail. However, when the kernel size is not evenly divisible by the stride, transposed convolutions can produce checkerboard artifacts in the output. A popular alternative is to use nearest-neighbor or bilinear upsampling followed by a standard convolution, which separates the spatial enlargement step from the feature transformation step.
The U-Net architecture, introduced by Ronneberger, Fischer, and Brox in 2015 for biomedical image segmentation, features a distinctive encoder-decoder structure with skip connections.
The U-Net decoder (sometimes called the "expanding path" or "upsampling path") progressively increases the spatial resolution of feature maps while decreasing the number of channels. At each level, the decoder performs:
The final layer of the decoder is a 1x1 convolution that maps the feature channels to the desired number of output classes for pixel-wise classification.
Without skip connections, the decoder would have to reconstruct fine spatial details purely from the low-resolution, high-level features at the bottom of the network. Skip connections provide a shortcut for spatial information to flow directly from the encoder to the decoder, enabling precise localization. This design has proven so effective that skip connections have been adopted in many subsequent architectures, including SegNet, Feature Pyramid Networks (FPN), and variants like U-Net++ and Attention U-Net.
In latent diffusion models such as Stable Diffusion, a VAE decoder plays a critical role in the generation pipeline. The diffusion process operates in a compressed latent space (encoded by a VAE encoder) rather than in pixel space, which makes training and inference much more efficient. After the U-Net denoises the latent representation through iterative reverse diffusion steps, the VAE decoder converts the final denoised latent back into a full-resolution pixel image. This decoder runs only once at the end of the generation process.
Once a decoder model has been trained, the method used to select tokens during inference significantly affects the quality, diversity, and coherence of the generated text. These methods are called decoding strategies (or decoding algorithms).
Greedy decoding selects the single most probable token at each step. It is computationally cheap and deterministic, but it often produces repetitive or suboptimal text because it never considers alternative paths. A locally optimal choice at each step does not guarantee a globally optimal sequence.
Beam search maintains a set of the top-k most probable partial sequences (called "beams") at each step, expanding each beam with every possible next token and keeping only the top-k scoring candidates. It explores a broader set of possibilities than greedy decoding while remaining tractable. Beam search is widely used in machine translation and speech recognition, where output quality matters and some diversity is acceptable. The beam width (typically 4 to 10) controls the trade-off between computation and search thoroughness. However, beam search tends to produce generic, high-probability text and can still be repetitive. For open-ended generation tasks (such as creative writing or dialogue), sampling-based methods are preferred.
Top-k sampling restricts the candidate pool at each step to the k most probable tokens, then samples from this truncated distribution after renormalization. This prevents the model from selecting very low-probability (and often incoherent) tokens while introducing controlled randomness. Fan et al. (2018) popularized this approach. The parameter k is fixed; a common value is 40 or 50.
Top-p sampling (also called nucleus sampling), introduced by Holtzman et al. (2020), dynamically adjusts the candidate pool size. Instead of fixing k, it selects the smallest set of tokens whose cumulative probability exceeds a threshold p (typically 0.9 or 0.95). When the model is confident, the nucleus is small; when the distribution is flat, it is larger. This adaptive behavior makes top-p sampling more flexible than top-k and generally produces more natural-sounding text.
Temperature is a parameter that scales the logits (the raw scores before softmax) before computing the probability distribution. A temperature below 1.0 sharpens the distribution, making the model more confident and deterministic. A temperature above 1.0 flattens the distribution, increasing randomness and diversity. At the extreme, a temperature approaching 0 is equivalent to greedy decoding, and a very high temperature produces near-uniform random sampling. Temperature is often combined with top-k or top-p sampling. For instance, a common configuration for creative text generation might use temperature 0.8 with top-p 0.95.
| Strategy | Deterministic? | Diversity | Common use cases | Key parameter |
|---|---|---|---|---|
| Greedy | Yes | Low | Quick prototyping, simple tasks | None |
| Beam search | Yes (for a given beam width) | Low to moderate | Machine translation, speech recognition, summarization | Beam width (k) |
| Top-k sampling | No | Moderate | Open-ended text generation | k (e.g., 40) |
| Top-p (nucleus) sampling | No | Moderate to high | Dialogue, creative writing | p (e.g., 0.9) |
| Temperature scaling | Depends on base method | Adjustable | Combined with sampling methods | Temperature (e.g., 0.7) |
Recent research has introduced several more sophisticated decoding techniques:
The key-value (KV) cache is an optimization technique that dramatically accelerates autoregressive decoding in Transformer models. During autoregressive generation, the model produces tokens one at a time, and at each step the self-attention mechanism must compute over all previous tokens. Without caching, this means the key and value projections for every earlier token would be recomputed at every step, resulting in redundant computation that grows quadratically with sequence length.
The KV cache stores the key (K) and value (V) matrices from previous time steps so they can be reused. At each new generation step, only the key and value vectors for the newly generated token need to be computed and appended to the cache. The query (Q) vector for the current token is then computed against the full set of cached keys and values. This reduces the per-step computation from O(n^2) to O(n), where n is the current sequence length.
The KV cache's memory footprint scales linearly with the sequence length, the number of layers, and the hidden dimension of the model. For large models with long context windows, the KV cache can consume substantial GPU memory. For example, a LLaMA 2 7B model at 16-bit precision with a batch size of 1 requires approximately 2 GB of KV cache memory. At larger batch sizes or longer contexts, this figure grows rapidly.
Several techniques have been developed to reduce KV cache memory usage:
| Optimization technique | Description |
|---|---|
| Multi-query attention (MQA) | Shares a single set of key-value heads across all query heads, dramatically reducing cache size |
| Grouped-query attention (GQA) | Shares key-value heads among groups of query heads; a middle ground between full multi-head attention and MQA |
| KV cache quantization | Reduces the numerical precision of cached values (e.g., from 16-bit to 4-bit) to save memory with minimal quality impact |
| Cache eviction strategies | Selectively removes less important entries from the cache, keeping only the most relevant tokens |
| Sliding window attention | Limits the attention context to a fixed window of recent tokens rather than the full sequence |
In practical benchmarks, generating 1,000 tokens with KV caching takes roughly 11 to 12 seconds, compared to over 56 seconds without it. This illustrates the technique's critical importance for real-time inference in production systems.
Speculative decoding is an inference acceleration technique that speeds up autoregressive generation without changing the output distribution. It was introduced by Leviathan, Kalman, and Matias (2023) and independently by Chen et al. (2023).
The core idea is to use a smaller, faster "draft" model to generate a sequence of candidate tokens, and then verify these candidates in a single forward pass of the larger "target" model. Because the target model can process multiple tokens in parallel (just as it does during training), verification is much cheaper than generating each token individually.
The algorithm proceeds as follows:
Because the acceptance/rejection step uses a specific mathematical correction, the output distribution of speculative decoding is provably identical to that of standard autoregressive decoding with the target model alone. This means speculative decoding is a lossless acceleration technique.
Speculative decoding typically delivers 2x to 3x speedups over standard autoregressive decoding, depending on the similarity between the draft and target models and the nature of the task. It has been adopted in production systems across the industry. Google has deployed speculative decoding in several products, and Apple has developed variants such as Mirror Speculative Decoding and Speculative Streaming, which aim to further reduce the serial bottleneck in generation.
Recent developments (2024 and 2025) include techniques for handling vocabulary mismatches between draft and target models, multi-model speculation approaches, and methods like SpecEE that use early-exit strategies rather than separate draft models.
Regardless of the specific architecture, training a decoder generally involves optimizing it to produce output sequences that match ground-truth target sequences.
For sequence generation tasks (translation, text generation), the standard loss function is cross-entropy loss computed at each output position. Given a target sequence of tokens, the loss at each step is the negative log-probability of the correct token under the decoder's predicted distribution. The total loss is the sum (or average) over all positions.
For reconstruction tasks (autoencoders), the loss is typically mean squared error (MSE) for continuous data or binary cross-entropy for binary data.
Decoders are trained using gradient descent (typically Adam or AdamW) with backpropagation. In Transformer-based models, learning rate warmup followed by a cosine or inverse-square-root decay schedule is standard practice. Large decoder models may require mixed-precision training, gradient checkpointing, and distributed training across multiple GPUs or TPUs.
Dropout is applied within decoder layers to prevent overfitting. Label smoothing, which replaces hard one-hot targets with softened distributions, is another common regularization technique for decoder training. Weight decay (L2 regularization) is also standard.
The quality of a decoder's output is evaluated differently depending on the task:
Decoders are used across a wide range of deep learning applications:
Imagine you have a box of building blocks, and you build a really cool castle. Now, you want to send the castle to your friend, but it is too big to mail. So you take a photo of the castle and send that instead. The photo is much smaller, but it captures the important details.
Your friend gets the photo and uses it to rebuild the castle with their own blocks. They might not get every single detail exactly right, but they can rebuild something very close to your original castle.
In machine learning, the "encoder" is like taking the photo: it squishes big, complicated information down into something small and compact. The "decoder" is like your friend rebuilding the castle from the photo: it takes that compact information and turns it back into something full-sized and useful, like a translated sentence, a generated image, or an answer to a question.
Some decoders are extra clever. Instead of needing a photo from someone else, they can create entirely new castles on their own, one block at a time, by deciding what block to place next based on all the blocks they have already placed. That is how chatbots like ChatGPT work: they generate text one word at a time, always looking at what they have written so far to decide the next word.