Transformers

Note: This article is about the neural network architecture introduced in 2017. For the open-source Python library by Hugging Face, see Hugging Face Transformers.

The Transformer is a deep learning architecture introduced by eight researchers at Google in the 2017 paper "Attention Is All You Need". It uses attention as the sole mechanism for modeling relationships between elements of a sequence, removing the recurrence found in earlier sequence models such as the recurrent neural network and the long short-term memory network. The architecture allows training to be parallelized across positions, scales well with model size and data, and now underpins almost every large language model in production, including the GPT series, BERT, Claude, Gemini, LLaMA, Mistral, and Qwen, as well as image, audio, and protein-structure models.

Origin

The paper was submitted to arXiv on June 12, 2017, and presented at the 31st Conference on Neural Information Processing Systems (NeurIPS) in December 2017. The eight authors were Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin, all working at Google Brain or Google Research at the time. The paper noted that all authors contributed equally and that the listing order was randomized.

The original goal was practical. Sequence-to-sequence machine translation models built on RNNs and LSTMs had to process tokens one at a time during training, which made it difficult to use the parallelism of GPUs efficiently and made very long-range dependencies hard to capture. Earlier work on additive attention by Bahdanau, Cho, and Bengio (2014) and on convolutional sequence models by Gehring et al. at Facebook (2017) showed that attention could shorten the path between distant tokens. The Transformer pushed that idea to its limit by removing the recurrent and convolutional backbones entirely and relying only on attention plus simple feedforward layers.

On the WMT 2014 English-to-German translation task, the large Transformer reached 28.4 BLEU, beating the previous best ensemble by more than 2 BLEU. On WMT 2014 English-to-French, a single model reached 41.8 BLEU after 3.5 days of training on eight NVIDIA P100 GPUs. The model also generalized to English constituency parsing. The 2017 paper has since been cited well over 170,000 times and is among the most cited research papers of the 21st century.

Why transformers were created

Recurrent models compute the hidden state at position t from the hidden state at position t minus 1. That sequential dependency creates two problems. First, training cannot be fully parallelized across the positions of a single sequence, because each step waits on the previous one. Second, gradients have to travel through many time steps to relate distant tokens, which causes vanishing or exploding gradients in practice. LSTMs and gated recurrent units soften the second problem but do not eliminate it.

Attention sidesteps both issues. Every output position can look directly at every input position in a single matrix multiplication. The path length between any two tokens is constant, regardless of how far apart they are, so long-range dependencies become a matter of soft retrieval rather than long-distance error propagation. The same operation can run as a single batched matrix multiplication on a GPU or TPU, which is exactly the kind of workload modern accelerators are built for.

Architecture overview

The original Transformer uses an encoder-decoder layout. The encoder reads the source sequence and produces a stack of contextual representations. The decoder reads the encoder output along with the partially generated target sequence and produces the next token. Both halves are stacks of identical layers, each built from a small number of standard pieces.

The paper used six encoder layers and six decoder layers, an embedding dimension of 512 (1024 in the large model), eight attention heads (16 in the large model), and a feedforward inner dimension of 2048 (4096 in the large model). The full base model has roughly 65 million parameters. The large model has roughly 213 million.

Encoder layer

Each encoder layer has two sublayers:

A multi-head self-attention block, where every token in the source sequence attends to every other token.
A position-wise feedforward network applied independently to each position.

A residual connection wraps each sublayer, followed by layer normalization. In the original paper, normalization is applied after the residual addition ("post-LN"). Most modern implementations apply it before the sublayer ("pre-LN") because pre-LN trains more stably without a learning rate warmup.

Decoder layer

Each decoder layer has three sublayers:

A masked multi-head self-attention block, in which each position can only attend to earlier positions in the target sequence. The mask enforces autoregressive generation.
A multi-head cross-attention block, where queries come from the decoder and keys and values come from the encoder output.
A position-wise feedforward network.

Residual connections and layer normalization wrap each sublayer just as in the encoder.

Embeddings, projections, and weight tying

Input tokens are mapped to vectors through an embedding matrix. The same matrix is often shared with the output projection that produces logits over the vocabulary, a trick called weight tying that reduces parameter count and tends to improve perplexity. A final softmax over the logits gives the next-token probability distribution.

Scaled dot-product attention

The attention operation at the heart of the architecture is straightforward to write down. Given a set of queries Q, keys K, and values V (each a matrix of vectors stacked row by row), the output is:

Attention(Q, K, V) = softmax( Q K^T / sqrt(d_k) ) V

where d_k is the dimension of each key vector. The dot products Q K^T measure how well each query matches each key. Dividing by sqrt(d_k) keeps the magnitudes from growing with dimension, which would otherwise push the softmax into regions with vanishing gradients. The softmax turns the scores into a distribution, and multiplying by V produces a weighted sum of value vectors.

In self-attention, Q, K, and V are all linear projections of the same input. In cross-attention, Q comes from one sequence and K, V come from another. In masked self-attention, an additive mask sets the scores for forbidden positions (such as future tokens during decoding) to negative infinity before the softmax.

Multi-head attention

A single attention operation can only encode one set of relationships at a time. Multi-head attention runs h parallel attention operations on different learned projections of the input, then concatenates the results and projects them back to the model dimension. With model dimension d and h heads, each head usually has key and value dimensions d divided by h, so the total compute is similar to a single attention with full dimension.

Different heads tend to specialize. Some look at adjacent tokens, some track syntactic structure such as verb-subject agreement, some attend to specific token types like punctuation or rare nouns. The original paper visualized several heads that captured anaphora resolution and long-range dependencies in English sentences.

A common modern variant is multi-query attention, in which all heads share a single key and value projection while keeping per-head queries. Grouped-query attention is a middle ground that groups heads to share keys and values. Both reduce the size of the key-value cache during autoregressive decoding, which is the main memory bottleneck for long-context inference.

Positional encoding

Attention is permutation-invariant. Without extra information, a Transformer would treat "the cat sat on the mat" and "the mat sat on the cat" the same way. Positional encoding injects the order of tokens into the model.

The original paper used fixed sinusoidal positional encodings:

PE(pos, 2i)   = sin( pos / 10000^(2i/d_model) )
PE(pos, 2i+1) = cos( pos / 10000^(2i/d_model) )

for position pos and embedding dimension index i. The encoding is added to the token embedding before the first attention layer. Sine and cosine were chosen so that PE(pos + k) is a linear function of PE(pos) for any fixed k, which the authors argued might help the model learn relative offsets.

Later work has produced several alternatives:

Encoding	Year	Used in	Idea
Sinusoidal	2017	Original Transformer	Fixed sin/cos values added to embeddings
Learned absolute	2018	BERT, GPT-2	Trainable position vectors per index
Relative position	2018	Transformer-XL, T5	Bias attention scores by relative distance
RoPE (Rotary Position Embedding)	2021	LLaMA, GPT-NeoX, PaLM, Mistral	Rotate Q and K vectors by an angle proportional to position
ALiBi (Attention with Linear Biases)	2021	BLOOM, MPT	Add a per-head linear bias to attention logits

RoPE in particular has become the default for most large language models trained after 2022 because it encodes relative position cleanly and can extrapolate to longer contexts than were seen during training, especially when combined with techniques like position interpolation or NTK-aware scaling.

Feedforward, residual, and normalization layers

After attention, each position passes through a feedforward block applied independently:

FFN(x) = phi( x W1 + b1 ) W2 + b2

The original paper used a two-layer fully connected network with a ReLU between them and an inner dimension four times the model dimension. Modern variants almost always replace ReLU with a smoother activation (GELU in BERT and GPT-2, SwiGLU or GeGLU in LLaMA, PaLM, and most current open models). The feedforward layers hold the majority of the model's parameters, often more than 60 percent in large language models.

Residual connections add the sublayer input to its output, which keeps gradient signals strong and lets very deep networks train without degrading. Layer normalization stabilizes the activations across the embedding dimension. RMSNorm, a simpler variant that omits the mean subtraction, is now common in LLaMA and other production models.

Training

For sequence-to-sequence translation, the Transformer is trained with teacher forcing: the decoder input at time t is the ground-truth token at time t, not the model's own previous prediction. Cross-entropy loss is computed between the model's predicted distribution and the actual next token, summed across positions, and minimized with a variant of stochastic gradient descent. The Adam optimizer with the warmup-then-decay learning rate schedule from the original paper became the default for years. Modern training usually uses AdamW with cosine decay and label smoothing of around 0.1.

Language models are trained with self-supervised objectives. Decoder-only models predict the next token. Encoder-only models predict masked tokens given the rest of the sequence. Encoder-decoder models such as T5 use a span corruption objective in which a contiguous span of tokens is replaced with a sentinel and the decoder is trained to reconstruct it.

Tokenization

Text is split into subword tokens before entering the model. Common schemes include Byte-Pair Encoding (BPE), used by GPT and most open models; WordPiece, used by BERT; and SentencePiece, a language-agnostic library that supports both. The vocabulary size is usually between 30,000 and 200,000 tokens.

Scaling laws

In 2020, Jared Kaplan and colleagues at OpenAI published "Scaling Laws for Neural Language Models". They found that test loss falls as a power law in three quantities: model size N, dataset size D, and training compute C. Architectural details such as depth and width matter much less than the totals, within a wide range. The result was a recipe for spending compute: train very large models on relatively small amounts of data and stop well before convergence. This reasoning informed the design of GPT-3, which used 175 billion parameters trained on around 300 billion tokens.

In 2022, Jordan Hoffmann and colleagues at DeepMind published "Training Compute-Optimal Large Language Models", known as the Chinchilla paper. By training more than 400 models from 70 million to 16 billion parameters on 5 to 500 billion tokens, they found that for a fixed compute budget, model size and dataset size should scale roughly equally: every doubling of parameters should be matched by a doubling of training tokens. They demonstrated this by training Chinchilla, a 70B-parameter model on 1.4 trillion tokens, which outperformed the much larger Gopher (280B), GPT-3 (175B), Jurassic-1 (178B), and Megatron-Turing NLG (530B) on a wide range of evaluations.

The Chinchilla finding shifted the field. Later models, including LLaMA 1, LLaMA 2, and most open-source releases since 2023, have used compute-optimal or even data-heavy training ratios.

Variants

Transformers come in three main flavors based on which halves of the original architecture are kept.

Variant	Examples	Typical use	Pretraining objective
Encoder-only	BERT, RoBERTa, ALBERT, DeBERTa, ELECTRA	Classification, retrieval, sentence embeddings	Masked language modeling
Decoder-only	GPT-2, GPT-3, GPT-4, LLaMA, Mistral, Claude, Qwen, Falcon	Text generation, chat, code	Next-token prediction
Encoder-decoder	Original Transformer, T5, BART, mT5, FLAN-T5	Translation, summarization, structured generation	Span corruption or denoising

Encoder-only models

BERT (Bidirectional Encoder Representations from Transformers), introduced by Jacob Devlin and colleagues at Google in 2018, was the first widely adopted encoder-only Transformer. BERT-Base has 12 layers, 768 hidden dimensions, 12 heads, and 110 million parameters. BERT-Large has 24 layers, 1024 hidden dimensions, 16 heads, and 340 million parameters. It was pretrained with masked language modeling (predicting 15 percent of tokens that are randomly masked) and next-sentence prediction. BERT pushed the GLUE benchmark to 80.5, lifted SQuAD v1.1 F1 to 93.2, and was deployed in Google Search starting in October 2019. RoBERTa (Facebook AI, 2019) showed that BERT was undertrained and improved scores by removing next-sentence prediction and training longer on more data. DeBERTa added disentangled attention for content and position, and ELECTRA replaced the masked-token objective with replaced-token detection.

Decoder-only models

The first GPT, presented in "Improving Language Understanding by Generative Pre-Training" by Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever at OpenAI in 2018, used a 12-layer decoder-only Transformer with 768 hidden dimensions, 12 attention heads, 3072 inner FFN dimension, and roughly 117 million parameters. It was pretrained on the BookCorpus and then fine-tuned discriminatively for each downstream task. GPT-2 (2019) scaled this to 1.5 billion parameters and showed that a single model could perform many tasks zero-shot. GPT-3 (2020) reached 175 billion parameters and demonstrated few-shot learning. GPT-4 and later models follow the same decoder-only blueprint with mixtures of experts, longer context windows, and multimodal inputs added on top.

Most open-weight large language models released since 2023 also use decoder-only Transformers: LLaMA, LLaMA 2, LLaMA 3, Mistral, Mixtral, Qwen, Yi, Falcon, DeepSeek, and Gemma all follow the same recipe with variations in normalization, position encoding, attention heads, and FFN activations.

Encoder-decoder models

T5 (Text-to-Text Transfer Transformer), released by Google in 2019, framed every NLP task as text-in, text-out and used the original encoder-decoder layout pretrained on the C4 corpus with a span corruption objective. BART, from Facebook AI in 2019, used a similar layout with a more general denoising autoencoder objective and excelled at summarization. mT5 and FLAN-T5 extended the recipe to more languages and instruction tuning.

Vision and multimodal

The Vision Transformer (ViT), introduced by Alexey Dosovitskiy and colleagues at Google Research in the 2020 paper "An Image Is Worth 16x16 Words," treats an image as a sequence of fixed-size patches. A standard ViT splits a 224 by 224 image into 14 by 14 patches of 16 by 16 pixels, projects each patch into an embedding, prepends a learnable [CLS] token, adds positional embeddings, and runs the result through a stack of standard Transformer encoder layers. With enough pretraining data, ViT matched or exceeded the best convolutional neural networks on ImageNet classification. Variants followed quickly: DeiT made ViT trainable on ImageNet-1k alone with distillation, Swin Transformer added shifted-window attention to give a hierarchical, CNN-like inductive bias, and DETR used a Transformer encoder-decoder for object detection.

Multimodal models pair Transformers across modalities. CLIP (OpenAI, 2021) trains a text Transformer and an image Transformer jointly with a contrastive objective on 400 million image-text pairs. Flamingo (DeepMind, 2022) interleaves vision and language tokens for few-shot visual question answering. Modern frontier models (GPT-4o, Gemini, Claude 3) accept images, audio, and video as token streams alongside text.

Efficient attention

Standard attention has time and memory complexity O(n^2) in the sequence length n, which becomes the bottleneck for long contexts. Many lines of work try to reduce that cost.

Method	Year	Approach
Sparse Transformer	2019	Restrict attention to fixed sparse patterns
Reformer	2020	Locality-sensitive hashing groups similar queries and keys
Linformer	2020	Project keys and values to a fixed lower dimension
Longformer	2020	Combine local sliding-window attention with a few global tokens
Performer	2020	Approximate softmax attention with random feature kernels
BigBird	2020	Random plus local plus global attention pattern
FlashAttention	2022	Exact attention reordered to minimize GPU memory I/O
FlashAttention-2 / 3	2023, 2024	Better parallelism, work partitioning, and FP8 support

FlashAttention, from Tri Dao and colleagues at Stanford, is the most widely adopted of these. It does not change the math of attention. It restructures the computation in tiles that fit in fast on-chip SRAM, cutting reads and writes to GPU high-bandwidth memory. It produced a roughly 3x speedup on GPT-2 with 1k context and a 2.4x speedup on long-range arena benchmarks at 1k to 4k context. FlashAttention is now the default attention kernel in PyTorch, JAX, and most inference frameworks.

KV caching is a complementary trick used during autoregressive decoding. Once a token has been processed, its key and value vectors do not change, so they are stored and reused for every later step. This reduces inference cost from quadratic to linear in sequence length per step.

Applications

Transformers are now the default architecture in almost every domain that involves sequences or sets.

Natural language processing: machine translation, summarization, question answering, named entity recognition, sentiment analysis, dialog systems, code generation, retrieval-augmented generation.
Computer vision: image classification (ViT, Swin), object detection (DETR, DINO), segmentation (Mask2Former, SAM), depth estimation, video understanding.
Speech and audio: Whisper for speech recognition, AudioLM and MusicLM for audio generation, wav2vec 2.0 for self-supervised speech representations.
Biology and chemistry: AlphaFold 2 and AlphaFold 3 use attention-based modules (the Evoformer and Pairformer) to predict protein structures from sequences. ESM, ProGen, and similar protein language models use Transformers trained on amino-acid sequences.
Code: Codex, AlphaCode, Code Llama, StarCoder, and DeepSeek-Coder are decoder-only Transformers fine-tuned or pretrained on source code.
Reinforcement learning and robotics: Decision Transformer, Gato, and Robotics Transformer treat trajectories as token sequences.

Limitations

Transformers have well-known weaknesses.

The quadratic cost of attention in sequence length is the most obvious. Even with FlashAttention and KV caching, very long contexts (hundreds of thousands or millions of tokens) require special techniques such as ring attention, sliding-window attention, or state-space hybrids.

Autoregressive Transformers can produce confident but false statements, a behavior usually called hallucination. The model is trained to predict the most likely next token, not to verify facts.

Large Transformers are expensive to train. GPT-3 reportedly cost several million dollars in compute alone; frontier models in 2024 and 2025 are estimated to cost hundreds of millions of dollars per training run. The hardware required (tens of thousands of high-end GPUs or TPUs) is concentrated in a small number of companies and labs.

Interpretability remains hard. Researchers can visualize attention weights and probe individual neurons, but understanding why a 70-billion-parameter model produces a particular output is largely an open problem and the central concern of the mechanistic interpretability research program.

Finally, Transformers are not the only game in town. State-space models such as Mamba (Gu and Dao, 2023) achieve linear-time inference and competitive quality on language modeling. Hybrid architectures such as Jamba combine Mamba blocks with Transformer blocks. Whether one of these alternatives eventually displaces the Transformer is an open question.

References

Origin

Why transformers were created

Architecture overview

Encoder layer

Decoder layer

Embeddings, projections, and weight tying

Scaled dot-product attention

Multi-head attention

Positional encoding

Feedforward, residual, and normalization layers

Training

Tokenization

Scaling laws

Variants

Encoder-only models

Decoder-only models

Encoder-decoder models

Vision and multimodal

Efficient attention

Applications

Limitations

See also

References

Improve this article

Related Articles

GELU (Gaussian Error Linear Unit)

Multi-head Latent Attention

Bidirectional

Node (neural network)

Sparse autoencoder

LeNet

Origin

Why transformers were created

Architecture overview

Encoder layer

Decoder layer

Embeddings, projections, and weight tying

Scaled dot-product attention

Multi-head attention

Positional encoding

Feedforward, residual, and normalization layers

Training

Tokenization

Scaling laws

Variants

Encoder-only models

Decoder-only models

Encoder-decoder models

Vision and multimodal

Efficient attention

Applications

Limitations

See also

References

Related Articles

GELU (Gaussian Error Linear Unit)

Multi-head Latent Attention

Bidirectional

Node (neural network)

Sparse autoencoder

LeNet