Transformers
Last reviewed
May 24, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v7 · 6,988 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 24, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v7 · 6,988 words
Add missing citations, update stale details, or suggest a clearer explanation.
Note: This article is about the neural network architecture introduced in 2017. For the open-source Python library by Hugging Face, see Hugging Face Transformers.
The Transformer is a deep learning architecture introduced by eight researchers at Google in the 2017 paper "Attention Is All You Need"[1]. It uses attention as the sole mechanism for modeling relationships between elements of a sequence, removing the recurrence found in earlier sequence models such as the recurrent neural network and the long short-term memory network. The architecture allows training to be parallelized across positions, scales well with model size and data, and now underpins almost every large language model in production, including the GPT series, BERT, Claude, Gemini, LLaMA, Mistral, and Qwen, as well as image, audio, and protein-structure models. Beyond language, transformers have become the default architecture for image classification (Vision Transformer), object detection (DETR), speech recognition (Whisper), protein structure prediction (AlphaFold), and class-conditional image and video generation (Diffusion Transformer)[2][3][4][5][6].
The paper was submitted to arXiv on June 12, 2017, and presented at the 31st Conference on Neural Information Processing Systems (NeurIPS) in December 2017[1]. The eight authors were Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin, all working at Google Brain or Google Research at the time. The paper noted that all authors contributed equally and that the listing order was randomized[1].
The original goal was practical. Sequence-to-sequence machine translation models built on RNNs and LSTMs had to process tokens one at a time during training, which made it difficult to use the parallelism of GPUs efficiently and made very long-range dependencies hard to capture. Earlier work on additive attention by Bahdanau, Cho, and Bengio (2014) and on convolutional sequence models by Gehring et al. at Facebook (2017) showed that attention could shorten the path between distant tokens[7]. The Transformer pushed that idea to its limit by removing the recurrent and convolutional backbones entirely and relying only on attention plus simple feedforward layers.
On the WMT 2014 English-to-German translation task, the large Transformer reached 28.4 BLEU, beating the previous best ensemble by more than 2 BLEU. On WMT 2014 English-to-French, a single model reached 41.8 BLEU after 3.5 days of training on eight NVIDIA P100 GPUs[1]. The model also generalized to English constituency parsing. The 2017 paper has since been cited well over 170,000 times and is among the most cited research papers of the 21st century.
Recurrent models compute the hidden state at position t from the hidden state at position t minus 1. That sequential dependency creates two problems. First, training cannot be fully parallelized across the positions of a single sequence, because each step waits on the previous one. Second, gradients have to travel through many time steps to relate distant tokens, which causes vanishing or exploding gradients in practice. LSTMs and gated recurrent units soften the second problem but do not eliminate it.
Attention sidesteps both issues. Every output position can look directly at every input position in a single matrix multiplication. The path length between any two tokens is constant, regardless of how far apart they are, so long-range dependencies become a matter of soft retrieval rather than long-distance error propagation. The same operation can run as a single batched matrix multiplication on a GPU or TPU, which is exactly the kind of workload modern accelerators are built for.
The Bahdanau attention mechanism of 2014 already used a small attention computation between an LSTM encoder and an LSTM decoder, mainly to soften the bottleneck of compressing a whole source sentence into one vector[7]. By 2017 the field had spent three years incrementally adding attention to recurrent systems. The contribution of the Transformer was to remove the recurrent backbone entirely and let attention carry the full sequence-modeling load.
The original Transformer uses an encoder-decoder layout. The encoder reads the source sequence and produces a stack of contextual representations. The decoder reads the encoder output along with the partially generated target sequence and produces the next token. Both halves are stacks of identical layers, each built from a small number of standard pieces.
The paper used six encoder layers and six decoder layers, an embedding dimension of 512 (1024 in the large model), eight attention heads (16 in the large model), and a feedforward inner dimension of 2048 (4096 in the large model)[1]. The full base model has roughly 65 million parameters. The large model has roughly 213 million.
Each encoder layer has two sublayers:
A residual connection wraps each sublayer, followed by layer normalization. In the original paper, normalization is applied after the residual addition ("post-LN"). Most modern implementations apply it before the sublayer ("pre-LN") because pre-LN trains more stably without a learning rate warmup.
Each decoder layer has three sublayers:
Residual connections and layer normalization wrap each sublayer just as in the encoder.
Input tokens are mapped to vectors through an embedding matrix. The same matrix is often shared with the output projection that produces logits over the vocabulary, a trick called weight tying that reduces parameter count and tends to improve perplexity. A final softmax over the logits gives the next-token probability distribution.
The attention operation at the heart of the architecture is straightforward to write down. Given a set of queries Q, keys K, and values V (each a matrix of vectors stacked row by row), the output is:
Attention(Q, K, V) = softmax( Q K^T / sqrt(d_k) ) V
where d_k is the dimension of each key vector[1]. The dot products Q K^T measure how well each query matches each key. Dividing by sqrt(d_k) keeps the magnitudes from growing with dimension, which would otherwise push the softmax into regions with vanishing gradients. The softmax turns the scores into a distribution, and multiplying by V produces a weighted sum of value vectors.
In self-attention, Q, K, and V are all linear projections of the same input. In cross-attention, Q comes from one sequence and K, V come from another. In masked self-attention, an additive mask sets the scores for forbidden positions (such as future tokens during decoding) to negative infinity before the softmax.
A single attention operation can only encode one set of relationships at a time. Multi-head attention runs h parallel attention operations on different learned projections of the input, then concatenates the results and projects them back to the model dimension. With model dimension d and h heads, each head usually has key and value dimensions d divided by h, so the total compute is similar to a single attention with full dimension[1].
Different heads tend to specialize. Some look at adjacent tokens, some track syntactic structure such as verb-subject agreement, some attend to specific token types like punctuation or rare nouns. The original paper visualized several heads that captured anaphora resolution and long-range dependencies in English sentences[1].
A common modern variant is multi-query attention, in which all heads share a single key and value projection while keeping per-head queries. Grouped-query attention (GQA), introduced by Ainslie et al. at Google Research in May 2023, is a middle ground that groups heads to share keys and values[8]. GQA interpolates between standard multi-head attention and multi-query attention. Both variants reduce the size of the key-value cache during autoregressive decoding, which is the main memory bottleneck for long-context inference. Meta adopted GQA across LLaMA 2 (with 8 KV heads for the 70B model) and retained it through LLaMA 3[8][9].
Attention is permutation-invariant. Without extra information, a Transformer would treat "the cat sat on the mat" and "the mat sat on the cat" the same way. Positional encoding injects the order of tokens into the model.
The original paper used fixed sinusoidal positional encodings:
PE(pos, 2i) = sin( pos / 10000^(2i/d_model) )
PE(pos, 2i+1) = cos( pos / 10000^(2i/d_model) )
for position pos and embedding dimension index i. The encoding is added to the token embedding before the first attention layer[1]. Sine and cosine were chosen so that PE(pos + k) is a linear function of PE(pos) for any fixed k, which the authors argued might help the model learn relative offsets.
Later work has produced several alternatives:
| Encoding | Year | Used in | Idea |
|---|---|---|---|
| Sinusoidal | 2017 | Original Transformer | Fixed sin/cos values added to embeddings |
| Learned absolute | 2018 | BERT, GPT-2 | Trainable position vectors per index |
| Relative position | 2018 | Transformer-XL, T5 | Bias attention scores by relative distance |
| RoPE (Rotary Position Embedding) | 2021 | LLaMA, GPT-NeoX, PaLM, Mistral | Rotate Q and K vectors by an angle proportional to position |
| ALiBi (Attention with Linear Biases) | 2021 | BLOOM, MPT | Add a per-head linear bias to attention logits |
RoPE, introduced by Jianlin Su and colleagues in April 2021 in the RoFormer paper, encodes the absolute position by rotating the query and key vectors with a rotation matrix whose angle depends on position, which has the side effect that the dot product between two rotated vectors only depends on their relative offset[10]. RoPE has become the default for most large language models trained after 2022 because it encodes relative position cleanly and can extrapolate to longer contexts than were seen during training, especially when combined with techniques like position interpolation, NTK-aware scaling, and YaRN. RoPE is the default positional strategy in LLaMA, LLaMA 2, LLaMA 3, Gemma, Mistral, and Code Llama[10].
ALiBi, from Press, Smith, and Lewis (2021), adds a fixed linear bias to each attention score that grows with distance, biasing the model toward closer tokens and supporting length extrapolation without any learned parameters[11].
After attention, each position passes through a feedforward block applied independently:
FFN(x) = phi( x W1 + b1 ) W2 + b2
The original paper used a two-layer fully connected network with a ReLU between them and an inner dimension four times the model dimension[1]. Modern variants almost always replace ReLU with a smoother activation (GELU in BERT and GPT-2, SwiGLU or GeGLU in LLaMA, PaLM, and most current open models). The feedforward layers hold the majority of the model's parameters, often more than 60 percent in large language models.
Residual connections add the sublayer input to its output, which keeps gradient signals strong and lets very deep networks train without degrading. Layer normalization stabilizes the activations across the embedding dimension. RMSNorm, a simpler variant that omits the mean subtraction, is now common in LLaMA and other production models.
For sequence-to-sequence translation, the Transformer is trained with teacher forcing: the decoder input at time t is the ground-truth token at time t, not the model's own previous prediction. Cross-entropy loss is computed between the model's predicted distribution and the actual next token, summed across positions, and minimized with a variant of stochastic gradient descent. The Adam optimizer with the warmup-then-decay learning rate schedule from the original paper became the default for years[1]. Modern training usually uses AdamW with cosine decay and label smoothing of around 0.1.
Language models are trained with self-supervised objectives. Decoder-only models predict the next token. Encoder-only models predict masked tokens given the rest of the sequence. Encoder-decoder models such as T5 use a span corruption objective in which a contiguous span of tokens is replaced with a sentinel and the decoder is trained to reconstruct it[12].
Text is split into subword tokens before entering the model. Common schemes include Byte-Pair Encoding (BPE), used by GPT and most open models; WordPiece, used by BERT; and SentencePiece, a language-agnostic library that supports both[13]. The vocabulary size is usually between 30,000 and 200,000 tokens. LLaMA 3 uses a vocabulary of 128,000 tokens, a substantial increase from the 32,000 used by LLaMA 1 and LLaMA 2, which Meta reports gave a non-trivial improvement in per-token efficiency and downstream quality[9].
The earliest empirical evidence that loss scales predictably with data, model, and compute came in 2017 from Joel Hestness and colleagues at Baidu Research. Their paper "Deep Learning Scaling is Predictable, Empirically" found power-law generalization error scaling across machine translation, language modeling, image processing, and speech recognition, and noted that model size grows sublinearly with data size[14].
In 2020, Jared Kaplan and colleagues at OpenAI published "Scaling Laws for Neural Language Models"[15]. They found that test loss falls as a power law in three quantities: model size N, dataset size D, and training compute C. Architectural details such as depth and width matter much less than the totals, within a wide range. The result was a recipe for spending compute: train very large models on relatively small amounts of data and stop well before convergence. This reasoning informed the design of GPT-3, which used 175 billion parameters trained on around 300 billion tokens[15][16].
In 2022, Jordan Hoffmann and colleagues at DeepMind published "Training Compute-Optimal Large Language Models", known as the Chinchilla paper[17]. By training more than 400 models from 70 million to 16 billion parameters on 5 to 500 billion tokens, they found that for a fixed compute budget, model size and dataset size should scale roughly equally: every doubling of parameters should be matched by a doubling of training tokens. They demonstrated this by training Chinchilla, a 70B-parameter model on 1.4 trillion tokens, which outperformed the much larger Gopher (280B), GPT-3 (175B), Jurassic-1 (178B), and Megatron-Turing NLG (530B) on a wide range of evaluations[17].
The Chinchilla finding shifted the field. Later models, including LLaMA 1, LLaMA 2, LLaMA 3, and most open-source releases since 2023, have used compute-optimal or even data-heavy training ratios. LLaMA 3 in particular was pretrained on over 15 trillion tokens, far beyond the Chinchilla-optimal ratio for its 8B and 70B sizes, on the empirical grounds that smaller models keep improving past the compute-optimal point if extra data is available[9].
Transformers come in three main flavors based on which halves of the original architecture are kept.
| Variant | Examples | Typical use | Pretraining objective |
|---|---|---|---|
| Encoder-only | BERT, RoBERTa, ALBERT, DeBERTa, ELECTRA | Classification, retrieval, sentence embeddings | Masked language modeling |
| Decoder-only | GPT-2, GPT-3, GPT-4, LLaMA, Mistral, Claude, Qwen, Falcon | Text generation, chat, code | Next-token prediction |
| Encoder-decoder | Original Transformer, T5, BART, mT5, FLAN-T5 | Translation, summarization, structured generation | Span corruption or denoising |
BERT (Bidirectional Encoder Representations from Transformers), introduced by Jacob Devlin and colleagues at Google in 2018, was the first widely adopted encoder-only Transformer[18]. BERT-Base has 12 layers, 768 hidden dimensions, 12 heads, and 110 million parameters. BERT-Large has 24 layers, 1024 hidden dimensions, 16 heads, and 340 million parameters[18]. It was pretrained with masked language modeling (predicting 15 percent of tokens that are randomly masked) and next-sentence prediction. BERT pushed the GLUE benchmark to 80.5, lifted SQuAD v1.1 F1 to 93.2, and was deployed in Google Search starting in October 2019. RoBERTa (Facebook AI, 2019) showed that BERT was undertrained and improved scores by removing next-sentence prediction and training longer on more data. DeBERTa added disentangled attention for content and position, and ELECTRA replaced the masked-token objective with replaced-token detection.
The first GPT, presented in "Improving Language Understanding by Generative Pre-Training" by Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever at OpenAI in 2018, used a 12-layer decoder-only Transformer with 768 hidden dimensions, 12 attention heads, 3072 inner FFN dimension, and roughly 117 million parameters[19]. It was pretrained on the BookCorpus and then fine-tuned discriminatively for each downstream task. GPT-2 (2019) scaled this to 1.5 billion parameters and was trained on WebText, a 40 GB corpus of 8 million web pages linked from Reddit posts with at least three karma[20]. GPT-2 showed that a single model could perform many tasks zero-shot. GPT-3 (2020) reached 175 billion parameters and demonstrated few-shot learning, requiring an estimated 3.14 x 10^23 FLOPs to train[16]. GPT-4 and later models follow the same decoder-only blueprint with mixtures of experts, longer context windows, and multimodal inputs added on top.
Most open-weight large language models released since 2023 also use decoder-only Transformers: LLaMA, LLaMA 2, LLaMA 3, Mistral, Mixtral, Qwen, Yi, Falcon, DeepSeek, and Gemma all follow the same recipe with variations in normalization, position encoding, attention heads, and FFN activations.
T5 (Text-to-Text Transfer Transformer), released by Google in 2019, framed every NLP task as text-in, text-out and used the original encoder-decoder layout pretrained on the C4 corpus with a span corruption objective[12]. BART, from Facebook AI in 2019, used a similar layout with a more general denoising autoencoder objective and excelled at summarization[21]. mT5 and FLAN-T5 extended the recipe to more languages and instruction tuning.
Several variants emerged specifically to handle sequences longer than the few thousand tokens that fit in a base Transformer. Transformer-XL (Dai et al., 2019) introduced segment-level recurrence in which hidden states from the previous segment are cached and reused, letting attention reach across segment boundaries[22]. The authors reported that Transformer-XL captures dependencies 80 percent longer than RNNs and 450 percent longer than vanilla Transformers, and is up to 1,800 times faster than vanilla Transformers during evaluation[22].
For frontier-scale long-context inference, Gemini 1.5 Pro (Google DeepMind, 2024) used a sparse mixture-of-experts architecture and a 1 million token context window at launch, later expanded to 2 million tokens in production, with near-perfect recall (over 99.7 percent) on needle-in-a-haystack retrieval tests up to 1 million tokens, and reasonable performance with contexts extended to 10 million tokens for text, 9.7 million for audio, and 9.9 million for video[23]. Claude and GPT-4 likewise extended context lengths into the hundreds of thousands of tokens through 2024 and 2025.
The Vision Transformer (ViT), introduced by Alexey Dosovitskiy and colleagues at Google Research in the 2020 paper "An Image Is Worth 16x16 Words," treats an image as a sequence of fixed-size patches[2]. A standard ViT splits a 224 by 224 image into 14 by 14 patches of 16 by 16 pixels, projects each patch into an embedding, prepends a learnable [CLS] token, adds positional embeddings, and runs the result through a stack of standard Transformer encoder layers. With enough pretraining data, ViT matched or exceeded the best convolutional neural networks on ImageNet classification[2]. Variants followed quickly: DeiT made ViT trainable on ImageNet-1k alone with distillation, Swin Transformer (Liu et al., 2021) added shifted-window attention to give a hierarchical, CNN-like inductive bias and won the ICCV 2021 Marr Prize, reaching 87.3 top-1 accuracy on ImageNet-1K and 58.7 box AP on COCO[24], and DETR used a Transformer encoder-decoder for object detection[3].
Multimodal models pair Transformers across modalities. CLIP (OpenAI, 2021) trains a text Transformer and an image Transformer jointly with a contrastive objective on 400 million image-text pairs[25]. Flamingo (DeepMind, 2022) interleaves vision and language tokens for few-shot visual question answering. Modern frontier models (GPT-4o, Gemini, Claude 3) accept images, audio, and video as token streams alongside text.
Whisper, released by OpenAI in September 2022, is an encoder-decoder Transformer trained on 680,000 hours of multilingual and multitask supervised audio scraped from the web[4]. Input audio is split into 30-second chunks, converted to a log-Mel spectrogram, and passed through two 1D convolutional layers that downsample along time before entering the Transformer encoder. The decoder is conditioned on special tokens that specify the task (transcription, translation, voice-activity detection, language identification), so a single set of weights handles many speech tasks. The large-v3 model has 32 encoder layers, a model dimension of 1280, and 20 attention heads[4].
Earlier, wav2vec 2.0 (Baevski et al., Meta AI, 2020) introduced a self-supervised contrastive objective for speech representations, masking parts of a latent representation of the audio and learning to identify the masked content[26]. Using all labeled data of LibriSpeech it achieves 1.8/3.3 WER on the clean/other test sets; with just ten minutes of labeled data and pretraining on 53,000 hours of unlabeled audio it still reaches 4.8/8.2 WER[26]. AudioLM and MusicLM extended Transformers to audio generation; the speech recognition lineage culminated in Whisper.
DeepMind's AlphaFold 2 (Jumper et al., 2021) used a deep attention-based module called the Evoformer that integrates multiple sequence alignments with residue-residue pair representations, plus a structure module that includes Invariant Point Attention specifically designed for 3D point clouds[5]. AlphaFold 2 won the CASP14 protein structure prediction competition by a wide margin and is widely credited with substantially solving the single-domain protein folding problem. AlphaFold 3 (2024) replaced the Evoformer with a Pairformer and added a diffusion-based structure decoder, extending coverage to protein-protein and protein-ligand complexes.
Meta's Evolutionary Scale Modeling (ESM) line produced protein language models trained on hundreds of millions of amino-acid sequences with a masked-language-modeling objective. ESMFold (Lin et al., 2022) used ESM-2 as a backbone to predict structure from a single sequence without the multiple sequence alignment that AlphaFold relies on. EvolutionaryScale, founded by a team of former Meta researchers including Alexander Rives, released ESM3 in June 2024, a generative model that reasons jointly over sequence, structure, and function.
The Diffusion Transformer (DiT), introduced by William Peebles (UC Berkeley) and Saining Xie (NYU) at ICCV 2023, replaced the U-Net backbone of latent diffusion models with a Transformer operating on patches of the diffusion latent[6]. DiT incorporates timestep and class label as embeddings and uses adaptive layer normalization (adaLN) to inject conditioning. The largest DiT-XL/2 reached a Fréchet inception distance of 2.27 on class-conditional ImageNet 256x256, then state of the art, with FID continuing to drop as the model's training compute (Gflops) increases[6]. DiT became the basis of Stable Diffusion 3, Stable Diffusion 3.5, FLUX.1, OpenAI's Sora text-to-video model, and many other 2024-2025 generative systems[27]. Sora itself treats video as sequences of spacetime patches passed through a Transformer-based denoiser that does not constrain input resolution or duration[27].
Decision Transformer, introduced by Lili Chen, Kevin Lu, and colleagues at UC Berkeley and Facebook AI Research in June 2021, cast offline reinforcement learning as a sequence modeling problem[28]. By conditioning a causal Transformer on the desired return, past states, and past actions, the model can generate future actions that achieve the target return without any value-function or policy-gradient machinery. It matches or exceeds state-of-the-art offline RL baselines on Atari, OpenAI Gym, and Key-to-Door tasks[28]. DeepMind's Gato (2022) trained a single 1.2-billion-parameter Transformer to play Atari, caption images, chat, and control a robot arm by tokenizing every modality into a shared sequence. Robotics Transformer (RT-1, RT-2) extended the same idea to physical manipulation.
Mixture of experts (MoE) replaces the dense feedforward block in some Transformer layers with a sparse layer of expert subnetworks and a learned router that sends each token to a small subset of experts, usually one or two. This decouples total parameter count from per-token compute. The first large-scale MoE Transformer was GShard (Lepikhin et al., Google, 2020), which scaled a multilingual neural machine translation Transformer to 600 billion parameters across 2048 experts and trained it efficiently on 2048 TPU v3 chips in 4 days, beating dense baselines on 100-language to English translation[29]. Switch Transformer (Fedus et al., Google, 2021) simplified the router to pick the single top expert, scaled to 1.6 trillion parameters, and demonstrated competitive quality with substantially lower compute than dense Chinchilla-scale models[30].
Mixtral 8x7B (Mistral AI, January 2024) was an influential open-weight MoE: each layer routes tokens to two of eight feedforward experts, giving 47 billion total parameters with 13 billion active per token, and matched or beat LLaMA 2 70B and GPT-3.5 across most benchmarks[31]. DeepSeek V3 (December 2024) pushed open MoE further with 671 billion total parameters and 37 billion active per token, using 256 routed experts plus 1 shared expert (8 routed plus 1 shared activated per token), Multi-head Latent Attention, and FP8 mixed-precision training[32]. Google's Gemini 1.5 series, OpenAI's GPT-4 (per third-party reports), and other frontier models also adopted MoE for the same compute-vs-quality tradeoff.
Standard attention has time and memory complexity O(n^2) in the sequence length n, which becomes the bottleneck for long contexts. Many lines of work try to reduce that cost.
| Method | Year | Approach |
|---|---|---|
| Sparse Transformer | 2019 | Restrict attention to fixed sparse patterns (strided and fixed factorizations) |
| Reformer | 2020 | Locality-sensitive hashing groups similar queries and keys; reversible residuals reduce memory |
| Linformer | 2020 | Project keys and values to a fixed lower dimension |
| Longformer | 2020 | Combine local sliding-window attention with a few global tokens |
| Performer | 2020 | Approximate softmax attention with random feature kernels |
| BigBird | 2020 | Random plus local plus global attention pattern |
| FlashAttention | 2022 | Exact attention reordered to minimize GPU memory I/O |
| FlashAttention-2 | 2023 | Better parallelism, work partitioning across thread blocks and warps |
| FlashAttention-3 | 2024 | Asynchrony on Hopper Tensor Cores, FP8 support |
The Sparse Transformer (Child, Gray, Radford, Sutskever at OpenAI, 2019) introduced factorized attention patterns that reduce the cost from O(n^2) to O(n sqrt(n)), and set state-of-the-art density estimation results on CIFAR-10, Enwik8, and ImageNet 64 while modeling sequences tens of thousands of tokens long[33]. Reformer (Kitaev, Kaiser, Levskaya, 2020) used locality-sensitive hashing to bucket similar queries and keys, reducing attention to O(L log L) and pairing it with reversible residual layers; the authors demonstrated context windows of up to one million tokens on a single 16 GB accelerator[34]. Longformer (Beltagy, Peters, Cohan at AI2, 2020) combined sliding window, dilated sliding window, and task-specific global attention. Linformer (Wang et al., Facebook, 2020) projected key and value lengths down to a fixed dimension. Performer (Choromanski et al., Google, 2020) approximated the softmax kernel with positive orthogonal random features. BigBird (Zaheer et al., Google, 2020) combined random, window, and global attention, proved that the resulting sparse attention is a universal approximator and Turing-complete, and supported sequences up to 8x longer than full attention on the same hardware[35].
FlashAttention, from Tri Dao and colleagues at Stanford in 2022, is the most widely adopted of these efficiency techniques and unlike the others it computes exact softmax attention rather than an approximation. It restructures the computation in tiles that fit in fast on-chip SRAM, cutting reads and writes to GPU high-bandwidth memory[36]. FlashAttention is now the default attention kernel in PyTorch, JAX, and most inference frameworks. FlashAttention-2 (Dao, July 2023) reworked parallelism and partitioning between thread blocks and warps, reaching 50 to 73 percent of theoretical peak FLOPs on A100 GPUs and up to 225 TFLOPs/s end-to-end for GPT-style training (72 percent model FLOPs utilization)[37]. FlashAttention-3 (Shah et al., July 2024) targets NVIDIA Hopper H100 GPUs with asynchronous Tensor Core scheduling, warp specialization, interleaved matmul and softmax, and FP8 block quantization. It reaches roughly 740 TFLOPs/s on FP16 (75 percent of H100 peak) and nearly 1.2 PFLOPs/s on FP8, while keeping FP8 numerical error 2.6x smaller than the baseline FP8 attention[38].
KV caching is a complementary trick used during autoregressive decoding. Once a token has been processed, its key and value vectors do not change, so they are stored and reused for every later step. This reduces inference cost from quadratic to linear in sequence length per step, at the cost of memory proportional to context length times number of heads times key dimension. Multi-query and grouped-query attention specifically target the size of this cache.
For sequences that no longer fit on a single accelerator, Ring Attention (Liu, Zaharia, Abbeel at UC Berkeley, October 2023) distributes blocks of keys and values around a conceptual ring of devices and overlaps the inter-device communication with on-device computation, scaling context length linearly with the number of devices and supporting millions of tokens[39].
Transformers are now the default architecture in almost every domain that involves sequences or sets.
The Transformer's takeover of the field can be tracked by year.
Transformers have well-known weaknesses.
Quadratic attention cost. The cost of self-attention in sequence length is the most obvious weakness. Even with FlashAttention and KV caching, very long contexts (hundreds of thousands or millions of tokens) require special techniques such as Ring Attention, sliding-window attention, sparse attention, or state-space hybrids[36][39]. Many efficient-attention papers from 2019 to 2021 promised linear or sub-quadratic alternatives in theory; in practice FlashAttention's exact O(n^2) kernel with optimized memory hierarchy outperformed most of them for sequences under tens of thousands of tokens, leaving the long-context regime as the main place where approximate methods remain competitive.
Hallucination. Autoregressive Transformers can produce confident but false statements, a behavior usually called hallucination. The model is trained to predict the most likely next token, not to verify facts. Mitigations include retrieval-augmented generation, reinforcement learning from human feedback, tool use, and explicit chain-of-thought, but no current technique eliminates hallucination.
Training cost and concentration of capability. Large Transformers are expensive to train. GPT-3 (175B parameters) reportedly cost several million dollars in compute alone[16]. PaLM 540B was trained on 6144 TPU v4 chips[40]. LLaMA 3 405B used 16,000 H100 GPUs and roughly 7.7 million H100 GPU-hours of pretraining compute[9]. Frontier models in 2025 and 2026 are estimated to cost hundreds of millions to over a billion dollars per training run. The hardware required (tens of thousands of high-end GPUs or TPUs) is concentrated in a small number of companies and labs.
Energy and environmental impact. Strubell, Ganesh, and McCallum (ACL 2019) estimated that training one large NLP Transformer with neural architecture search could emit roughly 626,000 pounds (about 284 metric tons) of CO2-equivalent, comparable to the lifetime emissions of five average American cars[42]. They argued for transparent reporting of training compute and emissions. Since then, the field has shifted toward training efficiency (compute-optimal scaling, MoE, FP8, FlashAttention), and the largest model providers now run on partially renewable electricity, but the absolute scale of frontier training continues to grow.
Interpretability. Researchers can visualize attention weights and probe individual neurons, but understanding why a 70-billion-parameter model produces a particular output is largely an open problem and the central concern of the mechanistic interpretability research program. Attention weights in particular are often misleading: they show what positions a head attends to, not what computations the head is performing.
Other failure modes. Transformers can be brittle to prompt phrasing, fall into repetition loops, struggle with arithmetic and symbolic manipulation at longer lengths than seen in training, and fail to learn algorithms that recurrent networks handle naturally (such as counting parentheses to arbitrary depth). They are also susceptible to jailbreaks, prompt injection, and adversarial inputs.
Competing architectures. Transformers are not the only game in town. State-space models such as Mamba (Gu and Dao, 2023) achieve linear-time inference and competitive quality on language modeling; the Mamba-3B model outperformed Transformers of the same size on standard benchmarks[41]. RWKV reframes the recurrent structure of an RNN with parallelizable training, claiming Transformer-level quality with linear time and constant memory inference[43]. Hybrid architectures such as Jamba and Mamba 2 combine state-space and attention blocks. Whether one of these alternatives eventually displaces the Transformer for general-purpose language modeling is an open question; as of 2026 most frontier models remain attention-based.