See also: Machine learning terms
In machine learning, the Transformer is a deep learning architecture based entirely on attention mechanisms, dispensing with recurrence and convolutions. It was introduced in 2017 by Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan Gomez, Lukasz Kaiser, and Illia Polosukhin in the paper "Attention Is All You Need" [1]. All eight authors contributed equally; the listed order was randomized. At the time of publication, seven of the authors were affiliated with Google Brain and one (Polosukhin) with Google Research. The paper was submitted to arXiv on June 12, 2017, and presented at the 31st Conference on Neural Information Processing Systems (NeurIPS) in December 2017. As of 2025, the paper has been cited more than 173,000 times, placing it among the ten most-cited papers of the 21st century.
The Transformer was originally designed for machine translation but has since become the dominant architecture across nearly all areas of artificial intelligence. Every major frontier large language model as of early 2026 (including OpenAI's GPT series, Anthropic's Claude, Google's Gemini, and Meta's LLaMA) is built on the Transformer or a close variant of it [2]. The architecture has also expanded far beyond natural language processing, powering breakthroughs in computer vision, protein structure prediction, speech recognition, music generation, and robotics.
Before the Transformer, sequence modeling in NLP was dominated by recurrent neural networks (RNNs), particularly Long Short-Term Memory (LSTM) networks and Gated Recurrent Units (GRUs). These architectures process tokens one at a time in sequence, which creates two major problems. First, training cannot be parallelized across time steps because each step depends on the hidden state from the previous step. This makes RNNs slow to train on long sequences. Second, despite gating mechanisms, RNNs still struggle with very long-range dependencies because information must pass through many sequential operations to travel between distant positions.
The attention mechanism itself predates the Transformer. Bahdanau et al. introduced additive attention for machine translation in 2014 [3], allowing the decoder to "attend" to different parts of the source sentence at each decoding step. Luong et al. proposed a simplified multiplicative (dot-product) attention in 2015 [4]. These mechanisms were used alongside RNNs, not as replacements for them.
The key insight of Vaswani et al. was that attention alone, without any recurrence, could serve as the entire computational backbone of a sequence model. This allowed full parallelization during training and gave the model direct access to all positions in the input sequence, regardless of distance.
The original Transformer follows an encoder-decoder design. The encoder reads an input sequence and produces a sequence of continuous representations. The decoder then generates an output sequence one token at a time, attending to both the encoder output and its own previously generated tokens.
Input tokens are first converted into dense vectors through a learned embedding layer. Since the Transformer has no recurrence or convolution, it has no inherent sense of token order. To provide positional information, the authors add positional encodings to the input embeddings before they enter the encoder or decoder stacks.
The original paper uses fixed sinusoidal positional encodings defined by:
where pos is the position in the sequence, i is the dimension index, and d_model is the model's embedding dimension. Each dimension of the positional encoding corresponds to a sinusoid with a different wavelength, ranging from 2pi to 10000 * 2pi. The authors chose this scheme because it allows the model to learn to attend by relative position, since for any fixed offset k, PE(pos+k) can be expressed as a linear function of PE(pos).
The core computation in the Transformer is scaled dot-product attention. Given a set of queries Q, keys K, and values V (all matrices), the attention output is computed as:
Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) V
where d_k is the dimension of the key vectors. The division by sqrt(d_k) is a scaling factor that prevents the dot products from growing too large in magnitude as d_k increases, which would push the softmax function into regions where it has extremely small gradients. Without this scaling, the softmax would become highly peaked, concentrating almost all weight on a single key and making learning difficult.
In practice, the queries, keys, and values are derived from the input through learned linear projections:
where X is the input matrix and W_Q, W_K, W_V are learned parameter matrices.
Rather than performing a single attention function with d_model-dimensional keys, values, and queries, the Transformer uses multi-head attention. The input is projected into h separate sets of queries, keys, and values using different learned linear projections, attention is computed independently for each "head," and the results are concatenated and projected once more:
MultiHead(Q, K, V) = Concat(head_1, ..., head_h) * W_O
where head_i = Attention(Q * W_Q_i, K * W_K_i, V * W_V_i)
Multi-head attention allows the model to attend to information from different representation subspaces at different positions simultaneously. A single attention head, by averaging, might suppress this capacity. For instance, one head may capture short-range syntactic dependencies while another tracks broader semantic context. In the original paper, the authors used h = 8 parallel attention heads, with each head operating on a dimension of d_k = d_v = d_model / h = 64.
The Transformer uses attention in three distinct ways:
Encoder self-attention: Each position in the encoder attends to all positions in the previous encoder layer. The queries, keys, and values all come from the output of the previous encoder layer.
Decoder self-attention: Each position in the decoder attends to all positions in the decoder up to and including that position. Future positions are masked (set to negative infinity before the softmax) to prevent the decoder from "seeing ahead" during training. This is called causal masking or masked self-attention.
Encoder-decoder cross-attention: The queries come from the previous decoder layer, while the keys and values come from the output of the encoder. This allows every position in the decoder to attend over all positions in the input sequence.
Each layer in both the encoder and decoder contains a fully connected feed-forward network (FFN) applied independently to each position. It consists of two linear transformations with a ReLU activation in between:
FFN(x) = max(0, x * W_1 + b_1) * W_2 + b_2
The inner dimension of the FFN (d_ff) is larger than d_model. In the base model, d_model = 512 and d_ff = 2048, giving an expansion factor of 4x. This "bottleneck" design lets the network learn richer intermediate representations before projecting back to d_model dimensions. The parameters of the FFN differ from layer to layer but are shared across all positions within a given layer.
Each sub-layer (self-attention, cross-attention, or feed-forward) is wrapped with a residual connection followed by layer normalization. The output of each sub-layer is:
LayerNorm(x + Sublayer(x))
Residual connections, first introduced by He et al. in 2015 for deep convolutional networks [5], allow gradients to flow directly through the network without passing through nonlinear activations at every layer. This is essential for training deep networks with many stacked layers. Layer normalization stabilizes the activations by normalizing across the feature dimension.
The original paper places normalization after the residual addition ("Post-LN"). Later research found that placing normalization before each sub-layer ("Pre-LN") leads to more stable training for very deep Transformers [6]. Most modern Transformer implementations use Pre-LN, and many have switched from standard layer normalization to RMSNorm, which omits the mean-centering step for computational efficiency.
The encoder consists of a stack of N identical layers. Each encoder layer contains two sub-layers: a multi-head self-attention mechanism and a position-wise feed-forward network, each with residual connections and layer normalization.
The decoder also consists of N identical layers. Each decoder layer contains three sub-layers: masked multi-head self-attention, multi-head encoder-decoder cross-attention, and a position-wise feed-forward network.
In the original paper, both stacks use N = 6 layers. The final decoder output is passed through a linear layer and a softmax to produce output token probabilities.
The paper describes two model sizes:
| Configuration | Layers (N) | d_model | d_ff | Heads (h) | d_k | Parameters | Training time |
|---|---|---|---|---|---|---|---|
| Transformer (base) | 6 | 512 | 2048 | 8 | 64 | 65 million | 12 hours |
| Transformer (big) | 6 | 1024 | 4096 | 16 | 64 | 213 million | 3.5 days |
Both models were trained on 8 NVIDIA P100 GPUs. The base model was trained for 100,000 steps (about 0.4 seconds per step), while the big model was trained for 300,000 steps (about 1.0 seconds per step) [1].
The Transformer achieved state-of-the-art results on the WMT 2014 English-to-German and English-to-French translation benchmarks:
These results were achieved at a fraction of the training cost of competing models. The big Transformer model required only 3.5 days of training on 8 GPUs, while previous state-of-the-art systems required many more GPU-days.
The Transformer's dominance over recurrent architectures comes down to several concrete advantages:
Parallelization. RNNs process tokens sequentially, meaning the computation for position t depends on the result at position t-1. The Transformer computes attention over all positions simultaneously. This maps efficiently to modern GPU and TPU hardware, which excels at large matrix multiplications. Training speed improvements are dramatic: tasks that took weeks with RNNs can often be completed in days with Transformers.
Long-range dependencies. In an RNN, information between two tokens separated by n positions must pass through O(n) sequential operations. In a Transformer, any two positions are connected through a single attention operation, requiring only O(1) sequential operations. This makes it far easier to learn dependencies between distant tokens.
Constant path length. The maximum path length between any two positions in a Transformer is O(1), compared to O(n) for RNNs and O(log n) for convolutional models. Shorter paths make gradient flow and learning more efficient.
Scalability. Transformers scale more predictably with increased compute, data, and parameters. This property, formalized through scaling laws, has been the primary driver behind the rapid progress in large language models.
The main trade-off is computational cost: self-attention has O(n^2) complexity with respect to sequence length, compared to O(n) for RNNs. For very long sequences this becomes expensive, motivating research into efficient attention variants.
Since 2017, researchers have explored three major structural variants of the Transformer, each suited to different tasks.
Encoder-only models use only the encoder portion of the Transformer and process the entire input bidirectionally (every token can attend to every other token). The most prominent example is BERT (Bidirectional Encoder Representations from Transformers), released by Google in October 2018 [7]. BERT is trained with a masked language modeling (MLM) objective: 15% of input tokens are randomly masked, and the model learns to predict them from surrounding context. BERT also uses a next-sentence prediction (NSP) objective during pre-training.
Encoder-only models excel at tasks that require understanding input text rather than generating new text: classification, named entity recognition, extractive question answering, and semantic similarity. RoBERTa (2019, Facebook AI) improved on BERT by training longer with more data and dropping the NSP objective [8]. ELECTRA (2020, Google) replaced the masking objective with a replaced-token-detection task for more sample-efficient training.
Decoder-only models use only the decoder portion (with causal masking so each token can only attend to previous tokens) and are trained with a causal language modeling objective: predict the next token. The GPT (Generative Pre-trained Transformer) series from OpenAI pioneered this approach, starting with GPT-1 in June 2018 [9].
Decoder-only models are naturally suited for text generation. GPT-2 (2019) demonstrated surprisingly coherent long-form generation from a 1.5 billion parameter model. GPT-3 (2020, 175 billion parameters) showed that decoder-only models could perform many tasks through in-context learning without any fine-tuning [10]. As of 2026, the decoder-only architecture dominates the large language model space. GPT-4, Claude, Gemini, LLaMA, Mistral, and DeepSeek all use decoder-only Transformer architectures.
Encoder-decoder models retain the full original Transformer structure. Google's T5 (Text-to-Text Transfer Transformer, 2019) treats every NLP task as a text-to-text problem: the input and output are both text strings, whether the task is translation, summarization, classification, or question answering [11]. BART (2019, Facebook AI) combines a bidirectional encoder with an autoregressive decoder and is trained by corrupting text and learning to reconstruct the original [12].
Encoder-decoder models are well suited for tasks where the input and output are both sequences of variable length, such as translation and summarization.
| Variant | Attention pattern | Training objective | Typical tasks | Notable models |
|---|---|---|---|---|
| Encoder-only | Bidirectional (full) | Masked language modeling | Classification, NER, extractive QA | BERT, RoBERTa, ELECTRA |
| Decoder-only | Causal (left-to-right) | Next-token prediction | Text generation, chat, code, reasoning | GPT series, Claude, LLaMA, Gemini |
| Encoder-decoder | Encoder: bidirectional; Decoder: causal + cross-attention | Denoising, span corruption | Translation, summarization | T5, BART, mBART, Flan-T5 |
Positional encoding is one of the most actively researched components of the Transformer. Since the self-attention mechanism is permutation-invariant, positional encodings are the sole mechanism that provides the model with information about token order. Several approaches have been developed since the original sinusoidal scheme.
The original Transformer uses deterministic sine and cosine functions of different frequencies to encode absolute position. These encodings are added directly to the token embeddings at the input layer. The scheme is parameter-free and theoretically allows the model to extrapolate to longer sequences than seen during training, although practical extrapolation remains limited [1].
BERT and the GPT series replace the sinusoidal functions with a learned embedding table that maps each absolute position index to a d_model-dimensional vector [7][9]. This approach is simple but fixes the maximum sequence length at training time, since positions beyond the table length have no representation.
RoPE, proposed by Su et al. in 2021 [13], encodes position by rotating the query and key vectors in the complex plane. The d_model features are organized as d_model/2 pairs, where each pair is treated as a coordinate in a 2D plane and rotated by an angle proportional to the token's position. After rotation, the dot product between a query at position m and a key at position n depends only on the relative distance (m - n), making RoPE inherently a relative positional encoding. RoPE is parameter-free, naturally captures relative position, and scales gracefully to long sequences. It has become the standard positional encoding for most modern LLMs, used in LLaMA, Mistral, Qwen, and many others [14].
ALiBi, introduced by Press et al. in 2022, takes a different approach by removing positional embeddings from the input entirely [15]. Instead, it adds a static, non-learned bias to the attention scores before the softmax operation. The bias is proportional to the distance between the query and key positions, with each attention head receiving a different fixed slope. This simple penalty causes the model to naturally attend more strongly to nearby tokens. ALiBi demonstrates strong extrapolation to sequence lengths longer than those seen during training, achieving comparable perplexity to sinusoidal models trained on longer sequences while using 11% less memory and training 11% faster.
| Method | Type | Parameters | Relative position | Length extrapolation | Used in |
|---|---|---|---|---|---|
| Sinusoidal | Fixed, absolute | None | Implicit via linear functions | Limited | Original Transformer |
| Learned | Absolute | d_model * max_len | No | None (fixed length) | BERT, GPT-1/2/3 |
| RoPE | Rotation-based, relative | None | Yes (via rotation) | Moderate, improved with NTK-aware scaling | LLaMA, Mistral, Qwen, Gemma |
| ALiBi | Bias-based, relative | None | Yes (via linear penalty) | Strong | BLOOM, MPT |
One of the most consequential discoveries about Transformers is that their performance improves predictably as a function of model size, dataset size, and compute budget. Kaplan et al. at OpenAI published the first systematic study of neural scaling laws in January 2020 [16], finding that loss decreases as a power law in each of these three factors, with model size being the most important.
In March 2022, Hoffmann et al. at DeepMind published the "Chinchilla" paper, which refined these findings [17]. By training over 400 language models ranging from 70 million to over 16 billion parameters, they showed that for a fixed compute budget, model size and training data should be scaled roughly equally. Specifically, they recommended approximately 20 training tokens per parameter for compute-optimal training. This finding suggested that many existing large models (including the 280 billion parameter Gopher) were significantly undertrained relative to their size. The resulting 70 billion parameter Chinchilla model, trained on 1.4 trillion tokens, outperformed the much larger Gopher.
More recent practice has moved well beyond Chinchilla-optimal ratios. LLaMA (2023) trained a 65 billion parameter model on 1.4 trillion tokens. By 2025, some models pushed this even further: Alibaba's Qwen3-0.6B was trained on 36 trillion tokens, giving a tokens-to-parameters ratio of 60,000:1, far exceeding the Chinchilla recommendation [18]. The motivation is that smaller, heavily trained models are cheaper to deploy at inference time.
The growth in Transformer model sizes has been dramatic, though the trend has shown signs of reversing as efficiency techniques mature.
| Model | Year | Parameters | Architecture | Training data |
|---|---|---|---|---|
| Original Transformer (big) | 2017 | 213M | Encoder-decoder | WMT En-De/En-Fr |
| BERT-Large | 2018 | 340M | Encoder-only | 3.3B words |
| GPT-2 | 2019 | 1.5B | Decoder-only | 40GB WebText |
| T5-11B | 2019 | 11B | Encoder-decoder | 750GB C4 |
| GPT-3 | 2020 | 175B | Decoder-only | 300B tokens |
| PaLM | 2022 | 540B | Decoder-only | 780B tokens |
| LLaMA 65B | 2023 | 65B | Decoder-only | 1.4T tokens |
| LLaMA 3 405B | 2024 | 405B | Decoder-only | 15T tokens |
| DeepSeek-V3 | 2024 | 671B total (37B active) | Decoder-only MoE | 14.8T tokens |
A notable recent trend is the shift toward smaller but more heavily trained models. As of 2025, the best-performing open-weight models (such as Llama 3.3 70B and Mistral Large 2 at 123B parameters) are considerably smaller than the largest models from 2022-2023, reflecting improvements in data quality, training recipes, and architectural refinements.
The basic Transformer architecture has been augmented with numerous improvements since 2017. These innovations have enabled scaling to far larger models and longer context windows.
Standard attention implementations are memory-bound: they write the full N x N attention matrix to GPU high-bandwidth memory (HBM), which is slow. FlashAttention, introduced by Tri Dao et al. in 2022 [19], restructures the attention computation using tiling and kernel fusion to avoid materializing the full attention matrix in HBM. It performs all computation in fast on-chip SRAM, dramatically reducing memory usage and wall-clock time. FlashAttention makes exact (not approximate) attention faster and more memory-efficient, enabling practical training with context lengths of 64k tokens and beyond.
FlashAttention-2 (2023) further optimized parallelism and work partitioning across GPU thread blocks. FlashAttention-3 (2024) was specifically designed for NVIDIA Hopper GPUs (H100) and introduced three key techniques: exploiting asynchrony between Tensor Cores and the Tensor Memory Accelerator (TMA) for overlapping computation and data movement via warp-specialization; interleaving block-wise matrix multiplication and softmax operations; and block quantization with incoherent processing for FP8 low-precision computation [20]. FlashAttention-3 achieves up to 740 TFLOPs/s in FP16 (75% utilization of the H100, up from 35% with FlashAttention-2) and close to 1.2 PFLOPs/s with FP8, while maintaining 2.6x lower numerical error than naive FP8 attention by keeping intermediate results in FP32. FlashAttention has been a major contributor to the expansion of LLM context lengths, from 2-4K tokens (GPT-3, OPT) to 128K (GPT-4) and 1M+ (Gemini 1.5 Pro).
In standard multi-head attention, each head has its own set of query, key, and value projections. During autoregressive inference, keys and values from all previous tokens must be stored in a "KV cache," which grows large and becomes a memory bottleneck. Grouped-query attention, introduced by Ainslie et al. in 2023 [21], shares key and value heads across groups of query heads while keeping separate query projections. For example, with 32 query heads and 8 KV head groups, each group of 4 query heads shares a single set of key-value heads. This reduces KV cache size by a factor of 4 with minimal quality loss (typically less than 1% degradation on standard benchmarks).
GQA can be understood as a generalization that spans the spectrum between standard multi-head attention (MHA) and multi-query attention (MQA, where all heads share a single KV pair). Meta adopted GQA for LLaMA 2 (July 2023) and retained it in LLaMA 3 (2024). By 2025, GQA has displaced classic multi-head attention as the default configuration for most large open-weight models, including Qwen, Gemma, and Mistral.
DeepSeek-V2 (May 2024) introduced Multi-Head Latent Attention, which compresses key-value states into a low-rank latent vector [22]. Instead of caching full key and value matrices, MLA caches a much smaller latent representation from which keys and values can be reconstructed. This reduces the KV cache size even further than GQA, enabling efficient inference for models with very long contexts.
Mixture of experts is a technique where each Transformer layer contains multiple parallel feed-forward networks ("experts"), and a routing mechanism selects only a small subset of experts for each token. This allows the total parameter count to grow without proportionally increasing compute per token. Sparse MoE Transformers were explored in Switch Transformer by Google in 2021 [23]. The approach gained mainstream adoption with Mixtral 8x7B from Mistral AI (December 2023), which has 46.7 billion total parameters but activates only about 12.9 billion per token [24]. Mixtral 8x22B expanded to 141 billion total parameters with 39 billion active.
DeepSeek-V3 (December 2024) scaled MoE to 671 billion total parameters with 37 billion active per token, using 256 fine-grained experts with an auxiliary-loss-free load balancing strategy. It was trained on 14.8 trillion tokens for 2.788 million H800 GPU hours [25]. DeepSeek-R1 (January 2025), built on the V3 architecture with reinforcement learning for reasoning, matched or exceeded many closed-source models on benchmarks while being open-weight.
Meta's Llama 4 (April 2025) marked Meta's first adoption of MoE, with Llama 4 Scout using 16 experts (109B total, 17B active) and Llama 4 Maverick using 128 experts (400B total, 17B active). Jamba by AI21 Labs (March 2024) combined MoE with a hybrid Transformer-Mamba architecture, drawing on 12B of its 52B parameters at inference. Its successor Jamba 1.5 scaled to 398B total with 94B active parameters.
The O(n^2) cost of standard attention has motivated research into sub-quadratic alternatives. Longformer (2020) uses a combination of local windowed attention and task-specific global attention [26]. BigBird (2020) combines random, windowed, and global attention patterns. Performer (2020) uses random feature maps (FAVOR+) to approximate softmax attention with linear complexity, though at a quality cost.
Linear attention methods replace the softmax with kernel functions to achieve O(n) complexity. Gated Linear Attention (GLA), published at ICML 2024, pairs linear attention with a data-dependent gating mechanism and uses a hardware-efficient training algorithm called FlashLinearAttention that is faster than FlashAttention-2 as a standalone layer [27]. The GLA Transformer performs competitively with standard Transformer architectures and subquadratic baselines like RetNet and Mamba.
Several other modifications have become common in modern Transformers:
Training large Transformer models requires distributing computation across thousands of accelerators. The field has converged on a combination of parallelism strategies, often referred to as multi-dimensional or "4D" parallelism.
Data parallelism (DP) replicates the entire model on each device, splits the training batch across devices, and synchronizes gradients after each step. Fully Sharded Data Parallelism (FSDP) extends this by sharding model parameters, gradients, and optimizer states across devices rather than replicating them, reducing per-device memory requirements.
Tensor parallelism (TP) splits individual weight matrices across multiple GPUs and computes matrix multiplications in parallel. For example, the attention projection matrices and feed-forward layers can be sliced horizontally or vertically, with each GPU computing on its shard and synchronizing results. TP is most effective within a single node because it requires frequent all-reduce communications.
Pipeline parallelism (PP) assigns different Transformer layers to different GPUs, creating a pipeline where micro-batches flow through stages. This reduces memory per device at the cost of some "pipeline bubbles" (idle time). Techniques like interleaved scheduling and virtual pipeline stages help minimize these bubbles.
Context parallelism (CP) splits the input token sequence along the sequence-length dimension, distributing different portions of a long sequence across GPUs. This is essential for training with extremely long contexts (100K+ tokens). LLaMA 3 used an all-gather-based context parallelism that gathers key and value tensors before attention computation [30].
LLaMA 3 405B was trained on 16,384 H100 GPUs using all four dimensions simultaneously, consuming approximately 3.8 x 10^25 FLOPs over several months. NVIDIA's Megatron-LM framework and Meta's internal infrastructure are the primary codebases supporting these parallelism combinations.
The hardware demands have grown enormously since the original 8-GPU setup.
| Model | Year | Parameters | Training hardware | Approximate cost |
|---|---|---|---|---|
| Original Transformer (big) | 2017 | 213M | 8 NVIDIA P100 GPUs | Negligible |
| BERT-Large | 2018 | 340M | 16 TPU v3 chips | ~$7,000 |
| GPT-3 | 2020 | 175B | Thousands of V100 GPUs | ~$4.6 million |
| LLaMA 65B | 2023 | 65B | 2,048 A100 GPUs | ~$2.4 million |
| LLaMA 3 405B | 2024 | 405B | 16,384 H100 GPUs | ~$30+ million |
| DeepSeek-V3 | 2024 | 671B (37B active) | 2,048 H800 GPUs | ~$5.6 million |
As of 2025, NVIDIA H100 GPUs (80 GB HBM3) are the standard for training, with cloud rental costs of $1.50 to $6.00 per GPU-hour depending on provider and availability. The H200, with 141 GB of HBM3e memory, has become widely available for inference workloads. For inference, a rough guideline is that a model requires approximately 2 bytes of VRAM per parameter at FP16 precision: a 7 billion parameter model needs about 14 GB, while a 70 billion parameter model needs about 140 GB. Quantization techniques (4-bit, 8-bit) can reduce these requirements by 2-4x.
The Vision Transformer (ViT), introduced by Dosovitskiy et al. at Google in October 2020, demonstrated that a pure Transformer (with no convolutions) could match or exceed state-of-the-art convolutional neural networks on image classification [31]. ViT divides an image into fixed-size patches (typically 16x16 pixels), flattens each patch into a vector, adds positional embeddings, and feeds the resulting sequence into a standard Transformer encoder. The paper was published at ICLR 2021.
DeiT (Data-efficient Image Transformers), published by Touvron et al. at Facebook AI in 2021, addressed ViT's reliance on massive datasets (ViT was pre-trained on JFT-300M). DeiT introduced a distillation token that learns from a CNN teacher model through the attention mechanism, achieving 84.2% top-1 accuracy on ImageNet-1K when trained on a single 8-GPU server over three days, without any external data [32]. The use of a CNN teacher was found to be more effective than a Transformer teacher, as it transfers convolutional spatial inductive biases that Transformers naturally lack.
The Swin Transformer, introduced by Liu et al. at Microsoft Research in 2021, addressed the computational cost of applying attention globally to image patches by using a hierarchical architecture with shifted windows [33]. Instead of computing attention across all patches, Swin Transformer computes self-attention within local, non-overlapping windows, then shifts the window partition between layers to allow cross-window connections. This design achieves linear computational complexity with respect to image size (rather than quadratic), making it practical for high-resolution images and dense prediction tasks like object detection and segmentation. The Swin Transformer achieved 87.3% top-1 accuracy on ImageNet-1K, 58.7 box AP on COCO object detection, and won the ICCV 2021 Best Paper Award (Marr Prize).
Since these foundational works, Transformers have been applied broadly in computer vision: object detection (DETR by Facebook AI, 2020), image segmentation (Segment Anything by Meta, 2023), image generation (diffusion transformers, or DiT, used in Stable Diffusion 3 and DALL-E 3), and video understanding. By 2025, most state-of-the-art vision systems either use Transformers or hybrid CNN-Transformer architectures.
AlphaFold 2 by DeepMind (2020) used a modified Transformer architecture called the "Evoformer" to predict protein 3D structures with near-experimental accuracy [34]. AlphaFold 3 (2024) introduced a simplified variant called the "Pairformer" to handle prediction of protein-ligand, protein-DNA, and protein-RNA complexes. These systems have transformed structural biology and drug discovery.
Transformers have been adopted across many additional fields:
Despite the Transformer's dominance, its O(n^2) attention cost and large memory footprint have motivated research into alternative sequence modeling approaches. As of 2025, no alternative has decisively overtaken the Transformer on broad benchmarks, but several promising directions have emerged.
Structured state space models process sequences with linear complexity by maintaining a fixed-size hidden state that is updated recurrently. S4 (Structured State Spaces for Sequence Modeling, Gu et al., 2021) demonstrated strong performance on the Long Range Arena benchmark. Mamba (December 2023) introduced selective state spaces, allowing the model to dynamically filter information based on the input, achieving Transformer-competitive performance on language modeling with linear-time inference.
Mamba-2 (May 2024) revealed a deep connection between state space models and attention through the State Space Duality (SSD) framework, showing that Transformers and SSMs are two sides of the same mathematical coin. The SSD framework enabled a 2-8x speedup in Mamba-2's core layer while remaining competitive with Transformers [35].
In May 2024, Beck et al. (including Sepp Hochreiter, co-inventor of the original LSTM) published xLSTM, which revisits the LSTM architecture with modern techniques [36]. xLSTM introduces exponential gating with appropriate normalization and two modified memory variants: sLSTM with scalar memory, and mLSTM with a matrix memory and covariance update rule that is fully parallelizable. In benchmarks at scales from 125M to 1.3B parameters, xLSTM performed favorably compared to both Transformers and state space models, with faster inference. The startup NXAI, founded by Hochreiter, is developing commercial xLSTM-based language models.
Rather than replacing Transformers entirely, the most practical approach has been to combine Transformer attention layers with efficient recurrent or SSM layers in a single model. Jamba (AI21 Labs, March 2024) interleaves Transformer and Mamba layers at a ratio of roughly one attention layer per eight total layers, combined with MoE. This design fits in a single 80GB GPU while supporting 256K-token contexts and matching pure Transformer quality [37]. Jamba 1.5 (2024) scaled to 398B total parameters (94B active) across 72 layers, demonstrating that hybrid designs can work at large scale.
In early 2026, the Allen Institute for AI released OLMo Hybrid, combining Transformer attention with recurrent layers for improved data efficiency. NVIDIA's research on hybrid models showed that including even a small fraction of full attention layers (12.5-25%) alongside SSM layers is sufficient to match pure Transformer performance on most benchmarks.
Despite the activity in alternative architectures, the Transformer remains firmly dominant as of early 2026. An analysis of the top-ranked models on the LMSYS Chatbot Arena showed that no model in the top 10 uses a sub-quadratic or hybrid architecture; all rely on full attention. The practical benefits of the Transformer ecosystem, including optimized hardware, training infrastructure, and accumulated engineering knowledge, provide a strong moat. Alternative architectures are most competitive in long-context and latency-sensitive settings, while the Transformer continues to excel when compute is not the binding constraint.
As of early 2026, the Transformer remains the foundation of virtually all frontier AI systems. The architecture has not been replaced; it has been refined. The basic recipe of self-attention, feed-forward layers, residual connections, and layer normalization continues to work remarkably well at scale.
Key trends include:
Despite active research into alternatives, no architecture has convincingly surpassed the Transformer across a broad range of tasks. The Transformer's combination of expressive power, scalability, and compatibility with modern hardware continues to make it the default choice for both research and production AI systems.
Imagine you are building a puzzle with your friends. Each piece of the puzzle is a word, and the completed puzzle is a sentence. The Transformer is like a smart helper that looks at all the pieces at the same time and figures out which pieces are most important for understanding the picture.
Older helpers (called RNNs) had to look at the pieces one by one, left to right. If an important piece was far away from the one they were working on, they might forget about it by the time they got there. The Transformer can look at all the pieces at once, so it never forgets.
The Transformer's special trick is called "attention." It works like this: for each puzzle piece, the Transformer asks, "Which other pieces should I pay attention to?" It figures out the answer and uses that information to understand what each piece means in context. It does this many times from different angles (called "heads"), so it can notice different kinds of connections.
This ability to look at everything at once and focus on what matters is why the Transformer became the engine behind chatbots, translation apps, image generators, and many other AI tools we use today.