Attention Is All You Need (Transformer)
Last reviewed
May 8, 2026
Sources
16 citations
Review status
Source-backed
Revision
v5 · 5,565 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 8, 2026
Sources
16 citations
Review status
Source-backed
Revision
v5 · 5,565 words
Add missing citations, update stale details, or suggest a clearer explanation.
"Attention Is All You Need" is a 2017 research paper by eight researchers at Google Brain and Google Research that introduced the Transformer architecture. The paper was first posted to arXiv on June 12, 2017, and presented at the 31st Conference on Neural Information Processing Systems (NeurIPS) in Long Beach, California in December 2017. It is one of the most consequential papers in the history of artificial intelligence: every modern frontier large language model, including GPT-4, Claude, Gemini, and LLaMA, descends from the architecture it described.
| Field | Value |
|---|---|
| Title | Attention Is All You Need |
| Authors | Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin |
| Affiliations at time of publication | Google Brain (seven authors); Google Research (Polosukhin) |
| Venue | NeurIPS 2017 (31st Conference on Neural Information Processing Systems) |
| arXiv ID | 1706.03762 |
| First arXiv submission | June 12, 2017 |
| Latest arXiv revision | Version 7, August 2, 2023 |
| Original task | Machine translation (WMT 2014 English-German and English-French) |
| Citations (early 2026) | More than 173,000 |
| Reference codebase | Tensor2Tensor (TensorFlow) |
The authors all contributed equally; the listed order was randomized. A footnote in the paper states that each author wishes to be considered a first author.
Vaswani et al. published in 2017 an influential research paper titled "Attention Is All You Need" at the Neural Information Processing Systems (NeurIPS) conference that introduced the Transformer architecture, a novel neural network model for natural language processing (NLP) tasks. As of early 2026, the paper has been cited more than 173,000 times, making it one of the most cited papers in the history of computer science and one of the ten most-cited scientific papers of the 21st century. The authors were all researchers at Google Brain and Google Research at the time of submission, an AI research division of Google.
The new neural network architecture was based on a self-attention mechanism designed for language understanding. Traditionally, sequence transduction models were based on a recurrent neural network or a convolutional neural network that included an encoder and decoder. The top performing models also connected the encoder and decoder via an attention mechanism. With the Transformer, the researchers built a simple network architecture based only on attention mechanisms, with no recurrence and no convolutions.
The experimental results demonstrated that the new model was "superior in quality while being more parallelizable and requiring significantly less time to train." The Transformer also generalized well to other tasks. According to the authors, "The Transformer allows for significantly more parallelization and can reach a new state of the art in translation quality after being trained for as little as twelve hours on eight P100 GPUs."
The paper's main contribution was demonstrating that a stack of attention layers, with no recurrence or convolution, was sufficient to set a new state of the art on machine translation. Within a year, BERT (October 2018) and GPT-1 (June 2018) had adapted the architecture to general language understanding and generation. Within seven years, every leading commercial AI system, including ChatGPT, Google Search's generative summaries, DALL-E, and Microsoft Copilot, was running on Transformer descendants. Transformer models also learned to work with chemical structures, predict protein folding, and analyze medical data at scale.
All eight authors were at Google when the paper was written. The first footnote on the paper notes their individual contributions. Vaswani drove much of the experimentation, Shazeer designed key efficiency improvements, Parmar handled implementation, Uszkoreit conceived the original idea of replacing recurrence with attention, Jones built the initial codebase and visualizations, Gomez (a 20-year-old intern at the time) helped implement the model in Tensor2Tensor, Kaiser pushed the engineering, and Polosukhin helped design the model.
| Author | Affiliation in 2017 | Role on the paper |
|---|---|---|
| Ashish Vaswani | Google Brain | Designed and implemented core architecture; ran experiments |
| Noam Shazeer | Google Brain | Proposed scaled dot-product attention and multi-head attention; tuned model |
| Niki Parmar | Google Brain | Designed and implemented variants; led ablation experiments |
| Jakob Uszkoreit | Google Brain | Originated idea of replacing recurrence with attention; led research effort |
| Llion Jones | Google Research (Tokyo) | Built initial codebase, efficient inference, attention visualizations |
| Aidan N. Gomez | Google Brain (intern) | Built and refined model in Tensor2Tensor; ran experiments |
| Lukasz Kaiser | Google Brain | Co-designed and implemented Tensor2Tensor; mentored team |
| Illia Polosukhin | Google Research | Helped design the model and contributed to early prototypes |
The randomized author order is one of several signals that the project was a true team effort. Inside Google, the work grew out of the Brain Frontiers and Translate research teams. The reference implementation was published in Tensor2Tensor, a TensorFlow-based library for sequence modeling that Lukasz Kaiser had been developing along with Aidan Gomez.
The project that produced the paper was not originally about killing recurrence. It started with several smaller efforts to use attention more aggressively inside LSTM-based translation models. Jakob Uszkoreit had been arguing internally for a while that attention alone, with no recurrence at all, ought to be enough for translation. Most colleagues were skeptical. Recurrence had defined sequence modeling for two decades.
The breakthrough came when the team began stripping pieces out of their model and watched the BLEU scores go up rather than down. Llion Jones has said in later interviews that they were "throwing bits of the model away to see how much worse it would get, and to their surprise it started getting better." The team replaced recurrence with self-attention, then added multi-head attention and scaled dot-product attention as cleaner formulations of ideas already floating around the literature. Noam Shazeer is generally credited with the formulation of multi-head attention and the sqrt(d_k) scaling factor.
The name "Transformer" was chosen by Jakob Uszkoreit because, in his own words, he liked the sound of the word. An early internal draft was titled "Transformers: Iterative Self-Attention and Processing for Various Tasks" and included illustrations from the Transformers franchise. The paper title "Attention Is All You Need" is a play on the Beatles song "All You Need Is Love." Llion Jones has been credited as the author of the catchier final title. The phrase has since spawned a long lineage of papers: by 2025, more than 700 arXiv preprints used "All You Need" in their titles, almost all post-dating 2017.
The NeurIPS 2017 deadline gave the project its final push. The paper was submitted to arXiv on June 12, 2017, the same week the team finalized results. It was accepted to NeurIPS as a poster, not an oral presentation. The poster session in Long Beach was crowded, but few in the audience that week understood that the architecture would dominate machine learning for at least the next decade.
The parent article on the Transformer covers the architecture in detail. This section focuses on the specific choices made in the 2017 paper, several of which were later revised by the broader community.
The original Transformer is an encoder-decoder model designed for sequence transduction. The encoder is a stack of six identical layers; each layer contains a multi-head self-attention sub-layer followed by a position-wise feed-forward sub-layer. The decoder is also a stack of six layers, with three sub-layers each: a masked multi-head self-attention sub-layer, an encoder-decoder cross-attention sub-layer, and a feed-forward sub-layer. Every sub-layer is wrapped in a residual connection followed by layer normalization. The paper places normalization after the residual addition, a configuration later known as Post-LN.
Vaswani et al. (2017) describe attention as follows: "An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key."
The paper introduces scaled dot-product attention, defined as:
Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) V
The sqrt(d_k) scaling was a small but important detail. Without it, the dot products grow large in magnitude as d_k increases, which pushes the softmax into regions of vanishing gradient. Shazeer noticed this during early experiments and proposed the fix.
Multi-head attention runs h attention functions in parallel, each on a different learned linear projection of the queries, keys, and values. The outputs are concatenated and projected once more. The base model uses h = 8 heads with d_k = d_v = d_model / h = 64. The big model uses h = 16 heads.
Self-attention (also known as intra-attention) "is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence. Self-attention has been used successfully in a variety of tasks including reading comprehension, abstractive summarization, textual entailment and learning task-independent sentence representations."
Because the Transformer has no recurrence and no convolution, it has no inherent sense of token order. The paper adds fixed sinusoidal positional encodings to the input embeddings:
The authors chose this scheme over learned positional embeddings because it allowed the model to attend by relative position (since for any fixed offset k, PE(pos+k) is a linear function of PE(pos)) and because it could in principle generalize to sequences longer than those seen in training. They reported that learned positional embeddings produced nearly identical results, and most subsequent encoder-only and decoder-only models (including BERT and the GPT series) used learned positional embeddings instead. Sinusoidal encodings are now mostly of historical interest; modern LLMs use rotary position embeddings (RoPE) or attention biases (ALiBi) instead.
The paper presents two model sizes: a base model used for ablation and the headline big model used for the leaderboard results.
| Hyperparameter | Base | Big |
|---|---|---|
| Layers (N) | 6 | 6 |
| Model dimension (d_model) | 512 | 1024 |
| Feed-forward dimension (d_ff) | 2048 | 4096 |
| Heads (h) | 8 | 16 |
| Per-head key/value dimension (d_k = d_v) | 64 | 64 |
| Dropout rate | 0.1 | 0.3 (En-De), 0.1 (En-Fr) |
| Label smoothing (epsilon) | 0.1 | 0.1 |
| Parameters | 65 million | 213 million |
These hyperparameters are remarkably close to what the field still considers reasonable nine years later for models of similar size. Some choices have been revised: most modern Transformers use Pre-LN (normalization inside the residual block) rather than Post-LN, replace ReLU with SwiGLU or GeGLU activations in the feed-forward layer, and use RMSNorm instead of layer normalization. The h = 8 head count and d_model = 512 dimensions, however, still show up in modern small models.
The paper trained the Transformer on the WMT 2014 English-to-German and WMT 2014 English-to-French translation benchmarks, the same datasets used by previous state-of-the-art systems. The English-to-German set had about 4.5 million sentence pairs, encoded using byte-pair encoding with a shared source-target vocabulary of about 37,000 tokens. The English-to-French set was much larger at 36 million sentence pairs, with a 32,000 word-piece vocabulary.
| Model | Hardware | Steps | Wall-clock time | FLOPs (approx.) |
|---|---|---|---|---|
| Base | 8 NVIDIA P100 GPUs | 100,000 | 12 hours | 3.3 x 10^18 |
| Big | 8 NVIDIA P100 GPUs | 300,000 | 3.5 days | 2.3 x 10^19 |
This was extraordinarily cheap by 2017 standards. Competing systems such as ConvS2S and the Google Neural Machine Translation system required orders of magnitude more compute. The big Transformer cost roughly one quarter of the FLOPs of the previous best published model. By 2026 standards the entire training run would fit comfortably in the warm-up phase of a frontier LLM run.
The paper used the Adam optimizer with beta1 = 0.9, beta2 = 0.98, and epsilon = 10^-9, plus a custom warmup schedule. Learning rate was increased linearly for the first 4,000 steps and then decayed proportionally to the inverse square root of the step number. This warmup schedule was widely copied for the next several years of Transformer research and is still common in modern training recipes, often with cosine decay substituted for the inverse square root tail.
Label smoothing of 0.1 was applied during training. The paper notes this hurt perplexity but improved BLEU and accuracy, an early example of a regularization technique that traded one metric for another.
For decoding the paper used beam search with a beam size of 4 and a length penalty of alpha = 0.6. The authors averaged the last 5 (base) or 20 (big) checkpoints to obtain a single inference model, a technique that produced a small BLEU bump. Maximum output length was set to input length plus 50 tokens, with early termination when possible.
Both models outperformed all previously published single-model and ensemble systems on both translation tasks.
| Model | EN-DE BLEU | EN-FR BLEU | Training cost (FLOPs) |
|---|---|---|---|
| Transformer (base) | 27.3 | 38.1 | 3.3 x 10^18 |
| Transformer (big) | 28.4 | 41.0 (41.8 with averaging) | 2.3 x 10^19 |
| ConvS2S (Gehring et al., 2017) | 25.16 | 40.46 | 9.6 x 10^18 |
| GNMT + RL (Wu et al., 2016) | 24.6 | 39.92 | 2.3 x 10^19 |
| ByteNet (Kalchbrenner et al., 2017) | 23.75 | - | - |
A 2 BLEU point gain on WMT EN-DE was a substantial improvement at the time, but more striking was the cost: the Transformer big achieved this with about a quarter of the compute of the previous best model.
The paper did not invent attention. The idea had been circulating in the machine translation literature for several years before 2017. The paper's contribution was to push attention to the point where it became the entire model, not just a helper.
| Year | Work | Contribution |
|---|---|---|
| 2014 | Bahdanau, Cho, Bengio: Neural Machine Translation by Jointly Learning to Align and Translate | Introduced additive attention for sequence-to-sequence translation; let the decoder attend to encoder states |
| 2015 | Luong, Pham, Manning: Effective Approaches to Attention-based Neural Machine Translation | Simplified to multiplicative (dot-product) attention; introduced local attention |
| 2016 | Cheng, Dong, Lapata: Long Short-Term Memory-Networks for Machine Reading | Used self-attention (intra-attention) inside an LSTM for sentence representation |
| 2017 | Gehring et al.: Convolutional Sequence to Sequence Learning (ConvS2S) | Replaced recurrence with convolutions and gated linear units |
| 2017 | Kalchbrenner et al.: Neural Machine Translation in Linear Time (ByteNet) | Used dilated convolutions for translation |
| 2017 | Parikh et al.: A Decomposable Attention Model for Natural Language Inference | Used attention without recurrence for NLI; an early hint that attention alone could carry a model |
The paper's introduction explicitly cites these predecessors. ConvS2S and ByteNet, both convolutional alternatives to recurrence, were the immediate competition. Both had shown that you could move beyond recurrence; the Transformer showed that you could replace it entirely with attention and beat the convolutional alternatives at the same time.
The paper's reference code was released as part of Tensor2Tensor (T2T), a TensorFlow library for sequence-to-sequence models that Lukasz Kaiser had been developing with Aidan Gomez and others before the paper. T2T provided the full training pipeline, including the model definition, the WMT data preprocessing, and the BLEU evaluation scripts.
T2T was widely used in 2018 and 2019 but was eventually superseded by other libraries. Google moved its production machine translation work to Lingvo. The research community gradually migrated to PyTorch, and Hugging Face's Transformers library, launched in 2018 and growing through 2019, became the dominant open-source implementation. Tensor2Tensor was officially deprecated in 2020. Today the original Tensor2Tensor repository is mostly read-only on GitHub and exists primarily as a historical artifact.
For researchers and students learning the architecture from scratch, the standard reference is no longer the Tensor2Tensor code but "The Annotated Transformer" by Sasha Rush at Harvard, originally published in 2018 as a Jupyter notebook with PyTorch code interleaved with the paper text in roughly the same reading order. A community refresh in 2022 updated it to modern PyTorch idioms. The annotated version implements the model in roughly 400 lines of code and remains one of the most widely cited educational resources in machine learning.
Other notable open-source implementations include:
| Implementation | Framework | Notes |
|---|---|---|
| Tensor2Tensor (Google) | TensorFlow | Original reference; deprecated 2020 |
| The Annotated Transformer (Sasha Rush, Harvard) | PyTorch | Educational; ~400 lines |
| Hugging Face Transformers | PyTorch, JAX, TensorFlow | De facto standard for production and research |
| fairseq (Meta AI) | PyTorch | Includes the original Transformer plus many variants |
| Trax (Google) | JAX, TensorFlow | Successor to Tensor2Tensor for some research |
| Flaxformer (Google) | JAX/Flax | Used internally for T5 and PaLM training |
For a paper that has been read by hundreds of thousands of people, "Attention Is All You Need" is unusually short and clearly written. The arXiv version is 15 pages including references, with about 11 pages of body text, two pages of references, and two pages of supplementary material. The structure is conventional but worth noting because so much of the paper has become canonical.
| Section | Pages | Content |
|---|---|---|
| 1. Introduction | 1 | Frames the limitations of recurrence and introduces the Transformer at a high level |
| 2. Background | 1 | Reviews convolutional alternatives (ConvS2S, ByteNet) and prior attention work |
| 3. Model Architecture | 4 | Introduces encoder-decoder stacks, scaled dot-product attention, multi-head attention, position-wise feed-forward, embeddings, and positional encoding |
| 4. Why Self-Attention | 1.5 | Compares self-attention with recurrence and convolution in terms of complexity, parallelization, and path length |
| 5. Training | 1 | Details on data, hardware, optimizer, regularization |
| 6. Results | 2 | BLEU scores, model variations, English constituency parsing transfer experiment |
| 7. Conclusion | 0.5 | Brief summary and future work |
| References | 2 | About 40 citations, mostly to neural translation work from 2014-2017 |
Section 4 ("Why Self-Attention") is sometimes overlooked in summaries of the paper but is one of its most quoted pieces. The authors construct a now-famous table comparing self-attention, recurrent, and convolutional layers along three dimensions: complexity per layer, sequential operations, and maximum path length between any two positions. Self-attention has O(n^2 * d) complexity per layer, but only O(1) sequential operations and O(1) maximum path length, compared to O(n) for recurrence and O(log_k(n)) for convolution. This table is the formal justification for replacing recurrence with attention.
Section 6 also includes one of the more underrated experiments in the paper: a transfer experiment to English constituency parsing on the Wall Street Journal corpus. With minimal task-specific tuning, the Transformer outperformed several strong parsers, hinting at the architecture's generality. This experiment took up barely a page in the paper but was an early signal that the model would transfer well across NLP tasks.
Reception at NeurIPS 2017 was respectful but not euphoric. The poster drew steady traffic, and the BLEU numbers were impressive, but most attendees still considered LSTM-based seq2seq the safer bet for the next year of work. Citation count grew steadily through 2018 as researchers ported the architecture to other tasks.
Three events in 2018 changed everything. In June, OpenAI released GPT-1, demonstrating that the decoder of the Transformer, scaled up and pretrained on unlabeled text, was a strong general language model. In October, Google released BERT, demonstrating that the encoder of the Transformer, pretrained with masked language modeling, was state of the art on essentially every NLP benchmark. In November, the original Transformer hit roughly 1,000 citations. By 2019 it was clear that the architecture was generalizable, transferable, and scalable in a way that recurrent models had never been.
From there the citation count compounded. The paper passed 10,000 citations by 2020, 50,000 by mid-2022, 100,000 by early 2024, and more than 173,000 by 2025. It is one of fewer than ten papers in the history of computer science to exceed 100,000 citations on Google Scholar. Most lists of the most-cited scientific papers of the 21st century rank it in the top ten across all fields.
A few cultural artifacts stand out. The phrase "Attention Is All You Need" became a recurring meme in the machine learning community. Researchers began naming their own papers in the same template: "Convolution Is All You Need," "Diffusion Is All You Need," "Tokenization Is All You Need," and so on. By 2025, more than 700 arXiv preprints had "All You Need" in their titles, with submission counts continuing to grow each year.
The paper's influence extends well beyond NLP. Vision Transformers (ViT, 2020), AlphaFold 2 (2020), DALL-E (2021), Whisper (2022), and a long list of multimodal and scientific models all use the basic encoder or decoder block from the 2017 paper as their computational core. As of 2026, no architecture has decisively replaced the Transformer in any major domain.
The paper's milestones have prompted retrospectives in the technical press.
The paper has gone through seven public versions on arXiv since the first June 2017 submission. The differences between versions are small. Most are typo fixes and figure adjustments rather than substantive changes. The final version (v7), posted on August 2, 2023, was made by Vaswani primarily to fix typographical errors that had been pointed out by readers over the years.
| Version | Date | Notes |
|---|---|---|
| v1 | June 12, 2017 | Original arXiv submission |
| v2 | June 12, 2017 | Same-day correction |
| v3 | June 30, 2017 | NeurIPS submission revision |
| v4 | December 6, 2017 | Camera-ready version for NeurIPS |
| v5 | December 6, 2017 | Minor fixes |
| v6 | July 24, 2023 | Minor formatting updates |
| v7 | August 2, 2023 | Final typographical corrections |
The NeurIPS proceedings version (December 2017) is functionally equivalent to arXiv v4. Most modern citations point to either the arXiv preprint or the NeurIPS proceedings PDF; the former is more commonly used because it is the freely accessible canonical version.
A small but interesting historical note: the residual connection equation in early versions of the paper has a known typo (the position of the residual addition relative to the layer normalization is described inconsistently across the figure and the text). Several reproductions in 2018 had to look at the Tensor2Tensor reference code to figure out which interpretation was intended. The 2023 v7 update corrected most of these inconsistencies, though by then the field had long since moved on to Pre-LN anyway.
Within a few years of the paper, almost all of the authors had left Google. Most went on to found AI companies. The collective impact of those companies, taken together, is one of the more remarkable second-order effects of the paper itself.
| Author | After Google | Role as of 2026 |
|---|---|---|
| Ashish Vaswani | Co-founded Adept AI (2022), then Essential AI (2023) | Co-founder and CEO of Essential AI |
| Noam Shazeer | Co-founded Character.AI (2021); rejoined Google August 2024 in $2.7B reverse-acquihire deal | Technical lead on Gemini at Google DeepMind |
| Niki Parmar | Co-founded Adept AI, then Essential AI | Co-founder of Essential AI |
| Jakob Uszkoreit | Founded Inceptive in 2020 to apply AI to mRNA medicine design | Co-founder and CEO of Inceptive |
| Llion Jones | Left Google in 2023 | Co-founder and CTO of Sakana AI, Tokyo |
| Aidan N. Gomez | Co-founded Cohere in 2019 with Ivan Zhang and Nick Frosst | Co-founder and CEO of Cohere |
| Lukasz Kaiser | Joined OpenAI in 2021 | Member of technical staff at OpenAI; contributor to o-series reasoning models |
| Illia Polosukhin | Co-founded NEAR.AI in 2017, which became NEAR Protocol | Co-founder of NEAR Protocol |
The Shazeer return to Google in August 2024 was particularly notable. Google paid roughly $2.7 billion in a structured deal that licensed Character.AI's technology and brought Shazeer and a small team of co-workers back into Google DeepMind. The deal drew antitrust scrutiny from the U.S. Department of Justice. Shazeer became one of three technical leads on Gemini, along with Jeff Dean and Oriol Vinyals.
Llion Jones's exit to co-found Sakana AI in Tokyo with David Ha and Ren Ito was framed publicly as a deliberate move away from Transformer-only thinking. Sakana AI's stated mission is to research alternative, nature-inspired architectures, drawing on swarm-intelligence and evolutionary methods. The company raised about $135 million by late 2025, valuing it at $2.65 billion.
Illia Polosukhin's path is the most unusual of the eight. He left Google in 2017 to co-found NEAR.AI, which pivoted to become NEAR Protocol, a sharded smart-contract blockchain. By the early 2020s NEAR had become a meaningful player in crypto and Web3. Polosukhin has since spoken publicly about bringing AI capabilities back into NEAR.
Google holds patent US10,452,978 B2, titled "Attention-based sequence transduction neural networks," filed in May 2018 with Noam Shazeer, Aidan Gomez, Lukasz Kaiser, Niki Parmar, Illia Polosukhin, Jakob Uszkoreit, Llion Jones, and Ashish Vaswani as inventors. The patent covers aspects of multi-head attention applied to sequence transduction. It was granted in October 2019. Google has filed several continuation applications that extend related claims.
In practice, Google has not enforced the patent, and the broader research and industry community has built freely on top of the architecture. There are several reasons this has not produced legal conflict. First, most modern systems use only the encoder (BERT family) or only the decoder (GPT family), and the patent's claims focus on the encoder-decoder transduction setup of the original paper. Second, Google has signaled that the work was published as research and that the patent serves a defensive purpose. Third, enforcing the patent against the entire field would be commercially damaging given Google's own reliance on the broader AI ecosystem.
The patent does not expire until the early 2040s based on its filing date, so this state of affairs is unlikely to change soon for legal reasons.
Not every choice in the 2017 paper has aged well. The architecture itself has held up remarkably; the surrounding details have been steadily revised.
Self-attention has O(n^2) cost in sequence length. The 2017 paper trained on sentences of about 25 to 50 tokens, where this was not a problem. For modern context windows of 100K to 1M+ tokens it is the dominant cost of training and inference. A long line of work, including Longformer, BigBird, FlashAttention, and various linear-attention schemes, has tried to bring this cost down without sacrificing the quality that full attention provides.
The sinusoidal scheme has largely been superseded. BERT and the GPT series moved to learned positional embeddings; modern LLMs use rotary position embeddings (RoPE) or attention biases (ALiBi). Sinusoidal encodings are now mostly of historical interest.
The paper places layer normalization after the residual addition (Post-LN). This makes very deep Transformers difficult to train without careful warmup schedules. Subsequent work, particularly Xiong et al. (2020), found that placing normalization inside the residual block (Pre-LN) is more stable and trains more reliably. Almost all modern LLMs use Pre-LN, often with RMSNorm instead of standard layer normalization.
The original feed-forward block uses a ReLU activation. Modern Transformers commonly use SwiGLU or GeGLU instead, following Shazeer's own 2020 paper "GLU Variants Improve Transformer." These activations consistently improve loss across model sizes.
The specific decoding choices in the paper, beam size 4 with length penalty 0.6, are particular to the WMT translation task. Modern decoder-only LLMs typically use sampling with temperature and top-p, not beam search. Label smoothing is rarely used in modern pre-training; it is more common in fine-tuning or distillation.
The paper itself is unusually reproducible for its time. Tensor2Tensor was released alongside the paper, and the WMT datasets were already public benchmarks. Several groups reproduced the headline numbers within a few months of publication. Modern reproductions, including those in Hugging Face Transformers and fairseq, generally land within 0.2 to 0.5 BLEU of the paper's reported numbers.
It is hard to overstate how much of modern AI flows from this paper. By 2026:
Whether the Transformer will remain dominant in another decade is an open question. The authors themselves, at the 2024 GTC reunion, agreed that the field probably needs something better. Mamba and other state-space models, hybrid architectures, and renewed interest in LSTM variants like xLSTM are all candidates. None has yet displaced the Transformer at the frontier.
For now, "Attention Is All You Need" stands as one of the very few research papers whose title has become folklore, whose authors have all become founders or technical leads at the companies defining the field, and whose architecture quietly underpins essentially every major AI system in the world.