"Attention Is All You Need" is a landmark research paper published in June 2017 by eight researchers at Google Brain and Google Research: Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan Gomez, Lukasz Kaiser, and Illia Polosukhin [1]. The paper introduced the transformer architecture, a novel neural network design based entirely on attention mechanisms, dispensing with the recurrence and convolutions that had previously dominated sequence modeling. Presented at the 31st Conference on Neural Information Processing Systems (NeurIPS 2017), the paper has since accumulated over 173,000 citations, making it one of the ten most-cited scientific papers of the 21st century and the most influential publication in the history of artificial intelligence research [2].
The transformer architecture proposed in this paper serves as the foundation for virtually all modern large language models (LLMs), including GPT, BERT, PaLM, LLaMA, and Claude. It also underpins vision transformers, protein structure prediction models like AlphaFold, speech recognition systems, music generation tools, and many other AI applications. The paper's title, a reference to the Beatles song "All You Need Is Love," has itself become one of the most recognizable phrases in the field.
Before the transformer, the dominant architectures for sequence-to-sequence tasks in natural language processing were recurrent neural networks (RNNs), particularly Long Short-Term Memory (LSTM) networks and Gated Recurrent Units (GRUs). These models processed sequences one token at a time, maintaining a hidden state that carried information from previous time steps. While effective, this sequential nature created two fundamental problems.
First, RNNs were difficult to parallelize during training. Because each time step depended on the output of the previous step, the computation could not be distributed across multiple processors in the time dimension. This made training on long sequences slow and expensive.
Second, RNNs struggled with long-range dependencies. Despite the gating mechanisms in LSTMs and GRUs, information from distant positions in a sequence tended to decay as it passed through many sequential processing steps. Techniques like attention (introduced by Bahdanau et al. in 2014 [3]) had been added on top of RNNs to address this limitation. The Bahdanau attention mechanism allowed decoders to directly attend to any position in the encoder's output, dramatically improving translation quality. However, these hybrid models still relied on recurrent computation for encoding the input sequence itself.
The Google Brain team's key insight was to ask whether attention alone, without any recurrence, could be sufficient for sequence modeling. Their answer was the transformer.
The transformer follows an encoder-decoder structure, where the encoder maps an input sequence to a continuous representation, and the decoder generates an output sequence one token at a time using that representation.
The encoder consists of a stack of N identical layers (N = 6 in the original paper). Each encoder layer has two sub-layers:
Each sub-layer is wrapped with a residual connection and followed by layer normalization. The output of each sub-layer can be expressed as LayerNorm(x + Sublayer(x)).
The decoder also consists of a stack of N = 6 identical layers. Each decoder layer has three sub-layers:
As with the encoder, residual connections and layer normalization wrap each sub-layer.
The core computational unit of the transformer is scaled dot-product attention. Given a set of queries (Q), keys (K), and values (V), the attention function computes:
Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) V
where d_k is the dimension of the keys. The dot product QK^T computes a similarity score between each query and all keys. Dividing by sqrt(d_k) prevents the dot products from growing too large in magnitude as the dimensionality increases. Without this scaling, the softmax function would be pushed into regions with extremely small gradients, slowing or stalling learning. After the scaled softmax produces attention weights (each row summing to 1), these weights determine how much each value contributes to the output.
The authors chose dot-product attention over additive attention (as used by Bahdanau et al.) because dot-product attention is faster and more space-efficient in practice, benefiting from highly optimized matrix multiplication implementations.
Rather than performing a single attention function with full-dimensional keys, values, and queries, the paper proposes multi-head attention. The model linearly projects the queries, keys, and values h times into lower-dimensional spaces (where h = 8 in the base model), performs scaled dot-product attention on each projection independently, and then concatenates and projects the results:
MultiHead(Q, K, V) = Concat(head_1, ..., head_h) W^O where head_i = Attention(QW_i^Q, KW_i^K, VW_i^V)
With h = 8 heads and d_model = 512, each head operates on d_k = d_v = 64 dimensions. This allows different heads to attend to different aspects of the input: some heads might focus on syntactic relationships, others on semantic similarity, and still others on positional patterns.
Multi-head attention is used in three different ways within the transformer:
Because the transformer contains no recurrence or convolution, it has no inherent notion of the order of tokens in a sequence. To inject positional information, the authors add positional encodings to the input embeddings. They use sinusoidal functions of varying frequencies:
PE(pos, 2i) = sin(pos / 10000^(2i/d_model)) PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))
where pos is the position in the sequence and i is the dimension index. Each dimension of the positional encoding corresponds to a sinusoid with a different wavelength, ranging from 2 pi to 10000 * 2 pi. The authors chose sinusoidal encodings because they hypothesized it would allow the model to learn relative position relationships, since PE(pos + k) can be represented as a linear function of PE(pos) for any fixed offset k.
The paper notes that learned positional embeddings produced nearly identical results, but sinusoidal encodings were preferred because they might generalize to sequence lengths longer than those seen during training.
Each encoder and decoder layer contains a fully connected feed-forward network applied independently to each position. The network consists of two linear transformations with a ReLU activation in between:
FFN(x) = max(0, xW_1 + b_1)W_2 + b_2
The inner dimension of the feed-forward network is d_ff = 2,048 in the base model (four times the model dimension of 512). Although the same function is applied at every position, different layers use different learned parameters. This component can be viewed as applying two 1x1 convolutions with a nonlinearity in between.
The paper defined two model configurations for evaluation.
| Parameter | Transformer Base | Transformer Big |
|---|---|---|
| Encoder layers (N) | 6 | 6 |
| Decoder layers (N) | 6 | 6 |
| Model dimension (d_model) | 512 | 1,024 |
| Feed-forward dimension (d_ff) | 2,048 | 4,096 |
| Attention heads (h) | 8 | 16 |
| Key/Value dimension (d_k = d_v) | 64 | 64 |
| Parameters | ~65M | ~213M |
| Dropout rate (P_drop) | 0.1 | 0.3 |
The models were trained on the standard WMT 2014 English-German and English-French machine translation benchmarks.
For English-German, the training set contained approximately 4.5 million sentence pairs. For English-French, the substantially larger dataset contained approximately 36 million sentence pairs. Sentences were encoded using byte pair encoding, with a shared source-target vocabulary of approximately 37,000 tokens for English-German and separate vocabularies for English-French.
The authors used the Adam optimizer with beta_1 = 0.9, beta_2 = 0.98, and epsilon = 10^-9. They employed a custom learning rate schedule that combined a linear warmup with an inverse square root decay:
lrate = d_model^(-0.5) * min(step_num^(-0.5), step_num * warmup_steps^(-1.5))
This schedule linearly increases the learning rate during the first warmup_steps training steps (4,000 for the base model), then decreases it proportionally to the inverse square root of the step number. The warmup phase prevents instability in the early stages of training when the model's parameters are randomly initialized.
Three forms of regularization were applied:
The base model was trained for 100,000 steps (approximately 12 hours) on 8 NVIDIA P100 GPUs. The big model was trained for 300,000 steps (approximately 3.5 days) on the same hardware. These training times were dramatically shorter than those reported for competing RNN and CNN architectures.
The transformer achieved state-of-the-art results on both machine translation benchmarks tested.
| Model | EN-DE BLEU | EN-FR BLEU | Training Cost (FLOPs) |
|---|---|---|---|
| ByteNet (2017) | 23.75 | - | - |
| GNMT + RL (Google, 2016) | 24.6 | 39.92 | 1.4 x 10^20 |
| ConvS2S (Gehring et al., 2017) | 25.16 | 40.46 | 1.5 x 10^20 |
| MoE (Shazeer et al., 2017) | 26.03 | 40.56 | 1.2 x 10^20 |
| Transformer Base | 27.3 | 38.1 | 3.3 x 10^18 |
| Transformer Big | 28.4 | 41.0 | 2.3 x 10^19 |
| Transformer Big (EN-FR, single model) | - | 41.8 | 2.3 x 10^19 |
The Transformer Big achieved 28.4 BLEU on WMT 2014 English-to-German translation, improving over the previous best results (including ensembles) by more than 2 BLEU points. On WMT 2014 English-to-French, the single model achieved 41.0 BLEU (41.8 with checkpoint averaging), establishing a new state-of-the-art at a fraction of the training cost of previous models [1].
Notably, the Transformer Big required approximately 10 to 100 times fewer floating-point operations to train than the competing models it outperformed. The ability to parallelize attention computations across all positions simultaneously, rather than processing them sequentially as in RNNs, was the primary driver of this efficiency.
To demonstrate the transformer's generality beyond translation, the authors also evaluated it on English constituency parsing (predicting the syntactic tree structure of a sentence). The model achieved competitive results, demonstrating that the transformer could generalize to structured prediction tasks outside of its original machine translation domain.
Several aspects of the paper represented significant novelties that contributed to its lasting impact.
The most radical aspect of the paper was completely eliminating recurrent computation. Prior work had used attention as a supplement to RNNs; the transformer showed that attention alone was not only sufficient but superior. This enabled full parallelization across sequence positions during training, drastically reducing wall-clock training time.
While attention between encoder and decoder had been explored before (Bahdanau et al., 2014), the systematic use of self-attention within both the encoder and the decoder was a key contribution. Self-attention allows every position in a sequence to directly attend to every other position in a single computation step, giving the transformer a constant-path-length connection between any two positions (O(1)), compared to O(n) for RNNs and O(log n) for dilated convolutions.
The transformer architecture exhibited unusually favorable scaling properties. Subsequent work demonstrated that transformer performance continues to improve predictably as model size, dataset size, and training compute increase. This property, formalized in the neural scaling laws identified by Kaplan et al. (2020) [4], ultimately motivated the development of increasingly large language models.
All eight authors were listed with an asterisk denoting "equal contribution," with the listing order randomized. Each has gone on to significant roles in the AI industry, and their collective post-paper trajectories underscore the transformer's centrality to the modern AI landscape.
| Author | Affiliation (2017) | Current Role (2025) | Notable Post-Paper Venture |
|---|---|---|---|
| Ashish Vaswani | Google Brain | Co-founder, Essential AI | Co-founded Adept AI (2022), then Essential AI (2023) |
| Noam Shazeer | Google Brain | VP Engineering, Google DeepMind (Gemini co-lead) | Founded Character.AI (2021); returned to Google via $2.7B deal (2024) |
| Niki Parmar | Google Brain | Co-founder, Essential AI | Co-founded Adept AI (2022) with Vaswani, then Essential AI |
| Jakob Uszkoreit | Google Research | Co-founder, Inceptive | Founded Inceptive (2020), using AI for mRNA-based medicine design |
| Llion Jones | Google Research | Co-founder, Sakana AI | Founded Sakana AI (2023) in Tokyo with David Ha |
| Aidan Gomez | Google Brain (intern) | CEO, Cohere | Co-founded Cohere (2019); raised $1.6B+ at $7B+ valuation |
| Lukasz Kaiser | Google Brain | Research Scientist, OpenAI | Joined OpenAI (2021); the only author who did not start a company |
| Illia Polosukhin | Google Research | Co-founder, NEAR Protocol | Co-founded NEAR Protocol (2017), a blockchain/AI platform |
Ashish Vaswani served as the lead researcher on the transformer project at Google Brain. After leaving Google, he co-founded Adept AI in 2022 alongside Niki Parmar, which focused on building AI agents for enterprise software. He subsequently co-founded Essential AI in 2023, which develops enterprise AI infrastructure, reportedly raising over $100 million in funding.
Noam Shazeer was already one of Google's most prolific AI researchers before the transformer paper, with key contributions to mixture of experts and other techniques. After Google declined to launch a chatbot he had developed with Daniel De Freitas, Shazeer left in 2021 to co-found Character.AI, a conversational AI platform. Character.AI grew rapidly, attracting millions of users. In August 2024, Google struck a deal reported at $2.7 billion to license Character.AI's technology and bring Shazeer back to Google DeepMind, where he was appointed to co-lead the Gemini project alongside Jeff Dean and Oriol Vinyals [5].
Aidan Gomez was an undergraduate intern at Google Brain during the development of the transformer. He co-founded Cohere in 2019 alongside Ivan Zhang and Nick Chicken, building enterprise-focused large language models. By 2025, Cohere had raised approximately $1.6 billion at a valuation exceeding $7 billion, with Gomez publicly stating the company was considering an IPO [6].
Illia Polosukhin co-founded NEAR Protocol in 2017, originally as a machine learning platform called NEAR.ai before pivoting to a blockchain platform. As of 2025, Polosukhin continues to lead NEAR with a focus on the intersection of AI and decentralized computation, presenting research at NVIDIA GTC 2025 on confidential, decentralized AI systems.
After over a decade at Google, Llion Jones co-founded Sakana AI in Tokyo in 2023 alongside David Ha (another former Google researcher). Sakana AI focuses on developing novel AI architectures, drawing inspiration from biological systems and evolutionary approaches. Jones has described current AI research as overly focused on scaling existing architectures and has advocated for more fundamental architectural innovation.
Jakob Uszkoreit co-founded Inceptive in 2020, a biotechnology company that applies deep learning to the design of RNA molecules for therapeutic applications. Inceptive uses transformer-based models to predict RNA structure and function, applying the same class of architecture Uszkoreit helped create to biological sequence design.
Lukasz Kaiser joined OpenAI in 2021 and has contributed to the organization's research on reasoning and multimodal models. He remains the only one of the eight authors who did not found a startup, instead continuing as a research scientist.
Niki Parmar co-founded Adept AI in 2022 alongside Ashish Vaswani and subsequently co-founded Essential AI in 2023. Her research interests span attention mechanisms, generative models, and AI systems for enterprise automation.
The influence of "Attention Is All You Need" extends across nearly every subfield of AI and well beyond.
Every major large language model deployed as of 2025 is built on the transformer architecture or a close derivative. GPT-1 through GPT-4 (OpenAI), BERT and T5 (Google), PaLM and Gemini (Google DeepMind), LLaMA (Meta), Claude (Anthropic), Mistral, and DeepSeek all use transformer-based designs. The decoder-only variant (used by GPT) and the encoder-only variant (used by BERT) were both derived directly from the architecture described in this paper.
The transformer has proven to be a remarkably general-purpose architecture, extending far beyond its original machine translation application:
With over 173,000 citations as of 2025, "Attention Is All You Need" ranks among the top ten most-cited papers of the 21st century across all scientific disciplines. Its citation count continues to grow as the transformer architecture finds new applications.
The paper's impact extends beyond technology into the structure of the AI industry itself. The eight authors collectively founded or co-led companies valued at tens of billions of dollars. The transformer's scalability properties enabled the modern approach to AI development: training increasingly large models on increasingly large datasets, a strategy that underpins the business models of OpenAI, Google DeepMind, Anthropic, Cohere, and many other AI companies.
The transformer as described in the 2017 paper has been refined and extended in numerous ways.
| Variant | Architecture Type | Key Modification | Example Models |
|---|---|---|---|
| Encoder-only | Bidirectional | Removes decoder; bidirectional attention | BERT, RoBERTa |
| Decoder-only | Autoregressive | Removes encoder; causal masking only | GPT, LLaMA, Claude |
| Encoder-decoder | Seq2seq | Original design | T5, BART, mBART |
| Mixture of Experts | Sparse | Routes tokens to specialized sub-networks | Switch Transformer, Mixtral |
| State Space | Hybrid | Replaces or supplements attention with SSM layers | Mamba, Jamba |
Subsequent research introduced numerous improvements to transformer training:
The transformer's O(n^2) attention complexity with respect to sequence length remains its primary computational bottleneck. Processing a sequence of length n requires computing attention scores between all n^2 pairs of positions. This quadratic scaling limits the maximum sequence length that can be efficiently processed. Numerous approaches have been proposed to address this, including sparse attention patterns, linear attention approximations, and hybrid architectures combining attention with state space models.
As of early 2026, the transformer remains the dominant architecture in AI. While researchers continue to explore alternatives (state space models like Mamba, hybrid architectures, and novel attention mechanisms), no single architecture has displaced the transformer across the breadth of tasks where it excels.
The paper's ideas continue to evolve. Modern transformers bear the same fundamental relationship to the 2017 design as a modern automobile does to the Ford Model T: the core principles remain, but virtually every component has been refined. Attention patterns, normalization schemes, position encodings, activation functions, and training procedures have all been improved. Yet the basic framework of stacking self-attention and feed-forward layers, processing sequences in parallel, and learning contextual representations through attention remains the backbone of the AI systems that define the current era.
The paper is also notable for its role in accelerating the broader AI boom. The transformer's parallelizability made it possible to efficiently scale models to billions and then trillions of parameters using modern GPU and TPU clusters, enabling the emergent capabilities observed in large language models. In this sense, "Attention Is All You Need" did not just propose a new architecture; it unlocked the scaling trajectory that produced ChatGPT, Gemini, Claude, and the broader wave of AI applications that have reshaped industry and society.