Attention Is All You Need

Deep Learning Machine Learning Natural Language Processing

37 min read

Updated Jul 11, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 11, 2026

Fact-checked

In review queue

Sources

42 citations

Revision

v9 · 7,318 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

"Attention Is All You Need" is the 2017 research paper that introduced the transformer, the neural network architecture that underpins virtually every modern large language model. It was published in June 2017 by eight researchers at Google Brain and Google Research: Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan Gomez, Lukasz Kaiser, and Illia Polosukhin ^[1]. The transformer is a design based entirely on attention mechanisms, dispensing with the recurrence and convolutions that had dominated sequence modeling. As the authors state in the paper's abstract, "We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely" ^[1]. Presented at the 31st Conference on Neural Information Processing Systems (NeurIPS 2017), the paper has accumulated more than 173,000 citations as of 2025, ranking seventh on Nature's tabulation of the most-cited scientific papers of the 21st century and the most influential publication in the history of artificial intelligence research ^[2].

The transformer architecture proposed in this paper is the foundation for virtually all modern large language models (LLMs), including GPT, BERT, PaLM, LLaMA, and Claude. It also underpins vision transformers, protein structure prediction models like AlphaFold, speech recognition systems, music generation tools, and many other AI applications. The paper's title, a reference to the Beatles song "All You Need Is Love," has itself become one of the most recognizable phrases in the field ^[3].

Background

Before the transformer, the dominant architectures for sequence-to-sequence tasks in natural language processing were recurrent neural networks (RNNs), particularly Long Short-Term Memory (LSTM) networks and Gated Recurrent Units (GRUs). These models processed sequences one token at a time, maintaining a hidden state that carried information from previous time steps. While effective, this sequential nature created two fundamental problems.

First, RNNs were difficult to parallelize during training. Because each time step depended on the output of the previous step, the computation could not be distributed across multiple processors in the time dimension. This made training on long sequences slow and expensive.

Second, RNNs struggled with long-range dependencies. Despite the gating mechanisms in LSTMs and GRUs, information from distant positions in a sequence tended to decay as it passed through many sequential processing steps. The attention mechanism introduced by Bahdanau, Cho, and Bengio in 2014 partially addressed this limitation ^[4]. The Bahdanau attention mechanism allowed decoders in neural machine translation systems to directly attend to any position in the encoder's output, dramatically improving translation quality. However, these hybrid models still relied on recurrent computation for encoding the input sequence itself.

Intellectual precursors

Several pieces of work in 2016 directly preceded the transformer. Cheng, Dong, and Lapata proposed Long Short-Term Memory-Networks (LSTMNs) for machine reading, augmenting LSTMs with an intra-attention (or self-attention) mechanism so that the model could store contextual representations of each input token ^[5]. Parikh, Tackstrom, Das, and Uszkoreit (the same Jakob Uszkoreit who later co-authored "Attention Is All You Need") published "A Decomposable Attention Model for Natural Language Inference," which applied self-attention to feedforward networks and achieved state-of-the-art results on textual entailment while using roughly an order of magnitude fewer parameters than competing LSTM models ^[6]. That paper convinced Uszkoreit that attention without recurrence could be sufficient for language tasks more broadly. His father, the computational linguist Hans Uszkoreit, was reportedly skeptical of the hypothesis ^[7].

The Google Brain team's key question was whether attention alone, without any recurrence, could be sufficient for sequence modeling. Their answer was the transformer.

Origin story at Google

Uszkoreit proposed replacing RNNs with self-attention and started the effort to evaluate this idea at Google. Ashish Vaswani and Illia Polosukhin designed and implemented the first transformer models. Noam Shazeer, already a senior Google engineer with extensive prior work on neural language modeling and mixture of experts, joined the project and proposed scaled dot-product attention, multi-head attention, and the parameter-free positional representation. Niki Parmar designed, implemented, tuned, and evaluated countless model variants in the original codebase and in tensor2tensor, the open-source TensorFlow library Google Brain released in June 2017 ^[8]. Llion Jones contributed to model exploration and the visualizations that appeared in the paper. Lukasz Kaiser and Aidan Gomez (then a 20-year-old undergraduate intern from the University of Toronto) spent considerable time designing and implementing parts of tensor2tensor, which dramatically accelerated the experimental cycle ^[9]^[10].

The name "Transformer" was chosen by Jakob Uszkoreit, who liked the way the word sounded; an early internal design document featured artwork from the Transformers franchise ^[3]. The title "Attention Is All You Need" was a deliberate echo of the Beatles song "All You Need Is Love." It seeded a long-running scientific meme: an analysis of arXiv preprints found 717 papers with "All You Need" in their titles between 2009 and 2025, with exponential growth following the 2017 publication ^[11].

The paper was first posted to arXiv on June 12, 2017 and received seven subsequent revisions, the latest in August 2023 ^[1]. Even before the preprint appeared, Peter Liu and co-authors at Google were applying a decoder-only variant of the architecture to generate fictitious Wikipedia articles, work published in 2018 as "Generating Wikipedia by Summarizing Long Sequences" ^[12].

What is the transformer architecture?

The transformer follows an encoder-decoder structure, where the encoder maps an input sequence to a continuous representation and the decoder generates an output sequence one token at a time conditioned on that representation.

Overall structure

The encoder consists of a stack of $N$ identical layers ( $N = 6$ in the original paper). Each encoder layer has two sub-layers:

A multi-head self-attention mechanism
A position-wise fully connected feed-forward network

Each sub-layer is wrapped with a residual connection and followed by layer normalization. The output of each sub-layer is expressed as $\mathrm{LayerNorm}(x + \mathrm{Sublayer}(x))$ . To support these residual connections, all sub-layers in the model, and the embedding layers, produce outputs of dimension $d_{\text{model}} = 512$ (base model) or 1,024 (big model) ^[1].

The decoder also consists of a stack of $N = 6$ identical layers. Each decoder layer has three sub-layers:

A masked multi-head self-attention mechanism (preventing positions from attending to subsequent positions)
A multi-head attention mechanism that attends to the encoder's output (encoder-decoder attention, or "cross-attention")
A position-wise feed-forward network

As with the encoder, residual connections and layer normalization wrap each sub-layer. Inputs to both the encoder and decoder are first converted into vectors via a learned embedding layer scaled by $\sqrt{d_{\text{model}}}$ , with positional encodings added to inject sequence order information.

Scaled dot-product attention

The core computational unit of the transformer is scaled dot-product attention. Given a set of queries ( $Q$ ), keys ( $K$ ), and values ( $V$ ), where $Q$ is an $n \times d_k$ matrix, $K$ is an $m \times d_k$ matrix, and $V$ is an $m \times d_v$ matrix, the attention function computes:

\mathrm{Attention}(Q, K, V) = \mathrm{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right) V

where $d_k$ is the dimension of the keys. The dot product $QK^\top$ produces an $n \times m$ matrix of similarity scores between each query and all keys. Dividing by $\sqrt{d_k}$ prevents the dot products from growing too large in magnitude as the dimensionality increases. The intuition is that for $d_k$ -dimensional unit-variance independent random query and key vectors, the dot product has variance $d_k$ ; without the scaling, the softmax function would be pushed into regions with extremely small gradients, slowing or stalling learning. After the scaled softmax produces attention weights (each row summing to 1), these weights determine how much each value vector contributes to the output ^[1].

The authors chose dot-product attention over the additive attention used by Bahdanau et al. because dot-product attention is faster and more space-efficient in practice, benefiting from highly optimized matrix multiplication kernels on accelerators. The paper noted that the two formulations have similar theoretical complexity but that dot-product attention is the more practical choice once scaling is included. In experiments without scaling, the authors observed that for large $d_k$ the additive variant outperformed dot-product attention, which is why the $1/\sqrt{d_k}$ factor was essential rather than optional ^[1].

Multi-head attention

Rather than performing a single attention function with full-dimensional keys, values, and queries, the paper proposed multi-head attention. The model linearly projects the queries, keys, and values $h$ times into lower-dimensional spaces (where $h = 8$ in the base model and $h = 16$ in the big model), performs scaled dot-product attention on each projection independently in parallel, and then concatenates and projects the results:

\mathrm{MultiHead}(Q, K, V) = \mathrm{Concat}(\mathrm{head}_1, \ldots, \mathrm{head}_h) W^O

where

\mathrm{head}_i = \mathrm{Attention}(QW_i^Q, KW_i^K, VW_i^V)

With $h = 8$ heads and $d_{\text{model}} = 512$ , each head operates on $d_k = d_v = 64$ dimensions. Because the per-head dimensionality is reduced, the total computational cost of multi-head attention is similar to a single full-dimensional attention with the same width. Different heads can attend to different aspects of the input: some heads track syntactic relationships, others track semantic similarity, and still others track positional patterns ^[1].

Multi-head attention is used in three different ways within the transformer:

Encoder self-attention: Queries, keys, and values all come from the encoder's previous layer output. Every position in the encoder attends to every other position.
Decoder self-attention: Similar, but with masking to ensure each position can only attend to earlier positions, preserving the autoregressive property.
Encoder-decoder attention: Queries come from the decoder, while keys and values come from the encoder output. This allows the decoder to attend to all positions in the input sequence.

Positional encoding

Because the transformer contains no recurrence or convolution, it has no inherent notion of the order of tokens in a sequence. To inject positional information, the authors add positional encodings to the input embeddings. They use sinusoidal functions of varying frequencies:

\mathrm{PE}(\text{pos}, 2i) = \sin\!\left(\frac{\text{pos}}{10000^{2i/d_{\text{model}}}}\right)

\mathrm{PE}(\text{pos}, 2i+1) = \cos\!\left(\frac{\text{pos}}{10000^{2i/d_{\text{model}}}}\right)

where $\text{pos}$ is the position in the sequence and $i$ is the dimension index. Each dimension of the positional encoding corresponds to a sinusoid with a different wavelength, ranging from $2\pi$ to $10000 \times 2\pi$ . The authors chose sinusoidal encodings because they hypothesized it would allow the model to learn relative position relationships, since $\mathrm{PE}(\text{pos} + k)$ can be represented as a linear function of $\mathrm{PE}(\text{pos})$ for any fixed offset $k$ ^[1].

The paper notes that learned positional embeddings produced nearly identical results, but sinusoidal encodings were preferred because they might generalize to sequence lengths longer than those seen during training.

Position-wise feed-forward networks

Each encoder and decoder layer contains a fully connected feed-forward network applied independently to each position. The network consists of two linear transformations with a ReLU activation in between:

\mathrm{FFN}(x) = \max(0,\, xW_1 + b_1)W_2 + b_2

The inner dimension of the feed-forward network is $d_{\text{ff}} = 2{,}048$ in the base model (four times the model dimension of 512) and $d_{\text{ff}} = 4{,}096$ in the big model. Although the same function is applied at every position, different layers use different learned parameters. The component can be viewed as applying two $1 \times 1$ convolutions with a nonlinearity in between ^[1].

Why self-attention instead of recurrence or convolution?

The paper's "Why Self-Attention" section compares self-attention to recurrent and convolutional layers across three desiderata: total computational complexity per layer, the amount of computation that can be parallelized (measured by the minimum number of sequential operations required), and the maximum path length between any two positions in the network. Self-attention has $O(n^2 \cdot d)$ complexity per layer (where $n$ is the sequence length and $d$ is the representation dimension), $O(1)$ sequential operations (everything can be done in parallel), and an $O(1)$ maximum path length between any two positions. Recurrent layers, by contrast, require $O(n)$ sequential operations and have $O(n)$ path length. Convolutional layers have $O(\log_k(n))$ path length with dilated convolutions. The constant-path-length property is the architectural reason transformers learn long-range dependencies more effectively than RNNs ^[1].

In a worked comparison from the paper, for the typical case where $n$ is smaller than $d$ (sequence length shorter than representation dimension), self-attention is also faster per layer than recurrent layers in raw operation count. When the sequence is much longer than $d$ , the paper notes that "restricted self-attention," in which each position attends only to a neighborhood of size $r$ , can be used to reduce complexity to $O(r \cdot n \cdot d)$ at the cost of increasing maximum path length to $O(n/r)$ . The authors also note a secondary benefit: self-attention models can yield more interpretable representations, since the learned attention weights often reflect syntactic or semantic structure ^[1]. The abstract frames the practical payoff plainly, reporting that the models are "superior in quality while being more parallelizable and requiring significantly less time to train" ^[1].

Model configurations

The paper defined two model configurations for evaluation.

Parameter	Transformer Base	Transformer Big
Encoder layers ( $N$ )	6	6
Decoder layers ( $N$ )	6	6
Model dimension ( $d_{\text{model}}$ )	512	1,024
Feed-forward dimension ( $d_{\text{ff}}$ )	2,048	4,096
Attention heads ( $h$ )	8	16
Key/Value dimension ( $d_k = d_v$ )	64	64
Parameters	~65M	~213M
Dropout rate ( $P_{\text{drop}}$ )	0.1	0.3 (EN-FR) / 0.1 (EN-DE)

How was the transformer trained?

The models were trained on the standard WMT 2014 English-German and English-French machine translation benchmarks ^[1].

Dataset

For English-German, the training set contained approximately 4.5 million sentence pairs. For English-French, the substantially larger dataset contained approximately 36 million sentence pairs. Sentences were encoded using byte-pair encoding, with a shared source-target vocabulary of approximately 37,000 tokens for English-German and a 32,000-token word-piece vocabulary for English-French. Training batches contained sentence pairs grouped by approximate sequence length, with each batch holding about 25,000 source tokens and 25,000 target tokens ^[1].

Optimizer and learning rate

The authors used the Adam optimizer with $\beta_1 = 0.9$ , $\beta_2 = 0.98$ , and $\epsilon = 10^{-9}$ . They employed a custom learning rate schedule that combined a linear warmup with an inverse square root decay:

\text{lrate} = d_{\text{model}}^{-0.5} \cdot \min\!\left(\text{step\_num}^{-0.5},\, \text{step\_num} \cdot \text{warmup\_steps}^{-1.5}\right)

This schedule linearly increases the learning rate during the first warmup_steps training steps (4,000 for the base model) then decreases it proportionally to the inverse square root of the step number. The warmup phase prevents instability in the early stages of training when the model's parameters are randomly initialized ^[1]. The need for warmup was later traced to the post-norm placement of layer normalization in the original transformer; subsequent work showed that the pre-norm variant trains stably without warmup ^[13].

Regularization

Three forms of regularization were applied:

Residual dropout: Dropout with rate 0.1 (base) was applied to the output of each sub-layer before it is added to the sub-layer input and normalized, and also to the sums of embeddings and positional encodings. The EN-FR big model used 0.3 instead of 0.1.
Attention dropout: Dropout was also applied to the attention weights themselves.
Label smoothing: During training, the target distribution used a label smoothing value of $\epsilon_{\text{ls}} = 0.1$ , distributing a small amount of probability mass uniformly across non-target tokens. This hurt perplexity (the model became less confident about its predictions) but improved accuracy and BLEU scores.

Compute and training time

The base model was trained for 100,000 steps (approximately 12 hours, roughly 0.4 seconds per step) on 8 NVIDIA P100 GPUs in a single machine. The big model was trained for 300,000 steps (approximately 3.5 days, roughly 1.0 second per step) on the same hardware. These training times were dramatically shorter than those reported for competing RNN and convolutional sequence-to-sequence architectures: the big model achieved state-of-the-art results at an estimated training cost of $2.3 \times 10^{19}$ FLOPs, roughly an order of magnitude less than the GNMT and ConvS2S systems it surpassed ^[1]. Final translation models were obtained by averaging the last 5 checkpoints (base) or 20 checkpoints (big), saved at 10-minute intervals during training.

The original implementation lived inside Google's research codebase and was ported into tensor2tensor, the open-source TensorFlow library released by the Google Brain team in June 2017, the same month the paper appeared on arXiv ^[8]. tensor2tensor exposed reference implementations of the transformer along with shared datasets, vocabularies, and tokenization pipelines, which substantially lowered the barrier to reproducing the paper's results. Within months, multiple independent re-implementations appeared in PyTorch and other frameworks. tensor2tensor itself was later deprecated in favor of its successor, Trax, but its initial release was instrumental in spreading the architecture quickly through the research community ^[9].

What results did the transformer achieve?

The transformer achieved state-of-the-art results on both machine translation benchmarks tested. During inference, the authors used beam search with a beam size of 4 and a length penalty $\alpha = 0.6$ .

Machine translation BLEU scores

Model	EN-DE BLEU	EN-FR BLEU	Training cost (FLOPs)
ByteNet (2017)	23.75	-	-
GNMT + RL (Google, 2016)	24.6	39.92	$1.4 \times 10^{20}$
ConvS2S (Gehring et al., 2017)	25.16	40.46	$1.5 \times 10^{20}$
MoE (Shazeer et al., 2017)	26.03	40.56	$1.2 \times 10^{20}$
Transformer Base	27.3	38.1	$3.3 \times 10^{18}$
Transformer Big	28.4	41.0	$2.3 \times 10^{19}$
Transformer Big (EN-FR, single model)	-	41.8	$2.3 \times 10^{19}$

The Transformer Big achieved 28.4 BLEU on WMT 2014 English-to-German translation, improving over the previous best results (including ensembles) by more than 2 BLEU points. On WMT 2014 English-to-French, the single model achieved 41.0 BLEU (41.8 with checkpoint averaging), a new state-of-the-art at a fraction of the training cost of previous models ^[1]. The abstract summarizes the headline result directly: "Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles, by over 2 BLEU" and "establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature" ^[1].

The Transformer Big required approximately 10 to 100 times fewer floating-point operations to train than the competing models it outperformed. The ability to parallelize attention computations across all positions simultaneously, rather than processing them sequentially as in RNNs, was the primary driver of this efficiency ^[1].

English constituency parsing

To demonstrate the transformer's generality beyond translation, the authors evaluated a 4-layer transformer with $d_{\text{model}} = 1024$ on English constituency parsing (predicting the syntactic tree structure of a sentence). Trained on the Wall Street Journal portion of the Penn Treebank in a supervised setting, the model achieved an F1 score of 91.3, comparable to specialized parsing systems despite minimal task-specific tuning. With semi-supervised training using larger high-confidence and BerkleyParser corpora, the score rose to 92.7 F1, surpassing all but one of the discriminative parsers compared in the paper. The choice of constituency parsing as a generalization probe followed Vinyals, Kaiser, and colleagues' 2015 "Grammar as a Foreign Language" line of work, which had cast parsing as a sequence-to-sequence problem ^[1]^[40].

Attention visualizations

The appendix of the paper presents several attention pattern visualizations from the trained models. Different heads were observed to specialize: some attended primarily to neighboring tokens, others tracked syntactic relations such as verb-object or preposition-object dependencies, and a few appeared to capture long-range coreference. The visualizations did not constitute a rigorous interpretability claim, but they suggested that multi-head attention learns to decompose linguistic structure across heads. This observation seeded an entire subfield of "BERTology" and transformer probing, with hundreds of subsequent papers analyzing what attention heads learn in larger models ^[1].

What were the key innovations?

Several aspects of the paper represented significant departures from prior work and contributed to its lasting impact.

Replacing recurrence entirely

The most radical aspect of the paper was completely eliminating recurrent computation. Prior work had used attention as a supplement to RNNs; the transformer showed that attention alone was not only sufficient but superior. This enabled full parallelization across sequence positions during training, drastically reducing wall-clock training time ^[1].

Systematic self-attention

While attention between encoder and decoder had been explored before (Bahdanau et al., 2014), the systematic use of self-attention within both the encoder and the decoder was a key contribution. Self-attention allows every position in a sequence to directly attend to every other position in a single computation step, giving the transformer a constant-path-length connection between any two positions ( $O(1)$ ), compared to $O(n)$ for RNNs and $O(\log n)$ for dilated convolutions ^[1].

Scaling properties

The transformer architecture has unusually favorable scaling properties. Subsequent work demonstrated that transformer performance continues to improve predictably as model size, dataset size, and training compute increase. This property, formalized in the neural scaling laws identified by Kaplan et al. (2020) ^[14] and refined in the Chinchilla compute-optimal analysis by Hoffmann et al. (2022) ^[15], ultimately motivated the development of increasingly large language models.

The eight authors and their subsequent paths

All eight authors were listed with an asterisk denoting "equal contribution," with the listing order randomized. Each has gone on to significant roles in the AI industry, and their collective post-paper trajectories underscore the transformer's centrality to the modern AI landscape. As of 2024 reporting, seven of the eight founded or co-founded companies that have collectively raised more than $2 billion in venture capital ^[16].

Author	Affiliation (2017)	Current role (2026)	Notable post-paper venture
Ashish Vaswani	Google Brain	Co-founder and CEO, Essential AI	Co-founded Adept AI (2022), then Essential AI (2023)
Noam Shazeer	Google Brain	VP Engineering and co-lead of Gemini, Google DeepMind	Founded Character.AI (2021); returned to Google via $2.7B licensing deal (2024)
Niki Parmar	Google Brain	Co-founder, Essential AI	Co-founded Adept AI (2022) with Vaswani, then Essential AI
Jakob Uszkoreit	Google Research	Co-founder and CEO, Inceptive	Founded Inceptive (2021), applying generative AI to RNA medicine design
Llion Jones	Google Research	Co-founder and CTO, Sakana AI	Founded Sakana AI (2023) in Tokyo with David Ha and Ren Ito
Aidan Gomez	Google Brain (intern)	Co-founder and CEO, Cohere	Co-founded Cohere (2019); $7B+ valuation as of 2025
Lukasz Kaiser	Google Brain	Research scientist, OpenAI	Joined OpenAI (2021); only author who did not start a company
Illia Polosukhin	Google Research	Co-founder, NEAR Protocol	Co-founded NEAR Protocol (2017), an AI/blockchain platform

Ashish Vaswani

Ashish Vaswani served as a lead researcher on the transformer project at Google Brain and is listed first on the paper. After leaving Google in 2021, he co-founded Adept AI alongside Niki Parmar and David Luan in 2022, working on AI agents for enterprise software. He and Parmar departed Adept in 2022 to pursue a research-led foundation-model thesis and in 2023 co-founded Essential AI, which emerged from stealth in December 2023 with a $56.5 million Series A led by March Capital and including Google, Nvidia, AMD, KB Investment, Franklin Venture Partners, and Thrive Capital ^[17].

Noam Shazeer

Noam Shazeer joined Google in 2000 and was one of Google's most prolific AI researchers before the transformer paper, with key contributions to spelling correction, large-scale language modeling, mixture of experts, and the 2017 "Outrageously Large Neural Networks" paper that introduced sparsely-gated MoE layers ^[18]^[39]. Within the transformer project, Shazeer specifically proposed scaled dot-product attention, multi-head attention, and the parameter-free positional representation that gave the architecture much of its empirical strength ^[1]. After Google declined to launch a chatbot he had developed with Daniel De Freitas, Shazeer left in 2021 and co-founded Character.AI, a conversational AI platform that grew to millions of users. In August 2024, Google struck a deal reported at $2.7 billion to license Character.AI's technology and bring Shazeer back to Google DeepMind, where he was appointed to co-lead the Gemini project alongside Jeff Dean and Oriol Vinyals. The U.S. Department of Justice subsequently opened an inquiry into whether the deal's structure circumvented antitrust review ^[19].

Aidan Gomez

Aidan Gomez was a 20-year-old undergraduate intern at Google Brain during the development of the transformer, on leave from the University of Toronto. He co-founded Cohere in 2019 alongside Ivan Zhang and Nick Frosst, building enterprise-focused large language models. By September 2025, Cohere had raised approximately $1.6 billion in total funding, with a $500 million Series E round in August 2025 followed by a $100 million extension in September 2025 valuing the company at roughly $7 billion. Annualized revenue crossed $100 million in May 2025 and $150 million by October 2025, and Gomez publicly stated Cohere was preparing for an initial public offering ^[20].

Illia Polosukhin

Illia Polosukhin co-founded NEAR Protocol in 2017, originally as a machine learning platform called NEAR.ai before pivoting to a blockchain platform. As of 2025, Polosukhin continues to lead NEAR with a focus on the intersection of AI and decentralized computation. At NVIDIA GTC 2025 he presented NEAR's research on confidential, decentralized AI inference, the second consecutive year he had spoken at NVIDIA's flagship conference ^[21].

Llion Jones

After more than a decade at Google, Llion Jones co-founded Sakana AI in Tokyo in 2023 alongside David Ha (another former Google researcher) and Ren Ito. Sakana AI develops novel AI architectures with inspiration from biological systems and evolutionary approaches, with a particular focus on Japan-optimized models. The company raised a $30 million seed round in January 2024, a $214 million Series A in September 2024 at a $1.5 billion valuation that made it Japan's first AI unicorn, and a $135 million Series B in November 2025 at a $2.65 billion valuation ^[22]^[23].

Jakob Uszkoreit

Jakob Uszkoreit co-founded Inceptive in 2021, a biotechnology company that applies transformer-style deep learning to the design of RNA molecules for therapeutic and vaccine applications. Inceptive raised $100 million in 2023 led by Andreessen Horowitz and NVIDIA, in addition to earlier rounds bringing total funding to roughly $120 million. The company has partnered with at least one major European pharmaceutical company on a new infectious-disease mRNA vaccine ^[24]. Uszkoreit's father, the German computational linguist Hans Uszkoreit, was initially skeptical of the self-attention hypothesis ^[7].

Lukasz Kaiser

Lukasz Kaiser joined OpenAI in 2021 and has contributed to that organization's research on reasoning and multimodal models. He spent considerable time at Google on tensor2tensor and on the design of the original transformer codebase, and remains the only one of the eight authors who did not found a startup, instead continuing as a research scientist ^[16].

Niki Parmar

Niki Parmar co-founded Adept AI in 2022 alongside Ashish Vaswani and subsequently co-founded Essential AI in 2023. Her research interests span attention mechanisms, generative models, and AI systems for enterprise automation. She also co-authored the 2018 Image Transformer paper, an early extension of self-attention to image generation ^[25].

The GTC 2024 reunion

In March 2024, seven of the eight authors gathered for the first time as a group at NVIDIA's GTC conference in San Jose for a panel moderated by NVIDIA CEO Jensen Huang. Niki Parmar was unable to attend. The panel drew a packed ballroom audience of over 2,000 attendees, who heard the researchers reflect on the original paper and on where they believe transformer-style architectures are heading. Huang presented each panelist with a framed cover plate of the NVIDIA DGX-1 AI supercomputer signed "You transformed the world" ^[26].

Several authors used the panel to argue that the field needs better architectures. Aidan Gomez said "I think the world needs something better than the transformer." Jakob Uszkoreit emphasized adaptive computation, arguing that models should spend more compute on hard problems and less on easy ones. Llion Jones described the discovery process as one of subtraction, recalling "we had very recently started throwing bits of the model away, just to see how much worse it would get. And to our surprise it started getting better" ^[16]^[27].

How influential is Attention Is All You Need?

The paper received an oral presentation at NeurIPS 2017 in Long Beach, California. Reception in the broader research community was initially modest but accelerated rapidly once OpenAI's GPT (June 2018) and Google's BERT (October 2018) demonstrated that pretraining transformer models on large unlabeled corpora produced powerful general-purpose representations across many NLP tasks ^[28]^[29]. BERT in particular achieved state-of-the-art results on eleven NLP benchmarks at once and was integrated into Google Search in late 2019, putting the transformer architecture into production at internet scale ^[29]. By 2020, "Attention Is All You Need" was the most-cited paper in the deep learning subfield. By 2025, citation counts on Google Scholar exceeded 173,000, placing the paper seventh on Nature's tabulation of the most-cited scientific papers of the 21st century, behind work in deep residual learning, COVID-19 epidemiology, and meta-analysis methodology, and ahead of the original AlphaFold paper ^[2].

Andrej Karpathy, a prominent AI educator and former director of AI at Tesla, has emphasized that "Attention Is All You Need" did not invent attention; the operator itself dates to Bahdanau, Cho, and Bengio's 2014 paper on neural machine translation. In December 2024, Karpathy published correspondence from Dzmitry Bahdanau describing the origin of the term "attention," noting that the 2014 paper receives roughly a thousand times fewer citations than the 2017 transformer paper despite introducing the central mechanism. Karpathy's commentary frames the transformer paper as the moment attention was pushed to its limit (used exclusively, rather than as an add-on to RNNs), rather than as the moment attention itself was invented ^[38].

The paper's cultural footprint is unusually large for a technical machine learning paper. An analysis of arXiv titles found 717 papers containing "All You Need" between 2009 and 2025, with exponential growth following the 2017 publication; "Attention" remains the most frequently claimed necessity among the imitators ^[11].

What is the paper's impact and legacy?

The influence of "Attention Is All You Need" extends across nearly every subfield of AI and into structural biology, drug discovery, robotics, weather forecasting, and beyond.

Foundation of modern language models

Every major large language model deployed as of 2026 is built on the transformer architecture or a close derivative. GPT-1 through GPT-5 (OpenAI), BERT and T5 (Google), PaLM and Gemini (Google DeepMind), LLaMA (Meta), Claude (Anthropic), Mistral, and DeepSeek all use transformer-based designs. The decoder-only variant (used by GPT) and the encoder-only variant (used by BERT) were both derived directly from the architecture described in this paper ^[28]^[29]. The shift from task-specific architectures to a single transformer family that could be pretrained once and fine-tuned on many tasks was made concrete by benchmarks like GLUE (2018), which BERT topped immediately after release ^[41], and by GPT-3's 2020 demonstration that scaling decoder-only transformers to 175 billion parameters enabled "few-shot" in-context learning across many tasks without any task-specific fine-tuning ^[42].

Beyond NLP

The transformer is a remarkably general-purpose architecture, extending far beyond its original machine translation application:

Computer vision: The Vision Transformer (ViT, 2020) and its variants (Swin Transformer, DeiT) brought transformers to image classification, object detection, and segmentation ^[30].
Protein structure: AlphaFold 2 (DeepMind, 2020) uses transformer components in its structure prediction module and drove a revolution in structural biology ^[31].
Image generation: Diffusion transformers (DiT) replaced U-Net architectures in image generation systems including Stable Diffusion 3 and Sora.
Speech and audio: Models like Whisper (OpenAI) and MusicGen (Meta) apply transformers to speech recognition and music generation.
Robotics: Transformer-based policies power robotic control systems with architectures that process multimodal sensory inputs.
Science: Transformers are used for weather forecasting, drug discovery, materials science, and mathematical theorem proving.

Most-cited AI paper

With more than 173,000 citations as of 2025, "Attention Is All You Need" ranks among the top ten most-cited papers of the 21st century across all scientific disciplines ^[2]. Its citation count continues to grow as the transformer architecture finds new applications.

Influence on AI industry structure

The paper's impact extends beyond technology into the structure of the AI industry. The eight authors collectively founded or co-led companies valued at tens of billions of dollars. The transformer's scalability enabled the modern approach to AI development: training increasingly large models on increasingly large datasets, a strategy that underpins the business models of OpenAI, Google DeepMind, Anthropic, Cohere, and many other AI companies.

The post-Google diaspora is unusual in the history of computer science. As of 2026, the companies founded by transformer co-authors include Cohere ($7B+ valuation), Character.AI (acquired in licensing form by Google for $2.7B), Essential AI, Adept AI, Sakana AI ($2.65B valuation), Inceptive, and NEAR Protocol. Each operates in a distinct part of the AI ecosystem: Cohere on enterprise LLM APIs, Essential AI on foundation model research, Sakana AI on biologically-inspired models for the Japanese market, Inceptive on RNA therapeutic design, and NEAR Protocol on decentralized AI infrastructure. The transformer paper thus functioned both as a technical breakthrough and as the founding document of a venture portfolio that has helped define the post-2020 AI startup landscape ^[16].

Subsequent developments

The transformer as described in the 2017 paper has been refined and extended in numerous ways since publication.

What are the main transformer variants?

Variant	Architecture type	Key modification	Example models
Encoder-only	Bidirectional	Removes decoder; bidirectional attention	BERT, RoBERTa
Decoder-only	Autoregressive	Removes encoder; causal masking only	GPT, LLaMA, Claude
Encoder-decoder	Seq2seq	Original design	T5, BART, mBART
Mixture of experts	Sparse	Routes tokens to specialized sub-networks	Switch Transformer, Mixtral
State space	Hybrid	Replaces or supplements attention with SSM layers	Mamba, Jamba

Training improvements

Subsequent research introduced many improvements to transformer training:

Pre-LayerNorm: Moving layer normalization before (rather than after) each sub-layer improved training stability and removed the need for an explicit warmup schedule. Xiong et al. (2020) analyzed the gradient dynamics that explain this difference ^[13]. Pre-norm became the dominant choice in subsequent large language models.
Rotary Position Embeddings (RoPE): Replacing sinusoidal with rotary encodings improved relative-position modeling and enabled better length generalization, introduced by Su et al. in RoFormer ^[32].
FlashAttention: Hardware-aware implementations of attention that reduce memory usage from $O(n^2)$ to $O(n)$ and dramatically speed up training and inference, introduced by Dao et al. ^[33].
Grouped-Query Attention (GQA): Reducing the number of key-value heads relative to query heads decreases memory bandwidth requirements during inference while preserving most of the quality, introduced by Ainslie et al. ^[34].
KV-cache optimization: Techniques for efficiently caching key and value representations during autoregressive generation, reducing redundant computation across decoding steps.

Efficiency concerns and alternatives

The transformer's $O(n^2)$ attention complexity with respect to sequence length remains its primary computational bottleneck. Processing a sequence of length $n$ requires computing attention scores between all $n^2$ pairs of positions, and the resulting KV cache grows linearly in $n$ at inference time. Numerous approaches have been proposed to address this, including sparse attention patterns, linear attention approximations, and hybrid architectures combining attention with state space models such as Mamba ^[35].

Documented limitations

Beyond compute cost, the transformer has documented theoretical and empirical limitations. Hahn (2020) proved that a single self-attention layer cannot reliably recognize simple formal languages such as parity or balanced parentheses, a result that has been extended to show inherent weaknesses in function composition ^[36]. Empirical work on hallucination demonstrates that transformer language models can produce confident outputs incompatible with their training data or input. Critics including Yann LeCun and Richard Sutton have argued that transformer-based LLMs are approaching the limits of what scaling alone can achieve, and that fundamental architectural innovation will be needed to support robust reasoning and planning ^[37].

Worked example dimensions

For concreteness, a single forward pass through the base transformer on a 50-token English-to-German translation example would proceed as follows. The 50-token input is embedded into a $50 \times 512$ matrix; positional encodings are added; the result enters the first encoder layer. Multi-head self-attention projects this matrix into 8 sets of queries, keys, and values each of shape $50 \times 64$ , computes 8 attention matrices each of shape $50 \times 50$ , multiplies each by its $50 \times 64$ value matrix to produce 8 head outputs of shape $50 \times 64$ , concatenates them into a $50 \times 512$ matrix, applies an output projection, and adds the residual. The position-wise feed-forward sub-layer then expands the $50 \times 512$ matrix to $50 \times 2048$ , applies ReLU, projects back to $50 \times 512$ , and again adds the residual. After six encoder layers, the $50 \times 512$ representation flows into the decoder, which generates the German output one token at a time using masked self-attention over already-emitted tokens and cross-attention against the encoder output. The base model has approximately 65 million parameters spread across token embeddings, six encoder layers, six decoder layers, and the output projection ^[1].

Is the transformer still relevant in 2025-2026?

As of early 2026, the transformer remains the dominant architecture in AI. Researchers continue to explore alternatives, including state space models like Mamba, hybrid models like Jamba (state space and attention combined), and novel attention variants. None has displaced the transformer across the breadth of tasks where it excels ^[35].

The paper's ideas continue to evolve. Modern transformers bear the same fundamental relationship to the 2017 design as a modern automobile does to the Ford Model T: the core principles remain, but virtually every component has been refined. Attention patterns, normalization schemes, position encodings, activation functions (with SwiGLU and similar gated variants replacing ReLU in most large models), and training procedures have all been improved. The basic framework of stacking self-attention and feed-forward layers, processing sequences in parallel, and learning contextual representations through attention remains the backbone of the AI systems that define the current era.

The paper is also notable for its role in accelerating the broader AI boom. The transformer's parallelizability made it possible to efficiently scale models to billions and then trillions of parameters on modern GPU and TPU clusters, enabling the emergent capabilities observed in large language models. "Attention Is All You Need" did not just propose a new architecture; it unlocked the scaling trajectory that produced ChatGPT, Gemini, Claude, and the broader wave of AI applications that have reshaped industry and society.

References

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I., "Attention Is All You Need," NeurIPS 2017 / arXiv:1706.03762, 2017-06-12 (last revised 2023-08-02). https://arxiv.org/abs/1706.03762. Accessed 2026-05-24. ↩
Pearson, H., et al., "Exclusive: the most-cited papers of the twenty-first century," Nature, 2025-04. https://www.nature.com/articles/d41586-025-01125-9. Accessed 2026-05-24. ↩
Wikipedia, "Attention Is All You Need." https://en.wikipedia.org/wiki/Attention_Is_All_You_Need. Accessed 2026-05-24. ↩
Bahdanau, D., Cho, K., Bengio, Y., "Neural Machine Translation by Jointly Learning to Align and Translate," ICLR 2015 / arXiv:1409.0473, 2014-09-01. https://arxiv.org/abs/1409.0473. Accessed 2026-05-24. ↩
Cheng, J., Dong, L., Lapata, M., "Long Short-Term Memory-Networks for Machine Reading," EMNLP 2016 / arXiv:1601.06733, 2016-01-25. https://arxiv.org/abs/1601.06733. Accessed 2026-05-24. ↩
Parikh, A., Tackstrom, O., Das, D., Uszkoreit, J., "A Decomposable Attention Model for Natural Language Inference," EMNLP 2016 / arXiv:1606.01933, 2016-06-06. https://arxiv.org/abs/1606.01933. Accessed 2026-05-24. ↩
Olshansky, J., "A Father of Modern AI Wants to Reinvent Biology," Crazy Stupid Tech, 2025-01-12. https://crazystupidtech.com/2025/01/12/a-father-of-modern-ai-wants-to-reinvent-biology/. Accessed 2026-05-24. ↩
Vaswani, A., et al., "Accelerating Deep Learning Research with the Tensor2Tensor Library," Google Research Blog, 2017-06-19. https://research.google/blog/accelerating-deep-learning-research-with-the-tensor2tensor-library/. Accessed 2026-05-24. ↩
tensorflow/tensor2tensor GitHub repository. https://github.com/tensorflow/tensor2tensor. Accessed 2026-05-24. ↩
Wikipedia, "Aidan Gomez." https://en.wikipedia.org/wiki/Aidan_Gomez. Accessed 2026-05-24. ↩
Nizar, B., et al., "All You Need Is Not All You Need for a Paper Title: On the Origins of a Scientific Meme," arXiv:2512.19700, 2025. https://arxiv.org/pdf/2512.19700. Accessed 2026-05-24. ↩
Liu, P.J., et al., "Generating Wikipedia by Summarizing Long Sequences," ICLR 2018 / arXiv:1801.10198, 2018-01-30. https://arxiv.org/abs/1801.10198. Accessed 2026-05-24. ↩
Xiong, R., Yang, Y., He, D., et al., "On Layer Normalization in the Transformer Architecture," ICML 2020 / arXiv:2002.04745, 2020-02-12. https://arxiv.org/abs/2002.04745. Accessed 2026-05-24. ↩
Kaplan, J., McCandlish, S., Henighan, T., et al., "Scaling Laws for Neural Language Models," arXiv:2001.08361, 2020-01-23. https://arxiv.org/abs/2001.08361. Accessed 2026-05-24. ↩
Hoffmann, J., Borgeaud, S., Mensch, A., et al., "Training Compute-Optimal Large Language Models," arXiv:2203.15556, 2022-03-29. https://arxiv.org/abs/2203.15556. Accessed 2026-05-24. ↩
"All You Need Is Attention: 7 Years Later," indieresearch.net, 2024-02-18. https://indieresearch.net/2024/02/18/is-attention-all-you-need-to-become-an-entrepreneur/. Accessed 2026-05-24. ↩
"Essential AI emerges from stealth with $56.5M Series A," Essential AI press release / LinkedIn, 2023-12. https://www.linkedin.com/company/essentialai. Accessed 2026-05-24. ↩
Shazeer, N., Mirhoseini, A., Maziarz, K., et al., "Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer," ICLR 2017 / arXiv:1701.06538, 2017-01-23. https://arxiv.org/abs/1701.06538. Accessed 2026-05-24. ↩
"Google's $2.7B AI deal with Noam Shazeer's Character.AI draws DOJ attention," Calcalist Tech, 2024-12. https://www.calcalistech.com/ctechnews/article/sy06wllflg. Accessed 2026-05-24. ↩
Wiggers, K., "Cohere hits $7B valuation a month after its last raise, partners with AMD," TechCrunch, 2025-09-24. https://techcrunch.com/2025/09/24/cohere-hits-7b-valuation-a-month-after-its-last-raise-partners-with-amd/. Accessed 2026-05-24. ↩
"NEAR Co-Founder Illia Polosukhin Presents New AI Research at NVIDIA GTC 2025," NEAR Protocol blog, 2025-03. https://pages.near.org/blog/nvidia-gtc-2025/. Accessed 2026-05-24. ↩
Sakana AI, "We raised $30M to develop nature-inspired AI in Japan," 2024-01-16. https://sakana.ai/seed-round/. Accessed 2026-05-24. ↩
Wiggers, K., "Sakana AI raises $135M Series B at a $2.65B valuation," TechCrunch, 2025-11-17. https://techcrunch.com/2025/11/17/sakana-ai-raises-135m-series-b-at-a-2-65b-valuation-to-continue-building-ai-models-for-japan/. Accessed 2026-05-24. ↩
Capoot, A., "After leaving Google, Jakob Uszkoreit started Inceptive to apply AI to drug development," CNBC, 2024-07-12. https://www.cnbc.com/2024/07/12/inceptive-ceo-jakob-uszkoreit-says-ai-will-transform-pharmaceuticals.html. Accessed 2026-05-24. ↩
Parmar, N., Vaswani, A., Uszkoreit, J., et al., "Image Transformer," ICML 2018 / arXiv:1802.05751, 2018-02-15. https://arxiv.org/abs/1802.05751. Accessed 2026-05-24. ↩
"'You Transformed the World,' NVIDIA CEO Tells Researchers Behind Landmark AI Paper," NVIDIA Blog, 2024-03-20. https://blogs.nvidia.com/blog/gtc-2024-transformer-ai-research-panel-jensen/. Accessed 2026-05-24. ↩
Goldman, S., "'Attention is All You Need' creators look beyond Transformers for AI at Nvidia GTC," VentureBeat, 2024-03-20. https://venturebeat.com/ai/attention-is-all-you-need-creators-look-beyond-transformers-at-nvidia-gtc-the-world-needs-something-better/. Accessed 2026-05-24. ↩
Radford, A., Narasimhan, K., Salimans, T., Sutskever, I., "Improving Language Understanding by Generative Pre-Training," OpenAI technical report, 2018-06. https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf. Accessed 2026-05-24. ↩
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K., "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding," NAACL 2019 / arXiv:1810.04805, 2018-10-11. https://arxiv.org/abs/1810.04805. Accessed 2026-05-24. ↩
Dosovitskiy, A., Beyer, L., Kolesnikov, A., et al., "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale," ICLR 2021 / arXiv:2010.11929, 2020-10-22. https://arxiv.org/abs/2010.11929. Accessed 2026-05-24. ↩
Jumper, J., Evans, R., Pritzel, A., et al., "Highly accurate protein structure prediction with AlphaFold," Nature, 2021-07-15. https://www.nature.com/articles/s41586-021-03819-2. Accessed 2026-05-24. ↩
Su, J., Lu, Y., Pan, S., et al., "RoFormer: Enhanced Transformer with Rotary Position Embedding," arXiv:2104.09864, 2021-04-20. https://arxiv.org/abs/2104.09864. Accessed 2026-05-24. ↩
Dao, T., Fu, D.Y., Ermon, S., Rudra, A., Re, C., "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness," NeurIPS 2022 / arXiv:2205.14135, 2022-05-27. https://arxiv.org/abs/2205.14135. Accessed 2026-05-24. ↩
Ainslie, J., Lee-Thorp, J., de Jong, M., et al., "GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints," EMNLP 2023 / arXiv:2305.13245, 2023-05-22. https://arxiv.org/abs/2305.13245. Accessed 2026-05-24. ↩
Gu, A., Dao, T., "Mamba: Linear-Time Sequence Modeling with Selective State Spaces," arXiv:2312.00752, 2023-12-01. https://arxiv.org/abs/2312.00752. Accessed 2026-05-24. ↩
Hahn, M., "Theoretical Limitations of Self-Attention in Neural Sequence Models," TACL 2020 / arXiv:1906.06755, 2019-06-16. https://arxiv.org/abs/1906.06755. Accessed 2026-05-24. ↩
Roose, K., "AI 'Godfather' Yann LeCun: LLMs Are Nearing the End, but Better AI Is Coming," Newsweek, 2025. https://www.newsweek.com/nw-ai/ai-impact-interview-yann-lecun-llm-limitations-analysis-2054255. Accessed 2026-05-24. ↩
Karpathy, A., "The (true) story of development and inspiration behind the 'attention' operator," X (Twitter), 2024-12-03. https://x.com/karpathy/status/1864023344435380613. Accessed 2026-05-24. ↩
Wikipedia, "Noam Shazeer." https://en.wikipedia.org/wiki/Noam_Shazeer. Accessed 2026-05-24. ↩
Vinyals, O., Kaiser, L., Koo, T., et al., "Grammar as a Foreign Language," NeurIPS 2015 / arXiv:1412.7449, 2014-12-23. https://arxiv.org/abs/1412.7449. Accessed 2026-05-24. ↩
Wang, A., Singh, A., Michael, J., et al., "GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding," ICLR 2019 / arXiv:1804.07461, 2018-04-20. https://arxiv.org/abs/1804.07461. Accessed 2026-05-24. ↩
Brown, T.B., Mann, B., Ryder, N., et al., "Language Models are Few-Shot Learners," NeurIPS 2020 / arXiv:2005.14165, 2020-05-28. https://arxiv.org/abs/2005.14165. Accessed 2026-05-24. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

8 revisions by 1 contributors · full history

Suggest edit