Attention Is All You Need
Last reviewed
May 24, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v6 · 7,133 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 24, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v6 · 7,133 words
Add missing citations, update stale details, or suggest a clearer explanation.
"Attention Is All You Need" is a landmark research paper published in June 2017 by eight researchers at Google Brain and Google Research: Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan Gomez, Lukasz Kaiser, and Illia Polosukhin [1]. The paper introduced the transformer architecture, a neural network design based entirely on attention mechanisms, dispensing with the recurrence and convolutions that had dominated sequence modeling. Presented at the 31st Conference on Neural Information Processing Systems (NeurIPS 2017), the paper has accumulated more than 173,000 citations as of 2025, ranking seventh on Nature's tabulation of the most-cited scientific papers of the 21st century and the most influential publication in the history of artificial intelligence research [2].
The transformer architecture proposed in this paper is the foundation for virtually all modern large language models (LLMs), including GPT, BERT, PaLM, LLaMA, and Claude. It also underpins vision transformers, protein structure prediction models like AlphaFold, speech recognition systems, music generation tools, and many other AI applications. The paper's title, a reference to the Beatles song "All You Need Is Love," has itself become one of the most recognizable phrases in the field [3].
Before the transformer, the dominant architectures for sequence-to-sequence tasks in natural language processing were recurrent neural networks (RNNs), particularly Long Short-Term Memory (LSTM) networks and Gated Recurrent Units (GRUs). These models processed sequences one token at a time, maintaining a hidden state that carried information from previous time steps. While effective, this sequential nature created two fundamental problems.
First, RNNs were difficult to parallelize during training. Because each time step depended on the output of the previous step, the computation could not be distributed across multiple processors in the time dimension. This made training on long sequences slow and expensive.
Second, RNNs struggled with long-range dependencies. Despite the gating mechanisms in LSTMs and GRUs, information from distant positions in a sequence tended to decay as it passed through many sequential processing steps. The attention mechanism introduced by Bahdanau, Cho, and Bengio in 2014 partially addressed this limitation [4]. The Bahdanau attention mechanism allowed decoders in neural machine translation systems to directly attend to any position in the encoder's output, dramatically improving translation quality. However, these hybrid models still relied on recurrent computation for encoding the input sequence itself.
Several pieces of work in 2016 directly preceded the transformer. Cheng, Dong, and Lapata proposed Long Short-Term Memory-Networks (LSTMNs) for machine reading, augmenting LSTMs with an intra-attention (or self-attention) mechanism so that the model could store contextual representations of each input token [5]. Parikh, Tackstrom, Das, and Uszkoreit (the same Jakob Uszkoreit who later co-authored "Attention Is All You Need") published "A Decomposable Attention Model for Natural Language Inference," which applied self-attention to feedforward networks and achieved state-of-the-art results on textual entailment while using roughly an order of magnitude fewer parameters than competing LSTM models [6]. That paper convinced Uszkoreit that attention without recurrence could be sufficient for language tasks more broadly. His father, the computational linguist Hans Uszkoreit, was reportedly skeptical of the hypothesis [7].
The Google Brain team's key question was whether attention alone, without any recurrence, could be sufficient for sequence modeling. Their answer was the transformer.
Uszkoreit proposed replacing RNNs with self-attention and started the effort to evaluate this idea at Google. Ashish Vaswani and Illia Polosukhin designed and implemented the first transformer models. Noam Shazeer, already a senior Google engineer with extensive prior work on neural language modeling and mixture of experts, joined the project and proposed scaled dot-product attention, multi-head attention, and the parameter-free positional representation. Niki Parmar designed, implemented, tuned, and evaluated countless model variants in the original codebase and in tensor2tensor, the open-source TensorFlow library Google Brain released in June 2017 [8]. Llion Jones contributed to model exploration and the visualizations that appeared in the paper. Lukasz Kaiser and Aidan Gomez (then a 20-year-old undergraduate intern from the University of Toronto) spent considerable time designing and implementing parts of tensor2tensor, which dramatically accelerated the experimental cycle [9][10].
The name "Transformer" was chosen by Jakob Uszkoreit, who liked the way the word sounded; an early internal design document featured artwork from the Transformers franchise [3]. The title "Attention Is All You Need" was a deliberate echo of the Beatles song "All You Need Is Love." It seeded a long-running scientific meme: an analysis of arXiv preprints found 717 papers with "All You Need" in their titles between 2009 and 2025, with exponential growth following the 2017 publication [11].
The paper was first posted to arXiv on June 12, 2017 and received seven subsequent revisions, the latest in August 2023 [1]. Even before the preprint appeared, Peter Liu and co-authors at Google were applying a decoder-only variant of the architecture to generate fictitious Wikipedia articles, work published in 2018 as "Generating Wikipedia by Summarizing Long Sequences" [12].
The transformer follows an encoder-decoder structure, where the encoder maps an input sequence to a continuous representation and the decoder generates an output sequence one token at a time conditioned on that representation.
The encoder consists of a stack of N identical layers (N = 6 in the original paper). Each encoder layer has two sub-layers:
Each sub-layer is wrapped with a residual connection and followed by layer normalization. The output of each sub-layer is expressed as LayerNorm(x + Sublayer(x)). To support these residual connections, all sub-layers in the model, and the embedding layers, produce outputs of dimension d_model = 512 (base model) or 1,024 (big model) [1].
The decoder also consists of a stack of N = 6 identical layers. Each decoder layer has three sub-layers:
As with the encoder, residual connections and layer normalization wrap each sub-layer. Inputs to both the encoder and decoder are first converted into vectors via a learned embedding layer scaled by sqrt(d_model), with positional encodings added to inject sequence order information.
The core computational unit of the transformer is scaled dot-product attention. Given a set of queries (Q), keys (K), and values (V), where Q is an n×d_k matrix, K is an m×d_k matrix, and V is an m×d_v matrix, the attention function computes:
Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) V
where d_k is the dimension of the keys. The dot product QK^T produces an n×m matrix of similarity scores between each query and all keys. Dividing by sqrt(d_k) prevents the dot products from growing too large in magnitude as the dimensionality increases. The intuition is that for d_k-dimensional unit-variance independent random query and key vectors, the dot product has variance d_k; without the scaling, the softmax function would be pushed into regions with extremely small gradients, slowing or stalling learning. After the scaled softmax produces attention weights (each row summing to 1), these weights determine how much each value vector contributes to the output [1].
The authors chose dot-product attention over the additive attention used by Bahdanau et al. because dot-product attention is faster and more space-efficient in practice, benefiting from highly optimized matrix multiplication kernels on accelerators. The paper noted that the two formulations have similar theoretical complexity but that dot-product attention is the more practical choice once scaling is included. In experiments without scaling, the authors observed that for large d_k the additive variant outperformed dot-product attention, which is why the 1/sqrt(d_k) factor was essential rather than optional [1].
Rather than performing a single attention function with full-dimensional keys, values, and queries, the paper proposed multi-head attention. The model linearly projects the queries, keys, and values h times into lower-dimensional spaces (where h = 8 in the base model and h = 16 in the big model), performs scaled dot-product attention on each projection independently in parallel, and then concatenates and projects the results:
MultiHead(Q, K, V) = Concat(head_1, ..., head_h) W^O where head_i = Attention(QW_i^Q, KW_i^K, VW_i^V)
With h = 8 heads and d_model = 512, each head operates on d_k = d_v = 64 dimensions. Because the per-head dimensionality is reduced, the total computational cost of multi-head attention is similar to a single full-dimensional attention with the same width. Different heads can attend to different aspects of the input: some heads track syntactic relationships, others track semantic similarity, and still others track positional patterns [1].
Multi-head attention is used in three different ways within the transformer:
Because the transformer contains no recurrence or convolution, it has no inherent notion of the order of tokens in a sequence. To inject positional information, the authors add positional encodings to the input embeddings. They use sinusoidal functions of varying frequencies:
PE(pos, 2i) = sin(pos / 10000^(2i/d_model)) PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))
where pos is the position in the sequence and i is the dimension index. Each dimension of the positional encoding corresponds to a sinusoid with a different wavelength, ranging from 2π to 10000 × 2π. The authors chose sinusoidal encodings because they hypothesized it would allow the model to learn relative position relationships, since PE(pos + k) can be represented as a linear function of PE(pos) for any fixed offset k [1].
The paper notes that learned positional embeddings produced nearly identical results, but sinusoidal encodings were preferred because they might generalize to sequence lengths longer than those seen during training.
Each encoder and decoder layer contains a fully connected feed-forward network applied independently to each position. The network consists of two linear transformations with a ReLU activation in between:
FFN(x) = max(0, xW_1 + b_1)W_2 + b_2
The inner dimension of the feed-forward network is d_ff = 2,048 in the base model (four times the model dimension of 512) and d_ff = 4,096 in the big model. Although the same function is applied at every position, different layers use different learned parameters. The component can be viewed as applying two 1x1 convolutions with a nonlinearity in between [1].
The paper's "Why Self-Attention" section compares self-attention to recurrent and convolutional layers across three desiderata: total computational complexity per layer, the amount of computation that can be parallelized (measured by the minimum number of sequential operations required), and the maximum path length between any two positions in the network. Self-attention has O(n^2 · d) complexity per layer (where n is the sequence length and d is the representation dimension), O(1) sequential operations (everything can be done in parallel), and an O(1) maximum path length between any two positions. Recurrent layers, by contrast, require O(n) sequential operations and have O(n) path length. Convolutional layers have O(log_k(n)) path length with dilated convolutions. The constant-path-length property is the architectural reason transformers learn long-range dependencies more effectively than RNNs [1].
In a worked comparison from the paper, for the typical case where n is smaller than d (sequence length shorter than representation dimension), self-attention is also faster per layer than recurrent layers in raw operation count. When the sequence is much longer than d, the paper notes that "restricted self-attention," in which each position attends only to a neighborhood of size r, can be used to reduce complexity to O(r · n · d) at the cost of increasing maximum path length to O(n/r). The authors also note a secondary benefit: self-attention models can yield more interpretable representations, since the learned attention weights often reflect syntactic or semantic structure [1].
The paper defined two model configurations for evaluation.
| Parameter | Transformer Base | Transformer Big |
|---|---|---|
| Encoder layers (N) | 6 | 6 |
| Decoder layers (N) | 6 | 6 |
| Model dimension (d_model) | 512 | 1,024 |
| Feed-forward dimension (d_ff) | 2,048 | 4,096 |
| Attention heads (h) | 8 | 16 |
| Key/Value dimension (d_k = d_v) | 64 | 64 |
| Parameters | ~65M | ~213M |
| Dropout rate (P_drop) | 0.1 | 0.3 (EN-FR) / 0.1 (EN-DE) |
The models were trained on the standard WMT 2014 English-German and English-French machine translation benchmarks [1].
For English-German, the training set contained approximately 4.5 million sentence pairs. For English-French, the substantially larger dataset contained approximately 36 million sentence pairs. Sentences were encoded using byte-pair encoding, with a shared source-target vocabulary of approximately 37,000 tokens for English-German and a 32,000-token word-piece vocabulary for English-French. Training batches contained sentence pairs grouped by approximate sequence length, with each batch holding about 25,000 source tokens and 25,000 target tokens [1].
The authors used the Adam optimizer with β_1 = 0.9, β_2 = 0.98, and ε = 10^-9. They employed a custom learning rate schedule that combined a linear warmup with an inverse square root decay:
lrate = d_model^(-0.5) · min(step_num^(-0.5), step_num · warmup_steps^(-1.5))
This schedule linearly increases the learning rate during the first warmup_steps training steps (4,000 for the base model) then decreases it proportionally to the inverse square root of the step number. The warmup phase prevents instability in the early stages of training when the model's parameters are randomly initialized [1]. The need for warmup was later traced to the post-norm placement of layer normalization in the original transformer; subsequent work showed that the pre-norm variant trains stably without warmup [13].
Three forms of regularization were applied:
The base model was trained for 100,000 steps (approximately 12 hours, roughly 0.4 seconds per step) on 8 NVIDIA P100 GPUs in a single machine. The big model was trained for 300,000 steps (approximately 3.5 days, roughly 1.0 second per step) on the same hardware. These training times were dramatically shorter than those reported for competing RNN and convolutional sequence-to-sequence architectures: the big model achieved state-of-the-art results at an estimated training cost of 2.3 × 10^19 FLOPs, roughly an order of magnitude less than the GNMT and ConvS2S systems it surpassed [1]. Final translation models were obtained by averaging the last 5 checkpoints (base) or 20 checkpoints (big), saved at 10-minute intervals during training.
The original implementation lived inside Google's research codebase and was ported into tensor2tensor, the open-source TensorFlow library released by the Google Brain team in June 2017, the same month the paper appeared on arXiv [8]. tensor2tensor exposed reference implementations of the transformer along with shared datasets, vocabularies, and tokenization pipelines, which substantially lowered the barrier to reproducing the paper's results. Within months, multiple independent re-implementations appeared in PyTorch and other frameworks. tensor2tensor itself was later deprecated in favor of its successor, Trax, but its initial release was instrumental in spreading the architecture quickly through the research community [9].
The transformer achieved state-of-the-art results on both machine translation benchmarks tested. During inference, the authors used beam search with a beam size of 4 and a length penalty α = 0.6.
| Model | EN-DE BLEU | EN-FR BLEU | Training cost (FLOPs) |
|---|---|---|---|
| ByteNet (2017) | 23.75 | – | – |
| GNMT + RL (Google, 2016) | 24.6 | 39.92 | 1.4 × 10^20 |
| ConvS2S (Gehring et al., 2017) | 25.16 | 40.46 | 1.5 × 10^20 |
| MoE (Shazeer et al., 2017) | 26.03 | 40.56 | 1.2 × 10^20 |
| Transformer Base | 27.3 | 38.1 | 3.3 × 10^18 |
| Transformer Big | 28.4 | 41.0 | 2.3 × 10^19 |
| Transformer Big (EN-FR, single model) | – | 41.8 | 2.3 × 10^19 |
The Transformer Big achieved 28.4 BLEU on WMT 2014 English-to-German translation, improving over the previous best results (including ensembles) by more than 2 BLEU points. On WMT 2014 English-to-French, the single model achieved 41.0 BLEU (41.8 with checkpoint averaging), a new state-of-the-art at a fraction of the training cost of previous models [1].
The Transformer Big required approximately 10 to 100 times fewer floating-point operations to train than the competing models it outperformed. The ability to parallelize attention computations across all positions simultaneously, rather than processing them sequentially as in RNNs, was the primary driver of this efficiency [1].
To demonstrate the transformer's generality beyond translation, the authors evaluated a 4-layer transformer with d_model = 1024 on English constituency parsing (predicting the syntactic tree structure of a sentence). Trained on the Wall Street Journal portion of the Penn Treebank in a supervised setting, the model achieved an F1 score of 91.3, comparable to specialized parsing systems despite minimal task-specific tuning. With semi-supervised training using larger high-confidence and BerkleyParser corpora, the score rose to 92.7 F1, surpassing all but one of the discriminative parsers compared in the paper. The choice of constituency parsing as a generalization probe followed Vinyals, Kaiser, and colleagues' 2015 "Grammar as a Foreign Language" line of work, which had cast parsing as a sequence-to-sequence problem [1][40].
The appendix of the paper presents several attention pattern visualizations from the trained models. Different heads were observed to specialize: some attended primarily to neighboring tokens, others tracked syntactic relations such as verb-object or preposition-object dependencies, and a few appeared to capture long-range coreference. The visualizations did not constitute a rigorous interpretability claim, but they suggested that multi-head attention learns to decompose linguistic structure across heads. This observation seeded an entire subfield of "BERTology" and transformer probing, with hundreds of subsequent papers analyzing what attention heads learn in larger models [1].
Several aspects of the paper represented significant departures from prior work and contributed to its lasting impact.
The most radical aspect of the paper was completely eliminating recurrent computation. Prior work had used attention as a supplement to RNNs; the transformer showed that attention alone was not only sufficient but superior. This enabled full parallelization across sequence positions during training, drastically reducing wall-clock training time [1].
While attention between encoder and decoder had been explored before (Bahdanau et al., 2014), the systematic use of self-attention within both the encoder and the decoder was a key contribution. Self-attention allows every position in a sequence to directly attend to every other position in a single computation step, giving the transformer a constant-path-length connection between any two positions (O(1)), compared to O(n) for RNNs and O(log n) for dilated convolutions [1].
The transformer architecture has unusually favorable scaling properties. Subsequent work demonstrated that transformer performance continues to improve predictably as model size, dataset size, and training compute increase. This property, formalized in the neural scaling laws identified by Kaplan et al. (2020) [14] and refined in the Chinchilla compute-optimal analysis by Hoffmann et al. (2022) [15], ultimately motivated the development of increasingly large language models.
All eight authors were listed with an asterisk denoting "equal contribution," with the listing order randomized. Each has gone on to significant roles in the AI industry, and their collective post-paper trajectories underscore the transformer's centrality to the modern AI landscape. As of 2024 reporting, seven of the eight founded or co-founded companies that have collectively raised more than $2 billion in venture capital [16].
| Author | Affiliation (2017) | Current role (2026) | Notable post-paper venture |
|---|---|---|---|
| Ashish Vaswani | Google Brain | Co-founder and CEO, Essential AI | Co-founded Adept AI (2022), then Essential AI (2023) |
| Noam Shazeer | Google Brain | VP Engineering and co-lead of Gemini, Google DeepMind | Founded Character.AI (2021); returned to Google via $2.7B licensing deal (2024) |
| Niki Parmar | Google Brain | Co-founder, Essential AI | Co-founded Adept AI (2022) with Vaswani, then Essential AI |
| Jakob Uszkoreit | Google Research | Co-founder and CEO, Inceptive | Founded Inceptive (2021), applying generative AI to RNA medicine design |
| Llion Jones | Google Research | Co-founder and CTO, Sakana AI | Founded Sakana AI (2023) in Tokyo with David Ha and Ren Ito |
| Aidan Gomez | Google Brain (intern) | Co-founder and CEO, Cohere | Co-founded Cohere (2019); $7B+ valuation as of 2025 |
| Lukasz Kaiser | Google Brain | Research scientist, OpenAI | Joined OpenAI (2021); only author who did not start a company |
| Illia Polosukhin | Google Research | Co-founder, NEAR Protocol | Co-founded NEAR Protocol (2017), an AI/blockchain platform |
Ashish Vaswani served as a lead researcher on the transformer project at Google Brain and is listed first on the paper. After leaving Google in 2021, he co-founded Adept AI alongside Niki Parmar and David Luan in 2022, working on AI agents for enterprise software. He and Parmar departed Adept in 2022 to pursue a research-led foundation-model thesis and in 2023 co-founded Essential AI, which emerged from stealth in December 2023 with a $56.5 million Series A led by March Capital and including Google, Nvidia, AMD, KB Investment, Franklin Venture Partners, and Thrive Capital [17].
Noam Shazeer joined Google in 2000 and was one of Google's most prolific AI researchers before the transformer paper, with key contributions to spelling correction, large-scale language modeling, mixture of experts, and the 2017 "Outrageously Large Neural Networks" paper that introduced sparsely-gated MoE layers [18][39]. Within the transformer project, Shazeer specifically proposed scaled dot-product attention, multi-head attention, and the parameter-free positional representation that gave the architecture much of its empirical strength [1]. After Google declined to launch a chatbot he had developed with Daniel De Freitas, Shazeer left in 2021 and co-founded Character.AI, a conversational AI platform that grew to millions of users. In August 2024, Google struck a deal reported at $2.7 billion to license Character.AI's technology and bring Shazeer back to Google DeepMind, where he was appointed to co-lead the Gemini project alongside Jeff Dean and Oriol Vinyals. The U.S. Department of Justice subsequently opened an inquiry into whether the deal's structure circumvented antitrust review [19].
Aidan Gomez was a 20-year-old undergraduate intern at Google Brain during the development of the transformer, on leave from the University of Toronto. He co-founded Cohere in 2019 alongside Ivan Zhang and Nick Frosst, building enterprise-focused large language models. By September 2025, Cohere had raised approximately $1.6 billion in total funding, with a $500 million Series E round in August 2025 followed by a $100 million extension in September 2025 valuing the company at roughly $7 billion. Annualized revenue crossed $100 million in May 2025 and $150 million by October 2025, and Gomez publicly stated Cohere was preparing for an initial public offering [20].
Illia Polosukhin co-founded NEAR Protocol in 2017, originally as a machine learning platform called NEAR.ai before pivoting to a blockchain platform. As of 2025, Polosukhin continues to lead NEAR with a focus on the intersection of AI and decentralized computation. At NVIDIA GTC 2025 he presented NEAR's research on confidential, decentralized AI inference, the second consecutive year he had spoken at NVIDIA's flagship conference [21].
After more than a decade at Google, Llion Jones co-founded Sakana AI in Tokyo in 2023 alongside David Ha (another former Google researcher) and Ren Ito. Sakana AI develops novel AI architectures with inspiration from biological systems and evolutionary approaches, with a particular focus on Japan-optimized models. The company raised a $30 million seed round in January 2024, a $214 million Series A in September 2024 at a $1.5 billion valuation that made it Japan's first AI unicorn, and a $135 million Series B in November 2025 at a $2.65 billion valuation [22][23].
Jakob Uszkoreit co-founded Inceptive in 2021, a biotechnology company that applies transformer-style deep learning to the design of RNA molecules for therapeutic and vaccine applications. Inceptive raised $100 million in 2023 led by Andreessen Horowitz and NVIDIA, in addition to earlier rounds bringing total funding to roughly $120 million. The company has partnered with at least one major European pharmaceutical company on a new infectious-disease mRNA vaccine [24]. Uszkoreit's father, the German computational linguist Hans Uszkoreit, was initially skeptical of the self-attention hypothesis [7].
Lukasz Kaiser joined OpenAI in 2021 and has contributed to that organization's research on reasoning and multimodal models. He spent considerable time at Google on tensor2tensor and on the design of the original transformer codebase, and remains the only one of the eight authors who did not found a startup, instead continuing as a research scientist [16].
Niki Parmar co-founded Adept AI in 2022 alongside Ashish Vaswani and subsequently co-founded Essential AI in 2023. Her research interests span attention mechanisms, generative models, and AI systems for enterprise automation. She also co-authored the 2018 Image Transformer paper, an early extension of self-attention to image generation [25].
In March 2024, seven of the eight authors gathered for the first time as a group at NVIDIA's GTC conference in San Jose for a panel moderated by NVIDIA CEO Jensen Huang. Niki Parmar was unable to attend. The panel drew a packed ballroom audience of over 2,000 attendees, who heard the researchers reflect on the original paper and on where they believe transformer-style architectures are heading. Huang presented each panelist with a framed cover plate of the NVIDIA DGX-1 AI supercomputer signed "You transformed the world" [26].
Several authors used the panel to argue that the field needs better architectures. Aidan Gomez said "I think the world needs something better than the transformer." Jakob Uszkoreit emphasized adaptive computation, arguing that models should spend more compute on hard problems and less on easy ones. Llion Jones described the discovery process as one of subtraction, recalling "we had very recently started throwing bits of the model away, just to see how much worse it would get. And to our surprise it started getting better" [16][27].
The paper received an oral presentation at NeurIPS 2017 in Long Beach, California. Reception in the broader research community was initially modest but accelerated rapidly once OpenAI's GPT (June 2018) and Google's BERT (October 2018) demonstrated that pretraining transformer models on large unlabeled corpora produced powerful general-purpose representations across many NLP tasks [28][29]. BERT in particular achieved state-of-the-art results on eleven NLP benchmarks at once and was integrated into Google Search in late 2019, putting the transformer architecture into production at internet scale [29]. By 2020, "Attention Is All You Need" was the most-cited paper in the deep learning subfield. By 2025, citation counts on Google Scholar exceeded 173,000, placing the paper seventh on Nature's tabulation of the most-cited scientific papers of the 21st century, behind work in deep residual learning, COVID-19 epidemiology, and meta-analysis methodology, and ahead of the original AlphaFold paper [2].
Andrej Karpathy, a prominent AI educator and former director of AI at Tesla, has emphasized that "Attention Is All You Need" did not invent attention; the operator itself dates to Bahdanau, Cho, and Bengio's 2014 paper on neural machine translation. In December 2024, Karpathy published correspondence from Dzmitry Bahdanau describing the origin of the term "attention," noting that the 2014 paper receives roughly a thousand times fewer citations than the 2017 transformer paper despite introducing the central mechanism. Karpathy's commentary frames the transformer paper as the moment attention was pushed to its limit (used exclusively, rather than as an add-on to RNNs), rather than as the moment attention itself was invented [38].
The paper's cultural footprint is unusually large for a technical machine learning paper. An analysis of arXiv titles found 717 papers containing "All You Need" between 2009 and 2025, with exponential growth following the 2017 publication; "Attention" remains the most frequently claimed necessity among the imitators [11].
The influence of "Attention Is All You Need" extends across nearly every subfield of AI and into structural biology, drug discovery, robotics, weather forecasting, and beyond.
Every major large language model deployed as of 2026 is built on the transformer architecture or a close derivative. GPT-1 through GPT-5 (OpenAI), BERT and T5 (Google), PaLM and Gemini (Google DeepMind), LLaMA (Meta), Claude (Anthropic), Mistral, and DeepSeek all use transformer-based designs. The decoder-only variant (used by GPT) and the encoder-only variant (used by BERT) were both derived directly from the architecture described in this paper [28][29]. The shift from task-specific architectures to a single transformer family that could be pretrained once and fine-tuned on many tasks was made concrete by benchmarks like GLUE (2018), which BERT topped immediately after release [41], and by GPT-3's 2020 demonstration that scaling decoder-only transformers to 175 billion parameters enabled "few-shot" in-context learning across many tasks without any task-specific fine-tuning [42].
The transformer is a remarkably general-purpose architecture, extending far beyond its original machine translation application:
With more than 173,000 citations as of 2025, "Attention Is All You Need" ranks among the top ten most-cited papers of the 21st century across all scientific disciplines [2]. Its citation count continues to grow as the transformer architecture finds new applications.
The paper's impact extends beyond technology into the structure of the AI industry. The eight authors collectively founded or co-led companies valued at tens of billions of dollars. The transformer's scalability enabled the modern approach to AI development: training increasingly large models on increasingly large datasets, a strategy that underpins the business models of OpenAI, Google DeepMind, Anthropic, Cohere, and many other AI companies.
The post-Google diaspora is unusual in the history of computer science. As of 2026, the companies founded by transformer co-authors include Cohere ($7B+ valuation), Character.AI (acquired in licensing form by Google for $2.7B), Essential AI, Adept AI, Sakana AI ($2.65B valuation), Inceptive, and NEAR Protocol. Each operates in a distinct part of the AI ecosystem: Cohere on enterprise LLM APIs, Essential AI on foundation model research, Sakana AI on biologically-inspired models for the Japanese market, Inceptive on RNA therapeutic design, and NEAR Protocol on decentralized AI infrastructure. The transformer paper thus functioned both as a technical breakthrough and as the founding document of a venture portfolio that has helped define the post-2020 AI startup landscape [16].
The transformer as described in the 2017 paper has been refined and extended in numerous ways since publication.
| Variant | Architecture type | Key modification | Example models |
|---|---|---|---|
| Encoder-only | Bidirectional | Removes decoder; bidirectional attention | BERT, RoBERTa |
| Decoder-only | Autoregressive | Removes encoder; causal masking only | GPT, LLaMA, Claude |
| Encoder-decoder | Seq2seq | Original design | T5, BART, mBART |
| Mixture of experts | Sparse | Routes tokens to specialized sub-networks | Switch Transformer, Mixtral |
| State space | Hybrid | Replaces or supplements attention with SSM layers | Mamba, Jamba |
Subsequent research introduced many improvements to transformer training:
The transformer's O(n^2) attention complexity with respect to sequence length remains its primary computational bottleneck. Processing a sequence of length n requires computing attention scores between all n^2 pairs of positions, and the resulting KV cache grows linearly in n at inference time. Numerous approaches have been proposed to address this, including sparse attention patterns, linear attention approximations, and hybrid architectures combining attention with state space models such as Mamba [35].
Beyond compute cost, the transformer has documented theoretical and empirical limitations. Hahn (2020) proved that a single self-attention layer cannot reliably recognize simple formal languages such as parity or balanced parentheses, a result that has been extended to show inherent weaknesses in function composition [36]. Empirical work on hallucination demonstrates that transformer language models can produce confident outputs incompatible with their training data or input. Critics including Yann LeCun and Richard Sutton have argued that transformer-based LLMs are approaching the limits of what scaling alone can achieve, and that fundamental architectural innovation will be needed to support robust reasoning and planning [37].
For concreteness, a single forward pass through the base transformer on a 50-token English-to-German translation example would proceed as follows. The 50-token input is embedded into a 50 × 512 matrix; positional encodings are added; the result enters the first encoder layer. Multi-head self-attention projects this matrix into 8 sets of queries, keys, and values each of shape 50 × 64, computes 8 attention matrices each of shape 50 × 50, multiplies each by its 50 × 64 value matrix to produce 8 head outputs of shape 50 × 64, concatenates them into a 50 × 512 matrix, applies an output projection, and adds the residual. The position-wise feed-forward sub-layer then expands the 50 × 512 matrix to 50 × 2048, applies ReLU, projects back to 50 × 512, and again adds the residual. After six encoder layers, the 50 × 512 representation flows into the decoder, which generates the German output one token at a time using masked self-attention over already-emitted tokens and cross-attention against the encoder output. The base model has approximately 65 million parameters spread across token embeddings, six encoder layers, six decoder layers, and the output projection [1].
As of early 2026, the transformer remains the dominant architecture in AI. Researchers continue to explore alternatives, including state space models like Mamba, hybrid models like Jamba (state space and attention combined), and novel attention variants. None has displaced the transformer across the breadth of tasks where it excels [35].
The paper's ideas continue to evolve. Modern transformers bear the same fundamental relationship to the 2017 design as a modern automobile does to the Ford Model T: the core principles remain, but virtually every component has been refined. Attention patterns, normalization schemes, position encodings, activation functions (with SwiGLU and similar gated variants replacing ReLU in most large models), and training procedures have all been improved. The basic framework of stacking self-attention and feed-forward layers, processing sequences in parallel, and learning contextual representations through attention remains the backbone of the AI systems that define the current era.
The paper is also notable for its role in accelerating the broader AI boom. The transformer's parallelizability made it possible to efficiently scale models to billions and then trillions of parameters on modern GPU and TPU clusters, enabling the emergent capabilities observed in large language models. "Attention Is All You Need" did not just propose a new architecture; it unlocked the scaling trajectory that produced ChatGPT, Gemini, Claude, and the broader wave of AI applications that have reshaped industry and society.