Ashish Vaswani

People Transformer Models

17 min read

Updated Jun 27, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 27, 2026

Fact-checked

In review queue

Sources

21 citations

Revision

v3 · 3,445 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Ashish Vaswani (born 1986) is an Indian-American computer scientist who is the first-listed author of the 2017 paper "Attention Is All You Need," the work that introduced the transformer architecture now underlying nearly every modern large language model.^[1]^[2] He carried out that work as a research scientist at Google Brain, where the paper's footnote credits him, together with Illia Polosukhin, with having "designed and implemented the first Transformer models."^[3] Vaswani later co-founded Adept AI in 2021 and is now co-founder and chief executive officer of Essential AI, a San Francisco foundation-model startup he and Niki Parmar launched in 2023 and that raised a $56.5 million Series A round in December 2023.^[4]^[5]^[7]

The transformer paper has accumulated more than 257,000 citations on Google Scholar by May 2026, the bulk of Vaswani's roughly 291,000 total citations, and is one of the most-cited papers in the history of artificial intelligence.^[11] Reflecting on its downstream impact, Vaswani has said: "There is going to be a time before Chat-GPT and a time after Chat-GPT."^[9]


Born	1986, India^[1]
Education	B.Tech., Birla Institute of Technology, Mesra; Ph.D., University of Southern California (2014)^[1]^[9]
Doctoral advisors	David Chiang and Liang Huang^[9]^[10]
Known for	First author of "Attention Is All You Need" (2017); transformer architecture^[3]
Employers	Google Brain (2016 to 2021); Adept AI (2021 to 2022); Essential AI (2023 to present)^[1]^[4]^[5]^[7]
Citations	291,427 (Google Scholar, May 2026); h-index 56^[11]

Who is Ashish Vaswani?

Ashish Vaswani is a computer scientist and AI entrepreneur best known as the lead author of "Attention Is All You Need," the 2017 paper that introduced the transformer, the neural network architecture that replaced recurrence and convolution with multi-head self-attention and became the foundation for systems such as BERT, GPT-3, and ChatGPT.^[3]^[9] He completed the work as a research scientist at Google Brain, holds a Ph.D. from the University of Southern California, and has since co-founded two AI companies, Adept AI (2021) and Essential AI (2023), where he serves as chief executive officer.^[1]^[4]^[7]

Early life and education

Vaswani was born in 1986 in India.^[1] He completed a Bachelor of Technology in computer science at the Birla Institute of Technology, Mesra (BIT Mesra), one of India's older engineering schools whose computer science department dates to 1983.^[1]^[12] He moved to the United States for graduate study and joined the University of Southern California (USC) as a computer science doctoral student in 2004, working at the USC Information Sciences Institute (ISI), a research center in Marina del Rey known for its long-standing program in machine translation and natural language processing.^[9]^[10] ISI in the mid-2000s was a major hub for statistical machine translation research and the host institution for the GIZA++ word-alignment toolkit and the Joshua decoder, both widely used at the time.^[10]

At ISI he worked in the natural language processing group led by Kevin Knight, with primary doctoral advising from David Chiang (now at the University of Notre Dame) and Liang Huang (now at Oregon State University).^[9]^[10] His doctoral research focused on statistical machine translation and, increasingly, on neural language modeling.^[9]^[10] His Ph.D. thesis, defended in 2014, was titled Smaller, Faster, and Accurate Models for Statistical Machine Translation and dealt with reducing the parameter and decoding cost of large translation systems while preserving BLEU scores.^[1]

David Chiang has stated that Vaswani was "my first Ph.D. student and one of the very first people to see the potential for deep learning in natural language processing back in 2011."^[9]^[10] Liang Huang has described Vaswani during this period as enthusiastic and unusual in pursuing GPU-based methods before the broader machine translation community took deep learning seriously.^[9] Vaswani has described the ISI group of that era as "a vibrant, tremendous research group pursuing bold ideas, and that's rare."^[9]

While still at USC, Vaswani co-authored two papers that anticipated his later work at Google. The first, "Decoding with Large-Scale Neural Language Models Improves Translation" (2013), integrated a feedforward neural language model into a statistical machine translation decoder and reported gains over n-gram baselines on Arabic-to-English and Chinese-to-English tasks.^[9]^[10] The second, "Simple, Fast Noise-Contrastive Estimation for Large RNN Vocabularies" (2017), described an efficient training procedure for recurrent neural language models with very large output vocabularies.^[9]^[10]

What did Ashish Vaswani do at Google Brain?

Vaswani joined Google as a research scientist at Google Brain in 2016, after a postdoctoral period at USC ISI.^[1] At Google he worked on sequence models for translation and generation in collaboration with Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin, all of whom would later be co-authors on the transformer paper.^[3] He remained at Google Brain through 2021, when he left to co-found Adept AI in November of that year.^[4]^[13]

The Google Brain group during this period was structured around small, fluid teams pursuing both production-relevant translation work (Google's neural machine translation system, which had launched in 2016) and longer-horizon research on attention and sequence modeling.^[14] Vaswani's collaborators included Shazeer, a longtime Google researcher who had worked on mixture-of-experts and parameter scaling; Parmar, who had joined Google Brain after a master's degree at USC; Uszkoreit, whose research group had explored attention-based models for text without recurrence; and Polosukhin, who later left Google to co-found the NEAR Protocol blockchain project.^[3]^[14]

During his Google tenure Vaswani contributed to the Google Brain Tensor2Tensor (T2T) project, an open-source library that included a reference implementation of the transformer and was used widely by external researchers.^[14] He also worked on extensions of self-attention to vision and on image generation, producing the 2018 paper "Image Transformer" (Parmar, Vaswani, Uszkoreit, Kaiser, Shazeer, Ku, Tran), which adapted the transformer to autoregressive image generation, and the 2018 paper "Self-Attention with Relative Position Representations" (Shaw, Uszkoreit, Vaswani), which proposed relative position encodings later adopted by many downstream models.^[11]

What is Attention Is All You Need?

The paper "Attention Is All You Need," posted to arXiv on 12 June 2017 and presented at NeurIPS 2017, listed Vaswani as the first author and described the transformer, a neural network architecture for sequence transduction that replaces recurrence and convolution with multi-head self-attention.^[3]^[15] The paper's equal-contribution footnote records that "Ashish, with Illia, designed and implemented the first Transformer models and has been crucially involved in every aspect of this work."^[3] The full author list, in published order, is Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin, with Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, and Kaiser at Google Brain or Google Research and Polosukhin then at Google.^[3]

The transformer paper reported a BLEU score of 28.4 on the WMT 2014 English-to-German translation task and 41.8 on English-to-French, both improving over the best prior published results while requiring substantially less training time on eight P100 GPUs.^[3] By May 2026 the paper had accumulated more than 257,000 citations on Google Scholar, the bulk of Vaswani's 291,427 total citations.^[11] The transformer became the basis for almost every subsequent large language model, including BERT (Devlin et al., 2018), GPT-2 (Radford et al., 2019), GPT-3 (Brown et al., 2020), and later systems from OpenAI, Google DeepMind, Anthropic, Meta, and others.^[3] Vaswani has framed his own motivation for the work in terms of a long-term goal: "For me, personally, I was seeking a universal model. A single model that would consolidate all modalities and exchange information between them, just like the human brain."^[9]

The original paper described two key building blocks. Scaled dot-product attention computes attention weights as the softmax of query-key dot products divided by the square root of the key dimension, then uses those weights to take a weighted sum of value vectors.^[3] Multi-head attention runs this operation in parallel across multiple independently-projected subspaces, allowing the network to attend to information from different representation subspaces at the same time.^[3] The encoder-decoder transformer in the paper had six identical layers in each stack, with the decoder layers including an additional cross-attention sublayer over the encoder output.^[3] Position information was injected through sinusoidal positional encodings added to the token embeddings.^[3] Subsequent work refined many of these choices: BERT used encoder-only stacks, GPT used decoder-only stacks, and later models replaced sinusoidal positions with learned or relative position encodings.^[3]^[11]

What is Adept AI?

Vaswani left Google in late 2021 to co-found Adept AI alongside Niki Parmar and David Luan, the latter a former OpenAI engineering vice president and Google large-model program lead.^[4]^[13] Adept emerged from stealth on 26 April 2022 with a $65 million Series A funding round led by Greylock and Addition, with participation from Root Ventures and angel investors including Andrej Karpathy, Jaan Tallinn, and Chris Ré.^[4] At launch Luan was chief executive officer, Parmar was chief technology officer, and Vaswani was chief scientist.^[4]^[13]

Adept's stated objective was to train a neural network to operate existing software tools, with the company describing an "AI teammate" capable of taking natural-language instructions and performing tasks in productivity applications.^[13] The founding team included several other researchers from Google Brain and DeepMind.^[13]

Vaswani and Parmar both departed Adept in November 2022, less than a year after the company emerged from stealth.^[5]^[6] Reporting on their exit indicated that the departure was sudden and stemmed in part from differences with investors over the company's research direction.^[5] The remaining Adept team continued to operate under Luan and raised a $350 million Series B in March 2023 led by General Catalyst and Spark Capital.^[16] In June 2024 Amazon hired Luan and several other Adept co-founders into its AGI team in an arrangement widely described as an acquihire, with Adept itself remaining as a separate licensed entity under a new chief executive.^[17]

What is Essential AI?

After leaving Adept, Vaswani and Parmar founded Essential AI in 2023, with Vaswani as chief executive officer and Parmar as chief technology officer.^[7]^[8] The San Francisco-based company emerged from stealth on 13 December 2023 with a $56.5 million Series A funding round led by March Capital and including AMD, Franklin Venture Partners, Google, KB Investment, Nvidia, and Thrive Capital.^[7] An earlier $8.3 million seed round had been led by Thrive Capital, bringing total disclosed funding at the Series A to roughly $65 million.^[18]

At launch the company described its mission as "deepening the partnership between humans and computers, unlocking collaborative capabilities that far exceed what could be achieved today" and said it would develop full-stack large-language-model products to automate workflows.^[7] Over the following two years the company's public focus shifted toward open pretraining research, with a stated mission of "building an open platform to accelerate the science and engineering of deep learning" through pretraining work on frontier STEM and code capabilities.^[19]

By mid-2025 Essential AI had begun releasing open datasets and research artifacts. Essential-Web v1.0, a 24-trillion-token web corpus organized for pretraining, was released on Hugging Face.^[19] In May 2025 the company published "Practical Efficiency of Muon for Pretraining" (arXiv:2505.02222), a study of the Muon optimizer for large-batch pretraining, reporting that Muon expanded the Pareto frontier over AdamW on the compute-time tradeoff and remained more data-efficient at large batch sizes.^[20] The Muon study ran experiments at model sizes up to 4 billion parameters and batch sizes up to 16 million tokens and presented a "telescoping" algorithm for combining Muon with the maximal-update parameterization (muP) to transfer hyperparameters from small to large models.^[20] In December 2025 Essential AI released the Rnj-1 (Ramanujan) language model line, including base and instruction-tuned variants, listing Vaswani as an author on the announcement.^[19]

Public statements of Essential AI's positioning over 2024 and 2025 indicate a shift in emphasis from the initial enterprise-automation product framing of the December 2023 launch toward open pretraining research and dataset publication, with the company describing itself in 2026 as building "an open platform to accelerate the science and engineering of deep learning."^[7]^[19] As of May 2026 Vaswani is listed as chief executive officer of Essential AI; the speaker page for The Montgomery Summit identifies him in that role.^[21]

What were Ashish Vaswani's main research contributions?

Transformer architecture

Vaswani's central contribution is the transformer, the neural network architecture introduced in "Attention Is All You Need" and now the standard backbone for sequence modeling in natural language processing, computer vision, audio, and protein structure prediction.^[3] The transformer uses self-attention layers (specifically scaled dot-product attention with multiple heads), residual connections, layer normalization, and position-wise feedforward networks, dispensing with the recurrent connections of LSTMs and the local receptive fields of convolutional networks.^[3] In the architecture, attention weights are computed as the softmax of scaled query-key dot products and used to aggregate value vectors, allowing every position in a sequence to attend to every other position in a single layer.^[3]

In their footnote describing the division of labor, the authors record that Vaswani and Polosukhin designed and implemented the first transformer models, that Shazeer proposed scaled dot-product attention, multi-head attention, and the parameter-free position representation, and that subsequent refinements were contributed by other authors.^[3] After the initial publication, Vaswani co-led further work showing that the architecture generalized beyond translation.^[11]

Image Transformer and self-attention with relative positions

In 2018 Vaswani co-authored "Image Transformer" (Parmar, Vaswani, Uszkoreit, Kaiser, Shazeer, Ku, Tran), which applied self-attention to autoregressive image generation by treating image pixels as tokens and restricting attention to local neighborhoods to control memory cost.^[11] The same year he co-authored "Self-Attention with Relative Position Representations" (Shaw, Uszkoreit, Vaswani), which proposed an alternative to the original sinusoidal position encoding by injecting learned representations of the relative offset between tokens directly into the attention computation; this design influenced later position-encoding schemes including those in T5 and Transformer-XL.^[11]

Bottleneck transformers

In 2021 Vaswani co-authored "Bottleneck Transformers for Visual Recognition" (Srinivas, Lin, Parmar, Shlens, Abbeel, Vaswani), which replaced the spatial convolutions in the final stage of a ResNet with multi-head self-attention blocks (BoT blocks) and reported gains on ImageNet classification and COCO instance segmentation.^[11] The paper was an early example of hybrid convolution-attention vision backbones, a design family that influenced subsequent architectures including ConvNeXt and the Swin Transformer.^[11]

Graph networks survey

Vaswani was a co-author on the 2018 survey "Relational Inductive Biases, Deep Learning, and Graph Networks" led by Peter W. Battaglia at DeepMind, which proposed graph networks as a unifying framework for relational reasoning over structured data and argued for the importance of inductive biases in deep learning architectures.^[11] The paper accumulated more than 5,100 citations by May 2026.^[11]

Tensor2Tensor

While at Google Brain, Vaswani was a co-author on the Tensor2Tensor library, a TensorFlow-based deep-learning toolkit that included reference implementations of the transformer and supporting datasets.^[14] Tensor2Tensor was used by external researchers as a common implementation of the transformer in the period immediately following the 2017 paper and supported reproducibility of the original results.^[14]

Pretraining research at Essential AI

At Essential AI, Vaswani's published research has focused on pretraining efficiency. The 2025 paper "Practical Efficiency of Muon for Pretraining" (arXiv:2505.02222) studied the Muon second-order optimizer at model sizes up to 4 billion parameters and batch sizes up to 16 million tokens, presenting evidence that Muon retained data efficiency at large batch sizes beyond the so-called critical batch size and that it combined effectively with the maximal-update parameterization (muP) for hyperparameter transfer.^[20] The associated experimental artifacts were released on Hugging Face.^[20] Essential AI also published Essential-Web v1.0, a 24-trillion-token organized pretraining corpus.^[19] The company's research output between 2024 and 2026 included work on dataset organization for pretraining, on optimizer behavior at scale, and on the relationship between learning-rate schedules and grokking phenomena.^[19]^[20]

Reception of the transformer work

The transformer paper has been described as one of the most consequential publications in machine learning of the 2010s, with its impact framed in retrospect through the explosion of large language models built on transformer backbones in the years that followed.^[9]^[10] Vaswani has called the arrival of ChatGPT "a clear landmark in the arc of AI" and has said, "We're seeing the beginnings of profound tools for thought that will eventually make us much more capable in the digital world."^[9] USC Viterbi and ISI both profiled Vaswani's role in their 2023 alumni features tied to the rise of ChatGPT.^[9]^[10] In coverage of Essential AI's December 2023 launch, multiple outlets identified Vaswani and Parmar as co-creators of the transformer when describing the new company's pedigree.^[7]^[8]^[18]

Awards and recognition

Vaswani's recognition derives primarily from the citation count and downstream impact of the transformer paper, which by May 2026 had accumulated more than 257,000 citations on Google Scholar.^[11] He was listed among the speakers at The Montgomery Summit, a technology investment conference, where his biography described him as a co-creator of the transformer and chief executive officer of Essential AI.^[21] USC Viterbi profiled him in 2023 as part of a feature on USC alumni whose work paved the path to ChatGPT.^[9]

Selected publications

The papers below are listed in approximate order of citation count, with citation counts as of Google Scholar in May 2026.^[11]

Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, Ł.; Polosukhin, I. (2017). "Attention Is All You Need." Advances in Neural Information Processing Systems 30. 257,428 citations.^[3]^[11]
Battaglia, P. W.; Hamrick, J. B.; Bapst, V.; Sanchez-Gonzalez, A.; Zambaldi, V.; et al., including Vaswani (2018). "Relational Inductive Biases, Deep Learning, and Graph Networks." arXiv:1806.01261. 5,137 citations.^[11]
Shaw, P.; Uszkoreit, J.; Vaswani, A. (2018). "Self-Attention with Relative Position Representations." NAACL-HLT. 3,934 citations.^[11]
Parmar, N.; Vaswani, A.; Uszkoreit, J.; Kaiser, Ł.; Shazeer, N.; Ku, A.; Tran, D. (2018). "Image Transformer." ICML. 2,629 citations.^[11]
Srinivas, A.; Lin, T.-Y.; Parmar, N.; Shlens, J.; Abbeel, P.; Vaswani, A. (2021). "Bottleneck Transformers for Visual Recognition." CVPR. 1,728 citations.^[11]
Vaswani, A.; Bengio, Y.; Brevdo, E.; Chollet, F.; Gomez, A. N.; Gouws, S.; Jones, L.; Kaiser, Ł.; Kalchbrenner, N.; Parmar, N.; Sepassi, R.; Shazeer, N.; Uszkoreit, J. (2018). "Tensor2Tensor for Neural Machine Translation." arXiv:1803.07416.^[14]
Vaswani, A.; Zhao, Y.; Fossum, V.; Chiang, D. (2013). "Decoding with Large-Scale Neural Language Models Improves Translation." EMNLP.^[9]^[10]
Essential AI (Vaswani et al.) (2025). "Practical Efficiency of Muon for Pretraining." arXiv:2505.02222.^[20]
Vaswani, A. (2014). Smaller, Faster, and Accurate Models for Statistical Machine Translation (Ph.D. thesis), University of Southern California.^[1]

References

Wikipedia, "Ashish Vaswani", Wikipedia, 2026-05-01. https://en.wikipedia.org/wiki/Ashish_Vaswani. Accessed 2026-05-25. ↩
Vaswani, A. et al., "Attention Is All You Need", NeurIPS, 2017-12-04. https://papers.neurips.cc/paper/7181-attention-is-all-you-need.pdf. Accessed 2026-05-25. ↩
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, L.; Polosukhin, I., "Attention Is All You Need", arXiv, 2017-06-12. https://arxiv.org/abs/1706.03762. Accessed 2026-05-25. ↩
BusinessWire, "AI Transformer Inventors Launch Adept with $65M to Lend a Hand to Knowledge Workers", BusinessWire, 2022-04-26. https://www.businesswire.com/news/home/20220426005963/en/AI-Transformer-Inventors-Launch-Adept-with-$65M-to-Lend-a-Hand-to-Knowledge-Workers. Accessed 2026-05-25. ↩
The Information, "Two Co-Founders of Adept, an OpenAI Rival, Suddenly Left to Start Another Company", The Information, 2022-12. https://www.theinformation.com/briefings/two-co-founders-of-adept-an-openai-rival-suddenly-left-to-start-another-company. Accessed 2026-05-25. ↩
Computerworld, "Essential AI reveals funding, development of full-stack AI automation tools", Computerworld, 2023-12-13. https://www.computerworld.com/article/1611301/essential-ai-reveals-funding-development-of-full-stack-ai-automation-tools.html. Accessed 2026-05-25. ↩
BusinessWire, "Essential AI Raises $56.5M Series A to Build the Enterprise Brain", BusinessWire, 2023-12-11. https://www.businesswire.com/news/home/20231211867788/en/Essential-AI-Raises-$56.5M-Series-A-to-Build-the-Enterprise-Brain. Accessed 2026-05-25. ↩
HPCwire / AIwire, "Essential AI Raises $56.5M Series A to Build the Enterprise Brain", AIwire, 2023-12-12. https://www.hpcwire.com/aiwire/2023/12/12/essential-ai-raises-56-5m-series-a-to-build-the-enterprise-brain/. Accessed 2026-05-25. ↩
USC Viterbi School of Engineering, "Attention Is All You Need: USC Alumni Paved Path for ChatGPT", USC Viterbi News, 2023-03. https://viterbischool.usc.edu/news/2023/03/attention-is-all-you-need-usc-alumni-paved-path-for-chatgpt/. Accessed 2026-05-25. ↩
USC Information Sciences Institute, "Attention Is All You Need: USC Alumni Paved Path for ChatGPT", USC ISI News, 2023-03. https://www.isi.edu/news/54564/attention-is-all-you-need-usc-alumni-paved-path-for-chatgpt/. Accessed 2026-05-25. ↩
Vaswani, A., "Google Scholar profile", Google Scholar, 2026-05-25. https://scholar.google.com/citations?user=oR9sCGYAAAAJ&hl=en. Accessed 2026-05-25. ↩
Birla Institute of Technology Mesra, "Department of Computer Science and Engineering", BIT Mesra, 2026-05. https://bitmesra.irins.org/faculty/index/Department+of+Computer+Science+Engineering. Accessed 2026-05-25. ↩
The Register, "Ex-Googlers to build 'general intelligence' at Adept AI", The Register, 2022-04-27. https://www.theregister.com/2022/04/27/adept_ai_google/. Accessed 2026-05-25. ↩
Vaswani, A. et al., "Tensor2Tensor for Neural Machine Translation", arXiv, 2018-03-16. https://arxiv.org/abs/1803.07416. Accessed 2026-05-25. ↩
Vaswani, A. et al., "Attention Is All You Need (HTML version v7)", arXiv, 2023-08-02. https://arxiv.org/html/1706.03762v7. Accessed 2026-05-25. ↩
TechCrunch, "Adept, a startup training AI to use existing software and APIs, raises $350M", TechCrunch, 2023-03-15. https://techcrunch.com/2023/03/15/adept-a-startup-training-ai-to-use-existing-software-and-apis-raises-350m/. Accessed 2026-05-25. ↩
TechCrunch, "Amazon hires founders away from AI startup Adept", TechCrunch, 2024-06-28. https://techcrunch.com/2024/06/28/amazon-hires-founders-away-from-ai-startup-adept/. Accessed 2026-05-25. ↩
Maginative, "Essential AI Secures $56.5M in Series A Funding to Build 'Enterprise Brain'", Maginative, 2023-12-13. https://www.maginative.com/article/essential-ai-secures-56-5m-in-series-a-funding-to-build-enterprise-brain/. Accessed 2026-05-25. ↩
Essential AI, "Essential AI homepage and research", Essential AI, 2026-05. https://essential.ai/. Accessed 2026-05-25. ↩
Essential AI, "Practical Efficiency of Muon for Pretraining", arXiv, 2025-05-04. https://arxiv.org/abs/2505.02222. Accessed 2026-05-25. ↩
The Montgomery Summit, "Ashish Vaswani speaker page", The Montgomery Summit, 2026-05. https://montgomerysummit.com/speakers/ashish-vaswani/. Accessed 2026-05-25. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributors · full history

Suggest edit

What links here

David Luan Illia Polosukhin Llion Jones Multi-Head Self-Attention Noam Shazeer

Who is Ashish Vaswani?

Early life and education

What did Ashish Vaswani do at Google Brain?

What is Attention Is All You Need?

What is Adept AI?

What is Essential AI?

What were Ashish Vaswani's main research contributions?

Transformer architecture

Image Transformer and self-attention with relative positions

Bottleneck transformers

Graph networks survey

Tensor2Tensor

Pretraining research at Essential AI

Reception of the transformer work

Awards and recognition

Selected publications

See also

References

Improve this article

Related Articles

Aidan Gomez

BERT

Multi-head Latent Attention

Multi-Head Self-Attention

Positional encoding

Rotary Position Embedding

What links here

Related Articles

Aidan Gomez

BERT

Multi-head Latent Attention

Multi-Head Self-Attention

Positional encoding

Rotary Position Embedding

What links here