Ashish Vaswani
Last reviewed
May 24, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 3,224 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 24, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 3,224 words
Add missing citations, update stale details, or suggest a clearer explanation.
Ashish Vaswani (born 1986) is an Indian-American computer scientist known for his work on neural sequence models and as the first-listed author of the 2017 paper "Attention Is All You Need," which introduced the transformer architecture.[1][2] At the time of that publication he was a research scientist at Google Brain, where he and Illia Polosukhin designed and implemented the first transformer models.[3] Vaswani left Google in late 2021 to co-found Adept AI with Noam Shazeer's former colleague Niki Parmar and former OpenAI engineering lead David Luan, then departed Adept in November 2022 with Parmar to co-found Essential AI in 2023.[4][5][6] Essential AI, based in San Francisco, builds open foundation models and pretraining tools and raised a $56.5 million Series A in December 2023 led by March Capital with participation from Google, Nvidia, AMD, and Thrive Capital.[7][8]
| Born | 1986, India[1] |
| Education | B.Tech., Birla Institute of Technology, Mesra; Ph.D., University of Southern California (2014)[1][9] |
| Doctoral advisors | David Chiang and Liang Huang[9][10] |
| Known for | Co-author of "Attention Is All You Need" (2017); transformer architecture[3] |
| Employers | Google Brain (2016 to 2021); Adept AI (2021 to 2022); Essential AI (2023 to present)[1][4][5][7] |
| Citations | 291,427 (Google Scholar, May 2026); h-index 56[11] |
Vaswani was born in 1986 in India.[1] He completed a Bachelor of Technology in computer science at the Birla Institute of Technology, Mesra (BIT Mesra), one of India's older engineering schools whose computer science department dates to 1983.[1][12] He moved to the United States for graduate study and joined the University of Southern California (USC) as a computer science doctoral student in 2004, working at the USC Information Sciences Institute (ISI), a research center in Marina del Rey known for its long-standing program in machine translation and natural language processing.[9][10] ISI in the mid-2000s was a major hub for statistical machine translation research and the host institution for the GIZA++ word-alignment toolkit and the Joshua decoder, both widely used at the time.[10]
At ISI he worked in the natural language processing group led by Kevin Knight, with primary doctoral advising from David Chiang (now at the University of Notre Dame) and Liang Huang (now at Oregon State University).[9][10] His doctoral research focused on statistical machine translation and, increasingly, on neural language modeling.[9][10] His Ph.D. thesis, defended in 2014, was titled Smaller, Faster, and Accurate Models for Statistical Machine Translation and dealt with reducing the parameter and decoding cost of large translation systems while preserving BLEU scores.[1]
David Chiang has stated that Vaswani was "my first Ph.D. student and one of the very first people to see the potential for deep learning in natural language processing back in 2011."[9][10] Liang Huang has described Vaswani during this period as enthusiastic and unusual in pursuing GPU-based methods before the broader machine translation community took deep learning seriously.[9] In a retrospective interview after the success of the transformer paper, Vaswani himself described the ISI group as a research environment pursuing bold ideas.[10]
While still at USC, Vaswani co-authored two papers that anticipated his later work at Google. The first, "Decoding with Large-Scale Neural Language Models Improves Translation" (2013), integrated a feedforward neural language model into a statistical machine translation decoder and reported gains over n-gram baselines on Arabic-to-English and Chinese-to-English tasks.[9][10] The second, "Simple, Fast Noise-Contrastive Estimation for Large RNN Vocabularies" (2017), described an efficient training procedure for recurrent neural language models with very large output vocabularies.[9][10]
Vaswani joined Google as a research scientist at Google Brain in 2016, after a postdoctoral period at USC ISI.[1] At Google he worked on sequence models for translation and generation in collaboration with Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin, all of whom would later be co-authors on the transformer paper.[3] He remained at Google Brain through 2021, when he left to co-found Adept AI in November of that year.[4][13]
The Google Brain group during this period was structured around small, fluid teams pursuing both production-relevant translation work (Google's neural machine translation system, which had launched in 2016) and longer-horizon research on attention and sequence modeling.[14] Vaswani's collaborators included Shazeer, a longtime Google researcher who had worked on mixture-of-experts and parameter scaling; Parmar, who had joined Google Brain after a master's degree at USC; Uszkoreit, whose research group had explored attention-based models for text without recurrence; and Polosukhin, who later left Google to co-found the NEAR Protocol blockchain project.[3][14]
During his Google tenure Vaswani contributed to the Google Brain Tensor2Tensor (T2T) project, an open-source library that included a reference implementation of the transformer and was used widely by external researchers.[14] He also worked on extensions of self-attention to vision and on image generation, producing the 2018 paper "Image Transformer" (Parmar, Vaswani, Uszkoreit, Kaiser, Shazeer, Ku, Tran), which adapted the transformer to autoregressive image generation, and the 2018 paper "Self-Attention with Relative Position Representations" (Shaw, Uszkoreit, Vaswani), which proposed relative position encodings later adopted by many downstream models.[11]
The paper "Attention Is All You Need," posted to arXiv on 12 June 2017 and presented at NeurIPS 2017, listed Vaswani as the first author and described the transformer, a neural network architecture for sequence transduction that replaces recurrence and convolution with multi-head self-attention.[3][15] The paper's equal-contribution footnote records that "Ashish, with Illia, designed and implemented the first Transformer models and has been crucially involved in every aspect of this work."[3] The full author list, in published order, is Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin, with Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, and Kaiser at Google Brain or Google Research and Polosukhin then at Google.[3]
The transformer paper reported a BLEU score of 28.4 on the WMT 2014 English-to-German translation task and 41.8 on English-to-French, both improving over the best prior published results while requiring substantially less training time on eight P100 GPUs.[3] By May 2026 the paper had accumulated more than 257,000 citations on Google Scholar, the bulk of Vaswani's 291,427 total citations.[11] The transformer became the basis for almost every subsequent large language model, including BERT (Devlin et al., 2018), GPT-2 (Radford et al., 2019), GPT-3 (Brown et al., 2020), and later systems from OpenAI, Google DeepMind, Anthropic, Meta, and others.[3]
The original paper described two key building blocks. Scaled dot-product attention computes attention weights as the softmax of query-key dot products divided by the square root of the key dimension, then uses those weights to take a weighted sum of value vectors.[3] Multi-head attention runs this operation in parallel across multiple independently-projected subspaces, allowing the network to attend to information from different representation subspaces at the same time.[3] The encoder-decoder transformer in the paper had six identical layers in each stack, with the decoder layers including an additional cross-attention sublayer over the encoder output.[3] Position information was injected through sinusoidal positional encodings added to the token embeddings.[3] Subsequent work refined many of these choices: BERT used encoder-only stacks, GPT used decoder-only stacks, and later models replaced sinusoidal positions with learned or relative position encodings.[3][11]
Vaswani left Google in late 2021 to co-found Adept AI alongside Niki Parmar and David Luan, the latter a former OpenAI engineering vice president and Google large-model program lead.[4][13] Adept emerged from stealth on 26 April 2022 with a $65 million Series A funding round led by Greylock and Addition, with participation from Root Ventures and angel investors including Andrej Karpathy, Jaan Tallinn, and Chris Ré.[4] At launch Luan was chief executive officer, Parmar was chief technology officer, and Vaswani was chief scientist.[4][13]
Adept's stated objective was to train a neural network to operate existing software tools, with the company describing an "AI teammate" capable of taking natural-language instructions and performing tasks in productivity applications.[13] The founding team included several other researchers from Google Brain and DeepMind.[13]
Vaswani and Parmar both departed Adept in November 2022, less than a year after the company emerged from stealth.[5][6] Reporting on their exit indicated that the departure was sudden and stemmed in part from differences with investors over the company's research direction.[5] The remaining Adept team continued to operate under Luan and raised a $350 million Series B in March 2023 led by General Catalyst and Spark Capital.[16] In June 2024 Amazon hired Luan and several other Adept co-founders into its AGI team in an arrangement widely described as an acquihire, with Adept itself remaining as a separate licensed entity under a new chief executive.[17]
After leaving Adept, Vaswani and Parmar founded Essential AI in 2023, with Vaswani as chief executive officer and Parmar as chief technology officer.[7][8] The San Francisco-based company emerged from stealth on 13 December 2023 with a $56.5 million Series A funding round led by March Capital and including AMD, Franklin Venture Partners, Google, KB Investment, Nvidia, and Thrive Capital.[7] An earlier $8.3 million seed round had been led by Thrive Capital, bringing total disclosed funding at the Series A to roughly $65 million.[18]
At launch the company described its mission as "deepening the partnership between humans and computers, unlocking collaborative capabilities that far exceed what could be achieved today" and said it would develop full-stack large-language-model products to automate workflows.[7] Over the following two years the company's public focus shifted toward open pretraining research, with a stated mission of "building an open platform to accelerate the science and engineering of deep learning" through pretraining work on frontier STEM and code capabilities.[19]
By mid-2025 Essential AI had begun releasing open datasets and research artifacts. Essential-Web v1.0, a 24-trillion-token web corpus organized for pretraining, was released on Hugging Face.[19] In May 2025 the company published "Practical Efficiency of Muon for Pretraining" (arXiv:2505.02222), a study of the Muon optimizer for large-batch pretraining, reporting that Muon expanded the Pareto frontier over AdamW on the compute-time tradeoff and remained more data-efficient at large batch sizes.[20] The Muon study ran experiments at model sizes up to 4 billion parameters and batch sizes up to 16 million tokens and presented a "telescoping" algorithm for combining Muon with the maximal-update parameterization (muP) to transfer hyperparameters from small to large models.[20] In December 2025 Essential AI released the Rnj-1 (Ramanujan) language model line, including base and instruction-tuned variants, listing Vaswani as an author on the announcement.[19]
Public statements of Essential AI's positioning over 2024 and 2025 indicate a shift in emphasis from the initial enterprise-automation product framing of the December 2023 launch toward open pretraining research and dataset publication, with the company describing itself in 2026 as building "an open platform to accelerate the science and engineering of deep learning."[7][19] As of May 2026 Vaswani is listed as chief executive officer of Essential AI; the speaker page for The Montgomery Summit identifies him in that role.[21]
Vaswani's central contribution is the transformer, the neural network architecture introduced in "Attention Is All You Need" and now the standard backbone for sequence modeling in natural language processing, computer vision, audio, and protein structure prediction.[3] The transformer uses self-attention layers (specifically scaled dot-product attention with multiple heads), residual connections, layer normalization, and position-wise feedforward networks, dispensing with the recurrent connections of LSTMs and the local receptive fields of convolutional networks.[3] In the architecture, attention weights are computed as the softmax of scaled query-key dot products and used to aggregate value vectors, allowing every position in a sequence to attend to every other position in a single layer.[3]
In their footnote describing the division of labor, the authors record that Vaswani and Polosukhin designed and implemented the first transformer models, that Shazeer proposed scaled dot-product attention, multi-head attention, and the parameter-free position representation, and that subsequent refinements were contributed by other authors.[3] After the initial publication, Vaswani co-led further work showing that the architecture generalized beyond translation.[11]
In 2018 Vaswani co-authored "Image Transformer" (Parmar, Vaswani, Uszkoreit, Kaiser, Shazeer, Ku, Tran), which applied self-attention to autoregressive image generation by treating image pixels as tokens and restricting attention to local neighborhoods to control memory cost.[11] The same year he co-authored "Self-Attention with Relative Position Representations" (Shaw, Uszkoreit, Vaswani), which proposed an alternative to the original sinusoidal position encoding by injecting learned representations of the relative offset between tokens directly into the attention computation; this design influenced later position-encoding schemes including those in T5 and Transformer-XL.[11]
In 2021 Vaswani co-authored "Bottleneck Transformers for Visual Recognition" (Srinivas, Lin, Parmar, Shlens, Abbeel, Vaswani), which replaced the spatial convolutions in the final stage of a ResNet with multi-head self-attention blocks (BoT blocks) and reported gains on ImageNet classification and COCO instance segmentation.[11] The paper was an early example of hybrid convolution-attention vision backbones, a design family that influenced subsequent architectures including ConvNeXt and the Swin Transformer.[11]
Vaswani was a co-author on the 2018 survey "Relational Inductive Biases, Deep Learning, and Graph Networks" led by Peter W. Battaglia at DeepMind, which proposed graph networks as a unifying framework for relational reasoning over structured data and argued for the importance of inductive biases in deep learning architectures.[11] The paper accumulated more than 5,100 citations by May 2026.[11]
While at Google Brain, Vaswani was a co-author on the Tensor2Tensor library, a TensorFlow-based deep-learning toolkit that included reference implementations of the transformer and supporting datasets.[14] Tensor2Tensor was used by external researchers as a common implementation of the transformer in the period immediately following the 2017 paper and supported reproducibility of the original results.[14]
At Essential AI, Vaswani's published research has focused on pretraining efficiency. The 2025 paper "Practical Efficiency of Muon for Pretraining" (arXiv:2505.02222) studied the Muon second-order optimizer at model sizes up to 4 billion parameters and batch sizes up to 16 million tokens, presenting evidence that Muon retained data efficiency at large batch sizes beyond the so-called critical batch size and that it combined effectively with the maximal-update parameterization (muP) for hyperparameter transfer.[20] The associated experimental artifacts were released on Hugging Face.[20] Essential AI also published Essential-Web v1.0, a 24-trillion-token organized pretraining corpus.[19] The company's research output between 2024 and 2026 included work on dataset organization for pretraining, on optimizer behavior at scale, and on the relationship between learning-rate schedules and grokking phenomena.[19][20]
The transformer paper has been described as one of the most consequential publications in machine learning of the 2010s, with its impact framed in retrospect through the explosion of large language models built on transformer backbones in the years that followed.[9][10] USC Viterbi and ISI both profiled Vaswani's role in their 2023 alumni features tied to the rise of ChatGPT.[9][10] In coverage of Essential AI's December 2023 launch, multiple outlets identified Vaswani and Parmar as co-creators of the transformer when describing the new company's pedigree.[7][8][18]
Vaswani's recognition derives primarily from the citation count and downstream impact of the transformer paper, which by May 2026 had accumulated more than 257,000 citations on Google Scholar.[11] He was listed among the speakers at The Montgomery Summit, a technology investment conference, where his biography described him as a co-creator of the transformer and chief executive officer of Essential AI.[21] USC Viterbi profiled him in 2023 as part of a feature on USC alumni whose work paved the path to ChatGPT.[9]
The papers below are listed in approximate order of citation count, with citation counts as of Google Scholar in May 2026.[11]