Noam Shazeer
Last reviewed
May 31, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 · 3,995 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 31, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 · 3,995 words
Add missing citations, update stale details, or suggest a clearer explanation.
Noam Shazeer (born 1975 or 1976) is an American computer scientist, mathematician, and software engineer who is one of the most prolific and influential researchers in the modern era of deep learning. He is best known as a co-author of the 2017 paper Attention Is All You Need, which introduced the Transformer architecture that underlies essentially all contemporary large language models, and as the inventor or co-inventor of a long list of foundational techniques, including the sparsely-gated Mixture-of-Experts layer, the Mesh-TensorFlow framework for model parallelism, multi-query attention, GLU activation variants such as SwiGLU and GEGLU, the Adafactor optimizer, the T5 text-to-text transfer transformer, and the Switch Transformer.[1][2][3]
Shazeer spent roughly two decades at Google (joining in 2000), where he was an early engineer on advertising and search infrastructure before moving to the Google Brain research team in 2012. He left Google in October 2021, reportedly after the company declined to publicly release a conversational chatbot he had built with Daniel De Freitas, and co-founded the consumer AI company Character.AI.[4][5] In August 2024, in one of the highest-profile "reverse acqui-hire" transactions in Silicon Valley history, Google paid approximately $2.7 billion to license Character.AI's technology and brought Shazeer back, where he became co-technical lead of the Gemini model effort alongside Jeff Dean and Oriol Vinyals.[6][7][8]
In 2023 Time magazine named Shazeer one of the 100 most influential people in artificial intelligence, and in February 2026 he was elected to the U.S. National Academy of Engineering as part of its Class of 2026.[9][10] Together, his combination of a long single-author technical track record, a near-uniquely broad set of foundational contributions, and his role at the head of one of the two leading frontier-model programs has made him one of the most-discussed individual figures in the post-ChatGPT period of artificial-intelligence development.
| Field | Detail |
|---|---|
| Born | 1975 or 1976 (Philadelphia, Pennsylvania, U.S.)[1] |
| Nationality | American |
| Education | Duke University, B.S. in mathematics and computer science (1994–1998)[1] |
| Known for | Co-author of Attention Is All You Need; Mixture-of-Experts; T5; Mesh-TensorFlow; multi-query attention; Adafactor; SwiGLU/GEGLU; Switch Transformer[1][2] |
| Affiliations | Google (2000–2021), Character.AI (2021–2024), Google / Google DeepMind (2024–present)[1][5] |
| Current role | VP of Engineering, co-technical lead of Google Gemini[7][11] |
| Honors | Time 100 in AI (2023); National Academy of Engineering (2026)[9][10] |
Shazeer was born in Philadelphia, Pennsylvania, in 1975 or 1976. He attended grade school at Cohen Hillel Academy in Marblehead, Massachusetts, and Swampscott High School in Swampscott, Massachusetts.[1] As a high school student he competed on the U.S. team at the 1994 International Mathematical Olympiad in Hong Kong, where he won a gold medal with a perfect score.[1][12]
From 1994 to 1998 he studied mathematics and computer science at Duke University, where he held an Angier B. Duke Memorial Scholarship (Duke's most selective merit scholarship) and was a star member of Duke's prize-winning Putnam team. In his first semester he placed sixth in the nation on the William Lowell Putnam Mathematical Competition, and over his undergraduate career he helped lead Duke to first-place and second-place finishes at the Putnam in 1996 and 1997 respectively, making Duke the only school besides Harvard to win the team competition during the 1990s.[1][13] He earned a Bachelor of Science degree in mathematics and computer science from Duke in 1998, and briefly entered a graduate program at the University of California, Berkeley, before leaving without completing a doctorate.[1] His Olympiad and Putnam track record, taken together with his later research output, has been frequently cited as evidence that elite competitive-mathematics ability can translate directly into productive machine-learning research.
Shazeer joined Google in 2000, when the company was a young start-up with roughly two hundred employees.[1] One of his earliest contributions was a substantial rewrite of Google's web-search spell-checker. The new spell-checker used statistical models trained on the web's own text to detect and correct misspellings, famously suggesting "Britney Spears" when a user typed "pritany spears," and became one of the search engine's most-used auxiliary services.[1][14]
He later co-developed, with fellow early Google engineer Georges Harik, a probabilistic page-classification system known internally as PHIL (typically expanded as "Probabilistic Hierarchical Inferential Learner"), which categorized web pages by topic by learning the co-occurrence patterns of terms and concepts.[1][14][29] PHIL was used to match contextually relevant ads to publisher pages and, according to Steven Levy's In the Plex, was the in-house technology that actually powered Google's content-targeted advertising product, AdSense, even though the AdSense brand had been inherited from the externally acquired Applied Semantics company.[29] Shazeer's contributions in this period helped him earn a reputation inside Google as an unusually productive engineer with a deep grasp of large-scale statistical modeling, and he was promoted into Google's senior technical ranks long before the company's modern research arm existed.
In 2012 Shazeer moved to the newly formed Google Brain research group, where Jeff Dean and others were experimenting with applying very large neural networks to speech, vision, and language tasks.[1][14] Over the following decade he became one of Brain's most prolific authors and was widely regarded as the team's principal architect of large-scale neural-language models.
Across the 2010s Shazeer's research output focused on three intertwined themes: (1) increasing model capacity through conditional computation and sparsity; (2) increasing model throughput through better attention mechanisms, optimizers, and parallelism primitives; and (3) unifying NLP tasks under a single sequence-to-sequence formulation. Many of the techniques he and his collaborators introduced are now standard components of frontier large language models. In a 2025 retrospective interview on the Dwarkesh Podcast, Shazeer and Jeff Dean discussed his role across this era; Dwarkesh Patel summarized that Shazeer had "invented or co-invented all the main architectures and techniques that are used for modern LLMs: from the Transformer itself, to Mixture of Experts, to Mesh-TensorFlow, to Gemini and many other things."[30]
His first major Brain-era contribution was Exploring the Limits of Language Modeling (Józefowicz, Vinyals, Schuster, Shazeer & Wu, 2016), an empirical study of scaling LSTM language models on the One Billion Word Benchmark that prefigured the systematic scaling work he would do later in the decade.
In January 2017 Shazeer and collaborators (Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton and Jeff Dean) published Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer.[2] The paper introduced a layer consisting of up to thousands of small feed-forward expert sub-networks plus a trainable gating network that selects only a small sparse subset of experts to evaluate for each input. By activating only a few experts per token, the layer makes it possible to expand a model's parameter count by orders of magnitude while keeping its per-example computational cost roughly fixed.[2]
Applied between stacked LSTM layers, the technique produced language and translation models with up to 137 billion parameters that significantly outperformed dense state-of-the-art models at comparable compute.[2] The Sparsely-Gated MoE layer is the direct ancestor of every modern Mixture-of-Experts language model: from Google's GShard, Switch Transformer and GLaM, to Mixtral and the MoE variants of Gemini, GPT-4 and DeepSeek. The paper also established two ideas that have proven durable across all subsequent MoE work: (i) using a learned, sparse top-k gate to route tokens to a small subset of experts, and (ii) adding auxiliary load-balancing losses to keep expert utilisation roughly uniform during training.[2]
In June 2017 Shazeer was one of eight co-authors of Attention Is All You Need, published on arXiv on 12 June 2017 and presented at NeurIPS later that year.[3][15] The paper introduced the Transformer architecture, a sequence-to-sequence neural network that replaces recurrence and convolution with a stack of scaled dot-product self-attention layers and position-wise feed-forward networks.[3]
All eight authors (Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez (aidan gomez), Łukasz Kaiser, and Illia Polosukhin) are credited as equal contributors and listed in randomized order.[3][15] In the paper's footnote of acknowledgements, Shazeer is specifically credited with proposing scaled dot-product attention, multi-head attention, and the parameter-free positional representation, and with being "the other person involved in nearly every detail."[15] The paper has since become one of the most cited works in computer science and is the foundational reference for modern large language models.
Several of the design choices the paper attributes to Shazeer, including scaled dot-product attention (dividing the dot-product by the square root of the head dimension to stabilise softmax gradients), splitting attention into multiple parallel heads, and using sinusoidal position encodings, are now textbook material; subsequent work on attention variants (multi-query attention, grouped-query attention, FlashAttention, rotary position embeddings, etc.) has refined rather than replaced this basic structure.[3][15]
Training models with billions of parameters quickly outgrew the memory of any single accelerator. In November 2018 Shazeer and a Google Brain team (Youlong Cheng, Niki Parmar, Dustin Tran, Ashish Vaswani (ashish vaswani), Penporn Koanantakool, Peter Hawkins, HyoukJoong Lee, Mingsheng Hong, Cliff Young, Ryan Sepassi, and Blake Hechtman) introduced Mesh-TensorFlow: Deep Learning for Supercomputers at NeurIPS 2018.[16] Mesh-TensorFlow is a language for specifying distributed tensor computations in which the user can declare that any tensor dimension is split across any dimension of a multi-dimensional mesh of processors. This enabled a clean expression of model parallelism (splitting individual weight tensors across many chips) in addition to the more common data parallelism. The paper demonstrated Mesh-TensorFlow on Transformer models with up to 5 billion parameters on TPU meshes of up to 512 cores.[16] Mesh-TensorFlow was an important conceptual precursor to today's large-model parallelism systems such as GSPMD, JAX's pjit / shard_map, and DeepSpeed.
Together with Mitchell Stern, Shazeer published Adafactor: Adaptive Learning Rates with Sublinear Memory Cost at ICML 2018.[17] Standard adaptive optimizers such as Adam maintain two extra tensors per parameter (first and second moments), doubling or tripling the memory footprint of training. Adafactor maintains only per-row and per-column statistics of weight matrices, reconstructing per-parameter second-moment estimates from these factorized statistics. Combined with update clipping and a decaying-momentum schedule, the optimizer matches Adam's quality while using sublinear auxiliary memory, making it possible to train much larger Transformer models on the same hardware.[17] Adafactor became the default optimizer for T5 and many later very-large-model training runs.
In October 2019, Colin Raffel, Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu published Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer, introducing the T5 model family and the C4 (Colossal Clean Crawled Corpus) pre-training dataset.[18] T5 recast every NLP task (translation, summarization, classification, question answering) as a text-to-text problem: a string in, a string out. This unified formulation, combined with a systematic empirical study of pre-training objectives, model sizes, and dataset sizes, established a new state of the art on a wide range of NLP benchmarks and influenced the design of virtually every subsequent encoder-decoder language model.[18]
In November 2019 Shazeer single-authored Fast Transformer Decoding: One Write-Head Is All You Need, which proposed multi-query attention (MQA).[19] In a standard multi-head attention layer, each attention head maintains its own keys and values; during autoregressive decoding this requires loading a separate key/value cache per head per layer, and the resulting memory bandwidth becomes the dominant cost of inference. MQA shares a single set of keys and values across all heads, drastically reducing the size of the key/value cache and dramatically improving decoding throughput, with only minor quality degradation.[19] MQA, and its later generalization grouped-query attention (GQA), are now standard in production large language models such as PaLM 2, LLaMA, Mistral, and the Gemini family.
In February 2020 Shazeer published a short single-author paper, GLU Variants Improve Transformer, which explored replacing the standard ReLU/GELU activation in the Transformer feed-forward sub-layer with variants of the Gated Linear Unit (GLU).[20] Among the variants introduced were GEGLU (using GELU as the gating non-linearity) and SwiGLU (using Swish). The paper showed that GEGLU and SwiGLU produced consistent improvements in perplexity over ReLU and GELU baselines.[20] SwiGLU was subsequently adopted as the standard feed-forward activation in PaLM, LLaMA, Mistral, Gemini and many other modern large language models.
In January 2021, William Fedus, Barret Zoph, and Shazeer published Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity.[21] The Switch Transformer simplified the Mixture-of-Experts routing scheme by sending each token to exactly one expert (top-1 routing), which substantially reduced the communication and load-balancing costs of MoE training.[21] Built on top of T5, the Switch Transformer achieved up to seven-times speed-ups in pre-training over dense baselines at the same compute budget and was scaled to over one trillion total parameters, among the first publicly described language models to reach that scale. The paper was published in the Journal of Machine Learning Research in 2022.[21] Top-1 routing as introduced in the Switch Transformer subsequently became the default expert routing strategy in production MoE models, including Mixtral, DeepSeek-MoE, and the MoE configurations used inside the Gemini family.
Shazeer was one of dozens of contributors to Google's PaLM: Scaling Language Modeling with Pathways, the 540-billion-parameter dense Transformer that briefly held the title of largest publicly described language model and that demonstrated strong few-shot performance on hundreds of language and reasoning benchmarks.[22] PaLM was trained on 6,144 TPU v4 chips across two pods using Google's then-new Pathways system; many of its architectural choices, including SwiGLU feed-forwards, multi-query attention, parallel attention/feed-forward layers, and rotary positional embeddings, build directly on Shazeer-era Brain research, with the SwiGLU activation and multi-query attention components originating in Shazeer's own single-author papers.[22][19][20] PaLM's release in April 2022 was widely seen as Google's headline counter to OpenAI's GPT-3 and was the immediate technical antecedent of the Bard / Gemini effort.
Inside Google, Shazeer and Daniel De Freitas led the development of a large open-domain dialogue model originally called Meena and later renamed LaMDA (Language Model for Dialogue Applications).[4][5][23] Meena, announced in a 2020 Google AI blog post, was at the time the largest open-domain chatbot model ever trained, with 2.6 billion parameters; it was followed by LaMDA, a Transformer-based 137-billion-parameter model fine-tuned for dialogue.[23] The team built a system capable of strikingly fluent multi-turn conversation; however, Google's leadership declined to release the model publicly, citing concerns about safety, fairness, and the reputational risk of consumer-facing chatbots. According to The Wall Street Journal, Shazeer and De Freitas argued that releasing the chatbot would generate the user feedback necessary to improve safety in production, while senior leadership preferred a more cautious enterprise rollout. The disagreement over public release was widely reported as a principal reason for Shazeer's and De Freitas's eventual departure from the company in 2021.[4][5]
Shazeer left Google in October 2021.[5] Together with Daniel De Freitas (daniel de freitas) he co-founded Character Technologies, Inc., doing business as Character.AI, which incorporated in November 2021.[24] Character.AI's product was a consumer web (and later mobile) application that let users create and chat with AI characters, including original personas, fictional figures, historical figures, or assistants, built on top of a custom large language model trained in-house.[24]
The company raised an initial $43 million in seed funding shortly after incorporation, and in March 2023 closed a $150 million Series A round led by Andreessen Horowitz at an approximately $1 billion valuation, with participation from Nat Friedman, Elad Gil, SV Angel, and A.Capital.[25][26] Character.AI's public beta launched in September 2022, its iOS and Android apps launched in May 2023 (and were downloaded more than 1.7 million times in their first week), and the optional Character.AI+ subscription tier was introduced the same month.[24]
By early 2024 Character.AI had become one of the most-used consumer AI products on the web, with millions of daily active users, the majority of them under 30; the company reported roughly 3.5 million daily visitors as of January 2024.[24] Throughout this period Shazeer served as Character.AI's chief executive officer and chief technical officer; De Freitas served as president and co-led research.[24]
Shazeer's stated thesis for Character.AI was that conversational AI should be a mass-market consumer product rather than purely an enterprise tool, and that a small team able to ship directly to end users would iterate faster on safety, personality, and product fit than a research division embedded inside a large incumbent. In media appearances during 2022–2024 he repeatedly argued that compute, not data or algorithms, was the binding constraint on progress, and that scaling existing Transformer-based architectures would continue to produce qualitative gains in capability.[14]
On 2 August 2024 Google and Character.AI jointly announced an arrangement in which Google would license Character.AI's underlying model technology on a non-exclusive basis, provide Character.AI with substantial additional funding, and hire Shazeer, De Freitas, and roughly 30 members of Character.AI's research team into Google.[6] Press reporting from Reuters, The Information, the Wall Street Journal and others described the transaction as a "reverse acqui-hire" worth approximately $2.7 billion, paid out as licensing fees rather than as an acquisition price, in a structure that some analysts argued was designed to avoid antitrust review.[7][8][27] Subsequent reporting indicated that the U.S. Department of Justice opened an inquiry into the deal's structure.[8] Press accounts estimated Shazeer's personal share of the proceeds at $750 million–$1 billion.[1]
At Google, Shazeer was given the title of Vice President of Engineering and was announced (in an internal memo from Google DeepMind chief Demis Hassabis that was reported by The Information) as one of three co-technical leads of the Gemini model effort, working alongside Jeff Dean (Google's chief scientist) and Oriol Vinyals.[11] In its public statement, Google described Shazeer as "a preeminent researcher in machine learning."[6]
Since rejoining Google, Shazeer has been associated with the development of the Gemini 2 model family, including Gemini 2.5 Pro and Gemini 2.5 Flash, and with Gemini 3, which launched in November 2025 and was deployed into Google Search on the same day.[28] He continues to advocate publicly for further scaling of training compute, for mixture-of-experts architectures, and for direct consumer deployment of frontier conversational AI, positions that closely mirror the ones that led to his 2021 departure.[14]
In a February 2025 Dwarkesh Podcast interview alongside Jeff Dean, Shazeer described his ambition to build a single mixture-of-experts model that could be continuously grown and updated, a research vision that closely matches the Pathways architecture Google has been describing publicly since 2021, and forecasted that automated AI research itself would soon become a dominant source of progress.[30] Press coverage of his return has framed Shazeer as one of a small number of senior individual researchers whose presence is considered strategically necessary for a frontier-model effort, comparable in influence to Ilya Sutskever at OpenAI or Yann LeCun at Meta.[14]