# Noam Shazeer

> Source: https://aiwiki.ai/wiki/noam_shazeer
> Updated: 2026-06-22
> Categories: AI Research, Google, People
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

**Noam Shazeer** (born 1975 or 1976) is an American computer scientist and software engineer who is one of the most prolific and influential researchers in the modern era of deep learning, best known as a co-author of the 2017 paper *Attention Is All You Need* and as the inventor or co-inventor of many of the core techniques inside today's large language models. The paper, which he co-wrote with seven Google colleagues, introduced the [Transformer](/wiki/transformer) architecture that now underlies essentially all contemporary large language models, and Shazeer is separately credited with the sparsely-gated [Mixture-of-Experts](/wiki/mixture_of_experts) layer, the Mesh-TensorFlow framework for model parallelism, multi-query attention, GLU activation variants such as SwiGLU and GEGLU, the Adafactor optimizer, the [T5](/wiki/t5) text-to-text transfer transformer, and the [Switch Transformer](/wiki/switch_transformer).[^1][^2][^3]

Shazeer spent roughly two decades at Google (joining in 2000), where he was an early engineer on advertising and search infrastructure before moving to the [Google Brain](/wiki/google_brain) research team in 2012. He left Google in October 2021, reportedly after the company declined to publicly release a conversational chatbot he had built with Daniel De Freitas, and co-founded the consumer AI company [Character.AI](/wiki/character_ai).[^4][^5] In August 2024, in one of the highest-profile "reverse acqui-hire" transactions in Silicon Valley history, Google paid approximately $2.7 billion to license Character.AI's technology and brought Shazeer back, where he became co-technical lead of the [Gemini](/wiki/gemini) model effort alongside Jeff Dean and Oriol Vinyals.[^6][^7][^8] Less than two years later, on 18 June 2026, Shazeer announced that he was leaving Google to join [OpenAI](/wiki/openai) as its Lead for Architecture Research.[^31][^32]

In 2023 *Time* magazine named Shazeer one of the 100 most influential people in artificial intelligence, and in February 2026 he was elected to the U.S. National Academy of Engineering as part of its Class of 2026.[^9][^10] Together, his combination of a long single-author technical track record, a near-uniquely broad set of foundational contributions, and his successive leadership roles at the head of the two leading frontier-model programs has made him one of the most-discussed individual figures in the post-ChatGPT period of artificial-intelligence development.

## Key facts

| Field | Detail |
| --- | --- |
| Born | 1975 or 1976 (Philadelphia, Pennsylvania, U.S.)[^1] |
| Nationality | American |
| Education | Duke University, B.S. in mathematics and computer science (1994-1998)[^1] |
| Known for | Co-author of *Attention Is All You Need*; Mixture-of-Experts; T5; Mesh-TensorFlow; multi-query attention; Adafactor; SwiGLU/GEGLU; Switch Transformer[^1][^2] |
| Affiliations | Google (2000-2021), Character.AI (2021-2024), Google / Google DeepMind (2024-2026), OpenAI (2026-present)[^1][^5][^31] |
| Current role | Lead for Architecture Research, OpenAI (from June 2026)[^31][^32] |
| Honors | *Time* 100 in AI (2023); National Academy of Engineering (2026)[^9][^10] |

## Early life and education

Shazeer was born in Philadelphia, Pennsylvania, in 1975 or 1976. He attended grade school at Cohen Hillel Academy in Marblehead, Massachusetts, and Swampscott High School in Swampscott, Massachusetts.[^1] As a high school student he competed on the U.S. team at the 1994 International Mathematical Olympiad in Hong Kong, where he won a gold medal with a perfect score.[^1][^12]

From 1994 to 1998 he studied mathematics and computer science at Duke University, where he held an Angier B. Duke Memorial Scholarship (Duke's most selective merit scholarship) and was a star member of Duke's prize-winning Putnam team. In his first semester he placed sixth in the nation on the William Lowell Putnam Mathematical Competition, and over his undergraduate career he helped lead Duke to first-place and second-place finishes at the Putnam in 1996 and 1997 respectively, making Duke the only school besides Harvard to win the team competition during the 1990s.[^1][^13] He earned a Bachelor of Science degree in mathematics and computer science from Duke in 1998, and briefly entered a graduate program at the University of California, Berkeley, before leaving without completing a doctorate.[^1] His Olympiad and Putnam track record, taken together with his later research output, has been frequently cited as evidence that elite competitive-mathematics ability can translate directly into productive machine-learning research.

## What did Noam Shazeer do at Google (2000-2021)?

### Early years: search, spell-checking, and PHIL/AdSense

Shazeer joined Google in 2000, when the company was a young start-up with roughly two hundred employees.[^1] One of his earliest contributions was a substantial rewrite of Google's web-search spell-checker. The new spell-checker used statistical models trained on the web's own text to detect and correct misspellings, famously suggesting "Britney Spears" when a user typed "pritany spears," and became one of the search engine's most-used auxiliary services.[^1][^14]

He later co-developed, with fellow early Google engineer Georges Harik, a probabilistic page-classification system known internally as **PHIL** (typically expanded as "Probabilistic Hierarchical Inferential Learner"), which categorized web pages by topic by learning the co-occurrence patterns of terms and concepts.[^1][^14][^29] PHIL was used to match contextually relevant ads to publisher pages and, according to Steven Levy's *In the Plex*, was the in-house technology that actually powered Google's content-targeted advertising product, AdSense, even though the AdSense brand had been inherited from the externally acquired Applied Semantics company.[^29] Shazeer's contributions in this period helped him earn a reputation inside Google as an unusually productive engineer with a deep grasp of large-scale statistical modeling, and he was promoted into Google's senior technical ranks long before the company's modern research arm existed.

### Google Brain (2012-2021)

In 2012 Shazeer moved to the newly formed Google Brain research group, where Jeff Dean and others were experimenting with applying very large neural networks to speech, vision, and language tasks.[^1][^14] Over the following decade he became one of Brain's most prolific authors and was widely regarded as the team's principal architect of large-scale neural-language models.

Across the 2010s Shazeer's research output focused on three intertwined themes: (1) increasing model *capacity* through conditional computation and sparsity; (2) increasing model *throughput* through better attention mechanisms, optimizers, and parallelism primitives; and (3) unifying NLP tasks under a single sequence-to-sequence formulation. Many of the techniques he and his collaborators introduced are now standard components of frontier large language models. In a 2025 retrospective interview on the *Dwarkesh Podcast*, Shazeer and Jeff Dean discussed his role across this era; Dwarkesh Patel summarized that Shazeer had "invented or co-invented all the main architectures and techniques that are used for modern LLMs: from the Transformer itself, to Mixture of Experts, to Mesh-TensorFlow, to Gemini and many other things."[^30]

His first major Brain-era contribution was *Exploring the Limits of Language Modeling* (Józefowicz, Vinyals, Schuster, Shazeer & Wu, 2016), an empirical study of scaling LSTM language models on the One Billion Word Benchmark that prefigured the systematic scaling work he would do later in the decade.

## What are Noam Shazeer's main research contributions?

### Sparsely-gated Mixture-of-Experts (2017)

In January 2017 Shazeer and collaborators (Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton and Jeff Dean) published *Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer*.[^2] The paper introduced a layer consisting of up to thousands of small feed-forward expert sub-networks plus a trainable gating network that selects only a small sparse subset of experts to evaluate for each input. By activating only a few experts per token, the layer makes it possible to expand a model's parameter count by orders of magnitude while keeping its per-example computational cost roughly fixed.[^2]

Applied between stacked LSTM layers, the technique produced language and translation models with up to 137 billion parameters that significantly outperformed dense state-of-the-art models at comparable compute.[^2] The Sparsely-Gated MoE layer is the direct ancestor of every modern [Mixture-of-Experts](/wiki/mixture_of_experts) language model: from Google's GShard, [Switch Transformer](/wiki/switch_transformer) and GLaM, to Mixtral and the MoE variants of Gemini, GPT-4 and DeepSeek. The paper also established two ideas that have proven durable across all subsequent MoE work: (i) using a learned, sparse top-*k* gate to route tokens to a small subset of experts, and (ii) adding auxiliary load-balancing losses to keep expert utilisation roughly uniform during training.[^2]

### Attention Is All You Need (2017)

In June 2017 Shazeer was one of eight co-authors of *Attention Is All You Need*, published on arXiv on 12 June 2017 and presented at NeurIPS later that year.[^3][^15] The paper introduced the [Transformer](/wiki/transformer) architecture, a sequence-to-sequence neural network that replaces recurrence and convolution with a stack of scaled dot-product self-attention layers and position-wise feed-forward networks.[^3]

All eight authors (Ashish Vaswani, Noam Shazeer, Niki Parmar, [Jakob Uszkoreit](/wiki/jakob_uszkoreit), Llion Jones, Aidan N. Gomez ([aidan gomez](/wiki/aidan_gomez)), [Łukasz Kaiser](/wiki/lukasz_kaiser), and [Illia Polosukhin](/wiki/illia_polosukhin)) are credited as equal contributors and listed in randomized order.[^3][^15] In the paper's footnote of acknowledgements, Shazeer is specifically credited with proposing scaled dot-product attention, multi-head attention, and the parameter-free positional representation, and with being "the other person involved in nearly every detail."[^15] The paper has since become one of the most cited works in computer science and is the foundational reference for modern large language models. For the full architecture and history, see [Attention Is All You Need](/wiki/attention_is_all_you_need).

Several of the design choices the paper attributes to Shazeer, including scaled dot-product attention (dividing the dot-product by the square root of the head dimension to stabilise softmax gradients), splitting attention into multiple parallel heads, and using sinusoidal position encodings, are now textbook material; subsequent work on attention variants (multi-query attention, grouped-query attention, FlashAttention, rotary position embeddings, etc.) has refined rather than replaced this basic structure.[^3][^15]

### Mesh-TensorFlow (2018)

Training models with billions of parameters quickly outgrew the memory of any single accelerator. In November 2018 Shazeer and a Google Brain team (Youlong Cheng, Niki Parmar, Dustin Tran, Ashish Vaswani ([ashish vaswani](/wiki/ashish_vaswani)), Penporn Koanantakool, Peter Hawkins, HyoukJoong Lee, Mingsheng Hong, Cliff Young, Ryan Sepassi, and Blake Hechtman) introduced *Mesh-TensorFlow: Deep Learning for Supercomputers* at NeurIPS 2018.[^16] Mesh-TensorFlow is a language for specifying distributed tensor computations in which the user can declare that any tensor dimension is split across any dimension of a multi-dimensional mesh of processors. This enabled a clean expression of *model parallelism* (splitting individual weight tensors across many chips) in addition to the more common *data parallelism*. The paper demonstrated Mesh-TensorFlow on Transformer models with up to 5 billion parameters on TPU meshes of up to 512 cores.[^16] Mesh-TensorFlow was an important conceptual precursor to today's large-model parallelism systems such as GSPMD, JAX's `pjit` / `shard_map`, and DeepSpeed.

### Adafactor optimizer (2018)

Together with Mitchell Stern, Shazeer published *Adafactor: Adaptive Learning Rates with Sublinear Memory Cost* at ICML 2018.[^17] Standard adaptive optimizers such as Adam maintain two extra tensors per parameter (first and second moments), doubling or tripling the memory footprint of training. Adafactor maintains only per-row and per-column statistics of weight matrices, reconstructing per-parameter second-moment estimates from these factorized statistics. Combined with update clipping and a decaying-momentum schedule, the optimizer matches Adam's quality while using sublinear auxiliary memory, making it possible to train much larger Transformer models on the same hardware.[^17] Adafactor became the default optimizer for [T5](/wiki/t5) and many later very-large-model training runs.

### T5: Text-to-Text Transfer Transformer (2019)

In October 2019, Colin Raffel, Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu published *Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer*, introducing the [T5](/wiki/t5) model family and the C4 (Colossal Clean Crawled Corpus) pre-training dataset.[^18] T5 recast every NLP task (translation, summarization, classification, question answering) as a text-to-text problem: a string in, a string out. This unified formulation, combined with a systematic empirical study of pre-training objectives, model sizes, and dataset sizes, established a new state of the art on a wide range of NLP benchmarks and influenced the design of virtually every subsequent encoder-decoder language model.[^18]

### Multi-query attention (2019)

In November 2019 Shazeer single-authored *Fast Transformer Decoding: One Write-Head Is All You Need*, which proposed **multi-query attention** (MQA).[^19] In a standard multi-head attention layer, each attention head maintains its own keys and values; during autoregressive decoding this requires loading a separate key/value cache per head per layer, and the resulting memory bandwidth becomes the dominant cost of inference. MQA shares a single set of keys and values across all heads, drastically reducing the size of the key/value cache and dramatically improving decoding throughput, with only minor quality degradation.[^19] MQA, and its later generalization grouped-query attention (GQA), are now standard in production large language models such as PaLM 2, LLaMA, Mistral, and the Gemini family.

### GLU Variants Improve Transformer / SwiGLU (2020)

In February 2020 Shazeer published a short single-author paper, *GLU Variants Improve Transformer*, which explored replacing the standard ReLU/GELU activation in the Transformer feed-forward sub-layer with variants of the Gated Linear Unit (GLU).[^20] Among the variants introduced were GEGLU (using GELU as the gating non-linearity) and SwiGLU (using Swish). The paper showed that GEGLU and SwiGLU produced consistent improvements in perplexity over ReLU and GELU baselines.[^20] SwiGLU was subsequently adopted as the standard feed-forward activation in [PaLM](/wiki/palm), LLaMA, Mistral, Gemini and many other modern large language models.

### Switch Transformer (2021)

In January 2021, William Fedus, Barret Zoph, and Shazeer published *Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity*.[^21] The Switch Transformer simplified the Mixture-of-Experts routing scheme by sending each token to exactly one expert (top-1 routing), which substantially reduced the communication and load-balancing costs of MoE training.[^21] Built on top of T5, the Switch Transformer achieved up to seven-times speed-ups in pre-training over dense baselines at the same compute budget and was scaled to over one trillion total parameters, among the first publicly described language models to reach that scale. The paper was published in the *Journal of Machine Learning Research* in 2022.[^21] Top-1 routing as introduced in the Switch Transformer subsequently became the default expert routing strategy in production MoE models, including Mixtral, DeepSeek-MoE, and the MoE configurations used inside the Gemini family.

### PaLM and other late-Brain contributions (2021-2022)

Shazeer was one of dozens of contributors to Google's *PaLM: Scaling Language Modeling with Pathways*, the 540-billion-parameter dense Transformer that briefly held the title of largest publicly described language model and that demonstrated strong few-shot performance on hundreds of language and reasoning benchmarks.[^22] [PaLM](/wiki/palm) was trained on 6,144 TPU v4 chips across two pods using Google's then-new Pathways system; many of its architectural choices, including SwiGLU feed-forwards, multi-query attention, parallel attention/feed-forward layers, and rotary positional embeddings, build directly on Shazeer-era Brain research, with the SwiGLU activation and multi-query attention components originating in Shazeer's own single-author papers.[^22][^19][^20] PaLM's release in April 2022 was widely seen as Google's headline counter to OpenAI's GPT-3 and was the immediate technical antecedent of the Bard / Gemini effort.

### LaMDA / Meena and the path to public chatbots

Inside Google, Shazeer and Daniel De Freitas led the development of a large open-domain dialogue model originally called **Meena** and later renamed **LaMDA** (Language Model for Dialogue Applications).[^4][^5][^23] Meena, announced in a 2020 Google AI blog post, was at the time the largest open-domain chatbot model ever trained, with 2.6 billion parameters; it was followed by LaMDA, a Transformer-based 137-billion-parameter model fine-tuned for dialogue.[^23] The team built a system capable of strikingly fluent multi-turn conversation; however, Google's leadership declined to release the model publicly, citing concerns about safety, fairness, and the reputational risk of consumer-facing chatbots. According to *The Wall Street Journal*, Shazeer and De Freitas argued that releasing the chatbot would generate the user feedback necessary to improve safety in production, while senior leadership preferred a more cautious enterprise rollout. The disagreement over public release was widely reported as a principal reason for Shazeer's and De Freitas's eventual departure from the company in 2021.[^4][^5]

## Why did Shazeer leave Google to found Character.AI (2021-2024)?

Shazeer left Google in October 2021.[^5] Together with Daniel De Freitas ([daniel de freitas](/wiki/daniel_de_freitas)) he co-founded **Character Technologies, Inc.**, doing business as [Character.AI](/wiki/character_ai), which incorporated in November 2021.[^24] Character.AI's product was a consumer web (and later mobile) application that let users create and chat with AI characters, including original personas, fictional figures, historical figures, or assistants, built on top of a custom large language model trained in-house.[^24]

The company raised an initial $43 million in seed funding shortly after incorporation, and in March 2023 closed a $150 million Series A round led by Andreessen Horowitz at an approximately $1 billion valuation, with participation from Nat Friedman, Elad Gil, SV Angel, and A.Capital.[^25][^26] Character.AI's public beta launched in September 2022, its iOS and Android apps launched in May 2023 (and were downloaded more than 1.7 million times in their first week), and the optional Character.AI+ subscription tier was introduced the same month.[^24]

By early 2024 Character.AI had become one of the most-used consumer AI products on the web, with millions of daily active users, the majority of them under 30; the company reported roughly 3.5 million daily visitors as of January 2024.[^24] Throughout this period Shazeer served as Character.AI's chief executive officer and chief technical officer; De Freitas served as president and co-led research.[^24]

Shazeer's stated thesis for Character.AI was that conversational AI should be a mass-market consumer product rather than purely an enterprise tool, and that a small team able to ship directly to end users would iterate faster on safety, personality, and product fit than a research division embedded inside a large incumbent. In media appearances during 2022-2024 he repeatedly argued that compute, not data or algorithms, was the binding constraint on progress, and that scaling existing Transformer-based architectures would continue to produce qualitative gains in capability.[^14]

## How did Shazeer return to Google to lead Gemini (August 2024)?

On 2 August 2024 Google and Character.AI jointly announced an arrangement in which Google would license Character.AI's underlying model technology on a non-exclusive basis, provide Character.AI with substantial additional funding, and hire Shazeer, De Freitas, and roughly 30 members of Character.AI's research team into Google.[^6] Press reporting from Reuters, *The Information*, the *Wall Street Journal* and others described the transaction as a "reverse acqui-hire" worth approximately **$2.7 billion**, paid out as licensing fees rather than as an acquisition price, in a structure that some analysts argued was designed to avoid antitrust review.[^7][^8][^27] Subsequent reporting indicated that the U.S. Department of Justice opened an inquiry into the deal's structure.[^8] Press accounts estimated Shazeer's personal share of the proceeds at $750 million-$1 billion.[^1]

At Google, Shazeer was given the title of Vice President of Engineering and was announced (in an internal memo from Google DeepMind chief Demis Hassabis that was reported by *The Information*) as one of three co-technical leads of the [Gemini](/wiki/gemini) model effort, working alongside Jeff Dean (Google's chief scientist) and Oriol Vinyals.[^11] In its public statement, Google described Shazeer as "a preeminent researcher in machine learning."[^6]

Since rejoining Google, Shazeer was associated with the development of the Gemini 2 model family, including [Gemini 2.5 Pro](/wiki/gemini_2_5_pro) and Gemini 2.5 Flash, and with [Gemini 3](/wiki/gemini_3), which launched in November 2025 and was deployed into Google Search on the same day.[^28] He continued to advocate publicly for further scaling of training compute, for mixture-of-experts architectures, and for direct consumer deployment of frontier conversational AI, positions that closely mirror the ones that led to his 2021 departure.[^14]

In a February 2025 *Dwarkesh Podcast* interview alongside Jeff Dean, Shazeer described his ambition to build a single mixture-of-experts model that could be continuously grown and updated, a research vision that closely matches the Pathways architecture Google has been describing publicly since 2021, and forecasted that automated AI research itself would soon become a dominant source of progress.[^30] Press coverage of his return framed Shazeer as one of a small number of senior individual researchers whose presence is considered strategically necessary for a frontier-model effort, comparable in influence to Ilya Sutskever at OpenAI or Yann LeCun at Meta.[^14]

## When did Noam Shazeer join OpenAI?

On 18 June 2026, less than two years after his $2.7 billion return to Google, Shazeer announced that he was leaving Google DeepMind to join [OpenAI](/wiki/openai) as its Lead for Architecture Research, a role focused on next-generation model architectures and the continued evolution of the Transformer.[^31][^32] In a post on X he wrote, "I'm excited to share that I'll be joining OpenAI and look forward to working with the exceptional team there," and OpenAI chief executive Sam Altman publicly welcomed him as "one of the people I have most wanted to work with since the very beginning of openai."[^32]

CNBC and other outlets framed the move as a major talent shift in the frontier-AI race, noting that Shazeer co-designed the architecture underlying every major AI assistant, from ChatGPT to Gemini, and that his departure unwound, in under two years, the most expensive reverse acqui-hire Google had used to bring him back.[^31][^32] The hire reinforced OpenAI's bet on architectural research at a moment when the company was reported to be moving toward an initial public offering.[^32]

## Recognition and honors

* 1994: gold medal (perfect score), 35th International Mathematical Olympiad, Hong Kong (USA team).[^12]
* 1996 and 1997: top-ranked Putnam Mathematical Competition individual finishes; led Duke University to first- and second-place team finishes.[^13]
* 2023: *Time* 100 in AI list.[^9]
* 2026: Elected member of the U.S. National Academy of Engineering.[^10]

## Selected publications

* Shazeer, N. et al. *Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer*. ICLR 2017. arXiv:1701.06538.[^2]
* Vaswani, A.; Shazeer, N. et al. *Attention Is All You Need*. NeurIPS 2017. arXiv:1706.03762.[^3]
* Shazeer, N. et al. *Mesh-TensorFlow: Deep Learning for Supercomputers*. NeurIPS 2018. arXiv:1811.02084.[^16]
* Shazeer, N.; Stern, M. *Adafactor: Adaptive Learning Rates with Sublinear Memory Cost*. ICML 2018. arXiv:1804.04235.[^17]
* Raffel, C.; Shazeer, N. et al. *Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer*. JMLR 2020. arXiv:1910.10683.[^18]
* Shazeer, N. *Fast Transformer Decoding: One Write-Head Is All You Need*. arXiv:1911.02150 (2019).[^19]
* Shazeer, N. *GLU Variants Improve Transformer*. arXiv:2002.05202 (2020).[^20]
* Fedus, W.; Zoph, B.; Shazeer, N. *Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity*. JMLR 2022. arXiv:2101.03961.[^21]
* Chowdhery, A. et al. (incl. Shazeer, N.). *PaLM: Scaling Language Modeling with Pathways*. JMLR 2023. arXiv:2204.02311.[^22]

## See also

* [transformer](/wiki/transformer)
* [attention is all you need](/wiki/attention_is_all_you_need)
* [mixture of experts](/wiki/mixture_of_experts)
* [t5](/wiki/t5)
* [switch transformer](/wiki/switch_transformer)
* [palm](/wiki/palm)
* [gemini](/wiki/gemini)
* [gemini 2 5 pro](/wiki/gemini_2_5_pro)
* [gemini 3](/wiki/gemini_3)
* [character ai](/wiki/character_ai)
* [openai](/wiki/openai)
* [google brain](/wiki/google_brain)
* [google deepmind](/wiki/google_deepmind)
* [lamda](/wiki/lamda)
* [ashish vaswani](/wiki/ashish_vaswani)
* [aidan gomez](/wiki/aidan_gomez)
* [daniel de freitas](/wiki/daniel_de_freitas)

## References

[^1]: "Noam Shazeer", Wikipedia. https://en.wikipedia.org/wiki/Noam_Shazeer
[^2]: Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q., Hinton, G., Dean, J., "Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer", arXiv:1701.06538, 23 January 2017. https://arxiv.org/abs/1701.06538
[^3]: Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., Polosukhin, I., "Attention Is All You Need", arXiv:1706.03762, 12 June 2017. https://arxiv.org/abs/1706.03762
[^4]: M. Olson and B. Bensinger, "Google Refused to Release Chatbot, So Two of Its Top Engineers Quit", *The Wall Street Journal*, summarized in Interesting Engineering. https://interestingengineering.com/culture/google-built-chatgpt-like-ai-years-ago
[^5]: K. Wiggers and I. Lunden, "Exclusive: Character.AI CEO Noam Shazeer returns to Google as the tech giant invests in the AI company", *TechCrunch*, 2 August 2024. https://techcrunch.com/2024/08/02/character-ai-ceo-noam-shazeer-returns-to-google/
[^6]: Google and Character.AI joint announcement, reported in *TechCrunch*, 2 August 2024. https://techcrunch.com/2024/08/02/character-ai-ceo-noam-shazeer-returns-to-google/
[^7]: "Google Confirms $2.7 Billion Deal to Hire Character Co-Founders", *The Information*. https://www.theinformation.com/briefings/google-confirms-2-7-billion-deal-to-hire-character-co-founders
[^8]: "Google's $2.7B AI deal with Noam Shazeer's Character.AI draws DOJ attention", Calcalist (Ctech). https://www.calcalistech.com/ctechnews/article/sy06wllflg
[^9]: "Noam Shazeer: The 100 Most Influential People in AI 2023", *Time*. https://time.com/collection/time100-ai/6310599/noam-shazeer/
[^10]: "National Academy of Engineering Elects 130 Members and 28 International Members" (Class of 2026), National Academy of Engineering. https://www.nae.edu/345149/NAENewClass2026
[^11]: E. Woo, "Memo: Noam Shazeer, the ex-CEO of Character.AI who rejoined Google this month, will be Gemini's co-technical lead, working alongside Jeff Dean and Oriol Vinyals", *The Information*, summarized on Techmeme, 22 August 2024. https://www.techmeme.com/240822/p33
[^12]: "Noam Shazeer", individual results page, International Mathematical Olympiad. https://www.imo-official.org/participant_r.aspx?id=1144
[^13]: "Noam Shazeer", Google Wiki (Fandom), summarizing Duke Putnam record. https://google.fandom.com/wiki/Noam_Shazeer
[^14]: "Noam Shazeer: After 20 years at Google, he walked away, then came back for $2.7 billion", *Gulf News*. https://gulfnews.com/special-reports/noam-shazeer-after-20-years-at-google-he-walked-away-then-came-back-for-27-billion-1.1734670875849
[^15]: "Attention Is All You Need", Wikipedia (author credits and contribution footnote). https://en.wikipedia.org/wiki/Attention_Is_All_You_Need
[^16]: Shazeer, N., Cheng, Y., Parmar, N., Tran, D., Vaswani, A., Koanantakool, P., Hawkins, P., Lee, H., Hong, M., Young, C., Sepassi, R., Hechtman, B., "Mesh-TensorFlow: Deep Learning for Supercomputers", arXiv:1811.02084, NeurIPS 2018. https://arxiv.org/abs/1811.02084
[^17]: Shazeer, N., Stern, M., "Adafactor: Adaptive Learning Rates with Sublinear Memory Cost", arXiv:1804.04235, ICML 2018. https://arxiv.org/abs/1804.04235
[^18]: Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., Liu, P. J., "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer", arXiv:1910.10683, JMLR 2020. https://arxiv.org/abs/1910.10683
[^19]: Shazeer, N., "Fast Transformer Decoding: One Write-Head Is All You Need", arXiv:1911.02150, 6 November 2019. https://arxiv.org/abs/1911.02150
[^20]: Shazeer, N., "GLU Variants Improve Transformer", arXiv:2002.05202, 12 February 2020. https://arxiv.org/abs/2002.05202
[^21]: Fedus, W., Zoph, B., Shazeer, N., "Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity", arXiv:2101.03961; *Journal of Machine Learning Research* 23 (2022). https://arxiv.org/abs/2101.03961
[^22]: Chowdhery, A. et al., "PaLM: Scaling Language Modeling with Pathways", arXiv:2204.02311, 5 April 2022. https://arxiv.org/abs/2204.02311
[^23]: Google AI Blog, "LaMDA: our breakthrough conversation technology". https://blog.google/technology/ai/lamda/
[^24]: "Character.ai", Wikipedia. https://en.wikipedia.org/wiki/Character.ai
[^25]: "Personalized Superintelligence Platform Character.AI Secures $150M in Series A Funding Led by Andreessen Horowitz", BusinessWire, 23 March 2023. https://www.businesswire.com/news/home/20230323005299/en/Personalized-Superintelligence-Platform-Character.AI-Secures-$150M-in-Series-A-Funding-Led-by-Andreessen-Horowitz
[^26]: J. Vanian, "Ex-Google employees' A.I. chatbot startup valued at $1 billion after Andreessen Horowitz funding", *CNBC*, 23 March 2023. https://www.cnbc.com/2023/03/23/characterai-valued-at-1-billion-after-150-million-round-from-a16z.html
[^27]: "Google Reportedly Spent $2.7 Billion to Rehire Character.AI Founder", PYMNTS. https://www.pymnts.com/artificial-intelligence-2/2024/google-reportedly-spent-2-7-billion-to-rehire-character-ai-founder/
[^28]: "Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long-Context", Google DeepMind technical report. https://storage.googleapis.com/deepmind-media/gemini/gemini_v2_5_report.pdf
[^29]: B. Slawski, "Google's Second Most Important Algorithm? Before Google's Panda, there was Phil", SEO by the Sea, summarizing Levy, S., *In the Plex: How Google Thinks, Works, and Shapes Our Lives* (Simon & Schuster, 2011). https://www.seobythesea.com/2011/07/googles-second-most-important-algorithm-before-googles-panda-there-was-phil/
[^30]: D. Patel, "Jeff Dean & Noam Shazeer: 25 years at Google: from PageRank to AGI", *Dwarkesh Podcast*, 12 February 2025. https://www.dwarkesh.com/p/jeff-dean-and-noam-shazeer
[^31]: "Google Gemini co-lead Noam Shazeer leaves for OpenAI", *CNBC*, 18 June 2026. https://www.cnbc.com/2026/06/18/google-gemini-co-lead-noam-shazeer-leaves-for-openai.html
[^32]: "Two years after a $2.7 billion return to Google, AI pioneer Noam Shazeer is leaving for OpenAI", Calcalist (CTech), 18 June 2026. https://www.calcalistech.com/ctechnews/article/r1je3bzzze