PaLM
Last reviewed
May 24, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v7 ยท 6,693 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 24, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v7 ยท 6,693 words
Add missing citations, update stale details, or suggest a clearer explanation.
PaLM (Pathways Language Model) is a family of large language models developed by Google Research. The original PaLM, announced on April 4, 2022, was a 540-billion-parameter dense decoder-only transformer trained using Google's Pathways system across 6,144 TPU v4 chips spanning two pods.[1][2] At the time of its release it was the largest publicly disclosed dense language model ever trained, and it achieved state-of-the-art few-shot results on hundreds of language understanding and generation benchmarks.[1][2] PaLM is best known for demonstrating the effectiveness of chain-of-thought prompting at scale, validating an architectural recipe (SwiGLU activations, parallel transformer blocks, multi-query attention, and rotary position embeddings) that became standard in subsequent large models, and for serving as the foundation for specialized variants including PaLM-E, Med-PaLM, and Sec-PaLM.[1][13]
PaLM was succeeded by PaLM 2, announced at Google I/O on May 10, 2023, which used compute-optimal scaling and a smaller but more capable architecture.[3][4] PaLM 2 powered the Bard chatbot from May 2023 until Bard transitioned to the Gemini family in December 2023.[4][5] Google decommissioned the public PaLM API on August 15, 2024, directing developers to migrate to the Gemini API.[6]
PaLM was built to demonstrate the capabilities of Pathways, a distributed machine learning system that Google had been developing since 2021. Google Senior Fellow Jeff Dean introduced the broader Pathways vision in an October 2021 blog post, framing it as a next-generation architecture for "general-purpose intelligent systems" able to handle multiple tasks and modalities with sparsely activated, dynamically routed computation.[7] The accompanying systems paper, "Pathways: Asynchronous Distributed Dataflow for ML," followed in 2022 and described an orchestrator that could schedule computation across many accelerators connected over a data-center network.[8]
A central engineering target for Pathways was the ability to scale a single training job beyond the boundaries of one TPU pod. Until that point, almost all of the largest dense models (including Google's LaMDA and Microsoft/NVIDIA's Megatron-Turing NLG) were trained inside a single pod, because the bandwidth between pods over a standard data-center network is far lower than the dedicated interconnects inside a pod.[1][9] Pathways was designed to span pods efficiently using a combination of asynchronous gangs of accelerators and clever placement of pipeline stages.[8] PaLM was the first large model trained end-to-end on Pathways, and the paper presents PaLM's training run as the headline validation of the system.[1]
The Pathways paper itself, authored by Paul Barham, Aakanksha Chowdhery, Jeff Dean, Sanjay Ghemawat, and 12 other Google engineers, was posted to arXiv on March 23, 2022 (arXiv:2203.12533) and presented at the 5th MLSys Conference.[8] It describes a single-controller model in which a Python "client" program dispatches a sharded dataflow graph of asynchronous operators that consume and produce futures, and gang-schedules heterogeneous parallel computations on thousands of accelerators while coordinating data transfers over their dedicated interconnects.[8] Pathways reported performance parity with state-of-the-art SPMD systems running over 2,048 TPU chips, and comparable throughput for Transformer models pipelined across 16 stages or sharded across two islands of accelerators connected over a data-center network, the configuration ultimately used for PaLM.[8]
The PaLM project was deliberately framed as a test of dense scaling. By early 2022, several teams, including Google's own Brain group with GLaM (a 1.2-trillion-parameter sparsely activated mixture-of-experts model), had argued that sparse, expert-routed architectures could match or exceed dense models at a fraction of the inference cost.[1][10] GLaM contained 64 experts per MoE layer across 32 MoE layers but activated only roughly 97 billion parameters per token, achieving better zero-shot and one-shot performance than GPT-3 on 29 NLP tasks while consuming about one-third of GPT-3's training energy.[10] PaLM took the opposite bet, asking how far one could push a single, densely activated transformer using better infrastructure. The paper explicitly positions PaLM relative to dense baselines such as GPT-3 (175B), Gopher (280B), Megatron-Turing NLG (530B), and Chinchilla (70B), all of which had appeared in the prior year.[1][9][11][12]
The PaLM paper, titled PaLM: Scaling Language Modeling with Pathways, was co-led by Aakanksha Chowdhery and Sharan Narang and credits more than 60 co-authors at Google Research, including Jacob Devlin, Jeff Dean, Noah Fiedel, and Slav Petrov.[1][2] It was posted to arXiv as preprint 2204.02311 on April 5, 2022,[1] one day after the Google AI blog announcement on April 4, 2022.[2] A revised version (v5) followed in October 2022, and the work was later published in the Journal of Machine Learning Research (volume 24, 2023) as an 87-page article.[1]
PaLM is a dense, decoder-only transformer trained with an autoregressive next-token prediction objective. The architecture combines several modifications that had been individually studied in prior work but, before PaLM, had not been demonstrated together at the 540-billion-parameter scale.[1]
In place of the ReLU or GELU activations used in earlier transformers, PaLM uses the SwiGLU activation in its feed-forward sublayers. SwiGLU was proposed by Noam Shazeer in 2020 and is defined as SwiGLU(x) = Swish(xW) * xV, where Swish(z) = z * sigmoid(z). The gated formulation requires three matrix multiplications in each feed-forward block rather than the usual two, but ablations in the PaLM paper found a meaningful quality improvement at matched compute.[1] PaLM sets the feed-forward hidden dimension to four times the model dimension (d_ff = 4 * d_model), rather than the 8/3 ratio sometimes used in later models that adopt SwiGLU.[1]
PaLM rearranges each transformer block so that the attention sublayer and the feed-forward sublayer are computed in parallel from the same input rather than sequentially, with their outputs summed:
y = x + Attention(LayerNorm(x)) + FFN(LayerNorm(x))
This formulation, sometimes called the "parallel" or "fused" block, fuses the matrix multiplications that produce the queries/keys/values and the feed-forward up-projection, yielding roughly a 15% training speedup at large scale.[1] Ablations reported in the PaLM paper showed a small quality regression at 8B parameters but no measurable degradation at 62B or 540B, so the parallel formulation was used for all three sizes.[1] The parallel-block design was a key contributor to PaLM's high accelerator utilization, since it enabled the XLA compiler to fuse adjacent matrix multiplications and amortize collective communication across the attention and feed-forward computations.[1]
Standard multi-head attention gives each head its own query, key, and value projections. PaLM uses multi-query attention (MQA), introduced by Shazeer in 2019, in which the key and value projections are shared across all heads while each head retains its own query projection. The paper reports that MQA is quality-neutral and only marginally slower in training, but it dramatically shrinks the key-value cache during autoregressive decoding and therefore makes inference significantly cheaper.[1] PaLM was the first model at this scale to commit to MQA, helping cement the design as a default for later dense LLMs and a direct predecessor of the grouped-query attention used in Llama 2 and other open-weight families.
PaLM uses rotary position embeddings (RoPE), introduced by Su et al. in 2021, instead of absolute or learned relative position embeddings.[1] RoPE encodes absolute position by rotating the query and key vectors in a head-dependent angle, which lets relative position information enter the attention computation directly. This choice gives the model better extrapolation behavior beyond the trained context length and has since become a standard design across open-source LLMs such as LLaMA. The combination of RoPE, MQA, SwiGLU, and parallel transformer blocks adopted by PaLM has been widely characterized as the "modern dense LLM recipe" that subsequent open-weight models inherited.
To study scaling behavior, the team trained three models from scratch on the same data:
| Configuration | PaLM 8B | PaLM 62B | PaLM 540B |
|---|---|---|---|
| Parameters | 8.63 billion | 62.50 billion | 540.35 billion |
| Layers | 32 | 64 | 118 |
| Model dimension (d_model) | 4,096 | 8,192 | 18,432 |
| Attention heads | 16 | 32 | 48 |
| Head dimension | 256 | 256 | 256 |
| Feed-forward dimension | 16,384 | 32,768 | 73,728 |
| Vocabulary size | 256,000 | 256,000 | 256,000 |
The head dimension is held fixed at 256 across all three sizes, and the feed-forward hidden dimension is always 4 * d_model.[1]
PaLM was trained on 780 billion tokens drawn from a high-quality mixture based on the corpora previously assembled for LaMDA and GLaM.[1] The mixture is summarized below:
| Data source | Share of training tokens |
|---|---|
| Social media conversations (multilingual) | 50% |
| Filtered web pages (multilingual) | 27% |
| Books (English) | 13% |
| GitHub source code | 5% |
| Wikipedia (multilingual) | 4% |
| News articles (English) | 1% |
About 78% of the tokens are English, with the remaining 22% spread across more than 100 languages.[1] Each model is trained for exactly one epoch over the dataset; no training example is repeated. The code component spans 24 programming languages drawn from open-source GitHub repositories, totals roughly 196 gigabytes of raw text, and contributes a meaningful share of the 5% code allocation; despite that share, it is critical to PaLM's code-generation results discussed below.[1] The corpus was filtered for quality using a logistic-regression classifier (trained to distinguish Wikipedia, books, and selected web pages from generic web crawl) and was further de-duplicated to reduce memorization.[1]
The 540B model was trained on 6,144 TPU v4 chips, distributed across two TPU v4 pods of 3,072 chips each, with 768 hosts per pod.[1][2] The training job used a hybrid parallelism strategy:
A single Python "client" running on Pathways dispatched half of each training batch to each pod, the pods executed forward and backward passes in parallel, exchanged gradients over the DCN, and accumulated both local and remote gradients before applying a bit-identical parameter update on each pod.[1][8] Forward and backward passes were rematerialized (recomputed during the backward pass instead of stored), which lets the system use a larger effective batch size given the available memory. With these techniques the training run achieved a 57.8% hardware FLOPs utilization (HFU) and a 46.2% model FLOPs utilization (MFU), at the time of publication the highest reported figures for any LLM at this scale, with PaLM 540B sustaining an average training throughput of 238,300 tokens per second at the largest batch size.[1][13] By comparison, Megatron-Turing NLG 530B had reported 30.2% MFU.[1] The authors framed both numbers as a vindication of the Pathways system's ability to span pods without prohibitive bandwidth penalties.[1]
The full training of the 540B configuration used 6,144 TPU v4 chips for about 1,200 hours plus 3,072 chips for an additional 336 hours, for a final-run compute cost of approximately 2.56 x 10^24 floating-point operations.[1][14] At list cloud-rental prices the run was estimated to cost between roughly 9 million and 23 million US dollars in compute, depending on assumed TPU pricing and downtime.[14]
PaLM was trained with the Adafactor optimizer used without factorization, which the paper notes is effectively Adam with "parameter scaling" that scales the learning rate by the root-mean-square of each parameter matrix.[1] The batch size was warmed up in stages: for PaLM 540B, the team started at 512 sequences (1M tokens) until step 50k, doubled to 1,024 sequences (2M tokens) until step 115k, and doubled again to 2,048 sequences (4M tokens) for the remainder of training, ending at step 255k.[1] A peak learning rate of 0.01 was held constant for 10,000 steps before being decayed proportionally to the inverse square root of the step number, and gradients were clipped to a global norm of 1.0.[1]
Training was not entirely smooth. The paper documents about twenty loss spikes during the 540B run; for each one, the team restarted from a checkpoint about 100 steps before the spike and skipped roughly 200 to 500 batches of data, after which the loss returned to its trend.[1] The authors interpret these spikes as data-driven (specific batches interacting badly with current parameter values) rather than the result of a systemic instability; replaying the same batch from a fresh checkpoint did not consistently reproduce the spike, which they interpret as evidence of a fragile interaction between the optimizer state and rare token sequences.[1][15]
The PaLM paper accompanies its architecture and infrastructure description with one of the broadest evaluation suites ever reported for a single language model at the time. The headline claims are summarized below.
On a curated suite of 29 widely used English NLP benchmarks spanning question answering, cloze completion, in-context reasoning, Winograd-style coreference, common-sense reasoning, and SuperGLUE, PaLM 540B exceeded the few-shot state of the art on 28 of 29 tasks, with comparisons against GPT-3 175B, Megatron-Turing NLG 530B, Gopher 280B, Chinchilla 70B, and LaMDA 137B.[1] On several tasks (such as NaturalQuestions and TriviaQA) the few-shot PaLM 540B numbers also exceeded the previously best published fine-tuned results.[1]
The Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative collection of more than 150 diverse tasks designed to probe the limits of large language models.[16] On 58 common BIG-bench tasks evaluated with 5-shot prompting, PaLM 540B beat the average human rater baseline on aggregate, a milestone earlier dense models had not reached, and outperformed both GPT-3 and Gopher on the same subset by large margins.[1][2]
The paper also highlights "discontinuous" jumps in performance between the 62B and 540B configurations on certain BIG-bench tasks (for example, distinguishing cause and effect, recognizing logical inference patterns, and certain forms of compositional generalization), feeding the broader conversation about emergent abilities in language models.[1] In log-linear scaling plots, the paper argues that returns from scale have not plateaued at 540B, contrary to several earlier predictions that capability gains would saturate beyond a few hundred billion parameters.[1][2]
Concurrent work by Jason Wei and colleagues introduced chain-of-thought prompting, in which the few-shot exemplars include step-by-step reasoning traces.[17] Combined with PaLM, this technique produced some of the paper's most striking results:
These results are widely cited as the moment that chain-of-thought reasoning became a credible general technique rather than a niche prompting trick.[17][18]
Despite only 5% of pre-training data being code, PaLM 540B's few-shot code-generation performance was competitive with OpenAI Codex 12B, the model behind the original GitHub Copilot, which had been specifically fine-tuned on a much larger code corpus. The PaLM paper reports that the same Codex performance was achieved with roughly 50x less Python in pre-training.[1][2] A fine-tuned variant called PaLM-Coder reached 82.1% compile rate on the DeepFix bug-repair benchmark, against a prior state of the art of 71.7%.[1] PaLM was also evaluated on HumanEval, MBPP, and the TransCoder cross-language translation benchmark, on each of which it either matched or exceeded contemporary specialized models.[1]
Even though only 22% of training tokens were non-English, PaLM produced strong results on multilingual benchmarks. On WMT machine translation tasks PaLM was particularly effective at translating into English, and the paper reports the strongest results among LLMs trained on non-parallel multilingual corpora at the time.[1] Multilingual summarization and question-answering also benefited from PaLM's large 256k vocabulary, which kept token counts moderate even for morphologically rich languages.[1]
PaLM popularized a class of "qualitative" demonstrations that became routine in subsequent LLM releases: the model could explain jokes, write analogies, identify cause-and-effect relationships in short stories, and produce step-by-step justifications for its answers when asked to "think through" a problem.[2][19] The blog post accompanying the paper highlighted joke explanation in particular as evidence that scale was unlocking new behaviors, and the paper appendix collects a set of two-shot prompts in which the model dissects the structure of a punch line.[2] These examples became templates for public-facing demonstrations of emergent abilities and were widely circulated outside the research community.[19]
The PaLM paper devotes an extensive section to bias and toxicity analyses, including evaluation on the BBQ social bias benchmark and toxicity classification on RealToxicityPrompts.[1] PaLM 540B exhibited measurable identity-group bias on several dimensions, and its toxicity rate (probability of generating toxic continuations of toxic prompts) increased with model scale, a result the authors flag as a serious limitation.[1] The paper also studies training data memorization, showing that memorization rates grow logarithmically with model scale, consistent with concurrent observations by Carlini and colleagues on GPT-style models.[1]
The original PaLM family included three pre-training configurations and a small number of fine-tuned derivatives.
The smallest pre-trained PaLM. Used primarily for scaling-law and ablation studies in the original paper, where its results on benchmarks such as Lambada, TriviaQA, and BIG-bench bracket the lower end of the curve.[1] PaLM 8B is also the configuration in which the parallel-block ablation showed a small but consistent quality penalty, while at 62B and 540B the penalty vanishes.[1]
A middle configuration that already matched or beat earlier large models (such as GPT-3 175B) on many tasks while being roughly three times smaller.[1] PaLM 62B is the model where many of the "scale ablations" in the paper, for example, the parallel-block quality study and the discontinuous BIG-bench jumps, are calibrated.
The headline configuration and the one most often referred to simply as "PaLM." Unless otherwise specified, the BIG-bench, GSM8K, and code-generation numbers reported above are for this model.[1]
In late 2022, the Google Brain "Flan" team applied instruction tuning to PaLM, fine-tuning the 540B base model on a mixture of more than 1,800 instruction-formatted tasks (the "Flan Collection"), producing Flan-PaLM 540B.[20] Instruction tuning substantially improved few-shot performance on held-out evaluations: Flan-PaLM 540B outperformed PaLM 540B by an average of about 9.4% across the evaluation suite and reached 75.2% on five-shot MMLU, the leading score on the benchmark at the time of release.[20] Flan-PaLM became the workhorse base for downstream fine-tuning efforts, including Med-PaLM.[20][21]
A second-stage pre-training continuation of PaLM trained with the UL2 "mixture-of-denoisers" objective, U-PaLM was introduced in October 2022 in the paper Transcending Scaling Laws with 0.1% Extra Compute by Yi Tay and colleagues.[22] By continuing PaLM training for an additional sliver of compute (about 0.1% of the original budget) under UL2's combination of causal and span-corruption objectives, U-PaLM achieved the same loss as PaLM 540B at roughly half the total compute and improved chain-of-thought performance, BIG-Bench performance, and multilingual results.[22] U-PaLM models were released at 8B, 62B, and 540B scales.[22]
A code-focused fine-tune that was used for results on DeepFix and other program-repair benchmarks discussed in the PaLM paper.[1] PaLM-Coder retained the PaLM architecture but was further trained on additional code data, and its 82.1% DeepFix compile rate (versus the prior 71.7% state of the art) is presented as a strong validation of PaLM's transfer learning from natural language to code.[1]
PaLM-E ("Embodied") is a multimodal extension of PaLM aimed at robotics and embodied reasoning. It was introduced in the paper PaLM-E: An Embodied Multimodal Language Model posted to arXiv on March 6, 2023, with Danny Driess as the first author and co-authors from Google, Robotics at Google, and TU Berlin.[23] The work was later presented at ICML 2023.[23]
PaLM-E injects continuous sensor inputs, including images, robot state estimates, and 3-D scene representations, directly into the language embedding space of a pre-trained PaLM. Learned encoders project each modality into the same vector space as the text token embeddings, producing "multimodal sentences" that interleave language tokens and sensor features.[23] The training objective remains a single autoregressive next-token loss, so the same model handles language-only, vision-language, and robotic-action prediction tasks without per-task heads.[23]
The flagship configuration, PaLM-E-562B, combines the 540B-parameter PaLM language model with the 22-billion-parameter Vision Transformer (ViT-22B), for a total of about 562 billion parameters, the largest visual-language model reported at the time of its release.[23] PaLM-E-562B set a new state of the art on the OK-VQA visual question-answering benchmark while remaining competitive on language-only tasks, leading the authors to argue that sufficient scale lets a single backbone retain its generalist behavior even after multimodal fine-tuning (an effect they termed "catastrophic-forgetting avoidance").[23] PaLM-E was demonstrated controlling real robot arms with natural-language task instructions and visual context, and the paper reports positive transfer from internet-scale visual-language pre-training to physical manipulation tasks, both on a mobile manipulator in a kitchen setting and on a tabletop manipulation rig.[23]
PaLM-SayCan is an interpretable approach to instructing robots in natural language that combines PaLM with learned affordance functions. The paper Do As I Can, Not As I Say: Grounding Language in Robotic Affordances, by Michael Ahn and 44 co-authors, was posted to arXiv on April 4, 2022 (arXiv:2204.01691), the same day as the PaLM announcement.[24]
SayCan separates "Say," the language model's judgment of whether a candidate skill is useful for a goal, from "Can," a learned affordance value function that estimates whether the skill is executable in the current state. The product of the two scores selects each action.[24] When the underlying language model was upgraded from FLAN to PaLM and the system was renamed PaLM-SayCan, Google reported a planning success rate of 84% and an end-to-end execution success rate of 74% across 101 real-world kitchen tasks with a mobile manipulator, roughly halving the error rate of FLAN-SayCan and of PaLM without affordance grounding.[24][25] The Google Research blog described PaLM-SayCan as the first demonstration that improvements in a frontier language model translated cleanly into improvements in physical-robot task performance.[25]
Minerva is a quantitative-reasoning specialization of PaLM described in the paper Solving Quantitative Reasoning Problems with Language Models by Aitor Lewkowycz and colleagues, posted to arXiv on June 29, 2022 (arXiv:2206.14858) and presented at NeurIPS 2022.[26]
Minerva extends each of the three PaLM sizes (8B, 62B, and 540B) with additional pre-training on a 118-gigabyte technical corpus drawn from arXiv preprints and from web pages containing mathematical notation (with care taken to preserve LaTeX and MathJax markup).[26] The 540B-parameter Minerva variant was trained for an additional 26 billion tokens on top of the PaLM 540B checkpoint.[26] Combined with chain-of-thought prompting and majority-vote decoding, Minerva 540B reached 50.3% on the MATH competition-mathematics benchmark, up from prior state-of-the-art results in the single digits, and 78.5% on GSM8K, demonstrating that domain-targeted continued pre-training combined with prompting could deliver an order-of-magnitude improvement on quantitative tasks.[26]
Med-PaLM is a series of medical-domain language models built on top of the PaLM foundation by Google Research and DeepMind clinicians.
Med-PaLM was introduced in December 2022 in the paper Large Language Models Encode Clinical Knowledge, with Karan Singhal as the first author.[27] Med-PaLM was built by applying a combination of prompt engineering techniques, including few-shot prompting, chain-of-thought reasoning, and self-consistency decoding, to Flan-PaLM 540B. On the MedQA benchmark of USMLE-style multiple-choice questions, Flan-PaLM 540B reached 67.6% accuracy, exceeding the approximate 60% USMLE passing threshold and becoming the first AI system to clear it.[27] Human evaluation of Med-PaLM's free-form answers, however, found that they still trailed clinicians on factual alignment, completeness, and likelihood of potential harm. The paper was peer-reviewed and published in Nature in July 2023.[27] The same paper introduced the MultiMedQA benchmark suite, a curated combination of MedQA, MedMCQA, PubMedQA, LiveQA, MedicationQA, MMLU clinical-topics splits, and a new HealthSearchQA dataset, that has since become a standard evaluation suite for medical LLMs.[27]
Med-PaLM 2, announced at Google Health's The Check Up event in March 2023, was built on top of PaLM 2 and used improved fine-tuning techniques including ensemble refinement.[28] It reached 86.5% accuracy on MedQA, the first LLM to perform at "expert" level on USMLE-style questions, improving over Med-PaLM by more than 18 percentage points.[28] Human evaluation found that 92.6% of long-form Med-PaLM 2 answers aligned with scientific consensus, with a lower rate of potentially harmful content than a panel of clinicians, and physicians preferred Med-PaLM 2 answers to physician-written answers on eight of nine clinical axes in a pairwise study covering 1,066 consumer medical questions.[28] Med-PaLM 2 was later made available to selected Google Cloud customers under the MedLM family of healthcare foundation models, which Google announced in December 2023 as generally available to allowlisted U.S. Google Cloud customers through Vertex AI.[29]
A multimodal extension, Med-PaLM M (or "Med-PaLM Multimodal"), was introduced in July 2023 in the paper Towards Generalist Biomedical AI by Tao Tu and colleagues.[30] Building on PaLM-E, Med-PaLM M encodes and interprets clinical language, medical imaging (chest X-ray, mammography, dermatology, pathology, retinal imaging), and genomic variant data within a single set of model weights, and is evaluated on a new MultiMedBench suite of 14 biomedical tasks.[30] In a side-by-side ranking on 246 retrospective chest X-rays, clinicians expressed a pairwise preference for Med-PaLM M-generated reports over those written by board-certified radiologists in up to 40.5% of cases.[30]
Sec-PaLM is a cybersecurity-focused variant of PaLM, announced at the RSA Conference in April 2023 as the backbone of Google Cloud's Security AI Workbench.[31] Sec-PaLM was fine-tuned on Google's security telemetry and Mandiant's threat-intelligence corpus. Public applications include VirusTotal Code Insight, which generates natural-language explanations of potentially malicious scripts, and integrations with Chronicle Security Operations that summarize incidents and generate search queries for analysts.[31] Google subsequently positioned a refreshed version of the model, Sec-PaLM 2, as the foundation for the broader Security AI Workbench platform announced at the RSAC 2023 keynote.[31] Google did not disclose Sec-PaLM's parameter count.
PaLM 2 is treated in its own article; this section gives the context that matters for understanding PaLM's trajectory. PaLM 2 was announced at Google I/O on May 10, 2023, with a technical report posted to arXiv (2305.10403) on May 17, 2023, led by Rohan Anil and Andrew M. Dai.[3][4]
Compared with the original PaLM, PaLM 2 made three large changes:
PaLM 2 also extends multilingual training data to hundreds of languages and includes substantially more code and mathematics, and the technical report independently re-derived Chinchilla-style 1:1 scaling for very large training budgets.[3] The PaLM 2 technical report explicitly withholds training-data sources, model architecture details (parameter counts, depth, width), and training-hardware information, in contrast to PaLM's detailed 87-page paper, a decision that received critical commentary from the open-research community.[3][32]
The most public deployment of the PaLM family was inside Bard, Google's consumer chatbot. Bard launched in March 2023 backed by a lightweight version of LaMDA, but its reception was tepid and Google CEO Sundar Pichai signaled at the end of March 2023 that Bard would soon be upgraded to PaLM.[5][33]
At Google I/O on May 10, 2023, Google announced that Bard was now running on PaLM 2 and that the chatbot was simultaneously being made available without a waitlist in 180 countries and territories, initially in English with rollouts to Japanese and Korean (and a path to 40 languages over the coming months).[4][5] PaLM 2 was credited with Bard's improvements in coding, reasoning, and multilingual response quality.[4][5] PaLM 2 continued to power Bard through the second half of 2023 and was used as the underlying model for a series of Bard upgrades, including the Bard Extensions announced in September 2023.[34]
On December 6, 2023, Google announced Gemini 1.0 and confirmed that Bard would be powered by Gemini Pro going forward; Gemini Ultra followed in early 2024. In February 2024 Google rebranded Bard to Gemini, completing PaLM 2's exit from the consumer product line.[35]
PaLM had a substantial architectural impact on the open-source LLM ecosystem. The combination of SwiGLU, parallel transformer blocks (in some variants), multi-query (or grouped-query) attention, RoPE, and a no-dropout/no-bias formulation was adopted with minor variations by Meta's LLaMA family (2023) and by many subsequent open-weight models including Mistral 7B, Falcon, and Qwen.[1] The Multi-Query Attention design in particular, which had been published years earlier but seldom used, became standard for inference-efficient decoding largely on the strength of PaLM's deployment experience.[1]
PaLM was, by Google's framing, the proof point for two pieces of Google infrastructure: the Pathways orchestration system and the TPU v4 generation. PaLM's 6,144-chip job remained one of the most cited examples of Google's accelerator scale until the TPU v5p generation and the Gemini Ultra training run took over that role in 2023 to 2024.[1][2][35]
Google announced the merger of DeepMind and Google Brain into Google DeepMind in April 2023, partly to accelerate development of a unified next-generation model after OpenAI's GPT-4.[35] On December 6, 2023, Google CEO Sundar Pichai and Google DeepMind CEO Demis Hassabis announced Gemini 1.0 as the explicit successor to PaLM 2, launching in three sizes (Gemini Ultra, Gemini Pro, Gemini Nano) and described as natively multimodal from the start.[35] Gemini's technical report frames it as the spiritual successor to both PaLM 2 (its text capabilities) and PaLM-E (its multimodal grounding), with the underlying training infrastructure descended from Pathways and the TPU v4/v5 generations that PaLM had originally validated.[35]
The public PaLM API, which had been opened in March 2023 through Google AI for Developers (initially via MakerSuite) and Google Cloud's Vertex AI, was put on a deprecation schedule in early 2024.[36] On August 15, 2024, the Google AI PaLM API was decommissioned: from that date no new requests, no fine-tunes, and no inference on PaLM-tuned models were accepted, and developers were directed to migrate to the Gemini API (with the same API key flow).[6] The Vertex AI PaLM API followed shortly afterwards, and by late 2024 PaLM was fully retired from Google's external product surface.[6] The MedLM models continued to ship through Vertex AI as PaLM 2-derived endpoints into 2024 before being replaced by Gemini-based medical models, completing the transition of every public PaLM-family endpoint to a Gemini successor.[29]
The PaLM and PaLM 2 papers were both transparent about the families' limitations. The most frequently cited concerns are:
The following table summarizes PaLM 540B against the dense and sparse frontier models cited in its paper.
| Model | Year | Parameters | Training tokens | Notes |
|---|---|---|---|---|
| GPT-3 (OpenAI) | 2020 | 175B (dense) | 300B | Established few-shot in-context learning at scale.[1][37] |
| Gopher (DeepMind) | 2021 | 280B (dense) | 300B | Beat SOTA on 100 of 124 tasks on the MassiveText corpus.[9] |
| Megatron-Turing NLG (Microsoft/NVIDIA) | 2022 | 530B (dense) | 270B | Largest single-pod dense model, MFU 30.2%.[11] |
| GLaM (Google) | 2022 | 1,200B total (97B active, sparse) | 1.6T | First trillion-scale MoE LLM.[10] |
| Chinchilla (DeepMind) | 2022 | 70B (dense) | 1.4T | Compute-optimal scaling, outperformed Gopher on benchmarks.[12] |
| PaLM 540B | 2022 | 540B (dense) | 780B | First dense model to span two TPU pods; MFU 46.2%.[1] |
| PaLM 2 (largest) | 2023 | ~340B (estimated) | ~3.6T (estimated) | Compute-optimal, mixture-of-objectives, undisclosed details.[3][32] |