PaLM

Google DeepMind Large Language Models Natural Language Processing Transformer Models

36 min read

Updated Jun 20, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 20, 2026

Fact-checked

In review queue

Sources

37 citations

Revision

v8 · 7,176 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

PaLM (Pathways Language Model) is a family of large language models developed by Google Research. The original PaLM, announced on April 4, 2022, was a 540-billion-parameter dense decoder-only transformer trained using Google's Pathways system across 6,144 TPU v4 chips spanning two pods.^[1]^[2] At the time of its release it was the largest publicly disclosed dense language model ever trained, and it achieved state-of-the-art few-shot results on hundreds of language understanding and generation benchmarks.^[1]^[2] PaLM is best known for demonstrating the effectiveness of chain-of-thought prompting at scale, validating an architectural recipe (SwiGLU activations, parallel transformer blocks, multi-query attention, and rotary position embeddings) that became standard in subsequent large models, and for serving as the foundation for specialized variants including PaLM-E, Med-PaLM, and Sec-PaLM.^[1]^[13]

The paper's own framing of its central result is direct: "On a number of these tasks, PaLM 540B achieves breakthrough performance, outperforming the finetuned state-of-the-art on a suite of multi-step reasoning tasks, and outperforming average human performance on the recently released BIG-bench benchmark."^[1] PaLM was the first dense language model both to span two TPU pods in a single training job and to beat the average human rater on aggregate across the BIG-bench suite.^[1]^[2]

PaLM was succeeded by PaLM 2, announced at Google I/O on May 10, 2023, which used compute-optimal scaling and a smaller but more capable architecture.^[3]^[4] PaLM 2 powered the Bard chatbot from May 2023 until Bard transitioned to the Gemini family in December 2023.^[4]^[5] Google decommissioned the public PaLM API on August 15, 2024, directing developers to migrate to the Gemini API.^[6]

PaLM 540B at a glance

Attribute	Value
Full name	Pathways Language Model (PaLM)
Developer	Google Research
Announced	April 4, 2022 (Google AI blog); preprint arXiv:2204.02311 on April 5, 2022
Architecture	Dense, decoder-only transformer (autoregressive)
Parameters	540 billion (flagship); also trained at 8B and 62B
Context length	2,048 tokens
Training tokens	780 billion (one epoch)
Training hardware	6,144 TPU v4 chips across two pods
Model FLOPs utilization	46.2% (57.8% hardware FLOPs utilization)
Headline results	28/29 English NLP tasks SOTA; beat average human on BIG-bench; 58% on GSM8K with 8-shot chain-of-thought
Key descendants	PaLM-E, Minerva, Flan-PaLM, Med-PaLM, Sec-PaLM, PaLM 2
Public API retired	August 15, 2024 (migrate to Gemini API)

Background

The Pathways system

PaLM was built to demonstrate the capabilities of Pathways, a distributed machine learning system that Google had been developing since 2021. Google Senior Fellow Jeff Dean introduced the broader Pathways vision in an October 2021 blog post, framing it as a next-generation architecture for "general-purpose intelligent systems" able to handle multiple tasks and modalities with sparsely activated, dynamically routed computation.^[7] The accompanying systems paper, "Pathways: Asynchronous Distributed Dataflow for ML," followed in 2022 and described an orchestrator that could schedule computation across many accelerators connected over a data-center network.^[8]

A central engineering target for Pathways was the ability to scale a single training job beyond the boundaries of one TPU pod. Until that point, almost all of the largest dense models (including Google's LaMDA and Microsoft/NVIDIA's Megatron-Turing NLG) were trained inside a single pod, because the bandwidth between pods over a standard data-center network is far lower than the dedicated interconnects inside a pod.^[1]^[9] Pathways was designed to span pods efficiently using a combination of asynchronous gangs of accelerators and clever placement of pipeline stages.^[8] PaLM was the first large model trained end-to-end on Pathways, and the paper presents PaLM's training run as the headline validation of the system.^[1]

The Pathways paper itself, authored by Paul Barham, Aakanksha Chowdhery, Jeff Dean, Sanjay Ghemawat, and 12 other Google engineers, was posted to arXiv on March 23, 2022 (arXiv:2203.12533) and presented at the 5th MLSys Conference.^[8] It describes a single-controller model in which a Python "client" program dispatches a sharded dataflow graph of asynchronous operators that consume and produce futures, and gang-schedules heterogeneous parallel computations on thousands of accelerators while coordinating data transfers over their dedicated interconnects.^[8] Pathways reported performance parity with state-of-the-art SPMD systems running over 2,048 TPU chips, and comparable throughput for Transformer models pipelined across 16 stages or sharded across two islands of accelerators connected over a data-center network, the configuration ultimately used for PaLM.^[8]

Dense versus sparse models

The PaLM project was deliberately framed as a test of dense scaling. By early 2022, several teams, including Google's own Brain group with GLaM (a 1.2-trillion-parameter sparsely activated mixture-of-experts model), had argued that sparse, expert-routed architectures could match or exceed dense models at a fraction of the inference cost.^[1]^[10] GLaM contained 64 experts per MoE layer across 32 MoE layers but activated only roughly 97 billion parameters per token, achieving better zero-shot and one-shot performance than GPT-3 on 29 NLP tasks while consuming about one-third of GPT-3's training energy.^[10] PaLM took the opposite bet, asking how far one could push a single, densely activated transformer using better infrastructure. The paper explicitly positions PaLM relative to dense baselines such as GPT-3 (175B), Gopher (280B), Megatron-Turing NLG (530B), and Chinchilla (70B), all of which had appeared in the prior year.^[1]^[9]^[11]^[12]

Authors and publication history

The PaLM paper, titled PaLM: Scaling Language Modeling with Pathways, was co-led by Aakanksha Chowdhery and Sharan Narang and credits more than 60 co-authors at Google Research, including Jacob Devlin, Jeff Dean, Noah Fiedel, and Slav Petrov.^[1]^[2] It was posted to arXiv as preprint 2204.02311 on April 5, 2022,^[1] one day after the Google AI blog announcement on April 4, 2022.^[2] A revised version (v5) followed in October 2022, and the work was later published in the Journal of Machine Learning Research (volume 24, 2023) as an 87-page article.^[1]

Architecture

PaLM is a dense, decoder-only transformer trained with an autoregressive next-token prediction objective. The architecture combines several modifications that had been individually studied in prior work but, before PaLM, had not been demonstrated together at the 540-billion-parameter scale.^[1]

SwiGLU feed-forward layers

In place of the ReLU or GELU activations used in earlier transformers, PaLM uses the SwiGLU activation in its feed-forward sublayers. SwiGLU was proposed by Noam Shazeer in 2020 and is defined as SwiGLU(x) = Swish(xW) * xV, where Swish(z) = z * sigmoid(z). The gated formulation requires three matrix multiplications in each feed-forward block rather than the usual two, but ablations in the PaLM paper found a meaningful quality improvement at matched compute.^[1] PaLM sets the feed-forward hidden dimension to four times the model dimension (d_ff = 4 * d_model), rather than the 8/3 ratio sometimes used in later models that adopt SwiGLU.^[1]

Parallel transformer blocks

PaLM rearranges each transformer block so that the attention sublayer and the feed-forward sublayer are computed in parallel from the same input rather than sequentially, with their outputs summed:

y = x + Attention(LayerNorm(x)) + FFN(LayerNorm(x))

This formulation, sometimes called the "parallel" or "fused" block, fuses the matrix multiplications that produce the queries/keys/values and the feed-forward up-projection, yielding roughly a 15% training speedup at large scale.^[1] Ablations reported in the PaLM paper showed a small quality regression at 8B parameters but no measurable degradation at 62B or 540B, so the parallel formulation was used for all three sizes.^[1] The parallel-block design was a key contributor to PaLM's high accelerator utilization, since it enabled the XLA compiler to fuse adjacent matrix multiplications and amortize collective communication across the attention and feed-forward computations.^[1]

Multi-query attention

Standard multi-head attention gives each head its own query, key, and value projections. PaLM uses multi-query attention (MQA), introduced by Shazeer in 2019, in which the key and value projections are shared across all heads while each head retains its own query projection. The paper reports that MQA is quality-neutral and only marginally slower in training, but it dramatically shrinks the key-value cache during autoregressive decoding and therefore makes inference significantly cheaper.^[1] PaLM was the first model at this scale to commit to MQA, helping cement the design as a default for later dense LLMs and a direct predecessor of the grouped-query attention used in Llama 2 and other open-weight families.

Rotary position embeddings

PaLM uses rotary position embeddings (RoPE), introduced by Su et al. in 2021, instead of absolute or learned relative position embeddings.^[1] RoPE encodes absolute position by rotating the query and key vectors in a head-dependent angle, which lets relative position information enter the attention computation directly. This choice gives the model better extrapolation behavior beyond the trained context length and has since become a standard design across open-source LLMs such as LLaMA. The combination of RoPE, MQA, SwiGLU, and parallel transformer blocks adopted by PaLM has been widely characterized as the "modern dense LLM recipe" that subsequent open-weight models inherited.

Other design choices

Vocabulary. PaLM uses a SentencePiece tokenizer with a 256,000-token vocabulary built over the multilingual training corpus. The tokenizer is "lossless": it preserves all whitespace (important for code), splits out-of-vocabulary Unicode into bytes, and represents numbers as individual digits.^[1]
No bias terms or dropout. Bias terms are removed from all dense layers and layer norms, and dropout is not applied during pre-training. The authors report that bias-free layers improve training stability at scale.^[1]
Shared input/output embeddings. The input token embedding matrix and the output classifier weight matrix are shared.^[1]
Sequence length. A context window of 2,048 tokens is used for all configurations.^[1]
Pre-normalization. PaLM applies LayerNorm before each sublayer (the "pre-LN" placement), without trainable bias parameters, a choice that, together with the no-bias and no-dropout decisions, is credited with improving optimization stability at 540B scale.^[1]

Model configurations

To study scaling behavior, the team trained three models from scratch on the same data:

Configuration	PaLM 8B	PaLM 62B	PaLM 540B
Parameters	8.63 billion	62.50 billion	540.35 billion
Layers	32	64	118
Model dimension (d_model)	4,096	8,192	18,432
Attention heads	16	32	48
Head dimension	256	256	256
Feed-forward dimension	16,384	32,768	73,728
Vocabulary size	256,000	256,000	256,000

The head dimension is held fixed at 256 across all three sizes, and the feed-forward hidden dimension is always 4 * d_model.^[1]

Training

What data was PaLM trained on?

PaLM was trained on 780 billion tokens drawn from a high-quality mixture based on the corpora previously assembled for LaMDA and GLaM.^[1] The mixture is summarized below:

Data source	Share of training tokens
Social media conversations (multilingual)	50%
Filtered web pages (multilingual)	27%
Books (English)	13%
GitHub source code	5%
Wikipedia (multilingual)	4%
News articles (English)	1%

About 78% of the tokens are English, with the remaining 22% spread across more than 100 languages.^[1] Each model is trained for exactly one epoch over the dataset; no training example is repeated. The code component spans 24 programming languages drawn from open-source GitHub repositories, totals roughly 196 gigabytes of raw text, and contributes a meaningful share of the 5% code allocation; despite that share, it is critical to PaLM's code-generation results discussed below.^[1] The corpus was filtered for quality using a logistic-regression classifier (trained to distinguish Wikipedia, books, and selected web pages from generic web crawl) and was further de-duplicated to reduce memorization.^[1]

Infrastructure and Pathways

The 540B model was trained on 6,144 TPU v4 chips, distributed across two TPU v4 pods of 3,072 chips each, with 768 hosts per pod.^[1]^[2] The training job used a hybrid parallelism strategy:

Data parallelism between the two pods, over the data-center network (DCN).
A combination of model parallelism and within-pod data parallelism inside each pod, over the high-bandwidth intra-pod interconnect (ICI).

A single Python "client" running on Pathways dispatched half of each training batch to each pod, the pods executed forward and backward passes in parallel, exchanged gradients over the DCN, and accumulated both local and remote gradients before applying a bit-identical parameter update on each pod.^[1]^[8] Forward and backward passes were rematerialized (recomputed during the backward pass instead of stored), which lets the system use a larger effective batch size given the available memory. With these techniques the training run achieved a 57.8% hardware FLOPs utilization (HFU) and a 46.2% model FLOPs utilization (MFU), at the time of publication the highest reported figures for any LLM at this scale, with PaLM 540B sustaining an average training throughput of 238,300 tokens per second at the largest batch size.^[1]^[13] By comparison, Megatron-Turing NLG 530B had reported 30.2% MFU.^[1] The authors framed both numbers as a vindication of the Pathways system's ability to span pods without prohibitive bandwidth penalties.^[1]

The full training of the 540B configuration used 6,144 TPU v4 chips for about 1,200 hours plus 3,072 chips for an additional 336 hours, for a final-run compute cost of approximately 2.56 x 10^24 floating-point operations.^[1]^[14] At list cloud-rental prices the run was estimated to cost between roughly 9 million and 23 million US dollars in compute, depending on assumed TPU pricing and downtime.^[14]

Optimizer and training schedule

PaLM was trained with the Adafactor optimizer used without factorization, which the paper notes is effectively Adam with "parameter scaling" that scales the learning rate by the root-mean-square of each parameter matrix.^[1] The batch size was warmed up in stages: for PaLM 540B, the team started at 512 sequences (1M tokens) until step 50k, doubled to 1,024 sequences (2M tokens) until step 115k, and doubled again to 2,048 sequences (4M tokens) for the remainder of training, ending at step 255k.^[1] A peak learning rate of 0.01 was held constant for 10,000 steps before being decayed proportionally to the inverse square root of the step number, and gradients were clipped to a global norm of 1.0.^[1]

Training was not entirely smooth. The paper documents about twenty loss spikes during the 540B run; for each one, the team restarted from a checkpoint about 100 steps before the spike and skipped roughly 200 to 500 batches of data, after which the loss returned to its trend.^[1] The authors interpret these spikes as data-driven (specific batches interacting badly with current parameter values) rather than the result of a systemic instability; replaying the same batch from a fresh checkpoint did not consistently reproduce the spike, which they interpret as evidence of a fragile interaction between the optimizer state and rare token sequences.^[1]^[15]

Capabilities and benchmarks

The PaLM paper accompanies its architecture and infrastructure description with one of the broadest evaluation suites ever reported for a single language model at the time. The headline claims are summarized below.

English NLP tasks

On a curated suite of 29 widely used English NLP benchmarks spanning question answering, cloze completion, in-context reasoning, Winograd-style coreference, common-sense reasoning, and SuperGLUE, PaLM 540B exceeded the few-shot state of the art on 28 of 29 tasks, with comparisons against GPT-3 175B, Megatron-Turing NLG 530B, Gopher 280B, Chinchilla 70B, and LaMDA 137B.^[1] On several tasks (such as NaturalQuestions and TriviaQA) the few-shot PaLM 540B numbers also exceeded the previously best published fine-tuned results.^[1]

BIG-bench

The Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative collection of more than 150 diverse tasks designed to probe the limits of large language models.^[16] On 58 common BIG-bench tasks evaluated with 5-shot prompting, PaLM 540B beat the average human rater baseline on aggregate, a milestone earlier dense models had not reached, and outperformed both GPT-3 and Gopher on the same subset by large margins.^[1]^[2] As the Google Research announcement put it, "PaLM 540B 5-shot also does better than the average performance of people asked to solve the same tasks."^[2]

The paper also highlights "discontinuous" jumps in performance between the 62B and 540B configurations on certain BIG-bench tasks (for example, distinguishing cause and effect, recognizing logical inference patterns, and certain forms of compositional generalization), feeding the broader conversation about emergent abilities in language models.^[1] In its own words, the paper reports that "a significant number of BIG-bench tasks showed discontinuous improvements from model scale, meaning that performance steeply increased as we scaled to our largest model."^[1] In log-linear scaling plots, the paper argues that returns from scale have not plateaued at 540B, contrary to several earlier predictions that capability gains would saturate beyond a few hundred billion parameters.^[1]^[2]

Chain-of-thought reasoning

Concurrent work by Jason Wei and colleagues introduced chain-of-thought prompting, in which the few-shot exemplars include step-by-step reasoning traces.^[17] Combined with PaLM, this technique produced some of the paper's most striking results:

On the GSM8K grade-school math benchmark, PaLM 540B with 8-shot chain-of-thought prompting reached 58% accuracy, surpassing the previous best (55%) which had been achieved by a fine-tuned GPT-3 augmented with an external calculator and a separately trained verifier.^[1]^[2]^[17] PaLM 540B with standard prompting reached only 17.9%, so the chain-of-thought gain was more than 40 percentage points.^[17]
On the harder BIG-Bench Hard suite (23 BIG-bench tasks where prior LMs had failed to match the average human rater), PaLM with chain-of-thought prompting beat the average human rater on 10 of 23 tasks.^[18]
The paper popularized the use of self-consistency decoding (sampling multiple chains and majority-voting on the answer), introduced by Wang et al. in 2022, as a complementary technique that pushes GSM8K accuracy further still.^[1]

These results are widely cited as the moment that chain-of-thought reasoning became a credible general technique rather than a niche prompting trick.^[17]^[18]

Code

Despite only 5% of pre-training data being code, PaLM 540B's few-shot code-generation performance was competitive with OpenAI Codex 12B, the model behind the original GitHub Copilot, which had been specifically fine-tuned on a much larger code corpus. The PaLM paper reports that the same Codex performance was achieved with roughly 50x less Python in pre-training.^[1]^[2] A fine-tuned variant called PaLM-Coder reached 82.1% compile rate on the DeepFix bug-repair benchmark, against a prior state of the art of 71.7%.^[1] PaLM was also evaluated on HumanEval, MBPP, and the TransCoder cross-language translation benchmark, on each of which it either matched or exceeded contemporary specialized models.^[1]

Multilingual capabilities

Even though only 22% of training tokens were non-English, PaLM produced strong results on multilingual benchmarks. On WMT machine translation tasks PaLM was particularly effective at translating into English, and the paper reports the strongest results among LLMs trained on non-parallel multilingual corpora at the time.^[1] Multilingual summarization and question-answering also benefited from PaLM's large 256k vocabulary, which kept token counts moderate even for morphologically rich languages.^[1]

Qualitative behaviors

PaLM popularized a class of "qualitative" demonstrations that became routine in subsequent LLM releases: the model could explain jokes, write analogies, identify cause-and-effect relationships in short stories, and produce step-by-step justifications for its answers when asked to "think through" a problem.^[2]^[19] The blog post accompanying the paper highlighted joke explanation in particular as evidence that scale was unlocking new behaviors, and the paper appendix collects a set of two-shot prompts in which the model dissects the structure of a punch line.^[2] These examples became templates for public-facing demonstrations of emergent abilities and were widely circulated outside the research community.^[19]

Bias, toxicity, and memorization

The PaLM paper devotes an extensive section to bias and toxicity analyses, including evaluation on the BBQ social bias benchmark and toxicity classification on RealToxicityPrompts.^[1] PaLM 540B exhibited measurable identity-group bias on several dimensions, and its toxicity rate (probability of generating toxic continuations of toxic prompts) increased with model scale, a result the authors flag as a serious limitation.^[1] The paper also studies training data memorization, showing that memorization rates grow logarithmically with model scale, consistent with concurrent observations by Carlini and colleagues on GPT-style models.^[1]

Variants of the original PaLM

The original PaLM family included three pre-training configurations and a small number of fine-tuned derivatives.

PaLM 8B

The smallest pre-trained PaLM. Used primarily for scaling-law and ablation studies in the original paper, where its results on benchmarks such as Lambada, TriviaQA, and BIG-bench bracket the lower end of the curve.^[1] PaLM 8B is also the configuration in which the parallel-block ablation showed a small but consistent quality penalty, while at 62B and 540B the penalty vanishes.^[1]

PaLM 62B

A middle configuration that already matched or beat earlier large models (such as GPT-3 175B) on many tasks while being roughly three times smaller.^[1] PaLM 62B is the model where many of the "scale ablations" in the paper, for example, the parallel-block quality study and the discontinuous BIG-bench jumps, are calibrated.

PaLM 540B

The headline configuration and the one most often referred to simply as "PaLM." Unless otherwise specified, the BIG-bench, GSM8K, and code-generation numbers reported above are for this model.^[1]

Flan-PaLM

In late 2022, the Google Brain "Flan" team applied instruction tuning to PaLM, fine-tuning the 540B base model on a mixture of more than 1,800 instruction-formatted tasks (the "Flan Collection"), producing Flan-PaLM 540B.^[20] Instruction tuning substantially improved few-shot performance on held-out evaluations: Flan-PaLM 540B outperformed PaLM 540B by an average of about 9.4% across the evaluation suite and reached 75.2% on five-shot MMLU, the leading score on the benchmark at the time of release.^[20] Flan-PaLM became the workhorse base for downstream fine-tuning efforts, including Med-PaLM.^[20]^[21]

U-PaLM

A second-stage pre-training continuation of PaLM trained with the UL2 "mixture-of-denoisers" objective, U-PaLM was introduced in October 2022 in the paper Transcending Scaling Laws with 0.1% Extra Compute by Yi Tay and colleagues.^[22] By continuing PaLM training for an additional sliver of compute (about 0.1% of the original budget) under UL2's combination of causal and span-corruption objectives, U-PaLM achieved the same loss as PaLM 540B at roughly half the total compute and improved chain-of-thought performance, BIG-Bench performance, and multilingual results.^[22] U-PaLM models were released at 8B, 62B, and 540B scales.^[22]

PaLM-Coder

A code-focused fine-tune that was used for results on DeepFix and other program-repair benchmarks discussed in the PaLM paper.^[1] PaLM-Coder retained the PaLM architecture but was further trained on additional code data, and its 82.1% DeepFix compile rate (versus the prior 71.7% state of the art) is presented as a strong validation of PaLM's transfer learning from natural language to code.^[1]

PaLM-E

PaLM-E ("Embodied") is a multimodal extension of PaLM aimed at robotics and embodied reasoning. It was introduced in the paper PaLM-E: An Embodied Multimodal Language Model posted to arXiv on March 6, 2023, with Danny Driess as the first author and co-authors from Google, Robotics at Google, and TU Berlin.^[23] The work was later presented at ICML 2023.^[23]

PaLM-E injects continuous sensor inputs, including images, robot state estimates, and 3-D scene representations, directly into the language embedding space of a pre-trained PaLM. Learned encoders project each modality into the same vector space as the text token embeddings, producing "multimodal sentences" that interleave language tokens and sensor features.^[23] The training objective remains a single autoregressive next-token loss, so the same model handles language-only, vision-language, and robotic-action prediction tasks without per-task heads.^[23]

The flagship configuration, PaLM-E-562B, combines the 540B-parameter PaLM language model with the 22-billion-parameter Vision Transformer (ViT-22B), for a total of about 562 billion parameters, the largest visual-language model reported at the time of its release.^[23] PaLM-E-562B set a new state of the art on the OK-VQA visual question-answering benchmark while remaining competitive on language-only tasks, leading the authors to argue that sufficient scale lets a single backbone retain its generalist behavior even after multimodal fine-tuning (an effect they termed "catastrophic-forgetting avoidance").^[23] PaLM-E was demonstrated controlling real robot arms with natural-language task instructions and visual context, and the paper reports positive transfer from internet-scale visual-language pre-training to physical manipulation tasks, both on a mobile manipulator in a kitchen setting and on a tabletop manipulation rig.^[23]

PaLM-SayCan

PaLM-SayCan is an interpretable approach to instructing robots in natural language that combines PaLM with learned affordance functions. The paper Do As I Can, Not As I Say: Grounding Language in Robotic Affordances, by Michael Ahn and 44 co-authors, was posted to arXiv on April 4, 2022 (arXiv:2204.01691), the same day as the PaLM announcement.^[24]

SayCan separates "Say," the language model's judgment of whether a candidate skill is useful for a goal, from "Can," a learned affordance value function that estimates whether the skill is executable in the current state. The product of the two scores selects each action.^[24] When the underlying language model was upgraded from FLAN to PaLM and the system was renamed PaLM-SayCan, Google reported a planning success rate of 84% and an end-to-end execution success rate of 74% across 101 real-world kitchen tasks with a mobile manipulator, roughly halving the error rate of FLAN-SayCan and of PaLM without affordance grounding.^[24]^[25] The Google Research blog described PaLM-SayCan as the first demonstration that improvements in a frontier language model translated cleanly into improvements in physical-robot task performance.^[25]

Minerva

Minerva is a quantitative-reasoning specialization of PaLM described in the paper Solving Quantitative Reasoning Problems with Language Models by Aitor Lewkowycz and colleagues, posted to arXiv on June 29, 2022 (arXiv:2206.14858) and presented at NeurIPS 2022.^[26]

Minerva extends each of the three PaLM sizes (8B, 62B, and 540B) with additional pre-training on a 118-gigabyte technical corpus drawn from arXiv preprints and from web pages containing mathematical notation (with care taken to preserve LaTeX and MathJax markup).^[26] The 540B-parameter Minerva variant was trained for an additional 26 billion tokens on top of the PaLM 540B checkpoint.^[26] Combined with chain-of-thought prompting and majority-vote decoding, Minerva 540B reached 50.3% on the MATH competition-mathematics benchmark, up from prior state-of-the-art results in the single digits, and 78.5% on GSM8K, demonstrating that domain-targeted continued pre-training combined with prompting could deliver an order-of-magnitude improvement on quantitative tasks.^[26]

Med-PaLM and Med-PaLM 2

Med-PaLM is a series of medical-domain language models built on top of the PaLM foundation by Google Research and DeepMind clinicians.

Med-PaLM

Med-PaLM was introduced in December 2022 in the paper Large Language Models Encode Clinical Knowledge, with Karan Singhal as the first author.^[27] Med-PaLM was built by applying a combination of prompt engineering techniques, including few-shot prompting, chain-of-thought reasoning, and self-consistency decoding, to Flan-PaLM 540B. On the MedQA benchmark of USMLE-style multiple-choice questions, Flan-PaLM 540B reached 67.6% accuracy, exceeding the approximate 60% USMLE passing threshold and becoming the first AI system to clear it.^[27] Human evaluation of Med-PaLM's free-form answers, however, found that they still trailed clinicians on factual alignment, completeness, and likelihood of potential harm. The paper was peer-reviewed and published in Nature in July 2023.^[27] The same paper introduced the MultiMedQA benchmark suite, a curated combination of MedQA, MedMCQA, PubMedQA, LiveQA, MedicationQA, MMLU clinical-topics splits, and a new HealthSearchQA dataset, that has since become a standard evaluation suite for medical LLMs.^[27]

Med-PaLM 2

Med-PaLM 2, announced at Google Health's The Check Up event in March 2023, was built on top of PaLM 2 and used improved fine-tuning techniques including ensemble refinement.^[28] It reached 86.5% accuracy on MedQA, the first LLM to perform at "expert" level on USMLE-style questions, improving over Med-PaLM by more than 18 percentage points.^[28] Human evaluation found that 92.6% of long-form Med-PaLM 2 answers aligned with scientific consensus, with a lower rate of potentially harmful content than a panel of clinicians, and physicians preferred Med-PaLM 2 answers to physician-written answers on eight of nine clinical axes in a pairwise study covering 1,066 consumer medical questions.^[28] Med-PaLM 2 was later made available to selected Google Cloud customers under the MedLM family of healthcare foundation models, which Google announced in December 2023 as generally available to allowlisted U.S. Google Cloud customers through Vertex AI.^[29]

Med-PaLM M

A multimodal extension, Med-PaLM M (or "Med-PaLM Multimodal"), was introduced in July 2023 in the paper Towards Generalist Biomedical AI by Tao Tu and colleagues.^[30] Building on PaLM-E, Med-PaLM M encodes and interprets clinical language, medical imaging (chest X-ray, mammography, dermatology, pathology, retinal imaging), and genomic variant data within a single set of model weights, and is evaluated on a new MultiMedBench suite of 14 biomedical tasks.^[30] In a side-by-side ranking on 246 retrospective chest X-rays, clinicians expressed a pairwise preference for Med-PaLM M-generated reports over those written by board-certified radiologists in up to 40.5% of cases.^[30]

Sec-PaLM

Sec-PaLM is a cybersecurity-focused variant of PaLM, announced at the RSA Conference in April 2023 as the backbone of Google Cloud's Security AI Workbench.^[31] Sec-PaLM was fine-tuned on Google's security telemetry and Mandiant's threat-intelligence corpus. Public applications include VirusTotal Code Insight, which generates natural-language explanations of potentially malicious scripts, and integrations with Chronicle Security Operations that summarize incidents and generate search queries for analysts.^[31] Google subsequently positioned a refreshed version of the model, Sec-PaLM 2, as the foundation for the broader Security AI Workbench platform announced at the RSAC 2023 keynote.^[31] Google did not disclose Sec-PaLM's parameter count.

PaLM 2

PaLM 2 is treated in its own article; this section gives the context that matters for understanding PaLM's trajectory. PaLM 2 was announced at Google I/O on May 10, 2023, with a technical report posted to arXiv (2305.10403) on May 17, 2023, led by Rohan Anil and Andrew M. Dai.^[3]^[4]

Compared with the original PaLM, PaLM 2 made three large changes:

Compute-optimal scaling. Informed by DeepMind's Chinchilla scaling laws, PaLM 2 scales model size and dataset size in roughly 1:1 proportion. Public reporting based on internal Google documents put the largest PaLM 2 variant at approximately 340 billion parameters, substantially smaller than PaLM 540B, trained on roughly 3.6 trillion tokens, nearly five times PaLM's 780 billion.^[32]
Mixture of training objectives. Where PaLM was trained with a single causal language-modeling objective, PaLM 2 uses a tuned mixture of pre-training objectives, although the report withholds the exact composition.^[3]
Four named sizes. PaLM 2 ships as a family of models named after animals in increasing size: Gecko, Otter, Bison, and Unicorn.^[4] Gecko is small enough to run on flagship smartphones; Bison was the workhorse size exposed through the PaLM API and Vertex AI; Unicorn is the largest.

PaLM 2 also extends multilingual training data to hundreds of languages and includes substantially more code and mathematics, and the technical report independently re-derived Chinchilla-style 1:1 scaling for very large training budgets.^[3] The PaLM 2 technical report explicitly withholds training-data sources, model architecture details (parameter counts, depth, width), and training-hardware information, in contrast to PaLM's detailed 87-page paper, a decision that received critical commentary from the open-research community.^[3]^[32]

Bard usage (May 2023 to December 2023)

The most public deployment of the PaLM family was inside Bard, Google's consumer chatbot. Bard launched in March 2023 backed by a lightweight version of LaMDA, but its reception was tepid and Google CEO Sundar Pichai signaled at the end of March 2023 that Bard would soon be upgraded to PaLM.^[5]^[33]

At Google I/O on May 10, 2023, Google announced that Bard was now running on PaLM 2 and that the chatbot was simultaneously being made available without a waitlist in 180 countries and territories, initially in English with rollouts to Japanese and Korean (and a path to 40 languages over the coming months).^[4]^[5] PaLM 2 was credited with Bard's improvements in coding, reasoning, and multilingual response quality.^[4]^[5] PaLM 2 continued to power Bard through the second half of 2023 and was used as the underlying model for a series of Bard upgrades, including the Bard Extensions announced in September 2023.^[34]

On December 6, 2023, Google announced Gemini 1.0 and confirmed that Bard would be powered by Gemini Pro going forward; Gemini Ultra followed in early 2024. In February 2024 Google rebranded Bard to Gemini, completing PaLM 2's exit from the consumer product line.^[35]

Legacy and deprecation

How did PaLM influence later language models?

PaLM had a substantial architectural impact on the open-source LLM ecosystem. The combination of SwiGLU, parallel transformer blocks (in some variants), multi-query (or grouped-query) attention, RoPE, and a no-dropout/no-bias formulation was adopted with minor variations by Meta's LLaMA family (2023) and by many subsequent open-weight models including Mistral 7B, Falcon, and Qwen.^[1] The Multi-Query Attention design in particular, which had been published years earlier but seldom used, became standard for inference-efficient decoding largely on the strength of PaLM's deployment experience.^[1]

Pathways and TPU v4 validation

PaLM was, by Google's framing, the proof point for two pieces of Google infrastructure: the Pathways orchestration system and the TPU v4 generation. PaLM's 6,144-chip job remained one of the most cited examples of Google's accelerator scale until the TPU v5p generation and the Gemini Ultra training run took over that role in 2023 to 2024.^[1]^[2]^[35]

What replaced PaLM?

Google announced the merger of DeepMind and Google Brain into Google DeepMind in April 2023, partly to accelerate development of a unified next-generation model after OpenAI's GPT-4.^[35] On December 6, 2023, Google CEO Sundar Pichai and Google DeepMind CEO Demis Hassabis announced Gemini 1.0 as the explicit successor to PaLM 2, launching in three sizes (Gemini Ultra, Gemini Pro, Gemini Nano) and described as natively multimodal from the start.^[35] Gemini's technical report frames it as the spiritual successor to both PaLM 2 (its text capabilities) and PaLM-E (its multimodal grounding), with the underlying training infrastructure descended from Pathways and the TPU v4/v5 generations that PaLM had originally validated.^[35]

When was the PaLM API shut down?

The public PaLM API, which had been opened in March 2023 through Google AI for Developers (initially via MakerSuite) and Google Cloud's Vertex AI, was put on a deprecation schedule in early 2024.^[36] On August 15, 2024, the Google AI PaLM API was decommissioned: from that date no new requests, no fine-tunes, and no inference on PaLM-tuned models were accepted, and developers were directed to migrate to the Gemini API (with the same API key flow).^[6] The Vertex AI PaLM API followed shortly afterwards, and by late 2024 PaLM was fully retired from Google's external product surface.^[6] The MedLM models continued to ship through Vertex AI as PaLM 2-derived endpoints into 2024 before being replaced by Gemini-based medical models, completing the transition of every public PaLM-family endpoint to a Gemini successor.^[29]

Limitations and criticisms

The PaLM and PaLM 2 papers were both transparent about the families' limitations. The most frequently cited concerns are:

Bias and toxicity. Both papers include extensive bias and toxicity analyses; both models were shown to amplify stereotypes present in their training data, and toxicity rates rose with model scale on RealToxicityPrompts continuations.^[1]^[3]
Hallucination. Like all large LMs, PaLM produces confident but incorrect statements at non-trivial rates; the Med-PaLM evaluations explicitly highlight this as the main obstacle to clinical deployment.^[27]
English skew. With 78% English tokens, PaLM's quality on low-resource languages remained noticeably weaker than on English. PaLM 2 was designed in part to close this gap.^[1]^[4]
Closed weights. Neither PaLM nor PaLM 2 was released as open weights; external researchers had only API access, which limited reproducibility.^[1]^[3]
Compute requirements. Training PaLM 540B took thousands of TPU v4 chips for a full epoch; the PaLM 2 technical report does not disclose its training compute, but third-party estimates indicate a similar order of magnitude.^[3]^[14]
Withheld details in PaLM 2. The PaLM 2 technical report was widely criticized for omitting architectural specifications and training data sources, a notable retreat from the openness of the original PaLM paper.^[3]^[32]
Memorization. PaLM's appendix shows that memorization of training examples grows with scale; the 540B model can reproduce verbatim passages from copyrighted books and code when given short prefixes, a behavior subsequently exploited in dataset-extraction attacks.^[1]

Comparison with contemporary frontier models

How does PaLM compare to GPT-3, Gopher, and Chinchilla?

The following table summarizes PaLM 540B against the dense and sparse frontier models cited in its paper.

Model	Year	Parameters	Training tokens	Notes
GPT-3 (OpenAI)	2020	175B (dense)	300B	Established few-shot in-context learning at scale.^[1]^[37]
Gopher (DeepMind)	2021	280B (dense)	300B	Beat SOTA on 100 of 124 tasks on the MassiveText corpus.^[9]
Megatron-Turing NLG (Microsoft/NVIDIA)	2022	530B (dense)	270B	Largest single-pod dense model, MFU 30.2%.^[11]
GLaM (Google)	2022	1,200B total (97B active, sparse)	1.6T	First trillion-scale MoE LLM.^[10]
Chinchilla (DeepMind)	2022	70B (dense)	1.4T	Compute-optimal scaling, outperformed Gopher on benchmarks.^[12]
PaLM 540B	2022	540B (dense)	780B	First dense model to span two TPU pods; MFU 46.2%.^[1]
PaLM 2 (largest)	2023	~340B (estimated)	~3.6T (estimated)	Compute-optimal, mixture-of-objectives, undisclosed details.^[3]^[32]

Frequently asked questions

Is PaLM open source?

No. Neither PaLM nor PaLM 2 was ever released as open weights. External access was available only through the PaLM API and Google Cloud's Vertex AI, and that API was decommissioned on August 15, 2024.^[1]^[3]^[6] The lack of open weights was a recurring criticism because it limited independent reproduction of the paper's results.^[1]^[3]

Why is PaLM called "Pathways Language Model"?

PaLM is named after Pathways, the distributed machine learning system Google built to train a single model across multiple TPU v4 pods. PaLM was the first large model trained end-to-end on Pathways, and the paper presents the training run as the headline validation of that system.^[1]^[8]

Is PaLM still available?

No. The original PaLM and PaLM 2 have been fully retired from Google's public product surface. The PaLM API was shut down on August 15, 2024, Bard was rebranded to Gemini in February 2024, and the PaLM 2-derived MedLM endpoints were later replaced by Gemini-based medical models.^[6]^[29]^[35] Google directs all former PaLM users to the Gemini API.^[6]

References

Chowdhery, A., Narang, S., Devlin, J., *et al.* "PaLM: Scaling Language Modeling with Pathways." *Journal of Machine Learning Research*, 24(240):1-113, 2023. arXiv:2204.02311. <https://arxiv.org/abs/2204.02311>. Accessed 2026-05-24. ↩
Narang, S., and Chowdhery, A. "Pathways Language Model (PaLM): Scaling to 540 Billion Parameters for Breakthrough Performance." Google Research blog, 2022-04-04. <https://research.google/blog/pathways-language-model-palm-scaling-to-540-billion-parameters-for-breakthrough-performance/>. Accessed 2026-05-24. ↩
Anil, R., Dai, A. M., Firat, O., *et al.* "PaLM 2 Technical Report." arXiv:2305.10403, 2023-05-17. <https://arxiv.org/abs/2305.10403>. Accessed 2026-05-24. ↩
Google. "Google AI: What to Know About the PaLM 2 Large Language Model." Google Blog, 2023-05-10. <https://blog.google/technology/ai/google-palm-2-ai-large-language-model/>. Accessed 2026-05-24. ↩
Wiggers, K. "Google Launches PaLM 2, Its Next-Gen Large Language Model." TechCrunch, 2023-05-10. <https://techcrunch.com/2023/05/10/google-launches-palm-2-its-next-gen-large-language-model/>. Accessed 2026-05-24. ↩
Google. "PaLM API Deprecation." Google AI for Developers (PaLM API decommissioned 2024-08-15). <https://ai.google.dev/palm_docs/deprecation>. Accessed 2026-05-24. ↩
Dean, J. "Introducing Pathways: A Next-Generation AI Architecture." Google blog, 2021-10-28. <https://blog.google/technology/ai/introducing-pathways-next-generation-ai-architecture/>. Accessed 2026-05-24. ↩
Barham, P., Chowdhery, A., Dean, J., *et al.* "Pathways: Asynchronous Distributed Dataflow for ML." *Proceedings of the 5th MLSys Conference*, 2022. arXiv:2203.12533. <https://arxiv.org/abs/2203.12533>. Accessed 2026-05-24. ↩
Rae, J. W., Borgeaud, S., Cai, T., *et al.* "Scaling Language Models: Methods, Analysis & Insights from Training Gopher." arXiv:2112.11446, 2021-12-08. <https://arxiv.org/abs/2112.11446>. Accessed 2026-05-24. ↩
Du, N., Huang, Y., Dai, A. M., *et al.* "GLaM: Efficient Scaling of Language Models with Mixture-of-Experts." *Proceedings of the 39th International Conference on Machine Learning (ICML)*, 2022. arXiv:2112.06905. <https://arxiv.org/abs/2112.06905>. Accessed 2026-05-24. ↩
Smith, S., Patwary, M., Norick, B., *et al.* "Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model." arXiv:2201.11990, 2022-01-28. <https://arxiv.org/abs/2201.11990>. Accessed 2026-05-24. ↩
Hoffmann, J., Borgeaud, S., Mensch, A., *et al.* "Training Compute-Optimal Large Language Models." *NeurIPS 2022*. arXiv:2203.15556. <https://arxiv.org/abs/2203.15556>. Accessed 2026-05-24. ↩
Google. "Benchmarking FLOPs Utilization on TPU v4." Google services blog, 2022. <https://services.google.com/fh/files/blogs/tpu_v4_benchmarking.pdf>. Accessed 2026-05-24. ↩
Heim, L. "Estimating PaLM's Training Cost." blog.heim.xyz, 2022-04. <https://blog.heim.xyz/palm-training-cost/>. Accessed 2026-05-24. ↩
Molybog, I., Albert, P., Chen, M., *et al.* "A Theory on Adam Instability in Large-Scale Machine Learning." arXiv:2304.09871, 2023-04-19. <https://arxiv.org/abs/2304.09871>. Accessed 2026-05-24. ↩
Srivastava, A., Rastogi, A., Rao, A., *et al.* "Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models." *Transactions on Machine Learning Research*, 2023. arXiv:2206.04615. <https://arxiv.org/abs/2206.04615>. Accessed 2026-05-24. ↩
Wei, J., Wang, X., Schuurmans, D., *et al.* "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models." *Advances in Neural Information Processing Systems (NeurIPS)*, 2022. arXiv:2201.11903. <https://arxiv.org/abs/2201.11903>. Accessed 2026-05-24. ↩
Suzgun, M., Scales, N., Scharli, N., *et al.* "Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them." arXiv:2210.09261, 2022-10-17. <https://arxiv.org/abs/2210.09261>. Accessed 2026-05-24. ↩
Heaven, W. D. (et al.). "Google's PaLM Giant Language AI Can Explain Jokes." The Decoder, 2022-04-06. <https://the-decoder.com/google-palm-giant-language-ai-can-explain-jokes/>. Accessed 2026-05-24. ↩
Chung, H. W., Hou, L., Longpre, S., *et al.* "Scaling Instruction-Finetuned Language Models." arXiv:2210.11416, 2022-10-20. <https://arxiv.org/abs/2210.11416>. Accessed 2026-05-24. ↩
Longpre, S., Hou, L., Vu, T., *et al.* "The Flan Collection: Designing Data and Methods for Effective Instruction Tuning." arXiv:2301.13688, 2023-01-31. <https://arxiv.org/abs/2301.13688>. Accessed 2026-05-24. ↩
Tay, Y., Wei, J., Chung, H. W., *et al.* "Transcending Scaling Laws with 0.1% Extra Compute." arXiv:2210.11399, 2022-10-20. <https://arxiv.org/abs/2210.11399>. Accessed 2026-05-24. ↩
Driess, D., Xia, F., Sajjadi, M. S. M., *et al.* "PaLM-E: An Embodied Multimodal Language Model." *Proceedings of the 40th International Conference on Machine Learning (ICML)*, 2023. arXiv:2303.03378. <https://arxiv.org/abs/2303.03378>. Accessed 2026-05-24. ↩
Ahn, M., Brohan, A., Brown, N., *et al.* "Do As I Can, Not As I Say: Grounding Language in Robotic Affordances." arXiv:2204.01691, 2022-04-04. <https://arxiv.org/abs/2204.01691>. Accessed 2026-05-24. ↩
Ichter, B., and Hausman, K. "Towards Helpful Robots: Grounding Language in Robotic Affordances." Google Research blog, 2022-08-16. <https://research.google/blog/towards-helpful-robots-grounding-language-in-robotic-affordances/>. Accessed 2026-05-24. ↩
Lewkowycz, A., Andreassen, A., Dohan, D., *et al.* "Solving Quantitative Reasoning Problems with Language Models." *NeurIPS 2022*. arXiv:2206.14858. <https://arxiv.org/abs/2206.14858>. Accessed 2026-05-24. ↩
Singhal, K., Azizi, S., Tu, T., *et al.* "Large Language Models Encode Clinical Knowledge." *Nature*, 620:172-180, 2023-07-12. <https://www.nature.com/articles/s41586-023-06291-2>. Accessed 2026-05-24. ↩
Singhal, K., Tu, T., Gottweis, J., *et al.* "Towards Expert-Level Medical Question Answering with Large Language Models." arXiv:2305.09617, 2023-05-16. <https://arxiv.org/abs/2305.09617>. Accessed 2026-05-24. ↩
Google Cloud. "Introducing MedLM for the Healthcare Industry." Google Cloud blog, 2023-12-13. <https://cloud.google.com/blog/topics/healthcare-life-sciences/introducing-medlm-for-the-healthcare-industry>. Accessed 2026-05-24. ↩
Tu, T., Azizi, S., Driess, D., *et al.* "Towards Generalist Biomedical AI." arXiv:2307.14334, 2023-07-26. <https://arxiv.org/abs/2307.14334>. Accessed 2026-05-24. ↩
Potti, S., and Venables, P. "Supercharging Security with Generative AI." Google Cloud Blog, 2023-04-24. <https://cloud.google.com/blog/products/identity-security/rsa-google-cloud-security-ai-workbench-generative-ai>. Accessed 2026-05-24. ↩
Bastian, M. "Google's PaLM 2 Uses Nearly Five Times More Text Data Than Predecessor." CNBC, 2023-05-16 (reporting on internal Google documents citing 340B parameters and 3.6T training tokens). <https://www.cnbc.com/2023/05/16/googles-palm-2-uses-nearly-five-times-more-text-data-than-predecessor.html>. Accessed 2026-05-24. ↩
Sherr, I. "Google Bard Will Soon Switch Language Models from LaMDA to PaLM." Neowin, 2023-03-31. <https://www.neowin.net/news/google-bard-will-soon-switch-language-models-from-lamda-to-palm-to-compete-with-bing-chat/>. Accessed 2026-05-24. ↩
Hsiao, S. "Bard's Latest Update: More Features, Languages and Countries." Google blog, 2023-09-19. <https://blog.google/products/bard/google-bard-new-features-update-sept-2023/>. Accessed 2026-05-24. ↩
Pichai, S., and Hassabis, D. "Introducing Gemini: Our Largest and Most Capable AI Model." Google blog, 2023-12-06. <https://blog.google/technology/ai/google-gemini-ai/>. Accessed 2026-05-24. ↩
Wright, S., and Nyaga, J. "PaLM API & MakerSuite: An Approachable Way to Start Prototyping and Building Generative AI Applications." Google Developers Blog, 2023-03-14. <https://developers.googleblog.com/2023/03/announcing-palm-api-and-makersuite.html>. Accessed 2026-05-24. ↩
Brown, T., Mann, B., Ryder, N., *et al.* "Language Models are Few-Shot Learners." *NeurIPS 2020*. arXiv:2005.14165. <https://arxiv.org/abs/2005.14165>. Accessed 2026-05-24. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

6 revisions by 1 contributors · full history

Suggest edit