PaLM (Pathways Language Model) is a family of large language models developed by Google Research. The original PaLM, introduced in April 2022, was a 540-billion parameter dense decoder-only transformer trained using Google's Pathways system across 6,144 TPU v4 chips. At the time of its release, PaLM was the largest dense language model ever trained and achieved state-of-the-art results on hundreds of language understanding and generation benchmarks. The PaLM family later expanded to include PaLM 2, PaLM-E, Med-PaLM, and Sec-PaLM, each targeting different domains. Google eventually succeeded the PaLM line with its Gemini model family in late 2023.
The PaLM project grew out of Google's investment in scaling language models to unprecedented sizes. Google had already built several large models, including LaMDA (137 billion parameters, 2021) and GLaM (1.2 trillion sparse parameters, 2022). PaLM built on these efforts by combining an extremely large dense model with a new distributed training system called Pathways.
The Pathways system, first described by Google CEO Jeff Dean in a 2021 blog post, was designed to train a single model across thousands of accelerators spread over multiple TPU pods. Before PaLM, most large models were trained within a single pod. PaLM was the first model to use Pathways for large-scale training, distributing computation across two TPU v4 Pods connected over a data center network.
The PaLM paper, led by Aakanksha Chowdhery along with more than 60 co-authors from Google Research, was first released as an arXiv preprint on April 5, 2022. It was subsequently published in the Journal of Machine Learning Research (JMLR), Volume 24, in 2023.
| Date | Event |
|---|---|
| April 2022 | PaLM (540B) announced via research paper and blog post |
| December 2022 | Med-PaLM released; first AI to pass USMLE-style medical exam questions |
| March 2023 | PaLM-E (562B) released for embodied multimodal reasoning and robotics |
| March 2023 | PaLM API made publicly available for developers |
| April 2023 | Sec-PaLM announced at RSA Conference for cybersecurity applications |
| May 2023 | PaLM 2 announced at Google I/O; deployed in Bard chatbot |
| May 2023 | Med-PaLM 2 released with 86.5% accuracy on USMLE questions |
| December 2023 | Gemini 1.0 announced as successor to PaLM 2 |
| February 2024 | Bard rebranded as Gemini |
| August 2024 | PaLM API officially decommissioned |
PaLM uses a dense decoder-only transformer architecture, meaning every input token attends only to previous tokens (autoregressive generation). The model incorporates several architectural modifications that were individually studied in prior work but had not been combined together at this scale.
SwiGLU Activation Function. Instead of the standard ReLU or GELU activations used in most transformers at the time, PaLM uses SwiGLU, defined as SwiGLU(x) = Swish(xW) * xV. This activation function requires three matrix multiplications in the feed-forward network rather than two, but experiments showed it significantly improves model quality at equivalent compute budgets. To compensate for the additional matrix multiplication, the feed-forward hidden dimension is set to 4 times the model dimension (d_model) rather than the 8/3 ratio sometimes used with SwiGLU in later models.
Parallel Attention and Feed-Forward Layers. Standard transformers compute self-attention and the feed-forward network sequentially within each block. PaLM instead uses a "parallel" formulation where both the attention and feed-forward computations are performed simultaneously and their outputs are summed. This allows the input matrix multiplications for both sublayers to be fused, resulting in roughly a 15% speedup during training at large scales. Ablation experiments showed a minor quality degradation at the 8B parameter scale but no measurable degradation at 62B or 540B scale.
Rotary Position Embeddings (RoPE). Rather than absolute or relative position embeddings, PaLM uses RoPE, which encodes absolute position through a rotation matrix and incorporates relative position information directly into the self-attention computation. This approach combines the advantages of both absolute and relative positioning methods.
Multi-Query Attention. In standard multi-head attention, each attention head has its own key, value, and query projections. PaLM uses multi-query attention, where the key and value projections are shared across all heads while each head retains its own query projection. This has a neutral effect on model quality and training speed but provides large savings during autoregressive decoding (inference), since the key-value cache is much smaller.
Vocabulary. PaLM uses a SentencePiece tokenizer with a vocabulary of 256,000 tokens. This large vocabulary was chosen to handle the diverse multilingual training corpus, which spans over 100 languages. The vocabulary is designed to be "lossless," preserving all whitespace (important for code), splitting out-of-vocabulary Unicode characters into bytes, and tokenizing numbers into individual digits.
No Bias Terms or Dropout. PaLM removes bias terms from all dense layers and layer norms throughout the network. Dropout is also not used during training.
The research team trained three model sizes to study scaling behavior:
| Configuration | PaLM 8B | PaLM 62B | PaLM 540B |
|---|---|---|---|
| Parameters | 8.63 billion | 62.50 billion | 540.35 billion |
| Layers | 32 | 64 | 118 |
| Model Dimension (d_model) | 4,096 | 8,192 | 18,432 |
| Attention Heads | 16 | 32 | 48 |
| Head Dimension | 256 | 256 | 256 |
| Feed-Forward Dimension | 16,384 | 32,768 | 73,728 |
| Vocabulary Size | 256,000 | 256,000 | 256,000 |
The head dimension is fixed at 256 across all three sizes, and the feed-forward dimension is always 4 times the model dimension.
PaLM was trained on a high-quality corpus of 780 billion tokens drawn from a mixture of sources based on datasets previously used for LaMDA and GLaM:
| Data Source | Percentage of Training Corpus |
|---|---|
| Social media conversations | 50% |
| Filtered web pages | 27% |
| Books (English) | 13% |
| GitHub source code | 5% |
| Wikipedia (multilingual) | 4% |
| News articles (English) | 1% |
Approximately 78% of the training tokens are English, with the remaining 22% covering a multilingual corpus spanning over 100 languages. Each model was trained for exactly one epoch over the dataset, meaning no training example was repeated.
PaLM 540B was the first model to be trained using Google's Pathways system at such a large scale. The training infrastructure consisted of:
The high utilization was achieved through a combination of the parallel attention/feed-forward formulation, XLA TPU compiler optimizations, and the use of rematerialization (recomputing certain activations during the backward pass instead of storing them, which allows for larger batch sizes).
PaLM 540B demonstrated breakthrough performance across a wide range of benchmarks.
Language Understanding. On a suite of 29 widely-used English NLP tasks (including question answering, cloze completion, Winograd-style challenges, and SuperGLUE), PaLM surpassed the few-shot performance of all prior large models on 28 of 29 tasks. This included comparisons with GPT-3, Megatron-Turing NLG, Gopher, Chinchilla, and LaMDA.
BIG-Bench. The Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark consisting of more than 150 diverse language tasks. On 58 common BIG-bench tasks, PaLM 540B with 5-shot prompting outperformed the average performance of human raters who were asked to solve the same tasks. This was a significant milestone, as earlier models had generally fallen below human-level performance on this benchmark. PaLM also showed "discontinuous improvements" on certain tasks as scale increased from 62B to 540B, suggesting the emergence of new capabilities at larger scales.
Chain-of-Thought Reasoning. One of PaLM's most important contributions was demonstrating the power of chain-of-thought prompting at scale. When prompted to show its reasoning step by step, PaLM 540B achieved 58% accuracy on GSM8K (a benchmark of grade-school-level math word problems), surpassing the previous best score of 55%, which had been achieved using a fine-tuned GPT-3 model with an external calculator tool. A subsequent study by Google Research and Stanford applied chain-of-thought prompting to 23 BIG-Bench Hard tasks (tasks where language models had previously failed to outperform average human raters). Using this approach, PaLM surpassed human-level performance on 10 of 23 tasks.
Code Generation. Despite having only 5% code in its training data, PaLM's few-shot code generation performance was competitive with Codex 12B (the model behind GitHub Copilot), which was specifically fine-tuned for code. PaLM achieved this while using roughly 50 times less Python code during training. A fine-tuned variant called PaLM-Coder achieved an 82.1% compile rate on the DeepFix code repair benchmark, surpassing the prior state of the art of 71.7%.
Multilingual Capabilities. Even though only 22% of training tokens were non-English, PaLM achieved strong results on multilingual NLP benchmarks including machine translation, summarization, and question answering across many languages. PaLM was particularly effective at translating from other languages into English. It achieved the strongest machine translation results among LLMs trained on non-parallel multilingual corpora at the time.
Natural Language Generation. PaLM demonstrated strong abilities in explaining jokes, generating analogies, and performing commonsense reasoning. These qualitative capabilities, combined with the quantitative benchmark results, highlighted the breadth of capabilities that emerged at the 540B parameter scale.
PaLM 2 was announced at Google I/O on May 10, 2023, and described in a technical report released the same month. It represented a substantial advancement over the original PaLM, achieving stronger performance across reasoning, multilingual understanding, and code generation despite being a smaller model.
PaLM 2 is a transformer-based language model, though Google withheld many architectural specifics in its technical report. Key publicly known details include:
PaLM 2 comes in four sizes, named after animals in increasing order of capability:
| Model | Description |
|---|---|
| Gecko | The smallest variant, designed to run on mobile devices. Capable of processing roughly 20 tokens per second on a flagship smartphone, making it suitable for on-device applications even without an internet connection. |
| Otter | A mid-range model positioned between Gecko and Bison. Limited public information is available about this variant. |
| Bison | A larger model suitable for a wide range of general-purpose tasks. This was one of the most commonly used sizes through the PaLM API and Google Cloud's Vertex AI. |
| Unicorn | The largest and most capable model in the PaLM 2 family. |
Google did not disclose exact parameter counts for each variant.
Compared to the original PaLM, PaLM 2 showed significant improvements in several areas:
PaLM 2 was immediately deployed across more than 25 Google products and features after its announcement. Most notably, it powered the Bard chatbot (later renamed Gemini), replacing the earlier LaMDA-based version. PaLM 2 was also available through the PaLM API for external developers and through Google Cloud's Vertex AI platform for enterprise customers.
PaLM-E ("Embodied") is a multimodal language model designed for robotics and embodied AI tasks. It was introduced in a paper released on March 6, 2023, by researchers at Google, Robotics at Google, and TU Berlin. The paper was published at the International Conference on Machine Learning (ICML) 2023.
PaLM-E takes the PaLM language model and extends it to accept continuous sensor inputs alongside text. The core architectural idea is to inject embodied observations (such as images, robot state estimates, or 3D scene representations) into the language embedding space of a pre-trained PaLM model. These continuous inputs are encoded by learned encoders and projected into the same vector space as the text token embeddings, creating "multimodal sentences" that interleave text and sensor data.
The largest version, PaLM-E-562B, combines the 540-billion-parameter PaLM language model with a 22-billion-parameter Vision Transformer (ViT-22B). At 562 billion total parameters, it was the largest vision-language model reported at the time of its release.
PaLM-E was evaluated across multiple domains:
A central result of the PaLM-E research was that training on diverse visual-language data (not just robotics data) significantly improved the model's performance on robotic tasks. This positive transfer suggests that large-scale multimodal pre-training creates representations that are broadly useful for embodied reasoning.
Med-PaLM is a series of medical domain language models developed by Google Research and DeepMind, built on top of the PaLM foundation.
Med-PaLM was released in December 2022. It was built by applying a combination of prompt engineering techniques to Flan-PaLM, an instruction-tuned version of PaLM 540B. The prompting strategy combined few-shot learning, chain-of-thought reasoning, and self-consistency decoding.
Med-PaLM was the first AI system to achieve a passing score on the United States Medical Licensing Examination (USMLE)-style questions from the MedQA dataset, scoring 67.6% accuracy. This crossed the approximate 60% passing threshold for the exam. While a landmark achievement, human evaluation revealed that Med-PaLM's long-form answers still had room for improvement in terms of factual alignment with scientific consensus and clinical utility.
The research paper, "Large Language Models Encode Clinical Knowledge," was published in Nature in July 2023.
Med-PaLM 2 was introduced at Google Health's annual event, The Check Up, in March 2023. It was built on top of PaLM 2 and used improved fine-tuning techniques including ensemble refinement.
Key results for Med-PaLM 2:
| Metric | Med-PaLM | Med-PaLM 2 |
|---|---|---|
| MedQA (USMLE-style) Accuracy | 67.6% | 86.5% |
| Scientific Consensus Alignment | Not reported | 92.6% |
| Expert-Level Performance | No (passing only) | Yes (first LLM to reach expert level) |
Med-PaLM 2 was the first large language model to perform at expert-level on USMLE-style medical questions. A team of clinicians evaluated its long-form responses across criteria including scientific factuality, precision, medical consensus, reasoning, potential for bias, and likelihood of harm. The evaluation found that 92.6% of Med-PaLM 2's responses aligned with scientific consensus, and the rate of potentially harmful answers was lower than that of human medical professionals.
Med-PaLM 2 was later made available through Google Cloud as part of the MedLM family of foundation models for healthcare customers.
Sec-PaLM is a security-focused variant of PaLM, announced at the RSA Conference in April 2023. It powers Google Cloud's Security AI Workbench.
Sec-PaLM was fine-tuned specifically for cybersecurity use cases, incorporating Google's security intelligence and Mandiant's frontline threat intelligence. The training data includes information about vulnerabilities, malware signatures, threat indicators, and behavioral profiles of threat actors.
Key applications of Sec-PaLM include:
The following table summarizes the major models in the PaLM family:
| Model | Release Date | Parameters | Key Features |
|---|---|---|---|
| PaLM | April 2022 | 540B | Dense decoder-only transformer; trained on 780B tokens across 6,144 TPU v4 chips; breakthrough chain-of-thought reasoning; state-of-the-art on 28 of 29 NLP tasks |
| Med-PaLM | December 2022 | 540B (fine-tuned) | First AI to pass USMLE-style medical exam (67.6%); built on Flan-PaLM with specialized medical prompting |
| PaLM-E | March 2023 | 562B | Embodied multimodal model combining PaLM 540B with ViT-22B; designed for robotic manipulation and visual reasoning |
| Sec-PaLM | April 2023 | Not disclosed | Security-specialized variant for cybersecurity threat analysis and incident response |
| PaLM 2 | May 2023 | ~340B | Compute-optimal training on 3.6T tokens; improved multilingual, reasoning, and coding; four sizes (Gecko, Otter, Bison, Unicorn) |
| Med-PaLM 2 | May 2023 | Based on PaLM 2 | Expert-level USMLE performance (86.5%); 92.6% alignment with scientific consensus |
PaLM and PaLM 2 were released during a period of rapid advancement in large language models. The following table provides a high-level comparison with other major models from the same era:
| Feature | PaLM 2 (May 2023) | GPT-4 (March 2023) | LLaMA 2 70B (July 2023) | Claude 2 (July 2023) |
|---|---|---|---|---|
| Developer | OpenAI | Meta | Anthropic | |
| Parameters | ~340B | Not disclosed | 70B | Not disclosed |
| Open Source | No | No | Yes | No |
| Training Tokens | ~3.6 trillion | Not disclosed | 2 trillion | Not disclosed |
| Multilingual Strength | Strong (100+ languages) | Strong | Moderate | Moderate |
| Context Window (at launch) | 8,192 tokens | 8,192 / 32,768 tokens | 4,096 tokens | 100,000 tokens |
| Notable Strengths | Multilingual tasks, translation, reasoning | General reasoning, multimodal input | Open-source, strong for size | Long context, safety alignment |
| API/Access | PaLM API, Vertex AI | OpenAI API | Open weights (Meta license) | Anthropic API |
On benchmarks such as HellaSwag, GPT-4 scored 95.3 compared to PaLM 2's 86.8. On ARC-E, GPT-4 achieved 96.3 versus PaLM 2's 89.7. However, PaLM 2 showed competitive or superior results on certain mathematical and multilingual tasks, including WinoGrande and DROP. Direct comparisons are complicated by differences in evaluation methodology, prompt formatting, and the number of shots used.
The PaLM model family served as a foundational stepping stone toward Google's next-generation model family, Gemini.
In April 2023, Google merged its DeepMind and Google Brain AI research divisions into a single unit called Google DeepMind. This organizational change was partly motivated by the desire to accelerate AI development in response to the rapid progress of competitors, particularly OpenAI's ChatGPT and GPT-4.
On December 6, 2023, Google DeepMind CEO Demis Hassabis and Google CEO Sundar Pichai announced Gemini 1.0, which was explicitly described as the successor to PaLM 2. Gemini was built from the ground up as a natively multimodal model, capable of understanding and generating text, images, audio, video, and code. It launched in three sizes: Gemini Ultra (for the most complex tasks), Gemini Pro (for general-purpose use), and Gemini Nano (for on-device applications).
In February 2024, Google rebranded the Bard chatbot (which had been powered by PaLM 2) as Gemini, completing the public-facing transition. The PaLM API was officially decommissioned on August 15, 2024, with developers directed to migrate to the Gemini API.
While Google has not published full architectural details of Gemini, the PaLM line contributed several techniques that likely influenced its design:
These architectural choices, first validated at scale in PaLM, became common practice across the industry. Many subsequent open-source models, including LLaMA, adopted SwiGLU activations, RoPE embeddings, and multi-query (or grouped-query) attention, reflecting PaLM's broad influence on language model design.
PaLM made several lasting contributions to the field of artificial intelligence:
Scaling Laws and Emergent Abilities. PaLM provided some of the strongest evidence at the time for discontinuous capability improvements as models scale. Certain BIG-bench tasks showed sudden jumps in performance between the 62B and 540B parameter versions, suggesting that some abilities "emerge" only at sufficient scale. This finding fueled significant research interest in emergent abilities of large language models.
Efficient Distributed Training. By achieving 57.8% hardware FLOPs utilization across 6,144 TPU chips spanning two pods, PaLM set a new standard for large-scale distributed training efficiency. The Pathways system demonstrated that training could be effectively distributed across multiple pods connected by relatively lower-bandwidth data center networks.
Architectural Innovations at Scale. PaLM validated that combining parallel attention/feed-forward layers, SwiGLU activations, RoPE, and multi-query attention could work effectively at the 540-billion-parameter scale. These architectural choices were subsequently adopted by many other research groups.
Domain-Specific Adaptation. The PaLM family demonstrated a successful strategy for building domain-specific AI systems: start with a large general-purpose foundation model, then adapt it through fine-tuning and specialized prompting for specific fields like medicine (Med-PaLM), cybersecurity (Sec-PaLM), and robotics (PaLM-E).
Despite its achievements, PaLM had several recognized limitations: