PaLM-E (short for Pathways Language Model, Embodied) is an embodied multimodal large language model introduced by Google and TU Berlin in March 2023. The model was described in the paper PaLM-E: An Embodied Multimodal Language Model by Danny Driess, Fei Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, Wenlong Huang, Yevgen Chebotar, Pierre Sermanet, Daniel Duckworth, Sergey Levine, Vincent Vanhoucke, Karol Hausman, Marc Toussaint, Klaus Greff, Andy Zeng, Igor Mordatch, and Pete Florence, posted to arXiv on March 6, 2023 and later presented at ICML 2023.[1][2]
The largest variant, PaLM-E-562B, combined the 540-billion-parameter PaLM language model with the 22-billion-parameter Vision Transformer (ViT-22B) into a single 562B-parameter model that took interleaved text, images, and continuous robot state as input and produced natural-language plans, visual question answers, and robot commands as output.[1][3] At the time of release it was the largest vision-language model ever publicly described, and it set a new state of the art on the OK-VQA benchmark while also acting as the high-level planner for several physical robot platforms.[1][3]
PaLM-E is generally remembered for three results: it showed that a single generalist multimodal model could plan and act across multiple distinct robot embodiments, it demonstrated positive transfer between internet-scale vision-language data and robot data, and it provided the architectural template that Google's later RT-2 and Gemini Robotics systems would inherit.[1][4][5]
Before PaLM-E, there were two largely separate research lines on the boundary between language and embodied action. One line, exemplified by SayCan (Ahn et al., 2022), used a PaLM language model as a planner that picked among predefined robot skills based on their textual descriptions, but the language model itself never saw the robot's camera images and could not reason about novel objects in the scene.[1][6] The other line, including the original RT-1 (Brohan et al., 2022), trained a smaller transformer policy directly on robot demonstrations, but those policies struggled to generalize to instructions or objects far outside their training distribution.[7]
Driess and colleagues set out to merge these two threads. The goal was to inject continuous sensor observations directly into the input stream of a pretrained language model so that the model could plan, ground, and act in one forward pass, rather than relying on a separate perception module to first translate the world into text. The team also wanted to test whether the scaling laws and broad knowledge of frontier language models would carry over to embodied tasks.[1]
A contemporaneous Google line of work on multimodal models, including the 22-billion-parameter Vision Transformer (ViT-22B) by Dehghani et al. (2023), provided the visual encoder that PaLM-E would use as its image backbone. PaLM-E was developed jointly across Google Research, Google Robotics, and Marc Toussaint's Learning and Intelligent Systems group at Technische Universitat Berlin.[1][3]
The PaLM-E paper has 22 authors split between Google and TU Berlin. The lead author, Danny Driess, was at the time a PhD student at TU Berlin completing a research collaboration with Google's Robotics at Google team in Mountain View. The authors and their affiliations at the time of publication are listed below.[1]
| Author | Affiliation |
|---|---|
| Danny Driess (lead) | Robotics at Google, TU Berlin |
| Fei Xia | Robotics at Google |
| Mehdi S. M. Sajjadi | Google Research |
| Corey Lynch | Robotics at Google |
| Aakanksha Chowdhery | Google Research |
| Brian Ichter | Robotics at Google |
| Ayzaan Wahid | Robotics at Google |
| Jonathan Tompson | Robotics at Google |
| Quan Vuong | Robotics at Google |
| Tianhe Yu | Robotics at Google |
| Wenlong Huang | Robotics at Google |
| Yevgen Chebotar | Robotics at Google |
| Pierre Sermanet | Robotics at Google |
| Daniel Duckworth | Google Research |
| Sergey Levine | Robotics at Google |
| Vincent Vanhoucke | Robotics at Google |
| Karol Hausman | Robotics at Google |
| Marc Toussaint | TU Berlin |
| Klaus Greff | Google Research |
| Andy Zeng | Robotics at Google |
| Igor Mordatch | Google Research |
| Pete Florence | Robotics at Google |
The paper was posted to arXiv as 2303.03378 on March 6, 2023, accepted at the 40th International Conference on Machine Learning (ICML 2023) in Honolulu, and later given an oral presentation in the conference's robotics track.[1][2]
PaLM-E is built on a simple but consequential idea: any continuous observation can be projected into the same embedding space as language tokens, so the language model can attend to it without any architectural change. The paper calls these projections multimodal sentences and treats images, robot state vectors, and object-centric scene descriptions all as sequences of vectors that are fed into the transformer alongside text tokens.[1]
The language backbone is the PaLM decoder-only transformer family, with the largest version using the 540-billion-parameter PaLM-540B trained on 780 billion tokens of internet text and code. The vision backbone is a Vision Transformer; the largest configuration uses ViT-22B, the 22-billion-parameter image encoder introduced by Dehghani et al. (2023). Smaller PaLM-E variants pair smaller PaLM checkpoints with smaller ViT or ViT-4B encoders.[1][3]
The paper studies several ways of turning visual input into a sequence of language-aligned vectors. The simplest is a frozen ViT followed by a learned linear projection that maps each ViT patch token into the PaLM token embedding space. The paper also evaluates an Object Scene Representation Transformer (OSRT) encoder that produces object-centric slot representations, and a state vector encoder that takes the position and pose of objects in a scene and projects them into language tokens.[1] Across encoders, only the projection layer is trained from scratch; the ViT can be either frozen or co-trained with the rest of the model.
A prompt to PaLM-E might look like "Describe what is happening: " where <img> is replaced at runtime by a sequence of visual tokens produced by the encoder. For robotics, the prompt usually includes a goal and a partial action history, with images showing the current scene at each step. The model's output is decoded autoregressively as text, including either a high-level plan, an answer to a visual question, or a low-level action token interpreted by a downstream controller.[1]
PaLM-E is trained end-to-end with the standard next-token prediction cross-entropy loss on a mixture of robotics and vision-language data. The robotics data includes both image observations and target action sequences expressed as text. The vision-language data includes web-scale image-text pairs, captioning data, and visual question answering. The language model parameters can be either frozen or fine-tuned; the paper finds that for the largest model, freezing PaLM and only training the visual projection works almost as well as full fine-tuning, with significantly less catastrophic forgetting on language-only benchmarks.[1]
The paper reports results across a range of sizes, from a 12-billion-parameter version that combines an 8B PaLM with a 4B ViT, to the flagship PaLM-E-562B that pairs PaLM-540B with ViT-22B. Smaller variants exist primarily to enable controlled scaling experiments.[1]
| Variant | Language backbone | Vision backbone | Total parameters |
|---|---|---|---|
| PaLM-E-12B | PaLM-8B | ViT-4B | ~12B |
| PaLM-E-84B | PaLM-62B | ViT-22B | ~84B |
| PaLM-E-562B | PaLM-540B | ViT-22B | ~562B |
PaLM-E was trained on a mix of three robot embodiments and a large set of internet-scale vision-language datasets. The paper deliberately chose embodiments that differ in action space and sensor configuration so that the multi-embodiment training claim could be tested rigorously.[1]
| Embodiment | Description | Data source |
|---|---|---|
| Task and Motion Planning (TAMP) | Simulated tabletop with up to 8 objects; outputs symbolic plans solved by an underlying TAMP solver | Demonstrations generated by an oracle planner |
| Language Table | Tabletop robot manipulating colored blocks on a flat surface based on natural-language commands | Lynch et al. (2023) Language Table dataset |
| Mobile manipulator | Single-arm Everyday Robots mobile manipulator performing kitchen and office tasks | SayCan-style demonstrations collected at Google |
| Internet vision-language | Image captioning and visual question answering | WebLI, Conceptual Captions, VQAv2, OK-VQA, COCO Captions |
| Internet text | General language modeling | PaLM pretraining data |
The robot data for the Language Table embodiment was collected as part of the Interactive Language work by Lynch et al. (2023) and consists of natural-language demonstrations of pushing, sliding, and arranging colored blocks. The mobile manipulator data was inherited from the SayCan project, which had previously been used to evaluate language models as planners on the same hardware platform.[1][6][8]
The paper organizes PaLM-E's outputs into three families: high-level robot planning, embodied visual question answering and dialogue, and general internet-scale visual question answering and captioning.
For the TAMP, Language Table, and mobile manipulator embodiments, PaLM-E takes the current scene image and a natural-language goal as input and emits a sequence of high-level subgoals or actions. The plan is then executed by a low-level controller specific to the embodiment, with PaLM-E re-prompted at each step with a new image. The paper shows that this closed-loop setup allows the model to recover from intermediate failures, such as a dropped object or a perturbation introduced by a human.[1]
PaLM-E can answer questions about the robot's environment, such as how many objects of a given color are visible, whether a specific object can be picked up given the current arm configuration, or how the robot should rearrange the scene to satisfy a constraint. The model can also engage in multi-turn dialogue about the scene because the image and prior outputs are simply additional tokens in the multimodal sentence.[1]
Because PaLM-E retains the language modeling capabilities of PaLM and gains image understanding from the ViT, it inherits general vision-language abilities. The paper reports zero-shot performance on chain-of-thought visual reasoning, multi-image relational reasoning, and code generation conditioned on images, none of which were explicit training objectives.[1]
The paper's central scientific finding is that joint training across embodiments and across internet vision-language data improves performance on each individual task relative to single-task baselines. The authors call this positive transfer: a single PaLM-E instance trained on all three robot domains plus internet data outperforms separate models trained on each domain alone. The effect grows with model size, and is most pronounced for the largest 562B configuration.[1][9]
PaLM-E was evaluated on a mix of robot benchmarks and standard vision-language benchmarks. The most cited results are summarized below; full numbers are reported in the paper.[1][3]
| Benchmark | Metric | PaLM-E result | Comparison |
|---|---|---|---|
| OK-VQA | Accuracy | 66.1% (PaLM-E-562B) | New state of the art at the time of publication; previous SOTA was 60.6% from PaLI-17B[1][3] |
| VQAv2 | Accuracy | 80.0% (PaLM-E-562B, fine-tuned) | Competitive with specialized VQA models |
| COCO Captions | CIDEr | 138.7 (PaLM-E-562B, fine-tuned) | Comparable to dedicated captioning models |
| TAMP environment | Plan success rate | Around 95% on held-out tasks for PaLM-E-562B | Strong improvement over single-task baselines[1] |
| Language Table (long horizon) | Task success rate | About 65% on long-horizon stacking tasks for PaLM-E-12B with co-training | Compared to about 35% without co-training[1] |
| Mobile manipulator (SayCan tasks) | Affordance plan success | Higher than SayCan baseline using same skill library | Demonstrated robustness to scene perturbations during execution[1] |
| Multi-image reasoning | Zero-shot accuracy | Outperforms single-image baselines | First demonstration that PaLM-E generalizes to multi-image inputs without explicit training[1] |
The OK-VQA result attracted particular attention because PaLM-E-562B achieved it as a single generalist model rather than a specialist tuned on the OK-VQA training set. The model was also reported to retain most of PaLM-540B's language-only benchmark performance, which had been a concern with earlier approaches that fine-tuned the language model on multimodal data and saw significant catastrophic forgetting.[1][3]
A notable finding in the paper's analysis is that language capabilities degrade less with scale. PaLM-E-12B suffers about an 87% relative drop on language-only benchmarks when fine-tuned on multimodal data, while PaLM-E-562B suffers only about a 3.9% drop. This led the authors to argue that frozen-language-model multimodal training becomes unnecessary at sufficient scale, because the larger model has enough capacity to absorb the new modalities without overwriting its language priors.[1][9]
PaLM-E was announced through a Google Research blog post on March 10, 2023 and a project page at palm-e.github.io that hosted video demonstrations of the mobile manipulator following multi-step natural-language instructions in a kitchen.[3][10] The release attracted broad coverage in the technology press because it was published only weeks after the launch of GPT-4 and was widely framed as Google's response on the multimodal-and-embodied frontier.[11][12]
Ars Technica called PaLM-E "a multimodal language model that controls a robot" and noted that the most striking demonstration was an instruction such as bring me the rice chips from the drawer, which the robot completed despite the chips being in a different drawer than expected and an experimenter physically moving the bag during execution.[12] The Verge highlighted the 562-billion-parameter scale and the multi-embodiment generalization, and IEEE Spectrum framed PaLM-E as evidence that foundation models could play a central role in robotics rather than only in language.[11][13]
The ICML 2023 review process accepted PaLM-E as an oral paper, and the authors gave a keynote-style presentation in the robotics track at the conference in Honolulu.[2]
PaLM-E directly seeded a sequence of robotics foundation models at Google and beyond. The most direct successor is RT-2 (Brohan et al., 2023), which adopted the same recipe of fine-tuning a vision-language model on robot demonstrations but added the key step of representing robot actions as text tokens, allowing the model to be trained jointly on internet image-text pairs and robot trajectories using the same next-token prediction objective. RT-2 used the PaLI-X and PaLM-E backbones and demonstrated significantly stronger generalization than RT-1 to novel objects, backgrounds, and instructions.[4][14]
| Successor | Year | Relationship to PaLM-E |
|---|---|---|
| RT-2 | July 2023 | Built on PaLM-E and PaLI-X backbones; introduced action tokenization to fold robot actions into the language model output[4][14] |
| Open X-Embodiment / RT-X | October 2023 | Cross-institutional dataset and RT-1-X / RT-2-X models trained on data from 22 robot embodiments; built on lessons about positive transfer first demonstrated by PaLM-E[5][15] |
| PaLI-3 | September 2023 | Smaller, more efficient vision-language model that inherited several PaLM-E design choices for the visual projection layer |
| AutoRT | January 2024 | Used vision-language models including PaLM-E descendants to scale up data collection on a fleet of mobile manipulators |
| Gemini and Gemini Robotics | December 2023 / March 2025 | Followed the same multimodal-encoder-into-language-model template at much larger scale; Gemini Robotics inherited the embodiment-flexible training paradigm and the two-model planner / policy split that PaLM-E and SayCan first explored[16] |
| OpenVLA | June 2024 | Open-source 7B vision-language-action model trained on Open X-Embodiment data; explicitly cites PaLM-E and RT-2 as inspiration |
A 2024 survey of vision-language-action models in IEEE Transactions on Robotics described PaLM-E as the inflection point at which vision-language models became viable robot planners, and identified action tokenization (introduced shortly afterward by RT-2) as the missing piece that turned the planning interface into a unified vision-language-action model.[17]
PaLM-E shifted the framing of robot policy learning. Before its release, the dominant approach was to train a small policy network from scratch on a single robot embodiment, with the language model (if any) acting as an external planner. After PaLM-E, the dominant approach in industrial labs became take a frontier vision-language model, fine-tune on robot data from many embodiments, and let positive transfer from internet data do the heavy lifting on generalization.[5][17]
The paper also legitimized the use of internet-scale pretraining as the foundation for embodied AI. Subsequent work at Google (RT-2, RT-X, Gemini Robotics), Stanford (Mobile ALOHA, OpenVLA), NVIDIA (GR00T), and Physical Intelligence (Pi-0) all assume that embodied policies should inherit perception and reasoning from large multimodal foundation models, with robot-specific data serving as a fine-tuning signal rather than the primary source of capability.[5][16][18]
Finally, PaLM-E provided concrete empirical support for the generalist agent hypothesis: a single sufficiently large model trained on many tasks and embodiments can outperform specialist models on each individual task. This claim, originally made for language by GPT-3 and for general environments by Gato (Reed et al., 2022), was extended to physical robots by PaLM-E and remains a guiding assumption of the foundation model approach to robotics.[1][9]
The authors acknowledge several limitations of PaLM-E in the paper itself and in subsequent talks.
The authors also note that some Language Table results required co-training with internet data to reach high accuracy, suggesting that the small robot dataset was not sufficient on its own and that the multimodal pretraining was doing a substantial amount of the generalization work.[1]