RT-2
Last reviewed
May 4, 2026
Sources
12 citations
Review status
Source-backed
Revision
v1 · 3,081 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 4, 2026
Sources
12 citations
Review status
Source-backed
Revision
v1 · 3,081 words
Add missing citations, update stale details, or suggest a clearer explanation.
RT-2 (Robotic Transformer 2) is a vision-language-action model developed by Google DeepMind that enables robots to execute novel tasks by transferring knowledge from internet-scale vision-language pretraining into low-level robot control. Introduced in July 2023, RT-2 was the first widely publicized system to repurpose a large pretrained vision-language model (VLM) as the backbone of a robot policy, treating discretized robot actions as ordinary text tokens predicted autoregressively by the same head used for natural language.
The model was built by fine-tuning two existing VLMs, PaLI-X and PaLM-E, on a mixture of web-scale vision-language data and a robot demonstration dataset originally collected for its predecessor RT-1. The resulting policies could be conditioned on natural-language instructions and produce six-degree-of-freedom end-effector commands for a mobile manipulator. Because the action vocabulary was embedded inside the language model's token space, RT-2 inherited semantic and visual knowledge from the web. This produced what the authors called "emergent" capabilities such as understanding novel objects, basic spatial reasoning, and chain-of-thought-style multi-step instruction following without explicit robot training data for those skills.
RT-2 was widely covered in mainstream technology and science press as an early demonstration of a "foundation model for robotics." It set a template that subsequent VLA systems, including Open X-Embodiment's RT-X, OpenVLA, and Physical Intelligence's pi0, would build on. Google DeepMind did not release RT-2's weights or code; it remains a closed research artifact.
| Field | Value |
|---|---|
| Developer | Google DeepMind |
| Released | July 28, 2023 (arXiv preprint) |
| Architecture | Vision-language-action model based on PaLI-X and PaLM-E |
| Variants | RT-2-PaLI-X-5B, RT-2-PaLI-X-55B, RT-2-PaLM-E-12B |
| Largest variant | RT-2-PaLI-X 55B |
| Predecessor | RT-1 (December 2022) |
| Successors / related | RT-X (October 2023), AutoRT (2024), Open X-Embodiment |
| Training data | Open X-Embodiment-style robot demonstration data combined with web-scale vision-language data |
| License | Closed; weights and code not publicly released |
| Paper | arXiv:2307.15818 |
RT-2's direct predecessor, RT-1, was published by a large team at Google Research and Everyday Robots in December 2022. RT-1 was a 35-million-parameter transformer that consumed a short history of camera images and a natural-language instruction, then produced discretized end-effector actions at roughly 3 Hz. It was trained from scratch on approximately 130,000 robot demonstrations covering more than 700 tasks, gathered over seventeen months by a fleet of thirteen Everyday Robots mobile manipulators in Google's office buildings. RT-1 demonstrated that a single transformer policy could absorb a large multi-task demonstration dataset and execute kitchen and office manipulation tasks reliably, but it was limited to behaviors and objects represented in the robot data.
In parallel, Google had been scaling its vision-language and embodied multimodal models. PaLM-E, published in March 2023, injected images and other sensor observations into the embedding space of the PaLM large language model and showed that a single embodied multimodal model up to 562 billion parameters could perform high-level robotic planning, visual question answering, and language tasks. PaLI-X, published in May 2023, was a 55-billion-parameter multilingual vision-and-language model that achieved state-of-the-art results across more than twenty-five image and video benchmarks. Both models showed that scaling multimodal pretraining produced positive transfer to embodied reasoning, but neither produced low-level continuous control directly.
On July 28, 2023, Google DeepMind posted the preprint "RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control" (arXiv:2307.15818) and a companion blog post and project website. The paper carries 54 named authors led by Anthony Brohan and Brianna Zitkovich, with co-authors including Yevgen Chebotar, Krzysztof Choromanski, Tianli Ding, Danny Driess, Chelsea Finn, Pete Florence, Karol Hausman, Brian Ichter, Sergey Levine, Lisa Lee, Yao Lu, Igor Mordatch, Karl Pertsch, Pierre Sermanet, Quan Vuong, Fei Xia, Ted Xiao, and Tianhe Yu. The work was a collaboration between several Google DeepMind robotics groups; many contributors had previously worked at Google Brain or Google Research before the 2023 merger that created Google DeepMind.
The paper framed RT-2 as a synthesis of two trends. First, the existing robot-learning literature, exemplified by RT-1, showed that scaling up imitation data on a single robot could produce broadly useful policies. Second, the vision-language model literature showed that internet-scale pretraining could endow models with broad world knowledge. RT-2 asked whether that knowledge could be channeled into the moment-to-moment motor commands that a robot needs to act.
The central technical idea of RT-2 is to express robot actions as ordinary text tokens that the underlying language model can emit. Each action consists of a discrete "terminate episode" flag, six positional and rotational deltas for the robot's end effector, and a gripper open/close command. Continuous values are uniformly discretized into 256 bins, and each bin is mapped to a distinct token in the vocabulary. With the PaLI-X backbone, the authors overload tokens that already correspond to integers between 0 and 255; with PaLM-E, they reserve previously unused tokens. An entire eight-dimensional action is then a short fixed-length string such as "1 128 91 241 5 101 127 217", which the language head produces autoregressively given an image and an instruction.
Because actions are tokens, the same pretrained head that predicts the next word of a caption now predicts the next dimension of a robot action. No new output decoder is added, and no separate action prediction loss is introduced; the standard next-token cross-entropy loss handles both modalities. At deployment, the model produces a complete action string, the string is parsed back into floating-point values, and the values are sent to the underlying low-level controller.
RT-2 was built on two families of vision-language models. The full set of variants reported in the paper is summarized below.
| Variant | Backbone | Approximate parameters | Notes |
|---|---|---|---|
| RT-2-PaLI-X-5B | PaLI-X (UL2-3B + ViT-22B at smaller scale) | ~5B | Fits on a single host; faster inference |
| RT-2-PaLI-X-55B | PaLI-X (full scale) | ~55B | Largest variant; strongest emergent skills |
| RT-2-PaLM-E-12B | PaLM-E | ~12B | Inherits PaLM-E's embodied reasoning training |
For the 55B variant, inference is performed on a multi-host TPU pod accessed remotely from the robot, supporting a control loop of roughly 1 to 3 Hz. The 5B variant runs at approximately 5 Hz and is closer to RT-1's control rate.
RT-2 is not trained from scratch. Each variant starts from a pretrained vision-language model checkpoint and is then co-fine-tuned on a mixture of the original web-scale vision-language data used during VLM pretraining and the robot demonstration data. The authors found that this co-fine-tuning recipe substantially outperformed fine-tuning purely on robot data, because keeping the original VL data in the mixture preserves the semantic and visual knowledge that gives RT-2 its emergent abilities. Removing the web data and fine-tuning only on robot trajectories causes the model to forget concepts that were never seen on the robot.
The model is conditioned on a natural-language instruction in the standard VLM prompt format and on a single recent camera image. Output is a sequence of action tokens. Action chunking, in the sense of predicting several future actions as one block, is used in a limited form: each emitted string represents one full action vector, but training trajectories provide the supervision signal across many time steps.
RT-2 was trained on a combination of two data sources.
The first is the web-scale vision-language pretraining mixture inherited from the backbone. For the PaLI-X variants, this includes the WebLI image-text dataset, captioning data, visual question answering data, and OCR data spanning many languages. For the PaLM-E variant, the mixture includes the original PaLM language pretraining data plus the multimodal corpora used to train PaLM-E.
The second is robot demonstration data collected on the Everyday Robots fleet of mobile manipulators. This is the same dataset that was used to train RT-1: roughly 130,000 episodes covering more than 700 distinct task instructions, recorded by human teleoperators across a small set of office kitchen environments. Each episode pairs an instruction such as "pick up the orange can" with a sequence of images and joint commands. The Bridge Data dataset and other smaller manipulation datasets were also used in some experiments, foreshadowing the multi-embodiment training that would appear later in RT-X and Open X-Embodiment.
The most discussed result of the RT-2 paper is the appearance of capabilities that were not directly demonstrated to the robot but that the model nonetheless executes correctly because they are present in the web pretraining data. The authors group these into three rough categories: symbol understanding, reasoning, and human recognition.
A representative set of qualitative demonstrations from the paper and project website includes:
Quantitatively, the paper reports that on a held-out evaluation suite of unseen objects, backgrounds, and environments, RT-2 achieves an average success rate of approximately 62 percent, compared with about 32 percent for RT-1 on the same evaluation. The authors also report a roughly threefold improvement over baselines on a dedicated emergent-skill evaluation that explicitly tests symbol understanding, reasoning, and human recognition. On the Language Table simulated benchmark used in earlier work, RT-2 reaches around 90 percent success on a set of long-horizon tasks. The paper also includes a chain-of-thought variant in which the model is prompted to first emit a short natural-language plan and then the action tokens; this version improves performance on multi-step instructions. See chain-of-thought prompting for the broader pattern.
In ablations, the authors show that model size matters: the 55B PaLI-X variant generalizes better than the 5B variant, and both outperform the 12B PaLM-E variant on most generalization axes, although the PaLM-E variant is sometimes stronger on tasks that benefit from richer language reasoning. Co-fine-tuning with web data is critical; pure robot fine-tuning loses most of the emergent capabilities.
RT-2 has several practical and conceptual limitations that the authors acknowledge.
The model cannot acquire genuinely new low-level motor skills that are absent from its training distribution. Although it can compose known skills with novel objects, it cannot, for example, learn to fold a shirt simply because the concept of folding appears in web text. The action distribution that the model can produce is constrained to the small set of behaviors represented in the RT-1 demonstration data: pick, place, push, open, close, and a few others on a single end effector.
Inference latency is significant. The 55B variant requires a TPU pod and yields a 1 to 3 Hz control loop, which is sufficient for the slow tabletop manipulation tasks demonstrated but inadequate for dynamic or contact-rich behavior. The 5B variant is faster but still relies on networked compute.
The physical platform is also a constraint. RT-2 was demonstrated on the Everyday Robots mobile manipulator, the same hardware used for RT-1. Alphabet shut down Everyday Robots as a separate project in February 2023 during company-wide cost cutting, with some staff and technology absorbed into Google Research. The RT-2 paper appeared after this shutdown, and the demonstrations rely on the existing fleet rather than a continuing hardware program.
Finally, RT-2 is closed. Google DeepMind did not release weights, training code, fine-tuning data, or the specific evaluation suites used in the paper. This complicates external reproduction and contributed to demand for open VLA systems that followed.
RT-2 received broad coverage in mainstream and trade press, including The New York Times, Wired, IEEE Spectrum, MIT Technology Review, and InfoQ, with most articles framing the model as a step toward generalist robot intelligence and as evidence that large multimodal models can be redirected from passive perception to physical action. Within the research community, RT-2 was widely cited as the canonical proof of concept that an internet-pretrained VLM can serve as the backbone of a robot policy without sacrificing its semantic knowledge.
In the months and years that followed, the basic recipe of "start from a pretrained VLM, tokenize actions, co-fine-tune on robot data" became a common pattern. Google DeepMind and 21 collaborating institutions released the Open X-Embodiment dataset and the RT-X policies in October 2023, scaling RT-2-style training to data from 22 different robot embodiments. AutoRT, also from Google DeepMind, used VLMs as task proposers for autonomous data collection. Stanford's OpenVLA, released in June 2024, provided a 7-billion-parameter open-source VLA built on the Llama 2 family and the Open X-Embodiment data; it was explicitly framed as an open counterpart to RT-2. UC Berkeley's Octo, also from 2024, was another open generalist policy. Physical Intelligence, founded in 2024 by a group that included several RT-2 co-authors, released pi0, a 3.3-billion-parameter open VLA with a flow-matching action head. NVIDIA's GR00T N1, released in 2025, applied a similar VLA approach to humanoid platforms.
RT-2 also influenced the broader narrative around embodied AI. Where earlier embodied systems had relied on either narrow imitation learning or multi-stage planning pipelines built around frozen large multimodal models, RT-2 made the case that the perception, reasoning, and action layers could share a single autoregressive model.
| Model | Developer | Released | Approximate parameters | Notes |
|---|---|---|---|---|
| RT-1 | Google Research, Everyday Robots | December 2022 | ~35M | Robot transformer trained from scratch on robot demonstrations only |
| PaLM-E | March 2023 | up to 562B | Embodied multimodal language model, high-level planning | |
| RT-2 | Google DeepMind | July 2023 | up to 55B | First widely cited VLA built on a pretrained VLM |
| RT-X / Open X-Embodiment | Google DeepMind and 21 partners | October 2023 | varies | Multi-embodiment dataset and policies |
| AutoRT | Google DeepMind | January 2024 | varies | VLM-driven autonomous data collection at scale |
| OpenVLA | Stanford, UC Berkeley, TRI, Google DeepMind | June 2024 | 7B | Open-source VLA on Open X-Embodiment data |
| Octo | UC Berkeley and collaborators | 2024 | up to 93M | Open generalist robot policy with diffusion action head |
| pi0 | Physical Intelligence | October 2024 | 3.3B | Open VLA with flow-matching action expert |
| GR00T N1 | NVIDIA | 2025 | varies | VLA aimed at humanoid robots |
For earlier work that informed RT-2, see also PaLI for the multilingual VLM lineage that produced PaLI-X, and Gato for DeepMind's earlier generalist agent.
RT-2 has 54 named authors, drawn primarily from Google DeepMind robotics teams in Mountain View. Recurring contributors across the RT-1, PaLM-E, RT-2, RT-X, and AutoRT series include Anthony Brohan, Noah Brown, Yevgen Chebotar, Krzysztof Choromanski, Tianli Ding, Danny Driess, Chelsea Finn, Pete Florence, Karol Hausman, Brian Ichter, Alex Irpan, Dmitry Kalashnikov, Sergey Levine, Lisa Lee, Tsang-Wei Edward Lee, Yao Lu, Igor Mordatch, Karl Pertsch, Kanishka Rao, Pannag Sanketi, Pierre Sermanet, Vincent Vanhoucke, Quan Vuong, Fei Xia, Ted Xiao, Tianhe Yu, and Brianna Zitkovich.
Several of these researchers later moved to other VLA-focused roles. Karol Hausman, Brian Ichter, and Quan Vuong became co-founders of Physical Intelligence in 2024, joining Sergey Levine (Chief Scientist) and Chelsea Finn there. Karl Pertsch went on to lead OpenVLA at Stanford and UC Berkeley. Other contributors moved into adjacent VLA and robot-learning programs at NVIDIA, Tesla, Skild AI, Figure, and academic groups.
RT-2 established the playbook that most subsequent generalist VLAs follow with variations:
Later systems diverge from RT-2 in important ways. OpenVLA keeps the discrete-token action representation but uses an open-weights Llama 2 backbone, so the entire stack can be reproduced and fine-tuned by external groups. Octo replaces the autoregressive action head with a diffusion model. pi0 adds a separate "action expert" that operates on continuous actions through flow matching, retaining a pretrained VLM as the backbone but moving away from token-level action prediction. GR00T N1 applies a similar dual-system architecture to humanoid robots with high-frequency control. Helix from Figure AI uses a faster system-1 / system-2 split. In each case the underlying claim, that internet-scale multimodal pretraining is the right starting point for a robot policy, traces back to the demonstration that RT-2 made.
For a broader view of how the family relates, see vision-language-action model, robotics, and imitation learning.