# RT-2

> Source: https://aiwiki.ai/wiki/rt_2
> Updated: 2026-06-21
> Categories: AI Models, Google DeepMind, Robotics
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

**RT-2** (Robotic Transformer 2) is a [vision-language-action model](/wiki/vision_language_action_model) developed by [Google DeepMind](/wiki/google_deepmind) that enables robots to execute novel tasks by transferring knowledge from internet-scale vision-language pretraining into low-level robot control.[1] Introduced on July 28, 2023, RT-2 was the first widely publicized system to repurpose a large pretrained vision-language model (VLM) as the backbone of a robot policy, treating discretized robot actions as ordinary text tokens predicted autoregressively by the same head used for natural language.[1][2] On held-out tasks with unseen objects, backgrounds, and environments, RT-2 roughly doubled the generalization success rate of its predecessor, from about 32 percent for RT-1 to about 62 percent.[1][2]

The paper's central claim, stated in its abstract, is that the authors "study how vision-language models trained on Internet-scale data can be incorporated directly into end-to-end robotic control to boost generalization and enable emergent semantic reasoning."[1] RT-2 was built by fine-tuning two existing VLMs, PaLI-X and [PaLM-E](/wiki/palm_e), on a mixture of web-scale vision-language data and a robot demonstration dataset originally collected for its predecessor [RT-1](/wiki/rt_1).[1] The resulting policies could be conditioned on natural-language instructions and produce six-degree-of-freedom end-effector commands for a mobile manipulator. Because the action vocabulary was embedded inside the language model's token space, RT-2 inherited semantic and visual knowledge from the web. This produced what the authors called "emergent" capabilities such as understanding novel objects, basic spatial reasoning, and chain-of-thought-style multi-step instruction following without explicit robot training data for those skills.[1]

RT-2 was widely covered in mainstream technology and science press as an early demonstration of a [foundation model for robotics](/wiki/robot_foundation_model).[12] It set a template that subsequent VLA systems, including [Open X-Embodiment](/wiki/open_x_embodiment)'s RT-X, [OpenVLA](/wiki/openvla), and Physical Intelligence's [pi0](/wiki/pi0), would build on.[7][8][9] Google DeepMind did not release RT-2's weights or code; it remains a closed research artifact.[2]

## Infobox

| Field | Value |
|---|---|
| Developer | Google DeepMind |
| Released | July 28, 2023 (arXiv preprint) |
| Architecture | Vision-language-action model based on PaLI-X and PaLM-E |
| Variants | RT-2-PaLI-X-5B, RT-2-PaLI-X-55B, RT-2-PaLM-E-12B |
| Largest variant | RT-2-PaLI-X 55B |
| Predecessor | RT-1 (December 2022) |
| Successors / related | RT-X (October 2023), AutoRT (2024), Open X-Embodiment |
| Training data | Open X-Embodiment-style robot demonstration data combined with web-scale vision-language data |
| License | Closed; weights and code not publicly released |
| Paper | arXiv:2307.15818 |

## What is RT-2?

RT-2 is a single autoregressive model that takes a camera image and a natural-language instruction and outputs a robot action encoded as text tokens. Google DeepMind described it in its launch post as "a novel vision-language-action (VLA) model that learns from both web and robotics data, and translates this knowledge into generalised instructions for robotic control."[2] In other words, the same network that can answer a question about an image can also drive a robot arm, because the arm's commands are expressed in the model's own token vocabulary rather than handled by a separate control stack.

## When was RT-2 released, and how did it follow from RT-1?

### From RT-1 to RT-2

RT-2's direct predecessor, RT-1, was published by a large team at Google Research and [Everyday Robots](/wiki/everyday_robots) in December 2022.[4] RT-1 was a 35-million-parameter [transformer](/wiki/transformer) that consumed a short history of camera images and a natural-language instruction, then produced discretized end-effector actions at roughly 3 Hz. It was trained from scratch on approximately 130,000 robot demonstrations covering more than 700 tasks, gathered over seventeen months by a fleet of thirteen Everyday Robots mobile manipulators in Google's office buildings.[4] RT-1 demonstrated that a single transformer policy could absorb a large multi-task demonstration dataset and execute kitchen and office [manipulation](/wiki/manipulation) tasks reliably, but it was limited to behaviors and objects represented in the robot data.[4]

In parallel, Google had been scaling its vision-language and embodied multimodal models. PaLM-E, published in March 2023, injected images and other sensor observations into the embedding space of the [PaLM](/wiki/palm) [large language model](/wiki/large_language_model) and showed that a single embodied multimodal model up to 562 billion parameters could perform high-level robotic planning, visual question answering, and language tasks.[5] PaLI-X, published in May 2023, was a 55-billion-parameter multilingual vision-and-language model that achieved state-of-the-art results across more than twenty-five image and video benchmarks.[6] Both models showed that scaling multimodal pretraining produced positive transfer to embodied reasoning, but neither produced low-level continuous control directly.

### The RT-2 paper

On July 28, 2023, Google DeepMind posted the preprint "RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control" (arXiv:2307.15818) and a companion blog post and project website.[1][2][3] The paper carries 54 named authors led by Anthony Brohan and Brianna Zitkovich, with co-authors including Yevgen Chebotar, Krzysztof Choromanski, Tianli Ding, Danny Driess, Chelsea Finn, Pete Florence, [Karol Hausman](/wiki/karol_hausman), [Brian Ichter](/wiki/brian_ichter), Sergey Levine, Lisa Lee, Yao Lu, Igor Mordatch, Karl Pertsch, Pierre Sermanet, Quan Vuong, Fei Xia, Ted Xiao, and Tianhe Yu.[1] The work was a collaboration between several Google DeepMind robotics groups; many contributors had previously worked at Google Brain or Google Research before the 2023 merger that created Google DeepMind. The evaluation behind the paper's claims spanned roughly 6,000 robot trials.[1]

The paper framed RT-2 as a synthesis of two trends. First, the existing robot-learning literature, exemplified by RT-1, showed that scaling up imitation data on a single robot could produce broadly useful policies.[4] Second, the vision-language model literature showed that internet-scale pretraining could endow models with broad world knowledge. RT-2 asked whether that knowledge could be channeled into the moment-to-moment motor commands that a robot needs to act.[1]

## How does RT-2 work?

### Action tokenization

The central technical idea of RT-2 is to express robot actions as ordinary text tokens that the underlying language model can emit, integrating them into training "in the same way as natural language tokens."[1] Each action consists of a discrete "terminate episode" flag, six positional and rotational deltas for the robot's end effector, and a gripper open/close command. Continuous values are uniformly discretized into 256 bins, and each bin is mapped to a distinct token in the vocabulary. With the PaLI-X backbone, the authors overload tokens that already correspond to integers between 0 and 255; with PaLM-E, they reserve previously unused tokens. An entire eight-dimensional action is then a short fixed-length string such as "1 128 91 241 5 101 127 217", which the language head produces autoregressively given an image and an instruction.[1]

Because actions are tokens, the same pretrained head that predicts the next word of a caption now predicts the next dimension of a robot action. No new output decoder is added, and no separate action prediction loss is introduced; the standard next-token cross-entropy loss handles both modalities.[1] At deployment, the model produces a complete action string, the string is parsed back into floating-point values, and the values are sent to the underlying low-level controller.

### Backbones and variants

RT-2 was built on two families of vision-language models. The full set of variants reported in the paper is summarized below.

| Variant | Backbone | Approximate parameters | Notes |
|---|---|---|---|
| RT-2-PaLI-X-5B | PaLI-X (UL2-3B + ViT-22B at smaller scale) | ~5B | Fits on a single host; faster inference |
| RT-2-PaLI-X-55B | PaLI-X (full scale) | ~55B | Largest variant; strongest emergent skills |
| RT-2-PaLM-E-12B | PaLM-E | ~12B | Inherits PaLM-E's embodied reasoning training |

For the 55B variant, inference is performed on a multi-host TPU pod accessed remotely from the robot, supporting a control loop of roughly 1 to 3 Hz.[1] The 5B variant runs at approximately 5 Hz and is closer to RT-1's control rate.[1]

### Co-fine-tuning

RT-2 is not trained from scratch. Each variant starts from a pretrained vision-language model checkpoint and is then co-fine-tuned on a mixture of the original web-scale vision-language data used during VLM pretraining and the robot demonstration data.[1] The authors found that this co-fine-tuning recipe substantially outperformed fine-tuning purely on robot data, because keeping the original VL data in the mixture preserves the semantic and visual knowledge that gives RT-2 its emergent abilities.[1] Removing the web data and fine-tuning only on robot trajectories causes the model to forget concepts that were never seen on the robot.[1]

The model is conditioned on a natural-language instruction in the standard VLM prompt format and on a single recent camera image. Output is a sequence of action tokens. Action chunking, in the sense of predicting several future actions as one block, is used in a limited form: each emitted string represents one full action vector, but training trajectories provide the supervision signal across many time steps.

## What data was RT-2 trained on?

RT-2 was trained on a combination of two data sources.[1]

The first is the web-scale vision-language pretraining mixture inherited from the backbone. For the PaLI-X variants, this includes the WebLI image-text dataset, captioning data, visual question answering data, and OCR data spanning many languages.[6] For the PaLM-E variant, the mixture includes the original PaLM language pretraining data plus the multimodal corpora used to train PaLM-E.[5]

The second is robot demonstration data collected on the Everyday Robots fleet of mobile manipulators. This is the same dataset that was used to train RT-1: roughly 130,000 episodes covering more than 700 distinct task instructions, recorded by human teleoperators across a small set of office kitchen environments.[4] Each episode pairs an instruction such as "pick up the orange can" with a sequence of images and joint commands. The Bridge Data dataset and other smaller manipulation datasets were also used in some experiments, foreshadowing the multi-embodiment training that would appear later in RT-X and Open X-Embodiment.[7]

## What are RT-2's emergent abilities?

The most discussed result of the RT-2 paper is the appearance of capabilities that were not directly demonstrated to the robot but that the model nonetheless executes correctly because they are present in the web pretraining data. The authors group these into three rough categories: symbol understanding, reasoning, and human recognition.[1] As Google DeepMind summarized it, "RT-2 shows improved generalisation capabilities and semantic and visual understanding beyond the robotic data it was exposed to."[2]

A representative set of qualitative demonstrations from the paper and project website includes:[3]

- "Pick up the extinct animal" from a tabletop including various toys, where the robot selects a plastic dinosaur.
- "Move the can to Taylor Swift" from a table with several cans and several celebrity photographs, where the robot identifies the correct portrait.
- "Move the apple to the can that has the same color as the table."
- "Pick up the bag about to fall off the table."
- Selecting an object whose label is given as a small arithmetic problem ("the number that is 1 plus 2").
- Figuring out which available object would work best as an improvised hammer (a rock), a chain-of-thought example highlighted in the paper.[1]

Quantitatively, the paper reports that on a held-out evaluation suite of unseen objects, backgrounds, and environments, RT-2 achieves an average success rate of approximately 62 percent, compared with about 32 percent for RT-1 on the same evaluation.[1][2] The authors also report a roughly threefold improvement over baselines on a dedicated emergent-skill evaluation that explicitly tests symbol understanding, reasoning, and human recognition.[1] On the Language Table simulated benchmark used in earlier work, RT-2 reaches around 90 percent success on a set of long-horizon tasks, up from about 74 percent for RT-1.[2] The paper also includes a chain-of-thought variant in which the model is prompted to first emit a short natural-language plan and then the action tokens; this version improves performance on multi-step instructions.[1] See [chain-of-thought](/wiki/chain_of_thought) prompting for the broader pattern.

In ablations, the authors show that model size matters: the 55B PaLI-X variant generalizes better than the 5B variant, and both outperform the 12B PaLM-E variant on most generalization axes, although the PaLM-E variant is sometimes stronger on tasks that benefit from richer language reasoning.[1] Co-fine-tuning with web data is critical; pure robot fine-tuning loses most of the emergent capabilities.[1]

## What are RT-2's limitations?

RT-2 has several practical and conceptual limitations that the authors acknowledge.[1]

The model cannot acquire genuinely new low-level motor skills that are absent from its training distribution. Although it can compose known skills with novel objects, it cannot, for example, learn to fold a shirt simply because the concept of folding appears in web text. The action distribution that the model can produce is constrained to the small set of behaviors represented in the RT-1 demonstration data: pick, place, push, open, close, and a few others on a single end effector.[1]

Inference latency is significant. The 55B variant requires a TPU pod and yields a 1 to 3 Hz control loop, which is sufficient for the slow tabletop manipulation tasks demonstrated but inadequate for dynamic or contact-rich behavior.[1] The 5B variant is faster but still relies on networked compute.

The physical platform is also a constraint. RT-2 was demonstrated on the Everyday Robots mobile manipulator, the same hardware used for RT-1. Alphabet shut down Everyday Robots as a separate project in February 2023 during company-wide cost cutting, with some staff and technology absorbed into Google Research.[10][11] The RT-2 paper appeared after this shutdown, and the demonstrations rely on the existing fleet rather than a continuing hardware program.

Finally, RT-2 is closed. Google DeepMind did not release weights, training code, fine-tuning data, or the specific evaluation suites used in the paper.[2] This complicates external reproduction and contributed to demand for open VLA systems that followed.

## Reception and impact

RT-2 received broad coverage in mainstream and trade press, including The New York Times, Wired, IEEE Spectrum, MIT Technology Review, and InfoQ, with most articles framing the model as a step toward generalist robot intelligence and as evidence that large multimodal models can be redirected from passive perception to physical action.[12] Within the research community, RT-2 was widely cited as the canonical proof of concept that an internet-pretrained VLM can serve as the backbone of a robot policy without sacrificing its semantic knowledge.

In the months and years that followed, the basic recipe of "start from a pretrained VLM, tokenize actions, co-fine-tune on robot data" became a common pattern. Google DeepMind and 21 collaborating institutions released the Open X-Embodiment dataset and the RT-X policies in October 2023, scaling RT-2-style training to data from 22 different robot embodiments.[7] The resulting RT-2-X model outperformed RT-2 by roughly 3x on emergent-skill evaluations and showed better spatial understanding, according to the project, demonstrating positive transfer across robot bodies.[7] AutoRT, also from Google DeepMind, used VLMs as task proposers for autonomous data collection. Stanford's [OpenVLA](/wiki/openvla), released in June 2024, provided a 7-billion-parameter open-source VLA built on the Llama 2 family and the Open X-Embodiment data; it was explicitly framed as an open counterpart to RT-2.[8] UC Berkeley's Octo, also from 2024, was another open generalist policy. Physical Intelligence, founded in 2024 by a group that included several RT-2 co-authors, released [pi0](/wiki/pi0), a 3.3-billion-parameter open VLA with a flow-matching action head.[9] NVIDIA's GR00T N1, released in 2025, applied a similar VLA approach to humanoid platforms.

RT-2 also influenced the broader narrative around [embodied AI](/wiki/embodied_ai). Where earlier embodied systems had relied on either narrow imitation learning or multi-stage planning pipelines built around frozen [large multimodal models](/wiki/large_multimodal_model), RT-2 made the case that the perception, reasoning, and action layers could share a single autoregressive model.[1]

## Related models

| Model | Developer | Released | Approximate parameters | Notes |
|---|---|---|---|---|
| RT-1 | Google Research, Everyday Robots | December 2022 | ~35M | Robot transformer trained from scratch on robot demonstrations only |
| PaLM-E | Google | March 2023 | up to 562B | Embodied multimodal language model, high-level planning |
| RT-2 | Google DeepMind | July 2023 | up to 55B | First widely cited VLA built on a pretrained VLM |
| RT-X / Open X-Embodiment | Google DeepMind and 21 partners | October 2023 | varies | Multi-embodiment dataset and policies |
| AutoRT | Google DeepMind | January 2024 | varies | VLM-driven autonomous data collection at scale |
| OpenVLA | Stanford, UC Berkeley, TRI, Google DeepMind | June 2024 | 7B | Open-source VLA on Open X-Embodiment data |
| Octo | UC Berkeley and collaborators | 2024 | up to 93M | Open generalist robot policy with diffusion action head |
| pi0 | Physical Intelligence | October 2024 | 3.3B | Open VLA with flow-matching action expert |
| GR00T N1 | NVIDIA | 2025 | varies | VLA aimed at humanoid robots |

For earlier work that informed RT-2, see also [PaLI](/wiki/pali) for the multilingual VLM lineage that produced PaLI-X, and [Gato](/wiki/gato) for DeepMind's earlier generalist agent.

## Notable contributors

RT-2 has 54 named authors, drawn primarily from Google DeepMind robotics teams in Mountain View.[1] Recurring contributors across the RT-1, PaLM-E, RT-2, RT-X, and AutoRT series include Anthony Brohan, Noah Brown, Yevgen Chebotar, Krzysztof Choromanski, Tianli Ding, Danny Driess, Chelsea Finn, Pete Florence, Karol Hausman, Brian Ichter, Alex Irpan, Dmitry Kalashnikov, Sergey Levine, Lisa Lee, Tsang-Wei Edward Lee, Yao Lu, Igor Mordatch, Karl Pertsch, Kanishka Rao, Pannag Sanketi, Pierre Sermanet, Vincent Vanhoucke, Quan Vuong, Fei Xia, Ted Xiao, Tianhe Yu, and Brianna Zitkovich.

Several of these researchers later moved to other VLA-focused roles. Karol Hausman, Brian Ichter, and Quan Vuong became co-founders of Physical Intelligence in 2024, joining Sergey Levine (Chief Scientist) and Chelsea Finn there.[9] Karl Pertsch went on to lead OpenVLA at Stanford and UC Berkeley.[8] Other contributors moved into adjacent VLA and robot-learning programs at NVIDIA, Tesla, Skild AI, Figure, and academic groups.

## How did RT-2 shape later VLA research?

RT-2 established the playbook that most subsequent generalist VLAs follow with variations:[1]

1. Start from a pretrained vision-language model rather than training from scratch.
2. Express actions as discrete tokens or as the output of a small action head attached to the backbone.
3. Co-fine-tune on a mixture of robot demonstrations and web data, or carefully balance new robot data against the original pretraining distribution to preserve semantic knowledge.
4. Evaluate not just in-distribution success rates but generalization to novel objects, instructions, and environments.

Later systems diverge from RT-2 in important ways. OpenVLA keeps the discrete-token action representation but uses an open-weights Llama 2 backbone, so the entire stack can be reproduced and fine-tuned by external groups.[8] Octo replaces the autoregressive action head with a diffusion model. pi0 adds a separate "action expert" that operates on continuous actions through flow matching, retaining a pretrained VLM as the backbone but moving away from token-level action prediction.[9] GR00T N1 applies a similar dual-system architecture to humanoid robots with high-frequency control. Helix from Figure AI uses a faster system-1 / system-2 split. In each case the underlying claim, that internet-scale multimodal pretraining is the right starting point for a robot policy, traces back to the demonstration that RT-2 made.[1]

For a broader view of how the family relates, see [vision-language-action model](/wiki/vision_language_action_model), [robotics](/wiki/robotics), and [imitation learning](/wiki/imitation_learning).

## References

1. Brohan, A., Brown, N., Carbajal, J., Chebotar, Y., Chen, X., et al. (2023). "RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control." arXiv:2307.15818. https://arxiv.org/abs/2307.15818
2. Google DeepMind. (2023, July 28). "RT-2: New model translates vision and language into action." https://deepmind.google/blog/rt-2-new-model-translates-vision-and-language-into-action/
3. RT-2 Project Website. https://robotics-transformer2.github.io/
4. Brohan, A., Brown, N., Carbajal, J., Chebotar, Y., et al. (2022). "RT-1: Robotics Transformer for Real-World Control at Scale." arXiv:2212.06817. https://arxiv.org/abs/2212.06817
5. Driess, D., Xia, F., Sajjadi, M. S. M., Lynch, C., et al. (2023). "PaLM-E: An Embodied Multimodal Language Model." arXiv:2303.03378. https://arxiv.org/abs/2303.03378
6. Chen, X., Djolonga, J., Padlewski, P., Mustafa, B., et al. (2023). "PaLI-X: On Scaling up a Multilingual Vision and Language Model." arXiv:2305.18565. https://arxiv.org/abs/2305.18565
7. Open X-Embodiment Collaboration. (2023). "Open X-Embodiment: Robotic Learning Datasets and RT-X Models." arXiv:2310.08864. https://arxiv.org/abs/2310.08864
8. Kim, M. J., Pertsch, K., Karamcheti, S., Xiao, T., et al. (2024). "OpenVLA: An Open-Source Vision-Language-Action Model." arXiv:2406.09246. https://arxiv.org/abs/2406.09246
9. Black, K., Brown, N., Driess, D., Esmail, A., et al. (2024). "pi0: A Vision-Language-Action Flow Model for General Robot Control." Physical Intelligence. https://www.pi.website/blog/pi0
10. Heater, B. (2023, February 24). "Alphabet shutters Everyday Robots." The Robot Report. https://www.therobotreport.com/alphabet-closes-everyday-robots-among-layoffs/
11. Vincent, J. (2023, February 24). "Google parent Alphabet shuts down yet another robot project." The Verge. https://www.theverge.com/2023/2/24/23613214/everyday-robots-google-alphabet-shut-down
12. Edwards, B. (2023, July 28). "Google's RT-2 AI model brings us a step closer to WALL-E." Ars Technica. https://arstechnica.com/information-technology/2023/07/googles-rt-2-ai-model-brings-us-one-step-closer-to-wall-e/