# RoboCat

> Source: https://aiwiki.ai/wiki/robocat
> Updated: 2026-06-03
> Categories: Embodied AI, Google DeepMind, Robotics
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

RoboCat is a self-improving foundation agent for robotic manipulation developed by [Google DeepMind](/wiki/google_deepmind). First described in a paper and blog post released on 20 June 2023, it is a single neural network that can operate several different real and simulated robot arms, learn new manipulation tasks from a modest number of demonstrations, and then generate its own training data to keep getting better [1][2]. RoboCat was presented as a research system rather than a product, and it was notable for being, in DeepMind's framing, the first agent to solve and adapt to multiple tasks across different real robots while improving autonomously [2].

The work was later published in the journal Transactions on Machine Learning Research in December 2023 [3]. There is a small naming wrinkle worth noting: the preprint on arXiv carries the title "RoboCat: A Self-Improving Generalist Agent for Robotic Manipulation," while DeepMind's blog and its official publications listing use "Self-Improving Foundation Agent" [1][2][3]. Both refer to the same system.

## Background (Gato lineage)

RoboCat is built on [Gato](/wiki/gato), DeepMind's multimodal "generalist" model introduced in 2022. Gato treats language, images, and actions as a single stream of tokens, letting one transformer handle tasks as varied as captioning images, playing Atari, and controlling a robot arm. The name is a small joke: gato is Spanish for "cat," and RoboCat extends the lineage [2][4].

What RoboCat inherits from Gato is the idea that a sequence model can absorb experience from many embodiments at once without needing a shared, hand-engineered action or observation format. Architecturally, the paper describes RoboCat as a visual goal-conditioned decision transformer that consumes action-labelled visual experience [1][3]. Because a transformer can take in and emit variable-length sequences depending on context, the model can handle arms with different cameras, joint counts, and grippers natively. Where Gato was a broad proof of concept spanning hundreds of unrelated tasks, RoboCat narrows the focus to manipulation and adds the piece Gato lacked: a loop for generating its own data and improving over time [1][2].

## How RoboCat works

The model takes images of a scene together with a goal image showing the desired outcome, and outputs the low-level actions needed to drive a robot arm toward that goal. It was trained on a large and diverse dataset of manipulation behaviours, ranging from coarse pushing and stacking to precise, dexterous tasks, collected across both simulated and physical robots [1][2].

Because the training data spans arms with different degrees of freedom and different observation setups, RoboCat does not require a common action representation. The transformer reads each robot's inputs and produces appropriately shaped outputs based on the context it is given. This is the mechanism that lets a single set of weights drive multiple, structurally different machines [1][3].

DeepMind reported that the agent shows cross-task transfer: as the training data grew larger and more varied, RoboCat not only performed better on tasks it had seen, but also became more efficient at picking up entirely new ones. In other words, breadth of prior experience translated into faster adaptation [1][2].

## The self-improvement cycle

The most distinctive part of RoboCat is its self-improvement loop. According to DeepMind, the agent learns each new task by following roughly five steps [2]:

| Step | What happens |
| --- | --- |
| 1 | Humans tele-operate a robot arm to collect 100 to 1,000 demonstrations of a new task |
| 2 | RoboCat is fine-tuned on this data, producing a specialised "spin-off" agent for the task |
| 3 | The spin-off practises the task autonomously, on average about 10,000 times, generating fresh data |
| 4 | The human demonstrations and the self-generated data are folded back into RoboCat's training set |
| 5 | A new version of RoboCat is trained on the larger, more diverse dataset |

Each pass through this cycle expands the dataset and, in principle, makes the next round of adaptation easier. DeepMind characterised the trained model's ability to generate data for subsequent training iterations as a basic building block for an autonomous improvement loop [1][2]. The final agent described in the paper was trained on millions of trajectories drawn from real and simulated arms, including data the system had produced for itself [2].

## Multi-embodiment

A central claim of the work is that one model can control several physically different robots. RoboCat was trained and evaluated across four different types of robots and a range of robotic arms, in both simulation and the real world [2]. The paper reports evaluation on three distinct real robot platforms alongside simulation [3].

The clearest demonstration of this flexibility involved grippers. RoboCat had been trained mostly on arms with simple two-pronged grippers. DeepMind showed it could adapt to a more complex arm fitted with a three-fingered gripper that had roughly twice as many controllable inputs. After observing 1,000 human-controlled demonstrations, collected in just a few hours, the adapted agent could pick up gears with that unfamiliar three-fingered arm successfully about 86% of the time [2]. The paper also reports that the agent can generalise to new tasks and robots zero-shot in some cases, and otherwise adapt using only 100 to 1,000 examples for the target task [1][3].

## Results

The final RoboCat was trained on 253 tasks and benchmarked on a held-out set of 141 variations of those tasks, spanning both simulation and physical hardware [2]. The headline efficiency result is the demonstration count: where many robot-learning systems need tens of thousands of trials, RoboCat could begin handling a new task from as few as 100 demonstrations, and could adapt to previously unseen tasks with 500 to 1,000 demonstrations gathered in a matter of hours [1][2].

The self-improvement loop produced a measurable gain. DeepMind reported that an early version of RoboCat succeeded on previously unseen tasks just 36% of the time after learning from 500 demonstrations per task. A later version, trained on a greater diversity of tasks and self-generated data, more than doubled that figure to 74% on the same tasks [2].

| Metric | Reported value |
| --- | --- |
| Announcement | 20 June 2023 (TMLR publication December 2023) |
| Tasks trained on | 253 |
| Task variations benchmarked | 141 |
| Demonstrations to learn a new task | as few as 100 (typically 100 to 1,000) |
| Average autonomous practice per spin-off | about 10,000 episodes |
| Success on unseen tasks, early vs final model | 36% to 74% |
| Gears pick-up, three-fingered arm, after 1,000 demos | about 86% |

Individual task results varied widely. Press coverage of the paper noted success rates ranging from roughly 13% on the hardest tasks to as high as 99% on the easiest with 1,000 demonstrations, with lower performance when fewer demonstrations were available [4].

## Significance

RoboCat sat at the intersection of two ideas that were gaining momentum in 2023: foundation models that learn general-purpose representations from large, varied datasets, and robot learning that aims to reduce the punishing data requirements of training physical machines. Its contribution was to show that a single agent could span multiple real embodiments and bootstrap its own data, turning a few hours of human demonstration into a self-sustaining improvement process [1][2].

DeepMind framed the data efficiency as the practical payoff: faster adaptation could make it more feasible to train robots for many real-world jobs without collecting enormous bespoke datasets for each one. The team also stated a longer-term goal of cutting the number of demonstrations needed to fewer than ten [4]. The line of research continued in DeepMind's later robotics work, including the [Gemini Robotics](/wiki/gemini_robotics) models that bring vision-language-action capabilities to physical robots.

## References

1. Bousmalis, K. et al. "RoboCat: A Self-Improving Generalist Agent for Robotic Manipulation." arXiv preprint, 20 June 2023. [https://arxiv.org/abs/2306.11706](https://arxiv.org/abs/2306.11706)
2. Google DeepMind. "RoboCat: A self-improving robotic agent." DeepMind blog, 20 June 2023. [https://deepmind.google/blog/robocat-a-self-improving-robotic-agent/](https://deepmind.google/blog/robocat-a-self-improving-robotic-agent/)
3. Google DeepMind. "RoboCat: A Self-Improving Foundation Agent for Robotic Manipulation." Publication listing, Transactions on Machine Learning Research, December 2023. [https://deepmind.google/research/publications/35829/](https://deepmind.google/research/publications/35829/)
4. Wiggers, K. "DeepMind's RoboCat learns to perform a range of robotics tasks." TechCrunch, 21 June 2023. [https://techcrunch.com/2023/06/21/deepminds-robocat-learns-to-perform-a-range-of-robotics-tasks/](https://techcrunch.com/2023/06/21/deepminds-robocat-learns-to-perform-a-range-of-robotics-tasks/)