Gato (DeepMind)
Last reviewed
Jun 3, 2026
Sources
10 citations
Review status
Source-backed
Revision
v1 · 1,412 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 3, 2026
Sources
10 citations
Review status
Source-backed
Revision
v1 · 1,412 words
Add missing citations, update stale details, or suggest a clearer explanation.
Gato is a single generalist agent built by DeepMind and described in the May 2022 paper "A Generalist Agent." It is one neural network, with one fixed set of weights, that can play Atari games, caption images, hold a chat conversation, and stack blocks with a real robot arm, among other things. Rather than train a separate model for each problem, the researchers serialized every kind of input and output into a common stream of tokens and trained one transformer over all of it. The work became a flashpoint in the debate over whether simply scaling up such models is a route to artificial general intelligence.
By 2022 the recipe behind large language models was well established: take a transformer, feed it a vast corpus serialized as tokens, and train it to predict the next token. Models such as GPT-3 had shown that this single objective produced systems that could translate, summarize, write code, and answer questions without task-specific architectures. DeepMind's question was whether the same recipe could reach beyond text into perception and physical control, so that a single network could act across many environments and bodies rather than excel at one [1][2].
This framing distinguished Gato from DeepMind's earlier landmark systems. AlphaGo and AlphaZero were narrow specialists, superb at board games but unable to do anything else, and AlphaZero had to be retrained from scratch for each new game. Gato was instead meant to be multi-modal, multi-task, and multi-embodiment at once, retaining many skills inside one model [1][3].
Gato is a single decoder-only transformer sequence model. The central design idea is that anything which can be flattened into a sequence of tokens can be modeled the same way, so the network does not need hand-built components for vision, language, or control [1].
The paper specifies how each data type is converted into tokens before training:
| Data type | Tokenization |
|---|---|
| Text | SentencePiece subwords, vocabulary of 32,000, mapped to integers in the range [0, 32000) |
| Images | Split into 16x16 patches in raster order, normalized to ImageNet statistics |
| Discrete values (e.g. Atari button presses) | Flattened in row-major order into integers in the range [0, 1024) |
| Continuous values (e.g. joint torques, proprioception) | Mu-law encoded, then discretized into 1024 uniform bins shifted to [32000, 33024) |
Once everything is a flat sequence of integers, training is ordinary autoregressive next-token prediction. At run time the same model reads its context and decides what kind of token to emit next, whether that is a word, an image-region label, a button press, or a continuous control value. As DeepMind put it, the same network with the same weights decides "based on its context whether to output text, joint torques, button presses, or other tokens" [1]. Action tokens are masked appropriately so the model learns to produce behavior rather than merely describe it.
Gato was trained on 604 distinct tasks spanning very different modalities, observation formats, and action specifications [1][4][5]. The training mix combined control environments with vision and language data. Reported sources include the Atari Learning Environment, DeepMind Lab, the DM Control Suite, Meta-World, BabyAI, the Procgen benchmark, and a real and simulated robotics setup known as RGB Stacking. Vision and language came from datasets such as MS-COCO captions, Conceptual Captions, VQAv2, and OKVQA [1].
On the control side, the paper reports that Gato exceeded 50 percent of expert performance on 450 of the 604 tasks [4][5]. The headline demonstrations were deliberately varied to show breadth rather than peak skill: playing many Atari titles, captioning photographs, engaging in dialogue, and physically manipulating objects. With a real robot arm the system successfully completed block-stacking only about 60 percent of the time, a figure DeepMind reported plainly and which critics later cited [2].
The intended capability list, drawn directly from the paper's abstract, is to "play Atari, caption images, chat, stack blocks with a real robot arm and much more" [1]. The achievement was not that Gato was best at any one of these, but that one model with shared parameters could do all of them.
Gato is small by the standards of contemporary language models. The largest version has about 1.18 billion parameters, often rounded to 1.2 billion, and the paper also trained 79-million and 364-million parameter variants to study how performance scaled with size [4][6]. For comparison, GPT-3 had more than 170 billion parameters, making Gato orders of magnitude smaller [2][7].
The small size was a deliberate constraint rather than a limitation of method. The team wrote that they focused training "at the operating point of model scale that allows real-time control of real-world robots, currently around 1.2B parameters" [4]. Controlling a physical arm requires the model to emit actions fast enough to close the control loop, and a much larger network would have been too slow to run live. The paper also reported that in-distribution performance improved predictably as model size and the number of training tokens grew, consistent with the kind of power-law scaling already observed in language models [4][5]. That scaling evidence is what fueled the argument that a bigger Gato would be a more capable Gato.
Gato drew heavy press attention, much of it framed around artificial general intelligence. The lightning rod was a tweet by Nando de Freitas, a lead researcher on the project, who wrote: "My opinion: It's all about scale now! The Game is Over!" He went on to argue that the remaining work was about making models bigger, safer, more compute-efficient, faster at sampling, equipped with better memory, and trained on more modalities and innovative data, and he dismissed the need for new symbolic machinery, claiming "big nets" could already create and manipulate symbols [8][9].
The claim was contested immediately, including by some of de Freitas's own co-authors, who were more cautious about predicting timelines [10]. The sharpest external critic was Gary Marcus, who attacked what he labeled "Scaling Uber Alles," arguing that systems like Gato remain brittle and prone to moments of complete incomprehension, and that scale alone would not deliver robust reasoning [9][10]. Writing in MIT Technology Review, Will Douglas Heaven argued that the AGI hype obscured what was genuinely interesting about Gato. In his account, the notable result was that one model could learn many tasks without catastrophically forgetting earlier ones, a real advance over systems that had to be wiped and retrained for each new job [10]. Other commentators noted that Gato resembled a capable multi-task model more than a system showing emergent general intelligence, and that it could not easily handle tasks far outside its training distribution [2][7].
Gato's lasting contribution was less a benchmark result than a demonstration of unification. It showed that a single transformer, trained with one objective on tokenized data, could span text, images, and embodied control without bespoke architectures for each. That "tokenize everything" stance prefigured the broadly multimodal direction the field took afterward, including DeepMind's later vision-language work Flamingo and the multimodal Gemini family developed by Google DeepMind [3][10]. The accompanying debate also crystallized a divide that would define the next several years of AI discourse: whether scale alone is the path to general intelligence, or whether brittleness and the limits of next-token prediction demand something more. Gato did not settle that question, but it gave both camps a concrete artifact to argue over.