Gato (DeepMind)

AI Models Google DeepMind Reinforcement Learning

8 min read

Updated Jun 27, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 27, 2026

Fact-checked

In review queue

Sources

10 citations

Revision

v2 · 1,567 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Gato is a single generalist AI agent built by DeepMind and described in the May 2022 paper "A Generalist Agent" (arXiv:2205.06175). It is one neural network of about 1.2 billion parameters, with one fixed set of weights, that was trained on 604 distinct tasks and can play Atari games, caption images, hold a chat conversation, and stack blocks with a real robot arm, all from a single model ^[1]^[4]. Rather than train a separate model for each problem, the researchers serialized every kind of input and output into a common stream of tokens and trained one transformer over all of it. The work became a flashpoint in the debate over whether simply scaling up such models is a route to artificial general intelligence.

What is Gato?

Gato is a multimodal, multi-task, multi-embodiment agent: a single decoder-only transformer that performs more than 600 different tasks across text, images, and physical control using one set of weights ^[1]^[3]. DeepMind described it as a "generalist agent" because the same network handles vision, language, and robotics without a separate architecture for each. According to the paper's abstract, "the same network with the same weights can play Atari, caption images, chat, stack blocks with a real robot arm and much more, deciding based on its context whether to output text, joint torques, button presses, or other tokens" ^[1]^[4]. The achievement was not that Gato was best at any one of these tasks, but that one model with shared parameters could do all of them.

Background

By 2022 the recipe behind large language models was well established: take a transformer, feed it a vast corpus serialized as tokens, and train it to predict the next token. Models such as GPT-3 had shown that this single objective produced systems that could translate, summarize, write code, and answer questions without task-specific architectures. DeepMind's question was whether the same recipe could reach beyond text into perception and physical control, so that a single network could act across many environments and bodies rather than excel at one ^[1]^[2].

This framing distinguished Gato from DeepMind's earlier landmark systems. AlphaGo and AlphaZero were narrow specialists, superb at board games but unable to do anything else, and AlphaZero had to be retrained from scratch for each new game. Gato was instead meant to be multi-modal, multi-task, and multi-embodiment at once, retaining many skills inside one model ^[1]^[3].

How does Gato work?

Gato is a single decoder-only transformer sequence model with a context window of 1,024 tokens ^[1]^[5]. The central design idea is that anything which can be flattened into a sequence of tokens can be modeled the same way, so the network does not need hand-built components for vision, language, or control ^[1].

The paper specifies how each data type is converted into tokens before training:

Data type	Tokenization
Text	SentencePiece subwords, vocabulary of 32,000, mapped to integers in the range [0, 32000)
Images	Split into 16x16 patches in raster order, normalized to ImageNet statistics
Discrete values (e.g. Atari button presses)	Flattened in row-major order into integers in the range [0, 1024)
Continuous values (e.g. joint torques, proprioception)	Mu-law encoded, then discretized into 1024 uniform bins shifted to [32000, 33024)

Once everything is a flat sequence of integers, training is ordinary autoregressive next-token prediction. At run time the same model reads its context and decides what kind of token to emit next, whether that is a word, an image-region label, a button press, or a continuous control value ^[1]. Action tokens are masked appropriately so the model learns to produce behavior rather than merely describe it. This "tokenize everything" approach is what lets one set of weights span perception, language, and embodied action.

What tasks can Gato do?

Gato was trained on 604 distinct tasks spanning very different modalities, observation formats, and action specifications ^[1]^[4]^[5]. The training mix combined control environments with vision and language data. Reported sources include the Atari Learning Environment, DeepMind Lab, the DM Control Suite, Meta-World, BabyAI, the Procgen benchmark, and a real and simulated robotics setup known as RGB Stacking. Vision and language came from datasets such as MS-COCO captions, Conceptual Captions, VQAv2, and OKVQA ^[1].

On the control side, the paper reports that "Gato performs over 450 out of 604 tasks at over a 50% expert score threshold" ^[1]^[4]^[5]. The headline demonstrations were deliberately varied to show breadth rather than peak skill: playing many Atari titles, captioning photographs, engaging in dialogue, and physically manipulating objects. With a real robot arm the system successfully completed block-stacking only about 60 percent of the time, a figure DeepMind reported plainly and which critics later cited ^[2].

How big is Gato (parameters and scaling)?

Gato is small by the standards of contemporary language models. The largest version has about 1.18 billion parameters, often rounded to 1.2 billion, and the paper also trained two smaller variants to study how performance scaled with size ^[1]^[4]^[6]. As the authors state, "we evaluated 3 different model sizes (measured in parameter count): a 79M model, a 364M model, and a 1.18B model (Gato)" ^[1]. For comparison, GPT-3 had more than 170 billion parameters, making Gato orders of magnitude smaller ^[2]^[7].

Model variant	Parameters
Smallest	79 million
Middle	364 million
Gato (largest)	1.18 billion (~1.2B)

The small size was a deliberate constraint rather than a limitation of method. The team wrote that they focused training "at the operating point of model scale that allows real-time control of real-world robots, currently around 1.2B parameters" ^[4]. Controlling a physical arm requires the model to emit actions fast enough to close the control loop, and a much larger network would have been too slow to run live. The paper also reported that in-distribution performance improved predictably as model size and the number of training tokens grew, consistent with the kind of power-law scaling already observed in language models ^[4]^[5]. That scaling evidence is what fueled the argument that a bigger Gato would be a more capable Gato.

Why was Gato significant, and what was the AGI debate?

Gato drew heavy press attention, much of it framed around artificial general intelligence. The lightning rod was a tweet by Nando de Freitas, a lead researcher on the project, who wrote on May 14, 2022: "My opinion: It's all about scale now! The Game is Over!" ^[8]^[9]. He went on to argue that the remaining work was about making models bigger, safer, more compute-efficient, faster at sampling, equipped with better memory, and trained on more modalities and innovative data, and he dismissed the need for new symbolic machinery, claiming "big nets" could already create and manipulate symbols ^[8]^[9].

The claim was contested immediately, including by researchers who were more cautious about predicting timelines ^[10]. The sharpest external critic was Gary Marcus, who attacked what he labeled "Scaling Uber Alles," arguing that systems like Gato remain brittle and prone to moments of complete incomprehension, and that scale alone would not deliver robust reasoning ^[9]^[10]. Writing in MIT Technology Review, Will Douglas Heaven argued that the AGI hype obscured what was genuinely interesting about Gato. In his account, the notable result was that one model could learn many tasks without catastrophically forgetting earlier ones, a real advance over systems that had to be wiped and retrained for each new job ^[10]. Other commentators noted that Gato resembled a capable multi-task model more than a system showing emergent general intelligence, and that it could not easily handle tasks far outside its training distribution ^[2]^[7].

What is Gato's legacy?

Gato's lasting contribution was less a benchmark result than a demonstration of unification. It showed that a single transformer, trained with one objective on tokenized data, could span text, images, and embodied control without bespoke architectures for each ^[1]. That "tokenize everything" stance prefigured the broadly multimodal direction the field took afterward, including DeepMind's later vision-language work Flamingo and the multimodal Gemini family developed by Google DeepMind ^[3]^[10]. The accompanying debate also crystallized a divide that would define the next several years of AI discourse: whether scale alone is the path to general intelligence, or whether brittleness and the limits of next-token prediction demand something more. Gato did not settle that question, but it gave both camps a concrete artifact to argue over.

References

Reed, Scott, et al. "A Generalist Agent." arXiv:2205.06175 (May 12, 2022). https://arxiv.org/abs/2205.06175 ↩
Wiggers, Kyle. "DeepMind's new AI can perform over 600 tasks, from playing games to controlling robots." TechCrunch (May 13, 2022). https://techcrunch.com/2022/05/13/deepminds-new-ai-can-perform-over-600-tasks-from-playing-games-to-controlling-robots/ ↩
"DeepMind Introduces Gato: A Generalist, Multi-Modal, Multi-Task, Multi-Embodiment Agent." Synced (May 18, 2022). https://syncedreview.com/2022/05/18/deepmind-introduces-gato-a-generalist-multi-modal-multi-task-multi-embodiment-agent/ ↩
"A Generalist Agent." Google DeepMind blog (May 12, 2022). https://deepmind.google/blog/a-generalist-agent/ ↩
Reed, Scott, et al. "A Generalist Agent" (full text). ar5iv / arXiv. https://ar5iv.labs.arxiv.org/html/2205.06175 ↩
"GATO - A New Generalist Artificial Intelligence Agent." Analytics Vidhya (June 2022). https://www.analyticsvidhya.com/blog/2022/06/gato-a-new-generalist-artificial-intelligence-agent/ ↩
"Deepmind: Is 'Gato' a precursor for general artificial intelligence?" The Decoder (May 2022). https://the-decoder.com/deepmind-is-gato-a-precursor-for-general-artificial-intelligence/ ↩
de Freitas, Nando. Post on X/Twitter (May 14, 2022). https://x.com/NandoDF/status/1525397036325019649 ↩
Greene, Tristan. "DeepMind researcher claims new AI could lead to AGI, says 'game is over'." TNW / The Next Web (May 16, 2022). https://thenextweb.com/news/deepmind-researcher-claims-new-gato-ai-could-lead-to-agi-says-game-is-over ↩
Heaven, Will Douglas. "The hype around DeepMind's new AI model misses what's actually cool about it." MIT Technology Review (May 23, 2022). https://www.technologyreview.com/2022/05/23/1052627/deepmind-gato-ai-model-hype/ ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

Flamingo (visual language model)Perceiver RT-2 RoboCat Robot foundation model

What is Gato?

Background

How does Gato work?

What tasks can Gato do?

How big is Gato (parameters and scaling)?

Why was Gato significant, and what was the AGI debate?

What is Gato's legacy?

References

Improve this article

Related Articles

MuZero

DQN

AlphaStar

AlphaZero

David Silver

Ioannis Antonoglou

What links here

Related Articles

MuZero

DQN

AlphaStar

AlphaZero

David Silver

Ioannis Antonoglou

What links here