SIMA (DeepMind)
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,887 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,887 words
Add missing citations, update stale details, or suggest a clearer explanation.
SIMA (Scalable Instructable Multiworld Agent) is a family of generalist embodied AI agents developed by Google DeepMind that follow free-form natural-language instructions to act in a wide range of three-dimensional virtual environments, including commercial video games and research worlds.[^1] The system observes the same on-screen pixels a human player would see and produces keyboard-and-mouse actions, so it can in principle plug into any 3D title that accepts standard inputs without access to game source code, internal state, or bespoke APIs.[^2] The first technical report, "Scaling Instructable Agents Across Many Simulated Worlds," appeared on arXiv in March 2024 and was revised in October 2024; it described an agent trained on nine commercial games and several research environments via partnerships with publishers.[^1][^3] A successor, SIMA 2, was announced on 13 November 2025 and detailed in a December 2025 arXiv paper; it replaces the bespoke SIMA 1 policy with a fine-tuned Gemini backbone, adds explicit reasoning and self-improvement loops, and was evaluated inside worlds procedurally generated by Genie 3.[^4][^5][^6]
| Aspect | SIMA 1 (March 2024) | SIMA 2 (November 2025) |
|---|---|---|
| Developer | Google DeepMind | Google DeepMind |
| First public release | 13 March 2024 (research preview) | 13 November 2025 (limited research preview) |
| Foundation backbone | SPARC image encoder, Phenaki video encoder, in-house transformer policy | Fine-tuned Gemini 2.5 Flash-Lite (VLA configuration) |
| Inputs | Screen pixels and free-form text | Screen pixels, free-form text, optional images |
| Outputs | Keyboard and mouse events at human-like rate | Keyboard and mouse events plus structured action tokens; can verbalize reasoning |
| Training games | Nine commercial titles plus research environments | Expanded set including ASKA, Steamworld Build, Road 96, The Gunk |
| Skills catalogued | ~600 short language-conditioned skills | 600+ skills retained and extended with multi-step tasks |
| Headline benchmark | ~34% success on roughly 1,500 evaluation tasks vs. ~60% for humans on No Man's Sky | ~65% task completion vs. ~31% for SIMA 1 and ~71% for human players on the main suite |
| Self-improvement | None reported | Gemini task-setter plus Gemini reward model on self-generated trajectories |
| Generative-world test | Not tested | Genie 3 photorealistic environments |
DeepMind has used games as a research substrate for more than a decade. Its 2013 Deep Q-Network (see DQN) learned to play seven Atari arcade games from raw pixels using reinforcement learning, establishing the modern template of pixel-to-action policies trained end to end.[^7] Subsequent milestones, AlphaGo, AlphaZero, MuZero, and the StarCraft II agent AlphaStar, pushed model-based and multi-agent RL into competitive play, but each system was specialized to a single game and worked with custom observation and action interfaces.[^8] SIMA explicitly departs from that lineage: rather than maximize win rate inside a closed environment, it is trained to follow arbitrary language instructions across many worlds using only the screen and the human input devices.[^1][^2] In MIT Technology Review's account of the SIMA 2 launch, DeepMind researcher Joe Marino positions games as "a driving force behind agent research for quite a while," while emphasizing that SIMA's open-ended instruction-following is a deliberate departure from the closed scoreboards of AlphaGo and AlphaStar.[^7]
The project also draws on DeepMind's parallel line of open-ended-learning research, including the XLand environments and the Adaptive Agent (AdA) work that demonstrated rapid in-context task adaptation in procedurally generated 3D worlds.[^9] Where AdA learned to adapt to new tasks in minutes inside a controlled simulator, SIMA aimed to scale the underlying behaviour across messy, commercial 3D engines that DeepMind did not build and that expose no special hooks.[^9][^2] The SIMA program can be read as fusing two DeepMind threads that had previously run in parallel: the games-and-RL line (DQN, AlphaStar, MuZero) and the open-ended language-instruction line (XLand, AdA), with the explicit goal of producing one model that operates across many independently developed 3D environments rather than a hand-tuned policy per game.[^7][^9]
Independent commentators framed the original SIMA release as DeepMind "getting closer to its game-playing dream." Jack Clark, writing in Import AI shortly after the March 2024 announcement, observed that DeepMind had "taken some of the results from these advances" in pretrained encoders and transformer policies and used them to build an instructable multiworld agent that, unlike earlier systems, "doesn't get access to a game's source code, nor bespoke APIs."[^13] That property, training only on what a human player can see and do, is the defining engineering constraint of both SIMA versions.[^2][^13]
The first SIMA technical report was uploaded to arXiv on 13 March 2024 (with a revision on 11 October 2024) by a team of more than 90 researchers writing collectively as "SIMA Team, Google DeepMind."[^3] Named authors include Maria Abi Raad, Arun Ahuja, Jane X. Wang, Danilo J. Rezende, Jeff Clune, Volodymyr Mnih, Demis Hassabis, and Shane Legg.[^3] An accompanying Google DeepMind blog post the same day announced collaborations with eight game studios.[^2] SIMA 2 followed on 13 November 2025, also accompanied by a DeepMind blog post and a 4 December 2025 arXiv preprint (2512.04797).[^4][^5][^6] Senior researcher Jane Wang told MIT Technology Review that the SIMA 2 preview was meant to show "what DeepMind has been working on and see what kinds of collaborations and potential uses are possible," signaling that the project remains primarily a research probe rather than a product.[^7]
SIMA 1 is designed around a deliberately minimal interface so that the same policy can plug into many games. It receives a stream of pixel observations from the game's render output and a single natural-language instruction string; it emits sequences of keyboard presses and mouse movements at a rate intended to match a human player.[^1][^2] DeepMind frames this as the "broadest possible" interface: any 3D game playable by a person can in principle be played by SIMA without modification to the game itself.[^2]
The action representation is the actual key codes and mouse deltas that a game expects, not abstract macros. SIMA's policy outputs short rollouts of eight actions at a time, conditioned on the current observation, language instruction, and a memory of past states.[^10]
SIMA 1 is built around two pretrained perception models plus a transformer-based policy. The image encoder is SPARC (Sparse Fine-grained Contrastive Alignment), a contrastive image-text pretraining method from DeepMind that aligns image patches with caption tokens at a fine-grained level.[^11] The video encoder is Phenaki, a causal video transformer (C-ViViT) originally developed for variable-length text-to-video generation, repurposed here as a temporal frame encoder.[^10][^11] These pretrained backbones are fine-tuned during SIMA training rather than frozen.[^10]
The agent's policy is a multimodal transformer combined with a Transformer-XL recurrence over past memory states, which builds a representation conditioned on the current visual stream, the language instruction, and recent history.[^10] To strengthen the language conditioning, the SIMA team uses classifier-free guidance at action time, computing the policy output once with the instruction and once with a null instruction and combining them with a guidance scale, an idea borrowed from text-to-image diffusion models.[^10] An auxiliary head predicts goal achievement to supplement the primary behavioral-cloning loss.[^10]
Rather than rely on a single environment, SIMA 1 was trained on a curated portfolio of nine commercial games and several research environments via formal partnerships with their developers. The DeepMind blog enumerates the commercial titles: No Man's Sky (Hello Games), Teardown (Tuxedo Labs and Saber Interactive), Valheim, Satisfactory and Goat Simulator 3 (Coffee Stain), Hydroneer (Foulball Hangover), Space Engineers (Keen Software House), Wobbly Life (RubberbandGames), and Eco (Strange Loop Games).[^2][^12] The custom research environments included the Unity-built Construction Lab.[^2]
Training combined two main data sources: recordings of solo human gameplay annotated with after-the-fact text instructions, and two-player sessions in which one person played while another typed instructions that the player executed.[^10] DeepMind reports that the resulting corpus covers more than 600 short language-conditioned skills, each typically completable in under ten seconds, spanning navigation ("turn left", "climb the ladder"), object interaction, menu use, resource gathering, vehicle and spacecraft operation, crafting, and basic communicative actions.[^2][^13]
The agent is trained primarily by imitation learning (behavioral cloning) against the human demonstrations, with the auxiliary goal-prediction objective; SIMA 1 is not trained with reinforcement learning in its main configuration.[^10]
DeepMind evaluated SIMA on roughly 1,500 unique in-game tasks, drawn from across the training and held-out games, with human judges scoring task completion from rendered video alongside automated checks where available.[^14] The headline comparison was between a single multi-environment generalist policy and a set of specialist baselines each trained and evaluated inside a single game.[^14]
The reported results show that, averaged across games, the generalist SIMA outperformed nine separate specialist agents that had been trained only on their respective games, with a quoted improvement of roughly 67% in success rate over those specialists.[^14][^13] Absolute numbers were modest: on the No Man's Sky benchmark, for example, human players completed about 60% of tasks while SIMA reached roughly 34%, still well above non-language baselines.[^15][^13] DeepMind also tested zero-shot transfer to games held out from training and reported "quite decent" but lower performance, indicating that multi-game pretraining transfers but does not yet substitute for direct experience.[^10] Ablation studies showed that removing language inputs sharply degraded behavior, classifier-free guidance gave a smaller but consistent gain, and the auxiliary goal-prediction objective improved final scores.[^10]
The evaluation also probed transfer across the portfolio. Agents trained jointly on eight of the nine games and evaluated on the held-out title performed nearly as well as agents specifically trained on that ninth game, suggesting that the model was learning skills that transferred between environments rather than memorizing game-specific routines.[^2] DeepMind emphasized that the comparison required a careful evaluation protocol: tasks were short enough that a competent human player could finish them in under ten seconds, judges scored task completion against the original instruction, and the same agent had to handle a long list of distinct skills inside the same game session without specialized fine-tuning.[^14][^13] The conclusion drawn in the original blog post was that "a model trained on many games is better than a model that has only learned one," a recognizable echo of the multi-task transfer story for large language models applied here to embodied 3D action.[^2]
SIMA 2 was unveiled in a Google DeepMind blog post on 13 November 2025 and described in detail in the arXiv preprint "SIMA 2: A Generalist Embodied Agent for Virtual Worlds" (2512.04797), submitted on 4 December 2025.[^4][^5][^6] The follow-up keeps the same pixels-in, keyboard-and-mouse-out interface but rebuilds the underlying policy on top of a Gemini backbone.
SIMA 2 is implemented as a vision-language-action model obtained by fine-tuning Gemini 2.5 Flash-Lite, a relatively small multimodal Gemini variant chosen for its low latency.[^16][^17] The model ingests rendered video frames plus a textual instruction (and optionally an accompanying image) and emits structured text tokens that are decoded into keyboard-and-mouse events, retaining the broad-applicability interface of SIMA 1.[^16][^17] DeepMind reports that fine-tuning uses a mixture of human demonstration videos labelled with language captions and additional labels generated by Gemini itself, intended to preserve the base model's general vision, dialogue, and reasoning capabilities while teaching it to act.[^5][^17]
Because the action distribution is produced by a large language model rather than a small policy head, SIMA 2 can also produce a chain-of-thought over the instruction and its own plan before acting, which DeepMind describes as the agent "thinking out loud" about its current goal.[^5][^17] At inference time SIMA 2 can therefore narrate intermediate decisions, hold short dialogues with the user about the task, and accept follow-up corrections; SIMA 1 had no such capability.[^5]
A central novelty of SIMA 2 is open-ended self-improvement. After an initial fine-tuning phase on human demonstrations, the team places the agent in a new environment and uses two auxiliary Gemini models in a closed loop: a task-setter Gemini generates novel language instructions appropriate to the current observation, and a separate Gemini-based reward model scores the agent's rollouts against rubrics calibrated to human preferences.[^17][^18] Trajectories that exceed a quality threshold are added to a self-generated experience bank that is mixed into subsequent training, allowing the agent to improve on tasks it previously failed without any new human demonstrations.[^17][^18]
This procedure casts the system as something close to recursive self-improvement at the level of in-game behaviour: the agent generates its own curriculum and grades its own attempts, using LLM judgment in place of hand-crafted reward functions.[^17] DeepMind also notes that this allows SIMA 2 to bootstrap competence in entirely new games where ground-truth success signals or APIs are unavailable.[^5]
SIMA 2 was trained on an expanded portfolio that retains the SIMA 1 titles and adds further commercial games such as ASKA, Steamworld Build, Road 96, and The Gunk.[^18] DeepMind reports held-out evaluations on the survival game ASKA and on the MineDojo Minecraft research environment, and qualitative tests inside worlds generated on the fly by Genie 3, DeepMind's photorealistic interactive world model.[^5][^17] The Genie tests are positioned as a deliberate stress test of generalization: the agent receives a world it has never seen, generated by another model from a text or image prompt, and is asked to navigate and act within it.[^5][^18]
Across DeepMind's main evaluation suite, SIMA 2 achieves roughly 65% task completion compared with about 31% for SIMA 1 and about 71% for human players, roughly doubling the success rate of the original model and closing most of the gap to humans on the training games.[^16][^19] On the held-out ASKA environment, SIMA 2 is reported to outperform SIMA 1 by more than ten percentage points and to substantially outperform raw Gemini Flash-Lite (around 3% success) and Gemini Pro (around 7%) prompted directly to act.[^6][^19] On MineDojo, SIMA 2 completes tasks in 26 of 50 categories versus 2 for SIMA 1, a much steeper improvement on a held-out domain.[^6]
DeepMind characterizes SIMA 2 as having "substantially closed" the gap to human performance on training games while remaining clearly below humans on long-horizon and combat-heavy tasks; the team explicitly notes that fine-grained mouse aim and very long, multi-step quests are still hard.[^5][^6] The ASKA and MineDojo numbers are particularly emphasized in DeepMind's framing because both environments were entirely held out from training: ASKA is a 2024 Viking-themed cooperative survival game, and MineDojo wraps Minecraft as a research environment, neither of which appeared in the SIMA 2 training mix.[^19][^6] The fact that raw prompted Gemini Flash-Lite and Gemini Pro score in low single digits underscores how much of SIMA 2's gain comes from embodied fine-tuning and self-improvement rather than from the language model's prior knowledge alone.[^6]
DeepMind also reports qualitative behaviour that did not exist in SIMA 1. Because the policy is a fine-tuned Gemini model, SIMA 2 can interpret instructions that combine language and reference images, hold short conversations about strategy with the user, and reason explicitly about what to do before executing actions; in the launch material, the team contrasts this with SIMA 1, which had to be "told what to do step by step."[^5][^17] The DeepMind paper claims that the same agent can therefore act as a coherent companion across many titles, an explicit attempt to integrate the AI agent paradigm of LLM-based assistants with the embodied perception-action paradigm of reinforcement learning agents.[^5]
SIMA's training portfolio depends on direct partnerships with game studios that license their titles to DeepMind for research use. The first technical report lists eight collaborating studios, Coffee Stain, Foulball Hangover, Hello Games, Keen Software House, RubberbandGames, Strange Loop Games, Tuxedo Labs, and Saber Interactive, behind the nine commercial games in the SIMA 1 corpus.[^2][^12] DeepMind has emphasized that this collaboration model is intentional: working with publishers gives the team consent to record gameplay and run automated agents inside the games, and the studios receive insight into how an AI agent interacts with their environments.[^2]
Both SIMA 1 and SIMA 2 have been released only as research previews. DeepMind has shared SIMA 2 with selected academic groups and game developers under a limited program, with no broadly available API or downloadable model as of May 2026.[^5][^16]
SIMA sits at the intersection of three research lines that were historically separate: classical RL game agents, LLM-based AI agents, and vision-language-action models for robotics. By using only pixels and key codes, it provides a substrate for studying instruction-following in environments that, unlike sandboxed RL benchmarks, were built by other people to entertain humans and that contain rich, partially observable, often unfair worlds.[^1][^2]
The shift from SIMA 1 to SIMA 2 is also significant as a methodological statement. SIMA 1 used hand-engineered perceptual encoders (SPARC, Phenaki) feeding a custom transformer policy trained by imitation learning, a recognizably "RL-adjacent" stack.[^10][^11] SIMA 2 instead treats a general-purpose multimodal LLM as the agent, fine-tunes it as a VLA, and uses other LLM instances as task generators and reward models for self-improvement.[^17][^18] This positions SIMA inside the same paradigm as RT-2, PaLM-E, and Gemini Robotics in the robotics community: a foundation model reshaped into an AI agent for control rather than a bespoke policy.[^5][^6]
For DeepMind specifically, SIMA represents a public reorientation of the games line of research toward generalists. Where AlphaStar and AlphaGo demonstrated that focused RL could exceed top humans inside a single game, SIMA 2 is meant to demonstrate that one model can plausibly act inside many games, including games it has never seen, by combining LLM priors with embodied data.[^4][^7][^8] Multiple commentators have positioned SIMA 2 as a precursor to general embodied AI agents and to potential bridges between game agents and real-world robotic control.[^16][^19]
SIMA also intersects the rapidly developing space of generative interactive world models. By running SIMA 2 inside environments generated by Genie 3, DeepMind effectively pairs a generalist agent with a generalist environment generator: Genie produces a 3D world from a text or image prompt, and SIMA 2 plays it.[^5][^18] If the loop holds up, it suggests a future in which embodied agents are trained against an unbounded stream of model-generated worlds rather than a fixed portfolio of commercial games, sidestepping the licensing bottleneck that constrained SIMA 1.[^16][^17] DeepMind's own framing describes Genie as a potential "endless virtual training dojo" for agents like SIMA, with the explicit ambition of compounding the two systems' generality.[^7]
A further significance is methodological: SIMA 2's self-improvement loop is one of the more visible deployments of LLMs-as-judges in an embodied setting. The reward model is itself a Gemini instance evaluating gameplay videos against a rubric, an approach related to wider research on using language models as preference proxies for reinforcement learning.[^17][^18] If this loop generalizes, it would allow embodied agents to be trained in environments without crafted reward functions or APIs, a property that has historically made commercial games and real robots costly to use as RL substrates.[^5][^17]
SIMA shares a problem statement, "follow language instructions in an open-ended 3D world", with several adjacent systems, though each approaches it differently.
| System | Origin | Environment | Backbone | Action interface | Key distinction from SIMA |
|---|---|---|---|---|---|
| AlphaStar | DeepMind, 2019 | StarCraft II | Custom transformer + LSTM RL agent | Game API actions | Single game; trained by self-play RL to maximize win rate |
| DQN / Atari | DeepMind, 2013 | Atari 2600 suite | Convolutional Q-network | Discrete joystick actions | Single class of games; learns reward from environment, not language |
| PaLM-E | Google, 2023 | Real-world robotics, simulated tabletop | PaLM (LLM) plus image tokens | High-level action tokens | Embodied LLM for robots; not for video games |
| RT-2 | Google DeepMind, 2023 | Real-world robot manipulation | PaLI-X / PaLM-E VLM | Discretized robot actions | A robotic VLA; same paradigm SIMA 2 imports for games |
| Voyager | NVIDIA / academic, 2023 | Minecraft | GPT-4 as outer agent | Mineflayer skill code | Uses LLM to write code; SIMA acts directly via keyboard/mouse |
| AdA | DeepMind, 2023 | XLand procedurally generated 3D worlds | Transformer policy with memory | Game-engine actions | In-context adaptation in one custom simulator, not commercial games |
| Gemini Robotics | Google DeepMind, 2025 | Real robots | Gemini VLA | Continuous robot actions | Same VLA philosophy applied to physical robots |
Within DeepMind's own portfolio, SIMA's closest cousin is AdA, which uses a transformer-XL policy with memory to adapt to novel tasks in XLand, a custom procedurally generated 3D simulator.[^9] SIMA shares the transformer-with-memory design and the multi-task focus but trades AdA's controlled simulator for the messier surface of off-the-shelf commercial games, and trades AdA's in-context RL for behavioural cloning plus, in SIMA 2, LLM-driven self-improvement.[^10][^17]
Compared to NVIDIA's Voyager Minecraft agent, which uses chain-of-thought prompting of GPT-4 to write JavaScript skills that drive Minecraft via the Mineflayer API, SIMA never writes code and never uses a game-specific API; it presses keys.[^19] This makes SIMA more portable across games but gives it weaker symbolic planning than Voyager's code-writing approach.[^19] SIMA 2 partly closes that gap by exposing chain-of-thought reasoning through its Gemini backbone, but it still emits raw keyboard and mouse events rather than executable plans.[^5]
In the robotics world, PaLM-E and RT-2 are the obvious analogues: both fine-tune a large vision-language model into a control policy that outputs action tokens. SIMA 2 imports this paradigm into video games, replacing robot end-effector actions with key codes and mouse deltas, and adds LLM-driven self-play that is closer to AlphaZero's self-improvement than to typical robotic data flywheels.[^5][^6] Gemini Robotics is the contemporaneous DeepMind effort that applies the same Gemini-as-VLA idea to physical robots; SIMA 2 and Gemini Robotics together represent one bet on how generalist agents will be built across both virtual and real embodiments.[^16]
Outside DeepMind, projects like OpenAI's earlier OpenAI Universe, which aimed to expose many games and software environments through a uniform keyboard, mouse, and screen interface, are conceptual ancestors of SIMA: Universe established the idea that the human input interface is itself a general API for software, but the deep-learning techniques available at the time were not sufficient to learn cross-environment behaviour. SIMA can be read as the realization of that idea with modern multimodal transformers and large-scale demonstration data, and with the addition of an LLM-driven self-improvement loop in the SIMA 2 generation.[^2][^17]
It is also worth distinguishing SIMA from the modern computer use line of work, which uses VLMs to drive desktop applications via screenshots and keyboard or mouse events. The interface conventions are similar, but computer-use agents target productivity software (browsers, spreadsheets, mail clients), while SIMA is specialized to 3D real-time games with continuous control. The success of SIMA at action-rich, real-time settings provides indirect evidence about what is feasible for screen-based agents more broadly, even though direct transfer between the two regimes has not been demonstrated publicly.[^5][^16]
DeepMind and external commentators have flagged several limitations of both SIMA versions.
Short horizons. Analyst Jack Clark noted at launch that essentially every SIMA 1 skill takes less than ten seconds to complete, which makes the system a strong short-horizon instruction follower but a weak planner; the original technical report largely confirms this scope.[^13][^10] SIMA 2 extends the horizon by routing decisions through a Gemini reasoner, but the team still reports that "very long-horizon, complex tasks requiring extensive multi-step reasoning" remain difficult.[^6]
Sub-human performance on training games. Even with a Gemini backbone, SIMA 2 averages roughly 65% task completion on training games versus 71% for humans on the same suite, and trails much further on long quests and combat.[^16][^19] On held-out games like MineDojo, SIMA 2 reaches only about 13% versus around 1% for SIMA 1, an improvement of more than an order of magnitude but still far from human levels.[^4][^19]
No real-world embodiment. Both SIMA 1 and SIMA 2 act only inside rendered virtual environments. DeepMind has been careful not to claim that SIMA's policies transfer to physical robots; for that, Google instead points to Gemini Robotics and earlier RT-2 work.[^5][^16] The case for SIMA as a step toward general embodied agents therefore rests on the shared backbone (Gemini) and on the hope that LLM-driven self-improvement loops will transfer across embodiments, rather than on direct evidence that the policies do.[^16]
Partnership dependence. Because SIMA must run inside commercial games that DeepMind does not own, every training environment requires a publisher partnership. This shapes which titles are studied (mostly survival, crafting, and sandbox games with permissive studios) and means independent replication is hard without similar licensing.[^2][^12]
Closed access. Neither SIMA model has been released publicly; both are limited research previews. As of May 2026 there is no SIMA API, downloadable weights file, or evaluation harness open to outside researchers, which makes independent benchmarking difficult.[^5][^16]
Visual fragility and fine control. The SIMA 2 paper explicitly lists robust visual understanding of complex 3D scenes and precise low-level keyboard and mouse control among open problems, particularly fine-grained mouse aim in combat and dense menu navigation in deep game UIs.[^6]
Reward-model honesty. Self-improvement in SIMA 2 depends on a Gemini reward model to score trajectories, which inherits whatever miscalibrations and biases that reward model carries. DeepMind reports calibration against human preferences but acknowledges this as an open area, particularly for tasks where a confident-looking but wrong trajectory might be over-rewarded.[^6][^17]