Genie 2
Last reviewed
May 18, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 5,298 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 18, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 5,298 words
Add missing citations, update stale details, or suggest a clearer explanation.
Genie 2 is a foundation world model developed by Google DeepMind, unveiled on December 4, 2024. It generates action-controllable, interactive three-dimensional environments from a single image prompt, with worlds that can be navigated in real time by a human player or an AI agent using keyboard and mouse inputs.[^1][^2] DeepMind positioned the system as a successor to its February 2024 Genie model, which generated only two-dimensional platformer environments, and framed Genie 2 as a "large-scale foundation world model" intended primarily to address the bottleneck of training data diversity for embodied AI agents.[^1][^3]
Genie 2 sustained coherent simulated worlds for roughly 10 to 20 seconds in most reported demonstrations, with DeepMind claiming consistency of "up to a minute" in some cases.[^4][^5] Built around an autoregressive latent diffusion architecture trained on large-scale video data, the system demonstrated emergent capabilities including object interactions, physics simulation, character animation, lighting and reflection effects, modeling of non-player character behavior, and long-horizon memory of off-screen scene elements.[^1][^2][^3] DeepMind also demonstrated integration with its SIMA (Scalable Instructable Multiworld Agent) system, in which the SIMA agent followed natural-language instructions inside Genie 2 worlds it had never seen before.[^1][^4]
The model was never released to the public. It existed only as a research preview, with access granted selectively to internal researchers and a limited group of external collaborators.[^1][^6][^7] In August 2025, DeepMind announced its successor, Genie 3, which extended the world consistency window to several minutes, raised resolution from 360p to 720p, and added "promptable world events" that allow mid-session text-driven modifications.[^8][^9]
The Genie lineage began with a research paper titled "Genie: Generative Interactive Environments," posted to arXiv on February 23, 2024, by a Google DeepMind team led by researchers including Jake Bruce, Michael Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, and others including Tim Rocktäschel as senior author.[^10][^11] Genie was the first system the lab explicitly described as a "foundation world model," a phrase intended to evoke the analogy with foundation language models such as the generative pre-trained transformer family.[^10][^12]
At 11 billion parameters, Genie 1 was trained on a curated dataset of approximately 30,000 hours of internet gameplay footage drawn from hundreds of 2D platformer games.[^10][^11] Its architecture consisted of three principal components built around spatiotemporal transformers: a video tokenizer, a latent action model, and an autoregressive dynamics model.[^10] A critical contribution was unsupervised latent action learning. Without any human-labeled action data, Genie 1 recovered a small discrete codebook of "latent actions" purely by analyzing pixel-level transitions between video frames; users selecting a latent action could then control the model's generation frame by frame.[^10][^11]
Despite its parameter count, Genie 1 was severely limited in practical terms. It generated frames at roughly one frame per second, far too slow for interactive play, and its outputs were locked to 2D side-scrolling environments at 256x256 pixel resolution.[^10][^11] Its main contribution was conceptual: it demonstrated that an interactive world model could be trained entirely from passive, unlabeled video, a claim that had been treated skeptically before publication.[^10] The Genie 1 paper later won a Best Paper Award at the International Conference on Machine Learning (ICML) 2024.[^13]
In the months between Genie 1 and Genie 2, DeepMind's research priorities shifted toward generative video models that could function as simulators rather than as pure content generators. The lab hired Tim Brooks, a co-lead of OpenAI's Sora project, in October 2024 to work on video generation and "world simulators," reflecting an industry-wide convergence around the world-model framing.[^4][^14]
A world model in artificial intelligence is a learned system that builds an internal representation of an environment and predicts how that environment changes over time in response to actions.[^15] The framing was popularized by David Ha and Jürgen Schmidhuber's 2018 "World Models" paper, which trained agents inside neural simulators rather than in fixed game engines.[^15] DeepMind's specific phrasing, "foundation world model," is meant to denote a single trained system whose generative breadth and visual generality are sufficient to serve as an environmental substrate for many downstream agents and tasks, mirroring how a foundation model like a large language model can underpin many text applications.[^1][^12]
Researcher Tim Rocktäschel, who led the DeepMind Open-Endedness Team during Genie 1 and continued to support world-model work through Genie 2, articulated this framing on Bluesky shortly after the Genie 2 announcement: "When we started Genie 1 over two years ago, we always imagined a foundation world model will one day be able to generate an endless curriculum for training embodied AGI. Today, we made a big step towards that future."[^4][^16]
The training-curriculum framing is central to the technical motivation. DeepMind's blog post on Genie 2 stated that "training more general embodied agents has been traditionally bottlenecked by the availability of sufficiently rich and diverse training environments," and that Genie 2 was intended to provide "an endless curriculum of novel worlds" for such agents.[^1][^17] The argument is that hand-authored simulation environments are expensive and narrow, while real-world data collection is slow and dangerous, so a generative model that can produce diverse playable worlds on demand could remove the substrate bottleneck even before agents themselves improve.[^1][^15]
Google DeepMind unveiled Genie 2 on December 4, 2024, through a blog post on its corporate website titled "Genie 2: A large-scale foundation world model."[^1] The launch coincided with a broader push by the company into generative simulators that would continue through the year following, including the formation in early January 2025 of a dedicated world-simulator team led by Tim Brooks within DeepMind.[^14]
The announcement consisted of the blog post, a set of pre-recorded demo videos, and a list of more than thirty technical contributors. The project was led by Jack Parker-Holder, with Stephen Spencer serving as technical lead. Key contributors named in the blog post included Philip Ball, Jake Bruce, Vibhavari Dasagi, Kristian Holsheimer, Christos Kaplanis, Alexandre Moufarek, Guy Scully, Jeremy Shar, Jimmy Shi, and Jessica Yung, with additional contributions from Michael Dennis, Sultan Kenjeyev, and Shangbang Long. DeepMind also credited concept artist Max Cant for environment design.[^18] Jack Parker-Holder had previously been a co-author on the Genie 1 paper, providing direct continuity between the projects.[^10][^18]
Genie 2 was not accompanied by a peer-reviewed technical paper at announcement. DeepMind released only the blog post and curated video material, declining to publish detailed architectural specifications, parameter counts, training data sources, or compute usage. This made independent verification of many of Genie 2's claims difficult and contributed to a contemporaneous critique that the lab was leaning heavily on cherry-picked demos rather than systematic evaluation.[^4][^19]
DeepMind described Genie 2 as "an autoregressive latent diffusion model" trained on a large video dataset.[^1] The architecture combines two distinct mechanisms in a single pipeline. An autoencoder compresses raw video frames into a latent representation, reducing computational cost relative to working at full pixel resolution. A large transformer dynamics model then operates over these latent frames, trained with a causal mask similar to those used in generative pre-trained transformer language models, so that each predicted latent frame depends only on previous frames and the action token at that step.[^1][^3][^20]
At inference time, Genie 2 generates frames sequentially. The system takes the latent representation of the most recent frame, the user's most recent action (a keyboard or mouse input), and the history of previous latents, and predicts the next latent frame. That latent is then decoded back into a pixel frame and displayed to the user.[^1][^3] The next action is sampled, and the loop continues. DeepMind employed classifier-free guidance, a technique borrowed from text-to-image diffusion models, to strengthen the conditioning on user actions and improve action controllability during generation.[^1][^3]
This pairing of components places Genie 2 in a particular niche relative to other generative video systems. Pure text-to-video systems like Sora or Veo 2 produce passive clips that play back as filmed by a virtual camera; they have no notion of user agency. Genie 2's autoregressive structure mirrors the next-frame loop of a diffusion model, but its action conditioning makes the system more like a learned game engine that produces what a player would see, contingent on what the player does.[^1][^4]
DeepMind disclosed publicly that two model variants existed at launch: an undistilled base model that produced the highest-fidelity demonstrations, and a distilled version that ran fast enough to play in real time but with reduced visual quality.[^1][^3] Most of the curated demonstrations published with the blog post came from the undistilled base model, which was not real-time.
DeepMind stated that Genie 2 was trained on a "large-scale video dataset" but provided minimal further detail.[^1][^4] Reporting by TechCrunch noted that the training data was understood to include playthroughs of popular video game titles, although the lab declined to specify which titles or quantify the corpus.[^4] This opacity continued the trend established with Genie 1, whose paper described a curated 30,000-hour subset of internet gameplay video but did not specify which games or sources were used.[^4][^10]
The reliance on commercial gameplay footage raised contemporaneous questions about copyright and licensing, mirroring debates already underway over text and image training data. DeepMind has not publicly addressed the licensing status of gameplay footage used to train Genie models.[^4]
Compared to Genie 1's disclosure of 11 billion parameters, DeepMind did not publish a parameter count for Genie 2. Available reporting and DeepMind's framing suggest that Genie 2 is substantially larger than Genie 1, but precise figures have not been confirmed.[^4][^10][^18]
A key technical observation about Genie 2 is that it does not encode worlds as meshes, textures, or geometry of any kind. The system has no internal 3D representation that persists across time; what it has are model weights and an autoregressive context window holding recent frames and actions. Consistency between successive frames arises from the transformer's attention to those recent latents rather than from any stored scene state.[^1][^21]
This is sharply different from the way conventional game engines work. A traditional engine has the world fully described in memory as geometry, materials, and physics state, and draws each frame from that store. Genie 2 has to reconstruct what the world should look like at every step from its model weights plus the recent context, which has two consequences: it scales with model capacity rather than world complexity (a model that can render forests can also render mountains, deserts, and cities with no separate engineering), but it has no hard guarantee that the world stays exactly the same when revisited.[^3][^21]
The headline capability of Genie 2 is that a single image, supplied as a prompt, becomes a navigable 3D environment. DeepMind generated most demonstration prompts using Imagen 3, Google's text-to-image model, in pipelines that allowed researchers to describe a scene in text, sample a still image, and then "step into" that image inside Genie 2.[^1][^17] Real-world photographs and concept art also worked as prompts, demonstrating that the model generalized outside the gameplay video distribution it was trained on.[^1][^7]
In demonstrations, generated worlds responded to keyboard and mouse inputs in real time. The model interpreted the relevant agent within a scene without explicit annotation: DeepMind stated that Genie 2 could "figure out that arrow keys should move a robot and not trees or clouds," choosing the visually plausible movable object as the player avatar.[^4] This action targeting was emergent from training rather than configured per scene.
DeepMind described Genie 2 as having "long horizon memory," meaning the model maintained recognition of off-screen scene elements and reconstructed them coherently when the camera returned to them.[^1][^3] In one demonstration, a player turned a corner, walked away, and then returned to find the original space rendered with broadly the same features. This capability distinguished Genie 2 from earlier generative video systems that effectively forgot any element no longer in frame.[^7][^21]
The memory window in practice was substantially shorter than DeepMind's headline claim of "up to a minute." Independent reporting by Ars Technica and later analysis described typical effective consistency as 10 to 20 seconds, with the longer minute-scale figure achievable only in selected cases.[^5][^22] This is the figure most often cited in comparisons with Genie 2's successor, Genie 3, which extended consistency to several minutes.[^8]
DeepMind's blog post identified a number of "emergent capabilities" demonstrated in Genie 2 outputs:[^1]
DeepMind emphasized that none of these properties were explicitly programmed. Each had to be learned during pretraining from video data, the same way a language model learns syntax from text exposure.[^1] The blog framed this as evidence that "scaling" the video-pretraining approach was viable: capabilities that would normally require bespoke physics engines, animation systems, or shader programming emerged purely as consequences of training on a large enough video corpus.[^1][^3]
A capability particularly relevant to agent training was Genie 2's ability to generate divergent trajectories from the same starting frame. By sampling different actions from an identical initial latent, researchers could produce multiple "what if" rollouts of the same scenario, enabling counterfactual evaluation of agent behavior.[^1][^3] DeepMind explicitly highlighted this as one motivation for the system: agents could be evaluated on whether they performed equally well across diverse trajectories, exposing brittleness that fixed environments might hide.[^1]
Although Genie 2 was trained largely on video game footage, the model demonstrated transfer to inputs well outside that distribution. DeepMind showed examples where hand-drawn sketches, concept-art renderings, and photographs of real environments such as waterfalls and forests were converted into navigable 3D scenes.[^1][^7] DeepMind referred to this property as "out-of-distribution generalization" and presented it as evidence that the model had learned general principles of 3D scene dynamics rather than memorizing properties of specific games.[^1]
A central demonstration in the Genie 2 announcement was integration with SIMA (Scalable Instructable Multiworld Agent), DeepMind's general-purpose embodied agent.[^1][^23] SIMA had been originally trained on a curated set of commercial video games, learning to follow simple natural-language instructions like "go to the door" or "pick up the cup" by mapping language to mouse and keyboard actions in those games.[^23]
The Genie 2 announcement showed SIMA operating inside environments synthesized by Genie 2 from single image prompts, environments SIMA had never seen during training. In the most cited example, Genie 2 generated a 3D scene containing a blue door and a red door from a prompt image, and an experimenter typed an instruction such as "open the blue door"; SIMA, controlling the avatar through keyboard and mouse inputs while Genie 2 generated the frames, navigated to and opened the correct door.[^1][^4]
The significance of this demonstration was twofold. First, it showed that an agent trained on one set of environments (real video games) could be evaluated in entirely new ones (procedurally generated Genie 2 worlds), without any retraining. This kind of out-of-distribution evaluation had been difficult to perform at scale before, because each new environment had to be hand-built by engineers.[^1][^23] Second, it gestured at a workflow where Genie 2 could function as an agent training environment in its own right, with generated worlds providing a "limitless curriculum" of novel scenarios.[^1] DeepMind explicitly framed this in the blog: "Genie 2 could enable future agents to be trained and evaluated in a limitless curriculum of novel worlds."[^1]
SIMA's design philosophy, which mapped instructions directly to low-level input actions rather than to environment-specific commands, made it a natural fit for this evaluation paradigm. Any environment that accepts keyboard and mouse input is, in principle, playable by SIMA, and Genie 2 worlds did exactly that.[^23]
The transition from Genie 1 to Genie 2 represented a substantial shift in capability and scope:[^1][^10][^11]
| Aspect | Genie 1 (Feb 2024) | Genie 2 (Dec 2024) |
|---|---|---|
| Dimensionality | 2D side-scrolling platformers | 3D environments |
| Frame rate | ~1 frame per second | Real-time (distilled) / faster than 1 fps (base) |
| Resolution | 256x256 pixels | ~360p |
| World duration | A few seconds at most | 10-20 seconds typical, up to ~1 minute |
| Parameters | 11 billion (published) | Not publicly disclosed |
| Training data | ~30,000 hours of 2D platformer gameplay | "Large-scale video dataset" (unspecified) |
| Action representation | Unsupervised latent actions from video | Keyboard and mouse inputs |
| Physics, lighting, NPCs | Limited / stylized 2D | Emergent 3D physics, lighting, NPC behavior |
Genie 2 retained the conceptual lineage from Genie 1, particularly the use of an autoregressive transformer dynamics model and the principle that video pretraining alone is sufficient to learn world dynamics. It dropped the unsupervised latent-action codebook in favor of direct conditioning on keyboard and mouse actions, simplifying the user-facing interface at the cost of removing one of Genie 1's more conceptually distinctive features.[^1][^10]
Genie 2 launched into a fast-developing space of generative interactive environments. The two most directly comparable systems at the time were Decart's Oasis and World Labs' early prototypes.
Decart Oasis, released in October 2024 by the Israeli startup Decart, was a real-time Minecraft simulator that generated playable frames at low resolution. Coverage by TechCrunch and others contrasted Oasis's tendency to "forget" the layout of levels when the camera looked away with Genie 2's longer-horizon memory.[^4] Oasis was also restricted to a Minecraft-like visual style, whereas Genie 2 generated across a wide range of visual genres.[^4][^7]
World Labs, the startup founded by Fei-Fei Li and incorporated in 2024, demonstrated early prototypes of scene-based interactive 3D generation. Where Genie 2 generated worlds frame-by-frame during interaction, World Labs' early demonstrations produced static 3D scenes that could be navigated after generation. World Labs later released Marble in November 2025 as its first commercial product, focusing on editable, downloadable 3D environments rather than real-time generation.[^4][^24]
A third reference point is DeepMind's own SIMA, which was an agent rather than a world model but which Genie 2 effectively complemented. SIMA needed environments to operate in; Genie 2 needed to demonstrate that its environments were rich enough to challenge a competent agent. The pairing showcased both systems and signaled DeepMind's strategy of co-developing world models and agents as complementary pieces of a broader embodied AI stack.[^1][^23]
Genie 2 also sits within the broader category of generative video models that emerged in 2023-2024, including OpenAI's Sora, Google's own Veo and Veo 2, Runway's Gen models, and others. These systems produce non-interactive video clips from text or image prompts and have no notion of user agency or branching futures.[^4][^14] Tim Brooks's move from Sora to DeepMind in October 2024 and the subsequent formation of DeepMind's dedicated world-simulator team in January 2025 reflect a deliberate strategy of converting passive video generation expertise into interactive simulator expertise, with Genie 2 as the demonstrated early product of that direction.[^4][^14]
DeepMind announced Genie 3 on August 5, 2025, framing it as "a new frontier for world models." Genie 3 maintained the broad architectural approach of Genie 2 but improved on the most cited limitations:[^8][^9]
Genie 3 was likewise released as a research preview at first, with broader public access following only when Project Genie launched to Google AI Ultra subscribers in the United States on January 29, 2026.[^25][^26] Genie 2, by contrast, never received public access in any form.
DeepMind's framing of Genie 3 as a "stepping stone toward AGI" represented an explicit escalation of the rhetorical claims made for Genie 2. CEO Demis Hassabis repeatedly emphasized that learned world models like the Genie family were a critical component of embodied AGI strategies, framing the lineage as moving from research demonstration (Genie 1) through proof of generality (Genie 2) toward an actual training substrate (Genie 3).[^8][^27]
Google DeepMind released Genie 2 only as a research preview at announcement, with no public access offered to general users.[^1][^6][^7] DeepMind cited two reasons in its blog post: the substantial computational requirements of running the system, and the need to evaluate risks and refine the model in controlled conditions before broader release.[^1] In practice, access was limited to internal researchers and a small number of external collaborators including selected creatives such as concept artist Max Cant.[^1][^6][^7]
No API, no consumer product, and no commercial release of Genie 2 was ever made available. The model existed as a research artifact for slightly more than seven months before its successor was announced. Users wanting to experiment with the Genie family had to wait until the public launch of Project Genie on January 29, 2026, which incorporated Genie 3 rather than Genie 2.[^25][^26]
This restricted access pattern was consistent with DeepMind's prior practice with frontier capability demonstrations. Several earlier DeepMind systems, including aspects of the AlphaFold lineage and the original SIMA agent, had been released with research-preview gating before any broader availability, when broader availability was offered at all.[^23][^27]
Genie 2 was widely covered by general-purpose technology press on the day of its announcement and in subsequent weeks. Outlets including TechCrunch, Engadget, MIT Technology Review, the Verge, Simon Willison's blog, MarkTechPost, and Heise Online published news pieces or analyses focused on the demonstration videos and the claims in DeepMind's blog post.[^4][^7][^22][^28][^29][^30]
Coverage was generally positive but tempered. The most consistent observation across critical pieces was that DeepMind's "up to one minute" consistency claim was difficult to verify, and that most actual demonstration outputs were closer to 10-20 seconds in length.[^5][^22] Ars Technica's reporting on this point became widely cited as a corrective to DeepMind's headline figure.[^5][^22] Some coverage also questioned the lab's decision to release the system without a technical paper, parameter disclosure, or detailed training data information.[^4][^19]
In a segment of CBS News' 60 Minutes that originally aired in April 2025 and was updated in August 2025, Demis Hassabis and Jack Parker-Holder demonstrated Genie 2 to journalist Scott Pelley.[^31] Pelley was shown a photograph of a California waterfall converted into an explorable first-person environment, a paper plane soaring through a Western landscape with terrain features appearing dynamically, and a knight character ascending stairs in a dungeon while the world generated walls and lighting in real time.[^31]
Asked about the implications, Hassabis emphasized that the larger objective was "building a world model, a model that can understand our world," and that the technology could enable both entertainment applications and, more importantly, simulated training environments for robotics. He observed that simulated environments allowed unlimited data collection, with policies learned in simulation able to be fine-tuned on real-world data afterward.[^31] On the question of whether Genie 2's capabilities had surprised the lab, Hassabis replied that emergent abilities had appeared repeatedly throughout DeepMind's history and that Genie 2's understanding of the physical world had "not been something we were expecting it to be that good at that quickly."[^31]
Genie 2 was widely interpreted as a signal that the world-model approach was viable beyond stylized 2D demonstrations. Where Genie 1's value had been primarily theoretical, demonstrating that interactive worlds could be learned from passive video alone, Genie 2 showed that the same approach could scale into 3D photorealistic generation with persistent memory.[^16][^21] This shifted the conversation in the field from "could world models work?" to "how should world models be built?", with subsequent work from World Labs, Decart, and DeepMind itself iterating on the answers.[^4][^21][^24]
Internally at DeepMind, Genie 2 was treated as evidence that the approach was worth scaling further. The decision to form Tim Brooks's dedicated world-simulator team in January 2025, only weeks after Genie 2's announcement, indicated organizational confidence that the trajectory established by Genie 1 and Genie 2 deserved expanded investment.[^14] Genie 3's release eight months later, with substantially improved consistency and resolution, validated the scaling hypothesis.[^8][^9]
For Tim Rocktäschel and the original Open-Endedness Team, Genie 2 confirmed the bet that had motivated Genie 1 two years earlier: that "endless curriculum" generation was achievable in practice and could form a backbone of embodied AI training infrastructure.[^16]
DeepMind described Genie 2 as being "at the early stages of research and development" and acknowledged several limitations.[^1][^6][^7] Independent reporting and academic analysis identified others.
Short consistency window. As discussed in capability sections above, most Genie 2 simulations lost coherence after 10 to 20 seconds. DeepMind's "up to a minute" figure was a best case rather than a typical one. Visual artifacts accumulated as session length grew, with objects drifting, textures degrading, and spatial geometry losing consistency.[^4][^5][^22]
Low resolution. Genie 2 outputs were typically rendered at approximately 360p, well below the resolution standards of modern consumer video.[^11][^32] The distilled real-time version reduced visual quality further compared to the undistilled base model.[^1][^3]
No text rendering. Like most generative video models of the period, Genie 2 produced illegible or visually noisy text when prompted to render readable signs or text elements within the generated worlds.[^7][^9] This limitation persisted into Genie 3.
No persistent world alterations. Although Genie 2 demonstrated long-horizon memory in the sense of remembering off-screen layouts, the system did not support persistent changes to the world such as a player permanently moving an object, breaking a wall, or carving terrain. The model maintained recognition of visual configurations rather than a state-mutable world model.[^11][^7]
Approximate physics. The physics simulation in Genie 2 was a statistical approximation derived from video patterns, not a ground-truth physics engine. Falling objects, water flow, and collision interactions worked plausibly in most cases but failed in edge cases involving complex multi-body contact, fluid dynamics, or cloth simulation. This limitation made Genie 2 inadequate for sim-to-real robotics applications where transfer fidelity to ground-truth dynamics was required, even as it remained useful for more abstract agent behavior research.[^9][^11]
Computational cost. DeepMind cited computational requirements as one reason for restricting Genie 2's availability to a research preview. The undistilled base model required substantial inference compute, and even the distilled real-time version was not amenable to low-cost or local deployment.[^1][^6]
Lack of transparency. Genie 2 was not accompanied by a technical paper, a model card, a parameter count disclosure, or detailed information about training data sources. This made independent technical analysis difficult and contributed to coverage that relied entirely on DeepMind's curated demonstrations.[^4][^19]
Cherry-picked demos. Several reviewers noted that the polished demonstrations published with the blog post could not be reproduced or verified externally, raising the question of whether typical Genie 2 outputs were as visually consistent as the curated examples. Without public access, this concern could not be empirically resolved.[^4][^19]