# Genie (DeepMind)

> Source: https://aiwiki.ai/wiki/genie
> Updated: 2026-06-03
> Categories: Generative AI, Google DeepMind, World Models
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

**Genie** is a generative interactive environment built by [Google DeepMind](/wiki/google_deepmind), described by its creators as the first foundation world model. It was introduced in the paper "Genie: Generative Interactive Environments," posted to arXiv on 23 February 2024 and announced on DeepMind's research site around the same time.[1][2] Genie learns to produce playable, controllable two-dimensional environments from a single prompt, such as a generated image, a photograph, or a hand-drawn sketch, and it does so after training on Internet videos with no action labels at all. The release described here is the original model, sometimes referred to informally as "Genie 1" to distinguish it from the later [Genie 2](/wiki/genie_2) and [Genie 3](/wiki/genie_3); this article covers that first version.

The work was carried out by DeepMind's open-endedness group together with collaborators at the University of British Columbia.[3] At 11 billion parameters, the authors argue Genie can be considered a foundation world model, by analogy with the foundation models that had reshaped language and image generation.[1] It was presented as a research result rather than a public product, and was never offered as a consumer tool or playable demo.

## Background

A world model is a learned system that predicts how an environment will change in response to actions, letting an agent "imagine" future states instead of acting only in the real world. The idea has a long history in reinforcement learning, including DeepMind's own [Dreamer](/wiki/dreamer) line of model-based agents. What set Genie apart was the source and form of its supervision. Earlier interactive world models typically required environments with known action sets, or video paired with logged controller inputs. Genie instead learned from raw, unlabelled gameplay footage scraped from the public Internet, inferring both the visual dynamics of a game and a notion of "actions" without ever being told which button a player pressed.[1]

That choice mattered because action-labelled video is scarce, while unlabelled video is abundant. By learning a controllable interface from passive footage, Genie pointed toward a way of building interactive simulators, and ultimately training environments for embodied agents, at the scale of Internet data rather than hand-built games.[1]

## How Genie works

Genie is built from three components that are trained largely on the same backbone, a spatiotemporal transformer (ST-transformer) that attends across both space within a frame and time across frames.[1]

**Video tokenizer.** A spatiotemporal VQ-VAE compresses raw video frames into a grid of discrete tokens, with the ST-transformer in both encoder and decoder so the codes capture motion rather than treating each frame in isolation. In the paper this tokenizer has roughly 200 million parameters and a codebook of 1,024 entries.[1][4]

**Latent action model.** The core trick is an unsupervised latent action model (LAM) that looks at consecutive frames and learns a small set of discrete "actions" that best explain the transition from one frame to the next. The vocabulary is deliberately tiny: the authors restrict it to eight latent actions, which forces the model to discover a compact, meaningful control space. The LAM is used only during training to provide action targets; at inference the user supplies one of these latent actions each step.[1][4]

**Dynamics model.** A MaskGIT-style transformer takes the past frame tokens together with a chosen latent action and predicts the tokens of the next frame, trained autoregressively. This is by far the largest part of the system, on the order of 10 billion parameters, which accounts for most of Genie's 11 billion total.[1][4]

At generation time the pipeline runs in a loop. A prompt image is tokenized, the user picks a latent action, the dynamics model predicts the next frame's tokens, and the tokenizer's decoder renders them back into pixels. Repeating this turns a still image into a frame-by-frame playable sequence in which the same latent action produces consistent effects, such as moving left, jumping, or scrolling the scene.[1]

| Component | Approx. parameters | Role |
| --- | --- | --- |
| Video tokenizer (ST-transformer VQ-VAE) | ~200M | Compress frames to discrete tokens |
| Latent action model | ~300M | Infer 8 discrete latent actions (training only) |
| Dynamics model (MaskGIT transformer) | ~10B | Predict next-frame tokens from action and past |
| Total | ~11B | Foundation world model |

## Training data

Genie's main model was trained on a filtered set of about 30,000 hours of publicly available Internet gameplay videos drawn from hundreds of different 2D platformer games.[1] DeepMind began from a much larger pool of clips and trimmed it down using a learned classifier, ending with roughly 6.8 million sixteen-second clips. The footage was standardized to a low 160 by 90 resolution at 10 frames per second to keep training tractable at this scale.[4]

Crucially, none of these videos carried action labels. The platformer genre was a useful testbed because side-scrolling games share a fairly consistent grammar of movement, which made the latent actions easier to discover. To show the approach was not limited to games, the authors also trained a separate 2.5-billion-parameter model on a robotics dataset (RT-1 style robot demonstrations), where it again recovered distinct, consistent action representations from video alone.[1]

## Capabilities and limitations

Given a single starting frame, Genie can generate an open-ended variety of playable 2D scenes, and it accepts prompts well outside its training distribution, including text-to-image outputs, real photographs, and rough sketches.[1][2] Because the learned latent actions are consistent across different generated worlds, a person (or another model) can "play" the environment in a controllable way. The paper also showed that latent actions learned purely from Internet video could be used to infer policies in unseen reinforcement-learning environments, suggesting a path toward training generalist agents from passive video.[1]

The original Genie was firmly a research prototype. It generated at a very low resolution and ran extremely slowly, on the order of one frame per second, which the team noted was roughly 20 to 30 times slower than interactive play would require.[3] It could only hold about 16 frames of context, so worlds drifted and lost coherence over longer horizons, and the imagery was blocky and prone to artifacts. It was confined to 2D side-scrolling dynamics rather than full 3D scenes. These constraints, especially speed, consistency, and resolution, became the explicit targets of the successor models.

## Reception

Genie drew considerable attention as an early demonstration that controllable, interactive environments could be learned from unlabelled video at foundation-model scale. At the Forty-first International Conference on Machine Learning (ICML 2024), the paper received a Best Paper Award, one of two such awards given that year to DeepMind's open-endedness team.[5][6] Lead authors including Jake Bruce and senior author Tim Rocktäschel publicly noted the recognition.[6] Press coverage emphasized both the novelty of turning a single image into a playable world and the practical limitations of the first version.[2][3]

## Successors

DeepMind extended the line in two further releases. [Genie 2](/wiki/genie_2), announced in December 2024, moved beyond 2D platformers to generate action-controllable 3D environments from a single image, simulating physics, lighting, and object interactions, though it held coherence only for tens of seconds to about a minute.[7] [Genie 3](/wiki/genie_3), unveiled on 5 August 2025, was presented as the first world model to support real-time interaction, generating 720p worlds at 24 frames per second from a text prompt and maintaining consistency for several minutes with a visual memory reaching back roughly one minute.[8] The broader effort is closely tied to DeepMind's work on embodied and instruction-following agents such as [SIMA](/wiki/sima), which the world-model environments are intended to help train and evaluate.

## References

1. Bruce, J., Dennis, M., Edwards, A., Parker-Holder, J., Shi, Y., et al. "Genie: Generative Interactive Environments." arXiv:2402.15391, 23 February 2024. https://arxiv.org/abs/2402.15391
2. "Genie: Generative Interactive Environments." Google DeepMind research publications. https://deepmind.google/research/publications/60474/
3. Yirka, B. "DeepMind demonstrates Genie, an AI app that can generate playable 2D worlds from a single image." Tech Xplore, March 2024. https://techxplore.com/news/2024-03-deepmind-genie-ai-app-generate.html
4. Bruce, J., et al. "Genie: Generative Interactive Environments" (full text). arXiv HTML, 2024. https://arxiv.org/html/2402.15391v1
5. "Congratulations to the ICML 2024 award winners." AIhub, 25 July 2024. https://aihub.org/2024/07/25/congratulations-to-the-icml2024-award-winners/
6. Rocktäschel, T. "We have been awarded two Best Paper Awards at ICML 2024 ..." LinkedIn, July 2024. https://www.linkedin.com/posts/rockt_we-have-been-awarded-two-best-paper-awards-activity-7221414257035227136-0DmX
7. "Genie 2: A large-scale foundation world model." Google DeepMind blog, 4 December 2024. https://deepmind.google/blog/genie-2-a-large-scale-foundation-world-model/
8. "Genie 3: A new frontier for world models." Google DeepMind blog, 5 August 2025. https://deepmind.google/blog/genie-3-a-new-frontier-for-world-models/

