Genie 3 is a foundation world model developed by Google DeepMind, announced on August 5, 2025. It generates photorealistic, interactive three-dimensional environments from text descriptions, runs in real time at 24 frames per second and 720p resolution, and maintains environmental consistency for several minutes of continuous interaction. Google DeepMind describes Genie 3 as "a crucial stepping stone on the path to artificial general intelligence," positioning it primarily as a training substrate for general-purpose AI agents.
Unlike passive video generators such as Sora or Veo 3, Genie 3 responds to user navigation inputs, allowing real-time exploration and modification of the worlds it creates. Users can steer a character through generated landscapes, trigger weather changes, introduce new objects, and revisit previously seen locations, which the model recalls from a roughly one-minute memory window. The system taught itself basic physics entirely from video data, without any hard-coded physics engine.
Genie 3 became publicly accessible through Project Genie on January 29, 2026, initially limited to Google AI Ultra subscribers in the United States.
The Genie series began with a research paper published by Google DeepMind in February 2024, titled "Genie: Generative Interactive Environments." The original Genie model was a foundation world model trained on roughly 30,000 hours of internet gameplay footage from hundreds of 2D platformer games, using a curated dataset of approximately 6.8 million 16-second video clips at 10 frames per second and 160x90 pixel resolution.
Genie 1's architecture had three primary components: a spatiotemporal video tokenizer, a latent action model, and an autoregressive dynamics model. All three components used spatiotemporal transformers as their core building block. The latent action model employed a VQ-VAE approach, learning a discrete codebook of latent actions (such as MOVE_RIGHT) without any labeled action data, a technique called unsupervised action learning. The dynamics model then predicted the next video frame conditioned on the current frame and the inferred latent action.
Genie 1 scaled from 40 million to 2.7 billion parameters during development, with the final published model reaching 11 billion parameters. However, it generated worlds at approximately one frame per second, far too slow for interactive play, and operated only in two dimensions. Its primary value was demonstrating that a model could learn controllable world dynamics from unlabeled video footage alone.
Google DeepMind released Genie 2 in December 2024 as a substantial upgrade, extending world generation to three dimensions and improving fidelity. Genie 2 could accept a single image or short text prompt as input and generate a navigable 3D environment from it.
However, Genie 2 had notable constraints. Its effective memory window was approximately 10 to 20 seconds, meaning the world would become inconsistent or visually degrade beyond that point. Ars Technica reported at the time that DeepMind's claim of roughly one minute of consistency was optimistic for most use cases. Resolution was capped at 360p, well below broadcast quality. The system was not available for public use and remained a research demonstration.
Genie 2 was significant nonetheless because it showed that the latent action approach from Genie 1 could generalize to 3D environments and photorealistic imagery, not just stylized 2D game footage.
Genie 3's headline advance over its predecessors is that it runs at interactive speeds. The model generates environments at 24 frames per second and 720p resolution, which is sufficient for fluid gameplay. Users navigate generated worlds with standard keyboard controls, and the system responds within fractions of a second to each input.
This is technically non-trivial. Autoregressive models generate each output token (in this case, each video frame) by conditioning on everything generated previously. As a session continues, the model must reference an ever-growing trajectory of prior frames and actions. Shlomi Fruchter of Google DeepMind described the challenge: "The model has to take into account the previously generated trajectory that grows with time." Achieving real-time throughput under this growing computational burden required substantial engineering work alongside the base model architecture.
The result is a system where each frame is generated on-the-fly based on the player's current action and the accumulated history of the session, rather than being retrieved from a pre-rendered sequence or stored game state.
Genie 3 maintains a visual memory window of approximately one minute. When a user leaves a location and later returns to it, the model reconstructs it in a way that is consistent with what was generated earlier. Objects that were moved or created remain in their new positions. Environmental changes the user triggered persist for as long as the memory window covers the relevant frames.
This one-minute window represents a roughly six-fold improvement over Genie 2's effective limit of 10 to 20 seconds, and it enables a qualitatively different experience: players can walk away from a location, explore nearby areas, and return to find their starting point roughly intact. Prior world models would simply regenerate everything from scratch upon re-entry, producing visible inconsistencies.
The consistency mechanism does not rely on an explicit 3D scene representation. There is no NeRF, no Gaussian splatting, and no stored point cloud. Instead, consistency emerges from the model attending to earlier frames during autoregressive generation. This means the approach scales with model capacity rather than with scene complexity, but it also means that the "memory" degrades gracefully as the window fills rather than maintaining perfect accuracy indefinitely.
Beyond navigation controls, Genie 3 supports mid-session text prompts that alter the generated world. Users can type instructions like "make it snow," "introduce a thunderstorm," "add a fox near the treeline," or "change it to night," and the model modifies the ongoing generation accordingly. Google DeepMind calls this feature promptable world events.
This capability separates Genie 3 from purely action-conditioned video models, which can only respond to movement inputs. Text conditioning during generation allows users to author their worlds dynamically rather than just explore a fixed prompt output. Researchers studying agent behavior can use this to introduce novel stimuli mid-session without restarting a session from scratch.
The feature has some limitations. Very specific or geometrically precise instructions ("place a red cube exactly three meters to the left of the blue sphere") do not work reliably. The model interprets prompts in a statistical sense, producing plausible interpretations rather than literal executions.
Genie 3 was not given a physics engine during training. Its understanding of how objects fall, how water behaves, how lighting changes with weather, and how terrain deforms came entirely from watching video data. Google DeepMind researchers described this as the model discovering consistency mechanics without explicit programming, and it reflects a broader principle that large-scale video pretraining encodes substantial physical knowledge.
In practice the physics simulation is approximate. DeepMind's own blog acknowledged inaccuracies, giving the example of snow movement around a skier showing errors. Falling objects and basic rigid-body behavior are handled reasonably, but complex fluid dynamics, cloth simulation, and multi-body contact are not reliably correct. The system is best understood as a plausible physics approximator rather than a ground-truth simulator.
Genie 3 is an autoregressive transformer model. Like large language models that generate text one token at a time, Genie 3 generates video one frame at a time, with each frame conditioned on all prior frames and the most recent user action.
The spatiotemporal transformer architecture inherited from Genie 1 remained central to Genie 3, though Google DeepMind has not published a full technical paper describing the specific modifications made between versions. The model processes both spatial information (what the world looks like in the current frame) and temporal information (how the world has changed over the session) within its attention mechanism, allowing it to maintain coherence across the time dimension.
Training involved large volumes of diverse video footage, extending well beyond the 2D platformer games used for Genie 1. Genie 3 was trained on video spanning many environment types, including natural landscapes, architectural scenes, stylized animation, and fictional settings. This breadth is what allows it to generate "any world," rather than being specialized to a particular visual domain.
Genie 3 does not accept image inputs for world initialization; as of its August 2025 announcement, the model takes only text prompts. This contrasts with Genie 2, which could bootstrap a world from a single input image.
Google DeepMind has not disclosed the exact parameter count for Genie 3. Published reports referencing an 11 billion parameter count are drawing on the Genie 1 scaling experiments, which found the best results at 11B. Whether Genie 3 uses the same scale, or has been scaled further, has not been officially confirmed.
The lack of a formal technical paper at launch was noteworthy. Google DeepMind published a full arXiv preprint for Genie 1, which allowed external researchers to reproduce and analyze the architecture in detail. Genie 3's announcement consisted of a model page, a blog post, and a limited research preview program. Community discussion on Hacker News and in AI research forums noted that without a paper, key design decisions, training data composition, evaluation benchmarks, and failure modes cannot be independently assessed. Google DeepMind has not stated whether a technical report will be published.
The inference pipeline runs on Google's internal accelerator infrastructure. Google has not disclosed the hardware configuration used to serve Genie 3 at real-time speeds, though the compute requirements for generating 720p video at 24 FPS autoregressively are substantial. Project Genie's server-side deployment means users access the model through a browser-based interface rather than running it locally, which is consistent with the hardware demands involved.
On January 29, 2026, Google DeepMind launched Project Genie, an experimental research prototype that puts Genie 3 in the hands of end users. Project Genie rolled out to Google AI Ultra subscribers in the United States aged 18 and older. Google AI Ultra costs $249.99 per month, making Project Genie accessible primarily to developers, researchers, and enthusiasts already invested in Google's AI ecosystem.
Project Genie offers three main modes of interaction:
World sketching: Users describe a world with text prompts, optionally augmented with reference images. An integrated tool called Nano Banana Pro provides a preview of the world before the user enters it, allowing adjustments to the initial prompt.
World exploration: Once inside a generated world, users navigate in real time using WASD keys for movement, arrow keys for camera control, and spacebar for vertical movement. Genie 3 generates the path ahead continuously as the user moves through it.
World remixing: Users can explore a curated gallery of worlds created by other users, remix them with new prompts, and download video recordings of their sessions.
At launch, Project Genie capped individual sessions at 60 seconds of generation, shorter than what Genie 3 is capable of in research settings. Google acknowledged this limit was a product decision rather than a technical ceiling, citing the experimental nature of the prototype. Some capabilities present in the August 2025 Genie 3 announcement, such as full promptable world events, were not included in the initial Project Genie release.
The launch was limited to the US, with Google stating that expansion to other regions would happen "in the future" without specifying a timeline.
Google DeepMind's stated primary motivation for building Genie 3 is to create a substrate for training embodied AI agents. The current bottleneck for embodied agent research is the scarcity of diverse training environments. Real-world data collection is slow and expensive; hand-crafted game environments are diverse but narrow in scope; existing 3D simulation platforms like Unity or Unreal require significant engineering work to set up.
Genie 3 can generate an essentially unlimited variety of environments from text prompts, potentially allowing agents to train across far more situations than any manually authored simulator provides. Researchers can specify the type of environment, the physical properties they want the agent to encounter, and then generate hundreds of variations without writing any code.
Google DeepMind demonstrated this use case through integration with SIMA (Scalable Instructable Multiworld Agent), their embodied AI agent. SIMA 2, released in late 2025, showed that agents trained partly in Genie 3-generated worlds could transfer their skills to real game environments they had never seen during training. When placed in newly generated Genie worlds, SIMA 2 could orient itself, interpret user instructions, and take meaningful actions toward goals despite the unfamiliar visual context.
The goal is eventually to close the sim-to-real gap: train agents in generated worlds diverse enough that their learned behaviors transfer to physical robots operating in the real world. As of early 2026, this transfer remains an open research problem. Genie 3 environments are photorealistic by consumer standards but not reliably accurate enough for robust real-world deployment.
In February 2026, Waymo announced the Waymo World Model, a specialized derivative of Genie 3 adapted for autonomous driving research. The Waymo World Model generates multi-sensor outputs including camera video and lidar point clouds, at four times the lidar output speed of base Genie 3.
The system can simulate rare edge cases that Waymo's actual vehicle fleet would almost never encounter in normal operations: tornadoes, unusual animal encounters, extreme weather, and other low-frequency but high-importance scenarios. Waymo noted that Genie 3's pretraining on a vast and diverse video corpus gives the Waymo World Model access to visual and physical patterns that Waymo's own data collection would never adequately cover.
By feeding Waymo's specialized post-training on top of Genie 3's general world knowledge, the company can generate synthetic training data for situations that are nearly impossible to capture safely at scale in the real world. Bloomberg reported that Waymo believes these simulations will help accelerate the expansion of its robotaxi operations.
Game developers have been among the most interested outside observers of Genie 3. The ability to generate a navigable environment from a text description in seconds compresses what would otherwise be days or weeks of asset creation and level design into a nearly instant prototype.
A researcher at Game Developer Conference 2026 found that 52% of gaming professionals viewed AI tools as having a negative impact on the industry (up from 30% in 2025 and 18% in 2024), reflecting tension between AI as a productivity tool and AI as a threat to creative roles. Genie 3 itself attracted a range of reactions. Some game developers saw it as a useful ideation tool, pointing to its ability to quickly rough out visual concepts and spatial layouts. Others noted that a navigable environment generated by Genie 3 is not a game: it has no mechanics, no scoring, no narrative, no optimization for frame delivery, and no tools for iterating on a design.
Julian Togelius, a prominent researcher in AI and games, observed that Genie 3 represents a step toward what he called a "neural game engine" but argued that actual games require far more than consistent world generation. Precise collision detection, deterministic state, and author-controlled pacing are properties that pixel-level prediction systems cannot yet reliably provide.
Google positions Genie 3 for rapid prototyping and concept exploration, not as a replacement for commercial game development pipelines.
DeepMind cited education and creative exploration as additional use cases. Students studying history, geography, or science could in principle navigate generated representations of historical locations, ecosystems, or physical phenomena. The imperfect accuracy of Genie 3's world models makes formal educational applications limited today, but the potential for illustrative and exploratory contexts is real.
Creatives including filmmakers and concept artists have used early access to Project Genie to rough out environment ideas, generate reference material for settings, and produce short video clips of generative worlds for use in mood reels and pitch materials.
By August 2025, Genie 3 entered a competitive field of world model efforts from both large labs and startups. The approaches differ significantly in technical architecture, output format, and intended use case.
| Model | Developer | Output format | Resolution/FPS | Interactivity | Key differentiator |
|---|---|---|---|---|---|
| Genie 3 | Google DeepMind | Real-time navigable video | 720p @ 24 FPS | Yes, keyboard-controlled | Long memory window, promptable events, agent training |
| Marble | World Labs | 3D scene assets | Exportable 3D | Limited | Downloadable, editable 3D output |
| Oasis | Decart AI / Etched | Real-time video | ~480p @ 20 FPS | Yes | Browser-accessible, Minecraft-style |
| Odyssey-1 | Odyssey | Streaming video | 480p @ 30 FPS | Yes | 40ms latency, interactive video |
| HY World 1.5 | Tencent Hunyuan | Real-time video | 720p @ 24 FPS | Yes | Open source |
| GWM-1 Worlds | Runway | Real-time navigable video | 720p real-time | Yes | Spatial coherence on revisit |
World Labs (Marble): Fei-Fei Li's World Labs took a fundamentally different bet with its commercial product Marble, announced in November 2025. Where Genie 3 generates a continuous video stream that the user navigates, Marble produces downloadable, editable 3D assets and environments from multi-modal inputs including photos, videos, panoramas, and 3D layouts. An analyst summarized the distinction this way: Marble renders what the world looks like; Genie 3 shows how the world changes. Marble's output is persistent and author-controlled, which makes it more useful for content pipelines but less useful for agent training.
Oasis (Decart AI): Decart AI released Oasis in late 2024 in collaboration with hardware startup Etched. Oasis is a playable Minecraft-like world model that runs in real time in the browser. It runs at approximately 20 frames per second and demonstrated that real-time interactive world generation was achievable before Genie 3's announcement. Oasis operates at lower resolution and with less diversity than Genie 3, and its output is visually noisier, but its free public availability made it an influential demo.
Odyssey-1: Odyssey, a startup founded by autonomous vehicle researchers Oliver Cameron and Jeff Hawke, released Odyssey-1 as a playable world model with particularly low latency. The system generates video frames every 40 milliseconds from clusters of NVIDIA H100 GPUs, streaming interactive video at up to 30 frames per second. Odyssey described their approach as "interactive video" rather than 3D world simulation. At launch the environments were blurry and unstable, with layouts sometimes changing when the user turned around, but the streaming architecture and latency characteristics were notable.
Tencent HY World 1.5: Tencent's Hunyuan team released HY World 1.5 as an open-source world model running at 720p and 24 FPS, matching Genie 3's specifications on paper and providing the first fully open-source option at that resolution. As a research release it lacks Project Genie's interface and tooling but is accessible to developers who want to build on top of a base model.
Academic and technical reception to Genie 3 was largely positive on capability but mixed on transparency. The Hacker News discussion following the August 2025 announcement drew over a thousand comments. Several developers who tested the model expressed surprise that it achieved "consistency over multiple minutes AND runs in real time at 720p" simultaneously, calling it an advance beyond their expectations. Robotics researchers were particularly interested, with multiple commenters observing that world models could have "a bigger part to play in robotics and real world AI" than previously assumed.
Criticisms focused on two areas. First, the announcement lacked an accompanying technical paper or detailed architecture description, which was unusual by DeepMind's standards. Multiple researchers and engineers expressed frustration that "value extraction" appeared to have taken precedence over academic transparency. DeepMind did not publish a preprint in conjunction with the announcement, leaving the community to infer architectural details from a blog post and a model page.
Second, developers who tested the limited research preview reported that physics failures were common, social interaction scenarios did not work, and complex instruction following "fails in surprising and obvious ways." One developer noted that long instruction sequences and simple combinatorial game logic were both unreliable. These reports aligned with DeepMind's own stated limitations but underscored the gap between the polished demo videos and typical session quality.
Media coverage was largely enthusiastic. TechCrunch described it as DeepMind's "stepping stone toward AGI" framing, which itself drew debate. Critics noted that "AGI" is not a well-defined target and that positioning a video generation model as a stepping stone to general intelligence was a stretch that said more about marketing priorities than about the actual research contribution.
The stock market reacted to the Project Genie public launch in January 2026. Unity Software fell 21%, Roblox Corporation fell 15%, Take-Two Interactive fell 9.3%, and CD Projekt fell 8% in the days following the announcement, reflecting investor concern about Genie 3's potential to displace conventional game development tools.
Google DeepMind has been explicit about the current limitations of Genie 3.
Session duration: Interactions are capped at a few minutes of continuous generation. The one-minute memory window and the growing trajectory computation cost mean that sessions cannot currently extend to hours, which would be needed for robust agent training curricula. Project Genie's 60-second cap at public launch reflected this practical ceiling.
Action space: User control is limited to movement navigation. There is no support for fine manipulation, object grasping, or other actions relevant to robotic or embodied agent training beyond locomotion. The range of controllable actions is narrower than hand-authored game environments.
Multi-agent scenarios: Genie 3 struggles when multiple independent agents share a generated space. Characters introduced through promptable world events do not behave as fully autonomous independent agents; they appear and move plausibly but do not maintain independent goals or memory.
Geographic accuracy: The model cannot reliably recreate specific real-world locations. Asking for "the plaza outside the Louvre" will produce something plausibly Parisian but not accurate to the actual site. This limits uses in architecture, urban planning, and geography education.
Text rendering: Text that appears within generated worlds, such as signs, labels, or written content, is typically illegible or distorted. This is a common failure mode for video generation models trained primarily on visual rather than text-layout data.
Text-only input: As of its August 2025 announcement, Genie 3 accepts only text prompts as input, unlike Genie 2 which could bootstrap from a single image. This restriction limits the ability to anchor generated worlds to a specific visual reference.
Physics accuracy: The learned physics simulation is approximate. Fluid dynamics, cloth behavior, and complex multi-body contacts are not reliably simulated. The model produces plausible-looking physics for most casual interactions but fails on scenarios requiring precise physical accuracy.
Safety and responsibility: Google DeepMind's responsible AI documentation for Genie 3 notes that the open-ended and real-time nature of the system creates novel safety challenges compared to text or image generation. The limited research preview rollout before Project Genie reflected caution about unforeseen uses at scale.