World model
Last reviewed
Sources
17 citations
Review status
Source-backed
Revision
v5 · 5,377 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Sources
17 citations
Review status
Source-backed
Revision
v5 · 5,377 words
Add missing citations, update stale details, or suggest a clearer explanation.
A world model is an artificial intelligence system that learns an internal representation of how an environment works, enabling it to predict future states, simulate the consequences of actions, and support planning without needing to interact with the real environment at every step. In the simplest terms, a world model answers the question: "If I take this action in this situation, what will happen next?" The concept is inspired by the cognitive science idea that humans carry mental models of the world in their heads, constantly running internal simulations to anticipate outcomes before acting. The term was popularized for modern AI by a 2018 paper from David Ha and Jurgen Schmidhuber, and by the mid-2020s world models had become the organizing thesis behind systems such as DeepMind's Dreamer and Genie, Meta's V-JEPA, OpenAI's Sora, NVIDIA's Cosmos, and Fei-Fei Li's World Labs [1][3].
World models have become one of the most actively discussed topics in AI research as of 2025-2026. Their appeal is straightforward: an agent that understands the dynamics of its environment can plan ahead, reason about cause and effect, and generalize to new situations far more efficiently than one that relies purely on trial and error. The concept spans multiple subfields, from model-based reinforcement learning (where a learned dynamics model reduces the need for real-world interaction) to video prediction systems (where models forecast future video frames) to interactive world generators (where users can explore AI-generated environments in real time). The term has also become entangled with the debate over whether large video generation models like Sora constitute genuine world models or are simply sophisticated pattern matchers [1].
A world model is a learned, predictive model of an environment's dynamics: given a current state (and optionally an action), it predicts the next state. Unlike a fixed physics engine, a world model is learned from data, usually in a compressed latent space rather than over raw pixels. Three properties distinguish a world model from a plain generative video model: (1) it represents state, often as a compact latent vector; (2) it predicts forward in time; and (3) in its strongest form it is action-conditioned, so an agent can ask counterfactual "what if I do X" questions and use the answers to plan. Systems that only generate a single plausible video from a prompt, without responding to actions, sit at the weak end of this definition.
The notion that intelligent agents should build internal models of their environments is not new. In control theory, model predictive control (MPC) has used explicit dynamical models for planning since the 1960s. In AI, the idea of a "mental model" for planning dates at least to Kenneth Craik's 1943 book The Nature of Explanation, which argued that organisms carry "small-scale models" of the external world in their heads and use them to try out alternatives before acting. Ha and Schmidhuber framed their own work with a related line from the systems theorist Jay Wright Forrester: "The image of the world around us, which we carry in our head, is just a model. Nobody in his head imagines all the world, government or country. He has only selected concepts, and relationships between them, and uses those to represent the real system" [3].
In reinforcement learning, the distinction between model-free and model-based methods has been central for decades. Model-free methods (like Q-learning and policy gradient algorithms) learn to act directly from experience without building an explicit model of environment dynamics. Model-based methods learn a transition model (predicting the next state given the current state and action) and use it to plan or generate synthetic training data. Model-based approaches tend to be more sample-efficient because the agent can "imagine" many trajectories without actually taking them, but they are only as good as the accuracy of the learned model [2].
The paper that brought the term "world model" into common usage in the deep learning community was "World Models" by David Ha and Jurgen Schmidhuber, published in 2018. The paper proposed a three-component architecture for RL agents:
| Component | Architecture | Function |
|---|---|---|
| Vision model (V) | Variational autoencoder (VAE) | Compresses raw observations (images) into a compact latent representation (32 dimensions for CarRacing, 64 for VizDoom) |
| Memory model (M) | Recurrent neural network (MDN-RNN) | Predicts future latent states; captures temporal dynamics |
| Controller (C) | Small linear model | Maps the current latent state and memory to actions (just 867 parameters for CarRacing, 1,088 for VizDoom) |
The key insight was that the controller could be trained entirely inside the "dream" of the world model: the VAE and RNN together formed a generative model of the environment, and the controller learned a policy by interacting with this internal simulation rather than the real environment. As the authors put it, "We can even train our agent entirely inside of its own dream environment generated by its world model, and transfer this policy back into the actual environment" [3]. Ha and Schmidhuber demonstrated this on the VizDoom: Take Cover task and a car racing task. On CarRacing-v0, the agent reached 906 +/- 21, the first published method to solve the task (a score above 900 averaged over 100 trials); trained purely inside its dream and transferred back, the VizDoom agent scored 1092 +/- 556, well above the threshold of 750 considered to solve the environment [3].
The paper also explored a provocative idea: what happens when the agent trains purely in its learned model without ever interacting with the real environment? They showed this was possible but highlighted a limitation. Since the agent could exploit inaccuracies in the learned model (finding "cheats" that work in the dream but not in reality), pure dream training required careful regularization.
Ha and Schmidhuber's work proved that accurate latent dynamics models were sufficient for control, but also exposed the cost of modularity: the three components were trained separately, so the system could not fine-tune representations end-to-end. This limitation motivated the next generation of world model research [3].
Danijar Hafner and collaborators at Google and the University of Toronto developed the Dreamer series of world model agents, which addressed the limitations of the Ha-Schmidhuber approach by training all components jointly and using more sophisticated policy optimization.
| Version | Year | Key advance | Notable result |
|---|---|---|---|
| Dreamer (PlaNet) | 2019 | Recurrent State-Space Model (RSSM) for latent dynamics; end-to-end differentiable | Learned control from pixels in DeepMind Control Suite tasks |
| DreamerV2 | 2020 | Discrete latent representations; KL balancing | First world model agent to achieve human-level performance on the Atari 100K benchmark |
| DreamerV3 | 2023 | Normalization and transformation techniques for stable learning across domains | First algorithm to collect diamonds in Minecraft from scratch without human data; outperformed MuZero with far less compute |
Dreamer works by learning a world model that predicts future latent states from current states and actions. An actor-critic policy is then trained by "imagining" trajectories inside this model, with rewards predicted by the model itself. Because all components are differentiable, gradients flow from the imagined rewards back through the dynamics model and into the policy, enabling efficient end-to-end optimization [4].
DreamerV3 was particularly notable for its generality. A single, fixed set of hyperparameters worked across more than 150 tasks spanning continuous control, Atari games, procedurally generated environments, and the open world of Minecraft. Collecting diamonds in Minecraft is a long-horizon challenge that requires finding wood, crafting a pickaxe, mining stone, upgrading tools, locating iron, smelting it, and finally mining diamond ore, all from pixel observations with sparse rewards. No prior algorithm had accomplished this without human demonstrations or hand-crafted reward shaping. In the published runs, 24 of 40 random seeds collected at least one diamond within 100 million environment steps, with the first diamond appearing after about 29 million steps. DreamerV3's success was published in Nature in 2025, where the authors wrote that Dreamer is "the first algorithm to collect diamonds in Minecraft from scratch without human data" [4].
Yann LeCun, Meta's former VP and Chief AI Scientist, has been one of the most vocal proponents of world models as the path to machine intelligence. LeCun has argued repeatedly that large language models (LLMs), despite their impressive text generation abilities, will never achieve genuine understanding of the physical world because they operate only on discrete tokens and lack the ability to predict continuous, high-dimensional sensory states [5].
LeCun's proposed alternative is the Joint Embedding Predictive Architecture (JEPA). The core idea is that instead of predicting raw pixels or tokens (which is computationally expensive and forces the model to predict every irrelevant detail), a JEPA-based system predicts in a learned abstract representation space. Two encoder networks process inputs (for example, two different views of a scene, or a current frame and a future frame), and a predictor network learns to predict the representation of one from the other.
The approach has been implemented in a sequence of published models from Meta:
| Model | Year | Domain | Description |
|---|---|---|---|
| I-JEPA | 2023 | Images | Predicts abstract representations of masked image regions from surrounding context; no pixel-level reconstruction |
| V-JEPA | 2024 | Video | Predicts abstract representations of masked video segments; learns temporal dynamics without generating pixels |
| V-JEPA 2 | 2025 | Video + robot action | 1.2B-parameter world model pre-trained on 1M+ hours of video; an action-conditioned variant enables zero-shot robot planning |
I-JEPA (Image Joint Embedding Predictive Architecture) learns by masking large portions of an image and predicting the representation of the masked region from the visible context. Because it operates in representation space rather than pixel space, it can focus on semantic and structural information rather than low-level textures. V-JEPA extends this to video, learning to predict missing temporal segments in representation space, which forces the model to learn about motion, object permanence, and physical dynamics [6].
Released on June 11, 2025, V-JEPA 2 is a 1.2-billion-parameter video world model pre-trained on more than 1 million hours of internet video plus 1 million images, with no action labels in the first stage. Meta described it as "a world model trained on video that enables state-of-the-art understanding and prediction, as well as zero-shot planning and robot control in new environments" [15]. A second, action-conditioned model, V-JEPA 2-AC, was post-trained on fewer than 62 hours of unlabeled robot video from the open DROID dataset, then deployed zero-shot on Franka robot arms in two different labs to pick and place objects by planning toward image goals. On benchmarks, V-JEPA 2 reported 77.3% top-1 accuracy on Something-Something v2 (motion understanding) and 39.7 recall-at-5 on Epic-Kitchens-100 (human action anticipation) [15]. V-JEPA 2 is the clearest demonstration to date of the JEPA thesis: a model that learns physical dynamics in latent space from passive video, then transfers to a robot with minimal action data.
LeCun views JEPA as a stepping stone toward what he calls Autonomous Machine Intelligence (AMI), a system architecture in which a world model sits at the center, surrounded by modules for perception, memory, cost estimation, and action. In September 2025, LeCun launched AMI Labs at Meta to pursue this vision, representing what has been described as the largest corporate bet on the thesis that the path to general intelligence runs through world models rather than next-token prediction [5].
The term "world model" is used to describe several related but distinct approaches:
These are dynamics models learned within a reinforcement learning framework. The model takes a state and action as input and predicts the next state (and often the reward). The agent uses this model to plan by simulating future trajectories internally. Examples include the Dreamer series, MuZero (which learns a latent dynamics model for board games and Atari), and various model-based approaches used in robotics.
The strengths of this approach are sample efficiency (fewer real-world interactions needed) and the ability to plan ahead. The weakness is that errors in the model compound over long horizons: a small prediction error at each step can accumulate into a wildly inaccurate trajectory after many steps, a problem known as model compounding error [2].
Video prediction models take a sequence of video frames (and sometimes conditioning signals like text or actions) and generate future frames. These models learn to predict how visual scenes evolve over time, capturing information about object motion, occlusion, and scene dynamics.
Models in this category include:
Whether video prediction models are truly world models is a subject of ongoing debate (discussed in a later section).
These systems go beyond passive frame prediction by allowing interactive exploration. A user or agent can take actions within the generated world, and the model produces the next state in response, functioning like an AI-generated video game or simulator.
Google DeepMind's Genie and Genie 2 are the most prominent examples. NVIDIA's Cosmos platform represents a commercial approach, providing world foundation models specifically designed for physical AI applications like robotics and autonomous driving.
As described above, JEPA-based models predict in an abstract representation space rather than in pixel space. This avoids the computational burden and noise of pixel-level prediction while (in theory) focusing the model on the aspects of the world that matter for decision-making. With the release of V-JEPA 2 in 2025, this approach moved from pure research demos toward zero-shot robot control, though it remains earlier in deployment than commercial video-generation systems. I-JEPA, V-JEPA, and V-JEPA 2 are the main published examples.
When OpenAI introduced Sora in February 2024, its technical report described the model as a "world simulator," arguing that video generation models trained at sufficient scale would implicitly learn to simulate the physical world. Sora can generate photorealistic videos from text prompts, depicting complex scenes with moving objects, changing lighting, and plausible (if not always physically accurate) interactions [7].
The claim provoked significant debate. Supporters argued that a model capable of generating coherent video must have learned something about how the world works: objects fall when dropped, cars move along roads, water flows downhill. If the model can consistently generate physically plausible outcomes, it has, in some functional sense, learned physics.
Critics offered several counterarguments:
The Physics-IQ authors summarized the gap bluntly: "visual realism does not imply physical understanding" [8]. The debate is not merely academic. If video generation models are genuine world models, then scaling up video generation (as OpenAI, Google, and others are doing) is a path toward AI systems that understand the physical world. If they are not, the field needs fundamentally different architectures.
A useful way to think about the debate is as a spectrum rather than a binary:
| Level | Capability | Example |
|---|---|---|
| Frame interpolation | Predicting the next frame given previous frames; no understanding of physics | Simple video codecs |
| Statistical video generation | Generating plausible video from text or context; learns correlations in visual patterns | Sora, Runway Gen-3 |
| Stylized physics | Understanding that dropped things fall and rolling things move, without precise equations | Current best world models |
| Approximate physical simulation | Predicting outcomes of interactions with reasonable accuracy; responds to action conditioning | Research frontier (Genie 3, advanced model-based RL) |
| Precise physical simulation | Accurate physics with correct equations of motion | Traditional physics engines (not learned) |
Current video generation models operate at the "statistical video generation" level, occasionally reaching "stylized physics." Current model-based RL world models and interactive systems like Genie operate closer to "stylized physics" or "approximate physical simulation" for restricted domains.
Google DeepMind's Genie project represents one of the most ambitious efforts to build interactive world models.
The original Genie, published in February 2024, is an 11-billion-parameter model trained on a filtered set of 30,000 hours of unlabeled internet gameplay video (curated from a much larger pool of public 2D platformer footage). It learned a latent action space (a set of abstract "controls") entirely from watching videos, without any labeled action data. Users could provide a single image (a photo, a sketch, or an AI-generated scene), and Genie would generate an interactive 2D environment that could be explored using the learned controls. DeepMind described Genie as "the first generative interactive environment trained in an unsupervised manner from unlabelled Internet videos" [9].
Genie was notable for several reasons. It demonstrated that interactive world models could be learned from passive video without action labels. It showed that a single model could generate diverse 2D platformer-style worlds. And it introduced the idea of "world generation" as distinct from "video generation": the output was not a pre-determined video but an environment that responded to user input.
Genie 2, announced in December 2024, extended the concept to 3D environments. From a single image and optional text description, Genie 2 generates an interactive 3D world that users can explore using a keyboard or mouse. The generated environments include object interactions (opening doors, bursting balloons), animated characters and NPCs, lighting and reflections, and basic physics simulation [10].
Technically, Genie 2 uses an autoregressive latent diffusion model that generates the world frame by frame, simulating the consequences of each user action. It maintains memory of parts of the scene that are not currently visible and renders them accurately when they come back into view. The model was trained on video data and does not use a traditional rendering engine [10].
DeepMind positioned Genie 2 as useful for training and evaluating AI agents: rather than building handcrafted simulation environments, researchers could generate an unlimited curriculum of novel worlds for agents to explore.
Genie 3, released in August 2025, was described by DeepMind as its "first world model to allow interaction in real-time." The post states that "given a text prompt, Genie 3 can generate dynamic worlds that you can navigate in real time at 24 frames per second, retaining consistency for a few minutes at a resolution of 720p" [11]. The system maintains visual memory of the environment "extending as far back as one minute ago," so scenes stay consistent as a user looks away and back. Genie 3 also introduced "promptable world events," letting a user change the running world on the fly, for example altering weather or introducing new objects and characters. DeepMind summarized it as "a general purpose world model that can generate an unprecedented diversity of interactive environments," positioning it as a tool for training and evaluating embodied agents [11].
World Labs, co-founded by Fei-Fei Li, pursues a distinct interpretation of world models centered on "spatial intelligence": building models that perceive, generate, reason about, and interact with the 3D world. The company emerged from stealth in September 2024 with $230 million in funding from backers including Andreessen Horowitz, NVIDIA's venture arm NVentures, and Radical Ventures, at a reported valuation of around $1 billion [16]. World Labs describes its products as "large world models."
In November 2025, World Labs launched its first commercial product, Marble, which generates persistent, downloadable 3D environments from a single image or text prompt [17]. Unlike systems such as Genie that synthesize frames on the fly as a user explores, Marble produces a fixed 3D scene up front, which reduces the "morphing" and drift seen in autoregressive video world models and lets users export the result as Gaussian splats, meshes, or video. World Labs thus represents a 3D-native, geometry-first approach to world modeling, contrasting with the video-prediction approaches of Sora and Genie and the latent-prediction approach of JEPA.
NVIDIA launched the Cosmos platform at CES 2025 as a suite of world foundation models (WFMs) designed for physical AI development. Unlike research-oriented projects like Genie, Cosmos is aimed at commercial applications, particularly autonomous driving and robotics.
Cosmos models generate physics-based videos from combinations of text, image, video, robot sensor data, and motion data. They are trained to handle physically based interactions, object permanence, and realistic rendering of industrial environments (warehouses, factories) and driving environments (roads, weather conditions, lighting variations) [12].
For autonomous vehicles, Cosmos integrates with NVIDIA's Omniverse simulation platform. Developers can use Cosmos Transfer to amplify variations of sensor data, turning thousands of real-world miles of driving data into billions of virtually driven miles. This data flywheel approach addresses one of the biggest bottlenecks in autonomous driving development: the need for vast amounts of diverse training data [12].
NVIDIA released Cosmos as an open platform, and early adopters include 1X Technologies, Agility Robotics, Figure AI, Foretellix (for autonomous vehicle testing), Skild AI, and Uber. The company also released Cosmos tokenizers (for converting continuous data into discrete tokens suitable for transformer-based models) and guardrails tools [12].
A major release of Cosmos in March 2025 expanded the model suite and physical AI data tools, coinciding with NVIDIA's broader push into what CEO Jensen Huang calls "physical AI," the application of AI to systems that interact with the physical world [12].
This question has become one of the most contested in AI research. The arguments break down roughly as follows:
Arguments that video generation models are (or can become) world models:
Arguments that video generation models are not world models:
A paper published in 2024 by researchers from several institutions, titled "Sora and V-JEPA Have Not Learned The Complete Real World Model," argued that neither generative video models nor JEPA-style models had yet achieved genuine world understanding, and that both approaches had fundamental limitations that needed to be addressed [13].
The truth likely lies between the extremes. Video generation models have learned some aspects of world dynamics (enough to generate plausible videos), but fall short of the kind of accurate, generalizable, action-conditioned physical reasoning that would qualify as a true world model. The field is moving toward hybrid approaches that combine the visual generation capabilities of diffusion models with the interactive, action-conditioned structure of RL-based world models.
The original and most established application of world models is in RL. An agent with an accurate world model can "think ahead" by simulating future trajectories in its model before acting, reducing the amount of real-world interaction needed. This is the approach used by Dreamer, MuZero, and many robotics systems.
World models for autonomous driving predict how traffic scenes will evolve: where other vehicles will go, how pedestrians will move, and what will happen if the ego vehicle takes a particular action. NVIDIA's Cosmos, Wayve's GAIA-1, and various academic projects pursue this direction. The appeal is that a world model can generate unlimited training scenarios, including rare dangerous situations that are hard to encounter (or safely create) in real-world driving [12].
In robotics, world models help robots predict the outcomes of manipulation actions ("If I push this object, where will it end up?") and plan multi-step tasks. The combination of world models with language models (as in SayCan-style systems) allows robots to plan at multiple levels of abstraction: the language model decomposes a task into steps, and the world model simulates whether each step is likely to succeed. V-JEPA 2-AC's zero-shot pick-and-place on real Franka arms, using fewer than 62 hours of robot video, illustrates how a video-trained world model can be retargeted to physical control with minimal action data [15].
Genie and Genie 2 demonstrate the potential for world models to generate interactive environments for gaming, training, and evaluation. Instead of hand-crafting game levels or simulation scenarios, developers could use world models to generate limitless variations, potentially reducing the cost of content creation and testing.
World models can be applied to any domain where predicting future states from current conditions is valuable: weather forecasting, economic modeling, protein dynamics, and more. Google DeepMind's GraphCast weather model and similar systems share the underlying principle of learning dynamics from data to predict future states.
As of early 2026, several major research groups are pursuing distinct approaches to world models:
| Group | Approach | Philosophy |
|---|---|---|
| Yann LeCun / Meta AMI Labs | JEPA (Joint Embedding Predictive Architecture); V-JEPA 2 | Predict in abstract representation space, not pixel space; LLMs are insufficient for physical intelligence |
| Google DeepMind (Genie team) | Interactive world generation from video data | Learn to generate explorable 3D environments; useful for training and evaluating agents |
| OpenAI (Sora team) | Large-scale video generation as implicit world modeling | Sufficient scale in video generation may yield emergent world understanding |
| NVIDIA (Cosmos) | Commercial world foundation models for physical AI | Practical tools for autonomous driving and robotics; data amplification for training |
| Danijar Hafner et al. (Dreamer) | Model-based RL with learned latent dynamics | Compact, efficient models for planning and policy optimization in RL |
| Fei-Fei Li / World Labs | Spatial intelligence and 3D world understanding (Marble) | 3D-native, persistent scene generation as a foundation for world modeling |
These approaches are not mutually exclusive, and the eventual winning strategy may combine elements of several. The JEPA approach and the Dreamer approach share the idea of operating in learned representation spaces. The Genie approach and the Sora approach share the idea of learning from large-scale video data. Cosmos bridges research and commercial application [14].
World models are at an inflection point. The concept has moved from a niche topic in model-based RL to a central theme in AI research, driven by several converging trends:
The field is likely to evolve rapidly over the next few years. If JEPA-style architectures or Dreamer-style models can be scaled to handle the complexity of the real world, they could enable a new generation of AI agents that genuinely understand their environments. If video generation models prove to be a dead end for world understanding (as some critics predict), the field may pivot toward more structured approaches that incorporate explicit physical reasoning. In either case, world models remain central to the broader goal of building AI systems that can operate effectively in the physical world.