A world model is an artificial intelligence system that learns an internal representation of how an environment works, enabling it to predict future states, simulate the consequences of actions, and support planning without needing to interact with the real environment at every step. In the simplest terms, a world model answers the question: "If I take this action in this situation, what will happen next?" The concept is inspired by the cognitive science idea that humans carry mental models of the world in their heads, constantly running internal simulations to anticipate outcomes before acting.
World models have become one of the most actively discussed topics in AI research as of 2025-2026. Their appeal is straightforward: an agent that understands the dynamics of its environment can plan ahead, reason about cause and effect, and generalize to new situations far more efficiently than one that relies purely on trial and error. The concept spans multiple subfields, from model-based reinforcement learning (where a learned dynamics model reduces the need for real-world interaction) to video prediction systems (where models forecast future video frames) to interactive world generators (where users can explore AI-generated environments in real time). The term has also become entangled with the debate over whether large video generation models like Sora constitute genuine world models or are simply sophisticated pattern matchers [1].
The notion that intelligent agents should build internal models of their environments is not new. In control theory, model predictive control (MPC) has used explicit dynamical models for planning since the 1960s. In AI, the idea of a "mental model" for planning dates at least to Kenneth Craik's 1943 book The Nature of Explanation, which argued that organisms carry "small-scale models" of the external world in their heads and use them to try out alternatives before acting.
In reinforcement learning, the distinction between model-free and model-based methods has been central for decades. Model-free methods (like Q-learning and policy gradient algorithms) learn to act directly from experience without building an explicit model of environment dynamics. Model-based methods learn a transition model (predicting the next state given the current state and action) and use it to plan or generate synthetic training data. Model-based approaches tend to be more sample-efficient because the agent can "imagine" many trajectories without actually taking them, but they are only as good as the accuracy of the learned model [2].
The paper that brought the term "world model" into common usage in the deep learning community was "World Models" by David Ha and Jurgen Schmidhuber, published in 2018. The paper proposed a three-component architecture for RL agents:
| Component | Architecture | Function |
|---|---|---|
| Vision model (V) | Variational autoencoder (VAE) | Compresses raw observations (images) into a compact latent representation |
| Memory model (M) | Recurrent neural network (MDN-RNN) | Predicts future latent states; captures temporal dynamics |
| Controller (C) | Small linear model | Maps the current latent state and memory to actions |
The key insight was that the controller could be trained entirely inside the "dream" of the world model: the VAE and RNN together formed a generative model of the environment, and the controller learned a policy by interacting with this internal simulation rather than the real environment. Ha and Schmidhuber demonstrated this on the VizDoom first-person shooter game and a car racing task. The agent could learn effective policies by training inside its own hallucinated version of the environment [3].
The paper also explored a provocative idea: what happens when the agent trains purely in its learned model without ever interacting with the real environment? They showed this was possible but highlighted a limitation. Since the agent could exploit inaccuracies in the learned model (finding "cheats" that work in the dream but not in reality), pure dream training required careful regularization.
Ha and Schmidhuber's work proved that accurate latent dynamics models were sufficient for control, but also exposed the cost of modularity: the three components were trained separately, so the system could not fine-tune representations end-to-end. This limitation motivated the next generation of world model research [3].
Danijar Hafner and collaborators at Google and the University of Toronto developed the Dreamer series of world model agents, which addressed the limitations of the Ha-Schmidhuber approach by training all components jointly and using more sophisticated policy optimization.
| Version | Year | Key advance | Notable result |
|---|---|---|---|
| Dreamer (PlaNet) | 2019 | Recurrent State-Space Model (RSSM) for latent dynamics; end-to-end differentiable | Learned control from pixels in DeepMind Control Suite tasks |
| DreamerV2 | 2020 | Discrete latent representations; KL balancing | First world model agent to achieve human-level performance on the Atari 100K benchmark |
| DreamerV3 | 2023 | Normalization and transformation techniques for stable learning across domains | First algorithm to collect diamonds in Minecraft from scratch without human data; outperformed MuZero with far less compute |
Dreamer works by learning a world model that predicts future latent states from current states and actions. An actor-critic policy is then trained by "imagining" trajectories inside this model, with rewards predicted by the model itself. Because all components are differentiable, gradients flow from the imagined rewards back through the dynamics model and into the policy, enabling efficient end-to-end optimization [4].
DreamerV3 was particularly notable for its generality. A single set of hyperparameters worked across over 150 tasks spanning continuous control, Atari games, procedurally generated environments, and the open world of Minecraft. Collecting diamonds in Minecraft is a long-horizon challenge that requires finding wood, crafting a pickaxe, mining stone, upgrading tools, locating iron, smelting it, and finally mining diamond ore, all from pixel observations with sparse rewards. No prior algorithm had accomplished this without human demonstrations or hand-crafted reward shaping. DreamerV3's success was published in Nature in 2025 [4].
Yann LeCun, Meta's VP and Chief AI Scientist, has been one of the most vocal proponents of world models as the path to machine intelligence. LeCun has argued repeatedly that large language models (LLMs), despite their impressive text generation abilities, will never achieve genuine understanding of the physical world because they operate only on discrete tokens and lack the ability to predict continuous, high-dimensional sensory states [5].
LeCun's proposed alternative is the Joint Embedding Predictive Architecture (JEPA). The core idea is that instead of predicting raw pixels or tokens (which is computationally expensive and forces the model to predict every irrelevant detail), a JEPA-based system predicts in a learned abstract representation space. Two encoder networks process inputs (for example, two different views of a scene, or a current frame and a future frame), and a predictor network learns to predict the representation of one from the other.
The approach has been implemented in two published models from Meta:
| Model | Year | Domain | Description |
|---|---|---|---|
| I-JEPA | 2023 | Images | Predicts abstract representations of masked image regions from surrounding context; no pixel-level reconstruction |
| V-JEPA | 2024 | Video | Predicts abstract representations of masked video segments; learns temporal dynamics without generating pixels |
I-JEPA (Image Joint Embedding Predictive Architecture) learns by masking large portions of an image and predicting the representation of the masked region from the visible context. Because it operates in representation space rather than pixel space, it can focus on semantic and structural information rather than low-level textures. V-JEPA extends this to video, learning to predict missing temporal segments in representation space, which forces the model to learn about motion, object permanence, and physical dynamics [6].
LeCun views JEPA as a stepping stone toward what he calls Autonomous Machine Intelligence (AMI), a system architecture in which a world model sits at the center, surrounded by modules for perception, memory, cost estimation, and action. In September 2025, LeCun launched AMI Labs at Meta to pursue this vision, representing what has been described as the largest corporate bet on the thesis that the path to general intelligence runs through world models rather than next-token prediction [5].
The term "world model" is used to describe several related but distinct approaches:
These are dynamics models learned within a reinforcement learning framework. The model takes a state and action as input and predicts the next state (and often the reward). The agent uses this model to plan by simulating future trajectories internally. Examples include the Dreamer series, MuZero (which learns a latent dynamics model for board games and Atari), and various model-based approaches used in robotics.
The strengths of this approach are sample efficiency (fewer real-world interactions needed) and the ability to plan ahead. The weakness is that errors in the model compound over long horizons: a small prediction error at each step can accumulate into a wildly inaccurate trajectory after many steps, a problem known as model compounding error [2].
Video prediction models take a sequence of video frames (and sometimes conditioning signals like text or actions) and generate future frames. These models learn to predict how visual scenes evolve over time, capturing information about object motion, occlusion, and scene dynamics.
Models in this category include:
Whether video prediction models are truly world models is a subject of ongoing debate (discussed in a later section).
These systems go beyond passive frame prediction by allowing interactive exploration. A user or agent can take actions within the generated world, and the model produces the next state in response, functioning like an AI-generated video game or simulator.
Google DeepMind's Genie and Genie 2 are the most prominent examples. NVIDIA's Cosmos platform represents a commercial approach, providing world foundation models specifically designed for physical AI applications like robotics and autonomous driving.
As described above, JEPA-based models predict in an abstract representation space rather than in pixel space. This avoids the computational burden and noise of pixel-level prediction while (in theory) focusing the model on the aspects of the world that matter for decision-making. This approach is still primarily in the research phase, with I-JEPA and V-JEPA as the main published examples.
When OpenAI introduced Sora in February 2024, its technical report described the model as a "world simulator," arguing that video generation models trained at sufficient scale would implicitly learn to simulate the physical world. Sora can generate photorealistic videos from text prompts, depicting complex scenes with moving objects, changing lighting, and plausible (if not always physically accurate) interactions [7].
The claim provoked significant debate. Supporters argued that a model capable of generating coherent video must have learned something about how the world works: objects fall when dropped, cars move along roads, water flows downhill. If the model can consistently generate physically plausible outcomes, it has, in some functional sense, learned physics.
Critics offered several counterarguments:
The debate is not merely academic. If video generation models are genuine world models, then scaling up video generation (as OpenAI, Google, and others are doing) is a path toward AI systems that understand the physical world. If they are not, the field needs fundamentally different architectures.
A useful way to think about the debate is as a spectrum rather than a binary:
| Level | Capability | Example |
|---|---|---|
| Frame interpolation | Predicting the next frame given previous frames; no understanding of physics | Simple video codecs |
| Statistical video generation | Generating plausible video from text or context; learns correlations in visual patterns | Sora, Runway Gen-3 |
| Stylized physics | Understanding that dropped things fall and rolling things move, without precise equations | Current best world models |
| Approximate physical simulation | Predicting outcomes of interactions with reasonable accuracy; responds to action conditioning | Research frontier (Genie 3, advanced model-based RL) |
| Precise physical simulation | Accurate physics with correct equations of motion | Traditional physics engines (not learned) |
Current video generation models operate at the "statistical video generation" level, occasionally reaching "stylized physics." Current model-based RL world models and interactive systems like Genie operate closer to "stylized physics" or "approximate physical simulation" for restricted domains.
Google DeepMind's Genie project represents one of the most ambitious efforts to build interactive world models.
The original Genie, published in February 2024, was trained on 200,000 hours of unlabeled internet gameplay video. It learned a latent action space (a set of abstract "controls") entirely from watching videos, without any labeled action data. Users could provide a single image (a photo, a sketch, or an AI-generated scene), and Genie would generate an interactive 2D environment that could be explored using the learned controls [9].
Genie was notable for several reasons. It demonstrated that interactive world models could be learned from passive video without action labels. It showed that a single model could generate diverse 2D platformer-style worlds. And it introduced the idea of "world generation" as distinct from "video generation": the output was not a pre-determined video but an environment that responded to user input.
Genie 2, announced in December 2024, extended the concept to 3D environments. From a single image and optional text description, Genie 2 generates an interactive 3D world that users can explore using a keyboard or mouse. The generated environments include object interactions (opening doors, bursting balloons), animated characters and NPCs, lighting and reflections, and basic physics simulation [10].
Technically, Genie 2 uses an autoregressive latent diffusion model that generates the world frame by frame, simulating the consequences of each user action. It maintains memory of parts of the scene that are not currently visible and renders them accurately when they come back into view. The model was trained on video data and does not use a traditional rendering engine [10].
DeepMind positioned Genie 2 as useful for training and evaluating AI agents: rather than building handcrafted simulation environments, researchers could generate an unlimited curriculum of novel worlds for agents to explore.
Genie 3, released in August 2025, was described as the first real-time interactive general-purpose world model. It generates scenes at 720p resolution and 24 frames per second, enabling users to enter and interact with generated scenes in real time. Genie 3 represents the current frontier of interactive world generation from Google DeepMind [11].
NVIDIA launched the Cosmos platform at CES 2025 as a suite of world foundation models (WFMs) designed for physical AI development. Unlike research-oriented projects like Genie, Cosmos is aimed at commercial applications, particularly autonomous driving and robotics.
Cosmos models generate physics-based videos from combinations of text, image, video, robot sensor data, and motion data. They are trained to handle physically based interactions, object permanence, and realistic rendering of industrial environments (warehouses, factories) and driving environments (roads, weather conditions, lighting variations) [12].
For autonomous vehicles, Cosmos integrates with NVIDIA's Omniverse simulation platform. Developers can use Cosmos Transfer to amplify variations of sensor data, turning thousands of real-world miles of driving data into billions of virtually driven miles. This data flywheel approach addresses one of the biggest bottlenecks in autonomous driving development: the need for vast amounts of diverse training data [12].
NVIDIA released Cosmos as an open platform, and early adopters include 1X Technologies, Agility Robotics, Figure AI, Foretellix (for autonomous vehicle testing), Skild AI, and Uber. The company also released Cosmos tokenizers (for converting continuous data into discrete tokens suitable for transformer-based models) and guardrails tools [12].
A major release of Cosmos in March 2025 expanded the model suite and physical AI data tools, coinciding with NVIDIA's broader push into what CEO Jensen Huang calls "physical AI," the application of AI to systems that interact with the physical world [12].
This question has become one of the most contested in AI research. The arguments break down roughly as follows:
Arguments that video generation models are (or can become) world models:
Arguments that video generation models are not world models:
A paper published in 2024 by researchers from several institutions, titled "Sora and V-JEPA Have Not Learned The Complete Real World Model," argued that neither generative video models nor JEPA-style models had yet achieved genuine world understanding, and that both approaches had fundamental limitations that needed to be addressed [13].
The truth likely lies between the extremes. Video generation models have learned some aspects of world dynamics (enough to generate plausible videos), but fall short of the kind of accurate, generalizable, action-conditioned physical reasoning that would qualify as a true world model. The field is moving toward hybrid approaches that combine the visual generation capabilities of diffusion models with the interactive, action-conditioned structure of RL-based world models.
The original and most established application of world models is in RL. An agent with an accurate world model can "think ahead" by simulating future trajectories in its model before acting, reducing the amount of real-world interaction needed. This is the approach used by Dreamer, MuZero, and many robotics systems.
World models for autonomous driving predict how traffic scenes will evolve: where other vehicles will go, how pedestrians will move, and what will happen if the ego vehicle takes a particular action. NVIDIA's Cosmos, Wayve's GAIA-1, and various academic projects pursue this direction. The appeal is that a world model can generate unlimited training scenarios, including rare dangerous situations that are hard to encounter (or safely create) in real-world driving [12].
In robotics, world models help robots predict the outcomes of manipulation actions ("If I push this object, where will it end up?") and plan multi-step tasks. The combination of world models with language models (as in SayCan-style systems) allows robots to plan at multiple levels of abstraction: the language model decomposes a task into steps, and the world model simulates whether each step is likely to succeed.
Genie and Genie 2 demonstrate the potential for world models to generate interactive environments for gaming, training, and evaluation. Instead of hand-crafting game levels or simulation scenarios, developers could use world models to generate limitless variations, potentially reducing the cost of content creation and testing.
World models can be applied to any domain where predicting future states from current conditions is valuable: weather forecasting, economic modeling, protein dynamics, and more. Google DeepMind's GraphCast weather model and similar systems share the underlying principle of learning dynamics from data to predict future states.
As of early 2026, several major research groups are pursuing distinct approaches to world models:
| Group | Approach | Philosophy |
|---|---|---|
| Yann LeCun / Meta AMI Labs | JEPA (Joint Embedding Predictive Architecture) | Predict in abstract representation space, not pixel space; LLMs are insufficient for physical intelligence |
| Google DeepMind (Genie team) | Interactive world generation from video data | Learn to generate explorable 3D environments; useful for training and evaluating agents |
| OpenAI (Sora team) | Large-scale video generation as implicit world modeling | Sufficient scale in video generation may yield emergent world understanding |
| NVIDIA (Cosmos) | Commercial world foundation models for physical AI | Practical tools for autonomous driving and robotics; data amplification for training |
| Danijar Hafner et al. (Dreamer) | Model-based RL with learned latent dynamics | Compact, efficient models for planning and policy optimization in RL |
| Fei-Fei Li / World Labs | Spatial intelligence and 3D world understanding | 3D-native scene understanding as a foundation for world modeling |
These approaches are not mutually exclusive, and the eventual winning strategy may combine elements of several. The JEPA approach and the Dreamer approach share the idea of operating in learned representation spaces. The Genie approach and the Sora approach share the idea of learning from large-scale video data. Cosmos bridges research and commercial application [14].
World models are at an inflection point. The concept has moved from a niche topic in model-based RL to a central theme in AI research, driven by several converging trends:
The field is likely to evolve rapidly over the next few years. If JEPA-style architectures or Dreamer-style models can be scaled to handle the complexity of the real world, they could enable a new generation of AI agents that genuinely understand their environments. If video generation models prove to be a dead end for world understanding (as some critics predict), the field may pivot toward more structured approaches that incorporate explicit physical reasoning. In either case, world models remain central to the broader goal of building AI systems that can operate effectively in the physical world.