World model

Artificial Intelligence Machine Learning

27 min read

Updated Jun 21, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 21, 2026

Fact-checked

In review queue

Sources

17 citations

Revision

v5 · 5,377 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

A world model is an artificial intelligence system that learns an internal representation of how an environment works, enabling it to predict future states, simulate the consequences of actions, and support planning without needing to interact with the real environment at every step. In the simplest terms, a world model answers the question: "If I take this action in this situation, what will happen next?" The concept is inspired by the cognitive science idea that humans carry mental models of the world in their heads, constantly running internal simulations to anticipate outcomes before acting. The term was popularized for modern AI by a 2018 paper from David Ha and Jurgen Schmidhuber, and by the mid-2020s world models had become the organizing thesis behind systems such as DeepMind's Dreamer and Genie, Meta's V-JEPA, OpenAI's Sora, NVIDIA's Cosmos, and Fei-Fei Li's World Labs ^[1]^[3].

World models have become one of the most actively discussed topics in AI research as of 2025-2026. Their appeal is straightforward: an agent that understands the dynamics of its environment can plan ahead, reason about cause and effect, and generalize to new situations far more efficiently than one that relies purely on trial and error. The concept spans multiple subfields, from model-based reinforcement learning (where a learned dynamics model reduces the need for real-world interaction) to video prediction systems (where models forecast future video frames) to interactive world generators (where users can explore AI-generated environments in real time). The term has also become entangled with the debate over whether large video generation models like Sora constitute genuine world models or are simply sophisticated pattern matchers ^[1].

What is a world model in AI?

A world model is a learned, predictive model of an environment's dynamics: given a current state (and optionally an action), it predicts the next state. Unlike a fixed physics engine, a world model is learned from data, usually in a compressed latent space rather than over raw pixels. Three properties distinguish a world model from a plain generative video model: (1) it represents state, often as a compact latent vector; (2) it predicts forward in time; and (3) in its strongest form it is action-conditioned, so an agent can ask counterfactual "what if I do X" questions and use the answers to plan. Systems that only generate a single plausible video from a prompt, without responding to actions, sit at the weak end of this definition.

History and foundational work

Early ideas

The notion that intelligent agents should build internal models of their environments is not new. In control theory, model predictive control (MPC) has used explicit dynamical models for planning since the 1960s. In AI, the idea of a "mental model" for planning dates at least to Kenneth Craik's 1943 book The Nature of Explanation, which argued that organisms carry "small-scale models" of the external world in their heads and use them to try out alternatives before acting. Ha and Schmidhuber framed their own work with a related line from the systems theorist Jay Wright Forrester: "The image of the world around us, which we carry in our head, is just a model. Nobody in his head imagines all the world, government or country. He has only selected concepts, and relationships between them, and uses those to represent the real system" ^[3].

In reinforcement learning, the distinction between model-free and model-based methods has been central for decades. Model-free methods (like Q-learning and policy gradient algorithms) learn to act directly from experience without building an explicit model of environment dynamics. Model-based methods learn a transition model (predicting the next state given the current state and action) and use it to plan or generate synthetic training data. Model-based approaches tend to be more sample-efficient because the agent can "imagine" many trajectories without actually taking them, but they are only as good as the accuracy of the learned model ^[2].

Ha and Schmidhuber (2018): "World Models"

The paper that brought the term "world model" into common usage in the deep learning community was "World Models" by David Ha and Jurgen Schmidhuber, published in 2018. The paper proposed a three-component architecture for RL agents:

Component	Architecture	Function
Vision model (V)	Variational autoencoder (VAE)	Compresses raw observations (images) into a compact latent representation (32 dimensions for CarRacing, 64 for VizDoom)
Memory model (M)	Recurrent neural network (MDN-RNN)	Predicts future latent states; captures temporal dynamics
Controller (C)	Small linear model	Maps the current latent state and memory to actions (just 867 parameters for CarRacing, 1,088 for VizDoom)

The key insight was that the controller could be trained entirely inside the "dream" of the world model: the VAE and RNN together formed a generative model of the environment, and the controller learned a policy by interacting with this internal simulation rather than the real environment. As the authors put it, "We can even train our agent entirely inside of its own dream environment generated by its world model, and transfer this policy back into the actual environment" ^[3]. Ha and Schmidhuber demonstrated this on the VizDoom: Take Cover task and a car racing task. On CarRacing-v0, the agent reached 906 +/- 21, the first published method to solve the task (a score above 900 averaged over 100 trials); trained purely inside its dream and transferred back, the VizDoom agent scored 1092 +/- 556, well above the threshold of 750 considered to solve the environment ^[3].

The paper also explored a provocative idea: what happens when the agent trains purely in its learned model without ever interacting with the real environment? They showed this was possible but highlighted a limitation. Since the agent could exploit inaccuracies in the learned model (finding "cheats" that work in the dream but not in reality), pure dream training required careful regularization.

Ha and Schmidhuber's work proved that accurate latent dynamics models were sufficient for control, but also exposed the cost of modularity: the three components were trained separately, so the system could not fine-tune representations end-to-end. This limitation motivated the next generation of world model research ^[3].

Dreamer series (2019-2023)

Danijar Hafner and collaborators at Google and the University of Toronto developed the Dreamer series of world model agents, which addressed the limitations of the Ha-Schmidhuber approach by training all components jointly and using more sophisticated policy optimization.

Version	Year	Key advance	Notable result
Dreamer (PlaNet)	2019	Recurrent State-Space Model (RSSM) for latent dynamics; end-to-end differentiable	Learned control from pixels in DeepMind Control Suite tasks
DreamerV2	2020	Discrete latent representations; KL balancing	First world model agent to achieve human-level performance on the Atari 100K benchmark
DreamerV3	2023	Normalization and transformation techniques for stable learning across domains	First algorithm to collect diamonds in Minecraft from scratch without human data; outperformed MuZero with far less compute

Dreamer works by learning a world model that predicts future latent states from current states and actions. An actor-critic policy is then trained by "imagining" trajectories inside this model, with rewards predicted by the model itself. Because all components are differentiable, gradients flow from the imagined rewards back through the dynamics model and into the policy, enabling efficient end-to-end optimization ^[4].

DreamerV3 was particularly notable for its generality. A single, fixed set of hyperparameters worked across more than 150 tasks spanning continuous control, Atari games, procedurally generated environments, and the open world of Minecraft. Collecting diamonds in Minecraft is a long-horizon challenge that requires finding wood, crafting a pickaxe, mining stone, upgrading tools, locating iron, smelting it, and finally mining diamond ore, all from pixel observations with sparse rewards. No prior algorithm had accomplished this without human demonstrations or hand-crafted reward shaping. In the published runs, 24 of 40 random seeds collected at least one diamond within 100 million environment steps, with the first diamond appearing after about 29 million steps. DreamerV3's success was published in Nature in 2025, where the authors wrote that Dreamer is "the first algorithm to collect diamonds in Minecraft from scratch without human data" ^[4].

What is JEPA and why does Yann LeCun favor it?

Yann LeCun, Meta's former VP and Chief AI Scientist, has been one of the most vocal proponents of world models as the path to machine intelligence. LeCun has argued repeatedly that large language models (LLMs), despite their impressive text generation abilities, will never achieve genuine understanding of the physical world because they operate only on discrete tokens and lack the ability to predict continuous, high-dimensional sensory states ^[5].

LeCun's proposed alternative is the Joint Embedding Predictive Architecture (JEPA). The core idea is that instead of predicting raw pixels or tokens (which is computationally expensive and forces the model to predict every irrelevant detail), a JEPA-based system predicts in a learned abstract representation space. Two encoder networks process inputs (for example, two different views of a scene, or a current frame and a future frame), and a predictor network learns to predict the representation of one from the other.

The approach has been implemented in a sequence of published models from Meta:

Model	Year	Domain	Description
I-JEPA	2023	Images	Predicts abstract representations of masked image regions from surrounding context; no pixel-level reconstruction
V-JEPA	2024	Video	Predicts abstract representations of masked video segments; learns temporal dynamics without generating pixels
V-JEPA 2	2025	Video + robot action	1.2B-parameter world model pre-trained on 1M+ hours of video; an action-conditioned variant enables zero-shot robot planning

I-JEPA (Image Joint Embedding Predictive Architecture) learns by masking large portions of an image and predicting the representation of the masked region from the visible context. Because it operates in representation space rather than pixel space, it can focus on semantic and structural information rather than low-level textures. V-JEPA extends this to video, learning to predict missing temporal segments in representation space, which forces the model to learn about motion, object permanence, and physical dynamics ^[6].

V-JEPA 2 (June 2025)

Released on June 11, 2025, V-JEPA 2 is a 1.2-billion-parameter video world model pre-trained on more than 1 million hours of internet video plus 1 million images, with no action labels in the first stage. Meta described it as "a world model trained on video that enables state-of-the-art understanding and prediction, as well as zero-shot planning and robot control in new environments" ^[15]. A second, action-conditioned model, V-JEPA 2-AC, was post-trained on fewer than 62 hours of unlabeled robot video from the open DROID dataset, then deployed zero-shot on Franka robot arms in two different labs to pick and place objects by planning toward image goals. On benchmarks, V-JEPA 2 reported 77.3% top-1 accuracy on Something-Something v2 (motion understanding) and 39.7 recall-at-5 on Epic-Kitchens-100 (human action anticipation) ^[15]. V-JEPA 2 is the clearest demonstration to date of the JEPA thesis: a model that learns physical dynamics in latent space from passive video, then transfers to a robot with minimal action data.

LeCun views JEPA as a stepping stone toward what he calls Autonomous Machine Intelligence (AMI), a system architecture in which a world model sits at the center, surrounded by modules for perception, memory, cost estimation, and action. In September 2025, LeCun launched AMI Labs at Meta to pursue this vision, representing what has been described as the largest corporate bet on the thesis that the path to general intelligence runs through world models rather than next-token prediction ^[5].

Types of world models

The term "world model" is used to describe several related but distinct approaches:

Model-based RL world models

These are dynamics models learned within a reinforcement learning framework. The model takes a state and action as input and predicts the next state (and often the reward). The agent uses this model to plan by simulating future trajectories internally. Examples include the Dreamer series, MuZero (which learns a latent dynamics model for board games and Atari), and various model-based approaches used in robotics.

The strengths of this approach are sample efficiency (fewer real-world interactions needed) and the ability to plan ahead. The weakness is that errors in the model compound over long horizons: a small prediction error at each step can accumulate into a wildly inaccurate trajectory after many steps, a problem known as model compounding error ^[2].

Video prediction models

Video prediction models take a sequence of video frames (and sometimes conditioning signals like text or actions) and generate future frames. These models learn to predict how visual scenes evolve over time, capturing information about object motion, occlusion, and scene dynamics.

Models in this category include:

SVG (Stochastic Video Generation) by Denton and Fergus (2018), which used variational methods to model uncertainty in future frames.
FitVid and other deterministic video prediction models used for robotic planning.
Sora by OpenAI (2024), a large-scale diffusion-based video generation model that produces high-fidelity video from text prompts.

Whether video prediction models are truly world models is a subject of ongoing debate (discussed in a later section).

Learned simulators and interactive world models

These systems go beyond passive frame prediction by allowing interactive exploration. A user or agent can take actions within the generated world, and the model produces the next state in response, functioning like an AI-generated video game or simulator.

Google DeepMind's Genie and Genie 2 are the most prominent examples. NVIDIA's Cosmos platform represents a commercial approach, providing world foundation models specifically designed for physical AI applications like robotics and autonomous driving.

JEPA-style representation predictors

As described above, JEPA-based models predict in an abstract representation space rather than in pixel space. This avoids the computational burden and noise of pixel-level prediction while (in theory) focusing the model on the aspects of the world that matter for decision-making. With the release of V-JEPA 2 in 2025, this approach moved from pure research demos toward zero-shot robot control, though it remains earlier in deployment than commercial video-generation systems. I-JEPA, V-JEPA, and V-JEPA 2 are the main published examples.

Video generation as world simulation

Sora and the "world simulator" claim

When OpenAI introduced Sora in February 2024, its technical report described the model as a "world simulator," arguing that video generation models trained at sufficient scale would implicitly learn to simulate the physical world. Sora can generate photorealistic videos from text prompts, depicting complex scenes with moving objects, changing lighting, and plausible (if not always physically accurate) interactions ^[7].

The claim provoked significant debate. Supporters argued that a model capable of generating coherent video must have learned something about how the world works: objects fall when dropped, cars move along roads, water flows downhill. If the model can consistently generate physically plausible outcomes, it has, in some functional sense, learned physics.

Critics offered several counterarguments:

Memorization vs. understanding. A 2025 study using the Physics-IQ benchmark (developed by INSAIT and Google DeepMind, comprising 396 real-world test videos) found that Sora produced the most visually realistic output, achieving the best multimodal-LLM realism score of 55.6%. But across all systems tested, the best physical-understanding score was only 24.1% of the maximum (achieved by VideoPoet, not Sora), indicating that even the most realistic-looking models fail to capture underlying physics. The study reported essentially no correlation between visual realism and physical understanding ^[8].
No action conditioning. A true world model should respond to actions: "If I kick the ball, where does it go?" Sora generates video from text prompts but cannot be conditioned on a sequence of actions, so it cannot be used for interactive planning.
Out-of-distribution failure. Research showed that video generation models rely heavily on memorizing patterns from training data rather than learning general physical principles. When presented with scenarios outside their training distribution (such as objects with unusual physical properties), the models failed to generalize, and scaling up data and model size did not improve this ^[8].
Correlation vs. causation. Video models learn correlations in pixel patterns ("objects that look like this tend to move like that") rather than causal models ("gravity accelerates objects at 9.8 m/s squared"). This distinction matters for tasks that require counterfactual reasoning or precise prediction.

The Physics-IQ authors summarized the gap bluntly: "visual realism does not imply physical understanding" ^[8]. The debate is not merely academic. If video generation models are genuine world models, then scaling up video generation (as OpenAI, Google, and others are doing) is a path toward AI systems that understand the physical world. If they are not, the field needs fundamentally different architectures.

The spectrum between frame prediction and world understanding

A useful way to think about the debate is as a spectrum rather than a binary:

Level	Capability	Example
Frame interpolation	Predicting the next frame given previous frames; no understanding of physics	Simple video codecs
Statistical video generation	Generating plausible video from text or context; learns correlations in visual patterns	Sora, Runway Gen-3
Stylized physics	Understanding that dropped things fall and rolling things move, without precise equations	Current best world models
Approximate physical simulation	Predicting outcomes of interactions with reasonable accuracy; responds to action conditioning	Research frontier (Genie 3, advanced model-based RL)
Precise physical simulation	Accurate physics with correct equations of motion	Traditional physics engines (not learned)

Current video generation models operate at the "statistical video generation" level, occasionally reaching "stylized physics." Current model-based RL world models and interactive systems like Genie operate closer to "stylized physics" or "approximate physical simulation" for restricted domains.

Genie and Genie 2 (Google DeepMind)

Google DeepMind's Genie project represents one of the most ambitious efforts to build interactive world models.

Genie (February 2024)

The original Genie, published in February 2024, is an 11-billion-parameter model trained on a filtered set of 30,000 hours of unlabeled internet gameplay video (curated from a much larger pool of public 2D platformer footage). It learned a latent action space (a set of abstract "controls") entirely from watching videos, without any labeled action data. Users could provide a single image (a photo, a sketch, or an AI-generated scene), and Genie would generate an interactive 2D environment that could be explored using the learned controls. DeepMind described Genie as "the first generative interactive environment trained in an unsupervised manner from unlabelled Internet videos" ^[9].

Genie was notable for several reasons. It demonstrated that interactive world models could be learned from passive video without action labels. It showed that a single model could generate diverse 2D platformer-style worlds. And it introduced the idea of "world generation" as distinct from "video generation": the output was not a pre-determined video but an environment that responded to user input.

Genie 2 (December 2024)

Genie 2, announced in December 2024, extended the concept to 3D environments. From a single image and optional text description, Genie 2 generates an interactive 3D world that users can explore using a keyboard or mouse. The generated environments include object interactions (opening doors, bursting balloons), animated characters and NPCs, lighting and reflections, and basic physics simulation ^[10].

Technically, Genie 2 uses an autoregressive latent diffusion model that generates the world frame by frame, simulating the consequences of each user action. It maintains memory of parts of the scene that are not currently visible and renders them accurately when they come back into view. The model was trained on video data and does not use a traditional rendering engine ^[10].

DeepMind positioned Genie 2 as useful for training and evaluating AI agents: rather than building handcrafted simulation environments, researchers could generate an unlimited curriculum of novel worlds for agents to explore.

Genie 3 (August 2025)

Genie 3, released in August 2025, was described by DeepMind as its "first world model to allow interaction in real-time." The post states that "given a text prompt, Genie 3 can generate dynamic worlds that you can navigate in real time at 24 frames per second, retaining consistency for a few minutes at a resolution of 720p" ^[11]. The system maintains visual memory of the environment "extending as far back as one minute ago," so scenes stay consistent as a user looks away and back. Genie 3 also introduced "promptable world events," letting a user change the running world on the fly, for example altering weather or introducing new objects and characters. DeepMind summarized it as "a general purpose world model that can generate an unprecedented diversity of interactive environments," positioning it as a tool for training and evaluating embodied agents ^[11].

World Labs and spatial intelligence

World Labs, co-founded by Fei-Fei Li, pursues a distinct interpretation of world models centered on "spatial intelligence": building models that perceive, generate, reason about, and interact with the 3D world. The company emerged from stealth in September 2024 with $230 million in funding from backers including Andreessen Horowitz, NVIDIA's venture arm NVentures, and Radical Ventures, at a reported valuation of around $1 billion ^[16]. World Labs describes its products as "large world models."

In November 2025, World Labs launched its first commercial product, Marble, which generates persistent, downloadable 3D environments from a single image or text prompt ^[17]. Unlike systems such as Genie that synthesize frames on the fly as a user explores, Marble produces a fixed 3D scene up front, which reduces the "morphing" and drift seen in autoregressive video world models and lets users export the result as Gaussian splats, meshes, or video. World Labs thus represents a 3D-native, geometry-first approach to world modeling, contrasting with the video-prediction approaches of Sora and Genie and the latent-prediction approach of JEPA.

NVIDIA Cosmos

NVIDIA launched the Cosmos platform at CES 2025 as a suite of world foundation models (WFMs) designed for physical AI development. Unlike research-oriented projects like Genie, Cosmos is aimed at commercial applications, particularly autonomous driving and robotics.

Cosmos models generate physics-based videos from combinations of text, image, video, robot sensor data, and motion data. They are trained to handle physically based interactions, object permanence, and realistic rendering of industrial environments (warehouses, factories) and driving environments (roads, weather conditions, lighting variations) ^[12].

For autonomous vehicles, Cosmos integrates with NVIDIA's Omniverse simulation platform. Developers can use Cosmos Transfer to amplify variations of sensor data, turning thousands of real-world miles of driving data into billions of virtually driven miles. This data flywheel approach addresses one of the biggest bottlenecks in autonomous driving development: the need for vast amounts of diverse training data ^[12].

NVIDIA released Cosmos as an open platform, and early adopters include 1X Technologies, Agility Robotics, Figure AI, Foretellix (for autonomous vehicle testing), Skild AI, and Uber. The company also released Cosmos tokenizers (for converting continuous data into discrete tokens suitable for transformer-based models) and guardrails tools ^[12].

A major release of Cosmos in March 2025 expanded the model suite and physical AI data tools, coinciding with NVIDIA's broader push into what CEO Jensen Huang calls "physical AI," the application of AI to systems that interact with the physical world ^[12].

Are video generation models world models?

This question has become one of the most contested in AI research. The arguments break down roughly as follows:

Arguments that video generation models are (or can become) world models:

Scale may be sufficient. As video models get larger and are trained on more data, they may implicitly learn enough physics to be functionally useful as world models.
Emergence. Just as LLMs exhibit emergent abilities at sufficient scale, video models may develop emergent physical understanding.
Practical utility. Even imperfect physics understanding is useful. A model that "knows" dropped things fall and rolling things continue is more useful for planning than no model at all.

Arguments that video generation models are not world models:

No causal understanding. Video models learn correlations, not causal mechanisms. They cannot answer "what if" questions involving novel interventions.
Physics-IQ results. Empirical testing shows a large gap between visual realism (best score 55.6%) and physical understanding (best score 24.1%), and this gap does not close with scale ^[8].
No action conditioning. Most video generation models cannot be interacted with; they produce a single pre-determined trajectory rather than responding to agent actions.
Generalization failure. Models fail on out-of-distribution physical scenarios, suggesting memorization rather than understanding ^[8].

A paper published in 2024 by researchers from several institutions, titled "Sora and V-JEPA Have Not Learned The Complete Real World Model," argued that neither generative video models nor JEPA-style models had yet achieved genuine world understanding, and that both approaches had fundamental limitations that needed to be addressed ^[13].

The truth likely lies between the extremes. Video generation models have learned some aspects of world dynamics (enough to generate plausible videos), but fall short of the kind of accurate, generalizable, action-conditioned physical reasoning that would qualify as a true world model. The field is moving toward hybrid approaches that combine the visual generation capabilities of diffusion models with the interactive, action-conditioned structure of RL-based world models.

What are world models used for?

Planning in reinforcement learning

The original and most established application of world models is in RL. An agent with an accurate world model can "think ahead" by simulating future trajectories in its model before acting, reducing the amount of real-world interaction needed. This is the approach used by Dreamer, MuZero, and many robotics systems.

Autonomous driving

World models for autonomous driving predict how traffic scenes will evolve: where other vehicles will go, how pedestrians will move, and what will happen if the ego vehicle takes a particular action. NVIDIA's Cosmos, Wayve's GAIA-1, and various academic projects pursue this direction. The appeal is that a world model can generate unlimited training scenarios, including rare dangerous situations that are hard to encounter (or safely create) in real-world driving ^[12].

Robotics

In robotics, world models help robots predict the outcomes of manipulation actions ("If I push this object, where will it end up?") and plan multi-step tasks. The combination of world models with language models (as in SayCan-style systems) allows robots to plan at multiple levels of abstraction: the language model decomposes a task into steps, and the world model simulates whether each step is likely to succeed. V-JEPA 2-AC's zero-shot pick-and-place on real Franka arms, using fewer than 62 hours of robot video, illustrates how a video-trained world model can be retargeted to physical control with minimal action data ^[15].

Game environments and simulation

Genie and Genie 2 demonstrate the potential for world models to generate interactive environments for gaming, training, and evaluation. Instead of hand-crafting game levels or simulation scenarios, developers could use world models to generate limitless variations, potentially reducing the cost of content creation and testing.

Prediction and forecasting

World models can be applied to any domain where predicting future states from current conditions is valuable: weather forecasting, economic modeling, protein dynamics, and more. Google DeepMind's GraphCast weather model and similar systems share the underlying principle of learning dynamics from data to predict future states.

Competing approaches (2025-2026)

As of early 2026, several major research groups are pursuing distinct approaches to world models:

Group	Approach	Philosophy
Yann LeCun / Meta AMI Labs	JEPA (Joint Embedding Predictive Architecture); V-JEPA 2	Predict in abstract representation space, not pixel space; LLMs are insufficient for physical intelligence
Google DeepMind (Genie team)	Interactive world generation from video data	Learn to generate explorable 3D environments; useful for training and evaluating agents
OpenAI (Sora team)	Large-scale video generation as implicit world modeling	Sufficient scale in video generation may yield emergent world understanding
NVIDIA (Cosmos)	Commercial world foundation models for physical AI	Practical tools for autonomous driving and robotics; data amplification for training
Danijar Hafner et al. (Dreamer)	Model-based RL with learned latent dynamics	Compact, efficient models for planning and policy optimization in RL
Fei-Fei Li / World Labs	Spatial intelligence and 3D world understanding (Marble)	3D-native, persistent scene generation as a foundation for world modeling

These approaches are not mutually exclusive, and the eventual winning strategy may combine elements of several. The JEPA approach and the Dreamer approach share the idea of operating in learned representation spaces. The Genie approach and the Sora approach share the idea of learning from large-scale video data. Cosmos bridges research and commercial application ^[14].

Current state (2025-2026)

World models are at an inflection point. The concept has moved from a niche topic in model-based RL to a central theme in AI research, driven by several converging trends:

Convergence of video generation and world modeling. The enormous investment in video generation (Sora, Runway, Pika, and others) has produced models with impressive visual generation capabilities. Whether these models can be upgraded to true world models, or whether fundamentally different architectures are needed, is a key open question.
Commercial investment. NVIDIA's Cosmos, Meta's AMI Labs, Google DeepMind's Genie, and World Labs' $230 million-plus raise represent billions of dollars of investment in world model research and infrastructure ^[16].
Robotics demand. The boom in humanoid robots and AI robotics is creating urgent demand for world models that can help robots plan and learn in simulation before deployment. V-JEPA 2's zero-shot transfer to robot arms in 2025 is an early concrete result in this direction ^[15].
Genie 3 as a milestone. Google DeepMind's Genie 3 (August 2025) achieved real-time interactive world generation at 720p and 24 fps with minute-scale consistency, marking the first system from DeepMind to allow interaction in real time and fast enough for practical agent training and evaluation ^[11].
Persistent limitations. Despite progress, no system has yet demonstrated a learned world model that can accurately predict physical dynamics across a wide range of scenarios. The gap between visual plausibility and physical accuracy remains large (best Physics-IQ physics score 24.1%), and out-of-distribution generalization continues to be a weakness ^[8].

The field is likely to evolve rapidly over the next few years. If JEPA-style architectures or Dreamer-style models can be scaled to handle the complexity of the real world, they could enable a new generation of AI agents that genuinely understand their environments. If video generation models prove to be a dead end for world understanding (as some critics predict), the field may pivot toward more structured approaches that incorporate explicit physical reasoning. In either case, world models remain central to the broader goal of building AI systems that can operate effectively in the physical world.

References

Themesis. (2026). "World Models: Five Competing Approaches." https://themesis.com/2026/01/07/world-models-five-competing-approaches/ ↩
Sutton, R. S. & Barto, A. G. (2018). *Reinforcement Learning: An Introduction* (2nd ed.). MIT Press. ↩
Ha, D. & Schmidhuber, J. (2018). "World Models." https://worldmodels.github.io/ ↩
Hafner, D., et al. (2025). "Mastering diverse control tasks through world models." *Nature*. https://www.nature.com/articles/s41586-025-08744-2 ↩
Entropy Town. (2025). "Why Fei-Fei Li, Yann LeCun and DeepMind Are All Betting on World Models." https://entropytown.com/articles/2025-11-13-world-model-lecun-feifei-li/ ↩
Meta AI. (2024). "V-JEPA: The next step toward advanced machine intelligence." https://ai.meta.com/blog/v-jepa-yann-lecun-ai-model-video-joint-embedding-predictive-architecture/ ↩
OpenAI. (2024). "Sora: Creating video from text." Referenced in: https://openai.com/sora ↩
Motamed, S., et al. (2025). "Do generative video models understand physical principles?" https://arxiv.org/abs/2501.09038 ↩
Google DeepMind. (2024). "Genie: Generative Interactive Environments." https://arxiv.org/abs/2402.15391 ↩
Google DeepMind. (2024). "Genie 2: A large-scale foundation world model." https://deepmind.google/blog/genie-2-a-large-scale-foundation-world-model/ ↩
Google DeepMind. (2025). "Genie 3: A new frontier for world models." https://deepmind.google/discover/blog/genie-3-a-new-frontier-for-world-models/ ↩
NVIDIA. (2025). "NVIDIA Launches Cosmos World Foundation Model Platform to Accelerate Physical AI Development." https://nvidianews.nvidia.com/news/nvidia-launches-cosmos-world-foundation-model-platform-to-accelerate-physical-ai-development ↩
Liu, et al. (2024). "Sora and V-JEPA Have Not Learned The Complete Real World Model." https://www.arxiv.org/pdf/2407.10311 ↩
Introl. (2026). "World Models Race 2026: How LeCun, DeepMind, and Others Compete." https://introl.com/blog/world-models-race-agi-2026 ↩
Meta AI. (2025). "Introducing the V-JEPA 2 world model and new benchmarks for physical reasoning." https://ai.meta.com/blog/v-jepa-2-world-model-benchmarks/ ↩
TechCrunch. (2024). "Fei-Fei Li's World Labs comes out of stealth with $230M." https://techcrunch.com/2024/09/13/with-230m-in-funding-world-labs-is-building-large-world-models/ ↩
TechCrunch. (2025). "Fei-Fei Li's World Labs speeds up the world model race with Marble, its first commercial product." https://techcrunch.com/2025/11/12/fei-fei-lis-world-labs-speeds-up-the-world-model-race-with-marble-its-first-commercial-product/ ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

4 revisions by 1 contributors · full history

Suggest edit

World model

What is a world model in AI?

History and foundational work

Early ideas

Ha and Schmidhuber (2018): "World Models"

Dreamer series (2019-2023)

What is JEPA and why does Yann LeCun favor it?

V-JEPA 2 (June 2025)

Types of world models

Model-based RL world models

Video prediction models

Learned simulators and interactive world models

JEPA-style representation predictors

Video generation as world simulation

Sora and the "world simulator" claim

The spectrum between frame prediction and world understanding

Genie and Genie 2 (Google DeepMind)

Genie (February 2024)

Genie 2 (December 2024)

Genie 3 (August 2025)

World Labs and spatial intelligence

NVIDIA Cosmos

Are video generation models world models?

What are world models used for?

Planning in reinforcement learning

Autonomous driving

Robotics

Game environments and simulation

Prediction and forecasting

Competing approaches (2025-2026)

Current state (2025-2026)

See also

References

Improve this article

What links here (24 of 57)

What links here (24 of 57)

What is a world model in AI?

History and foundational work

Early ideas

Ha and Schmidhuber (2018): "World Models"

Dreamer series (2019-2023)

What is JEPA and why does Yann LeCun favor it?

V-JEPA 2 (June 2025)

Types of world models

Model-based RL world models

Video prediction models

Learned simulators and interactive world models

JEPA-style representation predictors

Video generation as world simulation

Sora and the "world simulator" claim

The spectrum between frame prediction and world understanding

Genie and Genie 2 (Google DeepMind)

Genie (February 2024)

Genie 2 (December 2024)

Genie 3 (August 2025)

World Labs and spatial intelligence

NVIDIA Cosmos

Are video generation models world models?

What are world models used for?

Planning in reinforcement learning

Autonomous driving

Robotics

Game environments and simulation

Prediction and forecasting

Competing approaches (2025-2026)

Current state (2025-2026)

See also

References

Improve this article

Related Articles

Agentic Context Engineering

Artificial Intelligence

Claude Sonnet 4.5

Computer-use agent

Computer vision

Context window

What links here (24 of 57)

Related Articles

Agentic Context Engineering

Artificial Intelligence

Claude Sonnet 4.5

Computer-use agent

Computer vision

Context window

What links here (24 of 57)