Dreamer (reinforcement learning)

Reinforcement Learning World Models

8 min read

Updated Jun 28, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 28, 2026

Fact-checked

In review queue

Sources

8 citations

Revision

v2 · 1,675 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Dreamer is a family of model-based reinforcement learning agents that learn a compact world model of their environment and then improve their behavior by "imagining" sequences of future outcomes inside that model rather than acting only on real experience. Developed by Danijar Hafner and collaborators at Google DeepMind and the University of Toronto, the line spans three generations published between 2019 and 2025: Dreamer (2019), DreamerV2 (2020), and DreamerV3 (2023), with the peer-reviewed DreamerV3 paper appearing in Nature in April 2025. DreamerV3 is, in the authors' words, "the first algorithm to collect diamonds in Minecraft from scratch without human data or curricula," and it reaches that and more than 150 other tasks using a single fixed configuration.^[1]^[2]^[3]^[4]

A Dreamer agent compresses what it sees into a low-dimensional latent representation, trains a predictive world model on those representations, and uses the model as a fast internal simulator in which an actor-critic policy is optimized. The work was led by Danijar Hafner, working with collaborators including Timothy Lillicrap, Jimmy Ba, Mohammad Norouzi, and Jurgis Pasukonis. The first agent, simply called Dreamer, showed that long-horizon behaviors could be learned from images purely by latent imagination. DreamerV2 became the first world-model agent to reach human-level performance on the Atari benchmark. DreamerV3 introduced a single fixed configuration that worked across many different domains and was the first algorithm to collect diamonds in the video game Minecraft from scratch.^[1]^[2]^[3]^[4]

What is a world model, and why use one?

A world model is a learned predictive model of how an environment evolves: given the current situation and an action, it forecasts what comes next. The appeal for reinforcement learning is sample efficiency. Real interaction with an environment, whether a robot or a game, is slow and costly, so an agent that can learn an accurate internal model can practice in imagination far more cheaply than it can in reality.^[1]^[3]

Dreamer's central design choice is to do this prediction in a compact latent space rather than over raw pixels. An encoder maps high-dimensional observations such as camera frames into low-dimensional latent states, and a recurrent dynamics model predicts how those latent states change over time. Because planning and policy learning happen in this small space, the agent can roll many steps into the future quickly. This lineage builds on earlier latent-dynamics research, including the PlaNet planning system that preceded the first Dreamer agent.^[1]^[6]

How does Dreamer's world model work?

At the core of every Dreamer version is a Recurrent State-Space Model (RSSM), the latent-dynamics backbone first introduced for PlaNet and reused throughout the Dreamer line. An encoder turns each observation into a stochastic latent representation, and a recurrent sequence model predicts the next latent state from the previous state and the action taken. From those latent states the model decodes several quantities: a reconstruction of the observation, the predicted reward, and, from DreamerV2 onward, a flag for whether the episode continues. Training the world model to reconstruct its inputs and predict rewards forces the latent states to capture the information that matters.^[1]^[2]^[3]

Once the world model exists, behavior is learned almost entirely "in the dream." An actor-critic pair is trained on trajectories that the world model imagines: starting from latent states drawn from real experience, the model generates rollouts of latents, actions, and rewards without touching the real environment. A critic learns to estimate long-term returns for each imagined state, and the actor learns to choose actions that maximize those returns. The original Dreamer propagated analytic gradients of the learned value back through the imagined trajectories, an efficient way to assign credit over long horizons. As the 2019 paper put it, Dreamer "efficiently learns behaviors by propagating analytic gradients of learned state values back through trajectories imagined in the compact state space of a learned world model."^[1]^[3]

The generations differ mainly in how the latent representation and the learning objective are built. The first Dreamer used continuous Gaussian latents. DreamerV2 switched to discrete representations made of multiple categorical variables, paired with a technique called KL balancing that lets the model's predictions improve faster than its representations; the Google Research team credited these two changes with the jump to human-level Atari play. DreamerV3 kept discrete latents but added a set of robustness techniques, including a "symlog" transformation that compresses large and small magnitudes, percentile-based normalization of returns, a two-hot reward objective, and free-bits clipping of the model loss. Together these let one configuration learn stably whether rewards are sparse or dense, large or small.^[2]^[3]^[5]

How does Dreamer differ from MuZero?

This design separates Dreamer from MuZero, another prominent model-based agent. MuZero learns a model that predicts only the quantities needed for planning, such as reward, value, and policy, and does not reconstruct observations; it then plans with explicit tree search. Dreamer instead learns a generative model that reconstructs observations and trains its policy by gradient-based optimization over imagined latent rollouts rather than search.^[3]^[7]

What are the versions of Dreamer?

Version	Paper title	Year / venue	Headline result
Dreamer	Dream to Control: Learning Behaviors by Latent Imagination	arXiv Dec 2019; ICLR 2020	Learned long-horizon behaviors from images by latent imagination; exceeded prior methods on 20 visual control tasks in data efficiency, compute, and final performance
DreamerV2	Mastering Atari with Discrete World Models	arXiv Oct 2020; ICLR 2021	First world-model agent to reach human-level performance on the Atari benchmark of 55 games, using discrete latent representations; outperformed Rainbow and IQN at equal compute
DreamerV3	Mastering Diverse Domains through World Models (arXiv); Mastering diverse control tasks through world models (Nature)	arXiv Jan 2023; Nature Apr 2025	A single fixed configuration that outperformed specialized methods across more than 150 tasks; first algorithm to collect diamonds in Minecraft from scratch

The dates and titles above come from the arXiv records and the Nature paper. The first Dreamer (arXiv:1912.01603) was submitted in December 2019 and presented at ICLR 2020. DreamerV2 (arXiv:2010.02193) was submitted on 5 October 2020 and presented at ICLR 2021. DreamerV3 (arXiv:2301.04104) was first submitted on 10 January 2023, and its peer-reviewed version appeared in Nature on 2 April 2025 under the title "Mastering diverse control tasks through world models," in volume 640, issue 8059, pages 647 to 653.^[1]^[2]^[3]^[4]

What did DreamerV2 achieve on Atari?

DreamerV2 was the first agent to reach human-level performance on the Atari benchmark by learning entirely inside a world model. The paper states plainly that "DreamerV2 constitutes the first agent that achieves human-level performance on the Atari benchmark of 55 tasks by learning behaviors inside a separately trained world model." Given the same computational budget and wall-clock time, DreamerV2 reaches 200 million frames and surpasses the final performance of the top single-GPU agents IQN and Rainbow. The key change from the first Dreamer was the move to discrete categorical latents combined with KL balancing.^[2]^[5]

What did DreamerV3 achieve in Minecraft?

The result that drew the most attention was in Minecraft. Mining a diamond is a long, multi-stage task: a player must gather wood, craft tools, dig down to the right depth, find iron, smelt it, and craft better tools before any diamond can be collected, all while the world is procedurally generated and rewards are sparse. DreamerV3 was, to the authors' knowledge, the first algorithm to collect diamonds in Minecraft entirely from scratch, learning from a reward signal without human demonstrations, expert data, or hand-designed curricula. The Nature paper describes it as "the first algorithm to collect diamonds in Minecraft from scratch without human data or curricula."^[3]^[4]

The agent learned the full sequence on its own by imagining future scenarios in which its goals were achieved and steering toward them. According to the DreamerV3 project page, the first diamond was reached after roughly 30 million environment steps, on the order of 17 days of in-game play. Press coverage of the Nature paper emphasized that the agent reached an expert level after about nine days of training and that no human gameplay was used to teach it. Because earlier systems that played Minecraft had typically relied on large amounts of human video or staged sub-goals, doing it from scratch was the noteworthy part.^[3]^[4]^[8]

Why does a single configuration matter?

Dreamer's broader contribution is evidence that a single model-based recipe can generalize. The DreamerV3 paper presents "a general algorithm that outperforms specialized methods across over 150 diverse tasks, with a single configuration." The agent was evaluated across eight domains and more than 150 tasks spanning Atari, DeepMind Lab, ProcGen, continuous control suites, the Minecraft challenge, and others, using the same hyperparameters throughout. That cuts against the common pattern in reinforcement learning of tuning an algorithm separately for each benchmark, and it suggests world models are a practical route to general agents.^[3]^[4]

The authors framed the work as a step toward systems that can teach themselves to reach goals, including the long-term prospect of robots that learn skills in the real world through internal simulation rather than costly trial and error. The Dreamer codebase has been released publicly, and the approach has influenced later research on reconstruction-free and predictive world models that borrow ideas from both Dreamer and MuZero.^[3]^[7]

References

Hafner, D., Lillicrap, T., Ba, J., Norouzi, M. "Dream to Control: Learning Behaviors by Latent Imagination." arXiv:1912.01603 (2019). https://arxiv.org/abs/1912.01603 ↩
Hafner, D., Lillicrap, T., Norouzi, M., Ba, J. "Mastering Atari with Discrete World Models." arXiv:2010.02193 (2020). https://arxiv.org/abs/2010.02193 ↩
Hafner, D., Pasukonis, J., Ba, J., Lillicrap, T. "Mastering Diverse Domains through World Models." arXiv:2301.04104 (2023). https://arxiv.org/abs/2301.04104 ↩
Hafner, D., Pasukonis, J., Ba, J., Lillicrap, T. "Mastering diverse control tasks through world models." Nature 640, 647-653 (2025). https://www.nature.com/articles/s41586-025-08744-2 ↩
Google Research. "Mastering Atari with Discrete World Models." (2021). https://research.google/blog/mastering-atari-with-discrete-world-models/ ↩
Hafner, D. "Dream to Control" project page. https://danijar.com/project/dreamer/ ↩
Hafner, D. "Mastering Diverse Control Tasks through World Models" project page. https://danijar.com/project/dreamerv3/ ↩
Yirka, B. "Google's AI Dreamer learns how to self-improve over time by mastering Minecraft." Tech Xplore (2025). https://techxplore.com/news/2025-04-google-ai-dreamer-mastering-minecraft.html ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

DQN Genie (DeepMind)Simulation (in AI and robotics)

What is a world model, and why use one?

How does Dreamer's world model work?

How does Dreamer differ from MuZero?

What are the versions of Dreamer?

What did DreamerV2 achieve on Atari?

What did DreamerV3 achieve in Minecraft?

Why does a single configuration matter?

References

Improve this article

Related Articles

NVIDIA Cosmos

Genie 3

World Labs

V-JEPA 2

Marble (World Labs)

GAIA-3 (Wayve)

What links here

Related Articles

NVIDIA Cosmos

Genie 3

World Labs

V-JEPA 2

Marble (World Labs)

GAIA-3 (Wayve)

What links here