Dreamer (reinforcement learning)
Last reviewed
Sources
8 citations
Review status
Source-backed
Revision
v1 · 1,367 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Sources
8 citations
Review status
Source-backed
Revision
v1 · 1,367 words
Add missing citations, update stale details, or suggest a clearer explanation.
Dreamer is a family of model-based reinforcement learning agents that learn a model of their environment and then improve their behavior by "imagining" sequences of future outcomes inside that model. Rather than learning directly from raw experience alone, a Dreamer agent compresses what it sees into a compact latent representation, trains a predictive world model on those representations, and uses the model as a fast internal simulator in which a policy can be optimized. The line of work was led by Danijar Hafner, working with collaborators including Timothy Lillicrap, Jimmy Ba, Mohammad Norouzi, and Jurgis Pasukonis across Google DeepMind and the University of Toronto.[1][2][3]
The project spans three main generations published between 2019 and 2025. The first agent, simply called Dreamer, showed that long-horizon behaviors could be learned from images purely by latent imagination. DreamerV2 became the first world-model agent to reach human-level performance on the Atari benchmark. DreamerV3 introduced a single fixed configuration that worked across many different domains and, notably, was the first algorithm to collect diamonds in the video game Minecraft from scratch.[1][2][3][4]
A world model is a learned predictive model of how an environment evolves: given the current situation and an action, it forecasts what comes next. The appeal for reinforcement learning is sample efficiency. Real interaction with an environment, whether a robot or a game, is slow and costly, so an agent that can learn an accurate internal model can practice in imagination far more cheaply than it can in reality.[1][3]
Dreamer's central design choice is to do this prediction in a compact latent space rather than over raw pixels. An encoder maps high-dimensional observations such as camera frames into low-dimensional latent states, and a recurrent dynamics model predicts how those latent states change over time. Because planning and policy learning happen in this small space, the agent can roll many steps into the future quickly. This lineage builds on earlier latent-dynamics research, including the PlaNet planning system that preceded the first Dreamer agent.[1][6]
At the core of every Dreamer version is a recurrent state-space model. An encoder turns each observation into a stochastic latent representation, and a recurrent sequence model predicts the next latent state from the previous state and the action taken. From those latent states the model decodes several quantities: a reconstruction of the observation, the predicted reward, and, from DreamerV2 onward, a flag for whether the episode continues. Training the world model to reconstruct its inputs and predict rewards forces the latent states to capture the information that matters.[1][2][3]
Once the world model exists, behavior is learned almost entirely "in the dream." An actor-critic pair is trained on trajectories that the world model imagines: starting from latent states drawn from real experience, the model generates rollouts of latents, actions, and rewards without touching the real environment. A critic learns to estimate long-term returns for each imagined state, and the actor learns to choose actions that maximize those returns. The original Dreamer propagated analytic gradients of the learned value back through the imagined trajectories, an efficient way to assign credit over long horizons.[1][3]
The generations differ mainly in how the latent representation and the learning objective are built. The first Dreamer used continuous Gaussian latents. DreamerV2 switched to discrete representations made of multiple categorical variables, paired with a technique called KL balancing that lets the model's predictions improve faster than its representations; the Google Research team credited these two changes with the jump to human-level Atari play. DreamerV3 kept discrete latents but added a set of robustness techniques, including a "symlog" transformation that compresses large and small magnitudes, percentile-based normalization of returns, a two-hot reward objective, and free-bits clipping of the model loss. Together these let one configuration learn stably whether rewards are sparse or dense, large or small.[2][3][5]
This separates Dreamer from MuZero, another prominent model-based agent. MuZero learns a model that predicts only the quantities needed for planning, such as reward, value, and policy, and does not reconstruct observations; it then plans with explicit tree search. Dreamer instead learns a generative model that reconstructs observations and trains its policy by gradient-based optimization over imagined latent rollouts rather than search.[3][7]
| Version | Paper title | Year / venue | Headline result |
|---|---|---|---|
| Dreamer | Dream to Control: Learning Behaviors by Latent Imagination | arXiv Dec 2019; ICLR 2020 | Learned long-horizon behaviors from images by latent imagination; exceeded prior methods on 20 visual control tasks in data efficiency, compute, and final performance |
| DreamerV2 | Mastering Atari with Discrete World Models | arXiv Oct 2020; ICLR 2021 | First world-model agent to reach human-level performance on the Atari benchmark of 55 games, using discrete latent representations; outperformed Rainbow and IQN at equal compute |
| DreamerV3 | Mastering Diverse Domains through World Models (arXiv); Mastering diverse control tasks through world models (Nature) | arXiv Jan 2023; Nature Apr 2025 | A single fixed configuration that outperformed specialized methods across more than 150 tasks; first algorithm to collect diamonds in Minecraft from scratch |
The dates and titles above come from the arXiv records and the Nature paper. The peer-reviewed version of DreamerV3 appeared in Nature on 2 April 2025 under the title "Mastering diverse control tasks through world models," in volume 640, pages 647 to 653.[1][2][3][4]
The result that drew the most attention was in Minecraft. Mining a diamond is a long, multi-stage task: a player must gather wood, craft tools, dig down to the right depth, find iron, smelt it, and craft better tools before any diamond can be collected, all while the world is procedurally generated and rewards are sparse. DreamerV3 was, to the authors' knowledge, the first algorithm to collect diamonds in Minecraft entirely from scratch, learning from a reward signal without human demonstrations, expert data, or hand-designed curricula.[3][4]
The agent learned the full sequence on its own by imagining future scenarios in which its goals were achieved and steering toward them. According to the DreamerV3 project page, the first diamond was reached after roughly 30 million environment steps, on the order of 17 days of in-game play. Press coverage of the Nature paper emphasized that the agent reached an expert level after about nine days of training and that no human gameplay was used to teach it. Because earlier systems that played Minecraft had typically relied on large amounts of human video or staged sub-goals, doing it from scratch was the noteworthy part.[3][4][8]
Dreamer's broader contribution is evidence that a single model-based recipe can generalize. DreamerV3 was evaluated across eight domains and more than 150 tasks spanning Atari, DeepMind Lab, ProcGen, continuous control suites, the Crafter-style Minecraft challenge, and others, using the same hyperparameters throughout. That cuts against the common pattern in reinforcement learning of tuning an algorithm separately for each benchmark, and it suggests world models are a practical route to general agents.[3][4]
The authors framed the work as a step toward systems that can teach themselves to reach goals, including the long-term prospect of robots that learn skills in the real world through internal simulation rather than costly trial and error. The Dreamer codebase has been released publicly, and the approach has influenced later research on reconstruction-free and predictive world models that borrow ideas from both Dreamer and MuZero.[3][7]