Diffusion Forcing
Last reviewed
Jun 8, 2026
Sources
5 citations
Review status
Source-backed
Revision
v1 · 1,712 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 8, 2026
Sources
5 citations
Review status
Source-backed
Revision
v1 · 1,712 words
Add missing citations, update stale details, or suggest a clearer explanation.
Diffusion Forcing is a training paradigm for sequence generative modeling introduced in 2024 that assigns each token in a sequence its own independent, randomly sampled noise level during training [1]. Rather than denoising an entire sequence at a single shared noise level, as in full sequence diffusion models, or processing tokens strictly one at a time with clean history, as in standard autoregressive models, a Diffusion Forcing model is trained to denoise tokens that each carry a different amount of corruption. Because the per-token noise levels are independent, the trained model can be steered at sampling time toward many different generation regimes, including pure next-token rollout, full-sequence diffusion, or a mixture of the two [1].
The paradigm was presented in the paper "Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion" by Boyuan Chen, Diego Marti Monso, Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann, all affiliated with MIT's Computer Science and Artificial Intelligence Laboratory (CSAIL), and published at the 38th Conference on Neural Information Processing Systems (NeurIPS 2024) [1][2]. It is most prominently applied to video generation and to sequential decision-making and planning, where its ability to roll out sequences far beyond the training horizon and to guide sampling toward high-reward trajectories is central [1][3].
Two dominant families of generative models for sequences have complementary strengths and weaknesses [1].
Next-token autoregressive models, trained with teacher forcing, predict each token conditioned on the ground-truth tokens that precede it. They excel at variable-length, causal generation: a model trained on short sequences can in principle continue generating indefinitely, one token at a time. However, when generating continuous signals such as video frames, autoregressive rollout tends to accumulate errors. Small mistakes feed back into the conditioning history, compounding over time until the generation diverges. Standard teacher-forced video models therefore often "blow up" when asked to produce sequences much longer than those seen in training [1]. These models also typically lack a natural mechanism for sequence-level guidance, since they commit to each token before the rest of the sequence exists.
Full-sequence diffusion models instead treat an entire fixed-length sequence as a single object and denoise all tokens jointly from a shared noise level. This yields high sample quality and, crucially, supports guidance: because the whole trajectory is generated together, classifier or reward signals can be applied across the sequence to steer it toward desirable outcomes, as in trajectory planners such as Diffuser and Decision Diffuser [1][3]. The drawbacks are that the horizon is fixed at the architecture's sequence length and that generation is non-causal, so these models do not naturally support open-ended, variable-length rollout.
Diffusion Forcing is motivated by the goal of combining the variable-length, causal rollout of next-token prediction with the quality and guidability of full-sequence diffusion in a single model [1].
The central idea is to combine teacher forcing with diffusion by replacing the usual "noiseless history, predict next token" setup with a scheme in which every token is noised independently [1].
During training, a causal sequence model is given a sequence of tokens (for example, the frames of a video or the steps of a trajectory) in which each token is corrupted to its own noise level, sampled independently and at random. The model is trained to denoise this sequence, predicting the clean tokens from their independently noised versions. Because a token's noise level is decoupled from its neighbors', the network must learn to handle history of arbitrary, mixed reliability rather than assuming the past is either fully clean or fully noised [1].
The authors frame the noise level of a token as a form of partial, or fractional, masking, a perspective they call "noise as masking." A token at maximum noise carries essentially no information and behaves like a fully masked token, while a token at zero noise is fully observed. This unifies the two limiting regimes:
| Regime | Noise on history | Noise on current token | Equivalent to |
|---|---|---|---|
| Next-token prediction | zero (clean) | high, then denoised | causal autoregressive rollout |
| Full-sequence diffusion | uniform across all tokens | same uniform level | joint sequence diffusion |
| Diffusion Forcing | independent per token | independent per token | flexible mixture of both |
Because the model has seen every combination of per-token noise levels in training, at sampling time the user can choose any schedule of noise levels across the sequence. Setting past tokens to low noise while denoising future tokens recovers stable autoregressive generation; using a shared schedule recovers full-sequence diffusion; intermediate, semi-autoregressive schedules trade off compute, quality, and horizon [1].
This flexibility yields several capabilities that neither parent family has alone [1]:
The training objective is shown to optimize a variational lower bound on the likelihoods of all subsequences of tokens drawn from the true joint distribution, giving the method a principled probabilistic grounding [1]. The original work implements Diffusion Forcing on causal sequence architectures, including recurrent networks such as a GRU-based RNN, and notes that masked transformers can serve the same role [1].
Diffusion Forcing was validated across several domains spanning continuous video and control [1][3].
On video prediction benchmarks, including the DMLab navigation environment and Minecraft gameplay footage, Diffusion Forcing rolls out coherent video far beyond its training horizon, where comparable autoregressive baselines suffer severe consistency failures or diverge [1]. This positions the method as a building block for video-based world models, where an agent must imagine long, consistent futures.
For sequential decision-making, Diffusion Forcing models trajectories as sequences of interleaved actions and observations, so that planning becomes conditional sequence generation. On the D4RL Maze2D navigation benchmark, evaluated across medium, large, and giant mazes for both point-mass and ant robots, Diffusion Forcing outperforms full-sequence diffusion planners by a wide margin, with the advantage growing in larger mazes where its semi-autoregressive, variable-horizon schedule helps over long horizons [1][3]. Ablations show that removing Monte Carlo Guidance degrades performance, though the method remains competitive even without it [3].
The paradigm was also applied to real-robot visual manipulation and imitation learning. In a tabletop task requiring an arm to rearrange objects across randomized slot configurations, the Diffusion Forcing policy operates from visual observations and proves robust to visual distractions, with the authors reporting performance exceeding strong imitation-learning baselines [1]. The work additionally reports results on time-series prediction, underscoring the generality of per-token noise levels across continuous sequence data [1].
Diffusion Forcing sits at the intersection of several established ideas [1]:
The most direct successor is the Diffusion Forcing Transformer (DFoT), introduced in the 2025 paper "History-Guided Video Diffusion" by Kiwhan Song, Boyuan Chen, Max Simchowitz, Yilun Du, Russ Tedrake, and Vincent Sitzmann (MIT, CMU, and Harvard), published at the International Conference on Machine Learning (ICML 2025) [4][5]. DFoT scales the per-token noise idea to a transformer video architecture and uses it to enable conditioning on a flexible number of history frames. The paper introduces "history guidance," a family of guidance methods that compose history-conditioned scores to improve consistency, compositionality, and motion dynamics. Using these techniques, DFoT stably rolls out very long videos, reporting, for example, hundreds of frames on RealEstate10K and substantially longer rollouts than prior baselines on Kinetics-600, and the authors note it can be applied as a finetunable add-on to existing sequence diffusion models [4][5]. More broadly, the per-token noise level concept has informed a wave of causal and autoregressive video diffusion models aimed at long, interactive generation.
Diffusion Forcing is notable for providing a single, principled training objective that interpolates between next-token prediction and full-sequence diffusion, two paradigms that were previously treated as distinct [1]. By exposing the noise schedule as a free choice at inference time, it lets one trained model behave as an autoregressive generator, a full-sequence diffuser, or anything in between, while gaining stable long-horizon rollout, sequence-level guidance, and a compute-for-quality tradeoff. Its strongest empirical impact has been in video generation and in robot planning and decision-making, and its core mechanism, independent per-token noise levels, has become an influential ingredient in subsequent long-video and interactive world-model research, most directly through the Diffusion Forcing Transformer [1][4].