Diffusion Forcing

Deep Learning Generative AI

9 min read

Updated Jun 8, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 8, 2026

Fact-checked

In review queue

Sources

5 citations

Revision

v1 · 1,712 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Overview

Diffusion Forcing is a training paradigm for sequence generative modeling introduced in 2024 that assigns each token in a sequence its own independent, randomly sampled noise level during training ^[1]. Rather than denoising an entire sequence at a single shared noise level, as in full sequence diffusion models, or processing tokens strictly one at a time with clean history, as in standard autoregressive models, a Diffusion Forcing model is trained to denoise tokens that each carry a different amount of corruption. Because the per-token noise levels are independent, the trained model can be steered at sampling time toward many different generation regimes, including pure next-token rollout, full-sequence diffusion, or a mixture of the two ^[1].

The paradigm was presented in the paper "Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion" by Boyuan Chen, Diego Marti Monso, Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann, all affiliated with MIT's Computer Science and Artificial Intelligence Laboratory (CSAIL), and published at the 38th Conference on Neural Information Processing Systems (NeurIPS 2024) ^[1]^[2]. It is most prominently applied to video generation and to sequential decision-making and planning, where its ability to roll out sequences far beyond the training horizon and to guide sampling toward high-reward trajectories is central ^[1]^[3].

Background: autoregressive versus full-sequence diffusion

Two dominant families of generative models for sequences have complementary strengths and weaknesses ^[1].

Next-token autoregressive models, trained with teacher forcing, predict each token conditioned on the ground-truth tokens that precede it. They excel at variable-length, causal generation: a model trained on short sequences can in principle continue generating indefinitely, one token at a time. However, when generating continuous signals such as video frames, autoregressive rollout tends to accumulate errors. Small mistakes feed back into the conditioning history, compounding over time until the generation diverges. Standard teacher-forced video models therefore often "blow up" when asked to produce sequences much longer than those seen in training ^[1]. These models also typically lack a natural mechanism for sequence-level guidance, since they commit to each token before the rest of the sequence exists.

Full-sequence diffusion models instead treat an entire fixed-length sequence as a single object and denoise all tokens jointly from a shared noise level. This yields high sample quality and, crucially, supports guidance: because the whole trajectory is generated together, classifier or reward signals can be applied across the sequence to steer it toward desirable outcomes, as in trajectory planners such as Diffuser and Decision Diffuser ^[1]^[3]. The drawbacks are that the horizon is fixed at the architecture's sequence length and that generation is non-causal, so these models do not naturally support open-ended, variable-length rollout.

Diffusion Forcing is motivated by the goal of combining the variable-length, causal rollout of next-token prediction with the quality and guidability of full-sequence diffusion in a single model ^[1].

How Diffusion Forcing works

The central idea is to combine teacher forcing with diffusion by replacing the usual "noiseless history, predict next token" setup with a scheme in which every token is noised independently ^[1].

Per-token noise levels

During training, a causal sequence model is given a sequence of tokens (for example, the frames of a video or the steps of a trajectory) in which each token is corrupted to its own noise level, sampled independently and at random. The model is trained to denoise this sequence, predicting the clean tokens from their independently noised versions. Because a token's noise level is decoupled from its neighbors', the network must learn to handle history of arbitrary, mixed reliability rather than assuming the past is either fully clean or fully noised ^[1].

The authors frame the noise level of a token as a form of partial, or fractional, masking, a perspective they call "noise as masking." A token at maximum noise carries essentially no information and behaves like a fully masked token, while a token at zero noise is fully observed. This unifies the two limiting regimes:

Regime	Noise on history	Noise on current token	Equivalent to
Next-token prediction	zero (clean)	high, then denoised	causal autoregressive rollout
Full-sequence diffusion	uniform across all tokens	same uniform level	joint sequence diffusion
Diffusion Forcing	independent per token	independent per token	flexible mixture of both

Because the model has seen every combination of per-token noise levels in training, at sampling time the user can choose any schedule of noise levels across the sequence. Setting past tokens to low noise while denoising future tokens recovers stable autoregressive generation; using a shared schedule recovers full-sequence diffusion; intermediate, semi-autoregressive schedules trade off compute, quality, and horizon ^[1].

Sampling, stability, and guidance

This flexibility yields several capabilities that neither parent family has alone ^[1]:

Stable, arbitrarily long rollout. A causal Diffusion Forcing model can generate future tokens without fully diffusing the past, and can continue past the training horizon where teacher-forced baselines diverge. The authors demonstrate stable rollout well beyond the training length, for example continuing video far past the number of frames seen during training ^[1].
Sequence-level guidance. Because future tokens are generated jointly under diffusion, guidance signals such as expected reward can be applied across the planned horizon. The paper introduces Monte Carlo Guidance (MCG), which biases sampling toward trajectories that maximize expected future reward, something one-shot full-sequence diffusion planners do not natively support ^[1]^[3].
Trading compute for quality. The number of denoising steps and the aggressiveness of the per-token schedule can be tuned at inference, allowing the same trained model to spend more computation for higher quality or less for speed ^[1].

The training objective is shown to optimize a variational lower bound on the likelihoods of all subsequences of tokens drawn from the true joint distribution, giving the method a principled probabilistic grounding ^[1]. The original work implements Diffusion Forcing on causal sequence architectures, including recurrent networks such as a GRU-based RNN, and notes that masked transformers can serve the same role ^[1].

Applications

Diffusion Forcing was validated across several domains spanning continuous video and control ^[1]^[3].

Video generation

On video prediction benchmarks, including the DMLab navigation environment and Minecraft gameplay footage, Diffusion Forcing rolls out coherent video far beyond its training horizon, where comparable autoregressive baselines suffer severe consistency failures or diverge ^[1]. This positions the method as a building block for video-based world models, where an agent must imagine long, consistent futures.

Planning and decision-making

For sequential decision-making, Diffusion Forcing models trajectories as sequences of interleaved actions and observations, so that planning becomes conditional sequence generation. On the D4RL Maze2D navigation benchmark, evaluated across medium, large, and giant mazes for both point-mass and ant robots, Diffusion Forcing outperforms full-sequence diffusion planners by a wide margin, with the advantage growing in larger mazes where its semi-autoregressive, variable-horizon schedule helps over long horizons ^[1]^[3]. Ablations show that removing Monte Carlo Guidance degrades performance, though the method remains competitive even without it ^[3].

Robotic manipulation and imitation learning

The paradigm was also applied to real-robot visual manipulation and imitation learning. In a tabletop task requiring an arm to rearrange objects across randomized slot configurations, the Diffusion Forcing policy operates from visual observations and proves robust to visual distractions, with the authors reporting performance exceeding strong imitation-learning baselines ^[1]. The work additionally reports results on time-series prediction, underscoring the generality of per-token noise levels across continuous sequence data ^[1].

Relationship to other methods

Diffusion Forcing sits at the intersection of several established ideas ^[1]:

Teacher forcing. The name deliberately echoes teacher forcing: training conditions on history, but with independent per-token noise rather than clean ground-truth history, which is what stabilizes long rollout.
Full-sequence diffusion planners. It generalizes trajectory-diffusion methods such as Diffuser and Decision Diffuser by making the model causal and variable-length while retaining their guidance abilities ^[1]^[3].
Autoregressive and masked diffusion. The "noise as masking" view connects it to autoregressive diffusion and to masked / absorbing-state diffusion language models, where masking is the corruption process; Diffusion Forcing instead uses continuous per-token noise as a soft, graded form of masking ^[1].

The most direct successor is the Diffusion Forcing Transformer (DFoT), introduced in the 2025 paper "History-Guided Video Diffusion" by Kiwhan Song, Boyuan Chen, Max Simchowitz, Yilun Du, Russ Tedrake, and Vincent Sitzmann (MIT, CMU, and Harvard), published at the International Conference on Machine Learning (ICML 2025) ^[4]^[5]. DFoT scales the per-token noise idea to a transformer video architecture and uses it to enable conditioning on a flexible number of history frames. The paper introduces "history guidance," a family of guidance methods that compose history-conditioned scores to improve consistency, compositionality, and motion dynamics. Using these techniques, DFoT stably rolls out very long videos, reporting, for example, hundreds of frames on RealEstate10K and substantially longer rollouts than prior baselines on Kinetics-600, and the authors note it can be applied as a finetunable add-on to existing sequence diffusion models ^[4]^[5]. More broadly, the per-token noise level concept has informed a wave of causal and autoregressive video diffusion models aimed at long, interactive generation.

Significance

Diffusion Forcing is notable for providing a single, principled training objective that interpolates between next-token prediction and full-sequence diffusion, two paradigms that were previously treated as distinct ^[1]. By exposing the noise schedule as a free choice at inference time, it lets one trained model behave as an autoregressive generator, a full-sequence diffuser, or anything in between, while gaining stable long-horizon rollout, sequence-level guidance, and a compute-for-quality tradeoff. Its strongest empirical impact has been in video generation and in robot planning and decision-making, and its core mechanism, independent per-token noise levels, has become an influential ingredient in subsequent long-video and interactive world-model research, most directly through the Diffusion Forcing Transformer ^[1]^[4].

References

Boyuan Chen, Diego Marti Monso, Yilun Du, Max Simchowitz, Russ Tedrake, Vincent Sitzmann. "Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion." arXiv:2407.01392, 2024. https://arxiv.org/abs/2407.01392 ↩
"Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion." NeurIPS 2024 (Advances in Neural Information Processing Systems 37). https://proceedings.neurips.cc/paper_files/paper/2024/file/2aee1c4159e48407d68fe16ae8e6e49e-Paper-Conference.pdf ↩
Project page, "Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion." Scene Representation Group, MIT. https://www.boyuan.space/diffusion-forcing/ ↩
Kiwhan Song, Boyuan Chen, Max Simchowitz, Yilun Du, Russ Tedrake, Vincent Sitzmann. "History-Guided Video Diffusion." arXiv:2502.06764, 2025. https://arxiv.org/abs/2502.06764 ↩
"History-Guided Video Diffusion." ICML 2025. https://icml.cc/virtual/2025/poster/44316 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

Suggest edit

What links here

Diffusion model

Overview

Background: autoregressive versus full-sequence diffusion

How Diffusion Forcing works

Per-token noise levels

Sampling, stability, and guidance

Applications

Video generation

Planning and decision-making

Robotic manipulation and imitation learning

Relationship to other methods

Significance

References

Improve this article

Related Articles

Diffusion model

AudioCraft

GAN

Generative Model

Autoencoder

Latent diffusion model

What links here

Related Articles

Diffusion model

AudioCraft

GAN

Generative Model

Autoencoder

Latent diffusion model