Diffusion Transformer (DiT)

Diffusion Models Generative AI Image Generation Transformer Models

32 min read

Updated Jun 21, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 21, 2026

Fact-checked

In review queue

Sources

32 citations

Revision

v5 · 6,417 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

A Diffusion Transformer (DiT) is a transformer-based neural network backbone for diffusion models that replaces the U-Net with a Vision Transformer operating on patches of an image latent. Introduced by William Peebles and Saining Xie in the December 2022 paper Scalable Diffusion Models with Transformers (arXiv:2212.09748), which won the best paper award at ICCV 2023, the DiT showed that pure transformer backbones scale more cleanly than convolutional U-Nets and that the largest variant, DiT-XL/2, reaches a state-of-the-art FID of 2.27 on class-conditional ImageNet 256x256.^[1] As the authors put it, "We find that DiTs with higher Gflops, through increased transformer depth/width or increased number of input tokens, consistently have lower FID."^[1] The architecture is now the standard backbone of frontier image and video generators including Sora, Stable Diffusion 3, and FLUX.1.

Peebles (then a PhD student at UC Berkeley with a research stint at Meta AI) and Xie (NYU faculty) built DiT on the Vision Transformer recipe of patchify-embed-attend.^[1] The paper's central empirical claim is that pure transformer backbones scale more cleanly than convolutional U-Nets along compute and parameters, and that this scaling advantage extends to image generation.^[1]

Since late 2023 the architecture has displaced U-Nets across most frontier image and video generation systems. The OpenAI Sora and Sora 2 video models, Stable Diffusion 3 and 3.5, FLUX.1, Google DeepMind's Veo, Veo 2, and Veo 3, Meta's Movie Gen, Runway Gen-3 and Gen-4, Pika 2.0+, PixArt-alpha and Sigma, Hunyuan-DiT, Lumina-T2X, Nvidia Cosmos, CogVideoX, and Latte are all DiT-style designs. The label "DiT" now refers both to the original ImageNet model from Peebles and Xie and to the broader family of transformer-on-latent-patch diffusion architectures that grew out of it.

When was the diffusion transformer introduced?

DiT was introduced in the paper Scalable Diffusion Models with Transformers, posted to arXiv as 2212.09748 in December 2022 and published at ICCV 2023, where it won the conference best paper award.^[1] The original code and weights were released through the Facebook Research GitHub repository under a CC-BY-NC license.^[3] The architecture moved from an academic ImageNet result to the backbone of every major frontier image and video model within roughly twenty-four months, with Stable Diffusion 3 and Sora both shipping in early 2024.^[11]^[22]

Why it matters

The U-Net introduced in Denoising Diffusion Probabilistic Models (Ho, Jain, and Abbeel, 2020) had been the default backbone for diffusion models for almost three years before DiT appeared.^[4] U-Nets perform reasonably well at small to medium scale, but Peebles and Xie found that scaling them by adding parameters and compute produced inconsistent improvements.^[1] Transformers, in contrast, had already shown predictable scaling behaviour for language models (the Chinchilla scaling laws) and for image classification (Vision Transformer (ViT)).^[8] DiT extended this evidence to image synthesis: as the authors increased the transformer's depth, width, or token count, FID on ImageNet decreased monotonically.^[1]

This matters for two reasons. First, predictable scaling lets practitioners trade compute directly for sample quality, which is how the rest of deep learning has progressed since 2020. Second, transformer infrastructure is mature: flash attention, ZeRO sharding, tensor parallelism, RoPE, and mixed-precision training all transfer cleanly to DiTs. Once DiT made this trade-off explicit, the open-source ecosystem and major labs converged on the design within roughly eighteen months.

Background context

Diffusion models

A diffusion model defines a forward process that gradually adds Gaussian noise to a sample from the data distribution until almost no signal remains, and a reverse process, parameterised by a neural network, that learns to denoise step by step. Training reduces to predicting the noise, the clean sample, or a velocity vector at random noise levels. Sampling starts from pure Gaussian noise and iteratively applies the learned denoiser. Three foundational papers established this framework: Ho, Jain, and Abbeel's DDPM (2020),^[4] Song et al.'s 2021 ICLR paper Score-Based Generative Modeling through Stochastic Differential Equations (which won an Outstanding Paper award and unified score matching with denoising diffusion through SDEs),^[5] and Karras, Aittala, Aila, and Laine's Elucidating the Design Space of Diffusion-Based Generative Models (NeurIPS 2022, often abbreviated EDM), which separated and tuned the design choices in sampling and preconditioning.^[6]

U-Net backbone

The U-Net is an encoder-decoder convolutional network with skip connections between matching resolutions. It was first applied to diffusion in DDPM and remained the standard backbone for Stable Diffusion 1.x, 2.x, and SDXL (which uses a 2.6-billion-parameter U-Net).^[14] U-Nets are biased toward local processing and downsample the spatial grid through several stages, which makes them efficient at small resolutions but awkward to scale: skip connections fix the topology, and adding capacity disproportionately benefits the bottleneck.

Vision Transformer

The Vision Transformer (Dosovitskiy et al., An Image is Worth 16x16 Words, ICLR 2021) treats an image as a sequence of non-overlapping patches, embeds each patch linearly, adds positional encodings, and processes the sequence with a standard transformer encoder using self-attention.^[8] ViTs have no convolutional inductive bias, which makes them weaker at small scales and stronger at large scales relative to convolutional networks. DiT inherits the patchify-and-attend recipe from ViT and applies it to the diffusion denoising task.

Latent diffusion

Rombach, Blattmann, Lorenz, Esser, and Ommer's High-Resolution Image Synthesis with Latent Diffusion Models (CVPR 2022) trained a variational autoencoder (VAE) to compress images into a low-dimensional latent grid, then ran the diffusion model in that latent space.^[7] The compressed representation is roughly 8x smaller per side than the input image, which makes high-resolution synthesis tractable. This is the basis of Stable Diffusion and most subsequent open image models, and it is also the input space used by DiT.

DiT architecture

DiT operates on the latent grid produced by a VAE encoder. For ImageNet 256x256, Peebles and Xie use a Stable-Diffusion VAE that produces a 32x32x4 latent.^[1] The DiT block diagram, in steps:

Patchify. Split the latent into non-overlapping patches. The original paper studies patch sizes 2, 4, and 8. With patch size 2 on a 32x32 latent, the model processes 256 tokens.^[1]
Linear embedding plus positional encoding. Each patch is flattened and projected to the hidden dimension d, then a learned positional embedding (or sin/cos positional encoding) is added.
Transformer blocks. A stack of N standard ViT-style blocks, each consisting of multi-head self-attention followed by an MLP, with pre-LayerNorm and residual connections.
Conditioning injection. The diffusion timestep t and a class label or text embedding c are mixed into every block.
Output head. A final LayerNorm and linear projection produce the predicted noise (or a velocity, in flow-matching variants), which is reshaped back into a latent of the original spatial resolution.

Conditioning methods

The paper compares four ways of conditioning on t and c. The differences are summarised below.

Method	How it works	Parameters added	Compute overhead	FID at 400K iters (DiT-XL/2)
In-context	Concatenate t and c as extra tokens at the start of the sequence	None	Negligible	~10.7
Cross-attention	Add a cross-attention layer in each block that attends to t and c tokens	~15% of block	High	~9.5
Adaptive LayerNorm (adaLN)	Predict scale and shift parameters for each LayerNorm from t and c	Modest MLP	Low	~8.4
adaLN-Zero	adaLN with the residual scale initialised to zero so each block starts as identity	Same as adaLN	Same as adaLN	~5.6

adaLN-Zero is the single most distinctive design choice in DiT. The conditioning vector is fed through a small MLP that outputs six modulation parameters per block: scale and shift for the LayerNorm before the attention sublayer, scale and shift for the LayerNorm before the MLP sublayer, and two residual gates that scale the output of each sublayer. The residual gates are initialised to zero, so at the start of training each DiT block is the identity function. This stabilises optimisation and matches the trick used by Goyal et al. for ResNets in 2017.^[30] Peebles and Xie report that adaLN-Zero almost halves the FID at 400K iterations compared with the in-context baseline.^[1]

The choice of adaLN-Zero turned out to be more than a trick. In a network with hundreds of stacked blocks, allowing each one to start as a no-op means that learning has to find a useful direction for each block separately, rather than fighting through a random initial transformation. This is loosely the same idea behind the Fixup and ReZero literature, and it is one reason DiTs train stably at scales where naive transformer initialisations diverge.

Why do transformers outscale U-Nets?

Two properties seem to explain DiT's clean scaling behaviour. The first is that a transformer treats every token uniformly. There is no privileged spatial bottleneck and no hand-coded multi-resolution hierarchy, so adding parameters increases capacity smoothly across the whole network rather than disproportionately at one place. The second is that attention has no fixed inductive bias about which tokens should interact, which means a larger model can in principle learn its own coupling pattern (sparse for textures, dense for global composition). U-Nets, by contrast, fix the receptive field at every level via convolution kernels and pooling, and adding capacity to a U-Net has to fight against that geometry.

This is also why DiTs work well as a substrate for classifier-free guidance and other conditioning tricks: every token sees every other token through self-attention, so a class label or text embedding plumbed into adaLN reaches everywhere by default.

Model variants

The original paper trains four sizes, in line with the ViT naming convention. The patch size appears as a suffix; DiT-XL/2 means an XL model with patch size 2.

Model	Layers (N)	Hidden dim (d)	Heads	Parameters	Gflops at 256x256, patch 2
DiT-S	12	384	6	~33M	~6
DiT-B	12	768	12	~130M	~23
DiT-L	24	1024	16	~458M	~80
DiT-XL	28	1152	16	~675M	~119

DiT-XL/2 reaches an FID-50K of 2.27 on class-conditional ImageNet 256x256 with classifier-free guidance, beating the previous best of 3.60 from latent diffusion.^[1] At 512x512, it reaches FID 3.04 at roughly 525 Gflops per forward pass.^[1] The headline finding from the scaling sweep is that, holding other knobs fixed, increasing transformer compute (depth, width, or token count) consistently lowers FID, with no clear saturation in the regime they explored.^[1] The paper's own framing is direct: the DiT-XL/2 models "outperform all prior diffusion models on the class-conditional ImageNet 512x512 and 256x256 benchmarks, achieving a state-of-the-art FID of 2.27 on the latter."^[1]

The paper also reports a clean log-linear relation between Gflops per forward pass and FID across the variants.^[1] Doubling the per-step compute moves FID down by roughly the same multiplicative amount across the small-to-XL range, which is the signature of a healthy scaling regime. Smaller patch sizes (patch 2 vs patch 4 or 8) increase the token count, which raises compute but also pushes FID down. The authors land on patch size 2 as the sweet spot for ImageNet at 256x256.^[1]

Training compute and data

The original DiT paper does not publish a single dollar or watt figure for the headline run, but it does report enough hardware and step counts to back out an estimate.^[1]

Run	Hardware reported	Steps	Approx. compute	Notes
DiT-XL/2 at 256x256	TPU v3 / v4 pods (paper) and A100 reproductions	7M	Roughly 2,500 GPU-days on A100 equivalents	Class-conditional ImageNet only
DiT-XL/2 at 512x512	Same	3M	Roughly 1,000 GPU-days	Higher-resolution finetune
Scaling sweep (S/B/L/XL with patches 2/4/8)	Same	400K each	Smaller; used to fit scaling curves	Reported in the main figures

Independent reproductions on A100 nodes have placed the DiT-XL/2 ImageNet run at roughly 2.5K GPU-days, which is modest by 2024 frontier-model standards but large for an academic paper. The run uses standard mixed-precision training, an AdamW optimiser with constant learning rate, and exponential moving averages of the weights for inference.^[1] There is no learning-rate schedule, no gradient clipping, and no warmup tricks.

How does a diffusion transformer generate an image?

DiT inherits its sampling machinery from generic diffusion models. DDIM, DPM-Solver, Euler ancestral, and Heun-style Runge-Kutta solvers all work without changes. Twenty to fifty steps are typical for high-quality output at inference. Classifier-free guidance, where the model is run with and without the conditioning and the predictions are linearly extrapolated, is used to trade diversity for sample quality. Distillation methods such as progressive distillation and consistency models compress the multi-step sampler into a few-step or single-step generator; this is how Flux.1 schnell achieves four-step generation.^[20]

Memory and inference cost

The quadratic attention cost has practical consequences at inference time. A DiT-XL/2 forward pass on a 256x256 image is cheap (a few hundred Gflops), but extending to 1024x1024 raises the token count by 16x and the attention cost by 256x. This is why FLUX.1 dev needs roughly 24 GB of VRAM at full precision^[21] and why Stable Diffusion 3.5 Medium was deliberately designed to fit on consumer GPUs at FP8.^[13] Activation checkpointing, sequence-parallel attention, and FlashAttention-3 all help. For video DiTs, the situation is much worse: Sora's full-resolution one-minute generation reportedly required dedicated inference clusters with custom kernels, and most open video DiTs cap themselves at five to fifteen seconds of output.^[22]

Training

DiTs are trained with the standard noise-prediction objective from DDPM,^[4] the velocity (v-prediction) objective, or a flow-matching loss.^[9] Mixed-precision training, ZeRO sharding (often via DeepSpeed or PyTorch FSDP), flash attention, and gradient checkpointing are standard. The DiT-XL/2 ImageNet model trained for 7 million steps on eight V100 or A100 nodes;^[1] later models such as Stable Diffusion 3 trained on hundreds of millions of image-text pairs across thousands of GPUs.^[11]

Larger DiTs need surprisingly few hyperparameter changes from the original recipe. AdamW at a constant learning rate around 1e-4, batch sizes from 256 to 4096, and exponential moving averages of the weights at decay rates between 0.9999 and 0.99999 cover most of the published runs.^[1] The conditioning input changes (a class label for ImageNet, a CLIP plus T5 ensemble for text-to-image, a prompt embedding plus camera pose for video), but the optimisation recipe is mostly fixed. This is one of the underappreciated practical advantages of DiTs: the same training loop ports across modalities with minimal tuning.

The rectified flow training objective used by SD3 and FLUX deserves a brief note. Standard diffusion learns to predict the noise added at a random timestep, which works but produces curved sampling trajectories that need many steps to integrate cleanly. Rectified flow (Liu et al, 2023)^[10] and flow matching (Lipman et al, 2023)^[9] reformulate the training target so that the optimal sampling path is a straight line in latent space. Empirically this lets models reach high sample quality in fewer steps (often 20 to 28 instead of 50) and trains more stably at the high noise levels that matter for high-resolution synthesis. SD3's paper reports that rectified flow with logit-normal timestep sampling outperforms DDPM-style training across the model sizes they tested.^[11]

What is MM-DiT?

Stability AI's Stable Diffusion 3 introduced the Multimodal Diffusion Transformer, or MM-DiT, in March 2024.^[12] The design splits the transformer into two parallel streams, one for text tokens and one for image tokens, with separate weights but a shared joint attention operation.^[11] The SD3 paper describes the architecture as using "separate weights for the two modalities and enables a bidirectional flow of information between image and text tokens, improving text comprehension, typography, and human preference ratings."^[11] Practically, every block does the following:

Compute query, key, and value projections separately for text and for image tokens, with each stream's own parameter matrices.
Concatenate the keys and values across streams.
Run one joint attention operation in which all tokens attend to all tokens.
Split the result back into text and image streams and apply each stream's own MLP.

This lets the model allocate parameters asymmetrically (more capacity to the modality that needs it) while still allowing every text token to influence every image patch and vice versa. The Esser et al. paper Scaling Rectified Flow Transformers for High-Resolution Image Synthesis (arXiv:2403.03206) describes scaling MM-DiT from a 15-block 450M-parameter model to a 38-block 8B-parameter model, reporting a smooth decrease in validation loss with both model size and training compute and "no signs of saturation."^[11] The authors state that the architecture "follows predictable scaling trends and correlates lower validation loss to improved text-to-image synthesis as measured by various metrics and human evaluations."^[11] Stable Diffusion 3.5 Large uses an 8B-parameter MM-DiT with QK normalisation and double attention layers; the 2.5B-parameter SD 3.5 Medium uses an MMDiT-X variant.^[13] FLUX.1 from Black Forest Labs is built on the same recipe but adds parallel attention blocks and a different distillation pipeline.^[20]

How do DiT, MM-DiT, and FLUX differ?

It is easy to lump everything from DiT-XL onwards into one bucket, but the three most influential variants differ in important specifics.

Property	Original DiT (2022)	MM-DiT (SD3, 2024)	FLUX.1 (2024)
Conditioning input	Class label only	Three text encoders (CLIP-G, CLIP-L, T5-XXL) plus pooled embedding	Two text encoders (CLIP-L, T5-XXL) plus pooled embedding
Token streams	Single image stream	Two streams (text and image) with separate weights	Two streams for some blocks, single stream for others ("single-stream" blocks)
Attention pattern	Self-attention over image tokens	Joint attention over concatenated text plus image tokens	Joint attention plus parallel attention blocks (attention and MLP run in parallel)
Conditioning method	adaLN-Zero	adaLN with QK normalisation	adaLN with QK normalisation, modified for parallel blocks
Training objective	Noise prediction (v-prediction in some reproductions)	Rectified flow matching	Rectified flow matching
Position encoding	Learned absolute	2D learned absolute, with extensions for arbitrary resolution	2D RoPE for image tokens, learned for text
Headline parameter count	675M (DiT-XL/2)	2B and 8B (SD3)	~12B (FLUX.1 dev / pro)
Distilled variants	None in original	SD3.5 Turbo	FLUX.1 schnell (Apache 2.0, 4-step)

The most important conceptual jump is from DiT to MM-DiT: separating text and image into two streams that interact only through joint attention is what made the architecture practical for serious text-to-image work, because it stops the model from having to learn how to keep the two modalities apart inside a single shared MLP. FLUX adds some throughput optimisations (parallel attention, fewer two-stream blocks) on top, and it changes the distillation story, but architecturally it sits on the MM-DiT line rather than introducing a new family. Rectified flow is the training-objective change shared by SD3, FLUX, and most recent video DiTs; it is not part of the architecture itself, but it has become the default for high-resolution generation because it produces straighter sampling trajectories than DDPM-style noise prediction.

Variants and extensions

Image generation

System	Release	Backbone	Parameters	Notes
DiT-XL/2	Dec 2022	DiT, adaLN-Zero	675M	ImageNet 256x256 FID 2.27^[1]
PixArt-alpha	Oct 2023 (ICLR 2024 Spotlight)	DiT with cross-attention	0.6B	Trained for 10.8% of SD 1.5's compute^[15]
PixArt-Sigma	Mar 2024	DiT with cross-attention	0.6B	4K text-to-image; weak-to-strong training from PixArt-alpha^[16]
Hunyuan-DiT	May 14, 2024	DiT with bilingual text encoders	~1.5B	Chinese and English text-to-image^[17]
Lumina-T2X (Flag-DiT)	May 2024	DiT with RoPE, RMSNorm, KQ-norm	up to 7B	Unified images, video, 3D, audio^[18]
Lumina-Next (Next-DiT)	June 2024	DiT with 3D RoPE, sandwich norm	up to 7B	Faster training and inference^[19]
Stable Diffusion 3	March 2024	MM-DiT, rectified flow	800M to 8B	38-block 8B variant^[11]
FLUX.1 (pro/dev/schnell)	August 2024	MM-DiT plus parallel attention, rectified flow	~12B	schnell is 4-step distilled^[20]
Stable Diffusion 3.5 Large	October 22, 2024	MM-DiT with QK norm	8B	Open weights up to $1M revenue^[13]
Stable Diffusion 3.5 Medium	October 29, 2024	MMDiT-X	2.5B	Designed for consumer GPUs^[13]

Video generation

System	Release	Notes
Sora	Feb 15, 2024	OpenAI; spacetime patches; DiT backbone; led by William Peebles and Tim Brooks^[22]
Latte	2024 (TMLR 2025)	Latent video DiT; spatial-temporal token factorisation^[24]
Open-Sora (STDiT)	2024	Spatial-Temporal DiT; up to 720p, 15-second clips^[25]
Veo / Veo 2	2024	Google DeepMind; transformer-based video diffusion
Veo 3	2025	Native audio generation; 4K output; transformer DiT lineage^[29]
Movie Gen	October 2024	Meta; 30B-parameter transformer with synchronized audio^[26]
CogVideoX	August 2024 (arXiv:2408.06072)	"Expert transformer" with expert adaLN; 10-second clips at 768x1360, 16 fps^[23]
HunyuanVideo	December 2024	Open-weight video DiT from Tencent
Allegro	2024	Open-source video DiT
Runway Gen-3 / Gen-4	2024 / 2025	Transformer-based text-and-video diffusion; cinematic focus
Pika 2.0+	2024	Pika Labs; transformer architecture per Pika research notes
Cosmos	January 2025	Nvidia; world-model DiTs for physical AI and robotics^[27]
Sora 2	September 30, 2025	OpenAI; native audio; sharper physics; iOS social app^[28]

Architectural research

Follow-up work explores cheaper attention substitutes (Mamba-DiT, LaMamba-Diff, RWKV-DiT), mixture-of-experts MM-DiTs (often called DiT-MoE), autoregressive image transformers such as VAR and MAR, and continuous-token DiTs. The shared theme is that the backbone family keeps shifting while the patchify-attend-modulate skeleton survives.

DiT-MoE and conditional compute

Mixture-of-experts variants of DiT route each token through a small subset of expert MLPs rather than a single shared one, which lets total parameter count grow without proportionally increasing inference compute. This pattern has shown up in several Stable Diffusion 3.5 research notes and in concurrent work on conditional diffusion, mirroring the path that large language models took with Mixtral and DeepSeek-V2. The catch is that experts are far more sensitive to load balancing in image diffusion than in text models, partly because the same patches recur across noise levels and partly because spatial routing tends to collapse onto a few experts unless you regularise carefully.

How does a diffusion transformer differ from a U-Net diffusion model?

System	Backbone	Parameters	Year	Latent	Notes
Stable Diffusion 1.5	U-Net	~860M	2022	4-channel SD VAE	Cross-attention to CLIP text encoder
Stable Diffusion 2.1	U-Net	~865M	2022	4-channel SD VAE	OpenCLIP H text encoder
SDXL	U-Net	2.6B	2023	4-channel SD VAE	Two text encoders; refiner stage^[14]
DiT-XL/2	DiT (class-conditional)	675M	2022	SD VAE	ImageNet only; no text^[1]
PixArt-Sigma	DiT (cross-attention)	0.6B	2024	SD VAE	4K output^[16]
Stable Diffusion 3 (8B)	MM-DiT, rectified flow	8B	2024	16-channel SD3 VAE	Three text encoders^[11]
FLUX.1	MM-DiT plus parallel attention, rectified flow	~12B	2024	SD3 VAE	Open weights^[20]
Sora	DiT (spacetime)	not disclosed	2024	Video VAE	Video; up to 1 minute^[22]

Qualitatively, U-Net SDXL produces strong samples at 2.6B parameters but plateaus when scaled further. DiT-style models keep improving past 8B and into the 12B range, and they handle multimodal conditioning more cleanly because attention is the only mixing operation in the network.

Strengths

DiTs scale predictably. Compute and FID move together along a roughly straight line in log-log space, which is the same regime that language models enjoy.^[1] The architecture is also conceptually clean: there are no skip connections, no resolution doubling, and no separate downsampling and upsampling stages. Conditioning is uniform across blocks, which makes it easy to add a text encoder, a class label, a camera pose, or any other auxiliary signal. Transformer infrastructure (flash attention, RoPE, ZeRO, FSDP, tensor parallelism) ports over without modification, so a team that knows how to train large language models can train large DiTs.

There is also a culture argument that gets less attention but matters in practice. The transformer ecosystem has been shaped by language modelling for the better part of a decade, which means tooling for sharding, profiling, debugging, and serving is mature in a way that U-Net-specific tooling never was. Hugging Face's Diffusers, vLLM-style attention kernels, and PyTorch FSDP all assume a transformer-shaped graph. Adopting DiT lets a team plug into that ecosystem instead of maintaining a parallel one.

Limitations

Self-attention is quadratic in sequence length, so high-resolution images and long videos are expensive. A 1024x1024 image with a patch size of 2 on a 1/8 latent gives 64x64 = 4,096 tokens, and the attention cost is 16M dot products per layer; a one-minute video is dramatically worse. Video DiTs typically factorise attention along space and time, use sliding windows, or adopt linear-attention substitutes like Mamba and RWKV to keep this manageable. Memory pressure during training is also higher than for U-Nets because activations must be retained across a deeper, wider stack. Finally, DiTs are weaker at small scales: a 50M-parameter U-Net often beats a 50M-parameter DiT, and the architecture only earns its keep above roughly 100M to 200M parameters.

Position encoding is another open issue. The original DiT uses learned absolute embeddings of fixed length, which makes it awkward to evaluate at resolutions other than the training resolution.^[1] Lumina-T2X, FLUX, and the SD3 family all switched to 2D RoPE or related schemes for this reason.^[18] Patch size choice carries similar trade-offs: patch 2 gives more tokens and better detail but raises attention cost; patch 4 or patch 8 are cheap but blur fine structure and tend to lose small text or thin lines.

Who created the diffusion transformer?

The paper's two authors took notably different paths after publication.

William Peebles finished his PhD at UC Berkeley in 2023 (advised by Alexei Efros), having done the DiT work as a visiting researcher at Meta AI in 2022. He joined OpenAI in early 2024 to lead the Sora team alongside Tim Brooks, and Sora's spacetime-patch architecture is a direct application of the DiT recipe to video.^[22] He has remained at OpenAI through Sora 2 and continues to lead video research there.

Saining Xie is an assistant professor at NYU Courant, where he runs a vision and generative modelling group. Before DiT he was best known as a co-author of ConvNeXt and Masked Autoencoders (MAE), and he previously worked on Aggregated Residual Transformations (ResNeXt) during his PhD at UC San Diego.^[31]^[32] The DiT paper is the first ICCV best paper award for either author.^[1] He has continued to push representation-aware diffusion (REPA) and autoregressive image generation since then.

The pairing reflects a small pattern in the field. Peebles brought generative-model experience (his prior work on GANs included GAN-Supervised Dense Visual Alignment) and Xie brought a vision-architecture lineage. The combination produced a paper that reads more like a careful empirical study than a typical "we propose a new model" announcement, which is part of why it has been so influential.

What models use a diffusion transformer?

By the end of 2025, every state-of-the-art open-weight text-to-image and text-to-video model used a DiT-style backbone. The conventional U-Net survives mainly in finetuned models built on Stable Diffusion 1.5 and SDXL.

Model / system	Organisation	Release	Architecture lineage
Stable Diffusion 3 / 3.5	Stability AI	March / October 2024	MM-DiT with rectified flow^[11]
FLUX.1 (pro / dev / schnell)	Black Forest Labs	August 2024	MM-DiT plus parallel attention, rectified flow^[20]
Sora	OpenAI	February 2024 (preview); December 2024 (Sora Turbo)	DiT with spacetime patches^[22]
Sora 2	OpenAI	September 2025	DiT with native audio, sharper physics^[28]
Veo / Veo 2	Google DeepMind	2024	Transformer-based video diffusion
Veo 3	Google DeepMind	2025	Transformer DiT lineage with native audio^[29]
Movie Gen	Meta	October 2024	30B-parameter transformer with synchronized audio^[26]
Runway Gen-3 / Gen-4	Runway	2024 / 2025	Text-and-video diffusion transformer
Pika 2.0+	Pika Labs	2024	Transformer-based diffusion
Hunyuan-DiT / HunyuanVideo	Tencent	May 2024 / December 2024	DiT with bilingual encoders^[17]
PixArt-alpha / Sigma	Huawei Noah's Ark	October 2023 / March 2024	DiT with cross-attention^[15]
Lumina-T2X / Lumina-Next	Shanghai AI Lab	May / June 2024	DiT with RoPE and RMSNorm^[18]
CogVideoX	Zhipu AI	August 2024	Expert transformer with expert adaLN^[23]
Cosmos	Nvidia	January 2025	DiT-based world models for physical AI^[27]
Allegro	Rhymes AI	2024	Open-source video DiT
Open-Sora (STDiT)	HPC-AI Tech	2024	Spatial-temporal DiT^[25]

The pattern is striking. Within twenty-four months of the original paper, the architecture had walked through every major lab and most well-funded startups. There are caveats: FLUX and SD3 are formally rectified flow transformers rather than denoising diffusion transformers in the strict sense, and Sora keeps a lot of its specifics private. But the patchify-attend-modulate skeleton is in all of them.

Code and reproducibility

The original DiT codebase is at github.com/facebookresearch/DiT and is released under a CC-BY-NC license.^[3] It contains training scripts for ImageNet at 256x256 and 512x512, the four model sizes from the paper, evaluation utilities for FID, and Colab demos.^[3] Independent reproductions in PyTorch (Hugging Face Diffusers, lucidrains' implementations) and in JAX have all matched the paper's numbers. PixArt and FLUX both started from forks of this code, which is one of the reasons the design has spread so quickly: the authors made it easy to copy.

The Diffusers library's DiTPipeline and the various MMDiTPipeline classes for SD3 and FLUX are the most common community entry points. Most papers that propose new variants release weights for at least one size, often on Hugging Face under a research license.

Influence and history

The Peebles and Xie paper has been cited many thousands of times since 2023 and won ICCV 2023's best paper award.^[1] William Peebles went on to lead OpenAI's Sora team, and Sora's spacetime-patch transformer is a direct application of DiT to video.^[22] Robin Rombach, Andreas Blattmann, and Dominik Lorenz, three of the authors of the original latent diffusion paper, founded Black Forest Labs after leaving Stability AI and shipped FLUX.1 in August 2024.^[20] By the end of 2024, every state-of-the-art open-weight text-to-image and text-to-video model used a DiT-style backbone. The conventional U-Net survives mainly in finetuned models built on Stable Diffusion 1.5 and SDXL.

Recent developments

Several 2024 and 2025 directions are worth tracking. Linear-attention DiTs replace softmax attention with state-space models or linear kernels to bring the per-layer cost down to O(N) in token count; LaMamba-Diff is one example. Mixture-of-experts DiTs route tokens through expert MLPs to expand parameter count without proportionally expanding compute; some Stable Diffusion 3.5 variants and concurrent research papers explore this. Autoregressive image transformers such as VAR (visual autoregressive modelling) and MAR (masked autoregressive) generate images token-by-token using transformer LMs and challenge the noise-conditioning paradigm itself. Representation autoencoders (RAE-DiT) replace the SD VAE with semantic encoders to improve sample quality. The patchify-attend-modulate skeleton remains stable across all of these.

Nvidia's Cosmos platform, announced at CES 2025, builds DiT-style world foundation models for robotics and autonomous-vehicle simulation. The Cosmos-1.0 family includes diffusion (DiT) and autoregressive variants ranging from 4B to 14B parameters and supports text-to-world, video-to-world, and conditional world generation.^[27] This is a different application domain from text-to-image, but the architectural lineage is the same.

Meta's Movie Gen, announced in October 2024, is another large transformer-based diffusion model with synchronized audio generation. It uses a 30B-parameter transformer for video and a 13B-parameter audio model that conditions on the visuals, producing up to 16-second clips at 1080p.^[26] Like Sora and Veo 3, it represents the merger of the DiT recipe with native audio output.

Is Sora a diffusion transformer?

Yes. OpenAI's Sora technical report states plainly: "Sora is a diffusion model," and, more specifically, "Sora is a diffusion transformer."^[22] The report explains that Sora "turns videos into patches" by compressing them into a latent space and decomposing that representation into spacetime patches, which then act as transformer tokens, directly applying the DiT recipe to variable-resolution, variable-duration video.^[22] The report also notes that sample quality improves consistently as training compute grows, the same scaling signature Peebles and Xie reported for images.^[22] Sora was led at OpenAI by William Peebles, the DiT paper's first author, alongside Tim Brooks.^[22]

Theoretical insights

Why has the DiT recipe held up so well across image and video? Two arguments are common in the literature. First, the uniformity argument: a transformer's compute is spread evenly over its tokens, so adding parameters or compute scales the model's expressivity uniformly. A U-Net, by contrast, allocates compute non-uniformly across resolutions, and adding capacity to a U-Net often helps only the bottleneck. Second, the inductive-bias argument: U-Nets bake in translation equivariance and a multi-resolution hierarchy, both of which are useful at small scales but become a constraint at large scales. A transformer's only inductive bias is permutation equivariance broken by positional encodings, which is gentle enough that with enough data the model can learn richer structure.

AdaLN-Zero deserves a separate note. Initialising the residual gates to zero ensures that at step zero the entire network is the identity from input to output, so the gradient flows through the conditioning MLP first and the rest of the network second.^[1] This decouples the learning of the conditioning machinery from the learning of the denoiser and produces noticeably more stable training, particularly at large model sizes where naive initialisations often diverge in the first few thousand steps.

References

Peebles, W. and Xie, S. (2023). *Scalable Diffusion Models with Transformers*. ICCV 2023 (Best Paper Award). arXiv:2212.09748. https://arxiv.org/abs/2212.09748 ↩
Peebles, W. *Scalable Diffusion Models with Transformers* project page. https://www.wpeebles.com/DiT
Facebook Research. *DiT GitHub repository*. https://github.com/facebookresearch/DiT ↩
Ho, J., Jain, A., and Abbeel, P. (2020). *Denoising Diffusion Probabilistic Models*. NeurIPS 2020. arXiv:2006.11239. https://arxiv.org/abs/2006.11239 ↩
Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., and Poole, B. (2021). *Score-Based Generative Modeling through Stochastic Differential Equations*. ICLR 2021 (Outstanding Paper). arXiv:2011.13456. https://arxiv.org/abs/2011.13456 ↩
Karras, T., Aittala, M., Aila, T., and Laine, S. (2022). *Elucidating the Design Space of Diffusion-Based Generative Models*. NeurIPS 2022. arXiv:2206.00364. https://arxiv.org/abs/2206.00364 ↩
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and Ommer, B. (2022). *High-Resolution Image Synthesis with Latent Diffusion Models*. CVPR 2022. arXiv:2112.10752. https://arxiv.org/abs/2112.10752 ↩
Dosovitskiy, A. et al. (2021). *An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale*. ICLR 2021. arXiv:2010.11929. https://arxiv.org/abs/2010.11929 ↩
Lipman, Y., Chen, R. T. Q., Ben-Hamu, H., Nickel, M., and Le, M. (2023). *Flow Matching for Generative Modeling*. ICLR 2023. arXiv:2210.02747. https://arxiv.org/abs/2210.02747 ↩
Liu, X., Gong, C., and Liu, Q. (2023). *Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow*. ICLR 2023. https://arxiv.org/abs/2209.03003 ↩
Esser, P., Kulal, S., Blattmann, A. et al. (2024). *Scaling Rectified Flow Transformers for High-Resolution Image Synthesis*. ICML 2024. arXiv:2403.03206. https://arxiv.org/abs/2403.03206 ↩
Stability AI (2024). *Stable Diffusion 3: Research Paper*. https://stability.ai/news/stable-diffusion-3-research-paper ↩
Stability AI (2024). *Introducing Stable Diffusion 3.5*. https://stability.ai/news/introducing-stable-diffusion-3-5 ↩
Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Müller, J., Penna, J., and Rombach, R. (2023). *SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis*. arXiv:2307.01952. https://arxiv.org/abs/2307.01952 ↩
Chen, J. et al. (2024). *PixArt-alpha: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis*. ICLR 2024 Spotlight. arXiv:2310.00426. https://arxiv.org/abs/2310.00426 ↩
Chen, J. et al. (2024). *PixArt-Sigma: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation*. ECCV 2024. https://pixart-alpha.github.io/PixArt-sigma-project/ ↩
Tencent Hunyuan (2024). *Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding*. arXiv:2405.08748. https://arxiv.org/abs/2405.08748 ↩
Gao, P. et al. (2024). *Lumina-T2X: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers*. arXiv:2405.05945. https://arxiv.org/abs/2405.05945 ↩
Zhuo, L. et al. (2024). *Lumina-Next: Making Lumina-T2X Stronger and Faster with Next-DiT*. NeurIPS 2024. arXiv:2406.18583. https://arxiv.org/abs/2406.18583 ↩
Black Forest Labs (2024). *Announcing Black Forest Labs and FLUX.1*. https://blackforestlabs.ai/announcing-black-forest-labs/ ↩
Black Forest Labs. *FLUX.1-dev model card*. https://huggingface.co/black-forest-labs/FLUX.1-dev ↩
OpenAI (2024). *Video Generation Models as World Simulators (Sora technical report)*. https://openai.com/index/video-generation-models-as-world-simulators/ ↩
Yang, Z. et al. (2024). *CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer*. arXiv:2408.06072. https://arxiv.org/abs/2408.06072 ↩
Ma, X. et al. (2024). *Latte: Latent Diffusion Transformer for Video Generation*. https://github.com/Vchitect/Latte ↩
HPC-AI Tech (2024). *Open-Sora: Democratizing Efficient Video Production for All*. https://github.com/hpcaitech/Open-Sora ↩
Polyak, A. et al. (2024). *Movie Gen: A Cast of Media Foundation Models*. Meta AI. https://ai.meta.com/research/movie-gen/ ↩
Nvidia (2025). *Cosmos World Foundation Model Platform for Physical AI*. https://blogs.nvidia.com/blog/cosmos-world-foundation-models/ ↩
OpenAI (2025). *Sora 2 is here*. https://openai.com/index/sora-2/ ↩
Google DeepMind (2025). *Veo 3: Generate videos with native audio*. https://deepmind.google/models/veo/ ↩
Goyal, P. et al. (2017). *Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour*. arXiv:1706.02677. https://arxiv.org/abs/1706.02677 ↩
Xie, S., Girshick, R., Dollár, P., Tu, Z., and He, K. (2017). *Aggregated Residual Transformations for Deep Neural Networks (ResNeXt)*. CVPR 2017. https://arxiv.org/abs/1611.05431 ↩
Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., and Xie, S. (2022). *A ConvNet for the 2020s (ConvNeXt)*. CVPR 2022. https://arxiv.org/abs/2201.03545 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

4 revisions by 1 contributors · full history

Suggest edit

Diffusion Transformer (DiT)

When was the diffusion transformer introduced?

Why it matters

Background context

Diffusion models

U-Net backbone

Vision Transformer

Latent diffusion

DiT architecture

Conditioning methods

Why do transformers outscale U-Nets?

Model variants

Training compute and data

How does a diffusion transformer generate an image?

Memory and inference cost

Training

What is MM-DiT?

How do DiT, MM-DiT, and FLUX differ?

Variants and extensions

Image generation

Video generation

Architectural research

DiT-MoE and conditional compute

How does a diffusion transformer differ from a U-Net diffusion model?

Strengths

Limitations

Who created the diffusion transformer?

What models use a diffusion transformer?

Code and reproducibility

Influence and history

Recent developments

Is Sora a diffusion transformer?

Theoretical insights

See also

References

Improve this article

What links here (24 of 46)

What links here (24 of 46)

When was the diffusion transformer introduced?

Why it matters

Background context

Diffusion models

U-Net backbone

Vision Transformer

Latent diffusion

DiT architecture

Conditioning methods

Why do transformers outscale U-Nets?

Model variants

Training compute and data

How does a diffusion transformer generate an image?

Memory and inference cost

Training

What is MM-DiT?

How do DiT, MM-DiT, and FLUX differ?

Variants and extensions

Image generation

Video generation

Architectural research

DiT-MoE and conditional compute

How does a diffusion transformer differ from a U-Net diffusion model?

Strengths

Limitations

Who created the diffusion transformer?

What models use a diffusion transformer?

Code and reproducibility

Influence and history

Recent developments

Is Sora a diffusion transformer?

Theoretical insights

See also

References

Improve this article

Related Articles

Stable Diffusion

DALL-E

Midjourney

Imagen (text-to-image model)

Flux (text-to-image model)

Black Forest Labs

What links here (24 of 46)

Related Articles

Stable Diffusion

DALL-E

Midjourney

Imagen (text-to-image model)

Flux (text-to-image model)

Black Forest Labs

What links here (24 of 46)