Diffusion Transformer (DiT)
Last reviewed
May 8, 2026
Sources
32 citations
Review status
Source-backed
Revision
v2 · 6,030 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 8, 2026
Sources
32 citations
Review status
Source-backed
Revision
v2 · 6,030 words
Add missing citations, update stale details, or suggest a clearer explanation.
A Diffusion Transformer, abbreviated DiT, is a transformer-based neural network backbone for diffusion models. It replaces the U-Net that diffusion models had used since 2020 with a Vision Transformer operating on patches of an image latent. DiT was introduced by William Peebles (then a PhD student at UC Berkeley with a research stint at Meta AI) and Saining Xie (NYU faculty) in the December 2022 paper Scalable Diffusion Models with Transformers (arXiv:2212.09748), which won the best paper award at ICCV 2023. The paper's central empirical claim is that pure transformer backbones scale more cleanly than convolutional U-Nets along compute and parameters, and that this scaling advantage extends to image generation.
Since late 2023 the architecture has displaced U-Nets across most frontier image and video generation systems. The OpenAI Sora and Sora 2 video models, Stable Diffusion 3 and 3.5, FLUX.1, Google DeepMind's Veo, Veo 2, and Veo 3, Meta's Movie Gen, Runway Gen-3 and Gen-4, Pika 2.0+, PixArt-alpha and Sigma, Hunyuan-DiT, Lumina-T2X, Nvidia Cosmos, CogVideoX, and Latte are all DiT-style designs. The label "DiT" now refers both to the original ImageNet model from Peebles and Xie and to the broader family of transformer-on-latent-patch diffusion architectures that grew out of it.
The U-Net introduced in Denoising Diffusion Probabilistic Models (Ho, Jain, and Abbeel, 2020) had been the default backbone for diffusion models for almost three years before DiT appeared. U-Nets perform reasonably well at small to medium scale, but Peebles and Xie found that scaling them by adding parameters and compute produced inconsistent improvements. Transformers, in contrast, had already shown predictable scaling behaviour for language models (the Chinchilla scaling laws) and for image classification (Vision Transformer (ViT)). DiT extended this evidence to image synthesis: as the authors increased the transformer's depth, width, or token count, FID on ImageNet decreased monotonically.
This matters for two reasons. First, predictable scaling lets practitioners trade compute directly for sample quality, which is how the rest of deep learning has progressed since 2020. Second, transformer infrastructure is mature: flash attention, ZeRO sharding, tensor parallelism, RoPE, and mixed-precision training all transfer cleanly to DiTs. Once DiT made this trade-off explicit, the open-source ecosystem and major labs converged on the design within roughly eighteen months.
A diffusion model defines a forward process that gradually adds Gaussian noise to a sample from the data distribution until almost no signal remains, and a reverse process, parameterised by a neural network, that learns to denoise step by step. Training reduces to predicting the noise, the clean sample, or a velocity vector at random noise levels. Sampling starts from pure Gaussian noise and iteratively applies the learned denoiser. Three foundational papers established this framework: Ho, Jain, and Abbeel's DDPM (2020), Song et al.'s 2021 ICLR paper Score-Based Generative Modeling through Stochastic Differential Equations (which won an Outstanding Paper award and unified score matching with denoising diffusion through SDEs), and Karras, Aittala, Aila, and Laine's Elucidating the Design Space of Diffusion-Based Generative Models (NeurIPS 2022, often abbreviated EDM), which separated and tuned the design choices in sampling and preconditioning.
The U-Net is an encoder-decoder convolutional network with skip connections between matching resolutions. It was first applied to diffusion in DDPM and remained the standard backbone for Stable Diffusion 1.x, 2.x, and SDXL (which uses a 2.6-billion-parameter U-Net). U-Nets are biased toward local processing and downsample the spatial grid through several stages, which makes them efficient at small resolutions but awkward to scale: skip connections fix the topology, and adding capacity disproportionately benefits the bottleneck.
The Vision Transformer (Dosovitskiy et al., An Image is Worth 16x16 Words, ICLR 2021) treats an image as a sequence of non-overlapping patches, embeds each patch linearly, adds positional encodings, and processes the sequence with a standard transformer encoder using self-attention. ViTs have no convolutional inductive bias, which makes them weaker at small scales and stronger at large scales relative to convolutional networks. DiT inherits the patchify-and-attend recipe from ViT and applies it to the diffusion denoising task.
Rombach, Blattmann, Lorenz, Esser, and Ommer's High-Resolution Image Synthesis with Latent Diffusion Models (CVPR 2022) trained a variational autoencoder (VAE) to compress images into a low-dimensional latent grid, then ran the diffusion model in that latent space. The compressed representation is roughly 8x smaller per side than the input image, which makes high-resolution synthesis tractable. This is the basis of Stable Diffusion and most subsequent open image models, and it is also the input space used by DiT.
DiT operates on the latent grid produced by a VAE encoder. For ImageNet 256x256, Peebles and Xie use a Stable-Diffusion VAE that produces a 32x32x4 latent. The DiT block diagram, in steps:
The paper compares four ways of conditioning on t and c. The differences are summarised below.
| Method | How it works | Parameters added | Compute overhead | FID at 400K iters (DiT-XL/2) |
|---|---|---|---|---|
| In-context | Concatenate t and c as extra tokens at the start of the sequence | None | Negligible | ~10.7 |
| Cross-attention | Add a cross-attention layer in each block that attends to t and c tokens | ~15% of block | High | ~9.5 |
| Adaptive LayerNorm (adaLN) | Predict scale and shift parameters for each LayerNorm from t and c | Modest MLP | Low | ~8.4 |
| adaLN-Zero | adaLN with the residual scale initialised to zero so each block starts as identity | Same as adaLN | Same as adaLN | ~5.6 |
adaLN-Zero is the single most distinctive design choice in DiT. The conditioning vector is fed through a small MLP that outputs six modulation parameters per block: scale and shift for the LayerNorm before the attention sublayer, scale and shift for the LayerNorm before the MLP sublayer, and two residual gates that scale the output of each sublayer. The residual gates are initialised to zero, so at the start of training each DiT block is the identity function. This stabilises optimisation and matches the trick used by Goyal et al. for ResNets in 2017. Peebles and Xie report that adaLN-Zero almost halves the FID at 400K iterations compared with the in-context baseline.
The choice of adaLN-Zero turned out to be more than a trick. In a network with hundreds of stacked blocks, allowing each one to start as a no-op means that learning has to find a useful direction for each block separately, rather than fighting through a random initial transformation. This is loosely the same idea behind the Fixup and ReZero literature, and it is one reason DiTs train stably at scales where naive transformer initialisations diverge.
Two properties seem to explain DiT's clean scaling behaviour. The first is that a transformer treats every token uniformly. There is no privileged spatial bottleneck and no hand-coded multi-resolution hierarchy, so adding parameters increases capacity smoothly across the whole network rather than disproportionately at one place. The second is that attention has no fixed inductive bias about which tokens should interact, which means a larger model can in principle learn its own coupling pattern (sparse for textures, dense for global composition). U-Nets, by contrast, fix the receptive field at every level via convolution kernels and pooling, and adding capacity to a U-Net has to fight against that geometry.
This is also why DiTs work well as a substrate for classifier-free guidance and other conditioning tricks: every token sees every other token through self-attention, so a class label or text embedding plumbed into adaLN reaches everywhere by default.
The original paper trains four sizes, in line with the ViT naming convention. The patch size appears as a suffix; DiT-XL/2 means an XL model with patch size 2.
| Model | Layers (N) | Hidden dim (d) | Heads | Parameters | Gflops at 256x256, patch 2 |
|---|---|---|---|---|---|
| DiT-S | 12 | 384 | 6 | ~33M | ~6 |
| DiT-B | 12 | 768 | 12 | ~130M | ~23 |
| DiT-L | 24 | 1024 | 16 | ~458M | ~80 |
| DiT-XL | 28 | 1152 | 16 | ~675M | ~119 |
DiT-XL/2 reaches an FID-50K of 2.27 on class-conditional ImageNet 256x256 with classifier-free guidance, beating the previous best of 3.60 from latent diffusion. At 512x512, it reaches FID 3.04 at roughly 525 Gflops per forward pass. The headline finding from the scaling sweep is that, holding other knobs fixed, increasing transformer compute (depth, width, or token count) consistently lowers FID, with no clear saturation in the regime they explored.
The paper also reports a clean log-linear relation between Gflops per forward pass and FID across the variants. Doubling the per-step compute moves FID down by roughly the same multiplicative amount across the small-to-XL range, which is the signature of a healthy scaling regime. Smaller patch sizes (patch 2 vs patch 4 or 8) increase the token count, which raises compute but also pushes FID down. The authors land on patch size 2 as the sweet spot for ImageNet at 256x256.
The original DiT paper does not publish a single dollar or watt figure for the headline run, but it does report enough hardware and step counts to back out an estimate.
| Run | Hardware reported | Steps | Approx. compute | Notes |
|---|---|---|---|---|
| DiT-XL/2 at 256x256 | TPU v3 / v4 pods (paper) and A100 reproductions | 7M | Roughly 2,500 GPU-days on A100 equivalents | Class-conditional ImageNet only |
| DiT-XL/2 at 512x512 | Same | 3M | Roughly 1,000 GPU-days | Higher-resolution finetune |
| Scaling sweep (S/B/L/XL with patches 2/4/8) | Same | 400K each | Smaller; used to fit scaling curves | Reported in the main figures |
Independent reproductions on A100 nodes have placed the DiT-XL/2 ImageNet run at roughly 2.5K GPU-days, which is modest by 2024 frontier-model standards but large for an academic paper. The run uses standard mixed-precision training, an AdamW optimiser with constant learning rate, and exponential moving averages of the weights for inference. There is no learning-rate schedule, no gradient clipping, and no warmup tricks.
DiT inherits its sampling machinery from generic diffusion models. DDIM, DPM-Solver, Euler ancestral, and Heun-style Runge-Kutta solvers all work without changes. Twenty to fifty steps are typical for high-quality output at inference. Classifier-free guidance, where the model is run with and without the conditioning and the predictions are linearly extrapolated, is used to trade diversity for sample quality. Distillation methods such as progressive distillation and consistency models compress the multi-step sampler into a few-step or single-step generator; this is how Flux.1 schnell achieves four-step generation.
The quadratic attention cost has practical consequences at inference time. A DiT-XL/2 forward pass on a 256x256 image is cheap (a few hundred Gflops), but extending to 1024x1024 raises the token count by 16x and the attention cost by 256x. This is why FLUX.1 dev needs roughly 24 GB of VRAM at full precision and why Stable Diffusion 3.5 Medium was deliberately designed to fit on consumer GPUs at FP8. Activation checkpointing, sequence-parallel attention, and FlashAttention-3 all help. For video DiTs, the situation is much worse: Sora's full-resolution one-minute generation reportedly required dedicated inference clusters with custom kernels, and most open video DiTs cap themselves at five to fifteen seconds of output.
DiTs are trained with the standard noise-prediction objective from DDPM, the velocity (v-prediction) objective, or a flow-matching loss. Mixed-precision training, ZeRO sharding (often via DeepSpeed or PyTorch FSDP), flash attention, and gradient checkpointing are standard. The DiT-XL/2 ImageNet model trained for 7 million steps on eight V100 or A100 nodes; later models such as Stable Diffusion 3 trained on hundreds of millions of image-text pairs across thousands of GPUs.
Larger DiTs need surprisingly few hyperparameter changes from the original recipe. AdamW at a constant learning rate around 1e-4, batch sizes from 256 to 4096, and exponential moving averages of the weights at decay rates between 0.9999 and 0.99999 cover most of the published runs. The conditioning input changes (a class label for ImageNet, a CLIP plus T5 ensemble for text-to-image, a prompt embedding plus camera pose for video), but the optimisation recipe is mostly fixed. This is one of the underappreciated practical advantages of DiTs: the same training loop ports across modalities with minimal tuning.
The rectified flow training objective used by SD3 and FLUX deserves a brief note. Standard diffusion learns to predict the noise added at a random timestep, which works but produces curved sampling trajectories that need many steps to integrate cleanly. Rectified flow (Liu et al, 2023) and flow matching (Lipman et al, 2023) reformulate the training target so that the optimal sampling path is a straight line in latent space. Empirically this lets models reach high sample quality in fewer steps (often 20 to 28 instead of 50) and trains more stably at the high noise levels that matter for high-resolution synthesis. SD3's paper reports that rectified flow with logit-normal timestep sampling outperforms DDPM-style training across the model sizes they tested.
Stability AI's Stable Diffusion 3 introduced the Multimodal Diffusion Transformer, or MM-DiT, in March 2024. The design splits the transformer into two parallel streams, one for text tokens and one for image tokens, with separate weights but a shared joint attention operation. Practically, every block does the following:
This lets the model allocate parameters asymmetrically (more capacity to the modality that needs it) while still allowing every text token to influence every image patch and vice versa. The Esser et al. paper Scaling Rectified Flow Transformers for High-Resolution Image Synthesis (arXiv:2403.03206) describes scaling MM-DiT from 450 million to 8 billion parameters and reports a smooth decrease in validation loss with both model size and training compute. Stable Diffusion 3.5 Large uses an 8B-parameter MM-DiT with QK normalisation and double attention layers; the 2.5B-parameter SD 3.5 Medium uses an MMDiT-X variant. FLUX.1 from Black Forest Labs is built on the same recipe but adds parallel attention blocks and a different distillation pipeline.
It is easy to lump everything from DiT-XL onwards into one bucket, but the three most influential variants differ in important specifics.
| Property | Original DiT (2022) | MM-DiT (SD3, 2024) | FLUX.1 (2024) |
|---|---|---|---|
| Conditioning input | Class label only | Three text encoders (CLIP-G, CLIP-L, T5-XXL) plus pooled embedding | Two text encoders (CLIP-L, T5-XXL) plus pooled embedding |
| Token streams | Single image stream | Two streams (text and image) with separate weights | Two streams for some blocks, single stream for others ("single-stream" blocks) |
| Attention pattern | Self-attention over image tokens | Joint attention over concatenated text plus image tokens | Joint attention plus parallel attention blocks (attention and MLP run in parallel) |
| Conditioning method | adaLN-Zero | adaLN with QK normalisation | adaLN with QK normalisation, modified for parallel blocks |
| Training objective | Noise prediction (v-prediction in some reproductions) | Rectified flow matching | Rectified flow matching |
| Position encoding | Learned absolute | 2D learned absolute, with extensions for arbitrary resolution | 2D RoPE for image tokens, learned for text |
| Headline parameter count | 675M (DiT-XL/2) | 2B and 8B (SD3) | ~12B (FLUX.1 dev / pro) |
| Distilled variants | None in original | SD3.5 Turbo | FLUX.1 schnell (Apache 2.0, 4-step) |
The most important conceptual jump is from DiT to MM-DiT: separating text and image into two streams that interact only through joint attention is what made the architecture practical for serious text-to-image work, because it stops the model from having to learn how to keep the two modalities apart inside a single shared MLP. FLUX adds some throughput optimisations (parallel attention, fewer two-stream blocks) on top, and it changes the distillation story, but architecturally it sits on the MM-DiT line rather than introducing a new family. Rectified flow is the training-objective change shared by SD3, FLUX, and most recent video DiTs; it is not part of the architecture itself, but it has become the default for high-resolution generation because it produces straighter sampling trajectories than DDPM-style noise prediction.
| System | Release | Backbone | Parameters | Notes |
|---|---|---|---|---|
| DiT-XL/2 | Dec 2022 | DiT, adaLN-Zero | 675M | ImageNet 256x256 FID 2.27 |
| PixArt-alpha | Oct 2023 (ICLR 2024 Spotlight) | DiT with cross-attention | 0.6B | Trained for 10.8% of SD 1.5's compute |
| PixArt-Sigma | Mar 2024 | DiT with cross-attention | 0.6B | 4K text-to-image; weak-to-strong training from PixArt-alpha |
| Hunyuan-DiT | May 14, 2024 | DiT with bilingual text encoders | ~1.5B | Chinese and English text-to-image |
| Lumina-T2X (Flag-DiT) | May 2024 | DiT with RoPE, RMSNorm, KQ-norm | up to 7B | Unified images, video, 3D, audio |
| Lumina-Next (Next-DiT) | June 2024 | DiT with 3D RoPE, sandwich norm | up to 7B | Faster training and inference |
| Stable Diffusion 3 | March 2024 | MM-DiT, rectified flow | 800M to 8B | 38-block 8B variant |
| FLUX.1 (pro/dev/schnell) | August 2024 | MM-DiT plus parallel attention, rectified flow | ~12B | schnell is 4-step distilled |
| Stable Diffusion 3.5 Large | October 22, 2024 | MM-DiT with QK norm | 8B | Open weights up to $1M revenue |
| Stable Diffusion 3.5 Medium | October 29, 2024 | MMDiT-X | 2.5B | Designed for consumer GPUs |
| System | Release | Notes |
|---|---|---|
| Sora | Feb 15, 2024 | OpenAI; spacetime patches; DiT backbone; led by William Peebles and Tim Brooks |
| Latte | 2024 (TMLR 2025) | Latent video DiT; spatial-temporal token factorisation |
| Open-Sora (STDiT) | 2024 | Spatial-Temporal DiT; up to 720p, 15-second clips |
| Veo / Veo 2 | 2024 | Google DeepMind; transformer-based video diffusion |
| Veo 3 | 2025 | Native audio generation; 4K output; transformer DiT lineage |
| Movie Gen | October 2024 | Meta; 30B-parameter transformer with synchronized audio |
| CogVideoX | August 2024 (arXiv:2408.06072) | "Expert transformer" with expert adaLN; 10-second clips at 768x1360, 16 fps |
| HunyuanVideo | December 2024 | Open-weight video DiT from Tencent |
| Allegro | 2024 | Open-source video DiT |
| Runway Gen-3 / Gen-4 | 2024 / 2025 | Transformer-based text-and-video diffusion; cinematic focus |
| Pika 2.0+ | 2024 | Pika Labs; transformer architecture per Pika research notes |
| Cosmos | January 2025 | Nvidia; world-model DiTs for physical AI and robotics |
| Sora 2 | September 30, 2025 | OpenAI; native audio; sharper physics; iOS social app |
Follow-up work explores cheaper attention substitutes (Mamba-DiT, LaMamba-Diff, RWKV-DiT), mixture-of-experts MM-DiTs (often called DiT-MoE), autoregressive image transformers such as VAR and MAR, and continuous-token DiTs. The shared theme is that the backbone family keeps shifting while the patchify-attend-modulate skeleton survives.
Mixture-of-experts variants of DiT route each token through a small subset of expert MLPs rather than a single shared one, which lets total parameter count grow without proportionally increasing inference compute. This pattern has shown up in several Stable Diffusion 3.5 research notes and in concurrent work on conditional diffusion, mirroring the path that large language models took with Mixtral and DeepSeek-V2. The catch is that experts are far more sensitive to load balancing in image diffusion than in text models, partly because the same patches recur across noise levels and partly because spatial routing tends to collapse onto a few experts unless you regularise carefully.
| System | Backbone | Parameters | Year | Latent | Notes |
|---|---|---|---|---|---|
| Stable Diffusion 1.5 | U-Net | ~860M | 2022 | 4-channel SD VAE | Cross-attention to CLIP text encoder |
| Stable Diffusion 2.1 | U-Net | ~865M | 2022 | 4-channel SD VAE | OpenCLIP H text encoder |
| SDXL | U-Net | 2.6B | 2023 | 4-channel SD VAE | Two text encoders; refiner stage |
| DiT-XL/2 | DiT (class-conditional) | 675M | 2022 | SD VAE | ImageNet only; no text |
| PixArt-Sigma | DiT (cross-attention) | 0.6B | 2024 | SD VAE | 4K output |
| Stable Diffusion 3 (8B) | MM-DiT, rectified flow | 8B | 2024 | 16-channel SD3 VAE | Three text encoders |
| FLUX.1 | MM-DiT plus parallel attention, rectified flow | ~12B | 2024 | SD3 VAE | Open weights |
| Sora | DiT (spacetime) | not disclosed | 2024 | Video VAE | Video; up to 1 minute |
Qualitatively, U-Net SDXL produces strong samples at 2.6B parameters but plateaus when scaled further. DiT-style models keep improving past 8B and into the 12B range, and they handle multimodal conditioning more cleanly because attention is the only mixing operation in the network.
DiTs scale predictably. Compute and FID move together along a roughly straight line in log-log space, which is the same regime that language models enjoy. The architecture is also conceptually clean: there are no skip connections, no resolution doubling, and no separate downsampling and upsampling stages. Conditioning is uniform across blocks, which makes it easy to add a text encoder, a class label, a camera pose, or any other auxiliary signal. Transformer infrastructure (flash attention, RoPE, ZeRO, FSDP, tensor parallelism) ports over without modification, so a team that knows how to train large language models can train large DiTs.
There is also a culture argument that gets less attention but matters in practice. The transformer ecosystem has been shaped by language modelling for the better part of a decade, which means tooling for sharding, profiling, debugging, and serving is mature in a way that U-Net-specific tooling never was. Hugging Face's Diffusers, vLLM-style attention kernels, and PyTorch FSDP all assume a transformer-shaped graph. Adopting DiT lets a team plug into that ecosystem instead of maintaining a parallel one.
Self-attention is quadratic in sequence length, so high-resolution images and long videos are expensive. A 1024x1024 image with a patch size of 2 on a 1/8 latent gives 64x64 = 4,096 tokens, and the attention cost is 16M dot products per layer; a one-minute video is dramatically worse. Video DiTs typically factorise attention along space and time, use sliding windows, or adopt linear-attention substitutes like Mamba and RWKV to keep this manageable. Memory pressure during training is also higher than for U-Nets because activations must be retained across a deeper, wider stack. Finally, DiTs are weaker at small scales: a 50M-parameter U-Net often beats a 50M-parameter DiT, and the architecture only earns its keep above roughly 100M to 200M parameters.
Position encoding is another open issue. The original DiT uses learned absolute embeddings of fixed length, which makes it awkward to evaluate at resolutions other than the training resolution. Lumina-T2X, FLUX, and the SD3 family all switched to 2D RoPE or related schemes for this reason. Patch size choice carries similar trade-offs: patch 2 gives more tokens and better detail but raises attention cost; patch 4 or patch 8 are cheap but blur fine structure and tend to lose small text or thin lines.
The paper's two authors took notably different paths after publication.
William Peebles finished his PhD at UC Berkeley in 2023 (advised by Alexei Efros), having done the DiT work as a visiting researcher at Meta AI in 2022. He joined OpenAI in early 2024 to lead the Sora team alongside Tim Brooks, and Sora's spacetime-patch architecture is a direct application of the DiT recipe to video. He has remained at OpenAI through Sora 2 and continues to lead video research there.
Saining Xie is an assistant professor at NYU Courant, where he runs a vision and generative modelling group. Before DiT he was best known as a co-author of ConvNeXt and Masked Autoencoders (MAE), and he previously worked on Aggregated Residual Transformations (ResNeXt) during his PhD at UC San Diego. The DiT paper is the first ICCV best paper award for either author. He has continued to push representation-aware diffusion (REPA) and autoregressive image generation since then.
The pairing reflects a small pattern in the field. Peebles brought generative-model experience (his prior work on GANs included GAN-Supervised Dense Visual Alignment) and Xie brought a vision-architecture lineage. The combination produced a paper that reads more like a careful empirical study than a typical "we propose a new model" announcement, which is part of why it has been so influential.
By the end of 2025, every state-of-the-art open-weight text-to-image and text-to-video model used a DiT-style backbone. The conventional U-Net survives mainly in finetuned models built on Stable Diffusion 1.5 and SDXL.
| Model / system | Organisation | Release | Architecture lineage |
|---|---|---|---|
| Stable Diffusion 3 / 3.5 | Stability AI | March / October 2024 | MM-DiT with rectified flow |
| FLUX.1 (pro / dev / schnell) | Black Forest Labs | August 2024 | MM-DiT plus parallel attention, rectified flow |
| Sora | OpenAI | February 2024 (preview); December 2024 (Sora Turbo) | DiT with spacetime patches |
| Sora 2 | OpenAI | September 2025 | DiT with native audio, sharper physics |
| Veo / Veo 2 | Google DeepMind | 2024 | Transformer-based video diffusion |
| Veo 3 | Google DeepMind | 2025 | Transformer DiT lineage with native audio |
| Movie Gen | Meta | October 2024 | 30B-parameter transformer with synchronized audio |
| Runway Gen-3 / Gen-4 | Runway | 2024 / 2025 | Text-and-video diffusion transformer |
| Pika 2.0+ | Pika Labs | 2024 | Transformer-based diffusion |
| Hunyuan-DiT / HunyuanVideo | Tencent | May 2024 / December 2024 | DiT with bilingual encoders |
| PixArt-alpha / Sigma | Huawei Noah's Ark | October 2023 / March 2024 | DiT with cross-attention |
| Lumina-T2X / Lumina-Next | Shanghai AI Lab | May / June 2024 | DiT with RoPE and RMSNorm |
| CogVideoX | Zhipu AI | August 2024 | Expert transformer with expert adaLN |
| Cosmos | Nvidia | January 2025 | DiT-based world models for physical AI |
| Allegro | Rhymes AI | 2024 | Open-source video DiT |
| Open-Sora (STDiT) | HPC-AI Tech | 2024 | Spatial-temporal DiT |
The pattern is striking. Within twenty-four months of the original paper, the architecture had walked through every major lab and most well-funded startups. There are caveats: FLUX and SD3 are formally rectified flow transformers rather than denoising diffusion transformers in the strict sense, and Sora keeps a lot of its specifics private. But the patchify-attend-modulate skeleton is in all of them.
The original DiT codebase is at github.com/facebookresearch/DiT and is released under a CC-BY-NC license. It contains training scripts for ImageNet at 256x256 and 512x512, the four model sizes from the paper, evaluation utilities for FID, and Colab demos. Independent reproductions in PyTorch (Hugging Face Diffusers, lucidrains' implementations) and in JAX have all matched the paper's numbers. PixArt and FLUX both started from forks of this code, which is one of the reasons the design has spread so quickly: the authors made it easy to copy.
The Diffusers library's DiTPipeline and the various MMDiTPipeline classes for SD3 and FLUX are the most common community entry points. Most papers that propose new variants release weights for at least one size, often on Hugging Face under a research license.
The Peebles and Xie paper has been cited many thousands of times since 2023 and won ICCV 2023's best paper award. William Peebles went on to lead OpenAI's Sora team, and Sora's spacetime-patch transformer is a direct application of DiT to video. Robin Rombach, Andreas Blattmann, and Dominik Lorenz, three of the authors of the original latent diffusion paper, founded Black Forest Labs after leaving Stability AI and shipped FLUX.1 in August 2024. By the end of 2024, every state-of-the-art open-weight text-to-image and text-to-video model used a DiT-style backbone. The conventional U-Net survives mainly in finetuned models built on Stable Diffusion 1.5 and SDXL.
Several 2024 and 2025 directions are worth tracking. Linear-attention DiTs replace softmax attention with state-space models or linear kernels to bring the per-layer cost down to O(N) in token count; LaMamba-Diff is one example. Mixture-of-experts DiTs route tokens through expert MLPs to expand parameter count without proportionally expanding compute; some Stable Diffusion 3.5 variants and concurrent research papers explore this. Autoregressive image transformers such as VAR (visual autoregressive modelling) and MAR (masked autoregressive) generate images token-by-token using transformer LMs and challenge the noise-conditioning paradigm itself. Representation autoencoders (RAE-DiT) replace the SD VAE with semantic encoders to improve sample quality. The patchify-attend-modulate skeleton remains stable across all of these.
Nvidia's Cosmos platform, announced at CES 2025, builds DiT-style world foundation models for robotics and autonomous-vehicle simulation. The Cosmos-1.0 family includes diffusion (DiT) and autoregressive variants ranging from 4B to 14B parameters and supports text-to-world, video-to-world, and conditional world generation. This is a different application domain from text-to-image, but the architectural lineage is the same.
Meta's Movie Gen, announced in October 2024, is another large transformer-based diffusion model with synchronized audio generation. It uses a 30B-parameter transformer for video and a 13B-parameter audio model that conditions on the visuals, producing up to 16-second clips at 1080p. Like Sora and Veo 3, it represents the merger of the DiT recipe with native audio output.
Why has the DiT recipe held up so well across image and video? Two arguments are common in the literature. First, the uniformity argument: a transformer's compute is spread evenly over its tokens, so adding parameters or compute scales the model's expressivity uniformly. A U-Net, by contrast, allocates compute non-uniformly across resolutions, and adding capacity to a U-Net often helps only the bottleneck. Second, the inductive-bias argument: U-Nets bake in translation equivariance and a multi-resolution hierarchy, both of which are useful at small scales but become a constraint at large scales. A transformer's only inductive bias is permutation equivariance broken by positional encodings, which is gentle enough that with enough data the model can learn richer structure.
AdaLN-Zero deserves a separate note. Initialising the residual gates to zero ensures that at step zero the entire network is the identity from input to output, so the gradient flows through the conditioning MLP first and the rest of the network second. This decouples the learning of the conditioning machinery from the learning of the denoiser and produces noticeably more stable training, particularly at large model sizes where naive initialisations often diverge in the first few thousand steps.