MMDiT (Multimodal Diffusion Transformer)
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 4,771 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 4,771 words
Add missing citations, update stale details, or suggest a clearer explanation.
MMDiT (Multimodal Diffusion Transformer, sometimes written MM-DiT) is a transformer architecture for text-conditioned image generation introduced by Stability AI in the paper Scaling Rectified Flow Transformers for High-Resolution Image Synthesis (Esser, Kulal, Blattmann et al., March 2024).[^1] The design extends the Diffusion Transformer (DiT) of Peebles and Xie by giving image tokens and text tokens their own separate weight matrices for layer normalization, query/key/value projections, and feedforward layers, while still mixing the two modalities through a single joint self-attention operation per block.[^1] MMDiT is the backbone of Stable Diffusion 3 and Stable Diffusion 3.5, and its dual-stream design was adopted, with modifications, by Black Forest Labs for FLUX.1 and FLUX.2.[^2][^3][^4] In all of these systems it is trained with a rectified flow objective rather than a classical denoising diffusion loss, and used together with a variational autoencoder that maps pixels to a compressed latent space.[^1][^2]
By 2023, transformer backbones had begun displacing U-Net convolutional architectures for several high-end generative image models. The immediate trigger was the Diffusion Transformer paper of William Peebles and Saining Xie, published as arXiv preprint 2212.09748 on December 19, 2022 and presented as an oral at the 2023 International Conference on Computer Vision.[^5][^6] DiT applied a Vision Transformer-style backbone to latent diffusion: a VAE compressed images to a lower-resolution latent grid, the latent was patchified into a sequence of tokens, and a stack of transformer blocks predicted the diffusion noise at each timestep.[^5] DiT-XL/2 reached a Fréchet Inception Distance of 2.27 on class-conditional ImageNet 256x256, a state-of-the-art result for that benchmark at the time, while also demonstrating clean monotonic scaling of FID with model Gflops over four model sizes (DiT-S, DiT-B, DiT-L, DiT-XL).[^5][^6]
Crucially, DiT was a class-conditional model. Conditioning was injected through Adaptive Layer Normalization with zero initialization, abbreviated AdaLN-Zero, in which a small MLP took the timestep and class embedding as input and produced scale and shift parameters that modulated activations inside every transformer block.[^5][^6] DiT did not have to handle long sequences of text tokens with their own structure; a one-hot class label, broadcast as a single embedding vector, was sufficient. The Peebles and Xie paper compared four conditioning schemes (in-context conditioning, cross-attention, plain adaLN, and adaLN-Zero) and reported that adaLN-Zero was the best on FID while also being the most compute-efficient, primarily because it adds no token positions and very few parameters compared to cross-attention.[^5][^6]
Text-to-image models such as the original Stable Diffusion and its successors had instead injected text through cross-attention layers inserted into a U-Net, with the text embedding acting as a static set of key/value pairs that image tokens queried against.[^7] In that design the text representation does not change as a function of image content; only the image side updates. The dominant text-to-image pipelines circa 2023, including Stable Diffusion 1.x and 2.x and SDXL, all followed this template. Pixart-alpha, presented later, was an early non-Stability transformer-based text-to-image model that adopted a DiT-like backbone but still used cross-attention to inject text.
When Stability AI's team set out to scale a transformer text-to-image model substantially beyond SDXL, they faced two open questions. First, how to extend the AdaLN-Zero conditioning scheme to a setting where text was not a single label but a long token sequence with its own internal structure that needs to be preserved and possibly refined. Second, whether to keep the asymmetry of cross-attention or to give text tokens the ability to update their own representations and exchange information with image tokens in both directions. The MMDiT architecture is the team's answer to both of those questions, packaged together with a rectified flow training objective borrowed from a separate line of work.[^1]
MMDiT operates on latent representations rather than raw pixels. An image is first encoded by a VAE into a latent tensor of shape h x w x c. In the published Stable Diffusion 3 configuration the latent has 16 channels and a spatial downsampling of eight, matching the autoencoder design used for the model.[^7][^8] The latent is patchified by extracting non-overlapping 2x2 patches, which are flattened and linearly projected to produce a sequence of image tokens of length (h/2)(w/2).[^1] Sinusoidal positional embeddings are added; the paper additionally describes a position-grid scheme combining extension and interpolation so that the network can handle aspect ratios different from those seen at the highest pre-training resolution.[^1]
Text is encoded by an ensemble of pre-trained, frozen text encoders. For Stable Diffusion 3 Medium these are OpenAI's CLIP ViT-L/14, OpenCLIP ViT-bigG/14, and Google's T5 v1.1 XXL.[^2][^7] The two CLIP encoders supply both per-token contextual embeddings and a single pooled vector each; the T5-XXL encoder supplies a long sequence of contextual token embeddings.[^7] The CLIP per-token embeddings and the T5 embeddings are concatenated into one text token sequence; the pooled CLIP vectors are concatenated with each other and added to the timestep embedding to form a global conditioning vector.[^7] The model can be run without the T5 encoder at inference time at a measured cost of about four percent in the paper's headline metric and a larger drop in typography quality.[^9]
A single MMDiT block contains two parallel transformer stacks, one consuming image tokens, the other consuming text tokens. Every linear weight inside the block, including the LayerNorm scale and shift parameters, the query/key/value projections, the attention output projection, and the two feedforward matrices, is duplicated, with one independent copy per modality.[^1] The image stream's weights are never tied to the text stream's weights.
The two streams are connected by a single joint self-attention operation. After per-modality LayerNorm and per-modality QKV projection, the resulting query, key, and value tensors of the two streams are concatenated along the sequence dimension, producing combined tensors of length N_image + N_text. A standard scaled dot-product attention is computed over this combined sequence. The output is then split back into image-length and text-length pieces, each piece is passed through that modality's output projection, and the streams continue independently into the per-modality feedforward layer.[^1] The Stability AI paper describes this concisely: the design is "equivalent to having two independent transformers for each modality, but joining the sequences of the two modalities for the attention operation, such that both representations can work in their own space yet take the other one into account."[^1]
This is a structural change from earlier text-to-image models. In a cross-attention U-Net, text embeddings are queried by image features but text representations are never updated by image content. In MMDiT, both modalities have queries, keys, and values, and both are updated at every block. The paper labels this two-way coupling a "bidirectional flow of information between image and text tokens" and treats it as the central novel ingredient of the architecture compared to DiT.[^1]
MMDiT inherits the AdaLN-Zero conditioning scheme from DiT.[^1][^5] A small MLP consumes the timestep embedding plus the pooled text conditioning vector and outputs a set of modulation parameters per block. These parameters scale and shift the activations immediately before each of the four substeps in a block: the LayerNorm before attention, the residual addition after attention, the LayerNorm before the feedforward, and the residual addition after the feedforward. The output gates are initialized to zero, which makes each new block start as an identity function and stabilizes the early phase of training, exactly as in DiT.[^5]
The novelty in MMDiT is that there are now two separate sets of modulation parameters per block, one for the image stream and one for the text stream, generated by the same conditioning MLP but with their own learned projection weights.[^1] Each modality is therefore modulated independently. The shared conditioning vector ensures that both streams see consistent global information about the current timestep and the high-level pooled text description, while the separate modulation matrices preserve the modality-specific statistics.
The motivation given in the paper is statistical. Text token embeddings, even after being projected to a common width, live on a different distribution from image latent token embeddings. They have a different sequence length, a different effective vocabulary, and very different second-order statistics. Sharing one set of LayerNorm parameters between the two would force a single set of scale and shift constants to fit two distributions, which is suboptimal; sharing one feedforward would force the same hidden representations to serve both.[^1] The paper documents an architecture comparison on CC12M in which MMDiT outperforms a plain DiT that simply concatenates text and image tokens with shared weights, as well as a CrossDiT baseline that uses cross-attention and a UViT baseline, across both validation loss and downstream metrics.[^1]
Conceptually, the joint self-attention provides the cross-modal channel that the model needs in order to be text-conditioned at all, while the per-modality weights elsewhere act as a soft inductive bias that lets each stream specialize. Text tokens can learn their own representations through their own feedforward, image tokens through theirs, and only the attention step has to be modality-neutral. The split also means the gradients flowing into the text-side weights are computed against a fundamentally different objective shape than those flowing into the image-side weights, which the paper argues helps preserve linguistic structure in the text stream while the image stream specializes on perceptual signals. By contrast, a fully shared backbone (the DiT baseline that simply concatenates text and image tokens) tends to collapse the text-side representations into something image-like, because the per-token training signal is dominated by the much longer image sequence.[^1]
This statistical argument generalizes. Any time two token streams differ substantially in distribution, length, and downstream use, the MMDiT recipe (separate weights everywhere, joint attention only) provides a structured way to share information without forcing a single set of normalization or feedforward parameters to be a compromise between the two distributions. Several later works in audio, video, and protein generation have used MMDiT-style dual streams for exactly this reason.
For position information, Stable Diffusion 3's MMDiT uses additive sinusoidal embeddings on the image token sequence, computed from a 2D position grid that is patched together with the same 2x2 patch stride as the token sequence.[^1] To support generation at multiple resolutions and aspect ratios, the paper describes a procedure of extending and interpolating the position grid before applying the sinusoidal embedding, which avoids the abrupt frequency mismatch that pure interpolation can cause.[^1] Text tokens carry the positional information already encoded by their respective text encoders, so no additional positional embedding is added on the text side.[^7] In later derivatives such as FLUX.1 the sinusoidal scheme is replaced by Rotary position embedding (RoPE), discussed below.[^3][^10]
MMDiT models in the Stable Diffusion 3 paper are described by a single integer, the depth d. Hidden size is set to 64d, the number of attention heads to d (so each head has a fixed width of 64), and the feedforward expansion to four times the hidden size.[^1] The reported scaling runs include depth 15, 18, 24, 30, and 38; the largest, depth 38, has 38 attention heads and corresponds to roughly 8 billion parameters in the joint MMDiT.[^1][^11] These configurations were swept on CC12M at 256x256 latents for half a million steps with a batch size of 4096 in order to study how validation loss decreases as depth increases. The paper reports a smooth, power-law-like decrease with no evidence of saturation in that regime.[^1]
A useful observation in the scaling study is that the validation loss tracks downstream evaluation metrics (CLIP score, GenEval, T2I-CompBench) tightly: model selection by validation loss alone is reliable. The depth-38 configuration was the one ultimately scaled up for the production 8B Stable Diffusion 3 release.[^1][^9]
MMDiT is trained as a velocity predictor for a rectified flow rather than as a noise predictor for a denoising diffusion model. Rectified flow, introduced by Liu, Gong, and Liu in Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow (arXiv:2209.03003, September 2022), defines a generative model as the solution of an ordinary differential equation whose target trajectories are straight lines connecting samples from a base distribution to samples from the data distribution.[^12] In the simplest formulation, for image x_1 drawn from data and noise x_0 drawn from a standard normal, the linear interpolant x_t = (1 - t) x_0 + t x_1 with t drawn from a distribution on [0, 1] defines a path; the model is trained by regression to predict the velocity x_1 - x_0 given x_t and t.[^12] At inference time the learned velocity field is integrated forward in time from t = 0 (pure noise) to t = 1 (sample) using a numerical ODE solver.
The Stable Diffusion 3 paper contributes two things on top of vanilla rectified flow. First, it shows that the choice of distribution over the training timestep t matters substantially: rather than sampling t uniformly from [0, 1], it biases the sampling toward intermediate timesteps where the prediction problem is harder. The most successful scheme is logit-normal sampling, in which the logit of t is drawn from a standard normal; the paper writes this as rf/lognorm(0.00, 1.00).[^1] In a large head-to-head sweep of 61 formulations covering epsilon prediction, v-prediction, EDM, and rectified flow variants with linear and cosine schedules, the logit-normal rectified flow consistently ranked best on CLIP score and FID.[^1] Two alternative samplers, called mode sampling with heavy tails and CosMap, are also documented in the paper as second-best options.[^1] Second, the paper introduces QK-normalization, applying RMSNorm to query and key tensors before the attention dot product. This stabilizes mixed-precision training at high resolutions where unnormalized attention logits can explode, and it later becomes the default for Stable Diffusion 3.5.[^1][^13]
The combination of rectified flow with MMDiT, with the logit-normal sampler and QK-normalization, is presented as the recipe that the rest of the paper scales.
Stable Diffusion 3 was first announced as an early preview in February 2024.[^14] The accompanying research paper, Scaling Rectified Flow Transformers for High-Resolution Image Synthesis, was posted on arXiv on March 5, 2024 and later presented at the 2024 International Conference on Machine Learning.[^1][^15] Stable Diffusion 3 was made available through Stability AI's API on April 17, 2024, and open weights for the 2-billion-parameter Stable Diffusion 3 Medium were released on Hugging Face on June 12, 2024.[^14][^7] The original announcement promised model variants from roughly 800 million to 8 billion parameters, all sharing the MMDiT backbone.[^2]
Stable Diffusion 3 Medium uses the depth-24 MMDiT configuration, three frozen text encoders (CLIP-L, OpenCLIP-G, T5-XXL), and a 16-channel VAE.[^7] At inference it can be run with either the full encoder stack or, to save memory, with only the two CLIP encoders.[^7] Stability AI's research blog reports that omitting the T5 encoder costs roughly four percentage points of text-adherence win rate and significantly more in typography evaluations.[^9] Compute requirements scale with the depth of the chosen MMDiT: the 2B Medium fits comfortably on consumer GPUs, while the 8B configuration is reported to take about 34 seconds to generate a 1024x1024 image in 50 sampling steps on an Nvidia RTX 4090 in 24 GB of VRAM.[^9]
Stable Diffusion 3.5 followed on October 22, 2024, with the Medium variant added on October 29, 2024.[^13] The release comprised three models, all retaining the MMDiT backbone but with several training and architecture refinements:[^13]
| Model | Parameters | Notes |
|---|---|---|
| Stable Diffusion 3.5 Large | 8.1 billion | Flagship, targets 1 megapixel output |
| Stable Diffusion 3.5 Large Turbo | 8.1 billion | Distilled variant, four-step sampler |
| Stable Diffusion 3.5 Medium | 2.5 billion | Consumer GPU target, 0.25 to 2 megapixel range |
The headline architectural change in the 3.5 release is the integration of QK-normalization into the published transformer blocks, the same technique used during research for the 8B model in the original paper. Stability AI's release notes describe this as stabilizing training and making fine-tuning easier.[^13] The Medium model also incorporates what Stability AI calls dual attention blocks in its first twelve layers, an extra variant of the MMDiT block, and an extended positional embedding space used during training at 384 latent positions per side to support mixed-scale image training.[^16] The three text encoders carry over from Stable Diffusion 3: CLIP ViT-L, OpenCLIP ViT-bigG, and T5 v1.1 XXL.[^16]
Black Forest Labs, founded in mid-2024 by former Stability AI researchers including Robin Rombach, Patrick Esser, and Andreas Blattmann, the same authors who developed MMDiT, announced the lab and the FLUX.1 model suite on August 1, 2024.[^17][^18] The FLUX.1 family consists of three variants sharing a 12-billion-parameter backbone: FLUX.1 [pro] available via API, FLUX.1 [dev] released under a non-commercial open-weight license, and FLUX.1 [schnell] released under Apache 2.0 as a distilled few-step model.[^17][^18]
FLUX.1's transformer combines MMDiT-style dual-stream blocks with additional single-stream blocks placed after them. According to the model architecture documentation, the network contains 19 double-stream MMDiT blocks followed by 38 single-stream blocks.[^10] The double-stream blocks follow the SD3 MMDiT design with separate weights per modality and a joint self-attention.[^10] The single-stream blocks, by contrast, take the concatenated text-plus-image sequence and process it through a single set of weights with parallel attention and feedforward, in a fashion closer to a plain DiT operating on a long mixed sequence.[^10] FLUX.1 also replaces the sinusoidal position embeddings of SD3 with Rotary Position Embedding across the joint sequence.[^3][^10] The model continues to be trained with rectified flow.[^17]
A second-generation variant, FLUX.1 Kontext, was released on May 29, 2025 as a suite focused on in-context image generation and editing.[^19] It keeps the MMDiT plus single-stream backbone but adapts the input pipeline to accept reference images alongside text prompts.[^19]
FLUX.2 was announced by Black Forest Labs on November 25, 2025 as a substantially redesigned successor.[^4][^20] The model scales to approximately 32 billion parameters total, partitioned into a smaller number of double-stream MMDiT-style blocks (eight in the public diffusers integration) and a much larger number of single-stream blocks (forty-eight), shifting the parameter ratio further toward the single-stream end.[^20] FLUX.2 also replaces the multi-encoder text stack of FLUX.1 and SD3 with a single Mistral Small 3.1 vision-language model.[^4][^20] Other reported changes include the removal of bias parameters throughout attention and feedforward layers, fused QKV plus feedforward projections, and shared AdaLN-Zero modulation across blocks.[^20] FLUX.2 ships as FLUX.2 [pro] for API use, FLUX.2 [flex] for parameter-controlled API access, FLUX.2 [dev] as an open-weight model on Hugging Face, and FLUX.2 [klein], an announced size-distilled Apache-2.0 variant.[^4][^20]
The line of descent is clear: SD3 introduces MMDiT, SD3.5 stabilizes it with QK-normalization and adds the dual attention variant, FLUX.1 keeps MMDiT for early blocks and stacks single-stream blocks on top, and FLUX.2 retains the same hybrid topology but rebalances the ratio, drops biases, and switches the text encoder.
The Stable Diffusion 3 paper reports several quantitative results that justify the MMDiT design.[^1] On a CC12M architecture comparison at 256x256 resolution, the MMDiT variant achieves lower validation loss and higher downstream CLIP score than DiT (with concatenated text tokens), CrossDiT (cross-attention), and UViT baselines after equal training compute.[^1] In the depth scaling experiment, validation loss decreases smoothly from depth 15 through depth 38 on CC12M, without a visible plateau.[^1] The paper also reports a sample-efficiency effect: larger MMDiTs lose less quality when run with fewer sampling steps, with the depth-38 model losing only 2.71 percent of its CLIP score when run with five steps instead of fifty, compared to 4.30 percent for the depth-15 model.[^1] Concretely, the paper's Table 6 lists relative CLIP score decreases of 4.30 percent (depth 15), 3.59 percent (depth 30), and 2.71 percent (depth 38) at the 5-versus-50-step comparison, illustrating that larger MMDiTs are not only better at full sampling budgets but also more robust to aggressive step reductions.[^1][^11]
On the GenEval object-grounded text-to-image benchmark (Ghosh et al., 2023), the depth-38 MMDiT model trained with rectified flow and refined with Direct Preference Optimization (DPO) reports an overall score of 0.71, ahead of DALL-E 3 at 0.67 and SDXL at 0.55 in the configuration evaluated by the authors.[^1] On T2I-CompBench (Huang et al., 2023) the same model reports consistently competitive scores across the color, shape, texture, spatial, non-spatial, and complex sub-tasks.[^1] In a human preference study on Parti-Prompts the authors report win rates of approximately 50 percent for visual aesthetics, 54 percent for prompt following, and 62 percent for typography against DALL-E 3.[^1]
Inference-time experiments also report that omitting the T5-XXL encoder reduces the text-adherence win rate by roughly four percentage points and reduces typography quality more substantially (from 50 to 38 percent on the typography metric in one comparison), illustrating that long-form text understanding is concentrated in the T5 stream while shorter text descriptions are handled adequately by the two CLIP encoders alone.[^9]
The paper is careful to attribute these gains to the combination of MMDiT, rectified flow with logit-normal timestep sampling, and the multi-encoder text stack with high dropout (the individual encoders are dropped with probability 46.3 percent during training so that the model can be queried with arbitrary subsets at inference).[^1] Ablations isolating the contribution of each ingredient are reported in the appendix. The single largest factor for high-resolution quality, by the paper's own ablation, is the noise sampler choice (rectified flow with logit-normal t sampling) rather than the architecture itself, suggesting that MMDiT's principal contribution is to make a high-quality model trainable and steerable at scale rather than to provide an FID gain in isolation.[^1]
MMDiT inherits the general failure modes of large text-to-image diffusion models. It can still misrender hands, faces, and small text; it can struggle with rare object categories and precise compositional relationships; and it is computationally expensive at the largest scale, requiring on the order of tens of gigabytes of GPU memory to run the 8B variant at 1024x1024 in full precision.[^9] The reliance on a frozen T5 encoder for the best results imposes an additional memory cost that can be partially traded off at the price of typography quality.[^9]
From an architectural standpoint, the dual-stream design also doubles the number of weights in every block compared to a same-width single-stream transformer operating on a concatenated sequence; that overhead motivated the hybrid design used in FLUX.1 and FLUX.2, in which only the first few blocks pay the dual-stream cost while the bulk of the parameters live in single-stream blocks shared between modalities.[^10] An academic study titled Demystifying Flux Architecture (arXiv:2507.09595, 2025) confirms this trade-off by reverse-engineering FLUX.1 and noting that the single-stream blocks "expand the model's capacity in a relatively lightweight and efficient manner" relative to adding more double-stream blocks.[^10]
A separate strand of follow-up work has begun to study MMDiT not just as a generator but as a feature extractor. The paper Exploring Multimodal Diffusion Transformers for Enhanced Prompt-based Image Editing (arXiv:2508.07519, 2025) investigates the internal attention maps and representations of MMDiT-based models to perform localized prompt-based edits, and treats the architecture as the new default reference point for diffusion-based image editing research.[^21]
A final limitation worth flagging is that the dual-stream design is conceptually tied to having exactly two modalities. Extending the same recipe to three or more modalities (image, text, audio, video, depth, or reference images, for example) requires either duplicating weights once per modality, which scales poorly, or introducing some other form of sharing. FLUX.1 Kontext's reference image handling and FLUX.2's multi-image inputs both extend MMDiT in this direction, but the cleanest multi-stream generalization of MMDiT remains an active design question.[^19][^20]
The relationship between MMDiT and the original DiT is best understood as a controlled extension along three axes:
| Aspect | DiT (Peebles & Xie, 2022/2023) | MMDiT (Esser et al., 2024) |
|---|---|---|
| Conditioning input | Class label (one-hot) | Long text sequence plus pooled vector |
| Conditioning mechanism | AdaLN-Zero on a single stream | AdaLN-Zero applied separately to two streams |
| Text/image interaction | Not applicable (no text stream) | Joint self-attention over concatenated sequences |
| Weight sharing across modalities | Single set of weights (one modality) | Separate per-modality weights for LN, QKV, output projection, MLP |
| Position encoding | Sinusoidal on 2D image grid | Sinusoidal on 2D image grid plus interpolation scheme for variable aspect ratios |
| Loss | DDPM-style noise prediction | Rectified flow velocity prediction with logit-normal timestep sampling |
| Reported scale | Up to DiT-XL/2, around 675M parameters | Up to depth-38, around 8B parameters |
The MMDiT block reduces to a DiT block in the limit where the text sequence has length one, the per-modality weights are tied across modalities, and the noise-prediction objective is restored. In that sense MMDiT is a strict generalization rather than a replacement, and the family of designs from DiT through MMDiT, FLUX.1's hybrid, and FLUX.2's rebalanced hybrid traces a single line of architectural experiments around the same core question of how to mix text and image tokens inside a transformer trained with a flow-matching-style objective.[^1][^5][^10][^20]
MMDiT sits at the intersection of three earlier lines of research. The first is the Diffusion Transformer of Peebles and Xie, which established AdaLN-Zero conditioning and demonstrated transformer scaling for class-conditional latent diffusion.[^5] The second is rectified flow, introduced by Liu, Gong, and Liu, which provided the straight-line ODE formulation that MMDiT is trained against.[^12] The third is the multi-encoder text conditioning approach pioneered in earlier Stable Diffusion releases, where multiple CLIP encoders, and later a T5 encoder, were used to provide complementary text representations.[^7]
Compared to U-Net based text-to-image models, MMDiT removes convolutional inductive biases entirely; compared to plain DiT, it introduces a structured asymmetry between modalities while keeping a single attention operation that mixes them. The success of this design has shifted the dominant architecture for open text-to-image diffusion away from U-Nets and toward transformer backbones with dual or hybrid streams, a transition reflected in essentially every major open or proprietary diffusion model released from mid-2024 onward.