Flow Matching
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,526 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,526 words
Add missing citations, update stale details, or suggest a clearer explanation.
Flow Matching is a simulation-free training framework for generative models that fits a time-dependent velocity field to transport samples from a source distribution (typically a standard Gaussian) to a data distribution along a prescribed probability path. The method was introduced by Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le in the paper "Flow Matching for Generative Modeling" (arXiv:2210.02747, October 2022), which appeared at ICLR 2023.[1] In its practical form, Conditional Flow Matching (CFM) and the Optimal Transport CFM variant (OT-CFM) train continuous normalizing flows without numerical integration during optimization, regressing a network against an analytically known per-sample target vector field. Flow Matching subsumes diffusion training as a special case and has become the dominant objective for state-of-the-art image, video, audio, and speech generators released between 2023 and 2025, including Stable Diffusion 3, FLUX.1, Voicebox, Audiobox, and Meta Movie Gen.[1][7][8][9][11]
Continuous normalizing flows (CNFs) were introduced by Ricky T. Q. Chen, Yulia Rubanova, Jesse Bettencourt, and David Duvenaud in "Neural Ordinary Differential Equations" (arXiv:1806.07366, June 2018), which received the best paper award at NeurIPS 2018.[2] A CNF defines a generative model by integrating an ordinary differential equation dx/dt = v_theta(x, t), where v_theta is a neural network and t runs from 0 to 1. Samples are obtained by drawing x_0 from a simple prior (most commonly a Gaussian) and solving the ODE forward to obtain x_1. The change of variables formula for ODEs reduces the log-determinant of the Jacobian to a trace of the velocity field's spatial divergence, eliminating the architectural constraints that earlier discrete normalizing flows imposed to keep their Jacobians tractable.[2]
The original CNF training recipe relied on maximum-likelihood with adjoint backpropagation through the ODE solver, which proved costly at scale. Each training step required simulating the ODE, computing or estimating divergences, and propagating gradients back through the solver, which made CNFs uncompetitive with the next wave of generative modeling. Score-based models and denoising diffusion probabilistic models (DDPM) instead exploited Gaussian noising processes, learning a denoiser that implicitly approximates the score of the noised data distribution, and trained without simulating any process.[1][14] These simulation-free training objectives, combined with U-Net backbones and large datasets, made diffusion the de facto generative-modeling framework from 2020 through 2023.
Flow Matching was designed to recover the advantages of CNFs (deterministic ODE sampling, faster generation, conceptually simple model classes) while inheriting the simulation-free training that made diffusion practical. Lipman et al. introduced an objective that regresses the network against an analytically constructed vector field, sidestepping ODE simulation during training entirely.[1] The framework generalizes the noising paths used by diffusion to an arbitrary family of probability paths, including paths defined by straight-line displacement interpolation between data and noise.[1]
A probability path is a time-indexed family of densities p_t on the data space R^d, with t in [0, 1], such that p_0 is a tractable source (Gaussian) and p_1 equals the data distribution q. A vector field u_t : R^d to R^d generates the path p_t if the continuity equation holds, namely that the time derivative of p_t plus the divergence of (p_t u_t) equals zero. The flow phi_t solving d phi_t (x) / dt = u_t (phi_t (x)) with phi_0 (x) = x then pushes p_0 to p_t.[1]
Given a target path p_t and target vector field u_t, the Flow Matching loss directly regresses a parametric velocity v_theta (x, t) against u_t (x):
L_FM (theta) = E_{t, x sampled from p_t} || v_theta (x, t) minus u_t (x) ||^2
with t drawn uniformly from [0, 1].[1] At zero loss, v_theta equals u_t almost everywhere on the support of p_t, so integrating dx / dt = v_theta (x, t) from t = 0 to t = 1 transports the source to the data distribution. The loss is appealing in form, yet it is intractable to evaluate as written, because constructing a marginal p_t and its generating u_t over the data distribution q in closed form is generally infeasible.[1]
The central trick in Lipman et al. is to write the marginal path as a mixture of per-sample conditional paths. Fix any choice of conditional density p_t (x | x_1) such that p_0 (x | x_1) is a fixed Gaussian and p_1 (x | x_1) concentrates near x_1. Define the marginal path by integrating against the data:
p_t (x) = integral of p_t (x | x_1) q (x_1) dx_1
and similarly the marginal vector field is the conditional expectation u_t (x) = E_{x_1 sampled from q given x} of u_t (x | x_1), where u_t (x | x_1) is any vector field that generates p_t (x | x_1).[1] Crucially, Lipman et al. prove that the Conditional Flow Matching loss
L_CFM (theta) = E_{t, x_1 sampled from q, x sampled from p_t (x | x_1)} || v_theta (x, t) minus u_t (x | x_1) ||^2
has the same gradient with respect to theta as L_FM, since they differ by a constant independent of theta.[1] Training is therefore reduced to sampling a data point x_1, sampling a noised version x from a known conditional density, and regressing the network against a closed-form conditional vector field. No ODE simulation occurs during training, and no score or divergence estimation is required.
Lipman et al. instantiate the framework with Gaussian conditional paths p_t (x | x_1) = N (x ; mu_t (x_1), sigma_t (x_1)^2 I), with smooth schedules mu_t and sigma_t such that mu_0 = 0, sigma_0 = 1 (standard Gaussian prior) and mu_1 = x_1, sigma_1 = sigma_min (near-Dirac at the data point).[1] In this family, the conditional vector field has a particularly clean expression involving the derivatives of mu_t and sigma_t.
Two choices are notable. Setting mu_t (x_1) = alpha_t x_1 and sigma_t (x_1) = sqrt (1 minus alpha_t^2) recovers variance-preserving (VP) diffusion paths, and Lipman et al. show that Flow Matching with these paths is mathematically equivalent to score-matching training for a VP diffusion model on the same noise schedule.[1] Setting the alternate choice mu_t (x_1) = t x_1 and sigma_t (x_1) = 1 minus (1 minus sigma_min) t produces straight-line interpolation between the noise sample and x_1, which Lipman et al. call the Optimal Transport (OT) path.
Under the OT conditional path, the per-sample vector field reduces to the constant u_t (x | x_1) = x_1 minus (1 minus sigma_min) x_0 with x = (1 minus (1 minus sigma_min) t) x_0 plus t x_1, namely the displacement vector pointing from the noise sample x_0 directly to the data sample x_1.[1] Lipman et al. report that the OT path provides straighter trajectories, faster training convergence, more stable optimization, and substantially fewer integration steps at inference compared with VP diffusion paths.[1] On ImageNet 128 by 128 and 256 by 256, Flow Matching with OT paths achieved better FID, negative log-likelihood, and required roughly one third to one half as many function evaluations as the equivalent diffusion model to reach comparable quality.[1]
Alexander Tong, Kilian Fatras, Nikolay Malkin, Guillaume Huguet, Yanlei Zhang, Jarrid Rector-Brooks, Guy Wolf, and Yoshua Bengio later generalized the OT idea in "Improving and generalizing flow-based generative models with minibatch optimal transport" (arXiv:2302.00482, February 2023), introducing a broader CFM family and an OT-CFM variant that couples noise to data samples via a minibatch optimal-transport plan instead of independent sampling.[3] Their formulation lifts the assumption that the source distribution must be Gaussian, requires no density evaluation, and approximates dynamic optimal transport when the true OT plan is available. Their experiments reported simpler flows, more stable training, and faster inference relative to the original FM and to diffusion baselines on a range of conditional and unconditional generation tasks.[3]
Denoising diffusion probabilistic models train a network to predict noise (or equivalently a scaled gradient of log p_t, the score) from a sample noised according to a fixed Markov chain. The training objective is a weighted score-matching loss, and sampling integrates a reverse-time stochastic differential equation or its probability-flow ODE counterpart. Lipman et al. show that Flow Matching with variance-preserving Gaussian conditional paths yields an objective whose minimizer coincides with that of score matching on the same noise schedule, modulo a network reparameterization and an output rescaling.[1] Flow Matching with OT paths, in contrast, has no equivalent score-based formulation, since the OT path is not generated by a Markov diffusion process. The framework therefore strictly subsumes diffusion training as one corner of a much larger probability-path design space.[1]
A December 2024 explainer by Sander Dieleman, Ben Poole, Aleksander Holynski, and collaborators at Google DeepMind, titled "Diffusion Meets Flow Matching", noted that rectified-flow and diffusion-based parameterizations are theoretically interchangeable under Gaussian path assumptions, with the practical differences arising from noise schedule choices, loss weighting, and network conditioning rather than from fundamentally distinct algorithms.[4]
Michael S. Albergo, Nicholas M. Boffi, and Eric Vanden-Eijnden of NYU's Courant Institute developed a parallel framework called stochastic interpolants. The full paper "Stochastic Interpolants: A Unifying Framework for Flows and Diffusions" appeared as arXiv:2303.08797 in March 2023 and was published in JMLR in 2025.[5] Their construction defines an interpolant I_t (x_0, x_1) between any two endpoint samples, derives both an ODE (probability-flow) and an SDE (diffusion-style) that transport p_0 to p_1, and shows that the drift and score of the SDE are minimizers of simple quadratic regression objectives matching those of CFM.[5] The framework recovers Flow Matching, rectified flow, Schrodinger-bridge constructions, and many score-based models as special cases by varying the interpolant and the diffusion coefficient, and it permits arbitrary source and target distributions rather than fixing one to a Gaussian.[5]
Xingchao Liu, Chengyue Gong, and Qiang Liu of UT Austin proposed Rectified Flow in "Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow" (arXiv:2209.03003, September 2022), one month before the Flow Matching paper.[6] Rectified flow trains a neural ODE to follow straight-line paths connecting samples from a source distribution pi_0 to a target distribution pi_1 by minimizing a nonlinear least-squares loss against the displacement vector x_1 minus x_0. In the special case of an independent coupling (pi_0 sampled independently from pi_1), the per-sample target vector field is identical to OT-CFM's, so the basic training objective coincides with Lipman et al.'s OT path instantiation.[6][1]
Rectified flow's distinct contribution is the reflow procedure, in which the trained model is rolled out to produce a deterministic coupling between noise and data, and a new rectified flow is trained on that coupling. Liu et al. prove that this iteration is a contraction on convex transport costs and tends to produce increasingly straight trajectories, ultimately enabling generation with very few solver steps (often a single Euler step).[6] The combination of straight-path training and reflow distillation provided the recipe later adopted by Stable Diffusion 3 and FLUX.1.[7][8]
The original Flow Matching framework assumes continuous data spaces and Gaussian-style probability paths, which do not extend naturally to discrete domains such as text or graphs. Itai Gat, Tal Remez, Neta Shaul, Felix Kreuk, Ricky T. Q. Chen, Gabriel Synnaeve, Yossi Adi, and Yaron Lipman introduced Discrete Flow Matching in arXiv:2407.15595, July 2024.[10] The paper provides a general framework for probability paths interpolating between source and target distributions over a finite vocabulary, with sampling formulas based on probability-denoiser and noise-prediction parameterizations and a flexible scheduler design.[10] At 1.7B parameters, the authors report 6.7% Pass@1 and 13.4% Pass@10 on HumanEval, narrowing the gap between non-autoregressive flow models and autoregressive language models.[10]
By 2024, the dominant family of large open and proprietary generative models for non-text modalities had moved from DDPM-style score matching to flow matching variants, often using rectified-flow or OT paths in particular.
Stability AI's Stable Diffusion 3 was introduced in the paper "Scaling Rectified Flow Transformers for High-Resolution Image Synthesis" by Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Muller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, Kyle Lacey, Alex Goodwin, Yannik Marek, and Robin Rombach (arXiv:2403.03206, March 2024).[7] The training objective is rectified-flow conditional flow matching with straight-line interpolation between data and Gaussian noise, identical in form to OT-CFM. The paper's central contributions are an improved noise sampling distribution (a logit-normal on t that concentrates training timesteps near 0.5, where the prediction is hardest) and a new Multimodal Diffusion Transformer (MMDiT) backbone in which text and image tokens are processed with separate weight matrices but interact through joint attention.[7] The authors demonstrate predictable scaling and improved human-preference scores relative to diffusion baselines.[7] The model was announced publicly on March 5, 2024 and released in stages thereafter.[7][12] Stable Diffusion 3.5 (Stable Diffusion 3.5) extended the recipe in October 2024.
The SD3 paper directly compared several flow-matching and diffusion formulations under a common training budget, evaluating standard rectified flow, EDM-style diffusion, cosine-schedule diffusion, and several intermediate variants. Esser et al. report that rectified flow combined with their logit-normal timestep weighting outperformed every other formulation tested, both in validation loss and in human preference evaluation, and that the advantage grew as model size and dataset size increased.[7] The paper also offers an explanation for the logit-normal choice: variance in the loss is heavily concentrated in the middle of the t schedule, where the velocity-field prediction is hardest because both the noise and the data signal contribute substantially to the input. Concentrating the training budget on that regime reduces gradient variance per update.[7] The SD3 release packaged this objective with a deep VAE-based image latent space, T5-XXL plus CLIP text encoders, and the MMDiT joint-attention backbone, demonstrating that flow matching scales smoothly along the same axes (parameters, tokens, compute) that govern transformer language models.
FLUX.1 was released by Black Forest Labs on August 1, 2024, the same day the company announced its launch.[8] Black Forest Labs was founded by Robin Rombach, Andreas Blattmann, and Dominik Lorenz, the team behind the original Stable Diffusion at LMU Munich, after their departure from Stability AI. FLUX.1 is a 12-billion-parameter rectified flow transformer that builds on the SD3 MMDiT and adds parallel attention layers, rotary positional embeddings, and a novel timestep sampling scheme adapted to few-step sampling.[8] The announcement explicitly states the team improved over previous diffusion baselines by "building on flow matching, a general and conceptually simple method for training generative models, which includes diffusion as a special case."[8] The release included FLUX.1 [pro] (API-only), FLUX.1 [dev] (open weights, non-commercial), and FLUX.1 [schnell] (Apache 2.0, distilled to one-to-four steps).[8] The schnell variant is distilled from the dev model using a guidance-distillation procedure tailored to rectified-flow models, producing a network that can generate competitive samples in as few as one to four Euler steps. This distillation property is a direct consequence of the straight paths learned by rectified-flow training: a perfectly straight flow can be integrated in a single step because the velocity along each trajectory is constant. The same approach has been continued in subsequent releases, including FLUX.2 and FLUX.1.1 Pro variants for higher resolution and improved typography.
Meta's Voicebox was the first large-scale speech foundation model trained with flow matching. The paper "Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale" by Matthew Le, Apoorv Vyas, Bowen Shi, Brian Karrer, Leda Sari, Rashel Moritz, Mary Williamson, Vimal Manohar, Yossi Adi, Jay Mahadeokar, and Wei-Ning Hsu (arXiv:2306.15687, June 2023) introduced a non-autoregressive flow-matching model trained to infill speech given audio context and text.[9] Voicebox was trained on more than 50,000 hours of speech across multiple languages and reported a word-error rate of 1.9% on English zero-shot text-to-speech, compared with 5.9% for VALL-E, while running up to twenty times faster.[9] The model demonstrated zero-shot generalization to text-to-speech, content editing, noise removal, style transfer, and cross-lingual conversion, with the flow-matching infilling objective playing the central role.[9]
Audiobox, announced on November 30, 2023 and detailed in arXiv:2312.15821 (December 2023), extends the Voicebox architecture and flow-matching training objective from speech to a unified audio model spanning speech, sound effects, music, and environmental audio.[11] Audiobox accepts natural-language prompts and audio examples and supports description-based and example-based conditioning. The paper integrates Bespoke Solvers to accelerate inference, reporting more than 25-fold speedups over default ODE solvers without quality loss, and achieves zero-shot TTS similarity of 0.745 on LibriSpeech and FAD of 0.77 on AudioCaps for text-to-sound.[11]
Stability AI's Stable Audio model family (Stable Audio) uses latent-space conditional flow matching to generate music and sound. Stable Audio Open (arXiv:2407.14358, July 2024) by Zach Evans, Julian D. Parker, CJ Carr, Zack Zukowski, Josiah Taylor, and Jordi Pons is an open-weights text-to-audio model built on a Diffusion Transformer (DiT) backbone operating on a continuous audio latent, with T5 text conditioning, generating up to 47 seconds of stereo audio at 44.1 kHz from approximately 1.21 billion parameters.[13] The model was trained on 7,300 hours of Creative-Commons-licensed audio.[13]
Meta's Movie Gen, described in "Movie Gen: A Cast of Media Foundation Models" (arXiv:2410.13720, October 2024), trains a 30-billion-parameter transformer for joint text-to-video and text-to-audio generation using a flow matching objective.[15] The video transformer follows the Transformer block design of Llama 3 with modifications including a cross-attention module inserted between self-attention and the feed-forward network, and operates on a maximum context length of 73,000 video tokens, producing up to 16 seconds at 16 fps at 1080p HD.[15] The team explicitly chose flow matching over diffusion training, noting that flow matching naturally ensures zero terminal signal-to-noise ratio, exhibits greater robustness to the exact choice of noise schedule, and outperformed diffusion losses in their internal comparisons.[15]
Many other late-2024 and 2025 systems followed the same trajectory. Discrete Flow Matching has been adopted for text and graph generation. Open-source implementations are available in Meta's facebookresearch/flow_matching PyTorch library, accompanying the December 2024 reference document "Flow Matching Guide and Code" by Yaron Lipman, Marton Havasi, Peter Holderrieth, Neta Shaul, Matt Le, Brian Karrer, Ricky T. Q. Chen, David Lopez-Paz, Heli Ben-Hamu, and Itai Gat (arXiv:2412.06264).[16] The library covers both continuous and discrete flow matching with practical examples for image and text modalities and is positioned as a reference implementation maintained by the original authors. The Tong et al. CFM/OT-CFM authors maintain the torchcfm library, which is widely used in research code.[3]
Beyond images, video, and audio, flow matching has been applied to scientific generative modeling, including protein structure design, molecular conformer generation, and graph generation. The framework's flexibility about source distributions, target distributions, and probability paths is particularly valuable in these settings, where the relevant invariances (rotation, permutation, periodicity) differ from those in image data.
Flow matching and DDPM-style training share a simulation-free regression form, but they differ along several practically important axes.
| Property | DDPM / Score-based diffusion | Flow Matching (OT-CFM, Rectified Flow) |
|---|---|---|
| Probability path | Variance-preserving or variance-exploding SDE | Arbitrary; Gaussian, OT straight-line, or others |
| Training target | Noise prediction or score | Velocity field (displacement) |
| Sampling | Reverse SDE or probability-flow ODE | Forward ODE |
| Curvature of paths | Curved | Straight (with OT/rectified) |
| Few-step sampling | Requires distillation | Native support after reflow |
| Subsumes the other? | Diffusion is a special case of FM | No |
| Terminal SNR at t=1 | Nonzero in some formulations | Zero by construction in OT path |
Lipman et al.'s ImageNet experiments reported that OT-path Flow Matching achieved better likelihood and FID than VP-path diffusion at the same compute budget, and produced samples with substantially fewer ODE solver evaluations.[1] Esser et al. (SD3) report that rectified flow combined with their logit-normal timestep weighting outperforms eight diffusion and flow-matching baselines they evaluated on text-to-image benchmarks.[7] Meta's Movie Gen team likewise reports that flow matching is "more robust to the exact choice of noise schedules and outperforms diffusion losses" in their ablations.[15] At inference, the most striking advantage is the small number of ODE solver steps required when paths are straight: rectified flow after a reflow pass and FLUX.1 [schnell] can produce competitive samples in one to four solver steps, compared with twenty to fifty for typical diffusion samplers.[6][8]
The practical training stability of flow matching, particularly with OT paths, also benefits from the constant per-sample velocity. The regression target u_t (x | x_1) does not depend on t under the OT path, so the loss does not blow up at the boundaries t=0 and t=1, in contrast to noise-prediction parameterizations whose effective signal-to-noise ratio diverges near the data boundary. This eliminates the need for ad hoc clamping or schedule tuning at the endpoints.
A second practical advantage is the simplicity of the conditioning interface. Because the velocity field is parameterized directly, classifier-free guidance, image-to-image conditioning, and inpainting all reduce to standard conditional regression problems, without the schedule-dependent scaling adjustments that some diffusion parameterizations require. The flow-matching sampler is a forward ODE solve, typically integrated with off-the-shelf methods such as Euler, Heun, or Dormand-Prince (RK45). After rectified-flow training or reflow, the trajectories are straight enough that a fixed-step Euler solver with very few steps suffices, which simplifies deployment compared with the adaptive solvers historically favored for diffusion sampling.
Despite these advantages, flow matching is not strictly dominant over diffusion in all metrics or settings. For very low-resolution data, well-tuned variance-preserving diffusion can still match or exceed OT-CFM in likelihood. The "Diffusion Meets Flow Matching" exposition emphasizes that once schedules, weightings, and network parameterizations are matched, the practical gap is often smaller than headline numbers suggest.[4] The decisive practical wins have come at large scale, where the simpler boundary behavior and straighter sampling trajectories of flow matching translate into measurable savings in compute and serving cost.
Although Flow Matching has displaced diffusion in many production settings, several limitations and open questions remain. The framework as originally formulated requires a continuous data manifold and is not directly applicable to discrete domains; Discrete Flow Matching and continuous-time Markov-chain generalizations address this but introduce new design choices.[10] Sampling still requires solving an ODE, and although straight-path training reduces the number of steps, exact one-step sampling generally requires reflow or distillation. The choice of conditional probability path and timestep weighting remains a nontrivial design decision, and recent work has explored adaptive sampling schedules, curriculum-style training, and consistency-style distillation tailored to flow models.[7][8] Stochastic Interpolants and related unifying frameworks have demonstrated that many of the modeling decisions in FM and DDPM are equivalent up to reparameterization, suggesting that the apparent gap between approaches narrows once schedules, loss weightings, and architectures are matched fairly.[4][5]
Memory requirements during training remain comparable to diffusion, since both store the same activations for the same backbone. Conditional generation with flow matching uses classifier-free guidance largely unchanged from diffusion. Likelihood evaluation for CFM models is possible by integrating the instantaneous change-of-variables formula, but it requires Jacobian-trace estimation that scales poorly to very high-dimensional latent spaces and is rarely used in practice.
Another active area is the interplay between rectified-flow distillation and consistency-style few-step generation. Reflow-based distillation typically requires generating a large dataset of (noise, sample) pairs from a trained model, then training a new model on that coupling. This is expensive and accumulates approximation errors over multiple reflow rounds. Alternative distillation methods such as score distillation of flow matching models, progressive distillation adapted to ODE samplers, and consistency-like trajectory regularization are being developed to reduce the cost and improve the few-step quality of flow matching models. None has yet become a clear winner in production deployment.
Finally, the theoretical understanding of why straight-path training stabilizes optimization at large scale is incomplete. Empirical work in the SD3, FLUX.1, and Movie Gen papers consistently reports that flow matching is less sensitive to noise-schedule design than diffusion, but a quantitative theory of this advantage, including its dependence on data dimension and noise distribution, is still being developed.[7][8][15]