The Frechet Inception Distance (FID) is a metric used to evaluate the quality of images produced by generative models, such as generative adversarial networks (GANs) and diffusion models. Introduced by Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter in their 2017 paper "GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium," FID measures the distance between the distribution of real images and the distribution of generated images in a learned feature space. Since its introduction, FID has become the most widely used quantitative benchmark for assessing image generation quality, and it remains the de facto standard in the field as of 2025.
Unlike pixel-level comparison methods, FID operates on high-level feature representations extracted by the Inception-v3 network, a convolutional neural network pretrained on ImageNet. By comparing statistical summaries of these feature distributions rather than individual images, FID captures both the fidelity (quality of individual samples) and the diversity (coverage of the real data distribution) of a generative model's output.
Evaluating generative models has always been a challenging problem. Before FID, the most commonly used automated metric was the Inception Score (IS), proposed by Tim Salimans and colleagues in 2016. The Inception Score evaluates generated images by passing them through an Inception network and measuring two properties: whether the classifier assigns high confidence to each image (indicating quality) and whether the set of generated images spans many different classes (indicating diversity). The IS is computed as the exponential of the expected KL divergence between the conditional class distribution for each image and the marginal class distribution across all images.
However, the Inception Score has several notable shortcomings. It evaluates only the distribution of generated images without comparing them to real data. This means a model could achieve a high IS by producing sharp, recognizable images that look nothing like the target dataset. The IS also depends entirely on the 1,000 ImageNet classes, making it poorly suited for domains where generated content does not map neatly onto those categories. Furthermore, the IS does not detect mode dropping: a model that generates only a single high-quality image from each class would score well, even though it fails to capture the full variety within each class.
These limitations motivated Heusel et al. to develop a metric that compares generated images directly against a reference set of real images, using a continuous feature representation rather than discrete class labels.
The Frechet Inception Distance is based on the Frechet distance between two multivariate Gaussian distributions. The Frechet distance between probability distributions was formalized by Maurice Frechet in 1957. D.C. Dowson and B.V. Landau derived the closed-form expression for the Frechet distance between two multivariate normal distributions in their 1982 paper "The Frechet Distance between Multivariate Normal Distributions," published in the Journal of Multivariate Analysis. This distance is equivalent to the 2-Wasserstein distance (also called the Earth Mover's Distance for Gaussian distributions).
Given two multivariate Gaussian distributions fitted to feature vectors extracted from real and generated images, the FID is defined as:
FID = ||mu_r - mu_g||^2 + Tr(Sigma_r + Sigma_g - 2 * (Sigma_r * Sigma_g)^(1/2))
Where:
The first term captures the difference in the central tendencies of the two distributions, while the second term captures differences in their spread and correlation structure. The formula reduces to zero when both distributions are identical, meaning the generated images have the same statistical properties as the real images in the Inception feature space.
FID values are non-negative, with lower scores indicating greater similarity between the generated and real image distributions. A score of 0 indicates that the two distributions are identical (which would only happen if the generated images perfectly replicate the statistical properties of the real set). In practice, even comparing two disjoint subsets of the same real dataset will produce a small non-zero FID due to sampling variance.
The computation of FID follows a well-defined pipeline with several distinct stages.
Both real and generated images must be preprocessed consistently before being passed through the Inception network. Standard preprocessing involves resizing images to 299 x 299 pixels (the input resolution expected by Inception-v3) and normalizing pixel values. The choice of resizing algorithm and image format (PNG vs. JPEG) can measurably affect FID scores, as demonstrated by Parmar et al. (2022).
Each image is fed through the Inception-v3 network, pretrained on ImageNet, with the final classification (softmax) layer removed. The activations from the last average pooling layer are extracted, yielding a 2,048-dimensional feature vector for each image. These vectors encode high-level semantic information about image content, including shapes, textures, objects, and scene structure.
The 2,048-dimensional feature vectors are collected for both the real image set and the generated image set. For each set, the sample mean vector (2,048 dimensions) and the sample covariance matrix (2,048 x 2,048) are computed. This step assumes that the feature vectors for each set can be reasonably approximated by a multivariate Gaussian distribution.
The Frechet distance formula is applied to the two pairs of (mean, covariance) statistics, producing a single scalar value. Computing the matrix square root of the product of the two covariance matrices is the most computationally expensive step and is typically performed using eigenvalue decomposition or the Schur decomposition.
The resulting FID value is compared against known benchmarks or competing models. Lower values indicate better generation quality. Results are only meaningful when compared using the same reference dataset, the same number of samples, and identical preprocessing.
FID requires a sufficiently large number of images to produce reliable estimates. Estimating a 2,048 x 2,048 covariance matrix from too few samples leads to high-variance, unreliable scores. The standard practice in the research community is to use at least 10,000 images for both the real and generated sets, with 50,000 being the most common choice for benchmark comparisons (particularly on CIFAR-10 and ImageNet). Using fewer images can produce misleading results, as the estimated covariance may be poorly conditioned.
The choice of reference (real) dataset significantly impacts FID values. A model's FID is always measured relative to a specific reference set, and scores computed against different reference sets are not comparable. Researchers typically use a fixed, standardized reference set for each benchmark (for example, the CIFAR-10 training set or 50,000 randomly selected ImageNet validation images).
Gaurav Parmar, Richard Zhang, and Jun-Yan Zhu demonstrated in their 2022 CVPR paper "On Aliased Resizing and Surprising Subtleties in GAN Evaluation" that seemingly minor differences in image preprocessing can cause large discrepancies in FID scores. They found that:
| Preprocessing Issue | Impact on FID |
|---|---|
| Different resizing libraries (OpenCV, PyTorch, TensorFlow vs. PIL bicubic) | FID difference of 6+ points on FFHQ |
| JPEG compression vs. PNG | Significant FID inflation despite perceptually identical images |
| Bilinear interpolation without anti-aliasing | Inconsistent scores across implementations |
| Different normalization schemes | Measurable score variations |
To address these issues, Parmar et al. released the Clean-FID library, which uses proper anti-aliased resizing following standard signal processing principles. Clean-FID has since become the recommended implementation for fair comparisons.
FID scores vary substantially across datasets and model architectures. The following table provides representative FID scores from notable models, illustrating how the metric has tracked progress in generative modeling over time.
| Model | Year | Dataset | Resolution | FID Score |
|---|---|---|---|---|
| DCGAN | 2015 | CIFAR-10 | 32 x 32 | ~37.0 |
| Progressive GAN | 2018 | CelebA-HQ | 1024 x 1024 | 8.04 |
| BigGAN-deep | 2019 | ImageNet | 256 x 256 | 6.95 |
| StyleGAN2 | 2020 | FFHQ | 1024 x 1024 | 2.84 |
| DDPM | 2020 | CIFAR-10 | 32 x 32 | 3.17 |
| ADM (Diffusion Models Beat GANs) | 2021 | ImageNet | 256 x 256 | 4.59 |
| DiT (Diffusion Transformer) | 2023 | ImageNet | 256 x 256 | 2.27 |
| StyleGAN-XL | 2022 | ImageNet | 256 x 256 | 2.30 |
| EDM2-S (distilled) | 2024 | ImageNet | 512 x 512 | 1.67 |
As a general guideline for interpreting FID scores:
| FID Range | Interpretation |
|---|---|
| 0 to 5 | Excellent generation quality; near state-of-the-art |
| 5 to 15 | Good generation quality; competitive results |
| 15 to 50 | Moderate quality; noticeable differences from real images |
| 50 to 100 | Poor quality; clearly distinguishable from real images |
| 100+ | Very poor quality; significant distribution mismatch |
These ranges are approximate and depend on the specific dataset and resolution. What counts as "excellent" continues to shift as new models push boundaries.
The Inception Score, introduced by Salimans et al. in 2016, was the predecessor to FID. While IS evaluates generated images in isolation (without a real reference set), FID compares generated images against a real distribution. FID and IS capture different aspects of generation quality: models that maximize IS tend to produce the sharpest individual images, while models that minimize FID tend to produce greater sample variety that better matches the real distribution. FID does not replace IS but is generally considered a more reliable and informative metric.
The Kernel Inception Distance, proposed by Binkowski et al. in 2018, uses the Maximum Mean Discrepancy (MMD) with a polynomial kernel instead of the Frechet distance. KID offers several advantages over FID:
However, KID has not displaced FID as the primary metric because it exhibits higher variance, meaning it requires averaging over multiple trials to produce stable estimates. KID is sometimes reported alongside FID as a complementary measure.
CMMD was introduced by Jayasumana et al. from Google Research in their 2024 CVPR paper "Rethinking FID: Towards a Better Evaluation Metric for Image Generation." CMMD replaces both core components of FID:
Jayasumana et al. demonstrated that FID can contradict human judgment in important cases. In one experiment, human raters preferred one model in 92.5% of side-by-side comparisons, yet FID ranked the other model as superior. CMMD correctly agreed with human preference in this case. CMMD is also computationally faster (roughly 100x for the distance calculation step) and more sample-efficient.
Kynkaanniemi et al. (2019) proposed separate precision and recall metrics for generative models. Precision measures the fraction of generated samples that fall within the support of the real distribution (fidelity), while recall measures the fraction of the real distribution that is covered by the generated distribution (diversity). These metrics decompose the single-number summary provided by FID into two interpretable components, making it easier to diagnose whether a model suffers from low quality, low diversity, or both.
Despite its widespread adoption, FID has attracted substantial criticism on several fronts.
FID assumes that the Inception feature vectors follow a multivariate Gaussian distribution. Jayasumana et al. (2024) applied three statistical normality tests (Mardia's skewness test, Mardia's kurtosis test, and the Henze-Zirkler test) to Inception embeddings of COCO 30K images. All three tests rejected the normality assumption with p-values of virtually zero. When the true distribution departs from Gaussianity, the closed-form Frechet distance can produce misleading results. In a synthetic experiment, Jayasumana et al. showed that FID remained at zero even as two mixture-of-Gaussians distributions diverged, while distribution-free metrics like MMD correctly detected the divergence.
Chong and Forsyth demonstrated in their 2020 CVPR paper "Effectively Unbiased FID and Inception Score and Where to Find Them" that FID is a biased estimator. The bias depends on the specific model being evaluated, meaning model A may receive a lower (better) FID than model B simply because model A's bias term is smaller, not because model A produces better images. This bias cannot be corrected by evaluating at a fixed sample size, making direct comparisons between models less reliable than commonly assumed. Chong and Forsyth proposed extrapolation-based corrections (FID-infinity) to mitigate this issue.
As Parmar et al. (2022) documented, FID scores are sensitive to low-level choices that should be irrelevant to image quality:
These sensitivities mean that FID scores from different codebases are often not directly comparable, even when evaluating the same model on the same dataset.
FID relies entirely on features from Inception-v3, a network trained on approximately 1 million ImageNet images spanning 1,000 object categories. This creates several problems:
Multiple studies have found cases where FID rankings contradict human preferences. Jayasumana et al. (2024) showed that FID rated distorted images from VQGAN latent-space noise as improvements over less-distorted versions, contradicting what humans could plainly see. FID also failed to reflect the gradual quality improvements during iterative refinement in models like Stable Diffusion and Muse, sometimes suggesting that quality worsened as visual quality clearly improved.
Borji (2022) further documented that FID is insensitive to certain types of image degradation and can be manipulated by adversarial perturbations that are imperceptible to humans but shift FID scores significantly.
FID does not reliably detect mode collapse, a failure mode where a GAN produces only a limited set of outputs. Because FID summarizes the entire distribution with just a mean and covariance, a model that memorizes and reproduces a subset of training images can achieve a low FID while failing to capture the full diversity of the real distribution. Precision-recall metrics are better suited for diagnosing this problem.
The Frechet distance framework has been adapted to evaluate generative models in domains beyond static images.
The Frechet Video Distance, introduced by Unterthiner et al. in 2019, extends the FID concept to video generation. Instead of Inception-v3, FVD uses features from an I3D (Inflated 3D ConvNet) network that captures both spatial appearance and temporal dynamics. FVD measures whether generated videos look realistic on a per-frame basis and whether they exhibit coherent motion over time. Like FID, FVD has been criticized for its Gaussian assumption and sensitivity to implementation choices.
The Frechet Audio Distance, proposed by Kilgour et al. in 2019, applies the same framework to audio and music generation. FAD originally used features from the VGGish audio classification network, though more recent implementations have adopted CLAP, PANNs, and MERT embeddings for domain-adapted feature extraction. FAD serves as the primary automated metric for evaluating music generation and audio enhancement systems.
The Frechet ChemNet Distance extends the concept to molecular generation, using features from ChemNet (a network trained on chemical structures) to assess whether AI-generated molecules are chemically plausible and diverse. FCD is used in drug discovery and materials science applications.
Clean-FID, released by Parmar et al. in 2022, is not a new metric but rather a standardized implementation of FID that addresses preprocessing inconsistencies. It uses proper anti-aliased resizing and consistent normalization to ensure that FID scores are reproducible and comparable across different research groups.
FID-infinity, proposed by Chong and Forsyth in 2020, uses extrapolation to estimate what the FID score would be with an infinite number of samples, effectively removing the finite-sample bias from the standard FID calculation. While more principled than standard FID, it requires computing FID at multiple sample sizes and fitting an extrapolation curve.
Several open-source libraries provide FID computation:
| Library | Framework | Notes |
|---|---|---|
| pytorch-fid | PyTorch | Widely used; uses bilinear resizing (potentially aliased) |
| clean-fid | PyTorch | Recommended; proper anti-aliased resizing |
| TorchMetrics FID | PyTorch | Part of the PyTorch Lightning ecosystem |
| tf.keras FID | TensorFlow | Uses TensorFlow's Inception weights |
| torch-fidelity | PyTorch | Supports FID, IS, and KID in a single package |
Researchers should be aware that different libraries can produce different FID scores for the same model due to the preprocessing sensitivities described above. When reporting results, specifying the exact library, version, and configuration used is important for reproducibility.
Based on the accumulated research on FID's strengths and limitations, the following practices are recommended: