Frechet Inception Distance

AI Benchmarks Computer Vision Generative AI Image Generation

19 min read

Updated Jul 11, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 11, 2026

Fact-checked

In review queue

Sources

14 citations

Revision

v4 · 3,835 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

The Frechet Inception Distance (FID) is the standard metric for measuring the quality of images produced by generative models: it computes the Frechet distance between two multivariate Gaussian distributions fitted to Inception-v3 features of a set of real images and a set of generated images, and a lower score is better, with 0 meaning the two distributions are statistically identical. It was introduced by Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter in the 2017 paper "GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium," presented at NeurIPS 2017 ^[1]. The authors state plainly that they "introduce the 'Frechet Inception Distance' (FID) which captures the similarity of generated images to real ones better than the Inception Score" ^[1]. Since its introduction, FID has become the most widely used quantitative benchmark for assessing image generation quality across generative adversarial networks (GANs) and diffusion models, and it remains the de facto standard in the field as of 2025.

Unlike pixel-level comparison methods, FID operates on high-level feature representations extracted by the Inception-v3 network, a convolutional neural network pretrained on ImageNet ^[5]. By comparing statistical summaries of these feature distributions rather than individual images, FID captures both the fidelity (quality of individual samples) and the diversity (coverage of the real data distribution) of a generative model's output.

What problem was FID designed to solve?

Evaluating generative models has always been a challenging problem. Before FID, the most commonly used automated metric was the Inception Score (IS), proposed by Tim Salimans and colleagues in 2016 ^[4]. The Inception Score evaluates generated images by passing them through an Inception network and measuring two properties: whether the classifier assigns high confidence to each image (indicating quality) and whether the set of generated images spans many different classes (indicating diversity). The IS is computed as the exponential of the expected KL divergence between the conditional class distribution for each image and the marginal class distribution across all images ^[4].

However, the Inception Score has several notable shortcomings. It evaluates only the distribution of generated images without comparing them to real data. This means a model could achieve a high IS by producing sharp, recognizable images that look nothing like the target dataset. The IS also depends entirely on the 1,000 ImageNet classes, making it poorly suited for domains where generated content does not map neatly onto those categories. Furthermore, the IS does not detect mode dropping: a model that generates only a single high-quality image from each class would score well, even though it fails to capture the full variety within each class.

These limitations motivated Heusel et al. to develop a metric that compares generated images directly against a reference set of real images, using a continuous feature representation rather than discrete class labels ^[1].

How is FID calculated? (Mathematical formulation)

The Frechet Inception Distance is based on the Frechet distance between two multivariate Gaussian distributions. The Frechet distance between probability distributions was formalized by Maurice Frechet in 1957 ^[3]. D.C. Dowson and B.V. Landau derived the closed-form expression for the Frechet distance between two multivariate normal distributions in their 1982 paper "The Frechet Distance between Multivariate Normal Distributions," published in the Journal of Multivariate Analysis ^[2]. This distance is equivalent to the 2-Wasserstein distance (also called the Earth Mover's Distance for Gaussian distributions).

The FID Formula

Given two multivariate Gaussian distributions fitted to feature vectors extracted from real and generated images, the FID is defined as:

\text{FID} = \lVert \mu_r - \mu_g \rVert^2 + \mathrm{Tr}\left(\Sigma_r + \Sigma_g - 2 (\Sigma_r \Sigma_g)^{1/2}\right)

Where:

$\mu_r$ is the mean of the feature vectors computed from real images
$\mu_g$ is the mean of the feature vectors computed from generated images
$\Sigma_r$ is the covariance matrix of the feature vectors from real images
$\Sigma_g$ is the covariance matrix of the feature vectors from generated images
$\lVert \mu_r - \mu_g \rVert^2$ is the squared Euclidean distance between the two mean vectors
$\mathrm{Tr}(\cdots)$ denotes the matrix trace operation
$(\Sigma_r \Sigma_g)^{1/2}$ is the matrix square root of the product of the two covariance matrices

The first term captures the difference in the central tendencies of the two distributions, while the second term captures differences in their spread and correlation structure. The formula reduces to zero when both distributions are identical, meaning the generated images have the same statistical properties as the real images in the Inception feature space.

What does a FID score mean? (Interpretation)

FID values are non-negative, with lower scores indicating greater similarity between the generated and real image distributions. A score of 0 indicates that the two distributions are identical (which would only happen if the generated images perfectly replicate the statistical properties of the real set). In practice, even comparing two disjoint subsets of the same real dataset will produce a small non-zero FID due to sampling variance.

How is FID computed step by step? (Computation pipeline)

The computation of FID follows a well-defined pipeline with several distinct stages.

Step 1: Image Preprocessing

Both real and generated images must be preprocessed consistently before being passed through the Inception network. Standard preprocessing involves resizing images to 299 x 299 pixels (the input resolution expected by Inception-v3) and normalizing pixel values. The choice of resizing algorithm and image format (PNG vs. JPEG) can measurably affect FID scores, as demonstrated by Parmar et al. (2022) ^[8].

Step 2: Feature Extraction

Each image is fed through the Inception-v3 network, pretrained on ImageNet, with the final classification (softmax) layer removed. The activations from the last average pooling layer are extracted, yielding a 2,048-dimensional feature vector for each image ^[1]. These vectors encode high-level semantic information about image content, including shapes, textures, objects, and scene structure.

Step 3: Statistical Estimation

The 2,048-dimensional feature vectors are collected for both the real image set and the generated image set. For each set, the sample mean vector (2,048 dimensions) and the sample covariance matrix (2,048 x 2,048) are computed. This step assumes that the feature vectors for each set can be reasonably approximated by a multivariate Gaussian distribution.

Step 4: Distance Calculation

The Frechet distance formula is applied to the two pairs of (mean, covariance) statistics, producing a single scalar value. Computing the matrix square root of the product of the two covariance matrices is the most computationally expensive step and is typically performed using eigenvalue decomposition or the Schur decomposition.

Step 5: Result Interpretation

The resulting FID value is compared against known benchmarks or competing models. Lower values indicate better generation quality. Results are only meaningful when compared using the same reference dataset, the same number of samples, and identical preprocessing.

Practical Considerations

How many images does FID need? (Sample size requirements)

FID requires a sufficiently large number of images to produce reliable estimates. Estimating a 2,048 x 2,048 covariance matrix from too few samples leads to high-variance, unreliable scores. The standard practice in the research community is to use at least 10,000 images for both the real and generated sets, with 50,000 being the most common choice for benchmark comparisons (particularly on CIFAR-10 and ImageNet). Using fewer images can produce misleading results, as the estimated covariance may be poorly conditioned.

Reference Dataset Selection

The choice of reference (real) dataset significantly impacts FID values. A model's FID is always measured relative to a specific reference set, and scores computed against different reference sets are not comparable. Researchers typically use a fixed, standardized reference set for each benchmark (for example, the CIFAR-10 training set or 50,000 randomly selected ImageNet validation images).

Preprocessing Sensitivity

Gaurav Parmar, Richard Zhang, and Jun-Yan Zhu demonstrated in their 2022 CVPR paper "On Aliased Resizing and Surprising Subtleties in GAN Evaluation" that seemingly minor differences in image preprocessing can cause large discrepancies in FID scores ^[8]. They found that:

Preprocessing Issue	Impact on FID
Different resizing libraries (OpenCV, PyTorch, TensorFlow vs. PIL bicubic)	FID difference of 6+ points on FFHQ
JPEG compression vs. PNG	Significant FID inflation despite perceptually identical images
Bilinear interpolation without anti-aliasing	Inconsistent scores across implementations
Different normalization schemes	Measurable score variations

To address these issues, Parmar et al. released the Clean-FID library, which uses proper anti-aliased resizing following standard signal processing principles ^[8]. Clean-FID has since become the recommended implementation for fair comparisons.

What is a good FID score? (Benchmark scores)

FID scores vary substantially across datasets and model architectures. The following table provides representative FID scores from notable models, illustrating how the metric has tracked progress in generative modeling over time.

Model	Year	Dataset	Resolution	FID Score
DCGAN	2015	CIFAR-10	32 x 32	~37.0
Progressive GAN	2018	CelebA-HQ	1024 x 1024	8.04
BigGAN-deep	2019	ImageNet	256 x 256	6.95
StyleGAN2	2020	FFHQ	1024 x 1024	2.84
DDPM	2020	CIFAR-10	32 x 32	3.17
ADM (Diffusion Models Beat GANs)	2021	ImageNet	256 x 256	4.59
DiT (Diffusion Transformer)	2023	ImageNet	256 x 256	2.27
StyleGAN-XL	2022	ImageNet	256 x 256	2.30
EDM2-S (distilled)	2024	ImageNet	512 x 512	1.67

The steady decline of state-of-the-art FID illustrates the metric's role in tracking progress: EDM2, presented at CVPR 2024, pushed the ImageNet 512 x 512 record from a prior FID of 2.41 down to 1.81 using fast deterministic sampling ^[14]. As a general guideline for interpreting FID scores:

FID Range	Interpretation
0 to 5	Excellent generation quality; near state-of-the-art
5 to 15	Good generation quality; competitive results
15 to 50	Moderate quality; noticeable differences from real images
50 to 100	Poor quality; clearly distinguishable from real images
100+	Very poor quality; significant distribution mismatch

These ranges are approximate and depend on the specific dataset and resolution. What counts as "excellent" continues to shift as new models push boundaries.

How does FID differ from other metrics?

Inception Score (IS)

The Inception Score, introduced by Salimans et al. in 2016, was the predecessor to FID ^[4]. While IS evaluates generated images in isolation (without a real reference set), FID compares generated images against a real distribution. FID and IS capture different aspects of generation quality: models that maximize IS tend to produce the sharpest individual images, while models that minimize FID tend to produce greater sample variety that better matches the real distribution. FID does not replace IS but is generally considered a more reliable and informative metric.

Kernel Inception Distance (KID)

The Kernel Inception Distance, proposed by Binkowski et al. in 2018, uses the Maximum Mean Discrepancy (MMD) with a polynomial kernel instead of the Frechet distance ^[9]. KID offers several advantages over FID:

No Gaussian assumption: KID does not assume the feature vectors follow a multivariate Gaussian distribution, making it more robust when this assumption is violated.
Unbiased estimator: Unlike FID, which is a biased estimator, KID provides an unbiased estimate of the true population metric.
Better for small samples: KID is more reliable with limited data because it does not require estimating a full covariance matrix.

However, KID has not displaced FID as the primary metric because it exhibits higher variance, meaning it requires averaging over multiple trials to produce stable estimates. KID is sometimes reported alongside FID as a complementary measure.

CLIP Maximum Mean Discrepancy (CMMD)

CMMD was introduced by Jayasumana et al. from Google Research in their 2024 CVPR paper "Rethinking FID: Towards a Better Evaluation Metric for Image Generation" ^[6]. CMMD replaces both core components of FID:

Instead of Inception-v3 features, CMMD uses CLIP embeddings (specifically, ViT-L/14 trained on 400 million image-text pairs), which provide richer representations of diverse visual content.
Instead of the Frechet distance (which assumes Gaussian distributions), CMMD uses the Maximum Mean Discrepancy with a Gaussian RBF kernel, which is distribution-free.

The authors summarize their findings bluntly: "FID contradicts human raters, it does not reflect gradual improvement of iterative text-to-image models, it does not capture distortion levels, and ... it produces inconsistent results when varying the sample size" ^[6]. In one experiment, human raters preferred one model in 92.5% of side-by-side comparisons, yet FID ranked the other model as superior; CMMD correctly agreed with human preference in this case ^[6]. CMMD is also computationally faster (roughly 100x for the distance calculation step) and more sample-efficient.

Precision and Recall for Distributions

Kynkaanniemi et al. (2019) proposed separate precision and recall metrics for generative models ^[10]. Precision measures the fraction of generated samples that fall within the support of the real distribution (fidelity), while recall measures the fraction of the real distribution that is covered by the generated distribution (diversity). These metrics decompose the single-number summary provided by FID into two interpretable components, making it easier to diagnose whether a model suffers from low quality, low diversity, or both.

What are the limitations of FID?

Despite its widespread adoption, FID has attracted substantial criticism on several fronts.

Gaussian Assumption

FID assumes that the Inception feature vectors follow a multivariate Gaussian distribution. Jayasumana et al. (2024) applied three statistical normality tests (Mardia's skewness test, Mardia's kurtosis test, and the Henze-Zirkler test) to Inception embeddings of COCO 30K images. All three tests rejected the normality assumption with p-values of virtually zero ^[6]. When the true distribution departs from Gaussianity, the closed-form Frechet distance can produce misleading results. In a synthetic experiment, Jayasumana et al. showed that FID remained at zero even as two mixture-of-Gaussians distributions diverged, while distribution-free metrics like MMD correctly detected the divergence ^[6].

Statistical Bias

Chong and Forsyth demonstrated in their 2020 CVPR paper "Effectively Unbiased FID and Inception Score and Where to Find Them" that FID is a biased estimator ^[7]. The bias depends on the specific model being evaluated, meaning model A may receive a lower (better) FID than model B simply because model A's bias term is smaller, not because model A produces better images. This bias cannot be corrected by evaluating at a fixed sample size, making direct comparisons between models less reliable than commonly assumed. Chong and Forsyth proposed extrapolation-based corrections (FID-infinity) to mitigate this issue ^[7].

Sensitivity to Implementation Details

As Parmar et al. (2022) documented, FID scores are sensitive to low-level choices that should be irrelevant to image quality ^[8]:

The image resizing algorithm used to scale images to 299 x 299 pixels
The image file format (JPEG vs. PNG) and compression quality
The deep learning framework used (TensorFlow vs. PyTorch) due to subtle numerical differences
The random seed used for selecting reference images

These sensitivities mean that FID scores from different codebases are often not directly comparable, even when evaluating the same model on the same dataset.

Dependence on the Inception Network

FID relies entirely on features from Inception-v3, a network trained on approximately 1 million ImageNet images spanning 1,000 object categories. This creates several problems:

Domain mismatch: For generated content outside the ImageNet domain (medical images, satellite imagery, art, abstract patterns), Inception features may not capture the relevant aspects of image quality.
Outdated architecture: Inception-v3 was state-of-the-art in 2015 but has been surpassed by many subsequent architectures. More modern feature extractors might provide better quality signals.
Limited vocabulary: The 1,000 ImageNet classes do not cover the full range of content produced by modern text-to-image models, potentially causing FID to miss important aspects of generation quality ^[6].

Inconsistency with Human Judgment

Multiple studies have found cases where FID rankings contradict human preferences. Jayasumana et al. (2024) showed that FID rated distorted images from VQGAN latent-space noise as improvements over less-distorted versions, contradicting what humans could plainly see ^[6]. FID also failed to reflect the gradual quality improvements during iterative refinement in models like Stable Diffusion and Muse, sometimes suggesting that quality worsened as visual quality clearly improved ^[6].

Borji (2022) further documented that FID is insensitive to certain types of image degradation and can be manipulated by adversarial perturbations that are imperceptible to humans but shift FID scores significantly ^[13].

Mode Collapse Detection

FID does not reliably detect mode collapse, a failure mode where a GAN produces only a limited set of outputs. Because FID summarizes the entire distribution with just a mean and covariance, a model that memorizes and reproduces a subset of training images can achieve a low FID while failing to capture the full diversity of the real distribution. Precision-recall metrics are better suited for diagnosing this problem.

Variants and Extensions

The Frechet distance framework has been adapted to evaluate generative models in domains beyond static images.

Frechet Video Distance (FVD)

The Frechet Video Distance, introduced by Unterthiner et al. in 2019, extends the FID concept to video generation ^[11]. Instead of Inception-v3, FVD uses features from an I3D (Inflated 3D ConvNet) network that captures both spatial appearance and temporal dynamics. FVD measures whether generated videos look realistic on a per-frame basis and whether they exhibit coherent motion over time. Like FID, FVD has been criticized for its Gaussian assumption and sensitivity to implementation choices.

Frechet Audio Distance (FAD)

The Frechet Audio Distance, proposed by Kilgour et al. in 2019, applies the same framework to audio and music generation ^[12]. FAD originally used features from the VGGish audio classification network, though more recent implementations have adopted CLAP, PANNs, and MERT embeddings for domain-adapted feature extraction. FAD serves as the primary automated metric for evaluating music generation and audio enhancement systems.

Frechet ChemNet Distance (FCD)

The Frechet ChemNet Distance extends the concept to molecular generation, using features from ChemNet (a network trained on chemical structures) to assess whether AI-generated molecules are chemically plausible and diverse. FCD is used in drug discovery and materials science applications.

Clean-FID

Clean-FID, released by Parmar et al. in 2022, is not a new metric but rather a standardized implementation of FID that addresses preprocessing inconsistencies ^[8]. It uses proper anti-aliased resizing and consistent normalization to ensure that FID scores are reproducible and comparable across different research groups.

FID-infinity

FID-infinity, proposed by Chong and Forsyth in 2020, uses extrapolation to estimate what the FID score would be with an infinite number of samples, effectively removing the finite-sample bias from the standard FID calculation ^[7]. While more principled than standard FID, it requires computing FID at multiple sample sizes and fitting an extrapolation curve.

Common Implementations

Several open-source libraries provide FID computation:

Library	Framework	Notes
pytorch-fid	PyTorch	Widely used; uses bilinear resizing (potentially aliased)
clean-fid	PyTorch	Recommended; proper anti-aliased resizing
TorchMetrics FID	PyTorch	Part of the PyTorch Lightning ecosystem
tf.keras FID	TensorFlow	Uses TensorFlow's Inception weights
torch-fidelity	PyTorch	Supports FID, IS, and KID in a single package

Researchers should be aware that different libraries can produce different FID scores for the same model due to the preprocessing sensitivities described above. When reporting results, specifying the exact library, version, and configuration used is important for reproducibility.

Best Practices

Based on the accumulated research on FID's strengths and limitations, the following practices are recommended:

Use a standardized implementation. Clean-FID or torch-fidelity with consistent preprocessing settings is preferable over custom implementations.
Use at least 50,000 samples for both real and generated sets to minimize sampling noise.
Report the reference dataset and the exact number of images used.
Do not compare FID scores across different reference datasets, different resolutions, or different preprocessing pipelines.
Supplement FID with other metrics. Report KID, precision/recall, or CMMD alongside FID to provide a more complete picture of model performance.
Include human evaluation for any claims about perceptual quality, since FID can disagree with human judgment.
Specify the implementation details (library, resizing method, image format, framework version) to enable reproducibility.

References

Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., & Hochreiter, S. (2017). "GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium." *Advances in Neural Information Processing Systems 30 (NeurIPS 2017)*, pp. 6626-6637. arXiv:1706.08500. ↩
Dowson, D.C. & Landau, B.V. (1982). "The Frechet Distance between Multivariate Normal Distributions." *Journal of Multivariate Analysis*, 12(3), pp. 450-455. ↩
Frechet, M. (1957). "Sur la distance de deux lois de probabilite." *Comptes Rendus de l'Academie des Sciences, Paris*, 244, pp. 689-692. ↩
Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., & Chen, X. (2016). "Improved Techniques for Training GANs." *Advances in Neural Information Processing Systems 29 (NeurIPS 2016)*. arXiv:1606.03498. ↩
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2016). "Rethinking the Inception Architecture for Computer Vision." *Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016)*, pp. 2818-2826. ↩
Jayasumana, S., Ramalingam, S., Veit, A., Glasner, D., Chakrabarti, A., & Kumar, S. (2024). "Rethinking FID: Towards a Better Evaluation Metric for Image Generation." *Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024)*. arXiv:2401.09603. ↩
Chong, M.J. & Forsyth, D. (2020). "Effectively Unbiased FID and Inception Score and Where to Find Them." *Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2020)*. arXiv:1911.07023. ↩
Parmar, G., Zhang, R., & Zhu, J.-Y. (2022). "On Aliased Resizing and Surprising Subtleties in GAN Evaluation." *Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022)*. arXiv:2104.11222. ↩
Binkowski, M., Sutherland, D.J., Arbel, M., & Gretton, A. (2018). "Demystifying MMD GANs." *Proceedings of the International Conference on Learning Representations (ICLR 2018)*. arXiv:1801.01401. ↩
Kynkaanniemi, T., Karras, T., Laine, S., Lehtinen, J., & Aila, T. (2019). "Improved Precision and Recall Metric for Assessing Generative Models." *Advances in Neural Information Processing Systems 32 (NeurIPS 2019)*. arXiv:1904.06991. ↩
Unterthiner, T., van Steenkiste, S., Kurach, K., Marinier, R., Michalski, M., & Gelly, S. (2019). "FVD: A New Metric for Video Generation." *ICLR 2019 Workshop on Deep Generative Models for Highly Structured Data*. ↩
Kilgour, K., Zuluaga, M., Roblek, D., & Sharifi, M. (2019). "Frechet Audio Distance: A Reference-Free Metric for Evaluating Music Enhancement Algorithms." *Proceedings of Interspeech 2019*. ↩
Borji, A. (2022). "Pros and Cons of GAN Evaluation Measures: New Developments." *Computer Vision and Image Understanding*, 215, 103329. ↩
Karras, T., Aittala, M., Lehtinen, J., Hellsten, J., Aila, T., & Laine, S. (2024). "Analyzing and Improving the Training Dynamics of Diffusion Models." *Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024)*. arXiv:2312.02696. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

3 revisions by 1 contributors · full history

Suggest edit

What links here

BigGAN CLIP Score CM3leon DDIM (Denoising Diffusion Implicit Models)DDPM Earth Mover's Distance GenEval Generator Lumiere Masked Autoregressive (MAR) generation Rectified Flow Score matching StyleGAN Unconditional Image Generation Models Wasserstein Loss

What problem was FID designed to solve?

How is FID calculated? (Mathematical formulation)

The FID Formula

What does a FID score mean? (Interpretation)

How is FID computed step by step? (Computation pipeline)

Step 1: Image Preprocessing

Step 2: Feature Extraction

Step 3: Statistical Estimation

Step 4: Distance Calculation

Step 5: Result Interpretation

Practical Considerations

How many images does FID need? (Sample size requirements)

Reference Dataset Selection

Preprocessing Sensitivity

What is a good FID score? (Benchmark scores)

How does FID differ from other metrics?

Inception Score (IS)

Kernel Inception Distance (KID)

CLIP Maximum Mean Discrepancy (CMMD)

Precision and Recall for Distributions

What are the limitations of FID?

Gaussian Assumption

Statistical Bias

Sensitivity to Implementation Details

Dependence on the Inception Network

Inconsistency with Human Judgment

Mode Collapse Detection

Variants and Extensions

Frechet Video Distance (FVD)

Frechet Audio Distance (FAD)

Frechet ChemNet Distance (FCD)

Clean-FID

FID-infinity

Common Implementations

Best Practices

See Also

References

Improve this article

Related Articles

CLIP Score

GenEval

ControlNet

CycleGAN

StyleGAN

Ideogram 3.0

What links here

Related Articles

CLIP Score

GenEval

ControlNet

CycleGAN

StyleGAN

Ideogram 3.0

What links here