Generative Model
Last reviewed
Sources
16 citations
Review status
Source-backed
Revision
v6 ยท 4,224 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Sources
16 citations
Review status
Source-backed
Revision
v6 ยท 4,224 words
Add missing citations, update stale details, or suggest a clearer explanation.
See also: Machine learning terms, Discriminative model
A generative model is a class of statistical and machine learning model that learns the joint probability distribution P(X) of the observed data, or the joint distribution P(X, Y) of inputs and labels, in order to generate new data samples that resemble the training distribution. In contrast to discriminative models, which learn the conditional probability P(Y|X) to classify inputs, generative models capture how the data itself is produced.[3] This is the defining difference: a discriminative model learns the boundary between classes, while a generative model learns the full data-generating process and can therefore synthesize images, write text, compose music, design molecules, and augment datasets.
Generative models sit at the heart of the modern AI revolution. Systems such as GPT, DALL-E, and Stable Diffusion are all built on generative modeling principles. The field draws on decades of research in probability theory, Bayesian statistics, information theory, and deep learning, and it has expanded rapidly since 2014 with the introduction of generative adversarial networks (GANs) and variational autoencoders (VAEs).[1][2] The category reached mainstream awareness in late 2022 when ChatGPT attracted an estimated 100 million monthly active users within roughly two months of launch, which a UBS analysis called the fastest-growing consumer application in history.[13]
In probabilistic terms, a generative model specifies a distribution p_model(x) over the data space X, or a joint distribution p_model(x, y) over inputs and labels. The goal during training is to adjust the model parameters theta so that p_model(x; theta) approximates the true data distribution p_data(x) as closely as possible.[9] Once trained, the model can:
The training objective often involves maximizing the log-likelihood of the observed data under the model, though many modern approaches use alternative objectives such as adversarial losses, score matching, or variational lower bounds.[10]
The distinction between generative and discriminative models is one of the most fundamental concepts in machine learning.[3]
A discriminative model learns the decision boundary between classes directly. It models P(Y|X) and answers the question: "Given this input, what is the most likely label?" Examples include logistic regression, support vector machines, and most neural network classifiers.[3]
A generative model learns the full data distribution. It models P(X, Y) or P(X) and answers the question: "How was this data generated?" Because it captures the complete data-generating process, it can also be used for classification through Bayes' rule, but its primary strength lies in generating new data.[3]
In a landmark 2002 paper, Andrew Ng and Michael Jordan compared generative and discriminative classifiers using naive Bayes (generative) and logistic regression (discriminative) as representative examples.[3] Their key findings include:
Ng and Jordan summarized the counterintuitive result directly, writing that "there can often be two distinct regimes of performance as the training set size is increased, one in which each algorithm does better," contrary to the then-widely-held belief that discriminative classifiers are almost always preferable.[3] This analysis remains influential and is frequently cited when practitioners choose between model families based on available data volume.[3]
| Property | Generative Model | Discriminative Model |
|---|---|---|
| What it models | Joint distribution P(X, Y) or P(X) | Conditional distribution P(Y|X) |
| Primary use | Data generation, density estimation | Classification, regression |
| Can generate new samples | Yes | No |
| Can classify | Yes (via Bayes' rule) | Yes (directly) |
| Sample efficiency | Higher (needs less data) | Lower (needs more data) |
| Asymptotic accuracy | Lower | Higher |
| Classic examples | Naive Bayes, GMM, HMM | Logistic regression, SVM, neural network classifier |
Ian Goodfellow proposed an influential taxonomy that divides generative models based on how they represent the data distribution.[9] The primary split is between models that define an explicit density function and those that use an implicit density approach.
Explicit density models define a parametric form for p_model(x) and optimize it directly. These are further divided into:
Implicit density models do not define an explicit density function. Instead, they learn to sample from p_model(x) directly without ever computing the likelihood. The most prominent example is the GAN, which trains a generator network to produce samples that fool a discriminator network.[9] Generative stochastic networks (GSNs) also fall into this category.
Generative adversarial networks were introduced by Ian Goodfellow and seven co-authors in 2014.[1] A GAN consists of two neural networks trained simultaneously in a minimax game:
The two networks optimize a single value function, min over G of max over D of V(D, G) = E[log D(x)] + E[log(1 - D(G(z)))]. The generator tries to minimize the discriminator's ability to distinguish real from fake, while the discriminator tries to maximize its classification accuracy. The original paper proved that this game has a unique global optimum at which the generator recovers the true data distribution and the discriminator outputs 1/2 everywhere, meaning generated samples are indistinguishable from real data.[1] The authors framed the setup as an analogy: the generative model is "analogous to a team of counterfeiters, trying to produce fake currency and use it without detection, while the discriminative model is analogous to the police, trying to detect the counterfeit currency."[1]
GANs have achieved remarkable results in image generation (StyleGAN, BigGAN), image-to-image translation (pix2pix, CycleGAN), and super-resolution (SRGAN). However, they can be difficult to train due to mode collapse (where the generator produces limited variety) and training instability.[8]
Variational autoencoders, introduced by Diederik Kingma and Max Welling in a paper submitted on December 20, 2013, combine neural networks with variational Bayesian inference.[2] A VAE consists of:
The model is trained by maximizing the evidence lower bound (ELBO), which balances reconstruction accuracy against a KL divergence regularization term that keeps the latent distribution close to a prior (usually a standard Gaussian). The reparameterization trick, also introduced in the original paper, allows gradients to flow through the sampling operation during backpropagation. The authors showed that "a reparameterization of the variational lower bound yields a lower bound estimator that can be straightforwardly optimized using standard stochastic gradient methods," which is what makes the VAE trainable end to end.[2]
VAEs produce a smooth, continuous latent space that supports meaningful interpolation between data points. They are widely used in drug discovery, molecular generation, and representation learning.
Autoregressive models decompose the joint distribution into a product of conditional distributions using the chain rule of probability:
p(x) = p(x_1) * p(x_2 | x_1) * p(x_3 | x_1, x_2) * ... * p(x_n | x_1, ..., x_{n-1})
Each element in the sequence is generated one at a time, conditioned on all previous elements. This approach yields tractable, exact likelihoods and has been enormously successful in both text and image generation.
For text, the Transformer architecture powers modern autoregressive large language models such as the GPT family (GPT-2, GPT-3, GPT-4), LLaMA, and Claude.[12] These models predict the next token in a sequence and can generate coherent paragraphs, code, poetry, and more.
For images, PixelCNN and PixelRNN generate images one pixel at a time. More recent approaches like VQVAE-2 and Parti use discrete token representations of images and apply autoregressive transformers to generate them.
Diffusion models (also called score-based generative models) have emerged as one of the most powerful generative frameworks since 2020. The approach involves two processes:
Denoising diffusion probabilistic models (DDPMs), introduced by Jonathan Ho, Ajay Jain, and Pieter Abbeel in 2020, demonstrated that this framework could generate high-quality images.[4] The paper opened by presenting "high quality image synthesis results using diffusion probabilistic models, a class of latent variable models inspired by considerations from nonequilibrium thermodynamics," and reported a then-state-of-the-art Frechet Inception Distance (FID) of 3.17 and an Inception score of 9.46 on unconditional CIFAR-10.[4] Song and Ermon showed the connection between diffusion models and score matching, where the network learns the gradient (score) of the log data density.[5]
Diffusion models have several advantages over GANs: stable training with a simple mean squared error loss function, no mode collapse, and a principled likelihood-based framework.[4] They power state-of-the-art image generators such as Stable Diffusion, DALL-E 2 and 3, and Imagen. They have also been extended to video generation (Sora), audio synthesis, and 3D content creation.
Normalizing flows construct a complex probability distribution by applying a sequence of invertible, differentiable transformations to a simple base distribution (typically a Gaussian).[6] Because each transformation is invertible, the exact likelihood can be computed using the change-of-variables formula:
log p(x) = log p(z) - sum of log |det(dfi/dz_i)|
where z is the base distribution sample and fi are the flow layers. Key models in this family include RealNVP (2016), Glow (2018), and Neural Spline Flows (2019).
The main advantage of normalizing flows is exact likelihood computation, which is useful for density estimation and anomaly detection. The main limitation is that the input and output dimensions must match, and the transformations must be carefully designed so that both the function and its Jacobian determinant are efficiently computable.
Energy-based models assign a scalar energy value E(x) to each data point. Lower energy corresponds to higher probability. The probability distribution is defined as:
p(x) = exp(-E(x)) / Z
where Z is the partition function (normalizing constant). Because Z is typically intractable, training EBMs requires specialized techniques such as contrastive divergence or score matching.[10]
EBMs are flexible and expressive, but sampling from them is computationally expensive, often requiring Markov chain Monte Carlo (MCMC) methods. Research by Yann LeCun and others has explored combining EBMs with modern deep learning architectures.
Boltzmann machines, proposed by Geoffrey Hinton and Terry Sejnowski in 1985, are stochastic neural networks inspired by statistical mechanics.[11] They define a joint distribution over visible and hidden units using an energy function, where states with lower energy are more probable, analogous to how a physical system at thermal equilibrium is more likely to occupy low-energy configurations. Restricted Boltzmann Machines (RBMs) simplify the architecture by removing connections between units in the same layer, making training more tractable through contrastive divergence.
RBMs were foundational to early deep learning. Stacking multiple RBMs produces Deep Belief Networks (DBNs), which were among the first successful deep generative models. While largely superseded by VAEs, GANs, and diffusion models for generation tasks, Boltzmann machines remain important in the history of generative modeling and continue to inspire research in energy-based approaches.
Naive Bayes is one of the simplest generative classifiers. It models the joint distribution P(X, Y) by assuming that all features are conditionally independent given the class label:
P(X | Y = c) = product of P(x_i | Y = c)
Despite this strong independence assumption, naive Bayes performs surprisingly well in many practical settings, especially text classification.[3] It can also be used generatively by sampling from the learned class-conditional distributions, though the generated samples tend to be less realistic than those from deep generative models.
| Model Family | Year Introduced | Density Type | Likelihood | Training Stability | Sample Quality | Key Use Cases |
|---|---|---|---|---|---|---|
| Naive Bayes | 1960s | Explicit (tractable) | Exact | Very stable | Low | Text classification, spam filtering |
| Boltzmann Machines / RBMs | 1985 | Explicit (approximate) | Approximate (MCMC) | Moderate | Low to moderate | Feature learning, pretraining |
| Gaussian Mixture Models | Classical | Explicit (tractable) | Exact | Stable (EM) | Low | Clustering, density estimation |
| Variational Autoencoders | 2013 | Explicit (approximate) | Lower bound (ELBO) | Stable | Moderate (can be blurry) | Molecule design, representation learning |
| GANs | 2014 | Implicit | Not available | Unstable (mode collapse) | High (sharp images) | Image synthesis, style transfer |
| Normalizing Flows | 2015 | Explicit (tractable) | Exact | Stable | Moderate to high | Density estimation, anomaly detection |
| Autoregressive (GPT, PixelCNN) | 2016 | Explicit (tractable) | Exact | Stable | High | Text generation, language modeling |
| Diffusion Models | 2020 | Explicit (approximate) | Approximate (ELBO) | Very stable | Very high | Image and video generation |
A 2022 comparative review of deep generative models in IEEE Transactions on Pattern Analysis and Machine Intelligence surveyed VAEs, GANs, normalizing flows, energy-based, and autoregressive families together, noting that no single family dominates on all axes of likelihood tractability, sample quality, and training stability simultaneously.[10]
Many generative models introduce latent variables z that represent unobserved factors of variation in the data. The observed data x is assumed to be generated from these hidden factors according to:
p(x) = integral of p(x | z) * p(z) dz
Latent variable models include VAEs, GANs (where the noise input z serves as the latent variable), and classical models like factor analysis and probabilistic PCA. The latent space often captures meaningful, interpretable features of the data. For example, in a face generation model, different latent dimensions might control pose, lighting, or expression.
The key challenge in latent variable models is inference: computing the posterior distribution p(z | x). In VAEs, this is addressed through amortized variational inference using an encoder network.[2] In GANs, the latent-to-data mapping is learned, but inferring z from x typically requires additional techniques such as encoder networks (BiGAN) or optimization-based inversion.
Different generative model families use different loss functions and training objectives:
Evaluating generative models is challenging because there is no single metric that captures all aspects of generation quality. Common metrics include:
FID, introduced by Heusel et al. in 2017, is the most widely used metric for evaluating generated images.[7] It extracts features from both real and generated images using a pretrained Inception v3 network, fits multivariate Gaussians to each set of features, and computes the Frechet distance between the two distributions. Lower FID indicates higher quality and diversity. The authors introduced FID specifically because it "captures the similarity of generated images to real ones better than the Inception Score," and showed it agrees better with human judgment and with increasing levels of injected image distortion.[7]
The Inception Score, introduced by Salimans et al. in 2016, evaluates generated images based on two criteria: each image should be clearly classifiable (low entropy in the class prediction), and the set of generated images should cover many classes (high entropy in the marginal class distribution).[8] Higher IS indicates better quality and diversity. However, IS has notable limitations: it does not compare against real data, and it can be fooled by generators that produce high-confidence but unrealistic images.
For text generation models, perplexity measures how well the model predicts a held-out test set. It is defined as the exponentiation of the average negative log-likelihood per token. Lower perplexity indicates that the model assigns higher probability to the actual text, suggesting better language modeling. Perplexity is the standard evaluation metric for large language models.[12]
Other evaluation approaches include Kernel Inception Distance (KID), precision and recall metrics for generative models (measuring fidelity and diversity separately), CLIP score (for text-image alignment), and human evaluation studies.
Generative models have found applications across a wide range of fields:
Diffusion models and GANs can generate photorealistic images from text descriptions, perform style transfer, fill in missing regions (inpainting), increase resolution (super-resolution), and edit specific attributes of existing images. Products like Midjourney, DALL-E, and Stable Diffusion have made these capabilities accessible to millions of users.
Autoregressive large language models generate coherent, contextually appropriate text for a vast range of applications: creative writing, code generation, translation, summarization, question answering, and conversational AI. Systems like ChatGPT, Claude, and Gemini represent the current state of the art.
Generative models, particularly VAEs and diffusion models, are used to design novel molecular structures with desired pharmacological properties. These models can explore the vast chemical space more efficiently than traditional high-throughput screening, generating candidates optimized for drug-likeness, synthetic accessibility, and target binding affinity. Real-world successes include the AI-driven discovery of novel antibiotics effective against multidrug-resistant infections.
Generative models create synthetic training data to improve the performance of supervised learning systems, especially when real labeled data is scarce or expensive to collect. This is particularly valuable in medical imaging, where patient privacy concerns and annotation costs limit dataset size.
Models that learn the normal data distribution can flag anomalies as inputs with low likelihood under p_model(x). This approach is used in fraud detection, manufacturing quality control, network intrusion detection, and medical diagnostics.
Generative models now create music compositions, voice synthesis, sound effects, and even full video clips. Models like Sora (video), AudioCraft (audio), and Suno (music) demonstrate the breadth of creative applications.
Beyond drug discovery, generative models assist in protein structure prediction, materials science (designing new materials with target properties), weather forecasting, and particle physics simulation.
The period from 2022 onward marked a transformative moment for generative models, bringing them from research labs into mainstream use:
This revolution was driven by three converging factors: advances in model architectures (transformers, diffusion), massive increases in compute and training data, and breakthroughs in alignment and instruction tuning that made models useful for everyday tasks.
Despite remarkable progress, generative models face several ongoing challenges:
Imagine you spend a long time looking at thousands of pictures of cats. Eventually, you get so good at understanding what cats look like that you can close your eyes and draw a brand new cat from your imagination. It would not be a copy of any cat you saw before, but it would still look like a real cat because you learned the "rules" of what makes a cat look like a cat.
That is basically what a generative model does. It looks at tons of examples (pictures, sentences, molecules, or anything else), figures out the hidden patterns and rules behind them, and then uses those rules to create brand new examples that look just like the real ones. Some generative models are really good at making pictures, others are great at writing stories, and some even help scientists invent new medicines.