See also: Machine learning terms, Discriminative model
A generative model is a class of statistical and machine learning model that learns the joint probability distribution P(X) of the observed data, or the joint distribution P(X, Y) of inputs and labels, in order to generate new data samples that resemble the training distribution. In contrast to discriminative models, which learn the conditional probability P(Y|X) to classify inputs, generative models capture how the data itself is produced. This makes them capable of tasks that go far beyond classification: synthesizing images, writing text, composing music, designing molecules, and augmenting datasets.
Generative models sit at the heart of the modern AI revolution. Systems such as GPT, DALL-E, and Stable Diffusion are all built on generative modeling principles. The field draws on decades of research in probability theory, Bayesian statistics, information theory, and deep learning, and it has expanded rapidly since 2014 with the introduction of generative adversarial networks (GANs) and variational autoencoders (VAEs).
In probabilistic terms, a generative model specifies a distribution p_model(x) over the data space X, or a joint distribution p_model(x, y) over inputs and labels. The goal during training is to adjust the model parameters theta so that p_model(x; theta) approximates the true data distribution p_data(x) as closely as possible. Once trained, the model can:
The training objective often involves maximizing the log-likelihood of the observed data under the model, though many modern approaches use alternative objectives such as adversarial losses, score matching, or variational lower bounds.
The distinction between generative and discriminative models is one of the most fundamental concepts in machine learning.
A discriminative model learns the decision boundary between classes directly. It models P(Y|X) and answers the question: "Given this input, what is the most likely label?" Examples include logistic regression, support vector machines, and most neural network classifiers.
A generative model learns the full data distribution. It models P(X, Y) or P(X) and answers the question: "How was this data generated?" Because it captures the complete data-generating process, it can also be used for classification through Bayes' rule, but its primary strength lies in generating new data.
In a landmark 2002 paper, Andrew Ng and Michael Jordan compared generative and discriminative classifiers using naive Bayes (generative) and logistic regression (discriminative) as representative examples. Their key findings include:
This analysis remains influential and is frequently cited when practitioners choose between model families based on available data volume.
| Property | Generative Model | Discriminative Model |
|---|---|---|
| What it models | Joint distribution P(X, Y) or P(X) | Conditional distribution P(Y|X) |
| Primary use | Data generation, density estimation | Classification, regression |
| Can generate new samples | Yes | No |
| Can classify | Yes (via Bayes' rule) | Yes (directly) |
| Sample efficiency | Higher (needs less data) | Lower (needs more data) |
| Asymptotic accuracy | Lower | Higher |
| Classic examples | Naive Bayes, GMM, HMM | Logistic regression, SVM, neural network classifier |
Ian Goodfellow proposed an influential taxonomy that divides generative models based on how they represent the data distribution. The primary split is between models that define an explicit density function and those that use an implicit density approach.
Explicit density models define a parametric form for p_model(x) and optimize it directly. These are further divided into:
Implicit density models do not define an explicit density function. Instead, they learn to sample from p_model(x) directly without ever computing the likelihood. The most prominent example is the GAN, which trains a generator network to produce samples that fool a discriminator network. Generative stochastic networks (GSNs) also fall into this category.
Generative adversarial networks were introduced by Ian Goodfellow and colleagues in 2014. A GAN consists of two neural networks trained simultaneously in a minimax game:
The generator tries to minimize the discriminator's ability to distinguish real from fake, while the discriminator tries to maximize its classification accuracy. At equilibrium, the generator produces samples indistinguishable from real data.
GANs have achieved remarkable results in image generation (StyleGAN, BigGAN), image-to-image translation (pix2pix, CycleGAN), and super-resolution (SRGAN). However, they can be difficult to train due to mode collapse (where the generator produces limited variety) and training instability.
Variational autoencoders, introduced by Kingma and Welling in 2013, combine neural networks with variational Bayesian inference. A VAE consists of:
The model is trained by maximizing the evidence lower bound (ELBO), which balances reconstruction accuracy against a KL divergence regularization term that keeps the latent distribution close to a prior (usually a standard Gaussian). The reparameterization trick, also introduced in the original paper, allows gradients to flow through the sampling operation during backpropagation.
VAEs produce a smooth, continuous latent space that supports meaningful interpolation between data points. They are widely used in drug discovery, molecular generation, and representation learning.
Autoregressive models decompose the joint distribution into a product of conditional distributions using the chain rule of probability:
p(x) = p(x_1) * p(x_2 | x_1) * p(x_3 | x_1, x_2) * ... * p(x_n | x_1, ..., x_{n-1})
Each element in the sequence is generated one at a time, conditioned on all previous elements. This approach yields tractable, exact likelihoods and has been enormously successful in both text and image generation.
For text, the Transformer architecture powers modern autoregressive large language models such as the GPT family (GPT-2, GPT-3, GPT-4), LLaMA, and Claude. These models predict the next token in a sequence and can generate coherent paragraphs, code, poetry, and more.
For images, PixelCNN and PixelRNN generate images one pixel at a time. More recent approaches like VQVAE-2 and Parti use discrete token representations of images and apply autoregressive transformers to generate them.
Diffusion models (also called score-based generative models) have emerged as one of the most powerful generative frameworks since 2020. The approach involves two processes:
Denoising diffusion probabilistic models (DDPMs), introduced by Ho, Jain, and Abbeel in 2020, demonstrated that this framework could generate high-quality images. Song and Ermon showed the connection between diffusion models and score matching, where the network learns the gradient (score) of the log data density.
Diffusion models have several advantages over GANs: stable training with a simple mean squared error loss function, no mode collapse, and a principled likelihood-based framework. They power state-of-the-art image generators such as Stable Diffusion, DALL-E 2 and 3, and Imagen. They have also been extended to video generation (Sora), audio synthesis, and 3D content creation.
Normalizing flows construct a complex probability distribution by applying a sequence of invertible, differentiable transformations to a simple base distribution (typically a Gaussian). Because each transformation is invertible, the exact likelihood can be computed using the change-of-variables formula:
log p(x) = log p(z) - sum of log |det(dfi/dz_i)|
where z is the base distribution sample and fi are the flow layers. Key models in this family include RealNVP (2016), Glow (2018), and Neural Spline Flows (2019).
The main advantage of normalizing flows is exact likelihood computation, which is useful for density estimation and anomaly detection. The main limitation is that the input and output dimensions must match, and the transformations must be carefully designed so that both the function and its Jacobian determinant are efficiently computable.
Energy-based models assign a scalar energy value E(x) to each data point. Lower energy corresponds to higher probability. The probability distribution is defined as:
p(x) = exp(-E(x)) / Z
where Z is the partition function (normalizing constant). Because Z is typically intractable, training EBMs requires specialized techniques such as contrastive divergence or score matching.
EBMs are flexible and expressive, but sampling from them is computationally expensive, often requiring Markov chain Monte Carlo (MCMC) methods. Research by Yann LeCun and others has explored combining EBMs with modern deep learning architectures.
Boltzmann machines, proposed by Hinton and Sejnowski in 1985, are stochastic neural networks inspired by statistical mechanics. They define a joint distribution over visible and hidden units using an energy function. Restricted Boltzmann Machines (RBMs) simplify the architecture by removing connections between units in the same layer, making training more tractable through contrastive divergence.
RBMs were foundational to early deep learning. Stacking multiple RBMs produces Deep Belief Networks (DBNs), which were among the first successful deep generative models. While largely superseded by VAEs, GANs, and diffusion models for generation tasks, Boltzmann machines remain important in the history of generative modeling and continue to inspire research in energy-based approaches.
Naive Bayes is one of the simplest generative classifiers. It models the joint distribution P(X, Y) by assuming that all features are conditionally independent given the class label:
P(X | Y = c) = product of P(x_i | Y = c)
Despite this strong independence assumption, naive Bayes performs surprisingly well in many practical settings, especially text classification. It can also be used generatively by sampling from the learned class-conditional distributions, though the generated samples tend to be less realistic than those from deep generative models.
| Model Family | Year Introduced | Density Type | Likelihood | Training Stability | Sample Quality | Key Use Cases |
|---|---|---|---|---|---|---|
| Naive Bayes | 1960s | Explicit (tractable) | Exact | Very stable | Low | Text classification, spam filtering |
| Boltzmann Machines / RBMs | 1985 | Explicit (approximate) | Approximate (MCMC) | Moderate | Low to moderate | Feature learning, pretraining |
| Gaussian Mixture Models | Classical | Explicit (tractable) | Exact | Stable (EM) | Low | Clustering, density estimation |
| Variational Autoencoders | 2013 | Explicit (approximate) | Lower bound (ELBO) | Stable | Moderate (can be blurry) | Molecule design, representation learning |
| GANs | 2014 | Implicit | Not available | Unstable (mode collapse) | High (sharp images) | Image synthesis, style transfer |
| Normalizing Flows | 2015 | Explicit (tractable) | Exact | Stable | Moderate to high | Density estimation, anomaly detection |
| Autoregressive (GPT, PixelCNN) | 2016 | Explicit (tractable) | Exact | Stable | High | Text generation, language modeling |
| Diffusion Models | 2020 | Explicit (approximate) | Approximate (ELBO) | Very stable | Very high | Image and video generation |
Many generative models introduce latent variables z that represent unobserved factors of variation in the data. The observed data x is assumed to be generated from these hidden factors according to:
p(x) = integral of p(x | z) * p(z) dz
Latent variable models include VAEs, GANs (where the noise input z serves as the latent variable), and classical models like factor analysis and probabilistic PCA. The latent space often captures meaningful, interpretable features of the data. For example, in a face generation model, different latent dimensions might control pose, lighting, or expression.
The key challenge in latent variable models is inference: computing the posterior distribution p(z | x). In VAEs, this is addressed through amortized variational inference using an encoder network. In GANs, the latent-to-data mapping is learned, but inferring z from x typically requires additional techniques such as encoder networks (BiGAN) or optimization-based inversion.
Different generative model families use different loss functions and training objectives:
Evaluating generative models is challenging because there is no single metric that captures all aspects of generation quality. Common metrics include:
FID, introduced by Heusel et al. in 2017, is the most widely used metric for evaluating generated images. It extracts features from both real and generated images using a pretrained Inception v3 network, fits multivariate Gaussians to each set of features, and computes the Frechet distance between the two distributions. Lower FID indicates higher quality and diversity. FID captures both fidelity (realism) and diversity in a single number.
The Inception Score, introduced by Salimans et al. in 2016, evaluates generated images based on two criteria: each image should be clearly classifiable (low entropy in the class prediction), and the set of generated images should cover many classes (high entropy in the marginal class distribution). Higher IS indicates better quality and diversity. However, IS has notable limitations: it does not compare against real data, and it can be fooled by generators that produce high-confidence but unrealistic images.
For text generation models, perplexity measures how well the model predicts a held-out test set. It is defined as the exponentiation of the average negative log-likelihood per token. Lower perplexity indicates that the model assigns higher probability to the actual text, suggesting better language modeling. Perplexity is the standard evaluation metric for large language models.
Other evaluation approaches include Kernel Inception Distance (KID), precision and recall metrics for generative models (measuring fidelity and diversity separately), CLIP score (for text-image alignment), and human evaluation studies.
Generative models have found applications across a wide range of fields:
Diffusion models and GANs can generate photorealistic images from text descriptions, perform style transfer, fill in missing regions (inpainting), increase resolution (super-resolution), and edit specific attributes of existing images. Products like Midjourney, DALL-E, and Stable Diffusion have made these capabilities accessible to millions of users.
Autoregressive large language models generate coherent, contextually appropriate text for a vast range of applications: creative writing, code generation, translation, summarization, question answering, and conversational AI. Systems like ChatGPT, Claude, and Gemini represent the current state of the art.
Generative models, particularly VAEs and diffusion models, are used to design novel molecular structures with desired pharmacological properties. These models can explore the vast chemical space more efficiently than traditional high-throughput screening, generating candidates optimized for drug-likeness, synthetic accessibility, and target binding affinity. Real-world successes include the AI-driven discovery of novel antibiotics effective against multidrug-resistant infections.
Generative models create synthetic training data to improve the performance of supervised learning systems, especially when real labeled data is scarce or expensive to collect. This is particularly valuable in medical imaging, where patient privacy concerns and annotation costs limit dataset size.
Models that learn the normal data distribution can flag anomalies as inputs with low likelihood under p_model(x). This approach is used in fraud detection, manufacturing quality control, network intrusion detection, and medical diagnostics.
Generative models now create music compositions, voice synthesis, sound effects, and even full video clips. Models like Sora (video), AudioCraft (audio), and Suno (music) demonstrate the breadth of creative applications.
Beyond drug discovery, generative models assist in protein structure prediction, materials science (designing new materials with target properties), weather forecasting, and particle physics simulation.
The period from 2022 onward marked a transformative moment for generative models, bringing them from research labs into mainstream use:
This revolution was driven by three converging factors: advances in model architectures (transformers, diffusion), massive increases in compute and training data, and breakthroughs in alignment and instruction tuning that made models useful for everyday tasks.
Despite remarkable progress, generative models face several ongoing challenges:
Imagine you spend a long time looking at thousands of pictures of cats. Eventually, you get so good at understanding what cats look like that you can close your eyes and draw a brand new cat from your imagination. It would not be a copy of any cat you saw before, but it would still look like a real cat because you learned the "rules" of what makes a cat look like a cat.
That is basically what a generative model does. It looks at tons of examples (pictures, sentences, molecules, or anything else), figures out the hidden patterns and rules behind them, and then uses those rules to create brand new examples that look just like the real ones. Some generative models are really good at making pictures, others are great at writing stories, and some even help scientists invent new medicines.