Minimax loss is a loss function rooted in game theory and decision theory that measures the worst-case performance of a strategy, algorithm, or model. In machine learning, the term most commonly refers to the adversarial objective function used to train generative adversarial networks (GANs), where a generator and a discriminator compete in a two-player zero-sum game. More broadly, minimax loss also appears in robust optimization, adversarial training, statistical decision theory, and classical game theory, wherever the goal is to minimize the maximum possible loss under uncertainty or adversarial conditions.
The minimax principle originates with John von Neumann, who proved the minimax theorem in 1928 in his paper "Zur Theorie der Gesellschaftsspiele" (On the Theory of Parlor Games). The theorem states that in every finite two-player zero-sum game, there exist optimal mixed strategies for both players such that:
max_x min_y f(x, y) = min_y max_x f(x, y)
Von Neumann considered this result essential to game theory, reportedly stating: "As far as I can see, there could be no theory of games without that theorem." He first presented the theorem in December 1926 to the Gottingen Mathematical Society and later published the full proof in 1928. Jean Ville provided an elegant simplified proof in 1938 that highlighted the role of convexity. Von Neumann and Oskar Morgenstern later incorporated the theorem into their 1944 book Theory of Games and Economic Behavior, which established game theory as a formal discipline.
In statistical decision theory, Abraham Wald extended the minimax principle during the 1940s and 1950s. Wald framed statistical estimation as a game between a statistician and nature, where the statistician seeks to minimize the maximum risk (expected loss) over all possible states of nature. Leonard Savage introduced a related concept, minimax regret, in 1951, which minimizes the worst-case difference between the loss of the chosen action and the loss of the best possible action in hindsight.
Given a decision rule d chosen from a set of possible rules D, and a set of scenarios or states of nature S, the minimax loss is defined as:
min_{d in D} max_{s in S} L(d, s)
where L(d, s) represents the loss incurred by decision rule d under scenario s. The minimax strategy selects the rule that performs best in the worst case.
Ian Goodfellow introduced the minimax formulation for GANs in 2014. The objective function is:
min_G max_D V(D, G) = E_{x ~ p_data(x)}[log D(x)] + E_{z ~ p_z(z)}[log(1 - D(G(z)))]
where:
| Symbol | Meaning |
|---|---|
| G | The generator network that maps random noise to synthetic data |
| D | The discriminator network that classifies inputs as real or fake |
| x | Real data samples drawn from the true data distribution p_data |
| z | Random noise vectors drawn from a prior distribution p_z (typically Gaussian or uniform) |
| D(x) | Discriminator's estimated probability that x is real |
| G(z) | Generator's output given noise input z |
| D(G(z)) | Discriminator's estimated probability that the generated sample G(z) is real |
| E | Expected value operator |
The discriminator tries to maximize this objective by correctly distinguishing real samples from generated ones. The generator tries to minimize it by producing samples that the discriminator classifies as real.
For a fixed generator G, the optimal discriminator D* can be derived by taking the functional derivative of V(D, G) with respect to D. The result is:
D*(x) = p_data(x) / (p_data(x) + p_g(x))
where p_g is the distribution of samples produced by the generator. When the generator perfectly matches the data distribution (p_g = p_data), the optimal discriminator outputs 1/2 for every input, meaning it cannot distinguish real from fake.
Substituting the optimal discriminator D* back into the value function yields:
V(G, D*) = -log(4) + 2 * D_JS(p_data || p_g)
where D_JS is the Jensen-Shannon divergence. This means that when the discriminator is optimal, the generator's objective reduces to minimizing the Jensen-Shannon divergence between the real data distribution and the generated distribution. The global minimum of V(G, D*) is achieved if and only if p_g = p_data, at which point V(G, D*) = -log(4).
This result, proven as Theorem 1 in Goodfellow et al. (2014), provides the theoretical justification for GAN training: the minimax game converges to the point where the generator perfectly replicates the data distribution.
The discriminator is trained to maximize the value function, which is equivalent to minimizing the negative of the value function. In practice, the discriminator loss is computed using binary cross-entropy:
L_D = -E_{x ~ p_data}[log D(x)] - E_{z ~ p_z}[log(1 - D(G(z)))]
The first term encourages the discriminator to assign high probability to real samples. The second term encourages it to assign low probability to generated (fake) samples.
In the original minimax formulation, the generator minimizes:
L_G = E_{z ~ p_z}[log(1 - D(G(z)))]
Since the generator has no control over the term involving real data, it only influences the second part of the value function. The generator tries to make D(G(z)) as close to 1 as possible, so that log(1 - D(G(z))) becomes very negative.
Goodfellow noted in the original GAN paper that the minimax generator loss can lead to vanishing gradients early in training. When the generator is poor and D(G(z)) is near 0, the gradient of log(1 - D(G(z))) with respect to G is very small, providing little learning signal.
The recommended alternative is the non-saturating loss, where the generator instead maximizes log(D(G(z))):
L_G = -E_{z ~ p_z}[log D(G(z))]
This formulation provides stronger gradients when D(G(z)) is near 0 (early in training), helping the generator learn more quickly. Both formulations have the same fixed point, but the non-saturating version has better gradient dynamics. In practice, nearly all GAN implementations use this modified loss rather than the original minimax generator loss.
The minimax objective defines a saddle point problem. The equilibrium corresponds to a point that is simultaneously a minimum with respect to G and a maximum with respect to D. Finding such saddle points is generally harder than standard minimization, because gradient descent on minimax problems does not have the same convergence guarantees as gradient descent on convex functions.
In practice, GANs are trained by alternating between gradient ascent steps on the discriminator and gradient descent steps on the generator. This approach, called simultaneous or alternating gradient descent-ascent (GDA), can exhibit several pathological behaviors.
Naive application of gradient descent-ascent is known to produce oscillatory behavior or outright divergence, even in simple bilinear settings. The generator and discriminator can cycle around the equilibrium without converging to it. In more complex settings, the generated distribution can oscillate between different modes of the data distribution, never settling on a stable output.
Heusel et al. (2017) proved that GANs trained using a two time-scale update rule (TTUR), where the discriminator and generator use different learning rates, converge to a stationary local Nash equilibrium under mild assumptions. However, Farnia and Ozdaglar (2020) showed that GAN zero-sum games may not always have local Nash equilibria, and that typical GAN architectures may converge to stationary non-Nash minimax points instead.
These theoretical results highlight a gap between the idealized minimax framework and the practical behavior of trained GANs.
When the discriminator becomes too strong relative to the generator, the discriminator's output for generated samples approaches 0. In the original minimax formulation, the gradient of log(1 - D(G(z))) with respect to G becomes vanishingly small, causing the generator to stop learning. This is a consequence of the Jensen-Shannon divergence saturating when the two distributions have little or no overlap.
Mode collapse occurs when the generator learns to produce only a narrow subset of the possible outputs, focusing on a few modes of the data distribution that are effective at fooling the current discriminator. The generator finds a few outputs that reliably receive high discriminator scores and repeatedly produces variations of those outputs, ignoring the full diversity of the training data.
Mode collapse is partly a consequence of the minimax training dynamics. The generator optimizes against the current discriminator, not against the optimal discriminator. As training alternates, the generator can "chase" the discriminator's weaknesses rather than learning the full distribution.
GAN training is notoriously difficult to stabilize. Small changes in hyperparameters, architecture, or initialization can lead to dramatically different outcomes. The minimax objective does not define a standard loss curve that decreases over time; instead, the discriminator and generator losses fluctuate as they adapt to each other. This makes it difficult to monitor training progress or determine when training has converged.
Several alternative loss functions have been proposed to address the shortcomings of the original minimax objective.
| Loss function | Year | Key idea | Divergence minimized | Main advantage |
|---|---|---|---|---|
| Original minimax | 2014 | Two-player zero-sum game | Jensen-Shannon divergence | Theoretically motivated |
| Non-saturating | 2014 | Generator maximizes log D(G(z)) | Same fixed point as minimax | Stronger gradients early in training |
| Wasserstein loss (WGAN) | 2017 | Critic outputs unbounded scores; uses Earth Mover's distance | Wasserstein-1 distance | Avoids vanishing gradients; more stable training |
| WGAN-GP | 2017 | Gradient penalty replaces weight clipping in WGAN | Wasserstein-1 distance | Better Lipschitz constraint enforcement |
| LSGAN (Least Squares) | 2017 | Uses mean squared error instead of cross-entropy | Pearson chi-squared divergence | Penalizes samples far from decision boundary |
| Hinge loss | 2017 | Margin-based loss for discriminator | Related to reverse KL divergence | Works well with spectral normalization |
| f-GAN | 2016 | Generalizes GAN training to any f-divergence | Any f-divergence (KL, reverse KL, Hellinger, etc.) | Flexible choice of divergence measure |
Arjovsky, Chintala, and Bottou (2017) proposed replacing the Jensen-Shannon divergence with the Wasserstein-1 distance (Earth Mover's distance). The WGAN objective is:
min_G max_{D in 1-Lipschitz} E_{x ~ p_data}[D(x)] - E_{z ~ p_z}[D(G(z))]
The discriminator (called a "critic" in WGAN) outputs real-valued scores rather than probabilities, and must satisfy a 1-Lipschitz constraint. The original WGAN enforced this constraint through weight clipping, which was later replaced by a gradient penalty term (WGAN-GP) by Gulrajani et al. (2017).
WGAN provides meaningful loss curves that correlate with sample quality, avoids mode collapse by allowing the critic to be trained to optimality, and provides non-zero gradients even when the real and generated distributions have disjoint supports.
Mao et al. (2017) proposed using least squares loss instead of cross-entropy for the discriminator. The LSGAN objective penalizes generated samples based on their squared distance from the decision boundary, rather than using a logarithmic penalty. This produces gradients for samples that are far from the boundary, even when they are on the correct side, reducing the vanishing gradient problem.
The hinge loss applies a margin-based criterion to the discriminator. It has been shown to work particularly well when combined with spectral normalization, which constrains the Lipschitz constant of the discriminator by normalizing each layer's weight matrix by its largest singular value. Miyato et al. (2018) demonstrated that spectral normalization with hinge loss produces stable training across a range of architectures.
Nowozin, Cseke, and Tomioka (2016) showed that the original GAN objective is a special case of variational divergence minimization. Their f-GAN framework allows training with any f-divergence, including KL divergence, reverse KL divergence, Hellinger distance, and others. The f-GAN objective takes the form:
F(theta, omega) = E_{x ~ P}[T_omega(x)] - E_{x ~ Q_theta}[f*(T_omega(x))]
where f* is the Fenchel conjugate of the chosen f-divergence generator function, and T_omega is a variational function parameterized by omega.
Beyond generative modeling, minimax loss plays a central role in adversarial robustness. Madry et al. (2018) framed adversarial training as a minimax optimization problem:
min_theta E_{(x,y) ~ D}[max_{delta in S} L(f_theta(x + delta), y)]
where theta represents model parameters, (x, y) are data-label pairs, delta is an adversarial perturbation constrained to lie within a set S (typically an L_p ball), and L is the classification loss. The inner maximization finds the worst-case perturbation (the adversarial attack), while the outer minimization trains the model to be robust against such attacks.
This formulation, known as PGD-based adversarial training, uses projected gradient descent to approximate the inner maximization. It remains one of the most effective methods for training neural networks that are resistant to adversarial examples, though it comes at a computational cost of roughly (k + 1) times the cost of standard training, where k is the number of PGD steps.
In classical statistical decision theory, the minimax estimator is the decision rule that minimizes the maximum risk over all possible parameter values. For an estimation problem with parameter space Theta, the minimax risk is:
R_minimax = min_d max_{theta in Theta} E[L(d(X), theta)]
where d is an estimator, X is the observed data, and L is a loss function (such as squared error). The minimax estimator is the one that achieves this minimum.
Wald showed that under certain regularity conditions, the minimax estimator corresponds to the Bayes estimator with respect to the least favorable prior. This connection between minimax and Bayesian estimation is a cornerstone of statistical decision theory.
Minimax regret, introduced by Savage (1951), offers a less conservative alternative. Instead of minimizing the maximum absolute loss, it minimizes the maximum regret, defined as the difference between the actual loss and the loss of the best decision in hindsight:
min_d max_{s in S} [L(d, s) - min_{d'} L(d', s)]
Savage described the standard minimax criterion as "ultrapessimistic" and argued that minimax regret provides a more reasonable basis for decision-making under uncertainty.
Training GANs with minimax-style objectives requires careful attention to several practical details.
| Technique | Description | Benefit |
|---|---|---|
| Two time-scale update rule (TTUR) | Use different learning rates for the generator and discriminator, typically higher for the discriminator | Provable convergence to local Nash equilibrium |
| Spectral normalization | Normalize discriminator weight matrices by their largest singular value | Stabilizes discriminator; enforces Lipschitz constraint cheaply |
| Gradient penalty | Add a penalty term based on the norm of the discriminator's gradients | Better Lipschitz constraint enforcement than weight clipping |
| Label smoothing | Use soft labels (e.g., 0.9 instead of 1.0 for real samples) | Prevents discriminator overconfidence |
| Multiple discriminator updates | Update the discriminator several times per generator update | Keeps the discriminator closer to optimality |
| Progressive growing | Start with low resolution and progressively increase image size | Stabilizes training for high-resolution generation |
| Batch normalization | Normalize intermediate layer activations | Reduces internal covariate shift; may help or hurt depending on architecture |
The minimax loss connects to several other topics in machine learning and mathematics.
Nash equilibrium. In a two-player zero-sum game, the minimax solution corresponds to a Nash equilibrium. At this point, neither player can improve their outcome by unilaterally changing their strategy. For GANs, the Nash equilibrium is reached when the generator produces the true data distribution and the discriminator outputs 1/2 everywhere.
Saddle point problems. The minimax objective defines a saddle point in the joint parameter space of G and D. The equilibrium is a minimum along the G dimensions and a maximum along the D dimensions. Optimization methods for saddle points, such as the extragradient method, have been applied to improve GAN training.
Robust optimization. Minimax loss is a special case of robust optimization, where the goal is to find a solution that performs well under the worst-case realization of uncertain parameters. This perspective unifies GAN training, adversarial robustness, and distributionally robust optimization.
Support vector machines. SVMs can be viewed through a minimax lens. The SVM objective maximizes the minimum margin over all training examples, seeking the classifier that performs best on the hardest-to-classify points.
Imagine two kids playing a game. One kid (the generator) is trying to draw fake dollar bills that look real. The other kid (the discriminator) is trying to tell which bills are real and which ones are fake.
The faker keeps practicing and getting better at drawing. The detective keeps practicing and getting better at spotting fakes. They keep going back and forth, each one pushing the other to improve.
Minimax loss is the score they use to keep track of the game. The faker wants to make the score as low as possible (meaning the detective is fooled). The detective wants to make the score as high as possible (meaning they catch all the fakes). The game ends when the faker's drawings are so good that the detective has to guess randomly, getting it right only half the time.
In the broader sense, minimax loss is just a way of preparing for the worst. Instead of hoping for the best, you plan for the toughest situation, so no matter what happens, you do at least okay.