Minimax Loss

Minimax loss is a loss function rooted in game theory and decision theory that measures the worst-case performance of a strategy, algorithm, or model. In machine learning, the term most commonly refers to the adversarial objective function used to train generative adversarial networks (GANs), where a generator and a discriminator compete in a two-player zero-sum game. More broadly, minimax loss also appears in robust optimization, adversarial training, statistical decision theory, and classical game theory, wherever the goal is to minimize the maximum possible loss under uncertainty or adversarial conditions.

Historical background

The minimax principle originates with John von Neumann, who proved the minimax theorem in 1928 in his paper "Zur Theorie der Gesellschaftsspiele" (On the Theory of Parlor Games). The theorem states that in every finite two-player zero-sum game, there exist optimal mixed strategies for both players such that:

max_x min_y f(x, y) = min_y max_x f(x, y)

Von Neumann considered this result essential to game theory, reportedly stating: "As far as I can see, there could be no theory of games without that theorem." He first presented the theorem in December 1926 to the Gottingen Mathematical Society and later published the full proof in 1928. Jean Ville provided an elegant simplified proof in 1938 that highlighted the role of convexity. Von Neumann and Oskar Morgenstern later incorporated the theorem into their 1944 book Theory of Games and Economic Behavior, which established game theory as a formal discipline.

In statistical decision theory, Abraham Wald extended the minimax principle during the 1940s and 1950s. Wald framed statistical estimation as a game between a statistician and nature, where the statistician seeks to minimize the maximum risk (expected loss) over all possible states of nature. Leonard Savage introduced a related concept, minimax regret, in 1951, which minimizes the worst-case difference between the loss of the chosen action and the loss of the best possible action in hindsight.

Mathematical formulation

General definition

Given a decision rule d chosen from a set of possible rules D, and a set of scenarios or states of nature S, the minimax loss is defined as:

min_{d in D} max_{s in S} L(d, s)

where L(d, s) represents the loss incurred by decision rule d under scenario s. The minimax strategy selects the rule that performs best in the worst case.

GAN minimax objective

Ian Goodfellow introduced the minimax formulation for GANs in 2014. The objective function is:

min_G max_D V(D, G) = E_{x ~ p_data(x)}[log D(x)] + E_{z ~ p_z(z)}[log(1 - D(G(z)))]

where:

Symbol	Meaning
G	The generator network that maps random noise to synthetic data
D	The discriminator network that classifies inputs as real or fake
x	Real data samples drawn from the true data distribution p_data
z	Random noise vectors drawn from a prior distribution p_z (typically Gaussian or uniform)
D(x)	Discriminator's estimated probability that x is real
G(z)	Generator's output given noise input z
D(G(z))	Discriminator's estimated probability that the generated sample G(z) is real
E	Expected value operator

The discriminator tries to maximize this objective by correctly distinguishing real samples from generated ones. The generator tries to minimize it by producing samples that the discriminator classifies as real.

Optimal discriminator

For a fixed generator G, the optimal discriminator D* can be derived by taking the functional derivative of V(D, G) with respect to D. The result is:

D*(x) = p_data(x) / (p_data(x) + p_g(x))

where p_g is the distribution of samples produced by the generator. When the generator perfectly matches the data distribution (p_g = p_data), the optimal discriminator outputs 1/2 for every input, meaning it cannot distinguish real from fake.

Connection to Jensen-Shannon divergence

Substituting the optimal discriminator D* back into the value function yields:

V(G, D*) = -log(4) + 2 * D_JS(p_data || p_g)

where D_JS is the Jensen-Shannon divergence. This means that when the discriminator is optimal, the generator's objective reduces to minimizing the Jensen-Shannon divergence between the real data distribution and the generated distribution. The global minimum of V(G, D*) is achieved if and only if p_g = p_data, at which point V(G, D*) = -log(4).

This result, proven as Theorem 1 in Goodfellow et al. (2014), provides the theoretical justification for GAN training: the minimax game converges to the point where the generator perfectly replicates the data distribution.

Minimax loss in GAN training

Discriminator loss

The discriminator is trained to maximize the value function, which is equivalent to minimizing the negative of the value function. In practice, the discriminator loss is computed using binary cross-entropy:

L_D = -E_{x ~ p_data}[log D(x)] - E_{z ~ p_z}[log(1 - D(G(z)))]

The first term encourages the discriminator to assign high probability to real samples. The second term encourages it to assign low probability to generated (fake) samples.

Generator loss (original formulation)

In the original minimax formulation, the generator minimizes:

L_G = E_{z ~ p_z}[log(1 - D(G(z)))]

Since the generator has no control over the term involving real data, it only influences the second part of the value function. The generator tries to make D(G(z)) as close to 1 as possible, so that log(1 - D(G(z))) becomes very negative.

Non-saturating (modified) generator loss

Goodfellow noted in the original GAN paper that the minimax generator loss can lead to vanishing gradients early in training. When the generator is poor and D(G(z)) is near 0, the gradient of log(1 - D(G(z))) with respect to G is very small, providing little learning signal.

The recommended alternative is the non-saturating loss, where the generator instead maximizes log(D(G(z))):

L_G = -E_{z ~ p_z}[log D(G(z))]

This formulation provides stronger gradients when D(G(z)) is near 0 (early in training), helping the generator learn more quickly. Both formulations have the same fixed point, but the non-saturating version has better gradient dynamics. In practice, nearly all GAN implementations use this modified loss rather than the original minimax generator loss.

Training dynamics and convergence

Saddle point optimization

The minimax objective defines a saddle point problem. The equilibrium corresponds to a point that is simultaneously a minimum with respect to G and a maximum with respect to D. Finding such saddle points is generally harder than standard minimization, because gradient descent on minimax problems does not have the same convergence guarantees as gradient descent on convex functions.

In practice, GANs are trained by alternating between gradient ascent steps on the discriminator and gradient descent steps on the generator. This approach, called simultaneous or alternating gradient descent-ascent (GDA), can exhibit several pathological behaviors.

Oscillation and cycling

Naive application of gradient descent-ascent is known to produce oscillatory behavior or outright divergence, even in simple bilinear settings. The generator and discriminator can cycle around the equilibrium without converging to it. In more complex settings, the generated distribution can oscillate between different modes of the data distribution, never settling on a stable output.

Convergence guarantees

Heusel et al. (2017) proved that GANs trained using a two time-scale update rule (TTUR), where the discriminator and generator use different learning rates, converge to a stationary local Nash equilibrium under mild assumptions. However, Farnia and Ozdaglar (2020) showed that GAN zero-sum games may not always have local Nash equilibria, and that typical GAN architectures may converge to stationary non-Nash minimax points instead.

These theoretical results highlight a gap between the idealized minimax framework and the practical behavior of trained GANs.

Problems with minimax loss in GANs

Vanishing gradients

When the discriminator becomes too strong relative to the generator, the discriminator's output for generated samples approaches 0. In the original minimax formulation, the gradient of log(1 - D(G(z))) with respect to G becomes vanishingly small, causing the generator to stop learning. This is a consequence of the Jensen-Shannon divergence saturating when the two distributions have little or no overlap.

Mode collapse

Mode collapse occurs when the generator learns to produce only a narrow subset of the possible outputs, focusing on a few modes of the data distribution that are effective at fooling the current discriminator. The generator finds a few outputs that reliably receive high discriminator scores and repeatedly produces variations of those outputs, ignoring the full diversity of the training data.

Mode collapse is partly a consequence of the minimax training dynamics. The generator optimizes against the current discriminator, not against the optimal discriminator. As training alternates, the generator can "chase" the discriminator's weaknesses rather than learning the full distribution.

Training instability

GAN training is notoriously difficult to stabilize. Small changes in hyperparameters, architecture, or initialization can lead to dramatically different outcomes. The minimax objective does not define a standard loss curve that decreases over time; instead, the discriminator and generator losses fluctuate as they adapt to each other. This makes it difficult to monitor training progress or determine when training has converged.

Alternative and improved loss functions

Several alternative loss functions have been proposed to address the shortcomings of the original minimax objective.

Loss function	Year	Key idea	Divergence minimized	Main advantage
Original minimax	2014	Two-player zero-sum game	Jensen-Shannon divergence	Theoretically motivated
Non-saturating	2014	Generator maximizes log D(G(z))	Same fixed point as minimax	Stronger gradients early in training
Wasserstein loss (WGAN)	2017	Critic outputs unbounded scores; uses Earth Mover's distance	Wasserstein-1 distance	Avoids vanishing gradients; more stable training
WGAN-GP	2017	Gradient penalty replaces weight clipping in WGAN	Wasserstein-1 distance	Better Lipschitz constraint enforcement
LSGAN (Least Squares)	2017	Uses mean squared error instead of cross-entropy	Pearson chi-squared divergence	Penalizes samples far from decision boundary
Hinge loss	2017	Margin-based loss for discriminator	Related to reverse KL divergence	Works well with spectral normalization
f-GAN	2016	Generalizes GAN training to any f-divergence	Any f-divergence (KL, reverse KL, Hellinger, etc.)	Flexible choice of divergence measure

Wasserstein loss (WGAN)

Arjovsky, Chintala, and Bottou (2017) proposed replacing the Jensen-Shannon divergence with the Wasserstein-1 distance (Earth Mover's distance). The WGAN objective is:

min_G max_{D in 1-Lipschitz} E_{x ~ p_data}[D(x)] - E_{z ~ p_z}[D(G(z))]

The discriminator (called a "critic" in WGAN) outputs real-valued scores rather than probabilities, and must satisfy a 1-Lipschitz constraint. The original WGAN enforced this constraint through weight clipping, which was later replaced by a gradient penalty term (WGAN-GP) by Gulrajani et al. (2017).

WGAN provides meaningful loss curves that correlate with sample quality, avoids mode collapse by allowing the critic to be trained to optimality, and provides non-zero gradients even when the real and generated distributions have disjoint supports.

Least squares GAN (LSGAN)

Mao et al. (2017) proposed using least squares loss instead of cross-entropy for the discriminator. The LSGAN objective penalizes generated samples based on their squared distance from the decision boundary, rather than using a logarithmic penalty. This produces gradients for samples that are far from the boundary, even when they are on the correct side, reducing the vanishing gradient problem.

Hinge loss

The hinge loss applies a margin-based criterion to the discriminator. It has been shown to work particularly well when combined with spectral normalization, which constrains the Lipschitz constant of the discriminator by normalizing each layer's weight matrix by its largest singular value. Miyato et al. (2018) demonstrated that spectral normalization with hinge loss produces stable training across a range of architectures.

f-GAN

Nowozin, Cseke, and Tomioka (2016) showed that the original GAN objective is a special case of variational divergence minimization. Their f-GAN framework allows training with any f-divergence, including KL divergence, reverse KL divergence, Hellinger distance, and others. The f-GAN objective takes the form:

F(theta, omega) = E_{x ~ P}[T_omega(x)] - E_{x ~ Q_theta}[f*(T_omega(x))]

where f* is the Fenchel conjugate of the chosen f-divergence generator function, and T_omega is a variational function parameterized by omega.

Minimax loss in adversarial robustness

Beyond generative modeling, minimax loss plays a central role in adversarial robustness. Madry et al. (2018) framed adversarial training as a minimax optimization problem:

min_theta E_{(x,y) ~ D}[max_{delta in S} L(f_theta(x + delta), y)]

where theta represents model parameters, (x, y) are data-label pairs, delta is an adversarial perturbation constrained to lie within a set S (typically an L_p ball), and L is the classification loss. The inner maximization finds the worst-case perturbation (the adversarial attack), while the outer minimization trains the model to be robust against such attacks.

This formulation, known as PGD-based adversarial training, uses projected gradient descent to approximate the inner maximization. It remains one of the most effective methods for training neural networks that are resistant to adversarial examples, though it comes at a computational cost of roughly (k + 1) times the cost of standard training, where k is the number of PGD steps.

Minimax loss in statistical decision theory

In classical statistical decision theory, the minimax estimator is the decision rule that minimizes the maximum risk over all possible parameter values. For an estimation problem with parameter space Theta, the minimax risk is:

R_minimax = min_d max_{theta in Theta} E[L(d(X), theta)]

where d is an estimator, X is the observed data, and L is a loss function (such as squared error). The minimax estimator is the one that achieves this minimum.

Wald showed that under certain regularity conditions, the minimax estimator corresponds to the Bayes estimator with respect to the least favorable prior. This connection between minimax and Bayesian estimation is a cornerstone of statistical decision theory.

Minimax regret, introduced by Savage (1951), offers a less conservative alternative. Instead of minimizing the maximum absolute loss, it minimizes the maximum regret, defined as the difference between the actual loss and the loss of the best decision in hindsight:

min_d max_{s in S} [L(d, s) - min_{d'} L(d', s)]

Savage described the standard minimax criterion as "ultrapessimistic" and argued that minimax regret provides a more reasonable basis for decision-making under uncertainty.

Practical considerations and training tips

Training GANs with minimax-style objectives requires careful attention to several practical details.

Technique	Description	Benefit
Two time-scale update rule (TTUR)	Use different learning rates for the generator and discriminator, typically higher for the discriminator	Provable convergence to local Nash equilibrium
Spectral normalization	Normalize discriminator weight matrices by their largest singular value	Stabilizes discriminator; enforces Lipschitz constraint cheaply
Gradient penalty	Add a penalty term based on the norm of the discriminator's gradients	Better Lipschitz constraint enforcement than weight clipping
Label smoothing	Use soft labels (e.g., 0.9 instead of 1.0 for real samples)	Prevents discriminator overconfidence
Multiple discriminator updates	Update the discriminator several times per generator update	Keeps the discriminator closer to optimality
Progressive growing	Start with low resolution and progressively increase image size	Stabilizes training for high-resolution generation
Batch normalization	Normalize intermediate layer activations	Reduces internal covariate shift; may help or hurt depending on architecture

Relationship to other concepts

The minimax loss connects to several other topics in machine learning and mathematics.

Nash equilibrium. In a two-player zero-sum game, the minimax solution corresponds to a Nash equilibrium. At this point, neither player can improve their outcome by unilaterally changing their strategy. For GANs, the Nash equilibrium is reached when the generator produces the true data distribution and the discriminator outputs 1/2 everywhere.

Saddle point problems. The minimax objective defines a saddle point in the joint parameter space of G and D. The equilibrium is a minimum along the G dimensions and a maximum along the D dimensions. Optimization methods for saddle points, such as the extragradient method, have been applied to improve GAN training.

Robust optimization. Minimax loss is a special case of robust optimization, where the goal is to find a solution that performs well under the worst-case realization of uncertain parameters. This perspective unifies GAN training, adversarial robustness, and distributionally robust optimization.

Support vector machines. SVMs can be viewed through a minimax lens. The SVM objective maximizes the minimum margin over all training examples, seeking the classifier that performs best on the hardest-to-classify points.

Explain like I'm 5 (ELI5)

Imagine two kids playing a game. One kid (the generator) is trying to draw fake dollar bills that look real. The other kid (the discriminator) is trying to tell which bills are real and which ones are fake.

The faker keeps practicing and getting better at drawing. The detective keeps practicing and getting better at spotting fakes. They keep going back and forth, each one pushing the other to improve.

Minimax loss is the score they use to keep track of the game. The faker wants to make the score as low as possible (meaning the detective is fooled). The detective wants to make the score as high as possible (meaning they catch all the fakes). The game ends when the faker's drawings are so good that the detective has to guess randomly, getting it right only half the time.

In the broader sense, minimax loss is just a way of preparing for the worst. Instead of hoping for the best, you plan for the toughest situation, so no matter what happens, you do at least okay.

References

Von Neumann, J. (1928). "Zur Theorie der Gesellschaftsspiele." *Mathematische Annalen*, 100(1), 295-320.
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. (2014). "Generative Adversarial Nets." *Advances in Neural Information Processing Systems 27 (NeurIPS 2014)*.
Arjovsky, M., Chintala, S., and Bottou, L. (2017). "Wasserstein GAN." *arXiv preprint arXiv:1701.07875*.
Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., and Courville, A. (2017). "Improved Training of Wasserstein GANs." *Advances in Neural Information Processing Systems 30 (NeurIPS 2017)*.
Nowozin, S., Cseke, B., and Tomioka, R. (2016). "f-GAN: Training Generative Neural Samplers using Variational Divergence Minimization." *Advances in Neural Information Processing Systems 29 (NeurIPS 2016)*.
Mao, X., Li, Q., Xie, H., Lau, R.Y.K., Wang, Z., and Smolley, S.P. (2017). "Least Squares Generative Adversarial Networks." *Proceedings of the IEEE International Conference on Computer Vision (ICCV 2017)*.
Miyato, T., Kataoka, T., Koyama, M., and Yoshida, Y. (2018). "Spectral Normalization for Generative Adversarial Networks." *International Conference on Learning Representations (ICLR 2018)*.
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and Hochreiter, S. (2017). "GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium." *Advances in Neural Information Processing Systems 30 (NeurIPS 2017)*.
Madry, A., Makelov, A., Schmidt, L., Tsipras, D., and Vladu, A. (2018). "Towards Deep Learning Models Resistant to Adversarial Attacks." *International Conference on Learning Representations (ICLR 2018)*.
Farnia, F. and Ozdaglar, A. (2020). "Do GANs Always Have Nash Equilibria?" *Proceedings of the 37th International Conference on Machine Learning (ICML 2020)*.
Wald, A. (1950). *Statistical Decision Functions*. John Wiley and Sons.
Savage, L.J. (1951). "The Theory of Statistical Decision." *Journal of the American Statistical Association*, 46(253), 55-67.
Von Neumann, J. and Morgenstern, O. (1944). *Theory of Games and Economic Behavior*. Princeton University Press.
Google Developers. "Loss Functions - Machine Learning GAN Course." https://developers.google.com/machine-learning/gan/loss
Kurach, K., Lucic, M., Zhai, X., Michalski, M., and Gelly, S. (2019). "A Large-Scale Study on Regularization and Normalization in GANs." *Proceedings of the 36th International Conference on Machine Learning (ICML 2019)*.

Historical background

Mathematical formulation

General definition

GAN minimax objective

Optimal discriminator

Connection to Jensen-Shannon divergence

Minimax loss in GAN training

Discriminator loss

Generator loss (original formulation)

Non-saturating (modified) generator loss

Training dynamics and convergence

Saddle point optimization

Oscillation and cycling

Convergence guarantees

Problems with minimax loss in GANs

Vanishing gradients

Mode collapse

Training instability

Alternative and improved loss functions

Wasserstein loss (WGAN)

Least squares GAN (LSGAN)

Hinge loss

f-GAN

Minimax loss in adversarial robustness

Minimax loss in statistical decision theory

Practical considerations and training tips

Relationship to other concepts

Explain like I'm 5 (ELI5)

References

Improve this article

Related Articles

ARC-AGI 2

Wasserstein Loss

Generative AI

Denoising Diffusion Probabilistic Model

L1 Loss

L2 Loss

Historical background

Mathematical formulation

General definition

GAN minimax objective

Optimal discriminator

Connection to Jensen-Shannon divergence

Minimax loss in GAN training

Discriminator loss

Generator loss (original formulation)

Non-saturating (modified) generator loss

Training dynamics and convergence

Saddle point optimization

Oscillation and cycling

Convergence guarantees

Problems with minimax loss in GANs

Vanishing gradients

Mode collapse

Training instability

Alternative and improved loss functions

Wasserstein loss (WGAN)

Least squares GAN (LSGAN)

Hinge loss

f-GAN

Minimax loss in adversarial robustness

Minimax loss in statistical decision theory

Practical considerations and training tips

Relationship to other concepts

Explain like I'm 5 (ELI5)

References

Related Articles

ARC-AGI 2

Wasserstein Loss

Generative AI

Denoising Diffusion Probabilistic Model

L1 Loss

L2 Loss