Empirical risk minimization (ERM)

Empirical risk minimization (ERM) is the foundational principle of statistical learning theory that says: pick a hypothesis from a hypothesis class that minimizes the average loss on the training data, and use that as a stand-in for picking the hypothesis that minimizes the true (but unknown) expected loss on the underlying data distribution. Almost every supervised learning algorithm in modern machine learning, from ordinary least squares to deep neural networks, can be understood as a form of ERM, sometimes regularized, sometimes not, sometimes with surrogate losses standing in for the loss we actually care about.

ERM was formalized by Vladimir Vapnik and Alexey Chervonenkis in a series of papers and books beginning in the late 1960s, culminating in Vapnik's two influential textbooks: The Nature of Statistical Learning Theory (1995) and Statistical Learning Theory (1998) ^[1]^[2]. The principle is deceptively simple. Its depth comes from the question it forces you to confront: when, and under what conditions on the hypothesis class and the sample size, does the hypothesis that does best on training data also do well on data it has never seen? That question is what gave rise to VC dimension, Rademacher complexity, PAC learning, and an enormous body of generalization theory that still drives research on why deep networks work.

Explain like I'm 5 (ELI5)

Imagine you have a box of different tools, and you want to find the best tool to fix a broken toy. ERM is like a method that helps you pick the best tool from the box by trying each one on a few other broken toys you have. You choose the tool that fixes the most toys as the best one.

However, sometimes the best tool for those few toys might not work well for other broken toys you haven't tested yet. In machine learning, this is called overfitting. To avoid this problem, you can add some rules to make sure you don't pick a tool that's too specialized and can only fix one specific kind of toy. These rules help you find a tool that works well for most toys, even the ones you haven't tested yet. That extra step of adding rules is called regularization, and it turns plain ERM into something safer.

Formal definition

Let (X, Y) be a random pair drawn from an unknown joint distribution D over an input space X and an output space Y. A hypothesis is a function h: X to Y from a hypothesis class H. A loss function L: Y x Y to R+ measures how bad a prediction h(x) is when the true label is y.

The quantity we actually care about is the true risk (also called expected risk or generalization error):

R(h) = E_{(x,y) ~ D} [ L(h(x), y) ]

The Bayes risk is R* = inf_h R(h), the lowest possible expected loss over all measurable functions; the function that achieves it is the Bayes-optimal predictor ^[3]. We do not have access to D, so we cannot compute R(h) directly. Instead we have a training sample S = {(x_1, y_1), ..., (x_n, y_n)} drawn iid from D, and we compute the empirical risk:

R_emp(h) = (1/n) * sum_{i=1}^{n} L(h(x_i), y_i)

The ERM principle prescribes choosing the hypothesis that minimizes the empirical risk over the chosen class:

h_ERM = argmin_{h in H} R_emp(h)

The entire intellectual content of statistical learning theory is the question of how close R(h_ERM) is to R*, and how that gap depends on n, on the complexity of H, and on the underlying distribution D ^[3].

Why ERM works (when it does)

For a single fixed hypothesis h, the law of large numbers guarantees that R_emp(h) converges to R(h) as n grows. Hoeffding's inequality makes this quantitative: if the loss is bounded in [0, B], then for any single h,

P( | R_emp(h) - R(h) | > epsilon ) <= 2 * exp( -2 * n * epsilon^2 / B^2 )

The trouble is that h_ERM is not a fixed hypothesis. It is chosen after seeing the data, so it depends on the data, and the simple Hoeffding bound does not apply directly. The standard fix is a union bound. For a finite hypothesis class H, the probability that any h in H has empirical and true risk that disagree by more than epsilon is at most

P( sup_{h in H} | R_emp(h) - R(h) | > epsilon ) <= 2 * |H| * exp( -2 * n * epsilon^2 / B^2 )

This is the basic uniform convergence bound for finite classes ^[4]. Solving for epsilon shows that the gap shrinks like sqrt( log|H| / n ), which is why "more data helps" and "smaller hypothesis classes are safer" are not just slogans but theorems.

For infinite classes, |H| is useless, and you need a more refined notion of size. That is where VC dimension and Rademacher complexity come in.

Why ERM fails (overfitting)

If H is rich enough that, for every possible labeling of the training points, some h in H matches it exactly, then h_ERM can drive empirical risk to zero while leaving true risk arbitrarily high. The classical illustration is to take H to be the set of all functions from X to Y. Then ERM picks any function that interpolates the training data, including a function that is constant outside the sample and therefore generalizes terribly. This is overfitting in its purest form.

In between "trivially small H" and "all functions" lives the bulk of practical machine learning, where the hypothesis class is parametric (linear models, neural networks, decision trees) and the worry is whether the class is too expressive for the amount of data you have. The classical bias-variance tradeoff is the picture this generates: simple models miss the signal (bias), complex models chase the noise (variance), and the optimal capacity sits somewhere in the middle. Modern overparameterized neural networks complicate that picture; see the section on benign overfitting below.

Generalization bounds

Most of statistical learning theory consists of bounds of the form

R(h_ERM) <= R_emp(h_ERM) + complexity_penalty(H, n, delta)

which hold with probability at least 1 - delta over the random sample. The penalty term depends on a measure of how rich H is. Several measures have been developed.

Bound type	Complexity measure	Typical rate	Reference
Finite class (Hoeffding + union bound)	log\|H\|	sqrt(log\|H\| / n)	Vapnik 1998 ^[2]
VC bound	VC dimension d	sqrt(d log(n/d) / n)	Vapnik and Chervonenkis 1971 ^[5]
Rademacher complexity	R_n(H), data-dependent	R_n(H) + sqrt(log(1/delta) / n)	Bartlett and Mendelson 2002 ^[6]
PAC-Bayes	KL(Q \|\| P) for posterior Q, prior P	sqrt((KL + log(n/delta)) / n)	McAllester 1999 ^[7]
Algorithmic stability	replace-one stability beta	beta + sqrt(1/n)	Bousquet and Elisseeff 2002 ^[8]

VC dimension

For binary classifiers, the VC dimension d of a class H is the largest integer such that some set of d points can be shattered by H, meaning all 2^d possible labelings are realizable by some h in H. Vapnik and Chervonenkis showed that finite VC dimension is necessary and sufficient for uniform convergence, and gave the bound

R(h) <= R_emp(h) + sqrt( ( d * (log(2n/d) + 1) - log(eta/4) ) / n )

holding for all h in H with probability at least 1 - eta over a sample of size n ^[5]. Linear classifiers in R^p have VC dimension p + 1; axis-aligned rectangles in R^2 have VC dimension 4; neural networks have VC dimension polynomial in the number of weights. These results were later sharpened with the growth function and Sauer's lemma, which bounds the number of distinct labelings of n points by a polynomial in n of degree d.

Rademacher complexity

The Rademacher complexity R_n(H) measures how well functions in H can correlate with random sign noise on a sample of n points. Formally,

R_n(H) = E_S E_sigma [ sup_{h in H} (1/n) * sum_i sigma_i * h(x_i) ]

where sigma_i are independent Rademacher (plus or minus one with equal probability) random variables. Bartlett and Mendelson's 2002 paper established the central inequality that for bounded losses,

R(h) <= R_emp(h) + 2 * R_n(L compose H) + 3 * sqrt( log(2/delta) / (2n) )

with probability at least 1 - delta ^[6]. The Rademacher complexity is data-dependent and often gives sharper bounds than VC dimension, especially for kernel methods and margin-based classifiers.

PAC-Bayes

McAllester's PAC-Bayesian theorems bound the expected risk of a randomized predictor in terms of the KL divergence between a posterior Q over hypotheses (the predictor's randomness) and any data-independent prior P:

E_{h ~ Q} R(h) <= E_{h ~ Q} R_emp(h) + sqrt( ( KL(Q || P) + log(n/delta) ) / (2 * (n - 1)) )

with probability at least 1 - delta over the sample ^[7]. PAC-Bayes bounds are some of the tightest non-vacuous bounds known for neural networks, and they allow a Bayesian flavor of prior knowledge to enter the bound through P.

Structural risk minimization

ERM by itself only minimizes empirical risk. Structural risk minimization (SRM), introduced by Vapnik and Chervonenkis in 1974 and developed in Vapnik's 1995 and 1998 books, extends ERM by trading off empirical risk against capacity ^[1]^[2]^[9]. The user fixes a nested sequence of hypothesis classes

H_1 subset H_2 subset H_3 subset ... subset H

with VC dimensions d_1 < d_2 < d_3 < ..., performs ERM in each H_k, and then picks the k that minimizes a guaranteed risk: empirical risk plus a confidence term that grows with d_k. The intuition is the same as the bias-variance tradeoff cast as an explicit model-selection rule.

Principle	Objective minimized	What it controls	Typical use
ERM	R_emp(h)	Training error only	Small or moderate fixed H
Regularized ERM	R_emp(h) + lambda * Omega(h)	Training error plus a complexity penalty on h	Ridge, lasso, weight decay, SVM
SRM	R_emp(h_k) + confidence(d_k, n, delta)	Training error plus a confidence term on H_k	Model selection across nested classes
Bayesian inference	Posterior over h, integrated predictions	Prior, likelihood, evidence	Probabilistic models with calibrated uncertainty

Regularized ERM

In practice, the standard recipe is regularized ERM: minimize the empirical risk plus a penalty Omega(h) that grows with the complexity of h:

h = argmin_{h in H} [ R_emp(h) + lambda * Omega(h) ]

The coefficient lambda > 0 controls the strength of the penalty. Common choices for Omega:

Method	Loss	Penalty Omega(w)	Hypothesis class
Ridge regression	Squared error	\|\|w\|\|_2^2	Linear functions
Lasso	Squared error	\|\|w\|\|_1	Linear functions, sparse
Logistic regression with L2	Cross-entropy	\|\|w\|\|_2^2	Linear in features, sigmoid output
Soft-margin SVM	Hinge loss	(1/2) * \|\|w\|\|_2^2	Linear classifiers in feature space
Weight decay (deep nets)	Cross-entropy or MSE	\|\|theta\|\|_2^2	Neural networks
Elastic net	Squared error	alpha\|\|w\|\|_1 + (1-alpha)\|\|w\|\|_2^2	Linear, both sparse and stable

Regularized ERM is mathematically equivalent to constrained ERM (Lagrangian duality) and to MAP estimation under a prior on parameters. From the SRM perspective, lambda parameterizes a continuous version of the nested hypothesis classes: small lambda means a large effective class, large lambda means a small one.

Loss functions used with ERM

The loss we conceptually want to minimize for classification is the 0-1 loss, which counts mistakes. The 0-1 loss is non-convex and non-differentiable, so ERM with the 0-1 loss is computationally intractable. The standard workaround is to substitute a convex surrogate loss that upper-bounds the 0-1 loss and is easy to optimize.

Loss	Formula	Used by	Optimal score
0-1 loss	I( y != h(x) )	The thing you actually care about	Bayes classifier
Squared error (MSE)	(y - h(x))^2	Linear regression, neural net regression	E[Y \| X]
Cross-entropy (log loss)	-log p_h(y \| x)	Logistic regression, softmax classifiers	log P(Y \| X)
Hinge loss	max(0, 1 - y * h(x))	Support vector machines	sign of margin
Huber loss	quadratic for small residuals, linear for large	Robust regression	Trimmed mean
Exponential loss	exp(-y * h(x))	AdaBoost	(1/2) log(P/(1-P))

A loss is classification-calibrated if the population minimizer of the surrogate has the same sign as the Bayes-optimal classifier. Bartlett, Jordan, and McAuliffe proved general conditions for calibration in 2006 ^[10]. Calibration is what justifies using cross-entropy or hinge loss instead of 0-1 loss; without it, the surrogate could be fooled into bad classifications even with infinite data.

Examples

Linear regression

Ordinary least squares is ERM with squared loss over the class of linear functions h(x) = w^T x + b:

w_OLS = argmin_w (1/n) * sum_i (y_i - w^T x_i - b)^2

Closed-form solution: w_OLS = (X^T X)^{-1} X^T y when X^T X is invertible. Ridge regression adds an L2 penalty: w_ridge = (X^T X + lambda I)^{-1} X^T y, which always has a unique solution and pulls w toward zero.

Logistic regression

Logistic regression is ERM with the cross-entropy loss over the class of sigmoid-activated linear models. Given p_h(y = 1 | x) = sigma(w^T x), the empirical risk is the negative log-likelihood

-(1/n) * sum_i [ y_i log p_h(1|x_i) + (1 - y_i) log p_h(0|x_i) ]

This has no closed form but is convex, so it can be minimized by Newton's method (iteratively reweighted least squares) or stochastic gradient descent.

Support vector machines

The soft-margin SVM is regularized ERM with hinge loss and L2 penalty:

min_w (1/2) ||w||^2 + C * sum_i max(0, 1 - y_i (w^T x_i + b))

The regularization parameter C is the inverse of lambda. The dual formulation enables the kernel trick, replacing inner products with kernel evaluations and giving nonlinear decision boundaries with linear ERM machinery.

Deep neural networks

Training a neural network by minimizing the average cross-entropy over a labeled dataset, with weight decay and SGD, is regularized ERM. The hypothesis class is the set of functions representable by the architecture; the loss is a surrogate for 0-1 classification error; the regularizer is the L2 norm of the weights. What makes deep learning theoretically peculiar is that the hypothesis class is enormous, often more expressive than the training set, yet ERM still generalizes. That puzzle is the next section.

ERM in modern overparameterized models

Classical theory predicts that if your hypothesis class can interpolate the training data, ERM will overfit catastrophically. Modern deep networks routinely interpolate their training data and yet generalize well. The discrepancy has driven a decade of research.

Zhang, Bengio, Hardt, Recht, and Vinyals showed in their 2017 ICLR paper that standard image classifiers can fit random labels to zero training error while still generalizing on real labels ^[11]. This rules out simple capacity-based explanations: the same network that memorizes random labels also generalizes when trained on real labels, so the architecture alone cannot be doing the regularizing.

Double descent

Belkin, Hsu, Ma, and Mandal proposed in 2019 that the classical U-shaped bias-variance curve is only the first half of a fuller picture they called the double descent curve ^[12]. As model capacity grows past the interpolation threshold (the point where the model can fit the training data exactly), test error first rises, peaks at the threshold, then decreases again in the heavily overparameterized regime. This phenomenon has been observed in linear regression, random feature models, and deep networks, and complicates the textbook story about ERM and capacity. See the article on double descent for a fuller treatment.

Benign overfitting

Bartlett, Long, Lugosi, and Tsigler gave a precise characterization in 2020 of when minimum-norm interpolation in linear regression is benign, meaning ERM finds a zero-training-error solution that nevertheless predicts almost as well as the Bayes-optimal one ^[13]. Their conditions involve two notions of effective rank of the data covariance: roughly, the ambient dimension must be much larger than the sample size, and the eigenvalue spectrum must decay at a particular rate. In finite-dimensional settings with dimension growing faster than sample size, a wide range of slowly decaying spectra make overfitting benign; in infinite-dimensional settings only a narrow window does.

Implicit regularization

A leading explanation for benign overfitting is that the optimization algorithm itself, not just the loss function, biases ERM toward well-generalizing solutions. For least squares, gradient descent initialized at zero converges to the minimum-norm solution. For separable classification with logistic loss, gradient descent on overparameterized linear models converges in direction to the maximum-margin (hard SVM) solution ^[14]. The combination of architecture, initialization, and SGD dynamics produces an implicit preference for low-complexity solutions even when the explicit objective is plain ERM.

Limitations and assumptions

ERM relies on a chain of assumptions, and almost every limitation of supervised learning can be traced back to one of them.

Assumption	What goes wrong if violated	Common remedies
Training data is iid from D	Sample bias, autocorrelated data, time-series	Cross-validation that respects structure, importance weighting
Test distribution equals training distribution	Distribution shift, covariate shift, concept drift	Domain adaptation, distributionally robust optimization, invariant risk minimization
Loss measures what we care about	Surrogate loss not calibrated, proxy metric misalignment	Calibrated surrogates, multi-objective training, RLHF
Hypothesis class contains a good predictor	High approximation error (bias), underfitting	Larger H, better features, deeper models
Sample size large relative to capacity	Overfitting, vacuous bounds	Regularization, SRM, more data
Loss is bounded	Heavy tails break standard concentration inequalities	Truncation, robust losses, median-of-means estimators

Distribution shift is arguably the most consequential failure mode in deployment. ERM optimizes performance on the training distribution; if the deployment distribution differs, no amount of training data fixes the gap. This has motivated robust variants such as distributionally robust ERM (Sinha, Namkoong, and Duchi, 2018) and invariant risk minimization (Arjovsky and others, 2019).

ERM in distributed and federated settings

When data is split across many clients or machines, ERM is solved approximately rather than exactly. Federated averaging (FedAvg), introduced by McMahan and others in 2017, runs local SGD steps on each client's data and averages the resulting weights. Under iid data and convex losses this approximates global ERM; under non-iid data it can drift toward client-specific minima, which is one reason federated learning theory is harder than centralized theory. Modern variants such as FedProx, SCAFFOLD, and FedAdam add corrections to keep the implicit objective close to the global empirical risk.

Connection to other principles

ERM is one of several inductive principles. Maximum likelihood estimation (MLE) is exactly ERM with the negative log-likelihood as the loss; under correct model specification, MLE and ERM give identical estimators. Bayesian inference, by contrast, does not minimize a single loss but instead averages predictions over the posterior, and reduces to regularized ERM (MAP estimation) only when you take the posterior mode. Minimum description length (MDL) and Kolmogorov-complexity-based approaches penalize hypothesis description length explicitly, recovering something like SRM with a code-length penalty.

The asymptotic story is reassuring: under mild conditions, ERM is consistent, meaning R(h_ERM) converges to inf_{h in H} R(h) as n grows. Vapnik and Chervonenkis proved in 1989 that finite VC dimension is necessary and sufficient for the consistency of ERM in the worst case ^[15]. The non-asymptotic story is what most theory papers spend their time on.

Practical takeaways

ERM is not a single algorithm but a recipe with five knobs: the hypothesis class, the loss function, the regularizer, the optimization algorithm, and the amount of data. Tuning these is most of what applied machine learning is about.

A useful checklist before deploying an ERM-based model: Is your loss the thing you actually want to minimize, or a surrogate? Is your training distribution close to deployment? Do you have enough data relative to model capacity, or are you relying on implicit regularization that may not transfer? Can you verify generalization with a held-out set drawn from the deployment distribution, not just the training distribution?

The theory of ERM is also a useful lens for diagnosing failures. A model that does great in training but poorly in production is failing one of the assumptions; identifying which one (overfitting, distribution shift, label noise, surrogate misalignment) usually points at the fix.

References

Vapnik, V. N. (1995). *The Nature of Statistical Learning Theory*. Springer-Verlag, New York. ISBN 0-387-94559-8.
Vapnik, V. N. (1998). *Statistical Learning Theory*. Wiley-Interscience, New York. ISBN 978-0-471-03003-4.
Devroye, L., Gyorfi, L., and Lugosi, G. (1996). *A Probabilistic Theory of Pattern Recognition*. Springer, New York.
Mohri, M., Rostamizadeh, A., and Talwalkar, A. (2018). *Foundations of Machine Learning*, 2nd edition. MIT Press.
Vapnik, V. N. and Chervonenkis, A. Ya. (1971). "On the uniform convergence of relative frequencies of events to their probabilities." *Theory of Probability and its Applications*, 16(2): 264-280.
Bartlett, P. L. and Mendelson, S. (2002). "Rademacher and Gaussian Complexities: Risk Bounds and Structural Results." *Journal of Machine Learning Research*, 3: 463-482. https://www.jmlr.org/papers/volume3/bartlett02a/bartlett02a.pdf
McAllester, D. A. (1999). "Some PAC-Bayesian Theorems." *Machine Learning*, 37(3): 355-363.
Bousquet, O. and Elisseeff, A. (2002). "Stability and Generalization." *Journal of Machine Learning Research*, 2: 499-526.
Vapnik, V. N. (1991). "Principles of Risk Minimization for Learning Theory." *Advances in Neural Information Processing Systems*, 4. https://proceedings.nips.cc/paper/1991/file/ff4d5fbbafdf976cfdc032e3bde78de5-Paper.pdf
Bartlett, P. L., Jordan, M. I., and McAuliffe, J. D. (2006). "Convexity, Classification, and Risk Bounds." *Journal of the American Statistical Association*, 101(473): 138-156.
Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. (2017). "Understanding Deep Learning Requires Rethinking Generalization." *International Conference on Learning Representations* (ICLR). arXiv:1611.03530.
Belkin, M., Hsu, D., Ma, S., and Mandal, S. (2019). "Reconciling Modern Machine Learning Practice and the Classical Bias-Variance Trade-Off." *Proceedings of the National Academy of Sciences*, 116(32): 15849-15854. arXiv:1812.11118.
Bartlett, P. L., Long, P. M., Lugosi, G., and Tsigler, A. (2020). "Benign Overfitting in Linear Regression." *Proceedings of the National Academy of Sciences*, 117(48): 30063-30070. https://www.pnas.org/doi/10.1073/pnas.1907378117
Soudry, D., Hoffer, E., Nacson, M. S., Gunasekar, S., and Srebro, N. (2018). "The Implicit Bias of Gradient Descent on Separable Data." *Journal of Machine Learning Research*, 19(70): 1-57.
Vapnik, V. N. and Chervonenkis, A. Ya. (1989). "The necessary and sufficient conditions for consistency of the method of empirical risk minimization." *Pattern Recognition and Image Analysis*, 1(3): 284-305.
Valiant, L. G. (1984). "A Theory of the Learnable." *Communications of the ACM*, 27(11): 1134-1142.
McMahan, B., Moore, E., Ramage, D., Hampson, S., and Aguera y Arcas, B. (2017). "Communication-Efficient Learning of Deep Networks from Decentralized Data." *Proceedings of the 20th International Conference on Artificial Intelligence and Statistics* (AISTATS).