See also: Machine learning terms, Statistical learning theory, Generalization
Empirical risk minimization (ERM) is the foundational principle of statistical learning theory that says: pick a hypothesis from a hypothesis class that minimizes the average loss on the training data, and use that as a stand-in for picking the hypothesis that minimizes the true (but unknown) expected loss on the underlying data distribution. Almost every supervised learning algorithm in modern machine learning, from ordinary least squares to deep neural networks, can be understood as a form of ERM, sometimes regularized, sometimes not, sometimes with surrogate losses standing in for the loss we actually care about.
ERM was formalized by Vladimir Vapnik and Alexey Chervonenkis in a series of papers and books beginning in the late 1960s, culminating in Vapnik's two influential textbooks: The Nature of Statistical Learning Theory (1995) and Statistical Learning Theory (1998) [1][2]. The principle is deceptively simple. Its depth comes from the question it forces you to confront: when, and under what conditions on the hypothesis class and the sample size, does the hypothesis that does best on training data also do well on data it has never seen? That question is what gave rise to VC dimension, Rademacher complexity, PAC learning, and an enormous body of generalization theory that still drives research on why deep networks work.
Imagine you have a box of different tools, and you want to find the best tool to fix a broken toy. ERM is like a method that helps you pick the best tool from the box by trying each one on a few other broken toys you have. You choose the tool that fixes the most toys as the best one.
However, sometimes the best tool for those few toys might not work well for other broken toys you haven't tested yet. In machine learning, this is called overfitting. To avoid this problem, you can add some rules to make sure you don't pick a tool that's too specialized and can only fix one specific kind of toy. These rules help you find a tool that works well for most toys, even the ones you haven't tested yet. That extra step of adding rules is called regularization, and it turns plain ERM into something safer.
Let (X, Y) be a random pair drawn from an unknown joint distribution D over an input space X and an output space Y. A hypothesis is a function h: X to Y from a hypothesis class H. A loss function L: Y x Y to R+ measures how bad a prediction h(x) is when the true label is y.
The quantity we actually care about is the true risk (also called expected risk or generalization error):
R(h) = E_{(x,y) ~ D} [ L(h(x), y) ]
The Bayes risk is R* = inf_h R(h), the lowest possible expected loss over all measurable functions; the function that achieves it is the Bayes-optimal predictor [3]. We do not have access to D, so we cannot compute R(h) directly. Instead we have a training sample S = {(x_1, y_1), ..., (x_n, y_n)} drawn iid from D, and we compute the empirical risk:
R_emp(h) = (1/n) * sum_{i=1}^{n} L(h(x_i), y_i)
The ERM principle prescribes choosing the hypothesis that minimizes the empirical risk over the chosen class:
h_ERM = argmin_{h in H} R_emp(h)
The entire intellectual content of statistical learning theory is the question of how close R(h_ERM) is to R*, and how that gap depends on n, on the complexity of H, and on the underlying distribution D [3].
For a single fixed hypothesis h, the law of large numbers guarantees that R_emp(h) converges to R(h) as n grows. Hoeffding's inequality makes this quantitative: if the loss is bounded in [0, B], then for any single h,
P( | R_emp(h) - R(h) | > epsilon ) <= 2 * exp( -2 * n * epsilon^2 / B^2 )
The trouble is that h_ERM is not a fixed hypothesis. It is chosen after seeing the data, so it depends on the data, and the simple Hoeffding bound does not apply directly. The standard fix is a union bound. For a finite hypothesis class H, the probability that any h in H has empirical and true risk that disagree by more than epsilon is at most
P( sup_{h in H} | R_emp(h) - R(h) | > epsilon ) <= 2 * |H| * exp( -2 * n * epsilon^2 / B^2 )
This is the basic uniform convergence bound for finite classes [4]. Solving for epsilon shows that the gap shrinks like sqrt( log|H| / n ), which is why "more data helps" and "smaller hypothesis classes are safer" are not just slogans but theorems.
For infinite classes, |H| is useless, and you need a more refined notion of size. That is where VC dimension and Rademacher complexity come in.
If H is rich enough that, for every possible labeling of the training points, some h in H matches it exactly, then h_ERM can drive empirical risk to zero while leaving true risk arbitrarily high. The classical illustration is to take H to be the set of all functions from X to Y. Then ERM picks any function that interpolates the training data, including a function that is constant outside the sample and therefore generalizes terribly. This is overfitting in its purest form.
In between "trivially small H" and "all functions" lives the bulk of practical machine learning, where the hypothesis class is parametric (linear models, neural networks, decision trees) and the worry is whether the class is too expressive for the amount of data you have. The classical bias-variance tradeoff is the picture this generates: simple models miss the signal (bias), complex models chase the noise (variance), and the optimal capacity sits somewhere in the middle. Modern overparameterized neural networks complicate that picture; see the section on benign overfitting below.
Most of statistical learning theory consists of bounds of the form
R(h_ERM) <= R_emp(h_ERM) + complexity_penalty(H, n, delta)
which hold with probability at least 1 - delta over the random sample. The penalty term depends on a measure of how rich H is. Several measures have been developed.
| Bound type | Complexity measure | Typical rate | Reference |
|---|---|---|---|
| Finite class (Hoeffding + union bound) | log|H| | sqrt(log|H| / n) | Vapnik 1998 [2] |
| VC bound | VC dimension d | sqrt(d log(n/d) / n) | Vapnik and Chervonenkis 1971 [5] |
| Rademacher complexity | R_n(H), data-dependent | R_n(H) + sqrt(log(1/delta) / n) | Bartlett and Mendelson 2002 [6] |
| PAC-Bayes | KL(Q || P) for posterior Q, prior P | sqrt((KL + log(n/delta)) / n) | McAllester 1999 [7] |
| Algorithmic stability | replace-one stability beta | beta + sqrt(1/n) | Bousquet and Elisseeff 2002 [8] |
For binary classifiers, the VC dimension d of a class H is the largest integer such that some set of d points can be shattered by H, meaning all 2^d possible labelings are realizable by some h in H. Vapnik and Chervonenkis showed that finite VC dimension is necessary and sufficient for uniform convergence, and gave the bound
R(h) <= R_emp(h) + sqrt( ( d * (log(2n/d) + 1) - log(eta/4) ) / n )
holding for all h in H with probability at least 1 - eta over a sample of size n [5]. Linear classifiers in R^p have VC dimension p + 1; axis-aligned rectangles in R^2 have VC dimension 4; neural networks have VC dimension polynomial in the number of weights. These results were later sharpened with the growth function and Sauer's lemma, which bounds the number of distinct labelings of n points by a polynomial in n of degree d.
The Rademacher complexity R_n(H) measures how well functions in H can correlate with random sign noise on a sample of n points. Formally,
R_n(H) = E_S E_sigma [ sup_{h in H} (1/n) * sum_i sigma_i * h(x_i) ]
where sigma_i are independent Rademacher (plus or minus one with equal probability) random variables. Bartlett and Mendelson's 2002 paper established the central inequality that for bounded losses,
R(h) <= R_emp(h) + 2 * R_n(L compose H) + 3 * sqrt( log(2/delta) / (2n) )
with probability at least 1 - delta [6]. The Rademacher complexity is data-dependent and often gives sharper bounds than VC dimension, especially for kernel methods and margin-based classifiers.
McAllester's PAC-Bayesian theorems bound the expected risk of a randomized predictor in terms of the KL divergence between a posterior Q over hypotheses (the predictor's randomness) and any data-independent prior P:
E_{h ~ Q} R(h) <= E_{h ~ Q} R_emp(h) + sqrt( ( KL(Q || P) + log(n/delta) ) / (2 * (n - 1)) )
with probability at least 1 - delta over the sample [7]. PAC-Bayes bounds are some of the tightest non-vacuous bounds known for neural networks, and they allow a Bayesian flavor of prior knowledge to enter the bound through P.
ERM by itself only minimizes empirical risk. Structural risk minimization (SRM), introduced by Vapnik and Chervonenkis in 1974 and developed in Vapnik's 1995 and 1998 books, extends ERM by trading off empirical risk against capacity [1][2][9]. The user fixes a nested sequence of hypothesis classes
H_1 subset H_2 subset H_3 subset ... subset H
with VC dimensions d_1 < d_2 < d_3 < ..., performs ERM in each H_k, and then picks the k that minimizes a guaranteed risk: empirical risk plus a confidence term that grows with d_k. The intuition is the same as the bias-variance tradeoff cast as an explicit model-selection rule.
| Principle | Objective minimized | What it controls | Typical use |
|---|---|---|---|
| ERM | R_emp(h) | Training error only | Small or moderate fixed H |
| Regularized ERM | R_emp(h) + lambda * Omega(h) | Training error plus a complexity penalty on h | Ridge, lasso, weight decay, SVM |
| SRM | R_emp(h_k) + confidence(d_k, n, delta) | Training error plus a confidence term on H_k | Model selection across nested classes |
| Bayesian inference | Posterior over h, integrated predictions | Prior, likelihood, evidence | Probabilistic models with calibrated uncertainty |
In practice, the standard recipe is regularized ERM: minimize the empirical risk plus a penalty Omega(h) that grows with the complexity of h:
h = argmin_{h in H} [ R_emp(h) + lambda * Omega(h) ]
The coefficient lambda > 0 controls the strength of the penalty. Common choices for Omega:
| Method | Loss | Penalty Omega(w) | Hypothesis class |
|---|---|---|---|
| Ridge regression | Squared error | ||w||_2^2 | Linear functions |
| Lasso | Squared error | ||w||_1 | Linear functions, sparse |
| Logistic regression with L2 | Cross-entropy | ||w||_2^2 | Linear in features, sigmoid output |
| Soft-margin SVM | Hinge loss | (1/2) * ||w||_2^2 | Linear classifiers in feature space |
| Weight decay (deep nets) | Cross-entropy or MSE | ||theta||_2^2 | Neural networks |
| Elastic net | Squared error | alpha||w||_1 + (1-alpha)||w||_2^2 | Linear, both sparse and stable |
Regularized ERM is mathematically equivalent to constrained ERM (Lagrangian duality) and to MAP estimation under a prior on parameters. From the SRM perspective, lambda parameterizes a continuous version of the nested hypothesis classes: small lambda means a large effective class, large lambda means a small one.
The loss we conceptually want to minimize for classification is the 0-1 loss, which counts mistakes. The 0-1 loss is non-convex and non-differentiable, so ERM with the 0-1 loss is computationally intractable. The standard workaround is to substitute a convex surrogate loss that upper-bounds the 0-1 loss and is easy to optimize.
| Loss | Formula | Used by | Optimal score |
|---|---|---|---|
| 0-1 loss | I( y != h(x) ) | The thing you actually care about | Bayes classifier |
| Squared error (MSE) | (y - h(x))^2 | Linear regression, neural net regression | E[Y | X] |
| Cross-entropy (log loss) | -log p_h(y | x) | Logistic regression, softmax classifiers | log P(Y | X) |
| Hinge loss | max(0, 1 - y * h(x)) | Support vector machines | sign of margin |
| Huber loss | quadratic for small residuals, linear for large | Robust regression | Trimmed mean |
| Exponential loss | exp(-y * h(x)) | AdaBoost | (1/2) log(P/(1-P)) |
A loss is classification-calibrated if the population minimizer of the surrogate has the same sign as the Bayes-optimal classifier. Bartlett, Jordan, and McAuliffe proved general conditions for calibration in 2006 [10]. Calibration is what justifies using cross-entropy or hinge loss instead of 0-1 loss; without it, the surrogate could be fooled into bad classifications even with infinite data.
Ordinary least squares is ERM with squared loss over the class of linear functions h(x) = w^T x + b:
w_OLS = argmin_w (1/n) * sum_i (y_i - w^T x_i - b)^2
Closed-form solution: w_OLS = (X^T X)^{-1} X^T y when X^T X is invertible. Ridge regression adds an L2 penalty: w_ridge = (X^T X + lambda I)^{-1} X^T y, which always has a unique solution and pulls w toward zero.
Logistic regression is ERM with the cross-entropy loss over the class of sigmoid-activated linear models. Given p_h(y = 1 | x) = sigma(w^T x), the empirical risk is the negative log-likelihood
-(1/n) * sum_i [ y_i log p_h(1|x_i) + (1 - y_i) log p_h(0|x_i) ]
This has no closed form but is convex, so it can be minimized by Newton's method (iteratively reweighted least squares) or stochastic gradient descent.
The soft-margin SVM is regularized ERM with hinge loss and L2 penalty:
min_w (1/2) ||w||^2 + C * sum_i max(0, 1 - y_i (w^T x_i + b))
The regularization parameter C is the inverse of lambda. The dual formulation enables the kernel trick, replacing inner products with kernel evaluations and giving nonlinear decision boundaries with linear ERM machinery.
Training a neural network by minimizing the average cross-entropy over a labeled dataset, with weight decay and SGD, is regularized ERM. The hypothesis class is the set of functions representable by the architecture; the loss is a surrogate for 0-1 classification error; the regularizer is the L2 norm of the weights. What makes deep learning theoretically peculiar is that the hypothesis class is enormous, often more expressive than the training set, yet ERM still generalizes. That puzzle is the next section.
Classical theory predicts that if your hypothesis class can interpolate the training data, ERM will overfit catastrophically. Modern deep networks routinely interpolate their training data and yet generalize well. The discrepancy has driven a decade of research.
Zhang, Bengio, Hardt, Recht, and Vinyals showed in their 2017 ICLR paper that standard image classifiers can fit random labels to zero training error while still generalizing on real labels [11]. This rules out simple capacity-based explanations: the same network that memorizes random labels also generalizes when trained on real labels, so the architecture alone cannot be doing the regularizing.
Belkin, Hsu, Ma, and Mandal proposed in 2019 that the classical U-shaped bias-variance curve is only the first half of a fuller picture they called the double descent curve [12]. As model capacity grows past the interpolation threshold (the point where the model can fit the training data exactly), test error first rises, peaks at the threshold, then decreases again in the heavily overparameterized regime. This phenomenon has been observed in linear regression, random feature models, and deep networks, and complicates the textbook story about ERM and capacity. See the article on double descent for a fuller treatment.
Bartlett, Long, Lugosi, and Tsigler gave a precise characterization in 2020 of when minimum-norm interpolation in linear regression is benign, meaning ERM finds a zero-training-error solution that nevertheless predicts almost as well as the Bayes-optimal one [13]. Their conditions involve two notions of effective rank of the data covariance: roughly, the ambient dimension must be much larger than the sample size, and the eigenvalue spectrum must decay at a particular rate. In finite-dimensional settings with dimension growing faster than sample size, a wide range of slowly decaying spectra make overfitting benign; in infinite-dimensional settings only a narrow window does.
A leading explanation for benign overfitting is that the optimization algorithm itself, not just the loss function, biases ERM toward well-generalizing solutions. For least squares, gradient descent initialized at zero converges to the minimum-norm solution. For separable classification with logistic loss, gradient descent on overparameterized linear models converges in direction to the maximum-margin (hard SVM) solution [14]. The combination of architecture, initialization, and SGD dynamics produces an implicit preference for low-complexity solutions even when the explicit objective is plain ERM.
ERM relies on a chain of assumptions, and almost every limitation of supervised learning can be traced back to one of them.
| Assumption | What goes wrong if violated | Common remedies |
|---|---|---|
| Training data is iid from D | Sample bias, autocorrelated data, time-series | Cross-validation that respects structure, importance weighting |
| Test distribution equals training distribution | Distribution shift, covariate shift, concept drift | Domain adaptation, distributionally robust optimization, invariant risk minimization |
| Loss measures what we care about | Surrogate loss not calibrated, proxy metric misalignment | Calibrated surrogates, multi-objective training, RLHF |
| Hypothesis class contains a good predictor | High approximation error (bias), underfitting | Larger H, better features, deeper models |
| Sample size large relative to capacity | Overfitting, vacuous bounds | Regularization, SRM, more data |
| Loss is bounded | Heavy tails break standard concentration inequalities | Truncation, robust losses, median-of-means estimators |
Distribution shift is arguably the most consequential failure mode in deployment. ERM optimizes performance on the training distribution; if the deployment distribution differs, no amount of training data fixes the gap. This has motivated robust variants such as distributionally robust ERM (Sinha, Namkoong, and Duchi, 2018) and invariant risk minimization (Arjovsky and others, 2019).
When data is split across many clients or machines, ERM is solved approximately rather than exactly. Federated averaging (FedAvg), introduced by McMahan and others in 2017, runs local SGD steps on each client's data and averages the resulting weights. Under iid data and convex losses this approximates global ERM; under non-iid data it can drift toward client-specific minima, which is one reason federated learning theory is harder than centralized theory. Modern variants such as FedProx, SCAFFOLD, and FedAdam add corrections to keep the implicit objective close to the global empirical risk.
ERM is one of several inductive principles. Maximum likelihood estimation (MLE) is exactly ERM with the negative log-likelihood as the loss; under correct model specification, MLE and ERM give identical estimators. Bayesian inference, by contrast, does not minimize a single loss but instead averages predictions over the posterior, and reduces to regularized ERM (MAP estimation) only when you take the posterior mode. Minimum description length (MDL) and Kolmogorov-complexity-based approaches penalize hypothesis description length explicitly, recovering something like SRM with a code-length penalty.
The asymptotic story is reassuring: under mild conditions, ERM is consistent, meaning R(h_ERM) converges to inf_{h in H} R(h) as n grows. Vapnik and Chervonenkis proved in 1989 that finite VC dimension is necessary and sufficient for the consistency of ERM in the worst case [15]. The non-asymptotic story is what most theory papers spend their time on.
ERM is not a single algorithm but a recipe with five knobs: the hypothesis class, the loss function, the regularizer, the optimization algorithm, and the amount of data. Tuning these is most of what applied machine learning is about.
A useful checklist before deploying an ERM-based model: Is your loss the thing you actually want to minimize, or a surrogate? Is your training distribution close to deployment? Do you have enough data relative to model capacity, or are you relying on implicit regularization that may not transfer? Can you verify generalization with a held-out set drawn from the deployment distribution, not just the training distribution?
The theory of ERM is also a useful lens for diagnosing failures. A model that does great in training but poorly in production is failing one of the assumptions; identifying which one (overfitting, distribution shift, label noise, surrogate misalignment) usually points at the fix.