Empirical risk minimization (ERM) is a principle in statistical learning theory that provides a foundation for designing and analyzing machine learning algorithms. The core idea is straightforward: because the true probability distribution generating the data is unknown, a learning algorithm should select the hypothesis that minimizes the average loss computed over the observed training data. ERM was formalized by Vladimir Vapnik and Alexey Chervonenkis in the late 1960s as part of their broader work on the theory of pattern recognition and uniform convergence of empirical processes.
The ERM principle underlies many widely used algorithms, including linear regression, logistic regression, and support vector machines. Understanding its theoretical properties, such as when and why minimizing training error leads to good performance on unseen data, is one of the central questions in learning theory.
Imagine you are trying to learn which fruits are sweet and which are sour. You have a basket of fruits that you have already tasted, and you know which ones were sweet and which were sour. Empirical risk minimization says: pick the rule that makes the fewest mistakes on the fruits you have already tasted, and then use that rule for new fruits you have not tried yet.
The word "empirical" means "based on what you have actually seen," and "risk" is just another word for "how many mistakes you make." So empirical risk minimization means "make as few mistakes as possible on the examples you have."
The tricky part is that a rule which perfectly memorizes your basket might not work well on new fruits. If your basket happened to contain only red sweet apples and green sour apples, you might wrongly conclude that color is all that matters. This problem is called overfitting, and much of learning theory is about figuring out when your rule for the basket will also work for new fruits.
The standard supervised learning setup consists of the following components:
The true risk (also called the expected risk or population risk) of a hypothesis h is defined as:
R(h) = E[L(h(x), y)] = integral of L(h(x), y) dP(x, y)
This quantity represents the expected loss of the hypothesis over the entire data distribution. Since P(x, y) is unknown, the true risk cannot be computed directly.
Given a training sample S = {(x_1, y_1), (x_2, y_2), ..., (x_n, y_n)} drawn i.i.d. from P, the empirical risk of a hypothesis h is:
R_emp(h) = (1/n) * sum from i=1 to n of L(h(x_i), y_i)
The empirical risk is simply the average loss over the training sample. By the law of large numbers, for any fixed hypothesis h, the empirical risk converges to the true risk as the sample size n grows.
The ERM principle selects the hypothesis that minimizes the empirical risk:
h_ERM = argmin over h in H of R_emp(h)
The hope is that the hypothesis minimizing the empirical risk will also have a low true risk. Whether this hope is justified depends on properties of the hypothesis class H and the sample size n.
The foundations of empirical risk minimization trace back to the work of Vladimir Vapnik and Alexey Chervonenkis at the Institute of Control Sciences in Moscow. Their research program, which began in the early 1960s, addressed a fundamental question: under what conditions does minimizing empirical risk lead to minimizing the true risk?
| Year | Event |
|---|---|
| 1964 | Vapnik and Chervonenkis develop the "generalized portrait" method for pattern recognition, an early precursor to support vector machines |
| 1968 | Vapnik and Chervonenkis publish their foundational result on uniform convergence of empirical risk to true risk, introducing conditions based on the capacity of the function class |
| 1971 | An English translation of the 1968 paper becomes available, bringing the results to a wider audience |
| 1974 | Vapnik and Chervonenkis publish Theory of Pattern Recognition, formalizing the structural risk minimization principle |
| 1982 | Vapnik publishes Estimation of Dependences Based on Empirical Data, presenting a systematic treatment of the ERM principle and VC theory |
| 1984 | Leslie Valiant introduces the Probably Approximately Correct (PAC) learning framework independently, providing a complementary computational perspective on learning |
| 1989 | Blumer, Ehrenfeucht, Haussler, and Warmuth connect PAC learning and VC dimension, establishing the fundamental theorem of statistical learning |
| 1995 | Vapnik publishes The Nature of Statistical Learning Theory, which synthesizes decades of work and popularizes these ideas in the machine learning community |
| 1998 | Vapnik's expanded second edition further develops the theory and its connections to kernel methods and support vector machines |
The central question in ERM theory is whether minimizing empirical risk leads to minimizing true risk. The answer depends on a property called uniform convergence: the empirical risk must converge to the true risk simultaneously for all hypotheses in the class H, not just for individual ones.
Uniform convergence means that for any epsilon > 0:
P(sup over h in H of |R_emp(h) - R(h)| > epsilon) -> 0 as n -> infinity
If uniform convergence holds, then the hypothesis selected by ERM will have true risk close to the best achievable true risk in H. This is because:
Vapnik and Chervonenkis proved that uniform convergence is both necessary and sufficient for the consistency of ERM across all possible data distributions.
The VC dimension (Vapnik-Chervonenkis dimension) is a measure of the capacity or expressiveness of a hypothesis class. For binary classification, the VC dimension of a class H is the largest number of points that can be shattered by H, meaning that H can realize all possible labelings of those points.
Examples of VC dimensions:
| Hypothesis class | VC dimension |
|---|---|
| Linear classifiers in R^d | d + 1 |
| Intervals on the real line | 2 |
| Convex polygons with k sides in R^2 | 2k + 1 |
| Finite hypothesis class of size | H |
| Sine functions (sin(theta * x)) | infinite |
| Set of all functions from X to {0,1} | infinite (if X is infinite) |
The VC inequality provides a quantitative bound on the deviation of empirical risk from true risk. For binary classification with 0-1 loss:
P(sup over h in H of |R_emp(h) - R(h)| > epsilon) <= 8 * S(H, n) * exp(-n * epsilon^2 / 32)
where S(H, n) is the growth function (also called the shattering coefficient), which counts the maximum number of distinct labelings that H can produce on n points.
Sauer's lemma bounds the growth function in terms of the VC dimension d:
S(H, n) <= sum from i=0 to d of C(n, i) <= (en/d)^d
for n >= d. This means the growth function is at most polynomial in n when the VC dimension is finite, rather than exponential.
Combining the VC inequality with properties of ERM yields the following generalization bound. With probability at least 1 - delta over the random draw of the training set of size n:
R(h_ERM) <= R_emp(h_ERM) + O(sqrt((d * log(n/d) + log(1/delta)) / n))
where d is the VC dimension of H. This bound states that the true risk of the ERM solution is at most its empirical risk plus a complexity penalty that decreases as the sample size grows and increases with the VC dimension.
The fundamental theorem of statistical learning, established through the combined work of Vapnik, Chervonenkis, Blumer, Ehrenfeucht, Haussler, and Warmuth, characterizes exactly when learning is possible. For binary classification, the following statements are equivalent:
This theorem is remarkable because it shows that a single combinatorial quantity, the VC dimension, completely governs the learnability of binary classification problems under the ERM principle.
The Probably Approximately Correct (PAC) learning framework, introduced by Leslie Valiant in 1984, provides a computational perspective on learning. A concept class C is PAC-learnable if there exists an algorithm that, for any target concept c in C, any distribution over the input space, and any parameters epsilon > 0 and delta > 0:
The agnostic PAC framework relaxes the assumption that the target concept belongs to the hypothesis class. In this setting, the goal is to find a hypothesis whose true risk is at most epsilon worse than the best hypothesis in H:
R(h) <= min over h in H of R(h) + epsilon**
ERM is a natural algorithm for agnostic PAC learning. If H has finite VC dimension d, then ERM is an agnostic PAC learner with sample complexity:
m(epsilon, delta) = O(d / epsilon^2 * log(1/epsilon) + log(1/delta) / epsilon^2)
This means that the number of training examples needed to learn scales linearly with the VC dimension and inversely with the square of the desired accuracy.
The choice of hypothesis class H introduces a fundamental tradeoff between two sources of error, closely related to the bias-variance tradeoff.
The excess risk of the ERM hypothesis can be decomposed as:
R(h_ERM) - R(h) = [R(h_H) - R(h*)] + [R(h_ERM) - R(h_H*)]**
where:
The two terms are:
| Error component | Also called | Depends on | Behavior as H grows |
|---|---|---|---|
| R(h_H*) - R(h*) | Approximation error (bias) | Expressiveness of H | Decreases |
| R(h_ERM) - R(h_H*) | Estimation error (variance) | Sample size n and complexity of H | Increases |
The approximation error measures how well the best hypothesis in H can fit the true data-generating process. A larger, more expressive hypothesis class reduces this error. The estimation error measures the cost of estimating the best hypothesis from finite data. A larger hypothesis class makes this harder because there are more candidates to distinguish among.
The goal in practice is to choose H to balance these two sources of error. This is the theoretical justification for model selection techniques and regularization.
Structural risk minimization (SRM) is a principle introduced by Vapnik and Chervonenkis in their 1974 book that directly addresses the approximation-estimation tradeoff. Rather than fixing a single hypothesis class, SRM considers a nested sequence of hypothesis classes:
H_1 subset H_2 subset H_3 subset ...
where each H_k has finite VC dimension d_k, and d_1 <= d_2 <= d_3 <= ...
The SRM principle selects both the hypothesis class and the hypothesis within it by minimizing an upper bound on the true risk:
h_SRM = argmin over k and h in H_k of [R_emp(h) + penalty(d_k, n, delta)]
The penalty term increases with the VC dimension of the chosen class, discouraging the learner from selecting overly complex models unless the data strongly support it.
| Property | ERM | SRM |
|---|---|---|
| Hypothesis class | Fixed | Selected from a nested sequence |
| Objective | Minimize empirical risk | Minimize empirical risk plus complexity penalty |
| Overfitting risk | Higher for complex H | Controlled by the penalty term |
| Requires model selection | External (cross-validation, etc.) | Built into the principle |
| Theoretical guarantee | Consistent if H has finite VC dimension | Consistent even over a growing sequence of classes |
| Practical realization | Least squares, MLE | Regularized methods, SVM with kernel selection |
In many practical settings, the natural loss function is computationally intractable. For binary classification, the 0-1 loss (which counts misclassifications) is the most natural choice, but minimizing empirical risk with 0-1 loss is NP-hard even for linear classifiers. To make optimization feasible, practitioners replace the 0-1 loss with a convex surrogate loss function.
The following table compares common loss functions used in place of the 0-1 loss. Each is expressed as a function of the margin m = y * f(x), where y is in {-1, +1} and f(x) is the real-valued prediction.
| Loss function | Formula | Used in | Properties |
|---|---|---|---|
| 0-1 loss | 1 if m < 0, else 0 | Ideal classification metric | Non-convex, non-differentiable; NP-hard to minimize |
| Hinge loss | max(0, 1 - m) | Support vector machines | Convex, not differentiable at m = 1; yields sparse solutions |
| Logistic loss | log(1 + exp(-m)) | Logistic regression | Convex, differentiable; outputs calibrated probabilities |
| Exponential loss | exp(-m) | AdaBoost (boosting) | Convex, differentiable; sensitive to outliers |
| Squared hinge loss | max(0, 1 - m)^2 | Modified SVM variants | Convex, differentiable; balances hinge and smoothness |
| Loss function | Formula | Used in | Properties |
|---|---|---|---|
| Squared loss (L2) | (y - f(x))^2 | Linear regression, ridge regression | Convex, differentiable; sensitive to outliers |
| Absolute loss (L1) | y - f(x) | ||
| Huber loss | Quadratic for small errors, linear for large | Robust regression | Convex, differentiable; combines advantages of L1 and L2 |
| Epsilon-insensitive loss | max(0, | y - f(x) | - epsilon) |
A surrogate loss is called Bayes consistent (or classification-calibrated) if minimizing the expected surrogate loss leads to the same classifier as minimizing the 0-1 loss. Zhang (2004) and Bartlett, Jordan, and McAuliffe (2006) established that all the convex surrogates listed above (hinge, logistic, exponential) are Bayes consistent, providing theoretical justification for their use in practice.
Many standard machine learning algorithms can be understood as instances of the ERM principle with specific choices of loss function, hypothesis class, and regularizer.
| Algorithm | Loss function | Hypothesis class | Regularizer | ERM formulation |
|---|---|---|---|---|
| Ordinary least squares | Squared loss | Linear functions | None | min (1/n) sum(y_i - w^T x_i)^2 |
| Ridge regression | Squared loss | Linear functions | L2 penalty (lambda * | |
| Lasso | Squared loss | Linear functions | L1 penalty (lambda * | |
| Logistic regression | Logistic loss | Linear functions | L2 or L1 | min (1/n) sum log(1 + exp(-y_i w^T x_i)) + lambda * R(w) |
| Support vector machines | Hinge loss | Linear or kernel functions | L2 penalty | min (1/n) sum max(0, 1 - y_i w^T x_i) + lambda * |
| AdaBoost | Exponential loss | Ensemble of weak learners | Implicit (boosting rounds) | Sequentially minimizes weighted exponential loss |
| Neural networks | Cross-entropy, squared loss, etc. | Compositions of nonlinear layers | Weight decay (L2), dropout | min (1/n) sum L(f_theta(x_i), y_i) + lambda * |
Regularized empirical risk minimization augments the ERM objective with a penalty term that discourages overly complex hypotheses:
h_RERM = argmin over h in H of [(1/n) * sum L(h(x_i), y_i) + lambda * R(h)]
where R(h) is the regularization term and lambda >= 0 is a hyperparameter controlling the tradeoff between fitting the data and maintaining simplicity.
| Regularizer | R(h) | Effect | Example algorithms |
|---|---|---|---|
| L2 (Tikhonov / ridge) | w | ||
| L1 (Lasso) | w | ||
| Elastic net | alpha * | w | |
| Early stopping | Number of gradient descent iterations | Implicitly limits model complexity | Neural network training |
| Dropout | Random zeroing of activations during training | Prevents co-adaptation of neurons | Deep learning |
Regularized ERM provides a practical implementation of the structural risk minimization principle. The regularization parameter lambda plays the role of controlling the effective complexity of the hypothesis class, analogous to selecting the nesting level in SRM.
Empirical risk minimization and maximum likelihood estimation (MLE) are closely related. When the loss function is the negative log-likelihood, ERM reduces to MLE.
Specifically, suppose the data are generated from a parametric family of distributions P_theta. The negative log-likelihood loss is:
L(theta; x, y) = -log P_theta(y | x)
Minimizing the empirical risk with this loss:
(1/n) sum from i=1 to n of [-log P_theta(y_i | x_i)]
is equivalent to maximizing the likelihood function. Furthermore, minimizing the expected negative log-likelihood is equivalent to minimizing the Kullback-Leibler (KL) divergence between the true distribution and the model distribution.
Adding an L2 regularizer to the negative log-likelihood corresponds to maximum a posteriori (MAP) estimation with a Gaussian prior on the parameters.
While VC dimension provides distribution-free generalization bounds, these bounds can be loose in practice because they must hold for the worst-case distribution. Rademacher complexity offers an alternative that adapts to the specific data distribution.
The empirical Rademacher complexity of a hypothesis class H with respect to a sample S = {x_1, ..., x_n} is:
R_S(H) = (1/n) E_sigma [sup over h in H of sum from i=1 to n of sigma_i * h(x_i)]
where sigma_1, ..., sigma_n are independent Rademacher random variables (each taking values +1 or -1 with equal probability). Intuitively, the Rademacher complexity measures how well the hypothesis class can fit random noise.
With probability at least 1 - delta over the draw of the training sample:
R(h) <= R_emp(h) + 2 * R_S(H) + sqrt(log(1/delta) / (2n))
for all h in H simultaneously.
This bound is often tighter than VC-based bounds because the Rademacher complexity can be estimated from the data and adapts to the actual sample distribution. Bartlett and Mendelson (2002) developed this framework, providing data-dependent risk bounds applicable to decision trees, neural networks, and support vector machines.
The no free lunch theorem, proved by Wolpert (1996) for supervised learning, demonstrates a fundamental limitation of ERM (and any learning algorithm). Without assumptions about the data distribution, no learning algorithm can be universally good.
Formally, for any learning algorithm and any sample size n, there exists a data distribution such that:
This result means that ERM can only succeed when the hypothesis class is appropriately chosen relative to the true data-generating process. The restriction to a specific hypothesis class H is not a limitation of ERM but a necessary feature: it encodes the learner's assumptions about which patterns are plausible.
Minimizing empirical risk is not always computationally easy. The difficulty depends on the choice of loss function and hypothesis class.
For binary classification, minimizing the empirical 0-1 loss over the class of linear classifiers is NP-hard. This result, due to multiple researchers including Feldman, Guruswami, Raghavendra, and Wu (2009), shows that even simple hypothesis classes lead to intractable optimization when paired with the natural loss function.
The standard approach to circumvent NP-hardness is to replace the 0-1 loss with a convex surrogate. Because the surrogate loss is convex and typically differentiable (or subdifferentiable), the resulting ERM problem can be solved efficiently using convex optimization techniques such as gradient descent, stochastic gradient descent, or interior point methods.
The excess risk from using a surrogate instead of the true loss can be controlled. For example, Zhang's inequality (2004) shows that for convex, classification-calibrated surrogates, the excess 0-1 risk can be bounded in terms of the excess surrogate risk, ensuring that good performance on the surrogate translates to good performance on the original objective.
For large-scale problems, computing the full empirical risk over all training examples at each iteration is expensive. Stochastic gradient descent (SGD) and its variants (Adam, AdaGrad, RMSProp) approximate the gradient of the empirical risk using mini-batches, enabling training on datasets with millions or billions of examples.
The success of deep learning has raised new questions about the ERM principle. Modern neural networks are often heavily overparameterized, meaning they have far more parameters than training examples. Classical ERM theory predicts that such models should overfit badly, yet in practice they often generalize well.
When a model has enough capacity to fit the training data perfectly (achieving zero empirical risk), it is said to interpolate the data. Classical learning theory suggests that interpolating models should have high true risk due to overfitting. However, empirical observations and recent theoretical work have revealed a more nuanced picture.
The double descent phenomenon, documented by Belkin, Hsu, Ma, and Mandal (2019) and explored further by Nakkiran et al. (2021), describes a non-monotonic relationship between model complexity and test error:
This phenomenon has been observed across multiple model types, including neural networks, random forests, and kernel methods.
One explanation for why overparameterized models generalize despite interpolating is implicit regularization. The optimization algorithm (typically SGD) and its initialization do not find an arbitrary interpolating solution; instead, they converge to solutions with particular structural properties, such as low norm or low rank. These inductive biases effectively restrict the hypothesis class, even though no explicit regularizer is present in the objective.
For example, gradient descent on overparameterized linear models converges to the minimum-norm interpolating solution, which has well-understood generalization properties. For neural networks, the precise characterization of implicit regularization remains an active area of research.
Tilted empirical risk minimization (TERM), introduced by Li et al. (2021), extends the standard ERM objective by introducing an exponential tilt parameter t that adjusts the influence of individual losses:
R_t(h) = (1/t) log((1/n) sum from i=1 to n of exp(t * L(h(x_i), y_i)))
The parameter t interpolates between different objectives:
| Tilt parameter | Limiting behavior | Effect |
|---|---|---|
| t -> 0 | Standard ERM (arithmetic mean of losses) | Equal weighting of all examples |
| t -> +infinity | Maximum loss (minimax risk) | Focuses on worst-case examples |
| t -> -infinity | Minimum loss | Focuses on easiest examples |
TERM provides a unified framework for several practical goals:
The framework has been applied to problems including fair classification, robust regression, and federated learning.
Despite its theoretical elegance and practical utility, ERM has several well-known limitations:
Distribution-free bounds are often loose. The generalization bounds derived from VC theory hold for all possible data distributions, but this worst-case guarantee comes at the cost of tightness. In practice, the bounds are often too large to be informative.
Overfitting with rich hypothesis classes. When the hypothesis class is too large relative to the sample size, ERM will select a hypothesis that fits the training data well but performs poorly on new data. This is the fundamental motivation for regularized ERM and structural risk minimization.
Computational intractability. For many natural combinations of loss functions and hypothesis classes, the ERM optimization problem is NP-hard. Convex relaxations and approximate algorithms are needed in practice.
Sensitivity to the training distribution. ERM optimizes average performance over the training set, which may not align with the desired objective. If the training data are imbalanced, contain systematic biases, or are not representative of the deployment distribution, the ERM solution may perform poorly on underrepresented subgroups or in distribution-shifted settings.
No free lunch. Without assumptions about the data distribution, no learning algorithm (including ERM) can guarantee good performance. The choice of hypothesis class implicitly encodes assumptions that may or may not match reality.