Empirical Risk Minimization

Machine Learning Training & Optimization

26 min read

Updated Jul 11, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 11, 2026

Fact-checked

In review queue

Sources

17 citations

Revision

v4 · 5,247 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Empirical risk minimization (ERM) is the foundational principle of statistical learning theory: because the true risk (the expected loss over the unknown data distribution) cannot be computed, a learning algorithm instead chooses the hypothesis that minimizes the average loss measured on the observed training data. It is the principle behind essentially all of supervised learning, from linear regression and support vector machines to neural network training, where gradient descent on a training loss is exactly an ERM procedure. ERM was formalized by Vladimir Vapnik and Alexey Chervonenkis in the late 1960s as part of their broader work on the theory of pattern recognition and uniform convergence of empirical processes ^[1]^[2].

Standard treatments state the rule plainly: "the empirical risk minimization principle states that the learning algorithm should choose a hypothesis which minimizes the empirical risk over the hypothesis class" ^[15]. The central question of learning theory is when this is justified: minimizing training error provably controls true risk when, and only when, the empirical risk converges uniformly to the true risk over the hypothesis class, a condition Vapnik and Chervonenkis tied to a single capacity measure, the VC dimension ^[1]^[14]. For a class with finite VC dimension d, the gap between training and true risk shrinks at a rate of order $\sqrt{d / n}$ in the sample size n ^[6]^[14], while leaving the class too rich relative to n produces overfitting. The same tension motivates regularization and structural risk minimization, the regularized extension of ERM that Vapnik and Chervonenkis introduced in 1974 ^[3]^[14].

The ERM principle underlies many widely used algorithms, including linear regression, logistic regression, and support vector machines. Understanding its theoretical properties, such as when and why minimizing training error leads to good performance on unseen data, is one of the central questions in learning theory ^[2]^[6].

ELI5 (Explain like I'm 5)

Imagine you are trying to learn which fruits are sweet and which are sour. You have a basket of fruits that you have already tasted, and you know which ones were sweet and which were sour. Empirical risk minimization says: pick the rule that makes the fewest mistakes on the fruits you have already tasted, and then use that rule for new fruits you have not tried yet.

The word "empirical" means "based on what you have actually seen," and "risk" is just another word for "how many mistakes you make." So empirical risk minimization means "make as few mistakes as possible on the examples you have."

The tricky part is that a rule which perfectly memorizes your basket might not work well on new fruits. If your basket happened to contain only red sweet apples and green sour apples, you might wrongly conclude that color is all that matters. This problem is called overfitting, and much of learning theory is about figuring out when your rule for the basket will also work for new fruits.

What is the formal definition of empirical risk minimization?

Setup

The standard supervised learning setup consists of the following components:

An input space X and an output space Y.
An unknown probability distribution P(x, y) over $X \times Y$ from which data are drawn independently and identically distributed (i.i.d.).
A hypothesis class H, which is a set of functions $h: X \to Y$ that the learner considers.
A loss function L(h(x), y) that measures the discrepancy between the prediction h(x) and the true label y.

True risk

The true risk (also called the expected risk or population risk) of a hypothesis h is defined as:

R(h) = \mathbb{E}[L(h(x), y)] = \int L(h(x), y) \, dP(x, y)

This quantity represents the expected loss of the hypothesis over the entire data distribution. Since P(x, y) is unknown, the true risk cannot be computed directly ^[6]^[15].

Empirical risk

Given a training sample $S = \{(x_1, y_1), (x_2, y_2), \ldots, (x_n, y_n)\}$ drawn i.i.d. from P, the empirical risk of a hypothesis h is:

R_{\text{emp}}(h) = \frac{1}{n} \sum_{i=1}^n L(h(x_i), y_i)

The empirical risk is simply the average loss over the training sample. By the law of large numbers, for any fixed hypothesis h, the empirical risk converges to the true risk as the sample size n grows ^[6].

The ERM principle

The ERM principle selects the hypothesis that minimizes the empirical risk:

h_{\text{ERM}} = \arg\min_{h \in H} R_{\text{emp}}(h)

The hope is that the hypothesis minimizing the empirical risk will also have a low true risk. Whether this hope is justified depends on properties of the hypothesis class H and the sample size n ^[1]^[6].

Who invented empirical risk minimization?

The foundations of empirical risk minimization trace back to the work of Vladimir Vapnik and Alexey Chervonenkis at the Institute of Control Sciences in Moscow. Their research program, which began in the early 1960s, addressed a fundamental question: under what conditions does minimizing empirical risk lead to minimizing the true risk ^[1]^[14]?

Year	Event
1964	Vapnik and Chervonenkis develop the "generalized portrait" method for pattern recognition, an early precursor to support vector machines
1968	Vapnik and Chervonenkis publish their foundational result on uniform convergence of empirical risk to true risk, introducing conditions based on the capacity of the function class
1971	An English translation of the 1968 paper becomes available, bringing the results to a wider audience
1974	Vapnik and Chervonenkis publish Theory of Pattern Recognition, formalizing the structural risk minimization principle
1982	Vapnik publishes Estimation of Dependences Based on Empirical Data, presenting a systematic treatment of the ERM principle and VC theory
1984	Leslie Valiant introduces the Probably Approximately Correct (PAC) learning framework independently, providing a complementary computational perspective on learning
1989	Blumer, Ehrenfeucht, Haussler, and Warmuth connect PAC learning and VC dimension, establishing the fundamental theorem of statistical learning
1995	Vapnik publishes The Nature of Statistical Learning Theory, which synthesizes decades of work and popularizes these ideas in the machine learning community
1998	Vapnik's expanded second edition further develops the theory and its connections to kernel methods and support vector machines

The 1971 English translation of the 1968 result ^[1] and Vapnik's 1991 NeurIPS paper "Principles of Risk Minimization for Learning Theory" ^[14] are the most frequently cited statements of the principle in the machine learning literature.

How does ERM connect to VC theory?

The central question in ERM theory is whether minimizing empirical risk leads to minimizing true risk. The answer depends on a property called uniform convergence: the empirical risk must converge to the true risk simultaneously for all hypotheses in the class H, not just for individual ones ^[1]^[14].

Uniform convergence

Uniform convergence means that for any $\epsilon > 0$ :

P\left( \sup_{h \in H} \lvert R_{\text{emp}}(h) - R(h) \rvert > \epsilon \right) \to 0 \text{ as } n \to \infty

If uniform convergence holds, then the hypothesis selected by ERM will have true risk close to the best achievable true risk in H. This is because:

The ERM solution h_ERM has low empirical risk (by construction).
Uniform convergence guarantees that empirical risk is close to true risk for all hypotheses.
Therefore, h_ERM also has low true risk.

Vapnik and Chervonenkis proved that uniform convergence is both necessary and sufficient for the consistency of ERM across all possible data distributions ^[1]^[14].

VC dimension

The VC dimension (Vapnik-Chervonenkis dimension) is a measure of the capacity or expressiveness of a hypothesis class. For binary classification, the VC dimension of a class H is the largest number of points that can be shattered by H, meaning that H can realize all possible labelings of those points ^[1]^[6].

Examples of VC dimensions:

Hypothesis class	VC dimension
Linear classifiers in $\mathbb{R}^d$	$d + 1$
Intervals on the real line	2
Convex polygons with k sides in $\mathbb{R}^2$	$2k + 1$
Finite hypothesis class of size $\lvert H \rvert$	at most $\log_2(\lvert H \rvert)$
Sine functions ( $\sin(\theta x)$ )	infinite
Set of all functions from X to $\{0, 1\}$	infinite (if X is infinite)

A linear (threshold) classifier in d-dimensional space has VC dimension exactly d + 1 ^[6].

The VC inequality

The VC inequality provides a quantitative bound on the deviation of empirical risk from true risk. For binary classification with 0-1 loss:

P\left( \sup_{h \in H} \lvert R_{\text{emp}}(h) - R(h) \rvert > \epsilon \right) \le 8 \, S(H, n) \exp(-n \epsilon^2 / 32)

where S(H, n) is the growth function (also called the shattering coefficient), which counts the maximum number of distinct labelings that H can produce on n points ^[1]^[14].

Sauer's lemma bounds the growth function in terms of the VC dimension d:

S(H, n) \le \sum_{i=0}^d \binom{n}{i} \le (en/d)^d

for n >= d. This means the growth function is at most polynomial in n when the VC dimension is finite, rather than exponential ^[6].

Generalization bound for ERM

Combining the VC inequality with properties of ERM yields the following generalization bound. With probability at least 1 - delta over the random draw of the training set of size n:

R(h_{\text{ERM}}) \le R_{\text{emp}}(h_{\text{ERM}}) + O\left( \sqrt{(d \log(n/d) + \log(1/\delta)) / n} \right)

where d is the VC dimension of H. This bound states that the true risk of the ERM solution is at most its empirical risk plus a complexity penalty that decreases as the sample size grows and increases with the VC dimension ^[6]^[14]. The penalty shrinks at a rate of order $\sqrt{d / n}$ , so doubling the effective dimension requires roughly four times as much data to hold the gap fixed.

What is the fundamental theorem of statistical learning?

The fundamental theorem of statistical learning, established through the combined work of Vapnik, Chervonenkis, Blumer, Ehrenfeucht, Haussler, and Warmuth, characterizes exactly when learning is possible ^[1]^[5]. For binary classification, the following statements are equivalent:

The hypothesis class H has finite VC dimension.
H is a uniform Glivenko-Cantelli class (uniform convergence of empirical risk to true risk holds).
H is agnostic PAC learnable.
H is PAC learnable.
ERM over H is a consistent learning rule (for any data distribution, ERM converges to the best hypothesis in H).

This theorem is remarkable because it shows that a single combinatorial quantity, the VC dimension, completely governs the learnability of binary classification problems under the ERM principle ^[5]^[6]. For general (for example, multiclass and unsupervised) learning problems, the equivalence between learnability and uniform convergence can break down, and learnability is instead characterized through notions such as algorithmic stability ^[13].

How does ERM relate to PAC learning?

The Probably Approximately Correct (PAC) learning framework, introduced by Leslie Valiant in 1984, provides a computational perspective on learning ^[4]. A concept class C is PAC-learnable if there exists an algorithm that, for any target concept c in C, any distribution over the input space, and any parameters $\epsilon > 0$ and $\delta > 0$ :

Uses a number of training examples polynomial in 1/epsilon, 1/delta, and the size of the representation.
Runs in polynomial time.
Outputs a hypothesis h such that, with probability at least $1 - \delta$ , the true risk satisfies $R(h) \le \epsilon$ .

Agnostic PAC learning

The agnostic PAC framework relaxes the assumption that the target concept belongs to the hypothesis class. In this setting, the goal is to find a hypothesis whose true risk is at most epsilon worse than the best hypothesis in H:

R(h) \le \min_{h^* \in H} R(h^*) + \epsilon

ERM is a natural algorithm for agnostic PAC learning. If H has finite VC dimension d, then ERM is an agnostic PAC learner with sample complexity ^[6]:

m(\epsilon, \delta) = O\left( \frac{d}{\epsilon^2} \log(1/\epsilon) + \frac{\log(1/\delta)}{\epsilon^2} \right)

This means that the number of training examples needed to learn scales linearly with the VC dimension and inversely with the square of the desired accuracy ^[6].

What is the approximation-estimation tradeoff?

The choice of hypothesis class H introduces a fundamental tradeoff between two sources of error, closely related to the bias-variance tradeoff.

Error decomposition

The excess risk of the ERM hypothesis can be decomposed as:

R(h_{\text{ERM}}) - R(h^*) = [R(h_H^*) - R(h^*)] + [R(h_{\text{ERM}}) - R(h_H^*)]

where:

h* is the globally optimal hypothesis (the Bayes optimal predictor).
h_H* is the best hypothesis within H.

The two terms are:

Error component	Also called	Depends on	Behavior as H grows
R(h_H) - R(h)	Approximation error (bias)	Expressiveness of H	Decreases
R(h_ERM) - R(h_H*)	Estimation error (variance)	Sample size n and complexity of H	Increases

The approximation error measures how well the best hypothesis in H can fit the true data-generating process. A larger, more expressive hypothesis class reduces this error. The estimation error measures the cost of estimating the best hypothesis from finite data. A larger hypothesis class makes this harder because there are more candidates to distinguish among ^[6].

The goal in practice is to choose H to balance these two sources of error. This is the theoretical justification for model selection techniques and regularization.

What is structural risk minimization?

Structural risk minimization (SRM) is a principle introduced by Vapnik and Chervonenkis in their 1974 book that directly addresses the approximation-estimation tradeoff ^[3]^[14]. Rather than fixing a single hypothesis class, SRM considers a nested sequence of hypothesis classes:

H_1 \subset H_2 \subset H_3 \subset \cdots

where each H_k has finite VC dimension d_k, and $d_1 \le d_2 \le d_3 \le \cdots$

The SRM principle selects both the hypothesis class and the hypothesis within it by minimizing an upper bound on the true risk:

h_{\text{SRM}} = \arg\min_{k, \, h \in H_k} [R_{\text{emp}}(h) + \text{penalty}(d_k, n, \delta)]

The penalty term increases with the VC dimension of the chosen class, discouraging the learner from selecting overly complex models unless the data strongly support it ^[3]^[14].

How does ERM differ from structural risk minimization?

Property	ERM	SRM
Hypothesis class	Fixed	Selected from a nested sequence
Objective	Minimize empirical risk	Minimize empirical risk plus complexity penalty
Overfitting risk	Higher for complex H	Controlled by the penalty term
Requires model selection	External (cross-validation, etc.)	Built into the principle
Theoretical guarantee	Consistent if H has finite VC dimension	Consistent even over a growing sequence of classes
Practical realization	Least squares, MLE	Regularized methods, SVM with kernel selection

Why does ERM use surrogate loss functions?

In many practical settings, the natural loss function is computationally intractable. For binary classification, the 0-1 loss (which counts misclassifications) is the most natural choice, but minimizing empirical risk with 0-1 loss is NP-hard even for linear classifiers ^[16]. To make optimization feasible, practitioners replace the 0-1 loss with a convex surrogate loss function.

Common surrogate losses for classification

The following table compares common loss functions used in place of the 0-1 loss. Each is expressed as a function of the margin $m = y f(x)$ , where y is in {-1, +1} and f(x) is the real-valued prediction.

Loss function	Formula	Used in	Properties
0-1 loss	1 if $m < 0$ , else 0	Ideal classification metric	Non-convex, non-differentiable; NP-hard to minimize
Hinge loss	$\max(0, 1 - m)$	Support vector machines	Convex, not differentiable at m = 1; yields sparse solutions
Logistic loss	$\log(1 + \exp(-m))$	Logistic regression	Convex, differentiable; outputs calibrated probabilities
Exponential loss	$\exp(-m)$	AdaBoost (boosting)	Convex, differentiable; sensitive to outliers
Squared hinge loss	$\max(0, 1 - m)^2$	Modified SVM variants	Convex, differentiable; balances hinge and smoothness

Common loss functions for regression

Loss function	Formula	Used in	Properties
Squared loss (L2)	$(y - f(x))^2$	Linear regression, ridge regression	Convex, differentiable; sensitive to outliers
Absolute loss (L1)	$\lvert y - f(x) \rvert$	Median regression	Convex; more robust to outliers; not differentiable at 0
Huber loss	Quadratic for small errors, linear for large	Robust regression	Convex, differentiable; combines advantages of L1 and L2
Epsilon-insensitive loss	$\max(0, \lvert y - f(x) \rvert - \epsilon)$	Support vector regression	Convex; ignores errors smaller than epsilon

Bayes consistency of surrogates

A surrogate loss is called Bayes consistent (or classification-calibrated) if minimizing the expected surrogate loss leads to the same classifier as minimizing the 0-1 loss. Zhang (2004) and Bartlett, Jordan, and McAuliffe (2006) established that all the convex surrogates listed above (hinge, logistic, exponential) are Bayes consistent, providing theoretical justification for their use in practice ^[8]^[9].

Which algorithms are instances of ERM?

Many standard machine learning algorithms can be understood as instances of the ERM principle with specific choices of loss function, hypothesis class, and regularizer ^[6].

Algorithm	Loss function	Hypothesis class	Regularizer	ERM formulation
Ordinary least squares	Squared loss	Linear functions	None	$\min (1/n) \sum(y_i - w^\top x_i)^2$
Ridge regression	Squared loss	Linear functions	L2 penalty ( $\lambda \lVert w \rVert^2$ )	$\min (1/n) \sum(y_i - w^\top x_i)^2 + \lambda \lVert w \rVert^2$
Lasso	Squared loss	Linear functions	L1 penalty ( $\lambda \lVert w \rVert_1$ )	$\min (1/n) \sum(y_i - w^\top x_i)^2 + \lambda \lVert w \rVert_1$
Logistic regression	Logistic loss	Linear functions	L2 or L1	$\min (1/n) \sum \log(1 + \exp(-y_i w^\top x_i)) + \lambda R(w)$
Support vector machines	Hinge loss	Linear or kernel functions	L2 penalty	$\min (1/n) \sum \max(0, 1 - y_i w^\top x_i) + \lambda \lVert w \rVert^2$
AdaBoost	Exponential loss	Ensemble of weak learners	Implicit (boosting rounds)	Sequentially minimizes weighted exponential loss
Neural networks	Cross-entropy, squared loss, etc.	Compositions of nonlinear layers	Weight decay (L2), dropout	$\min (1/n) \sum L(f_\theta(x_i), y_i) + \lambda \lVert \theta \rVert^2$

Training a neural network by gradient descent on a cross-entropy or squared loss is therefore a direct application of (regularized) ERM: the average loss over the training batch is the empirical risk being minimized ^[6].

What is regularized empirical risk minimization?

Regularized empirical risk minimization augments the ERM objective with a penalty term that discourages overly complex hypotheses:

h_{\text{RERM}} = \arg\min_{h \in H} \left[ \frac{1}{n} \sum L(h(x_i), y_i) + \lambda R(h) \right]

where R(h) is the regularization term and lambda >= 0 is a hyperparameter controlling the tradeoff between fitting the data and maintaining simplicity.

Types of regularization

Regularizer	R(h)	Effect	Example algorithms
L2 (Tikhonov / ridge)	$\lVert w \rVert^2 = \sum w_j^2$	Shrinks all weights toward zero; dense solutions	Ridge regression
L1 (Lasso)	$\lVert w \rVert_1 = \sum \lvert w_j \rvert$	Drives some weights exactly to zero; sparse solutions, feature selection	Lasso
Elastic net	$\alpha \lVert w \rVert_1 + (1-\alpha) \lVert w \rVert^2$	Combines L1 sparsity with L2 stability	Elastic net regression
Early stopping	Number of gradient descent iterations	Implicitly limits model complexity	Neural network training
Dropout	Random zeroing of activations during training	Prevents co-adaptation of neurons	Deep learning

Regularized ERM provides a practical implementation of the structural risk minimization principle. The regularization parameter lambda plays the role of controlling the effective complexity of the hypothesis class, analogous to selecting the nesting level in SRM ^[3]^[6].

How does ERM relate to maximum likelihood estimation?

Empirical risk minimization and maximum likelihood estimation (MLE) are closely related. When the loss function is the negative log-likelihood, ERM reduces to MLE ^[6].

Specifically, suppose the data are generated from a parametric family of distributions P_theta. The negative log-likelihood loss is:

L(\theta; x, y) = -\log P_\theta(y \mid x)

Minimizing the empirical risk with this loss:

\frac{1}{n} \sum_{i=1}^n [-\log P_\theta(y_i \mid x_i)]

is equivalent to maximizing the likelihood function. Furthermore, minimizing the expected negative log-likelihood is equivalent to minimizing the Kullback-Leibler (KL) divergence between the true distribution and the model distribution.

Adding an L2 regularizer to the negative log-likelihood corresponds to maximum a posteriori (MAP) estimation with a Gaussian prior on the parameters.

Rademacher complexity and data-dependent bounds

While VC dimension provides distribution-free generalization bounds, these bounds can be loose in practice because they must hold for the worst-case distribution. Rademacher complexity offers an alternative that adapts to the specific data distribution ^[7].

Definition

The empirical Rademacher complexity of a hypothesis class H with respect to a sample S = {x_1, ..., x_n} is:

R_S(H) = \frac{1}{n} \mathbb{E}_\sigma \left[ \sup_{h \in H} \sum_{i=1}^n \sigma_i h(x_i) \right]

where sigma_1, ..., sigma_n are independent Rademacher random variables (each taking values +1 or -1 with equal probability). Intuitively, the Rademacher complexity measures how well the hypothesis class can fit random noise ^[7].

Generalization bound via Rademacher complexity

With probability at least 1 - delta over the draw of the training sample:

R(h) \le R_{\text{emp}}(h) + 2 R_S(H) + \sqrt{\log(1/\delta) / (2n)}

for all h in H simultaneously.

This bound is often tighter than VC-based bounds because the Rademacher complexity can be estimated from the data and adapts to the actual sample distribution. Bartlett and Mendelson (2002) developed this framework, providing data-dependent risk bounds applicable to decision trees, neural networks, and support vector machines ^[7].

What is the no free lunch theorem?

The no free lunch theorem, proved by Wolpert (1996) for supervised learning, demonstrates a fundamental limitation of ERM (and any learning algorithm). Without assumptions about the data distribution, no learning algorithm can be universally good ^[10].

Formally, for any learning algorithm and any sample size n, there exists a data distribution such that:

The algorithm has expected error at least 1/2 on unseen data (no better than random guessing for binary classification).
Another algorithm achieves zero error on the same distribution.

This result means that ERM can only succeed when the hypothesis class is appropriately chosen relative to the true data-generating process. The restriction to a specific hypothesis class H is not a limitation of ERM but a necessary feature: it encodes the learner's assumptions about which patterns are plausible ^[10].

What are the computational considerations of ERM?

Minimizing empirical risk is not always computationally easy. The difficulty depends on the choice of loss function and hypothesis class.

NP-hardness of 0-1 loss

For binary classification, minimizing the empirical 0-1 loss over the class of linear classifiers is NP-hard. As standard references summarize, "empirical risk minimization for a classification problem with a 0-1 loss function is known to be an NP-hard problem even for a relatively simple class of functions such as linear classifiers," yet "it can be solved efficiently when the minimal empirical risk is zero, i.e., data is linearly separable" ^[15]. The underlying hardness result, strengthened by Feldman, Guruswami, Raghavendra, and Wu (2009), shows that even weak agnostic learning of such classifiers is intractable, so even simple hypothesis classes lead to intractable optimization when paired with the natural loss function ^[16].

Convex relaxation

The standard approach to circumvent NP-hardness is to replace the 0-1 loss with a convex surrogate. Because the surrogate loss is convex and typically differentiable (or subdifferentiable), the resulting ERM problem can be solved efficiently using convex optimization techniques such as gradient descent, stochastic gradient descent, or interior point methods.

The excess risk from using a surrogate instead of the true loss can be controlled. For example, Zhang's inequality (2004) shows that for convex, classification-calibrated surrogates, the excess 0-1 risk can be bounded in terms of the excess surrogate risk, ensuring that good performance on the surrogate translates to good performance on the original objective ^[8].

Stochastic optimization

For large-scale problems, computing the full empirical risk over all training examples at each iteration is expensive. Stochastic gradient descent (SGD) and its variants (Adam, AdaGrad, RMSProp) approximate the gradient of the empirical risk using mini-batches, enabling training on datasets with millions or billions of examples.

How does ERM apply to deep learning?

The success of deep learning has raised new questions about the ERM principle. Modern neural networks are often heavily overparameterized, meaning they have far more parameters than training examples. Classical ERM theory predicts that such models should overfit badly, yet in practice they often generalize well ^[11].

Interpolation and the double descent phenomenon

When a model has enough capacity to fit the training data perfectly (achieving zero empirical risk), it is said to interpolate the data. Classical learning theory suggests that interpolating models should have high true risk due to overfitting. However, empirical observations and recent theoretical work have revealed a more nuanced picture ^[11].

The double descent phenomenon, documented by Belkin, Hsu, Ma, and Mandal in a 2019 paper in the Proceedings of the National Academy of Sciences and explored further by Nakkiran et al. (2021), describes a non-monotonic relationship between model complexity and test error ^[11]^[17]:

In the underparameterized regime (few parameters relative to data), increasing model complexity reduces test error, consistent with classical theory.
Near the interpolation threshold (where model capacity equals the number of training examples), test error spikes.
In the overparameterized regime (many more parameters than data), test error decreases again, contrary to classical predictions.

This phenomenon has been observed across multiple model types, including neural networks, random forests, and kernel methods ^[11].

Implicit regularization

One explanation for why overparameterized models generalize despite interpolating is implicit regularization. The optimization algorithm (typically SGD) and its initialization do not find an arbitrary interpolating solution; instead, they converge to solutions with particular structural properties, such as low norm or low rank. These inductive biases effectively restrict the hypothesis class, even though no explicit regularizer is present in the objective.

For example, gradient descent on overparameterized linear models converges to the minimum-norm interpolating solution, which has well-understood generalization properties. For neural networks, the precise characterization of implicit regularization remains an active area of research.

Tilted empirical risk minimization

Tilted empirical risk minimization (TERM), introduced by Li et al. (2021), extends the standard ERM objective by introducing an exponential tilt parameter t that adjusts the influence of individual losses ^[12]:

R_t(h) = \frac{1}{t} \log\left( \frac{1}{n} \sum_{i=1}^n \exp(t \, L(h(x_i), y_i)) \right)

The parameter t interpolates between different objectives:

Tilt parameter	Limiting behavior	Effect
t -> 0	Standard ERM (arithmetic mean of losses)	Equal weighting of all examples
t -> +infinity	Maximum loss (minimax risk)	Focuses on worst-case examples
t -> -infinity	Minimum loss	Focuses on easiest examples

TERM provides a unified framework for several practical goals:

Positive tilt (t > 0): Upweights high-loss examples, promoting fairness across subgroups and handling class imbalance.
Negative tilt (t < 0): Downweights high-loss examples, providing robustness to outliers and noisy labels.

The framework has been applied to problems including fair classification, robust regression, and federated learning ^[12].

What are the limitations of ERM?

Despite its theoretical elegance and practical utility, ERM has several well-known limitations:

Distribution-free bounds are often loose. The generalization bounds derived from VC theory hold for all possible data distributions, but this worst-case guarantee comes at the cost of tightness. In practice, the bounds are often too large to be informative ^[7].
Overfitting with rich hypothesis classes. When the hypothesis class is too large relative to the sample size, ERM will select a hypothesis that fits the training data well but performs poorly on new data. This is the fundamental motivation for regularized ERM and structural risk minimization ^[6].
Computational intractability. For many natural combinations of loss functions and hypothesis classes, the ERM optimization problem is NP-hard. Convex relaxations and approximate algorithms are needed in practice ^[15]^[16].
Sensitivity to the training distribution. ERM optimizes average performance over the training set, which may not align with the desired objective. If the training data are imbalanced, contain systematic biases, or are not representative of the deployment distribution, the ERM solution may perform poorly on underrepresented subgroups or in distribution-shifted settings ^[12].
No free lunch. Without assumptions about the data distribution, no learning algorithm (including ERM) can guarantee good performance. The choice of hypothesis class implicitly encodes assumptions that may or may not match reality ^[10].

References

Vapnik, V. N., & Chervonenkis, A. Ya. (1971). "On the uniform convergence of relative frequencies of events to their probabilities." *Theory of Probability and Its Applications*, 16(2), 264-280. (Original Russian version published 1968.) ↩
Vapnik, V. N. (1995). *The Nature of Statistical Learning Theory*. Springer-Verlag. ↩
Vapnik, V. N. (1998). *Statistical Learning Theory*. Wiley-Interscience. ↩
Valiant, L. G. (1984). "A theory of the learnable." *Communications of the ACM*, 27(11), 1134-1142. ↩
Blumer, A., Ehrenfeucht, A., Haussler, D., & Warmuth, M. K. (1989). "Learnability and the Vapnik-Chervonenkis dimension." *Journal of the ACM*, 36(4), 929-965. ↩
Shalev-Shwartz, S., & Ben-David, S. (2014). *Understanding Machine Learning: From Theory to Algorithms*. Cambridge University Press. ↩
Bartlett, P. L., & Mendelson, S. (2002). "Rademacher and Gaussian complexities: Risk bounds and structural results." *Journal of Machine Learning Research*, 3, 463-482. ↩
Zhang, T. (2004). "Statistical behavior and consistency of classification methods based on convex risk minimization." *The Annals of Statistics*, 32(1), 56-85. ↩
Bartlett, P. L., Jordan, M. I., & McAuliffe, J. D. (2006). "Convexity, classification, and risk bounds." *Journal of the American Statistical Association*, 101(473), 138-156. ↩
Wolpert, D. H. (1996). "The lack of a priori distinctions between learning algorithms." *Neural Computation*, 8(7), 1341-1390. ↩
Belkin, M., Hsu, D., Ma, S., & Mandal, S. (2019). "Reconciling modern machine-learning practice and the classical bias-variance trade-off." *Proceedings of the National Academy of Sciences*, 116(32), 15849-15854. ↩
Li, T., Beirami, A., Sanjabi, M., & Smith, V. (2021). "Tilted empirical risk minimization." *International Conference on Learning Representations (ICLR)*. ↩
Shalev-Shwartz, S., Shamir, O., Srebro, N., & Sridharan, K. (2010). "Learnability, stability and uniform convergence." *Journal of Machine Learning Research*, 11, 2635-2670. ↩
Vapnik, V. N. (1991). "Principles of risk minimization for learning theory." *Advances in Neural Information Processing Systems (NeurIPS)*, 4, 831-838. ↩
"Empirical risk minimization." *Wikipedia*. Retrieved June 2026. https://en.wikipedia.org/wiki/Empirical_risk_minimization ↩
Feldman, V., Guruswami, V., Raghavendra, P., & Wu, Y. (2012). "Agnostic learning of monomials by halfspaces is hard." *SIAM Journal on Computing*, 41(6), 1558-1590. (Earlier version: FOCS 2009.) ↩
Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., & Sutskever, I. (2021). "Deep double descent: Where bigger models and more data hurt." *Journal of Statistical Mechanics: Theory and Experiment*, 2021(12), 124003. (Originally ICLR 2020.) ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

3 revisions by 1 contributors · full history

Suggest edit

What links here

Cost Domain adaptation Hinge Loss Independently and Identically Distributed (i.i.d.)Loss Machine learning terms/All Objective Objective function Statistical learning theory Structural risk minimization (SRM)Terms Test loss

ELI5 (Explain like I'm 5)

What is the formal definition of empirical risk minimization?

Setup

True risk

Empirical risk

The ERM principle

Who invented empirical risk minimization?

How does ERM connect to VC theory?

Uniform convergence

VC dimension

The VC inequality

Generalization bound for ERM

What is the fundamental theorem of statistical learning?

How does ERM relate to PAC learning?

Agnostic PAC learning

What is the approximation-estimation tradeoff?

Error decomposition

What is structural risk minimization?

How does ERM differ from structural risk minimization?

Why does ERM use surrogate loss functions?

Common surrogate losses for classification

Common loss functions for regression

Bayes consistency of surrogates

Which algorithms are instances of ERM?

What is regularized empirical risk minimization?

Types of regularization

How does ERM relate to maximum likelihood estimation?

Rademacher complexity and data-dependent bounds

Definition

Generalization bound via Rademacher complexity

What is the no free lunch theorem?

What are the computational considerations of ERM?

NP-hardness of 0-1 loss

Convex relaxation

Stochastic optimization

How does ERM apply to deep learning?

Interpolation and the double descent phenomenon

Implicit regularization

Tilted empirical risk minimization

What are the limitations of ERM?

See also

References

Improve this article

Related Articles

L0 Regularization

L1 Loss

L1 Regularization

L2 Loss

L2 Regularization

Pruning

What links here

Related Articles

L0 Regularization

L1 Loss

L1 Regularization

L2 Loss

L2 Regularization

Pruning

What links here