# Empirical Risk Minimization

> Source: https://aiwiki.ai/wiki/empirical_risk_minimization
> Updated: 2026-07-11
> Categories: Machine Learning, Training & Optimization
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

**Empirical risk minimization** (ERM) is the foundational principle of [statistical learning theory](/wiki/statistical_learning_theory): because the true risk (the expected [loss](/wiki/loss_function) over the unknown data distribution) cannot be computed, a learning algorithm instead chooses the hypothesis that minimizes the average loss measured on the observed [training data](/wiki/training_data). It is the principle behind essentially all of [supervised learning](/wiki/supervised_learning), from [linear regression](/wiki/linear_regression) and [support vector machines](/wiki/support_vector_machine_svm) to [neural network](/wiki/neural_network) training, where [gradient descent](/wiki/gradient_descent) on a training loss is exactly an ERM procedure. ERM was formalized by Vladimir Vapnik and Alexey Chervonenkis in the late 1960s as part of their broader work on the theory of pattern recognition and uniform convergence of empirical processes [1][2].

Standard treatments state the rule plainly: "the empirical risk minimization principle states that the learning algorithm should choose a hypothesis which minimizes the empirical risk over the hypothesis class" [15]. The central question of learning theory is when this is justified: minimizing training error provably controls true risk when, and only when, the empirical risk converges uniformly to the true risk over the hypothesis class, a condition Vapnik and Chervonenkis tied to a single capacity measure, the [VC dimension](/wiki/vc_dimension) [1][14]. For a class with finite VC dimension d, the gap between training and true risk shrinks at a rate of order $$\sqrt{d / n}$$ in the sample size n [6][14], while leaving the class too rich relative to n produces [overfitting](/wiki/overfitting). The same tension motivates [regularization](/wiki/regularization) and structural risk minimization, the regularized extension of ERM that Vapnik and Chervonenkis introduced in 1974 [3][14].

The ERM principle underlies many widely used algorithms, including [linear regression](/wiki/linear_regression), [logistic regression](/wiki/logistic_regression), and [support vector machines](/wiki/support_vector_machine_svm). Understanding its theoretical properties, such as when and why minimizing training error leads to good performance on unseen data, is one of the central questions in learning theory [2][6].

## ELI5 (Explain like I'm 5)

Imagine you are trying to learn which fruits are sweet and which are sour. You have a basket of fruits that you have already tasted, and you know which ones were sweet and which were sour. Empirical risk minimization says: pick the rule that makes the fewest mistakes on the fruits you have already tasted, and then use that rule for new fruits you have not tried yet.

The word "empirical" means "based on what you have actually seen," and "risk" is just another word for "how many mistakes you make." So empirical risk minimization means "make as few mistakes as possible on the examples you have."

The tricky part is that a rule which perfectly memorizes your basket might not work well on new fruits. If your basket happened to contain only red sweet apples and green sour apples, you might wrongly conclude that color is all that matters. This problem is called [overfitting](/wiki/overfitting), and much of learning theory is about figuring out when your rule for the basket will also work for new fruits.

## What is the formal definition of empirical risk minimization?

### Setup

The standard supervised learning setup consists of the following components:

- An input space **X** and an output space **Y**.
- An unknown probability distribution **P(x, y)** over $$X \times Y$$ from which data are drawn independently and identically distributed (i.i.d.).
- A hypothesis class **H**, which is a set of functions $$h: X \to Y$$ that the learner considers.
- A loss function **L(h(x), y)** that measures the discrepancy between the prediction **h(x)** and the true label **y**.

### True risk

The true risk (also called the expected risk or population risk) of a hypothesis **h** is defined as:

$$
R(h) = \mathbb{E}[L(h(x), y)] = \int L(h(x), y) \, dP(x, y)
$$

This quantity represents the expected loss of the hypothesis over the entire data distribution. Since **P(x, y)** is unknown, the true risk cannot be computed directly [6][15].

### Empirical risk

Given a training sample $$S = \{(x_1, y_1), (x_2, y_2), \ldots, (x_n, y_n)\}$$ drawn i.i.d. from **P**, the empirical risk of a hypothesis **h** is:

$$
R_{\text{emp}}(h) = \frac{1}{n} \sum_{i=1}^n L(h(x_i), y_i)
$$

The empirical risk is simply the average loss over the training sample. By the law of large numbers, for any fixed hypothesis **h**, the empirical risk converges to the true risk as the sample size **n** grows [6].

### The ERM principle

The ERM principle selects the hypothesis that minimizes the empirical risk:

$$
h_{\text{ERM}} = \arg\min_{h \in H} R_{\text{emp}}(h)
$$

The hope is that the hypothesis minimizing the empirical risk will also have a low true risk. Whether this hope is justified depends on properties of the hypothesis class **H** and the sample size **n** [1][6].

## Who invented empirical risk minimization?

The foundations of empirical risk minimization trace back to the work of Vladimir Vapnik and Alexey Chervonenkis at the Institute of Control Sciences in Moscow. Their research program, which began in the early 1960s, addressed a fundamental question: under what conditions does minimizing empirical risk lead to minimizing the true risk [1][14]?

| Year | Event |
|------|-------|
| 1964 | Vapnik and Chervonenkis develop the "generalized portrait" method for pattern recognition, an early precursor to support vector machines |
| 1968 | Vapnik and Chervonenkis publish their foundational result on uniform convergence of empirical risk to true risk, introducing conditions based on the capacity of the function class |
| 1971 | An English translation of the 1968 paper becomes available, bringing the results to a wider audience |
| 1974 | Vapnik and Chervonenkis publish *Theory of Pattern Recognition*, formalizing the structural risk minimization principle |
| 1982 | Vapnik publishes *Estimation of Dependences Based on Empirical Data*, presenting a systematic treatment of the ERM principle and VC theory |
| 1984 | Leslie Valiant introduces the Probably Approximately Correct (PAC) learning framework independently, providing a complementary computational perspective on learning |
| 1989 | Blumer, Ehrenfeucht, Haussler, and Warmuth connect PAC learning and VC dimension, establishing the fundamental theorem of statistical learning |
| 1995 | Vapnik publishes *The Nature of Statistical Learning Theory*, which synthesizes decades of work and popularizes these ideas in the machine learning community |
| 1998 | Vapnik's expanded second edition further develops the theory and its connections to kernel methods and [support vector machines](/wiki/support_vector_machine_svm) |

The 1971 English translation of the 1968 result [1] and Vapnik's 1991 NeurIPS paper "Principles of Risk Minimization for Learning Theory" [14] are the most frequently cited statements of the principle in the machine learning literature.

## How does ERM connect to VC theory?

The central question in ERM theory is whether minimizing empirical risk leads to minimizing true risk. The answer depends on a property called **uniform convergence**: the empirical risk must converge to the true risk simultaneously for all hypotheses in the class **H**, not just for individual ones [1][14].

### Uniform convergence

Uniform convergence means that for any $$\epsilon > 0$$:

$$
P\left( \sup_{h \in H} \lvert R_{\text{emp}}(h) - R(h) \rvert > \epsilon \right) \to 0 \text{ as } n \to \infty
$$

If uniform convergence holds, then the hypothesis selected by ERM will have true risk close to the best achievable true risk in **H**. This is because:

1. The ERM solution **h_ERM** has low empirical risk (by construction).
2. Uniform convergence guarantees that empirical risk is close to true risk for all hypotheses.
3. Therefore, **h_ERM** also has low true risk.

Vapnik and Chervonenkis proved that uniform convergence is both necessary and sufficient for the consistency of ERM across all possible data distributions [1][14].

### VC dimension

The [VC dimension](/wiki/vc_dimension) (Vapnik-Chervonenkis dimension) is a measure of the capacity or expressiveness of a hypothesis class. For binary [classification](/wiki/classification), the VC dimension of a class **H** is the largest number of points that can be **shattered** by **H**, meaning that **H** can realize all possible labelings of those points [1][6].

Examples of VC dimensions:

| Hypothesis class | VC dimension |
|---|---|
| Linear classifiers in $$\mathbb{R}^d$$ | $$d + 1$$ |
| Intervals on the real line | 2 |
| Convex polygons with k sides in $$\mathbb{R}^2$$ | $$2k + 1$$ |
| Finite hypothesis class of size $$\lvert H \rvert$$ | at most $$\log_2(\lvert H \rvert)$$ |
| Sine functions ($$\sin(\theta x)$$) | infinite |
| Set of all functions from X to $$\{0, 1\}$$ | infinite (if X is infinite) |

A linear (threshold) classifier in d-dimensional space has VC dimension exactly d + 1 [6].

### The VC inequality

The VC inequality provides a quantitative bound on the deviation of empirical risk from true risk. For binary classification with 0-1 loss:

$$
P\left( \sup_{h \in H} \lvert R_{\text{emp}}(h) - R(h) \rvert > \epsilon \right) \le 8 \, S(H, n) \exp(-n \epsilon^2 / 32)
$$

where **S(H, n)** is the growth function (also called the shattering coefficient), which counts the maximum number of distinct labelings that **H** can produce on **n** points [1][14].

Sauer's lemma bounds the growth function in terms of the VC dimension **d**:

$$
S(H, n) \le \sum_{i=0}^d \binom{n}{i} \le (en/d)^d
$$

for n >= d. This means the growth function is at most polynomial in **n** when the VC dimension is finite, rather than exponential [6].

### Generalization bound for ERM

Combining the VC inequality with properties of ERM yields the following generalization bound. With probability at least 1 - delta over the random draw of the training set of size **n**:

$$
R(h_{\text{ERM}}) \le R_{\text{emp}}(h_{\text{ERM}}) + O\left( \sqrt{(d \log(n/d) + \log(1/\delta)) / n} \right)
$$

where **d** is the VC dimension of **H**. This bound states that the true risk of the ERM solution is at most its empirical risk plus a complexity penalty that decreases as the sample size grows and increases with the VC dimension [6][14]. The penalty shrinks at a rate of order $$\sqrt{d / n}$$, so doubling the effective dimension requires roughly four times as much data to hold the gap fixed.

## What is the fundamental theorem of statistical learning?

The fundamental theorem of statistical learning, established through the combined work of Vapnik, Chervonenkis, Blumer, Ehrenfeucht, Haussler, and Warmuth, characterizes exactly when learning is possible [1][5]. For binary classification, the following statements are equivalent:

1. The hypothesis class **H** has finite VC dimension.
2. **H** is a uniform Glivenko-Cantelli class (uniform convergence of empirical risk to true risk holds).
3. **H** is agnostic PAC learnable.
4. **H** is PAC learnable.
5. ERM over **H** is a consistent learning rule (for any data distribution, ERM converges to the best hypothesis in **H**).

This theorem is remarkable because it shows that a single combinatorial quantity, the VC dimension, completely governs the learnability of binary classification problems under the ERM principle [5][6]. For general (for example, multiclass and unsupervised) learning problems, the equivalence between learnability and uniform convergence can break down, and learnability is instead characterized through notions such as algorithmic stability [13].

## How does ERM relate to PAC learning?

The Probably Approximately Correct (PAC) learning framework, introduced by Leslie Valiant in 1984, provides a computational perspective on learning [4]. A concept class **C** is PAC-learnable if there exists an algorithm that, for any target concept **c** in **C**, any distribution over the input space, and any parameters $$\epsilon > 0$$ and $$\delta > 0$$:

- Uses a number of training examples polynomial in **1/epsilon**, **1/delta**, and the size of the representation.
- Runs in polynomial time.
- Outputs a hypothesis **h** such that, with probability at least $$1 - \delta$$, the true risk satisfies $$R(h) \le \epsilon$$.

### Agnostic PAC learning

The agnostic PAC framework relaxes the assumption that the target concept belongs to the hypothesis class. In this setting, the goal is to find a hypothesis whose true risk is at most epsilon worse than the best hypothesis in **H**:

$$
R(h) \le \min_{h^* \in H} R(h^*) + \epsilon
$$

ERM is a natural algorithm for agnostic PAC learning. If **H** has finite VC dimension **d**, then ERM is an agnostic PAC learner with sample complexity [6]:

$$
m(\epsilon, \delta) = O\left( \frac{d}{\epsilon^2} \log(1/\epsilon) + \frac{\log(1/\delta)}{\epsilon^2} \right)
$$

This means that the number of training examples needed to learn scales linearly with the VC dimension and inversely with the square of the desired accuracy [6].

## What is the approximation-estimation tradeoff?

The choice of hypothesis class **H** introduces a fundamental tradeoff between two sources of error, closely related to the [bias-variance tradeoff](/wiki/bias_variance_tradeoff).

### Error decomposition

The excess risk of the ERM hypothesis can be decomposed as:

$$
R(h_{\text{ERM}}) - R(h^*) = [R(h_H^*) - R(h^*)] + [R(h_{\text{ERM}}) - R(h_H^*)]
$$

where:
- **h*** is the globally optimal hypothesis (the Bayes optimal predictor).
- **h_H*** is the best hypothesis within **H**.

The two terms are:

| Error component | Also called | Depends on | Behavior as H grows |
|---|---|---|---|
| R(h_H*) - R(h*) | Approximation error (bias) | Expressiveness of H | Decreases |
| R(h_ERM) - R(h_H*) | Estimation error (variance) | Sample size n and complexity of H | Increases |

The approximation error measures how well the best hypothesis in **H** can fit the true data-generating process. A larger, more expressive hypothesis class reduces this error. The estimation error measures the cost of estimating the best hypothesis from finite data. A larger hypothesis class makes this harder because there are more candidates to distinguish among [6].

The goal in practice is to choose **H** to balance these two sources of error. This is the theoretical justification for model selection techniques and [regularization](/wiki/regularization).

## What is structural risk minimization?

Structural risk minimization (SRM) is a principle introduced by Vapnik and Chervonenkis in their 1974 book that directly addresses the approximation-estimation tradeoff [3][14]. Rather than fixing a single hypothesis class, SRM considers a nested sequence of hypothesis classes:

$$
H_1 \subset H_2 \subset H_3 \subset \cdots
$$

where each **H_k** has finite VC dimension **d_k**, and $$d_1 \le d_2 \le d_3 \le \cdots$$

The SRM principle selects both the hypothesis class and the hypothesis within it by minimizing an upper bound on the true risk:

$$
h_{\text{SRM}} = \arg\min_{k, \, h \in H_k} [R_{\text{emp}}(h) + \text{penalty}(d_k, n, \delta)]
$$

The penalty term increases with the VC dimension of the chosen class, discouraging the learner from selecting overly complex models unless the data strongly support it [3][14].

### How does ERM differ from structural risk minimization?

| Property | ERM | SRM |
|---|---|---|
| Hypothesis class | Fixed | Selected from a nested sequence |
| Objective | Minimize empirical risk | Minimize empirical risk plus complexity penalty |
| Overfitting risk | Higher for complex H | Controlled by the penalty term |
| Requires model selection | External (cross-validation, etc.) | Built into the principle |
| Theoretical guarantee | Consistent if H has finite VC dimension | Consistent even over a growing sequence of classes |
| Practical realization | Least squares, MLE | Regularized methods, SVM with kernel selection |

## Why does ERM use surrogate loss functions?

In many practical settings, the natural loss function is computationally intractable. For binary classification, the 0-1 loss (which counts misclassifications) is the most natural choice, but minimizing empirical risk with 0-1 loss is NP-hard even for linear classifiers [16]. To make optimization feasible, practitioners replace the 0-1 loss with a convex surrogate loss function.

### Common surrogate losses for classification

The following table compares common loss functions used in place of the 0-1 loss. Each is expressed as a function of the margin $$m = y f(x)$$, where **y** is in {-1, +1} and **f(x)** is the real-valued prediction.

| Loss function | Formula | Used in | Properties |
|---|---|---|---|
| 0-1 loss | 1 if $$m < 0$$, else 0 | Ideal classification metric | Non-convex, non-differentiable; NP-hard to minimize |
| Hinge loss | $$\max(0, 1 - m)$$ | [Support vector machines](/wiki/support_vector_machine_svm) | Convex, not differentiable at m = 1; yields sparse solutions |
| Logistic loss | $$\log(1 + \exp(-m))$$ | [Logistic regression](/wiki/logistic_regression) | Convex, differentiable; outputs calibrated probabilities |
| Exponential loss | $$\exp(-m)$$ | AdaBoost ([boosting](/wiki/boosting)) | Convex, differentiable; sensitive to outliers |
| Squared hinge loss | $$\max(0, 1 - m)^2$$ | Modified SVM variants | Convex, differentiable; balances hinge and smoothness |

### Common loss functions for regression

| Loss function | Formula | Used in | Properties |
|---|---|---|---|
| Squared loss (L2) | $$(y - f(x))^2$$ | [Linear regression](/wiki/linear_regression), ridge [regression](/wiki/regression) | Convex, differentiable; sensitive to outliers |
| Absolute loss (L1) | $$\lvert y - f(x) \rvert$$ | Median regression | Convex; more robust to outliers; not differentiable at 0 |
| Huber loss | Quadratic for small errors, linear for large | Robust regression | Convex, differentiable; combines advantages of L1 and L2 |
| Epsilon-insensitive loss | $$\max(0, \lvert y - f(x) \rvert - \epsilon)$$ | Support vector regression | Convex; ignores errors smaller than epsilon |

### Bayes consistency of surrogates

A surrogate loss is called **Bayes consistent** (or **classification-calibrated**) if minimizing the expected surrogate loss leads to the same classifier as minimizing the 0-1 loss. Zhang (2004) and Bartlett, Jordan, and McAuliffe (2006) established that all the convex surrogates listed above (hinge, logistic, exponential) are Bayes consistent, providing theoretical justification for their use in practice [8][9].

## Which algorithms are instances of ERM?

Many standard [machine learning](/wiki/machine_learning) algorithms can be understood as instances of the ERM principle with specific choices of loss function, hypothesis class, and regularizer [6].

| Algorithm | Loss function | Hypothesis class | Regularizer | ERM formulation |
|---|---|---|---|---|
| Ordinary least squares | Squared loss | Linear functions | None | $$\min (1/n) \sum(y_i - w^\top x_i)^2$$ |
| Ridge regression | Squared loss | Linear functions | L2 penalty ($$\lambda \lVert w \rVert^2$$) | $$\min (1/n) \sum(y_i - w^\top x_i)^2 + \lambda \lVert w \rVert^2$$ |
| Lasso | Squared loss | Linear functions | L1 penalty ($$\lambda \lVert w \rVert_1$$) | $$\min (1/n) \sum(y_i - w^\top x_i)^2 + \lambda \lVert w \rVert_1$$ |
| [Logistic regression](/wiki/logistic_regression) | Logistic loss | Linear functions | L2 or L1 | $$\min (1/n) \sum \log(1 + \exp(-y_i w^\top x_i)) + \lambda R(w)$$ |
| [Support vector machines](/wiki/support_vector_machine_svm) | Hinge loss | Linear or kernel functions | L2 penalty | $$\min (1/n) \sum \max(0, 1 - y_i w^\top x_i) + \lambda \lVert w \rVert^2$$ |
| AdaBoost | Exponential loss | Ensemble of weak learners | Implicit (boosting rounds) | Sequentially minimizes weighted exponential loss |
| [Neural networks](/wiki/neural_network) | Cross-entropy, squared loss, etc. | Compositions of nonlinear layers | Weight decay (L2), dropout | $$\min (1/n) \sum L(f_\theta(x_i), y_i) + \lambda \lVert \theta \rVert^2$$ |

Training a neural network by [gradient descent](/wiki/gradient_descent) on a cross-entropy or squared loss is therefore a direct application of (regularized) ERM: the average loss over the training batch is the empirical risk being minimized [6].

## What is regularized empirical risk minimization?

Regularized empirical risk minimization augments the ERM objective with a penalty term that discourages overly complex hypotheses:

$$
h_{\text{RERM}} = \arg\min_{h \in H} \left[ \frac{1}{n} \sum L(h(x_i), y_i) + \lambda R(h) \right]
$$

where **R(h)** is the regularization term and **lambda >= 0** is a [hyperparameter](/wiki/hyperparameter) controlling the tradeoff between fitting the data and maintaining simplicity.

### Types of regularization

| Regularizer | R(h) | Effect | Example algorithms |
|---|---|---|---|
| L2 (Tikhonov / ridge) | $$\lVert w \rVert^2 = \sum w_j^2$$ | Shrinks all weights toward zero; dense solutions | Ridge regression |
| L1 (Lasso) | $$\lVert w \rVert_1 = \sum \lvert w_j \rvert$$ | Drives some weights exactly to zero; sparse solutions, feature selection | Lasso |
| Elastic net | $$\alpha \lVert w \rVert_1 + (1-\alpha) \lVert w \rVert^2$$ | Combines L1 sparsity with L2 stability | Elastic net regression |
| Early stopping | Number of [gradient descent](/wiki/gradient_descent) iterations | Implicitly limits model complexity | [Neural network](/wiki/neural_network) training |
| Dropout | Random zeroing of activations during training | Prevents co-adaptation of neurons | [Deep learning](/wiki/deep_learning) |

Regularized ERM provides a practical implementation of the structural risk minimization principle. The regularization parameter **lambda** plays the role of controlling the effective complexity of the hypothesis class, analogous to selecting the nesting level in SRM [3][6].

## How does ERM relate to maximum likelihood estimation?

Empirical risk minimization and maximum likelihood estimation (MLE) are closely related. When the loss function is the negative log-likelihood, ERM reduces to MLE [6].

Specifically, suppose the data are generated from a parametric family of distributions **P_theta**. The negative log-likelihood loss is:

$$
L(\theta; x, y) = -\log P_\theta(y \mid x)
$$

Minimizing the empirical risk with this loss:

$$
\frac{1}{n} \sum_{i=1}^n [-\log P_\theta(y_i \mid x_i)]
$$

is equivalent to maximizing the likelihood function. Furthermore, minimizing the expected negative log-likelihood is equivalent to minimizing the Kullback-Leibler (KL) divergence between the true distribution and the model distribution.

Adding an L2 regularizer to the negative log-likelihood corresponds to maximum a posteriori (MAP) estimation with a Gaussian prior on the parameters.

## Rademacher complexity and data-dependent bounds

While VC dimension provides distribution-free generalization bounds, these bounds can be loose in practice because they must hold for the worst-case distribution. Rademacher complexity offers an alternative that adapts to the specific data distribution [7].

### Definition

The empirical Rademacher complexity of a hypothesis class **H** with respect to a sample **S = {x_1, ..., x_n}** is:

$$
R_S(H) = \frac{1}{n} \mathbb{E}_\sigma \left[ \sup_{h \in H} \sum_{i=1}^n \sigma_i h(x_i) \right]
$$

where **sigma_1, ..., sigma_n** are independent Rademacher random variables (each taking values +1 or -1 with equal probability). Intuitively, the Rademacher complexity measures how well the hypothesis class can fit random noise [7].

### Generalization bound via Rademacher complexity

With probability at least **1 - delta** over the draw of the training sample:

$$
R(h) \le R_{\text{emp}}(h) + 2 R_S(H) + \sqrt{\log(1/\delta) / (2n)}
$$

for all **h in H** simultaneously.

This bound is often tighter than VC-based bounds because the Rademacher complexity can be estimated from the data and adapts to the actual sample distribution. Bartlett and Mendelson (2002) developed this framework, providing data-dependent risk bounds applicable to [decision trees](/wiki/decision_tree), neural networks, and support vector machines [7].

## What is the no free lunch theorem?

The no free lunch theorem, proved by Wolpert (1996) for [supervised learning](/wiki/supervised_learning), demonstrates a fundamental limitation of ERM (and any learning algorithm). Without assumptions about the data distribution, no learning algorithm can be universally good [10].

Formally, for any learning algorithm and any sample size **n**, there exists a data distribution such that:

- The algorithm has expected error at least 1/2 on unseen data (no better than random guessing for binary classification).
- Another algorithm achieves zero error on the same distribution.

This result means that ERM can only succeed when the hypothesis class is appropriately chosen relative to the true data-generating process. The restriction to a specific hypothesis class **H** is not a limitation of ERM but a necessary feature: it encodes the learner's assumptions about which patterns are plausible [10].

## What are the computational considerations of ERM?

Minimizing empirical risk is not always computationally easy. The difficulty depends on the choice of loss function and hypothesis class.

### NP-hardness of 0-1 loss

For binary classification, minimizing the empirical 0-1 loss over the class of linear classifiers is NP-hard. As standard references summarize, "empirical risk minimization for a classification problem with a 0-1 loss function is known to be an NP-hard problem even for a relatively simple class of functions such as linear classifiers," yet "it can be solved efficiently when the minimal empirical risk is zero, i.e., data is linearly separable" [15]. The underlying hardness result, strengthened by Feldman, Guruswami, Raghavendra, and Wu (2009), shows that even weak agnostic learning of such classifiers is intractable, so even simple hypothesis classes lead to intractable optimization when paired with the natural loss function [16].

### Convex relaxation

The standard approach to circumvent NP-hardness is to replace the 0-1 loss with a convex surrogate. Because the surrogate loss is convex and typically differentiable (or subdifferentiable), the resulting ERM problem can be solved efficiently using [convex optimization](/wiki/convex_optimization) techniques such as [gradient descent](/wiki/gradient_descent), stochastic gradient descent, or interior point methods.

The excess risk from using a surrogate instead of the true loss can be controlled. For example, Zhang's inequality (2004) shows that for convex, classification-calibrated surrogates, the excess 0-1 risk can be bounded in terms of the excess surrogate risk, ensuring that good performance on the surrogate translates to good performance on the original objective [8].

### Stochastic optimization

For large-scale problems, computing the full empirical risk over all training examples at each iteration is expensive. Stochastic gradient descent (SGD) and its variants (Adam, AdaGrad, RMSProp) approximate the gradient of the empirical risk using mini-batches, enabling training on datasets with millions or billions of examples.

## How does ERM apply to deep learning?

The success of [deep learning](/wiki/deep_learning) has raised new questions about the ERM principle. Modern neural networks are often heavily overparameterized, meaning they have far more parameters than training examples. Classical ERM theory predicts that such models should overfit badly, yet in practice they often [generalize](/wiki/generalization) well [11].

### Interpolation and the double descent phenomenon

When a model has enough capacity to fit the training data perfectly (achieving zero empirical risk), it is said to **interpolate** the data. Classical learning theory suggests that interpolating models should have high true risk due to overfitting. However, empirical observations and recent theoretical work have revealed a more nuanced picture [11].

The **double descent** phenomenon, documented by Belkin, Hsu, Ma, and Mandal in a 2019 paper in the *Proceedings of the National Academy of Sciences* and explored further by Nakkiran et al. (2021), describes a non-monotonic relationship between model complexity and test error [11][17]:

1. In the underparameterized regime (few parameters relative to data), increasing model complexity reduces test error, consistent with classical theory.
2. Near the interpolation threshold (where model capacity equals the number of training examples), test error spikes.
3. In the overparameterized regime (many more parameters than data), test error decreases again, contrary to classical predictions.

This phenomenon has been observed across multiple model types, including neural networks, [random forests](/wiki/random_forest), and kernel methods [11].

### Implicit regularization

One explanation for why overparameterized models generalize despite interpolating is **implicit regularization**. The optimization algorithm (typically SGD) and its initialization do not find an arbitrary interpolating solution; instead, they converge to solutions with particular structural properties, such as low norm or low rank. These inductive biases effectively restrict the hypothesis class, even though no explicit regularizer is present in the objective.

For example, [gradient descent](/wiki/gradient_descent) on overparameterized linear models converges to the minimum-norm interpolating solution, which has well-understood generalization properties. For neural networks, the precise characterization of implicit regularization remains an active area of research.

## Tilted empirical risk minimization

Tilted empirical risk minimization (TERM), introduced by Li et al. (2021), extends the standard ERM objective by introducing an exponential tilt parameter **t** that adjusts the influence of individual losses [12]:

$$
R_t(h) = \frac{1}{t} \log\left( \frac{1}{n} \sum_{i=1}^n \exp(t \, L(h(x_i), y_i)) \right)
$$

The parameter **t** interpolates between different objectives:

| Tilt parameter | Limiting behavior | Effect |
|---|---|---|
| t -> 0 | Standard ERM (arithmetic mean of losses) | Equal weighting of all examples |
| t -> +infinity | Maximum loss (minimax risk) | Focuses on worst-case examples |
| t -> -infinity | Minimum loss | Focuses on easiest examples |

TERM provides a unified framework for several practical goals:

- **Positive tilt (t > 0)**: Upweights high-loss examples, promoting fairness across subgroups and handling class imbalance.
- **Negative tilt (t < 0)**: Downweights high-loss examples, providing robustness to outliers and noisy labels.

The framework has been applied to problems including fair classification, robust regression, and federated learning [12].

## What are the limitations of ERM?

Despite its theoretical elegance and practical utility, ERM has several well-known limitations:

1. **Distribution-free bounds are often loose.** The generalization bounds derived from VC theory hold for all possible data distributions, but this worst-case guarantee comes at the cost of tightness. In practice, the bounds are often too large to be informative [7].

2. **Overfitting with rich hypothesis classes.** When the hypothesis class is too large relative to the sample size, ERM will select a hypothesis that fits the training data well but performs poorly on new data. This is the fundamental motivation for regularized ERM and structural risk minimization [6].

3. **Computational intractability.** For many natural combinations of loss functions and hypothesis classes, the ERM optimization problem is NP-hard. Convex relaxations and approximate algorithms are needed in practice [15][16].

4. **Sensitivity to the training distribution.** ERM optimizes average performance over the training set, which may not align with the desired objective. If the training data are imbalanced, contain systematic biases, or are not representative of the deployment distribution, the ERM solution may perform poorly on underrepresented subgroups or in distribution-shifted settings [12].

5. **No free lunch.** Without assumptions about the data distribution, no learning algorithm (including ERM) can guarantee good performance. The choice of hypothesis class implicitly encodes assumptions that may or may not match reality [10].

## See also

- [Loss function](/wiki/loss_function)
- [Overfitting](/wiki/overfitting)
- [Regularization](/wiki/regularization)
- [Bias-variance tradeoff](/wiki/bias_variance_tradeoff)
- [Generalization](/wiki/generalization)
- [Support vector machine](/wiki/support_vector_machine_svm)
- [Supervised learning](/wiki/supervised_learning)
- [Convex optimization](/wiki/convex_optimization)
- [Gradient descent](/wiki/gradient_descent)

## References

1. Vapnik, V. N., & Chervonenkis, A. Ya. (1971). "On the uniform convergence of relative frequencies of events to their probabilities." *Theory of Probability and Its Applications*, 16(2), 264-280. (Original Russian version published 1968.)

2. Vapnik, V. N. (1995). *The Nature of Statistical Learning Theory*. Springer-Verlag.

3. Vapnik, V. N. (1998). *Statistical Learning Theory*. Wiley-Interscience.

4. Valiant, L. G. (1984). "A theory of the learnable." *Communications of the ACM*, 27(11), 1134-1142.

5. Blumer, A., Ehrenfeucht, A., Haussler, D., & Warmuth, M. K. (1989). "Learnability and the Vapnik-Chervonenkis dimension." *Journal of the ACM*, 36(4), 929-965.

6. Shalev-Shwartz, S., & Ben-David, S. (2014). *Understanding Machine Learning: From Theory to Algorithms*. Cambridge University Press.

7. Bartlett, P. L., & Mendelson, S. (2002). "Rademacher and Gaussian complexities: Risk bounds and structural results." *Journal of Machine Learning Research*, 3, 463-482.

8. Zhang, T. (2004). "Statistical behavior and consistency of classification methods based on convex risk minimization." *The Annals of Statistics*, 32(1), 56-85.

9. Bartlett, P. L., Jordan, M. I., & McAuliffe, J. D. (2006). "Convexity, classification, and risk bounds." *Journal of the American Statistical Association*, 101(473), 138-156.

10. Wolpert, D. H. (1996). "The lack of a priori distinctions between learning algorithms." *Neural Computation*, 8(7), 1341-1390.

11. Belkin, M., Hsu, D., Ma, S., & Mandal, S. (2019). "Reconciling modern machine-learning practice and the classical bias-variance trade-off." *Proceedings of the National Academy of Sciences*, 116(32), 15849-15854.

12. Li, T., Beirami, A., Sanjabi, M., & Smith, V. (2021). "Tilted empirical risk minimization." *International Conference on Learning Representations (ICLR)*.

13. Shalev-Shwartz, S., Shamir, O., Srebro, N., & Sridharan, K. (2010). "Learnability, stability and uniform convergence." *Journal of Machine Learning Research*, 11, 2635-2670.

14. Vapnik, V. N. (1991). "Principles of risk minimization for learning theory." *Advances in Neural Information Processing Systems (NeurIPS)*, 4, 831-838.

15. "Empirical risk minimization." *Wikipedia*. Retrieved June 2026. https://en.wikipedia.org/wiki/Empirical_risk_minimization

16. Feldman, V., Guruswami, V., Raghavendra, P., & Wu, Y. (2012). "Agnostic learning of monomials by halfspaces is hard." *SIAM Journal on Computing*, 41(6), 1558-1590. (Earlier version: FOCS 2009.)

17. Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., & Sutskever, I. (2021). "Deep double descent: Where bigger models and more data hurt." *Journal of Statistical Mechanics: Theory and Experiment*, 2021(12), 124003. (Originally ICLR 2020.)