Structural risk minimization (SRM)

Structural risk minimization (SRM) is an inductive principle in statistical learning theory for selecting a learned model that simultaneously fits the training data and controls the capacity of the hypothesis class. It was developed by Vladimir Vapnik and Alexey Chervonenkis at the Institute of Control Sciences in Moscow during the late 1960s and 1970s as a response to the fact that pure empirical risk minimization (ERM) over a sufficiently rich function class can drive the training error to zero while leaving the true (population) error arbitrarily large. SRM addresses this by ranking hypothesis classes from simple to complex, then choosing the model that minimizes a guaranteed upper bound on the true risk: the empirical risk plus a complexity penalty derived from the VC dimension of the class. The principle gave statistical learning a rigorous framework for model selection, provided the theoretical justification for support vector machines, and is closely connected to ideas of regularization, Bayesian model selection, and the minimum description length principle.

SRM is a non-asymptotic principle: it produces probabilistic guarantees that hold for any finite sample size n, rather than only in the limit as n approaches infinity. This non-asymptotic character distinguishes it from classical statistics, where consistency results often only describe behavior as the sample grows without bound. SRM is also distribution-free, meaning the bounds it relies on hold for every possible distribution generating the data, with no parametric assumption about how the inputs and outputs are related.

ELI5 (Explain like I'm 5)

Imagine you have a huge box of LEGO bricks and you want to build something that looks like a dog you saw in your yard. You only saw the dog for a moment, so you have to guess what its other side looks like. If you use only a few bricks, your dog might be too rough and miss obvious details. If you use thousands of bricks and try to copy every speck of dirt you remember, your model will only look like that single dog at that exact moment, and not like other dogs you might see later.

Structural risk minimization is the rule that says: pick a number of bricks that is just enough to capture the important shape of the dog, but not so many that you start copying random details. To do this, you do not just count your mistakes on the dog you saw. You also count how many bricks you used, and you try to keep both numbers small at once.

In machine learning, the bricks are parameters of a model and the dog is the training data. SRM gives you a formula that adds a penalty for using too many bricks, so the model you choose should still work well on dogs you have not seen yet. This is the same intuition behind regularization, which is the practical day-to-day technique most people use to apply SRM ideas.

Definition

The SRM principle assumes a learning problem in the standard supervised setup. There is an input space X, an output space Y, and an unknown probability distribution P(x, y) on X x Y from which a training sample S = {(x_1, y_1), ..., (x_n, y_n)} is drawn independently and identically distributed (i.i.d.). A loss function L(h(x), y) measures the error of a hypothesis h: X -> Y. The true risk and empirical risk of h are:

R(h) = E[L(h(x), y)] (expectation over P)

R_emp(h) = (1/n) * sum from i=1 to n of L(h(x_i), y_i)

A learner cannot compute R(h) directly because P is unknown. ERM substitutes R_emp(h) for R(h) and minimizes that. SRM sharpens this by also adding a complexity penalty.

Nested structure of hypothesis classes

SRM begins with a structure on the space of hypotheses: an infinite or finite sequence of nested classes

H_1 subset H_2 subset H_3 subset ... subset H_K

with VC dimensions h_1 <= h_2 <= ... <= h_K, where each h_k is finite. Each H_k is a candidate hypothesis class of bounded complexity. The structure can be obtained in many ways: by limiting polynomial degree, by limiting the depth of a decision tree, by capping the norm of a weight vector in a linear model, by restricting the number of hidden units in a neural network, by varying the bandwidth of a kernel, or by adjusting the regularization parameter in a penalized objective.

The SRM objective

SRM selects both an index k and a hypothesis h in H_k by minimizing an upper bound on the true risk:

(k_SRM, h_SRM) = argmin over k and h in H_k of [R_emp(h) + Phi(h_k, n, delta)]

The penalty term Phi(h_k, n, delta) is called a confidence interval, capacity term, or complexity penalty. It depends on the VC dimension h_k of the chosen class, the sample size n, and a confidence parameter delta in (0, 1). A standard form, derived from the VC inequality, is:

Phi(h_k, n, delta) = sqrt((h_k * (log(2n / h_k) + 1) - log(delta / 4)) / n)

The central guarantee of SRM is that with probability at least 1 - delta over the sample,

R(h) <= R_emp(h) + Phi(h_k, n, delta) for every h in H_k and every k

so the SRM solution comes with an explicit, distribution-free upper bound on its true risk. The penalty Phi grows with h_k and shrinks with n, formalizing the intuition that large hypothesis classes need more data to be safely fit.

History

SRM grew out of three decades of work on the foundations of pattern recognition by Vapnik and Chervonenkis. The story is closely tied to the rise and rediscovery of statistical learning theory.

Year	Event
1962	Frank Rosenblatt's Principles of Neurodynamics introduces the perceptron and triggers interest in learning machines
1963	Vapnik and Chervonenkis develop the "generalized portrait" algorithm for pattern recognition, an ancestor of the support vector machine
1968	Vapnik and Chervonenkis prove the uniform convergence theorem and introduce what becomes known as the VC dimension
1971	English translation of the 1968 paper appears in Theory of Probability and Its Applications, exposing Western researchers to the result
1974	Vapnik and Chervonenkis publish the Russian-language Theory of Pattern Recognition, in which the structural risk minimization principle is named and developed
1979	Vapnik publishes the Russian-language Estimation of Dependences Based on Empirical Data, with a systematic treatment of ERM, SRM, and density estimation
1982	English translation of Estimation of Dependences is published by Springer, giving the SRM principle a wider audience
1984	Leslie Valiant introduces the Probably Approximately Correct (PAC) framework, providing an independent computational learning theory
1989	Blumer, Ehrenfeucht, Haussler, and Warmuth connect VC dimension to PAC learnability, establishing the modern fundamental theorem of statistical learning
1991	Vapnik publishes the influential NeurIPS paper Principles of risk minimization for learning theory, comparing ERM, SRM, and Bayesian model selection
1992	Boser, Guyon, and Vapnik introduce the soft-margin support vector machine at COLT, putting the SRM principle directly into a working algorithm
1995	Vapnik publishes The Nature of Statistical Learning Theory with Springer, the book that brings SRM to a broad machine learning audience
1998	Vapnik publishes the comprehensive Statistical Learning Theory with Wiley-Interscience
2000	Second edition of The Nature of Statistical Learning Theory expands the discussion of SVMs and kernel methods
2002	Bartlett and Mendelson develop Rademacher complexity bounds, providing tighter, data-dependent alternatives to the original VC penalties used in SRM

Vapnik moved from the Institute of Control Sciences in Moscow to AT&T Bell Labs in 1990, where his collaboration with Bernhard Boser, Isabelle Guyon, Corinna Cortes, and others led to the support vector machine. The success of SVMs in the late 1990s on tasks such as handwritten digit recognition gave SRM a high-profile demonstration of practical effectiveness, not just theoretical neatness.

Connection to VC theory

SRM cannot be stated cleanly without VC theory, since the penalty term Phi is built from the VC dimension and the underlying generalization bounds.

VC dimension

The VC dimension h of a class H of binary-valued functions is the largest number of points that H can shatter, where shattering means realizing all 2^m possible label patterns on a set of m points. The VC dimension is a combinatorial measure of capacity that does not depend on the data distribution.

Hypothesis class	VC dimension
Threshold functions on the real line	1
Intervals on the real line	2
Axis-aligned rectangles in R^2	4
Half-spaces (linear classifiers) in R^d	d + 1
Convex polygons in R^2 with k vertices	2k + 1
Sigmoid neural networks with W weights	O(W^2) (lower bound: Omega(W))
Sine functions sin(theta * x)	infinite
Multilayer ReLU networks with W weights and L layers	O(W L log W)

Uniform convergence and the VC inequality

The central tool behind the SRM penalty is the VC inequality, a uniform bound on how far the empirical risk can deviate from the true risk for any function in H_k:

Pr(sup over h in H_k of |R_emp(h) - R(h)| > epsilon) <= 4 * S(H_k, 2n) * exp(-n * epsilon^2 / 8)

Here S(H_k, 2n) is the growth function of H_k. Sauer's lemma bounds the growth function in terms of the VC dimension: S(H_k, n) <= (en / h_k)^h_k when n >= h_k. Inverting the VC inequality at confidence 1 - delta yields the Phi(h_k, n, delta) penalty used in the SRM objective.

Necessary and sufficient conditions for consistency

Vapnik and Chervonenkis proved that finite VC dimension is necessary and sufficient for ERM to be consistent across all distributions. SRM extends this guarantee: even if the union of all H_k has infinite VC dimension, SRM is universally consistent provided the structure is rich enough to approximate any reasonable target function and the h_k grow slowly enough relative to n. Concretely, if h_{k(n)} log(n) / n -> 0 while still admitting good approximations of the Bayes classifier, then R(h_SRM) -> R* (the Bayes risk) almost surely.

From ERM to SRM

At the level of objectives, the difference between the two is local versus global model selection.

Aspect	ERM	SRM
Hypothesis class	Single fixed H	Nested family H_1 subset H_2 subset ...
Objective	Minimize R_emp(h)	Minimize R_emp(h) + Phi(h_k, n, delta)
Output	Best fit to training data	Best fit subject to capacity penalty
Risk of overfitting	High when H is rich	Controlled by the penalty term
Model selection	External (cross-validation)	Built in
Consistency	Requires finite VC dim of H	Holds even for unions with infinite VC dim
Bound on R(h)	Only via separate analysis of fixed H	Built into the chosen objective
Practical realization	Maximum likelihood, least squares	SVMs, regularized risk, ridge regression

The SRM principle and algorithm

The practical SRM algorithm follows a straightforward outline.

Define the structure. Choose a sequence H_1 subset H_2 subset ... subset H_K with VC dimensions h_1 <= h_2 <= ... <= h_K, ordered so that simpler classes come first. For example, take polynomials of increasing degree, neural networks of increasing width or depth, decision trees of increasing depth, or kernel models with shrinking regularization parameter.
Train within each class. For each k, run an ERM algorithm restricted to H_k: solve h_k = argmin over h in H_k of R_emp(h)* by least squares, gradient descent, quadratic programming, or whatever optimizer is appropriate.
Compute the SRM objective. For each k, compute the bound B(k) = R_emp(h_k) + Phi(h_k, n, delta)* using the VC dimension of H_k, the sample size, and a chosen confidence delta.
Pick the best level. Select k_SRM = argmin over k of B(k) and return h_SRM = h_{k_SRM}*.

The last step is not trivial: as k increases, R_emp(h_k)* decreases because larger classes can fit more, while Phi(h_k, n, delta) increases because the penalty grows with VC dimension. The optimum lies where these two opposing trends balance, illustrating the same trade-off that drives the bias-variance tradeoff and the model selection literature.

Two practical realizations

Vapnik distinguished two ways of implementing SRM in real algorithms.

Constraint form (hard structure): Train ERM in each H_k separately and pick the best k using the bound. This corresponds to comparing models of different polynomial degrees, different tree depths, or different network sizes.

Penalty form (soft structure): Replace the explicit choice of k with a continuous regularization parameter that controls the effective capacity. Tikhonov regularization, ridge regression, weight decay, and the soft-margin SVM all fit this form. The continuous parameter implicitly indexes a continuum of nested classes, and minimizing the regularized empirical risk approximates SRM with the constraint replaced by a Lagrangian penalty.

Connection to support vector machines

The support vector machine is the most famous algorithm explicitly motivated by SRM. The key bridge is the result, proved by Vapnik and Chervonenkis in 1974 and refined later, that the VC dimension of the class of separating hyperplanes with margin at least gamma in a ball of radius R is bounded by

h <= min(ceil(R^2 / gamma^2), d) + 1

This means that constraining the margin gives a structure on hyperplanes: hyperplanes with larger margins form smaller-VC-dimension classes, even in high-dimensional or infinite-dimensional feature spaces. Maximizing the margin therefore implicitly minimizes the VC-based capacity term in the SRM objective.

Hard-margin SVM as SRM

For linearly separable data, the hard-margin SVM solves

min (1/2) * ||w||^2 subject to y_i (w . x_i + b) >= 1

Minimizing ||w|| maximizes the margin gamma = 1/||w||, which corresponds to selecting the smallest VC-dimension hyperplane class capable of separating the data. The empirical risk is zero by the constraint, so the SRM objective reduces to picking the lowest level k in the structure on which separation is feasible.

Soft-margin SVM as regularized SRM

When the data are not separable or when the user wants to tolerate some training error, the soft-margin SVM solves

min (1/2) ||w||^2 + C * sum from i=1 to n of xi_i

with hinge slack variables xi_i = max(0, 1 - y_i (w . x_i + b)). The first term is the capacity penalty (small ||w|| corresponds to large margin and small VC-dimension class), the second is the empirical hinge loss, and C is the trade-off parameter selecting the level of the structure. This is the penalty form of SRM applied to hyperplane classifiers.

Kernel SVMs and infinite-dimensional spaces

The kernel trick maps inputs into a high or infinite-dimensional reproducing kernel Hilbert space, where linear separators correspond to highly nonlinear classifiers in the original input space. Without margin control, the VC dimension of separators in such spaces could be infinite, which would make ERM fail. The margin-based bound rescues this: even in infinite-dimensional feature spaces, the effective VC dimension of large-margin classifiers is bounded by R^2 / gamma^2, independent of the feature space dimension. SRM is therefore the reason that kernel SVMs do not catastrophically overfit.

Connection to regularization

SRM and Tikhonov-style regularization are closely related but historically distinct ideas.

Tikhonov regularization, introduced by Andrey Tikhonov in 1943 for ill-posed inverse problems, replaces a singular optimization with a penalized variant

min ||A x - b||^2 + lambda ||x||^2

where lambda > 0 stabilizes the solution. In statistics and machine learning, the same idea appears as ridge regression (Hoerl and Kennard, 1970), weight decay in neural network training, and the L2 penalty in logistic regression and SVM objectives.

The link to SRM is that the regularization parameter lambda controls the level k in a continuous structure. For ridge regression with penalty lambda ||w||^2, the constraint ||w|| <= B defines a class H_B with VC dimension bounded in terms of B. Smaller lambda corresponds to looser constraints and higher capacity; larger lambda corresponds to tighter constraints and lower capacity. Choosing lambda by cross-validation or by a generalization bound is the practical implementation of the SRM principle for these models.

Regularizer	Penalty term	Implicit structure	Algorithm
Tikhonov / L2	lambda *		w
L1 / Lasso	lambda *		w
Elastic net	alpha *		w
Early stopping	Effective number of iterations	Implicitly bounded weight norm	Gradient descent on neural networks
Dropout	Random masking during training	Effective ensemble of subnetworks	Deep learning
Spectral norm bound	sigma_max(W_l) <= s	Lipschitz-bounded networks	Spectral normalization for GANs

The modern view, articulated by Bartlett, Bousquet, Mendelson, and others, is that all these techniques implement SRM with different complexity measures. Some use VC dimension explicitly, some use Rademacher complexity, some use covering numbers, but the underlying principle is the same: trade off training error against a capacity term to minimize an upper bound on the true risk.

Bias-variance tradeoff

SRM is the formal counterpart of the bias-variance tradeoff familiar from elementary statistics. Decomposing the excess risk of any hypothesis h_k in H_k gives

R(h_k) - R(h) = [R(h_k_best) - R(h*)] + [R(h_k*) - R(h_k_best)]**

where h* is the unconstrained Bayes optimal predictor and h_k_best is the best hypothesis in H_k. The first term is the approximation error (bias), which decreases as k grows because larger classes can express more functions. The second term is the estimation error (variance), which grows as k grows because larger classes are harder to estimate from finite data. SRM picks the k where the sum of these two errors is smallest.

This is the same balance that motivates the bias-variance decomposition for squared loss, where total error equals bias squared plus variance plus irreducible noise. SRM provides a non-asymptotic, distribution-free version of this trade-off, with concrete bounds replacing the asymptotic intuition.

Comparison to ERM, MDL, and Bayesian methods

SRM is one of several principles for choosing a model from data. Each can be derived from a different starting point but they yield related practical algorithms.

Principle	Foundational view	Objective	Capacity measure	Strengths	Weaknesses
ERM	Frequentist; empirical loss as substitute for true loss	min R_emp(h) over fixed H	None directly	Simple, well-understood; consistent if H has finite VC dim	Overfits on rich H; needs external model selection
SRM	Frequentist; minimize an upper bound on R(h)	min R_emp(h) + Phi(h_k, n, delta)	VC dimension, covering numbers, Rademacher complexity	Distribution-free, non-asymptotic; consistent over unions with infinite VC dim	Bounds often loose; choice of structure can be ad hoc
MDL	Information-theoretic; total description length of data plus model	min L(h) + L(data	h) in bits	Code length of hypothesis	Direct interpretation as compression; works for any countable hypothesis space
Bayesian model selection	Probabilistic; posterior probability of each model	argmax over k of P(model k	data) using marginal likelihood	Model prior P(h) and likelihood	Coherent uncertainty quantification; automatic Occam factor
AIC	Asymptotic information; Kullback-Leibler divergence to truth	min -2 log L + 2 k	Number of parameters k	Easy to compute; widely used in classical statistics	Asymptotic; assumes regular models
BIC	Bayesian large-sample approximation	min -2 log L + k log n	Number of parameters k	Approximates marginal likelihood; consistent for true model selection	Penalizes complexity more than AIC; assumes regular models
Cross-validation	Resampling estimate of generalization	min CV error over models	Implicit through holdout	No bound needed; works in practice	Computationally expensive; high-variance estimates

Several results show these principles agree under particular assumptions. Two-part MDL with a uniform code for the model index reduces to a complexity penalty resembling Phi(h_k, n, delta). The Bayesian Information Criterion (BIC), originally derived as a Laplace approximation to the marginal likelihood, has the same large-sample form as a penalized likelihood with (k log n) / 2 penalty. Stochastic complexity in the MDL framework corresponds to a normalized maximum likelihood that often coincides with Bayesian marginal likelihoods. The PAC-Bayes framework introduced by McAllester (1999) interpolates between Bayesian and frequentist views, deriving generalization bounds that involve a Kullback-Leibler divergence between a posterior and a prior over hypotheses.

PAC learning and Rademacher complexity

SRM sits inside a larger collection of statistical learning frameworks. Three are especially close to it.

Probably Approximately Correct (PAC) learning

The PAC learning framework of Valiant (1984) defines learnability in terms of sample complexity. A class H is PAC learnable if there exists an algorithm that, for any target concept and any distribution, returns a hypothesis with true risk at most epsilon with probability at least 1 - delta, using a number of examples polynomial in 1/epsilon, 1/delta, and the size of the representation. The fundamental theorem of statistical learning, established through Vapnik, Chervonenkis, Blumer, Ehrenfeucht, Haussler, Warmuth and others, says that H is agnostic PAC learnable if and only if its VC dimension is finite, with sample complexity O(h / epsilon^2 + log(1/delta) / epsilon^2). SRM is a meta-algorithm for PAC learning on top of a structure: even if the union of classes is not PAC learnable, SRM achieves consistent learning by selecting the right level for each sample size.

Rademacher complexity

The original SRM penalty uses the VC dimension, which is distribution-free and often loose. Bartlett and Mendelson (2002) introduced Rademacher complexity as a data-dependent capacity measure:

R_S(H) = (1/n) * E_sigma [sup over h in H of sum from i=1 to n of sigma_i * h(x_i)]

where sigma_1, ..., sigma_n are independent Rademacher random variables (uniform on {-1, +1}). The corresponding generalization bound is

R(h) <= R_emp(h) + 2 * R_S(H) + 3 * sqrt(log(2/delta) / (2n))

for all h in H with probability at least 1 - delta. Substituting Rademacher complexity for Phi(h_k, n, delta) in the SRM objective gives a tighter, data-dependent SRM that adapts to the actual sample distribution. This is now the standard tool in modern statistical learning theory.

Local Rademacher complexity and fast rates

Bartlett, Bousquet, and Mendelson (2005) sharpened these bounds further by introducing local Rademacher complexities, which look only at hypotheses with small variance around the optimum. Under low-noise conditions (Mammen-Tsybakov noise condition or Bernstein-type conditions), these yield O(1/n) rates rather than the standard O(1/sqrt(n)), sharpening the practical guarantees that SRM-style algorithms can claim.

Modern relevance and the deep learning puzzle

The success of deep learning has produced an apparent contradiction with classical SRM. Modern neural networks have millions to trillions of parameters and can perfectly fit large random labelings of their training data, as Zhang et al. (2017) demonstrated experimentally. By the original VC bounds, such overparameterized models should overfit catastrophically. In practice they often generalize well, sometimes better than smaller, classically "appropriate" models.

Reconciliation attempts

Several lines of research try to reconcile SRM-style bounds with deep learning practice.

Implicit regularization. Stochastic gradient descent and its initialization push the optimizer toward solutions with low effective capacity, even when the explicit hypothesis class is enormous. The learned classifier behaves as if it lives in a much smaller class than the full network would suggest.

Margin-based and norm-based bounds. Bartlett, Foster, and Telgarsky (2017) and Neyshabur et al. (2018) derived generalization bounds for neural networks based on the spectral norms or path norms of the weights, yielding non-vacuous estimates for some practical models. These are SRM-style bounds with capacity measured by norms rather than counting parameters.

PAC-Bayes for neural networks. Dziugaite and Roy (2017) computed non-vacuous PAC-Bayes bounds for deep networks trained on MNIST, suggesting that with the right complexity measure, classical learning theory can apply even to overparameterized models.

Double descent. Belkin, Hsu, Ma, and Mandal (2019) documented the double descent phenomenon, in which test error rises near the interpolation threshold and falls again in the heavily overparameterized regime. Classical SRM is a single-descent picture; the double descent picture suggests that the structure should be re-indexed in the overparameterized regime, where larger models effectively correspond to a different choice of complexity measure (such as norm rather than parameter count).

Benign overfitting. Bartlett, Long, Lugosi, and Tsigler (2020) introduced the notion of benign overfitting, where interpolating models continue to generalize well. This is consistent with SRM if the effective complexity (Rademacher complexity, norm, or eigenvalue tail of the kernel) is small even when the parameter count is huge.

Practical consequences

The SRM principle is no less true in deep learning than it was in 1974, but the right complexity measure to use in Phi has shifted. Counting parameters is the wrong index for modern networks. Modern theoretical work focuses on data-dependent norms, neural tangent kernel spectra, sharpness measures, and PAC-Bayes posteriors as candidate replacements. Practical regularization (weight decay, dropout, data augmentation, label smoothing, early stopping) can all be understood as enforcing or approximating an SRM trade-off in the chosen complexity measure.

Beyond classification

SRM-style ideas have been extended well beyond binary classification:

Regression with covering-number-based bounds (Pollard 1984, Anthony and Bartlett 1999).
Density estimation through penalized maximum likelihood and bracketing entropy (van de Geer 2000).
Reinforcement learning through structural risk minimization on policy classes (Munos 2003, Antos, Munos, and Szepesvari 2008).
Online learning and regret bounds, where the structural perspective leads to model selection over expert classes (Cesa-Bianchi and Lugosi 2006).
Distribution shift and domain adaptation, where the SRM bound is augmented by a divergence between source and target distributions (Ben-David et al. 2010).

Limitations

SRM has well-known practical and theoretical limitations.

Loose bounds. The original VC-based Phi is often very loose for realistic models. It can predict generalization gaps far larger than what is actually observed, making the bound useful as a qualitative guide rather than a quantitative one.
Choice of structure. SRM does not say how to define the nested sequence H_1 subset H_2 subset ... in the first place. Different structures lead to different chosen models, and there is no universal best choice. Practical model selection often relies on cross-validation as a substitute.
Computability of VC dimension. For many practical hypothesis classes (deep neural networks, decision tree ensembles, kernel models with adaptive features), the exact VC dimension is unknown or hard to compute, so the penalty must be replaced by a surrogate (Rademacher complexity, covering numbers, norms).
Distribution-free guarantees can be too cautious. Real data are not adversarial. Distribution-dependent bounds, including Rademacher and PAC-Bayes, often give much sharper guarantees and are now the standard tools in research.
Misalignment with the deployment objective. SRM, like ERM, optimizes a surrogate of the true expected loss on the same distribution that generated the training data. If the deployment distribution differs (covariate shift, label shift, adversarial perturbation), the SRM bound provides no guarantee.
No guidance for fairness or robustness objectives. Standard SRM is silent on subgroup performance, fairness constraints, or distributional robustness. Modern variants such as distributionally robust optimization and tilted ERM extend the framework to address these concerns.

References

Vapnik, V. N., & Chervonenkis, A. Ya. (1971). "On the uniform convergence of relative frequencies of events to their probabilities." *Theory of Probability and Its Applications*, 16(2), 264-280. (Original Russian version, 1968.)
Vapnik, V. N., & Chervonenkis, A. Ya. (1974). *Theory of Pattern Recognition* (in Russian). Nauka, Moscow.
Vapnik, V. N. (1979). *Estimation of Dependences Based on Empirical Data* (in Russian). Nauka, Moscow.
Vapnik, V. N. (1982). *Estimation of Dependences Based on Empirical Data*. Springer-Verlag, New York.
Vapnik, V. N. (1991). "Principles of risk minimization for learning theory." *Advances in Neural Information Processing Systems*, 4, 831-838.
Vapnik, V. N. (1995). *The Nature of Statistical Learning Theory*. Springer-Verlag.
Vapnik, V. N. (1998). *Statistical Learning Theory*. Wiley-Interscience.
Vapnik, V. N. (2000). *The Nature of Statistical Learning Theory* (2nd edition). Springer.
Boser, B. E., Guyon, I. M., & Vapnik, V. N. (1992). "A training algorithm for optimal margin classifiers." *Proceedings of the Fifth Annual Workshop on Computational Learning Theory (COLT)*, 144-152.
Cortes, C., & Vapnik, V. (1995). "Support-vector networks." *Machine Learning*, 20(3), 273-297.
Valiant, L. G. (1984). "A theory of the learnable." *Communications of the ACM*, 27(11), 1134-1142.
Blumer, A., Ehrenfeucht, A., Haussler, D., & Warmuth, M. K. (1989). "Learnability and the Vapnik-Chervonenkis dimension." *Journal of the ACM*, 36(4), 929-965.
Tikhonov, A. N. (1943). "On the stability of inverse problems." *Doklady Akademii Nauk SSSR*, 39(5), 195-198.
Hoerl, A. E., & Kennard, R. W. (1970). "Ridge regression: Biased estimation for nonorthogonal problems." *Technometrics*, 12(1), 55-67.
Bartlett, P. L., & Mendelson, S. (2002). "Rademacher and Gaussian complexities: Risk bounds and structural results." *Journal of Machine Learning Research*, 3, 463-482.
Bartlett, P. L., Bousquet, O., & Mendelson, S. (2005). "Local Rademacher complexities." *The Annals of Statistics*, 33(4), 1497-1537.
McAllester, D. A. (1999). "PAC-Bayesian model averaging." *Proceedings of the Twelfth Annual Conference on Computational Learning Theory*, 164-170.
Rissanen, J. (1978). "Modeling by shortest data description." *Automatica*, 14(5), 465-471.
Schwarz, G. (1978). "Estimating the dimension of a model." *The Annals of Statistics*, 6(2), 461-464.
Akaike, H. (1974). "A new look at the statistical model identification." *IEEE Transactions on Automatic Control*, 19(6), 716-723.
Shalev-Shwartz, S., & Ben-David, S. (2014). *Understanding Machine Learning: From Theory to Algorithms*. Cambridge University Press.
Mohri, M., Rostamizadeh, A., & Talwalkar, A. (2018). *Foundations of Machine Learning* (2nd edition). MIT Press.
Anthony, M., & Bartlett, P. L. (1999). *Neural Network Learning: Theoretical Foundations*. Cambridge University Press.
Zhang, C., Bengio, S., Hardt, M., Recht, B., & Vinyals, O. (2017). "Understanding deep learning requires rethinking generalization." *International Conference on Learning Representations (ICLR)*.
Bartlett, P. L., Foster, D. J., & Telgarsky, M. J. (2017). "Spectrally-normalized margin bounds for neural networks." *Advances in Neural Information Processing Systems*, 30.
Neyshabur, B., Bhojanapalli, S., McAllester, D., & Srebro, N. (2018). "A PAC-Bayesian approach to spectrally-normalized margin bounds for neural networks." *International Conference on Learning Representations*.
Dziugaite, G. K., & Roy, D. M. (2017). "Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data." *Proceedings of the 33rd Conference on Uncertainty in Artificial Intelligence (UAI)*.
Belkin, M., Hsu, D., Ma, S., & Mandal, S. (2019). "Reconciling modern machine-learning practice and the classical bias-variance trade-off." *Proceedings of the National Academy of Sciences*, 116(32), 15849-15854.
Bartlett, P. L., Long, P. M., Lugosi, G., & Tsigler, A. (2020). "Benign overfitting in linear regression." *Proceedings of the National Academy of Sciences*, 117(48), 30063-30070.
Ben-David, S., Blitzer, J., Crammer, K., Kulesza, A., Pereira, F., & Vaughan, J. W. (2010). "A theory of learning from different domains." *Machine Learning*, 79(1-2), 151-175.

ELI5 (Explain like I'm 5)

Definition

Nested structure of hypothesis classes

The SRM objective

History

Connection to VC theory

VC dimension

Uniform convergence and the VC inequality

Necessary and sufficient conditions for consistency

From ERM to SRM

The SRM principle and algorithm

Two practical realizations

Connection to support vector machines

Hard-margin SVM as SRM

Soft-margin SVM as regularized SRM

Kernel SVMs and infinite-dimensional spaces

Connection to regularization

Bias-variance tradeoff

Comparison to ERM, MDL, and Bayesian methods

PAC learning and Rademacher complexity

Probably Approximately Correct (PAC) learning

Rademacher complexity

Local Rademacher complexity and fast rates

Modern relevance and the deep learning puzzle

Reconciliation attempts

Practical consequences

Beyond classification

Limitations

See also

References

Improve this article

Related Articles

ARC-AGI 2

Double Descent

Grokking

Stability

AUC-ROC

Machine learning terms/Clustering

ELI5 (Explain like I'm 5)

Definition

Nested structure of hypothesis classes

The SRM objective

History

Connection to VC theory

VC dimension

Uniform convergence and the VC inequality

Necessary and sufficient conditions for consistency

From ERM to SRM

The SRM principle and algorithm

Two practical realizations

Connection to support vector machines

Hard-margin SVM as SRM

Soft-margin SVM as regularized SRM

Kernel SVMs and infinite-dimensional spaces

Connection to regularization

Bias-variance tradeoff

Comparison to ERM, MDL, and Bayesian methods

PAC learning and Rademacher complexity

Probably Approximately Correct (PAC) learning

Rademacher complexity

Local Rademacher complexity and fast rates

Modern relevance and the deep learning puzzle

Reconciliation attempts

Practical consequences

Beyond classification

Limitations

See also

References

Related Articles

ARC-AGI 2

Double Descent

Grokking

Stability

AUC-ROC

Machine learning terms/Clustering