Statistical learning theory

Machine Learning Mathematics Statistics

25 min read

Updated Jul 11, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 11, 2026

Fact-checked

In review queue

Sources

24 citations

Revision

v4 · 5,040 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Statistical learning theory (SLT) is the mathematical framework that explains when and why machine learning algorithms generalize from a finite training sample to unseen data. It provides theoretical bounds on the gap between a model's performance on its training set and its performance on the underlying data distribution, using complexity measures such as the Vapnik-Chervonenkis (VC) dimension to say when learning from data is possible in principle ^[1]^[3]. The field treats learning as a problem of function estimation from empirical observations and asks: given a hypothesis class, a loss function, and a finite sample, what can we say about the difference between performance measured on the sample and performance on the full distribution?

SLT was developed primarily by Vladimir Vapnik and Alexey Chervonenkis, working in the Soviet Union from the 1960s onward, who introduced the central tools of the field: the VC dimension, growth function bounds, and the principle of structural risk minimisation ^[1]^[3]. Vapnik moved to AT&T Bell Labs in 1991, where the same ideas led directly to the support vector machine ^[3]. A largely independent line of work in the United States, led by Leslie Valiant in 1984, produced the Probably Approximately Correct (PAC) framework, which formalised computational learnability ^[2]. The two strands merged into the modern theory through the 1990s and 2000s, and SLT remains the standard mathematical language for discussing why machine learning works.

What is statistical learning theory?

Statistical learning theory studies the supervised learning problem as function estimation under uncertainty: a learner observes a finite sample drawn from an unknown joint distribution and must output a hypothesis that performs well on future draws from the same distribution. The theory's defining contribution is to relate the generalisation gap, the difference between a hypothesis's expected (true) risk and its empirical (training) risk, to measurable properties of the hypothesis class, most famously its VC dimension ^[1]. When that complexity is finite, the gap shrinks at a predictable rate as the sample grows; when it is infinite, no amount of data can guarantee generalisation in the worst case.

Vapnik framed the practical lesson of the theory as a methodological imperative. In his account of statistical learning theory he wrote: "When solving a problem of interest, do not solve a more general problem as an intermediate step. Try to get the answer that you really need but not a more general one." ^[3] This principle, that one should estimate only the quantity the task requires (a decision boundary, say, rather than a full probability density), runs through the design of support vector machines and the theory of transductive learning.

When was statistical learning theory developed?

Vapnik and Chervonenkis began collaborating at the Institute of Control Sciences in Moscow in the mid-1960s. Their first major joint result appeared as a draft in 1966 and was announced in 1968 before being published in full Russian-language form in 1971 under the title "On the uniform convergence of relative frequencies of events to their probabilities" ^[1]. That paper introduced what is now called the VC dimension and proved the first rigorous link between a combinatorial property of a function class and the rate at which empirical averages converge to true expectations. Vapnik continued to develop the theory inside the Soviet Union for the next two decades, often in mathematical isolation from Western researchers.

In late 1990 Vapnik moved to the United States and joined the Adaptive Systems Research Department at AT&T Bell Labs in Holmdel, New Jersey ^[21]. There he and his colleagues turned the abstract theory into a concrete algorithm, the support vector machine, which used the maximum margin principle that had been hinted at in his earlier work. His 1995 monograph "The Nature of Statistical Learning Theory" introduced the field to a wider audience and remains one of the standard references ^[3]. The longer 1998 book "Statistical Learning Theory" gave a complete technical treatment ^[4].

Leslie Valiant's 1984 paper "A Theory of the Learnable," published in Communications of the ACM (volume 27, issue 11, pages 1134-1142), came from a different research community ^[2]. Valiant approached learning from theoretical computer science and asked when a learning algorithm could be both sample-efficient and computationally efficient, defining the subject precisely: "we shall say that a program for performing a task has been acquired by learning if it has been acquired by any means other than explicit programming." ^[2] His PAC framework was initially studied with finite hypothesis classes and Boolean concepts. By the early 1990s researchers had shown that PAC learnability for binary classification is equivalent to finite VC dimension, unifying the two traditions into what is now called the fundamental theorem of statistical learning. Valiant received the Turing Award in 2010 partly for this work.

How is the learning problem set up?

The standard supervised learning problem in SLT is described by a small set of objects.

Symbol	Object	Meaning
$X$	Input space	Domain of features, often a subset of $\mathbb{R}^d$
$Y$	Output space	Labels, for example $\{0,1\}$ or $\{-1,+1\}$ for classification or $\mathbb{R}$ for regression
$P$	Joint distribution	Probability distribution over $X \times Y$ , unknown to the learner
$H$	Hypothesis class	A set of candidate functions $h: X \to Y$
$L(h(x), y)$	Loss	Penalty for predicting $h(x)$ when the true label is $y$
$R(h)$	Expected risk	$\mathbb{E}_{(X,Y) \sim P}[L(h(X), Y)]$
$\hat{R}_n(h)$	Empirical risk	$\frac{1}{n} \sum_{i=1}^{n} L(h(x_i), y_i)$ on a sample of size n

The central question is when minimising the empirical risk on a finite sample produces a hypothesis with low expected risk on the underlying distribution. The gap between the two, $R(h)$ minus $\hat{R}_n(h)$ , is called the generalisation gap, and SLT provides quantitative bounds on this gap in terms of properties of H.

What is empirical risk minimisation?

Empirical risk minimisation (ERM) is the basic learning principle of choosing the hypothesis h that minimises R_hat_n(h) on the training sample. ERM is the textbook recipe behind least-squares regression, logistic regression, and most modern classification methods. The principle is simple, but it raises an immediate question: when does the chosen h actually generalise?

A trivial counterexample shows that ERM can fail badly. If H contains every possible function from X to Y, then for any sample we can find an h with zero training error simply by memorising the labels, and this h will have no relationship to the true distribution. ERM works only if the hypothesis class is restricted enough that good performance on the sample implies good performance on the distribution. The job of SLT is to make this intuition rigorous.

The key technical concept is uniform convergence: the empirical risk R_hat_n(h) converges to R(h) uniformly over h in H. If uniform convergence holds at rate $\epsilon$ with high probability, then any ERM minimiser is within $2\epsilon$ of the best h in H. For supervised binary classification under the 0-1 loss, learnability turns out to be equivalent to uniform convergence, which in turn is equivalent to finite VC dimension ^[12]. The general learning setting is more delicate: Shalev-Shwartz, Shamir, Srebro and Sridharan showed in 2010 that there are problems that are learnable without uniform convergence, with stability serving as the right characterisation of learnability in those cases ^[18].

What is the VC dimension?

The Vapnik-Chervonenkis dimension is the central complexity measure of classical SLT. For a class H of binary classifiers, a set of n points x_1, ..., x_n in X is said to be shattered by H if for every possible labelling of those points by {0,1}, there exists some h in H that realises the labelling. The VC dimension $\mathrm{VC}(H)$ is the largest n such that some set of n points is shattered ^[1]^[22]. If H can shatter arbitrarily large sets, the VC dimension is infinite.

The VC dimension captures combinatorial richness independent of how the class is parameterised. A few canonical values:

Hypothesis class	VC dimension
Affine half-spaces in $\mathbb{R}^d$ (linear classifiers with bias)	$d + 1$
Homogeneous half-spaces in $\mathbb{R}^d$ (no bias)	$d$
Axis-aligned rectangles in $\mathbb{R}^2$	4
Intervals on the real line	2
Threshold functions on $\mathbb{R}$	1
Sine functions $\{\mathrm{sign}(\sin(\theta x)) : \theta \in \mathbb{R}\}$	infinite
Decision trees of depth k on n binary features	grows with k and n

The value $d+1$ for affine half-spaces follows from Radon's theorem: any $d+2$ points in $\mathbb{R}^d$ can be split into two subsets with intersecting convex hulls, so they cannot be linearly separated, but the $d$ unit coordinate vectors plus the origin can be shattered.

What is the Sauer-Shelah lemma and the growth function?

The growth function $\Pi_H(n)$ counts the maximum number of distinct labellings that H can realise on any set of n points. By definition $\Pi_H(n)$ is at most $2^n$ , and the VC dimension is the largest n for which equality holds. The Sauer-Shelah lemma, proved independently by Norbert Sauer and Saharon Shelah in 1972 and crucially used by Vapnik and Chervonenkis, states that if $\mathrm{VC}(H) = d$ , then for all $n \ge d$ the growth function is at most the sum of binomials $\sum_{i=0}^{d} \binom{n}{i}$ , which is itself bounded by $(en/d)^d$ ^[16]^[17]. The key consequence is that finite VC dimension forces the growth function to be polynomial in n.

What is the VC bound?

The Vapnik-Chervonenkis uniform convergence bound says that with probability at least $1 - \delta$ over the random sample, every h in H satisfies

R(h) \le \hat{R}_n(h) + \sqrt{\frac{d \log(2n/d) + \log(1/\delta)}{n}}

where $d = \mathrm{VC}(H)$ , with explicit constants depending on the form of the bound. Concrete versions in Vapnik 1998, Mohri et al. 2018, and Shalev-Shwartz and Ben-David 2014 differ slightly in constants but agree on the main rate of order $\sqrt{d/n}$ up to logarithmic factors ^[4]^[11]^[12].

The fundamental theorem of statistical learning ties everything together. For binary classification under 0-1 loss, the following are equivalent: H is PAC learnable, H is agnostic PAC learnable, the uniform convergence property holds for H, any ERM rule succeeds, and H has finite VC dimension ^[12]. The sample complexity for agnostic PAC learning of a class of VC dimension d scales as $\Theta\left(\frac{d + \log(1/\delta)}{\epsilon^2}\right)$ , which is tight up to constants ^[12].

What is Probably Approximately Correct (PAC) learning?

Valiant's PAC framework asks a slightly different question. A hypothesis class H is PAC learnable if there exists an algorithm A such that for every distribution P, every target concept c in H, and every epsilon, delta > 0, the algorithm draws a sample of size polynomial in 1/epsilon, 1/delta (and other natural parameters) and returns a hypothesis h with $P(h(X) \ne c(X)) \le \epsilon$ with probability at least $1 - \delta$ ^[2]. The original Valiant 1984 setup also required the algorithm to run in polynomial time, which adds a computational dimension absent from the purely statistical Vapnik-Chervonenkis theory.

Variant	Assumption on data	What you bound
Realisable PAC	True labels come from some h* in H	error of returned h
Agnostic PAC	No assumption: labels follow arbitrary P	excess error over $\inf_{h \in H} R(h)$
Computational PAC	Same as above plus polynomial-time algorithm	sample and time complexity
Statistical PAC	Sample complexity only, ignoring runtime	sample complexity

For binary classification a class is statistically PAC learnable if and only if its VC dimension is finite ^[12]. The computational version is strictly harder. Some classes have small VC dimension yet are NP-hard to learn assuming standard cryptographic assumptions: examples include 3-term DNF formulas and certain neural-network architectures. This gap between statistical and computational learnability is one of the durable themes of the field.

What is Rademacher complexity?

The VC dimension is a worst-case combinatorial quantity that does not depend on the data distribution. Peter Bartlett and Shahar Mendelson introduced data-dependent complexity measures called Rademacher and Gaussian complexities in their 2002 JMLR paper, and these now provide the tightest generic bounds for many model classes ^[5].

Given a sample x_1, ..., x_n, the empirical Rademacher complexity of a real-valued function class F is

\hat{R}_n(F) = \mathbb{E}_\sigma\left[\sup_{f \in F} \frac{1}{n} \sum_{i=1}^{n} \sigma_i f(x_i)\right]

where $\sigma_1, \ldots, \sigma_n$ are independent random variables uniform on $\{-1, +1\}$ . The expected Rademacher complexity is $R_n(F) = \mathbb{E}[\hat{R}_n(F)]$ over the data sample.

The Bartlett-Mendelson generalisation bound states that with probability at least $1 - \delta$ , every f in F (taking values in $[0,1]$ ) satisfies

\mathbb{E}[f(X)] \le \frac{1}{n} \sum_{i=1}^{n} f(x_i) + 2 R_n(F) + \sqrt{\frac{\log(1/\delta)}{2n}}

A fully empirical version replaces $R_n(F)$ with $\hat{R}_n(F)$ at the cost of an extra term of similar order. Several useful properties make Rademacher complexity tractable: it composes nicely with Lipschitz functions, satisfies a contraction inequality (Talagrand 1994), and behaves well under convex hulls and additive combinations. For VC classes the empirical Rademacher complexity is bounded by $\sqrt{\frac{2 \, \mathrm{VC}(H) \log(en/d)}{n}}$ , recovering the VC bound up to constants ^[5]. For margin-based classifiers and kernel methods Rademacher complexity gives sharper bounds that depend on the margin and the norm of the parameter vector rather than the ambient dimension.

What other complexity measures and bound techniques exist?

Classical SLT contains a small zoo of complementary tools.

Technique	Key reference	What it controls
VC dimension	Vapnik and Chervonenkis 1971	Combinatorial richness for binary classes
Growth function	Vapnik and Chervonenkis 1971; Sauer 1972	Number of dichotomies
Covering numbers	Kolmogorov and Tikhomirov 1959	Metric size of H at scale epsilon
Fat-shattering dimension	Kearns and Schapire 1990; Bartlett, Long, Williamson 1996	Real-valued analogue of VC dimension
Rademacher complexity	Bartlett and Mendelson 2002	Data-dependent richness
Gaussian complexity	Bartlett and Mendelson 2002	Continuous analogue used in regression
Algorithmic stability	Bousquet and Elisseeff 2002	Sensitivity of the learner to one example
PAC-Bayes	McAllester 1998-1999; Catoni 2007	Posterior over hypotheses vs prior

What are stability-based bounds?

Olivier Bousquet and Andre Elisseeff's 2002 paper "Stability and Generalization," published in JMLR, took a different route ^[6]. Instead of measuring the size of H, they asked how sensitive the learning algorithm is to perturbations of a single training example. A learning algorithm is uniformly beta-stable if removing or replacing one training example changes the loss of the resulting hypothesis by at most beta. They proved an exponential generalisation bound for uniformly stable algorithms and showed that Tikhonov regularisation, ridge regression, and SVMs are stable with stability controlled by the regularisation parameter ^[6]. The crucial advantage is that stability bounds depend on how the algorithm searches H, not on a global complexity of H itself, so they apply even when the VC dimension is infinite.

What are PAC-Bayes bounds?

David McAllester proposed the first PAC-Bayes bounds in 1998 and 1999 ^[7]. The setup involves a prior distribution P over H chosen before seeing the data and a posterior Q chosen after. McAllester's theorem states that with probability at least $1 - \delta$ , for every posterior Q,

\mathbb{E}_{f \sim Q}[R(f)] \le \mathbb{E}_{f \sim Q}[\hat{R}_n(f)] + \sqrt{\frac{\mathrm{KL}(Q \parallel P) + \log(n / \delta)}{2(n - 1)}}

The Kullback-Leibler divergence between posterior and prior plays the role of a complexity penalty. PAC-Bayes bounds were tightened by Matthias Seeger (2002), Olivier Catoni (2007), and Andreas Maurer (2004). They have become central in the search for non-vacuous bounds for deep networks: Dziugaite and Roy 2017 showed that PAC-Bayes bounds, optimised numerically, can give the first non-vacuous generalisation bounds for stochastic neural networks on MNIST ^[19].

What is margin theory?

A recurring lesson is that the VC dimension is sometimes too crude, especially when the learning algorithm produces a hypothesis with a large margin on the training data. Vapnik's 1995 book argued that the relevant complexity for SVMs is governed by the margin, not the ambient dimension, and that this explains why kernel SVMs in infinite-dimensional feature spaces can still generalise ^[3]. Schapire, Freund, Bartlett, and Lee made a parallel argument for AdaBoost in their 1998 Annals of Statistics paper "Boosting the margin: A new explanation for the effectiveness of voting methods" ^[8]. They showed that the test error of a voting classifier can be bounded in terms of the margin distribution on the training set, independent of the number of boosting rounds, which explained why AdaBoost continues to improve test accuracy long after the training error has reached zero.

How do bias-variance and structural risk minimisation relate?

For squared-error regression the expected risk decomposes as the sum of three non-negative terms: the irreducible noise, the squared bias of the average prediction, and the variance of the prediction. Hastie, Tibshirani, and Friedman discuss this decomposition in chapter 7 of "The Elements of Statistical Learning," and it forms the classical motivation for the bias-variance tradeoff: choosing model complexity to balance bias and variance ^[10].

SLT bounds operationalise this trade-off. Vapnik's structural risk minimisation principle organises a family of hypothesis classes H_1 subset H_2 subset ... by complexity, computes a bound R_hat_n(h) + complexity_term(H_k) for each level, and chooses the level that minimises the bound ^[3]. In practice this reduces to penalised likelihood, ridge regression, lasso, or weight decay, depending on the setting.

How does statistical learning theory connect to specific algorithms?

Most of the algorithms taught in a machine learning course can be tied back to an SLT analysis. The relationship between algorithm and bound is often what motivated the algorithm in the first place.

Algorithm	Best-known bound	Key reference
Perceptron and linear classifiers	VC = d+1; classical VC bound	Vapnik and Chervonenkis 1971
Support vector machines and kernel SVMs	Margin-based Rademacher bound; sample complexity scales with $1/\mathrm{margin}^2$	Vapnik 1995, 1998
Decision trees and CART	Complexity grows with depth; VC dimension proportional to leaves	Breiman, Friedman, Olshen, Stone 1984
k-nearest neighbours	Universally consistent under $k_n \to \infty$ , $k_n / n \to 0$	Stone 1977
AdaBoost and other voting methods	Margin-based bound independent of rounds	Schapire, Freund, Bartlett, Lee 1998
Ridge regression and Tikhonov methods	Stability-based bound with rate $1/(\lambda n)$	Bousquet and Elisseeff 2002
Neural networks	Classical VC bounds vacuous; norm-based and PAC-Bayes bounds	Bartlett, Foster, Telgarsky 2017; Dziugaite and Roy 2017
Online learning algorithms	Regret bounds independent of distribution	Littlestone 1988; Cesa-Bianchi and Lugosi 2006

The Stone result is worth highlighting because k-nearest neighbours falls outside the standard ERM frame: the rule simply averages or votes among nearby labels. Charles Stone's 1977 paper "Consistent Nonparametric Regression" gave conditions under which a class of weighted-average estimators is universally consistent, meaning the expected error converges to the Bayes optimal rate for every distribution ^[9]. As a special case, k-NN classification with $k = k_n$ satisfying $k_n \to \infty$ and $k_n / n \to 0$ is universally consistent. This was the first proof that any concrete algorithm could achieve Bayes-optimal performance in the limit for arbitrary distributions.

Why do deep networks generalise? The generalisation puzzle

Classical SLT bounds for modern deep neural networks are vacuous. A network like ResNet-50 has tens of millions of parameters, more than the size of typical training sets, and its VC dimension is at least linear in the parameter count. The classical bound on the generalisation gap is therefore much larger than 1, which says nothing useful. Yet such networks routinely achieve test accuracy within a percentage point of training accuracy.

The puzzle was sharpened by Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals in their 2017 ICLR paper "Understanding deep learning requires rethinking generalization" ^[13]. They reported that "state-of-the-art convolutional networks for image classification trained with stochastic gradient methods easily fit a random labeling of the training data," achieving zero training error even when the labels carry no signal ^[13]. The same networks, trained on the real labels, then generalise well. Since the network can clearly memorise random data, no uniform convergence argument over the full hypothesis class can explain why it generalises on real data. The paper won an ICLR 2017 best paper award and reframed the field's research agenda.

Researchers have explored several non-mutually-exclusive responses.

Implicit regularisation: stochastic gradient descent does not search over the entire hypothesis class. Among the many parameter settings that fit the data, SGD systematically prefers smooth, low-norm, or flat-minimum solutions. The bound should depend on the actual hypothesis returned, not on the worst case in H.
Norm-based and margin-based bounds: Bartlett, Foster, and Telgarsky's 2017 NeurIPS paper "Spectrally-normalized margin bounds for neural networks" gave a Rademacher-style bound for fully connected and convolutional networks that depends on the product of spectral norms of the weight matrices times a margin term ^[15]. The bound is much smaller than the VC bound, though still typically vacuous on real datasets.
PAC-Bayes for deep nets: Dziugaite and Roy 2017 numerically optimised a PAC-Bayes bound for stochastic neural networks on MNIST and obtained the first non-vacuous bounds, demonstrating that the PAC-Bayes framework can in principle apply to deep models ^[19].
Double descent: Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal's 2019 PNAS paper "Reconciling modern machine-learning practice and the classical bias-variance trade-off" argued that the U-shaped bias-variance curve is just the under-parameterised half of a longer curve ^[14]. As capacity increases past the interpolation threshold, test error first spikes upward, then descends again, producing the double descent that gives the phenomenon its name. The double descent picture is now widely reproduced across linear regression, kernel methods, and deep networks.
Neural tangent kernel: Arthur Jacot, Franck Gabriel, and Clement Hongler's 2018 NeurIPS paper introduced the NTK, showing that in the infinite-width limit a neural network trained by gradient descent is equivalent to kernel regression with a fixed kernel determined by the architecture at initialisation ^[20]. NTK theory provides one rigorous mathematical setting in which deep network training can be analysed.
Other lines of work include the mean-field analysis of two-layer networks, the edge-of-stability phenomenon for full-batch gradient descent, benign overfitting in linear models, and reasoning based on algorithmic information theory.

None of these threads has yet produced bounds that are simultaneously non-vacuous, predictive, and applicable to realistic deep networks across tasks. The deep learning generalisation puzzle remains one of the most active questions at the intersection of theory and practice.

What extensions of PAC learning exist?

The basic PAC framework has been extended in many directions to cover settings closer to real practice.

Extension	Setting
Agnostic PAC	No realisability assumption; bound excess risk over $\inf_{h \in H} R(h)$
Online learning	Sequential prediction with adversarial data; control regret rather than risk; mistake-bound learnability characterised by the Littlestone dimension
Active learning	Learner chooses which examples to label; sample complexity can sometimes be exponentially smaller
Semi-supervised learning	Combines labelled and unlabelled data; bounds depend on cluster or manifold assumptions
Transfer learning	Source and target distributions differ; bounds depend on a divergence between them
Multitask learning	Joint learning of related tasks; shared structure reduces sample complexity per task
Differential privacy	Learner must protect individual examples; private learnability is strictly weaker than PAC for some classes
Robust learning	Adversarial perturbations of inputs; standard VC theory does not directly apply to robust risk
Fairness-constrained learning	Bounds under group-fairness constraints

Nick Littlestone's 1988 paper introduced the mistake-bound model and the Littlestone dimension, which characterises online learnability much as VC dimension characterises PAC learnability ^[24]. Cesa-Bianchi and Lugosi's 2006 book "Prediction, Learning, and Games" gives a comprehensive treatment of regret-minimisation bounds in adversarial online learning ^[23].

How has statistical learning theory influenced practice?

SLT directly shaped the design of several algorithm families. Support vector machines and kernel methods are the clearest example: Vapnik's reasoning about VC dimension and margin bounds led him to formulate the maximum-margin classifier and motivated the use of the kernel trick ^[3]. Regularisation theory, Tikhonov-style penalties, ridge regression, and weight decay all have natural SLT interpretations as bounds on hypothesis complexity. Boosting was reinterpreted through margin theory after the Schapire et al. 1998 paper ^[8].

For deep learning, the relationship is more complicated. Modern deep networks were not designed using SLT bounds, and current bounds remain vacuous on the architectures used in practice. Yet SLT still shapes the way the field thinks about generalisation, motivates the search for the right notion of complexity, and supplies the language for discussions of overfitting, regularisation, and capacity.

Who are the major contributors to statistical learning theory?

Person	Contribution
Vladimir Vapnik	Co-founder of SLT; VC dimension; structural risk minimisation; SVM
Alexey Chervonenkis	Co-founder of SLT; uniform convergence for empirical processes
Leslie Valiant	PAC learning framework; Turing Award 2010
Norbert Sauer and Saharon Shelah	Sauer-Shelah lemma on growth functions
David Haussler	Decision-theoretic generalisations of PAC; agnostic learning
Peter Bartlett	Rademacher complexity; margin bounds; spectral norm bounds for neural networks
Shahar Mendelson	Rademacher complexity; empirical processes
Olivier Bousquet and Andre Elisseeff	Algorithmic stability bounds
David McAllester	PAC-Bayes bounds
Olivier Catoni	Tight PAC-Bayes bounds
Robert Schapire and Yoav Freund	AdaBoost; margin theory of boosting
Nick Littlestone	Online learning; Littlestone dimension
Charles Stone	Universal consistency of k-NN
Mehryar Mohri, Afshin Rostamizadeh, Ameet Talwalkar	Standard graduate textbook (2018)
Shai Shalev-Shwartz and Shai Ben-David	Standard graduate textbook (2014); learnability beyond uniform convergence
Mikhail Belkin	Double descent; benign overfitting

What open problems remain?

A partial list of questions that remain unresolved at the intersection of SLT and current practice:

Tight non-vacuous generalisation bounds for over-parameterised deep networks across realistic datasets and architectures.
A precise theoretical characterisation of when and why over-parameterised models generalise, beyond specific limits like infinite width or NTK.
Generalisation theory for self-supervised learning, where the labelled supervision signal is replaced by pretext tasks and contrastive objectives.
Bounds for foundation models and large language models, where the relevant complexity must somehow account for pretraining data, fine-tuning, and prompt distributions.
A learning-theoretic account of in-context learning and related emergent behaviour in large transformers.
Sharper interplay between optimisation and generalisation: bounds that depend on the trajectory of gradient descent rather than the final hypothesis alone.
Computational lower bounds matching the best statistical upper bounds for natural deep network classes.

References

Vapnik, V. N. and Chervonenkis, A. Y. (1971). On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications, 16(2), 264-280. [English translation of the 1971 Russian original; the foundational paper introducing VC dimension.] ↩
Valiant, L. G. (1984). A theory of the learnable. Communications of the ACM, 27(11), 1134-1142. https://dl.acm.org/doi/10.1145/1968.1972 ↩
Vapnik, V. N. (1995). The Nature of Statistical Learning Theory. Springer, New York. https://link.springer.com/book/10.1007/978-1-4757-3264-1 ↩
Vapnik, V. N. (1998). Statistical Learning Theory. Wiley, New York. ↩
Bartlett, P. L. and Mendelson, S. (2002). Rademacher and Gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3, 463-482. https://www.jmlr.org/papers/v3/bartlett02a.html ↩
Bousquet, O. and Elisseeff, A. (2002). Stability and generalization. Journal of Machine Learning Research, 2, 499-526. https://jmlr.org/papers/v2/bousquet02a.html ↩
McAllester, D. (1999). Some PAC-Bayesian theorems. Machine Learning, 37(3), 355-363. https://link.springer.com/article/10.1023/A:1007618624809 ↩
Schapire, R. E., Freund, Y., Bartlett, P. L. and Lee, W. S. (1998). Boosting the margin: A new explanation for the effectiveness of voting methods. Annals of Statistics, 26(5), 1651-1686. https://cseweb.ucsd.edu/~yfreund/papers/BoostingtheMargin.pdf ↩
Stone, C. J. (1977). Consistent nonparametric regression. Annals of Statistics, 5(4), 595-620. https://projecteuclid.org/euclid.aos/1176343886 ↩
Hastie, T., Tibshirani, R. and Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction (2nd ed.), chapter 7 on model assessment and selection. Springer. ↩
Mohri, M., Rostamizadeh, A. and Talwalkar, A. (2018). Foundations of Machine Learning (2nd ed.). MIT Press. https://cs.nyu.edu/~mohri/mlbook/ ↩
Shalev-Shwartz, S. and Ben-David, S. (2014). Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press. https://www.cs.huji.ac.il/~shais/UnderstandingMachineLearning/ ↩
Zhang, C., Bengio, S., Hardt, M., Recht, B. and Vinyals, O. (2017). Understanding deep learning requires rethinking generalization. International Conference on Learning Representations. https://arxiv.org/abs/1611.03530 ↩
Belkin, M., Hsu, D., Ma, S. and Mandal, S. (2019). Reconciling modern machine-learning practice and the classical bias-variance trade-off. Proceedings of the National Academy of Sciences, 116(32), 15849-15854. https://arxiv.org/abs/1812.11118 ↩
Bartlett, P. L., Foster, D. J. and Telgarsky, M. J. (2017). Spectrally-normalized margin bounds for neural networks. Advances in Neural Information Processing Systems 30. https://arxiv.org/abs/1706.08498 ↩
Sauer, N. (1972). On the density of families of sets. Journal of Combinatorial Theory A, 13, 145-147. ↩
Shelah, S. (1972). A combinatorial problem; stability and order for models and theories in infinitary languages. Pacific Journal of Mathematics, 41, 247-261. ↩
Shalev-Shwartz, S., Shamir, O., Srebro, N. and Sridharan, K. (2010). Learnability, stability and uniform convergence. Journal of Machine Learning Research, 11, 2635-2670. https://jmlr.csail.mit.edu/papers/volume11/shalev-shwartz10a/shalev-shwartz10a.pdf ↩
Dziugaite, G. K. and Roy, D. M. (2017). Computing nonvacuous generalization bounds for deep stochastic neural networks with many more parameters than training data. Uncertainty in Artificial Intelligence. https://arxiv.org/abs/1703.11008 ↩
Jacot, A., Gabriel, F. and Hongler, C. (2018). Neural tangent kernel: Convergence and generalization in neural networks. Advances in Neural Information Processing Systems. https://arxiv.org/abs/1806.07572 ↩
Vladimir Vapnik biography. Wikipedia. https://en.wikipedia.org/wiki/Vladimir_Vapnik ↩
Vapnik-Chervonenkis dimension. Wikipedia. https://en.wikipedia.org/wiki/Vapnik%E2%80%93Chervonenkis_dimension ↩
Cesa-Bianchi, N. and Lugosi, G. (2006). Prediction, Learning, and Games. Cambridge University Press. ↩
Littlestone, N. (1988). Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm. Machine Learning, 2, 285-318. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

3 revisions by 1 contributors · full history

Suggest edit

What links here

Concept drift Empirical Risk Minimization Independently and Identically Distributed (i.i.d.)Stationarity Structural risk minimization (SRM)Support Vector Machine (SVM)

What is statistical learning theory?

When was statistical learning theory developed?

How is the learning problem set up?

What is empirical risk minimisation?

What is the VC dimension?

What is the Sauer-Shelah lemma and the growth function?

What is the VC bound?

What is Probably Approximately Correct (PAC) learning?

What is Rademacher complexity?

What other complexity measures and bound techniques exist?

What are stability-based bounds?

What are PAC-Bayes bounds?

What is margin theory?

How do bias-variance and structural risk minimisation relate?

How does statistical learning theory connect to specific algorithms?

Why do deep networks generalise? The generalisation puzzle

What extensions of PAC learning exist?

How has statistical learning theory influenced practice?

Who are the major contributors to statistical learning theory?

What open problems remain?

References

Improve this article

Related Articles

Independently and Identically Distributed (i.i.d.)

Log-Odds

Principal Component Analysis (PCA)

Earth Mover's Distance

Curse of Dimensionality

KL Divergence (Kullback-Leibler Divergence)

What links here

Related Articles

Independently and Identically Distributed (i.i.d.)

Log-Odds

Principal Component Analysis (PCA)

Earth Mover's Distance

Curse of Dimensionality

KL Divergence (Kullback-Leibler Divergence)

What links here