# Statistical learning theory

> Source: https://aiwiki.ai/wiki/statistical_learning_theory
> Updated: 2026-06-25
> Categories: Machine Learning, Mathematics, Statistics
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

**Statistical learning theory (SLT)** is the mathematical framework that explains when and why [machine learning](/wiki/machine_learning) algorithms [generalize](/wiki/generalization) from a finite training sample to unseen data. It provides theoretical bounds on the gap between a model's performance on its training set and its performance on the underlying data distribution, using complexity measures such as the Vapnik-Chervonenkis (VC) dimension to say when learning from data is possible in principle [1][3]. The field treats learning as a problem of function estimation from empirical observations and asks: given a hypothesis class, a [loss function](/wiki/loss_function), and a finite sample, what can we say about the difference between performance measured on the sample and performance on the full distribution?

SLT was developed primarily by Vladimir Vapnik and Alexey Chervonenkis, working in the Soviet Union from the 1960s onward, who introduced the central tools of the field: the VC dimension, growth function bounds, and the principle of structural risk minimisation [1][3]. Vapnik moved to AT&T Bell Labs in 1991, where the same ideas led directly to the [support vector machine](/wiki/support_vector_machine) [3]. A largely independent line of work in the United States, led by Leslie Valiant in 1984, produced the Probably Approximately Correct (PAC) framework, which formalised computational learnability [2]. The two strands merged into the modern theory through the 1990s and 2000s, and SLT remains the standard mathematical language for discussing why machine learning works.

## What is statistical learning theory?

Statistical learning theory studies the supervised learning problem as function estimation under uncertainty: a learner observes a finite sample drawn from an unknown joint distribution and must output a hypothesis that performs well on future draws from the same distribution. The theory's defining contribution is to relate the *generalisation gap*, the difference between a hypothesis's expected (true) risk and its empirical (training) risk, to measurable properties of the hypothesis class, most famously its VC dimension [1]. When that complexity is finite, the gap shrinks at a predictable rate as the sample grows; when it is infinite, no amount of data can guarantee generalisation in the worst case.

Vapnik framed the practical lesson of the theory as a methodological imperative. In his account of statistical learning theory he wrote: "When solving a problem of interest, do not solve a more general problem as an intermediate step. Try to get the answer that you really need but not a more general one." [3] This principle, that one should estimate only the quantity the task requires (a decision boundary, say, rather than a full probability density), runs through the design of support vector machines and the theory of transductive learning.

## When was statistical learning theory developed?

Vapnik and Chervonenkis began collaborating at the Institute of Control Sciences in Moscow in the mid-1960s. Their first major joint result appeared as a draft in 1966 and was announced in 1968 before being published in full Russian-language form in 1971 under the title "On the uniform convergence of relative frequencies of events to their probabilities" [1]. That paper introduced what is now called the VC dimension and proved the first rigorous link between a combinatorial property of a function class and the rate at which empirical averages converge to true expectations. Vapnik continued to develop the theory inside the Soviet Union for the next two decades, often in mathematical isolation from Western researchers.

In late 1990 Vapnik moved to the United States and joined the Adaptive Systems Research Department at AT&T Bell Labs in Holmdel, New Jersey [21]. There he and his colleagues turned the abstract theory into a concrete algorithm, the support vector machine, which used the maximum margin principle that had been hinted at in his earlier work. His 1995 monograph "The Nature of Statistical Learning Theory" introduced the field to a wider audience and remains one of the standard references [3]. The longer 1998 book "Statistical Learning Theory" gave a complete technical treatment [4].

Leslie Valiant's 1984 paper "A Theory of the Learnable," published in Communications of the ACM (volume 27, issue 11, pages 1134-1142), came from a different research community [2]. Valiant approached learning from theoretical computer science and asked when a learning algorithm could be both sample-efficient and computationally efficient, defining the subject precisely: "we shall say that a program for performing a task has been acquired by learning if it has been acquired by any means other than explicit programming." [2] His PAC framework was initially studied with finite hypothesis classes and Boolean concepts. By the early 1990s researchers had shown that PAC learnability for binary classification is equivalent to finite VC dimension, unifying the two traditions into what is now called the fundamental theorem of statistical learning. Valiant received the Turing Award in 2010 partly for this work.

## How is the learning problem set up?

The standard supervised learning problem in SLT is described by a small set of objects.

| Symbol | Object | Meaning |
| --- | --- | --- |
| X | Input space | Domain of features, often a subset of R^d |
| Y | Output space | Labels, for example {0,1} or {-1,+1} for classification or R for regression |
| P | Joint distribution | Probability distribution over X x Y, unknown to the learner |
| H | Hypothesis class | A set of candidate functions h: X -> Y |
| L(h(x), y) | [Loss](/wiki/loss) | Penalty for predicting h(x) when the true label is y |
| R(h) | Expected risk | E_{(X,Y) ~ P}[L(h(X), Y)] |
| R_hat_n(h) | Empirical risk | (1/n) sum_{i=1..n} L(h(x_i), y_i) on a sample of size n |

The central question is when minimising the empirical risk on a finite sample produces a hypothesis with low expected risk on the underlying distribution. The gap between the two, R(h) minus R_hat_n(h), is called the generalisation gap, and SLT provides quantitative bounds on this gap in terms of properties of H.

## What is empirical risk minimisation?

[Empirical risk minimisation](/wiki/empirical_risk_minimization) (ERM) is the basic learning principle of choosing the hypothesis h that minimises R_hat_n(h) on the training sample. ERM is the textbook recipe behind least-squares regression, logistic regression, and most modern classification methods. The principle is simple, but it raises an immediate question: when does the chosen h actually generalise?

A trivial counterexample shows that ERM can fail badly. If H contains every possible function from X to Y, then for any sample we can find an h with zero training error simply by memorising the labels, and this h will have no relationship to the true distribution. ERM works only if the hypothesis class is restricted enough that good performance on the sample implies good performance on the distribution. The job of SLT is to make this intuition rigorous.

The key technical concept is uniform convergence: the empirical risk R_hat_n(h) converges to R(h) uniformly over h in H. If uniform convergence holds at rate epsilon with high probability, then any ERM minimiser is within 2*epsilon of the best h in H. For supervised binary classification under the 0-1 loss, learnability turns out to be equivalent to uniform convergence, which in turn is equivalent to finite VC dimension [12]. The general learning setting is more delicate: Shalev-Shwartz, Shamir, Srebro and Sridharan showed in 2010 that there are problems that are learnable without uniform convergence, with stability serving as the right characterisation of learnability in those cases [18].

## What is the VC dimension?

The Vapnik-Chervonenkis dimension is the central complexity measure of classical SLT. For a class H of binary classifiers, a set of n points x_1, ..., x_n in X is said to be shattered by H if for every possible labelling of those points by {0,1}, there exists some h in H that realises the labelling. The VC dimension VC(H) is the largest n such that some set of n points is shattered [1][22]. If H can shatter arbitrarily large sets, the VC dimension is infinite.

The VC dimension captures combinatorial richness independent of how the class is parameterised. A few canonical values:

| Hypothesis class | VC dimension |
| --- | --- |
| Affine half-spaces in R^d (linear classifiers with bias) | d + 1 |
| Homogeneous half-spaces in R^d (no bias) | d |
| Axis-aligned rectangles in R^2 | 4 |
| Intervals on the real line | 2 |
| Threshold functions on R | 1 |
| Sine functions {sign(sin(theta x)) : theta in R} | infinite |
| Decision trees of depth k on n binary features | grows with k and n |

The value d+1 for affine half-spaces follows from Radon's theorem: any d+2 points in R^d can be split into two subsets with intersecting convex hulls, so they cannot be linearly separated, but the d unit coordinate vectors plus the origin can be shattered.

### What is the Sauer-Shelah lemma and the growth function?

The growth function Pi_H(n) counts the maximum number of distinct labellings that H can realise on any set of n points. By definition Pi_H(n) is at most 2^n, and the VC dimension is the largest n for which equality holds. The Sauer-Shelah lemma, proved independently by Norbert Sauer and Saharon Shelah in 1972 and crucially used by Vapnik and Chervonenkis, states that if VC(H) = d, then for all n >= d the growth function is at most the sum of binomials sum_{i=0..d} C(n,i), which is itself bounded by (en/d)^d [16][17]. The key consequence is that finite VC dimension forces the growth function to be polynomial in n.

### What is the VC bound?

The Vapnik-Chervonenkis uniform convergence bound says that with probability at least 1 - delta over the random sample, every h in H satisfies

R(h) <= R_hat_n(h) + sqrt( (d * log(2n/d) + log(1/delta)) / n )

where d = VC(H), with explicit constants depending on the form of the bound. Concrete versions in Vapnik 1998, Mohri et al. 2018, and Shalev-Shwartz and Ben-David 2014 differ slightly in constants but agree on the main rate of order sqrt(d/n) up to logarithmic factors [4][11][12].

The fundamental theorem of statistical learning ties everything together. For binary classification under 0-1 loss, the following are equivalent: H is PAC learnable, H is agnostic PAC learnable, the uniform convergence property holds for H, any ERM rule succeeds, and H has finite VC dimension [12]. The sample complexity for agnostic PAC learning of a class of VC dimension d scales as Theta((d + log(1/delta)) / epsilon^2), which is tight up to constants [12].

## What is Probably Approximately Correct (PAC) learning?

Valiant's PAC framework asks a slightly different question. A hypothesis class H is PAC learnable if there exists an algorithm A such that for every distribution P, every target concept c in H, and every epsilon, delta > 0, the algorithm draws a sample of size polynomial in 1/epsilon, 1/delta (and other natural parameters) and returns a hypothesis h with P(h(X) != c(X)) <= epsilon with probability at least 1 - delta [2]. The original Valiant 1984 setup also required the algorithm to run in polynomial time, which adds a computational dimension absent from the purely statistical Vapnik-Chervonenkis theory.

| Variant | Assumption on data | What you bound |
| --- | --- | --- |
| Realisable PAC | True labels come from some h* in H | error of returned h |
| Agnostic PAC | No assumption: labels follow arbitrary P | excess error over inf_{h in H} R(h) |
| Computational PAC | Same as above plus polynomial-time algorithm | sample and time complexity |
| Statistical PAC | Sample complexity only, ignoring runtime | sample complexity |

For binary classification a class is statistically PAC learnable if and only if its VC dimension is finite [12]. The computational version is strictly harder. Some classes have small VC dimension yet are NP-hard to learn assuming standard cryptographic assumptions: examples include 3-term DNF formulas and certain neural-network architectures. This gap between statistical and computational learnability is one of the durable themes of the field.

## What is Rademacher complexity?

The VC dimension is a worst-case combinatorial quantity that does not depend on the data distribution. Peter Bartlett and Shahar Mendelson introduced data-dependent complexity measures called Rademacher and Gaussian complexities in their 2002 JMLR paper, and these now provide the tightest generic bounds for many model classes [5].

Given a sample x_1, ..., x_n, the empirical Rademacher complexity of a real-valued function class F is

R_hat_n(F) = E_sigma [ sup_{f in F} (1/n) sum_{i=1..n} sigma_i f(x_i) ]

where sigma_1, ..., sigma_n are independent random variables uniform on {-1, +1}. The expected Rademacher complexity is R_n(F) = E[R_hat_n(F)] over the data sample.

The Bartlett-Mendelson generalisation bound states that with probability at least 1 - delta, every f in F (taking values in [0,1]) satisfies

E[f(X)] <= (1/n) sum_{i=1..n} f(x_i) + 2 R_n(F) + sqrt( log(1/delta) / (2n) )

A fully empirical version replaces R_n(F) with R_hat_n(F) at the cost of an extra term of similar order. Several useful properties make Rademacher complexity tractable: it composes nicely with Lipschitz functions, satisfies a contraction inequality (Talagrand 1994), and behaves well under convex hulls and additive combinations. For VC classes the empirical Rademacher complexity is bounded by sqrt((2 * VC(H) * log(en/d)) / n), recovering the VC bound up to constants [5]. For margin-based classifiers and kernel methods Rademacher complexity gives sharper bounds that depend on the margin and the norm of the parameter vector rather than the ambient dimension.

## What other complexity measures and bound techniques exist?

Classical SLT contains a small zoo of complementary tools.

| Technique | Key reference | What it controls |
| --- | --- | --- |
| VC dimension | Vapnik and Chervonenkis 1971 | Combinatorial richness for binary classes |
| Growth function | Vapnik and Chervonenkis 1971; Sauer 1972 | Number of dichotomies |
| Covering numbers | Kolmogorov and Tikhomirov 1959 | Metric size of H at scale epsilon |
| Fat-shattering dimension | Kearns and Schapire 1990; Bartlett, Long, Williamson 1996 | Real-valued analogue of VC dimension |
| Rademacher complexity | Bartlett and Mendelson 2002 | Data-dependent richness |
| Gaussian complexity | Bartlett and Mendelson 2002 | Continuous analogue used in regression |
| Algorithmic stability | Bousquet and Elisseeff 2002 | Sensitivity of the learner to one example |
| PAC-Bayes | McAllester 1998-1999; Catoni 2007 | Posterior over hypotheses vs prior |

### What are stability-based bounds?

Olivier Bousquet and Andre Elisseeff's 2002 paper "Stability and Generalization," published in JMLR, took a different route [6]. Instead of measuring the size of H, they asked how sensitive the learning algorithm is to perturbations of a single training example. A learning algorithm is uniformly beta-stable if removing or replacing one training example changes the loss of the resulting hypothesis by at most beta. They proved an exponential generalisation bound for uniformly stable algorithms and showed that Tikhonov regularisation, ridge regression, and SVMs are stable with stability controlled by the regularisation parameter [6]. The crucial advantage is that stability bounds depend on how the algorithm searches H, not on a global complexity of H itself, so they apply even when the VC dimension is infinite.

### What are PAC-Bayes bounds?

David McAllester proposed the first PAC-Bayes bounds in 1998 and 1999 [7]. The setup involves a prior distribution P over H chosen before seeing the data and a posterior Q chosen after. McAllester's theorem states that with probability at least 1 - delta, for every posterior Q,

E_{f ~ Q}[R(f)] <= E_{f ~ Q}[R_hat_n(f)] + sqrt( (KL(Q || P) + log(n / delta)) / (2 (n - 1)) )

The Kullback-Leibler divergence between posterior and prior plays the role of a complexity penalty. PAC-Bayes bounds were tightened by Matthias Seeger (2002), Olivier Catoni (2007), and Andreas Maurer (2004). They have become central in the search for non-vacuous bounds for deep networks: Dziugaite and Roy 2017 showed that PAC-Bayes bounds, optimised numerically, can give the first non-vacuous generalisation bounds for stochastic neural networks on MNIST [19].

### What is margin theory?

A recurring lesson is that the VC dimension is sometimes too crude, especially when the learning algorithm produces a hypothesis with a large margin on the training data. Vapnik's 1995 book argued that the relevant complexity for SVMs is governed by the margin, not the ambient dimension, and that this explains why kernel SVMs in infinite-dimensional feature spaces can still generalise [3]. Schapire, Freund, Bartlett, and Lee made a parallel argument for [AdaBoost](/wiki/adaboost) in their 1998 Annals of Statistics paper "Boosting the margin: A new explanation for the effectiveness of voting methods" [8]. They showed that the test error of a voting classifier can be bounded in terms of the margin distribution on the training set, independent of the number of boosting rounds, which explained why AdaBoost continues to improve test accuracy long after the training error has reached zero.

## How do bias-variance and structural risk minimisation relate?

For squared-error regression the expected risk decomposes as the sum of three non-negative terms: the irreducible noise, the squared bias of the average prediction, and the variance of the prediction. Hastie, Tibshirani, and Friedman discuss this decomposition in chapter 7 of "The Elements of Statistical Learning," and it forms the classical motivation for the [bias-variance tradeoff](/wiki/bias_variance_tradeoff): choosing model complexity to balance bias and variance [10].

SLT bounds operationalise this trade-off. Vapnik's structural risk minimisation principle organises a family of hypothesis classes H_1 subset H_2 subset ... by complexity, computes a bound R_hat_n(h) + complexity_term(H_k) for each level, and chooses the level that minimises the bound [3]. In practice this reduces to penalised likelihood, ridge regression, lasso, or weight decay, depending on the setting.

## How does statistical learning theory connect to specific algorithms?

Most of the algorithms taught in a machine learning course can be tied back to an SLT analysis. The relationship between algorithm and bound is often what motivated the algorithm in the first place.

| Algorithm | Best-known bound | Key reference |
| --- | --- | --- |
| Perceptron and linear classifiers | VC = d+1; classical VC bound | Vapnik and Chervonenkis 1971 |
| [Support vector machines](/wiki/support_vector_machine) and [kernel SVMs](/wiki/kernel_support_vector_machines_ksvms) | Margin-based Rademacher bound; sample complexity scales with 1/margin^2 | Vapnik 1995, 1998 |
| Decision trees and CART | Complexity grows with depth; VC dimension proportional to leaves | Breiman, Friedman, Olshen, Stone 1984 |
| k-nearest neighbours | Universally consistent under k_n -> infinity, k_n / n -> 0 | Stone 1977 |
| AdaBoost and other voting methods | Margin-based bound independent of rounds | Schapire, Freund, Bartlett, Lee 1998 |
| Ridge regression and Tikhonov methods | Stability-based bound with rate 1/(lambda * n) | Bousquet and Elisseeff 2002 |
| Neural networks | Classical VC bounds vacuous; norm-based and PAC-Bayes bounds | Bartlett, Foster, Telgarsky 2017; Dziugaite and Roy 2017 |
| Online learning algorithms | Regret bounds independent of distribution | Littlestone 1988; Cesa-Bianchi and Lugosi 2006 |

The Stone result is worth highlighting because k-nearest neighbours falls outside the standard ERM frame: the rule simply averages or votes among nearby labels. Charles Stone's 1977 paper "Consistent Nonparametric Regression" gave conditions under which a class of weighted-average estimators is universally consistent, meaning the expected error converges to the Bayes optimal rate for every distribution [9]. As a special case, k-NN classification with k = k_n satisfying k_n -> infinity and k_n / n -> 0 is universally consistent. This was the first proof that any concrete algorithm could achieve Bayes-optimal performance in the limit for arbitrary distributions.

## Why do deep networks generalise? The generalisation puzzle

Classical SLT bounds for modern deep neural networks are vacuous. A network like ResNet-50 has tens of millions of parameters, more than the size of typical training sets, and its VC dimension is at least linear in the parameter count. The classical bound on the generalisation gap is therefore much larger than 1, which says nothing useful. Yet such networks routinely achieve test accuracy within a percentage point of training accuracy.

The puzzle was sharpened by Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals in their 2017 ICLR paper "Understanding deep learning requires rethinking generalization" [13]. They reported that "state-of-the-art convolutional networks for image classification trained with stochastic gradient methods easily fit a random labeling of the training data," achieving zero training error even when the labels carry no signal [13]. The same networks, trained on the real labels, then generalise well. Since the network can clearly memorise random data, no uniform convergence argument over the full hypothesis class can explain why it generalises on real data. The paper won an ICLR 2017 best paper award and reframed the field's research agenda.

Researchers have explored several non-mutually-exclusive responses.

- Implicit regularisation: stochastic gradient descent does not search over the entire hypothesis class. Among the many parameter settings that fit the data, SGD systematically prefers smooth, low-norm, or flat-minimum solutions. The bound should depend on the actual hypothesis returned, not on the worst case in H.
- Norm-based and margin-based bounds: Bartlett, Foster, and Telgarsky's 2017 NeurIPS paper "Spectrally-normalized margin bounds for neural networks" gave a Rademacher-style bound for fully connected and convolutional networks that depends on the product of spectral norms of the weight matrices times a margin term [15]. The bound is much smaller than the VC bound, though still typically vacuous on real datasets.
- PAC-Bayes for deep nets: Dziugaite and Roy 2017 numerically optimised a PAC-Bayes bound for stochastic neural networks on MNIST and obtained the first non-vacuous bounds, demonstrating that the PAC-Bayes framework can in principle apply to deep models [19].
- Double descent: Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal's 2019 PNAS paper "Reconciling modern machine-learning practice and the classical bias-variance trade-off" argued that the U-shaped bias-variance curve is just the under-parameterised half of a longer curve [14]. As capacity increases past the interpolation threshold, test error first spikes upward, then descends again, producing the [double descent](/wiki/double_descent) that gives the phenomenon its name. The double descent picture is now widely reproduced across linear regression, kernel methods, and deep networks.
- Neural tangent kernel: Arthur Jacot, Franck Gabriel, and Clement Hongler's 2018 NeurIPS paper introduced the NTK, showing that in the infinite-width limit a neural network trained by gradient descent is equivalent to kernel regression with a fixed kernel determined by the architecture at initialisation [20]. NTK theory provides one rigorous mathematical setting in which deep network training can be analysed.
- Other lines of work include the mean-field analysis of two-layer networks, the edge-of-stability phenomenon for full-batch gradient descent, benign overfitting in linear models, and reasoning based on algorithmic information theory.

None of these threads has yet produced bounds that are simultaneously non-vacuous, predictive, and applicable to realistic deep networks across tasks. The deep learning generalisation puzzle remains one of the most active questions at the intersection of theory and practice.

## What extensions of PAC learning exist?

The basic PAC framework has been extended in many directions to cover settings closer to real practice.

| Extension | Setting |
| --- | --- |
| Agnostic PAC | No realisability assumption; bound excess risk over inf_{h in H} R(h) |
| Online learning | Sequential prediction with adversarial data; control regret rather than risk; mistake-bound learnability characterised by the Littlestone dimension |
| Active learning | Learner chooses which examples to label; sample complexity can sometimes be exponentially smaller |
| Semi-supervised learning | Combines labelled and unlabelled data; bounds depend on cluster or manifold assumptions |
| Transfer learning | Source and target distributions differ; bounds depend on a divergence between them |
| Multitask learning | Joint learning of related tasks; shared structure reduces sample complexity per task |
| Differential privacy | Learner must protect individual examples; private learnability is strictly weaker than PAC for some classes |
| Robust learning | Adversarial perturbations of inputs; standard VC theory does not directly apply to robust risk |
| Fairness-constrained learning | Bounds under group-fairness constraints |

Nick Littlestone's 1988 paper introduced the mistake-bound model and the Littlestone dimension, which characterises online learnability much as VC dimension characterises PAC learnability [24]. Cesa-Bianchi and Lugosi's 2006 book "Prediction, Learning, and Games" gives a comprehensive treatment of regret-minimisation bounds in adversarial online learning [23].

## How has statistical learning theory influenced practice?

SLT directly shaped the design of several algorithm families. Support vector machines and kernel methods are the clearest example: Vapnik's reasoning about VC dimension and margin bounds led him to formulate the maximum-margin classifier and motivated the use of the kernel trick [3]. Regularisation theory, Tikhonov-style penalties, ridge regression, and weight decay all have natural SLT interpretations as bounds on hypothesis complexity. Boosting was reinterpreted through margin theory after the Schapire et al. 1998 paper [8].

For deep learning, the relationship is more complicated. Modern deep networks were not designed using SLT bounds, and current bounds remain vacuous on the architectures used in practice. Yet SLT still shapes the way the field thinks about generalisation, motivates the search for the right notion of complexity, and supplies the language for discussions of overfitting, regularisation, and capacity.

## Who are the major contributors to statistical learning theory?

| Person | Contribution |
| --- | --- |
| Vladimir Vapnik | Co-founder of SLT; VC dimension; structural risk minimisation; SVM |
| Alexey Chervonenkis | Co-founder of SLT; uniform convergence for empirical processes |
| Leslie Valiant | PAC learning framework; Turing Award 2010 |
| Norbert Sauer and Saharon Shelah | Sauer-Shelah lemma on growth functions |
| David Haussler | Decision-theoretic generalisations of PAC; agnostic learning |
| Peter Bartlett | Rademacher complexity; margin bounds; spectral norm bounds for neural networks |
| Shahar Mendelson | Rademacher complexity; empirical processes |
| Olivier Bousquet and Andre Elisseeff | Algorithmic stability bounds |
| David McAllester | PAC-Bayes bounds |
| Olivier Catoni | Tight PAC-Bayes bounds |
| Robert Schapire and Yoav Freund | AdaBoost; margin theory of boosting |
| Nick Littlestone | Online learning; Littlestone dimension |
| Charles Stone | Universal consistency of k-NN |
| Mehryar Mohri, Afshin Rostamizadeh, Ameet Talwalkar | Standard graduate textbook (2018) |
| Shai Shalev-Shwartz and Shai Ben-David | Standard graduate textbook (2014); learnability beyond uniform convergence |
| Mikhail Belkin | Double descent; benign overfitting |

## What open problems remain?

A partial list of questions that remain unresolved at the intersection of SLT and current practice:

- Tight non-vacuous generalisation bounds for over-parameterised deep networks across realistic datasets and architectures.
- A precise theoretical characterisation of when and why over-parameterised models generalise, beyond specific limits like infinite width or NTK.
- Generalisation theory for self-supervised learning, where the labelled supervision signal is replaced by pretext tasks and contrastive objectives.
- Bounds for foundation models and large language models, where the relevant complexity must somehow account for pretraining data, fine-tuning, and prompt distributions.
- A learning-theoretic account of in-context learning and related emergent behaviour in large transformers.
- Sharper interplay between optimisation and generalisation: bounds that depend on the trajectory of gradient descent rather than the final hypothesis alone.
- Computational lower bounds matching the best statistical upper bounds for natural deep network classes.

## References

1. Vapnik, V. N. and Chervonenkis, A. Y. (1971). On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications, 16(2), 264-280. [English translation of the 1971 Russian original; the foundational paper introducing VC dimension.]
2. Valiant, L. G. (1984). A theory of the learnable. Communications of the ACM, 27(11), 1134-1142. https://dl.acm.org/doi/10.1145/1968.1972
3. Vapnik, V. N. (1995). The Nature of Statistical Learning Theory. Springer, New York. https://link.springer.com/book/10.1007/978-1-4757-3264-1
4. Vapnik, V. N. (1998). Statistical Learning Theory. Wiley, New York.
5. Bartlett, P. L. and Mendelson, S. (2002). Rademacher and Gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3, 463-482. https://www.jmlr.org/papers/v3/bartlett02a.html
6. Bousquet, O. and Elisseeff, A. (2002). Stability and generalization. Journal of Machine Learning Research, 2, 499-526. https://jmlr.org/papers/v2/bousquet02a.html
7. McAllester, D. (1999). Some PAC-Bayesian theorems. Machine Learning, 37(3), 355-363. https://link.springer.com/article/10.1023/A:1007618624809
8. Schapire, R. E., Freund, Y., Bartlett, P. L. and Lee, W. S. (1998). Boosting the margin: A new explanation for the effectiveness of voting methods. Annals of Statistics, 26(5), 1651-1686. https://cseweb.ucsd.edu/~yfreund/papers/BoostingtheMargin.pdf
9. Stone, C. J. (1977). Consistent nonparametric regression. Annals of Statistics, 5(4), 595-620. https://projecteuclid.org/euclid.aos/1176343886
10. Hastie, T., Tibshirani, R. and Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction (2nd ed.), chapter 7 on model assessment and selection. Springer.
11. Mohri, M., Rostamizadeh, A. and Talwalkar, A. (2018). Foundations of Machine Learning (2nd ed.). MIT Press. https://cs.nyu.edu/~mohri/mlbook/
12. Shalev-Shwartz, S. and Ben-David, S. (2014). Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press. https://www.cs.huji.ac.il/~shais/UnderstandingMachineLearning/
13. Zhang, C., Bengio, S., Hardt, M., Recht, B. and Vinyals, O. (2017). Understanding deep learning requires rethinking generalization. International Conference on Learning Representations. https://arxiv.org/abs/1611.03530
14. Belkin, M., Hsu, D., Ma, S. and Mandal, S. (2019). Reconciling modern machine-learning practice and the classical bias-variance trade-off. Proceedings of the National Academy of Sciences, 116(32), 15849-15854. https://arxiv.org/abs/1812.11118
15. Bartlett, P. L., Foster, D. J. and Telgarsky, M. J. (2017). Spectrally-normalized margin bounds for neural networks. Advances in Neural Information Processing Systems 30. https://arxiv.org/abs/1706.08498
16. Sauer, N. (1972). On the density of families of sets. Journal of Combinatorial Theory A, 13, 145-147.
17. Shelah, S. (1972). A combinatorial problem; stability and order for models and theories in infinitary languages. Pacific Journal of Mathematics, 41, 247-261.
18. Shalev-Shwartz, S., Shamir, O., Srebro, N. and Sridharan, K. (2010). Learnability, stability and uniform convergence. Journal of Machine Learning Research, 11, 2635-2670. https://jmlr.csail.mit.edu/papers/volume11/shalev-shwartz10a/shalev-shwartz10a.pdf
19. Dziugaite, G. K. and Roy, D. M. (2017). Computing nonvacuous generalization bounds for deep stochastic neural networks with many more parameters than training data. Uncertainty in Artificial Intelligence. https://arxiv.org/abs/1703.11008
20. Jacot, A., Gabriel, F. and Hongler, C. (2018). Neural tangent kernel: Convergence and generalization in neural networks. Advances in Neural Information Processing Systems. https://arxiv.org/abs/1806.07572
21. Vladimir Vapnik biography. Wikipedia. https://en.wikipedia.org/wiki/Vladimir_Vapnik
22. Vapnik-Chervonenkis dimension. Wikipedia. https://en.wikipedia.org/wiki/Vapnik%E2%80%93Chervonenkis_dimension
23. Cesa-Bianchi, N. and Lugosi, G. (2006). Prediction, Learning, and Games. Cambridge University Press.
24. Littlestone, N. (1988). Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm. Machine Learning, 2, 285-318.

