Stability

See also: Machine learning terms and Stability AI

introduction

Stability in machine learning refers to the property that a learning algorithm produces similar outputs (predictions, parameter values, or loss curves) when its inputs are slightly perturbed. The inputs that get perturbed depend on which kind of stability you mean, and that is where the term gets confusing. The same word covers at least six different ideas: a theoretical property of the learning rule, a numerical property of the floating-point arithmetic, a behavioral property of the optimizer, a robustness property of the trained model, a deployment property under distribution shift, and a property of the explanations the model produces. Most disagreements about whether a system is stable are really disagreements about which of these definitions is being used.

Stability matters because real systems are never run on the exact same data twice. The training data you collected this week is slightly different from the data you would have collected next week. The hyperparameter you chose was somewhat arbitrary. The random seed for initialization could have gone the other way. A stable algorithm gives roughly the same answer regardless. An unstable one does not, and you end up shipping a model whose behavior depends on accidents of the training run.

meanings of stability

The table below summarizes the distinct meanings the word carries in machine learning research and practice.

Meaning	Perturbation considered	Quantity that should stay similar	Typical context
Algorithmic stability	Replace or remove one training example	The learned hypothesis or its loss on a held-out point	Statistical learning theory
Numerical stability	Floating-point rounding, precision (FP16, BF16, FP32)	Forward and backward computations	Mixed-precision training, large-model pretraining
Training-dynamics stability	Random seed, learning rate, batch size, optimizer state	The loss curve over training iterations	Deep learning practice
Robustness to input perturbation	Small changes to test inputs (noise, adversarial examples)	The model's prediction	Robustness, security
Out-of-distribution stability	Shift in the test distribution relative to training	Calibration, accuracy, error rates	Production ML, distribution shift
Stability of explanations	Small input perturbations	Feature attributions, saliency maps	Interpretability research

When a paper claims a method "improves stability," the first useful question is which row of this table the authors actually mean.

algorithmic stability

The theoretical sense of stability is the oldest and the one connected most directly to generalization. Bousquet and Elisseeff formalized it in their 2002 paper "Stability and Generalization" in the Journal of Machine Learning Research. Their idea is straightforward to state. Take a learning algorithm, train it on a dataset, then train it again on a dataset that differs by a single example. If the resulting hypotheses agree closely on every point, the algorithm is stable. The smaller the change, the more stable the algorithm.

Bousquet and Elisseeff defined several precise variants. Each tightens the previous one.

Notion	What it requires	Strength
Hypothesis stability	Expected loss change is small when one training point is removed	Weakest
Pointwise hypothesis stability	Expected change at the specific removed point is small	Intermediate
Error stability	Expected empirical error changes little when one point is removed	Intermediate
Uniform stability	The loss change is bounded for every input and every dataset of size n	Strongest

The central theorem links uniform stability to a generalization bound that does not depend on the VC dimension of the hypothesis class. Informally, if removing or replacing one of the n training examples changes the loss by at most beta on every test point, then the gap between training loss and true loss is on the order of beta plus a confidence term that goes to zero as n grows. Bousquet and Elisseeff proved that ridge regression and Support Vector Machines with bounded loss are uniformly stable, with stability controlled by the regularization parameter lambda. This recovers VC-style guarantees for these methods through a different and often tighter route.

The theory connects to other foundational ideas. Shalev-Shwartz, Shamir, Srebro and Sridharan (2010) showed that learnability and stability are essentially equivalent under reasonable conditions, putting stability at the center of statistical learning theory rather than at its periphery. A separate line of work, beginning with Dwork, Feldman, Hardt, Pitassi, Reingold and Roth, showed that differential privacy implies a strong form of algorithmic stability and therefore implies generalization. The Dwork and Roth 2014 monograph "The Algorithmic Foundations of Differential Privacy" lays out this connection in detail.

Another influential result extended algorithmic stability to non-convex optimization. In their 2016 ICML paper "Train faster, generalize better: Stability of stochastic gradient descent," Hardt, Recht and Singer proved that stochastic gradient descent on a Lipschitz, smooth loss is uniformly stable, with the stability bound growing with the number of training iterations. The headline message is that running SGD for fewer steps tightens the generalization bound, which gives a theoretical reason behind early stopping, decaying learning rates, and short training schedules. Their analysis also extends to the non-convex case, providing stability bounds that match common practices in deep learning.

numerical stability

Numerical stability is a different concept from a different field, but it shows up constantly in modern ML. The forward pass of a deep network multiplies many activations together. The backward pass multiplies many gradient terms. With finite-precision arithmetic, those products can underflow to zero or overflow to infinity. Vanishing gradients prevent learning from making progress. Exploding gradients send the loss to NaN.

Mixed-precision training has made these issues more prominent. FP16 has only about 5 useful exponent bits, so values outside roughly [6e-8, 6e4] either round to zero or overflow. BF16, which keeps the FP32 exponent range while sacrificing mantissa precision, was introduced largely to dodge this problem and is now the default for most large transformer pretraining. Loss-scaling tricks, the epsilon term in the Adam optimizer, and careful ordering of operations are all attempts to keep the actual numbers inside the representable range. When practitioners say a training run "diverged" because of FP16, they almost always mean numerical stability rather than the theoretical kind.

training-dynamics stability

Training-dynamics stability is what most deep learning engineers worry about day to day. A stable training run produces a smoothly decreasing loss curve. An unstable run produces oscillations, plateaus, sudden spikes, or outright divergence. The biggest practical influences are the learning rate, the batch size, the choice of optimizer, the weight initialization scheme, and the presence of normalization layers.

The table below lists common sources of training instability and the standard mitigations.

Source of instability	Mechanism	Mitigation
Learning rate too high	Updates overshoot loss minima	Learning-rate warmup, cosine or step decay
Poor initialization	Activations or gradients vanish or explode at depth	Xavier initialization (Glorot and Bengio 2010), He initialization (He et al. 2015)
Lack of normalization	Layer-input distributions drift across iterations	Batch normalization, layer normalization, RMSNorm
Exploding gradients	Backprop through many layers amplifies updates	Gradient clipping by global norm
Vanishing gradients	Saturating activations push derivatives to zero	ReLU and variants, residual connections
Loss spikes in LLM pretraining	Rare data batches or numerical artifacts trigger huge updates	Restart from earlier checkpoint, skip implicated batches, adaptive clipping
Optimizer state corruption	Momentum or Adam moments accumulate bad statistics	Reset moments after a spike, use SPAM-style spike-aware updates

Ioffe and Szegedy's 2015 paper "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift" was a turning point. By normalizing each layer's pre-activations within a mini-batch, the technique allowed substantially higher learning rates, less careful initialization, and 14 times fewer training steps to reach the same accuracy on ImageNet. Layer normalization, proposed by Ba, Kiros and Hinton in 2016, applied the same idea per-token rather than per-batch and became standard in transformers. RMSNorm, a simpler variant that drops the mean-centering step, is now used in most modern large language models.

Large language model pretraining brought a new flavor of instability into focus. The PaLM technical report from Google in 2022 noted that the 540B model's loss spiked roughly 20 times during training even with gradient clipping enabled, sometimes deep into the run. The team's mitigation was empirical and somewhat brute-force: restart from a checkpoint about 100 steps before the spike, skip 200 to 500 batches that included the offending data, and resume. After the skip, the same spike did not recur, suggesting the cause was specific data combined with a particular optimizer state. Subsequent work on adaptive gradient clipping, including ZClip (Kurzynski et al. 2025) and AdaGC, attempts to detect spikes statistically and clip only the offending updates rather than every step.

stability under input perturbation

A model that gives wildly different predictions on slightly different inputs is unstable in a different sense. This is the topic of adversarial robustness. Goodfellow, Shlens and Szegedy's 2014 paper introduced the Fast Gradient Sign Method and showed that imperceptible pixel-level changes flip the predictions of high-accuracy image classifiers. Madry et al.'s 2018 ICLR paper "Towards Deep Learning Models Resistant to Adversarial Attacks" framed the problem as min-max optimization and proposed projected gradient descent (PGD) adversarial training. They showed that networks trained against PGD adversaries were robust to a wide range of first-order attacks, with concrete results on MNIST and CIFAR-10 against adversaries bounded by 0.3 and 8 in the L-infinity norm respectively.

The Lipschitz constant of a function is a quantitative measure of this kind of stability: a function with Lipschitz constant L cannot change its output by more than L times the change in its input. Spectral normalization, gradient penalties, and architectural constraints are all attempts to control this constant.

out-of-distribution stability

A model can be stable on the data it was trained on and still fall apart in deployment if the world changes. This is the territory of distribution shift: covariate shift (the input distribution changes), label shift (the marginal class frequencies change), and concept drift (the relationship between inputs and labels changes). Calibration, the agreement between predicted confidence and actual accuracy, often degrades faster than raw accuracy under shift. Practical responses include retraining schedules, drift-detection monitors, importance weighting, and domain-adaptation methods.

stability of explanations

If two nearly identical inputs produce wildly different feature attributions or saliency maps, the explanations are not telling you something stable about the model. Alvarez-Melis and Jaakkola made this point in their 2018 paper "On the Robustness of Interpretability Methods," showing that several popular explanation techniques produce very different explanations for visually indistinguishable inputs. This is the explanation analog of adversarial examples and a reason to be skeptical of single saliency maps as evidence for what a model is doing.

techniques that improve stability

Many standard machine learning techniques can be reframed as stability-promoting interventions. The table below maps common techniques to the kind of stability they target.

Technique	Primary target	Brief mechanism
L2 regularization	Algorithmic stability	Penalizes large weights, bounds influence of any single point
Dropout	Algorithmic and dynamics	Random masking averages over many subnetworks
Data augmentation	Algorithmic and robustness	Reduces dependence on the exact training set
Early stopping	Algorithmic	Limits SGD iterations, tightening the Hardt-Recht-Singer bound
Cross-validation	Hyperparameter selection	Empirical estimate of stability across folds
Bootstrap aggregation (bagging)	Variance reduction	Trains models on resampled datasets and averages
Ensembling	Variance reduction	Averages independently trained models
Batch normalization	Dynamics	Normalizes per-layer activations within a mini-batch
Layer normalization	Dynamics	Normalizes per-token activations across features
Gradient clipping	Dynamics	Caps update magnitude per step
Weight initialization (Xavier, He)	Dynamics	Keeps initial activations and gradients in stable ranges
Mixed-precision loss scaling	Numerical	Multiplies loss by a constant to keep FP16 gradients in range
Adversarial training	Input robustness	Trains on worst-case perturbations within an epsilon ball
Differential privacy	Algorithmic	Adds calibrated noise, implies stability and generalization

Breiman's 1996 paper on bagging made the link between stability and ensembling explicit: bagging helps most for unstable predictors (decision trees, neural networks) and barely helps for stable ones (k-nearest neighbors with k larger than 1). The instability of the base predictor is what bagging exploits.

stability and generalization in the deep learning era

Classical stability theory assumes the algorithm fits the training data without memorizing it. Overparameterized neural networks complicate this. Models with many more parameters than training examples can drive training error to zero on randomly labeled data (Zhang et al. 2017, "Understanding deep learning requires rethinking generalization") and yet generalize well on real labels. The classical uniform-stability bound for SGD predicts generalization should degrade with longer training, but in practice large models often improve over many epochs. Modern refinements use PAC-Bayesian bounds, data-dependent stability, and analyses of implicit regularization to try to close the gap. Feldman and Vondrak (2019) gave tight generalization bounds for uniformly stable algorithms that match the classical lower bound, sharpening but not resolving the deep learning puzzle.

stability in deployment

In production, stability also has an organizational meaning. A model that scores well in offline tests but whose predictions swing wildly between weekly retrains is a bad model to depend on. Common practices include shadow deployments, gradual rollouts, A/B tests with stability checks (asking whether the new model agrees with the old one in cases where they should agree), and monitoring for input drift. Drift in calibration is often the first sign that a model is no longer reliable, even before raw accuracy noticeably drops.

explain like I'm 5 (ELI5)

Stability is a model's ability to stay accurate even when small changes are made to its training data, its settings, or its building process. Think of it like building a tower with blocks. If the tower is constructed poorly, even a small breeze can knock it down. If it is built solidly, even a strong wind cannot push it over. Techniques like cross-validation, bootstrapping, regularization, and normalization are different ways of making the tower stronger so that it does not fall over for silly reasons.

references

Bousquet, O., and Elisseeff, A. (2002). "Stability and Generalization." Journal of Machine Learning Research, 2, 499-526. https://jmlr.org/papers/v2/bousquet02a.html
Hardt, M., Recht, B., and Singer, Y. (2016). "Train faster, generalize better: Stability of stochastic gradient descent." In Proceedings of the 33rd International Conference on Machine Learning (ICML), 1225-1234. https://arxiv.org/abs/1509.01240
Ioffe, S., and Szegedy, C. (2015). "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift." In Proceedings of the 32nd International Conference on Machine Learning (ICML), 448-456. https://arxiv.org/abs/1502.03167
Ba, J. L., Kiros, J. R., and Hinton, G. E. (2016). "Layer Normalization." arXiv:1607.06450. https://arxiv.org/abs/1607.06450
Goodfellow, I. J., Shlens, J., and Szegedy, C. (2015). "Explaining and Harnessing Adversarial Examples." In International Conference on Learning Representations (ICLR). https://arxiv.org/abs/1412.6572
Madry, A., Makelov, A., Schmidt, L., Tsipras, D., and Vladu, A. (2018). "Towards Deep Learning Models Resistant to Adversarial Attacks." In International Conference on Learning Representations (ICLR). https://arxiv.org/abs/1706.06083
Dwork, C., and Roth, A. (2014). "The Algorithmic Foundations of Differential Privacy." Foundations and Trends in Theoretical Computer Science, 9(3-4), 211-407.
Shalev-Shwartz, S., Shamir, O., Srebro, N., and Sridharan, K. (2010). "Learnability, Stability and Uniform Convergence." Journal of Machine Learning Research, 11, 2635-2670.
Breiman, L. (1996). "Bagging Predictors." Machine Learning, 24(2), 123-140.
Glorot, X., and Bengio, Y. (2010). "Understanding the difficulty of training deep feedforward neural networks." In Proceedings of AISTATS.
He, K., Zhang, X., Ren, S., and Sun, J. (2015). "Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification." In Proceedings of ICCV.
Chowdhery, A., et al. (2022). "PaLM: Scaling Language Modeling with Pathways." arXiv:2204.02311. https://arxiv.org/abs/2204.02311
Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. (2017). "Understanding deep learning requires rethinking generalization." In International Conference on Learning Representations (ICLR). https://arxiv.org/abs/1611.03530
Feldman, V., and Vondrak, J. (2019). "High probability generalization bounds for uniformly stable algorithms with nearly optimal rate." In Conference on Learning Theory (COLT). https://arxiv.org/abs/1812.09859
Alvarez-Melis, D., and Jaakkola, T. S. (2018). "On the Robustness of Interpretability Methods." arXiv:1806.08049. https://arxiv.org/abs/1806.08049

introduction

meanings of stability

algorithmic stability

numerical stability

training-dynamics stability

stability under input perturbation

out-of-distribution stability

stability of explanations

techniques that improve stability

stability and generalization in the deep learning era

stability in deployment

explain like I'm 5 (ELI5)

references

Improve this article

Related Articles

Empirical risk minimization (ERM)

Structural risk minimization (SRM)

Double Descent

Grokking

Statistical learning theory

Inductive bias

introduction

meanings of stability

algorithmic stability

numerical stability

training-dynamics stability

stability under input perturbation

out-of-distribution stability

stability of explanations

techniques that improve stability

stability and generalization in the deep learning era

stability in deployment

explain like I'm 5 (ELI5)

references

Related Articles

Empirical risk minimization (ERM)

Structural risk minimization (SRM)

Double Descent

Grokking

Statistical learning theory

Inductive bias