Termination condition

See also: Machine learning terms

A termination condition, also called a stopping criterion, convergence criterion, or halting condition, is a rule that decides when an iterative algorithm should stop running. In machine learning, termination conditions appear in almost every iterative procedure: gradient-based optimization, tree induction, clustering, reinforcement learning episodes, evolutionary search, Markov chain Monte Carlo, and the token-by-token decoding loop of a language model. The choice of stopping rule directly shapes how long training takes, how well a model generalizes, and whether results are reproducible from one run to the next.

Most real systems combine several conditions rather than relying on a single rule. A typical training script will stop when whichever comes first: the maximum number of epochs is reached, validation loss has not improved for a fixed patience window, or a wall-clock budget runs out. Combining a hard upper bound with an early-exit rule keeps the procedure both safe (it always terminates) and efficient (it does not burn compute past the point where the model has stopped improving).

Why termination conditions matter

A badly chosen stopping rule has three concrete consequences. First, training too long invites overfitting: the model begins to memorize noise in the training set and validation accuracy starts to fall even while training loss keeps dropping. Second, training too long wastes compute, which is a real cost when a single experiment can occupy GPUs for hours or days. Third, vague or undocumented stopping rules make experiments hard to reproduce; a paper that says "we trained until convergence" without specifying the criterion leaves a key hyperparameter implicit.

A sensible default in most deep learning workflows is to use both a maximum epoch count and a patience-based early stopping rule on a held-out validation set, then save the best checkpoint encountered along the way. This pattern shows up in nearly every modern training framework.

Categories of termination conditions

Different algorithm families have their own conventions. The table below summarizes the most common stopping criteria across optimization paradigms.

Paradigm	Common termination conditions	Typical implementation
First-order optimization	Gradient norm below tolerance, loss change below tolerance, max iterations	`tol`, `max_iter`, `
Neural network training	Max epochs, validation metric plateau, time budget, learning rate floor	`EarlyStopping` callback, `max_epochs`
Decision tree induction	Max depth, min samples per leaf or split, min impurity decrease, max leaf nodes	Pre-pruning hyperparameters
Clustering	Centroid movement below tolerance, max iterations, inertia change	K-means `tol` and `max_iter`
Genetic algorithms	Generation count, fitness plateau, diversity collapse	Stagnation window
Reinforcement learning	Terminal state reached, max steps per episode, total environment steps budget	`terminated` / `truncated` flags
Bandits and active learning	Sample budget exhausted, confidence interval width below threshold	Pulls allocated, posterior shrinkage
MCMC	Burn-in length plus convergence diagnostics, fixed sample count	R-hat below threshold, ESS target
LLM decoding	EOS token, custom stop sequence, max tokens, time budget	`stop`, `max_tokens` parameters

Gradient-based optimization

For smooth optimization with gradient descent or its variants, the classical termination rule is to stop when the gradient norm drops below a small tolerance: stop when ||∇L(w)|| < ε. This reflects the first-order optimality condition, since at a stationary point the gradient is zero. Practical implementations usually compute the 2-norm of the gradient vector, although the infinity norm (largest absolute component) is sometimes preferred for high-dimensional problems where it gives a per-coordinate guarantee. Checking the gradient norm adds essentially no overhead, because the gradient computed at the end of one iteration is already needed at the start of the next.

Other common rules include stopping when the change in loss between iterations falls below a tolerance, when the change in parameters falls below a tolerance, or when a maximum iteration count is reached. SciPy's scipy.optimize.minimize exposes gtol, ftol, xtol, and maxiter parameters that map directly onto these criteria. For stochastic gradient descent the gradient norm is a noisy estimator and is rarely used directly; practitioners instead monitor a smoothed running loss or a held-out validation metric.

For strongly convex problems, gradient descent needs roughly O(1/ε) iterations to reach gradient norm below ε, and O(log(1/ε)) iterations to reach loss within ε of the optimum, depending on conditioning. These bounds are mostly used to choose a reasonable max_iter value rather than as live stopping rules.

Early stopping for neural networks

In neural network training the dominant stopping rule is early stopping with a patience parameter. After each epoch (or every fixed number of steps) the model is evaluated on a validation set; training continues as long as the monitored metric keeps improving and stops once it has failed to improve for patience consecutive epochs. The original systematic study of this idea is Lutz Prechelt's 1998 paper "Automatic early stopping using cross validation: quantifying the criteria," published in Neural Networks, which compared 14 different stopping rules across 12 tasks and found that more patient criteria produced about 4% better generalization at roughly 4 times the training cost.

Early stopping doubles as an implicit regularizer. By halting before the parameters drift far from initialization, it has roughly the same effect as L2 weight decay in the linear-regression case, a result discussed in detail in Goodfellow, Bengio, and Courville's Deep Learning (2016). The framework-level controls are similar across libraries, although the parameter names differ.

Framework	API	Key parameters
Keras	`keras.callbacks.EarlyStopping`	`monitor`, `patience`, `min_delta`, `mode`, `restore_best_weights`, `baseline`, `start_from_epoch`
PyTorch Lightning	`lightning.pytorch.callbacks.EarlyStopping`	`monitor`, `patience`, `min_delta`, `mode`, `check_finite`, `stopping_threshold`, `divergence_threshold`
scikit-learn (MLP, SGDClassifier)	constructor flag `early_stopping=True`	`validation_fraction`, `n_iter_no_change`, `tol`
XGBoost	`early_stopping_rounds` argument to `fit` or `train`	rounds without improvement on the eval set
LightGBM	`early_stopping_round` parameter (alias `early_stopping_rounds`, `n_iter_no_change`)	rounds without improvement on a validation metric

A common subtlety in Keras is that restore_best_weights=True only restores weights when the patience criterion actually fires; if training hits max_epochs first, the final weights are kept rather than the best ones. Practitioners often add a ModelCheckpoint callback alongside EarlyStopping to ensure the best weights are written to disk regardless of which condition stopped training.

Decision trees

Decision trees build top-down by recursively partitioning the data, and the termination condition controls when a node is declared a leaf. scikit-learn's DecisionTreeClassifier and DecisionTreeRegressor expose several pre-pruning parameters that act as stopping conditions: max_depth caps the depth of every branch, min_samples_split requires a minimum number of samples in a node before it can be split, min_samples_leaf enforces a minimum size for any resulting leaf, min_impurity_decrease requires that a split reduce impurity by at least a given amount, and max_leaf_nodes caps the total number of leaves. Setting these too loose produces a tree that overfits; setting them too tight produces an underfit tree that misses real structure.

Post-pruning techniques such as cost-complexity pruning, exposed in scikit-learn through the ccp_alpha parameter, take a different approach: the tree is grown to its full size and then pruned back according to a complexity penalty. Here the termination condition during growth is essentially "stop when the data in a node is pure or nearly so," with the regularization happening afterward.

Clustering

Iterative clustering algorithms typically alternate between assignment and update steps and need a rule for when to stop iterating. scikit-learn's KMeans uses two: tol, the relative tolerance with respect to the Frobenius norm of the difference in centroid positions between consecutive iterations (default 1e-4), and max_iter, the maximum number of iterations per run (default 300). Iteration stops as soon as either condition is met. Expectation-maximization for Gaussian mixture models, hierarchical clustering, and DBSCAN-style algorithms have analogous criteria, usually expressed as a tolerance on parameter changes plus a maximum iteration count.

Genetic algorithms and evolutionary search

Evolutionary algorithms run for a fixed number of generations or until some stagnation criterion fires. The most common stopping rules are a maximum generation count, a target fitness threshold, and a stagnation window: stop if the best fitness has not improved by more than ε over the last T generations. Some implementations also monitor population diversity and stop when it collapses below a threshold, which usually signals premature convergence to a local optimum. Production genetic algorithm frameworks typically combine a hard generation cap with stagnation detection so that easy problems finish quickly while hard ones still receive a fixed compute budget.

Reinforcement learning

In reinforcement learning the term "termination" has a specific technical meaning that differs from supervised training. Each episode in a Markov decision process ends in one of two ways. The agent reaches a terminal state, where no further transitions are possible (the game is won or lost, the robot has fallen, the goal is reached), or the episode is truncated by an external time limit even though the underlying process could continue indefinitely.

From Gymnasium version 0.26 onward, the Step API explicitly returns both flags. The previous combined done flag was deprecated because it conflated these two cases and caused subtle bugs in bootstrapping. When an episode terminates the value function for the next state is zero by definition, so the Bellman update uses only the immediate reward. When an episode is truncated the next state still has a meaningful value, so bootstrapping should continue. Treating truncation as termination causes algorithms to learn that any reward grabbed before the time limit is "free," which produces pathological behaviors such as the half-cheetah agent crashing on its head as long as it scored a few extra reward points first.

Episode end type	Cause	Bellman target	Gymnasium return value
Termination	Goal reached, agent died, environment-defined terminal state	r (no bootstrap)	`terminated=True`, `truncated=False`
Truncation	Time limit, external cutoff, wrapper such as `TimeLimit`	r + γ V(s') (still bootstrap)	`terminated=False`, `truncated=True`
Ongoing	Neither condition met	r + γ V(s')	both `False`

Beyond per-episode termination, RL training itself has a higher-level termination condition. This is usually a total environment-step budget (often hundreds of millions of steps for Atari benchmarks or billions for large-scale robotics), a fixed number of policy update epochs, or a target performance level on a held-out evaluation suite.

Bandits and sequential decision-making

Multi-armed bandit algorithms sample arms until either a fixed sample budget runs out or the algorithm becomes confident enough to commit to one arm. Best-arm-identification procedures stop when the confidence interval around the leading arm's estimated value separates from the others by more than a target margin. Active learning and Bayesian optimization use analogous rules, stopping when the expected improvement from another query falls below a threshold or when a labeling budget is reached.

Markov chain Monte Carlo

MCMC samplers do not converge to a single answer; they generate samples from a target distribution. The termination problem is therefore split into two parts. A burn-in phase discards initial samples that depend too strongly on the starting point. After burn-in, the chain runs until enough effectively independent samples have been collected. Convergence diagnostics such as the Gelman-Rubin statistic R-hat (introduced in Gelman and Rubin 1992 and improved by Vehtari et al. 2021) compare the variance within and across multiple chains; values close to 1 indicate the chains have mixed. A common operational rule is to declare convergence when R-hat for every parameter of interest falls below 1.01 or 1.1, depending on how strict the analyst wants to be. Effective sample size (ESS) is the second standard target, with rules of thumb such as ESS > 400 per parameter for stable posterior summaries.

Language model inference

For large language models the training termination story is unusual. Modern LLMs are typically trained for a fixed number of optimizer steps consuming a planned token budget (often in the trillions of tokens) rather than to convergence in the classical sense. Training rarely runs until the loss flattens, both because the loss continues to improve slowly long after diminishing returns set in and because compute, not convergence, is the binding constraint.

At inference time, generation is governed by a different set of stopping rules. The model itself emits an end-of-sequence (EOS) token that the runtime treats as a stop signal. Users can supply additional stop strings that abort generation when matched in the output stream; the OpenAI API allows up to 4 such sequences. A max_tokens parameter caps the worst-case length, and many production deployments add a wall-clock deadline so that a single slow request does not block other traffic. Reasoning models that perform extended internal deliberation, such as OpenAI's o-series or Anthropic's Claude with extended thinking enabled, expose a separate budget on the number of "thinking" tokens the model is allowed to spend before it must commit to a final answer.

Best practices and common pitfalls

The most reliable approach is to combine multiple termination conditions and to pick the one that fires first. A typical recipe for supervised learning looks like this: a maximum epoch count high enough that the model reaches its best validation score; a patience-based early stopping rule on a sensible validation metric; a save-best-checkpoint mechanism so the best weights are recoverable; and a wall-clock budget as a final safety net.

A few mistakes show up over and over. Stopping too early produces an underfit model and is often caused by a patience value that does not account for noise in the validation curve. Stopping too late wastes compute and risks overfitting; this is the failure mode when only max_iter is used without any improvement check. Choosing the wrong validation metric is subtle: optimizing F1 while monitoring loss can stop training at the wrong epoch when the two are not perfectly aligned. Reusing the validation set for both early stopping and final reporting inflates the reported score, so a separate held-out test set should be evaluated only after training has stopped.

For RL, the single largest source of bugs is conflating termination and truncation. Always honor the distinction in the Bellman update. For LLM inference, forgetting to set max_tokens (or setting it too high) is the standard way to discover that the model can talk forever. For MCMC, declaring convergence based on a single chain's apparent stationarity is a classic trap; always run multiple chains and check R-hat across them.

Finally, log the termination condition that actually fired. Knowing whether a run stopped because it converged, ran out of patience, hit the iteration cap, or was interrupted manually is essential for diagnosing what went wrong and for reproducing the experiment later.

Explain like I'm 5 (ELI5)

When a computer is learning from data, it keeps practicing over and over. A termination condition is just a rule that tells the computer when to stop practicing. Maybe it stops after a set number of practice rounds, or when its scores on a quiz stop getting better, or when a timer runs out. Picking a good rule matters: if the computer stops too soon, it will not have learned enough; if it keeps practicing forever, it will start memorizing the practice questions instead of understanding the topic, and it will also burn through a lot of electricity for no reason.

References

Prechelt, L. (1998). "Automatic early stopping using cross validation: quantifying the criteria." *Neural Networks*, 11(4), 761-767. https://doi.org/10.1016/S0893-6080(98)00010-0
Goodfellow, I., Bengio, Y., and Courville, A. (2016). *Deep Learning*. MIT Press, Chapter 7 (Regularization for Deep Learning).
Keras documentation. "EarlyStopping callback." https://keras.io/api/callbacks/early_stopping/
PyTorch Lightning documentation. "EarlyStopping." https://lightning.ai/docs/pytorch/stable/api/lightning.pytorch.callbacks.EarlyStopping.html
scikit-learn documentation. "KMeans." https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
scikit-learn documentation. "DecisionTreeClassifier." https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
XGBoost documentation. "Python API: early stopping." https://xgboost.readthedocs.io/en/stable/python/python_intro.html
LightGBM documentation. "lightgbm.early_stopping." https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.early_stopping.html
Farama Foundation. "Gymnasium: Handling Time Limits." https://gymnasium.farama.org/tutorials/gymnasium_basics/handling_time_limits/
Towers, M., et al. (2024). "Gymnasium: A Standardized Interface for Reinforcement Learning Environments." *arXiv:2407.17032.* https://arxiv.org/abs/2407.17032
Gelman, A., and Rubin, D. B. (1992). "Inference from Iterative Simulation Using Multiple Sequences." *Statistical Science*, 7(4), 457-472.
Vehtari, A., Gelman, A., Simpson, D., Carpenter, B., and Bürkner, P. C. (2021). "Rank-Normalization, Folding, and Localization: An Improved R-hat for Assessing Convergence of MCMC." *Bayesian Analysis*, 16(2), 667-718.
Nocedal, J., and Wright, S. J. (2006). *Numerical Optimization*, 2nd edition. Springer, chapters on convergence theory.
OpenAI API reference. "Chat Completions: stop, max_tokens." https://platform.openai.com/docs/api-reference/chat/create

Why termination conditions matter

Categories of termination conditions

Gradient-based optimization

Early stopping for neural networks

Decision trees

Clustering

Genetic algorithms and evolutionary search

Reinforcement learning

Bandits and sequential decision-making

Markov chain Monte Carlo

Language model inference

Best practices and common pitfalls

Explain like I'm 5 (ELI5)

References

Improve this article

Related Articles

ARC-AGI 2

L0 Regularization

L1 Loss

L1 Regularization

L2 Loss

L2 Regularization

Why termination conditions matter

Categories of termination conditions

Gradient-based optimization

Early stopping for neural networks

Decision trees

Clustering

Genetic algorithms and evolutionary search

Reinforcement learning

Bandits and sequential decision-making

Markov chain Monte Carlo

Language model inference

Best practices and common pitfalls

Explain like I'm 5 (ELI5)

References

Related Articles

ARC-AGI 2

L0 Regularization

L1 Loss

L1 Regularization

L2 Loss

L2 Regularization