See also: Machine learning terms
A termination condition, also called a stopping criterion, convergence criterion, or halting condition, is a rule that decides when an iterative algorithm should stop running. In machine learning, termination conditions appear in almost every iterative procedure: gradient-based optimization, tree induction, clustering, reinforcement learning episodes, evolutionary search, Markov chain Monte Carlo, and the token-by-token decoding loop of a language model. The choice of stopping rule directly shapes how long training takes, how well a model generalizes, and whether results are reproducible from one run to the next.
Most real systems combine several conditions rather than relying on a single rule. A typical training script will stop when whichever comes first: the maximum number of epochs is reached, validation loss has not improved for a fixed patience window, or a wall-clock budget runs out. Combining a hard upper bound with an early-exit rule keeps the procedure both safe (it always terminates) and efficient (it does not burn compute past the point where the model has stopped improving).
A badly chosen stopping rule has three concrete consequences. First, training too long invites overfitting: the model begins to memorize noise in the training set and validation accuracy starts to fall even while training loss keeps dropping. Second, training too long wastes compute, which is a real cost when a single experiment can occupy GPUs for hours or days. Third, vague or undocumented stopping rules make experiments hard to reproduce; a paper that says "we trained until convergence" without specifying the criterion leaves a key hyperparameter implicit.
A sensible default in most deep learning workflows is to use both a maximum epoch count and a patience-based early stopping rule on a held-out validation set, then save the best checkpoint encountered along the way. This pattern shows up in nearly every modern training framework.
Different algorithm families have their own conventions. The table below summarizes the most common stopping criteria across optimization paradigms.
| Paradigm | Common termination conditions | Typical implementation |
|---|---|---|
| First-order optimization | Gradient norm below tolerance, loss change below tolerance, max iterations | tol, max_iter, ` |
| Neural network training | Max epochs, validation metric plateau, time budget, learning rate floor | EarlyStopping callback, max_epochs |
| Decision tree induction | Max depth, min samples per leaf or split, min impurity decrease, max leaf nodes | Pre-pruning hyperparameters |
| Clustering | Centroid movement below tolerance, max iterations, inertia change | K-means tol and max_iter |
| Genetic algorithms | Generation count, fitness plateau, diversity collapse | Stagnation window |
| Reinforcement learning | Terminal state reached, max steps per episode, total environment steps budget | terminated / truncated flags |
| Bandits and active learning | Sample budget exhausted, confidence interval width below threshold | Pulls allocated, posterior shrinkage |
| MCMC | Burn-in length plus convergence diagnostics, fixed sample count | R-hat below threshold, ESS target |
| LLM decoding | EOS token, custom stop sequence, max tokens, time budget | stop, max_tokens parameters |
For smooth optimization with gradient descent or its variants, the classical termination rule is to stop when the gradient norm drops below a small tolerance: stop when ||∇L(w)|| < ε. This reflects the first-order optimality condition, since at a stationary point the gradient is zero. Practical implementations usually compute the 2-norm of the gradient vector, although the infinity norm (largest absolute component) is sometimes preferred for high-dimensional problems where it gives a per-coordinate guarantee. Checking the gradient norm adds essentially no overhead, because the gradient computed at the end of one iteration is already needed at the start of the next.
Other common rules include stopping when the change in loss between iterations falls below a tolerance, when the change in parameters falls below a tolerance, or when a maximum iteration count is reached. SciPy's scipy.optimize.minimize exposes gtol, ftol, xtol, and maxiter parameters that map directly onto these criteria. For stochastic gradient descent the gradient norm is a noisy estimator and is rarely used directly; practitioners instead monitor a smoothed running loss or a held-out validation metric.
For strongly convex problems, gradient descent needs roughly O(1/ε) iterations to reach gradient norm below ε, and O(log(1/ε)) iterations to reach loss within ε of the optimum, depending on conditioning. These bounds are mostly used to choose a reasonable max_iter value rather than as live stopping rules.
In neural network training the dominant stopping rule is early stopping with a patience parameter. After each epoch (or every fixed number of steps) the model is evaluated on a validation set; training continues as long as the monitored metric keeps improving and stops once it has failed to improve for patience consecutive epochs. The original systematic study of this idea is Lutz Prechelt's 1998 paper "Automatic early stopping using cross validation: quantifying the criteria," published in Neural Networks, which compared 14 different stopping rules across 12 tasks and found that more patient criteria produced about 4% better generalization at roughly 4 times the training cost.
Early stopping doubles as an implicit regularizer. By halting before the parameters drift far from initialization, it has roughly the same effect as L2 weight decay in the linear-regression case, a result discussed in detail in Goodfellow, Bengio, and Courville's Deep Learning (2016). The framework-level controls are similar across libraries, although the parameter names differ.
| Framework | API | Key parameters |
|---|---|---|
| Keras | keras.callbacks.EarlyStopping | monitor, patience, min_delta, mode, restore_best_weights, baseline, start_from_epoch |
| PyTorch Lightning | lightning.pytorch.callbacks.EarlyStopping | monitor, patience, min_delta, mode, check_finite, stopping_threshold, divergence_threshold |
| scikit-learn (MLP, SGDClassifier) | constructor flag early_stopping=True | validation_fraction, n_iter_no_change, tol |
| XGBoost | early_stopping_rounds argument to fit or train | rounds without improvement on the eval set |
| LightGBM | early_stopping_round parameter (alias early_stopping_rounds, n_iter_no_change) | rounds without improvement on a validation metric |
A common subtlety in Keras is that restore_best_weights=True only restores weights when the patience criterion actually fires; if training hits max_epochs first, the final weights are kept rather than the best ones. Practitioners often add a ModelCheckpoint callback alongside EarlyStopping to ensure the best weights are written to disk regardless of which condition stopped training.
Decision trees build top-down by recursively partitioning the data, and the termination condition controls when a node is declared a leaf. scikit-learn's DecisionTreeClassifier and DecisionTreeRegressor expose several pre-pruning parameters that act as stopping conditions: max_depth caps the depth of every branch, min_samples_split requires a minimum number of samples in a node before it can be split, min_samples_leaf enforces a minimum size for any resulting leaf, min_impurity_decrease requires that a split reduce impurity by at least a given amount, and max_leaf_nodes caps the total number of leaves. Setting these too loose produces a tree that overfits; setting them too tight produces an underfit tree that misses real structure.
Post-pruning techniques such as cost-complexity pruning, exposed in scikit-learn through the ccp_alpha parameter, take a different approach: the tree is grown to its full size and then pruned back according to a complexity penalty. Here the termination condition during growth is essentially "stop when the data in a node is pure or nearly so," with the regularization happening afterward.
Iterative clustering algorithms typically alternate between assignment and update steps and need a rule for when to stop iterating. scikit-learn's KMeans uses two: tol, the relative tolerance with respect to the Frobenius norm of the difference in centroid positions between consecutive iterations (default 1e-4), and max_iter, the maximum number of iterations per run (default 300). Iteration stops as soon as either condition is met. Expectation-maximization for Gaussian mixture models, hierarchical clustering, and DBSCAN-style algorithms have analogous criteria, usually expressed as a tolerance on parameter changes plus a maximum iteration count.
Evolutionary algorithms run for a fixed number of generations or until some stagnation criterion fires. The most common stopping rules are a maximum generation count, a target fitness threshold, and a stagnation window: stop if the best fitness has not improved by more than ε over the last T generations. Some implementations also monitor population diversity and stop when it collapses below a threshold, which usually signals premature convergence to a local optimum. Production genetic algorithm frameworks typically combine a hard generation cap with stagnation detection so that easy problems finish quickly while hard ones still receive a fixed compute budget.
In reinforcement learning the term "termination" has a specific technical meaning that differs from supervised training. Each episode in a Markov decision process ends in one of two ways. The agent reaches a terminal state, where no further transitions are possible (the game is won or lost, the robot has fallen, the goal is reached), or the episode is truncated by an external time limit even though the underlying process could continue indefinitely.
From Gymnasium version 0.26 onward, the Step API explicitly returns both flags. The previous combined done flag was deprecated because it conflated these two cases and caused subtle bugs in bootstrapping. When an episode terminates the value function for the next state is zero by definition, so the Bellman update uses only the immediate reward. When an episode is truncated the next state still has a meaningful value, so bootstrapping should continue. Treating truncation as termination causes algorithms to learn that any reward grabbed before the time limit is "free," which produces pathological behaviors such as the half-cheetah agent crashing on its head as long as it scored a few extra reward points first.
| Episode end type | Cause | Bellman target | Gymnasium return value |
|---|---|---|---|
| Termination | Goal reached, agent died, environment-defined terminal state | r (no bootstrap) | terminated=True, truncated=False |
| Truncation | Time limit, external cutoff, wrapper such as TimeLimit | r + γ V(s') (still bootstrap) | terminated=False, truncated=True |
| Ongoing | Neither condition met | r + γ V(s') | both False |
Beyond per-episode termination, RL training itself has a higher-level termination condition. This is usually a total environment-step budget (often hundreds of millions of steps for Atari benchmarks or billions for large-scale robotics), a fixed number of policy update epochs, or a target performance level on a held-out evaluation suite.
Multi-armed bandit algorithms sample arms until either a fixed sample budget runs out or the algorithm becomes confident enough to commit to one arm. Best-arm-identification procedures stop when the confidence interval around the leading arm's estimated value separates from the others by more than a target margin. Active learning and Bayesian optimization use analogous rules, stopping when the expected improvement from another query falls below a threshold or when a labeling budget is reached.
MCMC samplers do not converge to a single answer; they generate samples from a target distribution. The termination problem is therefore split into two parts. A burn-in phase discards initial samples that depend too strongly on the starting point. After burn-in, the chain runs until enough effectively independent samples have been collected. Convergence diagnostics such as the Gelman-Rubin statistic R-hat (introduced in Gelman and Rubin 1992 and improved by Vehtari et al. 2021) compare the variance within and across multiple chains; values close to 1 indicate the chains have mixed. A common operational rule is to declare convergence when R-hat for every parameter of interest falls below 1.01 or 1.1, depending on how strict the analyst wants to be. Effective sample size (ESS) is the second standard target, with rules of thumb such as ESS > 400 per parameter for stable posterior summaries.
For large language models the training termination story is unusual. Modern LLMs are typically trained for a fixed number of optimizer steps consuming a planned token budget (often in the trillions of tokens) rather than to convergence in the classical sense. Training rarely runs until the loss flattens, both because the loss continues to improve slowly long after diminishing returns set in and because compute, not convergence, is the binding constraint.
At inference time, generation is governed by a different set of stopping rules. The model itself emits an end-of-sequence (EOS) token that the runtime treats as a stop signal. Users can supply additional stop strings that abort generation when matched in the output stream; the OpenAI API allows up to 4 such sequences. A max_tokens parameter caps the worst-case length, and many production deployments add a wall-clock deadline so that a single slow request does not block other traffic. Reasoning models that perform extended internal deliberation, such as OpenAI's o-series or Anthropic's Claude with extended thinking enabled, expose a separate budget on the number of "thinking" tokens the model is allowed to spend before it must commit to a final answer.
The most reliable approach is to combine multiple termination conditions and to pick the one that fires first. A typical recipe for supervised learning looks like this: a maximum epoch count high enough that the model reaches its best validation score; a patience-based early stopping rule on a sensible validation metric; a save-best-checkpoint mechanism so the best weights are recoverable; and a wall-clock budget as a final safety net.
A few mistakes show up over and over. Stopping too early produces an underfit model and is often caused by a patience value that does not account for noise in the validation curve. Stopping too late wastes compute and risks overfitting; this is the failure mode when only max_iter is used without any improvement check. Choosing the wrong validation metric is subtle: optimizing F1 while monitoring loss can stop training at the wrong epoch when the two are not perfectly aligned. Reusing the validation set for both early stopping and final reporting inflates the reported score, so a separate held-out test set should be evaluated only after training has stopped.
For RL, the single largest source of bugs is conflating termination and truncation. Always honor the distinction in the Bellman update. For LLM inference, forgetting to set max_tokens (or setting it too high) is the standard way to discover that the model can talk forever. For MCMC, declaring convergence based on a single chain's apparent stationarity is a classic trap; always run multiple chains and check R-hat across them.
Finally, log the termination condition that actually fired. Knowing whether a run stopped because it converged, ran out of patience, hit the iteration cap, or was interrupted manually is essential for diagnosing what went wrong and for reproducing the experiment later.
When a computer is learning from data, it keeps practicing over and over. A termination condition is just a rule that tells the computer when to stop practicing. Maybe it stops after a set number of practice rounds, or when its scores on a quiz stop getting better, or when a timer runs out. Picking a good rule matters: if the computer stops too soon, it will not have learned enough; if it keeps practicing forever, it will start memorizing the practice questions instead of understanding the topic, and it will also burn through a lot of electricity for no reason.