Early stopping is a regularization technique used in iterative machine learning algorithms to prevent overfitting. It works by monitoring the model's performance on a held-out validation set during training and halting the optimization process when the validation performance stops improving, even if the training loss continues to decrease. Early stopping is one of the most widely used regularization methods in deep learning, gradient boosting, and other iterative learning frameworks because of its simplicity and effectiveness.
Imagine you are practicing spelling words for a test. At first, you keep getting better and better. But after a while, you start memorizing the practice list so well that you only know those exact words and forget how to spell new ones. Early stopping is like having a parent quiz you on different words every so often. When you start getting those quiz words wrong even though you are still acing the practice list, your parent says, "Okay, that's enough practice." That way, you stop at the point where you are best at spelling all kinds of words, not just the ones you practiced.
The core procedure for early stopping involves the following steps:
This process is illustrated by the typical training curve where training loss decreases monotonically while validation loss initially decreases and then starts to rise. The gap between the two curves indicates overfitting, and the optimal stopping point lies near the minimum of the validation loss curve.
Early stopping behavior is controlled by several hyperparameters. The table below summarizes the most important ones.
| Parameter | Description | Typical values |
|---|---|---|
| monitor | The metric to track on the validation set (e.g., validation loss, validation accuracy) | val_loss, val_accuracy |
| patience | Number of epochs with no improvement before training is stopped | 5 to 20 |
| min_delta | Minimum change in the monitored metric to qualify as an improvement | 0.0001 to 0.001 |
| mode | Whether the monitored metric should be minimized (e.g., loss) or maximized (e.g., accuracy) | min, max, auto |
| restore_best_weights | Whether to reload model weights from the epoch with the best validation metric after stopping | True or False |
| baseline | An absolute threshold the monitored metric must beat before patience counting begins | Problem-dependent |
| start_from_epoch | Number of initial epochs to skip before monitoring begins (useful with learning rate warmup) | 0 to 10 |
Choosing appropriate values for these parameters requires some experimentation. A patience value that is too small may cause premature stopping (especially on noisy validation curves), while a patience value that is too large may negate the computational savings of early stopping.
Early stopping is formally classified as a regularization method because it restricts the effective capacity of the model without modifying the loss function or the model architecture. During iterative optimization with gradient descent, the model parameters start from an initial point (often near zero) and gradually move toward a region that minimizes the training loss. If training continues long enough, the model can fit noise in the training data, leading to poor generalization. By stopping the optimization early, the model parameters remain in a region closer to the initialization, which corresponds to a simpler, more regularized solution.
For a simple linear model with a quadratic error function trained using gradient descent, early stopping can be shown to be mathematically equivalent to L2 regularization (weight decay). The argument proceeds as follows.
Consider a model with parameter vector w being optimized with gradient descent at learning rate epsilon on a quadratic loss surface with Hessian matrix H. After tau iterations starting from the origin, the effective parameter vector is constrained in a way that is analogous to adding an L2 penalty. Specifically, the relationship between the number of training iterations tau, the learning rate epsilon, and the L2 regularization coefficient alpha can be approximated as:
1 / alpha is approximately equal to tau * epsilon
This means that allowing fewer iterations (smaller tau) is equivalent to using a larger regularization coefficient (larger alpha), which penalizes large weights more aggressively. Conversely, training for many epochs is equivalent to applying very little regularization.
This equivalence was discussed in detail by Goodfellow, Bengio, and Courville (2016) in Chapter 7 of their Deep Learning textbook. They noted that early stopping has the advantage over explicit weight decay in that it automatically determines the appropriate amount of regularization based on validation performance, rather than requiring a separate hyperparameter search over the penalty coefficient.
For neural networks with nonlinear activation functions, the exact L2 equivalence does not hold, but the qualitative argument remains: early stopping constrains the effective complexity of the model by limiting how far the parameters can move from initialization.
From the perspective of the bias-variance tradeoff, each step of iterative optimization reduces bias (the model fits the training data more closely) but eventually increases variance (the model becomes more sensitive to the specific training sample). Early stopping seeks the point on the training trajectory where the sum of bias and variance is minimized, which corresponds to the best generalization performance.
The idea of stopping training before convergence to improve generalization dates back to early work on neural network training in the late 1980s and early 1990s. Morgan and Bourlard (1990) empirically demonstrated that generalization in feedforward networks can degrade when training runs too long, providing early evidence for the benefits of halting optimization before convergence.
The technique received its most thorough practical treatment in Lutz Prechelt's influential paper "Early Stopping, But When?" (1998), published in Neural Networks: Tricks of the Trade. Prechelt systematically evaluated 14 different automatic stopping criteria across 12 classification and approximation tasks using multi-layer perceptrons. His key finding was that slower (more patient) stopping criteria allowed for modest improvements in generalization (approximately 4% on average) but at a significant computational cost (approximately 4 times longer training). This work provided practical guidance for choosing stopping criteria and remains widely cited.
The theoretical foundations were further strengthened by Goodfellow, Bengio, and Courville (2016), who formally analyzed early stopping as a regularization technique and demonstrated its approximate equivalence to L2 regularization in the linear case. More recent theoretical work by Yao, Rosasco, and Caponnetto (2007) analyzed early stopping in the context of non-parametric regression and spectral regularization methods, placing it alongside Tikhonov regularization and principal component regression.
Early stopping is one of the standard tools in the deep learning practitioner's toolkit. When training deep neural networks, the typical workflow involves:
In practice, early stopping is almost always combined with other regularization techniques such as dropout, weight decay, data augmentation, and batch normalization. These methods are complementary: dropout and weight decay constrain the model explicitly, while early stopping constrains it implicitly by limiting training duration.
Early stopping can interact in complex ways with learning rate schedules. When using learning rate warmup (where the learning rate starts small and gradually increases during the first few epochs), the validation loss may initially increase or fluctuate before the model begins learning effectively. If the patience parameter is too small, early stopping may terminate training during this warmup phase before any real learning has occurred. To avoid this, practitioners can use the start_from_epoch parameter to delay the monitoring of the validation metric until warmup is complete.
Similarly, when using cyclical learning rates or cosine annealing schedules, the validation loss may temporarily spike at points where the learning rate increases. In these scenarios, a larger patience value is necessary to allow the optimizer to recover after each learning rate increase.
Gradient boosting methods such as XGBoost, LightGBM, and CatBoost also use early stopping, although the mechanism is slightly different. Instead of stopping gradient descent iterations, early stopping in gradient boosting controls the number of boosting rounds (trees added to the ensemble). If the validation metric does not improve for a specified number of rounds, training is halted and the model retains only the trees up to the best iteration.
The table below compares early stopping parameters across popular gradient boosting frameworks.
| Framework | Parameter name | Description | Default |
|---|---|---|---|
| XGBoost | early_stopping_rounds | Number of rounds without improvement before stopping | None (disabled) |
| LightGBM | callbacks=[lgb.early_stopping(stopping_rounds=N)] | Early stopping callback with configurable patience | None (disabled) |
| CatBoost | od_type, od_wait | Overfitting detector type and patience | IncToDec, 20 |
Empirical studies have shown that using early stopping in gradient boosting approximately halves training time compared to training for a fixed number of rounds, while maintaining equivalent or slightly better predictive performance.
Keras provides a built-in EarlyStopping callback that can be passed to the fit() method:
import keras
callback = keras.callbacks.EarlyStopping(
monitor='val_loss',
patience=10,
min_delta=0.001,
restore_best_weights=True
)
model.fit(
X_train, y_train,
validation_data=(X_val, y_val),
epochs=200,
callbacks=[callback]
)
The restore_best_weights=True setting ensures that after training stops, the model uses the weights from the epoch with the lowest validation loss rather than the weights from the final epoch.
PyTorch does not include a built-in early stopping callback in its core library, but PyTorch Lightning provides one:
from lightning.pytorch.callbacks import EarlyStopping
early_stop_callback = EarlyStopping(
monitor='val_loss',
patience=10,
min_delta=0.001,
mode='min'
)
trainer = Trainer(callbacks=[early_stop_callback])
trainer.fit(model)
For users of plain PyTorch without Lightning, early stopping must be implemented manually by tracking the best validation loss and a counter within the training loop.
XGBoost integrates early stopping directly into the training call:
import xgboost as xgb
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2)
model = xgb.XGBClassifier(
n_estimators=1000,
early_stopping_rounds=10
)
model.fit(
X_train, y_train,
eval_set=[(X_val, y_val)],
verbose=True
)
print(f"Best iteration: {model.best_iteration}")
print(f"Best score: {model.best_score}")
After early stopping, the model automatically uses the best iteration for predictions.
Early stopping can be viewed as a form of model selection because it implicitly defines a family of models parameterized by the number of training iterations. By choosing the iteration that produces the best validation performance, early stopping selects the model from this family that is expected to generalize best.
This perspective raises an important practical concern: the validation set used for early stopping cannot also be used for unbiased evaluation of the final model. A separate test set, held out from both training and validation, is required to estimate the true generalization performance. When data is limited, nested cross-validation can be used to simultaneously perform early stopping (using an inner validation fold) and model evaluation (using an outer test fold).
Goodfellow, Bengio, and Courville (2016) noted an additional nuance: because early stopping removes a portion of the training data for validation, the final model is trained on fewer examples than available. They proposed two strategies to address this. The first is to retrain the model from scratch on the full dataset for the same number of epochs that were optimal during early stopping. The second is to continue training the early-stopped model on the full dataset and halt when the training loss matches the training loss at the early stopping point.
Despite its wide adoption, early stopping has several notable limitations:
start_from_epoch or a sufficiently large patience to accommodate warmup.Early stopping is often used alongside other techniques, and in some cases one of these alternatives may be preferable.
| Technique | Relationship to early stopping |
|---|---|
| L2 regularization (weight decay) | Explicitly penalizes large weights; approximately equivalent to early stopping for linear models |
| L1 regularization | Encourages sparsity; complementary to early stopping |
| Dropout | Randomly deactivates neurons during training; often used together with early stopping |
| Data augmentation | Increases effective training set size; reduces the need for aggressive early stopping |
| Learning rate scheduling | Controls the optimization trajectory; interacts with early stopping behavior |
| Model checkpointing | Saves the model at regular intervals; can be used independently of early stopping for post-hoc model selection |
| Ensemble methods | Training multiple models and averaging predictions can reduce variance without early stopping |
Early stopping is recommended in the following scenarios:
Early stopping may be less useful when training data is abundant relative to model complexity, when the training procedure is not iterative (e.g., closed-form solutions), or when the validation metric is too noisy to reliably indicate overfitting.