See also: Machine learning terms, Loss function
A loss curve is a plot that shows the value of a loss function over the course of training a machine learning model. The horizontal axis typically represents training iterations, epochs, or steps, while the vertical axis represents the loss value. By tracking how loss changes during training, practitioners can assess whether a model is learning effectively, diagnose problems such as overfitting or underfitting, and make informed decisions about hyperparameters like the learning rate and model architecture.
Loss curves are one of the most widely used diagnostic tools in deep learning and broader machine learning practice. Nearly every training framework, from TensorFlow to PyTorch, provides built-in support for logging and plotting loss curves.
Imagine you are learning to throw a ball into a bucket. At first, you miss by a lot. But every time you throw, you get a little closer. A loss curve is like a chart that tracks how far your throws land from the bucket each time you practice. When the line on the chart goes down, it means you are getting better. If the line stops going down or starts going back up, it means something might be wrong with how you are practicing, and you need to change your approach.
In supervised learning, a model learns by minimizing a loss function that measures the difference between its predictions and the true labels. For a dataset with N samples, the loss at any point during training can be written as:
L(θ) = (1/N) ∑ ℓ(f(x_i; θ), y_i)
where θ represents the model parameters, f(x_i; θ) is the model's prediction for input x_i, y_i is the true label, and ℓ is the per-sample loss function.
As gradient descent (or one of its variants) updates the parameters θ at each step, the value of L(θ) changes. Plotting L(θ) as a function of the training step produces the training loss curve. When a separate held-out validation set is used, computing the same loss function on that set at regular intervals produces the validation loss curve.
The choice of loss function affects the shape and scale of the resulting loss curve. Different loss functions are suited to different tasks.
| Loss function | Formula | Typical use case | Curve characteristics |
|---|---|---|---|
| Mean squared error (MSE) | (1/N) ∑(y_i - ŷ_i)² | Regression | Smooth, quadratic decrease; can be slow near optimum |
| Mean absolute error (MAE) | (1/N) ∑|y_i - ŷ_i| | Robust regression | Less sensitive to outliers; linear gradient |
| Binary cross-entropy | -(1/N) ∑[y_i log(ŷ_i) + (1-y_i) log(1-ŷ_i)] | Binary classification | Steep initial drop; strong gradients when predictions are wrong |
| Categorical cross-entropy | -(1/N) ∑∑ y_ic log(ŷ_ic) | Multi-class classification | Similar to binary cross-entropy but across C classes |
| Hinge loss | (1/N) ∑ max(0, 1 - y_i ŷ_i) | Support vector machines | Piecewise linear; zero loss for correct margins |
| Huber loss | Quadratic for small errors, linear for large | Regression with outliers | Smooth transition between MSE and MAE behavior |
Cross-entropy loss is generally preferred for classification tasks because it produces stronger gradients when the model makes confident but incorrect predictions, resulting in faster convergence compared to MSE on the same task.
In practice, multiple loss curves are plotted simultaneously to provide a complete picture of model behavior.
The training loss curve shows the loss computed on the training set at each step or epoch. In a well-configured setup, this curve decreases monotonically (with some noise) as the optimizer iteratively reduces the loss. A flat or increasing training loss curve suggests that the model is unable to learn from the data, which may indicate an excessively low learning rate, a bug in the data pipeline, or an inappropriate model architecture.
The validation loss curve is computed on a held-out validation set that the model does not use for parameter updates. It serves as a proxy for how well the model will perform on unseen data (generalization). The validation loss typically decreases alongside the training loss early in training but may diverge later if the model begins memorizing the training data rather than learning generalizable patterns.
After training is complete, loss can be evaluated on a separate test set. While this is not plotted as a curve over time, the test loss serves as the final measure of model performance. The test set should never be used during training or hyperparameter selection.
The relationship between the training and validation loss curves reveals important information about model behavior. The following patterns are the most commonly observed.
A well-fitting model produces training and validation loss curves that both decrease steadily and converge to similar low values. The gap between the two curves remains small throughout training. Both curves eventually flatten out, indicating that the model has learned the underlying patterns in the data and further training would provide diminishing returns.
Overfitting occurs when the training loss continues to decrease while the validation loss stops improving or begins to increase. The growing gap between the two curves indicates that the model is memorizing the training data rather than learning generalizable patterns. This is one of the most common problems identified through loss curve analysis.
Signs of overfitting in loss curves:
Underfitting occurs when both the training and validation loss remain high throughout training. The model lacks sufficient capacity or has not been trained long enough to capture the patterns in the data. Another form of underfitting appears when both curves are still decreasing at the end of training, suggesting that additional epochs could improve performance.
Signs of underfitting in loss curves:
| Pattern | Training loss | Validation loss | Gap between curves | Diagnosis | Recommended action |
|---|---|---|---|---|---|
| Good fit | Decreases, stabilizes | Decreases, stabilizes | Small and stable | Model is well-tuned | None needed |
| Overfitting | Continues decreasing | Plateaus or increases | Large and growing | Model memorizes training data | Add regularization, get more data, reduce model complexity |
| Underfitting (capacity) | Remains high | Remains high | Small | Model too simple | Increase model complexity, add features |
| Underfitting (duration) | Still decreasing | Still decreasing | Small | Training stopped too early | Train for more epochs |
| High variance | Noisy, erratic | Noisy, erratic | Variable | Unstable training dynamics | Reduce learning rate, increase batch size |
| Divergence | Increases | Increases | N/A | Training has failed | Check learning rate, data, gradients |
Several hyperparameters and design choices influence the shape and behavior of loss curves.
The learning rate is the single most influential hyperparameter on the loss curve. It controls the step size that gradient descent takes when updating model parameters.
The batch size determines how many training examples are used to compute each gradient update.
More complex models (more layers, more parameters) have greater capacity to fit the training data. This typically results in lower training loss but may increase the risk of overfitting, which manifests as diverging training and validation loss curves.
Regularization techniques constrain the model to prevent overfitting and directly affect the loss curve.
| Regularization technique | Effect on loss curve |
|---|---|
| L1 regularization | Adds a penalty proportional to the absolute value of weights; may slightly increase training loss but helps validation loss |
| L2 regularization (weight decay) | Adds a penalty proportional to the square of weights; smooths the loss curve and reduces overfitting |
| Dropout | Randomly deactivates neurons during training; training loss may appear higher because the effective model is smaller, but validation loss improves |
| Batch normalization | Normalizes layer inputs; stabilizes and smooths the loss curve, often allowing higher learning rates |
| Early stopping | Halts training when validation loss stops improving; prevents the overfitting phase of the loss curve |
| Data augmentation | Increases effective training set size; can reduce the gap between training and validation curves |
The initial values of model parameters affect the starting point of the loss curve. Poor initialization can lead to extremely high initial loss values, slow convergence, or training instability. Modern initialization methods such as Xavier (Glorot) initialization and He initialization are designed to keep activations and gradients at reasonable scales, leading to smoother initial loss curves.
Poorly preprocessed data, mislabeled examples, or data that has not been properly shuffled can cause erratic loss curves. For instance, if training examples are ordered by class (e.g., all dog images followed by all cat images), the loss curve may oscillate as the model alternately learns and forgets different classes. Proper random shuffling of training data at each epoch eliminates this issue.
A learning rate schedule adjusts the learning rate during training rather than keeping it fixed. Different schedules produce distinct loss curve shapes.
The simplest approach is to use a fixed learning rate throughout training. The loss curve typically shows rapid initial decrease followed by a plateau. This approach can work well for simple problems but often fails to achieve the best possible performance on complex tasks.
The learning rate is reduced by a fixed factor at predetermined intervals (e.g., multiplied by 0.1 every 30 epochs). The loss curve shows distinct "staircase" drops at each reduction point, as the smaller learning rate allows the optimizer to settle into finer-grained regions of the loss surface.
The learning rate decreases exponentially over time according to lr_t = lr_0 * e^(-kt), where k is a decay constant. This produces a smoothly decelerating loss curve.
The learning rate follows a cosine curve from its initial value down to near zero. This schedule is widely used in transformer training. The loss curve typically shows steady improvement with a particularly smooth final convergence phase. A variant called cosine annealing with warm restarts periodically resets the learning rate, producing a loss curve with multiple descent phases.
Many modern training pipelines begin with a warmup phase in which the learning rate starts very small and linearly increases to its target value over a set number of steps. Warmup is especially common in transformer training because it prevents early instability caused by large, poorly conditioned gradients before the model parameters have moved away from their initial values. The loss curve during warmup may appear relatively flat or even slightly noisy before the main descent begins.
A schedule used in large language model pretraining that consists of three phases: a warmup phase with linearly increasing learning rate, a stable phase with constant learning rate, and a decay phase (often cosine) at the end. The loss curve during the stable phase remains relatively flat, with a sharp drop during the final decay phase.
| Schedule | Loss curve shape | Common use cases | Advantages |
|---|---|---|---|
| Constant | Rapid initial drop, then plateau | Simple models, baselines | Simple to implement |
| Step decay | Staircase drops at fixed intervals | CNNs, image classification | Clear improvement at each step |
| Exponential decay | Smooth, decelerating descent | General purpose | No sudden changes |
| Cosine annealing | Smooth S-shaped descent | Transformers, language models | Good final convergence |
| Warmup + cosine decay | Flat start, then smooth descent | Large transformers, LLM pretraining | Prevents early instability |
| Cyclical learning rate | Periodic oscillations with overall descent | Exploration of loss surface | Can escape local minima |
| ReduceLROnPlateau | Drops when loss plateaus | Adaptive training | Responds to actual training dynamics |
The loss curve is a one-dimensional projection of the model's trajectory through a high-dimensional loss surface. Understanding the loss surface helps explain certain loss curve behaviors.
A local minimum is a point on the loss surface where the loss is lower than all immediately surrounding points but higher than the global minimum. If gradient descent converges to a local minimum, the loss curve will flatten out at a suboptimal value. In practice, research has shown that most local minima in high-dimensional neural networks have loss values close to the global minimum, so local minima are less of a concern than once believed.
Saddle points are locations where the gradient is zero but the point is a minimum in some directions and a maximum in others. In high-dimensional parameter spaces, saddle points are exponentially more common than local minima. When the optimizer reaches a saddle point, the loss curve may show a prolonged plateau before the model escapes. Optimizers with momentum (such as the Adam optimizer) are specifically designed to traverse saddle points more efficiently.
A plateau on the loss curve appears as a flat region where the loss does not change significantly over many steps. Plateaus can occur because of saddle points, because the learning rate is too small, or because the model's current representation has reached a temporary capacity limit. Plateaus are sometimes followed by sudden drops in loss, a pattern referred to as "grokking" in the context of algorithmic learning tasks.
The double descent phenomenon, first documented in depth by researchers at OpenAI in 2019, describes a situation where the test loss follows a non-monotonic pattern as model complexity increases. The test loss initially decreases (classical bias-variance tradeoff behavior), then increases near the interpolation threshold (where the model has just enough capacity to perfectly fit the training data), and then decreases again as the model becomes even larger.
Double descent can also appear along the epoch axis (epoch-wise double descent). For a model of fixed size, the test loss may decrease, then increase, and then decrease again as training continues for many epochs. This phenomenon is observed in various architectures, including convolutional neural networks, ResNets, and transformers, particularly when early stopping and regularization are not used.
During the pretraining of large language models, sudden and dramatic increases in loss (loss spikes) can occur. These spikes are caused by gradient explosions, where the norm of the gradient suddenly grows to very large values. Loss spikes can degrade final model performance or, in severe cases, cause training to diverge entirely.
Causes of loss spikes include:
Common mitigation strategies include gradient clipping (limiting the gradient norm to a fixed threshold such as 1.0), using mixed-precision training carefully, and restarting from a recent checkpoint while skipping the problematic data batch (a technique used during the training of Google's PaLM model).
Research on scaling laws has revealed that the final training loss of neural language models follows a power law relationship with model size, dataset size, and compute budget. The Chinchilla scaling laws, published by DeepMind in 2022, showed that for a given compute budget, the number of model parameters and training tokens should scale proportionally. These findings mean that loss curves for large models are, to some degree, predictable before training begins, which has significant implications for resource allocation in large-scale training.
Stochastic training methods produce noisy loss curves because the gradient is estimated from a random subset of data at each step. Several techniques are used to make loss curves easier to interpret.
The most common smoothing technique applies an exponential moving average to the raw loss values:
L_smooth(t) = β * L_smooth(t-1) + (1 - β) * L(t)
where β is a smoothing factor between 0 and 1 (typically 0.9 to 0.99). Higher values of β produce smoother curves but introduce more lag. Most visualization tools, including TensorBoard, offer built-in EMA smoothing controls.
A simple moving average computes the mean loss over the last k steps. This approach is straightforward but gives equal weight to all points in the window, which can obscure recent trends.
Reducing the frequency at which loss values are logged (e.g., logging every 100 steps instead of every step) effectively downsamples the curve and reduces visual noise. This is particularly useful during long training runs with millions of steps.
Several tools and libraries are commonly used to visualize loss curves during model training.
| Tool | Framework integration | Features | Cost |
|---|---|---|---|
| TensorBoard | TensorFlow, PyTorch, JAX | Real-time plots, smoothing, comparison of runs, histograms | Free, open source |
| Weights & Biases (W&B) | Framework-agnostic | Interactive dashboards, team collaboration, hyperparameter sweeps, automatic system metrics | Free tier available; paid plans for teams |
| MLflow | Framework-agnostic | Experiment tracking, model registry, metric comparison | Free, open source |
| Neptune.ai | Framework-agnostic | Real-time monitoring, collaboration, metadata management | Free tier available; paid plans |
| Matplotlib / Seaborn | Python | Custom static plots, full control over appearance | Free, open source |
| Comet ML | Framework-agnostic | Real-time tracking, code versioning, hyperparameter optimization | Free tier available; paid plans |
A basic loss curve can be plotted using Matplotlib by recording the loss value at each epoch during training:
import matplotlib.pyplot as plt
# Assuming train_losses and val_losses are lists recorded during training
plt.plot(train_losses, label='Training loss')
plt.plot(val_losses, label='Validation loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('Loss Curve')
plt.legend()
plt.show()
In TensorBoard, loss curves are logged automatically when using the appropriate callback or summary writer. For example, in PyTorch:
from torch.utils.tensorboard import SummaryWriter
writer = SummaryWriter()
for epoch in range(num_epochs):
train_loss = train_one_epoch(model, train_loader, optimizer)
val_loss = evaluate(model, val_loader)
writer.add_scalar('Loss/train', train_loss, epoch)
writer.add_scalar('Loss/validation', val_loss, epoch)
writer.close()
Early stopping is a regularization technique that monitors the validation loss curve and halts training when the validation loss has not improved for a specified number of epochs, known as the patience parameter.
The patience parameter controls how tolerant the early stopping mechanism is of temporary increases in validation loss.
Always plot both training and validation loss. A training loss curve alone is insufficient because it does not reveal overfitting.
Use a logarithmic scale for the y-axis when loss values span several orders of magnitude. This makes it easier to see improvements in later stages of training.
Compare multiple runs. When tuning hyperparameters, overlay loss curves from different configurations to identify which settings produce the best convergence.
Watch for sudden spikes. A sudden increase in loss often indicates a data issue (corrupted batch, mislabeled examples) or a numerical problem (exploding gradients).
Do not rely solely on the final loss value. The shape of the entire curve provides more information than any single number. A model that converges smoothly to a loss of 0.3 is often preferable to one that oscillates between 0.1 and 0.5.
Apply smoothing judiciously. Over-smoothing can hide important details like loss spikes or sudden changes in training dynamics.
Monitor additional metrics alongside loss. Accuracy, F1 score, and other task-specific metrics can provide complementary information. A decreasing loss does not always correspond to improving task performance, especially when the loss function is a poor proxy for the actual objective.
Be aware of the bias-variance tradeoff. The gap between training and validation curves directly reflects this tradeoff. A large gap indicates high variance (overfitting), while both curves being high indicates high bias (underfitting).
In image classification with convolutional neural networks, loss curves typically show rapid initial improvement followed by gradual refinement. Data augmentation is commonly used to reduce the gap between training and validation curves. Architectures like ResNet and EfficientNet often converge within 90 to 300 epochs, with step decay learning rate schedules producing characteristic staircase-shaped loss curves.
In natural language processing, transformer-based models are typically trained with warmup followed by cosine or linear decay. Loss curves for language model pretraining can span hundreds of thousands or millions of steps. The loss tends to decrease following a power law, consistent with the neural scaling laws observed by Kaplan et al. (2020) and Hoffmann et al. (2022).
Loss curves in reinforcement learning tend to be much noisier than in supervised learning because the training data distribution changes as the agent's policy improves. It is common to plot both the loss and the cumulative reward curve, as the loss alone may not be informative about the agent's actual performance.
The concept of plotting learning progress over time dates back to early work in mathematical psychology and animal learning research in the late 19th century. Hermann Ebbinghaus published his "forgetting curve" in 1885, which plotted memory retention over time. The idea of using similar plots to track the performance of computational learning algorithms was adopted as machine learning emerged as a field in the mid-20th century.
The term "learning curve" has been used in machine learning since at least the 1990s, with early analyses appearing in the work of Cortes et al. (1993) and others who studied how model error changes as a function of training set size. The modern usage of "loss curve" as a real-time diagnostic tool during neural network training became standard practice with the advent of deep learning frameworks and visualization tools like TensorBoard (released as part of TensorFlow in 2015).