Loss Curve

A loss curve is a plot that shows the value of a loss function over the course of training a machine learning model. The horizontal axis typically represents training iterations, epochs, or steps, while the vertical axis represents the loss value. By tracking how loss changes during training, practitioners can assess whether a model is learning effectively, diagnose problems such as overfitting or underfitting, and make informed decisions about hyperparameters like the learning rate and model architecture.

Loss curves are one of the most widely used diagnostic tools in deep learning and broader machine learning practice. Nearly every training framework, from TensorFlow to PyTorch, provides built-in support for logging and plotting loss curves.

Explain like I'm 5 (ELI5)

Imagine you are learning to throw a ball into a bucket. At first, you miss by a lot. But every time you throw, you get a little closer. A loss curve is like a chart that tracks how far your throws land from the bucket each time you practice. When the line on the chart goes down, it means you are getting better. If the line stops going down or starts going back up, it means something might be wrong with how you are practicing, and you need to change your approach.

Definition and mathematical background

In supervised learning, a model learns by minimizing a loss function that measures the difference between its predictions and the true labels. For a dataset with N samples, the loss at any point during training can be written as:

L(θ) = (1/N) ∑ ℓ(f(x_i; θ), y_i)

where θ represents the model parameters, f(x_i; θ) is the model's prediction for input x_i, y_i is the true label, and ℓ is the per-sample loss function.

As gradient descent (or one of its variants) updates the parameters θ at each step, the value of L(θ) changes. Plotting L(θ) as a function of the training step produces the training loss curve. When a separate held-out validation set is used, computing the same loss function on that set at regular intervals produces the validation loss curve.

Common loss functions and their curves

The choice of loss function affects the shape and scale of the resulting loss curve. Different loss functions are suited to different tasks.

Loss function	Formula	Typical use case	Curve characteristics
Mean squared error (MSE)	(1/N) ∑(y_i - ŷ_i)²	Regression	Smooth, quadratic decrease; can be slow near optimum
Mean absolute error (MAE)	(1/N) ∑\|y_i - ŷ_i\|	Robust regression	Less sensitive to outliers; linear gradient
Binary cross-entropy	-(1/N) ∑[y_i log(ŷ_i) + (1-y_i) log(1-ŷ_i)]	Binary classification	Steep initial drop; strong gradients when predictions are wrong
Categorical cross-entropy	-(1/N) ∑∑ y_ic log(ŷ_ic)	Multi-class classification	Similar to binary cross-entropy but across C classes
Hinge loss	(1/N) ∑ max(0, 1 - y_i ŷ_i)	Support vector machines	Piecewise linear; zero loss for correct margins
Huber loss	Quadratic for small errors, linear for large	Regression with outliers	Smooth transition between MSE and MAE behavior

Cross-entropy loss is generally preferred for classification tasks because it produces stronger gradients when the model makes confident but incorrect predictions, resulting in faster convergence compared to MSE on the same task.

Types of loss curves

In practice, multiple loss curves are plotted simultaneously to provide a complete picture of model behavior.

Training loss curve

The training loss curve shows the loss computed on the training set at each step or epoch. In a well-configured setup, this curve decreases monotonically (with some noise) as the optimizer iteratively reduces the loss. A flat or increasing training loss curve suggests that the model is unable to learn from the data, which may indicate an excessively low learning rate, a bug in the data pipeline, or an inappropriate model architecture.

Validation loss curve

The validation loss curve is computed on a held-out validation set that the model does not use for parameter updates. It serves as a proxy for how well the model will perform on unseen data (generalization). The validation loss typically decreases alongside the training loss early in training but may diverge later if the model begins memorizing the training data rather than learning generalizable patterns.

Test loss

After training is complete, loss can be evaluated on a separate test set. While this is not plotted as a curve over time, the test loss serves as the final measure of model performance. The test set should never be used during training or hyperparameter selection.

Interpreting loss curves

The relationship between the training and validation loss curves reveals important information about model behavior. The following patterns are the most commonly observed.

Good fit

A well-fitting model produces training and validation loss curves that both decrease steadily and converge to similar low values. The gap between the two curves remains small throughout training. Both curves eventually flatten out, indicating that the model has learned the underlying patterns in the data and further training would provide diminishing returns.

Overfitting

Overfitting occurs when the training loss continues to decrease while the validation loss stops improving or begins to increase. The growing gap between the two curves indicates that the model is memorizing the training data rather than learning generalizable patterns. This is one of the most common problems identified through loss curve analysis.

Signs of overfitting in loss curves:

Training loss continues to drop while validation loss plateaus or rises
A large and growing gap between training and validation loss
Validation loss reaches a minimum and then increases, forming a "U" shape

Underfitting

Underfitting occurs when both the training and validation loss remain high throughout training. The model lacks sufficient capacity or has not been trained long enough to capture the patterns in the data. Another form of underfitting appears when both curves are still decreasing at the end of training, suggesting that additional epochs could improve performance.

Signs of underfitting in loss curves:

Both training and validation loss remain high and relatively flat
Both curves are still decreasing when training ends
The gap between training and validation loss is small but both values are unacceptably high

Summary of diagnostic patterns

Pattern	Training loss	Validation loss	Gap between curves	Diagnosis	Recommended action
Good fit	Decreases, stabilizes	Decreases, stabilizes	Small and stable	Model is well-tuned	None needed
Overfitting	Continues decreasing	Plateaus or increases	Large and growing	Model memorizes training data	Add regularization, get more data, reduce model complexity
Underfitting (capacity)	Remains high	Remains high	Small	Model too simple	Increase model complexity, add features
Underfitting (duration)	Still decreasing	Still decreasing	Small	Training stopped too early	Train for more epochs
High variance	Noisy, erratic	Noisy, erratic	Variable	Unstable training dynamics	Reduce learning rate, increase batch size
Divergence	Increases	Increases	N/A	Training has failed	Check learning rate, data, gradients

Factors affecting loss curve shape

Several hyperparameters and design choices influence the shape and behavior of loss curves.

Learning rate

The learning rate is the single most influential hyperparameter on the loss curve. It controls the step size that gradient descent takes when updating model parameters.

Too high: The loss curve oscillates wildly or diverges (increases without bound). Large learning rates cause the optimizer to overshoot minima on the loss surface.
Too low: The loss curve decreases very slowly and may get stuck in suboptimal regions of the loss surface. Training takes much longer than necessary.
Well-tuned: The loss curve decreases smoothly and converges to a low value within a reasonable number of steps.

Batch size

The batch size determines how many training examples are used to compute each gradient update.

Small batch sizes (e.g., 1 to 32): Produce noisy loss curves because each gradient estimate has high variance. However, this noise can help the model escape shallow local minima. Training with very small batches resembles stochastic gradient descent.
Large batch sizes (e.g., 256 to 4096): Produce smoother loss curves because the gradient estimates are more accurate. However, large batches can converge to sharper minima that generalize less well, and they require more memory.
Common practice: Batch sizes of 32, 64, 128, or 256 are frequently used as a balance between gradient noise and computational efficiency.

Model complexity

More complex models (more layers, more parameters) have greater capacity to fit the training data. This typically results in lower training loss but may increase the risk of overfitting, which manifests as diverging training and validation loss curves.

Regularization

Regularization techniques constrain the model to prevent overfitting and directly affect the loss curve.

Regularization technique	Effect on loss curve
L1 regularization	Adds a penalty proportional to the absolute value of weights; may slightly increase training loss but helps validation loss
L2 regularization (weight decay)	Adds a penalty proportional to the square of weights; smooths the loss curve and reduces overfitting
Dropout	Randomly deactivates neurons during training; training loss may appear higher because the effective model is smaller, but validation loss improves
Batch normalization	Normalizes layer inputs; stabilizes and smooths the loss curve, often allowing higher learning rates
Early stopping	Halts training when validation loss stops improving; prevents the overfitting phase of the loss curve
Data augmentation	Increases effective training set size; can reduce the gap between training and validation curves

Weight initialization

The initial values of model parameters affect the starting point of the loss curve. Poor initialization can lead to extremely high initial loss values, slow convergence, or training instability. Modern initialization methods such as Xavier (Glorot) initialization and He initialization are designed to keep activations and gradients at reasonable scales, leading to smoother initial loss curves.

Data quality and preprocessing

Poorly preprocessed data, mislabeled examples, or data that has not been properly shuffled can cause erratic loss curves. For instance, if training examples are ordered by class (e.g., all dog images followed by all cat images), the loss curve may oscillate as the model alternately learns and forgets different classes. Proper random shuffling of training data at each epoch eliminates this issue.

Learning rate schedules and their effect on loss curves

A learning rate schedule adjusts the learning rate during training rather than keeping it fixed. Different schedules produce distinct loss curve shapes.

Constant learning rate

The simplest approach is to use a fixed learning rate throughout training. The loss curve typically shows rapid initial decrease followed by a plateau. This approach can work well for simple problems but often fails to achieve the best possible performance on complex tasks.

Step decay

The learning rate is reduced by a fixed factor at predetermined intervals (e.g., multiplied by 0.1 every 30 epochs). The loss curve shows distinct "staircase" drops at each reduction point, as the smaller learning rate allows the optimizer to settle into finer-grained regions of the loss surface.

Exponential decay

The learning rate decreases exponentially over time according to lr_t = lr_0 * e^(-kt), where k is a decay constant. This produces a smoothly decelerating loss curve.

Cosine annealing

The learning rate follows a cosine curve from its initial value down to near zero. This schedule is widely used in transformer training. The loss curve typically shows steady improvement with a particularly smooth final convergence phase. A variant called cosine annealing with warm restarts periodically resets the learning rate, producing a loss curve with multiple descent phases.

Warmup

Many modern training pipelines begin with a warmup phase in which the learning rate starts very small and linearly increases to its target value over a set number of steps. Warmup is especially common in transformer training because it prevents early instability caused by large, poorly conditioned gradients before the model parameters have moved away from their initial values. The loss curve during warmup may appear relatively flat or even slightly noisy before the main descent begins.

Warmup-stable-decay (WSD)

A schedule used in large language model pretraining that consists of three phases: a warmup phase with linearly increasing learning rate, a stable phase with constant learning rate, and a decay phase (often cosine) at the end. The loss curve during the stable phase remains relatively flat, with a sharp drop during the final decay phase.

Comparison of learning rate schedules

Schedule	Loss curve shape	Common use cases	Advantages
Constant	Rapid initial drop, then plateau	Simple models, baselines	Simple to implement
Step decay	Staircase drops at fixed intervals	CNNs, image classification	Clear improvement at each step
Exponential decay	Smooth, decelerating descent	General purpose	No sudden changes
Cosine annealing	Smooth S-shaped descent	Transformers, language models	Good final convergence
Warmup + cosine decay	Flat start, then smooth descent	Large transformers, LLM pretraining	Prevents early instability
Cyclical learning rate	Periodic oscillations with overall descent	Exploration of loss surface	Can escape local minima
ReduceLROnPlateau	Drops when loss plateaus	Adaptive training	Responds to actual training dynamics

Loss surface, local minima, and saddle points

The loss curve is a one-dimensional projection of the model's trajectory through a high-dimensional loss surface. Understanding the loss surface helps explain certain loss curve behaviors.

Local minima

A local minimum is a point on the loss surface where the loss is lower than all immediately surrounding points but higher than the global minimum. If gradient descent converges to a local minimum, the loss curve will flatten out at a suboptimal value. In practice, research has shown that most local minima in high-dimensional neural networks have loss values close to the global minimum, so local minima are less of a concern than once believed.

Saddle points

Saddle points are locations where the gradient is zero but the point is a minimum in some directions and a maximum in others. In high-dimensional parameter spaces, saddle points are exponentially more common than local minima. When the optimizer reaches a saddle point, the loss curve may show a prolonged plateau before the model escapes. Optimizers with momentum (such as the Adam optimizer) are specifically designed to traverse saddle points more efficiently.

Plateaus

A plateau on the loss curve appears as a flat region where the loss does not change significantly over many steps. Plateaus can occur because of saddle points, because the learning rate is too small, or because the model's current representation has reached a temporary capacity limit. Plateaus are sometimes followed by sudden drops in loss, a pattern referred to as "grokking" in the context of algorithmic learning tasks.

Advanced loss curve phenomena

Double descent

The double descent phenomenon, first documented in depth by researchers at OpenAI in 2019, describes a situation where the test loss follows a non-monotonic pattern as model complexity increases. The test loss initially decreases (classical bias-variance tradeoff behavior), then increases near the interpolation threshold (where the model has just enough capacity to perfectly fit the training data), and then decreases again as the model becomes even larger.

Double descent can also appear along the epoch axis (epoch-wise double descent). For a model of fixed size, the test loss may decrease, then increase, and then decrease again as training continues for many epochs. This phenomenon is observed in various architectures, including convolutional neural networks, ResNets, and transformers, particularly when early stopping and regularization are not used.

Loss spikes in large-scale training

During the pretraining of large language models, sudden and dramatic increases in loss (loss spikes) can occur. These spikes are caused by gradient explosions, where the norm of the gradient suddenly grows to very large values. Loss spikes can degrade final model performance or, in severe cases, cause training to diverge entirely.

Causes of loss spikes include:

Numerical instability in parameter norms
Noisy or corrupted training data batches
Non-uniform scaling of parameters across layers
Interactions between large learning rates and model architecture

Common mitigation strategies include gradient clipping (limiting the gradient norm to a fixed threshold such as 1.0), using mixed-precision training carefully, and restarting from a recent checkpoint while skipping the problematic data batch (a technique used during the training of Google's PaLM model).

Scaling laws

Research on scaling laws has revealed that the final training loss of neural language models follows a power law relationship with model size, dataset size, and compute budget. The Chinchilla scaling laws, published by DeepMind in 2022, showed that for a given compute budget, the number of model parameters and training tokens should scale proportionally. These findings mean that loss curves for large models are, to some degree, predictable before training begins, which has significant implications for resource allocation in large-scale training.

Smoothing noisy loss curves

Stochastic training methods produce noisy loss curves because the gradient is estimated from a random subset of data at each step. Several techniques are used to make loss curves easier to interpret.

Exponential moving average (EMA)

The most common smoothing technique applies an exponential moving average to the raw loss values:

L_smooth(t) = β * L_smooth(t-1) + (1 - β) * L(t)

where β is a smoothing factor between 0 and 1 (typically 0.9 to 0.99). Higher values of β produce smoother curves but introduce more lag. Most visualization tools, including TensorBoard, offer built-in EMA smoothing controls.

Moving window average

A simple moving average computes the mean loss over the last k steps. This approach is straightforward but gives equal weight to all points in the window, which can obscure recent trends.

Logging frequency

Reducing the frequency at which loss values are logged (e.g., logging every 100 steps instead of every step) effectively downsamples the curve and reduces visual noise. This is particularly useful during long training runs with millions of steps.

Tools for plotting loss curves

Several tools and libraries are commonly used to visualize loss curves during model training.

Tool	Framework integration	Features	Cost
TensorBoard	TensorFlow, PyTorch, JAX	Real-time plots, smoothing, comparison of runs, histograms	Free, open source
Weights & Biases (W&B)	Framework-agnostic	Interactive dashboards, team collaboration, hyperparameter sweeps, automatic system metrics	Free tier available; paid plans for teams
MLflow	Framework-agnostic	Experiment tracking, model registry, metric comparison	Free, open source
Neptune.ai	Framework-agnostic	Real-time monitoring, collaboration, metadata management	Free tier available; paid plans
Matplotlib / Seaborn	Python	Custom static plots, full control over appearance	Free, open source
Comet ML	Framework-agnostic	Real-time tracking, code versioning, hyperparameter optimization	Free tier available; paid plans

Plotting loss curves in Python

A basic loss curve can be plotted using Matplotlib by recording the loss value at each epoch during training:

import matplotlib.pyplot as plt

# Assuming train_losses and val_losses are lists recorded during training
plt.plot(train_losses, label='Training loss')
plt.plot(val_losses, label='Validation loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('Loss Curve')
plt.legend()
plt.show()

In TensorBoard, loss curves are logged automatically when using the appropriate callback or summary writer. For example, in PyTorch:

from torch.utils.tensorboard import SummaryWriter

writer = SummaryWriter()
for epoch in range(num_epochs):
    train_loss = train_one_epoch(model, train_loader, optimizer)
    val_loss = evaluate(model, val_loader)
    writer.add_scalar('Loss/train', train_loss, epoch)
    writer.add_scalar('Loss/validation', val_loss, epoch)
writer.close()

Early stopping and the loss curve

Early stopping is a regularization technique that monitors the validation loss curve and halts training when the validation loss has not improved for a specified number of epochs, known as the patience parameter.

How early stopping works

At the end of each epoch, the validation loss is computed.
If the validation loss improves (decreases below the best observed value), the model checkpoint is saved.
If the validation loss does not improve for a number of consecutive epochs equal to the patience value, training is stopped.
The model weights from the epoch with the lowest validation loss are restored.

Choosing the patience parameter

The patience parameter controls how tolerant the early stopping mechanism is of temporary increases in validation loss.

Too low (e.g., 1 to 3): May stop training prematurely, especially if the validation loss temporarily plateaus before improving.
Too high (e.g., 50 or more): May allow the model to overfit significantly before stopping.
Typical values: 5 to 20, depending on the problem and the expected shape of the validation loss curve.

Practical tips for working with loss curves

Always plot both training and validation loss. A training loss curve alone is insufficient because it does not reveal overfitting.
Use a logarithmic scale for the y-axis when loss values span several orders of magnitude. This makes it easier to see improvements in later stages of training.
Compare multiple runs. When tuning hyperparameters, overlay loss curves from different configurations to identify which settings produce the best convergence.
Watch for sudden spikes. A sudden increase in loss often indicates a data issue (corrupted batch, mislabeled examples) or a numerical problem (exploding gradients).
Do not rely solely on the final loss value. The shape of the entire curve provides more information than any single number. A model that converges smoothly to a loss of 0.3 is often preferable to one that oscillates between 0.1 and 0.5.
Apply smoothing judiciously. Over-smoothing can hide important details like loss spikes or sudden changes in training dynamics.
Monitor additional metrics alongside loss. Accuracy, F1 score, and other task-specific metrics can provide complementary information. A decreasing loss does not always correspond to improving task performance, especially when the loss function is a poor proxy for the actual objective.
Be aware of the bias-variance tradeoff. The gap between training and validation curves directly reflects this tradeoff. A large gap indicates high variance (overfitting), while both curves being high indicates high bias (underfitting).

Loss curves in different domains

Computer vision

In image classification with convolutional neural networks, loss curves typically show rapid initial improvement followed by gradual refinement. Data augmentation is commonly used to reduce the gap between training and validation curves. Architectures like ResNet and EfficientNet often converge within 90 to 300 epochs, with step decay learning rate schedules producing characteristic staircase-shaped loss curves.

Natural language processing

In natural language processing, transformer-based models are typically trained with warmup followed by cosine or linear decay. Loss curves for language model pretraining can span hundreds of thousands or millions of steps. The loss tends to decrease following a power law, consistent with the neural scaling laws observed by Kaplan et al. (2020) and Hoffmann et al. (2022).

Reinforcement learning

Loss curves in reinforcement learning tend to be much noisier than in supervised learning because the training data distribution changes as the agent's policy improves. It is common to plot both the loss and the cumulative reward curve, as the loss alone may not be informative about the agent's actual performance.

History

The concept of plotting learning progress over time dates back to early work in mathematical psychology and animal learning research in the late 19th century. Hermann Ebbinghaus published his "forgetting curve" in 1885, which plotted memory retention over time. The idea of using similar plots to track the performance of computational learning algorithms was adopted as machine learning emerged as a field in the mid-20th century.

The term "learning curve" has been used in machine learning since at least the 1990s, with early analyses appearing in the work of Cortes et al. (1993) and others who studied how model error changes as a function of training set size. The modern usage of "loss curve" as a real-time diagnostic tool during neural network training became standard practice with the advent of deep learning frameworks and visualization tools like TensorBoard (released as part of TensorFlow in 2015).

References

Goodfellow, I., Bengio, Y., and Courville, A. (2016). *Deep Learning*. MIT Press. Chapter 8: Optimization for Training Deep Models.
Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., et al. (2020). "Scaling Laws for Neural Language Models." arXiv:2001.08361.
Hoffmann, J., Borgeaud, S., Mensch, A., et al. (2022). "Training Compute-Optimal Large Language Models." arXiv:2203.15556. (Chinchilla scaling laws.)
Nakkiran, P., Kaplun, G., Bansal, Y., et al. (2019). "Deep Double Descent: Where Bigger Models and More Data Can Hurt." arXiv:1912.02292.
Zhuang, J., Tang, T., Ding, Y., et al. (2020). "AdaBelief Optimizer: Adapting Stepsizes by the Belief in Observed Gradients." arXiv:2010.07468.
Loshchilov, I. and Hutter, F. (2017). "SGDR: Stochastic Gradient Descent with Warm Restarts." Proceedings of ICLR 2017.
Google Developers. "Overfitting: Interpreting loss curves." Machine Learning Crash Course. https://developers.google.com/machine-learning/crash-course/overfitting/interpreting-loss-curves
Brownlee, J. (2019). "How to use Learning Curves to Diagnose Machine Learning Model Performance." Machine Learning Mastery. https://machinelearningmastery.com/learning-curves-for-diagnosing-machine-learning-model-performance/
Takase, S., Kiyono, S., Kobayashi, S., and Suzuki, J. (2023). "Spike No More: Stabilizing the Pre-training of Large Language Models." arXiv:2312.16903.
Glorot, X. and Bengio, Y. (2010). "Understanding the difficulty of training deep feedforward neural networks." Proceedings of AISTATS 2010.
He, K., Zhang, X., Ren, S., and Sun, J. (2015). "Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification." arXiv:1502.01852.
Cortes, C., Jackel, L. D., Solla, S. A., Vapnik, V., and Denker, J. S. (1993). "Learning Curves: Asymptotic Values and Rate of Convergence." Advances in Neural Information Processing Systems 6 (NIPS 1993).
Prechelt, L. (1998). "Early Stopping - But When?" In *Neural Networks: Tricks of the Trade*, Springer.
Li, H., Xu, Z., Taylor, G., Studer, C., and Goldstein, T. (2018). "Visualizing the Loss Landscape of Neural Nets." Advances in Neural Information Processing Systems 31 (NeurIPS 2018).

Explain like I'm 5 (ELI5)

Definition and mathematical background

Common loss functions and their curves

Types of loss curves

Training loss curve

Validation loss curve

Test loss

Interpreting loss curves

Good fit

Overfitting

Underfitting

Summary of diagnostic patterns

Factors affecting loss curve shape

Learning rate

Batch size

Model complexity

Regularization

Weight initialization

Data quality and preprocessing

Learning rate schedules and their effect on loss curves

Constant learning rate

Step decay

Exponential decay

Cosine annealing

Warmup

Warmup-stable-decay (WSD)

Comparison of learning rate schedules

Loss surface, local minima, and saddle points

Local minima

Saddle points

Plateaus

Advanced loss curve phenomena

Double descent

Loss spikes in large-scale training

Scaling laws

Smoothing noisy loss curves

Exponential moving average (EMA)

Moving window average

Logging frequency

Tools for plotting loss curves

Plotting loss curves in Python

Early stopping and the loss curve

How early stopping works

Choosing the patience parameter

Practical tips for working with loss curves

Loss curves in different domains

Computer vision

Natural language processing

Reinforcement learning

History

References

Improve this article

Related Articles

ARC-AGI 2

NaN Trap

Gradient Accumulation

GELU (Gaussian Error Linear Unit)

LeNet

Generalization

Explain like I'm 5 (ELI5)

Definition and mathematical background

Common loss functions and their curves

Types of loss curves

Training loss curve

Validation loss curve

Test loss

Interpreting loss curves

Good fit

Overfitting

Underfitting

Summary of diagnostic patterns

Factors affecting loss curve shape

Learning rate

Batch size

Model complexity

Regularization

Weight initialization

Data quality and preprocessing

Learning rate schedules and their effect on loss curves

Constant learning rate