# Loss Curve

> Source: https://aiwiki.ai/wiki/loss_curve
> Updated: 2026-07-12
> Categories: Deep Learning, Machine Learning, Model Evaluation, Training & Optimization
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

*See also: [Machine learning terms](/wiki/machine_learning_terms), [Loss function](/wiki/loss_function)*

A **loss curve** is a plot that shows the value of a [loss function](/wiki/loss_function) over the course of training a [machine learning](/wiki/machine_learning) model. The horizontal axis typically represents training iterations, [epochs](/wiki/epoch), or steps, while the vertical axis represents the loss value. By tracking how loss changes during training, practitioners can assess whether a model is learning effectively, diagnose problems such as [overfitting](/wiki/overfitting) or [underfitting](/wiki/underfitting), and make informed decisions about [hyperparameters](/wiki/hyperparameters) like the [learning rate](/wiki/learning_rate) and model architecture. The single most informative reading comes from plotting two curves together, the training loss and the validation loss: a small and stable gap between them signals a good fit, while a training loss that keeps falling as the validation loss flattens or turns upward is the classic signature of overfitting.

Loss curves are one of the most widely used diagnostic tools in [deep learning](/wiki/deep_learning) and broader machine learning practice. Nearly every training framework, from [TensorFlow](/wiki/tensorflow) to [PyTorch](/wiki/pytorch), provides built-in support for logging and plotting loss curves.

## Explain like I'm 5 (ELI5)

Imagine you are learning to throw a ball into a bucket. At first, you miss by a lot. But every time you throw, you get a little closer. A loss curve is like a chart that tracks how far your throws land from the bucket each time you practice. When the line on the chart goes down, it means you are getting better. If the line stops going down or starts going back up, it means something might be wrong with how you are practicing, and you need to change your approach.

## Definition and mathematical background

In [supervised learning](/wiki/supervised_learning), a model learns by minimizing a loss function that measures the difference between its predictions and the true labels. For a dataset with N samples, the loss at any point during training can be written as:

**L(&theta;) = (1/N) &sum; &#8467;(f(x_i; &theta;), y_i)**

where &theta; represents the model parameters, f(x_i; &theta;) is the model's prediction for input x_i, y_i is the true label, and &#8467; is the per-sample loss function.

As [gradient descent](/wiki/gradient_descent) (or one of its variants) updates the parameters &theta; at each step, the value of L(&theta;) changes. Plotting L(&theta;) as a function of the training step produces the training loss curve.[1] When a separate held-out [validation set](/wiki/validation_set) is used, computing the same loss function on that set at regular intervals produces the validation loss curve.

### Common loss functions and their curves

The choice of loss function affects the shape and scale of the resulting loss curve. Different loss functions are suited to different tasks.

| Loss function | Formula | Typical use case | Curve characteristics |
|---|---|---|---|
| Mean squared error (MSE) | (1/N) &sum;(y_i - &#375;_i)&sup2; | [Regression](/wiki/linear_regression) | Smooth, quadratic decrease; can be slow near optimum |
| Mean absolute error (MAE) | (1/N) &sum;&#124;y_i - &#375;_i&#124; | Robust regression | Less sensitive to outliers; linear gradient |
| Binary cross-entropy | -(1/N) &sum;[y_i log(&#375;_i) + (1-y_i) log(1-&#375;_i)] | Binary [classification](/wiki/binary_classification) | Steep initial drop; strong gradients when predictions are wrong |
| Categorical cross-entropy | -(1/N) &sum;&sum; y_ic log(&#375;_ic) | Multi-class classification | Similar to binary cross-entropy but across C classes |
| Hinge loss | (1/N) &sum; max(0, 1 - y_i &#375;_i) | Support vector machines | Piecewise linear; zero loss for correct margins |
| Huber loss | Quadratic for small errors, linear for large | Regression with outliers | Smooth transition between MSE and MAE behavior |

Cross-entropy loss is generally preferred for classification tasks because it produces stronger gradients when the model makes confident but incorrect predictions, resulting in faster convergence compared to MSE on the same task.

## What are the main types of loss curves?

In practice, multiple loss curves are plotted simultaneously to provide a complete picture of model behavior.

### Training loss curve

The training loss curve shows the loss computed on the [training set](/wiki/training_set) at each step or epoch. In a well-configured setup, this curve decreases monotonically (with some noise) as the [optimizer](/wiki/optimizer) iteratively reduces the loss. A flat or increasing training loss curve suggests that the model is unable to learn from the data, which may indicate an excessively low learning rate, a bug in the data pipeline, or an inappropriate model architecture.

### Validation loss curve

The validation loss curve is computed on a held-out validation set that the model does not use for parameter updates. It serves as a proxy for how well the model will perform on unseen data ([generalization](/wiki/generalization)). The validation loss typically decreases alongside the training loss early in training but may diverge later if the model begins memorizing the training data rather than learning generalizable patterns.

### Test loss

After training is complete, loss can be evaluated on a separate [test set](/wiki/test_set). While this is not plotted as a curve over time, the test loss serves as the final measure of model performance. The test set should never be used during training or hyperparameter selection.

## How do you interpret a loss curve?

The relationship between the training and validation loss curves reveals important information about model behavior. The following patterns are the most commonly observed. Google's Machine Learning Crash Course notes that a textbook run is rarer than beginners expect, and that an oscillating loss usually calls for a lower learning rate: "reducing learning rate is often a good idea when debugging a training problem."[7] Practitioner guides such as Brownlee (2019) group these shapes into underfit, overfit, and good-fit signatures, which are sometimes called diagnostic "learning curves."[8]

### Good fit

A well-fitting model produces training and validation loss curves that both decrease steadily and converge to similar low values. The gap between the two curves remains small throughout training. Both curves eventually flatten out, indicating that the model has learned the underlying patterns in the data and further training would provide diminishing returns.

### Overfitting

Overfitting occurs when the training loss continues to decrease while the validation loss stops improving or begins to increase. The growing gap between the two curves indicates that the model is memorizing the training data rather than learning generalizable patterns. This is one of the most common problems identified through loss curve analysis.

Signs of overfitting in loss curves:

- Training loss continues to drop while validation loss plateaus or rises
- A large and growing gap between training and validation loss
- Validation loss reaches a minimum and then increases, forming a "U" shape

### Underfitting

Underfitting occurs when both the training and validation loss remain high throughout training. The model lacks sufficient capacity or has not been trained long enough to capture the patterns in the data. Another form of underfitting appears when both curves are still decreasing at the end of training, suggesting that additional epochs could improve performance.

Signs of underfitting in loss curves:

- Both training and validation loss remain high and relatively flat
- Both curves are still decreasing when training ends
- The gap between training and validation loss is small but both values are unacceptably high

### Summary of diagnostic patterns

| Pattern | Training loss | Validation loss | Gap between curves | Diagnosis | Recommended action |
|---|---|---|---|---|---|
| Good fit | Decreases, stabilizes | Decreases, stabilizes | Small and stable | Model is well-tuned | None needed |
| Overfitting | Continues decreasing | Plateaus or increases | Large and growing | Model memorizes training data | Add [regularization](/wiki/regularization), get more data, reduce model complexity |
| Underfitting (capacity) | Remains high | Remains high | Small | Model too simple | Increase model complexity, add features |
| Underfitting (duration) | Still decreasing | Still decreasing | Small | Training stopped too early | Train for more epochs |
| High variance | Noisy, erratic | Noisy, erratic | Variable | Unstable training dynamics | Reduce learning rate, increase [batch size](/wiki/batch_size) |
| Divergence | Increases | Increases | N/A | Training has failed | Check learning rate, data, gradients |

## What affects the shape of a loss curve?

Several hyperparameters and design choices influence the shape and behavior of loss curves.

### Learning rate

The learning rate is the single most influential hyperparameter on the loss curve. It controls the step size that gradient descent takes when updating model parameters.

- **Too high**: The loss curve oscillates wildly or diverges (increases without bound). Large learning rates cause the optimizer to overshoot minima on the loss surface.
- **Too low**: The loss curve decreases very slowly and may get stuck in suboptimal regions of the loss surface. Training takes much longer than necessary.
- **Well-tuned**: The loss curve decreases smoothly and converges to a low value within a reasonable number of steps.

### Batch size

The batch size determines how many training examples are used to compute each gradient update.

- **Small batch sizes** (e.g., 1 to 32): Produce noisy loss curves because each gradient estimate has high variance. However, this noise can help the model escape shallow local minima. Training with very small batches resembles [stochastic gradient descent](/wiki/stochastic_gradient_descent_sgd).
- **Large batch sizes** (e.g., 256 to 4096): Produce smoother loss curves because the gradient estimates are more accurate. However, large batches can converge to sharper minima that generalize less well, and they require more memory.
- **Common practice**: Batch sizes of 32, 64, 128, or 256 are frequently used as a balance between gradient noise and computational efficiency.

### Model complexity

More complex models (more layers, more parameters) have greater capacity to fit the training data. This typically results in lower training loss but may increase the risk of overfitting, which manifests as diverging training and validation loss curves.

### Regularization

Regularization techniques constrain the model to prevent overfitting and directly affect the loss curve.

| Regularization technique | Effect on loss curve |
|---|---|
| L1 regularization | Adds a penalty proportional to the absolute value of weights; may slightly increase training loss but helps validation loss |
| L2 regularization (weight decay) | Adds a penalty proportional to the square of weights; smooths the loss curve and reduces overfitting |
| [Dropout](/wiki/dropout) | Randomly deactivates neurons during training; training loss may appear higher because the effective model is smaller, but validation loss improves |
| [Batch normalization](/wiki/batch_normalization) | Normalizes layer inputs; stabilizes and smooths the loss curve, often allowing higher learning rates |
| [Early stopping](/wiki/early_stopping) | Halts training when validation loss stops improving; prevents the overfitting phase of the loss curve |
| [Data augmentation](/wiki/data_augmentation) | Increases effective training set size; can reduce the gap between training and validation curves |

### Weight initialization

The initial values of model parameters affect the starting point of the loss curve. Poor initialization can lead to extremely high initial loss values, slow convergence, or training instability. Modern initialization methods such as Xavier (Glorot) initialization[10] and He initialization[11] are designed to keep activations and gradients at reasonable scales, leading to smoother initial loss curves.

### Data quality and preprocessing

Poorly preprocessed data, mislabeled examples, or data that has not been properly shuffled can cause erratic loss curves. For instance, if training examples are ordered by class (e.g., all dog images followed by all cat images), the loss curve may oscillate as the model alternately learns and forgets different classes. Proper random shuffling of training data at each epoch eliminates this issue.

## How do learning rate schedules change the loss curve?

A learning rate schedule adjusts the learning rate during training rather than keeping it fixed. Different schedules produce distinct loss curve shapes.

### Constant learning rate

The simplest approach is to use a fixed learning rate throughout training. The loss curve typically shows rapid initial decrease followed by a plateau. This approach can work well for simple problems but often fails to achieve the best possible performance on complex tasks.

### Step decay

The learning rate is reduced by a fixed factor at predetermined intervals (e.g., multiplied by 0.1 every 30 epochs). The loss curve shows distinct "staircase" drops at each reduction point, as the smaller learning rate allows the optimizer to settle into finer-grained regions of the loss surface.

### Exponential decay

The learning rate decreases exponentially over time according to lr_t = lr_0 * e^(-kt), where k is a decay constant. This produces a smoothly decelerating loss curve.

### Cosine annealing

The learning rate follows a cosine curve from its initial value down to near zero. This schedule is widely used in [transformer](/wiki/transformer) training. The loss curve typically shows steady improvement with a particularly smooth final convergence phase. A variant called cosine annealing with warm restarts, introduced by Loshchilov and Hutter (2017) as SGDR, periodically resets the learning rate, producing a loss curve with multiple descent phases.[6]

### Warmup

Many modern training pipelines begin with a warmup phase in which the learning rate starts very small and linearly increases to its target value over a set number of steps. Warmup is especially common in transformer training because it prevents early instability caused by large, poorly conditioned gradients before the model parameters have moved away from their initial values. The loss curve during warmup may appear relatively flat or even slightly noisy before the main descent begins.

### Warmup-stable-decay (WSD)

A schedule used in [large language model](/wiki/large_language_model) pretraining that consists of three phases: a warmup phase with linearly increasing learning rate, a stable phase with constant learning rate, and a decay phase (often cosine) at the end. Introduced with the MiniCPM models by Hu et al. (2024), WSD does not require the total number of training steps to be fixed in advance, which makes it convenient for very large or continual runs.[19] The loss curve during the stable phase remains relatively flat, and can even sit higher than a cosine schedule would, before dropping sharply during the final decay phase.[19]

### Comparison of learning rate schedules

| Schedule | Loss curve shape | Common use cases | Advantages |
|---|---|---|---|
| Constant | Rapid initial drop, then plateau | Simple models, baselines | Simple to implement |
| Step decay | Staircase drops at fixed intervals | CNNs, image classification | Clear improvement at each step |
| Exponential decay | Smooth, decelerating descent | General purpose | No sudden changes |
| Cosine annealing | Smooth S-shaped descent | Transformers, language models | Good final convergence |
| Warmup + cosine decay | Flat start, then smooth descent | Large transformers, LLM pretraining | Prevents early instability |
| Cyclical learning rate | Periodic oscillations with overall descent | Exploration of loss surface | Can escape local minima |
| ReduceLROnPlateau | Drops when loss plateaus | Adaptive training | Responds to actual training dynamics |

## Loss surface, local minima, and saddle points

The loss curve is a one-dimensional projection of the model's trajectory through a high-dimensional loss surface. Understanding the loss surface helps explain certain loss curve behaviors. Researchers have developed methods to render these high-dimensional surfaces in two or three dimensions, which help explain why some architectures, such as those with skip connections, train more smoothly than others.[14]

### Local minima

A local minimum is a point on the loss surface where the loss is lower than all immediately surrounding points but higher than the global minimum. If gradient descent converges to a local minimum, the loss curve will flatten out at a suboptimal value. In practice, research has shown that most local minima in high-dimensional [neural networks](/wiki/neural_network) have loss values close to the global minimum, so local minima are less of a concern than once believed.

### Saddle points

Saddle points are locations where the gradient is zero but the point is a minimum in some directions and a maximum in others. In high-dimensional parameter spaces, saddle points are exponentially more common than local minima, a result established by Dauphin et al. (2014) using random matrix theory.[17] When the optimizer reaches a saddle point, the loss curve may show a prolonged plateau before the model escapes. Optimizers with [momentum](/wiki/momentum) (such as the [Adam optimizer](/wiki/adam_optimizer)) are specifically designed to traverse saddle points more efficiently. Later adaptive optimizers such as AdaBelief adapt each step size "by the belief in observed gradients," which changes how smoothly the loss descends.[5]

### Plateaus

A plateau on the loss curve appears as a flat region where the loss does not change significantly over many steps. Plateaus can occur because of saddle points, because the learning rate is too small, or because the model's current representation has reached a temporary capacity limit. Plateaus are sometimes followed by sudden drops in loss, a pattern referred to as "grokking" in the context of algorithmic learning tasks, first reported and named by Power et al. (2022) at OpenAI.[18]

## Advanced loss curve phenomena

### Double descent

The double descent phenomenon describes a situation where the test loss follows a non-monotonic pattern as model complexity increases. The term "double descent" was coined by Belkin et al. (2019), who unified the classical bias-variance curve and the behavior of modern overparameterized models within a single plot.[16] Building on this, Nakkiran et al. (2019), in a study that included OpenAI, documented "deep double descent" in modern deep networks: the test loss initially decreases (classical bias-variance tradeoff behavior), then increases near the interpolation threshold (where the model has just enough capacity to perfectly fit the training data), and then decreases again as the model becomes even larger.[4]

Double descent can also appear along the epoch axis (epoch-wise double descent). For a model of fixed size, the test loss may decrease, then increase, and then decrease again as training continues for many epochs. This phenomenon is observed in various architectures, including [convolutional neural networks](/wiki/convolutional_neural_network), ResNets, and transformers, particularly when early stopping and regularization are not used.[4]

### Loss spikes in large-scale training

During the pretraining of large language models, sudden and dramatic increases in loss (loss spikes) can occur. These spikes are caused by gradient explosions, where the norm of the gradient suddenly grows to very large values. Loss spikes can degrade final model performance or, in severe cases, cause training to diverge entirely.

Causes of loss spikes include:

- Numerical instability in parameter norms
- Noisy or corrupted training data batches
- Non-uniform scaling of parameters across layers
- Interactions between large learning rates and model architecture

Common mitigation strategies include gradient clipping (limiting the gradient norm to a fixed threshold such as 1.0), using mixed-precision training carefully, and restarting from a recent checkpoint while skipping the problematic data batch. This last technique was used during the training of Google's 540-billion-parameter [PaLM](/wiki/palm) model, whose authors reported that they "observed spikes in the loss roughly 20 times during training, despite the fact that gradient clipping was enabled."[15] Their fix was to restart from a checkpoint roughly 100 steps before each spike and skip the 200 to 500 data batches that were seen just before and during it, after which the loss did not spike again at the same point.[15] Later work such as "Spike No More" (Takase et al., 2023) proposed initialization and normalization schemes that reduce how often these spikes occur.[9]

### Scaling laws

Research on [scaling laws](/wiki/scaling_laws) has revealed that the final training loss of neural language models follows a power law relationship with model size, dataset size, and compute budget. Kaplan et al. (2020) found that "The loss scales as a power-law with model size, dataset size, and the amount of compute used for training, with some trends spanning more than seven orders of magnitude."[2] The Chinchilla scaling laws, published by DeepMind in 2022, refined this by showing that for a given compute budget the number of model parameters and training tokens should scale proportionally, so that "for every doubling of model size the number of training tokens should also be doubled" (roughly 20 training tokens per parameter).[3] To demonstrate the point, DeepMind trained [Chinchilla](/wiki/chinchilla) (70 billion parameters on 1.4 trillion tokens) using the same compute budget as the 280-billion-parameter Gopher; Chinchilla outperformed Gopher, reaching 67.5% average accuracy on MMLU, an improvement of more than 7 percentage points.[3] These findings mean that loss curves for large models are, to some degree, predictable before training begins, which has significant implications for resource allocation in large-scale training.

## How do you smooth a noisy loss curve?

Stochastic training methods produce noisy loss curves because the gradient is estimated from a random subset of data at each step. Several techniques are used to make loss curves easier to interpret.

### Exponential moving average (EMA)

The most common smoothing technique applies an exponential moving average to the raw loss values:

**L_smooth(t) = &beta; * L_smooth(t-1) + (1 - &beta;) * L(t)**

where &beta; is a smoothing factor between 0 and 1 (typically 0.9 to 0.99). Higher values of &beta; produce smoother curves but introduce more lag. Most visualization tools, including TensorBoard, offer built-in EMA smoothing controls.

### Moving window average

A simple moving average computes the mean loss over the last k steps. This approach is straightforward but gives equal weight to all points in the window, which can obscure recent trends.

### Logging frequency

Reducing the frequency at which loss values are logged (e.g., logging every 100 steps instead of every step) effectively downsamples the curve and reduces visual noise. This is particularly useful during long training runs with millions of steps.

## What tools plot loss curves?

Several tools and libraries are commonly used to visualize loss curves during model training.

| Tool | Framework integration | Features | Cost |
|---|---|---|---|
| TensorBoard | TensorFlow, PyTorch, JAX | Real-time plots, smoothing, comparison of runs, histograms | Free, open source |
| Weights & Biases (W&B) | Framework-agnostic | Interactive dashboards, team collaboration, hyperparameter sweeps, automatic system metrics | Free tier available; paid plans for teams |
| MLflow | Framework-agnostic | Experiment tracking, model registry, metric comparison | Free, open source |
| Neptune.ai | Framework-agnostic | Real-time monitoring, collaboration, metadata management | Free tier available; paid plans |
| Matplotlib / Seaborn | Python | Custom static plots, full control over appearance | Free, open source |
| Comet ML | Framework-agnostic | Real-time tracking, code versioning, hyperparameter optimization | Free tier available; paid plans |

TensorBoard, which shipped with the open-source release of TensorFlow in 2015, was among the first widely adopted tools for watching loss curves update in real time during training.[20]

### Plotting loss curves in Python

A basic loss curve can be plotted using Matplotlib by recording the loss value at each epoch during training:

```python
import matplotlib.pyplot as plt

# Assuming train_losses and val_losses are lists recorded during training
plt.plot(train_losses, label='Training loss')
plt.plot(val_losses, label='Validation loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('Loss Curve')
plt.legend()
plt.show()
```

In TensorBoard, loss curves are logged automatically when using the appropriate callback or summary writer. For example, in PyTorch:

```python
from torch.utils.tensorboard import SummaryWriter

writer = SummaryWriter()
for epoch in range(num_epochs):
    train_loss = train_one_epoch(model, train_loader, optimizer)
    val_loss = evaluate(model, val_loader)
    writer.add_scalar('Loss/train', train_loss, epoch)
    writer.add_scalar('Loss/validation', val_loss, epoch)
writer.close()
```

## How does early stopping use the loss curve?

Early stopping is a regularization technique that monitors the validation loss curve and halts training when the validation loss has not improved for a specified number of epochs, known as the patience parameter. The question of precisely when to stop has been studied since at least Prechelt (1998), whose classic treatment weighed the tradeoff between wasted training time and lost generalization.[13]

### How early stopping works

1. At the end of each epoch, the validation loss is computed.
2. If the validation loss improves (decreases below the best observed value), the model checkpoint is saved.
3. If the validation loss does not improve for a number of consecutive epochs equal to the patience value, training is stopped.
4. The model weights from the epoch with the lowest validation loss are restored.

### Choosing the patience parameter

The patience parameter controls how tolerant the early stopping mechanism is of temporary increases in validation loss.

- **Too low** (e.g., 1 to 3): May stop training prematurely, especially if the validation loss temporarily plateaus before improving.
- **Too high** (e.g., 50 or more): May allow the model to overfit significantly before stopping.
- **Typical values**: 5 to 20, depending on the problem and the expected shape of the validation loss curve.

## Practical tips for working with loss curves

1. **Always plot both training and validation loss.** A training loss curve alone is insufficient because it does not reveal overfitting.

2. **Use a logarithmic scale for the y-axis** when loss values span several orders of magnitude. This makes it easier to see improvements in later stages of training.

3. **Compare multiple runs.** When tuning hyperparameters, overlay loss curves from different configurations to identify which settings produce the best convergence.

4. **Watch for sudden spikes.** A sudden increase in loss often indicates a data issue (corrupted batch, mislabeled examples) or a numerical problem ([exploding gradients](/wiki/exploding_gradient_problem)).

5. **Do not rely solely on the final loss value.** The shape of the entire curve provides more information than any single number. A model that converges smoothly to a loss of 0.3 is often preferable to one that oscillates between 0.1 and 0.5.

6. **Apply smoothing judiciously.** Over-smoothing can hide important details like loss spikes or sudden changes in training dynamics.

7. **Monitor additional metrics alongside loss.** Accuracy, F1 score, and other task-specific metrics can provide complementary information. A decreasing loss does not always correspond to improving task performance, especially when the loss function is a poor proxy for the actual objective.

8. **Be aware of the [bias-variance tradeoff](/wiki/bias_variance_tradeoff).** The gap between training and validation curves directly reflects this tradeoff. A large gap indicates high variance (overfitting), while both curves being high indicates high bias (underfitting).

## How do loss curves differ across domains?

### Computer vision

In image classification with [convolutional neural networks](/wiki/convolutional_neural_network), loss curves typically show rapid initial improvement followed by gradual refinement. Data augmentation is commonly used to reduce the gap between training and validation curves. Architectures like ResNet and EfficientNet often converge within 90 to 300 epochs, with step decay learning rate schedules producing characteristic staircase-shaped loss curves.

### Natural language processing

In [natural language processing](/wiki/natural_language_processing), transformer-based models are typically trained with warmup followed by cosine or linear decay. Loss curves for language model pretraining can span hundreds of thousands or millions of steps. The loss tends to decrease following a power law, consistent with the neural scaling laws observed by Kaplan et al. (2020)[2] and Hoffmann et al. (2022).[3]

### Reinforcement learning

Loss curves in [reinforcement learning](/wiki/reinforcement_learning) tend to be much noisier than in supervised learning because the training data distribution changes as the agent's policy improves. It is common to plot both the loss and the cumulative reward curve, as the loss alone may not be informative about the agent's actual performance.

## History

The concept of plotting learning progress over time dates back to early work in mathematical psychology and animal learning research in the late 19th century. Hermann Ebbinghaus published his "forgetting curve" in 1885, which plotted memory retention over time using experiments on himself with nonsense syllables.[21] The idea of using similar plots to track the performance of computational learning algorithms was adopted as machine learning emerged as a field in the mid-20th century.

The term "learning curve" has been used in machine learning since at least the 1990s, with early analyses appearing in the work of Cortes et al. (1993) and others who studied how model error changes as a function of training set size.[12] The modern usage of "loss curve" as a real-time diagnostic tool during neural network training became standard practice with the advent of deep learning frameworks and visualization tools like TensorBoard, which was released as part of TensorFlow in 2015.[20]

## See also

- [Self-Taught Evaluator](/wiki/self_taught_evaluator)

## References

1. Goodfellow, I., Bengio, Y., and Courville, A. (2016). *Deep Learning*. MIT Press. Chapter 8: Optimization for Training Deep Models.
2. Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., et al. (2020). "Scaling Laws for Neural Language Models." arXiv:2001.08361.
3. Hoffmann, J., Borgeaud, S., Mensch, A., et al. (2022). "Training Compute-Optimal Large Language Models." arXiv:2203.15556. (Chinchilla scaling laws.)
4. Nakkiran, P., Kaplun, G., Bansal, Y., et al. (2019). "Deep Double Descent: Where Bigger Models and More Data Hurt." arXiv:1912.02292.
5. Zhuang, J., Tang, T., Ding, Y., et al. (2020). "AdaBelief Optimizer: Adapting Stepsizes by the Belief in Observed Gradients." arXiv:2010.07468.
6. Loshchilov, I. and Hutter, F. (2017). "SGDR: Stochastic Gradient Descent with Warm Restarts." Proceedings of ICLR 2017.
7. Google Developers. "Overfitting: Interpreting loss curves." Machine Learning Crash Course. https://developers.google.com/machine-learning/crash-course/overfitting/interpreting-loss-curves
8. Brownlee, J. (2019). "How to use Learning Curves to Diagnose Machine Learning Model Performance." Machine Learning Mastery. https://machinelearningmastery.com/learning-curves-for-diagnosing-machine-learning-model-performance/
9. Takase, S., Kiyono, S., Kobayashi, S., and Suzuki, J. (2023). "Spike No More: Stabilizing the Pre-training of Large Language Models." arXiv:2312.16903.
10. Glorot, X. and Bengio, Y. (2010). "Understanding the difficulty of training deep feedforward neural networks." Proceedings of AISTATS 2010.
11. He, K., Zhang, X., Ren, S., and Sun, J. (2015). "Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification." arXiv:1502.01852.
12. Cortes, C., Jackel, L. D., Solla, S. A., Vapnik, V., and Denker, J. S. (1993). "Learning Curves: Asymptotic Values and Rate of Convergence." Advances in Neural Information Processing Systems 6 (NIPS 1993).
13. Prechelt, L. (1998). "Early Stopping - But When?" In *Neural Networks: Tricks of the Trade*, Springer.
14. Li, H., Xu, Z., Taylor, G., Studer, C., and Goldstein, T. (2018). "Visualizing the Loss Landscape of Neural Nets." Advances in Neural Information Processing Systems 31 (NeurIPS 2018).
15. Chowdhery, A., Narang, S., Devlin, J., et al. (2022). "PaLM: Scaling Language Modeling with Pathways." arXiv:2204.02311.
16. Belkin, M., Hsu, D., Ma, S., and Mandal, S. (2019). "Reconciling modern machine-learning practice and the classical bias-variance trade-off." Proceedings of the National Academy of Sciences 116(32): 15849-15854.
17. Dauphin, Y. N., Pascanu, R., Gulcehre, C., Cho, K., Ganguli, S., and Bengio, Y. (2014). "Identifying and Attacking the Saddle Point Problem in High-Dimensional Non-Convex Optimization." Advances in Neural Information Processing Systems 27 (NIPS 2014).
18. Power, A., Burda, Y., Edwards, H., Babuschkin, I., and Misra, V. (2022). "Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets." arXiv:2201.02177.
19. Hu, S., Tu, Y., Han, X., et al. (2024). "MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies." arXiv:2404.06395.
20. TensorFlow. "TensorBoard: TensorFlow's visualization toolkit." https://www.tensorflow.org/tensorboard
21. Ebbinghaus, H. (1885). *Über das Gedächtnis: Untersuchungen zur experimentellen Psychologie*. Duncker & Humblot. Translated as *Memory: A Contribution to Experimental Psychology* (1913).