# Epoch

> Source: https://aiwiki.ai/wiki/epoch
> Updated: 2026-06-21
> Categories: Deep Learning, Machine Learning, Neural Networks
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

*See also: [Machine learning terms](/wiki/machine_learning_terms)*

An **epoch** in [machine learning](/wiki/machine_learning) is one complete pass through the entire [training](/wiki/training) [dataset](/wiki/dataset), during which every example is presented to the model exactly once to compute gradients and update parameters before the cycle repeats.[1] The number of epochs is a key [hyperparameter](/wiki/hyperparameter) controlling how many times a model revisits its data: too few causes [underfitting](/wiki/underfitting), too many causes [overfitting](/wiki/overfitting). Classical models are trained for many epochs (image classifiers on [ImageNet](/wiki/imagenet) typically run 90 or more), whereas frontier [large language models](/wiki/large_language_model) are usually trained for roughly a single epoch over corpora of trillions of tokens, so the dominant unit of training progress has shifted from epochs to total tokens and steps.[3][7]

The most-cited modern result on repeating epochs comes from Muennighoff et al. (2023), who found that "training with up to 4 epochs of repeated data yields negligible changes to loss compared to having unique data" for a fixed compute budget.[3] The compute-optimal [Chinchilla scaling](/wiki/chinchilla_scaling) recipe (Hoffmann et al., 2022) assumes each token is seen once and recommends roughly 20 training tokens per model parameter, a 70 billion parameter model trained on about 1.4 trillion tokens.[2]

In practice, training a model for just one epoch is rarely sufficient to learn all the patterns in a dataset. Most training procedures involve running for multiple epochs so the model can iteratively refine its [weights](/wiki/weights) and [biases](/wiki/bias) based on the full distribution of training examples.[13] However, training for too many epochs risks [overfitting](/wiki/overfitting), where the model memorizes the training data rather than learning generalizable patterns.[1]

## What is the relationship between epochs, batches, iterations, and steps?

An epoch does not process the entire dataset at once. Instead, the dataset is divided into smaller subsets called batches (or mini-batches). Each time the model processes one batch and updates its weights through a forward pass and backward pass, that constitutes one **iteration** (also called a **step** in most frameworks).[4] Understanding how an epoch, a [batch](/wiki/batch), and an iteration fit together is essential for configuring training correctly.

- **Epoch**: One full pass through the entire training dataset.
- **[Batch](/wiki/batch) (mini-batch)**: A subset of the training data used in a single forward and backward pass. The [batch size](/wiki/batch_size) determines how many examples are in each batch.
- **Iteration (step)**: One parameter update, corresponding to processing one batch.

The relationship between these concepts is straightforward:

**Iterations per epoch = Total training examples / [Batch size](/wiki/batch_size)**

For example, consider a dataset of 50,000 training images with a [batch size](/wiki/batch_size) of 256:

- **Iterations per epoch** = 50,000 / 256 = 195.3, rounded up to 196 iterations
- If training runs for 90 epochs, the total number of iterations = 196 x 90 = 17,640 weight updates[5]

A smaller example: with 10,000 training examples and a batch size of 100, each epoch contains 10,000 / 100 = 100 iterations; after 5 epochs the model has seen each example 5 times and performed 500 total parameter updates.

The following table summarizes these three concepts:

| Concept | Definition | Scope | Example (50,000 samples, batch size 256) |
|---|---|---|---|
| [Batch](/wiki/batch) | A subset of training data processed together in one forward/backward pass | A fraction of the dataset | 256 samples |
| Iteration (step) | One weight update after processing a single batch | One batch | 1 of 196 per epoch |
| Epoch | One complete pass through the entire dataset | The full dataset | All 196 iterations |

And, framed around the smaller 10,000-sample example:

| Term | What it measures | Example (10,000 samples, batch size 100) |
|---|---|---|
| Batch size | Number of samples per update | 100 |
| Iteration (step) | One forward + backward pass + update | 1 iteration = 100 samples processed |
| Epoch | One full pass through the dataset | 1 epoch = 100 iterations = 10,000 samples |
| Total iterations | Epochs multiplied by iterations per epoch | 5 epochs = 500 iterations |

When the total number of samples is not evenly divisible by the batch size, the last batch in each epoch will be smaller than the others. Most training frameworks handle this automatically, either by using the smaller batch as-is or by dropping the last incomplete batch.

In most frameworks, the terms "iteration" and "step" are interchangeable. TensorFlow and [Keras](/wiki/keras) use "steps_per_epoch" while [PyTorch](/wiki/pytorch) users typically speak in terms of iterations.

## Why are multiple epochs needed?

A single pass through the data is usually insufficient for a model to converge to a good solution. [Stochastic gradient descent](/wiki/stochastic_gradient_descent_sgd) (SGD) and its variants update model weights based on noisy gradient estimates computed from individual mini-batches.[4] These updates point roughly toward the optimum but require many repeated passes to accumulate enough learning signal for convergence.[1]

### Underfitting with too few epochs

When a model trains for too few epochs, it has not had enough exposure to the data to learn the underlying patterns. The result is [underfitting](/wiki/underfitting): high error on both training and [validation](/wiki/validation_set) data. The [loss](/wiki/loss) curve will still be trending downward when training stops, indicating that additional epochs would improve performance.

### Overfitting with too many epochs

Training for too many epochs is one of the most common causes of overfitting. As training progresses, the model's capacity to fit the training data increases. Early in training, the model learns general patterns that apply to both training and validation data, and both losses decrease together. After a certain point, however, the model begins to memorize specifics of the training set: noise, outliers, and idiosyncrasies that do not generalize. Training loss continues to decrease while validation loss begins to increase. This gap signals [overfitting](/wiki/overfitting), and the point of divergence marks approximately where training should have stopped.[9]

The relationship between epochs and overfitting can be visualized as a learning curve:

- **Early epochs**: Both training and validation loss decrease. The model is learning useful patterns.
- **Middle epochs**: Validation loss reaches its minimum. The model has extracted the generalizable information from the data.
- **Late epochs**: Training loss keeps falling but validation loss rises. The model is memorizing training data.

The optimal number of epochs corresponds to the point where validation loss is minimized, which is exactly what early stopping is designed to detect.

## How do you choose the number of epochs?

There is no universal formula for the optimal number of epochs. The optimal count depends on the dataset size, model complexity, [learning rate](/wiki/learning_rate), and other hyperparameters. Several strategies and guidelines help practitioners make this decision.

### General principles

**Larger datasets generally need fewer epochs.** When a dataset is large and diverse, the model can learn a good representation with fewer passes because each epoch exposes it to a wide variety of examples. Conversely, small datasets may require more epochs because each pass provides limited information.

**More complex models may need more epochs.** A [deep neural network](/wiki/neural_network) with millions of parameters takes longer to train because the [optimizer](/wiki/optimizer) must navigate a higher-dimensional parameter space. Simple models like [logistic regression](/wiki/logistic_regression) often converge in just a few epochs.

**The learning rate affects how many epochs are needed.** A higher learning rate causes larger parameter updates per step, which can lead to convergence in fewer epochs (if the rate is well-chosen) or divergence (if it is too high). A lower learning rate requires more epochs because each update makes only a small change.[13]

### Early stopping

[Early stopping](/wiki/early_stopping) is the most widely used technique for determining when to end training.[9] Rather than fixing the epoch count in advance, the practitioner sets a large maximum number of epochs and monitors the model's performance on a held-out [validation set](/wiki/validation_set). Training halts when the validation metric (typically validation loss) has not improved for a specified number of consecutive epochs, known as the **patience** parameter.

The process works as follows:

1. After each epoch (or at regular intervals), evaluate the model on the validation set.
2. Track the best validation performance observed so far.
3. If the validation performance does not improve for a specified number of consecutive epochs (the "patience" parameter), stop training.
4. Restore the model weights from the epoch with the best validation performance.

For example, with a patience of 10, training continues as long as there has been at least one improvement in validation loss within the last 10 epochs. Once 10 consecutive epochs pass without improvement, training halts.

Early stopping acts as a form of [regularization](/wiki/regularization) because it limits how far the model's weights can move from their initial values. This constraint on the effective complexity of the model helps prevent overfitting.[1]

Common early stopping settings include:

| Parameter | Typical value | Purpose |
|---|---|---|
| Patience | 5 to 10 epochs | How many epochs without improvement before stopping |
| Min delta | 0.001 to 0.01 | Minimum change to count as an improvement |
| Restore best weights | True | Revert to the weights from the best epoch |

In [Keras](/wiki/keras), early stopping is implemented as a callback:

```python
from keras.callbacks import EarlyStopping

early_stop = EarlyStopping(
    monitor='val_loss',
    patience=5,
    min_delta=0.001,
    restore_best_weights=True
)

model.fit(X_train, y_train, epochs=200, callbacks=[early_stop])
```

### Learning curves

Plotting learning curves (training loss and validation loss against epoch number) is essential for diagnosing whether a model needs more or fewer epochs. There are three typical patterns:

- **Both curves still decreasing**: The model is underfitting. Train for more epochs.
- **Training curve decreasing, validation curve increasing**: The model is overfitting. Reduce epochs or apply [regularization](/wiki/regularization).
- **Both curves plateaued**: The model has converged. Additional epochs will not help.

### Heuristic guidelines

As a starting point, typical epoch counts vary by task:

| Task | Typical epoch range | Notes |
|---|---|---|
| Image classification (from scratch, e.g., ImageNet) | 90 to 300 | Standard [ResNet](/wiki/resnet) on [ImageNet](/wiki/imagenet) uses 90 epochs with step decay; longer schedules sometimes improve results |
| [Fine-tuning](/wiki/fine_tuning) pretrained models | 2 to 10 | Pretrained weights already encode useful features; fewer epochs suffice |
| [NLP](/wiki/natural_language_processing) classification | 3 to 10 | [BERT](/wiki/bert) fine-tuning typically uses 3 to 4 epochs[6] |
| NLP pre-training (BERT, GPT) | 1 to 3 over the corpus | Large corpora; some data may be seen only once |
| Small tabular datasets | 50 to 500 | More epochs needed due to limited data |
| [GAN](/wiki/generative_adversarial_network) training | 100 to 1000+ | GANs are notoriously unstable and may require long training to stabilize adversarial dynamics |
| LLM pre-training | 1 to 3 | Most data is seen only once; measured in tokens rather than epochs |

These are rough guidelines, not rules. The right number of epochs should always be verified empirically using validation performance.

In modern [large language model](/wiki/large_language_model) training, the concept of epochs has shifted. Models like [GPT-3](/wiki/gpt-3) and [GPT-4](/wiki/gpt-4) are often trained for less than one epoch over their massive training corpora, meaning not every example is seen even once.[7] The total number of training tokens (rather than epochs) becomes the primary measure of training duration. [Chinchilla scaling laws](/wiki/chinchilla_scaling) (Hoffmann et al., 2022) suggest that the optimal number of training tokens scales roughly proportionally to model size, fitting their predictions from over 400 trained models and recommending about 20 tokens per parameter.[2]

## Why is training data reshuffled between epochs?

Reshuffling the training data at the beginning of each epoch is standard practice and has important theoretical and practical benefits. When the data is presented in the same order every epoch, the model can develop sensitivity to the sequence of examples rather than learning from the data itself, which is especially problematic when data is sorted by class or some other systematic ordering. By randomly permuting the order of samples each epoch, the sequence of gradient updates varies, which helps the optimizer escape local minima and achieve better generalization.[1]

Without shuffling, the model would see the same sequence of batches every epoch. This can create biases in the gradient estimates, particularly if consecutive batches contain similar examples. Shuffling breaks these correlations and improves the quality of gradient updates.

Research on random reshuffling in [SGD](/wiki/stochastic_gradient_descent_sgd) has shown that it converges faster than the alternative of processing data in a fixed order (sometimes called "shuffle once") or sampling with replacement.[10] Modern deep learning frameworks, including [PyTorch](/wiki/pytorch) and [TensorFlow](/wiki/tensorflow), enable shuffling by default in their data loading utilities.

Some training frameworks use a [data loader](/wiki/data_loader) that samples batches with replacement from the dataset rather than iterating through the data in a fixed order. In this case, the concept of an epoch is approximate: after seeing N batches (where N equals the dataset size divided by the batch size), the model has seen approximately one epoch's worth of data, though some examples may have been seen more than once and others not at all.

In PyTorch, the `DataLoader` class provides a `shuffle` parameter:

```python
from torch.utils.data import DataLoader

train_loader = DataLoader(dataset, batch_size=64, shuffle=True)

for epoch in range(num_epochs):
    for batch in train_loader:  # Data is reshuffled each epoch
        # training step
        pass
```

## How are epochs expressed in different frameworks?

The way epochs are expressed in code varies between frameworks.

### PyTorch

PyTorch does not have a built-in "epochs" parameter in its core API. Instead, the training loop is written explicitly by the programmer, with an outer loop over epochs and an inner loop over batches from a `DataLoader`:

```python
for epoch in range(num_epochs):
    for inputs, targets in train_loader:
        outputs = model(inputs)
        loss = criterion(outputs, targets)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
```

This explicit loop gives full control over what happens at each epoch boundary (logging, checkpointing, learning rate adjustments).

### TensorFlow / Keras

[Keras](/wiki/keras) abstracts the epoch loop into `model.fit()`, where the `epochs` parameter specifies the total number of passes:

```python
model.fit(X_train, y_train, epochs=50, batch_size=32, validation_split=0.2)
```

Keras automatically handles batching, shuffling, and per-epoch callbacks such as [early stopping](/wiki/early_stopping), learning rate scheduling, and checkpointing.

## How does epoch differ from step in modern training?

In classical machine learning, the epoch was the natural unit of training progress. Datasets were relatively small, models were trained for many epochs, and evaluation checkpoints were placed at epoch boundaries.

Modern large-scale training has shifted toward using **steps** (iterations) as the primary unit of measurement. There are several reasons for this:

**Dataset size**: When training on datasets with billions of examples (e.g., [Common Crawl](/wiki/common_crawl) for language models or [LAION](/wiki/laion)-5B for image models), a single epoch may take days or weeks of compute. Tracking progress in steps rather than epochs provides finer-grained monitoring.

**Sub-epoch training**: Many modern models do not complete even one full epoch. GPT-3 was trained on approximately 300 billion tokens sampled from a corpus of several hundred billion tokens, but with non-uniform sampling weights. Some documents were seen multiple times while others were seen only once.[7] In this setting, the concept of an "epoch" loses its original meaning.

**Learning rate schedules**: Modern [learning rate](/wiki/learning_rate) schedulers (cosine annealing, linear warmup + decay) are typically defined in terms of total training steps rather than epochs.[14] For instance, "warmup for 2,000 steps then decay over 500,000 steps" is more common than "warmup for 1 epoch."

**Evaluation and checkpointing**: In large-scale training, models are evaluated and checkpointed every N steps (e.g., every 1,000 or 5,000 steps) rather than at epoch boundaries. This provides more frequent feedback on training progress.

**Reproducibility**: Specifying training duration in steps makes it easier to reproduce results across different hardware configurations. Changing the number of GPUs or the batch size changes the number of steps per epoch but not the total number of steps if gradient accumulation is adjusted accordingly.

| Unit | Traditional ML | Modern large-scale training |
|---|---|---|
| Primary progress unit | Epoch | Step (iteration) |
| Dataset passes | Many (10-300+ epochs) | Often less than 1 epoch |
| Evaluation frequency | Once per epoch | Every N steps |
| LR schedule defined by | Epoch milestones | Step counts |
| Typical dataset size | Thousands to millions of examples | Billions of examples or tokens |

Despite this shift, the epoch remains a useful concept for smaller-scale training, fine-tuning, and educational settings where the dataset is small enough that multiple passes are both practical and necessary.

### Epoch vs. step in LLM training pipelines

In [large language model](/wiki/large_language_model) pre-training, the distinction between epochs and steps becomes especially pronounced. Training budgets are specified in tokens (e.g., "train on 2 trillion tokens") rather than epochs, and progress is tracked by the number of tokens consumed. A single "step" processes a fixed number of tokens determined by the [batch size](/wiki/batch_size) and sequence length. For example, with a batch size of 1,024 sequences and a sequence length of 4,096 tokens, each step processes roughly 4.2 million tokens.

The total number of steps is then:

`Total steps = Total training tokens / (batch size * sequence length)`

All scheduling decisions, including [learning rate](/wiki/learning_rate) warmup, decay, and checkpointing intervals, are anchored to these step counts. This convention carries through to [fine-tuning](/wiki/fine-tuning) as well, where practitioners typically specify training in terms of steps rather than epochs, particularly when using [gradient accumulation](/wiki/gradient_accumulation) to achieve large effective batch sizes.

## How do epochs differ in online learning vs. batch learning?

The concept of an epoch is most meaningful in **batch learning** (also called offline learning), where the full dataset is available before training begins and can be iterated over multiple times. In this setting, one epoch is a well-defined unit: a single pass through the finite dataset.

In **online learning**, data arrives as a continuous stream and each example is typically used only once before being discarded. The notion of an epoch does not naturally apply because there is no fixed dataset to "pass through." Instead, online learning systems measure progress in terms of the number of samples processed or the number of weight updates performed. Online learning is common in scenarios with non-stationary data distributions, such as recommendation systems and financial trading models.

Some hybrid approaches exist where streaming data is accumulated into buffers and mini-epochs are run over the buffer before it is refreshed with new data.

## Epochs in large-scale LLM training

Training modern [large language models](/wiki/large_language_model) (LLMs) has shifted the emphasis from epochs to total **tokens seen** during training. Because LLM pre-training datasets can contain trillions of tokens, many models are trained for only a single epoch, meaning each token is encountered exactly once.

### Tokens seen vs. epochs

For LLMs, reporting the number of tokens processed is more informative than the number of epochs. Key examples include:

| Model | Parameters | Training tokens | Approximate epochs |
|---|---|---|---|
| [GPT-3](/wiki/gpt-3) | 175B | 300B tokens | ~1 (some data subsets repeated) |
| [LLaMA](/wiki/llama) 1 | 7B to 65B | 1.0T to 1.4T tokens | ~1 |
| [LLaMA](/wiki/llama) 2 | 7B to 70B | 2.0T tokens | ~1 |
| [LLaMA](/wiki/llama) 3.1 | 8B to 405B | 15T+ tokens | ~1 |
| [Chinchilla](/wiki/chinchilla) | 70B | 1.4T tokens | ~1 |

The reason single-epoch training dominates at this scale is that pre-training corpora are large enough that one pass already provides sufficient gradient updates.[8] Repeating data yields diminishing returns and can degrade model quality.[3]

## Should training data be repeated across multiple epochs?

A major question in modern ML is whether training data should be repeated across multiple epochs, or whether each training example should be seen only once. This debate has become especially relevant as the supply of high-quality text data for [large language models](/wiki/large_language_model) approaches its limits.

### The single-epoch paradigm

Early large language models, following the [Chinchilla](/wiki/chinchilla) scaling laws (Hoffmann et al., 2022), were typically trained for roughly one epoch.[2] The reasoning was straightforward: with enough data available, there was no need to repeat examples. Repeated data offers diminishing returns because the model has already extracted signal from previously seen examples, and each repetition provides progressively less new information.

### Chinchilla scaling laws and data repetition

The Chinchilla scaling laws (Hoffmann et al., 2022) established that compute-optimal training requires scaling model size and training data equally. Drawing on more than 400 models trained across a range of sizes and token budgets, the recommended ratio is approximately 20 tokens per parameter: a 70 billion parameter model should be trained on roughly 1.4 trillion tokens.[2] This framework implicitly assumes each token is seen only once.

### Data-constrained scaling laws

When unique data is limited, multi-epoch training becomes necessary. Muennighoff et al. (2023) conducted the first systematic study of multi-epoch training for language models. Their paper, "Scaling Data-Constrained Language Models," reported 400 training runs spanning models from 10 million to 9 billion parameters, with up to 1,500 epochs of repetition and compute budgets up to roughly 900 billion training tokens.[3] Key findings included:

- Training with up to **4 epochs** of repeated data yielded negligible changes in [loss](/wiki/loss) compared to having unique data, for a fixed compute budget. The authors state plainly: "training with up to 4 epochs of repeated data yields negligible changes to loss compared to having unique data."[3]
- Beyond 4 epochs, the value of additional compute allocated to repeated data decayed progressively, eventually approaching zero benefit.
- When data is constrained, compute should be allocated to both more parameters and more epochs, with epochs scaling slightly faster than what Chinchilla laws would predict for unique data.

Their findings suggest that when data is scarce, it is better to train a smaller model for more epochs rather than a larger model for fewer epochs, a reversal of the standard Chinchilla recommendation. If your data budget is limited, repeating data up to about 4 times is nearly free in terms of model quality.

| Epochs of repetition | Impact on loss (vs. unique data) | Practical recommendation |
|---|---|---|
| 1 (no repetition) | Baseline | Ideal if sufficient data is available |
| 2-4 | Negligible degradation | Safe to use when data is constrained |
| 4-16 | Modest degradation | Acceptable with diminishing returns |
| 16+ | Significant degradation | Avoid unless data is extremely scarce |

### Multi-epoch training in fine-tuning

While pre-training typically uses one or a few epochs, [fine-tuning](/wiki/fine-tuning) and post-training commonly use multiple epochs over smaller, curated datasets. Recent examples include:

- **[LLaMA 3](/wiki/llama)**: Multi-epoch supervised fine-tuning on instruction data
- **[DeepSeek-R1](/wiki/deepseek)**: 2-3 epochs of fine-tuning for reasoning capabilities
- **LIMO (2025)**: 15 epochs on a small, highly curated reasoning dataset, demonstrating that repeated training on high-quality data can be beneficial

Recent research (2025) has shown that training for many epochs on a small, carefully selected subset of fine-tuning data can actually outperform training for one epoch on a much larger dataset, especially for reasoning tasks. The LIMO study (Ye et al., COLM 2025) fine-tuned for 15 epochs on just 817 curated examples and reached 63.3 percent on AIME 2024 and 95.6 percent on MATH500, surpassing models trained on roughly 100 times more data.[15] This challenges the assumption that more unique data is always better and suggests that data quality matters more than quantity in the post-training phase.

### Running out of training data

[Epoch AI](/wiki/epoch_ai) estimates that the total effective stock of human-generated public text data is on the order of 300 trillion tokens, with a 90 percent confidence interval of 100 trillion to 1,000 trillion, and that leading labs may exhaust this supply within the next several years.[11][12] In their updated analysis, the researchers write that their "80% confidence interval is that the data stock will be fully utilized at some point between 2026 and 2032," depending on data quality thresholds and the degree of model overtraining.[12] This projection has made multi-epoch training research increasingly urgent, as future models may have no choice but to train on repeated data, use synthetic data generation, or develop more data-efficient training methods.

## How is training progress monitored per epoch?

Tracking metrics at the end of each epoch provides essential insight into training dynamics. The most commonly monitored quantities include:

- **Training loss**: Should decrease steadily across epochs. A plateau suggests the model has converged or the learning rate is too small.
- **Validation loss**: Should decrease alongside training loss. An increase indicates the onset of overfitting.
- **Accuracy or other task-specific metrics**: Provide a more interpretable measure of model quality than raw loss values.
- **Learning rate**: If using a learning rate schedule, logging the current rate per epoch helps diagnose training behavior.

Tools such as TensorBoard, Weights & Biases, and [MLflow](/wiki/mlflow) provide real-time visualization of per-epoch metrics, making it straightforward to detect problems and decide when to stop training.

## Computational cost considerations

The number of epochs directly affects training cost. Total compute is proportional to:

**Total FLOPs = FLOPs per iteration x Iterations per epoch x Number of epochs**

Several practical trade-offs are involved:

| Factor | Effect of increasing | Trade-off |
|---|---|---|
| Number of epochs | More weight updates, longer training time | Better learning vs. overfitting risk and higher cost |
| [Batch size](/wiki/batch_size) | Fewer iterations per epoch, better GPU utilization | Faster wall-clock time per epoch but potentially worse generalization |
| [Learning rate](/wiki/learning_rate) | Faster convergence per step | May overshoot or diverge if too large |
| Dataset size | More iterations per epoch | More data per epoch but each epoch takes longer |

For large-scale training, the cost is often measured in **GPU hours**: the product of the number of GPUs used and the elapsed training time. For instance, training [ResNet](/wiki/resnet)-50 on [ImageNet](/wiki/imagenet) for 90 epochs takes roughly 29 hours on 8 NVIDIA V100 GPUs.[5] LLM pre-training can require millions of GPU hours.

Reducing the number of epochs (by using early stopping or better learning rate schedules) is one of the most straightforward ways to reduce training cost without architectural changes.

## ELI5: what is an epoch?

Imagine you have a stack of 100 flashcards for studying vocabulary. Going through all 100 flashcards once, from the first to the last, is one epoch. After finishing, you shuffle the cards and go through all 100 again. That second pass is a second epoch. Each time you complete the full stack, you have finished one more epoch.

You need multiple passes because you will not memorize everything on the first try. But if you study the same cards hundreds of times, you might memorize the exact cards rather than truly understanding the words, which is like overfitting. The trick is finding the right number of passes: enough to learn well, but not so many that you only memorize rather than understand.

Another way to think about it: machine learning epochs are like playing a memory game where you have cards with pictures and try to match them all together. Each attempt at matching all of the cards is one epoch. The more times you play the game, the better you become, but playing too often could lead to mastering the old cards so well that you make mistakes when dealing with new ones.

## References

1. Goodfellow, I., Bengio, Y., & Courville, A. (2016). *Deep Learning*. MIT Press. Chapter 8: Optimization for Training Deep Models. [https://www.deeplearningbook.org/](https://www.deeplearningbook.org/)
2. Hoffmann, J., Borgeaud, S., Mensch, A., et al. (2022). "Training Compute-Optimal Large Language Models." *Advances in Neural Information Processing Systems ([NeurIPS](/wiki/neurips)) 2022*. [https://arxiv.org/abs/2203.15556](https://arxiv.org/abs/2203.15556)
3. Muennighoff, N., Rush, A. M., Barak, B., et al. (2023). "Scaling Data-Constrained Language Models." *NeurIPS 2023*. [https://arxiv.org/abs/2305.16264](https://arxiv.org/abs/2305.16264)
4. Bottou, L. (2010). "Large-Scale Machine Learning with Stochastic Gradient Descent." *Proceedings of COMPSTAT 2010*.
5. He, K., Zhang, X., Ren, S., & Sun, J. (2016). "Deep Residual Learning for Image Recognition." *CVPR 2016*. [https://arxiv.org/abs/1512.03385](https://arxiv.org/abs/1512.03385)
6. Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." *NAACL-HLT 2019*. [https://arxiv.org/abs/1810.04805](https://arxiv.org/abs/1810.04805)
7. Brown, T. B., Mann, B., Ryder, N., et al. (2020). "Language Models are Few-Shot Learners." *NeurIPS 2020*. [https://arxiv.org/abs/2005.14165](https://arxiv.org/abs/2005.14165)
8. Touvron, H., Lavril, T., Izacard, G., et al. (2023). "LLaMA: Open and Efficient Foundation Language Models." [https://arxiv.org/abs/2302.13971](https://arxiv.org/abs/2302.13971)
9. Prechelt, L. (1998). "Early Stopping - But When?" In *Neural Networks: Tricks of the Trade*, Springer. [https://doi.org/10.1007/3-540-49430-8_3](https://doi.org/10.1007/3-540-49430-8_3)
10. Mishchenko, K., Khaled, A., & Richtarik, P. (2020). "Random Reshuffling: Simple Analysis with Vast Improvements." *NeurIPS 2020*. [https://arxiv.org/abs/2006.05988](https://arxiv.org/abs/2006.05988)
11. Villalobos, P., Sevilla, J., Heim, L., et al. (2022). "Will We Run Out of Data? An Analysis of the Limits of Scaling Datasets in Machine Learning." *Epoch AI*. [https://arxiv.org/abs/2211.04325](https://arxiv.org/abs/2211.04325)
12. Villalobos, P., Ho, A., Sevilla, J., et al. (2024). "Will We Run Out of Data? Limits of LLM Scaling Based on Human-Generated Data." *Epoch AI / ICML 2024*. [https://epoch.ai/blog/will-we-run-out-of-data-limits-of-llm-scaling-based-on-human-generated-data](https://epoch.ai/blog/will-we-run-out-of-data-limits-of-llm-scaling-based-on-human-generated-data)
13. Bengio, Y. (2012). "Practical Recommendations for Gradient-Based Training of Deep Architectures." *Neural Networks: Tricks of the Trade*. Springer. [https://arxiv.org/abs/1206.5533](https://arxiv.org/abs/1206.5533)
14. Smith, L.N. (2017). "Cyclical Learning Rates for Training Neural Networks." *IEEE Winter Conference on Applications of Computer Vision (WACV)*. [https://arxiv.org/abs/1506.01186](https://arxiv.org/abs/1506.01186)
15. Ye, Y., Huang, Z., Xiao, Y., Chern, E., Xia, S., & Liu, P. (2025). "LIMO: Less is More for Reasoning." *COLM 2025*. [https://arxiv.org/abs/2502.03387](https://arxiv.org/abs/2502.03387)