See also: Machine learning terms
An epoch in machine learning refers to one complete pass through the entire training dataset during the model training process. Over the course of a single epoch, every example in the dataset is presented to the model exactly once to compute gradients and update model parameters, allowing the algorithm to learn from all available data before repeating the cycle. The number of epochs is a key hyperparameter that practitioners set to control how long training continues and how many times the model revisits the data.
In practice, training a model for just one epoch is rarely sufficient to learn all the patterns in a dataset. Most training procedures involve running for multiple epochs so the model can iteratively refine its weights and biases based on the full distribution of training examples. However, training for too many epochs risks overfitting, where the model memorizes the training data rather than learning generalizable patterns.
An epoch does not process the entire dataset at once. Instead, the dataset is divided into smaller subsets called batches (or mini-batches). Each time the model processes one batch and updates its weights through a forward pass and backward pass, that constitutes one iteration (also called a step in most frameworks). Understanding how an epoch, a batch, and an iteration fit together is essential for configuring training correctly.
The relationship between these concepts is straightforward:
Iterations per epoch = Total training examples / Batch size
For example, consider a dataset of 50,000 training images with a batch size of 256:
A smaller example: with 10,000 training examples and a batch size of 100, each epoch contains 10,000 / 100 = 100 iterations; after 5 epochs the model has seen each example 5 times and performed 500 total parameter updates.
The following table summarizes these three concepts:
| Concept | Definition | Scope | Example (50,000 samples, batch size 256) |
|---|---|---|---|
| Batch | A subset of training data processed together in one forward/backward pass | A fraction of the dataset | 256 samples |
| Iteration (step) | One weight update after processing a single batch | One batch | 1 of 196 per epoch |
| Epoch | One complete pass through the entire dataset | The full dataset | All 196 iterations |
And, framed around the smaller 10,000-sample example:
| Term | What it measures | Example (10,000 samples, batch size 100) |
|---|---|---|
| Batch size | Number of samples per update | 100 |
| Iteration (step) | One forward + backward pass + update | 1 iteration = 100 samples processed |
| Epoch | One full pass through the dataset | 1 epoch = 100 iterations = 10,000 samples |
| Total iterations | Epochs multiplied by iterations per epoch | 5 epochs = 500 iterations |
When the total number of samples is not evenly divisible by the batch size, the last batch in each epoch will be smaller than the others. Most training frameworks handle this automatically, either by using the smaller batch as-is or by dropping the last incomplete batch.
In most frameworks, the terms "iteration" and "step" are interchangeable. TensorFlow and Keras use "steps_per_epoch" while PyTorch users typically speak in terms of iterations.
A single pass through the data is usually insufficient for a model to converge to a good solution. Stochastic gradient descent (SGD) and its variants update model weights based on noisy gradient estimates computed from individual mini-batches. These updates point roughly toward the optimum but require many repeated passes to accumulate enough learning signal for convergence.
When a model trains for too few epochs, it has not had enough exposure to the data to learn the underlying patterns. The result is underfitting: high error on both training and validation data. The loss curve will still be trending downward when training stops, indicating that additional epochs would improve performance.
Training for too many epochs is one of the most common causes of overfitting. As training progresses, the model's capacity to fit the training data increases. Early in training, the model learns general patterns that apply to both training and validation data, and both losses decrease together. After a certain point, however, the model begins to memorize specifics of the training set: noise, outliers, and idiosyncrasies that do not generalize. Training loss continues to decrease while validation loss begins to increase. This gap signals overfitting, and the point of divergence marks approximately where training should have stopped.
The relationship between epochs and overfitting can be visualized as a learning curve:
The optimal number of epochs corresponds to the point where validation loss is minimized, which is exactly what early stopping is designed to detect.
There is no universal formula for the optimal number of epochs. The optimal count depends on the dataset size, model complexity, learning rate, and other hyperparameters. Several strategies and guidelines help practitioners make this decision.
Larger datasets generally need fewer epochs. When a dataset is large and diverse, the model can learn a good representation with fewer passes because each epoch exposes it to a wide variety of examples. Conversely, small datasets may require more epochs because each pass provides limited information.
More complex models may need more epochs. A deep neural network with millions of parameters takes longer to train because the optimizer must navigate a higher-dimensional parameter space. Simple models like logistic regression often converge in just a few epochs.
The learning rate affects how many epochs are needed. A higher learning rate causes larger parameter updates per step, which can lead to convergence in fewer epochs (if the rate is well-chosen) or divergence (if it is too high). A lower learning rate requires more epochs because each update makes only a small change.
Early stopping is the most widely used technique for determining when to end training. Rather than fixing the epoch count in advance, the practitioner sets a large maximum number of epochs and monitors the model's performance on a held-out validation set. Training halts when the validation metric (typically validation loss) has not improved for a specified number of consecutive epochs, known as the patience parameter.
The process works as follows:
For example, with a patience of 10, training continues as long as there has been at least one improvement in validation loss within the last 10 epochs. Once 10 consecutive epochs pass without improvement, training halts.
Early stopping acts as a form of regularization because it limits how far the model's weights can move from their initial values. This constraint on the effective complexity of the model helps prevent overfitting.
Common early stopping settings include:
| Parameter | Typical value | Purpose |
|---|---|---|
| Patience | 5 to 10 epochs | How many epochs without improvement before stopping |
| Min delta | 0.001 to 0.01 | Minimum change to count as an improvement |
| Restore best weights | True | Revert to the weights from the best epoch |
In Keras, early stopping is implemented as a callback:
from keras.callbacks import EarlyStopping
early_stop = EarlyStopping(
monitor='val_loss',
patience=5,
min_delta=0.001,
restore_best_weights=True
)
model.fit(X_train, y_train, epochs=200, callbacks=[early_stop])
Plotting learning curves (training loss and validation loss against epoch number) is essential for diagnosing whether a model needs more or fewer epochs. There are three typical patterns:
As a starting point, typical epoch counts vary by task:
| Task | Typical epoch range | Notes |
|---|---|---|
| Image classification (from scratch, e.g., ImageNet) | 90 to 300 | Standard ResNet on ImageNet uses 90 epochs with step decay; longer schedules sometimes improve results |
| Fine-tuning pretrained models | 2 to 10 | Pretrained weights already encode useful features; fewer epochs suffice |
| NLP classification | 3 to 10 | BERT fine-tuning typically uses 3 to 4 epochs |
| NLP pre-training (BERT, GPT) | 1 to 3 over the corpus | Large corpora; some data may be seen only once |
| Small tabular datasets | 50 to 500 | More epochs needed due to limited data |
| GAN training | 100 to 1000+ | GANs are notoriously unstable and may require long training to stabilize adversarial dynamics |
| LLM pre-training | 1 to 3 | Most data is seen only once; measured in tokens rather than epochs |
These are rough guidelines, not rules. The right number of epochs should always be verified empirically using validation performance.
In modern large language model training, the concept of epochs has shifted. Models like GPT-3 and GPT-4 are often trained for less than one epoch over their massive training corpora, meaning not every example is seen even once. The total number of training tokens (rather than epochs) becomes the primary measure of training duration. Chinchilla scaling laws (Hoffmann et al., 2022) suggest that the optimal number of training tokens scales roughly proportionally to model size.
Reshuffling the training data at the beginning of each epoch is standard practice and has important theoretical and practical benefits. When the data is presented in the same order every epoch, the model can develop sensitivity to the sequence of examples rather than learning from the data itself, which is especially problematic when data is sorted by class or some other systematic ordering. By randomly permuting the order of samples each epoch, the sequence of gradient updates varies, which helps the optimizer escape local minima and achieve better generalization.
Without shuffling, the model would see the same sequence of batches every epoch. This can create biases in the gradient estimates, particularly if consecutive batches contain similar examples. Shuffling breaks these correlations and improves the quality of gradient updates.
Research on random reshuffling in SGD has shown that it converges faster than the alternative of processing data in a fixed order (sometimes called "shuffle once") or sampling with replacement. Modern deep learning frameworks, including PyTorch and TensorFlow, enable shuffling by default in their data loading utilities.
Some training frameworks use a data loader that samples batches with replacement from the dataset rather than iterating through the data in a fixed order. In this case, the concept of an epoch is approximate: after seeing N batches (where N equals the dataset size divided by the batch size), the model has seen approximately one epoch's worth of data, though some examples may have been seen more than once and others not at all.
In PyTorch, the DataLoader class provides a shuffle parameter:
from torch.utils.data import DataLoader
train_loader = DataLoader(dataset, batch_size=64, shuffle=True)
for epoch in range(num_epochs):
for batch in train_loader: # Data is reshuffled each epoch
# training step
pass
The way epochs are expressed in code varies between frameworks.
PyTorch does not have a built-in "epochs" parameter in its core API. Instead, the training loop is written explicitly by the programmer, with an outer loop over epochs and an inner loop over batches from a DataLoader:
for epoch in range(num_epochs):
for inputs, targets in train_loader:
outputs = model(inputs)
loss = criterion(outputs, targets)
optimizer.zero_grad()
loss.backward()
optimizer.step()
This explicit loop gives full control over what happens at each epoch boundary (logging, checkpointing, learning rate adjustments).
Keras abstracts the epoch loop into model.fit(), where the epochs parameter specifies the total number of passes:
model.fit(X_train, y_train, epochs=50, batch_size=32, validation_split=0.2)
Keras automatically handles batching, shuffling, and per-epoch callbacks such as early stopping, learning rate scheduling, and checkpointing.
In classical machine learning, the epoch was the natural unit of training progress. Datasets were relatively small, models were trained for many epochs, and evaluation checkpoints were placed at epoch boundaries.
Modern large-scale training has shifted toward using steps (iterations) as the primary unit of measurement. There are several reasons for this:
Dataset size: When training on datasets with billions of examples (e.g., Common Crawl for language models or LAION-5B for image models), a single epoch may take days or weeks of compute. Tracking progress in steps rather than epochs provides finer-grained monitoring.
Sub-epoch training: Many modern models do not complete even one full epoch. GPT-3 was trained on approximately 300 billion tokens sampled from a corpus of several hundred billion tokens, but with non-uniform sampling weights. Some documents were seen multiple times while others were seen only once. In this setting, the concept of an "epoch" loses its original meaning.
Learning rate schedules: Modern learning rate schedulers (cosine annealing, linear warmup + decay) are typically defined in terms of total training steps rather than epochs. For instance, "warmup for 2,000 steps then decay over 500,000 steps" is more common than "warmup for 1 epoch."
Evaluation and checkpointing: In large-scale training, models are evaluated and checkpointed every N steps (e.g., every 1,000 or 5,000 steps) rather than at epoch boundaries. This provides more frequent feedback on training progress.
Reproducibility: Specifying training duration in steps makes it easier to reproduce results across different hardware configurations. Changing the number of GPUs or the batch size changes the number of steps per epoch but not the total number of steps if gradient accumulation is adjusted accordingly.
| Unit | Traditional ML | Modern large-scale training |
|---|---|---|
| Primary progress unit | Epoch | Step (iteration) |
| Dataset passes | Many (10-300+ epochs) | Often less than 1 epoch |
| Evaluation frequency | Once per epoch | Every N steps |
| LR schedule defined by | Epoch milestones | Step counts |
| Typical dataset size | Thousands to millions of examples | Billions of examples or tokens |
Despite this shift, the epoch remains a useful concept for smaller-scale training, fine-tuning, and educational settings where the dataset is small enough that multiple passes are both practical and necessary.
In large language model pre-training, the distinction between epochs and steps becomes especially pronounced. Training budgets are specified in tokens (e.g., "train on 2 trillion tokens") rather than epochs, and progress is tracked by the number of tokens consumed. A single "step" processes a fixed number of tokens determined by the batch size and sequence length. For example, with a batch size of 1,024 sequences and a sequence length of 4,096 tokens, each step processes roughly 4.2 million tokens.
The total number of steps is then:
Total steps = Total training tokens / (batch size * sequence length)
All scheduling decisions, including learning rate warmup, decay, and checkpointing intervals, are anchored to these step counts. This convention carries through to fine-tuning as well, where practitioners typically specify training in terms of steps rather than epochs, particularly when using gradient accumulation to achieve large effective batch sizes.
The concept of an epoch is most meaningful in batch learning (also called offline learning), where the full dataset is available before training begins and can be iterated over multiple times. In this setting, one epoch is a well-defined unit: a single pass through the finite dataset.
In online learning, data arrives as a continuous stream and each example is typically used only once before being discarded. The notion of an epoch does not naturally apply because there is no fixed dataset to "pass through." Instead, online learning systems measure progress in terms of the number of samples processed or the number of weight updates performed. Online learning is common in scenarios with non-stationary data distributions, such as recommendation systems and financial trading models.
Some hybrid approaches exist where streaming data is accumulated into buffers and mini-epochs are run over the buffer before it is refreshed with new data.
Training modern large language models (LLMs) has shifted the emphasis from epochs to total tokens seen during training. Because LLM pre-training datasets can contain trillions of tokens, many models are trained for only a single epoch, meaning each token is encountered exactly once.
For LLMs, reporting the number of tokens processed is more informative than the number of epochs. Key examples include:
| Model | Parameters | Training tokens | Approximate epochs |
|---|---|---|---|
| GPT-3 | 175B | 300B tokens | ~1 (some data subsets repeated) |
| LLaMA 1 | 7B to 65B | 1.0T to 1.4T tokens | ~1 |
| LLaMA 2 | 7B to 70B | 2.0T tokens | ~1 |
| LLaMA 3.1 | 8B to 405B | 15T+ tokens | ~1 |
| Chinchilla | 70B | 1.4T tokens | ~1 |
The reason single-epoch training dominates at this scale is that pre-training corpora are large enough that one pass already provides sufficient gradient updates. Repeating data yields diminishing returns and can degrade model quality.
A major question in modern ML is whether training data should be repeated across multiple epochs, or whether each training example should be seen only once. This debate has become especially relevant as the supply of high-quality text data for large language models approaches its limits.
Early large language models, following the Chinchilla scaling laws (Hoffmann et al., 2022), were typically trained for roughly one epoch. The reasoning was straightforward: with enough data available, there was no need to repeat examples. Repeated data offers diminishing returns because the model has already extracted signal from previously seen examples, and each repetition provides progressively less new information.
The Chinchilla scaling laws (Hoffmann et al., 2022) established that compute-optimal training requires scaling model size and training data equally. The recommended ratio is approximately 20 tokens per parameter: a 70 billion parameter model should be trained on roughly 1.4 trillion tokens. This framework implicitly assumes each token is seen only once.
When unique data is limited, multi-epoch training becomes necessary. Muennighoff et al. (2023) conducted the first systematic study of multi-epoch training for language models. Their paper, "Scaling Data-Constrained Language Models," trained over 400 models ranging from 10 million to 9 billion parameters for up to 1,500 epochs. Key findings included:
Their findings suggest that when data is scarce, it is better to train a smaller model for more epochs rather than a larger model for fewer epochs, a reversal of the standard Chinchilla recommendation. If your data budget is limited, repeating data up to about 4 times is nearly free in terms of model quality.
| Epochs of repetition | Impact on loss (vs. unique data) | Practical recommendation |
|---|---|---|
| 1 (no repetition) | Baseline | Ideal if sufficient data is available |
| 2-4 | Negligible degradation | Safe to use when data is constrained |
| 4-16 | Modest degradation | Acceptable with diminishing returns |
| 16+ | Significant degradation | Avoid unless data is extremely scarce |
While pre-training typically uses one or a few epochs, fine-tuning and post-training commonly use multiple epochs over smaller, curated datasets. Recent examples include:
Recent research (2025) has shown that training for many epochs on a small, carefully selected subset of fine-tuning data can actually outperform training for one epoch on a much larger dataset, especially for reasoning tasks. This challenges the assumption that more unique data is always better and suggests that data quality matters more than quantity in the post-training phase.
Epoch AI estimates that available high-quality web text amounts to roughly 300 trillion tokens, and leading labs may exhaust this supply within the next several years. Epoch AI (2022) projected that high-quality language data on the internet could be exhausted for LLM training purposes between 2026 and 2032, depending on data quality thresholds and model scaling trajectories. This projection has made multi-epoch training research increasingly urgent, as future models may have no choice but to train on repeated data, use synthetic data generation, or develop more data-efficient training methods.
Tracking metrics at the end of each epoch provides essential insight into training dynamics. The most commonly monitored quantities include:
Tools such as TensorBoard, Weights & Biases, and MLflow provide real-time visualization of per-epoch metrics, making it straightforward to detect problems and decide when to stop training.
The number of epochs directly affects training cost. Total compute is proportional to:
Total FLOPs = FLOPs per iteration x Iterations per epoch x Number of epochs
Several practical trade-offs are involved:
| Factor | Effect of increasing | Trade-off |
|---|---|---|
| Number of epochs | More weight updates, longer training time | Better learning vs. overfitting risk and higher cost |
| Batch size | Fewer iterations per epoch, better GPU utilization | Faster wall-clock time per epoch but potentially worse generalization |
| Learning rate | Faster convergence per step | May overshoot or diverge if too large |
| Dataset size | More iterations per epoch | More data per epoch but each epoch takes longer |
For large-scale training, the cost is often measured in GPU hours: the product of the number of GPUs used and the elapsed training time. For instance, training ResNet-50 on ImageNet for 90 epochs takes roughly 29 hours on 8 NVIDIA V100 GPUs. LLM pre-training can require millions of GPU hours.
Reducing the number of epochs (by using early stopping or better learning rate schedules) is one of the most straightforward ways to reduce training cost without architectural changes.
Imagine you have a stack of 100 flashcards for studying vocabulary. Going through all 100 flashcards once, from the first to the last, is one epoch. After finishing, you shuffle the cards and go through all 100 again. That second pass is a second epoch. Each time you complete the full stack, you have finished one more epoch.
You need multiple passes because you will not memorize everything on the first try. But if you study the same cards hundreds of times, you might memorize the exact cards rather than truly understanding the words, which is like overfitting. The trick is finding the right number of passes: enough to learn well, but not so many that you only memorize rather than understand.
Another way to think about it: machine learning epochs are like playing a memory game where you have cards with pictures and try to match them all together. Each attempt at matching all of the cards is one epoch. The more times you play the game, the better you become, but playing too often could lead to mastering the old cards so well that you make mistakes when dealing with new ones.