Epoch

See also: Machine learning terms

An epoch in machine learning refers to one complete pass through the entire training dataset during the model training process. Over the course of a single epoch, every example in the dataset is presented to the model exactly once to compute gradients and update model parameters, allowing the algorithm to learn from all available data before repeating the cycle. The number of epochs is a key hyperparameter that practitioners set to control how long training continues and how many times the model revisits the data.

In practice, training a model for just one epoch is rarely sufficient to learn all the patterns in a dataset. Most training procedures involve running for multiple epochs so the model can iteratively refine its weights and biases based on the full distribution of training examples. However, training for too many epochs risks overfitting, where the model memorizes the training data rather than learning generalizable patterns.

Relationship to batches, iterations, and steps

An epoch does not process the entire dataset at once. Instead, the dataset is divided into smaller subsets called batches (or mini-batches). Each time the model processes one batch and updates its weights through a forward pass and backward pass, that constitutes one iteration (also called a step in most frameworks). Understanding how an epoch, a batch, and an iteration fit together is essential for configuring training correctly.

Epoch: One full pass through the entire training dataset.
Batch (mini-batch): A subset of the training data used in a single forward and backward pass. The batch size determines how many examples are in each batch.
Iteration (step): One parameter update, corresponding to processing one batch.

The relationship between these concepts is straightforward:

Iterations per epoch = Total training examples / Batch size

For example, consider a dataset of 50,000 training images with a batch size of 256:

Iterations per epoch = 50,000 / 256 = 195.3, rounded up to 196 iterations
If training runs for 90 epochs, the total number of iterations = 196 x 90 = 17,640 weight updates

A smaller example: with 10,000 training examples and a batch size of 100, each epoch contains 10,000 / 100 = 100 iterations; after 5 epochs the model has seen each example 5 times and performed 500 total parameter updates.

The following table summarizes these three concepts:

Concept	Definition	Scope	Example (50,000 samples, batch size 256)
Batch	A subset of training data processed together in one forward/backward pass	A fraction of the dataset	256 samples
Iteration (step)	One weight update after processing a single batch	One batch	1 of 196 per epoch
Epoch	One complete pass through the entire dataset	The full dataset	All 196 iterations

And, framed around the smaller 10,000-sample example:

Term	What it measures	Example (10,000 samples, batch size 100)
Batch size	Number of samples per update	100
Iteration (step)	One forward + backward pass + update	1 iteration = 100 samples processed
Epoch	One full pass through the dataset	1 epoch = 100 iterations = 10,000 samples
Total iterations	Epochs multiplied by iterations per epoch	5 epochs = 500 iterations

When the total number of samples is not evenly divisible by the batch size, the last batch in each epoch will be smaller than the others. Most training frameworks handle this automatically, either by using the smaller batch as-is or by dropping the last incomplete batch.

In most frameworks, the terms "iteration" and "step" are interchangeable. TensorFlow and Keras use "steps_per_epoch" while PyTorch users typically speak in terms of iterations.

Why multiple epochs are needed

A single pass through the data is usually insufficient for a model to converge to a good solution. Stochastic gradient descent (SGD) and its variants update model weights based on noisy gradient estimates computed from individual mini-batches. These updates point roughly toward the optimum but require many repeated passes to accumulate enough learning signal for convergence.

Underfitting with too few epochs

When a model trains for too few epochs, it has not had enough exposure to the data to learn the underlying patterns. The result is underfitting: high error on both training and validation data. The loss curve will still be trending downward when training stops, indicating that additional epochs would improve performance.

Overfitting with too many epochs

Training for too many epochs is one of the most common causes of overfitting. As training progresses, the model's capacity to fit the training data increases. Early in training, the model learns general patterns that apply to both training and validation data, and both losses decrease together. After a certain point, however, the model begins to memorize specifics of the training set: noise, outliers, and idiosyncrasies that do not generalize. Training loss continues to decrease while validation loss begins to increase. This gap signals overfitting, and the point of divergence marks approximately where training should have stopped.

The relationship between epochs and overfitting can be visualized as a learning curve:

Early epochs: Both training and validation loss decrease. The model is learning useful patterns.
Middle epochs: Validation loss reaches its minimum. The model has extracted the generalizable information from the data.
Late epochs: Training loss keeps falling but validation loss rises. The model is memorizing training data.

The optimal number of epochs corresponds to the point where validation loss is minimized, which is exactly what early stopping is designed to detect.

How to choose the number of epochs

There is no universal formula for the optimal number of epochs. The optimal count depends on the dataset size, model complexity, learning rate, and other hyperparameters. Several strategies and guidelines help practitioners make this decision.

General principles

Larger datasets generally need fewer epochs. When a dataset is large and diverse, the model can learn a good representation with fewer passes because each epoch exposes it to a wide variety of examples. Conversely, small datasets may require more epochs because each pass provides limited information.

More complex models may need more epochs. A deep neural network with millions of parameters takes longer to train because the optimizer must navigate a higher-dimensional parameter space. Simple models like logistic regression often converge in just a few epochs.

The learning rate affects how many epochs are needed. A higher learning rate causes larger parameter updates per step, which can lead to convergence in fewer epochs (if the rate is well-chosen) or divergence (if it is too high). A lower learning rate requires more epochs because each update makes only a small change.

Early stopping

Early stopping is the most widely used technique for determining when to end training. Rather than fixing the epoch count in advance, the practitioner sets a large maximum number of epochs and monitors the model's performance on a held-out validation set. Training halts when the validation metric (typically validation loss) has not improved for a specified number of consecutive epochs, known as the patience parameter.

The process works as follows:

After each epoch (or at regular intervals), evaluate the model on the validation set.
Track the best validation performance observed so far.
If the validation performance does not improve for a specified number of consecutive epochs (the "patience" parameter), stop training.
Restore the model weights from the epoch with the best validation performance.

For example, with a patience of 10, training continues as long as there has been at least one improvement in validation loss within the last 10 epochs. Once 10 consecutive epochs pass without improvement, training halts.

Early stopping acts as a form of regularization because it limits how far the model's weights can move from their initial values. This constraint on the effective complexity of the model helps prevent overfitting.

Common early stopping settings include:

Parameter	Typical value	Purpose
Patience	5 to 10 epochs	How many epochs without improvement before stopping
Min delta	0.001 to 0.01	Minimum change to count as an improvement
Restore best weights	True	Revert to the weights from the best epoch

In Keras, early stopping is implemented as a callback:

from keras.callbacks import EarlyStopping

early_stop = EarlyStopping(
    monitor='val_loss',
    patience=5,
    min_delta=0.001,
    restore_best_weights=True
)

model.fit(X_train, y_train, epochs=200, callbacks=[early_stop])

Learning curves

Plotting learning curves (training loss and validation loss against epoch number) is essential for diagnosing whether a model needs more or fewer epochs. There are three typical patterns:

Both curves still decreasing: The model is underfitting. Train for more epochs.
Training curve decreasing, validation curve increasing: The model is overfitting. Reduce epochs or apply regularization.
Both curves plateaued: The model has converged. Additional epochs will not help.

Heuristic guidelines

As a starting point, typical epoch counts vary by task:

Task	Typical epoch range	Notes
Image classification (from scratch, e.g., ImageNet)	90 to 300	Standard ResNet on ImageNet uses 90 epochs with step decay; longer schedules sometimes improve results
Fine-tuning pretrained models	2 to 10	Pretrained weights already encode useful features; fewer epochs suffice
NLP classification	3 to 10	BERT fine-tuning typically uses 3 to 4 epochs
NLP pre-training (BERT, GPT)	1 to 3 over the corpus	Large corpora; some data may be seen only once
Small tabular datasets	50 to 500	More epochs needed due to limited data
GAN training	100 to 1000+	GANs are notoriously unstable and may require long training to stabilize adversarial dynamics
LLM pre-training	1 to 3	Most data is seen only once; measured in tokens rather than epochs

These are rough guidelines, not rules. The right number of epochs should always be verified empirically using validation performance.

In modern large language model training, the concept of epochs has shifted. Models like GPT-3 and GPT-4 are often trained for less than one epoch over their massive training corpora, meaning not every example is seen even once. The total number of training tokens (rather than epochs) becomes the primary measure of training duration. Chinchilla scaling laws (Hoffmann et al., 2022) suggest that the optimal number of training tokens scales roughly proportionally to model size.

Data shuffling between epochs

Reshuffling the training data at the beginning of each epoch is standard practice and has important theoretical and practical benefits. When the data is presented in the same order every epoch, the model can develop sensitivity to the sequence of examples rather than learning from the data itself, which is especially problematic when data is sorted by class or some other systematic ordering. By randomly permuting the order of samples each epoch, the sequence of gradient updates varies, which helps the optimizer escape local minima and achieve better generalization.

Without shuffling, the model would see the same sequence of batches every epoch. This can create biases in the gradient estimates, particularly if consecutive batches contain similar examples. Shuffling breaks these correlations and improves the quality of gradient updates.

Research on random reshuffling in SGD has shown that it converges faster than the alternative of processing data in a fixed order (sometimes called "shuffle once") or sampling with replacement. Modern deep learning frameworks, including PyTorch and TensorFlow, enable shuffling by default in their data loading utilities.

Some training frameworks use a data loader that samples batches with replacement from the dataset rather than iterating through the data in a fixed order. In this case, the concept of an epoch is approximate: after seeing N batches (where N equals the dataset size divided by the batch size), the model has seen approximately one epoch's worth of data, though some examples may have been seen more than once and others not at all.

In PyTorch, the DataLoader class provides a shuffle parameter:

from torch.utils.data import DataLoader

train_loader = DataLoader(dataset, batch_size=64, shuffle=True)

for epoch in range(num_epochs):
    for batch in train_loader:  # Data is reshuffled each epoch
        # training step
        pass

Epochs in different frameworks

The way epochs are expressed in code varies between frameworks.

PyTorch

PyTorch does not have a built-in "epochs" parameter in its core API. Instead, the training loop is written explicitly by the programmer, with an outer loop over epochs and an inner loop over batches from a DataLoader:

for epoch in range(num_epochs):
    for inputs, targets in train_loader:
        outputs = model(inputs)
        loss = criterion(outputs, targets)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

This explicit loop gives full control over what happens at each epoch boundary (logging, checkpointing, learning rate adjustments).

TensorFlow / Keras

Keras abstracts the epoch loop into model.fit(), where the epochs parameter specifies the total number of passes:

model.fit(X_train, y_train, epochs=50, batch_size=32, validation_split=0.2)

Keras automatically handles batching, shuffling, and per-epoch callbacks such as early stopping, learning rate scheduling, and checkpointing.

Epoch vs. step in modern training

In classical machine learning, the epoch was the natural unit of training progress. Datasets were relatively small, models were trained for many epochs, and evaluation checkpoints were placed at epoch boundaries.

Modern large-scale training has shifted toward using steps (iterations) as the primary unit of measurement. There are several reasons for this:

Dataset size: When training on datasets with billions of examples (e.g., Common Crawl for language models or LAION-5B for image models), a single epoch may take days or weeks of compute. Tracking progress in steps rather than epochs provides finer-grained monitoring.

Sub-epoch training: Many modern models do not complete even one full epoch. GPT-3 was trained on approximately 300 billion tokens sampled from a corpus of several hundred billion tokens, but with non-uniform sampling weights. Some documents were seen multiple times while others were seen only once. In this setting, the concept of an "epoch" loses its original meaning.

Learning rate schedules: Modern learning rate schedulers (cosine annealing, linear warmup + decay) are typically defined in terms of total training steps rather than epochs. For instance, "warmup for 2,000 steps then decay over 500,000 steps" is more common than "warmup for 1 epoch."

Evaluation and checkpointing: In large-scale training, models are evaluated and checkpointed every N steps (e.g., every 1,000 or 5,000 steps) rather than at epoch boundaries. This provides more frequent feedback on training progress.

Reproducibility: Specifying training duration in steps makes it easier to reproduce results across different hardware configurations. Changing the number of GPUs or the batch size changes the number of steps per epoch but not the total number of steps if gradient accumulation is adjusted accordingly.

Unit	Traditional ML	Modern large-scale training
Primary progress unit	Epoch	Step (iteration)
Dataset passes	Many (10-300+ epochs)	Often less than 1 epoch
Evaluation frequency	Once per epoch	Every N steps
LR schedule defined by	Epoch milestones	Step counts
Typical dataset size	Thousands to millions of examples	Billions of examples or tokens

Despite this shift, the epoch remains a useful concept for smaller-scale training, fine-tuning, and educational settings where the dataset is small enough that multiple passes are both practical and necessary.

Epoch vs. step in LLM training pipelines

In large language model pre-training, the distinction between epochs and steps becomes especially pronounced. Training budgets are specified in tokens (e.g., "train on 2 trillion tokens") rather than epochs, and progress is tracked by the number of tokens consumed. A single "step" processes a fixed number of tokens determined by the batch size and sequence length. For example, with a batch size of 1,024 sequences and a sequence length of 4,096 tokens, each step processes roughly 4.2 million tokens.

The total number of steps is then:

Total steps = Total training tokens / (batch size * sequence length)

All scheduling decisions, including learning rate warmup, decay, and checkpointing intervals, are anchored to these step counts. This convention carries through to fine-tuning as well, where practitioners typically specify training in terms of steps rather than epochs, particularly when using gradient accumulation to achieve large effective batch sizes.

Epochs in online learning vs. batch learning

The concept of an epoch is most meaningful in batch learning (also called offline learning), where the full dataset is available before training begins and can be iterated over multiple times. In this setting, one epoch is a well-defined unit: a single pass through the finite dataset.

In online learning, data arrives as a continuous stream and each example is typically used only once before being discarded. The notion of an epoch does not naturally apply because there is no fixed dataset to "pass through." Instead, online learning systems measure progress in terms of the number of samples processed or the number of weight updates performed. Online learning is common in scenarios with non-stationary data distributions, such as recommendation systems and financial trading models.

Some hybrid approaches exist where streaming data is accumulated into buffers and mini-epochs are run over the buffer before it is refreshed with new data.

Epochs in large-scale LLM training

Training modern large language models (LLMs) has shifted the emphasis from epochs to total tokens seen during training. Because LLM pre-training datasets can contain trillions of tokens, many models are trained for only a single epoch, meaning each token is encountered exactly once.

Tokens seen vs. epochs

For LLMs, reporting the number of tokens processed is more informative than the number of epochs. Key examples include:

Model	Parameters	Training tokens	Approximate epochs
GPT-3	175B	300B tokens	~1 (some data subsets repeated)
LLaMA 1	7B to 65B	1.0T to 1.4T tokens	~1
LLaMA 2	7B to 70B	2.0T tokens	~1
LLaMA 3.1	8B to 405B	15T+ tokens	~1
Chinchilla	70B	1.4T tokens	~1

The reason single-epoch training dominates at this scale is that pre-training corpora are large enough that one pass already provides sufficient gradient updates. Repeating data yields diminishing returns and can degrade model quality.

Multi-epoch training and data repetition

A major question in modern ML is whether training data should be repeated across multiple epochs, or whether each training example should be seen only once. This debate has become especially relevant as the supply of high-quality text data for large language models approaches its limits.

The single-epoch paradigm

Early large language models, following the Chinchilla scaling laws (Hoffmann et al., 2022), were typically trained for roughly one epoch. The reasoning was straightforward: with enough data available, there was no need to repeat examples. Repeated data offers diminishing returns because the model has already extracted signal from previously seen examples, and each repetition provides progressively less new information.

Chinchilla scaling laws and data repetition

The Chinchilla scaling laws (Hoffmann et al., 2022) established that compute-optimal training requires scaling model size and training data equally. The recommended ratio is approximately 20 tokens per parameter: a 70 billion parameter model should be trained on roughly 1.4 trillion tokens. This framework implicitly assumes each token is seen only once.

Data-constrained scaling laws

When unique data is limited, multi-epoch training becomes necessary. Muennighoff et al. (2023) conducted the first systematic study of multi-epoch training for language models. Their paper, "Scaling Data-Constrained Language Models," trained over 400 models ranging from 10 million to 9 billion parameters for up to 1,500 epochs. Key findings included:

Training with up to 4 epochs of repeated data yielded negligible changes in loss compared to having unique data, for a fixed compute budget.
Beyond 4 epochs, the value of additional compute allocated to repeated data decayed progressively, eventually approaching zero benefit.
When data is constrained, compute should be allocated to both more parameters and more epochs, with epochs scaling slightly faster than what Chinchilla laws would predict for unique data.

Their findings suggest that when data is scarce, it is better to train a smaller model for more epochs rather than a larger model for fewer epochs, a reversal of the standard Chinchilla recommendation. If your data budget is limited, repeating data up to about 4 times is nearly free in terms of model quality.

Epochs of repetition	Impact on loss (vs. unique data)	Practical recommendation
1 (no repetition)	Baseline	Ideal if sufficient data is available
2-4	Negligible degradation	Safe to use when data is constrained
4-16	Modest degradation	Acceptable with diminishing returns
16+	Significant degradation	Avoid unless data is extremely scarce

Multi-epoch training in fine-tuning

While pre-training typically uses one or a few epochs, fine-tuning and post-training commonly use multiple epochs over smaller, curated datasets. Recent examples include:

LLaMA 3: Multi-epoch supervised fine-tuning on instruction data
DeepSeek-R1: 2-3 epochs of fine-tuning for reasoning capabilities
LIMO (2025): 15 epochs on a small, highly curated reasoning dataset, demonstrating that repeated training on high-quality data can be beneficial

Recent research (2025) has shown that training for many epochs on a small, carefully selected subset of fine-tuning data can actually outperform training for one epoch on a much larger dataset, especially for reasoning tasks. This challenges the assumption that more unique data is always better and suggests that data quality matters more than quantity in the post-training phase.

Running out of training data

Epoch AI estimates that available high-quality web text amounts to roughly 300 trillion tokens, and leading labs may exhaust this supply within the next several years. Epoch AI (2022) projected that high-quality language data on the internet could be exhausted for LLM training purposes between 2026 and 2032, depending on data quality thresholds and model scaling trajectories. This projection has made multi-epoch training research increasingly urgent, as future models may have no choice but to train on repeated data, use synthetic data generation, or develop more data-efficient training methods.

Monitoring training progress per epoch

Tracking metrics at the end of each epoch provides essential insight into training dynamics. The most commonly monitored quantities include:

Training loss: Should decrease steadily across epochs. A plateau suggests the model has converged or the learning rate is too small.
Validation loss: Should decrease alongside training loss. An increase indicates the onset of overfitting.
Accuracy or other task-specific metrics: Provide a more interpretable measure of model quality than raw loss values.
Learning rate: If using a learning rate schedule, logging the current rate per epoch helps diagnose training behavior.

Tools such as TensorBoard, Weights & Biases, and MLflow provide real-time visualization of per-epoch metrics, making it straightforward to detect problems and decide when to stop training.

Computational cost considerations

The number of epochs directly affects training cost. Total compute is proportional to:

Total FLOPs = FLOPs per iteration x Iterations per epoch x Number of epochs

Several practical trade-offs are involved:

Factor	Effect of increasing	Trade-off
Number of epochs	More weight updates, longer training time	Better learning vs. overfitting risk and higher cost
Batch size	Fewer iterations per epoch, better GPU utilization	Faster wall-clock time per epoch but potentially worse generalization
Learning rate	Faster convergence per step	May overshoot or diverge if too large
Dataset size	More iterations per epoch	More data per epoch but each epoch takes longer

For large-scale training, the cost is often measured in GPU hours: the product of the number of GPUs used and the elapsed training time. For instance, training ResNet-50 on ImageNet for 90 epochs takes roughly 29 hours on 8 NVIDIA V100 GPUs. LLM pre-training can require millions of GPU hours.

Reducing the number of epochs (by using early stopping or better learning rate schedules) is one of the most straightforward ways to reduce training cost without architectural changes.

ELI5: what is an epoch?

Imagine you have a stack of 100 flashcards for studying vocabulary. Going through all 100 flashcards once, from the first to the last, is one epoch. After finishing, you shuffle the cards and go through all 100 again. That second pass is a second epoch. Each time you complete the full stack, you have finished one more epoch.

You need multiple passes because you will not memorize everything on the first try. But if you study the same cards hundreds of times, you might memorize the exact cards rather than truly understanding the words, which is like overfitting. The trick is finding the right number of passes: enough to learn well, but not so many that you only memorize rather than understand.

Another way to think about it: machine learning epochs are like playing a memory game where you have cards with pictures and try to match them all together. Each attempt at matching all of the cards is one epoch. The more times you play the game, the better you become, but playing too often could lead to mastering the old cards so well that you make mistakes when dealing with new ones.

References

Goodfellow, I., Bengio, Y., & Courville, A. (2016). *Deep Learning*. MIT Press. Chapter 8: Optimization for Training Deep Models. https://www.deeplearningbook.org/
Hoffmann, J., Borgeaud, S., Mensch, A., et al. (2022). "Training Compute-Optimal Large Language Models." *Advances in Neural Information Processing Systems (NeurIPS) 2022*. https://arxiv.org/abs/2203.15556
Muennighoff, N., Rush, A. M., Barak, B., et al. (2023). "Scaling Data-Constrained Language Models." *NeurIPS 2023*. https://arxiv.org/abs/2305.16264
Bottou, L. (2010). "Large-Scale Machine Learning with Stochastic Gradient Descent." *Proceedings of COMPSTAT 2010*.
He, K., Zhang, X., Ren, S., & Sun, J. (2016). "Deep Residual Learning for Image Recognition." *CVPR 2016*. https://arxiv.org/abs/1512.03385
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." *NAACL-HLT 2019*. https://arxiv.org/abs/1810.04805
Brown, T. B., Mann, B., Ryder, N., et al. (2020). "Language Models are Few-Shot Learners." *NeurIPS 2020*. https://arxiv.org/abs/2005.14165
Touvron, H., Lavril, T., Izacard, G., et al. (2023). "LLaMA: Open and Efficient Foundation Language Models." https://arxiv.org/abs/2302.13971
Prechelt, L. (1998). "Early Stopping - But When?" In *Neural Networks: Tricks of the Trade*, Springer. https://doi.org/10.1007/3-540-49430-8_3
Mishchenko, K., Khaled, A., & Richtarik, P. (2020). "Random Reshuffling: Simple Analysis with Vast Improvements." *NeurIPS 2020*. https://arxiv.org/abs/2006.05988
Villalobos, P., Sevilla, J., Heim, L., et al. (2022). "Will We Run Out of Data? An Analysis of the Limits of Scaling Datasets in Machine Learning." *Epoch AI*. https://arxiv.org/abs/2211.04325
Epoch AI (2022). "Will We Run Out of Data? Limits of LLM Scaling Based on Human-Generated Data." https://epoch.ai/blog/will-we-run-out-of-data-limits-of-llm-scaling-based-on-human-generated-data
Bengio, Y. (2012). "Practical Recommendations for Gradient-Based Training of Deep Architectures." *Neural Networks: Tricks of the Trade*. Springer. https://arxiv.org/abs/1206.5533
Smith, L.N. (2017). "Cyclical Learning Rates for Training Neural Networks." *IEEE Winter Conference on Applications of Computer Vision (WACV)*. https://arxiv.org/abs/1506.01186

Relationship to batches, iterations, and steps

Why multiple epochs are needed

Underfitting with too few epochs

Overfitting with too many epochs

How to choose the number of epochs

General principles

Early stopping

Learning curves

Heuristic guidelines

Data shuffling between epochs

Epochs in different frameworks

PyTorch

TensorFlow / Keras

Epoch vs. step in modern training

Epoch vs. step in LLM training pipelines

Epochs in online learning vs. batch learning

Epochs in large-scale LLM training

Tokens seen vs. epochs

Multi-epoch training and data repetition

The single-epoch paradigm

Chinchilla scaling laws and data repetition

Data-constrained scaling laws

Multi-epoch training in fine-tuning

Running out of training data

Monitoring training progress per epoch

Computational cost considerations

ELI5: what is an epoch?

References

Improve this article

Related Articles

GELU (Gaussian Error Linear Unit)

Multi-head Latent Attention

Sparse autoencoder

ARC-AGI 2

LeNet

Mixture of Experts (MoE)

Relationship to batches, iterations, and steps

Why multiple epochs are needed

Underfitting with too few epochs

Overfitting with too many epochs

How to choose the number of epochs

General principles

Early stopping

Learning curves

Heuristic guidelines

Data shuffling between epochs

Epochs in different frameworks

PyTorch

TensorFlow / Keras

Epoch vs. step in modern training

Epoch vs. step in LLM training pipelines

Epochs in online learning vs. batch learning

Epochs in large-scale LLM training

Tokens seen vs. epochs

Multi-epoch training and data repetition

The single-epoch paradigm

Chinchilla scaling laws and data repetition

Data-constrained scaling laws

Multi-epoch training in fine-tuning

Running out of training data

Monitoring training progress per epoch

Computational cost considerations

ELI5: what is an epoch?

References

Related Articles

GELU (Gaussian Error Linear Unit)

Multi-head Latent Attention

Sparse autoencoder

ARC-AGI 2

LeNet

Mixture of Experts (MoE)