# Iteration

> Source: https://aiwiki.ai/wiki/iteration
> Updated: 2026-06-25
> Categories: Deep Learning, Machine Learning
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

An **iteration** in machine learning is a single update of a model's parameters during training, performed by processing one [batch](/wiki/batch) of data: the model makes predictions on the batch (a forward pass), computes the [loss](/wiki/loss), calculates [gradients](/wiki/gradient) through [backpropagation](/wiki/backpropagation), and adjusts its [weights](/wiki/weights) and [biases](/wiki/biases) once. Google's Machine Learning Glossary defines an iteration as "a single update of a model's parameters ... during training" and adds that "the batch size determines how many examples the model processes in a single iteration" [16]. The number of iterations in one [epoch](/wiki/epoch) equals the dataset size divided by the [batch size](/wiki/batch_size), rounded up: a dataset of 50,000 examples with a batch size of 100 yields ceil(50,000 / 100) = 500 iterations per epoch [16][17].

*See also: [Machine learning terms](/wiki/machine_learning_terms)*

## What is an iteration in machine learning?

In [machine learning](/wiki/machine_learning) and [deep learning](/wiki/deep_learning), an **iteration** (also called a **training step** or **update step**) is one complete cycle of processing a single [batch](/wiki/batch) of data, computing the [loss](/wiki/loss), calculating [gradients](/wiki/gradient), and updating the [model](/wiki/model)'s [parameters](/wiki/parameter). Each iteration moves the model slightly closer to an optimal set of [weights](/wiki/weights) and [biases](/wiki/biases) by applying one round of [gradient descent](/wiki/gradient_descent) or another [optimization](/wiki/optimizer) algorithm.

The number of training examples the model processes in each iteration is determined by the [hyperparameter](/wiki/hyperparameter) known as [batch size](/wiki/batch_size). If the batch size is 64, the model processes 64 examples, computes the average loss across those examples, and performs one parameter update. That entire sequence counts as one iteration.

Iterations are the fundamental heartbeat of [training](/wiki/training). Every model, from a simple [linear regression](/wiki/linear_regression) to a 400-billion-parameter [large language model](/wiki/large_language_model), learns through repeated iterations. Understanding how iterations relate to [epochs](/wiki/epoch), batches, and total training compute is essential for configuring training runs, debugging [convergence](/wiki/convergence) issues, and interpreting training logs.

Beyond the narrow context of neural network training, the term iteration carries a broader meaning across mathematics, computer science, and statistics. An iterative algorithm is any procedure that repeatedly applies an update rule to a current state, generating a sequence of states that ideally converges to a solution. Many of the most important methods in numerical computation, including [k-means](/wiki/k-means) clustering, [expectation-maximization](/wiki/expectation_maximization), [policy iteration](/wiki/policy_iteration), Newton's method, and Gauss-Seidel relaxation, are iterative in this general sense. The training of a neural network is one specific instance of this larger pattern.

## How does an iteration differ from an epoch and a batch?

Three closely related terms appear constantly in training configurations. They are distinct concepts, but they connect through a simple formula. The Google Machine Learning Glossary draws the distinction precisely: a batch is "the set of examples used in one training iteration," while an epoch is "a full training pass over the entire training set such that each example has been processed once" [16].

| Term | Definition | Scope |
|------|-----------|-------|
| **[Batch](/wiki/batch)** | A subset of the training dataset used in a single iteration | A group of training examples |
| **Iteration** (training step) | One forward pass + backward pass + parameter update on one batch | One parameter update |
| **[Epoch](/wiki/epoch)** | One complete pass through the entire training dataset | All batches processed once |

### How many iterations are in an epoch?

The number of iterations required to complete one epoch is calculated as the dataset size divided by the batch size, rounded up to the nearest whole number:

**Iterations per epoch = ceil(Dataset size / Batch size)**

The ceiling (round-up) is needed because the final batch of an epoch is usually smaller than the rest when the dataset does not divide evenly. This is exactly how the deep learning frameworks compute the value internally: Keras returns `int(np.ceil(len(self.x) / float(self.batch_size)))` as the number of steps per epoch [17].

For example, if a dataset contains 10,000 training examples and the [batch size](/wiki/batch_size) is 100, then one epoch consists of ceil(10,000 / 100) = **100 iterations**. Over 50 epochs, the model would perform 100 x 50 = **5,000 total iterations** (also called total training steps).

A second worked example makes the rounding explicit. With 50,000 training examples and a batch size of 100, one epoch is ceil(50,000 / 100) = **500 iterations**, and 50,000 divides evenly so no rounding is needed [16]. If the dataset instead held 50,050 examples at the same batch size, one epoch would be ceil(50,050 / 100) = **501 iterations**, with the last batch containing only 50 examples. If the dataset size is not evenly divisible by the batch size, the last batch is simply smaller than the rest, unless the training framework is configured to drop the incomplete batch (TensorFlow exposes this as `drop_remainder=True`) [17].

### Numerical example

| Parameter | Value |
|-----------|-------|
| Dataset size | 60,000 examples |
| [Batch size](/wiki/batch_size) | 256 |
| Iterations per [epoch](/wiki/epoch) | ceil(60,000 / 256) = 235 |
| Number of epochs | 20 |
| **Total iterations** | **235 x 20 = 4,700** |

In this scenario, the model's parameters are updated 4,700 times over the course of training.

### Common terminology confusion

In practice, the words *iteration*, *step*, *update*, and even *batch* are often used interchangeably in research papers, blog posts, and framework documentation. The same paper may switch between *training step* and *iteration* in adjacent sentences. A few clarifications help when reading the literature:

- An **iteration** and a **training step** almost always mean the same thing in supervised deep learning: one optimizer update.
- A **batch** is a noun describing the data, not the action. Saying "after one batch" usually means "after one iteration" but the phrase is ambiguous when [gradient accumulation](/wiki/gradient_accumulation) is in use.
- An **epoch** is the only term whose meaning is essentially fixed: one sweep through the whole training set.
- The word **update** is sometimes reserved for the parameter change itself, in which case multiple forward passes can precede a single update (see gradient accumulation below).

## What happens during a single iteration?

Each iteration in a [neural network](/wiki/neural_network) training loop follows a well-defined sequence of operations:

1. **Batch sampling.** A batch of training examples is drawn from the dataset (or served by a data loader).
2. **Forward pass.** The input data passes through every [layer](/wiki/layer) of the network, producing predictions. The [loss function](/wiki/loss_function) compares these predictions against the true labels or targets to compute a scalar loss value.
3. **Backward pass ([backpropagation](/wiki/backpropagation)).** The loss is differentiated with respect to each learnable parameter in the network. This produces a gradient for every weight and bias, indicating the direction and magnitude of change that would reduce the loss.
4. **Parameter update.** The [optimizer](/wiki/optimizer) (such as [SGD](/wiki/stochastic_gradient_descent_sgd), Adam, or AdaGrad) uses the gradients and the [learning rate](/wiki/learning_rate) to adjust each parameter. The core update rule is: **parameter = parameter - learning_rate x gradient**.
5. **Metric logging.** Training frameworks typically record the loss, [accuracy](/wiki/accuracy), learning rate, and other metrics at each iteration or at fixed intervals.

After step 4, the model has completed one iteration and is ready to process the next batch.

### How long does one iteration take?

The wall-clock time of a single iteration depends on the model architecture, hardware, and software stack. For a typical [transformer](/wiki/transformer)-based language model trained on modern accelerators, the dominant costs in one iteration are:

| Operation | Approximate share of iteration time |
|-----------|-------------------------------------|
| Forward pass (matmuls, attention) | 30 to 40 percent |
| Backward pass | 60 to 70 percent (roughly twice the forward cost) |
| Optimizer step (e.g., AdamW updates) | 5 to 10 percent |
| Communication (in distributed training) | Variable; can dominate at extreme scale |
| Data loading and preprocessing | Often hidden by overlap, 0 to 5 percent |

On a modern H100 GPU running a 7-billion-parameter model with a sequence length of 4,096 and a per-device batch of 4, a single iteration takes on the order of one second. For trillion-parameter training on thousands of GPUs, an iteration may take ten to thirty seconds even with extensive parallelism, because more time is spent on cross-node communication and pipeline bubbles. The aggregate iteration count therefore needs to be planned alongside the wall-clock budget, not just the FLOP budget.

## How does an iteration work across gradient descent variants?

The concept of an iteration is central to all variants of [gradient descent](/wiki/gradient_descent). What differs across variants is how much data each iteration consumes.

| Variant | Data per iteration | Characteristics |
|---------|-------------------|----------------|
| **[Stochastic gradient descent](/wiki/stochastic_gradient_descent_sgd) (SGD)** | 1 example | Very frequent updates; noisy gradients; low memory usage; can escape shallow local minima |
| **Mini-batch gradient descent** | A subset (e.g., 32, 64, 256 examples) | Balances update frequency with gradient stability; standard practice in modern deep learning |
| **Batch (full-batch) gradient descent** | Entire dataset | One iteration per [epoch](/wiki/epoch); stable but slow gradients; high memory cost; impractical for large datasets |

In practice, nearly all modern [deep learning](/wiki/deep_learning) training uses mini-batch gradient descent. When practitioners refer to "one iteration," they almost always mean processing one mini-batch.

### How many iterations does convergence require?

In classical [convex optimization](/wiki/convex_optimization), the number of iterations needed for an algorithm to reach a target accuracy is studied as a function of problem properties such as the condition number, the dimension, and the smoothness of the objective. The most-cited bounds for first-order methods on a smooth convex objective with Lipschitz constant `L` and strong convexity parameter `mu` are summarized in the table below. The condition number is defined as `kappa = L / mu`.

| Method | Iterations to reach error `epsilon` |
|--------|------------------------------------|
| Full-batch gradient descent (smooth, convex) | `O(L / epsilon)` |
| Full-batch gradient descent (strongly convex) | `O(kappa log(1 / epsilon))` |
| Nesterov accelerated gradient (smooth, convex) | `O(sqrt(L / epsilon))` |
| Nesterov accelerated gradient (strongly convex) | `O(sqrt(kappa) log(1 / epsilon))` |
| Stochastic gradient descent (strongly convex) | `O(1 / epsilon)` |

These bounds, due in large part to Nesterov, Polyak, and others, were assembled in the form most familiar to modern practitioners by Boyd and Vandenberghe and by Nesterov's textbook on convex optimization [9][10]. They are worst-case bounds and rarely match the empirical iteration count needed in practice. They do, however, capture two important facts: ill-conditioned problems (large `kappa`) require many more iterations, and accelerated methods cut the iteration count substantially.

For non-convex objectives such as the loss landscapes of deep neural networks, there are no general bounds on iterations to reach a global optimum. Modern training instead aims at finding a good local minimum or a flat region of low loss, and the iteration count is chosen empirically.

### Iteration with gradient accumulation

[Gradient accumulation](/wiki/gradient_accumulation) is a technique that decouples the *forward and backward* pass count from the *parameter update* count. Instead of updating parameters after every batch, the gradients from several consecutive forward and backward passes are summed (accumulated) into a single gradient buffer, and the optimizer step is taken only after `K` of these mini-batches have been processed. The effect is to simulate a batch size that is `K` times larger without exceeding GPU memory.

With gradient accumulation, the meaning of iteration becomes ambiguous. Two conventions are common:

- **Step-based counting.** An iteration is one optimizer step. Each iteration corresponds to `K` micro-batches.
- **Micro-batch counting.** Each forward and backward pass on a micro-batch is counted as a separate iteration, even though no parameter update occurs after most of them.

Most large-scale training frameworks (DeepSpeed, Megatron-LM, FSDP) use the step-based convention because the optimizer step is the meaningful unit for [learning rate](/wiki/learning_rate) schedules and checkpoints.

## How do training logs track iterations?

Most training frameworks maintain a **global step counter** that increments by one after each iteration, regardless of the current epoch. This counter serves several purposes:

- **Learning rate scheduling.** Many learning rate schedules (warmup, cosine decay, step decay) are defined in terms of the global step rather than the epoch number, giving finer-grained control over the learning rate trajectory.
- **Checkpoint saving.** Models are often saved at regular step intervals (for example, every 1,000 iterations) so that training can be resumed from the most recent checkpoint if a failure occurs.
- **Logging and visualization.** Tools such as TensorBoard, Weights & Biases, and [MLflow](/wiki/mlflow) plot loss curves against the global step. Logging per iteration (rather than per epoch) provides higher-resolution insight into training dynamics, making it easier to spot instabilities, gradient spikes, or plateaus early.

In [PyTorch](/wiki/pytorch), the global step is typically tracked manually in the training loop. In [TensorFlow](/wiki/tensorflow)/[Keras](/wiki/keras), callbacks and the built-in training loop expose the step count automatically.

### A minimal PyTorch iteration

A single iteration of a supervised PyTorch training loop in pseudocode looks like this:

```
for batch_x, batch_y in dataloader:        # 1. batch sampling
    optimizer.zero_grad()                  #    clear stale gradients
    predictions = model(batch_x)           # 2. forward pass
    loss = loss_fn(predictions, batch_y)   #    loss computation
    loss.backward()                        # 3. backward pass
    optimizer.step()                       # 4. parameter update
    global_step += 1                       # 5. logging / scheduling
```

This five-line core has remained essentially unchanged since the early days of PyTorch, despite enormous changes in model architectures and scale [14]. The same five-step pattern is implicit in `tf.keras.Model.fit` and in the `Trainer` class of Hugging Face Transformers, even though those higher-level APIs hide the loop behind a single `fit` or `train` call.

## How many iterations does training a large language model take?

The scale of modern [large language model](/wiki/large_language_model) (LLM) training is often described in terms of total training tokens rather than iterations, but the relationship between the two is straightforward:

**Total tokens = Total iterations x Batch size (in tokens)**

### Examples from published models

| Model | Total training tokens | Batch size (tokens) | Approximate total iterations |
|-------|----------------------|--------------------|--------------------------|
| GPT-3 175B (OpenAI, 2020) | 300 billion | 3.2 million | ~93,750 |
| LLaMA 1 65B (Meta, 2023) | 1.4 trillion | 4 million | ~350,000 |
| Chinchilla 70B (DeepMind, 2022) | 1.4 trillion | 3 million | ~470,000 |
| LLaMA 2 70B (Meta, 2023) | 2 trillion | 4 million | ~500,000 |
| LLaMA 3 405B (Meta, 2024) | 15 trillion | 16 million | ~937,500 |

These numbers illustrate the enormous scale of LLM training. GPT-3's 175-billion-parameter model required roughly 94,000 iterations, each processing 3.2 million tokens. By 2024, LLaMA 3's flagship model ran nearly one million iterations with batches of 16 million tokens per step [1][2][3].

Training budgets for LLMs are often planned in terms of total compute (measured in FLOPs) rather than iteration count alone. The Chinchilla scaling laws suggested that compute-optimal training uses roughly 20 tokens per parameter, but subsequent models like LLaMA 3 have trained far beyond that ratio (over 1,800 tokens per parameter for the 8B variant), demonstrating that "over-training" smaller models on more data can yield strong performance at lower inference cost [4].

### Batch size ramping and warmup

Large-scale training runs often begin with a smaller batch size and gradually increase it during the first few thousand iterations. This technique, sometimes called batch size warmup, can stabilize early training when the model's parameters are still far from any useful solution. The technique is closely related to [learning rate](/wiki/learning_rate) warmup, in which the learning rate ramps up linearly over the first few hundred to a few thousand iterations before settling into its main schedule. Both warmups address the same underlying problem: at iteration 0, gradients are large and noisy because the model is initialized randomly, and a full-strength optimizer step at that point can destabilize training.

### Wall-clock time vs iteration count

Iteration count alone is not a faithful measure of training cost. Two runs with the same iteration count can differ by a factor of ten in wall-clock time and dollars spent, depending on:

- **Hardware.** A100, H100, and TPUv5 chips have different per-iteration throughput.
- **Model size.** A larger model spends more time per iteration in matrix multiplications and attention.
- **Sequence length.** Doubling the context length more than doubles the per-iteration time for transformers because of the quadratic attention cost.
- **Batch size.** Larger batches amortize fixed overheads but increase memory pressure.
- **Parallelism strategy.** Tensor, pipeline, sequence, and data parallelism each carry their own communication overhead.

For this reason, papers describing large-scale training runs typically report total compute (in FLOPs or GPU-hours) and total tokens, not just the iteration count. The iteration count is most useful for tracking progress within a fixed configuration, not for comparing across configurations.

## What iterative algorithms exist beyond gradient descent?

The concept of iteration extends well beyond neural network training. Many classical machine learning algorithms are inherently iterative.

### K-means clustering

[K-means](/wiki/k-means) alternates between two steps in each iteration: (1) assigning every data point to its nearest cluster center, and (2) recomputing each cluster center as the mean of its assigned points. The algorithm converges when assignments stop changing. At each iteration, the within-cluster sum of squares (WCSS) is guaranteed to decrease or stay the same, ensuring convergence to a local optimum [5].

In typical scikit-learn defaults, k-means runs at most 300 iterations and stops early if the centroid shift between iterations falls below a tolerance of 1e-4. For most well-conditioned datasets, convergence happens in 10 to 50 iterations [15]. Hard cases with many clusters or pathological initializations can take much longer, which is one reason that k-means++ initialization is the default in modern implementations.

### Expectation-maximization (EM)

The [expectation-maximization](/wiki/expectation_maximization) (EM) algorithm fits probabilistic models with latent variables (such as [Gaussian mixture models](/wiki/gaussian_mixture_model)) by iterating between an expectation step (E-step) and a maximization step (M-step). Each iteration increases the data log-likelihood until convergence. K-means can be viewed as a special case of EM with hard assignments [6].

The E-step computes the posterior probability that each data point belongs to each latent component, given the current parameter estimates. The M-step then maximizes the expected complete-data log-likelihood with respect to the parameters, treating the posteriors from the E-step as fixed. Iterating these two steps generates a non-decreasing sequence of log-likelihood values, and the algorithm terminates when the increase per iteration falls below a tolerance.

Dempster, Laird, and Rubin showed in their landmark 1977 paper that EM is guaranteed to converge to a local maximum of the likelihood (or to a saddle point in pathological cases) [6]. Like k-means, EM is sensitive to initialization, and multiple restarts with different random seeds are common in practice.

### Policy iteration and value iteration

In [reinforcement learning](/wiki/reinforcement_learning), [policy iteration](/wiki/policy_iteration) and value iteration are two classical methods for solving Markov decision processes. Both are iterative.

[Policy iteration](/wiki/policy_iteration) alternates between **policy evaluation** (computing the value function for the current policy by solving the Bellman equation, often itself by iteration) and **policy improvement** (updating the policy to act greedily with respect to the new values). The algorithm terminates when an entire pass of policy improvement leaves the policy unchanged, at which point the policy is provably optimal in the tabular setting [11].

Value iteration collapses these two phases into a single iteration. At each step, every state's value is updated by applying the Bellman optimality operator: the new value of a state is the maximum over actions of the expected immediate reward plus the discounted value of the resulting state. Value iteration converges to the optimal value function as the number of iterations grows, with a contraction rate determined by the discount factor `gamma`. The number of iterations needed to reach a target accuracy `epsilon` scales as `O(log(1 / epsilon) / (1 - gamma))`, so problems with high discount factors close to 1 can require many iterations [11].

Generalized policy iteration, a unifying view introduced by Sutton and Barto, observes that almost every modern RL algorithm (including Q-learning, SARSA, and actor-critic methods) interleaves some form of policy evaluation and policy improvement at each iteration, even when the iterations are not labeled as such [11].

### Newton's method

Newton's method is a second-order iterative algorithm for finding a zero of a function or, by extension, a stationary point of an objective. Each iteration uses the gradient and the Hessian (or an approximation) to take a step that, in a quadratic neighborhood of the optimum, reaches the optimum exactly. The update rule for minimizing a function `f` is:

**x_new = x_old - H_inverse(x_old) g(x_old)**

where `g` is the gradient and `H` is the Hessian. Near a non-degenerate optimum, Newton's method exhibits **quadratic convergence**, meaning the error roughly squares at each iteration. This is dramatically faster than gradient descent's linear or sublinear convergence, but each iteration is more expensive because it requires the Hessian (or a solve against it). For modern deep learning models with billions of parameters, the full Hessian is intractable, so quasi-Newton methods such as L-BFGS, K-FAC, and Shampoo are used to approximate the second-order information at lower cost per iteration [9].

### Gauss-Seidel and coordinate descent

Gauss-Seidel is a classical iterative method for solving systems of linear equations `Ax = b`. At each iteration, it updates one component of the solution vector at a time, using the most recently computed values for the other components. For diagonally dominant systems, Gauss-Seidel converges faster than the related Jacobi method, in which each iteration uses only values from the previous iteration. The convergence rate depends on the spectral radius of the iteration matrix and improves with successive over-relaxation (SOR), which scales the update by a relaxation factor.

Coordinate descent generalizes the same idea to non-linear optimization: at each iteration, one parameter (or a block of parameters) is updated to minimize the objective with the others held fixed. Modern variants of coordinate descent are used in solving large [LASSO](/wiki/lasso) and elastic-net regression problems, where they are competitive with first-order methods on sparse data.

### Other iterative methods

- **Power iteration** computes the dominant eigenvector of a matrix by repeatedly multiplying a starting vector by the matrix and renormalizing. It is the simplest example of an iterative algorithm that converges to a fixed point under mild conditions.
- **Iterative closest point (ICP)** for aligning 3D point clouds in [computer vision](/wiki/computer_vision) alternates between a correspondence step and a transformation step.
- **PageRank** iteratively updates page importance scores until the values converge, and is itself an instance of power iteration applied to the web graph's adjacency matrix.
- **Conjugate gradient** is an iterative solver for symmetric positive definite linear systems that, in exact arithmetic, converges in at most `n` iterations for an `n` by `n` system.
- **Iterative methods in classical information retrieval.** Rocchio relevance feedback, used in early IR systems, modifies a query vector based on user feedback over several iterations to bring it closer to relevant documents and away from non-relevant ones.

All of these algorithms share the same core pattern: apply a fixed update rule, check for convergence, and repeat.

## How does iteration differ from recursion?

In the broader sense used in computer science, *iteration* is one of the two main control structures for repeated computation, the other being *recursion*. An iterative procedure uses a loop construct (`for`, `while`, `do-while`) and explicit state variables that change with each pass. A recursive procedure expresses repetition by having a function call itself with reduced arguments until a base case is reached.

The distinction matters in practice for several reasons. Iteration uses constant stack space, while a recursive procedure uses stack space proportional to the depth of recursion (unless tail-call optimization is applied). Some algorithms are easier to express recursively (tree traversal, divide-and-conquer sorts), while others are more natural as iterations (matrix multiplication, gradient descent loops). In machine learning training code, iteration is the universal convention because it interacts more cleanly with hardware accelerators, profilers, and distributed runtimes.

This usage of *iteration* is closely related to but distinct from the training-loop usage. A training loop is a specific instance of an iterative procedure: the loop variable is the global step count, the state is the model's parameter vector, and the update rule is the optimizer step.

## What does iteration mean in AI development?

Outside the narrow technical meaning in training loops, "iteration" also describes the broader cycle of building and improving AI systems. [Andrew Ng](https://www.deeplearning.ai/the-batch/iteration-in-ai-development/) and other practitioners emphasize that AI development is fundamentally iterative: rather than designing a perfect model on the first attempt, teams cycle through data collection, labeling, model training, error analysis, and deployment. Each pass through this cycle is an iteration at the project level [7].

In software engineering more broadly, iterative development methodologies (such as Agile sprints) decompose a large project into small, testable increments. Each increment provides feedback that guides the next. This philosophy aligns naturally with machine learning workflows, where a first model is trained quickly, its errors are analyzed, and targeted improvements are made in subsequent iterations [8].

The iteration mindset extends to scientific research as well. The progression from AlexNet (2012) to ResNet (2015), Transformer (2017), GPT-2 (2019), GPT-3 (2020), and the current generation of frontier models is a chain of project-level iterations, each building on lessons learned from the previous one. The same is true at the level of a single research project: a series of experiments, each an iteration, refines the hypothesis until the final published result emerges.

## How do you choose the right number of iterations?

Selecting how many iterations to train for involves balancing several factors:

| Factor | Effect on iteration count |
|--------|-------------------------|
| Dataset size | Larger datasets produce more iterations per epoch |
| [Batch size](/wiki/batch_size) | Larger batch sizes reduce iterations per epoch |
| Number of [epochs](/wiki/epoch) | More epochs multiply total iterations |
| [Early stopping](/wiki/early_stopping) | Halts training when validation performance stops improving, capping the total iteration count |
| Compute budget | Fixed GPU-hours or FLOP budgets impose an upper bound |
| [Learning rate](/wiki/learning_rate) schedule | Schedules tied to total steps (e.g., cosine decay to zero) require specifying the total iteration count in advance |
| Convergence behavior | Some objectives plateau early; others continue to improve for many additional iterations |
| Generalization gap | Training too long can produce [overfitting](/wiki/overfitting) even after the training loss continues to decrease |

### Early stopping

[Early stopping](/wiki/early_stopping) is the most widely used technique for choosing the iteration count adaptively. The idea is to monitor a validation metric (validation loss, validation accuracy, BLEU, or similar) at fixed intervals and stop training when the metric has not improved for a *patience* window of consecutive checks. The number of iterations at which training stops is therefore data-dependent rather than fixed in advance. Early stopping serves as a form of regularization in addition to its compute-saving role: by halting before the model begins to overfit, it produces a model that generalizes better than one trained for the full pre-set iteration budget [12].

For large language model pre-training, classical early stopping is rarely used because the validation loss typically continues to decrease for the entire training budget. Instead, the iteration count is fixed in advance based on the compute budget and the Chinchilla-style scaling laws.

### Practical guidelines

- Start with a proven configuration from the literature and adjust based on validation metrics.
- Monitor loss per iteration (not just per epoch) to detect problems early.
- Use learning rate finders or short exploratory runs to determine a good learning rate before committing to a full training run.
- Save checkpoints at regular iteration intervals so that the best-performing snapshot can be selected after training.
- For LLM training, plan the iteration count from the compute budget and scaling laws; for smaller-scale supervised learning, use early stopping on a held-out validation set.
- Do not equate total iterations with total compute when comparing runs across different model sizes, batch sizes, or hardware.

## What are common iteration pitfalls?

A few iteration-related mistakes show up repeatedly in practice:

- **Forgetting to zero gradients.** In PyTorch, `optimizer.zero_grad()` must be called before each backward pass, otherwise gradients accumulate across iterations. This is a frequent source of mysteriously diverging loss curves [14].
- **Logging at the wrong cadence.** Logging every iteration produces a high-resolution loss curve but can slow training because of synchronous I/O. Logging every 50 to 100 iterations is a good default.
- **Mismatched scheduler step count.** When using a learning rate scheduler tied to total steps, the scheduler must be told the correct total iteration count, including any gradient accumulation factor. A scheduler configured for 100,000 micro-batches will reach learning rate zero `K` times too early if `K`-step accumulation is used.
- **Off-by-one errors at epoch boundaries.** When the dataset size is not divisible by the batch size, the final batch of each epoch is smaller. Forgetting to account for this can lead to incorrect loss averaging and confusing reports.
- **Iteration count drift in distributed training.** When training is restarted from a checkpoint, the global step counter and the data sampler must both be restored to avoid replaying or skipping data. Frameworks like PyTorch Lightning and DeepSpeed handle this, but custom training loops often do not.

## Explain like I'm 5 (ELI5)

Imagine you are learning to throw a basketball into a hoop. Each time you throw the ball, you see whether it went too far left, too far right, too high, or too low. Then you adjust your next throw based on what you learned. That single throw-and-adjust cycle is one **iteration**.

Now imagine you have a bucket of 100 balls. You grab 10 balls at a time (that is your **batch**), throw them, and then adjust your aim. After you have thrown all 100 balls, you have finished one **epoch** (one pass through the whole bucket). If you needed 10 throws of 10 balls each to empty the bucket, you did 10 **iterations** in that epoch.

A computer learning from data works the same way. It looks at a small group of examples, checks how wrong its answers are, fixes itself a little bit, and then moves on to the next group. Each time it fixes itself is one iteration. After thousands or even millions of iterations, the computer gets really good at its task.

## See also

- [Epoch](/wiki/epoch)
- [Batch](/wiki/batch)
- [Batch size](/wiki/batch_size)
- [Gradient descent](/wiki/gradient_descent)
- [Stochastic gradient descent](/wiki/stochastic_gradient_descent)
- [Learning rate](/wiki/learning_rate)
- [Backpropagation](/wiki/backpropagation)
- [Early stopping](/wiki/early_stopping)
- [K-means](/wiki/k-means)
- [Expectation-maximization](/wiki/expectation_maximization)
- [Policy iteration](/wiki/policy_iteration)
- [Convergence](/wiki/convergence)
- [Optimizer](/wiki/optimizer)

## References

1. Brown, T., et al. "Language Models are Few-Shot Learners." *Advances in Neural Information Processing Systems*, 2020. (GPT-3 paper)
2. Touvron, H., et al. "LLaMA: Open and Efficient Foundation Language Models." *arXiv preprint arXiv:2302.13971*, 2023.
3. Dubey, A., et al. "The Llama 3 Herd of Models." *arXiv preprint arXiv:2407.21783*, 2024.
4. Hoffmann, J., et al. "Training Compute-Optimal Large Language Models." *arXiv preprint arXiv:2203.15556*, 2022. (Chinchilla scaling laws)
5. Lloyd, S. "Least Squares Quantization in PCM." *IEEE Transactions on Information Theory*, 28(2):129-137, 1982.
6. Dempster, A. P., Laird, N. M., and Rubin, D. B. "Maximum Likelihood from Incomplete Data via the EM Algorithm." *Journal of the Royal Statistical Society, Series B*, 39(1):1-38, 1977.
7. Ng, A. "Iteration in AI Development." *The Batch*, DeepLearning.AI. https://www.deeplearning.ai/the-batch/iteration-in-ai-development/
8. Ruder, S. "An Overview of Gradient Descent Optimization Algorithms." *arXiv preprint arXiv:1609.04747*, 2016.
9. Boyd, S., and Vandenberghe, L. *Convex Optimization*. Cambridge University Press, 2004.
10. Nesterov, Y. *Introductory Lectures on Convex Optimization: A Basic Course*. Springer, 2004.
11. Sutton, R. S., and Barto, A. G. *Reinforcement Learning: An Introduction*, second edition. MIT Press, 2018.
12. Goodfellow, I., Bengio, Y., and Courville, A. *Deep Learning*. MIT Press, 2016.
13. Google Developers. "Machine Learning Crash Course: Gradient Descent." https://developers.google.com/machine-learning/crash-course/linear-regression/gradient-descent
14. PyTorch documentation. "Training a classifier." https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html
15. scikit-learn documentation. "KMeans." https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
16. Google Developers. "Machine Learning Glossary: Fundamentals" (definitions of iteration, epoch, batch, and batch size). https://developers.google.com/machine-learning/glossary/fundamentals
17. Keras documentation. "Training and evaluation with the built-in methods" (steps per epoch and drop_remainder). https://keras.io/guides/training_with_built_in_methods/

