Iteration
Last reviewed
May 9, 2026
Sources
15 citations
Review status
Source-backed
Revision
v4 · 4,910 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 9, 2026
Sources
15 citations
Review status
Source-backed
Revision
v4 · 4,910 words
Add missing citations, update stale details, or suggest a clearer explanation.
See also: Machine learning terms
In machine learning and deep learning, an iteration (also called a training step or update step) is one complete cycle of processing a single batch of data, computing the loss, calculating gradients, and updating the model's parameters. Each iteration moves the model slightly closer to an optimal set of weights and biases by applying one round of gradient descent or another optimization algorithm.
The number of training examples the model processes in each iteration is determined by the hyperparameter known as batch size. If the batch size is 64, the model processes 64 examples, computes the average loss across those examples, and performs one parameter update. That entire sequence counts as one iteration.
Iterations are the fundamental heartbeat of training. Every model, from a simple linear regression to a 400-billion-parameter large language model, learns through repeated iterations. Understanding how iterations relate to epochs, batches, and total training compute is essential for configuring training runs, debugging convergence issues, and interpreting training logs.
Beyond the narrow context of neural network training, the term iteration carries a broader meaning across mathematics, computer science, and statistics. An iterative algorithm is any procedure that repeatedly applies an update rule to a current state, generating a sequence of states that ideally converges to a solution. Many of the most important methods in numerical computation, including k-means clustering, expectation-maximization, policy iteration, Newton's method, and Gauss-Seidel relaxation, are iterative in this general sense. The training of a neural network is one specific instance of this larger pattern.
Three closely related terms appear constantly in training configurations. They are distinct concepts, but they connect through a simple formula.
| Term | Definition | Scope |
|---|---|---|
| Batch | A subset of the training dataset used in a single iteration | A group of training examples |
| Iteration (training step) | One forward pass + backward pass + parameter update on one batch | One parameter update |
| Epoch | One complete pass through the entire training dataset | All batches processed once |
The number of iterations required to complete one epoch is calculated as:
Iterations per epoch = Dataset size / Batch size
For example, if a dataset contains 10,000 training examples and the batch size is 100, then one epoch consists of 10,000 / 100 = 100 iterations. Over 50 epochs, the model would perform 100 x 50 = 5,000 total iterations (also called total training steps).
If the dataset size is not evenly divisible by the batch size, the last batch in an epoch is simply smaller than the rest, unless the training framework is configured to drop the incomplete batch.
| Parameter | Value |
|---|---|
| Dataset size | 60,000 examples |
| Batch size | 256 |
| Iterations per epoch | 60,000 / 256 = 235 (rounded up) |
| Number of epochs | 20 |
| Total iterations | 235 x 20 = 4,700 |
In this scenario, the model's parameters are updated 4,700 times over the course of training.
In practice, the words iteration, step, update, and even batch are often used interchangeably in research papers, blog posts, and framework documentation. The same paper may switch between training step and iteration in adjacent sentences. A few clarifications help when reading the literature:
Each iteration in a neural network training loop follows a well-defined sequence of operations:
After step 4, the model has completed one iteration and is ready to process the next batch.
The wall-clock time of a single iteration depends on the model architecture, hardware, and software stack. For a typical transformer-based language model trained on modern accelerators, the dominant costs in one iteration are:
| Operation | Approximate share of iteration time |
|---|---|
| Forward pass (matmuls, attention) | 30 to 40 percent |
| Backward pass | 60 to 70 percent (roughly twice the forward cost) |
| Optimizer step (e.g., AdamW updates) | 5 to 10 percent |
| Communication (in distributed training) | Variable; can dominate at extreme scale |
| Data loading and preprocessing | Often hidden by overlap, 0 to 5 percent |
On a modern H100 GPU running a 7-billion-parameter model with a sequence length of 4,096 and a per-device batch of 4, a single iteration takes on the order of one second. For trillion-parameter training on thousands of GPUs, an iteration may take ten to thirty seconds even with extensive parallelism, because more time is spent on cross-node communication and pipeline bubbles. The aggregate iteration count therefore needs to be planned alongside the wall-clock budget, not just the FLOP budget.
The concept of an iteration is central to all variants of gradient descent. What differs across variants is how much data each iteration consumes.
| Variant | Data per iteration | Characteristics |
|---|---|---|
| Stochastic gradient descent (SGD) | 1 example | Very frequent updates; noisy gradients; low memory usage; can escape shallow local minima |
| Mini-batch gradient descent | A subset (e.g., 32, 64, 256 examples) | Balances update frequency with gradient stability; standard practice in modern deep learning |
| Batch (full-batch) gradient descent | Entire dataset | One iteration per epoch; stable but slow gradients; high memory cost; impractical for large datasets |
In practice, nearly all modern deep learning training uses mini-batch gradient descent. When practitioners refer to "one iteration," they almost always mean processing one mini-batch.
In classical convex optimization, the number of iterations needed for an algorithm to reach a target accuracy is studied as a function of problem properties such as the condition number, the dimension, and the smoothness of the objective. The most-cited bounds for first-order methods on a smooth convex objective with Lipschitz constant L and strong convexity parameter mu are summarized in the table below. The condition number is defined as kappa = L / mu.
| Method | Iterations to reach error epsilon |
|---|---|
| Full-batch gradient descent (smooth, convex) | O(L / epsilon) |
| Full-batch gradient descent (strongly convex) | O(kappa log(1 / epsilon)) |
| Nesterov accelerated gradient (smooth, convex) | O(sqrt(L / epsilon)) |
| Nesterov accelerated gradient (strongly convex) | O(sqrt(kappa) log(1 / epsilon)) |
| Stochastic gradient descent (strongly convex) | O(1 / epsilon) |
These bounds, due in large part to Nesterov, Polyak, and others, were assembled in the form most familiar to modern practitioners by Boyd and Vandenberghe and by Nesterov's textbook on convex optimization [9][10]. They are worst-case bounds and rarely match the empirical iteration count needed in practice. They do, however, capture two important facts: ill-conditioned problems (large kappa) require many more iterations, and accelerated methods cut the iteration count substantially.
For non-convex objectives such as the loss landscapes of deep neural networks, there are no general bounds on iterations to reach a global optimum. Modern training instead aims at finding a good local minimum or a flat region of low loss, and the iteration count is chosen empirically.
Gradient accumulation is a technique that decouples the forward and backward pass count from the parameter update count. Instead of updating parameters after every batch, the gradients from several consecutive forward and backward passes are summed (accumulated) into a single gradient buffer, and the optimizer step is taken only after K of these mini-batches have been processed. The effect is to simulate a batch size that is K times larger without exceeding GPU memory.
With gradient accumulation, the meaning of iteration becomes ambiguous. Two conventions are common:
K micro-batches.Most large-scale training frameworks (DeepSpeed, Megatron-LM, FSDP) use the step-based convention because the optimizer step is the meaningful unit for learning rate schedules and checkpoints.
Most training frameworks maintain a global step counter that increments by one after each iteration, regardless of the current epoch. This counter serves several purposes:
In PyTorch, the global step is typically tracked manually in the training loop. In TensorFlow/Keras, callbacks and the built-in training loop expose the step count automatically.
A single iteration of a supervised PyTorch training loop in pseudocode looks like this:
for batch_x, batch_y in dataloader: # 1. batch sampling
optimizer.zero_grad() # clear stale gradients
predictions = model(batch_x) # 2. forward pass
loss = loss_fn(predictions, batch_y) # loss computation
loss.backward() # 3. backward pass
optimizer.step() # 4. parameter update
global_step += 1 # 5. logging / scheduling
This five-line core has remained essentially unchanged since the early days of PyTorch, despite enormous changes in model architectures and scale. The same five-step pattern is implicit in tf.keras.Model.fit and in the Trainer class of Hugging Face Transformers, even though those higher-level APIs hide the loop behind a single fit or train call.
The scale of modern large language model (LLM) training is often described in terms of total training tokens rather than iterations, but the relationship between the two is straightforward:
Total tokens = Total iterations x Batch size (in tokens)
| Model | Total training tokens | Batch size (tokens) | Approximate total iterations |
|---|---|---|---|
| GPT-3 175B (OpenAI, 2020) | 300 billion | 3.2 million | ~93,750 |
| LLaMA 1 65B (Meta, 2023) | 1.4 trillion | 4 million | ~350,000 |
| Chinchilla 70B (DeepMind, 2022) | 1.4 trillion | 3 million | ~470,000 |
| LLaMA 2 70B (Meta, 2023) | 2 trillion | 4 million | ~500,000 |
| LLaMA 3 405B (Meta, 2024) | 15 trillion | 16 million | ~937,500 |
These numbers illustrate the enormous scale of LLM training. GPT-3's 175-billion-parameter model required roughly 94,000 iterations, each processing 3.2 million tokens. By 2024, LLaMA 3's flagship model ran nearly one million iterations with batches of 16 million tokens per step [1][2][3].
Training budgets for LLMs are often planned in terms of total compute (measured in FLOPs) rather than iteration count alone. The Chinchilla scaling laws suggested that compute-optimal training uses roughly 20 tokens per parameter, but subsequent models like LLaMA 3 have trained far beyond that ratio (over 1,800 tokens per parameter for the 8B variant), demonstrating that "over-training" smaller models on more data can yield strong performance at lower inference cost [4].
Large-scale training runs often begin with a smaller batch size and gradually increase it during the first few thousand iterations. This technique, sometimes called batch size warmup, can stabilize early training when the model's parameters are still far from any useful solution. The technique is closely related to learning rate warmup, in which the learning rate ramps up linearly over the first few hundred to a few thousand iterations before settling into its main schedule. Both warmups address the same underlying problem: at iteration 0, gradients are large and noisy because the model is initialized randomly, and a full-strength optimizer step at that point can destabilize training.
Iteration count alone is not a faithful measure of training cost. Two runs with the same iteration count can differ by a factor of ten in wall-clock time and dollars spent, depending on:
For this reason, papers describing large-scale training runs typically report total compute (in FLOPs or GPU-hours) and total tokens, not just the iteration count. The iteration count is most useful for tracking progress within a fixed configuration, not for comparing across configurations.
The concept of iteration extends well beyond neural network training. Many classical machine learning algorithms are inherently iterative.
K-means alternates between two steps in each iteration: (1) assigning every data point to its nearest cluster center, and (2) recomputing each cluster center as the mean of its assigned points. The algorithm converges when assignments stop changing. At each iteration, the within-cluster sum of squares (WCSS) is guaranteed to decrease or stay the same, ensuring convergence to a local optimum [5].
In typical scikit-learn defaults, k-means runs at most 300 iterations and stops early if the centroid shift between iterations falls below a tolerance of 1e-4. For most well-conditioned datasets, convergence happens in 10 to 50 iterations. Hard cases with many clusters or pathological initializations can take much longer, which is one reason that k-means++ initialization is the default in modern implementations.
The expectation-maximization (EM) algorithm fits probabilistic models with latent variables (such as Gaussian mixture models) by iterating between an expectation step (E-step) and a maximization step (M-step). Each iteration increases the data log-likelihood until convergence. K-means can be viewed as a special case of EM with hard assignments [6].
The E-step computes the posterior probability that each data point belongs to each latent component, given the current parameter estimates. The M-step then maximizes the expected complete-data log-likelihood with respect to the parameters, treating the posteriors from the E-step as fixed. Iterating these two steps generates a non-decreasing sequence of log-likelihood values, and the algorithm terminates when the increase per iteration falls below a tolerance.
Dempster, Laird, and Rubin showed in their landmark 1977 paper that EM is guaranteed to converge to a local maximum of the likelihood (or to a saddle point in pathological cases) [6]. Like k-means, EM is sensitive to initialization, and multiple restarts with different random seeds are common in practice.
In reinforcement learning, policy iteration and value iteration are two classical methods for solving Markov decision processes. Both are iterative.
Policy iteration alternates between policy evaluation (computing the value function for the current policy by solving the Bellman equation, often itself by iteration) and policy improvement (updating the policy to act greedily with respect to the new values). The algorithm terminates when an entire pass of policy improvement leaves the policy unchanged, at which point the policy is provably optimal in the tabular setting [11].
Value iteration collapses these two phases into a single iteration. At each step, every state's value is updated by applying the Bellman optimality operator: the new value of a state is the maximum over actions of the expected immediate reward plus the discounted value of the resulting state. Value iteration converges to the optimal value function as the number of iterations grows, with a contraction rate determined by the discount factor gamma. The number of iterations needed to reach a target accuracy epsilon scales as O(log(1 / epsilon) / (1 - gamma)), so problems with high discount factors close to 1 can require many iterations [11].
Generalized policy iteration, a unifying view introduced by Sutton and Barto, observes that almost every modern RL algorithm (including Q-learning, SARSA, and actor-critic methods) interleaves some form of policy evaluation and policy improvement at each iteration, even when the iterations are not labeled as such [11].
Newton's method is a second-order iterative algorithm for finding a zero of a function or, by extension, a stationary point of an objective. Each iteration uses the gradient and the Hessian (or an approximation) to take a step that, in a quadratic neighborhood of the optimum, reaches the optimum exactly. The update rule for minimizing a function f is:
x_new = x_old - H_inverse(x_old) g(x_old)
where g is the gradient and H is the Hessian. Near a non-degenerate optimum, Newton's method exhibits quadratic convergence, meaning the error roughly squares at each iteration. This is dramatically faster than gradient descent's linear or sublinear convergence, but each iteration is more expensive because it requires the Hessian (or a solve against it). For modern deep learning models with billions of parameters, the full Hessian is intractable, so quasi-Newton methods such as L-BFGS, K-FAC, and Shampoo are used to approximate the second-order information at lower cost per iteration [9].
Gauss-Seidel is a classical iterative method for solving systems of linear equations Ax = b. At each iteration, it updates one component of the solution vector at a time, using the most recently computed values for the other components. For diagonally dominant systems, Gauss-Seidel converges faster than the related Jacobi method, in which each iteration uses only values from the previous iteration. The convergence rate depends on the spectral radius of the iteration matrix and improves with successive over-relaxation (SOR), which scales the update by a relaxation factor.
Coordinate descent generalizes the same idea to non-linear optimization: at each iteration, one parameter (or a block of parameters) is updated to minimize the objective with the others held fixed. Modern variants of coordinate descent are used in solving large LASSO and elastic-net regression problems, where they are competitive with first-order methods on sparse data.
n iterations for an n by n system.All of these algorithms share the same core pattern: apply a fixed update rule, check for convergence, and repeat.
In the broader sense used in computer science, iteration is one of the two main control structures for repeated computation, the other being recursion. An iterative procedure uses a loop construct (for, while, do-while) and explicit state variables that change with each pass. A recursive procedure expresses repetition by having a function call itself with reduced arguments until a base case is reached.
The distinction matters in practice for several reasons. Iteration uses constant stack space, while a recursive procedure uses stack space proportional to the depth of recursion (unless tail-call optimization is applied). Some algorithms are easier to express recursively (tree traversal, divide-and-conquer sorts), while others are more natural as iterations (matrix multiplication, gradient descent loops). In machine learning training code, iteration is the universal convention because it interacts more cleanly with hardware accelerators, profilers, and distributed runtimes.
This usage of iteration is closely related to but distinct from the training-loop usage. A training loop is a specific instance of an iterative procedure: the loop variable is the global step count, the state is the model's parameter vector, and the update rule is the optimizer step.
Outside the narrow technical meaning in training loops, "iteration" also describes the broader cycle of building and improving AI systems. Andrew Ng and other practitioners emphasize that AI development is fundamentally iterative: rather than designing a perfect model on the first attempt, teams cycle through data collection, labeling, model training, error analysis, and deployment. Each pass through this cycle is an iteration at the project level [7].
In software engineering more broadly, iterative development methodologies (such as Agile sprints) decompose a large project into small, testable increments. Each increment provides feedback that guides the next. This philosophy aligns naturally with machine learning workflows, where a first model is trained quickly, its errors are analyzed, and targeted improvements are made in subsequent iterations [8].
The iteration mindset extends to scientific research as well. The progression from AlexNet (2012) to ResNet (2015), Transformer (2017), GPT-2 (2019), GPT-3 (2020), and the current generation of frontier models is a chain of project-level iterations, each building on lessons learned from the previous one. The same is true at the level of a single research project: a series of experiments, each an iteration, refines the hypothesis until the final published result emerges.
Selecting how many iterations to train for involves balancing several factors:
| Factor | Effect on iteration count |
|---|---|
| Dataset size | Larger datasets produce more iterations per epoch |
| Batch size | Larger batch sizes reduce iterations per epoch |
| Number of epochs | More epochs multiply total iterations |
| Early stopping | Halts training when validation performance stops improving, capping the total iteration count |
| Compute budget | Fixed GPU-hours or FLOP budgets impose an upper bound |
| Learning rate schedule | Schedules tied to total steps (e.g., cosine decay to zero) require specifying the total iteration count in advance |
| Convergence behavior | Some objectives plateau early; others continue to improve for many additional iterations |
| Generalization gap | Training too long can produce overfitting even after the training loss continues to decrease |
Early stopping is the most widely used technique for choosing the iteration count adaptively. The idea is to monitor a validation metric (validation loss, validation accuracy, BLEU, or similar) at fixed intervals and stop training when the metric has not improved for a patience window of consecutive checks. The number of iterations at which training stops is therefore data-dependent rather than fixed in advance. Early stopping serves as a form of regularization in addition to its compute-saving role: by halting before the model begins to overfit, it produces a model that generalizes better than one trained for the full pre-set iteration budget.
For large language model pre-training, classical early stopping is rarely used because the validation loss typically continues to decrease for the entire training budget. Instead, the iteration count is fixed in advance based on the compute budget and the Chinchilla-style scaling laws.
A few iteration-related mistakes show up repeatedly in practice:
optimizer.zero_grad() must be called before each backward pass, otherwise gradients accumulate across iterations. This is a frequent source of mysteriously diverging loss curves.K times too early if K-step accumulation is used.Imagine you are learning to throw a basketball into a hoop. Each time you throw the ball, you see whether it went too far left, too far right, too high, or too low. Then you adjust your next throw based on what you learned. That single throw-and-adjust cycle is one iteration.
Now imagine you have a bucket of 100 balls. You grab 10 balls at a time (that is your batch), throw them, and then adjust your aim. After you have thrown all 100 balls, you have finished one epoch (one pass through the whole bucket). If you needed 10 throws of 10 balls each to empty the bucket, you did 10 iterations in that epoch.
A computer learning from data works the same way. It looks at a small group of examples, checks how wrong its answers are, fixes itself a little bit, and then moves on to the next group. Each time it fixes itself is one iteration. After thousands or even millions of iterations, the computer gets really good at its task.