See also: Machine learning terms
In machine learning and deep learning, an iteration (also called a training step or update step) is one complete cycle of processing a single batch of data, computing the loss, calculating gradients, and updating the model's parameters. Each iteration moves the model slightly closer to an optimal set of weights and biases by applying one round of gradient descent or another optimization algorithm.
The number of training examples the model processes in each iteration is determined by the hyperparameter known as batch size. If the batch size is 64, the model processes 64 examples, computes the average loss across those examples, and performs one parameter update. That entire sequence counts as one iteration.
Iterations are the fundamental heartbeat of training. Every model, from a simple linear regression to a 400-billion-parameter large language model, learns through repeated iterations. Understanding how iterations relate to epochs, batches, and total training compute is essential for configuring training runs, debugging convergence issues, and interpreting training logs.
Three closely related terms appear constantly in training configurations. They are distinct concepts, but they connect through a simple formula.
| Term | Definition | Scope |
|---|---|---|
| Batch | A subset of the training dataset used in a single iteration | A group of training examples |
| Iteration (training step) | One forward pass + backward pass + parameter update on one batch | One parameter update |
| Epoch | One complete pass through the entire training dataset | All batches processed once |
The number of iterations required to complete one epoch is calculated as:
Iterations per epoch = Dataset size / Batch size
For example, if a dataset contains 10,000 training examples and the batch size is 100, then one epoch consists of 10,000 / 100 = 100 iterations. Over 50 epochs, the model would perform 100 x 50 = 5,000 total iterations (also called total training steps).
If the dataset size is not evenly divisible by the batch size, the last batch in an epoch is simply smaller than the rest, unless the training framework is configured to drop the incomplete batch.
| Parameter | Value |
|---|---|
| Dataset size | 60,000 examples |
| Batch size | 256 |
| Iterations per epoch | 60,000 / 256 = 235 (rounded up) |
| Number of epochs | 20 |
| Total iterations | 235 x 20 = 4,700 |
In this scenario, the model's parameters are updated 4,700 times over the course of training.
Each iteration in a neural network training loop follows a well-defined sequence of operations:
After step 4, the model has completed one iteration and is ready to process the next batch.
The concept of an iteration is central to all variants of gradient descent. What differs across variants is how much data each iteration consumes.
| Variant | Data per Iteration | Characteristics |
|---|---|---|
| Stochastic Gradient Descent (SGD) | 1 example | Very frequent updates; noisy gradients; low memory usage; can escape shallow local minima |
| Mini-batch Gradient Descent | A subset (e.g., 32, 64, 256 examples) | Balances update frequency with gradient stability; standard practice in modern deep learning |
| Batch (Full-batch) Gradient Descent | Entire dataset | One iteration per epoch; stable but slow gradients; high memory cost; impractical for large datasets |
In practice, nearly all modern deep learning training uses mini-batch gradient descent. When practitioners refer to "one iteration," they almost always mean processing one mini-batch.
Most training frameworks maintain a global step counter that increments by one after each iteration, regardless of the current epoch. This counter serves several purposes:
In PyTorch, the global step is typically tracked manually in the training loop. In TensorFlow/Keras, callbacks and the built-in training loop expose the step count automatically.
The scale of modern large language model (LLM) training is often described in terms of total training tokens rather than iterations, but the relationship between the two is straightforward:
Total tokens = Total iterations x Batch size (in tokens)
| Model | Total Training Tokens | Batch Size (tokens) | Approximate Total Iterations |
|---|---|---|---|
| GPT-3 175B (OpenAI, 2020) | 300 billion | 3.2 million | ~93,750 |
| LLaMA 1 65B (Meta, 2023) | 1.4 trillion | 4 million | ~350,000 |
| LLaMA 3 405B (Meta, 2024) | 15 trillion | 16 million | ~937,500 |
These numbers illustrate the enormous scale of LLM training. GPT-3's 175-billion-parameter model required roughly 94,000 iterations, each processing 3.2 million tokens. By 2024, LLaMA 3's flagship model ran nearly one million iterations with batches of 16 million tokens per step [1][2][3].
Training budgets for LLMs are often planned in terms of total compute (measured in FLOPs) rather than iteration count alone. The Chinchilla scaling laws suggested that compute-optimal training uses roughly 20 tokens per parameter, but subsequent models like LLaMA 3 have trained far beyond that ratio (over 1,800 tokens per parameter for the 8B variant), demonstrating that "over-training" smaller models on more data can yield strong performance at lower inference cost [4].
Large-scale training runs often begin with a smaller batch size and gradually increase it during the first few thousand iterations. This technique, sometimes called batch size warmup, can stabilize early training when the model's parameters are still far from any useful solution.
The concept of iteration extends well beyond neural network training. Many classical machine learning algorithms are inherently iterative:
K-means alternates between two steps in each iteration: (1) assigning every data point to its nearest cluster center, and (2) recomputing each cluster center as the mean of its assigned points. The algorithm converges when assignments stop changing. At each iteration, the within-cluster sum of squares (WCSS) is guaranteed to decrease or stay the same, ensuring convergence to a local optimum [5].
The EM algorithm fits probabilistic models with latent variables (such as Gaussian mixture models) by iterating between an expectation step (E-step) and a maximization step (M-step). Each iteration increases the data log-likelihood until convergence. K-means can be viewed as a special case of EM with hard assignments [6].
All of these algorithms share the same core pattern: apply a fixed update rule, check for convergence, and repeat.
Outside the narrow technical meaning in training loops, "iteration" also describes the broader cycle of building and improving AI systems. Andrew Ng and other practitioners emphasize that AI development is fundamentally iterative: rather than designing a perfect model on the first attempt, teams cycle through data collection, labeling, model training, error analysis, and deployment. Each pass through this cycle is an iteration at the project level [7].
In software engineering more broadly, iterative development methodologies (such as Agile sprints) decompose a large project into small, testable increments. Each increment provides feedback that guides the next. This philosophy aligns naturally with machine learning workflows, where a first model is trained quickly, its errors are analyzed, and targeted improvements are made in subsequent iterations [8].
Selecting how many iterations to train for involves balancing several factors:
| Factor | Effect on Iteration Count |
|---|---|
| Dataset size | Larger datasets produce more iterations per epoch |
| Batch size | Larger batch sizes reduce iterations per epoch |
| Number of epochs | More epochs multiply total iterations |
| Early stopping | Halts training when validation performance stops improving, capping the total iteration count |
| Compute budget | Fixed GPU-hours or FLOP budgets impose an upper bound |
| Learning rate schedule | Schedules tied to total steps (e.g., cosine decay to zero) require specifying the total iteration count in advance |
Practical guidelines:
Imagine you are learning to throw a basketball into a hoop. Each time you throw the ball, you see whether it went too far left, too far right, too high, or too low. Then you adjust your next throw based on what you learned. That single throw-and-adjust cycle is one iteration.
Now imagine you have a bucket of 100 balls. You grab 10 balls at a time (that is your batch), throw them, and then adjust your aim. After you have thrown all 100 balls, you have finished one epoch (one pass through the whole bucket). If you needed 10 throws of 10 balls each to empty the bucket, you did 10 iterations in that epoch.
A computer learning from data works the same way. It looks at a small group of examples, checks how wrong its answers are, fixes itself a little bit, and then moves on to the next group. Each time it fixes itself is one iteration. After thousands or even millions of iterations, the computer gets really good at its task.