Iteration

Introduction

In machine learning and deep learning, an iteration (also called a training step or update step) is one complete cycle of processing a single batch of data, computing the loss, calculating gradients, and updating the model's parameters. Each iteration moves the model slightly closer to an optimal set of weights and biases by applying one round of gradient descent or another optimization algorithm.

The number of training examples the model processes in each iteration is determined by the hyperparameter known as batch size. If the batch size is 64, the model processes 64 examples, computes the average loss across those examples, and performs one parameter update. That entire sequence counts as one iteration.

Iterations are the fundamental heartbeat of training. Every model, from a simple linear regression to a 400-billion-parameter large language model, learns through repeated iterations. Understanding how iterations relate to epochs, batches, and total training compute is essential for configuring training runs, debugging convergence issues, and interpreting training logs.

Beyond the narrow context of neural network training, the term iteration carries a broader meaning across mathematics, computer science, and statistics. An iterative algorithm is any procedure that repeatedly applies an update rule to a current state, generating a sequence of states that ideally converges to a solution. Many of the most important methods in numerical computation, including k-means clustering, expectation-maximization, policy iteration, Newton's method, and Gauss-Seidel relaxation, are iterative in this general sense. The training of a neural network is one specific instance of this larger pattern.

Iteration vs. epoch vs. batch

Three closely related terms appear constantly in training configurations. They are distinct concepts, but they connect through a simple formula.

Term	Definition	Scope
Batch	A subset of the training dataset used in a single iteration	A group of training examples
Iteration (training step)	One forward pass + backward pass + parameter update on one batch	One parameter update
Epoch	One complete pass through the entire training dataset	All batches processed once

The formula

The number of iterations required to complete one epoch is calculated as:

Iterations per epoch = Dataset size / Batch size

For example, if a dataset contains 10,000 training examples and the batch size is 100, then one epoch consists of 10,000 / 100 = 100 iterations. Over 50 epochs, the model would perform 100 x 50 = 5,000 total iterations (also called total training steps).

If the dataset size is not evenly divisible by the batch size, the last batch in an epoch is simply smaller than the rest, unless the training framework is configured to drop the incomplete batch.

Numerical example

Parameter	Value
Dataset size	60,000 examples
Batch size	256
Iterations per epoch	60,000 / 256 = 235 (rounded up)
Number of epochs	20
Total iterations	235 x 20 = 4,700

In this scenario, the model's parameters are updated 4,700 times over the course of training.

Common terminology confusion

In practice, the words iteration, step, update, and even batch are often used interchangeably in research papers, blog posts, and framework documentation. The same paper may switch between training step and iteration in adjacent sentences. A few clarifications help when reading the literature:

An iteration and a training step almost always mean the same thing in supervised deep learning: one optimizer update.
A batch is a noun describing the data, not the action. Saying "after one batch" usually means "after one iteration" but the phrase is ambiguous when gradient accumulation is in use.
An epoch is the only term whose meaning is essentially fixed: one sweep through the whole training set.
The word update is sometimes reserved for the parameter change itself, in which case multiple forward passes can precede a single update (see gradient accumulation below).

What happens during a single iteration

Each iteration in a neural network training loop follows a well-defined sequence of operations:

Batch sampling. A batch of training examples is drawn from the dataset (or served by a data loader).
Forward pass. The input data passes through every layer of the network, producing predictions. The loss function compares these predictions against the true labels or targets to compute a scalar loss value.
Backward pass (backpropagation). The loss is differentiated with respect to each learnable parameter in the network. This produces a gradient for every weight and bias, indicating the direction and magnitude of change that would reduce the loss.
Parameter update. The optimizer (such as SGD, Adam, or AdaGrad) uses the gradients and the learning rate to adjust each parameter. The core update rule is: parameter = parameter - learning_rate x gradient.
Metric logging. Training frameworks typically record the loss, accuracy, learning rate, and other metrics at each iteration or at fixed intervals.

After step 4, the model has completed one iteration and is ready to process the next batch.

Compute cost per iteration

The wall-clock time of a single iteration depends on the model architecture, hardware, and software stack. For a typical transformer-based language model trained on modern accelerators, the dominant costs in one iteration are:

Operation	Approximate share of iteration time
Forward pass (matmuls, attention)	30 to 40 percent
Backward pass	60 to 70 percent (roughly twice the forward cost)
Optimizer step (e.g., AdamW updates)	5 to 10 percent
Communication (in distributed training)	Variable; can dominate at extreme scale
Data loading and preprocessing	Often hidden by overlap, 0 to 5 percent

On a modern H100 GPU running a 7-billion-parameter model with a sequence length of 4,096 and a per-device batch of 4, a single iteration takes on the order of one second. For trillion-parameter training on thousands of GPUs, an iteration may take ten to thirty seconds even with extensive parallelism, because more time is spent on cross-node communication and pipeline bubbles. The aggregate iteration count therefore needs to be planned alongside the wall-clock budget, not just the FLOP budget.

Iteration in gradient descent variants

The concept of an iteration is central to all variants of gradient descent. What differs across variants is how much data each iteration consumes.

Variant	Data per iteration	Characteristics
Stochastic gradient descent (SGD)	1 example	Very frequent updates; noisy gradients; low memory usage; can escape shallow local minima
Mini-batch gradient descent	A subset (e.g., 32, 64, 256 examples)	Balances update frequency with gradient stability; standard practice in modern deep learning
Batch (full-batch) gradient descent	Entire dataset	One iteration per epoch; stable but slow gradients; high memory cost; impractical for large datasets

In practice, nearly all modern deep learning training uses mini-batch gradient descent. When practitioners refer to "one iteration," they almost always mean processing one mini-batch.

Convergence theory

In classical convex optimization, the number of iterations needed for an algorithm to reach a target accuracy is studied as a function of problem properties such as the condition number, the dimension, and the smoothness of the objective. The most-cited bounds for first-order methods on a smooth convex objective with Lipschitz constant L and strong convexity parameter mu are summarized in the table below. The condition number is defined as kappa = L / mu.

Method	Iterations to reach error `epsilon`
Full-batch gradient descent (smooth, convex)	`O(L / epsilon)`
Full-batch gradient descent (strongly convex)	`O(kappa log(1 / epsilon))`
Nesterov accelerated gradient (smooth, convex)	`O(sqrt(L / epsilon))`
Nesterov accelerated gradient (strongly convex)	`O(sqrt(kappa) log(1 / epsilon))`
Stochastic gradient descent (strongly convex)	`O(1 / epsilon)`

These bounds, due in large part to Nesterov, Polyak, and others, were assembled in the form most familiar to modern practitioners by Boyd and Vandenberghe and by Nesterov's textbook on convex optimization ^[9]^[10]. They are worst-case bounds and rarely match the empirical iteration count needed in practice. They do, however, capture two important facts: ill-conditioned problems (large kappa) require many more iterations, and accelerated methods cut the iteration count substantially.

For non-convex objectives such as the loss landscapes of deep neural networks, there are no general bounds on iterations to reach a global optimum. Modern training instead aims at finding a good local minimum or a flat region of low loss, and the iteration count is chosen empirically.

Iteration with gradient accumulation

Gradient accumulation is a technique that decouples the forward and backward pass count from the parameter update count. Instead of updating parameters after every batch, the gradients from several consecutive forward and backward passes are summed (accumulated) into a single gradient buffer, and the optimizer step is taken only after K of these mini-batches have been processed. The effect is to simulate a batch size that is K times larger without exceeding GPU memory.

With gradient accumulation, the meaning of iteration becomes ambiguous. Two conventions are common:

Step-based counting. An iteration is one optimizer step. Each iteration corresponds to K micro-batches.
Micro-batch counting. Each forward and backward pass on a micro-batch is counted as a separate iteration, even though no parameter update occurs after most of them.

Most large-scale training frameworks (DeepSpeed, Megatron-LM, FSDP) use the step-based convention because the optimizer step is the meaningful unit for learning rate schedules and checkpoints.

Global step counter and training logs

Most training frameworks maintain a global step counter that increments by one after each iteration, regardless of the current epoch. This counter serves several purposes:

Learning rate scheduling. Many learning rate schedules (warmup, cosine decay, step decay) are defined in terms of the global step rather than the epoch number, giving finer-grained control over the learning rate trajectory.
Checkpoint saving. Models are often saved at regular step intervals (for example, every 1,000 iterations) so that training can be resumed from the most recent checkpoint if a failure occurs.
Logging and visualization. Tools such as TensorBoard, Weights & Biases, and MLflow plot loss curves against the global step. Logging per iteration (rather than per epoch) provides higher-resolution insight into training dynamics, making it easier to spot instabilities, gradient spikes, or plateaus early.

In PyTorch, the global step is typically tracked manually in the training loop. In TensorFlow/Keras, callbacks and the built-in training loop expose the step count automatically.

A minimal PyTorch iteration

A single iteration of a supervised PyTorch training loop in pseudocode looks like this:

for batch_x, batch_y in dataloader:        # 1. batch sampling
    optimizer.zero_grad()                  #    clear stale gradients
    predictions = model(batch_x)           # 2. forward pass
    loss = loss_fn(predictions, batch_y)   #    loss computation
    loss.backward()                        # 3. backward pass
    optimizer.step()                       # 4. parameter update
    global_step += 1                       # 5. logging / scheduling

This five-line core has remained essentially unchanged since the early days of PyTorch, despite enormous changes in model architectures and scale. The same five-step pattern is implicit in tf.keras.Model.fit and in the Trainer class of Hugging Face Transformers, even though those higher-level APIs hide the loop behind a single fit or train call.

Training iterations in large language models

The scale of modern large language model (LLM) training is often described in terms of total training tokens rather than iterations, but the relationship between the two is straightforward:

Total tokens = Total iterations x Batch size (in tokens)

Examples from published models

Model	Total training tokens	Batch size (tokens)	Approximate total iterations
GPT-3 175B (OpenAI, 2020)	300 billion	3.2 million	~93,750
LLaMA 1 65B (Meta, 2023)	1.4 trillion	4 million	~350,000
Chinchilla 70B (DeepMind, 2022)	1.4 trillion	3 million	~470,000
LLaMA 2 70B (Meta, 2023)	2 trillion	4 million	~500,000
LLaMA 3 405B (Meta, 2024)	15 trillion	16 million	~937,500

These numbers illustrate the enormous scale of LLM training. GPT-3's 175-billion-parameter model required roughly 94,000 iterations, each processing 3.2 million tokens. By 2024, LLaMA 3's flagship model ran nearly one million iterations with batches of 16 million tokens per step ^[1]^[2]^[3].

Training budgets for LLMs are often planned in terms of total compute (measured in FLOPs) rather than iteration count alone. The Chinchilla scaling laws suggested that compute-optimal training uses roughly 20 tokens per parameter, but subsequent models like LLaMA 3 have trained far beyond that ratio (over 1,800 tokens per parameter for the 8B variant), demonstrating that "over-training" smaller models on more data can yield strong performance at lower inference cost ^[4].

Batch size ramping and warmup

Large-scale training runs often begin with a smaller batch size and gradually increase it during the first few thousand iterations. This technique, sometimes called batch size warmup, can stabilize early training when the model's parameters are still far from any useful solution. The technique is closely related to learning rate warmup, in which the learning rate ramps up linearly over the first few hundred to a few thousand iterations before settling into its main schedule. Both warmups address the same underlying problem: at iteration 0, gradients are large and noisy because the model is initialized randomly, and a full-strength optimizer step at that point can destabilize training.

Wall-clock time vs iteration count

Iteration count alone is not a faithful measure of training cost. Two runs with the same iteration count can differ by a factor of ten in wall-clock time and dollars spent, depending on:

Hardware. A100, H100, and TPUv5 chips have different per-iteration throughput.
Model size. A larger model spends more time per iteration in matrix multiplications and attention.
Sequence length. Doubling the context length more than doubles the per-iteration time for transformers because of the quadratic attention cost.
Batch size. Larger batches amortize fixed overheads but increase memory pressure.
Parallelism strategy. Tensor, pipeline, sequence, and data parallelism each carry their own communication overhead.

For this reason, papers describing large-scale training runs typically report total compute (in FLOPs or GPU-hours) and total tokens, not just the iteration count. The iteration count is most useful for tracking progress within a fixed configuration, not for comparing across configurations.

Iterative algorithms beyond gradient descent

The concept of iteration extends well beyond neural network training. Many classical machine learning algorithms are inherently iterative.

K-means clustering

K-means alternates between two steps in each iteration: (1) assigning every data point to its nearest cluster center, and (2) recomputing each cluster center as the mean of its assigned points. The algorithm converges when assignments stop changing. At each iteration, the within-cluster sum of squares (WCSS) is guaranteed to decrease or stay the same, ensuring convergence to a local optimum ^[5].

In typical scikit-learn defaults, k-means runs at most 300 iterations and stops early if the centroid shift between iterations falls below a tolerance of 1e-4. For most well-conditioned datasets, convergence happens in 10 to 50 iterations. Hard cases with many clusters or pathological initializations can take much longer, which is one reason that k-means++ initialization is the default in modern implementations.

Expectation-maximization (EM)

The expectation-maximization (EM) algorithm fits probabilistic models with latent variables (such as Gaussian mixture models) by iterating between an expectation step (E-step) and a maximization step (M-step). Each iteration increases the data log-likelihood until convergence. K-means can be viewed as a special case of EM with hard assignments ^[6].

The E-step computes the posterior probability that each data point belongs to each latent component, given the current parameter estimates. The M-step then maximizes the expected complete-data log-likelihood with respect to the parameters, treating the posteriors from the E-step as fixed. Iterating these two steps generates a non-decreasing sequence of log-likelihood values, and the algorithm terminates when the increase per iteration falls below a tolerance.

Dempster, Laird, and Rubin showed in their landmark 1977 paper that EM is guaranteed to converge to a local maximum of the likelihood (or to a saddle point in pathological cases) ^[6]. Like k-means, EM is sensitive to initialization, and multiple restarts with different random seeds are common in practice.

Policy iteration and value iteration

In reinforcement learning, policy iteration and value iteration are two classical methods for solving Markov decision processes. Both are iterative.

Policy iteration alternates between policy evaluation (computing the value function for the current policy by solving the Bellman equation, often itself by iteration) and policy improvement (updating the policy to act greedily with respect to the new values). The algorithm terminates when an entire pass of policy improvement leaves the policy unchanged, at which point the policy is provably optimal in the tabular setting ^[11].

Value iteration collapses these two phases into a single iteration. At each step, every state's value is updated by applying the Bellman optimality operator: the new value of a state is the maximum over actions of the expected immediate reward plus the discounted value of the resulting state. Value iteration converges to the optimal value function as the number of iterations grows, with a contraction rate determined by the discount factor gamma. The number of iterations needed to reach a target accuracy epsilon scales as O(log(1 / epsilon) / (1 - gamma)), so problems with high discount factors close to 1 can require many iterations ^[11].

Generalized policy iteration, a unifying view introduced by Sutton and Barto, observes that almost every modern RL algorithm (including Q-learning, SARSA, and actor-critic methods) interleaves some form of policy evaluation and policy improvement at each iteration, even when the iterations are not labeled as such ^[11].

Newton's method

Newton's method is a second-order iterative algorithm for finding a zero of a function or, by extension, a stationary point of an objective. Each iteration uses the gradient and the Hessian (or an approximation) to take a step that, in a quadratic neighborhood of the optimum, reaches the optimum exactly. The update rule for minimizing a function f is:

x_new = x_old - H_inverse(x_old) g(x_old)

where g is the gradient and H is the Hessian. Near a non-degenerate optimum, Newton's method exhibits quadratic convergence, meaning the error roughly squares at each iteration. This is dramatically faster than gradient descent's linear or sublinear convergence, but each iteration is more expensive because it requires the Hessian (or a solve against it). For modern deep learning models with billions of parameters, the full Hessian is intractable, so quasi-Newton methods such as L-BFGS, K-FAC, and Shampoo are used to approximate the second-order information at lower cost per iteration ^[9].

Gauss-Seidel and coordinate descent

Gauss-Seidel is a classical iterative method for solving systems of linear equations Ax = b. At each iteration, it updates one component of the solution vector at a time, using the most recently computed values for the other components. For diagonally dominant systems, Gauss-Seidel converges faster than the related Jacobi method, in which each iteration uses only values from the previous iteration. The convergence rate depends on the spectral radius of the iteration matrix and improves with successive over-relaxation (SOR), which scales the update by a relaxation factor.

Coordinate descent generalizes the same idea to non-linear optimization: at each iteration, one parameter (or a block of parameters) is updated to minimize the objective with the others held fixed. Modern variants of coordinate descent are used in solving large LASSO and elastic-net regression problems, where they are competitive with first-order methods on sparse data.

Other iterative methods

Power iteration computes the dominant eigenvector of a matrix by repeatedly multiplying a starting vector by the matrix and renormalizing. It is the simplest example of an iterative algorithm that converges to a fixed point under mild conditions.
Iterative closest point (ICP) for aligning 3D point clouds in computer vision alternates between a correspondence step and a transformation step.
PageRank iteratively updates page importance scores until the values converge, and is itself an instance of power iteration applied to the web graph's adjacency matrix.
Conjugate gradient is an iterative solver for symmetric positive definite linear systems that, in exact arithmetic, converges in at most n iterations for an n by n system.
Iterative methods in classical information retrieval. Rocchio relevance feedback, used in early IR systems, modifies a query vector based on user feedback over several iterations to bring it closer to relevant documents and away from non-relevant ones.

All of these algorithms share the same core pattern: apply a fixed update rule, check for convergence, and repeat.

Iteration vs recursion

In the broader sense used in computer science, iteration is one of the two main control structures for repeated computation, the other being recursion. An iterative procedure uses a loop construct (for, while, do-while) and explicit state variables that change with each pass. A recursive procedure expresses repetition by having a function call itself with reduced arguments until a base case is reached.

The distinction matters in practice for several reasons. Iteration uses constant stack space, while a recursive procedure uses stack space proportional to the depth of recursion (unless tail-call optimization is applied). Some algorithms are easier to express recursively (tree traversal, divide-and-conquer sorts), while others are more natural as iterations (matrix multiplication, gradient descent loops). In machine learning training code, iteration is the universal convention because it interacts more cleanly with hardware accelerators, profilers, and distributed runtimes.

This usage of iteration is closely related to but distinct from the training-loop usage. A training loop is a specific instance of an iterative procedure: the loop variable is the global step count, the state is the model's parameter vector, and the update rule is the optimizer step.

Iteration in AI development and software engineering

Outside the narrow technical meaning in training loops, "iteration" also describes the broader cycle of building and improving AI systems. Andrew Ng and other practitioners emphasize that AI development is fundamentally iterative: rather than designing a perfect model on the first attempt, teams cycle through data collection, labeling, model training, error analysis, and deployment. Each pass through this cycle is an iteration at the project level ^[7].

In software engineering more broadly, iterative development methodologies (such as Agile sprints) decompose a large project into small, testable increments. Each increment provides feedback that guides the next. This philosophy aligns naturally with machine learning workflows, where a first model is trained quickly, its errors are analyzed, and targeted improvements are made in subsequent iterations ^[8].

The iteration mindset extends to scientific research as well. The progression from AlexNet (2012) to ResNet (2015), Transformer (2017), GPT-2 (2019), GPT-3 (2020), and the current generation of frontier models is a chain of project-level iterations, each building on lessons learned from the previous one. The same is true at the level of a single research project: a series of experiments, each an iteration, refines the hypothesis until the final published result emerges.

Choosing the right number of iterations

Selecting how many iterations to train for involves balancing several factors:

Factor	Effect on iteration count
Dataset size	Larger datasets produce more iterations per epoch
Batch size	Larger batch sizes reduce iterations per epoch
Number of epochs	More epochs multiply total iterations
Early stopping	Halts training when validation performance stops improving, capping the total iteration count
Compute budget	Fixed GPU-hours or FLOP budgets impose an upper bound
Learning rate schedule	Schedules tied to total steps (e.g., cosine decay to zero) require specifying the total iteration count in advance
Convergence behavior	Some objectives plateau early; others continue to improve for many additional iterations
Generalization gap	Training too long can produce overfitting even after the training loss continues to decrease

Early stopping

Early stopping is the most widely used technique for choosing the iteration count adaptively. The idea is to monitor a validation metric (validation loss, validation accuracy, BLEU, or similar) at fixed intervals and stop training when the metric has not improved for a patience window of consecutive checks. The number of iterations at which training stops is therefore data-dependent rather than fixed in advance. Early stopping serves as a form of regularization in addition to its compute-saving role: by halting before the model begins to overfit, it produces a model that generalizes better than one trained for the full pre-set iteration budget.

For large language model pre-training, classical early stopping is rarely used because the validation loss typically continues to decrease for the entire training budget. Instead, the iteration count is fixed in advance based on the compute budget and the Chinchilla-style scaling laws.

Practical guidelines

Start with a proven configuration from the literature and adjust based on validation metrics.
Monitor loss per iteration (not just per epoch) to detect problems early.
Use learning rate finders or short exploratory runs to determine a good learning rate before committing to a full training run.
Save checkpoints at regular iteration intervals so that the best-performing snapshot can be selected after training.
For LLM training, plan the iteration count from the compute budget and scaling laws; for smaller-scale supervised learning, use early stopping on a held-out validation set.
Do not equate total iterations with total compute when comparing runs across different model sizes, batch sizes, or hardware.

Common pitfalls

A few iteration-related mistakes show up repeatedly in practice:

Forgetting to zero gradients. In PyTorch, optimizer.zero_grad() must be called before each backward pass, otherwise gradients accumulate across iterations. This is a frequent source of mysteriously diverging loss curves.
Logging at the wrong cadence. Logging every iteration produces a high-resolution loss curve but can slow training because of synchronous I/O. Logging every 50 to 100 iterations is a good default.
Mismatched scheduler step count. When using a learning rate scheduler tied to total steps, the scheduler must be told the correct total iteration count, including any gradient accumulation factor. A scheduler configured for 100,000 micro-batches will reach learning rate zero K times too early if K-step accumulation is used.
Off-by-one errors at epoch boundaries. When the dataset size is not divisible by the batch size, the final batch of each epoch is smaller. Forgetting to account for this can lead to incorrect loss averaging and confusing reports.
Iteration count drift in distributed training. When training is restarted from a checkpoint, the global step counter and the data sampler must both be restored to avoid replaying or skipping data. Frameworks like PyTorch Lightning and DeepSpeed handle this, but custom training loops often do not.

Explain like I'm 5 (ELI5)

Imagine you are learning to throw a basketball into a hoop. Each time you throw the ball, you see whether it went too far left, too far right, too high, or too low. Then you adjust your next throw based on what you learned. That single throw-and-adjust cycle is one iteration.

Now imagine you have a bucket of 100 balls. You grab 10 balls at a time (that is your batch), throw them, and then adjust your aim. After you have thrown all 100 balls, you have finished one epoch (one pass through the whole bucket). If you needed 10 throws of 10 balls each to empty the bucket, you did 10 iterations in that epoch.

A computer learning from data works the same way. It looks at a small group of examples, checks how wrong its answers are, fixes itself a little bit, and then moves on to the next group. Each time it fixes itself is one iteration. After thousands or even millions of iterations, the computer gets really good at its task.

References

Brown, T., et al. "Language Models are Few-Shot Learners." *Advances in Neural Information Processing Systems*, 2020. (GPT-3 paper)
Touvron, H., et al. "LLaMA: Open and Efficient Foundation Language Models." *arXiv preprint arXiv:2302.13971*, 2023.
Dubey, A., et al. "The Llama 3 Herd of Models." *arXiv preprint arXiv:2407.21783*, 2024.
Hoffmann, J., et al. "Training Compute-Optimal Large Language Models." *arXiv preprint arXiv:2203.15556*, 2022. (Chinchilla scaling laws)
Lloyd, S. "Least Squares Quantization in PCM." *IEEE Transactions on Information Theory*, 28(2):129-137, 1982.
Dempster, A. P., Laird, N. M., and Rubin, D. B. "Maximum Likelihood from Incomplete Data via the EM Algorithm." *Journal of the Royal Statistical Society, Series B*, 39(1):1-38, 1977.
Ng, A. "Iteration in AI Development." *The Batch*, DeepLearning.AI. https://www.deeplearning.ai/the-batch/iteration-in-ai-development/
Ruder, S. "An Overview of Gradient Descent Optimization Algorithms." *arXiv preprint arXiv:1609.04747*, 2016.
Boyd, S., and Vandenberghe, L. *Convex Optimization*. Cambridge University Press, 2004.
Nesterov, Y. *Introductory Lectures on Convex Optimization: A Basic Course*. Springer, 2004.
Sutton, R. S., and Barto, A. G. *Reinforcement Learning: An Introduction*, second edition. MIT Press, 2018.
Goodfellow, I., Bengio, Y., and Courville, A. *Deep Learning*. MIT Press, 2016.
Google Developers. "Machine Learning Crash Course: Gradient Descent." https://developers.google.com/machine-learning/crash-course/linear-regression/gradient-descent
PyTorch documentation. "Training a classifier." https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html
scikit-learn documentation. "KMeans." https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html

Introduction

Iteration vs. epoch vs. batch

The formula

Numerical example

Common terminology confusion

What happens during a single iteration

Compute cost per iteration

Iteration in gradient descent variants

Convergence theory

Iteration with gradient accumulation

Global step counter and training logs

A minimal PyTorch iteration

Training iterations in large language models

Examples from published models

Batch size ramping and warmup

Wall-clock time vs iteration count

Iterative algorithms beyond gradient descent

K-means clustering

Expectation-maximization (EM)

Policy iteration and value iteration

Newton's method

Gauss-Seidel and coordinate descent

Other iterative methods

Iteration vs recursion

Iteration in AI development and software engineering

Choosing the right number of iterations

Early stopping

Practical guidelines

Common pitfalls

Explain like I'm 5 (ELI5)

See also

References

Improve this article

Related Articles

ARC-AGI 2

GELU (Gaussian Error Linear Unit)

LeNet

Context window

Multi-head Latent Attention

OCR Models

Introduction

Iteration vs. epoch vs. batch

The formula

Numerical example

Common terminology confusion

What happens during a single iteration

Compute cost per iteration

Iteration in gradient descent variants

Convergence theory

Iteration with gradient accumulation

Global step counter and training logs

A minimal PyTorch iteration

Training iterations in large language models

Examples from published models

Batch size ramping and warmup

Wall-clock time vs iteration count

Iterative algorithms beyond gradient descent

K-means clustering

Expectation-maximization (EM)

Policy iteration and value iteration

Newton's method

Gauss-Seidel and coordinate descent

Other iterative methods

Iteration vs recursion

Iteration in AI development and software engineering

Choosing the right number of iterations

Early stopping

Practical guidelines

Common pitfalls

Explain like I'm 5 (ELI5)

See also

References

Related Articles

ARC-AGI 2

GELU (Gaussian Error Linear Unit)

LeNet

Context window

Multi-head Latent Attention

OCR Models