Step

See also: Machine learning terms

In machine learning, the word step is overloaded. Depending on context it can mean a single optimizer update, one transition through a reinforcement learning environment, one denoising operation in a diffusion model, the generation of a single token by a language model, an intermediate reasoning move in a chain of thought, or even a stage of a scikit-learn pipeline. The same English word is doing a lot of work, and conflating these meanings is a common source of confusion when reading research papers, training logs, or library documentation.

This article covers the principal meanings, the math and APIs behind each one, how they relate to neighboring concepts like epoch and iteration, and how step counts factor into modern compute budgeting for large models.

meanings of "step" at a glance

Context	What a "step" means	Typical scale
Supervised training	One parameter update on one mini-batch	Millions per training run
Reinforcement learning	One environment transition (state, action, reward, next state)	Millions to billions per agent
Diffusion sampling	One reverse denoising operation in the generation chain	1 to 1000 per image
LLM inference	One token generated by an autoregressive decoder	Hundreds per response
Reasoning chains	One intermediate logical move in a chain of thought	A handful to dozens per problem
sklearn Pipeline	One named (transformer, estimator) stage applied in sequence	A small number per pipeline
Step decay schedule	A discrete drop in learning rate at a fixed boundary	A few per training run
Step (activation)	The Heaviside threshold used in the original perceptron	Historical

The rest of the article works through each row in detail.

training step (optimization step)

In supervised and self-supervised learning, a training step is one update of the model parameters using one mini-batch of data. The optimizer reads a batch, computes the loss, runs backpropagation to get gradients, and applies the update rule (vanilla SGD, Adam, AdamW, etc.). When the update finishes, the global step counter increments by one.

Frameworks expose this counter explicitly. PyTorch Lightning, for example, has a Trainer.global_step attribute that tracks the total number of optimizer steps taken across the entire training run, and that counter does not advance during validation passes or when gradients are accumulated without an update. With accumulate_grad_batches=4, only one global step is logged per four forward and backward passes, because there is only one actual optimizer call (Lightning AI, 2024).

A single training step usually involves the following work:

Sample or stream a mini-batch of size B from the training set.
Run the forward pass to compute predictions and loss.
Run the backward pass to compute gradients of the loss with respect to parameters.
Apply the optimizer update, which may include weight decay, gradient clipping, mixed-precision scaling, and EMA updates.
Increment global_step, log scalars, possibly run a learning rate scheduler step().

The term iteration is usually a synonym for step in this sense, though some older papers and textbooks use "iteration" to mean a full epoch. When in doubt, treat "iteration" as "step" unless context says otherwise.

epoch vs step vs iteration

These three terms are easy to mix up because the field never settled on consistent usage. The cleanest modern convention:

Term	Definition	Counts
Epoch	One full pass over the training dataset	1 epoch = (dataset_size / batch_size) steps
Step	One optimizer update on one mini-batch	1 step processes 1 batch of B examples
Iteration	Usually a synonym for step in modern usage	Same as step
Global step	Cumulative step count across all epochs	Increases monotonically through training

A worked example: if the training set has 32,000 examples and the batch size is 32, then one epoch contains 32000 / 32 = 1000 steps. A 10-epoch training run takes 10,000 total steps, and global_step goes from 0 to 10,000.

When the dataset is gigantic (web-scale corpora used for pretraining LLMs), "epoch" stops being a useful unit because the model may never see the same example twice, or only sees a small fraction of the data once. In that regime, training is described purely in steps or in tokens consumed, not epochs.

learning rate schedules in step units

Most modern training recipes describe the learning rate as a function of the step counter rather than the epoch counter. Two common patterns:

Linear warmup over N steps. The learning rate increases linearly from 0 to the peak value over the first warmup_steps, then decays. The formula is lr(step) = lr_peak * step / warmup_steps for step < warmup_steps. Typical warmup lengths run from a few hundred steps for small models to tens of thousands of steps for the largest LLMs (Hugging Face Transformers documentation, 2024). GPT, BERT, T5, and LLaMA all use some form of warmup.

Step decay. The learning rate is held constant, then multiplied by a factor like 0.1 or 0.5 at predefined boundaries (every N steps or at fixed epochs). Step decay produces a staircase pattern in the learning rate over time. It was the dominant schedule in computer vision through the mid 2010s but has been largely replaced by cosine decay or inverse-square-root decay in modern transformer training (Kaplan et al., 2020; Hoffmann et al., 2022).

Cosine decay and inverse square root schedules are both step-indexed. The Noam scheduler from the original Transformer paper, for example, scales the learning rate as lr ~ d_model^(-0.5) * min(step^(-0.5), step * warmup_steps^(-1.5)) (Vaswani et al., 2017).

reinforcement learning step (environment step)

In reinforcement learning, "step" almost always refers to one transition in the underlying Markov decision process. The agent observes a state, selects an action, the environment returns a new observation and a reward, and that whole cycle is one step.

The Gymnasium API (the maintained successor to OpenAI Gym) makes this concrete. Calling env.step(action) returns five values:

Return value	Meaning
observation	The next observation from the environment's observation_space
reward	The scalar reward produced by taking the action
terminated	True if the agent reached a terminal MDP state
truncated	True if a time limit or out-of-bounds condition ended the episode
info	A dictionary of auxiliary diagnostic information

The split between terminated and truncated was introduced in Gymnasium 0.26 to make bootstrapping algorithms unambiguous: when an episode ends because of a time limit rather than a true terminal state, value-based methods should still bootstrap from the next state's value (Farama Foundation, Gymnasium documentation, 2024).

An episode is a sequence of environment steps from a reset to a terminal or truncated state. Total environment steps are the standard compute budget metric for deep RL papers. "Trained for 10M steps" or "200M frames" tells you how much interaction data the agent consumed, which matters more than wall-clock time for fair comparisons across hardware.

Care is needed because RL has two distinct step counters that often coexist in the same training script:

Counter	What it tracks
Environment steps	Number of `env.step()` calls (interaction data)
Gradient steps	Number of optimizer updates on the policy or value network
Episodes	Number of resets between terminal or truncated states

In off-policy methods like DQN or SAC, the ratio of gradient steps to environment steps (sometimes called the "update-to-data" or UTD ratio) is a sensitive hyperparameter.

diffusion model step

In a diffusion model, "step" refers to one denoising operation in the reverse generation chain. The original DDPM paper (Ho et al., 2020) used T = 1000 timesteps with a linear variance schedule that increased the noise variance from 1e-4 to 0.02 as the forward process progressed. Generating an image required calling the U-Net 1000 times, once per reverse step.

The high step count became a bottleneck, and most subsequent work focused on producing comparable samples in fewer steps:

Sampler	Typical steps to a usable image	Notes
DDPM (Ho et al., 2020)	~1000	Original ancestral sampler
DDIM (Song et al., 2020)	50 to 100	Deterministic non-Markovian sampler, 10x to 50x faster
DPM-Solver / DPM++	20 to 30	High-order ODE solvers for the reverse process
Consistency models (Song et al., 2023)	1 to 4	Trained or distilled to map noise to data in one shot
Latent Consistency Models, SDXL Turbo	1 to 4	Step distillation applied to large text-to-image models

DDIM (Song et al., 2020) reframed the reverse process as a deterministic non-Markovian chain, allowing samples to be drawn in 50 to 100 steps with quality close to the full 1000-step DDPM run. Consistency models (Song et al., 2023) take this further by training the network so that any point on the diffusion path maps to the same final sample, enabling one-step or few-step generation. As of 2025 these step-distillation techniques are what makes interactive text-to-image generation feasible at the millisecond scale.

A confusing terminology overlap: the same network is conditioned on a "timestep" embedding t that says how noisy the input is. Each call to the network during sampling is also called a step. So "100 inference steps" means the network is called 100 times, each time with a different t value drawn from the schedule.

LLM inference step

During autoregressive decoding, an LLM generates text one token at a time. Each forward pass through the model that emits one token is an inference step or decoding step. The total wall-clock time to produce a response is roughly:

latency = time_to_first_token + (num_output_tokens - 1) * time_per_step

time_per_step is the inter-token latency, often in the 10 to 100 millisecond range for large frontier models. Reducing this number is the goal of techniques like KV cache reuse, paged attention, and quantization.

Speculative decoding breaks the one-token-per-step assumption (Leviathan et al., 2023). A small, fast draft model proposes several future tokens, and the large target model verifies them in a single forward pass, accepting the longest matching prefix. Multiple tokens are emitted per target-model forward pass, which on decode-heavy workloads can yield up to 3x throughput on AWS Trainium (AWS Machine Learning Blog, 2024) without changing output quality. NVIDIA, vLLM, and most production inference stacks now ship some variant of this technique.

step in reasoning chains

In chain-of-thought prompting, a reasoning step is one intermediate inference the model writes down before the final answer. Step-level analysis of these chains has become its own subfield.

OpenAI's "Let's Verify Step by Step" (Lightman et al., 2023) compared two ways of training a reward model for math problems:

Supervision type	What gets a reward	Result on MATH
Outcome supervision (ORM)	Only the final answer	Lower accuracy
Process supervision (PRM)	Each individual reasoning step	78% on a representative MATH subset

The authors released PRM800K, a dataset of 800,000 step-level human correctness labels on model-generated solutions. Process Reward Models trained on step-level annotations outperform outcome reward models that only see the final answer, and step-level reward signals appear to be part of the recipe behind reasoning models like the OpenAI o-series.

scaling laws and step counts

Step counts feed directly into compute budgeting for large model training. The standard rule of thumb from Kaplan et al. (2020) is that training a transformer costs about 6 FLOPs per parameter per token:

C ~ 6 * N * D

where C is total compute in FLOPs, N is the number of non-embedding parameters, and D is the number of training tokens consumed. Because tokens are processed in batches of size B sequence_length, the relationship to steps is:

D = steps * tokens_per_batch
C ~ 6 * N * steps * tokens_per_batch

This means "how many steps to train" is fixed once you choose model size, batch size, and total compute budget.

Hoffmann et al. (2022), the Chinchilla paper, refined the picture by showing that for a fixed compute budget, parameters and training tokens should be scaled roughly equally. The headline heuristic is approximately 20 training tokens per parameter for compute-optimal training. A 70B-parameter Chinchilla-optimal model is therefore trained on roughly 1.4 trillion tokens. Translating that into steps requires dividing by the per-step token count: at a batch size of 1024 sequences of length 2048, that is about 670,000 steps.

Later work (LLaMA, Llama 2, Llama 3) deliberately overtrains relative to the Chinchilla ratio because inference cost is paid forever and training cost is paid once, so smaller-but-more-trained models with the same loss ship better in production. "More trained" here means more steps at the same model size, hitting D values much higher than the Chinchilla optimum (Touvron et al., 2023; Scaling laws).

step (Heaviside step function)

In the original perceptron (Rosenblatt, 1957), the activation function is the Heaviside step:

H(x) = 1 if x >= 0
H(x) = 0 otherwise

A perceptron classifies an input as 1 if the weighted sum of inputs plus bias is non-negative, and 0 otherwise. The unit using this activation is called a threshold logic unit. The Heaviside function is mathematically discontinuous, has zero derivative almost everywhere, and is undefined at the origin, so it cannot be trained by gradient descent. ADALINE (Widrow and Hoff, 1960) replaced the step with a continuous identity activation specifically to enable a least-squares update rule, and modern networks use smooth activations like sigmoid, tanh, ReLU, GELU, and SwiGLU instead. The step function survives mainly as a textbook reference and as the conceptual ancestor of all the smoother activations.

step in scikit-learn pipelines

In scikit-learn, a Pipeline is built from an ordered list of (name, estimator) tuples. Each tuple is a step. The convention is enforced by the API: every non-final step must implement fit and transform, and the final step must implement fit (and may implement predict, transform, or score depending on its role).

from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.svm import SVC

estimators = [("reduce_dim", PCA()), ("clf", SVC())]
pipe = Pipeline(estimators)

In this example the pipeline has two steps named reduce_dim and clf. Calling pipe.fit(X, y) runs PCA.fit_transform on X, then passes the reduced features into SVC.fit. Steps can be accessed by name (pipe.named_steps["clf"]) or by integer index (pipe.steps[0]), which is convenient when grid-searching hyperparameters with GridSearchCV using the step_name__param_name syntax (scikit-learn documentation, 2024).

practical relationships and gotchas

The total number of training steps for a vanilla supervised run is epochs * (dataset_size / batch_size). Increasing the batch size cuts the step count proportionally, which is why "large batch training" needs more aggressive learning rates and longer warmups.
When using gradient accumulation, the optimizer step (and global_step) increments only after the accumulated gradients are applied, not on every forward and backward pass.
In distributed data parallel training, all workers contribute to one combined step. The global step is per-replica gradient updates, not per-rank micro-batches.
In RL, do not confuse environment steps with gradient steps. Both are reported in papers, often on the same plot.
In diffusion, the timestep t that conditions the network is not the same thing as the index of the sampling step. With 50-step DDIM sampling, the indices t may be spaced unevenly across the original 1000-step schedule.
In LLM inference, with speculative decoding the number of tokens emitted per step is variable. Latency budgets should be expressed per output token, not per forward pass.

summary

Meaning	Where it appears	Key reference
Optimizer update on a mini-batch	Supervised and self-supervised training	Bottou (2010); standard practice
Environment transition	Reinforcement learning	Sutton and Barto (2018); Gymnasium docs
Reverse denoising operation	Diffusion model sampling	Ho et al. (2020); Song et al. (2020)
One generated token	LLM autoregressive decoding	Vaswani et al. (2017); Leviathan et al. (2023)
Intermediate reasoning move	Chain-of-thought, PRMs	Lightman et al. (2023)
Pipeline stage	scikit-learn Pipeline	scikit-learn docs
Discrete LR drop	Step decay schedule	Bengio (2012); standard CV practice
Heaviside threshold	Original perceptron	Rosenblatt (1957)

When reading a paper, log file, or library doc, the safest move is to identify which of these meanings is in scope before doing any arithmetic. "100K steps" can mean very different things to a vision model trainer, an RL researcher, and someone tuning a diffusion sampler.

references

Bottou, L. (2010). Large-Scale Machine Learning with Stochastic Gradient Descent. Proceedings of COMPSTAT.
Farama Foundation. (2024). Gymnasium documentation: Env class. https://gymnasium.farama.org/api/env/
Ho, J., Jain, A., and Abbeel, P. (2020). Denoising Diffusion Probabilistic Models. NeurIPS 2020.
Hoffmann, J., Borgeaud, S., Mensch, A., et al. (2022). Training Compute-Optimal Large Language Models. arXiv:2203.15556.
Hugging Face. (2024). Transformers Optimization Schedules documentation. https://huggingface.co/docs/transformers/main_classes/optimizer_schedules
Kaplan, J., McCandlish, S., Henighan, T., et al. (2020). Scaling Laws for Neural Language Models. arXiv:2001.08361.
Leviathan, Y., Kalman, M., and Matias, Y. (2023). Fast Inference from Transformers via Speculative Decoding. ICML 2023.
Lightman, H., Kosaraju, V., Burda, Y., et al. (2023). Let's Verify Step by Step. arXiv:2305.20050. OpenAI PRM800K dataset: https://github.com/openai/prm800k
PyTorch Lightning. (2024). Trainer documentation. https://lightning.ai/docs/pytorch/stable/common/trainer.html
Rosenblatt, F. (1957). The Perceptron, a Perceiving and Recognizing Automaton. Cornell Aeronautical Laboratory Report.
scikit-learn developers. (2024). Pipeline documentation. https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html
Song, J., Meng, C., and Ermon, S. (2020). Denoising Diffusion Implicit Models. arXiv:2010.02502.
Song, Y., Dhariwal, P., Chen, M., and Sutskever, I. (2023). Consistency Models. arXiv:2303.01469.
Sutton, R. S., and Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press.
Touvron, H., Martin, L., Stone, K., et al. (2023). Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv:2307.09288.
Vaswani, A., Shazeer, N., Parmar, N., et al. (2017). Attention Is All You Need. NeurIPS 2017.
Widrow, B., and Hoff, M. E. (1960). Adaptive Switching Circuits. WESCON Convention Record.

meanings of "step" at a glance

training step (optimization step)

epoch vs step vs iteration

learning rate schedules in step units

reinforcement learning step (environment step)

diffusion model step

LLM inference step

step in reasoning chains

scaling laws and step counts

step (Heaviside step function)

step in scikit-learn pipelines

practical relationships and gotchas

summary

references

Improve this article

Related Articles

AdaGrad

Gradient clipping

Momentum

Parameter update

Adam optimizer

Staged training

meanings of "step" at a glance

training step (optimization step)

epoch vs step vs iteration

learning rate schedules in step units

reinforcement learning step (environment step)

diffusion model step

LLM inference step

step in reasoning chains

scaling laws and step counts

step (Heaviside step function)

step in scikit-learn pipelines

practical relationships and gotchas

summary

references

Related Articles

AdaGrad

Gradient clipping

Momentum

Parameter update

Adam optimizer

Staged training