See also: Machine learning terms
In machine learning, the word step is overloaded. Depending on context it can mean a single optimizer update, one transition through a reinforcement learning environment, one denoising operation in a diffusion model, the generation of a single token by a language model, an intermediate reasoning move in a chain of thought, or even a stage of a scikit-learn pipeline. The same English word is doing a lot of work, and conflating these meanings is a common source of confusion when reading research papers, training logs, or library documentation.
This article covers the principal meanings, the math and APIs behind each one, how they relate to neighboring concepts like epoch and iteration, and how step counts factor into modern compute budgeting for large models.
| Context | What a "step" means | Typical scale |
|---|---|---|
| Supervised training | One parameter update on one mini-batch | Millions per training run |
| Reinforcement learning | One environment transition (state, action, reward, next state) | Millions to billions per agent |
| Diffusion sampling | One reverse denoising operation in the generation chain | 1 to 1000 per image |
| LLM inference | One token generated by an autoregressive decoder | Hundreds per response |
| Reasoning chains | One intermediate logical move in a chain of thought | A handful to dozens per problem |
| sklearn Pipeline | One named (transformer, estimator) stage applied in sequence | A small number per pipeline |
| Step decay schedule | A discrete drop in learning rate at a fixed boundary | A few per training run |
| Step (activation) | The Heaviside threshold used in the original perceptron | Historical |
The rest of the article works through each row in detail.
In supervised and self-supervised learning, a training step is one update of the model parameters using one mini-batch of data. The optimizer reads a batch, computes the loss, runs backpropagation to get gradients, and applies the update rule (vanilla SGD, Adam, AdamW, etc.). When the update finishes, the global step counter increments by one.
Frameworks expose this counter explicitly. PyTorch Lightning, for example, has a Trainer.global_step attribute that tracks the total number of optimizer steps taken across the entire training run, and that counter does not advance during validation passes or when gradients are accumulated without an update. With accumulate_grad_batches=4, only one global step is logged per four forward and backward passes, because there is only one actual optimizer call (Lightning AI, 2024).
A single training step usually involves the following work:
global_step, log scalars, possibly run a learning rate scheduler step().The term iteration is usually a synonym for step in this sense, though some older papers and textbooks use "iteration" to mean a full epoch. When in doubt, treat "iteration" as "step" unless context says otherwise.
These three terms are easy to mix up because the field never settled on consistent usage. The cleanest modern convention:
| Term | Definition | Counts |
|---|---|---|
| Epoch | One full pass over the training dataset | 1 epoch = (dataset_size / batch_size) steps |
| Step | One optimizer update on one mini-batch | 1 step processes 1 batch of B examples |
| Iteration | Usually a synonym for step in modern usage | Same as step |
| Global step | Cumulative step count across all epochs | Increases monotonically through training |
A worked example: if the training set has 32,000 examples and the batch size is 32, then one epoch contains 32000 / 32 = 1000 steps. A 10-epoch training run takes 10,000 total steps, and global_step goes from 0 to 10,000.
When the dataset is gigantic (web-scale corpora used for pretraining LLMs), "epoch" stops being a useful unit because the model may never see the same example twice, or only sees a small fraction of the data once. In that regime, training is described purely in steps or in tokens consumed, not epochs.
Most modern training recipes describe the learning rate as a function of the step counter rather than the epoch counter. Two common patterns:
Linear warmup over N steps. The learning rate increases linearly from 0 to the peak value over the first warmup_steps, then decays. The formula is lr(step) = lr_peak * step / warmup_steps for step < warmup_steps. Typical warmup lengths run from a few hundred steps for small models to tens of thousands of steps for the largest LLMs (Hugging Face Transformers documentation, 2024). GPT, BERT, T5, and LLaMA all use some form of warmup.
Step decay. The learning rate is held constant, then multiplied by a factor like 0.1 or 0.5 at predefined boundaries (every N steps or at fixed epochs). Step decay produces a staircase pattern in the learning rate over time. It was the dominant schedule in computer vision through the mid 2010s but has been largely replaced by cosine decay or inverse-square-root decay in modern transformer training (Kaplan et al., 2020; Hoffmann et al., 2022).
Cosine decay and inverse square root schedules are both step-indexed. The Noam scheduler from the original Transformer paper, for example, scales the learning rate as lr ~ d_model^(-0.5) * min(step^(-0.5), step * warmup_steps^(-1.5)) (Vaswani et al., 2017).
In reinforcement learning, "step" almost always refers to one transition in the underlying Markov decision process. The agent observes a state, selects an action, the environment returns a new observation and a reward, and that whole cycle is one step.
The Gymnasium API (the maintained successor to OpenAI Gym) makes this concrete. Calling env.step(action) returns five values:
| Return value | Meaning |
|---|---|
| observation | The next observation from the environment's observation_space |
| reward | The scalar reward produced by taking the action |
| terminated | True if the agent reached a terminal MDP state |
| truncated | True if a time limit or out-of-bounds condition ended the episode |
| info | A dictionary of auxiliary diagnostic information |
The split between terminated and truncated was introduced in Gymnasium 0.26 to make bootstrapping algorithms unambiguous: when an episode ends because of a time limit rather than a true terminal state, value-based methods should still bootstrap from the next state's value (Farama Foundation, Gymnasium documentation, 2024).
An episode is a sequence of environment steps from a reset to a terminal or truncated state. Total environment steps are the standard compute budget metric for deep RL papers. "Trained for 10M steps" or "200M frames" tells you how much interaction data the agent consumed, which matters more than wall-clock time for fair comparisons across hardware.
Care is needed because RL has two distinct step counters that often coexist in the same training script:
| Counter | What it tracks |
|---|---|
| Environment steps | Number of env.step() calls (interaction data) |
| Gradient steps | Number of optimizer updates on the policy or value network |
| Episodes | Number of resets between terminal or truncated states |
In off-policy methods like DQN or SAC, the ratio of gradient steps to environment steps (sometimes called the "update-to-data" or UTD ratio) is a sensitive hyperparameter.
In a diffusion model, "step" refers to one denoising operation in the reverse generation chain. The original DDPM paper (Ho et al., 2020) used T = 1000 timesteps with a linear variance schedule that increased the noise variance from 1e-4 to 0.02 as the forward process progressed. Generating an image required calling the U-Net 1000 times, once per reverse step.
The high step count became a bottleneck, and most subsequent work focused on producing comparable samples in fewer steps:
| Sampler | Typical steps to a usable image | Notes |
|---|---|---|
| DDPM (Ho et al., 2020) | ~1000 | Original ancestral sampler |
| DDIM (Song et al., 2020) | 50 to 100 | Deterministic non-Markovian sampler, 10x to 50x faster |
| DPM-Solver / DPM++ | 20 to 30 | High-order ODE solvers for the reverse process |
| Consistency models (Song et al., 2023) | 1 to 4 | Trained or distilled to map noise to data in one shot |
| Latent Consistency Models, SDXL Turbo | 1 to 4 | Step distillation applied to large text-to-image models |
DDIM (Song et al., 2020) reframed the reverse process as a deterministic non-Markovian chain, allowing samples to be drawn in 50 to 100 steps with quality close to the full 1000-step DDPM run. Consistency models (Song et al., 2023) take this further by training the network so that any point on the diffusion path maps to the same final sample, enabling one-step or few-step generation. As of 2025 these step-distillation techniques are what makes interactive text-to-image generation feasible at the millisecond scale.
A confusing terminology overlap: the same network is conditioned on a "timestep" embedding t that says how noisy the input is. Each call to the network during sampling is also called a step. So "100 inference steps" means the network is called 100 times, each time with a different t value drawn from the schedule.
During autoregressive decoding, an LLM generates text one token at a time. Each forward pass through the model that emits one token is an inference step or decoding step. The total wall-clock time to produce a response is roughly:
latency = time_to_first_token + (num_output_tokens - 1) * time_per_step
time_per_step is the inter-token latency, often in the 10 to 100 millisecond range for large frontier models. Reducing this number is the goal of techniques like KV cache reuse, paged attention, and quantization.
Speculative decoding breaks the one-token-per-step assumption (Leviathan et al., 2023). A small, fast draft model proposes several future tokens, and the large target model verifies them in a single forward pass, accepting the longest matching prefix. Multiple tokens are emitted per target-model forward pass, which on decode-heavy workloads can yield up to 3x throughput on AWS Trainium (AWS Machine Learning Blog, 2024) without changing output quality. NVIDIA, vLLM, and most production inference stacks now ship some variant of this technique.
In chain-of-thought prompting, a reasoning step is one intermediate inference the model writes down before the final answer. Step-level analysis of these chains has become its own subfield.
OpenAI's "Let's Verify Step by Step" (Lightman et al., 2023) compared two ways of training a reward model for math problems:
| Supervision type | What gets a reward | Result on MATH |
|---|---|---|
| Outcome supervision (ORM) | Only the final answer | Lower accuracy |
| Process supervision (PRM) | Each individual reasoning step | 78% on a representative MATH subset |
The authors released PRM800K, a dataset of 800,000 step-level human correctness labels on model-generated solutions. Process Reward Models trained on step-level annotations outperform outcome reward models that only see the final answer, and step-level reward signals appear to be part of the recipe behind reasoning models like the OpenAI o-series.
Step counts feed directly into compute budgeting for large model training. The standard rule of thumb from Kaplan et al. (2020) is that training a transformer costs about 6 FLOPs per parameter per token:
C ~ 6 * N * D
where C is total compute in FLOPs, N is the number of non-embedding parameters, and D is the number of training tokens consumed. Because tokens are processed in batches of size B sequence_length, the relationship to steps is:
D = steps * tokens_per_batch
C ~ 6 * N * steps * tokens_per_batch
This means "how many steps to train" is fixed once you choose model size, batch size, and total compute budget.
Hoffmann et al. (2022), the Chinchilla paper, refined the picture by showing that for a fixed compute budget, parameters and training tokens should be scaled roughly equally. The headline heuristic is approximately 20 training tokens per parameter for compute-optimal training. A 70B-parameter Chinchilla-optimal model is therefore trained on roughly 1.4 trillion tokens. Translating that into steps requires dividing by the per-step token count: at a batch size of 1024 sequences of length 2048, that is about 670,000 steps.
Later work (LLaMA, Llama 2, Llama 3) deliberately overtrains relative to the Chinchilla ratio because inference cost is paid forever and training cost is paid once, so smaller-but-more-trained models with the same loss ship better in production. "More trained" here means more steps at the same model size, hitting D values much higher than the Chinchilla optimum (Touvron et al., 2023; Scaling laws).
In the original perceptron (Rosenblatt, 1957), the activation function is the Heaviside step:
H(x) = 1 if x >= 0
H(x) = 0 otherwise
A perceptron classifies an input as 1 if the weighted sum of inputs plus bias is non-negative, and 0 otherwise. The unit using this activation is called a threshold logic unit. The Heaviside function is mathematically discontinuous, has zero derivative almost everywhere, and is undefined at the origin, so it cannot be trained by gradient descent. ADALINE (Widrow and Hoff, 1960) replaced the step with a continuous identity activation specifically to enable a least-squares update rule, and modern networks use smooth activations like sigmoid, tanh, ReLU, GELU, and SwiGLU instead. The step function survives mainly as a textbook reference and as the conceptual ancestor of all the smoother activations.
In scikit-learn, a Pipeline is built from an ordered list of (name, estimator) tuples. Each tuple is a step. The convention is enforced by the API: every non-final step must implement fit and transform, and the final step must implement fit (and may implement predict, transform, or score depending on its role).
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.svm import SVC
estimators = [("reduce_dim", PCA()), ("clf", SVC())]
pipe = Pipeline(estimators)
In this example the pipeline has two steps named reduce_dim and clf. Calling pipe.fit(X, y) runs PCA.fit_transform on X, then passes the reduced features into SVC.fit. Steps can be accessed by name (pipe.named_steps["clf"]) or by integer index (pipe.steps[0]), which is convenient when grid-searching hyperparameters with GridSearchCV using the step_name__param_name syntax (scikit-learn documentation, 2024).
epochs * (dataset_size / batch_size). Increasing the batch size cuts the step count proportionally, which is why "large batch training" needs more aggressive learning rates and longer warmups.global_step) increments only after the accumulated gradients are applied, not on every forward and backward pass.t that conditions the network is not the same thing as the index of the sampling step. With 50-step DDIM sampling, the indices t may be spaced unevenly across the original 1000-step schedule.| Meaning | Where it appears | Key reference |
|---|---|---|
| Optimizer update on a mini-batch | Supervised and self-supervised training | Bottou (2010); standard practice |
| Environment transition | Reinforcement learning | Sutton and Barto (2018); Gymnasium docs |
| Reverse denoising operation | Diffusion model sampling | Ho et al. (2020); Song et al. (2020) |
| One generated token | LLM autoregressive decoding | Vaswani et al. (2017); Leviathan et al. (2023) |
| Intermediate reasoning move | Chain-of-thought, PRMs | Lightman et al. (2023) |
| Pipeline stage | scikit-learn Pipeline | scikit-learn docs |
| Discrete LR drop | Step decay schedule | Bengio (2012); standard CV practice |
| Heaviside threshold | Original perceptron | Rosenblatt (1957) |
When reading a paper, log file, or library doc, the safest move is to identify which of these meanings is in scope before doing any arithmetic. "100K steps" can mean very different things to a vision model trainer, an RL researcher, and someone tuning a diffusion sampler.