Step size

See also: learning rate, parameter update, Adam, SGD

In machine learning, the step size (also called the learning rate, usually written as the Greek letter eta or alpha) is the scalar that controls how far the parameters of a model move on each update during gradient-based optimization. It is the single hyperparameter that most strongly determines whether training converges, how fast it converges, and what generalization the final model reaches. Practitioners often discover that a model that fails to learn at one step size will train cleanly at another value an order of magnitude smaller or larger, which is why the step size is usually the first thing tuned and the first thing blamed when a run goes wrong.

definition and update rule

Given a loss function L(theta) over model parameters theta, plain gradient descent updates the parameters by subtracting a scaled gradient:

theta_{t+1} = theta_t - eta * grad L(theta_t)

Here eta is the step size. Each iteration moves the parameters a distance proportional to eta in the direction of steepest descent. In stochastic and mini-batch settings the true gradient is replaced by a noisy estimate computed on a batch of examples, but the role of eta does not change.

The term "step size" comes from the optimization literature, where the per-iteration parameter movement is literally the length of a step taken across the loss landscape. In the deep learning community the same quantity is almost always called the learning rate. The two names are interchangeable, although "step size" is more common in convex optimization and theoretical papers, while "learning rate" dominates practical guides and code APIs.

why it is the most important hyperparameter

In deep learning, varying the step size by a factor of two often changes final accuracy more than swapping architectures or doubling the model size. The reasons trace back to the geometry of the loss surface:

Too small. If eta is much smaller than the local curvature of the loss, training is stable but unbearably slow. The parameters inch toward a minimum and may never reach it within the compute budget. Loss curves look almost flat.
Too large. If eta is much larger than the local curvature, the update overshoots in the directions of high curvature. The loss bounces around or grows. With deep networks the result is typically NaN values within a few hundred steps as activations and gradients explode.
Just right. A well-chosen eta produces fast progress in low-curvature directions and stable progress in high-curvature directions. There is rarely a single perfect value, but there is usually a band of acceptable values spanning roughly one order of magnitude.

Leslie Smith's 2018 report on disciplined hyperparameter tuning argues that the learning rate is the dominant control knob: tuning it well lets you set most other hyperparameters from sensible defaults.

effective vs nominal step size

For plain SGD, the value passed to the optimizer is also the actual displacement per gradient unit. With momentum and adaptive optimizers, the effective step size diverges from the nominal value:

With heavy-ball momentum coefficient mu, the effective step size in the steady state is roughly eta / (1 - mu). At mu = 0.9 the effective rate is about ten times the nominal eta, which is why SGD with momentum often uses smaller eta than plain SGD.
With Adam and other adaptive methods, the effective step in each parameter is eta divided by a running estimate of the gradient magnitude. The actual displacement per parameter is approximately eta in the early stages of training but shrinks for parameters with large historical gradients.
With Lion the update is the sign of a smoothed gradient, so the per-parameter displacement is exactly eta in absolute value. This is why Lion needs a learning rate three to ten times smaller than AdamW.

This distinction matters when porting hyperparameters between optimizers or papers. A learning rate of 1e-3 in Adam is not comparable to 1e-3 in SGD or 1e-3 in Lion.

learning rate schedules

The step size does not need to stay constant. Almost every modern training run varies eta over time according to a schedule. The schedule typically combines a warmup phase (eta starts small and grows) with a decay phase (eta shrinks toward zero). The table below summarizes the schedules in common use.

Schedule	Formula	Where it is used	Notes
Constant	eta_t = eta_0	Online learning, simple baselines	Easy to debug, rarely optimal for deep nets
Step decay	eta_t = eta_0 * gamma^{floor(t / s)}	Classic ImageNet ResNets	Cut by 10x at chosen epoch boundaries
Exponential decay	eta_t = eta_0 * gamma^t	Older RL and online setups	Smooth analog of step decay
Polynomial decay	eta_t = eta_0 * (1 - t/T)^p	BERT pretraining (p=1, linear)	Linear-to-zero is a common LLM default
Inverse square root	eta_t proportional to 1 / sqrt(t)	Original Transformer	Combined with linear warmup
Cosine annealing	eta_t = eta_min + 0.5 * (eta_max - eta_min) * (1 + cos(pi * t / T))	Modern vision and LLM pretraining	Smooth, single-cycle, no extra hyperparameters
Cosine with warm restarts (SGDR)	Cosine over a cycle, then restart	Snapshot ensembles, fine-tuning	Periodic restarts can escape sharp minima
One-cycle	Triangular up then down, plus a final tail	Fast.ai super-convergence regime	Allows much larger peak LR than usual
Cyclical	Triangular oscillation between eta_min and eta_max	CV experiments	Avoids tuning a single peak value
Linear warmup + cosine decay	Linear ramp for K steps, then cosine	GPT-3, Llama, Chinchilla	The default for LLM pretraining
ReduceLROnPlateau	Drop eta when validation loss stops improving	Older PyTorch workflows	Reactive, not deterministic

The two schedules dominating modern practice are linear warmup plus cosine decay (for large pretraining runs) and one-cycle (for shorter runs that benefit from aggressive maximum learning rates).

warmup

Warmup linearly increases eta from zero (or a small value) to a peak over the first K steps of training. Goyal and colleagues introduced gradual warmup in 2017 to make large-batch SGD work on ImageNet, and it is now standard for transformers and any model trained with adaptive optimizers at scale. Without warmup, adaptive optimizers like Adam apply unstable updates in the first few hundred steps, when the second-moment estimate is still noisy and the bias correction divides by a small denominator. Warmup gives the optimizer time to stabilize before applying full-strength updates.

Typical warmup lengths range from a few hundred steps for small models to ten thousand steps or more for the largest LLMs. GPT-3 used 375 million tokens of warmup, and Llama 3 405B used 8,000 steps.

cosine annealing

The cosine annealing schedule was proposed by Loshchilov and Hutter in their 2017 ICLR paper SGDR (Stochastic Gradient Descent with Warm Restarts). Within a cycle of length T_i, the learning rate at step T_cur is:

eta_t = eta_min + 0.5 * (eta_max - eta_min) * (1 + cos(pi * T_cur / T_i))

The value starts at eta_max, follows a half-cosine curve, and ends at eta_min. The schedule has no plateau and no abrupt drops, which empirically generalizes better than step decay. The original paper also proposed warm restarts: after T_i steps, reset T_cur to zero and start a new cycle, optionally with a longer T_i. Modern LLM training usually uses a single cosine cycle, with eta_min set to roughly 10% of eta_max (the choice popularized by Chinchilla).

inverse square root

The original Transformer paper by Vaswani and colleagues used a custom schedule combining linear warmup with inverse-square-root decay:

lr = d_model^{-0.5} * min(step^{-0.5}, step * warmup_steps^{-1.5})

This schedule auto-scales with the model dimension d_model and crosses smoothly at step = warmup_steps. The base Transformer used 4,000 warmup steps. The schedule still appears in some NLP codebases, although cosine has largely replaced it in newer work.

one-cycle and super-convergence

Leslie Smith introduced cyclical learning rates in 2017 and the one-cycle policy in 2018. The one-cycle schedule rises from a low value to a maximum over the first half of training, then falls symmetrically back, with a short final tail well below the starting point. Smith and Topin showed that this schedule enables "super-convergence": ResNets that normally need 80 epochs can converge in 10 with one-cycle and a much larger peak LR than constant-rate training would tolerate. The schedule is implemented in PyTorch as torch.optim.lr_scheduler.OneCycleLR and is widely used in fastai workflows.

the LR finder

Smith's 2017 cyclical paper also introduced the LR range test, a quick way to find a sensible peak learning rate for a new model and dataset. The recipe:

Start training from a tiny learning rate (for example 1e-7).
Increase eta exponentially after each mini-batch.
Plot training loss against log(eta).
The loss decreases over a range of eta, then suddenly explodes upward. Pick a value just before the explosion (often one order of magnitude below the divergence point).

The LR finder takes a single short run (often a few hundred batches) and tends to give a usable peak learning rate without hand tuning. It is the default in fastai and is implemented in PyTorch Lightning's Tuner.lr_find.

adaptive optimizers and per-parameter step sizes

A second family of methods sidesteps schedule tuning by adapting the step size per parameter using gradient statistics. The user still sets a global eta, but each parameter gets a different effective step.

Optimizer	Year	Key idea	Typical eta
AdaGrad	Duchi, Hazan, Singer 2011	Divide by sqrt of historical sum of squared gradients	0.01
RMSProp	Hinton 2012 (Coursera lecture)	Replace AdaGrad sum with exponential moving average	1e-3
Adam	Kingma & Ba 2014	Combine momentum and RMSProp with bias correction	1e-3 default, 1e-4 to 3e-4 for transformers
AdamW	Loshchilov & Hutter ICLR 2019	Decouple weight decay from the gradient update	1e-4 to 3e-4
Adafactor	Shazeer & Stern 2018	Factor the second-moment matrix to save memory	Often runs without explicit eta
LAMB	You et al. 2019	Layer-wise adaptive scaling for huge batches	Used to train BERT in 76 minutes
Lion	Chen et al. 2023	Sign of smoothed gradient; one momentum buffer	3e-5 to 1e-4 (3 to 10x lower than AdamW)
Sophia	Liu et al. 2023	Diagonal Hessian estimate as preconditioner	About 2x faster than Adam in steps

AdaGrad's accumulated denominator grows monotonically, so the effective step size shrinks toward zero over time. This works well in convex problems but is too aggressive for deep nets, which is why RMSProp replaced the running sum with an exponential moving average. Adam combines RMSProp's adaptive scaling with momentum and adds bias-correction terms that account for the small initial values of the moving averages. Its defaults (eta = 0.001, beta1 = 0.9, beta2 = 0.999, epsilon = 1e-8) work surprisingly well across many tasks, which is why Adam became the default for almost everything between 2015 and 2018.

AdamW fixed a subtle bug in how Adam handled L2 regularization. In Adam, weight decay was being scaled by the same per-parameter denominator as the gradient, which weakened the regularization for parameters with small gradients. AdamW decouples weight decay from the gradient step so that decay applies uniformly. This change matters most for overfitting-prone models and is the reason essentially every transformer trained after 2019 uses AdamW rather than Adam.

Lion (Evolved Sign Momentum) was discovered by Google's symbolic search over optimizer programs in 2023. Its update is the sign of a momentum-smoothed gradient, so each parameter moves by exactly eta in absolute value. Because the update has uniform magnitude, Lion needs a learning rate three to ten times smaller than AdamW; the authors recommend pairing this with a weight decay three to ten times larger. Lion uses about half the optimizer memory of AdamW (one buffer instead of two) and performs comparably or better on language modeling and image classification.

Sophia, from Stanford and HKU in 2023, uses a cheap diagonal Hessian estimate as a preconditioner. On GPT-style language models from 125M to 1.5B parameters, Sophia reaches the same perplexity as AdamW in roughly half as many steps. Its adoption is still limited compared to AdamW.

recommended step sizes for common tasks

The following are starting points, not final values. Always tune for your dataset and architecture.

Task / setup	Optimizer	Peak step size	Schedule
ResNet-50 on ImageNet, batch 256	SGD + momentum 0.9	0.1	Step decay at epoch 30, 60, 90
ResNet-50 on ImageNet, batch 8192	SGD + momentum 0.9	3.2 (linear scaling from 0.1)	5 epoch warmup + step decay
Vision transformer	AdamW	1e-3	Linear warmup + cosine
BERT pretraining	AdamW	1e-4	Linear warmup + linear decay
GPT-3 175B pretraining	Adam	6e-5	Cosine to 10% over 300B tokens
Llama 3 405B pretraining	AdamW	8e-5	Cosine to 8e-7 over 1.2M steps
Llama 3 8B pretraining	AdamW	3e-4	Cosine, 2,000 step warmup
Fine-tuning a pretrained LLM	AdamW	1e-5 to 5e-5	Linear warmup + linear decay
LoRA adapter fine-tuning	AdamW	1e-4 to 5e-4	Constant or linear
Diffusion model training	AdamW or Lion	1e-4 (AdamW), 3e-5 (Lion)	Constant or cosine
Reinforcement learning (PPO)	Adam	3e-4	Linear decay over total steps

Several patterns are visible. Larger models use smaller peak learning rates: GPT-3's 175B model trained at 6e-5 while smaller GPT-3 variants used up to 6e-4. Fine-tuning uses much smaller rates than pretraining (typically one to two orders of magnitude lower) because the pretrained weights are already good and aggressive updates would erase what the model already knows. Adapter methods like LoRA use higher rates than full fine-tuning because they update a smaller, randomly initialized parameter set.

interaction with batch size

Step size and batch size are tightly coupled. The two best-known rules:

Linear scaling rule (Goyal et al. 2017). When the mini-batch size is multiplied by k, multiply the learning rate by k as well. This rule lets a model trained with batch 256 and eta 0.1 also train with batch 8192 and eta 3.2. The rule requires gradual warmup; without it, the large-batch run diverges in the first few iterations. The original paper used this rule to train ResNet-50 on ImageNet in one hour using 256 GPUs.
Square-root scaling rule. Some authors instead recommend multiplying eta by sqrt(k), arguing that gradient noise scales as sqrt(batch). This rule is more conservative and is sometimes preferred for adaptive optimizers.

The linear rule eventually breaks down at very large batches. McCandlish, Kaplan and colleagues at OpenAI introduced the gradient noise scale in 2018 to predict where this breakdown happens. The noise scale measures the ratio of gradient variance to gradient magnitude. Below the noise scale, doubling the batch lets you double eta and halve training steps; above the noise scale, returns diminish quickly. The noise scale grows as a model trains, so the optimal batch size grows too. This insight informed the schedule of warming up batch size during GPT-3 pretraining.

Smith and Le (2018) argued from a Bayesian perspective that the relevant quantity is the ratio eta / batch_size, which they call the "noise scale" of SGD. Doubling either eta or batch_size has the same effect on the implicit noise. This is why "don't decay the learning rate, increase the batch size" can be a viable schedule in distributed training.

common failure modes

Most training problems trace back to the step size. The diagnostic table:

Symptom	Likely cause	Fix
Loss is NaN or Inf within a few hundred steps	eta far too high	Cut eta by 10x, add or lengthen warmup
Loss explodes after a long calm period	eta still too high or schedule misaligned	Check gradient norms, add gradient clipping
Loss decreases then plateaus very early	eta too low	Try 3x or 10x larger eta
Loss oscillates between values without trending down	eta too high in noisy direction	Lower eta, raise batch size
Loss looks fine, validation accuracy stays poor	eta probably fine, check regularization	Adjust weight decay, dropout, augmentation
Loss collapses to a constant (model predicts one class)	Sometimes too-low eta combined with bad init	Reset, try a different seed, use LR finder
Training diverges only with adaptive optimizer	Missing warmup	Add 1,000 to 10,000 step linear warmup
Loss spikes near end of training	Cosine schedule decayed too aggressively	Set eta_min above zero (10% of peak is common)

Gradient clipping (typically clip_grad_norm to 1.0) often saves runs that the scheduled step size alone would not. Modern LLM pretraining recipes routinely combine warmup, cosine decay, gradient clipping, and AdamW with weight decay 0.1.

modern LLM context

Large language model pretraining has converged on a fairly narrow recipe for the step size:

Optimizer: AdamW with beta1 = 0.9, beta2 = 0.95 (a small change from Adam's default 0.999, found to be more stable for LLMs).
Schedule: Linear warmup over a few thousand steps, then a single cosine decay.
Peak eta: Roughly 1e-4 to 6e-4, scaling down for larger models. Empirically, peak eta tracks 1 / sqrt(model_width) reasonably well.
Final eta: Decays to 10% of peak, following the Chinchilla recipe. Some recent work (for example WSD and warmup-stable-decay schedules) advocates for a constant peak followed by a fast linear decay to zero.
Decay length: The cosine cycle should match the planned training horizon. Decaying too fast leaves performance on the table; decaying too slowly means the run ends before the schedule reaches eta_min.
Gradient clipping: Norm clipping at 1.0.
Weight decay: 0.1, applied via AdamW's decoupled term.

GPT-3 175B used Adam with peak 6e-5 and cosine to 10% over 300B tokens. Llama 3 405B used AdamW with peak 8e-5, cosine to 8e-7 over 1.2 million steps, weight decay 0.1, and 8,000 warmup steps. Smaller Llama 3 variants used peak 3e-4 to 4e-4 with 2,000 warmup steps. The pattern (smaller model, larger peak LR) is consistent across published recipes.

For instruction tuning and supervised fine-tuning of LLMs, peak learning rates drop by an order of magnitude or more, typically 1e-5 to 5e-5. LoRA and QLoRA fine-tuning use higher rates (1e-4 to 5e-4) because only a small adapter is being trained from random initialization while the base weights stay frozen.

the connection to scaling laws

Research on neural scaling laws has shown that the optimal learning rate depends on model size, dataset size, and training horizon. The rough rules:

Wider models prefer smaller peak learning rates. The Maximal Update Parametrization (muP) framework by Yang and Hu provides a principled way to transfer learning rates from a small model to a large one without re-tuning.
For Chinchilla-optimal training (where training tokens equal roughly 20 times parameter count), the cosine schedule must be tuned to match the planned token count. Truncating or extending the schedule both hurt loss.
For longer-than-Chinchilla training, current evidence suggests that warmup-stable-decay schedules (constant peak followed by short linear decay) match or beat cosine while making it easier to extend a run after the fact.

implementations

Every major framework provides scheduler classes. The PyTorch lineup, all in torch.optim.lr_scheduler:

Class	Schedule
`LambdaLR`	Arbitrary function of step
`MultiplicativeLR`	Multiply eta by a returned factor
`StepLR`	Cut by gamma every step_size epochs
`MultiStepLR`	Cut by gamma at chosen milestones
`ExponentialLR`	Multiply by gamma each epoch
`CosineAnnealingLR`	Cosine to eta_min over T_max epochs
`CosineAnnealingWarmRestarts`	SGDR with restarts
`OneCycleLR`	Smith's one-cycle policy
`CyclicLR`	Smith's triangular cyclical schedule
`LinearLR`	Linear interpolation between two factors
`PolynomialLR`	Polynomial decay
`ReduceLROnPlateau`	Drop on validation plateau
`SequentialLR`	Compose multiple schedulers
`ChainedScheduler`	Apply multiple schedulers in parallel

In TensorFlow and Keras, the equivalent classes live under tf.keras.optimizers.schedules (for example CosineDecay, PolynomialDecay, PiecewiseConstantDecay). Hugging Face's transformers.Trainer exposes high-level scheduler types via the lr_scheduler_type argument, with linear, cosine, cosine_with_restarts, polynomial, constant, and constant_with_warmup as the supported choices, plus a warmup_steps argument that prepends a linear warmup to any of them. JAX users typically build schedules from optax.warmup_cosine_decay_schedule and similar functions.

A minimal PyTorch example that reproduces the modern LLM recipe:

import torch
from torch.optim import AdamW
from torch.optim.lr_scheduler import LambdaLR

optimizer = AdamW(model.parameters(), lr=3e-4, betas=(0.9, 0.95), weight_decay=0.1)

def lr_lambda(step):
    warmup = 2000
    total = 100_000
    if step < warmup:
        return step / warmup
    progress = (step - warmup) / (total - warmup)
    return 0.1 + 0.9 * 0.5 * (1 + math.cos(math.pi * progress))

scheduler = LambdaLR(optimizer, lr_lambda)

practical heuristics

A short list of rules that survive most projects:

Tune eta first. Get warmup, schedule, and peak value right before touching anything else.
Use the LR finder for any new dataset or architecture. It costs less than one full epoch and almost always gets you within a factor of two of the right value.
Always use warmup with adaptive optimizers. A few hundred to a few thousand linear warmup steps fixes most early-training instability.
Cosine usually beats step decay. The single cosine cycle has no extra hyperparameters and tends to generalize better.
Watch the loss curve, not only the number. A clean loss curve is smooth and downward-trending. Jagged means too high; flat means too low.
When scaling to more GPUs, scale eta linearly (Goyal rule) and lengthen warmup proportionally.
For LLM fine-tuning, start at 2e-5 and adjust by powers of two.
Save the optimizer state, not just model weights, so a run can resume on the right step of the schedule.

explain like I'm 5 (ELI5)

Imagine you are walking down a hill blindfolded, trying to find the lowest spot. The step size is how big each of your steps is. If you take huge leaps, you might fly right over the bottom and end up climbing the next hill. If you take tiny shuffles, you will get closer and closer to the bottom but it will take all day. The trick is to pick steps that are big enough to make progress but small enough that you do not jump past the goal. Most people start with bigger steps and slow down as they think they are getting close to the bottom.

references

Duchi, J., Hazan, E., & Singer, Y. (2011). Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. Journal of Machine Learning Research, 12, 2121-2159. https://jmlr.org/papers/v12/duchi11a.html
Hinton, G. (2012). Lecture 6.5 - rmsprop: Divide the gradient by a running average of its recent magnitude. Coursera: Neural Networks for Machine Learning.
Kingma, D. P., & Ba, J. (2014). Adam: A Method for Stochastic Optimization. arXiv:1412.6980.
Smith, L. N. (2017). Cyclical Learning Rates for Training Neural Networks. WACV 2017. arXiv:1506.01186.
Loshchilov, I., & Hutter, F. (2017). SGDR: Stochastic Gradient Descent with Warm Restarts. ICLR 2017. arXiv:1608.03983.
Vaswani, A., et al. (2017). Attention Is All You Need. NeurIPS 2017. arXiv:1706.03762.
Goyal, P., Dollar, P., Girshick, R., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., & He, K. (2017). Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour. arXiv:1706.02677.
Smith, L. N., & Topin, N. (2017/2018). Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates. arXiv:1708.07120.
Smith, L. N. (2018). A disciplined approach to neural network hyper-parameters: Part 1 - learning rate, batch size, momentum, and weight decay. arXiv:1803.09820.
Shazeer, N., & Stern, M. (2018). Adafactor: Adaptive Learning Rates with Sublinear Memory Cost. ICML 2018. arXiv:1804.04235.
Smith, S. L., & Le, Q. V. (2018). A Bayesian Perspective on Generalization and Stochastic Gradient Descent. ICLR 2018. arXiv:1710.06451.
McCandlish, S., Kaplan, J., Amodei, D., & OpenAI Dota Team (2018). An Empirical Model of Large-Batch Training. arXiv:1812.06162.
Loshchilov, I., & Hutter, F. (2019). Decoupled Weight Decay Regularization. ICLR 2019. arXiv:1711.05101.
Brown, T. B., et al. (2020). Language Models are Few-Shot Learners (GPT-3). NeurIPS 2020. arXiv:2005.14165.
Hoffmann, J., et al. (2022). Training Compute-Optimal Large Language Models (Chinchilla). arXiv:2203.15556.
Chen, X., et al. (2023). Symbolic Discovery of Optimization Algorithms (Lion). NeurIPS 2023. arXiv:2302.06675.
Liu, H., Li, Z., Hall, D., Liang, P., & Ma, T. (2023). Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training. arXiv:2305.14342.
Grattafiori, A., et al. (2024). The Llama 3 Herd of Models. arXiv:2407.21783.
PyTorch documentation: torch.optim.lr_scheduler. https://docs.pytorch.org/docs/stable/optim.html
Hugging Face Transformers documentation: Optimizer and learning rate schedules.

definition and update rule

why it is the most important hyperparameter

effective vs nominal step size

learning rate schedules

warmup

cosine annealing

inverse square root

one-cycle and super-convergence

the LR finder

adaptive optimizers and per-parameter step sizes

recommended step sizes for common tasks

interaction with batch size

common failure modes

modern LLM context

the connection to scaling laws

implementations

practical heuristics

explain like I'm 5 (ELI5)

references

Improve this article

Related Articles

Regularization Rate

L0 Regularization

L1 Loss

L1 Regularization

L2 Loss

L2 Regularization

definition and update rule

why it is the most important hyperparameter

effective vs nominal step size

learning rate schedules

warmup

cosine annealing

inverse square root

one-cycle and super-convergence

the LR finder

adaptive optimizers and per-parameter step sizes

recommended step sizes for common tasks

interaction with batch size

common failure modes

modern LLM context

the connection to scaling laws

implementations

practical heuristics

explain like I'm 5 (ELI5)

references

Related Articles

Regularization Rate

L0 Regularization

L1 Loss

L1 Regularization

L2 Loss

L2 Regularization