See also: learning rate, parameter update, Adam, SGD
In machine learning, the step size (also called the learning rate, usually written as the Greek letter eta or alpha) is the scalar that controls how far the parameters of a model move on each update during gradient-based optimization. It is the single hyperparameter that most strongly determines whether training converges, how fast it converges, and what generalization the final model reaches. Practitioners often discover that a model that fails to learn at one step size will train cleanly at another value an order of magnitude smaller or larger, which is why the step size is usually the first thing tuned and the first thing blamed when a run goes wrong.
Given a loss function L(theta) over model parameters theta, plain gradient descent updates the parameters by subtracting a scaled gradient:
theta_{t+1} = theta_t - eta * grad L(theta_t)
Here eta is the step size. Each iteration moves the parameters a distance proportional to eta in the direction of steepest descent. In stochastic and mini-batch settings the true gradient is replaced by a noisy estimate computed on a batch of examples, but the role of eta does not change.
The term "step size" comes from the optimization literature, where the per-iteration parameter movement is literally the length of a step taken across the loss landscape. In the deep learning community the same quantity is almost always called the learning rate. The two names are interchangeable, although "step size" is more common in convex optimization and theoretical papers, while "learning rate" dominates practical guides and code APIs.
In deep learning, varying the step size by a factor of two often changes final accuracy more than swapping architectures or doubling the model size. The reasons trace back to the geometry of the loss surface:
Leslie Smith's 2018 report on disciplined hyperparameter tuning argues that the learning rate is the dominant control knob: tuning it well lets you set most other hyperparameters from sensible defaults.
For plain SGD, the value passed to the optimizer is also the actual displacement per gradient unit. With momentum and adaptive optimizers, the effective step size diverges from the nominal value:
This distinction matters when porting hyperparameters between optimizers or papers. A learning rate of 1e-3 in Adam is not comparable to 1e-3 in SGD or 1e-3 in Lion.
The step size does not need to stay constant. Almost every modern training run varies eta over time according to a schedule. The schedule typically combines a warmup phase (eta starts small and grows) with a decay phase (eta shrinks toward zero). The table below summarizes the schedules in common use.
| Schedule | Formula | Where it is used | Notes |
|---|---|---|---|
| Constant | eta_t = eta_0 | Online learning, simple baselines | Easy to debug, rarely optimal for deep nets |
| Step decay | eta_t = eta_0 * gamma^{floor(t / s)} | Classic ImageNet ResNets | Cut by 10x at chosen epoch boundaries |
| Exponential decay | eta_t = eta_0 * gamma^t | Older RL and online setups | Smooth analog of step decay |
| Polynomial decay | eta_t = eta_0 * (1 - t/T)^p | BERT pretraining (p=1, linear) | Linear-to-zero is a common LLM default |
| Inverse square root | eta_t proportional to 1 / sqrt(t) | Original Transformer | Combined with linear warmup |
| Cosine annealing | eta_t = eta_min + 0.5 * (eta_max - eta_min) * (1 + cos(pi * t / T)) | Modern vision and LLM pretraining | Smooth, single-cycle, no extra hyperparameters |
| Cosine with warm restarts (SGDR) | Cosine over a cycle, then restart | Snapshot ensembles, fine-tuning | Periodic restarts can escape sharp minima |
| One-cycle | Triangular up then down, plus a final tail | Fast.ai super-convergence regime | Allows much larger peak LR than usual |
| Cyclical | Triangular oscillation between eta_min and eta_max | CV experiments | Avoids tuning a single peak value |
| Linear warmup + cosine decay | Linear ramp for K steps, then cosine | GPT-3, Llama, Chinchilla | The default for LLM pretraining |
| ReduceLROnPlateau | Drop eta when validation loss stops improving | Older PyTorch workflows | Reactive, not deterministic |
The two schedules dominating modern practice are linear warmup plus cosine decay (for large pretraining runs) and one-cycle (for shorter runs that benefit from aggressive maximum learning rates).
Warmup linearly increases eta from zero (or a small value) to a peak over the first K steps of training. Goyal and colleagues introduced gradual warmup in 2017 to make large-batch SGD work on ImageNet, and it is now standard for transformers and any model trained with adaptive optimizers at scale. Without warmup, adaptive optimizers like Adam apply unstable updates in the first few hundred steps, when the second-moment estimate is still noisy and the bias correction divides by a small denominator. Warmup gives the optimizer time to stabilize before applying full-strength updates.
Typical warmup lengths range from a few hundred steps for small models to ten thousand steps or more for the largest LLMs. GPT-3 used 375 million tokens of warmup, and Llama 3 405B used 8,000 steps.
The cosine annealing schedule was proposed by Loshchilov and Hutter in their 2017 ICLR paper SGDR (Stochastic Gradient Descent with Warm Restarts). Within a cycle of length T_i, the learning rate at step T_cur is:
eta_t = eta_min + 0.5 * (eta_max - eta_min) * (1 + cos(pi * T_cur / T_i))
The value starts at eta_max, follows a half-cosine curve, and ends at eta_min. The schedule has no plateau and no abrupt drops, which empirically generalizes better than step decay. The original paper also proposed warm restarts: after T_i steps, reset T_cur to zero and start a new cycle, optionally with a longer T_i. Modern LLM training usually uses a single cosine cycle, with eta_min set to roughly 10% of eta_max (the choice popularized by Chinchilla).
The original Transformer paper by Vaswani and colleagues used a custom schedule combining linear warmup with inverse-square-root decay:
lr = d_model^{-0.5} * min(step^{-0.5}, step * warmup_steps^{-1.5})
This schedule auto-scales with the model dimension d_model and crosses smoothly at step = warmup_steps. The base Transformer used 4,000 warmup steps. The schedule still appears in some NLP codebases, although cosine has largely replaced it in newer work.
Leslie Smith introduced cyclical learning rates in 2017 and the one-cycle policy in 2018. The one-cycle schedule rises from a low value to a maximum over the first half of training, then falls symmetrically back, with a short final tail well below the starting point. Smith and Topin showed that this schedule enables "super-convergence": ResNets that normally need 80 epochs can converge in 10 with one-cycle and a much larger peak LR than constant-rate training would tolerate. The schedule is implemented in PyTorch as torch.optim.lr_scheduler.OneCycleLR and is widely used in fastai workflows.
Smith's 2017 cyclical paper also introduced the LR range test, a quick way to find a sensible peak learning rate for a new model and dataset. The recipe:
The LR finder takes a single short run (often a few hundred batches) and tends to give a usable peak learning rate without hand tuning. It is the default in fastai and is implemented in PyTorch Lightning's Tuner.lr_find.
A second family of methods sidesteps schedule tuning by adapting the step size per parameter using gradient statistics. The user still sets a global eta, but each parameter gets a different effective step.
| Optimizer | Year | Key idea | Typical eta |
|---|---|---|---|
| AdaGrad | Duchi, Hazan, Singer 2011 | Divide by sqrt of historical sum of squared gradients | 0.01 |
| RMSProp | Hinton 2012 (Coursera lecture) | Replace AdaGrad sum with exponential moving average | 1e-3 |
| Adam | Kingma & Ba 2014 | Combine momentum and RMSProp with bias correction | 1e-3 default, 1e-4 to 3e-4 for transformers |
| AdamW | Loshchilov & Hutter ICLR 2019 | Decouple weight decay from the gradient update | 1e-4 to 3e-4 |
| Adafactor | Shazeer & Stern 2018 | Factor the second-moment matrix to save memory | Often runs without explicit eta |
| LAMB | You et al. 2019 | Layer-wise adaptive scaling for huge batches | Used to train BERT in 76 minutes |
| Lion | Chen et al. 2023 | Sign of smoothed gradient; one momentum buffer | 3e-5 to 1e-4 (3 to 10x lower than AdamW) |
| Sophia | Liu et al. 2023 | Diagonal Hessian estimate as preconditioner | About 2x faster than Adam in steps |
AdaGrad's accumulated denominator grows monotonically, so the effective step size shrinks toward zero over time. This works well in convex problems but is too aggressive for deep nets, which is why RMSProp replaced the running sum with an exponential moving average. Adam combines RMSProp's adaptive scaling with momentum and adds bias-correction terms that account for the small initial values of the moving averages. Its defaults (eta = 0.001, beta1 = 0.9, beta2 = 0.999, epsilon = 1e-8) work surprisingly well across many tasks, which is why Adam became the default for almost everything between 2015 and 2018.
AdamW fixed a subtle bug in how Adam handled L2 regularization. In Adam, weight decay was being scaled by the same per-parameter denominator as the gradient, which weakened the regularization for parameters with small gradients. AdamW decouples weight decay from the gradient step so that decay applies uniformly. This change matters most for overfitting-prone models and is the reason essentially every transformer trained after 2019 uses AdamW rather than Adam.
Lion (Evolved Sign Momentum) was discovered by Google's symbolic search over optimizer programs in 2023. Its update is the sign of a momentum-smoothed gradient, so each parameter moves by exactly eta in absolute value. Because the update has uniform magnitude, Lion needs a learning rate three to ten times smaller than AdamW; the authors recommend pairing this with a weight decay three to ten times larger. Lion uses about half the optimizer memory of AdamW (one buffer instead of two) and performs comparably or better on language modeling and image classification.
Sophia, from Stanford and HKU in 2023, uses a cheap diagonal Hessian estimate as a preconditioner. On GPT-style language models from 125M to 1.5B parameters, Sophia reaches the same perplexity as AdamW in roughly half as many steps. Its adoption is still limited compared to AdamW.
The following are starting points, not final values. Always tune for your dataset and architecture.
| Task / setup | Optimizer | Peak step size | Schedule |
|---|---|---|---|
| ResNet-50 on ImageNet, batch 256 | SGD + momentum 0.9 | 0.1 | Step decay at epoch 30, 60, 90 |
| ResNet-50 on ImageNet, batch 8192 | SGD + momentum 0.9 | 3.2 (linear scaling from 0.1) | 5 epoch warmup + step decay |
| Vision transformer | AdamW | 1e-3 | Linear warmup + cosine |
| BERT pretraining | AdamW | 1e-4 | Linear warmup + linear decay |
| GPT-3 175B pretraining | Adam | 6e-5 | Cosine to 10% over 300B tokens |
| Llama 3 405B pretraining | AdamW | 8e-5 | Cosine to 8e-7 over 1.2M steps |
| Llama 3 8B pretraining | AdamW | 3e-4 | Cosine, 2,000 step warmup |
| Fine-tuning a pretrained LLM | AdamW | 1e-5 to 5e-5 | Linear warmup + linear decay |
| LoRA adapter fine-tuning | AdamW | 1e-4 to 5e-4 | Constant or linear |
| Diffusion model training | AdamW or Lion | 1e-4 (AdamW), 3e-5 (Lion) | Constant or cosine |
| Reinforcement learning (PPO) | Adam | 3e-4 | Linear decay over total steps |
Several patterns are visible. Larger models use smaller peak learning rates: GPT-3's 175B model trained at 6e-5 while smaller GPT-3 variants used up to 6e-4. Fine-tuning uses much smaller rates than pretraining (typically one to two orders of magnitude lower) because the pretrained weights are already good and aggressive updates would erase what the model already knows. Adapter methods like LoRA use higher rates than full fine-tuning because they update a smaller, randomly initialized parameter set.
Step size and batch size are tightly coupled. The two best-known rules:
The linear rule eventually breaks down at very large batches. McCandlish, Kaplan and colleagues at OpenAI introduced the gradient noise scale in 2018 to predict where this breakdown happens. The noise scale measures the ratio of gradient variance to gradient magnitude. Below the noise scale, doubling the batch lets you double eta and halve training steps; above the noise scale, returns diminish quickly. The noise scale grows as a model trains, so the optimal batch size grows too. This insight informed the schedule of warming up batch size during GPT-3 pretraining.
Smith and Le (2018) argued from a Bayesian perspective that the relevant quantity is the ratio eta / batch_size, which they call the "noise scale" of SGD. Doubling either eta or batch_size has the same effect on the implicit noise. This is why "don't decay the learning rate, increase the batch size" can be a viable schedule in distributed training.
Most training problems trace back to the step size. The diagnostic table:
| Symptom | Likely cause | Fix |
|---|---|---|
| Loss is NaN or Inf within a few hundred steps | eta far too high | Cut eta by 10x, add or lengthen warmup |
| Loss explodes after a long calm period | eta still too high or schedule misaligned | Check gradient norms, add gradient clipping |
| Loss decreases then plateaus very early | eta too low | Try 3x or 10x larger eta |
| Loss oscillates between values without trending down | eta too high in noisy direction | Lower eta, raise batch size |
| Loss looks fine, validation accuracy stays poor | eta probably fine, check regularization | Adjust weight decay, dropout, augmentation |
| Loss collapses to a constant (model predicts one class) | Sometimes too-low eta combined with bad init | Reset, try a different seed, use LR finder |
| Training diverges only with adaptive optimizer | Missing warmup | Add 1,000 to 10,000 step linear warmup |
| Loss spikes near end of training | Cosine schedule decayed too aggressively | Set eta_min above zero (10% of peak is common) |
Gradient clipping (typically clip_grad_norm to 1.0) often saves runs that the scheduled step size alone would not. Modern LLM pretraining recipes routinely combine warmup, cosine decay, gradient clipping, and AdamW with weight decay 0.1.
Large language model pretraining has converged on a fairly narrow recipe for the step size:
GPT-3 175B used Adam with peak 6e-5 and cosine to 10% over 300B tokens. Llama 3 405B used AdamW with peak 8e-5, cosine to 8e-7 over 1.2 million steps, weight decay 0.1, and 8,000 warmup steps. Smaller Llama 3 variants used peak 3e-4 to 4e-4 with 2,000 warmup steps. The pattern (smaller model, larger peak LR) is consistent across published recipes.
For instruction tuning and supervised fine-tuning of LLMs, peak learning rates drop by an order of magnitude or more, typically 1e-5 to 5e-5. LoRA and QLoRA fine-tuning use higher rates (1e-4 to 5e-4) because only a small adapter is being trained from random initialization while the base weights stay frozen.
Research on neural scaling laws has shown that the optimal learning rate depends on model size, dataset size, and training horizon. The rough rules:
Every major framework provides scheduler classes. The PyTorch lineup, all in torch.optim.lr_scheduler:
| Class | Schedule |
|---|---|
LambdaLR | Arbitrary function of step |
MultiplicativeLR | Multiply eta by a returned factor |
StepLR | Cut by gamma every step_size epochs |
MultiStepLR | Cut by gamma at chosen milestones |
ExponentialLR | Multiply by gamma each epoch |
CosineAnnealingLR | Cosine to eta_min over T_max epochs |
CosineAnnealingWarmRestarts | SGDR with restarts |
OneCycleLR | Smith's one-cycle policy |
CyclicLR | Smith's triangular cyclical schedule |
LinearLR | Linear interpolation between two factors |
PolynomialLR | Polynomial decay |
ReduceLROnPlateau | Drop on validation plateau |
SequentialLR | Compose multiple schedulers |
ChainedScheduler | Apply multiple schedulers in parallel |
In TensorFlow and Keras, the equivalent classes live under tf.keras.optimizers.schedules (for example CosineDecay, PolynomialDecay, PiecewiseConstantDecay). Hugging Face's transformers.Trainer exposes high-level scheduler types via the lr_scheduler_type argument, with linear, cosine, cosine_with_restarts, polynomial, constant, and constant_with_warmup as the supported choices, plus a warmup_steps argument that prepends a linear warmup to any of them. JAX users typically build schedules from optax.warmup_cosine_decay_schedule and similar functions.
A minimal PyTorch example that reproduces the modern LLM recipe:
import torch
from torch.optim import AdamW
from torch.optim.lr_scheduler import LambdaLR
optimizer = AdamW(model.parameters(), lr=3e-4, betas=(0.9, 0.95), weight_decay=0.1)
def lr_lambda(step):
warmup = 2000
total = 100_000
if step < warmup:
return step / warmup
progress = (step - warmup) / (total - warmup)
return 0.1 + 0.9 * 0.5 * (1 + math.cos(math.pi * progress))
scheduler = LambdaLR(optimizer, lr_lambda)
A short list of rules that survive most projects:
Imagine you are walking down a hill blindfolded, trying to find the lowest spot. The step size is how big each of your steps is. If you take huge leaps, you might fly right over the bottom and end up climbing the next hill. If you take tiny shuffles, you will get closer and closer to the bottom but it will take all day. The trick is to pick steps that are big enough to make progress but small enough that you do not jump past the goal. Most people start with bigger steps and slow down as they think they are getting close to the bottom.