See also: Machine learning terms
In machine learning, a parameter update is the rule that adjusts a model's trainable parameters θ after a gradient of the loss has been computed. It is the final stage of every training iteration: the forward pass produces predictions, the loss is computed, backpropagation and automatic differentiation yield the gradient ∇L(θ), and the parameter update consumes that gradient to produce a new value of θ. Most modern training pipelines, including those used to train large language models, repeat this loop billions of times across thousands of GPUs.
The generic form of an update is:
θ_new = θ_old − update(g, state)
where g is the current gradient and state holds optional running statistics such as momentum buffers, second-moment estimates, or per-parameter learning rates. Different optimizers differ only in how they compute that update term.
A single training step in a typical deep learning framework follows the same five operations:
In PyTorch the last three operations are loss.backward(), optimizer.step(), and optimizer.zero_grad(). In TensorFlow and Keras the equivalent call is optimizer.apply_gradients(zip(grads, vars)). JAX and Optax instead expose optimizers as pure functions that take the gradient and the previous state and return a new state and an update vector; the user then applies that vector with optax.apply_updates(params, updates).
The distinction between computing the gradient and applying the update matters because both operations can be modified independently. Mixed-precision training, gradient accumulation, gradient clipping, and weight decay all live in this gap between the backward pass and the optimizer step.
The simplest update rule is plain gradient descent, which scales the gradient by a constant learning rate η and subtracts it from the current parameters:
θ ← θ − η ∇L(θ)
When the gradient is computed on a single example or a mini-batch instead of the full dataset, the same rule is called stochastic gradient descent or mini-batch SGD. Vanilla SGD has no internal state: the only thing the optimizer remembers between steps is the step size itself, which may be adjusted by a separate learning-rate schedule.
The lack of state is both a strength and a weakness. SGD uses no extra memory beyond the gradient, but it converges slowly on ill-conditioned problems and oscillates in narrow ravines of the loss surface. Almost every modern optimizer adds buffers to address one or both of these issues.
Polyak (1964) introduced the heavy-ball method, also known as classical momentum, which augments SGD with a velocity buffer v that accumulates past gradients with exponential decay μ:
v ← μ v + ∇L(θ)
θ ← θ − η v
The physical analogy is a ball rolling down the loss surface: the velocity carries it through small bumps and damps oscillations across narrow valleys. Typical values of μ in deep learning are 0.9 or 0.99.
Nesterov (1983) refined this with Nesterov accelerated gradient, which evaluates the gradient at the look-ahead point θ − μv rather than at θ itself:
v ← μ v + ∇L(θ − μ v)
θ ← θ − η v
In theory, Nesterov momentum gives an O(1/T²) convergence rate on smooth convex problems, compared with O(1/T) for plain gradient descent. In practice the difference for non-convex deep networks is smaller, but PyTorch and most other frameworks expose it as a one-line option (nesterov=True).
A second family of optimizers gives every parameter its own effective learning rate based on the history of its gradients.
Duchi, Hazan, and Singer (2011) proposed AdaGrad, which divides the gradient by the square root of the sum of all past squared gradients. Parameters that receive large gradients early on get progressively smaller updates, which works well for sparse features but eventually freezes learning altogether on dense problems.
Hinton (2012) addressed that decay-to-zero behavior in his Coursera lectures with RMSProp, which replaces the running sum with an exponential moving average of squared gradients controlled by a decay parameter ρ (often 0.9):
E[g²] ← ρ E[g²] + (1 − ρ) g²
θ ← θ − η g / √(E[g²] + ε)
Kingma and Ba (2015) combined RMSProp's adaptive scaling with momentum to produce Adam, which has dominated deep learning since its publication. The full update at step t is:
m ← β₁ m + (1 − β₁) g
v ← β₂ v + (1 − β₂) g²
m̂ = m / (1 − β₁ᵗ)
v̂ = v / (1 − β₂ᵗ)
θ ← θ − η m̂ / (√v̂ + ε)
The two correction factors (1 − β₁ᵗ) and (1 − β₂ᵗ) compensate for the fact that m and v start at zero and would otherwise be biased toward zero in the first few steps. Common defaults are β₁ = 0.9, β₂ = 0.999, ε = 10⁻⁸.
Loshchilov and Hutter (2019) showed that Adam's standard L2 regularization interacts badly with the per-parameter scaling: large-gradient weights effectively receive less regularization than small-gradient weights. They proposed AdamW, which decouples weight decay from the gradient and applies it directly to the parameters:
θ ← θ − η m̂ / (√v̂ + ε) − η λ θ
This tiny change improves generalization enough that AdamW has displaced Adam in nearly every modern training recipe, including BERT, GPT, T5, Llama, and DeepSeek.
The table below summarizes the most common rules.
| Optimizer | Year | State per parameter | Update rule (sketch) | Notes |
|---|---|---|---|---|
| SGD | 1951 | none | θ ← θ − η g | Robbins and Monro |
| Momentum | 1964 | velocity v | θ ← θ − η (μ v + g) | Polyak heavy-ball |
| Nesterov | 1983 | velocity v | gradient at θ − μ v | Accelerated method |
| AdaGrad | 2011 | sum of g² | η / √Σg² scaling | Duchi et al. |
| RMSProp | 2012 | EMA of g² | η / √EMA(g²) scaling | Hinton lecture notes |
| Adam | 2014 | m, v | bias-corrected first and second moments | Kingma and Ba |
| AdamW | 2017 | m, v | Adam + decoupled weight decay | Loshchilov and Hutter |
| LAMB | 2019 | m, v | layer-wise normalized AdamW | Trains BERT in 76 minutes |
| Lion | 2023 | momentum only | sign(β₁ m + (1 − β₁) g) | Discovered by symbolic search |
| Sophia | 2023 | m, diagonal Hessian | clipped m / max(diag-H, ε) | Light-weight second-order |
More recent work continues to push the design space. Chen et al. (2023) discovered Lion ("EvoLved Sign Momentum") through automated symbolic search; it tracks only momentum, applies the sign function elementwise, and uses roughly half the optimizer memory of Adam while training Vision Transformers up to 5× faster on JFT and matching or exceeding Adam on diffusion models. Liu et al. (2023) introduced Sophia, which estimates the diagonal of the Hessian every few steps and clips the resulting Newton-style update; on GPT models from 125M to 1.5B parameters, Sophia reaches the same perplexity as Adam in roughly half the steps.
Several techniques sit between the backward pass and the optimizer step and modify either the gradient or the parameter delta before the update is applied.
| Technique | What it modifies | Why it is used |
|---|---|---|
| Gradient clipping (norm) | rescales g if ‖g‖ > c | Prevents exploding gradients in RNNs and transformers; LLM training typically uses c = 1.0 |
| Gradient clipping (value) | clips each gᵢ to [−c, c] | Cheaper but less common than norm clipping |
| Gradient accumulation | sums g across micro-batches before stepping | Simulates a large effective batch on limited GPU memory |
| Mixed precision | computes gradients in BF16 or FP16 with a loss scale | Halves memory and roughly doubles throughput on modern GPUs |
| Decoupled weight decay | θ ← θ (1 − η λ) before the gradient step | Restores the original meaning of weight decay under adaptive optimizers (AdamW) |
| Exponential moving average of weights | maintains a separate θ_EMA = ρ θ_EMA + (1 − ρ) θ | Used as the inference checkpoint, especially for diffusion models and segmentation |
| LARS | per-layer LR ∝ ‖θ_layer‖ / ‖g_layer‖ | Enables large-batch SGD for ResNet (You et al. 2017) |
| LAMB | per-layer LR scaling around AdamW | Used to train BERT in 76 minutes with batch size 32,768 (You et al. 2019) |
Gradient clipping deserves special attention because it is one of the few techniques that almost every modern training recipe uses unchanged. Pascanu, Mikolov, and Bengio (2013) showed that the loss surface of recurrent networks contains "cliffs" where the gradient magnitude can grow by orders of magnitude in a single step, and that simply rescaling the gradient when its norm exceeds a fixed threshold is enough to make training stable. The same trick is now standard for transformers, with PyTorch exposing it as torch.nn.utils.clip_grad_norm_(params, max_norm=1.0).
When a model is trained across multiple GPUs or nodes, the parameter update has to be coordinated. The dominant strategies are summarized below.
| Strategy | What is shared | Communication |
|---|---|---|
| Data parallel (DDP) | full parameters, gradients, and optimizer state on every GPU | All-reduce of gradients per step |
| ZeRO stage 1 | parameters and gradients replicated; optimizer state sharded | Reduce-scatter of gradients, all-gather of updated weights |
| ZeRO stage 2 | parameters replicated; gradients and optimizer state sharded | Same as stage 1 with smaller buffers |
| ZeRO stage 3 / FSDP | parameters, gradients, and optimizer state all sharded | All-gather parameters before each layer's compute, reduce-scatter gradients |
| Pipeline parallel | layers split across devices | Activations and gradients passed between stages; micro-batches stagger updates |
| Tensor parallel | individual matrices split across devices | All-reduce inside each layer |
Rajbhandari et al. (2020) introduced ZeRO (Zero Redundancy Optimizer) in DeepSpeed. Their key observation was that under standard data parallelism the optimizer state, which for Adam is roughly 8 bytes per parameter for the moments plus 4 bytes for an FP32 master copy, dominates GPU memory at scale. ZeRO partitions that state across the data-parallel workers so each rank holds only 1/N of it, and the same idea extends to gradients (stage 2) and parameters themselves (stage 3). PyTorch's Fully Sharded Data Parallel (FSDP) is the upstream equivalent: it all-gathers parameter shards just before each forward and backward sub-graph runs, then reshards immediately after, and applies the optimizer step to its local shard only.
Synchronization style is a separate axis. Synchronous training, used by almost all production LLM runs, makes every worker wait for the all-reduce to finish before stepping; the result is bit-for-bit deterministic given a fixed seed and topology. Asynchronous training, pioneered by Niu et al.'s 2011 Hogwild! scheme, lets each worker write to shared parameters without locks and accepts that some gradients will be stale. Hogwild! is provably efficient when updates are sparse (it was demonstrated on the Netflix matrix completion problem), but is rarely used for dense neural networks today because the noise it introduces hurts convergence at scale.
For large language models the parameter update is heavily standardized. The recipe used by Llama 1, 2, and 3 and DeepSeek V1 through V3 is essentially the same: AdamW with β₁ = 0.9, β₂ = 0.95, ε = 10⁻⁸, weight decay λ = 0.1, gradient norm clipped at 1.0, a cosine learning-rate schedule with linear warmup over a few thousand steps, and BF16 mixed precision with an FP32 master copy of the weights. Batch sizes range from 1M to 16M tokens, achieved through gradient accumulation across thousands of GPUs.
The lower β₂ = 0.95 (compared with Adam's default 0.999) gives the second-moment estimate a shorter effective memory of about 20 steps rather than 1,000, so the optimizer adapts faster as the gradient distribution shifts during pretraining. Weight decay 0.1 is large by historical standards; without decoupling, the same value applied as L2 regularization in plain Adam would give very different effective regularization for parameters with different gradient magnitudes.
Optimizer state often dominates GPU memory at this scale. A 70B-parameter model in FP32 takes 280 GB just for the weights, and an Adam-style optimizer adds another 560 GB for the two moment buffers. ZeRO stage 3 or FSDP is therefore not optional but required, splitting that footprint across the data-parallel group.
| Framework | Optimizer call | Notes |
|---|---|---|
| PyTorch | optimizer.step() then optimizer.zero_grad(set_to_none=True) | Stateful optimizers; in-place ops like param.add_(grad, alpha=-lr) are used internally |
| TensorFlow / Keras | optimizer.apply_gradients(zip(grads, vars)) | tf.GradientTape produces the gradients |
| JAX / Optax | updates, state = tx.update(grads, state, params); params = optax.apply_updates(params, updates) | Pure functional API; state is explicit |
| DeepSpeed | model_engine.step() | Combines optimizer step, gradient sync, and ZeRO bookkeeping |
| Hugging Face Accelerate | accelerator.backward(loss); optimizer.step() | Device-agnostic wrapper around PyTorch |
Under the hood every framework boils down to fused element-wise operations on contiguous parameter tensors. PyTorch's reference Adam implementation uses torch.add_, torch.mul_, and torch.addcmul_ to mutate parameter tensors in place; CUDA kernels fuse these operations to avoid extra memory traffic.
The parameter update rule does not exist in isolation. The learning rate controls how aggressively the update is applied, and modern LLM recipes use a warmup-then-cosine schedule rather than a fixed value. Regularization, particularly weight decay, is applied as part of the update rather than as a separate loss term, and the decoupled form used by AdamW changes the effective regularization strength compared with the equivalent L2 penalty added to the loss.
A model has lots of little knobs called parameters. After looking at some examples, the model figures out how it was wrong and which way each knob should turn to be a little less wrong next time. The parameter update is the moment we actually turn the knobs. Different optimizers turn them in different ways: some take big confident steps, some take small careful steps, and some remember which way they have been going so they keep moving in that direction.
torch.optim and torch.distributed.fsdp. pytorch.org/docs.