Parameter update

See also: Machine learning terms

In machine learning, a parameter update is the rule that adjusts a model's trainable parameters θ after a gradient of the loss has been computed. It is the final stage of every training iteration: the forward pass produces predictions, the loss is computed, backpropagation and automatic differentiation yield the gradient ∇L(θ), and the parameter update consumes that gradient to produce a new value of θ. Most modern training pipelines, including those used to train large language models, repeat this loop billions of times across thousands of GPUs.

The generic form of an update is:

θ_new = θ_old − update(g, state)

where g is the current gradient and state holds optional running statistics such as momentum buffers, second-moment estimates, or per-parameter learning rates. Different optimizers differ only in how they compute that update term.

the training loop

A single training step in a typical deep learning framework follows the same five operations:

Sample a mini-batch of training examples.
Run a forward pass to compute the model's predictions and the scalar loss L(θ).
Run a backward pass that uses automatic differentiation to fill the gradient ∇L(θ).
Call the optimizer's update routine, which applies the update rule to every parameter in place.
Reset the gradient buffers to zero for the next step.

In PyTorch the last three operations are loss.backward(), optimizer.step(), and optimizer.zero_grad(). In TensorFlow and Keras the equivalent call is optimizer.apply_gradients(zip(grads, vars)). JAX and Optax instead expose optimizers as pure functions that take the gradient and the previous state and return a new state and an update vector; the user then applies that vector with optax.apply_updates(params, updates).

The distinction between computing the gradient and applying the update matters because both operations can be modified independently. Mixed-precision training, gradient accumulation, gradient clipping, and weight decay all live in this gap between the backward pass and the optimizer step.

vanilla stochastic gradient descent

The simplest update rule is plain gradient descent, which scales the gradient by a constant learning rate η and subtracts it from the current parameters:

θ ← θ − η ∇L(θ)

When the gradient is computed on a single example or a mini-batch instead of the full dataset, the same rule is called stochastic gradient descent or mini-batch SGD. Vanilla SGD has no internal state: the only thing the optimizer remembers between steps is the step size itself, which may be adjusted by a separate learning-rate schedule.

The lack of state is both a strength and a weakness. SGD uses no extra memory beyond the gradient, but it converges slowly on ill-conditioned problems and oscillates in narrow ravines of the loss surface. Almost every modern optimizer adds buffers to address one or both of these issues.

momentum and its variants

Polyak (1964) introduced the heavy-ball method, also known as classical momentum, which augments SGD with a velocity buffer v that accumulates past gradients with exponential decay μ:

v ← μ v + ∇L(θ)

θ ← θ − η v

The physical analogy is a ball rolling down the loss surface: the velocity carries it through small bumps and damps oscillations across narrow valleys. Typical values of μ in deep learning are 0.9 or 0.99.

Nesterov (1983) refined this with Nesterov accelerated gradient, which evaluates the gradient at the look-ahead point θ − μv rather than at θ itself:

v ← μ v + ∇L(θ − μ v)

θ ← θ − η v

In theory, Nesterov momentum gives an O(1/T²) convergence rate on smooth convex problems, compared with O(1/T) for plain gradient descent. In practice the difference for non-convex deep networks is smaller, but PyTorch and most other frameworks expose it as a one-line option (nesterov=True).

adaptive update rules

A second family of optimizers gives every parameter its own effective learning rate based on the history of its gradients.

Duchi, Hazan, and Singer (2011) proposed AdaGrad, which divides the gradient by the square root of the sum of all past squared gradients. Parameters that receive large gradients early on get progressively smaller updates, which works well for sparse features but eventually freezes learning altogether on dense problems.

Hinton (2012) addressed that decay-to-zero behavior in his Coursera lectures with RMSProp, which replaces the running sum with an exponential moving average of squared gradients controlled by a decay parameter ρ (often 0.9):

E[g²] ← ρ E[g²] + (1 − ρ) g²

θ ← θ − η g / √(E[g²] + ε)

Kingma and Ba (2015) combined RMSProp's adaptive scaling with momentum to produce Adam, which has dominated deep learning since its publication. The full update at step t is:

m ← β₁ m + (1 − β₁) g

v ← β₂ v + (1 − β₂) g²

m̂ = m / (1 − β₁ᵗ)

v̂ = v / (1 − β₂ᵗ)

θ ← θ − η m̂ / (√v̂ + ε)

The two correction factors (1 − β₁ᵗ) and (1 − β₂ᵗ) compensate for the fact that m and v start at zero and would otherwise be biased toward zero in the first few steps. Common defaults are β₁ = 0.9, β₂ = 0.999, ε = 10⁻⁸.

Loshchilov and Hutter (2019) showed that Adam's standard L2 regularization interacts badly with the per-parameter scaling: large-gradient weights effectively receive less regularization than small-gradient weights. They proposed AdamW, which decouples weight decay from the gradient and applies it directly to the parameters:

θ ← θ − η m̂ / (√v̂ + ε) − η λ θ

This tiny change improves generalization enough that AdamW has displaced Adam in nearly every modern training recipe, including BERT, GPT, T5, Llama, and DeepSeek.

The table below summarizes the most common rules.

Optimizer	Year	State per parameter	Update rule (sketch)	Notes
SGD	1951	none	θ ← θ − η g	Robbins and Monro
Momentum	1964	velocity v	θ ← θ − η (μ v + g)	Polyak heavy-ball
Nesterov	1983	velocity v	gradient at θ − μ v	Accelerated method
AdaGrad	2011	sum of g²	η / √Σg² scaling	Duchi et al.
RMSProp	2012	EMA of g²	η / √EMA(g²) scaling	Hinton lecture notes
Adam	2014	m, v	bias-corrected first and second moments	Kingma and Ba
AdamW	2017	m, v	Adam + decoupled weight decay	Loshchilov and Hutter
LAMB	2019	m, v	layer-wise normalized AdamW	Trains BERT in 76 minutes
Lion	2023	momentum only	sign(β₁ m + (1 − β₁) g)	Discovered by symbolic search
Sophia	2023	m, diagonal Hessian	clipped m / max(diag-H, ε)	Light-weight second-order

More recent work continues to push the design space. Chen et al. (2023) discovered Lion ("EvoLved Sign Momentum") through automated symbolic search; it tracks only momentum, applies the sign function elementwise, and uses roughly half the optimizer memory of Adam while training Vision Transformers up to 5× faster on JFT and matching or exceeding Adam on diffusion models. Liu et al. (2023) introduced Sophia, which estimates the diagonal of the Hessian every few steps and clips the resulting Newton-style update; on GPT models from 125M to 1.5B parameters, Sophia reaches the same perplexity as Adam in roughly half the steps.

modifications to the update

Several techniques sit between the backward pass and the optimizer step and modify either the gradient or the parameter delta before the update is applied.

Technique	What it modifies	Why it is used
Gradient clipping (norm)	rescales g if ‖g‖ > c	Prevents exploding gradients in RNNs and transformers; LLM training typically uses c = 1.0
Gradient clipping (value)	clips each gᵢ to [−c, c]	Cheaper but less common than norm clipping
Gradient accumulation	sums g across micro-batches before stepping	Simulates a large effective batch on limited GPU memory
Mixed precision	computes gradients in BF16 or FP16 with a loss scale	Halves memory and roughly doubles throughput on modern GPUs
Decoupled weight decay	θ ← θ (1 − η λ) before the gradient step	Restores the original meaning of weight decay under adaptive optimizers (AdamW)
Exponential moving average of weights	maintains a separate θ_EMA = ρ θ_EMA + (1 − ρ) θ	Used as the inference checkpoint, especially for diffusion models and segmentation
LARS	per-layer LR ∝ ‖θ_layer‖ / ‖g_layer‖	Enables large-batch SGD for ResNet (You et al. 2017)
LAMB	per-layer LR scaling around AdamW	Used to train BERT in 76 minutes with batch size 32,768 (You et al. 2019)

Gradient clipping deserves special attention because it is one of the few techniques that almost every modern training recipe uses unchanged. Pascanu, Mikolov, and Bengio (2013) showed that the loss surface of recurrent networks contains "cliffs" where the gradient magnitude can grow by orders of magnitude in a single step, and that simply rescaling the gradient when its norm exceeds a fixed threshold is enough to make training stable. The same trick is now standard for transformers, with PyTorch exposing it as torch.nn.utils.clip_grad_norm_(params, max_norm=1.0).

distributed parameter updates

When a model is trained across multiple GPUs or nodes, the parameter update has to be coordinated. The dominant strategies are summarized below.

Strategy	What is shared	Communication
Data parallel (DDP)	full parameters, gradients, and optimizer state on every GPU	All-reduce of gradients per step
ZeRO stage 1	parameters and gradients replicated; optimizer state sharded	Reduce-scatter of gradients, all-gather of updated weights
ZeRO stage 2	parameters replicated; gradients and optimizer state sharded	Same as stage 1 with smaller buffers
ZeRO stage 3 / FSDP	parameters, gradients, and optimizer state all sharded	All-gather parameters before each layer's compute, reduce-scatter gradients
Pipeline parallel	layers split across devices	Activations and gradients passed between stages; micro-batches stagger updates
Tensor parallel	individual matrices split across devices	All-reduce inside each layer

Rajbhandari et al. (2020) introduced ZeRO (Zero Redundancy Optimizer) in DeepSpeed. Their key observation was that under standard data parallelism the optimizer state, which for Adam is roughly 8 bytes per parameter for the moments plus 4 bytes for an FP32 master copy, dominates GPU memory at scale. ZeRO partitions that state across the data-parallel workers so each rank holds only 1/N of it, and the same idea extends to gradients (stage 2) and parameters themselves (stage 3). PyTorch's Fully Sharded Data Parallel (FSDP) is the upstream equivalent: it all-gathers parameter shards just before each forward and backward sub-graph runs, then reshards immediately after, and applies the optimizer step to its local shard only.

Synchronization style is a separate axis. Synchronous training, used by almost all production LLM runs, makes every worker wait for the all-reduce to finish before stepping; the result is bit-for-bit deterministic given a fixed seed and topology. Asynchronous training, pioneered by Niu et al.'s 2011 Hogwild! scheme, lets each worker write to shared parameters without locks and accepts that some gradients will be stale. Hogwild! is provably efficient when updates are sparse (it was demonstrated on the Netflix matrix completion problem), but is rarely used for dense neural networks today because the noise it introduces hurts convergence at scale.

modern LLM training in practice

For large language models the parameter update is heavily standardized. The recipe used by Llama 1, 2, and 3 and DeepSeek V1 through V3 is essentially the same: AdamW with β₁ = 0.9, β₂ = 0.95, ε = 10⁻⁸, weight decay λ = 0.1, gradient norm clipped at 1.0, a cosine learning-rate schedule with linear warmup over a few thousand steps, and BF16 mixed precision with an FP32 master copy of the weights. Batch sizes range from 1M to 16M tokens, achieved through gradient accumulation across thousands of GPUs.

The lower β₂ = 0.95 (compared with Adam's default 0.999) gives the second-moment estimate a shorter effective memory of about 20 steps rather than 1,000, so the optimizer adapts faster as the gradient distribution shifts during pretraining. Weight decay 0.1 is large by historical standards; without decoupling, the same value applied as L2 regularization in plain Adam would give very different effective regularization for parameters with different gradient magnitudes.

Optimizer state often dominates GPU memory at this scale. A 70B-parameter model in FP32 takes 280 GB just for the weights, and an Adam-style optimizer adds another 560 GB for the two moment buffers. ZeRO stage 3 or FSDP is therefore not optional but required, splitting that footprint across the data-parallel group.

framework implementations

Framework	Optimizer call	Notes
PyTorch	`optimizer.step()` then `optimizer.zero_grad(set_to_none=True)`	Stateful optimizers; in-place ops like `param.add_(grad, alpha=-lr)` are used internally
TensorFlow / Keras	`optimizer.apply_gradients(zip(grads, vars))`	`tf.GradientTape` produces the gradients
JAX / Optax	`updates, state = tx.update(grads, state, params); params = optax.apply_updates(params, updates)`	Pure functional API; state is explicit
DeepSpeed	`model_engine.step()`	Combines optimizer step, gradient sync, and ZeRO bookkeeping
Hugging Face Accelerate	`accelerator.backward(loss); optimizer.step()`	Device-agnostic wrapper around PyTorch

Under the hood every framework boils down to fused element-wise operations on contiguous parameter tensors. PyTorch's reference Adam implementation uses torch.add_, torch.mul_, and torch.addcmul_ to mutate parameter tensors in place; CUDA kernels fuse these operations to avoid extra memory traffic.

relationship to regularization and learning-rate scheduling

The parameter update rule does not exist in isolation. The learning rate controls how aggressively the update is applied, and modern LLM recipes use a warmup-then-cosine schedule rather than a fixed value. Regularization, particularly weight decay, is applied as part of the update rather than as a separate loss term, and the decoupled form used by AdamW changes the effective regularization strength compared with the equivalent L2 penalty added to the loss.

explain like I'm 5 (ELI5)

A model has lots of little knobs called parameters. After looking at some examples, the model figures out how it was wrong and which way each knob should turn to be a little less wrong next time. The parameter update is the moment we actually turn the knobs. Different optimizers turn them in different ways: some take big confident steps, some take small careful steps, and some remember which way they have been going so they keep moving in that direction.

references

Robbins, H., and Monro, S. (1951). "A Stochastic Approximation Method." Annals of Mathematical Statistics, 22(3), 400-407.
Polyak, B. T. (1964). "Some methods of speeding up the convergence of iteration methods." USSR Computational Mathematics and Mathematical Physics, 4(5), 1-17.
Nesterov, Y. (1983). "A method for solving the convex programming problem with convergence rate O(1/k²)." Soviet Mathematics Doklady, 27, 372-376.
Duchi, J., Hazan, E., and Singer, Y. (2011). "Adaptive Subgradient Methods for Online Learning and Stochastic Optimization." Journal of Machine Learning Research, 12, 2121-2159.
Niu, F., Recht, B., Ré, C., and Wright, S. J. (2011). "HOGWILD!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent." NeurIPS 2011. arXiv:1106.5730.
Pascanu, R., Mikolov, T., and Bengio, Y. (2013). "On the difficulty of training Recurrent Neural Networks." ICML 2013. arXiv:1211.5063.
Hinton, G. (2012). "Lecture 6e: rmsprop: Divide the gradient by a running average of its recent magnitude." Coursera: Neural Networks for Machine Learning.
Kingma, D. P., and Ba, J. (2015). "Adam: A Method for Stochastic Optimization." ICLR 2015. arXiv:1412.6980.
You, Y., Gitman, I., and Ginsburg, B. (2017). "Large Batch Training of Convolutional Networks." arXiv:1708.03888.
Loshchilov, I., and Hutter, F. (2019). "Decoupled Weight Decay Regularization." ICLR 2019. arXiv:1711.05101.
You, Y., Li, J., Reddi, S., Hseu, J., Kumar, S., Bhojanapalli, S., Song, X., Demmel, J., Keutzer, K., and Hsieh, C.-J. (2020). "Large Batch Optimization for Deep Learning: Training BERT in 76 minutes." ICLR 2020. arXiv:1904.00962.
Rajbhandari, S., Rasley, J., Ruwase, O., and He, Y. (2020). "ZeRO: Memory Optimizations Toward Training Trillion Parameter Models." SC20.
Chen, X., Liang, C., Huang, D., Real, E., Wang, K., Liu, Y., Pham, H., Dong, X., Luong, T., Hsieh, C.-J., Lu, Y., and Le, Q. V. (2023). "Symbolic Discovery of Optimization Algorithms." arXiv:2302.06675.
Liu, H., Li, Z., Hall, D., Liang, P., and Ma, T. (2023). "Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training." arXiv:2305.14342.
Zhao, Y. et al. (2023). "PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel." arXiv:2304.11277.
PyTorch documentation, torch.optim and torch.distributed.fsdp. pytorch.org/docs.
DeepSpeed ZeRO tutorial. deepspeed.ai/tutorials/zero.
Optax documentation. optax.readthedocs.io.

the training loop

vanilla stochastic gradient descent

momentum and its variants

adaptive update rules

modifications to the update

distributed parameter updates

modern LLM training in practice

framework implementations

relationship to regularization and learning-rate scheduling

explain like I'm 5 (ELI5)

references

Improve this article

Related Articles

AdaGrad

Gradient clipping

Momentum

Step

Adam optimizer

Staged training

the training loop

vanilla stochastic gradient descent

momentum and its variants

adaptive update rules

modifications to the update

distributed parameter updates

modern LLM training in practice

framework implementations

relationship to regularization and learning-rate scheduling

explain like I'm 5 (ELI5)

references

Related Articles

AdaGrad

Gradient clipping

Momentum

Step

Adam optimizer

Staged training