Parameter update
Last reviewed
Sources
18 citations
Review status
Source-backed
Revision
v4 ยท 3,381 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Sources
18 citations
Review status
Source-backed
Revision
v4 ยท 3,381 words
Add missing citations, update stale details, or suggest a clearer explanation.
See also: Machine learning terms
A parameter update is the step in neural-network training where a model's trainable weights are adjusted using the gradient of the loss, following the rule w_new = w_old - learning_rate * gradient. It is the final stage of every training iteration and the core of the optimizer's job: the forward pass produces predictions, the loss is computed, backpropagation and automatic differentiation yield the gradient, and the parameter update consumes that gradient to produce a new value of the weights. In the language of gradient descent, the update moves each parameter a small distance in the direction that locally decreases the loss, with the learning rate setting how far. Modern training pipelines, including those used to train large language models, repeat this loop billions of times across thousands of GPUs.
Writing the trainable parameters as the vector theta, the generic form of an update is:
theta_new = theta_old - update(g, state)
where g is the current gradient of the loss and state holds optional running statistics such as momentum buffers, second-moment estimates, or per-parameter learning rates. Different optimizers differ only in how they compute that update term; the subtraction of the update from the current weights is what is common to all of them.
A single training step in a typical deep learning framework follows the same five operations:
In PyTorch the last three operations are loss.backward(), optimizer.step(), and optimizer.zero_grad(). The PyTorch documentation describes the division of labor plainly: loss.backward() runs reverse-mode automatic differentiation to compute the gradients, and optimizer.step() then "takes the gradients stored in param.grad and applies them to the model parameters according to the chosen optimization algorithm" [16]. Because PyTorch accumulates gradients by default, optimizer.zero_grad() must clear them between steps or successive backward passes will add together [16]. In TensorFlow and Keras the equivalent update call is optimizer.apply_gradients(zip(grads, vars)). JAX and Optax instead expose optimizers as pure functions that take the gradient and the previous state and return a new state and an update vector; the user then applies that vector with optax.apply_updates(params, updates) [18].
The distinction between computing the gradient and applying the update matters because both operations can be modified independently. Mixed-precision training, gradient accumulation, gradient clipping, and weight decay all live in this gap between the backward pass and the optimizer step.
The simplest update rule is plain gradient descent, which scales the gradient by a constant learning rate eta and subtracts it from the current parameters:
theta <- theta - eta * g
This is the literal w := w - learning_rate * gradient that defines the parameter update. When the gradient is computed on a single example or a mini-batch instead of the full dataset, the same rule is called stochastic gradient descent or mini-batch SGD. The idea traces to Robbins and Monro's 1951 stochastic approximation method [1]. Vanilla SGD has no internal state: the only thing the optimizer remembers between steps is the step size itself, which may be adjusted by a separate learning-rate schedule.
Update frequency is a separate choice from the rule itself. Batch (full-batch) gradient descent computes one update per pass over the entire dataset; stochastic gradient descent updates after every single example; and mini-batch gradient descent, the dominant choice in deep learning, updates once per mini-batch (commonly tens to thousands of examples) so the gradient is a noisy but cheap estimate of the full-dataset gradient. More frequent updates mean noisier steps but far more steps per pass over the data.
The lack of state in vanilla SGD is both a strength and a weakness. SGD uses no extra memory beyond the gradient, but it converges slowly on ill-conditioned problems and oscillates in narrow ravines of the loss surface. Almost every modern optimizer adds buffers to address one or both of these issues.
Polyak (1964) introduced the heavy-ball method, also known as classical momentum, which augments SGD with a velocity buffer v that accumulates past gradients with exponential decay mu [2]:
v <- mu * v + g
theta <- theta - eta * v
The physical analogy is a ball rolling down the loss surface: the velocity carries it through small bumps and damps oscillations across narrow valleys. Typical values of mu in deep learning are 0.9 or 0.99.
Nesterov (1983) refined this with Nesterov accelerated gradient, which evaluates the gradient at the look-ahead point theta - mu*v rather than at theta itself [3]:
v <- mu * v + grad at (theta - mu * v)
theta <- theta - eta * v
In theory, Nesterov momentum gives an O(1/T^2) convergence rate on smooth convex problems, compared with O(1/T) for plain gradient descent [3]. In practice the difference for non-convex deep networks is smaller, but PyTorch and most other frameworks expose it as a one-line option (nesterov=True).
A second family of optimizers gives every parameter its own effective learning rate based on the history of its gradients.
Duchi, Hazan, and Singer (2011) proposed AdaGrad, which divides the gradient by the square root of the sum of all past squared gradients [4]. Parameters that receive large gradients early on get progressively smaller updates, which works well for sparse features but eventually freezes learning altogether on dense problems.
Hinton (2012) addressed that decay-to-zero behavior in his Coursera lectures with RMSProp, which replaces the running sum with an exponential moving average of squared gradients controlled by a decay parameter rho (often 0.9) [7]:
E[g^2] <- rho * E[g^2] + (1 - rho) * g^2
theta <- theta - eta * g / sqrt(E[g^2] + epsilon)
Kingma and Ba (2015) combined RMSProp's adaptive scaling with momentum to produce Adam, which has dominated deep learning since its publication. The paper introduces it as "an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments" [8]. The full update at step t is:
m <- beta1 * m + (1 - beta1) * g
v <- beta2 * v + (1 - beta2) * g^2
m_hat = m / (1 - beta1^t)
v_hat = v / (1 - beta2^t)
theta <- theta - eta * m_hat / (sqrt(v_hat) + epsilon)
The two correction factors (1 - beta1^t) and (1 - beta2^t) compensate for the fact that m and v start at zero and would otherwise be biased toward zero in the first few steps [8]. Common defaults are beta1 = 0.9, beta2 = 0.999, epsilon = 10^-8 [8].
Loshchilov and Hutter (2019) showed that Adam's standard L2 regularization interacts badly with the per-parameter scaling: large-gradient weights effectively receive less regularization than small-gradient weights [10]. They proposed AdamW, which decouples weight decay from the gradient and applies it directly to the parameters:
theta <- theta - eta * m_hat / (sqrt(v_hat) + epsilon) - eta * lambda * theta
This tiny change improves generalization enough that AdamW has displaced Adam in nearly every modern training recipe, including BERT, GPT, T5, Llama, and DeepSeek.
The table below summarizes the most common rules.
| Optimizer | Year | State per parameter | Update rule (sketch) | Notes |
|---|---|---|---|---|
| SGD | 1951 | none | theta <- theta - eta g | Robbins and Monro |
| Momentum | 1964 | velocity v | theta <- theta - eta (mu v + g) | Polyak heavy-ball |
| Nesterov | 1983 | velocity v | gradient at theta - mu v | Accelerated method |
| AdaGrad | 2011 | sum of g^2 | eta / sqrt(sum g^2) scaling | Duchi et al. |
| RMSProp | 2012 | EMA of g^2 | eta / sqrt(EMA(g^2)) scaling | Hinton lecture notes |
| Adam | 2014 | m, v | bias-corrected first and second moments | Kingma and Ba |
| AdamW | 2017 | m, v | Adam + decoupled weight decay | Loshchilov and Hutter |
| LAMB | 2019 | m, v | layer-wise normalized AdamW | Trains BERT in 76 minutes |
| Lion | 2023 | momentum only | sign(beta1 m + (1 - beta1) g) | Discovered by symbolic search |
| Sophia | 2023 | m, diagonal Hessian | clipped m / max(diag-H, epsilon) | Light-weight second-order |
More recent work continues to push the design space. Chen et al. (2023) discovered Lion ("EvoLved Sign Momentum") through automated symbolic search; it tracks only momentum and applies the sign function elementwise, so it uses roughly half the optimizer memory of Adam [13]. The paper reports that Lion "saves up to 5x the pre-training compute on JFT" and "boosts the accuracy of ViT by up to 2% on ImageNet", reaching 88.3% zero-shot and 91.1% fine-tuning ImageNet accuracy and cutting diffusion-model training compute by up to 2.3x while achieving better FID scores [13]. Liu et al. (2023) introduced Sophia, which estimates the diagonal of the Hessian every few steps and clips the resulting Newton-style update; on GPT models from 125M to 1.5B parameters, Sophia "achieves a 2x speed-up compared with Adam in the number of steps, total compute, and wall-clock time", reaching the same perplexity in roughly half the steps [14].
Several techniques sit between the backward pass and the optimizer step and modify either the gradient or the parameter delta before the update is applied.
| Technique | What it modifies | Why it is used |
|---|---|---|
| Gradient clipping (norm) | rescales g if norm(g) > c | Prevents exploding gradients in RNNs and transformers; LLM training typically uses c = 1.0 |
| Gradient clipping (value) | clips each g_i to [-c, c] | Cheaper but less common than norm clipping |
| Gradient accumulation | sums g across micro-batches before stepping | Simulates a large effective batch on limited GPU memory |
| Mixed precision | computes gradients in BF16 or FP16 with a loss scale | Halves memory and roughly doubles throughput on modern GPUs |
| Decoupled weight decay | theta <- theta (1 - eta lambda) before the gradient step | Restores the original meaning of weight decay under adaptive optimizers (AdamW) |
| Exponential moving average of weights | maintains a separate theta_EMA = rho theta_EMA + (1 - rho) theta | Used as the inference checkpoint, especially for diffusion models and segmentation |
| LARS | per-layer LR proportional to norm(theta_layer) / norm(g_layer) | Enables large-batch SGD for ResNet (You et al. 2017) |
| LAMB | per-layer LR scaling around AdamW | Used to train BERT in 76 minutes with batch size 32,868 (You et al. 2019) |
Gradient clipping deserves special attention because it is one of the few techniques that almost every modern training recipe uses unchanged. Pascanu, Mikolov, and Bengio (2013) showed that the loss surface of recurrent networks contains "cliffs" where the gradient magnitude can grow by orders of magnitude in a single step, and that simply rescaling the gradient when its norm exceeds a fixed threshold (a procedure they call norm clipping) is enough to make training stable [6]. Their insight was geometric: the gradients point in the right direction but are unreasonably long, so the fix is to limit the length without changing the direction [6]. The same trick is now standard for transformers, with PyTorch exposing it as torch.nn.utils.clip_grad_norm_(params, max_norm=1.0).
Layer-wise large-batch optimizers built on this gap between gradient and update. You et al. (2017) introduced LARS to scale SGD to large batches for ResNet [9], and You et al. (2019) generalized the idea to AdamW as LAMB, which trained BERT in 76 minutes (down from 3 days) at a batch size of 32,868, requiring just 8,599 iterations instead of roughly one million [11].
When a model is trained across multiple GPUs or nodes, the parameter update has to be coordinated. The dominant strategies are summarized below.
| Strategy | What is shared | Communication |
|---|---|---|
| Data parallel (DDP) | full parameters, gradients, and optimizer state on every GPU | All-reduce of gradients per step |
| ZeRO stage 1 | parameters and gradients replicated; optimizer state sharded | Reduce-scatter of gradients, all-gather of updated weights |
| ZeRO stage 2 | parameters replicated; gradients and optimizer state sharded | Same as stage 1 with smaller buffers |
| ZeRO stage 3 / FSDP | parameters, gradients, and optimizer state all sharded | All-gather parameters before each layer's compute, reduce-scatter gradients |
| Pipeline parallel | layers split across devices | Activations and gradients passed between stages; micro-batches stagger updates |
| Tensor parallel | individual matrices split across devices | All-reduce inside each layer |
Rajbhandari et al. (2020) introduced ZeRO (Zero Redundancy Optimizer) in DeepSpeed [12]. Their key observation was that under standard data parallelism the optimizer state, which for Adam is roughly 8 bytes per parameter for the two moment buffers plus 4 bytes for an FP32 master copy of the weights, dominates GPU memory at scale. The DeepSpeed tutorial illustrates the problem with a 1.5B-parameter GPT-2 model where "the Adam optimizer states for the model consume 18GB, a significant portion of the 32GB RAM" [17]. ZeRO partitions that state across the data-parallel workers so each rank holds only 1/N of it, and the same idea extends to gradients (stage 2) and parameters themselves (stage 3). PyTorch's Fully Sharded Data Parallel (FSDP) is the upstream equivalent: it all-gathers parameter shards just before each forward and backward sub-graph runs, then reshards immediately after, and applies the optimizer step to its local shard only [15].
Synchronization style is a separate axis. Synchronous training, used by almost all production LLM runs, makes every worker wait for the all-reduce to finish before stepping; the result is bit-for-bit deterministic given a fixed seed and topology. Asynchronous training, pioneered by Niu et al.'s 2011 Hogwild! scheme, lets each worker write to shared parameters without locks and accepts that some gradients will be stale [5]. Hogwild! is provably efficient when updates are sparse (it was demonstrated on a matrix completion problem), but is rarely used for dense neural networks today because the noise it introduces hurts convergence at scale.
For large language models the parameter update is heavily standardized. The recipe used by Llama 1, 2, and 3 and DeepSeek V1 through V3 is essentially the same: AdamW with beta1 = 0.9, beta2 = 0.95, a cosine learning-rate schedule with linear warmup over a few thousand steps, weight decay lambda = 0.1, gradient norm clipped at 1.0, and BF16 mixed precision with an FP32 master copy of the weights. The Llama 2 technical report specifies AdamW with beta1 = 0.9, beta2 = 0.95, epsilon = 10^-5, weight decay 0.1, gradient clipping at 1.0, and a 2,000-step warmup followed by cosine decay to 10% of the peak learning rate. Batch sizes range from roughly 1M to 16M tokens, achieved through gradient accumulation across thousands of GPUs.
The lower beta2 = 0.95 (compared with Adam's default 0.999) gives the second-moment estimate a shorter effective memory of about 20 steps rather than 1,000, so the optimizer adapts faster as the gradient distribution shifts during pretraining. Weight decay 0.1 is large by historical standards; without decoupling, the same value applied as L2 regularization in plain Adam would give very different effective regularization for parameters with different gradient magnitudes.
Optimizer state often dominates GPU memory at this scale. A 70B-parameter model in FP32 takes 280 GB just for the weights, and an Adam-style optimizer adds another 560 GB for the two moment buffers. ZeRO stage 3 or FSDP is therefore not optional but required, splitting that footprint across the data-parallel group.
| Framework | Optimizer call | Notes |
|---|---|---|
| PyTorch | optimizer.step() then optimizer.zero_grad(set_to_none=True) | Stateful optimizers; in-place ops like param.add_(grad, alpha=-lr) are used internally |
| TensorFlow / Keras | optimizer.apply_gradients(zip(grads, vars)) | tf.GradientTape produces the gradients |
| JAX / Optax | updates, state = tx.update(grads, state, params); params = optax.apply_updates(params, updates) | Pure functional API; state is explicit |
| DeepSpeed | model_engine.step() | Combines optimizer step, gradient sync, and ZeRO bookkeeping |
| Hugging Face Accelerate | accelerator.backward(loss); optimizer.step() | Device-agnostic wrapper around PyTorch |
Under the hood every framework boils down to fused element-wise operations on contiguous parameter tensors. PyTorch's reference Adam implementation uses torch.add_, torch.mul_, and torch.addcmul_ to mutate parameter tensors in place; CUDA kernels fuse these operations to avoid extra memory traffic [16].
The parameter update rule does not exist in isolation. The learning rate controls how aggressively the update is applied, and modern LLM recipes use a warmup-then-cosine schedule rather than a fixed value. Regularization, particularly weight decay, is applied as part of the update rather than as a separate loss term, and the decoupled form used by AdamW changes the effective regularization strength compared with the equivalent L2 penalty added to the loss [10].
A model has lots of little knobs called parameters. After looking at some examples, the model figures out how it was wrong and which way each knob should turn to be a little less wrong next time. The parameter update is the moment we actually turn the knobs. Different optimizers turn them in different ways: some take big confident steps, some take small careful steps, and some remember which way they have been going so they keep moving in that direction.