# Parameter update

> Source: https://aiwiki.ai/wiki/parameter_update
> Updated: 2026-07-11
> Categories: Training & Optimization
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

*See also: [Machine learning terms](/wiki/machine_learning_terms)*

A **parameter update** is the step in neural-network training where a model's trainable weights are adjusted using the gradient of the loss, following the rule $$w_{\text{new}} = w_{\text{old}} - \text{learning\_rate} \cdot \text{gradient}$$. It is the final stage of every training iteration and the core of the optimizer's job: the forward pass produces predictions, the loss is computed, [backpropagation](/wiki/backpropagation) and [automatic differentiation](/wiki/automatic_differentiation) yield the gradient, and the parameter update consumes that gradient to produce a new value of the weights. In the language of [gradient descent](/wiki/gradient_descent), the update moves each parameter a small distance in the direction that locally decreases the loss, with the [learning rate](/wiki/learning_rate) setting how far. Modern training pipelines, including those used to train [large language models](/wiki/llm), repeat this loop billions of times across thousands of GPUs.

Writing the trainable parameters as the vector $$\theta$$, the generic form of an update is:

$$
\theta_{\text{new}} = \theta_{\text{old}} - \mathrm{update}(g, \mathrm{state})
$$

where $$g$$ is the current gradient of the loss and state holds optional running statistics such as momentum buffers, second-moment estimates, or per-parameter learning rates. Different optimizers differ only in how they compute that update term; the subtraction of the update from the current weights is what is common to all of them.

## Where does the parameter update sit in the training loop?

A single training step in a typical deep learning framework follows the same five operations:

1. Sample a mini-batch of training examples.
2. Run a forward pass to compute the model's predictions and the scalar loss L(theta).
3. Run a backward pass that uses [backpropagation](/wiki/backpropagation) and [automatic differentiation](/wiki/automatic_differentiation) to fill the gradient of the loss with respect to every parameter.
4. Call the optimizer's update routine, which applies the update rule to every parameter in place. This is the parameter update.
5. Reset the gradient buffers to zero for the next step.

In PyTorch the last three operations are `loss.backward()`, `optimizer.step()`, and `optimizer.zero_grad()`. The PyTorch documentation describes the division of labor plainly: `loss.backward()` runs reverse-mode automatic differentiation to compute the gradients, and `optimizer.step()` then "takes the gradients stored in param.grad and applies them to the model parameters according to the chosen optimization algorithm" [16]. Because PyTorch accumulates gradients by default, `optimizer.zero_grad()` must clear them between steps or successive backward passes will add together [16]. In TensorFlow and Keras the equivalent update call is `optimizer.apply_gradients(zip(grads, vars))`. JAX and Optax instead expose optimizers as pure functions that take the gradient and the previous state and return a new state and an update vector; the user then applies that vector with `optax.apply_updates(params, updates)` [18].

The distinction between computing the gradient and applying the update matters because both operations can be modified independently. Mixed-precision training, gradient accumulation, gradient clipping, and weight decay all live in this gap between the backward pass and the optimizer step.

## What is the basic SGD update rule?

The simplest update rule is plain [gradient descent](/wiki/gradient_descent), which scales the gradient by a constant [learning rate](/wiki/learning_rate) $$\eta$$ and subtracts it from the current parameters:

$$
\theta \leftarrow \theta - \eta g
$$

This is the literal $$w := w - \text{learning\_rate} \cdot \text{gradient}$$ that defines the parameter update. When the gradient is computed on a single example or a mini-batch instead of the full dataset, the same rule is called [stochastic gradient descent](/wiki/stochastic_gradient_descent_sgd) or [mini-batch SGD](/wiki/mini-batch_stochastic_gradient_descent). The idea traces to Robbins and Monro's 1951 stochastic approximation method [1]. Vanilla SGD has no internal state: the only thing the optimizer remembers between steps is the [step size](/wiki/step_size) itself, which may be adjusted by a separate learning-rate schedule.

Update frequency is a separate choice from the rule itself. **Batch (full-batch) gradient descent** computes one update per pass over the entire dataset; **stochastic gradient descent** updates after every single example; and **mini-batch gradient descent**, the dominant choice in deep learning, updates once per mini-batch (commonly tens to thousands of examples) so the gradient is a noisy but cheap estimate of the full-dataset gradient. More frequent updates mean noisier steps but far more steps per pass over the data.

The lack of state in vanilla SGD is both a strength and a weakness. SGD uses no extra memory beyond the gradient, but it converges slowly on ill-conditioned problems and oscillates in narrow ravines of the loss surface. Almost every modern optimizer adds buffers to address one or both of these issues.

## How does momentum change the update?

Polyak (1964) introduced the **heavy-ball method**, also known as classical [momentum](/wiki/momentum), which augments SGD with a velocity buffer v that accumulates past gradients with exponential decay $$\mu$$ [2]:

$$
v \leftarrow \mu v + g
$$

$$
\theta \leftarrow \theta - \eta v
$$

The physical analogy is a ball rolling down the loss surface: the velocity carries it through small bumps and damps oscillations across narrow valleys. Typical values of $$\mu$$ in deep learning are 0.9 or 0.99.

Nesterov (1983) refined this with **Nesterov accelerated gradient**, which evaluates the gradient at the look-ahead point $$\theta - \mu v$$ rather than at $$\theta$$ itself [3]:

$$
v \leftarrow \mu v + \nabla L(\theta - \mu v)
$$

$$
\theta \leftarrow \theta - \eta v
$$

In theory, Nesterov momentum gives an $$O(1/T^2)$$ convergence rate on smooth convex problems, compared with $$O(1/T)$$ for plain gradient descent [3]. In practice the difference for non-convex deep networks is smaller, but PyTorch and most other frameworks expose it as a one-line option (`nesterov=True`).

## What are adaptive optimizers (AdaGrad, RMSProp, Adam)?

A second family of optimizers gives every parameter its own effective learning rate based on the history of its gradients.

Duchi, Hazan, and Singer (2011) proposed **AdaGrad**, which divides the gradient by the square root of the sum of all past squared gradients [4]. Parameters that receive large gradients early on get progressively smaller updates, which works well for sparse features but eventually freezes learning altogether on dense problems.

Hinton (2012) addressed that decay-to-zero behavior in his Coursera lectures with **RMSProp**, which replaces the running sum with an exponential moving average of squared gradients controlled by a decay parameter $$\rho$$ (often 0.9) [7]:

$$
E[g^2] \leftarrow \rho E[g^2] + (1 - \rho) g^2
$$

$$
\theta \leftarrow \theta - \frac{\eta g}{\sqrt{E[g^2] + \epsilon}}
$$

Kingma and Ba (2015) combined RMSProp's adaptive scaling with momentum to produce **Adam**, which has dominated deep learning since its publication. The paper introduces it as "an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments" [8]. The full update at step $$t$$ is:

$$
m \leftarrow \beta_1 m + (1 - \beta_1) g
$$

$$
v \leftarrow \beta_2 v + (1 - \beta_2) g^2
$$

$$
\hat{m} = \frac{m}{1 - \beta_1^t}
$$

$$
\hat{v} = \frac{v}{1 - \beta_2^t}
$$

$$
\theta \leftarrow \theta - \frac{\eta \hat{m}}{\sqrt{\hat{v}} + \epsilon}
$$

The two correction factors $$(1 - \beta_1^t)$$ and $$(1 - \beta_2^t)$$ compensate for the fact that $$m$$ and $$v$$ start at zero and would otherwise be biased toward zero in the first few steps [8]. Common defaults are $$\beta_1 = 0.9$$, $$\beta_2 = 0.999$$, $$\epsilon = 10^{-8}$$ [8].

Loshchilov and Hutter (2019) showed that Adam's standard L2 regularization interacts badly with the per-parameter scaling: large-gradient weights effectively receive less regularization than small-gradient weights [10]. They proposed [AdamW](/wiki/adamw), which decouples weight decay from the gradient and applies it directly to the parameters:

$$
\theta \leftarrow \theta - \frac{\eta \hat{m}}{\sqrt{\hat{v}} + \epsilon} - \eta \lambda \theta
$$

This tiny change improves generalization enough that AdamW has displaced Adam in nearly every modern training recipe, including BERT, GPT, T5, Llama, and DeepSeek.

The table below summarizes the most common rules.

| Optimizer | Year | State per parameter | Update rule (sketch) | Notes |
|---|---|---|---|---|
| [SGD](/wiki/stochastic_gradient_descent_sgd) | 1951 | none | $$\theta \leftarrow \theta - \eta g$$ | Robbins and Monro |
| Momentum | 1964 | velocity $$v$$ | $$\theta \leftarrow \theta - \eta (\mu v + g)$$ | [Polyak](/wiki/momentum) heavy-ball |
| Nesterov | 1983 | velocity $$v$$ | gradient at $$\theta - \mu v$$ | Accelerated method |
| [AdaGrad](/wiki/adagrad) | 2011 | sum of $$g^2$$ | $$\eta / \sqrt{\sum g^2}$$ scaling | Duchi et al. |
| RMSProp | 2012 | EMA of $$g^2$$ | $$\eta / \sqrt{\mathrm{EMA}(g^2)}$$ scaling | Hinton lecture notes |
| [Adam](/wiki/adam_optimizer) | 2014 | m, v | bias-corrected first and second moments | Kingma and Ba |
| [AdamW](/wiki/adamw) | 2017 | m, v | Adam + decoupled weight decay | Loshchilov and Hutter |
| LAMB | 2019 | m, v | layer-wise normalized AdamW | Trains BERT in 76 minutes |
| Lion | 2023 | momentum only | $$\operatorname{sign}(\beta_1 m + (1 - \beta_1) g)$$ | Discovered by symbolic search |
| Sophia | 2023 | m, diagonal Hessian | clipped $$m / \max(\text{diag-}H, \epsilon)$$ | Light-weight second-order |

More recent work continues to push the design space. Chen et al. (2023) discovered **Lion** ("EvoLved Sign Momentum") through automated symbolic search; it tracks only momentum and applies the sign function elementwise, so it uses roughly half the optimizer memory of Adam [13]. The paper reports that Lion "saves up to 5x the pre-training compute on JFT" and "boosts the accuracy of ViT by up to 2% on ImageNet", reaching 88.3% zero-shot and 91.1% fine-tuning ImageNet accuracy and cutting diffusion-model training compute by up to 2.3x while achieving better FID scores [13]. Liu et al. (2023) introduced **Sophia**, which estimates the diagonal of the Hessian every few steps and clips the resulting Newton-style update; on GPT models from 125M to 1.5B parameters, Sophia "achieves a 2x speed-up compared with Adam in the number of steps, total compute, and wall-clock time", reaching the same perplexity in roughly half the steps [14].

## What modifies the gradient before the update?

Several techniques sit between the backward pass and the optimizer step and modify either the gradient or the parameter delta before the update is applied.

| Technique | What it modifies | Why it is used |
|---|---|---|
| Gradient clipping (norm) | rescales $$g$$ if $$\lVert g \rVert > c$$ | Prevents exploding gradients in RNNs and transformers; LLM training typically uses $$c = 1.0$$ |
| Gradient clipping (value) | clips each $$g_i$$ to $$[-c, c]$$ | Cheaper but less common than norm clipping |
| Gradient accumulation | sums $$g$$ across micro-batches before stepping | Simulates a large effective batch on limited GPU memory |
| Mixed precision | computes gradients in BF16 or FP16 with a loss scale | Halves memory and roughly doubles throughput on modern GPUs |
| Decoupled weight decay | $$\theta \leftarrow \theta (1 - \eta \lambda)$$ before the gradient step | Restores the original meaning of weight decay under adaptive optimizers (AdamW) |
| Exponential moving average of weights | maintains a separate $$\theta_{\text{EMA}} = \rho \theta_{\text{EMA}} + (1 - \rho) \theta$$ | Used as the inference checkpoint, especially for diffusion models and segmentation |
| LARS | per-layer LR proportional to $$\lVert \theta_{\text{layer}} \rVert / \lVert g_{\text{layer}} \rVert$$ | Enables large-batch SGD for ResNet (You et al. 2017) |
| LAMB | per-layer LR scaling around AdamW | Used to train BERT in 76 minutes with batch size 32,868 (You et al. 2019) |

Gradient clipping deserves special attention because it is one of the few techniques that almost every modern training recipe uses unchanged. Pascanu, Mikolov, and Bengio (2013) showed that the loss surface of recurrent networks contains "cliffs" where the gradient magnitude can grow by orders of magnitude in a single step, and that simply rescaling the gradient when its norm exceeds a fixed threshold (a procedure they call norm clipping) is enough to make training stable [6]. Their insight was geometric: the gradients point in the right direction but are unreasonably long, so the fix is to limit the length without changing the direction [6]. The same trick is now standard for transformers, with PyTorch exposing it as `torch.nn.utils.clip_grad_norm_(params, max_norm=1.0)`.

Layer-wise large-batch optimizers built on this gap between gradient and update. You et al. (2017) introduced LARS to scale SGD to large batches for ResNet [9], and You et al. (2019) generalized the idea to AdamW as **LAMB**, which trained BERT in 76 minutes (down from 3 days) at a batch size of 32,868, requiring just 8,599 iterations instead of roughly one million [11].

## How are parameter updates done across many GPUs?

When a model is trained across multiple GPUs or nodes, the parameter update has to be coordinated. The dominant strategies are summarized below.

| Strategy | What is shared | Communication |
|---|---|---|
| Data parallel (DDP) | full parameters, gradients, and optimizer state on every GPU | All-reduce of gradients per step |
| ZeRO stage 1 | parameters and gradients replicated; optimizer state sharded | Reduce-scatter of gradients, all-gather of updated weights |
| ZeRO stage 2 | parameters replicated; gradients and optimizer state sharded | Same as stage 1 with smaller buffers |
| ZeRO stage 3 / FSDP | parameters, gradients, and optimizer state all sharded | All-gather parameters before each layer's compute, reduce-scatter gradients |
| Pipeline parallel | layers split across devices | Activations and gradients passed between stages; micro-batches stagger updates |
| Tensor parallel | individual matrices split across devices | All-reduce inside each layer |

Rajbhandari et al. (2020) introduced **ZeRO** (Zero Redundancy Optimizer) in DeepSpeed [12]. Their key observation was that under standard data parallelism the optimizer state, which for Adam is roughly 8 bytes per parameter for the two moment buffers plus 4 bytes for an FP32 master copy of the weights, dominates GPU memory at scale. The DeepSpeed tutorial illustrates the problem with a 1.5B-parameter GPT-2 model where "the Adam optimizer states for the model consume 18GB, a significant portion of the 32GB RAM" [17]. ZeRO partitions that state across the data-parallel workers so each rank holds only 1/N of it, and the same idea extends to gradients (stage 2) and parameters themselves (stage 3). PyTorch's **Fully Sharded Data Parallel** (FSDP) is the upstream equivalent: it all-gathers parameter shards just before each forward and backward sub-graph runs, then reshards immediately after, and applies the optimizer step to its local shard only [15].

Synchronization style is a separate axis. **Synchronous** training, used by almost all production LLM runs, makes every worker wait for the all-reduce to finish before stepping; the result is bit-for-bit deterministic given a fixed seed and topology. **Asynchronous** training, pioneered by Niu et al.'s 2011 **Hogwild!** scheme, lets each worker write to shared parameters without locks and accepts that some gradients will be stale [5]. Hogwild! is provably efficient when updates are sparse (it was demonstrated on a matrix completion problem), but is rarely used for dense neural networks today because the noise it introduces hurts convergence at scale.

## How do modern LLMs configure the parameter update?

For large language models the parameter update is heavily standardized. The recipe used by Llama 1, 2, and 3 and DeepSeek V1 through V3 is essentially the same: AdamW with $$\beta_1 = 0.9$$, $$\beta_2 = 0.95$$, a cosine learning-rate schedule with linear warmup over a few thousand steps, weight decay $$\lambda = 0.1$$, gradient norm clipped at 1.0, and BF16 mixed precision with an FP32 master copy of the weights. The Llama 2 technical report specifies AdamW with $$\beta_1 = 0.9$$, $$\beta_2 = 0.95$$, $$\epsilon = 10^{-5}$$, weight decay 0.1, gradient clipping at 1.0, and a 2,000-step warmup followed by cosine decay to 10% of the peak learning rate. Batch sizes range from roughly 1M to 16M tokens, achieved through gradient accumulation across thousands of GPUs.

The lower $$\beta_2 = 0.95$$ (compared with Adam's default 0.999) gives the second-moment estimate a shorter effective memory of about 20 steps rather than 1,000, so the optimizer adapts faster as the gradient distribution shifts during pretraining. Weight decay 0.1 is large by historical standards; without decoupling, the same value applied as L2 regularization in plain Adam would give very different effective regularization for parameters with different gradient magnitudes.

Optimizer state often dominates GPU memory at this scale. A 70B-parameter model in FP32 takes 280 GB just for the weights, and an Adam-style optimizer adds another 560 GB for the two moment buffers. ZeRO stage 3 or FSDP is therefore not optional but required, splitting that footprint across the data-parallel group.

## How is the update implemented in each framework?

| Framework | Optimizer call | Notes |
|---|---|---|
| PyTorch | `optimizer.step()` then `optimizer.zero_grad(set_to_none=True)` | Stateful optimizers; in-place ops like `param.add_(grad, alpha=-lr)` are used internally |
| TensorFlow / Keras | `optimizer.apply_gradients(zip(grads, vars))` | `tf.GradientTape` produces the gradients |
| JAX / Optax | `updates, state = tx.update(grads, state, params); params = optax.apply_updates(params, updates)` | Pure functional API; state is explicit |
| DeepSpeed | `model_engine.step()` | Combines optimizer step, gradient sync, and ZeRO bookkeeping |
| Hugging Face Accelerate | `accelerator.backward(loss); optimizer.step()` | Device-agnostic wrapper around PyTorch |

Under the hood every framework boils down to fused element-wise operations on contiguous parameter tensors. PyTorch's reference Adam implementation uses `torch.add_`, `torch.mul_`, and `torch.addcmul_` to mutate parameter tensors in place; CUDA kernels fuse these operations to avoid extra memory traffic [16].

## How does the update relate to learning-rate scheduling and regularization?

The parameter update rule does not exist in isolation. The [learning rate](/wiki/learning_rate) controls how aggressively the update is applied, and modern LLM recipes use a warmup-then-cosine schedule rather than a fixed value. [Regularization](/wiki/regularization), particularly weight decay, is applied as part of the update rather than as a separate loss term, and the decoupled form used by AdamW changes the effective regularization strength compared with the equivalent L2 penalty added to the loss [10].

## Explain like I'm 5 (ELI5)

A model has lots of little knobs called parameters. After looking at some examples, the model figures out how it was wrong and which way each knob should turn to be a little less wrong next time. The parameter update is the moment we actually turn the knobs. Different optimizers turn them in different ways: some take big confident steps, some take small careful steps, and some remember which way they have been going so they keep moving in that direction.

## References

1. Robbins, H., and Monro, S. (1951). "A Stochastic Approximation Method." *Annals of Mathematical Statistics*, 22(3), 400-407.
2. Polyak, B. T. (1964). "Some methods of speeding up the convergence of iteration methods." *USSR Computational Mathematics and Mathematical Physics*, 4(5), 1-17.
3. Nesterov, Y. (1983). "A method for solving the convex programming problem with convergence rate O(1/k^2)." *Soviet Mathematics Doklady*, 27, 372-376.
4. Duchi, J., Hazan, E., and Singer, Y. (2011). "Adaptive Subgradient Methods for Online Learning and Stochastic Optimization." *Journal of Machine Learning Research*, 12, 2121-2159.
5. Niu, F., Recht, B., Re, C., and Wright, S. J. (2011). "HOGWILD!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent." NeurIPS 2011. [arXiv:1106.5730](https://arxiv.org/abs/1106.5730).
6. Pascanu, R., Mikolov, T., and Bengio, Y. (2013). "On the difficulty of training Recurrent Neural Networks." ICML 2013. [arXiv:1211.5063](https://arxiv.org/abs/1211.5063).
7. Hinton, G. (2012). "Lecture 6e: rmsprop: Divide the gradient by a running average of its recent magnitude." Coursera: Neural Networks for Machine Learning.
8. Kingma, D. P., and Ba, J. (2015). "Adam: A Method for Stochastic Optimization." ICLR 2015. [arXiv:1412.6980](https://arxiv.org/abs/1412.6980).
9. You, Y., Gitman, I., and Ginsburg, B. (2017). "Large Batch Training of Convolutional Networks." [arXiv:1708.03888](https://arxiv.org/abs/1708.03888).
10. Loshchilov, I., and Hutter, F. (2019). "Decoupled Weight Decay Regularization." ICLR 2019. [arXiv:1711.05101](https://arxiv.org/abs/1711.05101).
11. You, Y., Li, J., Reddi, S., Hseu, J., Kumar, S., Bhojanapalli, S., Song, X., Demmel, J., Keutzer, K., and Hsieh, C.-J. (2020). "Large Batch Optimization for Deep Learning: Training BERT in 76 minutes." ICLR 2020. [arXiv:1904.00962](https://arxiv.org/abs/1904.00962).
12. Rajbhandari, S., Rasley, J., Ruwase, O., and He, Y. (2020). "ZeRO: Memory Optimizations Toward Training Trillion Parameter Models." SC20. [arXiv:1910.02054](https://arxiv.org/abs/1910.02054).
13. Chen, X., Liang, C., Huang, D., Real, E., Wang, K., Liu, Y., Pham, H., Dong, X., Luong, T., Hsieh, C.-J., Lu, Y., and Le, Q. V. (2023). "Symbolic Discovery of Optimization Algorithms." [arXiv:2302.06675](https://arxiv.org/abs/2302.06675).
14. Liu, H., Li, Z., Hall, D., Liang, P., and Ma, T. (2023). "Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training." [arXiv:2305.14342](https://arxiv.org/abs/2305.14342).
15. Zhao, Y. et al. (2023). "PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel." [arXiv:2304.11277](https://arxiv.org/abs/2304.11277).
16. PyTorch documentation, `torch.optim` and `torch.distributed.fsdp`. [pytorch.org/docs](https://pytorch.org/docs/stable/optim.html).
17. DeepSpeed ZeRO tutorial. [deepspeed.ai/tutorials/zero](https://www.deepspeed.ai/tutorials/zero/).
18. Optax documentation. [optax.readthedocs.io](https://optax.readthedocs.io/).