Gradient Descent
Last reviewed
May 31, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v7 · 5,352 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 31, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v7 · 5,352 words
Add missing citations, update stale details, or suggest a clearer explanation.
Gradient descent is a first-order iterative optimization algorithm that minimizes a differentiable loss function by repeatedly stepping in the direction of the negative gradient. It is the foundational optimization method of modern machine learning and deep learning, forming the algorithmic backbone of how neural networks learn from data. At each iteration, parameters are updated by subtracting a positive multiple of the gradient evaluated at the current point; with appropriate step sizes this monotonically reduces the objective value when the function is sufficiently smooth.[1]
Gradient descent and its stochastic, momentum-based, and adaptive variants are responsible for training virtually every modern deep learning model, from convolutional neural networks for image recognition to transformers underlying large language models. The dominant optimizer for modern large language model training is AdamW, a 2017 refinement of Adam that decouples weight decay from the adaptive update.[2]
Imagine you are standing on a hilly field and you want to get to the lowest point, but you are blindfolded. You can feel the ground under your feet to figure out which direction slopes downward. At each step, you move in the steepest downhill direction you can feel. You keep doing this over and over, and eventually you reach the bottom of the valley.
That is what gradient descent does with math. A computer has a big math problem (called a "loss function") and it wants to find the answer that makes the result as small as possible. It checks which direction would make the answer go down the fastest, takes a small step that way, and then checks again. It repeats this many times until it finds a good answer. The size of each step is called the "learning rate." If steps are too big, you might jump right over the lowest point. If steps are too small, it takes forever to get there.
The method of gradient descent was first described by the French mathematician Augustin-Louis Cauchy in 1847. In his short note Methode generale pour la resolution des systemes d'equations simultanees, presented to the Academie des Sciences and published in Comptes Rendus, Cauchy proposed an iterative process for solving systems of equations by minimizing a sum of squared residuals: at each step, variables are updated by subtracting a positive multiple of the partial derivatives.[3] Cauchy's motivation was practical astronomy and geodesy; he treated the algorithm as a tool for very large linear systems where direct elimination was infeasible.
Jacques Hadamard independently described a similar gradient-following procedure in 1908.[4] The first rigorous convergence analysis for nonlinear functions was given by Haskell Curry in 1944, establishing the basic theoretical foundation for the algorithm under classical smoothness assumptions.[5]
In 1951, Herbert Robbins and Sutton Monro introduced the stochastic approximation framework in their paper A Stochastic Approximation Method, published in the Annals of Mathematical Statistics.[6] Their algorithm finds roots of an expected function using noisy evaluations, and is mathematically the ancestor of stochastic gradient descent (SGD). Robbins and Monro proved convergence under conditions that have been used ever since: the step sizes must sum to infinity but their squares must be summable.
Boris Polyak introduced the momentum method (also called the "heavy ball" method) in 1964, drawing an analogy to a heavy ball rolling through a potential field with friction.[7] Yurii Nesterov proposed his accelerated gradient method in 1983, achieving the optimal O(1/k^2) convergence rate for first-order methods on smooth convex functions, matching the lower bound from Nemirovski and Yudin.[8]
The link between gradient descent and neural network training was solidified in 1986, when David Rumelhart, Geoffrey Hinton, and Ronald Williams demonstrated that backpropagation could efficiently compute gradients for multi-layer networks, enabling practical training of deep architectures by gradient-based methods.[9] The chain rule had been used for neural learning earlier (notably by Paul Werbos in 1974), but the 1986 paper popularized the approach and inaugurated the modern era of gradient-based neural network training.
The development of adaptive learning rate methods accelerated in the 2010s: AdaGrad (Duchi, Hazan, and Singer, 2011),[10] RMSProp (Hinton, 2012 Coursera lecture),[11] and Adam (Kingma and Ba, 2014, arXiv:1412.6980)[12] each introduced ways to automatically adjust per-parameter learning rates. AdamW (Loshchilov and Hutter, 2017, arXiv:1711.05101)[2] decoupled weight decay from the adaptive gradient step and became the default optimizer for transformer training. More recently the Lion optimizer (Chen et al., 2023, arXiv:2302.06675)[13] was discovered by automated symbolic search and offers competitive performance with lower memory footprint than Adam.
Gradient descent solves the unconstrained minimization problem
minimize L(theta) over theta in R^n
where L: R^n -> R is a differentiable function and theta is a vector of n parameters. In machine learning, L is typically a loss function, for example mean squared error for regression or cross-entropy for classification, averaged over a training dataset of N examples:
L(theta) = (1/N) * sum_{i=1}^{N} l(f(x_i; theta), y_i)
The algorithm only uses the first derivative (the gradient), placing it firmly in the family of first-order methods. Second-order methods such as Newton's method additionally use curvature (Hessian) information; they converge in fewer steps but each step is dramatically more expensive.
The canonical update rule of gradient descent at iteration t is:
theta_{t+1} = theta_t - eta * grad L(theta_t)
where eta > 0 is the learning rate (also denoted alpha) and grad L(theta_t) is the gradient of the loss with respect to the parameters, evaluated at theta_t. The gradient points in the direction of steepest local increase in L; moving in the opposite direction decreases L most rapidly within a small neighborhood.
The learning rate eta controls how far each step moves. Setting it correctly is generally the most important hyperparameter choice in deep learning.
For functions with an L-Lipschitz gradient (i.e., the gradient does not change too fast), choosing eta <= 1/L guarantees monotone decrease of the loss each step.
In practice the gradient is computed by backpropagation, which uses reverse-mode automatic differentiation applied to the computational graph that defines the model. For a neural network with L layers, backpropagation reuses intermediate activations to compute the gradient of the scalar loss with respect to all parameters in time proportional to one forward pass.[9] Gradient descent provides the direction; backpropagation provides the magnitude per parameter.
Near a point theta_t, the loss can be approximated by its first-order Taylor expansion:
L(theta) ~= L(theta_t) + grad L(theta_t)^T (theta - theta_t)
Choosing theta to minimize this linear approximation within a small trust region yields the gradient descent step. The step size eta must be small enough that the linear approximation remains accurate over the proposed move.
In practice, the choice of how much training data is used to estimate the gradient at each step defines three primary variants of gradient descent.
Batch gradient descent (also called vanilla gradient descent or full-batch gradient descent) computes the gradient of the loss with respect to the parameters using the entire training dataset:
theta_{t+1} = theta_t - eta * grad L(theta_t)
where L is the full-dataset average loss. Because it uses the full dataset for each update, batch gradient descent produces smooth, deterministic convergence trajectories and is guaranteed to converge to the global minimum for convex functions. However, it becomes computationally expensive and memory-intensive for large datasets, since every single parameter update requires a full pass through the data, and it cannot incorporate streaming data.
Stochastic gradient descent (SGD) updates the parameters using the gradient computed on a single randomly chosen training example (x_i, y_i):
theta_{t+1} = theta_t - eta * grad l(theta_t; x_i, y_i)
SGD is much faster per iteration than batch gradient descent and naturally handles very large or streaming datasets. The noise introduced by single-example gradient estimates causes the loss to fluctuate during training; this noise can actually help the optimizer escape shallow local minima and saddle points. The Robbins-Monro convergence conditions require the step size to satisfy sum eta_t = infinity and sum eta_t^2 < infinity, which means the learning rate must be decreased over time for asymptotic convergence in the classical regime.[6]
Mini-batch gradient descent, usually just called "SGD" in modern deep learning, is the practical compromise that combines the benefits of both extremes. It computes gradients over small randomly sampled subsets (mini-batches) of training data, typically between 32 and several thousand examples:
theta_{t+1} = theta_t - eta * (1/b) * sum_{j=1}^{b} grad l(theta_t; x_{i_j}, y_{i_j})
where b is the batch size. Mini-batches reduce gradient variance compared to single-example SGD and exploit hardware parallelism on GPUs and TPUs through batched matrix operations. Almost all modern neural network training uses mini-batch SGD as the underlying iteration, often wrapped by momentum and adaptive schemes.
| Variant | Data per update | Gradient noise | Memory | Convergence stability | Hardware efficiency |
|---|---|---|---|---|---|
| Batch | Entire dataset | None | Very high | Very stable | Poor at large N |
| SGD (single example) | 1 example | Very high | Very low | Unstable | Poor (no parallelism) |
| Mini-batch SGD | b (32-8192) | Moderate | Moderate | Balanced | Excellent (GPU/TPU) |
Plain gradient descent can be slow when the loss surface has regions where the curvature is much steeper in one direction than another (so-called "ravines"). In these settings, the gradient oscillates across the narrow dimension while making slow progress along the long dimension. Momentum methods accumulate a velocity vector that smooths out these oscillations and accelerates progress along consistent directions.
The heavy ball method introduced by Polyak[7] adds a fraction of the previous update vector to the current update:
v_t = mu * v_{t-1} + grad L(theta_t)
theta_{t+1} = theta_t - eta * v_t
where mu is the momentum coefficient, typically set to 0.9 or 0.99. The momentum term accelerates descent in directions where the gradient consistently points the same way, while dampening oscillations in directions where the gradient frequently changes sign. Physically, this is analogous to a ball rolling downhill that gains speed and can roll through small bumps and plateaus.
Nesterov momentum improves on classical momentum by computing the gradient at a "lookahead" position rather than the current position:[8]
v_t = mu * v_{t-1} + grad L(theta_t - eta * mu * v_{t-1})
theta_{t+1} = theta_t - eta * v_t
By evaluating the gradient at the anticipated next position (theta - eta * mu * v), Nesterov's scheme provides a correction factor that reduces overshooting. For smooth convex functions it achieves the optimal first-order convergence rate of O(1/k^2), versus O(1/k) for plain gradient descent. Sutskever et al. (2013) gave a reformulation that simplified implementation in deep learning frameworks.[14] In practice, Nesterov momentum often converges faster and is less prone to overshooting than classical momentum, though the empirical gap is small.
A widely read interactive Distill article by Goh, Why Momentum Really Works (2017), provides a geometric explanation of why momentum accelerates ill-conditioned descent and visualizes the dynamics under simple quadratic losses.[15]
A major limitation of vanilla gradient descent is that a single scalar learning rate is applied to every parameter. Adaptive methods maintain per-parameter learning rates that are adjusted based on the history of gradients seen for that parameter.
AdaGrad (Adaptive Gradient) accumulates the squared gradients for each parameter and scales the learning rate inversely by the square root of this sum:[10]
G_t = G_{t-1} + g_t^2 (element-wise)
theta_{t+1} = theta_t - (eta / (sqrt(G_t) + epsilon)) * g_t
where g_t = grad L(theta_t) and epsilon is a small constant (typically 10^-8) for numerical stability.
AdaGrad performs larger updates for infrequent parameters and smaller updates for frequent ones, making it well suited for sparse data and NLP tasks where rare features carry information. Its main weakness is that the accumulated squared gradients grow monotonically, so the effective learning rate shrinks continuously and eventually becomes vanishingly small, stopping learning prematurely.
RMSProp (Root Mean Square Propagation) was proposed by Geoffrey Hinton in lecture 6e of his 2012 Coursera course Neural Networks for Machine Learning.[11] It addresses AdaGrad's diminishing learning rate problem by replacing the running sum with an exponentially weighted moving average:
E[g^2]t = rho * E[g^2]{t-1} + (1 - rho) * g_t^2
theta_{t+1} = theta_t - (eta / sqrt(E[g^2]_t + epsilon)) * g_t
Hinton recommended rho = 0.9 and eta = 0.001. RMSProp prevents the learning rate from decaying too aggressively while still adapting per parameter; it was widely used for training recurrent neural networks and remains a reliable choice. Tieleman and Hinton's lecture slides are the canonical reference; there was never a peer-reviewed journal paper.
Adam (Adaptive Moment Estimation) combines momentum (first moment) with adaptive learning rates (second moment).[12] It maintains exponential moving averages of both the gradient and the squared gradient:
m_t = beta_1 * m_{t-1} + (1 - beta_1) * g_t (first moment)
v_t = beta_2 * v_{t-1} + (1 - beta_2) * g_t^2 (second moment)
Because m and v are initialized to zero, they are biased toward zero in early iterations. Adam applies bias correction:
m_hat_t = m_t / (1 - beta_1^t)
v_hat_t = v_t / (1 - beta_2^t)
The parameter update is:
theta_{t+1} = theta_t - eta * m_hat_t / (sqrt(v_hat_t) + epsilon)
Default hyperparameters are beta_1 = 0.9, beta_2 = 0.999, epsilon = 10^-8, and a per-task learning rate around 1e-3 to 3e-4. Adam works well across a wide range of problems and requires little hyperparameter tuning, making it one of the most popular optimizers in deep learning.[12] Reddi, Kale, and Kumar (2018) showed that the original Adam may fail to converge for some convex problems and proposed AMSGrad as a fix.[16]
AdamW fixes a subtle but consequential issue with how weight decay interacts with adaptive learning rate methods.[2] In standard Adam with L2 regularization, the regularization term is folded into the gradient before the adaptive scaling, so parameters with historically large gradients see effectively less weight decay. AdamW instead applies weight decay as a separate, decoupled step:
theta_{t+1} = theta_t - eta * (m_hat_t / (sqrt(v_hat_t) + epsilon) + lambda * theta_t)
where lambda is the weight decay coefficient applied directly to the parameters. This decoupling makes the optimal weight decay value largely independent of the learning rate and substantially improves generalization. AdamW is now the dominant optimizer for transformer training, including essentially all open large language models.
Lion (Evolved Sign Momentum) was discovered by Google researchers via automated symbolic search and published in 2023.[13] Its update is strikingly simple:
c_t = beta_1 * m_{t-1} + (1 - beta_1) * g_t
theta_{t+1} = theta_t - eta * (sign(c_t) + lambda * theta_t)
m_t = beta_2 * m_{t-1} + (1 - beta_2) * g_t
Lion uses only a single momentum buffer (rather than Adam's two buffers), cutting optimizer state memory roughly in half. The sign operation makes all update magnitudes uniform, which is why Lion typically requires a learning rate 3-10x smaller than Adam at the same task. Reported results show Lion matching or exceeding AdamW on image classification, vision-language pretraining, and language modeling at scale, while saving optimizer memory.
| Optimizer | Year | Key innovation | Typical use |
|---|---|---|---|
| SGD + momentum | 1964 | Velocity buffer | CNNs |
| Nesterov | 1983 | Lookahead momentum | Convex optimization, CNNs |
| AdaGrad | 2011 | Per-parameter accumulated squared gradients | Sparse data, NLP |
| RMSProp | 2012 | EMA of squared gradients | RNNs |
| Adam | 2014 | Momentum + adaptive rates with bias correction | General default |
| AdamW | 2017 | Decoupled weight decay | Transformers, LLMs |
| Lion | 2023 | Sign-based momentum | Vision and language models, memory-efficient |
Modern deep learning training almost always varies the learning rate over the course of training rather than holding it fixed.
Linear warmup ramps the learning rate from 0 (or a tiny value) up to a target peak over the first W steps, typically 1-5% of total training steps. Warmup is essentially mandatory for transformer training and large-batch training of any kind: at the very start of training the loss surface is poorly conditioned because weights are random, and applying a large learning rate immediately can cause divergence.[17]
Cosine annealing decays the learning rate smoothly from a peak to a minimum following a cosine curve:
eta(t) = eta_min + 0.5 * (eta_max - eta_min) * (1 + cos(pi * t / T))
where T is the total number of training steps. Originally proposed by Loshchilov and Hutter in the SGDR paper as a periodic schedule with warm restarts,[17] it is now ubiquitous as a single half-cosine cycle in LLM training. The combination of linear warmup followed by cosine decay to about 10% of the peak rate is the default for most large language models.
Polynomial decay drops the learning rate as eta(t) = eta_max * (1 - t/T)^p for some power p. With p=1 this is linear decay. Other common choices include step decay (multiply by gamma every N epochs), exponential decay, and constant schedules.
| Schedule | Formula (simplified) | Common usage |
|---|---|---|
| Constant | eta_0 | Simple baselines |
| Step decay | eta_0 * gamma^floor(t/N) | CNNs (ResNet, VGG) |
| Exponential | eta_0 * exp(-lambda * t) | General |
| Linear decay | eta_0 * (1 - t/T) | Many LLM variants |
| Cosine | half-cosine to eta_min | Transformers, modern CNNs |
| Linear warmup + cosine | ramp then half-cosine | LLMs (standard) |
| Cosine warm restarts (SGDR) | repeated cosine cycles | Image classification[17] |
| 1-cycle policy | one ramp up and down | Super-convergence (Smith) |
In deep networks, especially recurrent ones, gradients can occasionally be very large. Gradient clipping rescales gradients whose norm exceeds a threshold c so that the rescaled gradient has norm c:
if ||g||_2 > c then g <- c * g / ||g||_2
This norm-clipping strategy was introduced by Pascanu, Mikolov, and Bengio in "On the difficulty of training Recurrent Neural Networks" (arXiv:1211.5063, ICML 2013).[18] It is now standard for transformer training, where a clipping threshold of c = 1.0 is typical. An alternative is value clipping, which clips each coordinate to a range, but norm clipping is preferred because it preserves the gradient's direction.
L2 regularization adds a term (lambda / 2) * ||theta||^2 to the loss, which contributes a gradient of lambda * theta. Weight decay instead modifies the update rule directly by subtracting lambda * theta from the parameters. For plain SGD without adaptive rates, these are mathematically equivalent. For adaptive optimizers like Adam, they are not, which is precisely the issue that AdamW fixes by decoupling the weight decay term from the gradient-based update.[2] In modern transformer training the term "weight decay" is essentially always understood in the decoupled (AdamW) sense.
Some methods apply different learning rates to different parts of the network. LARS (You, Gitman, and Ginsburg, 2017)[19] and LAMB (You et al., 2019)[20] scale per-layer learning rates by the ratio of weight norms to gradient norms, which stabilizes large-batch training and enabled the well-known result of training BERT in 76 minutes on 1024 TPUs. Layer-wise rates are also common in transfer learning: lower layers are often given smaller learning rates than higher layers when fine-tuning a pretrained model.
The convergence rate of gradient descent depends strongly on the structure of the objective function.
| Function class | Method | Convergence rate | Iterations to epsilon |
|---|---|---|---|
| Convex, L-smooth | GD with eta = 1/L | O(1/k) on L(theta_k) - L* | O(1/epsilon) |
| Strongly convex, L-smooth | GD with eta = 1/L | O((1 - mu/L)^k) | O(kappa log(1/epsilon)) |
| Convex, L-smooth | Nesterov | O(1/k^2) | O(1/sqrt(epsilon)) |
| Non-convex, L-smooth | GD | min_k | |
| Convex, L-smooth | SGD with decaying eta | O(1/sqrt(k)) in expectation | O(1/epsilon^2) |
For convex functions with an L-Lipschitz gradient, gradient descent with step size 1/L achieves an O(1/k) rate. With additional strong convexity (parameter mu), it converges linearly with rate (1 - mu/L)^k, where kappa = L/mu is the condition number.[1] Nesterov's accelerated method achieves the optimal first-order rate of O(1/k^2) on smooth convex problems.[8]
For non-convex problems, including essentially all deep learning objectives, gradient descent can only guarantee convergence to a stationary point where the gradient is approximately zero. This may be a local minimum, a saddle point, or even a local maximum. SGD on convex problems has a slower O(1/sqrt(k)) rate because of gradient noise. For non-convex SGD, finding an epsilon-stationary point takes O(1/epsilon^4) samples in the worst case.
A striking empirical observation about deep learning is that gradient descent variants find solutions that generalize well, often better than would be predicted from classical learning theory.
SGD has an implicit regularization effect: even without explicit penalty terms it tends to find low-norm or "simple" solutions among the many that fit the training data. Soudry et al. (2018) proved that gradient descent on logistic regression on linearly separable data implicitly converges to the maximum-margin solution, even without explicit margin terms.[21] More general results show that for over-parameterized networks the SGD trajectory is biased toward certain "flat" minimizers.
Hochreiter and Schmidhuber (1997) suggested that flat minima, broad regions of low loss, generalize better than sharp minima (narrow troughs).[22] Keskar et al. (2017) provided empirical evidence that large batch sizes lead to sharper minima and worse generalization, while small-batch SGD tends to find flatter minima.[23] These observations motivate practical training recipes such as moderate batch sizes, learning rate warmup, and methods like Sharpness-Aware Minimization (SAM) that explicitly bias the optimizer toward flat regions.
Other surprising phenomena observed in gradient-descent training of deep networks include double descent (test loss is non-monotonic in model size; Belkin et al. 2019)[24] and the lottery ticket hypothesis (small subnetworks trained from their initial weights can match the full network; Frankle and Carbin 2019)[25]. Both highlight that the interaction between gradient descent and over-parameterized networks departs from classical statistical intuition.
Training a large language model consists of running mini-batch SGD with adaptive optimizers on hundreds of billions or trillions of tokens. The dominant recipe is remarkably consistent across published large models including GPT-3, LLaMA, Mistral, Qwen, and DeepSeek.[26][27]
AdamW is essentially the universal optimizer for LLM pretraining and supervised fine-tuning. Typical hyperparameters for LLM training: beta_1 = 0.9, beta_2 = 0.95 (lower than the 0.999 default to reduce instability at large scale), epsilon = 1e-8, weight decay = 0.1, peak learning rate around 1e-4 to 3e-4 for 1B-100B parameter models (smaller for larger models), with linear warmup over the first 0.1-2% of steps followed by cosine decay to about 10% of the peak.[26]
LLMs are trained almost exclusively in mixed precision. The two dominant formats are FP16 (half precision, 5-bit exponent) and BF16 (bfloat16, 8-bit exponent matching FP32's range). BF16 has become the standard for transformer training because its wider exponent range avoids the underflow/overflow problems that plague FP16 at scale; FP16 requires loss scaling to maintain dynamic range.[28] The model parameters and optimizer states are typically maintained in FP32 (the "master weights"), while activations and gradients are computed in BF16 or FP16 to halve memory and double arithmetic throughput on supported hardware.
AdamW stores two optimizer state tensors per parameter, the first moment m and the second moment v, each typically in FP32. For a model with N parameters, the FP32 master weights, the FP32 gradients, and the two FP32 moments together require 16N bytes of optimizer state, on top of the model weights themselves. For a 7B parameter model this exceeds 100 GB of GPU memory just for optimizer state, which motivates memory-saving variants such as 8-bit Adam (Dettmers et al., 2022),[29] sharded optimizers like ZeRO (Rajbhandari et al., 2020),[30] and Lion (which uses only one moment buffer).
When the desired effective batch size exceeds what fits in GPU memory, gradients from several forward-backward passes are summed (or averaged) before a single optimizer step. This decouples the micro-batch (what fits on a device) from the global batch used for the update. Most modern LLM training uses gradient accumulation in combination with data parallelism, pipeline parallelism, and tensor parallelism to reach global batch sizes of several million tokens per step.
Norm-based gradient clipping with c = 1.0 is universal in LLM training. Without it, occasional spike batches (often related to badly conditioned data) cause loss spikes from which training may not recover.
In high-dimensional spaces, saddle points (where the gradient is zero but the point is neither a local minimum nor a local maximum) vastly outnumber local minima.[31] Some directions curve upward and some curve downward. Plain gradient descent slows dramatically near saddle points because the gradient vanishes; in practice momentum-based methods and noise from stochastic gradients usually let the optimizer escape.
In non-convex optimization, gradient descent may converge to a local minimum rather than the global minimum. However, theoretical and empirical work on over-parameterized networks suggests that in the regime where the number of parameters greatly exceeds the number of training examples, most local minima have loss close to the global minimum and qualitatively similar generalization performance.[32]
Gradient-based optimizers, especially SGD with momentum, can be extremely sensitive to learning rate, batch size, weight decay, warmup length, and schedule shape. Adaptive methods like AdamW are noticeably more robust to misspecified learning rates, which is part of why they dominate LLM training where the cost of a misspecified run is enormous.
The memory footprint of optimizer states grows linearly with the number of parameters. For AdamW this is up to four times the size of the model in FP32 (master weights + gradients + two moments). For 100B+ parameter models, optimizer states often dominate GPU memory, motivating ZeRO sharding, CPU offloading, and Lion-style single-moment methods.
When different directions of the loss surface have very different curvatures (large condition number), plain gradient descent zigzags across narrow valleys and progresses slowly. Batch normalization, proper weight initialization, and adaptive methods all help reduce the practical impact of ill-conditioning.
In very deep networks, gradients can become extremely small (vanishing) or extremely large (exploding) as they propagate backward through many layers. Standard mitigations include gradient clipping, residual connections (as in ResNet), careful initialization (Xavier/Glorot, He), normalization layers (batch normalization, layer normalization, RMSNorm), and activation functions like ReLU that do not saturate.