See also: Machine learning terms, Optimizer, Gradient descent
Momentum is an optimization technique that accelerates gradient descent by accumulating an exponentially weighted moving average of past gradients into a velocity vector and using that velocity, rather than the raw gradient, to update model parameters. The accumulated velocity smooths out high-frequency oscillations in the gradient signal, builds speed along directions of consistent descent, and helps the optimizer push through flat regions and shallow local minima of the loss surface.
The technique was first introduced by Soviet mathematician Boris Polyak in 1964 as the "heavy ball method," drawing on an analogy from classical mechanics in which a heavy ball rolls under gravity across a curved surface, retaining inertia from previous motion [1]. Yurii Nesterov later proposed an accelerated variant in 1983 that evaluates the gradient at a look-ahead position and achieves a provably optimal convergence rate for smooth convex problems [2]. Both ideas were brought into the modern deep learning era by Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton in their 2013 paper "On the Importance of Initialization and Momentum in Deep Learning," which showed that carefully tuned momentum schedules let plain stochastic gradient descent reach performance levels previously achievable only with expensive second-order methods [3].
In 2026, almost every popular optimizer used to train neural networks contains momentum somewhere inside it. SGD with momentum is still the workhorse for many computer vision pipelines. Adam and AdamW treat their first-moment estimate as a momentum buffer with decay rate beta_1 (typically 0.9 or 0.95). NAdam adds Nesterov look-ahead on top of Adam. The Lion optimizer keeps only a single momentum buffer and takes the sign of an interpolation between the current gradient and that buffer. Even Schedule-Free Adam, introduced by Defazio and colleagues in 2024, can be viewed as replacing the explicit momentum buffer with an interpolation between an averaged iterate and a fresh one [4]. Momentum is the connective tissue of modern optimization.
Boris Polyak introduced the heavy ball method in a 1964 paper titled "Some methods of speeding up the convergence of iteration methods," published in the USSR Computational Mathematics and Mathematical Physics [1]. The motivation was concrete: pure gradient descent converges painfully slowly on ill-conditioned quadratic problems, where the loss surface looks like a long narrow valley. Polyak observed that adding inertia, by including a term proportional to the previous step in the new step, allows the iterate to behave like a ball with mass rolling under gravity. The ball overshoots the bottom of the valley once or twice but quickly settles, instead of zigzagging across the walls forever.
For a smooth strongly convex quadratic with condition number kappa equal to L over mu (the ratio of the largest to smallest eigenvalue of the Hessian), Polyak proved that the heavy ball method achieves a linear convergence rate of approximately ((sqrt(kappa) - 1) / (sqrt(kappa) + 1))^k, where k is the iteration count [5]. Vanilla gradient descent on the same problem converges only at the rate ((kappa - 1) / (kappa + 1))^k. The improvement is roughly the square root of the condition number, which is enormous for the ill-conditioned problems that arise in scientific computing and, later, in deep learning.
The optimal momentum coefficient and step size for a quadratic with smoothness L and strong convexity mu are alpha_star = 4 / (sqrt(L) + sqrt(mu))^2 and beta_star = ((sqrt(L) - sqrt(mu)) / (sqrt(L) + sqrt(mu)))^2 [5]. In practice the exact eigenvalues of the Hessian are not known, so beta is treated as a hyperparameter and tuned, but the formulas explain why values close to 1 are optimal for very ill-conditioned problems.
In 1983, Yurii Nesterov, then at the Central Economic Mathematical Institute in Moscow, published "A method for solving the convex programming problem with convergence rate O(1/k^2)" in Doklady Akademii Nauk SSSR [2]. Nesterov modified Polyak's update by evaluating the gradient at a look-ahead point instead of at the current iterate. This small change is what gives the method its name: Nesterov accelerated gradient (NAG).
For smooth convex (not necessarily strongly convex) functions, Nesterov proved that NAG converges in objective value at rate O(1/k^2), while plain gradient descent converges only at O(1/k) [6]. This rate is provably optimal for any first-order method on this class of problems, in the sense that no algorithm using only gradients (and not curvature) can do better in the worst case. Nesterov's result is one of the foundational theorems of convex optimization and is the reason "acceleration" usually refers to look-ahead momentum methods rather than to ordinary momentum.
Momentum entered mainstream machine learning through David Rumelhart, Geoffrey Hinton, and Ronald Williams's 1986 paper on backpropagation, "Learning representations by back-propagating errors," published in Nature [7]. The authors included a momentum term in the SGD update rule used for training their networks, citing it as a practical trick to speed up learning and dampen oscillations in the loss. Momentum quickly became a standard component of every neural network training recipe in the late 1980s and 1990s.
The role of momentum in deep network training was systematically reexamined by Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton in 2013, in their ICML paper "On the Importance of Initialization and Momentum in Deep Learning" [3]. The paper made two main empirical contributions. First, it showed that carefully tuned classical and Nesterov momentum, combined with a well-designed random initialization (notably the sparse initialization scheme they proposed), could train deep networks and recurrent networks on tasks that previously required Hessian-free second-order methods. Second, it argued that for stochastic optimization Nesterov's look-ahead variant tends to be more stable than classical momentum, especially when the momentum coefficient mu is large.
A practical recommendation from the paper was a momentum schedule that increases mu from a smaller starting value (around 0.5) up to a value close to 1 (often 0.99) over the course of training. This schedule mirrors the role of warmup in modern transformer training: while the velocity is small and the gradient direction is poorly informed, only modest momentum is helpful; as training stabilizes, larger momentum extracts more information from the accumulated gradient history.
Sutskever's paper is also notable as one of the early signals that careful first-order methods can match second-order ones for deep learning. Within a year, Adam had been published, and within a few years momentum-based optimizers had displaced Hessian-free methods almost entirely in mainstream practice.
Let theta_t denote the parameter vector at iteration t, let g_t = nabla L(theta_t) denote the gradient of the loss function, let eta denote the learning rate, and let mu denote the momentum coefficient. The classical (Polyak) momentum update introduces a velocity vector v_t and replaces the standard SGD step with two coupled updates.
The textbook form of classical momentum is:
v_t = mu * v_(t-1) + g_t
theta_t = theta_(t-1) - eta * v_t
The velocity v_t accumulates an exponentially decaying sum of past gradients. Unrolling the recurrence gives:
v_t = g_t + mu * g_(t-1) + mu^2 * g_(t-2) + mu^3 * g_(t-3) + ...
A gradient from k steps in the past contributes with weight mu^k. With mu = 0.9, the contribution of a gradient drops below 5% of its initial value after about 30 steps. With mu = 0.99, the same drop takes about 300 steps. Some references write the same update with an explicit (1 - mu) factor in front of g_t (so v_t becomes a true exponential moving average that is bounded in expectation), which is the convention used inside Adam and most recent optimizers; the two forms are mathematically equivalent up to a rescaling of the learning rate. The differences are summarized in the implementation table below.
Nesterov's variant evaluates the gradient at a look-ahead point that anticipates where momentum will take the parameters next:
theta_lookahead = theta_(t-1) + mu * v_(t-1) # in the Polyak sign convention
v_t = mu * v_(t-1) + g(theta_lookahead)
theta_t = theta_(t-1) - eta * v_t
The gradient g(theta_lookahead) is evaluated at theta_(t-1) shifted by the momentum term, so the velocity update sees what the gradient looks like at the position momentum is about to drag the iterate to. If that point is past a minimum, the gradient at the look-ahead position points back toward the minimum and corrects the velocity before the overshoot becomes severe. Sutskever, Martens, Dahl, and Hinton (2013) reformulated this update in a way that does not require an extra forward-pass at a shifted point, which is the form used by PyTorch and most modern frameworks [3].
All major frameworks implement momentum in slightly different but mathematically equivalent ways. The differences matter when comparing recipes between codebases or when interpreting hyperparameters trained in one framework and reused in another.
| Framework | Velocity update | Parameter update | Notes |
|---|---|---|---|
| Sutskever et al. 2013 (textbook) | v <- mu * v + lr * g | theta <- theta - v | Learning rate folded into velocity; changing lr also changes effective momentum |
| PyTorch torch.optim.SGD | v <- mu * v + g | theta <- theta - lr * v | Learning rate kept outside velocity; momentum stays invariant when lr changes |
| TensorFlow tf.keras.optimizers.SGD | v <- mu * v + g | theta <- theta - lr * v | Matches PyTorch convention |
| Optax optax.sgd | v <- mu * v + g | theta <- theta - lr * v | Matches PyTorch convention |
| Exponential moving average form | v <- mu * v + (1 - mu) * g | theta <- theta - lr * v | Used inside Adam's first moment estimate |
PyTorch's documentation explicitly notes that its update differs from Sutskever et al., and the difference is sometimes subtle enough to cause confusion when porting code [8]. PyTorch also initializes the momentum buffer to the first observed gradient rather than to zero, which avoids a "warm-up" effect on the first step. The exponential moving average form is the one used by Adam, which keeps the magnitude of v bounded by the expected magnitude of the gradient and makes bias correction necessary at the start of training.
Momentum is named for the physical quantity. The most common pedagogical analogy compares the optimizer to a heavy ball rolling across a surface defined by the loss function under the influence of gravity. The gradient g_t plays the role of the local force on the ball, the velocity v_t plays the role of the ball's actual momentum, and the momentum coefficient mu plays the role of one minus the friction coefficient. With mu = 0 the ball is fully damped and stops at every step (this is plain gradient descent). With mu close to 1 there is almost no friction, and the ball can roll for a long time on its accumulated speed.
A more precise interpretation comes from viewing the momentum update as a discretization of a continuous-time ordinary differential equation. The classical heavy ball method corresponds, in the small-step-size limit, to a damped second-order ODE of the form theta'' + a * theta' + nabla L(theta) = 0, where a is a friction coefficient determined by mu. This is the equation of motion of a particle of unit mass moving in a potential L under viscous drag, which makes the physical analogy literal rather than just illustrative. Nesterov's method corresponds to a closely related ODE with time-varying friction whose decay rate produces the O(1/k^2) acceleration in the discrete algorithm [9].
One consequence of the physical picture is that momentum can be understood as a low-pass filter on the gradient signal. High-frequency oscillations (which often correspond to the optimizer bouncing across a narrow ravine) cancel out in the velocity, while low-frequency components (which correspond to a consistent direction of descent down the valley) reinforce. This is the same intuition that explains why physical heavy balls do not oscillate when rolled into a smooth bowl but instead settle to the bottom after a few overshoots.
Momentum addresses several specific weaknesses of plain gradient descent that show up in deep learning loss surfaces.
If the loss surface curves much more sharply along one direction than another, the gradient is dominated by the steep direction. Plain gradient descent oscillates back and forth across the steep axis of the valley while making slow progress along the gentle axis. Momentum cancels these oscillatory components in the velocity (because successive gradients along the steep axis point in opposite directions) while reinforcing the consistent component along the gentle axis. The net effect is faster progress along the valley floor and smaller side-to-side jitter.
When the gradient points in roughly the same direction for several consecutive steps, the velocity accumulates that direction additively. After many such steps, the magnitude of v can be much larger than the magnitude of any single gradient. The effective step size in that direction grows like a geometric series with ratio mu, which sums to roughly 1 / (1 - mu) times the per-step contribution. This is the source of the heuristic that high momentum amplifies the effective learning rate.
If the optimizer enters a flat plateau where the gradient is very small, plain SGD makes only tiny steps and may stall. With momentum, the velocity built up before the plateau carries the iterate across it. The same intuition applies to small bumps and shallow local minima: the inertia of the velocity can carry the iterate past a saddle point or a poor local minimum that would otherwise trap it.
For a strongly convex quadratic with condition number kappa, plain gradient descent needs O(kappa * log(1/epsilon)) iterations to reach an epsilon-accurate solution. Heavy ball momentum reduces this to O(sqrt(kappa) * log(1/epsilon)), and Nesterov's accelerated variant reaches the same rate with simpler tuning [5][6]. In effect, momentum takes the square root of the condition number. For loss surfaces with kappa in the millions (which is typical of deep networks), this is a difference of three orders of magnitude in iteration count.
For a constant gradient g, the steady-state velocity in the Polyak update is v_inf = g / (1 - mu), and the corresponding effective step is eta * g / (1 - mu). With mu = 0.9 the effective step is roughly 10 times the per-iteration learning rate, and with mu = 0.99 it is roughly 100 times. This factor explains why increasing momentum often requires decreasing the learning rate to keep training stable [10].
On smooth convex objectives, the convergence rates of gradient descent, heavy ball momentum, and Nesterov acceleration form a clean hierarchy.
| Method | Smooth convex (rate in objective value) | Smooth strongly convex (linear rate per iteration) |
|---|---|---|
| Gradient descent | O(1/k) | (kappa - 1) / (kappa + 1) |
| Polyak heavy ball | O(1/k) globally; matches Nesterov on quadratics | (sqrt(kappa) - 1) / (sqrt(kappa) + 1) |
| Nesterov accelerated gradient | O(1/k^2) | 1 - 1/sqrt(kappa) (asymptotically the same) |
The numbers in this table are worst-case rates; problems with structure can be much easier in practice. Polyak's method does not enjoy a universal O(1/k^2) rate on all smooth convex problems (it achieves it only on quadratics and other problems with extra structure), while Nesterov's method does. This is the technical reason Nesterov's algorithm is referred to as "accelerated" while Polyak's is not, even though the practical performance of the two methods is often very close on real problems [9].
Deep learning loss surfaces are non-convex, and worst-case rates for non-convex problems are weaker. Recent theoretical work has shown that SGD with momentum converges to a stationary point on smooth non-convex problems under mild assumptions, with rates that depend on the noise model. None of these guarantees imply that momentum finds a global minimum, and in practice the question of whether momentum-based optimizers find good local minima is mostly empirical. The empirical answer, across thousands of published deep learning experiments, is yes: momentum is part of essentially every well-performing recipe.
A sharper way to state the empirical evidence comes from Sutskever et al. (2013), who observed that on several non-convex deep learning problems, momentum was necessary (not just helpful) to reach good performance; without momentum, even with carefully tuned learning rates, the same architectures failed to train. The interaction with initialization is tight: a poor initialization can make momentum unstable, and a good initialization can make momentum decisive [3].
Momentum adds one hyperparameter to gradient descent: the coefficient mu (also called beta or alpha in different sources). Choosing a value is usually straightforward, but it interacts with the learning rate in ways that matter for stability.
| mu value | Effective gradient window | Effective step amplification | Typical use |
|---|---|---|---|
| 0 | 1 | 1x | Pure gradient descent; baseline only |
| 0.5 | 2 | 2x | Very noisy or unstable training |
| 0.8 | 5 | 5x | Light momentum, transition setting |
| 0.9 | 10 | 10x | Standard default for SGD-momentum and Adam beta_1 |
| 0.95 | 20 | 20x | Common in LLM pretraining (beta_1 = 0.95 sometimes) |
| 0.99 | 100 | 100x | Well-conditioned problems, large batch, long training |
| 0.999 | 1,000 | 1000x | Rare; only for exceptionally smooth problems |
A few heuristics, drawn from optimization theory and modern training practice, help in choosing mu.
For SGD with momentum, mu = 0.9 is a strong default, used throughout most of the ResNet, VGG, and modern image classification literature. mu = 0.99 is occasionally used on harder problems (especially with very large batch sizes), but it almost always requires reducing the learning rate proportionally and using gradient clipping.
For Adam and AdamW, beta_1 = 0.9 is the default proposed in the original paper [11]. For large transformer pretraining, the convention has shifted toward beta_1 = 0.9 with beta_2 = 0.95 (rather than the textbook 0.999). DeepSeek V3 uses (beta_1, beta_2) = (0.9, 0.95) with weight decay 0.1 [12]. Llama 2 uses the same setting [13]. The shift is partly because the gradient distribution changes substantially during pretraining of a large model, so a shorter second-moment window is preferred; the first-moment beta_1 = 0.9 is essentially unchanged from defaults.
Learning rate warmup is almost always combined with high momentum. In the first few hundred to few thousand steps, the velocity buffer is small and the gradient signal is unreliable. A linear or cosine warmup phase from zero up to the peak learning rate gives the velocity time to stabilize before the optimizer starts taking large effective steps. This is the same reason RAdam (which corrects the variance of the adaptive step size at the start of training) often removes the need for explicit warmup.
A more aggressive pattern, popularized by Leslie Smith's 1cycle policy, varies momentum and learning rate inversely: as the learning rate ramps up, momentum decreases (often from 0.95 to 0.85), and as the learning rate ramps down, momentum increases back. The intuition is that high learning rate plus high momentum is unstable, so the two should not peak at the same time.
The simplest and most widely used form is plain SGD with momentum, which adds a single velocity buffer to vanilla SGD. It dominated deep learning training from the late 1980s until the rise of Adam around 2015 and is still the preferred optimizer in many computer vision pipelines, particularly for training CNNs like ResNet. SGD with momentum is generally considered to find solutions that generalize as well as or better than Adam on image classification benchmarks, although the gap has narrowed substantially since AdamW decoupled weight decay from the adaptive normalization.
Adam (Kingma and Ba, 2015) maintains two exponential moving averages: one of the gradient itself (the first moment m_t with decay beta_1) and one of the squared gradient (the second moment v_t with decay beta_2) [11]. The first moment plays exactly the role of momentum, and the default beta_1 = 0.9 corresponds to the same effective gradient window as SGD with mu = 0.9. The second moment provides per-parameter adaptive learning rate scaling, which is what differentiates Adam from plain SGD with momentum. Both moments use the (1 - beta) form of the EMA, and both are bias-corrected to compensate for their initialization at zero.
AdamW (Loshchilov and Hutter, 2019) is the variant of Adam used by essentially every modern LLM and vision transformer recipe. It decouples weight decay from the adaptive parameter update, applying weight decay directly to the parameters after the Adam step rather than adding it to the loss as an L2 penalty. The momentum mechanism (the first-moment EMA) is unchanged from Adam.
NAdam (Nesterov-accelerated Adaptive Moment Estimation) was introduced by Timothy Dozat in a workshop paper at ICLR 2016 [14]. It modifies Adam to use Nesterov's look-ahead step inside the first-moment update rather than a classical momentum step. Concretely, the bias-corrected first moment is replaced by an interpolation that incorporates the current gradient before applying the parameter update. NAdam often produces slightly faster convergence than vanilla Adam at the same hyperparameters, with no extra memory, but in modern deep learning AdamW has overshadowed it.
Lion (EvoLved Sign Momentum) was introduced by Xiangning Chen and colleagues at Google in 2023 and was discovered through an automated program search over the space of possible optimizer update rules [15]. Lion stores only a momentum buffer (no second moment), updates that buffer with classical momentum, and then applies the sign of an interpolation between the current gradient and the momentum to update the parameters. Because the parameter update has unit magnitude per coordinate, the actual step size is determined entirely by the learning rate and weight decay. Lion uses about half the optimizer memory of Adam and has been shown to match or outperform AdamW on diffusion model training, image classification, and language model pretraining at comparable budget.
Schedule-Free optimization, introduced by Aaron Defazio and colleagues in 2024 in "The Road Less Scheduled," replaces the explicit momentum buffer with a combination of interpolation and iterate averaging that does not require a learning rate schedule [4]. The Schedule-Free variant of AdamW won the MLCommons 2024 AlgoPerf Algorithmic Efficiency Challenge in the self-tuning track. The mechanism is closely related to Nesterov-style averaging and is best understood as another point in the design space of momentum-like accumulation strategies, not as the elimination of momentum. The optimizer is available in PyTorch via the open source schedulefree package.
Momentum or its analogues appear inside almost every adaptive optimizer in common use, including RMSProp (which includes momentum as an optional term), AdaDelta, Adafactor, AdaBelief, LAMB, LARS, and Sophia. The single common thread is that all of them maintain at least one running average of past gradients to smooth the optimizer's trajectory.
Momentum is exposed by every major deep learning framework as a parameter on the SGD optimizer, with a separate flag for Nesterov.
In PyTorch:
import torch
import torch.nn as nn
import torch.optim as optim
model = nn.Sequential(
nn.Linear(784, 256),
nn.ReLU(),
nn.Linear(256, 10)
)
# Classical Polyak momentum
optimizer = optim.SGD(
model.parameters(),
lr=0.01,
momentum=0.9,
weight_decay=1e-4,
)
# Nesterov accelerated gradient
optimizer_nag = optim.SGD(
model.parameters(),
lr=0.01,
momentum=0.9,
nesterov=True,
weight_decay=1e-4,
)
In TensorFlow / Keras:
import tensorflow as tf
optimizer = tf.keras.optimizers.SGD(
learning_rate=0.01,
momentum=0.9,
nesterov=True,
)
In JAX using Optax:
import optax
optimizer = optax.sgd(
learning_rate=0.01,
momentum=0.9,
nesterov=True,
)
In the Schedule-Free PyTorch package:
import schedulefree
optimizer = schedulefree.AdamWScheduleFree(
model.parameters(),
lr=1e-3,
betas=(0.9, 0.999),
weight_decay=1e-2,
)
Key parameters in PyTorch's SGD are summarized below.
| Parameter | Description | Default |
|---|---|---|
lr | Learning rate | Required |
momentum | Momentum coefficient mu | 0 (no momentum) |
dampening | Dampening factor on the gradient term | 0 |
nesterov | Use Nesterov look-ahead (requires momentum > 0, dampening = 0) | False |
weight_decay | Coupled L2 regularization (use AdamW for decoupled) | 0 |
The dampening parameter is unusual and appears mostly for backwards compatibility. It scales the gradient term in the velocity update by (1 - dampening) instead of 1. With dampening = 0 (the default) the update matches the standard PyTorch convention; with dampening = mu the velocity becomes a true exponential moving average of the gradient. Most modern recipes leave dampening at 0.
Different kinds of training tend to settle on different momentum values and pairings with learning rate. The table below summarizes typical settings observed in published recipes for major models and benchmarks.
| Task | Optimizer | Learning rate | Momentum (mu / beta_1) | Notes |
|---|---|---|---|---|
| ImageNet ResNet training | SGD-momentum | 0.1 (linear scaled with batch) | 0.9 | Step or cosine LR decay |
| ResNet on CIFAR-10/100 | SGD-momentum | 0.1 | 0.9 | Multistep LR drops |
| Vision Transformer pretraining | AdamW | 1e-3 to 3e-3 | 0.9 (beta_2 = 0.999) | Cosine schedule with warmup |
| BERT-style MLM pretraining | AdamW | 1e-4 | 0.9 (beta_2 = 0.999) | Linear LR decay |
| LLM pretraining (GPT-3, PaLM, Llama, DeepSeek) | AdamW | 1e-4 to 6e-4 | 0.9 (beta_2 = 0.95) | Warmup plus cosine; gradient clip 1.0 |
| LLM fine-tuning / RLHF | AdamW | 1e-5 to 5e-5 | 0.9 (beta_2 = 0.999) | Short warmup, often constant or linear |
| Diffusion model training | AdamW or Lion | 1e-4 (Lion: 1e-5) | 0.9 | EMA of weights also common |
| GAN training (DCGAN family) | Adam | 1e-4 to 2e-4 | 0.5 (beta_2 = 0.999) | Lower beta_1 for stability |
| Reinforcement learning (PPO, DQN) | Adam | 3e-4 | 0.9 | Often grad clipping at 0.5 |
| Large-batch CNN training | SGD-momentum or LARS | scaled with batch | 0.9 | Linear scaling rule plus warmup |
The DeepSeek V3 technical report explicitly lists AdamW with (beta_1, beta_2) = (0.9, 0.95), epsilon = 1e-8, and weight decay 0.1, with a peak learning rate of 2.2e-4 [12]. Llama 2 uses the same beta values with peak learning rate around 3e-4 for the 7B model and 1.5e-4 for the 70B model [13]. The shared pattern across modern LLM recipes is that the first-moment momentum stays at the conventional 0.9, while the second-moment beta is lowered to 0.95 for faster adaptation.
Momentum and adaptive learning rate scaling are complementary, not substitutes. The choice between SGD with momentum and an adaptive method like Adam or AdamW comes down to a few practical considerations.
| Consideration | SGD with momentum | Adam / AdamW |
|---|---|---|
| Memory per parameter (fp32) | 4 bytes (1 buffer) | 8 bytes (2 buffers) |
| Generalization on CNN benchmarks | Often slightly better | Comparable with AdamW |
| Convergence speed (early epochs) | Slower | Faster |
| Tolerance to learning rate misspecification | Lower | Higher |
| Default works out of the box | No (needs tuning) | Yes |
| Use in LLM pretraining | Rare | Universal |
| Robustness to noisy gradients | Lower | Higher |
The historical generalization advantage of SGD over Adam has narrowed substantially since the introduction of AdamW, which decouples weight decay and removes most of the implicit difference in regularization. For transformer-based architectures, AdamW has effectively displaced SGD with momentum entirely. For convolutional networks, both remain in active use, with the choice often coming down to convention rather than measured performance.
Momentum is not free. The most direct cost is one additional hyperparameter (mu), which couples to the learning rate in a way that requires care during tuning. For high values of mu, the effective step size grows by a factor of 1 / (1 - mu), so a learning rate that is stable at mu = 0 may diverge at mu = 0.99. Practitioners typically search learning rate and momentum jointly when both are non-default.
Momentum also amplifies the response to noisy gradients. In the steady state, the velocity contains contributions from many past mini-batches, and a single anomalous batch can produce a large transient in the velocity that takes O(1 / (1 - mu)) steps to decay. This is why batch normalization and gradient clipping are commonly paired with high momentum in modern recipes.
Near an optimum, momentum can produce "stale" velocity that points in a now-incorrect direction. This is a special case of the more general overshoot behavior of underdamped dynamical systems. Nesterov's look-ahead alleviates the issue but does not eliminate it; in stochastic settings the look-ahead advantage is partially lost to gradient noise, which is why the practical gap between classical and Nesterov momentum is often small in deep learning.
Finally, classical and Nesterov momentum on their own do not provide per-parameter adaptive scaling, which is why purely momentum-based methods like SGD-momentum often need carefully designed learning rate schedules to perform well on heterogeneous architectures. Adaptive optimizers like Adam and AdamW combine momentum with per-parameter scaling and tend to be more forgiving as a result.
As of 2026, momentum is a near-universal feature of deep learning optimizers. Plain SGD with momentum remains a strong default for image classification on CNNs and is still found in production training pipelines at major industrial labs. AdamW (which contains a momentum buffer as its first moment) is the dominant optimizer for LLM pretraining, fine-tuning, and most transformer-based vision and multimodal models. Lion offers a memory-efficient alternative that retains the momentum mechanism while replacing per-parameter scaling with a sign update. Schedule-Free Adam and AdEMAMix push the momentum idea in different directions, the former by subsuming momentum into iterate averaging and the latter by mixing two momentum buffers with different decay rates.
The trend in optimizer research has been to keep momentum as a building block while experimenting with what surrounds it: per-parameter scaling, sign updates, iterate averaging, second-order curvature estimates, and learning-rate-free formulations. Sixty years after Polyak's original paper, the heavy ball is still rolling.
Imagine you are riding a sled down a snowy hill that is full of bumps. Without momentum, your sled is so light that every little bump throws it sideways, and you spend most of your energy zigzagging instead of going down. With momentum, your sled is heavy: when you start moving in one direction, you keep moving that way for a while even if you hit small bumps, and you can coast across the flat patches without pedaling. In machine learning, the computer is trying to roll downhill on a bumpy landscape (the loss function) to find the lowest point (the best answer). Momentum lets the computer remember which way it was already going so it does not get distracted by every wiggle in the gradient and can slide smoothly toward the bottom much faster.