Momentum

Overview

Momentum is an optimization technique that accelerates gradient descent by accumulating an exponentially weighted moving average of past gradients into a velocity vector and using that velocity, rather than the raw gradient, to update model parameters. The accumulated velocity smooths out high-frequency oscillations in the gradient signal, builds speed along directions of consistent descent, and helps the optimizer push through flat regions and shallow local minima of the loss surface.

The technique was first introduced by Soviet mathematician Boris Polyak in 1964 as the "heavy ball method," drawing on an analogy from classical mechanics in which a heavy ball rolls under gravity across a curved surface, retaining inertia from previous motion ^[1]. Yurii Nesterov later proposed an accelerated variant in 1983 that evaluates the gradient at a look-ahead position and achieves a provably optimal convergence rate for smooth convex problems ^[2]. Both ideas were brought into the modern deep learning era by Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton in their 2013 paper "On the Importance of Initialization and Momentum in Deep Learning," which showed that carefully tuned momentum schedules let plain stochastic gradient descent reach performance levels previously achievable only with expensive second-order methods ^[3].

In 2026, almost every popular optimizer used to train neural networks contains momentum somewhere inside it. SGD with momentum is still the workhorse for many computer vision pipelines. Adam and AdamW treat their first-moment estimate as a momentum buffer with decay rate beta_1 (typically 0.9 or 0.95). NAdam adds Nesterov look-ahead on top of Adam. The Lion optimizer keeps only a single momentum buffer and takes the sign of an interpolation between the current gradient and that buffer. Even Schedule-Free Adam, introduced by Defazio and colleagues in 2024, can be viewed as replacing the explicit momentum buffer with an interpolation between an averaged iterate and a fresh one ^[4]. Momentum is the connective tissue of modern optimization.

History

Polyak's heavy ball method (1964)

Boris Polyak introduced the heavy ball method in a 1964 paper titled "Some methods of speeding up the convergence of iteration methods," published in the USSR Computational Mathematics and Mathematical Physics ^[1]. The motivation was concrete: pure gradient descent converges painfully slowly on ill-conditioned quadratic problems, where the loss surface looks like a long narrow valley. Polyak observed that adding inertia, by including a term proportional to the previous step in the new step, allows the iterate to behave like a ball with mass rolling under gravity. The ball overshoots the bottom of the valley once or twice but quickly settles, instead of zigzagging across the walls forever.

For a smooth strongly convex quadratic with condition number kappa equal to L over mu (the ratio of the largest to smallest eigenvalue of the Hessian), Polyak proved that the heavy ball method achieves a linear convergence rate of approximately ((sqrt(kappa) - 1) / (sqrt(kappa) + 1))^k, where k is the iteration count ^[5]. Vanilla gradient descent on the same problem converges only at the rate ((kappa - 1) / (kappa + 1))^k. The improvement is roughly the square root of the condition number, which is enormous for the ill-conditioned problems that arise in scientific computing and, later, in deep learning.

The optimal momentum coefficient and step size for a quadratic with smoothness L and strong convexity mu are alpha_star = 4 / (sqrt(L) + sqrt(mu))^2 and beta_star = ((sqrt(L) - sqrt(mu)) / (sqrt(L) + sqrt(mu)))^2 ^[5]. In practice the exact eigenvalues of the Hessian are not known, so beta is treated as a hyperparameter and tuned, but the formulas explain why values close to 1 are optimal for very ill-conditioned problems.

Nesterov's accelerated gradient (1983)

In 1983, Yurii Nesterov, then at the Central Economic Mathematical Institute in Moscow, published "A method for solving the convex programming problem with convergence rate O(1/k^2)" in Doklady Akademii Nauk SSSR ^[2]. Nesterov modified Polyak's update by evaluating the gradient at a look-ahead point instead of at the current iterate. This small change is what gives the method its name: Nesterov accelerated gradient (NAG).

For smooth convex (not necessarily strongly convex) functions, Nesterov proved that NAG converges in objective value at rate O(1/k^2), while plain gradient descent converges only at O(1/k) ^[6]. This rate is provably optimal for any first-order method on this class of problems, in the sense that no algorithm using only gradients (and not curvature) can do better in the worst case. Nesterov's result is one of the foundational theorems of convex optimization and is the reason "acceleration" usually refers to look-ahead momentum methods rather than to ordinary momentum.

Adoption in neural network training (1986)

Momentum entered mainstream machine learning through David Rumelhart, Geoffrey Hinton, and Ronald Williams's 1986 paper on backpropagation, "Learning representations by back-propagating errors," published in Nature ^[7]. The authors included a momentum term in the SGD update rule used for training their networks, citing it as a practical trick to speed up learning and dampen oscillations in the loss. Momentum quickly became a standard component of every neural network training recipe in the late 1980s and 1990s.

Sutskever 2013 and the modern revival

The role of momentum in deep network training was systematically reexamined by Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton in 2013, in their ICML paper "On the Importance of Initialization and Momentum in Deep Learning" ^[3]. The paper made two main empirical contributions. First, it showed that carefully tuned classical and Nesterov momentum, combined with a well-designed random initialization (notably the sparse initialization scheme they proposed), could train deep networks and recurrent networks on tasks that previously required Hessian-free second-order methods. Second, it argued that for stochastic optimization Nesterov's look-ahead variant tends to be more stable than classical momentum, especially when the momentum coefficient mu is large.

A practical recommendation from the paper was a momentum schedule that increases mu from a smaller starting value (around 0.5) up to a value close to 1 (often 0.99) over the course of training. This schedule mirrors the role of warmup in modern transformer training: while the velocity is small and the gradient direction is poorly informed, only modest momentum is helpful; as training stabilizes, larger momentum extracts more information from the accumulated gradient history.

Sutskever's paper is also notable as one of the early signals that careful first-order methods can match second-order ones for deep learning. Within a year, Adam had been published, and within a few years momentum-based optimizers had displaced Hessian-free methods almost entirely in mainstream practice.

Mathematical formulation

Let theta_t denote the parameter vector at iteration t, let g_t = nabla L(theta_t) denote the gradient of the loss function, let eta denote the learning rate, and let mu denote the momentum coefficient. The classical (Polyak) momentum update introduces a velocity vector v_t and replaces the standard SGD step with two coupled updates.

Vanilla momentum (Polyak heavy ball)

The textbook form of classical momentum is:

v_t = mu * v_(t-1) + g_t
theta_t = theta_(t-1) - eta * v_t

The velocity v_t accumulates an exponentially decaying sum of past gradients. Unrolling the recurrence gives:

v_t = g_t + mu * g_(t-1) + mu^2 * g_(t-2) + mu^3 * g_(t-3) + ...

A gradient from k steps in the past contributes with weight mu^k. With mu = 0.9, the contribution of a gradient drops below 5% of its initial value after about 30 steps. With mu = 0.99, the same drop takes about 300 steps. Some references write the same update with an explicit (1 - mu) factor in front of g_t (so v_t becomes a true exponential moving average that is bounded in expectation), which is the convention used inside Adam and most recent optimizers; the two forms are mathematically equivalent up to a rescaling of the learning rate. The differences are summarized in the implementation table below.

Nesterov accelerated gradient (NAG)

Nesterov's variant evaluates the gradient at a look-ahead point that anticipates where momentum will take the parameters next:

theta_lookahead = theta_(t-1) + mu * v_(t-1)        # in the Polyak sign convention
v_t             = mu * v_(t-1) + g(theta_lookahead)
theta_t         = theta_(t-1) - eta * v_t

The gradient g(theta_lookahead) is evaluated at theta_(t-1) shifted by the momentum term, so the velocity update sees what the gradient looks like at the position momentum is about to drag the iterate to. If that point is past a minimum, the gradient at the look-ahead position points back toward the minimum and corrects the velocity before the overshoot becomes severe. Sutskever, Martens, Dahl, and Hinton (2013) reformulated this update in a way that does not require an extra forward-pass at a shifted point, which is the form used by PyTorch and most modern frameworks ^[3].

Equivalent forms used in code

All major frameworks implement momentum in slightly different but mathematically equivalent ways. The differences matter when comparing recipes between codebases or when interpreting hyperparameters trained in one framework and reused in another.

Framework	Velocity update	Parameter update	Notes
Sutskever et al. 2013 (textbook)	v <- mu * v + lr * g	theta <- theta - v	Learning rate folded into velocity; changing lr also changes effective momentum
PyTorch torch.optim.SGD	v <- mu * v + g	theta <- theta - lr * v	Learning rate kept outside velocity; momentum stays invariant when lr changes
TensorFlow tf.keras.optimizers.SGD	v <- mu * v + g	theta <- theta - lr * v	Matches PyTorch convention
Optax optax.sgd	v <- mu * v + g	theta <- theta - lr * v	Matches PyTorch convention
Exponential moving average form	v <- mu * v + (1 - mu) * g	theta <- theta - lr * v	Used inside Adam's first moment estimate

PyTorch's documentation explicitly notes that its update differs from Sutskever et al., and the difference is sometimes subtle enough to cause confusion when porting code ^[8]. PyTorch also initializes the momentum buffer to the first observed gradient rather than to zero, which avoids a "warm-up" effect on the first step. The exponential moving average form is the one used by Adam, which keeps the magnitude of v bounded by the expected magnitude of the gradient and makes bias correction necessary at the start of training.

Physical analogy and intuition

Momentum is named for the physical quantity. The most common pedagogical analogy compares the optimizer to a heavy ball rolling across a surface defined by the loss function under the influence of gravity. The gradient g_t plays the role of the local force on the ball, the velocity v_t plays the role of the ball's actual momentum, and the momentum coefficient mu plays the role of one minus the friction coefficient. With mu = 0 the ball is fully damped and stops at every step (this is plain gradient descent). With mu close to 1 there is almost no friction, and the ball can roll for a long time on its accumulated speed.

A more precise interpretation comes from viewing the momentum update as a discretization of a continuous-time ordinary differential equation. The classical heavy ball method corresponds, in the small-step-size limit, to a damped second-order ODE of the form theta'' + a * theta' + nabla L(theta) = 0, where a is a friction coefficient determined by mu. This is the equation of motion of a particle of unit mass moving in a potential L under viscous drag, which makes the physical analogy literal rather than just illustrative. Nesterov's method corresponds to a closely related ODE with time-varying friction whose decay rate produces the O(1/k^2) acceleration in the discrete algorithm ^[9].

One consequence of the physical picture is that momentum can be understood as a low-pass filter on the gradient signal. High-frequency oscillations (which often correspond to the optimizer bouncing across a narrow ravine) cancel out in the velocity, while low-frequency components (which correspond to a consistent direction of descent down the valley) reinforce. This is the same intuition that explains why physical heavy balls do not oscillate when rolled into a smooth bowl but instead settle to the bottom after a few overshoots.

Why momentum helps

Momentum addresses several specific weaknesses of plain gradient descent that show up in deep learning loss surfaces.

Damping oscillations across narrow valleys

If the loss surface curves much more sharply along one direction than another, the gradient is dominated by the steep direction. Plain gradient descent oscillates back and forth across the steep axis of the valley while making slow progress along the gentle axis. Momentum cancels these oscillatory components in the velocity (because successive gradients along the steep axis point in opposite directions) while reinforcing the consistent component along the gentle axis. The net effect is faster progress along the valley floor and smaller side-to-side jitter.

Accelerating along consistent directions

When the gradient points in roughly the same direction for several consecutive steps, the velocity accumulates that direction additively. After many such steps, the magnitude of v can be much larger than the magnitude of any single gradient. The effective step size in that direction grows like a geometric series with ratio mu, which sums to roughly 1 / (1 - mu) times the per-step contribution. This is the source of the heuristic that high momentum amplifies the effective learning rate.

Pushing through flat regions and shallow minima

If the optimizer enters a flat plateau where the gradient is very small, plain SGD makes only tiny steps and may stall. With momentum, the velocity built up before the plateau carries the iterate across it. The same intuition applies to small bumps and shallow local minima: the inertia of the velocity can carry the iterate past a saddle point or a poor local minimum that would otherwise trap it.

Improving the conditioning of the optimization

For a strongly convex quadratic with condition number kappa, plain gradient descent needs O(kappa * log(1/epsilon)) iterations to reach an epsilon-accurate solution. Heavy ball momentum reduces this to O(sqrt(kappa) * log(1/epsilon)), and Nesterov's accelerated variant reaches the same rate with simpler tuning ^[5]^[6]. In effect, momentum takes the square root of the condition number. For loss surfaces with kappa in the millions (which is typical of deep networks), this is a difference of three orders of magnitude in iteration count.

Effective step size

For a constant gradient g, the steady-state velocity in the Polyak update is v_inf = g / (1 - mu), and the corresponding effective step is eta * g / (1 - mu). With mu = 0.9 the effective step is roughly 10 times the per-iteration learning rate, and with mu = 0.99 it is roughly 100 times. This factor explains why increasing momentum often requires decreasing the learning rate to keep training stable ^[10].

Convergence theory

Convex case

On smooth convex objectives, the convergence rates of gradient descent, heavy ball momentum, and Nesterov acceleration form a clean hierarchy.

Method	Smooth convex (rate in objective value)	Smooth strongly convex (linear rate per iteration)
Gradient descent	O(1/k)	(kappa - 1) / (kappa + 1)
Polyak heavy ball	O(1/k) globally; matches Nesterov on quadratics	(sqrt(kappa) - 1) / (sqrt(kappa) + 1)
Nesterov accelerated gradient	O(1/k^2)	1 - 1/sqrt(kappa) (asymptotically the same)

The numbers in this table are worst-case rates; problems with structure can be much easier in practice. Polyak's method does not enjoy a universal O(1/k^2) rate on all smooth convex problems (it achieves it only on quadratics and other problems with extra structure), while Nesterov's method does. This is the technical reason Nesterov's algorithm is referred to as "accelerated" while Polyak's is not, even though the practical performance of the two methods is often very close on real problems ^[9].

Non-convex case

Deep learning loss surfaces are non-convex, and worst-case rates for non-convex problems are weaker. Recent theoretical work has shown that SGD with momentum converges to a stationary point on smooth non-convex problems under mild assumptions, with rates that depend on the noise model. None of these guarantees imply that momentum finds a global minimum, and in practice the question of whether momentum-based optimizers find good local minima is mostly empirical. The empirical answer, across thousands of published deep learning experiments, is yes: momentum is part of essentially every well-performing recipe.

A sharper way to state the empirical evidence comes from Sutskever et al. (2013), who observed that on several non-convex deep learning problems, momentum was necessary (not just helpful) to reach good performance; without momentum, even with carefully tuned learning rates, the same architectures failed to train. The interaction with initialization is tight: a poor initialization can make momentum unstable, and a good initialization can make momentum decisive ^[3].

Hyperparameter guidance

Momentum adds one hyperparameter to gradient descent: the coefficient mu (also called beta or alpha in different sources). Choosing a value is usually straightforward, but it interacts with the learning rate in ways that matter for stability.

mu value	Effective gradient window	Effective step amplification	Typical use
0	1	1x	Pure gradient descent; baseline only
0.5	2	2x	Very noisy or unstable training
0.8	5	5x	Light momentum, transition setting
0.9	10	10x	Standard default for SGD-momentum and Adam beta_1
0.95	20	20x	Common in LLM pretraining (beta_1 = 0.95 sometimes)
0.99	100	100x	Well-conditioned problems, large batch, long training
0.999	1,000	1000x	Rare; only for exceptionally smooth problems

A few heuristics, drawn from optimization theory and modern training practice, help in choosing mu.

For SGD with momentum, mu = 0.9 is a strong default, used throughout most of the ResNet, VGG, and modern image classification literature. mu = 0.99 is occasionally used on harder problems (especially with very large batch sizes), but it almost always requires reducing the learning rate proportionally and using gradient clipping.

For Adam and AdamW, beta_1 = 0.9 is the default proposed in the original paper ^[11]. For large transformer pretraining, the convention has shifted toward beta_1 = 0.9 with beta_2 = 0.95 (rather than the textbook 0.999). DeepSeek V3 uses (beta_1, beta_2) = (0.9, 0.95) with weight decay 0.1 ^[12]. Llama 2 uses the same setting ^[13]. The shift is partly because the gradient distribution changes substantially during pretraining of a large model, so a shorter second-moment window is preferred; the first-moment beta_1 = 0.9 is essentially unchanged from defaults.

Learning rate warmup is almost always combined with high momentum. In the first few hundred to few thousand steps, the velocity buffer is small and the gradient signal is unreliable. A linear or cosine warmup phase from zero up to the peak learning rate gives the velocity time to stabilize before the optimizer starts taking large effective steps. This is the same reason RAdam (which corrects the variance of the adaptive step size at the start of training) often removes the need for explicit warmup.

A more aggressive pattern, popularized by Leslie Smith's 1cycle policy, varies momentum and learning rate inversely: as the learning rate ramps up, momentum decreases (often from 0.95 to 0.85), and as the learning rate ramps down, momentum increases back. The intuition is that high learning rate plus high momentum is unstable, so the two should not peak at the same time.

Momentum in modern optimizers

SGD with momentum

The simplest and most widely used form is plain SGD with momentum, which adds a single velocity buffer to vanilla SGD. It dominated deep learning training from the late 1980s until the rise of Adam around 2015 and is still the preferred optimizer in many computer vision pipelines, particularly for training CNNs like ResNet. SGD with momentum is generally considered to find solutions that generalize as well as or better than Adam on image classification benchmarks, although the gap has narrowed substantially since AdamW decoupled weight decay from the adaptive normalization.

Adam

Adam (Kingma and Ba, 2015) maintains two exponential moving averages: one of the gradient itself (the first moment m_t with decay beta_1) and one of the squared gradient (the second moment v_t with decay beta_2) ^[11]. The first moment plays exactly the role of momentum, and the default beta_1 = 0.9 corresponds to the same effective gradient window as SGD with mu = 0.9. The second moment provides per-parameter adaptive learning rate scaling, which is what differentiates Adam from plain SGD with momentum. Both moments use the (1 - beta) form of the EMA, and both are bias-corrected to compensate for their initialization at zero.

AdamW

AdamW (Loshchilov and Hutter, 2019) is the variant of Adam used by essentially every modern LLM and vision transformer recipe. It decouples weight decay from the adaptive parameter update, applying weight decay directly to the parameters after the Adam step rather than adding it to the loss as an L2 penalty. The momentum mechanism (the first-moment EMA) is unchanged from Adam.

NAdam

NAdam (Nesterov-accelerated Adaptive Moment Estimation) was introduced by Timothy Dozat in a workshop paper at ICLR 2016 ^[14]. It modifies Adam to use Nesterov's look-ahead step inside the first-moment update rather than a classical momentum step. Concretely, the bias-corrected first moment is replaced by an interpolation that incorporates the current gradient before applying the parameter update. NAdam often produces slightly faster convergence than vanilla Adam at the same hyperparameters, with no extra memory, but in modern deep learning AdamW has overshadowed it.

Lion

Lion (EvoLved Sign Momentum) was introduced by Xiangning Chen and colleagues at Google in 2023 and was discovered through an automated program search over the space of possible optimizer update rules ^[15]. Lion stores only a momentum buffer (no second moment), updates that buffer with classical momentum, and then applies the sign of an interpolation between the current gradient and the momentum to update the parameters. Because the parameter update has unit magnitude per coordinate, the actual step size is determined entirely by the learning rate and weight decay. Lion uses about half the optimizer memory of Adam and has been shown to match or outperform AdamW on diffusion model training, image classification, and language model pretraining at comparable budget.

Schedule-Free Adam and SGD

Schedule-Free optimization, introduced by Aaron Defazio and colleagues in 2024 in "The Road Less Scheduled," replaces the explicit momentum buffer with a combination of interpolation and iterate averaging that does not require a learning rate schedule ^[4]. The Schedule-Free variant of AdamW won the MLCommons 2024 AlgoPerf Algorithmic Efficiency Challenge in the self-tuning track. The mechanism is closely related to Nesterov-style averaging and is best understood as another point in the design space of momentum-like accumulation strategies, not as the elimination of momentum. The optimizer is available in PyTorch via the open source schedulefree package.

Other optimizers using momentum

Momentum or its analogues appear inside almost every adaptive optimizer in common use, including RMSProp (which includes momentum as an optional term), AdaDelta, Adafactor, AdaBelief, LAMB, LARS, and Sophia. The single common thread is that all of them maintain at least one running average of past gradients to smooth the optimizer's trajectory.

Implementation

Momentum is exposed by every major deep learning framework as a parameter on the SGD optimizer, with a separate flag for Nesterov.

In PyTorch:

import torch
import torch.nn as nn
import torch.optim as optim

model = nn.Sequential(
    nn.Linear(784, 256),
    nn.ReLU(),
    nn.Linear(256, 10)
)

# Classical Polyak momentum
optimizer = optim.SGD(
    model.parameters(),
    lr=0.01,
    momentum=0.9,
    weight_decay=1e-4,
)

# Nesterov accelerated gradient
optimizer_nag = optim.SGD(
    model.parameters(),
    lr=0.01,
    momentum=0.9,
    nesterov=True,
    weight_decay=1e-4,
)

In TensorFlow / Keras:

import tensorflow as tf

optimizer = tf.keras.optimizers.SGD(
    learning_rate=0.01,
    momentum=0.9,
    nesterov=True,
)

In JAX using Optax:

import optax

optimizer = optax.sgd(
    learning_rate=0.01,
    momentum=0.9,
    nesterov=True,
)

In the Schedule-Free PyTorch package:

import schedulefree

optimizer = schedulefree.AdamWScheduleFree(
    model.parameters(),
    lr=1e-3,
    betas=(0.9, 0.999),
    weight_decay=1e-2,
)

Key parameters in PyTorch's SGD are summarized below.

Parameter	Description	Default
`lr`	Learning rate	Required
`momentum`	Momentum coefficient mu	0 (no momentum)
`dampening`	Dampening factor on the gradient term	0
`nesterov`	Use Nesterov look-ahead (requires momentum > 0, dampening = 0)	False
`weight_decay`	Coupled L2 regularization (use AdamW for decoupled)	0

The dampening parameter is unusual and appears mostly for backwards compatibility. It scales the gradient term in the velocity update by (1 - dampening) instead of 1. With dampening = 0 (the default) the update matches the standard PyTorch convention; with dampening = mu the velocity becomes a true exponential moving average of the gradient. Most modern recipes leave dampening at 0.

Recommended hyperparameter settings by task

Different kinds of training tend to settle on different momentum values and pairings with learning rate. The table below summarizes typical settings observed in published recipes for major models and benchmarks.

Task	Optimizer	Learning rate	Momentum (mu / beta_1)	Notes
ImageNet ResNet training	SGD-momentum	0.1 (linear scaled with batch)	0.9	Step or cosine LR decay
ResNet on CIFAR-10/100	SGD-momentum	0.1	0.9	Multistep LR drops
Vision Transformer pretraining	AdamW	1e-3 to 3e-3	0.9 (beta_2 = 0.999)	Cosine schedule with warmup
BERT-style MLM pretraining	AdamW	1e-4	0.9 (beta_2 = 0.999)	Linear LR decay
LLM pretraining (GPT-3, PaLM, Llama, DeepSeek)	AdamW	1e-4 to 6e-4	0.9 (beta_2 = 0.95)	Warmup plus cosine; gradient clip 1.0
LLM fine-tuning / RLHF	AdamW	1e-5 to 5e-5	0.9 (beta_2 = 0.999)	Short warmup, often constant or linear
Diffusion model training	AdamW or Lion	1e-4 (Lion: 1e-5)	0.9	EMA of weights also common
GAN training (DCGAN family)	Adam	1e-4 to 2e-4	0.5 (beta_2 = 0.999)	Lower beta_1 for stability
Reinforcement learning (PPO, DQN)	Adam	3e-4	0.9	Often grad clipping at 0.5
Large-batch CNN training	SGD-momentum or LARS	scaled with batch	0.9	Linear scaling rule plus warmup

The DeepSeek V3 technical report explicitly lists AdamW with (beta_1, beta_2) = (0.9, 0.95), epsilon = 1e-8, and weight decay 0.1, with a peak learning rate of 2.2e-4 ^[12]. Llama 2 uses the same beta values with peak learning rate around 3e-4 for the 7B model and 1.5e-4 for the 70B model ^[13]. The shared pattern across modern LLM recipes is that the first-moment momentum stays at the conventional 0.9, while the second-moment beta is lowered to 0.95 for faster adaptation.

Comparison with adaptive methods

Momentum and adaptive learning rate scaling are complementary, not substitutes. The choice between SGD with momentum and an adaptive method like Adam or AdamW comes down to a few practical considerations.

Consideration	SGD with momentum	Adam / AdamW
Memory per parameter (fp32)	4 bytes (1 buffer)	8 bytes (2 buffers)
Generalization on CNN benchmarks	Often slightly better	Comparable with AdamW
Convergence speed (early epochs)	Slower	Faster
Tolerance to learning rate misspecification	Lower	Higher
Default works out of the box	No (needs tuning)	Yes
Use in LLM pretraining	Rare	Universal
Robustness to noisy gradients	Lower	Higher

The historical generalization advantage of SGD over Adam has narrowed substantially since the introduction of AdamW, which decouples weight decay and removes most of the implicit difference in regularization. For transformer-based architectures, AdamW has effectively displaced SGD with momentum entirely. For convolutional networks, both remain in active use, with the choice often coming down to convention rather than measured performance.

Limitations

Momentum is not free. The most direct cost is one additional hyperparameter (mu), which couples to the learning rate in a way that requires care during tuning. For high values of mu, the effective step size grows by a factor of 1 / (1 - mu), so a learning rate that is stable at mu = 0 may diverge at mu = 0.99. Practitioners typically search learning rate and momentum jointly when both are non-default.

Momentum also amplifies the response to noisy gradients. In the steady state, the velocity contains contributions from many past mini-batches, and a single anomalous batch can produce a large transient in the velocity that takes O(1 / (1 - mu)) steps to decay. This is why batch normalization and gradient clipping are commonly paired with high momentum in modern recipes.

Near an optimum, momentum can produce "stale" velocity that points in a now-incorrect direction. This is a special case of the more general overshoot behavior of underdamped dynamical systems. Nesterov's look-ahead alleviates the issue but does not eliminate it; in stochastic settings the look-ahead advantage is partially lost to gradient noise, which is why the practical gap between classical and Nesterov momentum is often small in deep learning.

Finally, classical and Nesterov momentum on their own do not provide per-parameter adaptive scaling, which is why purely momentum-based methods like SGD-momentum often need carefully designed learning rate schedules to perform well on heterogeneous architectures. Adaptive optimizers like Adam and AdamW combine momentum with per-parameter scaling and tend to be more forgiving as a result.

Modern context

As of 2026, momentum is a near-universal feature of deep learning optimizers. Plain SGD with momentum remains a strong default for image classification on CNNs and is still found in production training pipelines at major industrial labs. AdamW (which contains a momentum buffer as its first moment) is the dominant optimizer for LLM pretraining, fine-tuning, and most transformer-based vision and multimodal models. Lion offers a memory-efficient alternative that retains the momentum mechanism while replacing per-parameter scaling with a sign update. Schedule-Free Adam and AdEMAMix push the momentum idea in different directions, the former by subsuming momentum into iterate averaging and the latter by mixing two momentum buffers with different decay rates.

The trend in optimizer research has been to keep momentum as a building block while experimenting with what surrounds it: per-parameter scaling, sign updates, iterate averaging, second-order curvature estimates, and learning-rate-free formulations. Sixty years after Polyak's original paper, the heavy ball is still rolling.

Explain like I'm 5 (ELI5)

Imagine you are riding a sled down a snowy hill that is full of bumps. Without momentum, your sled is so light that every little bump throws it sideways, and you spend most of your energy zigzagging instead of going down. With momentum, your sled is heavy: when you start moving in one direction, you keep moving that way for a while even if you hit small bumps, and you can coast across the flat patches without pedaling. In machine learning, the computer is trying to roll downhill on a bumpy landscape (the loss function) to find the lowest point (the best answer). Momentum lets the computer remember which way it was already going so it does not get distracted by every wiggle in the gradient and can slide smoothly toward the bottom much faster.

References

Polyak, B. T. (1964). "Some methods of speeding up the convergence of iteration methods." *USSR Computational Mathematics and Mathematical Physics*, 4(5), 1-17.
Nesterov, Y. (1983). "A method for solving the convex programming problem with convergence rate O(1/k^2)." *Doklady Akademii Nauk SSSR*, 269(3), 543-547.
Sutskever, I., Martens, J., Dahl, G., and Hinton, G. (2013). "On the Importance of Initialization and Momentum in Deep Learning." Proceedings of the 30th International Conference on Machine Learning (ICML), 1139-1147. https://proceedings.mlr.press/v28/sutskever13.pdf
Defazio, A., Yang, X., Mehta, H., Mishchenko, K., Khaled, A., and Cutkosky, A. (2024). "The Road Less Scheduled." Proceedings of NeurIPS 2024. https://arxiv.org/abs/2405.15682
Polyak, B. T. (1987). *Introduction to Optimization*. Optimization Software, New York.
Nesterov, Y. (2004). *Introductory Lectures on Convex Optimization: A Basic Course*. Springer.
Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986). "Learning representations by back-propagating errors." *Nature*, 323(6088), 533-536.
PyTorch Documentation. "torch.optim.SGD." https://pytorch.org/docs/stable/generated/torch.optim.SGD.html
Su, W., Boyd, S., and Candes, E. (2014). "A Differential Equation for Modeling Nesterov's Accelerated Gradient Method: Theory and Insights." Proceedings of NeurIPS 2014. https://web.stanford.edu/~boyd/papers/pdf/ode_nest_grad.pdf
Goh, G. (2017). "Why Momentum Really Works." *Distill*. https://distill.pub/2017/momentum/
Kingma, D. P., and Ba, J. (2015). "Adam: A Method for Stochastic Optimization." Proceedings of ICLR 2015. https://arxiv.org/abs/1412.6980
DeepSeek-AI (2024). "DeepSeek-V3 Technical Report." https://arxiv.org/abs/2412.19437
Touvron, H., Martin, L., Stone, K., et al. (2023). "Llama 2: Open Foundation and Fine-Tuned Chat Models." https://arxiv.org/abs/2307.09288
Dozat, T. (2016). "Incorporating Nesterov Momentum into Adam." ICLR 2016 Workshop. https://openreview.net/pdf/OM0jvwB8jIp57ZJjtNEZ.pdf
Chen, X., Liang, C., Huang, D., et al. (2023). "Symbolic Discovery of Optimization Algorithms." Proceedings of NeurIPS 2023. https://arxiv.org/abs/2302.06675
Smith, L. N. (2018). "A disciplined approach to neural network hyper-parameters: Part 1 - learning rate, batch size, momentum, and weight decay." https://arxiv.org/abs/1803.09820
Qian, N. (1999). "On the momentum term in gradient descent learning algorithms." *Neural Networks*, 12(1), 145-151.
Goodfellow, I., Bengio, Y., and Courville, A. (2016). *Deep Learning*. MIT Press. Chapter 8: Optimization for Training Deep Models.
Loshchilov, I., and Hutter, F. (2019). "Decoupled Weight Decay Regularization." Proceedings of ICLR 2019. https://arxiv.org/abs/1711.05101
Reddi, S. J., Kale, S., and Kumar, S. (2018). "On the Convergence of Adam and Beyond." Proceedings of ICLR 2018. https://openreview.net/forum?id=ryQu7f-RZ

Overview

History

Polyak's heavy ball method (1964)

Nesterov's accelerated gradient (1983)

Adoption in neural network training (1986)

Sutskever 2013 and the modern revival

Mathematical formulation

Vanilla momentum (Polyak heavy ball)

Nesterov accelerated gradient (NAG)

Equivalent forms used in code

Physical analogy and intuition

Why momentum helps

Damping oscillations across narrow valleys

Accelerating along consistent directions

Pushing through flat regions and shallow minima

Improving the conditioning of the optimization

Effective step size

Convergence theory

Convex case

Non-convex case

Hyperparameter guidance

Momentum in modern optimizers

SGD with momentum

Adam

AdamW

NAdam

Lion

Schedule-Free Adam and SGD

Other optimizers using momentum

Implementation

Recommended hyperparameter settings by task

Comparison with adaptive methods

Limitations

Modern context

Explain like I'm 5 (ELI5)

References

Improve this article

Related Articles

AdaGrad

Gradient clipping

Parameter update

Step

Adam optimizer

Staged training

Overview

History

Polyak's heavy ball method (1964)

Nesterov's accelerated gradient (1983)

Adoption in neural network training (1986)

Sutskever 2013 and the modern revival

Mathematical formulation

Vanilla momentum (Polyak heavy ball)

Nesterov accelerated gradient (NAG)

Equivalent forms used in code

Physical analogy and intuition

Why momentum helps

Damping oscillations across narrow valleys

Accelerating along consistent directions

Pushing through flat regions and shallow minima

Improving the conditioning of the optimization

Effective step size

Convergence theory

Convex case

Non-convex case

Hyperparameter guidance

Momentum in modern optimizers

SGD with momentum

Adam

AdamW

NAdam

Lion

Schedule-Free Adam and SGD

Other optimizers using momentum

Implementation

Recommended hyperparameter settings by task

Comparison with adaptive methods

Limitations

Modern context

Explain like I'm 5 (ELI5)

References