See also: Machine learning terms, Optimizer, Gradient Descent
Momentum is an optimization technique that accelerates gradient descent by accumulating a velocity vector from past gradients. Rather than updating model parameters based solely on the current gradient, momentum maintains a running average of previous gradients and uses it to guide the direction and magnitude of each update step. This helps the optimizer move faster along dimensions where the gradient consistently points in the same direction, while dampening oscillations in dimensions where the gradient frequently changes sign.
The technique was first introduced by Boris Polyak in 1964 as the "heavy ball method," drawing on an analogy from classical mechanics. In the decades since, momentum has become a standard component of nearly every modern optimization algorithm used to train neural networks, including SGD with momentum and Adam.
Soviet mathematician Boris Polyak introduced the heavy ball method in his 1964 paper on methods for minimizing functionals. The approach was inspired by physics: the iteration models the motion of a heavy ball rolling across a surface defined by the loss function, subject to friction. The friction term prevents the ball from accelerating indefinitely, while the inertia from past motion helps the ball pass over small bumps and flat regions.
Polyak showed that for smooth, strongly convex quadratic functions, the heavy ball method achieves a convergence rate proportional to the square root of the condition number, a significant improvement over standard gradient descent. Specifically, while gradient descent converges at a rate of (kappa - 1) / (kappa + 1), the heavy ball method converges at a rate of (sqrt(kappa) - 1) / (sqrt(kappa) + 1), where kappa is the condition number of the Hessian matrix.
Momentum entered the machine learning mainstream through the work of David Rumelhart, Geoffrey Hinton, and Ronald Williams on backpropagation in 1986. Their influential paper on learning representations by back-propagating errors included momentum as part of the SGD update rule, and it quickly became a standard practice in training neural networks.
Yurii Nesterov, a Russian mathematician, proposed an important variant known as Nesterov Accelerated Gradient (NAG) in his 1983 paper. Nesterov showed that his method achieves an optimal convergence rate of O(1/k^2) for convex functions, compared to O(1/k) for standard gradient descent, where k is the number of iterations. Ilya Sutskever and colleagues later popularized NAG for deep learning in their 2013 paper "On the Importance of Initialization and Momentum in Deep Learning."
The classical momentum (also called Polyak momentum) update rule introduces a velocity vector v that accumulates past gradients:
v_t = beta * v_(t-1) + eta * nabla_f(theta_t)
theta_(t+1) = theta_t - v_t
Where:
| Symbol | Meaning |
|---|---|
| v_t | Velocity (momentum) vector at time step t |
| beta | Momentum coefficient (typically 0.9) |
| eta | Learning rate |
| nabla_f(theta_t) | Gradient of the loss function with respect to parameters theta at step t |
| theta_(t+1) | Updated parameter vector |
The velocity v_t is an exponentially weighted moving average of past gradients. At each step, the old velocity is scaled by beta (retaining a fraction of the previous direction) and a new gradient term is added. The parameter update then moves in the direction of this accumulated velocity rather than just the raw gradient.
Unrolling the recurrence relation reveals that the velocity at step t is a weighted sum of all past gradients:
v_t = eta * nabla_f(theta_t) + beta * eta * nabla_f(theta_(t-1)) + beta^2 * eta * nabla_f(theta_(t-2)) + ...
The weight assigned to a gradient from k steps ago decays exponentially as beta^k. With beta = 0.9, the effective window covers roughly the last 10 gradients (since 0.9^10 is approximately 0.35). With beta = 0.99, the effective window extends to roughly 100 past gradients.
Nesterov momentum modifies the classical approach by evaluating the gradient at a "lookahead" position rather than the current position. The update rule is:
theta_lookahead = theta_t - beta * v_t
v_(t+1) = beta * v_t + eta * nabla_f(theta_lookahead)
theta_(t+1) = theta_t - v_(t+1)
The key difference is in the gradient computation. Classical momentum computes the gradient at the current position theta_t, while Nesterov momentum computes the gradient at the anticipated future position theta_t - beta * v_t. This lookahead mechanism provides a form of correction: if the velocity is about to carry the parameters past a minimum, the gradient at the lookahead position will point backward, slowing or reversing the momentum before the overshoot occurs.
The following table summarizes the key differences between the two approaches:
| Property | Classical Momentum | Nesterov Momentum |
|---|---|---|
| Gradient evaluation point | Current position theta_t | Lookahead position theta_t - beta * v_t |
| Correction behavior | Reactive (corrects after overshooting) | Proactive (anticipates and prevents overshooting) |
| Convergence rate (convex) | O(1/k) | O(1/k^2) |
| Convergence rate (strongly convex) | (sqrt(kappa) - 1) / (sqrt(kappa) + 1) | (sqrt(kappa) - 1) / (sqrt(kappa) + 1) |
| Stability at high beta | Can oscillate | More stable |
| Benefit in stochastic setting | Significant | Modest (noise limits lookahead advantage) |
In practice, Nesterov momentum tends to perform as well as or better than classical momentum, especially in deterministic or near-deterministic optimization. In the stochastic setting (mini-batch training), the advantage of the lookahead step is partially negated by gradient noise, though Nesterov momentum still provides some benefit.
Momentum addresses several fundamental challenges in gradient-based optimization:
In regions of the loss landscape where the surface curves much more steeply in one direction than another (high condition number), vanilla gradient descent oscillates back and forth across the narrow valley while making slow progress along the valley floor. Momentum solves this problem by accumulating velocity. The oscillatory gradient components point in alternating directions across successive steps, so they partially cancel out in the velocity vector. Meanwhile, the consistent gradient component along the valley floor reinforces itself in the velocity, producing larger effective steps in that direction.
When the gradient is small but consistent (as in relatively flat regions of the loss surface), momentum builds up speed over several steps. This allows the optimizer to traverse plateaus more quickly than vanilla gradient descent, which would take very small steps due to the small gradient magnitude.
The accumulated velocity can carry the optimizer past shallow local minima and saddle points. Even if the gradient at a particular point is zero or near zero, the inertia from previous updates continues to push the parameters forward.
From a theoretical perspective, momentum effectively takes the square root of the condition number of the optimization problem. For a quadratic loss surface with condition number kappa, gradient descent requires O(kappa * log(1/epsilon)) iterations to reach epsilon-accuracy, while momentum requires only O(sqrt(kappa) * log(1/epsilon)) iterations. This can represent an enormous speedup for ill-conditioned problems.
The momentum coefficient beta (also written as mu or alpha in some formulations) controls how much influence past gradients have on the current update. Choosing an appropriate value is important for good training performance.
| beta Value | Effective Window | Behavior | Typical Use Case |
|---|---|---|---|
| 0 | 1 gradient | No momentum; equivalent to vanilla SGD | Baseline comparisons |
| 0.5 - 0.8 | 2 - 5 gradients | Light momentum, faster response to gradient changes | Noisy or unstable training |
| 0.9 | ~10 gradients | Standard default for most problems | General-purpose training |
| 0.95 - 0.99 | 20 - 100 gradients | Smoother trajectories, higher inertia | Well-behaved landscapes, large-batch training |
The most common default is beta = 0.9, which works well across a wide range of tasks. Higher values like 0.95 or 0.99 can be beneficial for problems with smooth, well-conditioned loss surfaces but may cause instability or delayed convergence on noisy or rugged landscapes.
A practical guideline from optimization theory: set beta as close to 1 as possible, then find the highest learning rate that still allows convergence. This combination tends to yield the fastest training.
Stochastic gradient descent with momentum is the simplest and most widely used form. At each step, the velocity is updated as an exponential moving average of the mini-batch gradients, and the parameters are moved in the direction of the velocity. Despite the availability of more sophisticated optimizers, SGD with momentum remains competitive for training deep neural networks, particularly in computer vision, where it is often preferred for its generalization properties.
Adam (Adaptive Moment Estimation) extends the momentum concept by maintaining two exponential moving averages:
| Component | What It Tracks | Default Decay Rate | Purpose |
|---|---|---|---|
| First moment (m_t) | Mean of gradients | beta_1 = 0.9 | Provides momentum (direction) |
| Second moment (v_t) | Mean of squared gradients | beta_2 = 0.999 | Provides per-parameter adaptive learning rates |
The first moment estimate in Adam functions identically to classical momentum, smoothing the gradient direction over time. The second moment estimate, borrowed from the RMSProp optimizer, scales the update for each parameter inversely by the root mean square of recent gradients. Adam also applies bias correction to both moment estimates to account for their initialization at zero.
Other optimizers that incorporate momentum include AdaGrad, RMSProp, AdamW, Nadam (which combines Adam with Nesterov momentum), and LAMB.
PyTorch implements momentum through torch.optim.SGD. The following example shows how to use SGD with momentum to train a model:
import torch
import torch.nn as nn
import torch.optim as optim
# Define a simple model
model = nn.Sequential(
nn.Linear(784, 256),
nn.ReLU(),
nn.Linear(256, 10)
)
# SGD with classical momentum (beta = 0.9)
optimizer = optim.SGD(
model.parameters(),
lr=0.01,
momentum=0.9
)
# SGD with Nesterov momentum
optimizer_nesterov = optim.SGD(
model.parameters(),
lr=0.01,
momentum=0.9,
nesterov=True
)
# Training loop
for inputs, targets in dataloader:
optimizer.zero_grad()
outputs = model(inputs)
loss = criterion(outputs, targets)
loss.backward()
optimizer.step()
Key parameters in PyTorch's SGD:
| Parameter | Description | Default |
|---|---|---|
lr | Learning rate | Required |
momentum | Momentum factor (beta) | 0 |
dampening | Dampening for momentum | 0 |
nesterov | Enables Nesterov momentum | False |
weight_decay | L2 regularization penalty | 0 |
An important implementation detail: PyTorch applies the learning rate at update time (v = mu * v + g; p = p - lr * v), which differs from some other frameworks that fold the learning rate into the velocity (v = mu * v + lr * g; p = p - v). Both formulations are mathematically equivalent but can produce different behavior when the learning rate changes during training.
Several practical considerations affect how momentum is used in real training pipelines:
When using high momentum values, it is common to combine them with learning rate warmup. The velocity vector starts at zero and needs several steps to build up, so a gradually increasing learning rate prevents large, poorly directed updates during the early phase of training.
Some training regimes adjust the momentum coefficient during training. For example, the 1cycle policy (introduced by Leslie Smith) increases momentum from 0.85 to 0.95 as the learning rate decreases, and vice versa. This inverse relationship between learning rate and momentum can speed up training significantly.
Momentum and learning rate interact closely. A higher momentum effectively amplifies the learning rate by a factor of approximately 1 / (1 - beta). With beta = 0.9, the effective learning rate is roughly 10 times the nominal rate; with beta = 0.99, it is roughly 100 times. This means that when increasing beta, it is often necessary to decrease the learning rate proportionally to maintain stable training.
Larger batch sizes produce less noisy gradient estimates, which allows momentum to be more effective. In large-batch training (batch sizes of 8,192 or more), high momentum values like 0.9 to 0.99 are common and beneficial. With small batch sizes, lower momentum values may be preferable to avoid amplifying gradient noise.
Imagine you are riding a sled down a snowy hill. If the hill curves left, your sled does not turn instantly because it has speed built up going straight. That built-up speed is like momentum. In machine learning, the computer is trying to find the lowest point in a bumpy landscape (the best answer). Without momentum, it takes tiny, shaky steps and might get stuck wiggling back and forth. With momentum, the computer remembers which direction it was already going and keeps some of that speed. This makes it slide smoothly toward the lowest point instead of zigzagging, so it finds the answer much faster.