# RMSProp

> Source: https://aiwiki.ai/wiki/rmsprop
> Updated: 2026-07-11
> Categories: Deep Learning, Training & Optimization
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

**RMSProp** (Root Mean Square Propagation) is an adaptive learning-rate [optimizer](/wiki/optimizer) that divides each parameter's gradient by a running root-mean-square of that parameter's recent gradients, so every weight takes a step of roughly comparable size even when raw gradients differ in scale by orders of magnitude. It was introduced by [Geoffrey Hinton](/wiki/geoffrey_hinton) in 2012 in lecture 6e of his Coursera course *Neural Networks for Machine Learning*, was never published as a stand-alone paper, and is the direct precursor of [Adam](/wiki/adam_optimizer), which is simply RMSProp plus [momentum](/wiki/momentum) plus bias correction.[1][4]

RMSProp trains [neural networks](/wiki/neural_network) with mini-batch [stochastic gradient descent](/wiki/stochastic_gradient_descent_sgd). It keeps an exponentially decaying average of squared gradients (default decay 0.9) and divides the gradient by the square root of that average, keeping the effective step size roughly the same across parameters.[1] The canonical reference is the lecture slide deck itself: "Tieleman, T. and Hinton, G. (2012). Lecture 6.5 of *Neural Networks for Machine Learning*, Coursera," because Hinton's then-PhD-student Tijmen Tieleman wrote the implementation students used in the course assignments. There was no journal or conference paper, which is why the technique is universally cited to a course handout even though it became one of the most widely used adaptive optimizers of the early deep-learning era.[1][17]

RMSProp sits in the early-2010s wave of adaptive optimizers. AdaGrad came first (Duchi, Hazan & Singer 2011) and worked well for sparse problems but accumulated squared gradients without ever forgetting them, so the effective [learning rate](/wiki/learning_rate) eventually crashed to zero.[2] RMSProp fixed that by replacing the running sum with an exponentially decaying average. AdaDelta (Zeiler, December 2012) used the same trick independently.[3] [Adam](/wiki/adam_optimizer) (Kingma & Ba 2014) folded RMSProp together with [momentum](/wiki/momentum) and added bias correction, and Adam plus its descendant [AdamW](/wiki/adamw) ate most of RMSProp's market share for general [deep learning](/wiki/deep_learning) work.[4] RMSProp is still around, mostly in older deep [reinforcement learning](/wiki/reinforcement_learning) code (the original [DQN](/wiki/dqn) is the most cited example) and in [GAN](/wiki/gan) settings where Adam's momentum term makes training less stable.

## Who invented RMSProp and when?

In the fall of 2012, Hinton taught a free Coursera course called *Neural Networks for Machine Learning*, one of the first MOOCs aimed at deep learning, running a few months after AlexNet had won ILSVRC. Lecture 6 covered how to make learning go faster; one of its slides introduced RMSProp in roughly two bullet points. In Hinton's own framing on the slide, RMSProp is to "keep a moving average of the squared gradient for each weight" and then "divide the gradient by sqrt(MeanSquare(w,t))" before taking a step.[1] He noted that an earlier method called rprop (Riedmiller & Braun 1993) used only the sign of the gradient, which works for full-batch training but breaks down for mini-batches because magnitude information matters when batches are noisy.[13] RMSProp keeps rprop's per-parameter scaling but uses a smooth running average of squared gradients instead of a per-step sign flip.

### From rprop to RMSProp

Riedmiller and Braun's rprop algorithm assigned each parameter a separate step size, increased it whenever the sign of the gradient stayed the same across consecutive iterations, and decreased it whenever the sign flipped. The procedure ignored gradient magnitude entirely. That is fine when gradients are computed over the whole training set, because two consecutive full-batch gradients have comparable magnitudes by construction. With mini-batches, two consecutive gradients can have wildly different magnitudes simply because the batches contain different examples; rprop's sign-only update tends to thrash. Hinton's lecture framed RMSProp as the natural fix: replace the sign-comparison heuristic with a per-parameter scale derived from a smoothly accumulated estimate of recent squared gradients, so that the update divides the raw gradient by something close to its typical magnitude.[1]

Tijmen Tieleman, then a PhD student in Hinton's group, wrote the implementation that students used in the Coursera assignments. There was no journal or conference paper; the canonical citation is "Tieleman, T. and Hinton, G. (2012). Lecture 6.5 of *Neural Networks for Machine Learning*, Coursera."[1] Zeiler's AdaDelta paper, posted to arXiv in December 2012, confirmed that others had landed on roughly the same construction.[3] Yann Dauphin and others working on RNN training in 2013 and 2014 also began citing the lecture handout in workshop papers as a way to refer to the technique, which is largely how it entered the literature.

For several years RMSProp was the default optimizer when something better than vanilla SGD was needed. The first major demonstration was the DQN paper (Mnih et al., *Nature* 2015).[7] After Adam landed in 2015, most new work moved over.

### Adoption timeline

| Year | Event |
|---|---|
| 1993 | Riedmiller & Braun publish rprop, the conceptual predecessor. |
| 2011 | Duchi, Hazan & Singer publish AdaGrad. |
| 2012 (Oct) | Hinton lectures on RMSProp in Coursera *Neural Networks for Machine Learning*, lecture 6e. |
| 2012 (Dec) | Zeiler posts AdaDelta on arXiv, independently arriving at an exponentially decayed accumulator. |
| 2013 | Alex Graves uses centered RMSProp for handwriting and text generation RNNs. |
| 2014 | Kingma & Ba post the Adam preprint, framing it as RMSProp with momentum and bias correction. |
| 2015 | Mnih et al. publish *Human-level control through deep reinforcement learning* in *Nature*; DQN uses RMSProp. |
| 2016 | Mnih et al. release A3C; the asynchronous version uses a shared RMSProp accumulator across workers. |
| 2017 | Arjovsky et al. publish Wasserstein GAN; explicitly recommends RMSProp over Adam for the critic. |
| 2017 (Nov) | Loshchilov & Hutter post AdamW, after which AdamW becomes the standard for transformer training. |
| 2018+ | RMSProp recedes as a default for new architectures but remains common for reinforcement learning baselines and reproductions of older papers. |

## How does the RMSProp update rule work?

RMSProp is a per-parameter optimizer. For each scalar parameter θ with gradient g at the current step, it keeps a running estimate v of the squared gradient and uses the square root of that estimate to scale the step.

### Standard form

Let $$\theta_t$$ be the parameter at step t, $$g_t$$ the gradient of the loss with respect to $$\theta_t$$, $$\rho$$ a decay coefficient (often called $$\gamma$$ or $$\beta_2$$), $$\alpha$$ the global learning rate, and $$\epsilon$$ a small constant for numerical stability.

$$
\begin{aligned}
v_t &= \rho \cdot v_{t-1} + (1 - \rho) \cdot g_t^2 \\
\theta_{t+1} &= \theta_t - \alpha \cdot \frac{g_t}{\sqrt{v_t} + \epsilon}
\end{aligned}
$$

$$v_t$$ is the exponential moving average of squared gradients. It plays the same role as the AdaGrad accumulator, except the contribution of any single past gradient decays geometrically over time instead of staying in the sum forever. The square root $$\sqrt{v_t}$$ is the running root-mean-square (hence the name), and dividing $$g_t$$ by it produces a step that has roughly unit magnitude in expectation, regardless of whether the parameter usually sees large or small gradients. The Cornell optimization textbook summarizes the method in one line: "keep moving average of the squared gradients for each weight. And then we divide the gradient by square root the mean square," with default values $$\beta = 0.9$$ and $$\eta = 0.001$$.[18]

### Intuition

If one parameter consistently sees gradients of magnitude 100 and another sees gradients of magnitude 0.01, vanilla SGD has to pick one global step size that works for both, and either the first parameter overshoots or the second barely moves. RMSProp's per-parameter denominator scales each update so the actual step in parameter space is comparable across parameters, no matter what scale the gradients live on. The exponential decay means that scale is computed from recent history, so it can shift over the course of training without permanently shrinking the effective learning rate the way AdaGrad does. The $$\epsilon$$ term keeps things from blowing up early in training when $$v_t$$ is close to zero; common defaults are 1e-6 or 1e-8 depending on the library.

### Where epsilon goes

The placement of ε matters more than it looks. The original lecture wrote the update as `g / (sqrt(v) + ε)`, with ε added outside the square root. Some implementations and some Adam-style variants place it inside the square root: `g / sqrt(v + ε)`. The two are not algebraically equal. Outside-sqrt ε dominates when v is very small early in training and gives a smooth, finite step even when v is exactly zero. Inside-sqrt ε only matters once v drops below ε, which can produce slightly different early-step behavior. PyTorch's `torch.optim.RMSprop` uses outside-sqrt ε to match Hinton's original lecture; TensorFlow has used both formulations across its history. When porting code across frameworks, this is a real source of subtle numerical drift.

### Centered RMSProp

A variant from Alex Graves's 2013 paper *Generating Sequences With Recurrent Neural Networks* (arXiv:1308.0850) also tracks a running mean of the gradient and subtracts its square from v before taking the square root, so the denominator becomes the running standard deviation rather than the running RMS:[6]

$$
\begin{aligned}
m_t &= \rho \cdot m_{t-1} + (1 - \rho) \cdot g_t \\
v_t &= \rho \cdot v_{t-1} + (1 - \rho) \cdot g_t^2 \\
\theta_{t+1} &= \theta_t - \alpha \cdot \frac{g_t}{\sqrt{v_t - m_t^2} + \epsilon}
\end{aligned}
$$

Graves reported that this made training more stable when gradients had a strong directional bias. The intuition is that if the gradient has a consistent sign, the standard RMSProp denominator will report a large magnitude even though the parameter is moving in a single coherent direction. Centered RMSProp removes the bias before taking the square root, so a parameter that is steadily decreasing gets a denominator close to zero rather than close to the gradient's typical absolute value, which makes the per-step move larger when there is a clear direction to move in. Modern libraries expose it as a `centered=True` flag, off by default.

### With momentum

Many implementations also support an optional momentum term that smooths the parameter update itself, separate from the squared-gradient running average:

$$
\begin{aligned}
v_t &= \rho \cdot v_{t-1} + (1 - \rho) \cdot g_t^2 \\
b_t &= \mu \cdot b_{t-1} + \frac{g_t}{\sqrt{v_t} + \epsilon} \\
\theta_{t+1} &= \theta_t - \alpha \cdot b_t
\end{aligned}
$$

With μ = 0.9 this is conceptually close to Adam, although Adam's bias-correction step makes the early-iteration behavior slightly different. PyTorch's `torch.optim.RMSprop` exposes this as the `momentum` argument and sets it to 0 by default. Some reinforcement learning codebases turned this on with values around 0.9 or 0.95 to smooth out very noisy policy-gradient updates.

### A worked toy example

A two-parameter quadratic illustrates the per-parameter scaling clearly. Consider the loss `L(θ₁, θ₂) = 50 θ₁² + θ₂²`. The gradient is `g = (100 θ₁, 2 θ₂)`. Starting from `(θ₁, θ₂) = (1, 1)` with vanilla SGD at learning rate 0.01, the first step in θ₁ is `-0.01 × 100 = -1.0`, which overshoots through the minimum, while the first step in θ₂ is `-0.01 × 2 = -0.02`, which barely moves. RMSProp with ρ = 0.9 instead computes `v₁ = (1 - 0.9) × 100² = 1000` and `v₂ = (1 - 0.9) × 2² = 0.4`. The first updates become `-0.01 × 100 / sqrt(1000) ≈ -0.0316` and `-0.01 × 2 / sqrt(0.4) ≈ -0.0316`. Both parameters move by roughly the same amount in parameter space, so neither overshoots and neither stalls. That is the entire point of the algorithm.

## How does RMSProp differ from AdaGrad?

[AdaGrad](/wiki/adagrad) (Duchi, Hazan, Singer 2011, *JMLR*) uses the same general structure: each parameter has its own learning-rate scale computed from past squared gradients. The difference is how that scale is accumulated.[2]

```
# AdaGrad
G_t   = G_{t-1} + g_t²
θ_{t+1} = θ_t - α · g_t / (sqrt(G_t) + ε)

# RMSProp
v_t   = ρ · v_{t-1} + (1 - ρ) · g_t²
θ_{t+1} = θ_t - α · g_t / (sqrt(v_t) + ε)
```

The AdaGrad accumulator Gₜ grows monotonically. For sparse problems this is a feature: rare features get large updates because G stays small. For dense problems it is a bug: G grows roughly linearly in t, so the effective learning rate shrinks like 1/sqrt(t) and learning eventually stops. RMSProp's exponential decay is the explicit fix for exactly this AdaGrad failure mode, discarding ancient squared gradients so the denominator stays bounded.[14] With ρ = 0.9, only the last 10 to 20 steps contribute meaningfully, so the adaptive rate tracks the current geometry of the loss surface.

There is a clean way to think about the difference. AdaGrad's denominator estimates the cumulative L2 norm of all past gradients for that parameter. RMSProp's denominator estimates the running root-mean-square of recent gradients, which is bounded as long as gradients themselves are bounded. AdaGrad's effective learning rate is monotonically non-increasing, so once it has decayed, it cannot recover, even if the loss landscape changes (for instance after a learning-rate warmup, a curriculum shift, or a fine-tuning phase). RMSProp's effective learning rate can grow back as soon as recent gradients become small. That difference is what makes RMSProp viable for the long, multi-stage training runs typical of modern deep learning, while AdaGrad mostly stayed in the convex-optimization and sparse-feature literature where its monotonicity is the right thing to want.

## How does RMSProp differ from AdaDelta?

AdaDelta (Zeiler 2012, arXiv:1212.5701) was developed independently about three months after Hinton's lecture.[3] It uses the same exponentially decaying mean of squared gradients but tries to eliminate the learning rate α entirely by also keeping a running average of squared parameter updates and using the ratio of those two RMS quantities as the step size:

```
# AdaDelta
E[g²]_t = ρ · E[g²]_{t-1} + (1 - ρ) · g_t²
Δθ_t   = -(sqrt(E[Δθ²]_{t-1} + ε) / sqrt(E[g²]_t + ε)) · g_t
E[Δθ²]_t = ρ · E[Δθ²]_{t-1} + (1 - ρ) · Δθ_t²
θ_{t+1}  = θ_t + Δθ_t
```

In theory this makes AdaDelta hyperparameter-free apart from ρ and ε; in practice most implementations still expose a learning-rate multiplier. AdaDelta is still a built-in optimizer in [PyTorch](/wiki/pytorch), [TensorFlow](/wiki/tensorflow), and [Keras](/wiki/keras), but sees less use since Adam.

The key motivation in Zeiler's paper is unit consistency. He argued that in plain RMSProp the update `α · g / sqrt(v)` does not have the same units as the parameter being updated, because the gradient has units of `1 / units(θ)` and the denominator does not cancel them out, so the user-supplied learning rate α has to absorb that mismatch. AdaDelta replaces α with the running RMS of past parameter updates, which has the same units as θ, making the update dimensionally consistent. This is a clean theoretical observation, and on some tasks it does remove the need to tune a learning rate from scratch. In modern practice, however, the simplicity of just trying a few learning rates with Adam usually wins.

## How is Adam related to RMSProp?

Adam (Kingma & Ba 2014, arXiv:1412.6980, ICLR 2015) is essentially RMSProp with momentum on the gradient itself plus bias correction.[4] Adam keeps two running averages, one of the gradient (mₜ) and one of the squared gradient (vₜ), each with its own decay (β₁, β₂):

```
# Adam
m_t = β₁ · m_{t-1} + (1 - β₁) · g_t
v_t = β₂ · v_{t-1} + (1 - β₂) · g_t²
m̂_t = m_t / (1 - β₁ᵗ)         # bias correction
v̂_t = v_t / (1 - β₂ᵗ)         # bias correction
θ_{t+1} = θ_t - α · m̂_t / (sqrt(v̂_t) + ε)
```

With β₁ = 0 and bias correction off, Adam reduces to RMSProp with ρ = β₂; the second-moment estimate vₜ is identical to RMSProp's running average, and Adam's defaults of β₁ = 0.9, β₂ = 0.999 simply set RMSProp's decay to 0.999 and add a 0.9-momentum first moment.[4] The first moment mₜ does the same job as heavy-ball momentum in [SGD with momentum](/wiki/momentum): it averages out noise in successive minibatch gradients. The single feature RMSProp lacks is bias correction. Kingma and Ba single out exactly this gap: RMSProp with momentum "lacks a bias-correction term; this matters most in case of a value of β₂ close to 1 (required in case of sparse gradients), since in that case not correcting the bias leads to very large stepsizes and often divergence."[4] Adam and its descendant [AdamW](/wiki/adamw) (decoupled weight decay) is the default optimizer for almost every transformer and modern vision model.[5]

### Convergence and the AMSGrad issue

Kingma and Ba's original Adam paper offered a regret-bound proof for the optimizer in the convex online setting. Reddi, Kale & Kumar ("On the Convergence of Adam and Beyond," ICLR 2018) showed that the proof had a gap and constructed counter-examples on which Adam fails to converge to the optimum even in simple convex problems.[12] The same construction technically affects RMSProp, since the troublesome term comes from the exponentially decaying squared-gradient accumulator that Adam inherits from RMSProp. Reddi and coauthors proposed AMSGrad, which keeps the running maximum of vₜ rather than the raw running average, restoring convergence guarantees. AMSGrad sees occasional use but never replaced Adam in practice, because the failure cases are rare on the kinds of non-convex objectives that come up in deep learning. The same caveats apply to RMSProp: the paper construction means the convergence story is a little less clean than the AdaGrad analysis, but on real neural-network training problems it has not turned into a practical problem.

## What are RMSProp's default hyperparameters?

RMSProp has a small number of knobs. The defaults below are the ones used by the major libraries.

| Hyperparameter | Symbol | Common default | Notes |
|---|---|---|---|
| Learning rate | $$\alpha$$ | 0.001 (Keras, PyTorch) | DQN used 0.00025 |
| Decay (squared gradient) | $$\rho, \gamma$$ | 0.9 | Called alpha in PyTorch, rho in Keras |
| Epsilon | $$\epsilon$$ | 1e-7 (Keras), 1e-8 (PyTorch) | 1e-6 in some RL papers |
| Momentum | $$\mu$$ | 0 | Optional |
| Centered | flag | False | Subtract running mean (Graves 2013) |
| Weight decay | $$\lambda$$ | 0 | L2 penalty added to gradient |

The 0.9 default for ρ corresponds to an effective averaging window of roughly 10 steps, and matches the value Hinton used in the original lecture examples.[1][18] For DQN specifically, the published learning rate of 0.00025 reflects how noisy bootstrap-target updates are; using the standard 0.001 there tends to make Q-values diverge.

### Tuning guidance

Learning rate is by far the most important RMSProp hyperparameter, just as it is for SGD. Common practice is to sweep α across roughly four orders of magnitude on a log scale (1e-5, 3e-5, 1e-4, 3e-4, 1e-3, 3e-3, 1e-2) and pick the largest one that does not diverge in the first few hundred steps. RMSProp's per-parameter scaling makes the optimizer more forgiving of a slightly-too-large learning rate than vanilla SGD, but it does not eliminate the cliff entirely; once α crosses some task-dependent threshold the loss will still go to NaN within a few iterations.

The decay ρ is rarely worth tuning. Both 0.9 and 0.99 are reasonable. 0.9 makes the running average track changes in gradient scale more aggressively, which helps when the loss landscape changes rapidly; 0.99 (PyTorch's default) is smoother and slightly more stable. The original Hinton lecture used a value of 0.9 in its examples.[1]

ε acts as a soft floor on the denominator and therefore as a soft ceiling on the per-step update size. On problems where gradients are very small (typical for the late stages of large-scale training), increasing ε by a couple of orders of magnitude (for instance from 1e-8 to 1e-4) can prevent updates from blowing up when sqrt(vₜ) is also tiny. The DQN default of ε = 0.01 is a deliberately large value chosen for exactly this reason; in deep RL the squared-gradient running average can drop to genuinely small values whenever the policy briefly stops exploring, and a tiny ε will then turn a small gradient into a huge step.[7]

Weight decay in RMSProp implementations is added to the gradient before computing vₜ, so it is L2 regularization in the classical sense, not the decoupled weight decay used by AdamW. If you want decoupled weight decay with an RMSProp-style update, you have to either roll it yourself or use Optax, where the gradient transformation pipeline lets you apply weight decay independently of the squared-gradient normalization.

## Implementation

RMSProp is built in to all the major deep-learning libraries. Semantics match across them; the main thing to watch when porting code is the default learning rate.[16]

### PyTorch

`torch.optim.RMSprop` defaults match the original lecture: lr 0.01, decay 0.99, eps 1e-8, momentum 0, centered False.

```python
import torch

optimizer = torch.optim.RMSprop(
    model.parameters(),
    lr=1e-3,
    alpha=0.9,        # this is rho
    eps=1e-8,
)

for x, y in loader:
    optimizer.zero_grad()
    loss_fn(model(x), y).backward()
    optimizer.step()
```

PyTorch names the squared-gradient decay coefficient `alpha`, which collides with the symbol α used for the learning rate in most papers. The learning rate argument is just `lr`. The library also exposes a `centered` flag for centered RMSProp and a `momentum` argument that adds the explicit momentum buffer described above.

### TensorFlow / Keras

`tf.keras.optimizers.RMSprop` defaults: learning rate 0.001, rho 0.9, momentum 0.0, epsilon 1e-7.

```python
import tensorflow as tf

optimizer = tf.keras.optimizers.RMSprop(learning_rate=1e-3, rho=0.9, epsilon=1e-7)
model.compile(optimizer=optimizer, loss="mse")
```

Keras uses `rho` for the squared-gradient decay, which is closer to the conventional symbol. Note that the TensorFlow default learning rate (0.001) differs from PyTorch's (0.01) by a factor of ten. Hyperparameters tuned on one library do not transfer directly to the other without checking this.

### JAX (Optax)

In the [JAX](/wiki/jax) ecosystem the standard implementation is `optax.rmsprop`, composable with the rest of the optax gradient-transformation pipeline.

```python
import optax

optimizer = optax.rmsprop(learning_rate=1e-3, decay=0.9, eps=1e-8)
opt_state = optimizer.init(params)
updates, opt_state = optimizer.update(grads, opt_state)
params = optax.apply_updates(params, updates)
```

Optax separates the optimizer logic into `init` and `update` calls without any hidden state on the optimizer object, which makes it straightforward to combine RMSProp with gradient clipping, weight decay, learning-rate schedules, and other transformations using `optax.chain`. For example, a typical training pipeline might be `optax.chain(optax.clip_by_global_norm(1.0), optax.rmsprop(1e-3))`.

### From scratch in NumPy

A reference implementation in pure NumPy fits in a few lines:

```python
import numpy as np

class RMSProp:
    def __init__(self, params, lr=1e-3, rho=0.9, eps=1e-8):
        self.lr, self.rho, self.eps = lr, rho, eps
        self.v = [np.zeros_like(p) for p in params]

    def step(self, params, grads):
        for i, (p, g) in enumerate(zip(params, grads)):
            self.v[i] = self.rho * self.v[i] + (1 - self.rho) * g * g
            p -= self.lr * g / (np.sqrt(self.v[i]) + self.eps)
```

That is the entire algorithm. Real implementations add weight decay, momentum, gradient clipping, and bookkeeping, but the inner loop is exactly the two lines that update v and step the parameter.

## Optimizer comparison

The following table summarizes how RMSProp relates to the other first-order optimizers it shares lineage with.

| Optimizer | Year | Per-parameter scale | Momentum | Bias correction | Notes |
|---|---|---|---|---|---|
| SGD | 1951 (Robbins & Monro) | No | No | No | Single global step size |
| SGD with momentum | 1964 (Polyak) | No | Yes | No | Heavy-ball momentum |
| Nesterov momentum | 1983 (Nesterov) | No | Yes | No | Lookahead momentum variant |
| AdaGrad | 2011 (Duchi et al.) | Yes (sum of g²) | No | No | Learning rate decays to zero |
| RMSProp | 2012 (Hinton) | Yes (EMA of g²) | Optional | No | Fixes AdaGrad's decay issue |
| AdaDelta | 2012 (Zeiler) | Yes (EMA of g²) | No | No | Eliminates explicit learning rate |
| Adam | 2014 (Kingma & Ba) | Yes (EMA of g²) | Yes (EMA of g) | Yes | RMSProp + momentum + bias correction |
| AdamW | 2017 (Loshchilov & Hutter) | Yes (EMA of g²) | Yes (EMA of g) | Yes | Adam with decoupled [weight decay](/wiki/weight_decay) |
| AMSGrad | 2018 (Reddi et al.) | Yes (max of g² EMA) | Yes | No | Adam variant with restored convergence proof |
| AdaBelief | 2020 (Zhuang et al.) | Yes (EMA of (g - m)²) | Yes | Yes | Adam variant tracking gradient variance |

### Memory and compute cost

All first-order adaptive optimizers cost extra memory because they have to store auxiliary state per parameter. SGD has zero extra state, SGD with momentum has one buffer per parameter, RMSProp has one buffer (or two with momentum, three with centered), Adam has two, and AMSGrad has three. For a transformer with 7 billion parameters in fp32, that is 28 GB of optimizer state for RMSProp, 56 GB for Adam, and 84 GB for AMSGrad, which is why optimizer state offloading and 8-bit Adam exist. RMSProp's relatively modest state footprint, half of Adam's, was historically one of its small practical advantages on memory-constrained hardware, though in the era of distributed training across hundreds of GPUs that consideration has mostly faded.

## What is RMSProp used for?

RMSProp shows up in specific corners of the deep-learning literature, mostly from the 2013 to 2016 window when it was the default for sequence models and RL.

### Reinforcement learning

The DeepMind paper that put deep RL on the map (Mnih et al., "Human-level control through deep reinforcement learning", *Nature* 518, 2015) trained the Q-network with a variant of RMSProp that adds a momentum term to the standard update.[7] Published settings from Extended Data Table 1: learning rate 0.00025, gradient momentum 0.95, squared-gradient momentum (decay) 0.95, and min squared gradient (epsilon) 0.01.[7] A lot of follow-on work (DQN variants, Rainbow, Ape-X) inherited those settings even after Adam became standard elsewhere. The 2016 *Asynchronous Methods for Deep Reinforcement Learning* paper introduced [A3C](/wiki/a3c), which used a shared RMSProp accumulator across asynchronous actor-learner workers; the squared-gradient running average was held in shared memory and updated atomically by every worker, giving each worker the benefit of a population-level estimate of gradient scale without any explicit synchronization.[8] That trick became standard in distributed RL implementations for several years. Several other reinforcement-learning algorithms from the same era (TRPO baselines, certain ACER configurations, the original IMPALA reference) also defaulted to RMSProp. As of 2026, deep RL libraries typically expose both Adam and RMSProp and most new agents pick Adam, but the legacy DQN and A3C settings remain the canonical reference points for benchmarks on the Atari Learning Environment.

### Recurrent neural networks

For [recurrent neural networks](/wiki/recurrent_neural_network), Alex Graves's character-level RNN work (Graves 2013) used centered RMSProp.[6] *Recurrent Batch Normalization* (Cooijmans et al. 2016) used RMSProp on language modeling and sequence [MNIST](/wiki/mnist).[9] Several early seq2seq systems used RMSProp before switching to Adam. The motivation for RMSProp on RNNs was practical: gradient magnitudes in long-sequence backpropagation through time vary dramatically across parameters, especially in the recurrent matrices, and per-parameter scaling helps prevent the small subset of weights that experience the largest gradients from dominating the update. Once Adam took over, the same property carried over, so the switch had little qualitative effect on training dynamics for most RNN setups.

### Generative adversarial networks

For GANs, the Wasserstein GAN paper ([WGAN](/wiki/wgan), Arjovsky, Chintala & Bottou 2017) used RMSProp for both critic and generator and explicitly recommended against Adam, on the grounds that Adam's momentum term plus their gradient-clipping scheme made the critic loss less reliable.[10] The original WGAN code uses RMSProp with learning rate 5e-5. The follow-on WGAN-GP paper (Gulrajani et al. 2017) reverted to Adam after replacing weight clipping with a gradient penalty, which suggests that the WGAN preference for RMSProp had as much to do with the specific weight-clipping mechanism as with any general property of GAN training.[11] A few other adversarial setups, including some early attempts at adversarial training for robustness, also reached for RMSProp on the theory that momentum makes saddle-point dynamics worse, but this is more folklore than measured fact.

### Other notable uses

RMSProp was a common default for smaller models in the 2014 to 2016 era, including character-level neural language models in Andrej Karpathy's widely circulated `char-rnn` codebase, several speech recognition baselines, and the original style-transfer implementations. For supervised image classification it was always less common than SGD with momentum, which delivered better final accuracy on ImageNet-scale benchmarks. Once Adam, and later AdamW, displaced both, RMSProp gradually became a niche choice for new work outside reinforcement learning.

## When should you use RMSProp instead of Adam?

For most tasks, AdamW is a better default. The momentum term usually helps, the bias correction makes the first few hundred steps less twitchy, and decoupled weight decay does the right thing for [regularization](/wiki/regularization).[5] On transformers, large CNNs, and diffusion models, AdamW with cosine learning-rate scheduling and a short warmup is the standard recipe and almost always at least as good as RMSProp.

Reproducing an older paper is the most common reason to still use RMSProp. DQN, WGAN, and a chunk of the 2013 to 2016 deep-learning literature use it with specific hyperparameters, and if you want your numbers to match, you use the same optimizer with the same settings. Some adversarial training setups (WGAN being the classic one) also work better without first-moment momentum because the loss landscape is non-stationary by design. For a brand-new project with no prior art, start with AdamW.

### Decision checklist

In rough order of priority, the situations where reaching for RMSProp over Adam still makes sense:

1. The reference implementation you are reproducing uses RMSProp and you want bit-for-bit comparable numbers. This covers most Atari DQN and A3C work.
2. The optimization landscape is adversarial or otherwise non-stationary, and the gradient mean is genuinely uninformative or actively misleading, so adding first-moment momentum hurts. WGAN's critic is the canonical example.
3. You are training a small recurrent model and want to match the hyperparameters from a 2013-vintage paper without retuning.
4. You are working in a memory-constrained environment where halving the optimizer state from Adam to RMSProp matters.
5. You are running a unit test or sanity check and want a simpler, fewer-knob optimizer than Adam.

For every other modern setting, AdamW is the safer default. Some practitioners prefer SGD with momentum and a long cosine schedule for image classification on ConvNets, but that is a separate debate that has nothing to do with RMSProp specifically.

## Common pitfalls

A short list of things that trip up engineers using RMSProp in production:

1. **Default learning rate mismatch.** PyTorch defaults to lr 0.01 (matching Hinton's original lecture); Keras and TensorFlow default to lr 0.001 (matching what most papers use). Code ported between the two libraries often diverges or stalls because of this. Always set the learning rate explicitly when reading or writing reference implementations.
2. **Epsilon placement.** As noted above, `g / (sqrt(v) + ε)` and `g / sqrt(v + ε)` are not equal. Frameworks differ. When epsilon dominates the denominator (early in training or in regions of vanishing gradient), the two formulations can produce noticeably different updates.
3. **No bias correction.** Unlike Adam, RMSProp does not bias-correct vₜ, so the running average is biased toward zero in the first few hundred steps. The practical effect is usually small but it does mean that the first updates are larger than the steady-state behavior would predict. For very short training runs this can matter.
4. **Centered flag confusion.** The `centered=True` flag silently changes the algorithm. It uses more memory, runs more slowly, and can subtly change convergence on some problems. Do not flip it on without rerunning your hyperparameter sweep.
5. **Weight decay is L2, not decoupled.** If you set the `weight_decay` argument in PyTorch's RMSprop, you are getting classical L2 regularization mixed into the gradient before the squared-gradient running average, not the decoupled form used by AdamW. This is usually fine but can interact badly with very large weight-decay coefficients.
6. **Large ε in deep RL.** The DQN setting of ε = 0.01 looks unreasonable to most deep-learning practitioners, but it is a deliberate choice to handle vanishing-gradient regions in Q-learning. Reproducing DQN-style experiments without setting ε to its published value tends to make Q-values diverge.
7. **Numerical drift across hardware.** RMSProp's denominator is sensitive to the order of floating-point operations in the squared-gradient running average. Mixed-precision training, distributed reductions, and TPU vs. GPU implementations can produce slightly different trajectories from the same initial conditions. This rarely matters for final accuracy but can complicate exact reproducibility.

## See also

[Optimizer](/wiki/optimizer), [gradient descent](/wiki/gradient_descent), [stochastic gradient descent](/wiki/stochastic_gradient_descent_sgd), [momentum](/wiki/momentum), [AdaGrad](/wiki/adagrad), [Adam](/wiki/adam_optimizer), [AdamW](/wiki/adamw), [backpropagation](/wiki/backpropagation), [learning rate](/wiki/learning_rate), [A3C](/wiki/a3c), [DQN](/wiki/dqn), [WGAN](/wiki/wgan).

## References

1. Tieleman, T. and Hinton, G. (2012). *Lecture 6.5: rmsprop, COURSERA: Neural Networks for Machine Learning.* University of Toronto. https://www.cs.toronto.edu/~hinton/coursera/lecture6/lec6.pdf
2. Duchi, J., Hazan, E., & Singer, Y. (2011). Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. *Journal of Machine Learning Research*, 12, 2121-2159.
3. Zeiler, M. D. (2012). ADADELTA: An Adaptive Learning Rate Method. arXiv:1212.5701.
4. Kingma, D. P., & Ba, J. (2014). Adam: A Method for Stochastic Optimization. arXiv:1412.6980. ICLR 2015.
5. Loshchilov, I., & Hutter, F. (2017). Decoupled Weight Decay Regularization. arXiv:1711.05101. ICLR 2019.
6. Graves, A. (2013). Generating Sequences With Recurrent Neural Networks. arXiv:1308.0850.
7. Mnih, V., Kavukcuoglu, K., Silver, D., et al. (2015). Human-level control through deep reinforcement learning. *Nature* 518, 529-533.
8. Mnih, V., Badia, A. P., Mirza, M., et al. (2016). Asynchronous Methods for Deep Reinforcement Learning. *Proceedings of ICML 2016*. arXiv:1602.01783.
9. Cooijmans, T., Ballas, N., Laurent, C., Gülçehre, Ç., & Courville, A. (2016). Recurrent Batch Normalization. arXiv:1603.09025.
10. Arjovsky, M., Chintala, S., & Bottou, L. (2017). Wasserstein GAN. arXiv:1701.07875.
11. Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., & Courville, A. (2017). Improved Training of Wasserstein GANs. arXiv:1704.00028.
12. Reddi, S. J., Kale, S., & Kumar, S. (2018). On the Convergence of Adam and Beyond. ICLR 2018.
13. Riedmiller, M., & Braun, H. (1993). A direct adaptive method for faster backpropagation learning: The RPROP algorithm. *IEEE International Conference on Neural Networks*.
14. Ruder, S. (2016). An overview of gradient descent optimization algorithms. arXiv:1609.04747.
15. Goodfellow, I., Bengio, Y., & Courville, A. (2016). *Deep Learning*. MIT Press, Section 8.5.
16. PyTorch documentation: `torch.optim.RMSprop`. TensorFlow documentation: `tf.keras.optimizers.RMSprop`. Optax documentation: `optax.rmsprop`.
17. Wikipedia contributors. "Stochastic gradient descent" (section on RMSProp). Wikipedia.
18. Cornell University Computational Optimization Open Textbook. "RMSProp." https://optimization.cbe.cornell.edu/index.php?title=RMSProp