RMSProp

RMSProp (Root Mean Square Propagation) is an adaptive learning-rate optimizer for training neural networks with mini-batch stochastic gradient descent. It divides each parameter's gradient by a running root-mean-square estimate of recent gradient magnitudes for that parameter, keeping the effective step size roughly the same across parameters even when raw gradients differ in scale by orders of magnitude.

The method was proposed by Geoffrey Hinton in 2012 in lecture 6e of his Coursera course Neural Networks for Machine Learning ("rmsprop: Divide the gradient by a running average of its recent magnitude"). It was implemented in the course's Octave examples by Tijmen Tieleman, which is why it is often cited as Tieleman & Hinton 2012. There was no stand-alone paper; the canonical citation in the literature is the lecture slide deck itself. Hinton has joked in talks since that he "got tired of writing papers," so the method only ever lived inside a course handout, even as it became one of the most widely used adaptive optimizers of the early deep-learning era.

RMSProp sits in the early-2010s wave of adaptive optimizers. AdaGrad came first (Duchi, Hazan & Singer 2011) and worked well for sparse problems but accumulated squared gradients without ever forgetting them, so the effective learning rate eventually crashed to zero. RMSProp fixed that by replacing the running sum with an exponentially decaying average. AdaDelta (Zeiler, December 2012) used the same trick independently. Adam (Kingma & Ba 2014) folded RMSProp together with momentum and added bias correction, and Adam plus its descendant AdamW ate most of RMSProp's market share for general deep learning work. RMSProp is still around, mostly in older deep reinforcement learning code (the original DQN is the most cited example) and in GAN settings where Adam's momentum term makes training less stable.

History

In the fall of 2012, Hinton taught a free Coursera course called Neural Networks for Machine Learning, one of the first MOOCs aimed at deep learning, running a few months after AlexNet had won ILSVRC. Lecture 6 covered how to make learning go faster; one of its slides introduced RMSProp in roughly two bullet points. Hinton noted that an earlier method called rprop (Riedmiller & Braun 1993) used only the sign of the gradient, which works for full-batch training but breaks down for mini-batches because magnitude information matters when batches are noisy. RMSProp keeps rprop's per-parameter scaling but uses a smooth running average of squared gradients instead of a per-step sign flip.

From rprop to RMSProp

Riedmiller and Braun's rprop algorithm assigned each parameter a separate step size, increased it whenever the sign of the gradient stayed the same across consecutive iterations, and decreased it whenever the sign flipped. The procedure ignored gradient magnitude entirely. That is fine when gradients are computed over the whole training set, because two consecutive full-batch gradients have comparable magnitudes by construction. With mini-batches, two consecutive gradients can have wildly different magnitudes simply because the batches contain different examples; rprop's sign-only update tends to thrash. Hinton's lecture framed RMSProp as the natural fix: replace the sign-comparison heuristic with a per-parameter scale derived from a smoothly accumulated estimate of recent squared gradients, so that the update divides the raw gradient by something close to its typical magnitude.

Tijmen Tieleman, then a PhD student in Hinton's group, wrote the implementation that students used in the Coursera assignments. There was no journal or conference paper; the canonical citation is "Tieleman, T. and Hinton, G. (2012). Lecture 6.5 of Neural Networks for Machine Learning, Coursera." Zeiler's AdaDelta paper, posted to arXiv in December 2012, confirmed that others had landed on roughly the same construction. Yann Dauphin and others working on RNN training in 2013 and 2014 also began citing the lecture handout in workshop papers as a way to refer to the technique, which is largely how it entered the literature.

For several years RMSProp was the default optimizer when something better than vanilla SGD was needed. The first major demonstration was the DQN paper (Mnih et al., Nature 2015). After Adam landed in 2015, most new work moved over.

Adoption timeline

Year	Event
1993	Riedmiller & Braun publish rprop, the conceptual predecessor.
2011	Duchi, Hazan & Singer publish AdaGrad.
2012 (Oct)	Hinton lectures on RMSProp in Coursera Neural Networks for Machine Learning, lecture 6e.
2012 (Dec)	Zeiler posts AdaDelta on arXiv, independently arriving at an exponentially decayed accumulator.
2013	Alex Graves uses centered RMSProp for handwriting and text generation RNNs.
2014	Kingma & Ba post the Adam preprint, framing it as RMSProp with momentum and bias correction.
2015	Mnih et al. publish Human-level control through deep reinforcement learning in Nature; DQN uses RMSProp.
2016	Mnih et al. release A3C; the asynchronous version uses a shared RMSProp accumulator across workers.
2017	Arjovsky et al. publish Wasserstein GAN; explicitly recommends RMSProp over Adam for the critic.
2017 (Nov)	Loshchilov & Hutter post AdamW, after which AdamW becomes the standard for transformer training.
2018+	RMSProp recedes as a default for new architectures but remains common for reinforcement learning baselines and reproductions of older papers.

The update rule

RMSProp is a per-parameter optimizer. For each scalar parameter θ with gradient g at the current step, it keeps a running estimate v of the squared gradient and uses the square root of that estimate to scale the step.

Standard form

Let θₜ be the parameter at step t, gₜ the gradient of the loss with respect to θₜ, ρ a decay coefficient (often called γ or β₂), α the global learning rate, and ε a small constant for numerical stability.

v_t   = ρ · v_{t-1} + (1 - ρ) · g_t²
θ_{t+1} = θ_t - α · g_t / (sqrt(v_t) + ε)

vₜ is the exponential moving average of squared gradients. It plays the same role as the AdaGrad accumulator, except the contribution of any single past gradient decays geometrically over time instead of staying in the sum forever. The square root sqrt(vₜ) is the running root-mean-square (hence the name), and dividing gₜ by it produces a step that has roughly unit magnitude in expectation, regardless of whether the parameter usually sees large or small gradients.

Intuition

If one parameter consistently sees gradients of magnitude 100 and another sees gradients of magnitude 0.01, vanilla SGD has to pick one global step size that works for both, and either the first parameter overshoots or the second barely moves. RMSProp's per-parameter denominator scales each update so the actual step in parameter space is comparable across parameters, no matter what scale the gradients live on. The exponential decay means that scale is computed from recent history, so it can shift over the course of training without permanently shrinking the effective learning rate the way AdaGrad does. The ε term keeps things from blowing up early in training when vₜ is close to zero; common defaults are 1e-6 or 1e-8 depending on the library.

Where epsilon goes

The placement of ε matters more than it looks. The original lecture wrote the update as g / (sqrt(v) + ε), with ε added outside the square root. Some implementations and some Adam-style variants place it inside the square root: g / sqrt(v + ε). The two are not algebraically equal. Outside-sqrt ε dominates when v is very small early in training and gives a smooth, finite step even when v is exactly zero. Inside-sqrt ε only matters once v drops below ε, which can produce slightly different early-step behavior. PyTorch's torch.optim.RMSprop uses outside-sqrt ε to match Hinton's original lecture; TensorFlow has used both formulations across its history. When porting code across frameworks, this is a real source of subtle numerical drift.

Centered RMSProp

A variant from Alex Graves's 2013 paper Generating Sequences With Recurrent Neural Networks (arXiv:1308.0850) also tracks a running mean of the gradient and subtracts its square from v before taking the square root, so the denominator becomes the running standard deviation rather than the running RMS:

m_t = ρ · m_{t-1} + (1 - ρ) · g_t
v_t = ρ · v_{t-1} + (1 - ρ) · g_t²
θ_{t+1} = θ_t - α · g_t / (sqrt(v_t - m_t²) + ε)

Graves reported that this made training more stable when gradients had a strong directional bias. The intuition is that if the gradient has a consistent sign, the standard RMSProp denominator will report a large magnitude even though the parameter is moving in a single coherent direction. Centered RMSProp removes the bias before taking the square root, so a parameter that is steadily decreasing gets a denominator close to zero rather than close to the gradient's typical absolute value, which makes the per-step move larger when there is a clear direction to move in. Modern libraries expose it as a centered=True flag, off by default.

With momentum

Many implementations also support an optional momentum term that smooths the parameter update itself, separate from the squared-gradient running average:

v_t = ρ · v_{t-1} + (1 - ρ) · g_t²
b_t = μ · b_{t-1} + g_t / (sqrt(v_t) + ε)
θ_{t+1} = θ_t - α · b_t

With μ = 0.9 this is conceptually close to Adam, although Adam's bias-correction step makes the early-iteration behavior slightly different. PyTorch's torch.optim.RMSprop exposes this as the momentum argument and sets it to 0 by default. Some reinforcement learning codebases turned this on with values around 0.9 or 0.95 to smooth out very noisy policy-gradient updates.

A worked toy example

A two-parameter quadratic illustrates the per-parameter scaling clearly. Consider the loss L(θ₁, θ₂) = 50 θ₁² + θ₂². The gradient is g = (100 θ₁, 2 θ₂). Starting from (θ₁, θ₂) = (1, 1) with vanilla SGD at learning rate 0.01, the first step in θ₁ is -0.01 × 100 = -1.0, which overshoots through the minimum, while the first step in θ₂ is -0.01 × 2 = -0.02, which barely moves. RMSProp with ρ = 0.9 instead computes v₁ = (1 - 0.9) × 100² = 1000 and v₂ = (1 - 0.9) × 2² = 0.4. The first updates become -0.01 × 100 / sqrt(1000) ≈ -0.0316 and -0.01 × 2 / sqrt(0.4) ≈ -0.0316. Both parameters move by roughly the same amount in parameter space, so neither overshoots and neither stalls. That is the entire point of the algorithm.

Comparison to AdaGrad

AdaGrad (Duchi, Hazan, Singer 2011, JMLR) uses the same general structure: each parameter has its own learning-rate scale computed from past squared gradients. The difference is how that scale is accumulated.

# AdaGrad
G_t   = G_{t-1} + g_t²
θ_{t+1} = θ_t - α · g_t / (sqrt(G_t) + ε)

# RMSProp
v_t   = ρ · v_{t-1} + (1 - ρ) · g_t²
θ_{t+1} = θ_t - α · g_t / (sqrt(v_t) + ε)

The AdaGrad accumulator Gₜ grows monotonically. For sparse problems this is a feature: rare features get large updates because G stays small. For dense problems it is a bug: G grows roughly linearly in t, so the effective learning rate shrinks like 1/sqrt(t) and learning eventually stops. RMSProp's exponential decay fixes this. With ρ = 0.9, only the last 10 to 20 steps contribute meaningfully, so the adaptive rate tracks the current geometry of the loss surface.

There is a clean way to think about the difference. AdaGrad's denominator estimates the cumulative L2 norm of all past gradients for that parameter. RMSProp's denominator estimates the running root-mean-square of recent gradients, which is bounded as long as gradients themselves are bounded. AdaGrad's effective learning rate is monotonically non-increasing, so once it has decayed, it cannot recover, even if the loss landscape changes (for instance after a learning-rate warmup, a curriculum shift, or a fine-tuning phase). RMSProp's effective learning rate can grow back as soon as recent gradients become small. That difference is what makes RMSProp viable for the long, multi-stage training runs typical of modern deep learning, while AdaGrad mostly stayed in the convex-optimization and sparse-feature literature where its monotonicity is the right thing to want.

Comparison to AdaDelta

AdaDelta (Zeiler 2012, arXiv:1212.5701) was developed independently about three months after Hinton's lecture. It uses the same exponentially decaying mean of squared gradients but tries to eliminate the learning rate α entirely by also keeping a running average of squared parameter updates and using the ratio of those two RMS quantities as the step size:

# AdaDelta
E[g²]_t = ρ · E[g²]_{t-1} + (1 - ρ) · g_t²
Δθ_t   = -(sqrt(E[Δθ²]_{t-1} + ε) / sqrt(E[g²]_t + ε)) · g_t
E[Δθ²]_t = ρ · E[Δθ²]_{t-1} + (1 - ρ) · Δθ_t²
θ_{t+1}  = θ_t + Δθ_t

In theory this makes AdaDelta hyperparameter-free apart from ρ and ε; in practice most implementations still expose a learning-rate multiplier. AdaDelta is still a built-in optimizer in PyTorch, TensorFlow, and Keras, but sees less use since Adam.

The key motivation in Zeiler's paper is unit consistency. He argued that in plain RMSProp the update α · g / sqrt(v) does not have the same units as the parameter being updated, because the gradient has units of 1 / units(θ) and the denominator does not cancel them out, so the user-supplied learning rate α has to absorb that mismatch. AdaDelta replaces α with the running RMS of past parameter updates, which has the same units as θ, making the update dimensionally consistent. This is a clean theoretical observation, and on some tasks it does remove the need to tune a learning rate from scratch. In modern practice, however, the simplicity of just trying a few learning rates with Adam usually wins.

Comparison to Adam

Adam (Kingma & Ba 2014, arXiv:1412.6980, ICLR 2015) is essentially RMSProp with momentum on the gradient itself plus bias correction. Adam keeps two running averages, one of the gradient (mₜ) and one of the squared gradient (vₜ), each with its own decay (β₁, β₂):

# Adam
m_t = β₁ · m_{t-1} + (1 - β₁) · g_t
v_t = β₂ · v_{t-1} + (1 - β₂) · g_t²
m̂_t = m_t / (1 - β₁ᵗ)         # bias correction
v̂_t = v_t / (1 - β₂ᵗ)         # bias correction
θ_{t+1} = θ_t - α · m̂_t / (sqrt(v̂_t) + ε)

With β₁ = 0 and bias correction off, Adam reduces to RMSProp with ρ = β₂. The first moment mₜ does the same job as heavy-ball momentum in SGD with momentum: it averages out noise in successive minibatch gradients. Bias correction matters because mₜ and vₜ start at zero. Adam and its descendant AdamW (decoupled weight decay) is the default optimizer for almost every transformer and modern vision model.

Convergence and the AMSGrad issue

Kingma and Ba's original Adam paper offered a regret-bound proof for the optimizer in the convex online setting. Reddi, Kale & Kumar ("On the Convergence of Adam and Beyond," ICLR 2018) showed that the proof had a gap and constructed counter-examples on which Adam fails to converge to the optimum even in simple convex problems. The same construction technically affects RMSProp, since the troublesome term comes from the exponentially decaying squared-gradient accumulator that Adam inherits from RMSProp. Reddi and coauthors proposed AMSGrad, which keeps the running maximum of vₜ rather than the raw running average, restoring convergence guarantees. AMSGrad sees occasional use but never replaced Adam in practice, because the failure cases are rare on the kinds of non-convex objectives that come up in deep learning. The same caveats apply to RMSProp: the paper construction means the convergence story is a little less clean than the AdaGrad analysis, but on real neural-network training problems it has not turned into a practical problem.

Hyperparameters

RMSProp has a small number of knobs. The defaults below are the ones used by the major libraries.

Hyperparameter	Symbol	Common default	Notes
Learning rate	α	0.001 (Keras, PyTorch)	DQN used 0.00025
Decay (squared gradient)	ρ, γ	0.9	Called alpha in PyTorch, rho in Keras
Epsilon	ε	1e-7 (Keras), 1e-8 (PyTorch)	1e-6 in some RL papers
Momentum	μ	0	Optional
Centered	flag	False	Subtract running mean (Graves 2013)
Weight decay	λ	0	L2 penalty added to gradient

The 0.9 default for ρ corresponds to an effective averaging window of roughly 10 steps. For DQN specifically, the published learning rate of 0.00025 reflects how noisy bootstrap-target updates are; using the standard 0.001 there tends to make Q-values diverge.

Tuning guidance

Learning rate is by far the most important RMSProp hyperparameter, just as it is for SGD. Common practice is to sweep α across roughly four orders of magnitude on a log scale (1e-5, 3e-5, 1e-4, 3e-4, 1e-3, 3e-3, 1e-2) and pick the largest one that does not diverge in the first few hundred steps. RMSProp's per-parameter scaling makes the optimizer more forgiving of a slightly-too-large learning rate than vanilla SGD, but it does not eliminate the cliff entirely; once α crosses some task-dependent threshold the loss will still go to NaN within a few iterations.

The decay ρ is rarely worth tuning. Both 0.9 and 0.99 are reasonable. 0.9 makes the running average track changes in gradient scale more aggressively, which helps when the loss landscape changes rapidly; 0.99 (PyTorch's default) is smoother and slightly more stable. The original Hinton lecture used a value of 0.9 in its examples.

ε acts as a soft floor on the denominator and therefore as a soft ceiling on the per-step update size. On problems where gradients are very small (typical for the late stages of large-scale training), increasing ε by a couple of orders of magnitude (for instance from 1e-8 to 1e-4) can prevent updates from blowing up when sqrt(vₜ) is also tiny. The DQN default of ε = 0.01 is a deliberately large value chosen for exactly this reason; in deep RL the squared-gradient running average can drop to genuinely small values whenever the policy briefly stops exploring, and a tiny ε will then turn a small gradient into a huge step.

Weight decay in RMSProp implementations is added to the gradient before computing vₜ, so it is L2 regularization in the classical sense, not the decoupled weight decay used by AdamW. If you want decoupled weight decay with an RMSProp-style update, you have to either roll it yourself or use Optax, where the gradient transformation pipeline lets you apply weight decay independently of the squared-gradient normalization.

Implementation

RMSProp is built in to all the major deep-learning libraries. Semantics match across them; the main thing to watch when porting code is the default learning rate.

PyTorch

torch.optim.RMSprop defaults match the original lecture: lr 0.01, decay 0.99, eps 1e-8, momentum 0, centered False.

import torch

optimizer = torch.optim.RMSprop(
    model.parameters(),
    lr=1e-3,
    alpha=0.9,        # this is rho
    eps=1e-8,
)

for x, y in loader:
    optimizer.zero_grad()
    loss_fn(model(x), y).backward()
    optimizer.step()

PyTorch names the squared-gradient decay coefficient alpha, which collides with the symbol α used for the learning rate in most papers. The learning rate argument is just lr. The library also exposes a centered flag for centered RMSProp and a momentum argument that adds the explicit momentum buffer described above.

TensorFlow / Keras

tf.keras.optimizers.RMSprop defaults: learning rate 0.001, rho 0.9, momentum 0.0, epsilon 1e-7.

import tensorflow as tf

optimizer = tf.keras.optimizers.RMSprop(learning_rate=1e-3, rho=0.9, epsilon=1e-7)
model.compile(optimizer=optimizer, loss="mse")

Keras uses rho for the squared-gradient decay, which is closer to the conventional symbol. Note that the TensorFlow default learning rate (0.001) differs from PyTorch's (0.01) by a factor of ten. Hyperparameters tuned on one library do not transfer directly to the other without checking this.

JAX (Optax)

In the JAX ecosystem the standard implementation is optax.rmsprop, composable with the rest of the optax gradient-transformation pipeline.

import optax

optimizer = optax.rmsprop(learning_rate=1e-3, decay=0.9, eps=1e-8)
opt_state = optimizer.init(params)
updates, opt_state = optimizer.update(grads, opt_state)
params = optax.apply_updates(params, updates)

Optax separates the optimizer logic into init and update calls without any hidden state on the optimizer object, which makes it straightforward to combine RMSProp with gradient clipping, weight decay, learning-rate schedules, and other transformations using optax.chain. For example, a typical training pipeline might be optax.chain(optax.clip_by_global_norm(1.0), optax.rmsprop(1e-3)).

From scratch in NumPy

A reference implementation in pure NumPy fits in a few lines:

import numpy as np

class RMSProp:
    def __init__(self, params, lr=1e-3, rho=0.9, eps=1e-8):
        self.lr, self.rho, self.eps = lr, rho, eps
        self.v = [np.zeros_like(p) for p in params]

    def step(self, params, grads):
        for i, (p, g) in enumerate(zip(params, grads)):
            self.v[i] = self.rho * self.v[i] + (1 - self.rho) * g * g
            p -= self.lr * g / (np.sqrt(self.v[i]) + self.eps)

That is the entire algorithm. Real implementations add weight decay, momentum, gradient clipping, and bookkeeping, but the inner loop is exactly the two lines that update v and step the parameter.

Optimizer comparison

The following table summarizes how RMSProp relates to the other first-order optimizers it shares lineage with.

Optimizer	Year	Per-parameter scale	Momentum	Bias correction	Notes
SGD	1951 (Robbins & Monro)	No	No	No	Single global step size
SGD with momentum	1964 (Polyak)	No	Yes	No	Heavy-ball momentum
Nesterov momentum	1983 (Nesterov)	No	Yes	No	Lookahead momentum variant
AdaGrad	2011 (Duchi et al.)	Yes (sum of g²)	No	No	Learning rate decays to zero
RMSProp	2012 (Hinton)	Yes (EMA of g²)	Optional	No	Fixes AdaGrad's decay issue
AdaDelta	2012 (Zeiler)	Yes (EMA of g²)	No	No	Eliminates explicit learning rate
Adam	2014 (Kingma & Ba)	Yes (EMA of g²)	Yes (EMA of g)	Yes	RMSProp + momentum + bias correction
AdamW	2017 (Loshchilov & Hutter)	Yes (EMA of g²)	Yes (EMA of g)	Yes	Adam with decoupled weight decay
AMSGrad	2018 (Reddi et al.)	Yes (max of g² EMA)	Yes	No	Adam variant with restored convergence proof
AdaBelief	2020 (Zhuang et al.)	Yes (EMA of (g - m)²)	Yes	Yes	Adam variant tracking gradient variance

Memory and compute cost

All first-order adaptive optimizers cost extra memory because they have to store auxiliary state per parameter. SGD has zero extra state, SGD with momentum has one buffer per parameter, RMSProp has one buffer (or two with momentum, three with centered), Adam has two, and AMSGrad has three. For a transformer with 7 billion parameters in fp32, that is 28 GB of optimizer state for RMSProp, 56 GB for Adam, and 84 GB for AMSGrad, which is why optimizer state offloading and 8-bit Adam exist. RMSProp's relatively modest state footprint, half of Adam's, was historically one of its small practical advantages on memory-constrained hardware, though in the era of distributed training across hundreds of GPUs that consideration has mostly faded.

Where it has been used

RMSProp shows up in specific corners of the deep-learning literature, mostly from the 2013 to 2016 window when it was the default for sequence models and RL.

Reinforcement learning

The DeepMind paper that put deep RL on the map (Mnih et al., "Human-level control through deep reinforcement learning", Nature 518, 2015) used RMSProp to train the Q-network. Published settings: lr 0.00025, decay 0.95, momentum 0, epsilon 0.01. A lot of follow-on work (DQN variants, Rainbow, Ape-X) inherited those settings even after Adam became standard elsewhere. The 2016 Asynchronous Methods for Deep Reinforcement Learning paper introduced A3C, which used a shared RMSProp accumulator across asynchronous actor-learner workers; the squared-gradient running average was held in shared memory and updated atomically by every worker, giving each worker the benefit of a population-level estimate of gradient scale without any explicit synchronization. That trick became standard in distributed RL implementations for several years. Several other reinforcement-learning algorithms from the same era (TRPO baselines, certain ACER configurations, the original IMPALA reference) also defaulted to RMSProp. As of 2026, deep RL libraries typically expose both Adam and RMSProp and most new agents pick Adam, but the legacy DQN and A3C settings remain the canonical reference points for benchmarks on the Atari Learning Environment.

Recurrent neural networks

For recurrent neural networks, Alex Graves's character-level RNN work (Graves 2013) used centered RMSProp. Recurrent Batch Normalization (Cooijmans et al. 2016) used RMSProp on language modeling and sequence MNIST. Several early seq2seq systems used RMSProp before switching to Adam. The motivation for RMSProp on RNNs was practical: gradient magnitudes in long-sequence backpropagation through time vary dramatically across parameters, especially in the recurrent matrices, and per-parameter scaling helps prevent the small subset of weights that experience the largest gradients from dominating the update. Once Adam took over, the same property carried over, so the switch had little qualitative effect on training dynamics for most RNN setups.

Generative adversarial networks

For GANs, the Wasserstein GAN paper (WGAN, Arjovsky, Chintala & Bottou 2017) used RMSProp for both critic and generator and explicitly recommended against Adam, on the grounds that Adam's momentum term plus their gradient-clipping scheme made the critic loss less reliable. The original WGAN code uses RMSProp with learning rate 5e-5. The follow-on WGAN-GP paper (Gulrajani et al. 2017) reverted to Adam after replacing weight clipping with a gradient penalty, which suggests that the WGAN preference for RMSProp had as much to do with the specific weight-clipping mechanism as with any general property of GAN training. A few other adversarial setups, including some early attempts at adversarial training for robustness, also reached for RMSProp on the theory that momentum makes saddle-point dynamics worse, but this is more folklore than measured fact.

Other notable uses

RMSProp was a common default for smaller models in the 2014 to 2016 era, including character-level neural language models in Andrej Karpathy's widely circulated char-rnn codebase, several speech recognition baselines, and the original style-transfer implementations. For supervised image classification it was always less common than SGD with momentum, which delivered better final accuracy on ImageNet-scale benchmarks. Once Adam, and later AdamW, displaced both, RMSProp gradually became a niche choice for new work outside reinforcement learning.

When (not) to use it today

For most tasks, AdamW is a better default. The momentum term usually helps, the bias correction makes the first few hundred steps less twitchy, and decoupled weight decay does the right thing for regularization. On transformers, large CNNs, and diffusion models, AdamW with cosine learning-rate scheduling and a short warmup is the standard recipe and almost always at least as good as RMSProp.

Reproducing an older paper is the most common reason to still use RMSProp. DQN, WGAN, and a chunk of the 2013 to 2016 deep-learning literature use it with specific hyperparameters, and if you want your numbers to match, you use the same optimizer with the same settings. Some adversarial training setups (WGAN being the classic one) also work better without first-moment momentum because the loss landscape is non-stationary by design. For a brand-new project with no prior art, start with AdamW.

Decision checklist

In rough order of priority, the situations where reaching for RMSProp over Adam still makes sense:

The reference implementation you are reproducing uses RMSProp and you want bit-for-bit comparable numbers. This covers most Atari DQN and A3C work.
The optimization landscape is adversarial or otherwise non-stationary, and the gradient mean is genuinely uninformative or actively misleading, so adding first-moment momentum hurts. WGAN's critic is the canonical example.
You are training a small recurrent model and want to match the hyperparameters from a 2013-vintage paper without retuning.
You are working in a memory-constrained environment where halving the optimizer state from Adam to RMSProp matters.
You are running a unit test or sanity check and want a simpler, fewer-knob optimizer than Adam.

For every other modern setting, AdamW is the safer default. Some practitioners prefer SGD with momentum and a long cosine schedule for image classification on ConvNets, but that is a separate debate that has nothing to do with RMSProp specifically.

Common pitfalls

A short list of things that trip up engineers using RMSProp in production:

Default learning rate mismatch. PyTorch defaults to lr 0.01 (matching Hinton's original lecture); Keras and TensorFlow default to lr 0.001 (matching what most papers use). Code ported between the two libraries often diverges or stalls because of this. Always set the learning rate explicitly when reading or writing reference implementations.
Epsilon placement. As noted above, g / (sqrt(v) + ε) and g / sqrt(v + ε) are not equal. Frameworks differ. When epsilon dominates the denominator (early in training or in regions of vanishing gradient), the two formulations can produce noticeably different updates.
No bias correction. Unlike Adam, RMSProp does not bias-correct vₜ, so the running average is biased toward zero in the first few hundred steps. The practical effect is usually small but it does mean that the first updates are larger than the steady-state behavior would predict. For very short training runs this can matter.
Centered flag confusion. The centered=True flag silently changes the algorithm. It uses more memory, runs more slowly, and can subtly change convergence on some problems. Do not flip it on without rerunning your hyperparameter sweep.
Weight decay is L2, not decoupled. If you set the weight_decay argument in PyTorch's RMSprop, you are getting classical L2 regularization mixed into the gradient before the squared-gradient running average, not the decoupled form used by AdamW. This is usually fine but can interact badly with very large weight-decay coefficients.
Large ε in deep RL. The DQN setting of ε = 0.01 looks unreasonable to most deep-learning practitioners, but it is a deliberate choice to handle vanishing-gradient regions in Q-learning. Reproducing DQN-style experiments without setting ε to its published value tends to make Q-values diverge.
Numerical drift across hardware. RMSProp's denominator is sensitive to the order of floating-point operations in the squared-gradient running average. Mixed-precision training, distributed reductions, and TPU vs. GPU implementations can produce slightly different trajectories from the same initial conditions. This rarely matters for final accuracy but can complicate exact reproducibility.

References

Tieleman, T. and Hinton, G. (2012). *Lecture 6.5: rmsprop, COURSERA: Neural Networks for Machine Learning.* University of Toronto.
Duchi, J., Hazan, E., & Singer, Y. (2011). Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. *Journal of Machine Learning Research*, 12, 2121-2159.
Zeiler, M. D. (2012). ADADELTA: An Adaptive Learning Rate Method. arXiv:1212.5701.
Kingma, D. P., & Ba, J. (2014). Adam: A Method for Stochastic Optimization. arXiv:1412.6980. ICLR 2015.
Loshchilov, I., & Hutter, F. (2017). Decoupled Weight Decay Regularization. arXiv:1711.05101. ICLR 2019.
Graves, A. (2013). Generating Sequences With Recurrent Neural Networks. arXiv:1308.0850.
Mnih, V., Kavukcuoglu, K., Silver, D., et al. (2015). Human-level control through deep reinforcement learning. *Nature* 518, 529-533.
Mnih, V., Badia, A. P., Mirza, M., et al. (2016). Asynchronous Methods for Deep Reinforcement Learning. *Proceedings of ICML 2016*. arXiv:1602.01783.
Cooijmans, T., Ballas, N., Laurent, C., Gülçehre, Ç., & Courville, A. (2016). Recurrent Batch Normalization. arXiv:1603.09025.
Arjovsky, M., Chintala, S., & Bottou, L. (2017). Wasserstein GAN. arXiv:1701.07875.
Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., & Courville, A. (2017). Improved Training of Wasserstein GANs. arXiv:1704.00028.
Reddi, S. J., Kale, S., & Kumar, S. (2018). On the Convergence of Adam and Beyond. ICLR 2018.
Riedmiller, M., & Braun, H. (1993). A direct adaptive method for faster backpropagation learning: The RPROP algorithm. *IEEE International Conference on Neural Networks*.
Ruder, S. (2016). An overview of gradient descent optimization algorithms. arXiv:1609.04747.
Goodfellow, I., Bengio, Y., & Courville, A. (2016). *Deep Learning*. MIT Press, Section 8.5.
PyTorch documentation: `torch.optim.RMSprop`. TensorFlow documentation: `tf.keras.optimizers.RMSprop`. Optax documentation: `optax.rmsprop`.
Wikipedia contributors. "Stochastic gradient descent" (section on RMSProp). Wikipedia.

History

From rprop to RMSProp

Adoption timeline

The update rule

Standard form

Intuition

Where epsilon goes

Centered RMSProp

With momentum

A worked toy example

Comparison to AdaGrad

Comparison to AdaDelta

Comparison to Adam

Convergence and the AMSGrad issue

Hyperparameters

Tuning guidance

Implementation

PyTorch

TensorFlow / Keras

JAX (Optax)

From scratch in NumPy

Optimizer comparison

Memory and compute cost

Where it has been used

Reinforcement learning

Recurrent neural networks

Generative adversarial networks

Other notable uses

When (not) to use it today

Decision checklist

Common pitfalls

See also

References

Improve this article

Related Articles

Sparse autoencoder

GELU (Gaussian Error Linear Unit)

LeNet

Clipping

Gradient Descent

Hyperparameter

History

From rprop to RMSProp

Adoption timeline

The update rule

Standard form

Intuition

Where epsilon goes

Centered RMSProp

With momentum

A worked toy example

Comparison to AdaGrad

Comparison to AdaDelta

Comparison to Adam

Convergence and the AMSGrad issue

Hyperparameters

Tuning guidance

Implementation

PyTorch

TensorFlow / Keras

JAX (Optax)

From scratch in NumPy

Optimizer comparison

Memory and compute cost

Where it has been used

Reinforcement learning

Recurrent neural networks

Generative adversarial networks

Other notable uses

When (not) to use it today

Decision checklist

Common pitfalls

See also

References

Related Articles

Sparse autoencoder

GELU (Gaussian Error Linear Unit)

LeNet

Clipping

Gradient Descent