RMSProp
Last reviewed
May 9, 2026
Sources
17 citations
Review status
Source-backed
Revision
v2 · 5,402 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 9, 2026
Sources
17 citations
Review status
Source-backed
Revision
v2 · 5,402 words
Add missing citations, update stale details, or suggest a clearer explanation.
RMSProp (Root Mean Square Propagation) is an adaptive learning-rate optimizer for training neural networks with mini-batch stochastic gradient descent. It divides each parameter's gradient by a running root-mean-square estimate of recent gradient magnitudes for that parameter, keeping the effective step size roughly the same across parameters even when raw gradients differ in scale by orders of magnitude.
The method was proposed by Geoffrey Hinton in 2012 in lecture 6e of his Coursera course Neural Networks for Machine Learning ("rmsprop: Divide the gradient by a running average of its recent magnitude"). It was implemented in the course's Octave examples by Tijmen Tieleman, which is why it is often cited as Tieleman & Hinton 2012. There was no stand-alone paper; the canonical citation in the literature is the lecture slide deck itself. Hinton has joked in talks since that he "got tired of writing papers," so the method only ever lived inside a course handout, even as it became one of the most widely used adaptive optimizers of the early deep-learning era.
RMSProp sits in the early-2010s wave of adaptive optimizers. AdaGrad came first (Duchi, Hazan & Singer 2011) and worked well for sparse problems but accumulated squared gradients without ever forgetting them, so the effective learning rate eventually crashed to zero. RMSProp fixed that by replacing the running sum with an exponentially decaying average. AdaDelta (Zeiler, December 2012) used the same trick independently. Adam (Kingma & Ba 2014) folded RMSProp together with momentum and added bias correction, and Adam plus its descendant AdamW ate most of RMSProp's market share for general deep learning work. RMSProp is still around, mostly in older deep reinforcement learning code (the original DQN is the most cited example) and in GAN settings where Adam's momentum term makes training less stable.
In the fall of 2012, Hinton taught a free Coursera course called Neural Networks for Machine Learning, one of the first MOOCs aimed at deep learning, running a few months after AlexNet had won ILSVRC. Lecture 6 covered how to make learning go faster; one of its slides introduced RMSProp in roughly two bullet points. Hinton noted that an earlier method called rprop (Riedmiller & Braun 1993) used only the sign of the gradient, which works for full-batch training but breaks down for mini-batches because magnitude information matters when batches are noisy. RMSProp keeps rprop's per-parameter scaling but uses a smooth running average of squared gradients instead of a per-step sign flip.
Riedmiller and Braun's rprop algorithm assigned each parameter a separate step size, increased it whenever the sign of the gradient stayed the same across consecutive iterations, and decreased it whenever the sign flipped. The procedure ignored gradient magnitude entirely. That is fine when gradients are computed over the whole training set, because two consecutive full-batch gradients have comparable magnitudes by construction. With mini-batches, two consecutive gradients can have wildly different magnitudes simply because the batches contain different examples; rprop's sign-only update tends to thrash. Hinton's lecture framed RMSProp as the natural fix: replace the sign-comparison heuristic with a per-parameter scale derived from a smoothly accumulated estimate of recent squared gradients, so that the update divides the raw gradient by something close to its typical magnitude.
Tijmen Tieleman, then a PhD student in Hinton's group, wrote the implementation that students used in the Coursera assignments. There was no journal or conference paper; the canonical citation is "Tieleman, T. and Hinton, G. (2012). Lecture 6.5 of Neural Networks for Machine Learning, Coursera." Zeiler's AdaDelta paper, posted to arXiv in December 2012, confirmed that others had landed on roughly the same construction. Yann Dauphin and others working on RNN training in 2013 and 2014 also began citing the lecture handout in workshop papers as a way to refer to the technique, which is largely how it entered the literature.
For several years RMSProp was the default optimizer when something better than vanilla SGD was needed. The first major demonstration was the DQN paper (Mnih et al., Nature 2015). After Adam landed in 2015, most new work moved over.
| Year | Event |
|---|---|
| 1993 | Riedmiller & Braun publish rprop, the conceptual predecessor. |
| 2011 | Duchi, Hazan & Singer publish AdaGrad. |
| 2012 (Oct) | Hinton lectures on RMSProp in Coursera Neural Networks for Machine Learning, lecture 6e. |
| 2012 (Dec) | Zeiler posts AdaDelta on arXiv, independently arriving at an exponentially decayed accumulator. |
| 2013 | Alex Graves uses centered RMSProp for handwriting and text generation RNNs. |
| 2014 | Kingma & Ba post the Adam preprint, framing it as RMSProp with momentum and bias correction. |
| 2015 | Mnih et al. publish Human-level control through deep reinforcement learning in Nature; DQN uses RMSProp. |
| 2016 | Mnih et al. release A3C; the asynchronous version uses a shared RMSProp accumulator across workers. |
| 2017 | Arjovsky et al. publish Wasserstein GAN; explicitly recommends RMSProp over Adam for the critic. |
| 2017 (Nov) | Loshchilov & Hutter post AdamW, after which AdamW becomes the standard for transformer training. |
| 2018+ | RMSProp recedes as a default for new architectures but remains common for reinforcement learning baselines and reproductions of older papers. |
RMSProp is a per-parameter optimizer. For each scalar parameter θ with gradient g at the current step, it keeps a running estimate v of the squared gradient and uses the square root of that estimate to scale the step.
Let θₜ be the parameter at step t, gₜ the gradient of the loss with respect to θₜ, ρ a decay coefficient (often called γ or β₂), α the global learning rate, and ε a small constant for numerical stability.
v_t = ρ · v_{t-1} + (1 - ρ) · g_t²
θ_{t+1} = θ_t - α · g_t / (sqrt(v_t) + ε)
vₜ is the exponential moving average of squared gradients. It plays the same role as the AdaGrad accumulator, except the contribution of any single past gradient decays geometrically over time instead of staying in the sum forever. The square root sqrt(vₜ) is the running root-mean-square (hence the name), and dividing gₜ by it produces a step that has roughly unit magnitude in expectation, regardless of whether the parameter usually sees large or small gradients.
If one parameter consistently sees gradients of magnitude 100 and another sees gradients of magnitude 0.01, vanilla SGD has to pick one global step size that works for both, and either the first parameter overshoots or the second barely moves. RMSProp's per-parameter denominator scales each update so the actual step in parameter space is comparable across parameters, no matter what scale the gradients live on. The exponential decay means that scale is computed from recent history, so it can shift over the course of training without permanently shrinking the effective learning rate the way AdaGrad does. The ε term keeps things from blowing up early in training when vₜ is close to zero; common defaults are 1e-6 or 1e-8 depending on the library.
The placement of ε matters more than it looks. The original lecture wrote the update as g / (sqrt(v) + ε), with ε added outside the square root. Some implementations and some Adam-style variants place it inside the square root: g / sqrt(v + ε). The two are not algebraically equal. Outside-sqrt ε dominates when v is very small early in training and gives a smooth, finite step even when v is exactly zero. Inside-sqrt ε only matters once v drops below ε, which can produce slightly different early-step behavior. PyTorch's torch.optim.RMSprop uses outside-sqrt ε to match Hinton's original lecture; TensorFlow has used both formulations across its history. When porting code across frameworks, this is a real source of subtle numerical drift.
A variant from Alex Graves's 2013 paper Generating Sequences With Recurrent Neural Networks (arXiv:1308.0850) also tracks a running mean of the gradient and subtracts its square from v before taking the square root, so the denominator becomes the running standard deviation rather than the running RMS:
m_t = ρ · m_{t-1} + (1 - ρ) · g_t
v_t = ρ · v_{t-1} + (1 - ρ) · g_t²
θ_{t+1} = θ_t - α · g_t / (sqrt(v_t - m_t²) + ε)
Graves reported that this made training more stable when gradients had a strong directional bias. The intuition is that if the gradient has a consistent sign, the standard RMSProp denominator will report a large magnitude even though the parameter is moving in a single coherent direction. Centered RMSProp removes the bias before taking the square root, so a parameter that is steadily decreasing gets a denominator close to zero rather than close to the gradient's typical absolute value, which makes the per-step move larger when there is a clear direction to move in. Modern libraries expose it as a centered=True flag, off by default.
Many implementations also support an optional momentum term that smooths the parameter update itself, separate from the squared-gradient running average:
v_t = ρ · v_{t-1} + (1 - ρ) · g_t²
b_t = μ · b_{t-1} + g_t / (sqrt(v_t) + ε)
θ_{t+1} = θ_t - α · b_t
With μ = 0.9 this is conceptually close to Adam, although Adam's bias-correction step makes the early-iteration behavior slightly different. PyTorch's torch.optim.RMSprop exposes this as the momentum argument and sets it to 0 by default. Some reinforcement learning codebases turned this on with values around 0.9 or 0.95 to smooth out very noisy policy-gradient updates.
A two-parameter quadratic illustrates the per-parameter scaling clearly. Consider the loss L(θ₁, θ₂) = 50 θ₁² + θ₂². The gradient is g = (100 θ₁, 2 θ₂). Starting from (θ₁, θ₂) = (1, 1) with vanilla SGD at learning rate 0.01, the first step in θ₁ is -0.01 × 100 = -1.0, which overshoots through the minimum, while the first step in θ₂ is -0.01 × 2 = -0.02, which barely moves. RMSProp with ρ = 0.9 instead computes v₁ = (1 - 0.9) × 100² = 1000 and v₂ = (1 - 0.9) × 2² = 0.4. The first updates become -0.01 × 100 / sqrt(1000) ≈ -0.0316 and -0.01 × 2 / sqrt(0.4) ≈ -0.0316. Both parameters move by roughly the same amount in parameter space, so neither overshoots and neither stalls. That is the entire point of the algorithm.
AdaGrad (Duchi, Hazan, Singer 2011, JMLR) uses the same general structure: each parameter has its own learning-rate scale computed from past squared gradients. The difference is how that scale is accumulated.
# AdaGrad
G_t = G_{t-1} + g_t²
θ_{t+1} = θ_t - α · g_t / (sqrt(G_t) + ε)
# RMSProp
v_t = ρ · v_{t-1} + (1 - ρ) · g_t²
θ_{t+1} = θ_t - α · g_t / (sqrt(v_t) + ε)
The AdaGrad accumulator Gₜ grows monotonically. For sparse problems this is a feature: rare features get large updates because G stays small. For dense problems it is a bug: G grows roughly linearly in t, so the effective learning rate shrinks like 1/sqrt(t) and learning eventually stops. RMSProp's exponential decay fixes this. With ρ = 0.9, only the last 10 to 20 steps contribute meaningfully, so the adaptive rate tracks the current geometry of the loss surface.
There is a clean way to think about the difference. AdaGrad's denominator estimates the cumulative L2 norm of all past gradients for that parameter. RMSProp's denominator estimates the running root-mean-square of recent gradients, which is bounded as long as gradients themselves are bounded. AdaGrad's effective learning rate is monotonically non-increasing, so once it has decayed, it cannot recover, even if the loss landscape changes (for instance after a learning-rate warmup, a curriculum shift, or a fine-tuning phase). RMSProp's effective learning rate can grow back as soon as recent gradients become small. That difference is what makes RMSProp viable for the long, multi-stage training runs typical of modern deep learning, while AdaGrad mostly stayed in the convex-optimization and sparse-feature literature where its monotonicity is the right thing to want.
AdaDelta (Zeiler 2012, arXiv:1212.5701) was developed independently about three months after Hinton's lecture. It uses the same exponentially decaying mean of squared gradients but tries to eliminate the learning rate α entirely by also keeping a running average of squared parameter updates and using the ratio of those two RMS quantities as the step size:
# AdaDelta
E[g²]_t = ρ · E[g²]_{t-1} + (1 - ρ) · g_t²
Δθ_t = -(sqrt(E[Δθ²]_{t-1} + ε) / sqrt(E[g²]_t + ε)) · g_t
E[Δθ²]_t = ρ · E[Δθ²]_{t-1} + (1 - ρ) · Δθ_t²
θ_{t+1} = θ_t + Δθ_t
In theory this makes AdaDelta hyperparameter-free apart from ρ and ε; in practice most implementations still expose a learning-rate multiplier. AdaDelta is still a built-in optimizer in PyTorch, TensorFlow, and Keras, but sees less use since Adam.
The key motivation in Zeiler's paper is unit consistency. He argued that in plain RMSProp the update α · g / sqrt(v) does not have the same units as the parameter being updated, because the gradient has units of 1 / units(θ) and the denominator does not cancel them out, so the user-supplied learning rate α has to absorb that mismatch. AdaDelta replaces α with the running RMS of past parameter updates, which has the same units as θ, making the update dimensionally consistent. This is a clean theoretical observation, and on some tasks it does remove the need to tune a learning rate from scratch. In modern practice, however, the simplicity of just trying a few learning rates with Adam usually wins.
Adam (Kingma & Ba 2014, arXiv:1412.6980, ICLR 2015) is essentially RMSProp with momentum on the gradient itself plus bias correction. Adam keeps two running averages, one of the gradient (mₜ) and one of the squared gradient (vₜ), each with its own decay (β₁, β₂):
# Adam
m_t = β₁ · m_{t-1} + (1 - β₁) · g_t
v_t = β₂ · v_{t-1} + (1 - β₂) · g_t²
m̂_t = m_t / (1 - β₁ᵗ) # bias correction
v̂_t = v_t / (1 - β₂ᵗ) # bias correction
θ_{t+1} = θ_t - α · m̂_t / (sqrt(v̂_t) + ε)
With β₁ = 0 and bias correction off, Adam reduces to RMSProp with ρ = β₂. The first moment mₜ does the same job as heavy-ball momentum in SGD with momentum: it averages out noise in successive minibatch gradients. Bias correction matters because mₜ and vₜ start at zero. Adam and its descendant AdamW (decoupled weight decay) is the default optimizer for almost every transformer and modern vision model.
Kingma and Ba's original Adam paper offered a regret-bound proof for the optimizer in the convex online setting. Reddi, Kale & Kumar ("On the Convergence of Adam and Beyond," ICLR 2018) showed that the proof had a gap and constructed counter-examples on which Adam fails to converge to the optimum even in simple convex problems. The same construction technically affects RMSProp, since the troublesome term comes from the exponentially decaying squared-gradient accumulator that Adam inherits from RMSProp. Reddi and coauthors proposed AMSGrad, which keeps the running maximum of vₜ rather than the raw running average, restoring convergence guarantees. AMSGrad sees occasional use but never replaced Adam in practice, because the failure cases are rare on the kinds of non-convex objectives that come up in deep learning. The same caveats apply to RMSProp: the paper construction means the convergence story is a little less clean than the AdaGrad analysis, but on real neural-network training problems it has not turned into a practical problem.
RMSProp has a small number of knobs. The defaults below are the ones used by the major libraries.
| Hyperparameter | Symbol | Common default | Notes |
|---|---|---|---|
| Learning rate | α | 0.001 (Keras, PyTorch) | DQN used 0.00025 |
| Decay (squared gradient) | ρ, γ | 0.9 | Called alpha in PyTorch, rho in Keras |
| Epsilon | ε | 1e-7 (Keras), 1e-8 (PyTorch) | 1e-6 in some RL papers |
| Momentum | μ | 0 | Optional |
| Centered | flag | False | Subtract running mean (Graves 2013) |
| Weight decay | λ | 0 | L2 penalty added to gradient |
The 0.9 default for ρ corresponds to an effective averaging window of roughly 10 steps. For DQN specifically, the published learning rate of 0.00025 reflects how noisy bootstrap-target updates are; using the standard 0.001 there tends to make Q-values diverge.
Learning rate is by far the most important RMSProp hyperparameter, just as it is for SGD. Common practice is to sweep α across roughly four orders of magnitude on a log scale (1e-5, 3e-5, 1e-4, 3e-4, 1e-3, 3e-3, 1e-2) and pick the largest one that does not diverge in the first few hundred steps. RMSProp's per-parameter scaling makes the optimizer more forgiving of a slightly-too-large learning rate than vanilla SGD, but it does not eliminate the cliff entirely; once α crosses some task-dependent threshold the loss will still go to NaN within a few iterations.
The decay ρ is rarely worth tuning. Both 0.9 and 0.99 are reasonable. 0.9 makes the running average track changes in gradient scale more aggressively, which helps when the loss landscape changes rapidly; 0.99 (PyTorch's default) is smoother and slightly more stable. The original Hinton lecture used a value of 0.9 in its examples.
ε acts as a soft floor on the denominator and therefore as a soft ceiling on the per-step update size. On problems where gradients are very small (typical for the late stages of large-scale training), increasing ε by a couple of orders of magnitude (for instance from 1e-8 to 1e-4) can prevent updates from blowing up when sqrt(vₜ) is also tiny. The DQN default of ε = 0.01 is a deliberately large value chosen for exactly this reason; in deep RL the squared-gradient running average can drop to genuinely small values whenever the policy briefly stops exploring, and a tiny ε will then turn a small gradient into a huge step.
Weight decay in RMSProp implementations is added to the gradient before computing vₜ, so it is L2 regularization in the classical sense, not the decoupled weight decay used by AdamW. If you want decoupled weight decay with an RMSProp-style update, you have to either roll it yourself or use Optax, where the gradient transformation pipeline lets you apply weight decay independently of the squared-gradient normalization.
RMSProp is built in to all the major deep-learning libraries. Semantics match across them; the main thing to watch when porting code is the default learning rate.
torch.optim.RMSprop defaults match the original lecture: lr 0.01, decay 0.99, eps 1e-8, momentum 0, centered False.
import torch
optimizer = torch.optim.RMSprop(
model.parameters(),
lr=1e-3,
alpha=0.9, # this is rho
eps=1e-8,
)
for x, y in loader:
optimizer.zero_grad()
loss_fn(model(x), y).backward()
optimizer.step()
PyTorch names the squared-gradient decay coefficient alpha, which collides with the symbol α used for the learning rate in most papers. The learning rate argument is just lr. The library also exposes a centered flag for centered RMSProp and a momentum argument that adds the explicit momentum buffer described above.
tf.keras.optimizers.RMSprop defaults: learning rate 0.001, rho 0.9, momentum 0.0, epsilon 1e-7.
import tensorflow as tf
optimizer = tf.keras.optimizers.RMSprop(learning_rate=1e-3, rho=0.9, epsilon=1e-7)
model.compile(optimizer=optimizer, loss="mse")
Keras uses rho for the squared-gradient decay, which is closer to the conventional symbol. Note that the TensorFlow default learning rate (0.001) differs from PyTorch's (0.01) by a factor of ten. Hyperparameters tuned on one library do not transfer directly to the other without checking this.
In the JAX ecosystem the standard implementation is optax.rmsprop, composable with the rest of the optax gradient-transformation pipeline.
import optax
optimizer = optax.rmsprop(learning_rate=1e-3, decay=0.9, eps=1e-8)
opt_state = optimizer.init(params)
updates, opt_state = optimizer.update(grads, opt_state)
params = optax.apply_updates(params, updates)
Optax separates the optimizer logic into init and update calls without any hidden state on the optimizer object, which makes it straightforward to combine RMSProp with gradient clipping, weight decay, learning-rate schedules, and other transformations using optax.chain. For example, a typical training pipeline might be optax.chain(optax.clip_by_global_norm(1.0), optax.rmsprop(1e-3)).
A reference implementation in pure NumPy fits in a few lines:
import numpy as np
class RMSProp:
def __init__(self, params, lr=1e-3, rho=0.9, eps=1e-8):
self.lr, self.rho, self.eps = lr, rho, eps
self.v = [np.zeros_like(p) for p in params]
def step(self, params, grads):
for i, (p, g) in enumerate(zip(params, grads)):
self.v[i] = self.rho * self.v[i] + (1 - self.rho) * g * g
p -= self.lr * g / (np.sqrt(self.v[i]) + self.eps)
That is the entire algorithm. Real implementations add weight decay, momentum, gradient clipping, and bookkeeping, but the inner loop is exactly the two lines that update v and step the parameter.
The following table summarizes how RMSProp relates to the other first-order optimizers it shares lineage with.
| Optimizer | Year | Per-parameter scale | Momentum | Bias correction | Notes |
|---|---|---|---|---|---|
| SGD | 1951 (Robbins & Monro) | No | No | No | Single global step size |
| SGD with momentum | 1964 (Polyak) | No | Yes | No | Heavy-ball momentum |
| Nesterov momentum | 1983 (Nesterov) | No | Yes | No | Lookahead momentum variant |
| AdaGrad | 2011 (Duchi et al.) | Yes (sum of g²) | No | No | Learning rate decays to zero |
| RMSProp | 2012 (Hinton) | Yes (EMA of g²) | Optional | No | Fixes AdaGrad's decay issue |
| AdaDelta | 2012 (Zeiler) | Yes (EMA of g²) | No | No | Eliminates explicit learning rate |
| Adam | 2014 (Kingma & Ba) | Yes (EMA of g²) | Yes (EMA of g) | Yes | RMSProp + momentum + bias correction |
| AdamW | 2017 (Loshchilov & Hutter) | Yes (EMA of g²) | Yes (EMA of g) | Yes | Adam with decoupled weight decay |
| AMSGrad | 2018 (Reddi et al.) | Yes (max of g² EMA) | Yes | No | Adam variant with restored convergence proof |
| AdaBelief | 2020 (Zhuang et al.) | Yes (EMA of (g - m)²) | Yes | Yes | Adam variant tracking gradient variance |
All first-order adaptive optimizers cost extra memory because they have to store auxiliary state per parameter. SGD has zero extra state, SGD with momentum has one buffer per parameter, RMSProp has one buffer (or two with momentum, three with centered), Adam has two, and AMSGrad has three. For a transformer with 7 billion parameters in fp32, that is 28 GB of optimizer state for RMSProp, 56 GB for Adam, and 84 GB for AMSGrad, which is why optimizer state offloading and 8-bit Adam exist. RMSProp's relatively modest state footprint, half of Adam's, was historically one of its small practical advantages on memory-constrained hardware, though in the era of distributed training across hundreds of GPUs that consideration has mostly faded.
RMSProp shows up in specific corners of the deep-learning literature, mostly from the 2013 to 2016 window when it was the default for sequence models and RL.
The DeepMind paper that put deep RL on the map (Mnih et al., "Human-level control through deep reinforcement learning", Nature 518, 2015) used RMSProp to train the Q-network. Published settings: lr 0.00025, decay 0.95, momentum 0, epsilon 0.01. A lot of follow-on work (DQN variants, Rainbow, Ape-X) inherited those settings even after Adam became standard elsewhere. The 2016 Asynchronous Methods for Deep Reinforcement Learning paper introduced A3C, which used a shared RMSProp accumulator across asynchronous actor-learner workers; the squared-gradient running average was held in shared memory and updated atomically by every worker, giving each worker the benefit of a population-level estimate of gradient scale without any explicit synchronization. That trick became standard in distributed RL implementations for several years. Several other reinforcement-learning algorithms from the same era (TRPO baselines, certain ACER configurations, the original IMPALA reference) also defaulted to RMSProp. As of 2026, deep RL libraries typically expose both Adam and RMSProp and most new agents pick Adam, but the legacy DQN and A3C settings remain the canonical reference points for benchmarks on the Atari Learning Environment.
For recurrent neural networks, Alex Graves's character-level RNN work (Graves 2013) used centered RMSProp. Recurrent Batch Normalization (Cooijmans et al. 2016) used RMSProp on language modeling and sequence MNIST. Several early seq2seq systems used RMSProp before switching to Adam. The motivation for RMSProp on RNNs was practical: gradient magnitudes in long-sequence backpropagation through time vary dramatically across parameters, especially in the recurrent matrices, and per-parameter scaling helps prevent the small subset of weights that experience the largest gradients from dominating the update. Once Adam took over, the same property carried over, so the switch had little qualitative effect on training dynamics for most RNN setups.
For GANs, the Wasserstein GAN paper (WGAN, Arjovsky, Chintala & Bottou 2017) used RMSProp for both critic and generator and explicitly recommended against Adam, on the grounds that Adam's momentum term plus their gradient-clipping scheme made the critic loss less reliable. The original WGAN code uses RMSProp with learning rate 5e-5. The follow-on WGAN-GP paper (Gulrajani et al. 2017) reverted to Adam after replacing weight clipping with a gradient penalty, which suggests that the WGAN preference for RMSProp had as much to do with the specific weight-clipping mechanism as with any general property of GAN training. A few other adversarial setups, including some early attempts at adversarial training for robustness, also reached for RMSProp on the theory that momentum makes saddle-point dynamics worse, but this is more folklore than measured fact.
RMSProp was a common default for smaller models in the 2014 to 2016 era, including character-level neural language models in Andrej Karpathy's widely circulated char-rnn codebase, several speech recognition baselines, and the original style-transfer implementations. For supervised image classification it was always less common than SGD with momentum, which delivered better final accuracy on ImageNet-scale benchmarks. Once Adam, and later AdamW, displaced both, RMSProp gradually became a niche choice for new work outside reinforcement learning.
For most tasks, AdamW is a better default. The momentum term usually helps, the bias correction makes the first few hundred steps less twitchy, and decoupled weight decay does the right thing for regularization. On transformers, large CNNs, and diffusion models, AdamW with cosine learning-rate scheduling and a short warmup is the standard recipe and almost always at least as good as RMSProp.
Reproducing an older paper is the most common reason to still use RMSProp. DQN, WGAN, and a chunk of the 2013 to 2016 deep-learning literature use it with specific hyperparameters, and if you want your numbers to match, you use the same optimizer with the same settings. Some adversarial training setups (WGAN being the classic one) also work better without first-moment momentum because the loss landscape is non-stationary by design. For a brand-new project with no prior art, start with AdamW.
In rough order of priority, the situations where reaching for RMSProp over Adam still makes sense:
For every other modern setting, AdamW is the safer default. Some practitioners prefer SGD with momentum and a long cosine schedule for image classification on ConvNets, but that is a separate debate that has nothing to do with RMSProp specifically.
A short list of things that trip up engineers using RMSProp in production:
g / (sqrt(v) + ε) and g / sqrt(v + ε) are not equal. Frameworks differ. When epsilon dominates the denominator (early in training or in regions of vanishing gradient), the two formulations can produce noticeably different updates.centered=True flag silently changes the algorithm. It uses more memory, runs more slowly, and can subtly change convergence on some problems. Do not flip it on without rerunning your hyperparameter sweep.weight_decay argument in PyTorch's RMSprop, you are getting classical L2 regularization mixed into the gradient before the squared-gradient running average, not the decoupled form used by AdamW. This is usually fine but can interact badly with very large weight-decay coefficients.Optimizer, gradient descent, stochastic gradient descent, momentum, AdaGrad, Adam, AdamW, backpropagation, learning rate, A3C, DQN, WGAN.