See also: Machine learning terms
AdaGrad (short for Adaptive Gradient Algorithm) is an optimizer for gradient descent-based machine learning that adapts the learning rate individually for each parameter during training. Introduced by John Duchi, Elad Hazan, and Yoram Singer in their 2011 paper "Adaptive Subgradient Methods for Online Learning and Stochastic Optimization," published in the Journal of Machine Learning Research, AdaGrad was one of the first widely adopted adaptive learning rate methods [1]. The same family of algorithms was discovered concurrently by H. Brendan McMahan and Matthew Streeter, whose paper "Adaptive Bound Optimization for Online Convex Optimization" appeared at the 23rd Annual Conference on Learning Theory (COLT 2010) [2]. AdaGrad assigns larger effective learning rates to parameters associated with infrequent features and smaller ones to parameters associated with frequent features, which makes it particularly effective for problems involving sparse features such as natural language processing, recommendation systems, and large-scale online ad click prediction.
Although AdaGrad has been largely superseded by Adam-family optimizers for general deep learning work, it remains the conceptual ancestor of every modern adaptive optimizer. The core idea, dividing each gradient component by the square root of an accumulator that summarizes how much that parameter has moved in the past, runs through RMSProp, AdaDelta, Adam, and AdamW. Understanding AdaGrad is therefore the cleanest way to understand the entire adaptive-rate optimizer family.
In standard stochastic gradient descent (SGD), a single global learning rate is applied uniformly to every parameter at every step. This creates a fundamental problem: the optimal step size for one parameter may be far too large or too small for another. Real datasets contain features that appear at vastly different frequencies (some words occur in millions of documents, others in only a handful) and a uniform learning rate treats all of them identically.
Parameters tied to rare features receive gradient signal infrequently. Under a fixed schedule like the classical Robbins-Monro decay rule, these rarely updated parameters are penalized in the same way as parameters that update on nearly every step. The net effect is that rare-but-predictive features (the "needles in haystacks" referenced in the Duchi paper [1]) get drowned out. AdaGrad was designed to fix this imbalance by using each parameter's own gradient history to scale its step size.
The insight behind AdaGrad is that the magnitude of past gradients in a given coordinate is a useful proxy for the local geometry of the loss function along that coordinate. Coordinates that have accumulated large gradients tend to lie in well-explored, steep directions and can afford smaller steps. Coordinates with small accumulated gradients are in flatter, less-explored directions and benefit from larger steps. The algorithm formalizes this intuition by giving each parameter its own time-varying learning rate that depends only on the history observed for that one coordinate.
Duchi, Hazan, and Singer derived the method from the framework of online convex optimization with regularized dual averaging and composite mirror descent [1]. McMahan and Streeter arrived at the same diagonal preconditioner from a different angle, by adaptively choosing the regularizer in an online convex optimization algorithm to match the geometry of the data [2]. Both groups proved tight regret bounds and noted that the bounds could be exponentially better than non-adaptive methods on sparse problems. The two papers were presented in the same conference cycle and are now treated as co-discoveries.
The AdaGrad update rule operates on a per-parameter basis. Given a parameter vector w at time step t, the algorithm proceeds element-wise.
At each time step t:
Compute the gradient (or subgradient) of the loss with respect to the current parameters:
gt = nablaw L(y_t, f(xt, w(t-1)))
Accumulate the element-wise squared gradients into a running sum:
st = s(t-1) + g_t^2
where s_0 = 0 (initialized to zeros) and the squaring is performed element-wise. Some implementations call this accumulator G_t instead of s_t.
Update the parameters:
wt = w(t-1) - (eta / (sqrt(s_t) + epsilon)) * g_t
where eta is the initial (global) learning rate and epsilon is a small smoothing constant (typically 1e-7 or 1e-10) that prevents division by zero on coordinates that have not yet seen any gradient.
All operations (squaring, square root, division, multiplication) are performed element-wise. Each parameter w_i therefore has its own effective learning rate of eta / (sqrt(s_(t,i)) + epsilon), which can only decrease over the course of training as more gradient mass accumulates in s_i.
A common alternative formulation places the epsilon inside the square root: eta / sqrt(s_(t,i) + epsilon). The two versions are mathematically slightly different but practically indistinguishable. PyTorch and JAX/Optax use the outside-the-square-root form by default; some textbook derivations use the inside form.
The original Duchi paper actually presents two versions of the algorithm. The full-matrix version uses the matrix
G_t = sum_(tau=1)^t g_tau g_tau^T
the outer product matrix of all observed gradients, and updates parameters via
w_(t+1) = w_t - eta * G_t^(-1/2) * g_t
This full-matrix update captures correlations between parameters and is closely related to second-order methods that use a preconditioner derived from the empirical Fisher information matrix. However, it has O(d^2) memory and O(d^3) per-step time complexity for d-dimensional models, which is prohibitive for any model bigger than a few thousand parameters. The diagonal version (using only the diagonal entries of G_t) is what every practical implementation refers to as "AdaGrad," and it is what every modern framework provides out of the box.
Later work has revisited the full-matrix form. Agarwal, Bullins, Chen, Hazan, Singh, and Zhang's 2019 paper "Efficient Full-Matrix Adaptive Regularization" [3] showed that approximate full-matrix AdaGrad can be made tractable for moderately sized problems and can outperform the diagonal version, though the diagonal form remains overwhelmingly dominant in practice.
For each individual parameter w_i, the effective learning rate at step t is
eta_eff(i, t) = eta / (sqrt(sum_(tau=1)^t g_(tau,i)^2) + epsilon)
The table below shows what this implies for different gradient histories. Frequently updated coordinates have effective learning rates that drop quickly; rarely updated coordinates retain comparatively large step sizes.
| Gradient history | Accumulated sum (s_i) | Effective learning rate | Interpretation |
|---|---|---|---|
| Large, frequent gradients | Grows quickly | Decreases rapidly | Coordinate is well-explored; take smaller steps |
| Small or infrequent gradients | Grows slowly | Remains relatively large | Coordinate needs more exploration; take larger steps |
| Mixed (large early, small later) | Frozen near early peak | Stays small permanently | Vanishing-rate problem; the algorithm cannot recover |
AdaGrad's behaviour differs sharply between sparse and dense problems, which is the single most important fact for understanding when to use it.
| Property | Sparse gradients (e.g. NLP, ad CTR) | Dense gradients (e.g. CNN training) |
|---|---|---|
| Per-coordinate gradient frequency | Most coordinates see g_i = 0 most steps | Almost every coordinate sees a non-zero gradient every step |
| Behaviour of accumulator s_i | Grows slowly for rare coordinates | Grows roughly linearly with t for every coordinate |
| Effective learning rate over time | Stays usable for rare coordinates indefinitely | Decays as O(1 / sqrt(t)) for every coordinate |
| Theoretical regret bound | O(sqrt(T) * sqrt(sum_i max_t | g_(t,i) |
| Empirical outcome | Strongly preferred over plain SGD | Often beaten by SGD-momentum or Adam after a moderate number of epochs |
This table also explains why AdaGrad survived even after Adam took over deep learning: the sparse-gradient regime is exactly where AdaGrad has a provable advantage, and that regime appears in recommender systems, online ads, and certain NLP pipelines.
The complete AdaGrad algorithm can be summarized as follows.
| Step | Action |
|---|---|
| 1 | Initialize parameter vector w_0 and set the gradient accumulator s_0 = 0 |
| 2 | Choose a global learning rate eta (commonly 0.01) and smoothing constant epsilon (commonly 1e-8 or 1e-10) |
| 3 | For each training iteration t = 1, 2, ..., T: |
| 3a | Sample a minibatch (or single example, in the online setting) and compute the gradient g_t of the loss with respect to w |
| 3b | Update the accumulator: st = s(t-1) + g_t^2 (element-wise) |
| 3c | Compute the parameter update: wt = w(t-1) - (eta / (sqrt(s_t) + epsilon)) * g_t |
| 4 | Return the final parameter vector w_T |
A pseudocode form looks like this:
function Adagrad(loss_fn, w0, eta, epsilon, num_steps):
w = w0
s = zeros_like(w)
for t in 1..num_steps:
g = grad(loss_fn, w)
s = s + g * g
w = w - eta * g / (sqrt(s) + epsilon)
return w
AdaGrad enjoys strong theoretical guarantees in the convex online learning setting, which is the setting both the JMLR and COLT papers analyzed.
In the online convex optimization framework, an adversary picks a sequence of convex loss functions f_1, ..., f_T, and the learner picks parameters w_t before seeing f_t. Regret is defined as
R_T = sum_(t=1)^T f_t(wt) - minw* sum_(t=1)^T f_t(w*)
that is, the total excess loss compared to the best fixed comparator chosen in hindsight.
Duchi et al. and McMahan and Streeter both proved that AdaGrad achieves a regret bound of order O(sqrt(T)) under standard assumptions on the convexity and bounded subgradients [1][2]. More precisely, the bound has the form
R_T <= O(D_infinity * sqrt(sum_i sqrt(sum_t g_(t,i)^2)))
where D_infinity bounds the diameter of the comparator set. The crucial feature is that the sum over coordinates i depends on the per-coordinate gradient history. When the gradient is sparse (most g_(t,i) are zero), this bound can be exponentially smaller than the corresponding bound for plain online subgradient descent, which scales with sqrt(d * T) regardless of sparsity. On a problem with d dimensions of which only k carry typical signal, AdaGrad's bound scales like sqrt(k * T) rather than sqrt(d * T).
This sparsity-adaptive regret bound was the headline theoretical result and the main reason AdaGrad gained immediate adoption in NLP and recommendation pipelines. It provided a rigorous explanation for the empirical observation that adaptive per-feature learning rates outperform uniform schedules on text and click data.
Via standard online-to-batch conversion, the O(sqrt(T)) regret bound translates into a convergence rate of O(1/sqrt(T)) on the average iterate for stochastic convex optimization with bounded subgradients [1]. This matches the optimal rate for first-order stochastic methods on non-smooth convex problems.
AdaGrad's original analysis assumed convex losses, but neural network training is famously non-convex. Ward, Wu, and Bottou's 2019 paper "AdaGrad Stepsizes: Sharp Convergence Over Nonconvex Landscapes" [4] addressed this gap. They proved that the norm version of AdaGrad (AdaGrad-Norm, where the same scalar stepsize is used for all coordinates) converges to a stationary point at rate O(log(N) / sqrt(N)) in the stochastic non-convex setting and O(1/N) in the deterministic case. Their analysis did not require knowledge of the loss's Lipschitz constant or the noise level, which is unusual for non-convex SGD analyses and which they took as evidence that AdaGrad's adaptivity is genuinely useful even outside its original convex setting.
Later work by Defossez, Bottou, Bach, and Usunier [5] gave a unified proof for both Adam and AdaGrad in the non-convex stochastic case, confirming that adaptive methods do converge under reasonable assumptions even though their behaviour on non-convex landscapes is less understood than on convex ones.
The full-matrix version of AdaGrad is closely related to natural gradient descent and to other second-order optimization methods. Natural gradient descent uses the inverse Fisher information matrix as a preconditioner, which produces updates that are invariant to reparameterization. AdaGrad's full-matrix preconditioner G_t^(-1/2) is built from the outer products of the observed gradients, which is closely related to the empirical Fisher information matrix (the Fisher under the empirical data distribution rather than the model distribution).
The diagonal AdaGrad accumulator s_i is the diagonal of the empirical second-moment matrix of the gradients. Dividing by sqrt(s_i) is therefore a coordinate-wise normalization by an estimate of gradient standard deviation, which can be viewed as a crude diagonal approximation to a Fisher- or Hessian-based preconditioner. This gives AdaGrad a loose interpretation as a memory-cheap, first-order substitute for genuine second-order methods, an idea that Adam later inherited and that connects through to recent work like Shampoo and second-order LLM optimizers.
AdaGrad offers several benefits that explain its impact and its continued role in certain niches.
The most important strength is that the algorithm automatically scales the learning rate per parameter without any manual schedule. A practitioner still needs to set the global learning rate eta, but the per-coordinate adaptation eliminates the need for elaborate per-feature tuning. In sparse settings this often replaces what would otherwise be a feature-engineering effort to balance frequencies.
AdaGrad excels when gradients are sparse or non-uniformly distributed across features. In NLP with bag-of-words or hashed-feature representations, most words in any given example receive a zero gradient. Parameters for rare words accumulate slowly, so their effective learning rate stays large. Parameters for common words accumulate quickly and shrink. This per-feature adaptivity is exactly the right behaviour for sparse high-dimensional problems and is the regime where AdaGrad has a provable advantage over uniform schedules.
AdaGrad was developed in the online learning framework, where data arrives sequentially and the model must update after each observation. Its adaptive nature makes it well-suited for environments where the data distribution shifts gradually, since the per-parameter learning rates continuously adjust based on the observed gradients. Vowpal Wabbit, the open-source online learning system originally built at Yahoo Research and later maintained at Microsoft Research, uses AdaGrad as the default optimizer for its online learners [6].
AdaGrad maintains a single accumulator vector of the same shape as the parameter vector. That is one extra copy of model parameters in memory. Adam and AdamW require two extra copies (the first and second moment estimates), which doubles the optimizer state. For a 7-billion-parameter LLM, the difference is roughly 28 GB of optimizer state in fp32 (one accumulator vs two for Adam), which can matter for fitting training on a given hardware budget. In practice this advantage is rarely the deciding factor because Adam's other benefits dominate, but it is real.
The algorithm is short. There are no momentum buffers, no bias correction terms, no warmup schedules, no decoupled weight decay branches. A from-scratch AdaGrad fits in roughly five lines of code on top of an autograd library, which makes it easy to debug and easy to verify against published derivations.
The single most discussed weakness of AdaGrad is the vanishing learning rate. Because the accumulator s_t grows monotonically with t (gradients squared are always non-negative), the effective learning rate eta / sqrt(s_t) decreases monotonically and can shrink toward zero as training progresses. For long training runs on dense problems this means the optimizer eventually stops making meaningful progress, even when the loss has not yet plateaued.
Zeiler's AdaDelta paper, which was specifically designed to fix this problem, demonstrated empirically that AdaGrad performs well for the first several epochs of neural network training but then slows down sharply because the accumulator denominators continue to grow without bound [7]. This is the single observation that motivated the entire successor family (RMSProp, AdaDelta, Adam) to replace the cumulative sum with an exponential moving average.
In online settings where the data distribution shifts, AdaGrad cannot fully "forget" old gradient information. Once a coordinate has accumulated significant gradient mass, its effective learning rate stays small permanently, even if subsequent data suggests the parameter should now move quickly. RMSProp's exponential moving average solves this by giving older gradients geometrically less weight.
Although AdaGrad reduces the need for per-coordinate tuning, the choice of the global eta still matters. The initial gradient magnitudes heavily influence the early values of s_t, which then determine effective learning rates for the remainder of training. A global eta that is too large leads to large early accumulator values that cripple learning later; one that is too small leaves the model under-trained.
AdaGrad doubles the memory required to hold parameters compared to plain SGD (one parameter copy for w, one for s). For very large models or tightly memory-constrained settings (mobile, edge inference fine-tuning) this is an honest cost, although it remains half the cost of Adam.
For most dense neural network training, especially anything that runs for more than a few thousand steps, AdaGrad is empirically dominated by Adam, AdamW, or even well-tuned SGD-momentum. The vanishing learning rate is the proximate cause, and it is essentially unfixable without changing the algorithm into RMSProp or AdaDelta. For this reason, AdaGrad is rarely the right choice for modern deep learning, and especially never for LLM pretraining.
Several variants and extensions of AdaGrad exist beyond the basic diagonal form.
In parallel with the JMLR version, Duchi, Hazan, and Singer also derived a regularized dual averaging (RDA) variant of AdaGrad. AdagradDA replaces the proximal mirror descent style update with Nesterov's dual averaging, which leads to sparser solutions in the presence of L1 regularization. TensorFlow's AdagradDAOptimizer is a direct implementation.
A scalar version of AdaGrad maintains a single scalar accumulator equal to the running sum of the squared gradient norms, rather than per-coordinate accumulators. This is the version analyzed in the Ward, Wu, and Bottou non-convex convergence paper [4] and is sometimes preferred when per-coordinate adaptivity is undesirable, for example when one wants the adaptive scheme purely for stepsize selection rather than per-feature reweighting.
The more general framework Duchi, Hazan, and Singer worked in is composite mirror descent, in which AdaGrad arises as a specific choice of adaptive proximal function. Under this lens, AdaGrad sits in a family of algorithms that includes variants for L1-regularized problems, group-sparse problems, and matrix-structured problems.
Not strictly variants of AdaGrad, but direct algorithmic descendants:
These are covered in their own articles. The next section places them in chronological context.
AdaGrad is the start of a lineage that effectively defines modern adaptive deep learning optimization. The table below summarizes the main timeline, with each successor reacting to a specific limitation of its predecessor.
| Year | Optimizer | Authors | Innovation over predecessor | Source |
|---|---|---|---|---|
| 2010 | Adaptive bound optimization (later AdaGrad) | McMahan, Streeter | Adaptive regularizer based on observed gradients | COLT 2010 |
| 2011 | AdaGrad | Duchi, Hazan, Singer | Diagonal preconditioner with cumulative squared gradient sum | JMLR 12 |
| 2012 | RMSProp | Hinton (Tieleman, Hinton) | Exponential moving average of squared gradients (forgets old gradients) | Coursera lecture 6e |
| 2012 | AdaDelta | Zeiler | EMA of squared gradients plus EMA of squared parameter updates (no global LR) | arXiv 1212.5701 |
| 2014 | Adam | Kingma, Ba | RMSProp plus momentum plus bias correction | arXiv 1412.6980 / ICLR 2015 |
| 2014 | Adamax | Kingma, Ba | Infinity-norm variant of Adam | Same paper |
| 2018 | AMSGrad | Reddi, Kale, Kumar | Forces v_t to be non-decreasing; fixes a convergence proof bug | ICLR 2018 |
| 2019 | AdamW | Loshchilov, Hutter | Decoupled weight decay from gradient-scaled L2 | ICLR 2019 |
| 2020s | LAMB, Lion, Sophia, Shampoo, Adafactor | Various | Large-batch friendly, low-memory, second-order, or matrix-aware variants | Various |
Reading top to bottom, every later algorithm inherits the per-parameter-adaptive idea from AdaGrad. The main subsequent innovations have been replacing the cumulative sum with an EMA (RMSProp, AdaDelta, Adam), adding momentum (Adam), bias correction (Adam), decoupling regularization (AdamW), and bounding memory (Adafactor, Lion). The conceptual core (divide each coordinate's update by the square root of an estimate of its second moment) traces back unchanged to the 2011 paper.
| Optimizer | Gradient state | Decay scheme | Momentum | Bias correction | Memory cost (state per parameter) | Best at |
|---|---|---|---|---|---|---|
| SGD | None | Fixed or manual schedule | Optional | n/a | 0 (or 1 with momentum) | Well-tuned dense problems |
| AdaGrad | Cumulative sum of g^2 | Monotone decay (1/sqrt(t)) | No | No | 1 | Sparse online learning |
| RMSProp | EMA of g^2 | Bounded by EMA window | Optional in some implementations | No | 1 | Recurrent nets, RL, non-stationary signals |
| AdaDelta | EMA of g^2 plus EMA of (delta w)^2 | Adaptive, no global LR needed | No | No | 2 | Avoiding LR tuning |
| Adam | EMA of g and EMA of g^2 | Bounded by EMA windows | Yes (first moment) | Yes | 2 | General deep learning, default for most cases |
| AdamW | EMA of g and EMA of g^2 plus decoupled weight decay | Bounded by EMA windows | Yes | Yes | 2 | LLM and Transformer pretraining |
Despite being out of fashion for general deep learning, AdaGrad still appears in production systems where its specific strengths matter.
AdaGrad has been widely used in NLP tasks where sparse feature representations are common. Word embedding training, text classification, and language models that use large vocabularies all benefit from AdaGrad's tendency to keep effective learning rates high for infrequent tokens. The original GloVe word embedding algorithm by Pennington, Socher, and Manning (Stanford, EMNLP 2014) was trained using AdaGrad [8]. The Stanford GloVe codebase ships with AdaGrad as the default optimizer because the algorithm's per-coordinate adaptivity matches the heavy-tailed frequency distribution of word co-occurrence data.
Large-scale recommendation models work over very high-dimensional sparse feature spaces (user IDs, item IDs, hashed feature crosses). AdaGrad naturally handles this imbalance by adapting learning rates based on per-feature update frequency. Google's Wide and Deep Learning for Recommender Systems paper (Cheng et al., 2016) [9] reports the production setup used at Google Play: the wide linear part was trained with FTRL-Proximal with L1 regularization (a relative of AdaGrad), and the deep neural network part was trained with AdaGrad. The paper specifically attributes the choice of AdaGrad on the deep side to its adaptive behaviour on sparse embedding tables. This pattern (FTRL on the wide head, AdaGrad on the deep towers) became a standard recipe in recommendation literature for several years.
Vowpal Wabbit, the open-source online learning library originally built at Yahoo Research and maintained at Microsoft Research, uses AdaGrad as the default optimizer for its online learners, including its contextual bandit pipeline [6]. Online learning is exactly the regime AdaGrad was designed for: data arrives one example at a time, parameters update after each example, and per-feature adaptivity matters because feature frequencies are unknown a priori.
For convex problems and linear models such as logistic regression, AdaGrad's monotonically decreasing learning rate is actually desirable because it ensures convergence to the optimum. Industrial-scale convex classifiers in advertising and search ranking have used AdaGrad and its dual-averaging variant AdagradDA successfully.
A modern niche where AdaGrad survives is training embedding tables for users, items, or vocabulary tokens where the access pattern is power-law sparse. PyTorch's torch.optim.SparseAdam and similar sparse-update variants of Adam exist precisely because dense Adam wastes work updating embeddings that received no gradient. AdaGrad has a natural sparse-update form (only update accumulator entries for the coordinates whose gradient is non-zero), which makes it efficient for embedding-heavy models even when other parts of the model use Adam.
A short list of regimes where AdaGrad is not the right choice:
AdaGrad is available in every major deep learning framework.
In PyTorch, AdaGrad is implemented as torch.optim.Adagrad [10]:
import torch
import torch.optim as optim
model = MyModel()
optimizer = optim.Adagrad(
model.parameters(),
lr=0.01,
lr_decay=0,
weight_decay=0,
initial_accumulator_value=0,
eps=1e-10,
)
for input, target in dataset:
optimizer.zero_grad()
output = model(input)
loss = loss_fn(output, target)
loss.backward()
optimizer.step()
The PyTorch defaults are lr=0.01, lr_decay=0, weight_decay=0, initial_accumulator_value=0, and eps=1e-10. The lr_decay parameter implements an additional polynomial decay on top of AdaGrad's intrinsic adaptivity, which is rarely useful. The initial_accumulator_value parameter sets the starting value of s and is typically left at 0. PyTorch also provides a sparse-gradient code path for cases where most gradient entries are zero.
In TensorFlow, AdaGrad is available through the Keras API:
import tensorflow as tf
model = tf.keras.Sequential([...])
optimizer = tf.keras.optimizers.Adagrad(
learning_rate=0.001,
initial_accumulator_value=0.1,
epsilon=1e-7,
)
model.compile(optimizer=optimizer, loss='sparse_categorical_crossentropy')
model.fit(x_train, y_train, epochs=10)
The Keras defaults are learning_rate=0.001, initial_accumulator_value=0.1, and epsilon=1e-7. Note the difference from PyTorch: TensorFlow defaults to a non-zero initial accumulator value of 0.1, which gives a more conservative (smaller) initial effective learning rate and improves numerical stability in early steps. This default difference can lead to noticeably different training dynamics if a script is ported between frameworks without re-tuning.
In JAX with Optax, AdaGrad is optax.adagrad:
import optax
optimizer = optax.adagrad(
learning_rate=0.01,
initial_accumulator_value=0.1,
eps=1e-7,
)
opt_state = optimizer.init(params)
for batch in dataset:
grads = grad_fn(params, batch)
updates, opt_state = optimizer.update(grads, opt_state)
params = optax.apply_updates(params, updates)
Optax's defaults follow TensorFlow's conventions (initial accumulator 0.1, eps 1e-7).
A few practical points worth knowing:
AdaGrad is no longer the default optimizer for general deep learning. Adam and AdamW dominate that role. However, the algorithm remains relevant for several reasons.
First, AdaGrad is the conceptual ancestor of every modern adaptive optimizer. RMSProp simply replaces its cumulative sum with an EMA. Adam adds momentum and bias correction on top of RMSProp. AdamW decouples weight decay. Each of these starts from AdaGrad's per-coordinate normalization and improves on the specific ways AdaGrad falls short. Anyone who reads the Adam paper without first understanding AdaGrad is missing half the motivation.
Second, AdaGrad still wins on sparse online learning with bounded horizons. Vowpal Wabbit, FTRL-style click prediction systems, and the Stanford GloVe codebase are all examples of production systems where AdaGrad continues to be a sensible default.
Third, AdaGrad's regret bound is one of the cleanest theoretical results in online optimization, and it provides the standard baseline against which adaptive methods are still benchmarked. The Duchi-Hazan-Singer paper has more than 20,000 citations on Google Scholar (as of 2025) and is one of the most influential papers in machine learning optimization of the 2010s.
Fourth, the diagonal preconditioner perspective on AdaGrad opens the door to second-order and natural-gradient methods. Recent low-memory second-order optimizers (Shampoo, K-FAC, Adafactor) can all be understood as more sophisticated takes on the same idea AdaGrad introduced: precondition the gradient by something derived from gradient history.
In 2026, an instructor teaching adaptive optimizers in a deep learning course will still derive AdaGrad first, then RMSProp, then Adam, then AdamW. That pedagogical sequence reflects both the historical order and the conceptual one. Even where AdaGrad is no longer the optimizer one would actually run, it remains the right one to learn first.
Imagine you are learning to throw a basketball into different hoops. Some hoops you practice on every day, so you are already pretty good at those. Other hoops are in unusual spots, so you rarely get to try them.
A regular coach would have you practice every hoop the same way, spending equal effort on each one. But that is not very smart, because you are already good at some hoops and bad at others.
AdaGrad is like a smart coach. It watches you practice and keeps track of how much practice you have already done on each hoop. For the hoops you have practiced a lot, the coach says, "You are good at this one, so let's only make small adjustments." For the hoops you have barely tried, the coach says, "You need more help here, so let's make bigger changes."
The one problem with this smart coach is that after a very long time, it starts thinking you have practiced every hoop so much that it barely lets you change anything at all. That is when you need a different coach, like RMSProp or Adam, who does not remember every single practice session from the beginning of time but only the recent ones.