AdaGrad

See also: Machine learning terms

AdaGrad (short for Adaptive Gradient Algorithm) is an optimizer for gradient descent-based machine learning that adapts the learning rate individually for each parameter during training. Introduced by John Duchi, Elad Hazan, and Yoram Singer in their 2011 paper "Adaptive Subgradient Methods for Online Learning and Stochastic Optimization," published in the Journal of Machine Learning Research, AdaGrad was one of the first widely adopted adaptive learning rate methods ^[1]. The same family of algorithms was discovered concurrently by H. Brendan McMahan and Matthew Streeter, whose paper "Adaptive Bound Optimization for Online Convex Optimization" appeared at the 23rd Annual Conference on Learning Theory (COLT 2010) ^[2]. AdaGrad assigns larger effective learning rates to parameters associated with infrequent features and smaller ones to parameters associated with frequent features, which makes it particularly effective for problems involving sparse features such as natural language processing, recommendation systems, and large-scale online ad click prediction.

Although AdaGrad has been largely superseded by Adam-family optimizers for general deep learning work, it remains the conceptual ancestor of every modern adaptive optimizer. The core idea, dividing each gradient component by the square root of an accumulator that summarizes how much that parameter has moved in the past, runs through RMSProp, AdaDelta, Adam, and AdamW. Understanding AdaGrad is therefore the cleanest way to understand the entire adaptive-rate optimizer family.

Background and motivation

In standard stochastic gradient descent (SGD), a single global learning rate is applied uniformly to every parameter at every step. This creates a fundamental problem: the optimal step size for one parameter may be far too large or too small for another. Real datasets contain features that appear at vastly different frequencies (some words occur in millions of documents, others in only a handful) and a uniform learning rate treats all of them identically.

Parameters tied to rare features receive gradient signal infrequently. Under a fixed schedule like the classical Robbins-Monro decay rule, these rarely updated parameters are penalized in the same way as parameters that update on nearly every step. The net effect is that rare-but-predictive features (the "needles in haystacks" referenced in the Duchi paper ^[1]) get drowned out. AdaGrad was designed to fix this imbalance by using each parameter's own gradient history to scale its step size.

The insight behind AdaGrad is that the magnitude of past gradients in a given coordinate is a useful proxy for the local geometry of the loss function along that coordinate. Coordinates that have accumulated large gradients tend to lie in well-explored, steep directions and can afford smaller steps. Coordinates with small accumulated gradients are in flatter, less-explored directions and benefit from larger steps. The algorithm formalizes this intuition by giving each parameter its own time-varying learning rate that depends only on the history observed for that one coordinate.

Duchi, Hazan, and Singer derived the method from the framework of online convex optimization with regularized dual averaging and composite mirror descent ^[1]. McMahan and Streeter arrived at the same diagonal preconditioner from a different angle, by adaptively choosing the regularizer in an online convex optimization algorithm to match the geometry of the data ^[2]. Both groups proved tight regret bounds and noted that the bounds could be exponentially better than non-adaptive methods on sparse problems. The two papers were presented in the same conference cycle and are now treated as co-discoveries.

Mathematical formulation

The AdaGrad update rule operates on a per-parameter basis. Given a parameter vector w at time step t, the algorithm proceeds element-wise.

Update rule

At each time step t:

Compute the gradient (or subgradient) of the loss with respect to the current parameters:

gt = nablaw L(y_t, f(xt, w(t-1)))
Accumulate the element-wise squared gradients into a running sum:

st = s(t-1) + g_t^2

where s_0 = 0 (initialized to zeros) and the squaring is performed element-wise. Some implementations call this accumulator G_t instead of s_t.
Update the parameters:

wt = w(t-1) - (eta / (sqrt(s_t) + epsilon)) * g_t

where eta is the initial (global) learning rate and epsilon is a small smoothing constant (typically 1e-7 or 1e-10) that prevents division by zero on coordinates that have not yet seen any gradient.

All operations (squaring, square root, division, multiplication) are performed element-wise. Each parameter w_i therefore has its own effective learning rate of eta / (sqrt(s_(t,i)) + epsilon), which can only decrease over the course of training as more gradient mass accumulates in s_i.

A common alternative formulation places the epsilon inside the square root: eta / sqrt(s_(t,i) + epsilon). The two versions are mathematically slightly different but practically indistinguishable. PyTorch and JAX/Optax use the outside-the-square-root form by default; some textbook derivations use the inside form.

Full-matrix formulation

The original Duchi paper actually presents two versions of the algorithm. The full-matrix version uses the matrix

G_t = sum_(tau=1)^t g_tau g_tau^T

the outer product matrix of all observed gradients, and updates parameters via

w_(t+1) = w_t - eta * G_t^(-1/2) * g_t

This full-matrix update captures correlations between parameters and is closely related to second-order methods that use a preconditioner derived from the empirical Fisher information matrix. However, it has O(d^2) memory and O(d^3) per-step time complexity for d-dimensional models, which is prohibitive for any model bigger than a few thousand parameters. The diagonal version (using only the diagonal entries of G_t) is what every practical implementation refers to as "AdaGrad," and it is what every modern framework provides out of the box.

Later work has revisited the full-matrix form. Agarwal, Bullins, Chen, Hazan, Singh, and Zhang's 2019 paper "Efficient Full-Matrix Adaptive Regularization" ^[3] showed that approximate full-matrix AdaGrad can be made tractable for moderately sized problems and can outperform the diagonal version, though the diagonal form remains overwhelmingly dominant in practice.

Per-parameter effective learning rate

For each individual parameter w_i, the effective learning rate at step t is

eta_eff(i, t) = eta / (sqrt(sum_(tau=1)^t g_(tau,i)^2) + epsilon)

The table below shows what this implies for different gradient histories. Frequently updated coordinates have effective learning rates that drop quickly; rarely updated coordinates retain comparatively large step sizes.

Gradient history	Accumulated sum (s_i)	Effective learning rate	Interpretation
Large, frequent gradients	Grows quickly	Decreases rapidly	Coordinate is well-explored; take smaller steps
Small or infrequent gradients	Grows slowly	Remains relatively large	Coordinate needs more exploration; take larger steps
Mixed (large early, small later)	Frozen near early peak	Stays small permanently	Vanishing-rate problem; the algorithm cannot recover

Sparse versus dense gradients

AdaGrad's behaviour differs sharply between sparse and dense problems, which is the single most important fact for understanding when to use it.

Property	Sparse gradients (e.g. NLP, ad CTR)	Dense gradients (e.g. CNN training)
Per-coordinate gradient frequency	Most coordinates see g_i = 0 most steps	Almost every coordinate sees a non-zero gradient every step
Behaviour of accumulator s_i	Grows slowly for rare coordinates	Grows roughly linearly with t for every coordinate
Effective learning rate over time	Stays usable for rare coordinates indefinitely	Decays as O(1 / sqrt(t)) for every coordinate
Theoretical regret bound	O(sqrt(T) * sqrt(sum_i max_t	g_(t,i)
Empirical outcome	Strongly preferred over plain SGD	Often beaten by SGD-momentum or Adam after a moderate number of epochs

This table also explains why AdaGrad survived even after Adam took over deep learning: the sparse-gradient regime is exactly where AdaGrad has a provable advantage, and that regime appears in recommender systems, online ads, and certain NLP pipelines.

Algorithm steps

The complete AdaGrad algorithm can be summarized as follows.

Step	Action
1	Initialize parameter vector w_0 and set the gradient accumulator s_0 = 0
2	Choose a global learning rate eta (commonly 0.01) and smoothing constant epsilon (commonly 1e-8 or 1e-10)
3	For each training iteration t = 1, 2, ..., T:
3a	Sample a minibatch (or single example, in the online setting) and compute the gradient g_t of the loss with respect to w
3b	Update the accumulator: st = s(t-1) + g_t^2 (element-wise)
3c	Compute the parameter update: wt = w(t-1) - (eta / (sqrt(s_t) + epsilon)) * g_t
4	Return the final parameter vector w_T

A pseudocode form looks like this:

function Adagrad(loss_fn, w0, eta, epsilon, num_steps):
    w = w0
    s = zeros_like(w)
    for t in 1..num_steps:
        g = grad(loss_fn, w)
        s = s + g * g
        w = w - eta * g / (sqrt(s) + epsilon)
    return w

Convergence theory

AdaGrad enjoys strong theoretical guarantees in the convex online learning setting, which is the setting both the JMLR and COLT papers analyzed.

Online convex optimization

In the online convex optimization framework, an adversary picks a sequence of convex loss functions f_1, ..., f_T, and the learner picks parameters w_t before seeing f_t. Regret is defined as

R_T = sum_(t=1)^T f_t(wt) - minw* sum_(t=1)^T f_t(w*)

that is, the total excess loss compared to the best fixed comparator chosen in hindsight.

Duchi et al. and McMahan and Streeter both proved that AdaGrad achieves a regret bound of order O(sqrt(T)) under standard assumptions on the convexity and bounded subgradients ^[1]^[2]. More precisely, the bound has the form

R_T <= O(D_infinity * sqrt(sum_i sqrt(sum_t g_(t,i)^2)))

where D_infinity bounds the diameter of the comparator set. The crucial feature is that the sum over coordinates i depends on the per-coordinate gradient history. When the gradient is sparse (most g_(t,i) are zero), this bound can be exponentially smaller than the corresponding bound for plain online subgradient descent, which scales with sqrt(d * T) regardless of sparsity. On a problem with d dimensions of which only k carry typical signal, AdaGrad's bound scales like sqrt(k * T) rather than sqrt(d * T).

This sparsity-adaptive regret bound was the headline theoretical result and the main reason AdaGrad gained immediate adoption in NLP and recommendation pipelines. It provided a rigorous explanation for the empirical observation that adaptive per-feature learning rates outperform uniform schedules on text and click data.

Convergence rate for stochastic convex optimization

Via standard online-to-batch conversion, the O(sqrt(T)) regret bound translates into a convergence rate of O(1/sqrt(T)) on the average iterate for stochastic convex optimization with bounded subgradients ^[1]. This matches the optimal rate for first-order stochastic methods on non-smooth convex problems.

Non-convex convergence

AdaGrad's original analysis assumed convex losses, but neural network training is famously non-convex. Ward, Wu, and Bottou's 2019 paper "AdaGrad Stepsizes: Sharp Convergence Over Nonconvex Landscapes" ^[4] addressed this gap. They proved that the norm version of AdaGrad (AdaGrad-Norm, where the same scalar stepsize is used for all coordinates) converges to a stationary point at rate O(log(N) / sqrt(N)) in the stochastic non-convex setting and O(1/N) in the deterministic case. Their analysis did not require knowledge of the loss's Lipschitz constant or the noise level, which is unusual for non-convex SGD analyses and which they took as evidence that AdaGrad's adaptivity is genuinely useful even outside its original convex setting.

Later work by Defossez, Bottou, Bach, and Usunier ^[5] gave a unified proof for both Adam and AdaGrad in the non-convex stochastic case, confirming that adaptive methods do converge under reasonable assumptions even though their behaviour on non-convex landscapes is less understood than on convex ones.

Connection to natural gradient and second-order methods

The full-matrix version of AdaGrad is closely related to natural gradient descent and to other second-order optimization methods. Natural gradient descent uses the inverse Fisher information matrix as a preconditioner, which produces updates that are invariant to reparameterization. AdaGrad's full-matrix preconditioner G_t^(-1/2) is built from the outer products of the observed gradients, which is closely related to the empirical Fisher information matrix (the Fisher under the empirical data distribution rather than the model distribution).

The diagonal AdaGrad accumulator s_i is the diagonal of the empirical second-moment matrix of the gradients. Dividing by sqrt(s_i) is therefore a coordinate-wise normalization by an estimate of gradient standard deviation, which can be viewed as a crude diagonal approximation to a Fisher- or Hessian-based preconditioner. This gives AdaGrad a loose interpretation as a memory-cheap, first-order substitute for genuine second-order methods, an idea that Adam later inherited and that connects through to recent work like Shampoo and second-order LLM optimizers.

Strengths

AdaGrad offers several benefits that explain its impact and its continued role in certain niches.

Automatic per-parameter learning rate adaptation

The most important strength is that the algorithm automatically scales the learning rate per parameter without any manual schedule. A practitioner still needs to set the global learning rate eta, but the per-coordinate adaptation eliminates the need for elaborate per-feature tuning. In sparse settings this often replaces what would otherwise be a feature-engineering effort to balance frequencies.

Strong performance on sparse data

AdaGrad excels when gradients are sparse or non-uniformly distributed across features. In NLP with bag-of-words or hashed-feature representations, most words in any given example receive a zero gradient. Parameters for rare words accumulate slowly, so their effective learning rate stays large. Parameters for common words accumulate quickly and shrink. This per-feature adaptivity is exactly the right behaviour for sparse high-dimensional problems and is the regime where AdaGrad has a provable advantage over uniform schedules.

Online learning suitability

AdaGrad was developed in the online learning framework, where data arrives sequentially and the model must update after each observation. Its adaptive nature makes it well-suited for environments where the data distribution shifts gradually, since the per-parameter learning rates continuously adjust based on the observed gradients. Vowpal Wabbit, the open-source online learning system originally built at Yahoo Research and later maintained at Microsoft Research, uses AdaGrad as the default optimizer for its online learners ^[6].

Memory efficiency relative to Adam

AdaGrad maintains a single accumulator vector of the same shape as the parameter vector. That is one extra copy of model parameters in memory. Adam and AdamW require two extra copies (the first and second moment estimates), which doubles the optimizer state. For a 7-billion-parameter LLM, the difference is roughly 28 GB of optimizer state in fp32 (one accumulator vs two for Adam), which can matter for fitting training on a given hardware budget. In practice this advantage is rarely the deciding factor because Adam's other benefits dominate, but it is real.

Simplicity of implementation

The algorithm is short. There are no momentum buffers, no bias correction terms, no warmup schedules, no decoupled weight decay branches. A from-scratch AdaGrad fits in roughly five lines of code on top of an autograd library, which makes it easy to debug and easy to verify against published derivations.

Weaknesses

Vanishing learning rate

The single most discussed weakness of AdaGrad is the vanishing learning rate. Because the accumulator s_t grows monotonically with t (gradients squared are always non-negative), the effective learning rate eta / sqrt(s_t) decreases monotonically and can shrink toward zero as training progresses. For long training runs on dense problems this means the optimizer eventually stops making meaningful progress, even when the loss has not yet plateaued.

Zeiler's AdaDelta paper, which was specifically designed to fix this problem, demonstrated empirically that AdaGrad performs well for the first several epochs of neural network training but then slows down sharply because the accumulator denominators continue to grow without bound ^[7]. This is the single observation that motivated the entire successor family (RMSProp, AdaDelta, Adam) to replace the cumulative sum with an exponential moving average.

Inability to handle non-stationary objectives

In online settings where the data distribution shifts, AdaGrad cannot fully "forget" old gradient information. Once a coordinate has accumulated significant gradient mass, its effective learning rate stays small permanently, even if subsequent data suggests the parameter should now move quickly. RMSProp's exponential moving average solves this by giving older gradients geometrically less weight.

Sensitivity to the global learning rate

Although AdaGrad reduces the need for per-coordinate tuning, the choice of the global eta still matters. The initial gradient magnitudes heavily influence the early values of s_t, which then determine effective learning rates for the remainder of training. A global eta that is too large leads to large early accumulator values that cripple learning later; one that is too small leaves the model under-trained.

Memory overhead versus plain SGD

AdaGrad doubles the memory required to hold parameters compared to plain SGD (one parameter copy for w, one for s). For very large models or tightly memory-constrained settings (mobile, edge inference fine-tuning) this is an honest cost, although it remains half the cost of Adam.

Poor empirical performance on dense deep learning

For most dense neural network training, especially anything that runs for more than a few thousand steps, AdaGrad is empirically dominated by Adam, AdamW, or even well-tuned SGD-momentum. The vanishing learning rate is the proximate cause, and it is essentially unfixable without changing the algorithm into RMSProp or AdaDelta. For this reason, AdaGrad is rarely the right choice for modern deep learning, and especially never for LLM pretraining.

Variants and extensions

Several variants and extensions of AdaGrad exist beyond the basic diagonal form.

AdagradDA (regularized dual averaging form)

In parallel with the JMLR version, Duchi, Hazan, and Singer also derived a regularized dual averaging (RDA) variant of AdaGrad. AdagradDA replaces the proximal mirror descent style update with Nesterov's dual averaging, which leads to sparser solutions in the presence of L1 regularization. TensorFlow's AdagradDAOptimizer is a direct implementation.

AdaGrad-Norm

A scalar version of AdaGrad maintains a single scalar accumulator equal to the running sum of the squared gradient norms, rather than per-coordinate accumulators. This is the version analyzed in the Ward, Wu, and Bottou non-convex convergence paper ^[4] and is sometimes preferred when per-coordinate adaptivity is undesirable, for example when one wants the adaptive scheme purely for stepsize selection rather than per-feature reweighting.

Composite mirror descent

The more general framework Duchi, Hazan, and Singer worked in is composite mirror descent, in which AdaGrad arises as a specific choice of adaptive proximal function. Under this lens, AdaGrad sits in a family of algorithms that includes variants for L1-regularized problems, group-sparse problems, and matrix-structured problems.

Other named successors

Not strictly variants of AdaGrad, but direct algorithmic descendants:

RMSProp (Hinton lecture notes, 2012)
AdaDelta (Zeiler, 2012)
Adam (Kingma and Ba, 2014)
Adamax (Kingma and Ba, 2014, infinity-norm variant of Adam)
AMSGrad (Reddi, Kale, Kumar, ICLR 2018, fixes a convergence bug in Adam)
AdamW (Loshchilov and Hutter, 2019, decoupled weight decay)

These are covered in their own articles. The next section places them in chronological context.

The adaptive optimizer family

AdaGrad is the start of a lineage that effectively defines modern adaptive deep learning optimization. The table below summarizes the main timeline, with each successor reacting to a specific limitation of its predecessor.

Year	Optimizer	Authors	Innovation over predecessor	Source
2010	Adaptive bound optimization (later AdaGrad)	McMahan, Streeter	Adaptive regularizer based on observed gradients	COLT 2010
2011	AdaGrad	Duchi, Hazan, Singer	Diagonal preconditioner with cumulative squared gradient sum	JMLR 12
2012	RMSProp	Hinton (Tieleman, Hinton)	Exponential moving average of squared gradients (forgets old gradients)	Coursera lecture 6e
2012	AdaDelta	Zeiler	EMA of squared gradients plus EMA of squared parameter updates (no global LR)	arXiv 1212.5701
2014	Adam	Kingma, Ba	RMSProp plus momentum plus bias correction	arXiv 1412.6980 / ICLR 2015
2014	Adamax	Kingma, Ba	Infinity-norm variant of Adam	Same paper
2018	AMSGrad	Reddi, Kale, Kumar	Forces v_t to be non-decreasing; fixes a convergence proof bug	ICLR 2018
2019	AdamW	Loshchilov, Hutter	Decoupled weight decay from gradient-scaled L2	ICLR 2019
2020s	LAMB, Lion, Sophia, Shampoo, Adafactor	Various	Large-batch friendly, low-memory, second-order, or matrix-aware variants	Various

Reading top to bottom, every later algorithm inherits the per-parameter-adaptive idea from AdaGrad. The main subsequent innovations have been replacing the cumulative sum with an EMA (RMSProp, AdaDelta, Adam), adding momentum (Adam), bias correction (Adam), decoupling regularization (AdamW), and bounding memory (Adafactor, Lion). The conceptual core (divide each coordinate's update by the square root of an estimate of its second moment) traces back unchanged to the 2011 paper.

Direct comparison

Optimizer	Gradient state	Decay scheme	Momentum	Bias correction	Memory cost (state per parameter)	Best at
SGD	None	Fixed or manual schedule	Optional	n/a	0 (or 1 with momentum)	Well-tuned dense problems
AdaGrad	Cumulative sum of g^2	Monotone decay (1/sqrt(t))	No	No	1	Sparse online learning
RMSProp	EMA of g^2	Bounded by EMA window	Optional in some implementations	No	1	Recurrent nets, RL, non-stationary signals
AdaDelta	EMA of g^2 plus EMA of (delta w)^2	Adaptive, no global LR needed	No	No	2	Avoiding LR tuning
Adam	EMA of g and EMA of g^2	Bounded by EMA windows	Yes (first moment)	Yes	2	General deep learning, default for most cases
AdamW	EMA of g and EMA of g^2 plus decoupled weight decay	Bounded by EMA windows	Yes	Yes	2	LLM and Transformer pretraining

Applications

Despite being out of fashion for general deep learning, AdaGrad still appears in production systems where its specific strengths matter.

Natural language processing with sparse representations

AdaGrad has been widely used in NLP tasks where sparse feature representations are common. Word embedding training, text classification, and language models that use large vocabularies all benefit from AdaGrad's tendency to keep effective learning rates high for infrequent tokens. The original GloVe word embedding algorithm by Pennington, Socher, and Manning (Stanford, EMNLP 2014) was trained using AdaGrad ^[8]. The Stanford GloVe codebase ships with AdaGrad as the default optimizer because the algorithm's per-coordinate adaptivity matches the heavy-tailed frequency distribution of word co-occurrence data.

Recommendation and click-through rate prediction

Large-scale recommendation models work over very high-dimensional sparse feature spaces (user IDs, item IDs, hashed feature crosses). AdaGrad naturally handles this imbalance by adapting learning rates based on per-feature update frequency. Google's Wide and Deep Learning for Recommender Systems paper (Cheng et al., 2016) ^[9] reports the production setup used at Google Play: the wide linear part was trained with FTRL-Proximal with L1 regularization (a relative of AdaGrad), and the deep neural network part was trained with AdaGrad. The paper specifically attributes the choice of AdaGrad on the deep side to its adaptive behaviour on sparse embedding tables. This pattern (FTRL on the wide head, AdaGrad on the deep towers) became a standard recipe in recommendation literature for several years.

Online learning systems

Vowpal Wabbit, the open-source online learning library originally built at Yahoo Research and maintained at Microsoft Research, uses AdaGrad as the default optimizer for its online learners, including its contextual bandit pipeline ^[6]. Online learning is exactly the regime AdaGrad was designed for: data arrives one example at a time, parameters update after each example, and per-feature adaptivity matters because feature frequencies are unknown a priori.

Convex optimization and large-scale linear models

For convex problems and linear models such as logistic regression, AdaGrad's monotonically decreasing learning rate is actually desirable because it ensures convergence to the optimum. Industrial-scale convex classifiers in advertising and search ranking have used AdaGrad and its dual-averaging variant AdagradDA successfully.

Embedding training in recommender systems

A modern niche where AdaGrad survives is training embedding tables for users, items, or vocabulary tokens where the access pattern is power-law sparse. PyTorch's torch.optim.SparseAdam and similar sparse-update variants of Adam exist precisely because dense Adam wastes work updating embeddings that received no gradient. AdaGrad has a natural sparse-update form (only update accumulator entries for the coordinates whose gradient is non-zero), which makes it efficient for embedding-heavy models even when other parts of the model use Adam.

When not to use AdaGrad

A short list of regimes where AdaGrad is not the right choice:

Long training runs on dense problems. The vanishing learning rate kills long-horizon training. Use Adam or AdamW.
LLM pretraining. The standard is AdamW, with beta_2 lowered to 0.95 to react faster to the shifting gradient distribution. AdaGrad would freeze too early.
Training where momentum is known to help (CNNs trained with SGD-momentum on ImageNet, for example). Vanilla AdaGrad has no momentum.
Reinforcement learning with non-stationary policies. Use Adam, RMSProp, or specialized RL optimizers.
Any setting where you want the option to revisit a region of parameter space at a later step at full speed. AdaGrad's learning rate cannot recover.

Implementation

AdaGrad is available in every major deep learning framework.

PyTorch

In PyTorch, AdaGrad is implemented as torch.optim.Adagrad ^[10]:

import torch
import torch.optim as optim

model = MyModel()
optimizer = optim.Adagrad(
    model.parameters(),
    lr=0.01,
    lr_decay=0,
    weight_decay=0,
    initial_accumulator_value=0,
    eps=1e-10,
)

for input, target in dataset:
    optimizer.zero_grad()
    output = model(input)
    loss = loss_fn(output, target)
    loss.backward()
    optimizer.step()

The PyTorch defaults are lr=0.01, lr_decay=0, weight_decay=0, initial_accumulator_value=0, and eps=1e-10. The lr_decay parameter implements an additional polynomial decay on top of AdaGrad's intrinsic adaptivity, which is rarely useful. The initial_accumulator_value parameter sets the starting value of s and is typically left at 0. PyTorch also provides a sparse-gradient code path for cases where most gradient entries are zero.

TensorFlow / Keras

In TensorFlow, AdaGrad is available through the Keras API:

import tensorflow as tf

model = tf.keras.Sequential([...])
optimizer = tf.keras.optimizers.Adagrad(
    learning_rate=0.001,
    initial_accumulator_value=0.1,
    epsilon=1e-7,
)
model.compile(optimizer=optimizer, loss='sparse_categorical_crossentropy')
model.fit(x_train, y_train, epochs=10)

The Keras defaults are learning_rate=0.001, initial_accumulator_value=0.1, and epsilon=1e-7. Note the difference from PyTorch: TensorFlow defaults to a non-zero initial accumulator value of 0.1, which gives a more conservative (smaller) initial effective learning rate and improves numerical stability in early steps. This default difference can lead to noticeably different training dynamics if a script is ported between frameworks without re-tuning.

JAX / Optax

In JAX with Optax, AdaGrad is optax.adagrad:

import optax

optimizer = optax.adagrad(
    learning_rate=0.01,
    initial_accumulator_value=0.1,
    eps=1e-7,
)
opt_state = optimizer.init(params)

for batch in dataset:
    grads = grad_fn(params, batch)
    updates, opt_state = optimizer.update(grads, opt_state)
    params = optax.apply_updates(params, updates)

Optax's defaults follow TensorFlow's conventions (initial accumulator 0.1, eps 1e-7).

Implementation notes across frameworks

A few practical points worth knowing:

The default initial_accumulator_value differs between frameworks. PyTorch uses 0, TensorFlow and Optax use 0.1. The Keras choice produces a smoother first few hundred steps; the PyTorch choice produces a more aggressive first step.
The default eps differs too. PyTorch uses 1e-10, TensorFlow and Optax use 1e-7. Larger eps gives a more conservative early update; smaller eps lets sparse coordinates jump farther on their first update.
All frameworks place epsilon outside the square root by default, matching the standard textbook formulation eta * g / (sqrt(s) + eps). The original Duchi paper used the same form.
Sparse updates are supported in PyTorch (via SparseAdagrad-style code paths), TensorFlow (via tf.IndexedSlices), and Optax (via optax.adagrad with sparse gradients), all of which avoid touching accumulator entries for coordinates with no gradient. This is essential for embedding tables in recommendation models.

Modern relevance

AdaGrad is no longer the default optimizer for general deep learning. Adam and AdamW dominate that role. However, the algorithm remains relevant for several reasons.

First, AdaGrad is the conceptual ancestor of every modern adaptive optimizer. RMSProp simply replaces its cumulative sum with an EMA. Adam adds momentum and bias correction on top of RMSProp. AdamW decouples weight decay. Each of these starts from AdaGrad's per-coordinate normalization and improves on the specific ways AdaGrad falls short. Anyone who reads the Adam paper without first understanding AdaGrad is missing half the motivation.

Second, AdaGrad still wins on sparse online learning with bounded horizons. Vowpal Wabbit, FTRL-style click prediction systems, and the Stanford GloVe codebase are all examples of production systems where AdaGrad continues to be a sensible default.

Third, AdaGrad's regret bound is one of the cleanest theoretical results in online optimization, and it provides the standard baseline against which adaptive methods are still benchmarked. The Duchi-Hazan-Singer paper has more than 20,000 citations on Google Scholar (as of 2025) and is one of the most influential papers in machine learning optimization of the 2010s.

Fourth, the diagonal preconditioner perspective on AdaGrad opens the door to second-order and natural-gradient methods. Recent low-memory second-order optimizers (Shampoo, K-FAC, Adafactor) can all be understood as more sophisticated takes on the same idea AdaGrad introduced: precondition the gradient by something derived from gradient history.

In 2026, an instructor teaching adaptive optimizers in a deep learning course will still derive AdaGrad first, then RMSProp, then Adam, then AdamW. That pedagogical sequence reflects both the historical order and the conceptual one. Even where AdaGrad is no longer the optimizer one would actually run, it remains the right one to learn first.

Explain Like I'm 5 (ELI5)

Imagine you are learning to throw a basketball into different hoops. Some hoops you practice on every day, so you are already pretty good at those. Other hoops are in unusual spots, so you rarely get to try them.

A regular coach would have you practice every hoop the same way, spending equal effort on each one. But that is not very smart, because you are already good at some hoops and bad at others.

AdaGrad is like a smart coach. It watches you practice and keeps track of how much practice you have already done on each hoop. For the hoops you have practiced a lot, the coach says, "You are good at this one, so let's only make small adjustments." For the hoops you have barely tried, the coach says, "You need more help here, so let's make bigger changes."

The one problem with this smart coach is that after a very long time, it starts thinking you have practiced every hoop so much that it barely lets you change anything at all. That is when you need a different coach, like RMSProp or Adam, who does not remember every single practice session from the beginning of time but only the recent ones.

References

Duchi, J., Hazan, E., and Singer, Y. (2011). "Adaptive Subgradient Methods for Online Learning and Stochastic Optimization." *Journal of Machine Learning Research*, 12, 2121-2159. https://jmlr.org/papers/v12/duchi11a.html
McMahan, H. B., and Streeter, M. (2010). "Adaptive Bound Optimization for Online Convex Optimization." *Proceedings of the 23rd Annual Conference on Learning Theory (COLT)*. https://arxiv.org/abs/1002.4908
Agarwal, N., Bullins, B., Chen, X., Hazan, E., Singh, K., and Zhang, C. (2019). "Efficient Full-Matrix Adaptive Regularization." *Proceedings of the 36th International Conference on Machine Learning*, PMLR 97:102-110. http://proceedings.mlr.press/v97/agarwal19b/agarwal19b.pdf
Ward, R., Wu, X., and Bottou, L. (2019). "AdaGrad Stepsizes: Sharp Convergence Over Nonconvex Landscapes." *Proceedings of the 36th International Conference on Machine Learning*, PMLR 97:6677-6686. https://proceedings.mlr.press/v97/ward19a.html
Defossez, A., Bottou, L., Bach, F., and Usunier, N. (2022). "A Simple Convergence Proof of Adam and Adagrad." *Transactions on Machine Learning Research*. https://openreview.net/pdf?id=ZPQhzTSWA7
Vowpal Wabbit project documentation. "Online learning with AdaGrad." Microsoft Research / GitHub. https://github.com/VowpalWabbit/vowpal_wabbit
Zeiler, M. D. (2012). "ADADELTA: An Adaptive Learning Rate Method." *arXiv preprint arXiv:1212.5701*. https://arxiv.org/abs/1212.5701
Pennington, J., Socher, R., and Manning, C. D. (2014). "GloVe: Global Vectors for Word Representation." *Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)*, 1532-1543. https://nlp.stanford.edu/pubs/glove.pdf
Cheng, H.-T., Koc, L., Harmsen, J., Shaked, T., Chandra, T., Aradhye, H., Anderson, G., Corrado, G., Chai, W., Ispir, M., Anil, R., Haque, Z., Hong, L., Jain, V., Liu, X., and Shah, H. (2016). "Wide & Deep Learning for Recommender Systems." *Proceedings of the 1st Workshop on Deep Learning for Recommender Systems*. https://arxiv.org/abs/1606.07792
PyTorch Documentation. "torch.optim.Adagrad." https://docs.pytorch.org/docs/stable/generated/torch.optim.Adagrad.html
TensorFlow Documentation. "tf.keras.optimizers.Adagrad." https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/Adagrad
Optax Documentation. "optax.adagrad." https://optax.readthedocs.io/en/latest/api/optimizers.html#optax.adagrad
Hinton, G. (2012). "Neural Networks for Machine Learning, Lecture 6e: rmsprop." Coursera lecture slides. https://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf
Kingma, D. P., and Ba, J. (2015). "Adam: A Method for Stochastic Optimization." *Proceedings of the 3rd International Conference on Learning Representations (ICLR)*. https://arxiv.org/abs/1412.6980
Loshchilov, I., and Hutter, F. (2019). "Decoupled Weight Decay Regularization." *Proceedings of the 7th International Conference on Learning Representations (ICLR)*. https://openreview.net/forum?id=Bkg6RiCqY7
Ruder, S. (2016). "An overview of gradient descent optimization algorithms." *arXiv preprint arXiv:1609.04747*. https://arxiv.org/abs/1609.04747
Zhang, A., Lipton, Z. C., Li, M., and Smola, A. J. (2023). "Dive into Deep Learning," Section 12.7: Adagrad. https://d2l.ai/chapter_optimization/adagrad.html
Cornell University Computational Optimization Open Textbook. "AdaGrad." https://optimization.cbe.cornell.edu/index.php?title=AdaGrad
Reddi, S. J., Kale, S., and Kumar, S. (2018). "On the Convergence of Adam and Beyond." *Proceedings of the 6th International Conference on Learning Representations (ICLR)*. https://openreview.net/pdf?id=ryQu7f-RZ
Duchi, J., Shalev-Shwartz, S., Singer, Y., and Tewari, A. (2010). "Composite Objective Mirror Descent." *Proceedings of the 23rd Annual Conference on Learning Theory (COLT)*. https://www.learningtheory.org/colt2010/papers/057Duchi.pdf

Background and motivation

Mathematical formulation

Update rule

Full-matrix formulation

Per-parameter effective learning rate

Sparse versus dense gradients

Algorithm steps

Convergence theory

Online convex optimization

Convergence rate for stochastic convex optimization

Non-convex convergence

Connection to natural gradient and second-order methods

Strengths

Automatic per-parameter learning rate adaptation

Strong performance on sparse data

Online learning suitability

Memory efficiency relative to Adam

Simplicity of implementation

Weaknesses

Vanishing learning rate

Inability to handle non-stationary objectives

Sensitivity to the global learning rate

Memory overhead versus plain SGD

Poor empirical performance on dense deep learning

Variants and extensions

AdagradDA (regularized dual averaging form)

AdaGrad-Norm

Composite mirror descent

Other named successors

The adaptive optimizer family

Direct comparison

Applications

Natural language processing with sparse representations

Recommendation and click-through rate prediction

Online learning systems

Convex optimization and large-scale linear models

Embedding training in recommender systems

When not to use AdaGrad

Implementation

PyTorch

TensorFlow / Keras

JAX / Optax

Implementation notes across frameworks

Modern relevance

Explain Like I'm 5 (ELI5)

See also

References

Improve this article

Related Articles

Gradient clipping

Momentum

Parameter update

Step

Adam optimizer

Staged training

Background and motivation

Mathematical formulation

Update rule

Full-matrix formulation

Per-parameter effective learning rate

Sparse versus dense gradients

Algorithm steps

Convergence theory

Online convex optimization

Convergence rate for stochastic convex optimization

Non-convex convergence

Connection to natural gradient and second-order methods

Strengths

Automatic per-parameter learning rate adaptation

Strong performance on sparse data

Online learning suitability

Memory efficiency relative to Adam

Simplicity of implementation

Weaknesses

Vanishing learning rate

Inability to handle non-stationary objectives

Sensitivity to the global learning rate

Memory overhead versus plain SGD

Poor empirical performance on dense deep learning

Variants and extensions