# AdaGrad

> Source: https://aiwiki.ai/wiki/adagrad
> Updated: 2026-07-11
> Categories: Training & Optimization
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

*See also: [Machine learning terms](/wiki/machine_learning_terms)*

**AdaGrad** (short for Adaptive Gradient Algorithm) is an [optimizer](/wiki/optimizer) for [gradient descent](/wiki/gradient_descent)-based [machine learning](/wiki/machine_learning) that gives every parameter its own [learning rate](/wiki/learning_rate), scaling each step by the inverse square root of a running sum of that parameter's past squared gradients. It was introduced by [John Duchi](/wiki/john_duchi), [Elad Hazan](/wiki/elad_hazan), and [Yoram Singer](/wiki/yoram_singer) in the 2011 paper "Adaptive Subgradient Methods for Online Learning and Stochastic Optimization," published in the *Journal of Machine Learning Research* (volume 12, pages 2121-2159), and it was one of the first widely adopted adaptive learning rate methods [1]. The authors framed the goal vividly: the per-feature adaptation "allows us to find needles in haystacks in the form of very predictive but rarely seen features" [1]. The same family of algorithms was discovered concurrently by H. Brendan McMahan and Matthew Streeter, whose paper "Adaptive Bound Optimization for Online Convex Optimization" appeared at the 23rd Annual Conference on Learning Theory (COLT 2010) [2]. AdaGrad assigns larger effective learning rates to parameters associated with infrequent features and smaller ones to parameters associated with frequent features, which makes it particularly effective for problems involving [sparse features](/wiki/sparse_feature) such as [natural language processing](/wiki/natural_language_processing), recommendation systems, and large-scale online ad click prediction.

Although AdaGrad has been largely superseded by Adam-family optimizers for general [deep learning](/wiki/deep_learning) work, it remains the conceptual ancestor of every modern adaptive optimizer. The core idea, dividing each gradient component by the square root of an accumulator that summarizes how much that parameter has moved in the past, runs through [RMSProp](/wiki/rmsprop), [AdaDelta](/wiki/adadelta), [Adam](/wiki/adam_optimizer), and [AdamW](/wiki/adamw). Understanding AdaGrad is therefore the cleanest way to understand the entire adaptive-rate optimizer family.

## Why was AdaGrad created?

In standard [stochastic gradient descent (SGD)](/wiki/stochastic_gradient_descent_sgd), a single global learning rate is applied uniformly to every parameter at every step. This creates a fundamental problem: the optimal step size for one parameter may be far too large or too small for another. Real datasets contain features that appear at vastly different frequencies (some words occur in millions of documents, others in only a handful) and a uniform learning rate treats all of them identically.

Parameters tied to rare features receive gradient signal infrequently. Under a fixed schedule like the classical Robbins-Monro decay rule, these rarely updated parameters are penalized in the same way as parameters that update on nearly every step. The net effect is that rare-but-predictive features (the "needles in haystacks" referenced in the Duchi paper [1]) get drowned out. AdaGrad was designed to fix this imbalance by using each parameter's own gradient history to scale its [step size](/wiki/step_size).

The insight behind AdaGrad is that the magnitude of past gradients in a given coordinate is a useful proxy for the local geometry of the [loss function](/wiki/loss_function) along that coordinate. Coordinates that have accumulated large gradients tend to lie in well-explored, steep directions and can afford smaller steps. Coordinates with small accumulated gradients are in flatter, less-explored directions and benefit from larger steps. The algorithm formalizes this intuition by giving each parameter its own time-varying learning rate that depends only on the history observed for that one coordinate.

Duchi, Hazan, and Singer derived the method from the framework of online convex optimization with regularized dual averaging and composite mirror descent [1]. McMahan and Streeter arrived at the same diagonal preconditioner from a different angle, by adaptively choosing the regularizer in an online convex optimization algorithm to match the geometry of the data [2]. Both groups proved tight regret bounds and noted that the bounds could be exponentially better than non-adaptive methods on sparse problems. The two papers were presented in the same conference cycle and are now treated as co-discoveries.

## How does the AdaGrad update rule work?

The AdaGrad update rule operates on a per-parameter basis. Given a parameter vector **w** at time step *t*, the algorithm proceeds element-wise.

### Update rule

At each time step *t*:

1. Compute the gradient (or subgradient) of the loss with respect to the current parameters:

   $$
   \mathbf{g}_t = \nabla_{\mathbf{w}} L(y_t, f(\mathbf{x}_t, \mathbf{w}_{t-1}))
   $$

2. Accumulate the element-wise squared gradients into a running sum:

   $$
   \mathbf{s}_t = \mathbf{s}_{t-1} + \mathbf{g}_t^2
   $$

   where $$\mathbf{s}_0 = \mathbf{0}$$ (initialized to zeros) and the squaring is performed element-wise. Some implementations call this accumulator **G**_t instead of **s**_t.

3. Update the parameters:

   $$
   \mathbf{w}_t = \mathbf{w}_{t-1} - (\eta / (\sqrt{\mathbf{s}_t} + \epsilon)) \, \mathbf{g}_t
   $$

   where eta is the initial (global) learning rate and epsilon is a small smoothing constant (typically 1e-7 or 1e-10) that prevents division by zero on coordinates that have not yet seen any gradient.

All operations (squaring, square root, division, multiplication) are performed element-wise. Each parameter w_i therefore has its own effective learning rate of $$\eta / (\sqrt{s_{t,i}} + \epsilon)$$, which can only decrease over the course of training as more gradient mass accumulates in s_i.

A common alternative formulation places the epsilon inside the square root: $$\eta / \sqrt{s_{t,i} + \epsilon}$$. The two versions are mathematically slightly different but practically indistinguishable. PyTorch and JAX/Optax use the outside-the-square-root form by default; some textbook derivations use the inside form.

### Full-matrix formulation

The original Duchi paper actually presents two versions of the algorithm. The full-matrix version uses the matrix

$$
G_t = \sum_{\tau=1}^t \mathbf{g}_\tau \mathbf{g}_\tau^\top
$$

the outer product matrix of all observed gradients, and updates parameters via

$$
\mathbf{w}_{t+1} = \mathbf{w}_t - \eta \, G_t^{-1/2} \, \mathbf{g}_t
$$

This full-matrix update captures correlations between parameters and is closely related to second-order methods that use a preconditioner derived from the empirical [Fisher information matrix](/wiki/fisher_information). However, it has $$O(d^2)$$ memory and $$O(d^3)$$ per-step time complexity for d-dimensional models, which is prohibitive for any model bigger than a few thousand parameters. The diagonal version (using only the diagonal entries of G_t) is what every practical implementation refers to as "AdaGrad," and it is what every modern framework provides out of the box.

Later work has revisited the full-matrix form. Agarwal, Bullins, Chen, Hazan, Singh, and Zhang's 2019 paper "Efficient Full-Matrix Adaptive Regularization" [3] showed that approximate full-matrix AdaGrad can be made tractable for moderately sized problems and can outperform the diagonal version, though the diagonal form remains overwhelmingly dominant in practice.

### Per-parameter effective learning rate

For each individual parameter w_i, the effective learning rate at step *t* is

$$
\eta_{\text{eff}}(i, t) = \eta / \left( \sqrt{\sum_{\tau=1}^t g_{\tau,i}^2} + \epsilon \right)
$$

The table below shows what this implies for different gradient histories. Frequently updated coordinates have effective learning rates that drop quickly; rarely updated coordinates retain comparatively large step sizes.

| Gradient history | Accumulated sum (s_i) | Effective learning rate | Interpretation |
|---|---|---|---|
| Large, frequent gradients | Grows quickly | Decreases rapidly | Coordinate is well-explored; take smaller steps |
| Small or infrequent gradients | Grows slowly | Remains relatively large | Coordinate needs more exploration; take larger steps |
| Mixed (large early, small later) | Frozen near early peak | Stays small permanently | Vanishing-rate problem; the algorithm cannot recover |

### Sparse versus dense gradients

AdaGrad's behaviour differs sharply between sparse and dense problems, which is the single most important fact for understanding when to use it.

| Property | Sparse gradients (e.g. NLP, ad CTR) | Dense gradients (e.g. CNN training) |
|---|---|---|
| Per-coordinate gradient frequency | Most coordinates see $$g_i = 0$$ most steps | Almost every coordinate sees a non-zero gradient every step |
| Behaviour of accumulator s_i | Grows slowly for rare coordinates | Grows roughly linearly with t for every coordinate |
| Effective learning rate over time | Stays usable for rare coordinates indefinitely | Decays as $$O(1 / \sqrt{t})$$ for every coordinate |
| Theoretical regret bound | $$O(\sqrt{T} \cdot \sqrt{\sum_i \max_t \lvert g_{t,i} \rvert^2})$$, exponentially better than SGD on sparse problems | $$O(\sqrt{d\,T})$$, comparable to standard online subgradient methods |
| Empirical outcome | Strongly preferred over plain SGD | Often beaten by SGD-momentum or Adam after a moderate number of epochs |

This table also explains why AdaGrad survived even after Adam took over deep learning: the sparse-gradient regime is exactly where AdaGrad has a provable advantage, and that regime appears in [recommender systems](/wiki/recommender_systems), online ads, and certain NLP pipelines.

## Algorithm steps

The complete AdaGrad algorithm can be summarized as follows.

| Step | Action |
|---|---|
| 1 | Initialize parameter vector $$\mathbf{w}_0$$ and set the gradient accumulator $$\mathbf{s}_0 = \mathbf{0}$$ |
| 2 | Choose a global learning rate eta (commonly 0.01) and smoothing constant epsilon (commonly 1e-8 or 1e-10) |
| 3 | For each training iteration $$t = 1, 2, \ldots, T$$: |
| 3a | Sample a minibatch (or single example, in the online setting) and compute the gradient $$\mathbf{g}_t$$ of the loss with respect to $$\mathbf{w}$$ |
| 3b | Update the accumulator: $$\mathbf{s}_t = \mathbf{s}_{t-1} + \mathbf{g}_t^2$$ (element-wise) |
| 3c | Compute the [parameter update](/wiki/parameter_update): $$\mathbf{w}_t = \mathbf{w}_{t-1} - (\eta / (\sqrt{\mathbf{s}_t} + \epsilon)) \, \mathbf{g}_t$$ |
| 4 | Return the final parameter vector $$\mathbf{w}_T$$ |

A pseudocode form looks like this:

```
function Adagrad(loss_fn, w0, eta, epsilon, num_steps):
    w = w0
    s = zeros_like(w)
    for t in 1..num_steps:
        g = grad(loss_fn, w)
        s = s + g * g
        w = w - eta * g / (sqrt(s) + epsilon)
    return w
```

## What are AdaGrad's convergence guarantees?

AdaGrad enjoys strong theoretical guarantees in the convex online learning setting, which is the setting both the JMLR and COLT papers analyzed.

### Online convex optimization

In the online convex optimization framework, an adversary picks a sequence of convex loss functions f_1, ..., f_T, and the learner picks parameters **w**_t before seeing f_t. Regret is defined as

$$
R_T = \sum_{t=1}^T f_t(\mathbf{w}_t) - \min_{\mathbf{w}^*} \sum_{t=1}^T f_t(\mathbf{w}^*)
$$

that is, the total excess loss compared to the best fixed comparator chosen in hindsight.

Duchi et al. and McMahan and Streeter both proved that AdaGrad achieves a regret bound of order O(sqrt(T)) under standard assumptions on the convexity and bounded subgradients [1][2]. More precisely, the bound has the form

$$
R_T \le O(D_\infty \sqrt{\sum_i \sqrt{\sum_t g_{t,i}^2}})
$$

where D_infinity bounds the diameter of the comparator set. The crucial feature is that the sum over coordinates *i* depends on the per-coordinate gradient history. When the gradient is sparse (most g_(t,i) are zero), this bound can be exponentially smaller than the corresponding bound for plain online subgradient descent, which scales with sqrt(d * T) regardless of sparsity. On a problem with d dimensions of which only k carry typical signal, AdaGrad's bound scales like $$\sqrt{k\,T}$$ rather than $$\sqrt{d\,T}$$.

This sparsity-adaptive regret bound was the headline theoretical result and the main reason AdaGrad gained immediate adoption in NLP and recommendation pipelines. It provided a rigorous explanation for the empirical observation that adaptive per-feature learning rates outperform uniform schedules on text and click data.

### Convergence rate for stochastic convex optimization

Via standard online-to-batch conversion, the O(sqrt(T)) regret bound translates into a convergence rate of $$O(1/\sqrt{T})$$ on the average iterate for stochastic convex optimization with bounded subgradients [1]. This matches the optimal rate for first-order stochastic methods on non-smooth convex problems.

### Non-convex convergence

AdaGrad's original analysis assumed convex losses, but [neural network](/wiki/neural_network) training is famously non-convex. Ward, Wu, and Bottou's 2019 paper "AdaGrad Stepsizes: Sharp Convergence Over Nonconvex Landscapes" [4] addressed this gap. They proved that the norm version of AdaGrad (AdaGrad-Norm, where the same scalar stepsize is used for all coordinates) converges to a stationary point at rate $$O(\log(N) / \sqrt{N})$$ in the stochastic non-convex setting and $$O(1/N)$$ in the deterministic case. Their analysis did not require knowledge of the loss's Lipschitz constant or the noise level, which is unusual for non-convex SGD analyses and which they took as evidence that AdaGrad's adaptivity is genuinely useful even outside its original convex setting.

Later work by Defossez, Bottou, Bach, and Usunier [5] gave a unified proof for both Adam and AdaGrad in the non-convex stochastic case, confirming that adaptive methods do converge under reasonable assumptions even though their behaviour on non-convex landscapes is less understood than on convex ones.

## Connection to natural gradient and second-order methods

The full-matrix version of AdaGrad is closely related to natural gradient descent and to other second-order optimization methods. Natural gradient descent uses the inverse [Fisher information matrix](/wiki/fisher_information) as a preconditioner, which produces updates that are invariant to reparameterization. AdaGrad's full-matrix preconditioner G_t^(-1/2) is built from the outer products of the observed gradients, which is closely related to the empirical Fisher information matrix (the Fisher under the empirical data distribution rather than the model distribution).

The diagonal AdaGrad accumulator s_i is the diagonal of the empirical second-moment matrix of the gradients. Dividing by sqrt(s_i) is therefore a coordinate-wise normalization by an estimate of gradient standard deviation, which can be viewed as a crude diagonal approximation to a Fisher- or Hessian-based preconditioner. This gives AdaGrad a loose interpretation as a memory-cheap, first-order substitute for genuine second-order methods, an idea that Adam later inherited and that connects through to recent work like Shampoo and second-order LLM optimizers.

## What are AdaGrad's strengths?

AdaGrad offers several benefits that explain its impact and its continued role in certain niches.

### Automatic per-parameter learning rate adaptation

The most important strength is that the algorithm automatically scales the learning rate per parameter without any manual schedule. A practitioner still needs to set the global learning rate eta, but the per-coordinate adaptation eliminates the need for elaborate per-feature tuning. In sparse settings this often replaces what would otherwise be a feature-engineering effort to balance frequencies.

### Strong performance on sparse data

AdaGrad excels when gradients are sparse or non-uniformly distributed across features. In NLP with bag-of-words or hashed-feature representations, most words in any given example receive a zero gradient. Parameters for rare words accumulate slowly, so their effective learning rate stays large. Parameters for common words accumulate quickly and shrink. This per-feature adaptivity is exactly the right behaviour for sparse high-dimensional problems and is the regime where AdaGrad has a provable advantage over uniform schedules.

### Online learning suitability

AdaGrad was developed in the online learning framework, where data arrives sequentially and the model must update after each observation. Its adaptive nature makes it well-suited for environments where the data distribution shifts gradually, since the per-parameter learning rates continuously adjust based on the observed gradients. Vowpal Wabbit, the open-source online learning system originally built at Yahoo Research and later maintained at Microsoft Research, uses AdaGrad as the default optimizer for its online learners [6].

### Memory efficiency relative to Adam

AdaGrad maintains a single accumulator vector of the same shape as the parameter vector. That is one extra copy of model parameters in memory. Adam and AdamW require two extra copies (the first and second moment estimates), which doubles the optimizer state. For a 7-billion-parameter LLM, the difference is roughly 28 GB of optimizer state in fp32 (one accumulator vs two for Adam), which can matter for fitting training on a given hardware budget. In practice this advantage is rarely the deciding factor because Adam's other benefits dominate, but it is real.

### Simplicity of implementation

The algorithm is short. There are no momentum buffers, no bias correction terms, no warmup schedules, no decoupled weight decay branches. A from-scratch AdaGrad fits in roughly five lines of code on top of an autograd library, which makes it easy to debug and easy to verify against published derivations.

## What are AdaGrad's weaknesses?

### Vanishing learning rate

The single most discussed weakness of AdaGrad is the vanishing learning rate. Because the accumulator s_t grows monotonically with t (gradients squared are always non-negative), the effective learning rate eta / sqrt(s_t) decreases monotonically and can shrink toward zero as training progresses. For long training runs on dense problems this means the optimizer eventually stops making meaningful progress, even when the loss has not yet plateaued.

Zeiler's AdaDelta paper, which was specifically designed to fix this problem, demonstrated empirically that AdaGrad performs well for the first several epochs of [neural network](/wiki/neural_network) training but then slows down sharply because the accumulator denominators continue to grow without bound [7]. This is the single observation that motivated the entire successor family (RMSProp, AdaDelta, Adam) to replace the cumulative sum with an exponential moving average.

### Inability to handle non-stationary objectives

In online settings where the data distribution shifts, AdaGrad cannot fully "forget" old gradient information. Once a coordinate has accumulated significant gradient mass, its effective learning rate stays small permanently, even if subsequent data suggests the parameter should now move quickly. RMSProp's exponential moving average solves this by giving older gradients geometrically less weight.

### Sensitivity to the global learning rate

Although AdaGrad reduces the need for per-coordinate tuning, the choice of the global eta still matters. The initial gradient magnitudes heavily influence the early values of s_t, which then determine effective learning rates for the remainder of training. A global eta that is too large leads to large early accumulator values that cripple learning later; one that is too small leaves the model under-trained.

### Memory overhead versus plain SGD

AdaGrad doubles the memory required to hold parameters compared to plain SGD (one parameter copy for **w**, one for **s**). For very large models or tightly memory-constrained settings (mobile, edge inference fine-tuning) this is an honest cost, although it remains half the cost of Adam.

### Poor empirical performance on dense deep learning

For most dense neural network training, especially anything that runs for more than a few thousand steps, AdaGrad is empirically dominated by Adam, AdamW, or even well-tuned SGD-momentum. The vanishing learning rate is the proximate cause, and it is essentially unfixable without changing the algorithm into RMSProp or AdaDelta. For this reason, AdaGrad is rarely the right choice for modern deep learning, and especially never for [LLM](/wiki/llm) pretraining.

## Variants and extensions

Several variants and extensions of AdaGrad exist beyond the basic diagonal form.

### AdagradDA (regularized dual averaging form)

In parallel with the JMLR version, Duchi, Hazan, and Singer also derived a regularized dual averaging (RDA) variant of AdaGrad. AdagradDA replaces the proximal mirror descent style update with Nesterov's dual averaging, which leads to sparser solutions in the presence of L1 [regularization](/wiki/regularization). TensorFlow's AdagradDAOptimizer is a direct implementation.

### AdaGrad-Norm

A scalar version of AdaGrad maintains a single scalar accumulator equal to the running sum of the squared gradient norms, rather than per-coordinate accumulators. This is the version analyzed in the Ward, Wu, and Bottou non-convex convergence paper [4] and is sometimes preferred when per-coordinate adaptivity is undesirable, for example when one wants the adaptive scheme purely for stepsize selection rather than per-feature reweighting.

### Composite mirror descent

The more general framework Duchi, Hazan, and Singer worked in is composite mirror descent, in which AdaGrad arises as a specific choice of adaptive proximal function. Under this lens, AdaGrad sits in a family of algorithms that includes variants for L1-regularized problems, group-sparse problems, and matrix-structured problems.

### Other named successors

Not strictly variants of AdaGrad, but direct algorithmic descendants:

- RMSProp (Hinton lecture notes, 2012)
- AdaDelta (Zeiler, 2012)
- Adam (Kingma and Ba, 2014)
- Adamax (Kingma and Ba, 2014, infinity-norm variant of Adam)
- AMSGrad (Reddi, Kale, Kumar, ICLR 2018, fixes a convergence bug in Adam)
- AdamW (Loshchilov and Hutter, 2019, decoupled weight decay)

These are covered in their own articles. The next section places them in chronological context.

## How does AdaGrad relate to RMSProp and Adam?

AdaGrad is the start of a lineage that effectively defines modern adaptive deep learning optimization. The table below summarizes the main timeline, with each successor reacting to a specific limitation of its predecessor.

| Year | Optimizer | Authors | Innovation over predecessor | Source |
|---|---|---|---|---|
| 2010 | Adaptive bound optimization (later AdaGrad) | McMahan, Streeter | Adaptive regularizer based on observed gradients | COLT 2010 |
| 2011 | AdaGrad | Duchi, Hazan, Singer | Diagonal preconditioner with cumulative squared gradient sum | JMLR 12 |
| 2012 | RMSProp | Hinton (Tieleman, Hinton) | Exponential moving average of squared gradients (forgets old gradients) | Coursera lecture 6e |
| 2012 | AdaDelta | Zeiler | EMA of squared gradients plus EMA of squared parameter updates (no global LR) | arXiv 1212.5701 |
| 2014 | Adam | Kingma, Ba | RMSProp plus momentum plus bias correction | arXiv 1412.6980 / ICLR 2015 |
| 2014 | Adamax | Kingma, Ba | Infinity-norm variant of Adam | Same paper |
| 2018 | AMSGrad | Reddi, Kale, Kumar | Forces v_t to be non-decreasing; fixes a convergence proof bug | ICLR 2018 |
| 2019 | AdamW | Loshchilov, Hutter | Decoupled weight decay from gradient-scaled L2 | ICLR 2019 |
| 2020s | LAMB, Lion, Sophia, Shampoo, Adafactor | Various | Large-batch friendly, low-memory, second-order, or matrix-aware variants | Various |

Reading top to bottom, every later algorithm inherits the per-parameter-adaptive idea from AdaGrad. The main subsequent innovations have been replacing the cumulative sum with an EMA (RMSProp, AdaDelta, Adam), adding [momentum](/wiki/momentum) (Adam), bias correction (Adam), decoupling regularization (AdamW), and bounding memory (Adafactor, Lion). The conceptual core (divide each coordinate's update by the square root of an estimate of its second moment) traces back unchanged to the 2011 paper.

### Direct comparison

| Optimizer | Gradient state | Decay scheme | Momentum | Bias correction | Memory cost (state per parameter) | Best at |
|---|---|---|---|---|---|---|
| [SGD](/wiki/stochastic_gradient_descent_sgd) | None | Fixed or manual schedule | Optional | n/a | 0 (or 1 with momentum) | Well-tuned dense problems |
| AdaGrad | Cumulative sum of g^2 | Monotone decay (1/sqrt(t)) | No | No | 1 | Sparse online learning |
| [RMSProp](/wiki/rmsprop) | EMA of g^2 | Bounded by EMA window | Optional in some implementations | No | 1 | Recurrent nets, RL, non-stationary signals |
| [AdaDelta](/wiki/adadelta) | EMA of g^2 plus EMA of (delta w)^2 | Adaptive, no global LR needed | No | No | 2 | Avoiding LR tuning |
| [Adam](/wiki/adam_optimizer) | EMA of g and EMA of g^2 | Bounded by EMA windows | Yes (first moment) | Yes | 2 | General deep learning, default for most cases |
| [AdamW](/wiki/adamw) | EMA of g and EMA of g^2 plus decoupled weight decay | Bounded by EMA windows | Yes | Yes | 2 | LLM and Transformer pretraining |

## What is AdaGrad used for?

Despite being out of fashion for general deep learning, AdaGrad still appears in production systems where its specific strengths matter.

### Natural language processing with sparse representations

AdaGrad has been widely used in NLP tasks where sparse feature representations are common. Word embedding training, text classification, and language models that use large vocabularies all benefit from AdaGrad's tendency to keep effective learning rates high for infrequent tokens. The original [GloVe](/wiki/glove) word embedding algorithm by Pennington, Socher, and Manning (Stanford, EMNLP 2014) was trained using AdaGrad [8]. The Stanford GloVe codebase ships with AdaGrad as the default optimizer because the algorithm's per-coordinate adaptivity matches the heavy-tailed frequency distribution of word co-occurrence data.

### Recommendation and click-through rate prediction

Large-scale recommendation models work over very high-dimensional sparse feature spaces (user IDs, item IDs, hashed feature crosses). AdaGrad naturally handles this imbalance by adapting learning rates based on per-feature update frequency. Google's Wide and Deep Learning for Recommender Systems paper (Cheng et al., 2016) [9] reports the production setup used at Google Play: the wide linear part was trained with FTRL-Proximal with L1 regularization (a relative of AdaGrad), and the deep neural network part was trained with AdaGrad. The paper specifically attributes the choice of AdaGrad on the deep side to its adaptive behaviour on sparse embedding tables. This pattern (FTRL on the wide head, AdaGrad on the deep towers) became a standard recipe in recommendation literature for several years.

### Online learning systems

Vowpal Wabbit, the open-source online learning library originally built at Yahoo Research and maintained at Microsoft Research, uses AdaGrad as the default optimizer for its online learners, including its contextual bandit pipeline [6]. Online learning is exactly the regime AdaGrad was designed for: data arrives one example at a time, parameters update after each example, and per-feature adaptivity matters because feature frequencies are unknown a priori.

### Convex optimization and large-scale linear models

For convex problems and [linear models](/wiki/linear_regression) such as [logistic regression](/wiki/logistic_regression), AdaGrad's monotonically decreasing learning rate is actually desirable because it ensures convergence to the optimum. Industrial-scale convex classifiers in advertising and search ranking have used AdaGrad and its dual-averaging variant AdagradDA successfully.

### Embedding training in recommender systems

A modern niche where AdaGrad survives is training embedding tables for users, items, or vocabulary tokens where the access pattern is power-law sparse. PyTorch's torch.optim.SparseAdam and similar sparse-update variants of Adam exist precisely because dense Adam wastes work updating embeddings that received no gradient. AdaGrad has a natural sparse-update form (only update accumulator entries for the coordinates whose gradient is non-zero), which makes it efficient for embedding-heavy models even when other parts of the model use Adam.

## When should you not use AdaGrad?

A short list of regimes where AdaGrad is not the right choice:

- Long training runs on dense problems. The vanishing learning rate kills long-horizon training. Use Adam or AdamW.
- LLM pretraining. The standard is AdamW, with beta_2 lowered to 0.95 to react faster to the shifting gradient distribution. AdaGrad would freeze too early.
- Training where momentum is known to help (CNNs trained with SGD-momentum on ImageNet, for example). Vanilla AdaGrad has no momentum.
- Reinforcement learning with non-stationary policies. Use Adam, RMSProp, or specialized RL optimizers.
- Any setting where you want the option to revisit a region of parameter space at a later step at full speed. AdaGrad's learning rate cannot recover.

## How do you use AdaGrad in PyTorch, TensorFlow, and JAX?

AdaGrad is available in every major deep learning framework.

### PyTorch

In [PyTorch](/wiki/pytorch), AdaGrad is implemented as `torch.optim.Adagrad` [10]:

```python
import torch
import torch.optim as optim

model = MyModel()
optimizer = optim.Adagrad(
    model.parameters(),
    lr=0.01,
    lr_decay=0,
    weight_decay=0,
    initial_accumulator_value=0,
    eps=1e-10,
)

for input, target in dataset:
    optimizer.zero_grad()
    output = model(input)
    loss = loss_fn(output, target)
    loss.backward()
    optimizer.step()
```

The PyTorch defaults are lr=0.01, lr_decay=0, weight_decay=0, initial_accumulator_value=0, and eps=1e-10. The lr_decay parameter implements an additional polynomial decay on top of AdaGrad's intrinsic adaptivity, which is rarely useful. The initial_accumulator_value parameter sets the starting value of s and is typically left at 0. PyTorch also provides a sparse-gradient code path for cases where most gradient entries are zero.

### TensorFlow / Keras

In [TensorFlow](/wiki/tensorflow), AdaGrad is available through the Keras API:

```python
import tensorflow as tf

model = tf.keras.Sequential([...])
optimizer = tf.keras.optimizers.Adagrad(
    learning_rate=0.001,
    initial_accumulator_value=0.1,
    epsilon=1e-7,
)
model.compile(optimizer=optimizer, loss='sparse_categorical_crossentropy')
model.fit(x_train, y_train, epochs=10)
```

The Keras defaults are learning_rate=0.001, initial_accumulator_value=0.1, and epsilon=1e-7. Note the difference from PyTorch: TensorFlow defaults to a non-zero initial accumulator value of 0.1, which gives a more conservative (smaller) initial effective learning rate and improves numerical stability in early steps. This default difference can lead to noticeably different training dynamics if a script is ported between frameworks without re-tuning.

### JAX / Optax

In JAX with Optax, AdaGrad is `optax.adagrad`:

```python
import optax

optimizer = optax.adagrad(
    learning_rate=0.01,
    initial_accumulator_value=0.1,
    eps=1e-7,
)
opt_state = optimizer.init(params)

for batch in dataset:
    grads = grad_fn(params, batch)
    updates, opt_state = optimizer.update(grads, opt_state)
    params = optax.apply_updates(params, updates)
```

Optax's defaults follow TensorFlow's conventions (initial accumulator 0.1, eps 1e-7).

### Implementation notes across frameworks

A few practical points worth knowing:

- The default initial_accumulator_value differs between frameworks. PyTorch uses 0, TensorFlow and Optax use 0.1. The Keras choice produces a smoother first few hundred steps; the PyTorch choice produces a more aggressive first step.
- The default eps differs too. PyTorch uses 1e-10, TensorFlow and Optax use 1e-7. Larger eps gives a more conservative early update; smaller eps lets sparse coordinates jump farther on their first update.
- All frameworks place epsilon outside the square root by default, matching the standard textbook formulation eta * g / (sqrt(s) + eps). The original Duchi paper used the same form.
- Sparse updates are supported in PyTorch (via SparseAdagrad-style code paths), TensorFlow (via tf.IndexedSlices), and Optax (via optax.adagrad with sparse gradients), all of which avoid touching accumulator entries for coordinates with no gradient. This is essential for embedding tables in recommendation models.

## Is AdaGrad still used today?

AdaGrad is no longer the default optimizer for general deep learning. Adam and AdamW dominate that role. However, the algorithm remains relevant for several reasons.

First, AdaGrad is the conceptual ancestor of every modern adaptive optimizer. RMSProp simply replaces its cumulative sum with an EMA. Adam adds momentum and bias correction on top of RMSProp. AdamW decouples weight decay. Each of these starts from AdaGrad's per-coordinate normalization and improves on the specific ways AdaGrad falls short. Anyone who reads the Adam paper without first understanding AdaGrad is missing half the motivation.

Second, AdaGrad still wins on sparse online learning with bounded horizons. Vowpal Wabbit, FTRL-style click prediction systems, and the Stanford GloVe codebase are all examples of production systems where AdaGrad continues to be a sensible default.

Third, AdaGrad's regret bound is one of the cleanest theoretical results in online optimization, and it provides the standard baseline against which adaptive methods are still benchmarked. The Duchi-Hazan-Singer paper is one of the most heavily cited machine learning optimization papers of the 2010s and remains a standard reference in modern optimizer research [1].

Fourth, the diagonal preconditioner perspective on AdaGrad opens the door to second-order and natural-gradient methods. Recent low-memory second-order optimizers (Shampoo, K-FAC, Adafactor) can all be understood as more sophisticated takes on the same idea AdaGrad introduced: precondition the gradient by something derived from gradient history.

In 2026, an instructor teaching adaptive optimizers in a deep learning course will still derive AdaGrad first, then RMSProp, then Adam, then AdamW. That pedagogical sequence reflects both the historical order and the conceptual one. Even where AdaGrad is no longer the optimizer one would actually run, it remains the right one to learn first.

## Explain Like I'm 5 (ELI5)

Imagine you are learning to throw a basketball into different hoops. Some hoops you practice on every day, so you are already pretty good at those. Other hoops are in unusual spots, so you rarely get to try them.

A regular coach would have you practice every hoop the same way, spending equal effort on each one. But that is not very smart, because you are already good at some hoops and bad at others.

AdaGrad is like a smart coach. It watches you practice and keeps track of how much practice you have already done on each hoop. For the hoops you have practiced a lot, the coach says, "You are good at this one, so let's only make small adjustments." For the hoops you have barely tried, the coach says, "You need more help here, so let's make bigger changes."

The one problem with this smart coach is that after a very long time, it starts thinking you have practiced every hoop so much that it barely lets you change anything at all. That is when you need a different coach, like RMSProp or Adam, who does not remember every single practice session from the beginning of time but only the recent ones.

## See also

- [Adam optimizer](/wiki/adam_optimizer)
- [AdamW](/wiki/adamw)
- [RMSProp](/wiki/rmsprop)
- [AdaDelta](/wiki/adadelta)
- [Stochastic gradient descent (SGD)](/wiki/stochastic_gradient_descent_sgd)
- [Momentum](/wiki/momentum)
- [Learning rate](/wiki/learning_rate)
- [Optimizer](/wiki/optimizer)
- [GloVe](/wiki/glove)
- [Vowpal Wabbit](/wiki/vowpal_wabbit)

## References

1. Duchi, J., Hazan, E., and Singer, Y. (2011). "Adaptive Subgradient Methods for Online Learning and Stochastic Optimization." *Journal of Machine Learning Research*, 12, 2121-2159. https://jmlr.org/papers/v12/duchi11a.html
2. McMahan, H. B., and Streeter, M. (2010). "Adaptive Bound Optimization for Online Convex Optimization." *Proceedings of the 23rd Annual Conference on Learning Theory (COLT)*. https://arxiv.org/abs/1002.4908
3. Agarwal, N., Bullins, B., Chen, X., Hazan, E., Singh, K., and Zhang, C. (2019). "Efficient Full-Matrix Adaptive Regularization." *Proceedings of the 36th International Conference on Machine Learning*, PMLR 97:102-110. http://proceedings.mlr.press/v97/agarwal19b/agarwal19b.pdf
4. Ward, R., Wu, X., and Bottou, L. (2019). "AdaGrad Stepsizes: Sharp Convergence Over Nonconvex Landscapes." *Proceedings of the 36th International Conference on Machine Learning*, PMLR 97:6677-6686. https://proceedings.mlr.press/v97/ward19a.html
5. Defossez, A., Bottou, L., Bach, F., and Usunier, N. (2022). "A Simple Convergence Proof of Adam and Adagrad." *Transactions on Machine Learning Research*. https://openreview.net/pdf?id=ZPQhzTSWA7
6. Vowpal Wabbit project documentation. "Online learning with AdaGrad." Microsoft Research / GitHub. https://github.com/VowpalWabbit/vowpal_wabbit
7. Zeiler, M. D. (2012). "ADADELTA: An Adaptive Learning Rate Method." *arXiv preprint arXiv:1212.5701*. https://arxiv.org/abs/1212.5701
8. Pennington, J., Socher, R., and Manning, C. D. (2014). "GloVe: Global Vectors for Word Representation." *Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)*, 1532-1543. https://nlp.stanford.edu/pubs/glove.pdf
9. Cheng, H.-T., Koc, L., Harmsen, J., Shaked, T., Chandra, T., Aradhye, H., Anderson, G., Corrado, G., Chai, W., Ispir, M., Anil, R., Haque, Z., Hong, L., Jain, V., Liu, X., and Shah, H. (2016). "Wide & Deep Learning for Recommender Systems." *Proceedings of the 1st Workshop on Deep Learning for Recommender Systems*. https://arxiv.org/abs/1606.07792
10. PyTorch Documentation. "torch.optim.Adagrad." https://docs.pytorch.org/docs/stable/generated/torch.optim.Adagrad.html
11. TensorFlow Documentation. "tf.keras.optimizers.Adagrad." https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/Adagrad
12. Optax Documentation. "optax.adagrad." https://optax.readthedocs.io/en/latest/api/optimizers.html#optax.adagrad
13. Hinton, G. (2012). "Neural Networks for Machine Learning, Lecture 6e: rmsprop." Coursera lecture slides. https://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf
14. Kingma, D. P., and Ba, J. (2015). "Adam: A Method for Stochastic Optimization." *Proceedings of the 3rd International Conference on Learning Representations (ICLR)*. https://arxiv.org/abs/1412.6980
15. Loshchilov, I., and Hutter, F. (2019). "Decoupled Weight Decay Regularization." *Proceedings of the 7th International Conference on Learning Representations (ICLR)*. https://openreview.net/forum?id=Bkg6RiCqY7
16. Ruder, S. (2016). "An overview of gradient descent optimization algorithms." *arXiv preprint arXiv:1609.04747*. https://arxiv.org/abs/1609.04747
17. Zhang, A., Lipton, Z. C., Li, M., and Smola, A. J. (2023). "Dive into Deep Learning," Section 12.7: Adagrad. https://d2l.ai/chapter_optimization/adagrad.html
18. Cornell University Computational Optimization Open Textbook. "AdaGrad." https://optimization.cbe.cornell.edu/index.php?title=AdaGrad
19. Reddi, S. J., Kale, S., and Kumar, S. (2018). "On the Convergence of Adam and Beyond." *Proceedings of the 6th International Conference on Learning Representations (ICLR)*. https://openreview.net/pdf?id=ryQu7f-RZ
20. Duchi, J., Shalev-Shwartz, S., Singer, Y., and Tewari, A. (2010). "Composite Objective Mirror Descent." *Proceedings of the 23rd Annual Conference on Learning Theory (COLT)*. https://www.learningtheory.org/colt2010/papers/057Duchi.pdf