# Step size

> Source: https://aiwiki.ai/wiki/step_size
> Updated: 2026-06-27
> Categories: Training & Optimization
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

*See also: [learning rate](/wiki/learning_rate), [parameter update](/wiki/parameter_update), [Adam](/wiki/adam_optimizer), [SGD](/wiki/stochastic_gradient_descent_sgd)*

In [machine learning](/wiki/machine_learning), the **step size** (also called the **learning rate**, usually written as the Greek letter eta or alpha) is the scalar that controls how far the parameters of a model move on each update during gradient-based optimization. Formally, plain [gradient descent](/wiki/gradient_descent) sets theta_{t+1} = theta_t - eta * grad L(theta_t), so the step size eta multiplies the gradient to fix the distance the parameters travel at every iteration. It is the single [hyperparameter](/wiki/hyperparameter) that most strongly determines whether training converges, how fast it converges, and what generalization the final model reaches [9][21]. Practitioners often discover that a model that fails to learn at one step size will train cleanly at another value an order of magnitude smaller or larger, which is why the step size is usually the first thing tuned and the first thing blamed when a run goes wrong.

Leslie Smith, whose work underpins much of modern learning-rate practice, puts the priority bluntly: "Since learning rate is regarded as the most important hyper-parameter to tune (Bengio, 2012) then momentum is also important." [9][21]

## What is the step size (definition and update rule)?

Given a [loss function](/wiki/loss_function) L(theta) over model parameters theta, plain gradient descent updates the parameters by subtracting a scaled gradient:

theta_{t+1} = theta_t - eta * grad L(theta_t)

Here eta is the step size. Each iteration moves the parameters a distance proportional to eta in the direction of steepest descent. In stochastic and mini-batch settings the true gradient is replaced by a noisy estimate computed on a [batch](/wiki/batch_size) of examples, but the role of eta does not change [7].

The term "step size" comes from the optimization literature, where the per-iteration parameter movement is literally the length of a step taken across the loss landscape. In the deep learning community the same quantity is almost always called the learning rate. The two names are interchangeable, although "step size" is more common in convex optimization and theoretical papers, while "learning rate" dominates practical guides and code APIs.

## Why is the step size the most important hyperparameter?

In deep learning, varying the step size by a factor of two often changes final accuracy more than swapping architectures or doubling the model size. The reasons trace back to the geometry of the loss surface:

- **Too small.** If eta is much smaller than the local curvature of the loss, training is stable but unbearably slow. The parameters inch toward a minimum and may never reach it within the compute budget. Loss curves look almost flat.
- **Too large.** If eta is much larger than the local curvature, the update overshoots in the directions of high curvature. The loss bounces around or grows. With deep networks the result is typically NaN values within a few hundred [steps](/wiki/step) as activations and gradients explode.
- **Just right.** A well-chosen eta produces fast progress in low-curvature directions and stable progress in high-curvature directions. There is rarely a single perfect value, but there is usually a band of acceptable values spanning roughly one order of magnitude.

Leslie Smith's 2018 report on disciplined hyperparameter tuning argues that the learning rate is the dominant control knob: tuning it well lets you set most other hyperparameters from sensible defaults [9].

## What is the difference between effective and nominal step size?

For plain [stochastic gradient descent](/wiki/stochastic_gradient_descent) (SGD), the value passed to the optimizer is also the actual displacement per gradient unit. With [momentum](/wiki/momentum) and adaptive optimizers, the effective step size diverges from the nominal value:

- With heavy-ball momentum coefficient mu, the effective step size in the steady state is roughly eta / (1 - mu). At mu = 0.9 the effective rate is about ten times the nominal eta, which is why SGD with momentum often uses smaller eta than plain SGD.
- With [Adam](/wiki/adam_optimizer) and other adaptive methods, the effective step in each parameter is eta divided by a running estimate of the gradient magnitude. The actual displacement per parameter is approximately eta in the early stages of training but shrinks for parameters with large historical gradients. Kingma and Ba observe that in Adam "the effective magnitude of the steps taken in parameter space at each timestep are approximately bounded by the stepsize setting alpha," a property they describe as "establishing a trust region around the current parameter value." [3]
- With Lion the update is the sign of a smoothed gradient, so the per-parameter displacement is exactly eta in absolute value. This is why Lion needs a learning rate three to ten times smaller than [AdamW](/wiki/adamw) [16].

This distinction matters when porting hyperparameters between optimizers or papers. A learning rate of 1e-3 in Adam is not comparable to 1e-3 in SGD or 1e-3 in Lion.

## What is a learning rate schedule?

The step size does not need to stay constant. Almost every modern training run varies eta over time according to a schedule. The schedule typically combines a warmup phase (eta starts small and grows) with a decay phase (eta shrinks toward zero). The table below summarizes the schedules in common use.

| Schedule | Formula | Where it is used | Notes |
|---|---|---|---|
| Constant | eta_t = eta_0 | Online learning, simple baselines | Easy to debug, rarely optimal for deep nets |
| Step decay | eta_t = eta_0 * gamma^{floor(t / s)} | Classic ImageNet ResNets | Cut by 10x at chosen epoch boundaries |
| Exponential decay | eta_t = eta_0 * gamma^t | Older RL and online setups | Smooth analog of step decay |
| Polynomial decay | eta_t = eta_0 * (1 - t/T)^p | BERT pretraining (p=1, linear) | Linear-to-zero is a common LLM default |
| Inverse square root | eta_t proportional to 1 / sqrt(t) | Original Transformer | Combined with linear warmup [6] |
| Cosine annealing | eta_t = eta_min + 0.5 * (eta_max - eta_min) * (1 + cos(pi * t / T)) | Modern vision and LLM pretraining | Smooth, single-cycle, no extra hyperparameters [5] |
| Cosine with warm restarts (SGDR) | Cosine over a cycle, then restart | Snapshot ensembles, fine-tuning | Periodic restarts can escape sharp minima [5] |
| One-cycle | Triangular up then down, plus a final tail | Fast.ai super-convergence regime | Allows much larger peak LR than usual [8] |
| Cyclical | Triangular oscillation between eta_min and eta_max | CV experiments | Avoids tuning a single peak value [4] |
| Linear warmup + cosine decay | Linear ramp for K steps, then cosine | GPT-3, Llama, Chinchilla | The default for LLM pretraining [14][15][18] |
| ReduceLROnPlateau | Drop eta when validation loss stops improving | Older PyTorch workflows | Reactive, not deterministic |

The two schedules dominating modern practice are linear warmup plus cosine decay (for large pretraining runs) and one-cycle (for shorter runs that benefit from aggressive maximum learning rates).

### What is learning rate warmup?

Warmup linearly increases eta from zero (or a small value) to a peak over the first K steps of training. Goyal and colleagues introduced gradual warmup in 2017 to make large-batch SGD work on ImageNet, and it is now standard for [transformers](/wiki/transformer) and any model trained with adaptive optimizers at scale [7]. Without warmup, adaptive optimizers like Adam apply unstable updates in the first few hundred steps, when the second-moment estimate is still noisy and the bias correction divides by a small denominator. Warmup gives the optimizer time to stabilize before applying full-strength updates.

Typical warmup lengths range from a few hundred steps for small models to ten thousand steps or more for the largest LLMs. GPT-3 used 375 million tokens of warmup [14], and Llama 3 405B used 8,000 steps [18].

### How does cosine annealing work?

The cosine annealing schedule was proposed by Loshchilov and Hutter in their 2017 ICLR paper SGDR (Stochastic Gradient Descent with Warm Restarts) [5]. Within a cycle of length T_i, the learning rate at step T_cur is:

eta_t = eta_min + 0.5 * (eta_max - eta_min) * (1 + cos(pi * T_cur / T_i))

The value starts at eta_max, follows a half-cosine curve, and ends at eta_min. The schedule has no plateau and no abrupt drops, which empirically generalizes better than step decay. The original paper also proposed warm restarts: after T_i steps, reset T_cur to zero and start a new cycle, optionally with a longer T_i. Modern LLM training usually uses a single cosine cycle, with eta_min set to roughly 10% of eta_max (the choice popularized by Chinchilla) [15].

### What is the inverse-square-root schedule?

The original Transformer paper by Vaswani and colleagues used a custom schedule combining linear warmup with inverse-square-root decay [6]:

lr = d_model^{-0.5} * min(step^{-0.5}, step * warmup_steps^{-1.5})

This schedule auto-scales with the model dimension d_model and crosses smoothly at step = warmup_steps. The base Transformer used 4,000 warmup steps [6]. The schedule still appears in some NLP codebases, although cosine has largely replaced it in newer work.

### What is the one-cycle policy and super-convergence?

Leslie Smith introduced cyclical learning rates in 2017 and the one-cycle policy in 2018 [4][8]. The one-cycle schedule rises from a low value to a maximum over the first half of training, then falls symmetrically back, with a short final tail well below the starting point. Smith and Topin showed that this schedule enables "super-convergence": ResNets that normally need 80 epochs can converge in 10 with one-cycle and a much larger peak LR than constant-rate training would tolerate [8]. The schedule is implemented in PyTorch as `torch.optim.lr_scheduler.OneCycleLR` and is widely used in fastai workflows [19].

### How do you find a good learning rate (the LR finder)?

Smith's 2017 cyclical paper also introduced the **LR range test**, a quick way to find a sensible peak learning rate for a new model and dataset [4]. In Smith's words, "In the LR range test, training starts with a small learning rate which is slowly increased linearly throughout a pre-training run. This single run provides valuable information on how well the network can be trained over a range of learning rates and what is the maximum learning rate." [9] The recipe:

1. Start training from a tiny learning rate (for example 1e-7).
2. Increase eta exponentially after each mini-batch.
3. Plot training loss against log(eta).
4. The loss decreases over a range of eta, then suddenly explodes upward. Pick a value just before the explosion (often one order of magnitude below the divergence point).

The LR finder takes a single short run (often a few hundred batches) and tends to give a usable peak learning rate without hand tuning. It is the default in fastai and is implemented in PyTorch Lightning's `Tuner.lr_find`.

## How do adaptive optimizers set per-parameter step sizes?

A second family of methods sidesteps schedule tuning by adapting the step size per parameter using gradient statistics. The user still sets a global eta, but each parameter gets a different effective step.

| Optimizer | Year | Key idea | Typical eta |
|---|---|---|---|
| AdaGrad | Duchi, Hazan, Singer 2011 [1] | Divide by sqrt of historical sum of squared gradients | 0.01 |
| RMSProp | Hinton 2012 (Coursera lecture) [2] | Replace AdaGrad sum with exponential moving average | 1e-3 |
| Adam | Kingma & Ba 2014 [3] | Combine momentum and RMSProp with bias correction | 1e-3 default, 1e-4 to 3e-4 for transformers |
| AdamW | Loshchilov & Hutter ICLR 2019 [13] | Decouple weight decay from the gradient update | 1e-4 to 3e-4 |
| Adafactor | Shazeer & Stern 2018 [10] | Factor the second-moment matrix to save memory | Often runs without explicit eta |
| LAMB | You et al. 2019 | Layer-wise adaptive scaling for huge batches | Used to train BERT in 76 minutes |
| Lion | Chen et al. 2023 [16] | Sign of smoothed gradient; one momentum buffer | 3e-5 to 1e-4 (3 to 10x lower than AdamW) |
| Sophia | Liu et al. 2023 [17] | Diagonal Hessian estimate as preconditioner | About 2x faster than Adam in steps |

AdaGrad's accumulated denominator grows monotonically, so the effective step size shrinks toward zero over time [1]. This works well in convex problems but is too aggressive for deep nets, which is why RMSProp replaced the running sum with an exponential moving average [2]. Adam combines RMSProp's adaptive scaling with momentum and adds bias-correction terms that account for the small initial values of the moving averages [3]. Its defaults (eta = 0.001, beta1 = 0.9, beta2 = 0.999, epsilon = 1e-8) work surprisingly well across many tasks, which is why Adam became the default for almost everything between 2015 and 2018 [3].

AdamW fixed a subtle bug in how Adam handled L2 [regularization](/wiki/regularization). In Adam, weight decay was being scaled by the same per-parameter denominator as the gradient, which weakened the regularization for parameters with small gradients. AdamW decouples [weight decay](/wiki/weight_decay) from the gradient step so that decay applies uniformly [13]. This change matters most for [overfitting](/wiki/overfitting)-prone models and is the reason essentially every transformer trained after 2019 uses AdamW rather than Adam.

Lion (Evolved Sign Momentum) was discovered by Google's symbolic search over optimizer programs in 2023 [16]. Its update is the sign of a momentum-smoothed gradient, so each parameter moves by exactly eta in absolute value. Because the update has uniform magnitude, Lion needs a learning rate three to ten times smaller than AdamW; the authors recommend pairing this with a weight decay three to ten times larger [16]. Lion uses about half the optimizer memory of AdamW (one buffer instead of two) and performs comparably or better on language modeling and image classification [16].

Sophia, from Stanford and HKU in 2023, uses a cheap diagonal Hessian estimate as a preconditioner. On GPT-style language models from 125M to 1.5B parameters, Sophia reaches the same perplexity as AdamW in roughly half as many steps [17]. Its adoption is still limited compared to AdamW.

## What step sizes are recommended for common tasks?

The following are starting points, not final values. Always tune for your dataset and architecture.

| Task / setup | Optimizer | Peak step size | Schedule |
|---|---|---|---|
| ResNet-50 on ImageNet, batch 256 | SGD + momentum 0.9 | 0.1 | Step decay at epoch 30, 60, 90 |
| ResNet-50 on ImageNet, batch 8192 | SGD + momentum 0.9 | 3.2 (linear scaling from 0.1) [7] | 5 epoch warmup + step decay |
| Vision transformer | AdamW | 1e-3 | Linear warmup + cosine |
| BERT pretraining | AdamW | 1e-4 | Linear warmup + linear decay |
| GPT-3 175B pretraining | Adam | 6e-5 [14] | Cosine to 10% over 300B tokens |
| Llama 3 405B pretraining | AdamW | 8e-5 [18] | Cosine to 8e-7 over 1.2M steps |
| Llama 3 8B pretraining | AdamW | 3e-4 [18] | Cosine, 2,000 step warmup |
| Fine-tuning a pretrained LLM | AdamW | 1e-5 to 5e-5 | Linear warmup + linear decay |
| LoRA adapter fine-tuning | AdamW | 1e-4 to 5e-4 | Constant or linear |
| Diffusion model training | AdamW or Lion | 1e-4 (AdamW), 3e-5 (Lion) | Constant or cosine |
| Reinforcement learning (PPO) | Adam | 3e-4 | Linear decay over total steps |

Several patterns are visible. Larger models use smaller peak learning rates: GPT-3's 175B model trained at 6e-5 while smaller GPT-3 variants used up to 6e-4 [14]. Fine-tuning uses much smaller rates than pretraining (typically one to two orders of magnitude lower) because the pretrained weights are already good and aggressive updates would erase what the model already knows. Adapter methods like LoRA use higher rates than full fine-tuning because they update a smaller, randomly initialized parameter set.

## How are step size and batch size related?

Step size and [batch size](/wiki/batch_size) are tightly coupled. The two best-known rules:

- **Linear scaling rule (Goyal et al. 2017).** The rule states, verbatim: "When the minibatch size is multiplied by k, multiply the learning rate by k." [7] This lets a model trained with batch 256 and eta 0.1 also train with batch 8192 and eta 3.2. The rule requires gradual warmup; without it, the large-batch run diverges in the first few iterations. Using this rule, the authors report that their "Caffe2-based system trains ResNet-50 with a minibatch size of 8192 on 256 GPUs in one hour, while matching small minibatch accuracy." [7]
- **Square-root scaling rule.** Some authors instead recommend multiplying eta by sqrt(k), arguing that gradient noise scales as sqrt(batch). This rule is more conservative and is sometimes preferred for adaptive optimizers.

The linear rule eventually breaks down at very large batches. McCandlish, Kaplan and colleagues at OpenAI introduced the **gradient noise scale** in 2018 to predict where this breakdown happens [12]. The noise scale measures the ratio of gradient variance to gradient magnitude. Below the noise scale, doubling the batch lets you double eta and halve training steps; above the noise scale, returns diminish quickly. The noise scale grows as a model trains, so the optimal batch size grows too. This insight informed the schedule of warming up batch size during GPT-3 pretraining [12][14].

Smith and Le (2018) argued from a Bayesian perspective that the relevant quantity is the ratio eta / batch_size, which they call the "noise scale" of SGD [11]. Doubling either eta or batch_size has the same effect on the implicit noise. This is why "don't decay the learning rate, increase the batch size" can be a viable schedule in distributed training [11].

## What are the common step-size failure modes?

Most training problems trace back to the step size. The diagnostic table:

| Symptom | Likely cause | Fix |
|---|---|---|
| Loss is NaN or Inf within a few hundred steps | eta far too high | Cut eta by 10x, add or lengthen warmup |
| Loss explodes after a long calm period | eta still too high or schedule misaligned | Check gradient norms, add gradient clipping |
| Loss decreases then plateaus very early | eta too low | Try 3x or 10x larger eta |
| Loss oscillates between values without trending down | eta too high in noisy direction | Lower eta, raise [batch size](/wiki/batch_size) |
| Loss looks fine, validation accuracy stays poor | eta probably fine, check regularization | Adjust weight decay, dropout, augmentation |
| Loss collapses to a constant (model predicts one class) | Sometimes too-low eta combined with bad init | Reset, try a different seed, use LR finder |
| Training diverges only with adaptive optimizer | Missing warmup | Add 1,000 to 10,000 step linear warmup |
| Loss spikes near end of training | Cosine schedule decayed too aggressively | Set eta_min above zero (10% of peak is common) |

Gradient clipping (typically clip_grad_norm to 1.0) often saves runs that the scheduled step size alone would not. Modern LLM pretraining recipes routinely combine warmup, cosine decay, gradient clipping, and AdamW with weight decay 0.1 [18].

## How do large language models set the step size?

Large language model pretraining has converged on a fairly narrow recipe for the step size:

- **Optimizer:** AdamW with beta1 = 0.9, beta2 = 0.95 (a small change from Adam's default 0.999, found to be more stable for LLMs) [14][18].
- **Schedule:** Linear warmup over a few thousand steps, then a single cosine decay.
- **Peak eta:** Roughly 1e-4 to 6e-4, scaling down for larger models. Empirically, peak eta tracks 1 / sqrt(model_width) reasonably well.
- **Final eta:** Decays to 10% of peak, following the Chinchilla recipe [15]. Some recent work (for example WSD and warmup-stable-decay schedules) advocates for a constant peak followed by a fast linear decay to zero.
- **Decay length:** The cosine cycle should match the planned training horizon. Decaying too fast leaves performance on the table; decaying too slowly means the run ends before the schedule reaches eta_min.
- **Gradient clipping:** Norm clipping at 1.0 [18].
- **Weight decay:** 0.1, applied via AdamW's decoupled term [13][18].

GPT-3 175B used Adam with peak 6e-5 and cosine to 10% over 300B tokens [14]. Llama 3 405B used AdamW with peak 8e-5, cosine to 8e-7 over 1.2 million steps, weight decay 0.1, and 8,000 warmup steps [18]. Smaller Llama 3 variants used peak 3e-4 to 4e-4 with 2,000 warmup steps [18]. The pattern (smaller model, larger peak LR) is consistent across published recipes.

For instruction tuning and supervised fine-tuning of LLMs, peak learning rates drop by an order of magnitude or more, typically 1e-5 to 5e-5. LoRA and QLoRA fine-tuning use higher rates (1e-4 to 5e-4) because only a small adapter is being trained from random initialization while the base weights stay frozen.

## How does the step size connect to scaling laws?

Research on neural [scaling laws](/wiki/scaling_laws) has shown that the optimal learning rate depends on model size, dataset size, and training horizon. The rough rules:

- Wider models prefer smaller peak learning rates. The Maximal Update Parametrization (muP) framework by Yang and Hu provides a principled way to transfer learning rates from a small model to a large one without re-tuning [22].
- For Chinchilla-optimal training (where training tokens equal roughly 20 times parameter count), the cosine schedule must be tuned to match the planned token count [15]. Truncating or extending the schedule both hurt loss.
- For longer-than-Chinchilla training, current evidence suggests that warmup-stable-decay schedules (constant peak followed by short linear decay) match or beat cosine while making it easier to extend a run after the fact.

## Which frameworks implement step-size schedules?

Every major framework provides scheduler classes. The PyTorch lineup, all in `torch.optim.lr_scheduler` [19]:

| Class | Schedule |
|---|---|
| `LambdaLR` | Arbitrary function of step |
| `MultiplicativeLR` | Multiply eta by a returned factor |
| `StepLR` | Cut by gamma every step_size epochs |
| `MultiStepLR` | Cut by gamma at chosen milestones |
| `ExponentialLR` | Multiply by gamma each epoch |
| `CosineAnnealingLR` | Cosine to eta_min over T_max epochs |
| `CosineAnnealingWarmRestarts` | SGDR with restarts |
| `OneCycleLR` | Smith's one-cycle policy |
| `CyclicLR` | Smith's triangular cyclical schedule |
| `LinearLR` | Linear interpolation between two factors |
| `PolynomialLR` | Polynomial decay |
| `ReduceLROnPlateau` | Drop on validation plateau |
| `SequentialLR` | Compose multiple schedulers |
| `ChainedScheduler` | Apply multiple schedulers in parallel |

In TensorFlow and Keras, the equivalent classes live under `tf.keras.optimizers.schedules` (for example `CosineDecay`, `PolynomialDecay`, `PiecewiseConstantDecay`). Hugging Face's `transformers.Trainer` exposes high-level scheduler types via the `lr_scheduler_type` argument, with `linear`, `cosine`, `cosine_with_restarts`, `polynomial`, `constant`, and `constant_with_warmup` as the supported choices, plus a `warmup_steps` argument that prepends a linear warmup to any of them [20]. JAX users typically build schedules from `optax.warmup_cosine_decay_schedule` and similar functions.

A minimal PyTorch example that reproduces the modern LLM recipe:

```python
import torch
from torch.optim import AdamW
from torch.optim.lr_scheduler import LambdaLR

optimizer = AdamW(model.parameters(), lr=3e-4, betas=(0.9, 0.95), weight_decay=0.1)

def lr_lambda(step):
    warmup = 2000
    total = 100_000
    if step < warmup:
        return step / warmup
    progress = (step - warmup) / (total - warmup)
    return 0.1 + 0.9 * 0.5 * (1 + math.cos(math.pi * progress))

scheduler = LambdaLR(optimizer, lr_lambda)
```

## What are the practical heuristics for choosing a step size?

A short list of rules that survive most projects:

- Tune eta first. Get warmup, schedule, and peak value right before touching anything else [9].
- Use the LR finder for any new dataset or architecture. It costs less than one full epoch and almost always gets you within a factor of two of the right value [4].
- Always use warmup with adaptive optimizers. A few hundred to a few thousand linear warmup steps fixes most early-training instability [7].
- Cosine usually beats step decay. The single cosine cycle has no extra hyperparameters and tends to generalize better [5].
- Watch the loss curve, not only the number. A clean loss curve is smooth and downward-trending. Jagged means too high; flat means too low.
- When scaling to more GPUs, scale eta linearly (Goyal rule) and lengthen warmup proportionally [7].
- For [LLM](/wiki/llm) fine-tuning, start at 2e-5 and adjust by powers of two.
- Save the optimizer state, not just model weights, so a run can resume on the right step of the schedule.

## Explain like I'm 5 (ELI5)

Imagine you are walking down a hill blindfolded, trying to find the lowest spot. The step size is how big each of your steps is. If you take huge leaps, you might fly right over the bottom and end up climbing the next hill. If you take tiny shuffles, you will get closer and closer to the bottom but it will take all day. The trick is to pick steps that are big enough to make progress but small enough that you do not jump past the goal. Most people start with bigger steps and slow down as they think they are getting close to the bottom.

## References

1. Duchi, J., Hazan, E., & Singer, Y. (2011). Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. Journal of Machine Learning Research, 12, 2121-2159. [https://jmlr.org/papers/v12/duchi11a.html](https://jmlr.org/papers/v12/duchi11a.html)
2. Hinton, G. (2012). Lecture 6.5 - rmsprop: Divide the gradient by a running average of its recent magnitude. Coursera: Neural Networks for Machine Learning.
3. Kingma, D. P., & Ba, J. (2014). Adam: A Method for Stochastic Optimization. ICLR 2015. [arXiv:1412.6980](https://arxiv.org/abs/1412.6980).
4. Smith, L. N. (2017). Cyclical Learning Rates for Training Neural Networks. WACV 2017. [arXiv:1506.01186](https://arxiv.org/abs/1506.01186).
5. Loshchilov, I., & Hutter, F. (2017). SGDR: Stochastic Gradient Descent with Warm Restarts. ICLR 2017. [arXiv:1608.03983](https://arxiv.org/abs/1608.03983).
6. Vaswani, A., et al. (2017). Attention Is All You Need. NeurIPS 2017. [arXiv:1706.03762](https://arxiv.org/abs/1706.03762).
7. Goyal, P., Dollar, P., Girshick, R., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., & He, K. (2017). Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour. [arXiv:1706.02677](https://arxiv.org/abs/1706.02677).
8. Smith, L. N., & Topin, N. (2017/2018). Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates. [arXiv:1708.07120](https://arxiv.org/abs/1708.07120).
9. Smith, L. N. (2018). A disciplined approach to neural network hyper-parameters: Part 1 - learning rate, batch size, momentum, and weight decay. [arXiv:1803.09820](https://arxiv.org/abs/1803.09820).
10. Shazeer, N., & Stern, M. (2018). Adafactor: Adaptive Learning Rates with Sublinear Memory Cost. ICML 2018. [arXiv:1804.04235](https://arxiv.org/abs/1804.04235).
11. Smith, S. L., & Le, Q. V. (2018). A Bayesian Perspective on Generalization and Stochastic Gradient Descent. ICLR 2018. [arXiv:1710.06451](https://arxiv.org/abs/1710.06451).
12. McCandlish, S., Kaplan, J., Amodei, D., & OpenAI Dota Team (2018). An Empirical Model of Large-Batch Training. [arXiv:1812.06162](https://arxiv.org/abs/1812.06162).
13. Loshchilov, I., & Hutter, F. (2019). Decoupled Weight Decay Regularization. ICLR 2019. [arXiv:1711.05101](https://arxiv.org/abs/1711.05101).
14. Brown, T. B., et al. (2020). Language Models are Few-Shot Learners (GPT-3). NeurIPS 2020. [arXiv:2005.14165](https://arxiv.org/abs/2005.14165).
15. Hoffmann, J., et al. (2022). Training Compute-Optimal Large Language Models (Chinchilla). [arXiv:2203.15556](https://arxiv.org/abs/2203.15556).
16. Chen, X., et al. (2023). Symbolic Discovery of Optimization Algorithms (Lion). NeurIPS 2023. [arXiv:2302.06675](https://arxiv.org/abs/2302.06675).
17. Liu, H., Li, Z., Hall, D., Liang, P., & Ma, T. (2023). Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training. [arXiv:2305.14342](https://arxiv.org/abs/2305.14342).
18. Grattafiori, A., et al. (2024). The Llama 3 Herd of Models. [arXiv:2407.21783](https://arxiv.org/abs/2407.21783).
19. PyTorch documentation: torch.optim.lr_scheduler. [https://docs.pytorch.org/docs/stable/optim.html](https://docs.pytorch.org/docs/stable/optim.html)
20. Hugging Face Transformers documentation: Optimizer and learning rate schedules. [https://huggingface.co/docs/transformers/main_classes/optimizer_schedules](https://huggingface.co/docs/transformers/main_classes/optimizer_schedules)
21. Bengio, Y. (2012). Practical Recommendations for Gradient-Based Training of Deep Architectures. In Neural Networks: Tricks of the Trade. [arXiv:1206.5533](https://arxiv.org/abs/1206.5533).
22. Yang, G., Hu, E. J., et al. (2022). Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer (muP). [arXiv:2203.03466](https://arxiv.org/abs/2203.03466).

