# Learning Rate

> Source: https://aiwiki.ai/wiki/learning_rate
> Updated: 2026-06-20
> Categories: Deep Learning, Machine Learning, Training & Optimization
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

The **learning rate** is a [hyperparameter](/wiki/hyperparameter) in [machine learning](/wiki/machine_learning) that controls how much a model's parameters change in response to the estimated error each time the weights are updated. It is the scalar value (often written η or α) that multiplies the gradient during [gradient descent](/wiki/gradient_descent), so it decides whether the [optimizer](/wiki/optimizer) takes large, aggressive steps or small, cautious ones toward the minimum of the [loss function](/wiki/loss_function). The learning rate is widely regarded as the single most important hyperparameter in deep learning: too high a value makes training diverge, while too low a value makes it crawl or stall.

This importance is stated directly in the literature. Yoshua Bengio's 2012 guide to training deep architectures says the learning rate "is often the single most important hyper-parameter and one should always make sure that it has been tuned (up to approximately a factor of 2)" [13]. As a quick rule of thumb for adaptive optimizers, Andrej Karpathy popularized the half-joking "Karpathy constant" when he wrote that "3e-4 is the best learning rate for [Adam](/wiki/adam_optimizer), hands down" [14], a value still widely used as a safe starting point.

## Mathematical formulation

The role of the learning rate is most clearly seen in the parameter update rule for gradient descent. Given a parameter vector θ, a loss function L, and a learning rate α, the basic update rule is:

**θ = θ − α · ∇L(θ)**

Here, ∇L(θ) is the gradient of the loss with respect to the parameters, which points in the direction of steepest ascent. The negative sign ensures the update moves toward lower loss. The learning rate α scales the magnitude of this step.

For [stochastic gradient descent](/wiki/stochastic_gradient_descent_sgd) (SGD) with [momentum](/wiki/momentum), the update becomes:

**v = β · v + α · ∇L(θ)**

**θ = θ − v**

where v is the velocity (accumulated gradient) and β is the momentum coefficient (typically 0.9). The learning rate still controls the overall magnitude of the updates, but momentum smooths the trajectory by incorporating information from previous gradients.

## How does the learning rate affect model training?

The learning rate directly determines the speed and quality of [convergence](/wiki/convergence). Choosing the wrong value can cause training to fail entirely or produce a suboptimal model.

### What happens if the learning rate is too high?

When the learning rate is too high, the parameter updates are too large:

- **Overshooting.** The optimizer jumps past the loss minimum and lands on the other side, potentially at a higher loss value, causing oscillation that prevents convergence.
- **Divergence.** Each update makes the loss worse, causing it to grow without bound with weights exploding to very large values, producing NaN values.
- **Instability.** A high learning rate produces erratic training curves with large fluctuations in loss from step to step.

### What happens if the learning rate is too low?

When the learning rate is too low, the parameter updates are very small:

- **Slow convergence.** Training takes an impractical number of iterations to reach a good solution.
- **Getting stuck in local minima.** Small steps make it difficult to escape shallow local minima or saddle points.
- **Poor exploration.** The optimizer stays close to its starting point and may never explore regions where better solutions exist.

### Summary of effects

| Learning Rate | Training Speed | Convergence | Risk |
|---|---|---|---|
| Too High | Fast initially | Unstable or diverges | Loss explosion, NaN values |
| Too Low | Very slow | Stable but may stall | Trapped in poor local minima |
| Well-Tuned | Moderate | Converges to good minimum | Requires careful selection |

## Learning rate schedules

A fixed learning rate is rarely optimal for the entire training process. Learning rate schedulers adjust the learning rate during training according to a predefined rule. The general intuition is that a larger learning rate is useful early in training for fast progress, while a smaller rate is beneficial later for fine-grained convergence.

| Schedule | Formula / Description | Behavior | When to Use |
|---|---|---|---|
| Constant | lr = lr₀ | No change throughout training | Baselines; short training runs |
| Step Decay | lr = lr₀ · γ^(floor(epoch / step_size)) | Drops by factor γ every step_size epochs | Standard CNN training (e.g., ResNet) |
| Exponential Decay | lr = lr₀ · γ^epoch | Smooth exponential decrease each epoch | When gradual reduction is preferred |
| Cosine Annealing | lr = lr_min + 0.5·(lr₀ − lr_min)·(1 + cos(π·t/T)) | Follows cosine curve from lr₀ down to lr_min | [Transformer](/wiki/transformer) training, modern vision models |
| Warmup + Cosine | Linear increase for first N steps, then cosine decay | Starts low, rises, then smoothly decreases | Large language models, pre-training |
| Cyclical (Smith 2017) | Oscillates between lr_min and lr_max | Repeatedly increases and decreases | When exploring multiple local minima |
| One-Cycle (Smith and Topin 2018) | One cycle: warmup to peak, then annealing to near zero | Single large cycle with momentum changes | Fast convergence; super-convergence |
| Reduce on Plateau | Reduces lr by factor when metric stops improving for N epochs | Reactive; adapts to training dynamics | When the right schedule is unknown |
| Polynomial Decay | lr = (lr₀ − lr_end) · (1 − t/T)^power + lr_end | Decays according to polynomial function | BERT-style pre-training |
| Linear Decay | lr = lr₀ · (1 − t/T) | Straight line decrease to zero | GPT-style pre-training |
| WSD (Warmup-Stable-Decay) | Linear warmup, constant phase, then rapid decay | Three distinct phases | Modern LLM pre-training (MiniCPM, etc.) |

### Cosine annealing

Cosine annealing, proposed by Loshchilov and Hutter (2017) in their paper "SGDR: Stochastic Gradient Descent with Warm Restarts," decays the learning rate following a cosine curve [3]. Starting at the initial learning rate, it gradually decreases to a minimum value. The rate of decrease is slow at first, faster in the middle, and slow again near the end.

A variant called cosine annealing with warm restarts (SGDR) periodically resets the learning rate back to its initial value and begins a new cosine decay cycle [3]. Each restart allows the optimizer to potentially escape local minima and explore new regions of the loss landscape. The cycle length can be kept constant or increased after each restart.

Cosine annealing has become the default scheduler for many modern architectures, including vision transformers and large language models.

### Cyclical learning rates

Leslie N. Smith introduced cyclical learning rates (CLR) in a 2017 paper presented at the IEEE Winter Conference on Applications of Computer Vision (WACV) [1]. Instead of monotonically decreasing the learning rate, CLR oscillates it between a minimum and maximum bound in triangular or exponentially decaying patterns.

The motivation is that periodically increasing the learning rate helps the optimizer traverse saddle points and escape sharp minima that generalize poorly. Smith observed that this approach often converges faster than fixed or monotonically decaying schedules [1].

### One-cycle policy

Smith and Topin (2018) extended the cyclical approach into the one-cycle policy for "super-convergence" [2]. The schedule consists of a single cycle: the learning rate warms up from a low value to a high peak over the first portion of training (often 30 to 40 percent of total steps), then decays back down to a value much lower than the starting point. Simultaneously, momentum follows an inverse schedule, decreasing when the learning rate rises and increasing when it falls.

The one-cycle policy enables training with learning rates 10x to 20x larger than conventional schedules, allowing training to converge in far fewer epochs. Smith reported that ResNet-56 could be trained on CIFAR-10 in roughly 10 percent of the usual number of iterations [2].

## Why do large models use learning rate warmup?

Learning rate warmup starts training with a very small learning rate and gradually increases it to the target value over a specified number of steps or epochs. The increase is usually linear, though other schedules (e.g., exponential) are sometimes used.

Warmup is important for several reasons:

- **Stabilizing early training.** At initialization, the model's weights are random, and the gradients can be noisy and unreliable. Large updates based on these early gradients can push the model into bad regions of the parameter space from which it cannot recover.
- **Adaptive optimizer initialization.** Optimizers like [Adam](/wiki/adam_optimizer) maintain running estimates of gradient statistics. These estimates are unreliable at the start of training because they have seen very few gradients. A low initial learning rate prevents large updates based on these noisy estimates.
- **Large [batch size](/wiki/batch_size) training.** Goyal et al. (2017) demonstrated that warmup is necessary when training with large batch sizes [4]. Without warmup, the linear scaling rule causes divergence in the early stages of training.

Modern large-scale training pipelines almost universally use warmup. The original Transformer paper (Vaswani et al., 2017) used a warmup of 4,000 steps followed by inverse square root decay [7]. BERT used linear warmup followed by linear decay. GPT models use linear warmup followed by cosine decay.

## How do you find a good learning rate? (the LR finder)

Smith (2017) also proposed the **learning rate range test** (commonly called the "LR finder"), a practical method for identifying good learning rate bounds before training begins [1].

The procedure works as follows:

1. Start with a very small learning rate (e.g., 1e-7).
2. Train the model for one epoch (or a few hundred iterations), gradually increasing the learning rate after each mini-batch, typically on a logarithmic scale up to a large value (e.g., 10).
3. Record the loss at each learning rate.
4. Plot loss vs. learning rate on a log scale.

The resulting plot typically shows the loss decreasing as the learning rate increases from very small values, reaching a minimum, and then sharply increasing as the learning rate becomes too large. The optimal learning rate is typically selected from the region where the loss is decreasing most steeply, usually about one order of magnitude below the learning rate at which the loss is minimized [12].

This technique is implemented in popular libraries such as PyTorch Lightning (via `Tuner.lr_find()`) and fast.ai (via `lr_find()`).

## Adaptive learning rate optimizers

To address the challenge of manually setting and scheduling learning rates, adaptive learning rate methods have been developed. These optimizers maintain per-parameter learning rates that are adjusted automatically based on the history of gradients for each parameter.

| Optimizer | Key Idea | Year | Reference |
|---|---|---|---|
| AdaGrad | Scales learning rate inversely by the sum of squared past gradients; large updates for rare features | 2011 | Duchi et al. |
| Adadelta | Fixes AdaGrad's decaying learning rate by using a window of past gradients | 2012 | Zeiler |
| RMSProp | Uses exponential moving average of squared gradients instead of cumulative sum | 2012 | Hinton (unpublished lecture) |
| Adam | Combines momentum (first moment) with RMSProp (second moment); includes bias correction | 2015 | Kingma and Ba |
| AdamW | Decouples weight decay from the adaptive gradient update | 2019 | Loshchilov and Hutter |
| LAMB | Layer-wise adaptive learning rates for large batch training | 2020 | You et al. |
| Adafactor | Memory-efficient Adam variant using factored second-moment estimates | 2018 | Shazeer and Stern |

[AdaGrad](/wiki/adagrad) (Duchi et al., 2011) was the first widely used adaptive method [8]. It maintains a per-parameter sum of squared gradients and uses this to scale the learning rate. Parameters that receive large, frequent gradients get smaller learning rates, while parameters with small, infrequent gradients get larger ones. This is particularly useful for sparse data. The downside is that the accumulated squared gradients grow monotonically, causing the learning rate to eventually become vanishingly small [8].

RMSProp, proposed by Geoffrey Hinton in an unpublished lecture, addresses this by replacing the cumulative sum with an exponentially weighted moving average. This prevents the learning rate from shrinking to zero over time.

Adam (Kingma and Ba, 2015) combines the benefits of momentum (which tracks an exponential moving average of the gradient itself) with RMSProp's adaptive scaling. It also includes bias correction terms that account for the fact that the moving averages are initialized at zero [5]. Adam has become the default optimizer for many deep learning tasks.

AdamW (Loshchilov and Hutter, 2019) corrects a subtle issue with Adam's handling of L2 regularization [6]. In standard Adam, L2 regularization is added to the gradient before the adaptive scaling, which means the regularization effect is scaled differently for different parameters. AdamW applies weight decay directly to the weights, separate from the gradient update. This decoupled approach produces better generalization and is now the standard optimizer for training transformers and large language models.

## How are the learning rate and batch size related?

The learning rate and batch size are closely linked hyperparameters. When the batch size increases, the gradient estimate becomes less noisy because it is averaged over more samples. This reduced noise allows the optimizer to take larger steps without risking divergence.

Goyal et al. (2017) formalized this observation as the **linear scaling rule**: when the minibatch size is multiplied by a factor k, the learning rate should also be multiplied by k. Using this rule, they trained ResNet-50 on ImageNet with batches of 8,192 images in one hour on 256 GPUs, scaling the base learning rate from 0.1 (at batch size 256) to 3.2 (at batch size 8,192), and achieved roughly 90 percent scaling efficiency when moving from 8 to 256 GPUs while matching small-batch accuracy [4].

However, the linear scaling rule has limits. At very large batch sizes (beyond roughly 8,000 to 16,000 for ImageNet), simply scaling the learning rate linearly causes instability, especially during the early phases of training when the network is changing rapidly. Warmup is essential to make large-batch training work. Some researchers have also proposed a square root scaling rule, where the learning rate scales by √k rather than k, which can be more stable at extreme batch sizes.

For adaptive optimizers like Adam, the relationship is less straightforward because the optimizer already adjusts per-parameter learning rates based on gradient statistics. In practice, many practitioners still increase the learning rate when increasing the batch size, but the optimal scaling factor may differ from the linear rule.

## Weight decay and learning rate interaction

Weight decay and the learning rate interact in ways that can be subtle and counterintuitive, especially with adaptive optimizers.

In standard SGD, L2 regularization (adding λ·||θ||² to the loss) is mathematically equivalent to weight decay (subtracting λ·θ from the weights at each step) when the two are related by the learning rate. This equivalence breaks down for adaptive optimizers like Adam. Because Adam scales the gradient by per-parameter second-moment estimates, adding L2 regularization to the loss results in the regularization term being scaled differently for each parameter. This means that parameters with large historical gradients receive less regularization than intended.

Loshchilov and Hutter (2019) showed that decoupling weight decay from the gradient-based update (as in AdamW) restores proper regularization behavior and, critically, makes the optimal learning rate and weight decay factor more independent of each other. With standard Adam and L2 regularization, changing the learning rate requires re-tuning the regularization strength. With AdamW, the two hyperparameters can be tuned more independently, which simplifies hyperparameter search [6].

A common default configuration for AdamW in transformer training is a learning rate in the range 1e-4 to 1e-3 paired with a weight decay of 0.1.

## How is the learning rate transferred across model sizes? (muP)

One of the most challenging aspects of large language model training is that the optimal learning rate changes with model size. A learning rate that works well for a 125M parameter model may not work at all for a 7B parameter model. Since hyperparameter sweeps at the scale of billions of parameters are prohibitively expensive, researchers have developed methods for transferring optimal learning rates from small models to large ones.

### Maximal Update Parameterization (muP)

Yang et al. (2022) proposed the **Maximal Update Parameterization** (muP), which modifies the standard parameterization of neural networks so that the optimal hyperparameters (including the learning rate) remain stable across different model widths. The key insight is that in standard parameterization, the magnitude of weight updates changes as the model width changes, which shifts the optimal learning rate. muP rescales the initialization, learning rate, and forward pass for each layer so that the dynamics of hidden representations remain consistent regardless of width [9].

The associated tuning procedure, which the authors call muTransfer, is to "parametrize the target model in muP, tune the HP indirectly on a smaller model, and zero-shot transfer them to the full-sized model, i.e., without directly tuning the latter at all" [9]. In practice, muP enables the following workflow:

1. Train a small "proxy" model (e.g., 125M parameters) with a learning rate sweep to find the optimal learning rate.
2. Transfer the optimal learning rate directly to a much larger model (e.g., 7B parameters).
3. The transferred learning rate should be near-optimal without further tuning.

The payoff can be large. Yang et al. report that "by transferring from 40M parameters, we outperform published numbers of the 6.7B GPT-3 model, with tuning cost only 7% of total pretraining cost" [9]. Dey et al. (2024) at Cerebras demonstrated the same principle in production: they tuned a 111M parameter model, transferred the hyperparameters to a 3B parameter model, and achieved performance comparable to contemporary 7B models while using 3.3x less compute [10].

### Limitations of muP

Recent research has identified important caveats. The scaling rules of muP rely on assumptions about the geometric alignment of a layer's inputs with its weights and gradient updates. These assumptions hold primarily at the start of training. For the remainder of training, weight decay appears to be more important than muP for stabilizing update dynamics across widths. In addition, certain hyperparameters like weight decay and dropout rates do not transfer under muP and still need to be tuned for the target model size.

| Aspect | Standard Parameterization | muP |
|---|---|---|
| LR Transfer Across Widths | Does not transfer | Approximately transfers |
| Key Assumption | None | Update dynamics stable across widths |
| What Transfers | Nothing reliably | Learning rate, some optimizer params |
| What Does NOT Transfer | Everything | Weight decay, dropout |
| Practical Use | Tune at each scale | Tune small proxy, transfer to large |

## What learning rate should you use for different model sizes?

Even without muP, practitioners have accumulated empirical knowledge about what learning rates work at different scales. The following table summarizes commonly used peak learning rates for AdamW across different model sizes in LLM pre-training:

| Model Size | Typical Peak Learning Rate | Warmup Steps | Schedule | Examples |
|---|---|---|---|---|
| 125M-350M | 6e-4 to 1e-3 | 500-2,000 | Cosine to 10% of peak | GPT-2 Small, OLMo proxy models |
| 1B-3B | 3e-4 to 6e-4 | 1,000-2,000 | Cosine to 10% of peak | TinyLlama, Cerebras-GPT |
| 7B-13B | 1e-4 to 3e-4 | 2,000-4,000 | Cosine to 10% of peak | LLaMA, LLaMA 2, Mistral |
| 30B-70B | 1e-4 to 1.5e-4 | 2,000-4,000 | Cosine to 10% of peak | LLaMA 65B, LLaMA 2 70B |
| 175B+ | 6e-5 to 1.2e-4 | 2,000-5,000 | Cosine | GPT-3 175B |

The general trend is clear: as models get larger, the peak learning rate decreases. For reference, [GPT-3](/wiki/gpt_3) 175B was trained with a peak learning rate of 6e-5, the smallest in the GPT-3 family, consistent with the row above. This is because larger models have more parameters, and each parameter receives gradient contributions from more neurons. The signal from each individual gradient contribution is smaller relative to the noise, so a smaller learning rate is needed to avoid instability.

## Discriminative and differential learning rates

In [fine-tuning](/wiki/fine_tuning) scenarios, it is often beneficial to apply different learning rates to different layers of the model. This technique is known as discriminative fine-tuning or differential learning rates.

The idea was popularized by Howard and Ruder (2018) in their ULMFiT paper [11]. The intuition is that lower layers of a pre-trained model capture general, transferable features (such as basic language patterns or low-level visual features), while upper layers encode more task-specific representations. During fine-tuning, lower layers need smaller learning rates to preserve their general knowledge, while upper layers benefit from larger rates to adapt to the new task.

In practice, a common approach is to divide the model into groups of layers and assign each group a learning rate that is a fraction (e.g., 1/2.6) of the rate used for the group above it. For instance, if the top layers use a learning rate of 1e-3, the middle layers might use 3.8e-4, and the bottom layers might use 1.5e-4.

This technique has proven especially effective for transfer learning and fine-tuning large pre-trained models. Fast.ai's library implements discriminative learning rates as a first-class feature, and many practitioners use the approach when fine-tuning BERT, GPT, and other pre-trained models on downstream tasks. The concept is related to but distinct from layer-wise adaptive rate scaling (LARS) and LAMB, which automatically compute per-layer learning rate multipliers based on the ratio of weight norms to gradient norms.

## Practical guidelines

### Pre-training a language model

1. Choose a peak learning rate based on model size (see the table above).
2. Use linear warmup for the first 1 to 5 percent of total training steps.
3. Apply cosine decay to approximately 10 percent of the peak learning rate.
4. Use AdamW with beta1=0.9, beta2=0.95, and weight decay of 0.1.
5. If using muP, tune the learning rate on a small proxy model first.

### Fine-tuning a pre-trained model

1. Start with a learning rate 10x to 100x smaller than the pre-training peak (e.g., 1e-5 to 5e-5 for a 7B model).
2. Use a short warmup (3 to 10 percent of fine-tuning steps).
3. Apply cosine or linear decay.
4. Consider discriminative learning rates, with lower rates for early layers.
5. For LoRA or other parameter-efficient methods, learning rates of 1e-4 to 2e-4 often work well.

### Training a CNN from scratch

1. Use SGD with momentum of 0.9 and a learning rate of 0.1.
2. Apply step decay (divide by 10 at 30, 60, and 90 percent of training) or cosine annealing.
3. Use the LR finder to validate the chosen range.

### General tips

- **Start with standard defaults.** 3e-4 for Adam/AdamW on most tasks, 0.1 for SGD with momentum.
- **Use the LR finder.** Run a learning rate range test before committing to a schedule.
- **Always use warmup.** Especially for transformers and large batch training. One to five percent of total training steps is a common warmup duration.
- **Match scheduler to task.** Cosine annealing for pre-training, reduce-on-plateau for fine-tuning when the optimal number of epochs is uncertain, step decay for classical CNNs.
- **Scale with batch size.** When increasing the batch size by a factor of k, consider increasing the learning rate by √k or k (linear scaling), combined with warmup.
- **Monitor training curves.** If the loss oscillates wildly, the learning rate may be too high. If the loss decreases very slowly, it may be too low. If the loss plateaus, consider reducing the learning rate or switching to a schedule with decay.
- **Avoid [overfitting](/wiki/overfitting) with proper decay.** Reducing the learning rate toward the end of training helps the model settle into a flatter minimum, which tends to generalize better.

## Explain like I'm 5 (ELI5)

Imagine you are playing a game where you have to find a toy hidden somewhere in a dark room. You can only move by taking steps. The learning rate is how big your steps are. If you take really huge steps, you might walk right past the toy and keep going back and forth, never finding it. If you take tiny little baby steps, it will take you forever to get there. The learning rate is about finding the right step size so you reach your toy quickly without stepping over it.

In machine learning, the "toy" is the best answer the computer is looking for, and the "steps" are the changes the computer makes to get better at its job. A good learning rate helps it get better quickly without making wild, confusing changes.

## References

1. Smith, L.N. (2017). "Cyclical Learning Rates for Training Neural Networks." *IEEE Winter Conference on Applications of Computer Vision (WACV)*. [arXiv:1506.01186](https://arxiv.org/abs/1506.01186)
2. Smith, L.N. and Topin, N. (2018). "Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates." *Artificial Intelligence and Machine Learning for Multi-Domain Operations Applications*. [arXiv:1708.07120](https://arxiv.org/abs/1708.07120)
3. Loshchilov, I. and Hutter, F. (2017). "SGDR: Stochastic Gradient Descent with Warm Restarts." *International Conference on Learning Representations (ICLR)*. [arXiv:1608.03983](https://arxiv.org/abs/1608.03983)
4. Goyal, P., Dollar, P., Girshick, R., et al. (2017). "Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour." *arXiv preprint*. [arXiv:1706.02677](https://arxiv.org/abs/1706.02677)
5. Kingma, D.P. and Ba, J. (2015). "Adam: A Method for Stochastic Optimization." *International Conference on Learning Representations (ICLR)*. [arXiv:1412.6980](https://arxiv.org/abs/1412.6980)
6. Loshchilov, I. and Hutter, F. (2019). "Decoupled Weight Decay Regularization." *International Conference on Learning Representations (ICLR)*. [arXiv:1711.05101](https://arxiv.org/abs/1711.05101)
7. Vaswani, A., Shazeer, N., Parmar, N., et al. (2017). "Attention Is All You Need." *Advances in Neural Information Processing Systems (NeurIPS)*. [arXiv:1706.03762](https://arxiv.org/abs/1706.03762)
8. Duchi, J., Hazan, E., and Singer, Y. (2011). "Adaptive Subgradient Methods for Online Learning and Stochastic Optimization." *Journal of Machine Learning Research*, 12, 2121-2159. [JMLR](https://www.jmlr.org/papers/v12/duchi11a.html)
9. Yang, G., Hu, E.J., Babuschkin, I., et al. (2022). "Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer." *arXiv preprint*. [arXiv:2203.03466](https://arxiv.org/abs/2203.03466)
10. Dey, N., et al. (2024). "The Practitioner's Guide to the Maximal Update Parameterization." *Cerebras Blog*. [Link](https://www.cerebras.ai/blog/the-practitioners-guide-to-the-maximal-update-parameterization)
11. Howard, J. and Ruder, S. (2018). "Universal Language Model Fine-tuning for Text Classification." *Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL)*. [arXiv:1801.06146](https://arxiv.org/abs/1801.06146)
12. Smith, L.N. (2018). "A Disciplined Approach to Neural Network Hyper-Parameters: Part 1 -- Learning Rate, Batch Size, Momentum, and Weight Decay." *arXiv preprint*. [arXiv:1803.09820](https://arxiv.org/abs/1803.09820)
13. Bengio, Y. (2012). "Practical Recommendations for Gradient-Based Training of Deep Architectures." *Neural Networks: Tricks of the Trade (2nd ed.)*, Springer. [arXiv:1206.5533](https://arxiv.org/abs/1206.5533)
14. Karpathy, A. (2016). "3e-4 is the best learning rate for Adam, hands down." Post on X (Twitter), November 23, 2016. [Link](https://x.com/karpathy/status/801621764144971776)