Learning Rate

In machine learning, the learning rate is a hyperparameter that controls how much to adjust a model's parameters in response to the estimated error each time the weights are updated. It is used as a scalar value that multiplies the computed gradient during gradient descent optimization. The learning rate tells the optimization algorithm how large a step to take when updating weights and biases at each training iteration.

The learning rate (often denoted as η or α) scales the gradient, determining whether the optimizer takes large, aggressive steps or small, cautious ones toward the minimum of the loss function. It is widely considered the single most important hyperparameter in deep learning. As Yoshua Bengio noted, if a practitioner has time to tune only one hyperparameter, it should be the learning rate.

Mathematical formulation

The role of the learning rate is most clearly seen in the parameter update rule for gradient descent. Given a parameter vector θ, a loss function L, and a learning rate α, the basic update rule is:

θ = θ − α · ∇L(θ)

Here, ∇L(θ) is the gradient of the loss with respect to the parameters, which points in the direction of steepest ascent. The negative sign ensures the update moves toward lower loss. The learning rate α scales the magnitude of this step.

For stochastic gradient descent (SGD) with momentum, the update becomes:

v = β · v + α · ∇L(θ)

θ = θ − v

where v is the velocity (accumulated gradient) and β is the momentum coefficient (typically 0.9). The learning rate still controls the overall magnitude of the updates, but momentum smooths the trajectory by incorporating information from previous gradients.

How the learning rate affects model training

The learning rate directly determines the speed and quality of convergence. Choosing the wrong value can cause training to fail entirely or produce a suboptimal model.

Too high a learning rate

When the learning rate is too high, the parameter updates are too large:

Overshooting. The optimizer jumps past the loss minimum and lands on the other side, potentially at a higher loss value, causing oscillation that prevents convergence.
Divergence. Each update makes the loss worse, causing it to grow without bound with weights exploding to very large values, producing NaN values.
Instability. A high learning rate produces erratic training curves with large fluctuations in loss from step to step.

Too low a learning rate

When the learning rate is too low, the parameter updates are very small:

Slow convergence. Training takes an impractical number of iterations to reach a good solution.
Getting stuck in local minima. Small steps make it difficult to escape shallow local minima or saddle points.
Poor exploration. The optimizer stays close to its starting point and may never explore regions where better solutions exist.

Summary of effects

Learning Rate	Training Speed	Convergence	Risk
Too High	Fast initially	Unstable or diverges	Loss explosion, NaN values
Too Low	Very slow	Stable but may stall	Trapped in poor local minima
Well-Tuned	Moderate	Converges to good minimum	Requires careful selection

Learning rate schedules

A fixed learning rate is rarely optimal for the entire training process. Learning rate schedulers adjust the learning rate during training according to a predefined rule. The general intuition is that a larger learning rate is useful early in training for fast progress, while a smaller rate is beneficial later for fine-grained convergence.

Schedule	Formula / Description	Behavior	When to Use
Constant	lr = lr₀	No change throughout training	Baselines; short training runs
Step Decay	lr = lr₀ · γ^(floor(epoch / step_size))	Drops by factor γ every step_size epochs	Standard CNN training (e.g., ResNet)
Exponential Decay	lr = lr₀ · γ^epoch	Smooth exponential decrease each epoch	When gradual reduction is preferred
Cosine Annealing	lr = lr_min + 0.5·(lr₀ − lr_min)·(1 + cos(π·t/T))	Follows cosine curve from lr₀ down to lr_min	Transformer training, modern vision models
Warmup + Cosine	Linear increase for first N steps, then cosine decay	Starts low, rises, then smoothly decreases	Large language models, pre-training
Cyclical (Smith 2017)	Oscillates between lr_min and lr_max	Repeatedly increases and decreases	When exploring multiple local minima
One-Cycle (Smith and Topin 2018)	One cycle: warmup to peak, then annealing to near zero	Single large cycle with momentum changes	Fast convergence; super-convergence
Reduce on Plateau	Reduces lr by factor when metric stops improving for N epochs	Reactive; adapts to training dynamics	When the right schedule is unknown
Polynomial Decay	lr = (lr₀ − lr_end) · (1 − t/T)^power + lr_end	Decays according to polynomial function	BERT-style pre-training
Linear Decay	lr = lr₀ · (1 − t/T)	Straight line decrease to zero	GPT-style pre-training
WSD (Warmup-Stable-Decay)	Linear warmup, constant phase, then rapid decay	Three distinct phases	Modern LLM pre-training (MiniCPM, etc.)

Cosine annealing

Cosine annealing, proposed by Loshchilov and Hutter (2017) in their paper "SGDR: Stochastic Gradient Descent with Warm Restarts," decays the learning rate following a cosine curve. Starting at the initial learning rate, it gradually decreases to a minimum value. The rate of decrease is slow at first, faster in the middle, and slow again near the end.

A variant called cosine annealing with warm restarts (SGDR) periodically resets the learning rate back to its initial value and begins a new cosine decay cycle. Each restart allows the optimizer to potentially escape local minima and explore new regions of the loss landscape. The cycle length can be kept constant or increased after each restart.

Cosine annealing has become the default scheduler for many modern architectures, including vision transformers and large language models.

Cyclical learning rates

Leslie N. Smith introduced cyclical learning rates (CLR) in a 2017 paper presented at the IEEE Winter Conference on Applications of Computer Vision (WACV). Instead of monotonically decreasing the learning rate, CLR oscillates it between a minimum and maximum bound in triangular or exponentially decaying patterns.

The motivation is that periodically increasing the learning rate helps the optimizer traverse saddle points and escape sharp minima that generalize poorly. Smith observed that this approach often converges faster than fixed or monotonically decaying schedules.

One-cycle policy

Smith and Topin (2018) extended the cyclical approach into the one-cycle policy for "super-convergence." The schedule consists of a single cycle: the learning rate warms up from a low value to a high peak over the first portion of training (often 30 to 40 percent of total steps), then decays back down to a value much lower than the starting point. Simultaneously, momentum follows an inverse schedule, decreasing when the learning rate rises and increasing when it falls.

The one-cycle policy enables training with learning rates 10x to 20x larger than conventional schedules, allowing training to converge in far fewer epochs. Smith reported that ResNet-56 could be trained on CIFAR-10 in roughly 10 percent of the usual number of iterations.

Learning rate warmup

Learning rate warmup starts training with a very small learning rate and gradually increases it to the target value over a specified number of steps or epochs. The increase is usually linear, though other schedules (e.g., exponential) are sometimes used.

Warmup is important for several reasons:

Stabilizing early training. At initialization, the model's weights are random, and the gradients can be noisy and unreliable. Large updates based on these early gradients can push the model into bad regions of the parameter space from which it cannot recover.
Adaptive optimizer initialization. Optimizers like Adam maintain running estimates of gradient statistics. These estimates are unreliable at the start of training because they have seen very few gradients. A low initial learning rate prevents large updates based on these noisy estimates.
Large batch size training. Goyal et al. (2017) demonstrated that warmup is necessary when training with large batch sizes. Without warmup, the linear scaling rule causes divergence in the early stages of training.

Modern large-scale training pipelines almost universally use warmup. The original Transformer paper (Vaswani et al., 2017) used a warmup of 4,000 steps followed by inverse square root decay. BERT used linear warmup followed by linear decay. GPT models use linear warmup followed by cosine decay.

Learning rate finder

Smith (2017) also proposed the learning rate range test (commonly called the "LR finder"), a practical method for identifying good learning rate bounds before training begins.

The procedure works as follows:

Start with a very small learning rate (e.g., 1e-7).
Train the model for one epoch (or a few hundred iterations), gradually increasing the learning rate after each mini-batch, typically on a logarithmic scale up to a large value (e.g., 10).
Record the loss at each learning rate.
Plot loss vs. learning rate on a log scale.

The resulting plot typically shows the loss decreasing as the learning rate increases from very small values, reaching a minimum, and then sharply increasing as the learning rate becomes too large. The optimal learning rate is typically selected from the region where the loss is decreasing most steeply, usually about one order of magnitude below the learning rate at which the loss is minimized.

This technique is implemented in popular libraries such as PyTorch Lightning (via Tuner.lr_find()) and fast.ai (via lr_find()).

Adaptive learning rate optimizers

To address the challenge of manually setting and scheduling learning rates, adaptive learning rate methods have been developed. These optimizers maintain per-parameter learning rates that are adjusted automatically based on the history of gradients for each parameter.

Optimizer	Key Idea	Year	Reference
AdaGrad	Scales learning rate inversely by the sum of squared past gradients; large updates for rare features	2011	Duchi et al.
Adadelta	Fixes AdaGrad's decaying learning rate by using a window of past gradients	2012	Zeiler
RMSProp	Uses exponential moving average of squared gradients instead of cumulative sum	2012	Hinton (unpublished lecture)
Adam	Combines momentum (first moment) with RMSProp (second moment); includes bias correction	2015	Kingma and Ba
AdamW	Decouples weight decay from the adaptive gradient update	2019	Loshchilov and Hutter
LAMB	Layer-wise adaptive learning rates for large batch training	2020	You et al.
Adafactor	Memory-efficient Adam variant using factored second-moment estimates	2018	Shazeer and Stern

AdaGrad (Duchi et al., 2011) was the first widely used adaptive method. It maintains a per-parameter sum of squared gradients and uses this to scale the learning rate. Parameters that receive large, frequent gradients get smaller learning rates, while parameters with small, infrequent gradients get larger ones. This is particularly useful for sparse data. The downside is that the accumulated squared gradients grow monotonically, causing the learning rate to eventually become vanishingly small.

RMSProp, proposed by Geoffrey Hinton in an unpublished lecture, addresses this by replacing the cumulative sum with an exponentially weighted moving average. This prevents the learning rate from shrinking to zero over time.

Adam (Kingma and Ba, 2015) combines the benefits of momentum (which tracks an exponential moving average of the gradient itself) with RMSProp's adaptive scaling. It also includes bias correction terms that account for the fact that the moving averages are initialized at zero. Adam has become the default optimizer for many deep learning tasks.

AdamW (Loshchilov and Hutter, 2019) corrects a subtle issue with Adam's handling of L2 regularization. In standard Adam, L2 regularization is added to the gradient before the adaptive scaling, which means the regularization effect is scaled differently for different parameters. AdamW applies weight decay directly to the weights, separate from the gradient update. This decoupled approach produces better generalization and is now the standard optimizer for training transformers and large language models.

Learning rate and batch size relationship

The learning rate and batch size are closely linked hyperparameters. When the batch size increases, the gradient estimate becomes less noisy because it is averaged over more samples. This reduced noise allows the optimizer to take larger steps without risking divergence.

Goyal et al. (2017) formalized this observation as the linear scaling rule: when the minibatch size is multiplied by a factor k, the learning rate should also be multiplied by k. Using this rule, they trained ResNet-50 on ImageNet with batches of 8,192 images in one hour, scaling the base learning rate from 0.1 (at batch size 256) to 3.2 (at batch size 8,192).

However, the linear scaling rule has limits. At very large batch sizes (beyond roughly 8,000 to 16,000 for ImageNet), simply scaling the learning rate linearly causes instability, especially during the early phases of training when the network is changing rapidly. Warmup is essential to make large-batch training work. Some researchers have also proposed a square root scaling rule, where the learning rate scales by √k rather than k, which can be more stable at extreme batch sizes.

For adaptive optimizers like Adam, the relationship is less straightforward because the optimizer already adjusts per-parameter learning rates based on gradient statistics. In practice, many practitioners still increase the learning rate when increasing the batch size, but the optimal scaling factor may differ from the linear rule.

Weight decay and learning rate interaction

Weight decay and the learning rate interact in ways that can be subtle and counterintuitive, especially with adaptive optimizers.

In standard SGD, L2 regularization (adding λ·||θ||² to the loss) is mathematically equivalent to weight decay (subtracting λ·θ from the weights at each step) when the two are related by the learning rate. This equivalence breaks down for adaptive optimizers like Adam. Because Adam scales the gradient by per-parameter second-moment estimates, adding L2 regularization to the loss results in the regularization term being scaled differently for each parameter. This means that parameters with large historical gradients receive less regularization than intended.

Loshchilov and Hutter (2019) showed that decoupling weight decay from the gradient-based update (as in AdamW) restores proper regularization behavior and, critically, makes the optimal learning rate and weight decay factor more independent of each other. With standard Adam and L2 regularization, changing the learning rate requires re-tuning the regularization strength. With AdamW, the two hyperparameters can be tuned more independently, which simplifies hyperparameter search.

A common default configuration for AdamW in transformer training is a learning rate in the range 1e-4 to 1e-3 paired with a weight decay of 0.1.

Learning rate transfer with muP

One of the most challenging aspects of large language model training is that the optimal learning rate changes with model size. A learning rate that works well for a 125M parameter model may not work at all for a 7B parameter model. Since hyperparameter sweeps at the scale of billions of parameters are prohibitively expensive, researchers have developed methods for transferring optimal learning rates from small models to large ones.

Maximal Update Parameterization (muP)

Yang et al. (2022) proposed the Maximal Update Parameterization (muP), which modifies the standard parameterization of neural networks so that the optimal hyperparameters (including the learning rate) remain stable across different model widths. The key insight is that in standard parameterization, the magnitude of weight updates changes as the model width changes, which shifts the optimal learning rate. muP rescales the initialization, learning rate, and forward pass for each layer so that the dynamics of hidden representations remain consistent regardless of width.

In practice, muP enables the following workflow:

Train a small "proxy" model (e.g., 125M parameters) with a learning rate sweep to find the optimal learning rate.
Transfer the optimal learning rate directly to a much larger model (e.g., 7B parameters).
The transferred learning rate should be near-optimal without further tuning.

Dey et al. (2024) at Cerebras demonstrated this in practice: they tuned a 111M parameter model, transferred the hyperparameters to a 3B parameter model, and achieved performance comparable to contemporary 7B models while using 3.3x less compute.

Limitations of muP

Recent research has identified important caveats. The scaling rules of muP rely on assumptions about the geometric alignment of a layer's inputs with its weights and gradient updates. These assumptions hold primarily at the start of training. For the remainder of training, weight decay appears to be more important than muP for stabilizing update dynamics across widths. In addition, certain hyperparameters like weight decay and dropout rates do not transfer under muP and still need to be tuned for the target model size.

Aspect	Standard Parameterization	muP
LR Transfer Across Widths	Does not transfer	Approximately transfers
Key Assumption	None	Update dynamics stable across widths
What Transfers	Nothing reliably	Learning rate, some optimizer params
What Does NOT Transfer	Everything	Weight decay, dropout
Practical Use	Tune at each scale	Tune small proxy, transfer to large

Learning rates for different model sizes

Even without muP, practitioners have accumulated empirical knowledge about what learning rates work at different scales. The following table summarizes commonly used peak learning rates for AdamW across different model sizes in LLM pre-training:

Model Size	Typical Peak Learning Rate	Warmup Steps	Schedule	Examples
125M-350M	6e-4 to 1e-3	500-2,000	Cosine to 10% of peak	GPT-2 Small, OLMo proxy models
1B-3B	3e-4 to 6e-4	1,000-2,000	Cosine to 10% of peak	TinyLlama, Cerebras-GPT
7B-13B	1e-4 to 3e-4	2,000-4,000	Cosine to 10% of peak	LLaMA, LLaMA 2, Mistral
30B-70B	1e-4 to 1.5e-4	2,000-4,000	Cosine to 10% of peak	LLaMA 65B, LLaMA 2 70B
175B+	6e-5 to 1.2e-4	2,000-5,000	Cosine	GPT-3 175B

The general trend is clear: as models get larger, the peak learning rate decreases. This is because larger models have more parameters, and each parameter receives gradient contributions from more neurons. The signal from each individual gradient contribution is smaller relative to the noise, so a smaller learning rate is needed to avoid instability.

Discriminative and differential learning rates

In fine-tuning scenarios, it is often beneficial to apply different learning rates to different layers of the model. This technique is known as discriminative fine-tuning or differential learning rates.

The idea was popularized by Howard and Ruder (2018) in their ULMFiT paper. The intuition is that lower layers of a pre-trained model capture general, transferable features (such as basic language patterns or low-level visual features), while upper layers encode more task-specific representations. During fine-tuning, lower layers need smaller learning rates to preserve their general knowledge, while upper layers benefit from larger rates to adapt to the new task.

In practice, a common approach is to divide the model into groups of layers and assign each group a learning rate that is a fraction (e.g., 1/2.6) of the rate used for the group above it. For instance, if the top layers use a learning rate of 1e-3, the middle layers might use 3.8e-4, and the bottom layers might use 1.5e-4.

This technique has proven especially effective for transfer learning and fine-tuning large pre-trained models. Fast.ai's library implements discriminative learning rates as a first-class feature, and many practitioners use the approach when fine-tuning BERT, GPT, and other pre-trained models on downstream tasks. The concept is related to but distinct from layer-wise adaptive rate scaling (LARS) and LAMB, which automatically compute per-layer learning rate multipliers based on the ratio of weight norms to gradient norms.

Practical guidelines

Pre-training a language model

Choose a peak learning rate based on model size (see the table above).
Use linear warmup for the first 1 to 5 percent of total training steps.
Apply cosine decay to approximately 10 percent of the peak learning rate.
Use AdamW with beta1=0.9, beta2=0.95, and weight decay of 0.1.
If using muP, tune the learning rate on a small proxy model first.

Fine-tuning a pre-trained model

Start with a learning rate 10x to 100x smaller than the pre-training peak (e.g., 1e-5 to 5e-5 for a 7B model).
Use a short warmup (3 to 10 percent of fine-tuning steps).
Apply cosine or linear decay.
Consider discriminative learning rates, with lower rates for early layers.
For LoRA or other parameter-efficient methods, learning rates of 1e-4 to 2e-4 often work well.

Training a CNN from scratch

Use SGD with momentum of 0.9 and a learning rate of 0.1.
Apply step decay (divide by 10 at 30, 60, and 90 percent of training) or cosine annealing.
Use the LR finder to validate the chosen range.

General tips

Start with standard defaults. 3e-4 for Adam/AdamW on most tasks, 0.1 for SGD with momentum.
Use the LR finder. Run a learning rate range test before committing to a schedule.
Always use warmup. Especially for transformers and large batch training. One to five percent of total training steps is a common warmup duration.
Match scheduler to task. Cosine annealing for pre-training, reduce-on-plateau for fine-tuning when the optimal number of epochs is uncertain, step decay for classical CNNs.
Scale with batch size. When increasing the batch size by a factor of k, consider increasing the learning rate by √k or k (linear scaling), combined with warmup.
Monitor training curves. If the loss oscillates wildly, the learning rate may be too high. If the loss decreases very slowly, it may be too low. If the loss plateaus, consider reducing the learning rate or switching to a schedule with decay.
Avoid overfitting with proper decay. Reducing the learning rate toward the end of training helps the model settle into a flatter minimum, which tends to generalize better.

Explain like I'm 5 (ELI5)

Imagine you are playing a game where you have to find a toy hidden somewhere in a dark room. You can only move by taking steps. The learning rate is how big your steps are. If you take really huge steps, you might walk right past the toy and keep going back and forth, never finding it. If you take tiny little baby steps, it will take you forever to get there. The learning rate is about finding the right step size so you reach your toy quickly without stepping over it.

In machine learning, the "toy" is the best answer the computer is looking for, and the "steps" are the changes the computer makes to get better at its job. A good learning rate helps it get better quickly without making wild, confusing changes.

References

Smith, L.N. (2017). "Cyclical Learning Rates for Training Neural Networks." *IEEE Winter Conference on Applications of Computer Vision (WACV)*. arXiv:1506.01186
Smith, L.N. and Topin, N. (2018). "Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates." *Artificial Intelligence and Machine Learning for Multi-Domain Operations Applications*. arXiv:1708.07120
Loshchilov, I. and Hutter, F. (2017). "SGDR: Stochastic Gradient Descent with Warm Restarts." *International Conference on Learning Representations (ICLR)*. arXiv:1608.03983
Goyal, P., Dollar, P., Girshick, R., et al. (2017). "Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour." *arXiv preprint*. arXiv:1706.02677
Kingma, D.P. and Ba, J. (2015). "Adam: A Method for Stochastic Optimization." *International Conference on Learning Representations (ICLR)*. arXiv:1412.6980
Loshchilov, I. and Hutter, F. (2019). "Decoupled Weight Decay Regularization." *International Conference on Learning Representations (ICLR)*. arXiv:1711.05101
Vaswani, A., Shazeer, N., Parmar, N., et al. (2017). "Attention Is All You Need." *Advances in Neural Information Processing Systems (NeurIPS)*. arXiv:1706.03762
Duchi, J., Hazan, E., and Singer, Y. (2011). "Adaptive Subgradient Methods for Online Learning and Stochastic Optimization." *Journal of Machine Learning Research*, 12, 2121-2159. JMLR
Yang, G., Hu, E.J., Babuschkin, I., et al. (2022). "Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer." *arXiv preprint*. arXiv:2203.03466
Dey, N., et al. (2024). "The Practitioner's Guide to the Maximal Update Parameterization." *Cerebras Blog*. Link
Howard, J. and Ruder, S. (2018). "Universal Language Model Fine-tuning for Text Classification." *Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL)*. arXiv:1801.06146
Smith, L.N. (2018). "A Disciplined Approach to Neural Network Hyper-Parameters: Part 1 -- Learning Rate, Batch Size, Momentum, and Weight Decay." *arXiv preprint*. arXiv:1803.09820

Mathematical formulation

How the learning rate affects model training

Too high a learning rate

Too low a learning rate

Summary of effects

Learning rate schedules

Cosine annealing

Cyclical learning rates

One-cycle policy

Learning rate warmup

Learning rate finder

Adaptive learning rate optimizers

Learning rate and batch size relationship

Weight decay and learning rate interaction

Learning rate transfer with muP

Maximal Update Parameterization (muP)

Limitations of muP

Learning rates for different model sizes

Discriminative and differential learning rates

Practical guidelines

Pre-training a language model

Fine-tuning a pre-trained model

Training a CNN from scratch

General tips

Explain like I'm 5 (ELI5)

References

Improve this article

Related Articles

Sparse autoencoder

ARC-AGI 2

GELU (Gaussian Error Linear Unit)

LeNet

Gradient Descent

Hyperparameter

Mathematical formulation

How the learning rate affects model training

Too high a learning rate

Too low a learning rate

Summary of effects

Learning rate schedules

Cosine annealing

Cyclical learning rates

One-cycle policy

Learning rate warmup

Learning rate finder

Adaptive learning rate optimizers

Learning rate and batch size relationship

Weight decay and learning rate interaction

Learning rate transfer with muP

Maximal Update Parameterization (muP)

Limitations of muP

Learning rates for different model sizes

Discriminative and differential learning rates

Practical guidelines

Pre-training a language model

Fine-tuning a pre-trained model

Training a CNN from scratch

General tips

Explain like I'm 5 (ELI5)

References

Related Articles

Sparse autoencoder

ARC-AGI 2

GELU (Gaussian Error Linear Unit)

LeNet

Gradient Descent

Hyperparameter