In machine learning, the learning rate is a hyperparameter that controls how much to adjust a model's parameters in response to the estimated error each time the weights are updated. It is used as a scalar value that multiplies the computed gradient during gradient descent optimization. The learning rate tells the optimization algorithm how large a step to take when updating weights and biases at each training iteration.
The learning rate (often denoted as η or α) scales the gradient, determining whether the optimizer takes large, aggressive steps or small, cautious ones toward the minimum of the loss function. It is widely considered the single most important hyperparameter in deep learning. As Yoshua Bengio noted, if a practitioner has time to tune only one hyperparameter, it should be the learning rate.
The role of the learning rate is most clearly seen in the parameter update rule for gradient descent. Given a parameter vector θ, a loss function L, and a learning rate α, the basic update rule is:
θ = θ − α · ∇L(θ)
Here, ∇L(θ) is the gradient of the loss with respect to the parameters, which points in the direction of steepest ascent. The negative sign ensures the update moves toward lower loss. The learning rate α scales the magnitude of this step.
For stochastic gradient descent (SGD) with momentum, the update becomes:
v = β · v + α · ∇L(θ)
θ = θ − v
where v is the velocity (accumulated gradient) and β is the momentum coefficient (typically 0.9). The learning rate still controls the overall magnitude of the updates, but momentum smooths the trajectory by incorporating information from previous gradients.
The learning rate directly determines the speed and quality of convergence. Choosing the wrong value can cause training to fail entirely or produce a suboptimal model.
When the learning rate is too high, the parameter updates are too large:
When the learning rate is too low, the parameter updates are very small:
| Learning Rate | Training Speed | Convergence | Risk |
|---|---|---|---|
| Too High | Fast initially | Unstable or diverges | Loss explosion, NaN values |
| Too Low | Very slow | Stable but may stall | Trapped in poor local minima |
| Well-Tuned | Moderate | Converges to good minimum | Requires careful selection |
A fixed learning rate is rarely optimal for the entire training process. Learning rate schedulers adjust the learning rate during training according to a predefined rule. The general intuition is that a larger learning rate is useful early in training for fast progress, while a smaller rate is beneficial later for fine-grained convergence.
| Schedule | Formula / Description | Behavior | When to Use |
|---|---|---|---|
| Constant | lr = lr₀ | No change throughout training | Baselines; short training runs |
| Step Decay | lr = lr₀ · γ^(floor(epoch / step_size)) | Drops by factor γ every step_size epochs | Standard CNN training (e.g., ResNet) |
| Exponential Decay | lr = lr₀ · γ^epoch | Smooth exponential decrease each epoch | When gradual reduction is preferred |
| Cosine Annealing | lr = lr_min + 0.5·(lr₀ − lr_min)·(1 + cos(π·t/T)) | Follows cosine curve from lr₀ down to lr_min | Transformer training, modern vision models |
| Warmup + Cosine | Linear increase for first N steps, then cosine decay | Starts low, rises, then smoothly decreases | Large language models, pre-training |
| Cyclical (Smith 2017) | Oscillates between lr_min and lr_max | Repeatedly increases and decreases | When exploring multiple local minima |
| One-Cycle (Smith and Topin 2018) | One cycle: warmup to peak, then annealing to near zero | Single large cycle with momentum changes | Fast convergence; super-convergence |
| Reduce on Plateau | Reduces lr by factor when metric stops improving for N epochs | Reactive; adapts to training dynamics | When the right schedule is unknown |
| Polynomial Decay | lr = (lr₀ − lr_end) · (1 − t/T)^power + lr_end | Decays according to polynomial function | BERT-style pre-training |
| Linear Decay | lr = lr₀ · (1 − t/T) | Straight line decrease to zero | GPT-style pre-training |
| WSD (Warmup-Stable-Decay) | Linear warmup, constant phase, then rapid decay | Three distinct phases | Modern LLM pre-training (MiniCPM, etc.) |
Cosine annealing, proposed by Loshchilov and Hutter (2017) in their paper "SGDR: Stochastic Gradient Descent with Warm Restarts," decays the learning rate following a cosine curve. Starting at the initial learning rate, it gradually decreases to a minimum value. The rate of decrease is slow at first, faster in the middle, and slow again near the end.
A variant called cosine annealing with warm restarts (SGDR) periodically resets the learning rate back to its initial value and begins a new cosine decay cycle. Each restart allows the optimizer to potentially escape local minima and explore new regions of the loss landscape. The cycle length can be kept constant or increased after each restart.
Cosine annealing has become the default scheduler for many modern architectures, including vision transformers and large language models.
Leslie N. Smith introduced cyclical learning rates (CLR) in a 2017 paper presented at the IEEE Winter Conference on Applications of Computer Vision (WACV). Instead of monotonically decreasing the learning rate, CLR oscillates it between a minimum and maximum bound in triangular or exponentially decaying patterns.
The motivation is that periodically increasing the learning rate helps the optimizer traverse saddle points and escape sharp minima that generalize poorly. Smith observed that this approach often converges faster than fixed or monotonically decaying schedules.
Smith and Topin (2018) extended the cyclical approach into the one-cycle policy for "super-convergence." The schedule consists of a single cycle: the learning rate warms up from a low value to a high peak over the first portion of training (often 30 to 40 percent of total steps), then decays back down to a value much lower than the starting point. Simultaneously, momentum follows an inverse schedule, decreasing when the learning rate rises and increasing when it falls.
The one-cycle policy enables training with learning rates 10x to 20x larger than conventional schedules, allowing training to converge in far fewer epochs. Smith reported that ResNet-56 could be trained on CIFAR-10 in roughly 10 percent of the usual number of iterations.
Learning rate warmup starts training with a very small learning rate and gradually increases it to the target value over a specified number of steps or epochs. The increase is usually linear, though other schedules (e.g., exponential) are sometimes used.
Warmup is important for several reasons:
Modern large-scale training pipelines almost universally use warmup. The original Transformer paper (Vaswani et al., 2017) used a warmup of 4,000 steps followed by inverse square root decay. BERT used linear warmup followed by linear decay. GPT models use linear warmup followed by cosine decay.
Smith (2017) also proposed the learning rate range test (commonly called the "LR finder"), a practical method for identifying good learning rate bounds before training begins.
The procedure works as follows:
The resulting plot typically shows the loss decreasing as the learning rate increases from very small values, reaching a minimum, and then sharply increasing as the learning rate becomes too large. The optimal learning rate is typically selected from the region where the loss is decreasing most steeply, usually about one order of magnitude below the learning rate at which the loss is minimized.
This technique is implemented in popular libraries such as PyTorch Lightning (via Tuner.lr_find()) and fast.ai (via lr_find()).
To address the challenge of manually setting and scheduling learning rates, adaptive learning rate methods have been developed. These optimizers maintain per-parameter learning rates that are adjusted automatically based on the history of gradients for each parameter.
| Optimizer | Key Idea | Year | Reference |
|---|---|---|---|
| AdaGrad | Scales learning rate inversely by the sum of squared past gradients; large updates for rare features | 2011 | Duchi et al. |
| Adadelta | Fixes AdaGrad's decaying learning rate by using a window of past gradients | 2012 | Zeiler |
| RMSProp | Uses exponential moving average of squared gradients instead of cumulative sum | 2012 | Hinton (unpublished lecture) |
| Adam | Combines momentum (first moment) with RMSProp (second moment); includes bias correction | 2015 | Kingma and Ba |
| AdamW | Decouples weight decay from the adaptive gradient update | 2019 | Loshchilov and Hutter |
| LAMB | Layer-wise adaptive learning rates for large batch training | 2020 | You et al. |
| Adafactor | Memory-efficient Adam variant using factored second-moment estimates | 2018 | Shazeer and Stern |
AdaGrad (Duchi et al., 2011) was the first widely used adaptive method. It maintains a per-parameter sum of squared gradients and uses this to scale the learning rate. Parameters that receive large, frequent gradients get smaller learning rates, while parameters with small, infrequent gradients get larger ones. This is particularly useful for sparse data. The downside is that the accumulated squared gradients grow monotonically, causing the learning rate to eventually become vanishingly small.
RMSProp, proposed by Geoffrey Hinton in an unpublished lecture, addresses this by replacing the cumulative sum with an exponentially weighted moving average. This prevents the learning rate from shrinking to zero over time.
Adam (Kingma and Ba, 2015) combines the benefits of momentum (which tracks an exponential moving average of the gradient itself) with RMSProp's adaptive scaling. It also includes bias correction terms that account for the fact that the moving averages are initialized at zero. Adam has become the default optimizer for many deep learning tasks.
AdamW (Loshchilov and Hutter, 2019) corrects a subtle issue with Adam's handling of L2 regularization. In standard Adam, L2 regularization is added to the gradient before the adaptive scaling, which means the regularization effect is scaled differently for different parameters. AdamW applies weight decay directly to the weights, separate from the gradient update. This decoupled approach produces better generalization and is now the standard optimizer for training transformers and large language models.
The learning rate and batch size are closely linked hyperparameters. When the batch size increases, the gradient estimate becomes less noisy because it is averaged over more samples. This reduced noise allows the optimizer to take larger steps without risking divergence.
Goyal et al. (2017) formalized this observation as the linear scaling rule: when the minibatch size is multiplied by a factor k, the learning rate should also be multiplied by k. Using this rule, they trained ResNet-50 on ImageNet with batches of 8,192 images in one hour, scaling the base learning rate from 0.1 (at batch size 256) to 3.2 (at batch size 8,192).
However, the linear scaling rule has limits. At very large batch sizes (beyond roughly 8,000 to 16,000 for ImageNet), simply scaling the learning rate linearly causes instability, especially during the early phases of training when the network is changing rapidly. Warmup is essential to make large-batch training work. Some researchers have also proposed a square root scaling rule, where the learning rate scales by √k rather than k, which can be more stable at extreme batch sizes.
For adaptive optimizers like Adam, the relationship is less straightforward because the optimizer already adjusts per-parameter learning rates based on gradient statistics. In practice, many practitioners still increase the learning rate when increasing the batch size, but the optimal scaling factor may differ from the linear rule.
Weight decay and the learning rate interact in ways that can be subtle and counterintuitive, especially with adaptive optimizers.
In standard SGD, L2 regularization (adding λ·||θ||² to the loss) is mathematically equivalent to weight decay (subtracting λ·θ from the weights at each step) when the two are related by the learning rate. This equivalence breaks down for adaptive optimizers like Adam. Because Adam scales the gradient by per-parameter second-moment estimates, adding L2 regularization to the loss results in the regularization term being scaled differently for each parameter. This means that parameters with large historical gradients receive less regularization than intended.
Loshchilov and Hutter (2019) showed that decoupling weight decay from the gradient-based update (as in AdamW) restores proper regularization behavior and, critically, makes the optimal learning rate and weight decay factor more independent of each other. With standard Adam and L2 regularization, changing the learning rate requires re-tuning the regularization strength. With AdamW, the two hyperparameters can be tuned more independently, which simplifies hyperparameter search.
A common default configuration for AdamW in transformer training is a learning rate in the range 1e-4 to 1e-3 paired with a weight decay of 0.1.
One of the most challenging aspects of large language model training is that the optimal learning rate changes with model size. A learning rate that works well for a 125M parameter model may not work at all for a 7B parameter model. Since hyperparameter sweeps at the scale of billions of parameters are prohibitively expensive, researchers have developed methods for transferring optimal learning rates from small models to large ones.
Yang et al. (2022) proposed the Maximal Update Parameterization (muP), which modifies the standard parameterization of neural networks so that the optimal hyperparameters (including the learning rate) remain stable across different model widths. The key insight is that in standard parameterization, the magnitude of weight updates changes as the model width changes, which shifts the optimal learning rate. muP rescales the initialization, learning rate, and forward pass for each layer so that the dynamics of hidden representations remain consistent regardless of width.
In practice, muP enables the following workflow:
Dey et al. (2024) at Cerebras demonstrated this in practice: they tuned a 111M parameter model, transferred the hyperparameters to a 3B parameter model, and achieved performance comparable to contemporary 7B models while using 3.3x less compute.
Recent research has identified important caveats. The scaling rules of muP rely on assumptions about the geometric alignment of a layer's inputs with its weights and gradient updates. These assumptions hold primarily at the start of training. For the remainder of training, weight decay appears to be more important than muP for stabilizing update dynamics across widths. In addition, certain hyperparameters like weight decay and dropout rates do not transfer under muP and still need to be tuned for the target model size.
| Aspect | Standard Parameterization | muP |
|---|---|---|
| LR Transfer Across Widths | Does not transfer | Approximately transfers |
| Key Assumption | None | Update dynamics stable across widths |
| What Transfers | Nothing reliably | Learning rate, some optimizer params |
| What Does NOT Transfer | Everything | Weight decay, dropout |
| Practical Use | Tune at each scale | Tune small proxy, transfer to large |
Even without muP, practitioners have accumulated empirical knowledge about what learning rates work at different scales. The following table summarizes commonly used peak learning rates for AdamW across different model sizes in LLM pre-training:
| Model Size | Typical Peak Learning Rate | Warmup Steps | Schedule | Examples |
|---|---|---|---|---|
| 125M-350M | 6e-4 to 1e-3 | 500-2,000 | Cosine to 10% of peak | GPT-2 Small, OLMo proxy models |
| 1B-3B | 3e-4 to 6e-4 | 1,000-2,000 | Cosine to 10% of peak | TinyLlama, Cerebras-GPT |
| 7B-13B | 1e-4 to 3e-4 | 2,000-4,000 | Cosine to 10% of peak | LLaMA, LLaMA 2, Mistral |
| 30B-70B | 1e-4 to 1.5e-4 | 2,000-4,000 | Cosine to 10% of peak | LLaMA 65B, LLaMA 2 70B |
| 175B+ | 6e-5 to 1.2e-4 | 2,000-5,000 | Cosine | GPT-3 175B |
The general trend is clear: as models get larger, the peak learning rate decreases. This is because larger models have more parameters, and each parameter receives gradient contributions from more neurons. The signal from each individual gradient contribution is smaller relative to the noise, so a smaller learning rate is needed to avoid instability.
In fine-tuning scenarios, it is often beneficial to apply different learning rates to different layers of the model. This technique is known as discriminative fine-tuning or differential learning rates.
The idea was popularized by Howard and Ruder (2018) in their ULMFiT paper. The intuition is that lower layers of a pre-trained model capture general, transferable features (such as basic language patterns or low-level visual features), while upper layers encode more task-specific representations. During fine-tuning, lower layers need smaller learning rates to preserve their general knowledge, while upper layers benefit from larger rates to adapt to the new task.
In practice, a common approach is to divide the model into groups of layers and assign each group a learning rate that is a fraction (e.g., 1/2.6) of the rate used for the group above it. For instance, if the top layers use a learning rate of 1e-3, the middle layers might use 3.8e-4, and the bottom layers might use 1.5e-4.
This technique has proven especially effective for transfer learning and fine-tuning large pre-trained models. Fast.ai's library implements discriminative learning rates as a first-class feature, and many practitioners use the approach when fine-tuning BERT, GPT, and other pre-trained models on downstream tasks. The concept is related to but distinct from layer-wise adaptive rate scaling (LARS) and LAMB, which automatically compute per-layer learning rate multipliers based on the ratio of weight norms to gradient norms.
Imagine you are playing a game where you have to find a toy hidden somewhere in a dark room. You can only move by taking steps. The learning rate is how big your steps are. If you take really huge steps, you might walk right past the toy and keep going back and forth, never finding it. If you take tiny little baby steps, it will take you forever to get there. The learning rate is about finding the right step size so you reach your toy quickly without stepping over it.
In machine learning, the "toy" is the best answer the computer is looking for, and the "steps" are the changes the computer makes to get better at its job. A good learning rate helps it get better quickly without making wild, confusing changes.