Cosine learning rate schedule
Last reviewed
May 25, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 3,600 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 25, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 3,600 words
Add missing citations, update stale details, or suggest a clearer explanation.
The cosine learning rate schedule, also called cosine annealing, is a learning rate decay strategy that lowers the optimizer step size from a peak value to a small minimum following a half period of a cosine function over a fixed number of training iterations. It was introduced by Ilya Loshchilov and Frank Hutter in the paper "SGDR: Stochastic Gradient Descent with Warm Restarts" (arXiv:1608.03983), first posted on 13 August 2016 and presented at ICLR 2017.[1] The original SGDR formulation paired cosine decay with periodic warm restarts that reset the rate to its peak, but in modern practice the most common variant is a single half cosine cycle, usually preceded by a brief linear warmup phase.[1][2] As of 2026 it is the default learning rate schedule for virtually every large language model pretraining run, including GPT-3, the LLaMA family, PaLM, and Chinchilla.[3][4][5][6][7] Implementations are provided in pytorch (torch.optim.lr_scheduler.CosineAnnealingLR and CosineAnnealingWarmRestarts) and in the transformers library (get_cosine_schedule_with_warmup).[8][9][10]
Stochastic gradient descent and its variants require a learning rate that controls the magnitude of each parameter update. A constant rate is rarely optimal: a high rate accelerates progress early in training but causes oscillation near a minimum, while a low rate is stable near convergence but slow at the start.[11] The standard remedy is a learning rate schedule that begins high and decreases over time. Before cosine annealing became dominant, several decay families were widely used.
The earliest large scale image classification networks, including AlexNet (Krizhevsky et al. 2012), used step decay: the rate is divided by a constant factor (typically 10) at hand chosen epochs or when validation loss plateaus.[12] He et al. applied the same approach when training ResNet on ImageNet, starting at 0.1 and dividing by 10 each time the error plateaued, with weight decay of 1e-4 and SGD momentum 0.9.[13] Exponential decay multiplies the rate by a constant less than one at every step, giving a smooth geometric falloff. Polynomial decay lowers the rate following a polynomial of the step index, often used with power 1 (a linear schedule).[10] BERT, for example, used a linear warmup followed by linear decay to zero.[10]
A complementary technique is learning rate warmup, popularised for deep learning by Goyal et al. in "Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour" (arXiv:1706.02677, 8 June 2017).[14] When scaling minibatch size from 256 to 8,192 on 256 GPUs, the authors found that simply scaling the peak rate linearly with batch size caused optimization to diverge in the first few iterations. Their fix is a "hyper parameter free linear scaling rule" combined with a warmup scheme that ramps the learning rate from a small value up to the target peak over the first few thousand iterations, after which the regular schedule takes over.[14] Warmup proved essential for large batch stochastic gradient descent sgd and was later adopted in essentially every large transformer pretraining recipe.[3][4][5][6][7]
The cosine annealing schedule emerged in this context as a smoother alternative to step decay, motivated by the analogy between learning rate decay and the cooling schedule of simulated annealing (no simulated_annealing entry currently exists in the wiki).[1] Loshchilov and Hutter framed their original SGDR contribution as a way to escape sharp minima by occasionally re excitating the optimizer to a high learning rate, and the cosine curve served as a convenient, derivative continuous decay between those excitations.[1]
The cosine schedule defined by Loshchilov and Hutter (2016) sets the learning rate at iteration t within cycle i to:
η_t = η_min + 0.5 (η_max − η_min) (1 + cos(π T_cur / T_i))
where η_max and η_min are the peak and minimum rates, T_cur is the number of iterations since the most recent restart, and T_i is the length of the current restart cycle.[1] When T_cur = 0, the cosine term equals 1 and η_t = η_max. When T_cur = T_i, the cosine term equals −1 and η_t = η_min. The curve is a half period of cosine, so the rate is high for a longer fraction of training than it would be under a linear interpolation between η_max and η_min, and its first derivative at both endpoints is zero, producing a smooth handoff.[1]
In the most common single cycle usage T_i is set equal to T_max, the total number of training steps after warmup, and η_min is either zero or a fixed fraction (often 10%) of η_max.[3][4][5] The formula then reduces to:
η_t = η_min + 0.5 (η_max − η_min) (1 + cos(π t / T_max))
This is the formula implemented as torch.optim.lr_scheduler.CosineAnnealingLR in pytorch, which exposes T_max and eta_min (default 0) as constructor arguments.[8]
The original SGDR paper combined cosine decay with warm restarts. After T_i iterations of cosine decay from η_max down to η_min, the learning rate is reset to η_max and a new cycle begins.[1] The authors describe the high rate at restart as a way to "essentially catapult the parameters out of the minimum to which they previously converged and to a different area of the loss surface", after which the aggressive cosine annealing again drives the rate toward zero and allows the optimizer to settle into a (possibly different) basin of attraction.[1]
Two parameters control the restart sequence. T_0 is the length of the first cycle. T_mult is a multiplier applied at every restart, so cycle lengths form the sequence T_0, T_0 · T_mult, T_0 · T_mult², and so on.[1] With T_mult = 1 cycles have constant length; with T_mult = 2 each cycle is twice as long as the previous one, giving an exponentially growing schedule of restarts. The authors recommend, when used with Adam, "to start with an initially small T_i (between 1% and 10% of the total number of epochs) and multiply it by a factor of T_mult (e.g. T_mult = 2) at every restart".[1] Their experiments on CIFAR-10 and CIFAR-100 used a Wide Residual Network WRN-28-10 with T_0 = 10 and T_mult = 2, achieving error rates of 3.14% and 16.21% respectively, state of the art at the time.[1]
PyTorch implements this variant as torch.optim.lr_scheduler.CosineAnnealingWarmRestarts(optimizer, T_0, T_mult=1, eta_min=0.0, last_epoch=-1), where T_0 is the iterations to the first restart and T_mult multiplies the restart interval after each cycle.[9]
The default schedule used in modern large language model pretraining is a single half cosine cycle preceded by a short linear warmup, with no warm restarts. The schedule has three parameters: the number of warmup steps W, the total number of training steps T, and the peak learning rate η_max.[3][4][5][6][7]
During the warmup phase (steps 0 ≤ t ≤ W) the rate rises linearly from 0 (or a small value) to η_max. From step W onwards the rate follows a half cosine that decays from η_max to η_min over the remaining T − W steps.[10] The warmup duration is typically 1 to 10% of total training; large LLM runs often use a fixed warmup of a few thousand steps regardless of total length.[3][4]
The Hugging Face get_cosine_schedule_with_warmup function exposes exactly this pattern, taking num_warmup_steps, num_training_steps, and an optional num_cycles parameter that defaults to 0.5, corresponding to a single half cosine.[10] A related helper, get_cosine_with_hard_restarts_schedule_with_warmup, allows multiple hard restarts within the same total budget, and get_cosine_with_min_lr_schedule_with_warmup lets the user specify a non zero floor for the rate.[10]
Several modifications of the basic cosine schedule are used in practice.
Cosine to a final fraction of η_max. Rather than decaying all the way to zero, most LLM training runs decay to 10% of the peak rate.[3][4][5] GPT-3 "uses a cosine decay schedule to decay the learning rate to 10% of its original value, and uses a warmup schedule to increase the learning rate over the first 375 million tokens", with full decay completing over 260 billion tokens at a peak rate of 6e-4 for the 175B model.[3] LLaMA 1 and LLaMA 2 both use "a cosine learning rate schedule, with warmup of 2,000 steps, and decay final learning rate down to 10% of the peak learning rate".[4][5] This residual 10% rate empirically improves downstream task performance compared with decaying to zero.[4][5]
Warmup Stable Decay (WSD) / trapezoidal schedule. Introduced as part of the MiniCPM training recipe by Shengding Hu and coauthors in "MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies" (arXiv:2404.06395, submitted 9 April 2024), the WSD schedule splits training into three phases: a linear warmup, an extended "stable" phase at constant peak rate, and a short final decay phase.[15] The authors argue the design is "conducive to continuous training and domain adaptation" because the stable phase can be extended indefinitely without committing to a specific total token count; only the final decay phase needs to be matched to where the run will actually stop.[15] This addresses one of the main weaknesses of cosine, discussed below. Hugging Face Transformers added get_wsd_schedule to support this pattern, with configurable num_warmup_steps, num_stable_steps, and num_decay_steps, and decay_type that defaults to cosine.[10]
The Chinchilla cosine length finding. In "Training Compute Optimal Large Language Models" (arXiv:2203.15556, 29 March 2022), Jordan Hoffmann and coauthors observed that the cosine schedule has a sharp dependence on the match between cycle length and actual training duration.[6] Specifically, they noted that "setting the cosine cycle length to approximately match the number of training tokens results in the best final loss regardless of model size", and that when the cycle length overshoots the target by more than 25% "performance is noticeably degraded".[6] They formulated this as a methodological rule: the learning rate should decay by approximately 10x over the actual D training tokens used.[6] This was a significant departure from Kaplan et al. (2020), which had used a fixed schedule across models regardless of training duration, and it changed the interpretation of compute optimal training results in the Chinchilla scaling laws.[6]
Schedule free optimization. A more recent alternative, introduced in "The Road Less Scheduled" by Aaron Defazio and coauthors (arXiv:2405.15682, submitted 24 May 2024), eliminates explicit schedules altogether.[16] The Schedule-Free optimizer family unifies scheduling and iterate averaging in a single update, requires no advance knowledge of the stopping iteration T, and adds no hyperparameters over a standard momentum optimizer.[16] The method won the MLCommons 2024 AlgoPerf self tuning track and the authors report it matches or beats cosine schedules across convex problems and deep learning benchmarks.[16]
The cosine plus linear warmup schedule appears in the published training recipes of essentially every major open or open weight LLM. The table below summarises reported peak rates, warmup, and final rate fraction. Empty cells indicate the value was not disclosed in the cited reference.
| Model | Schedule | η_max | η_min / η_max | Warmup | Total |
|---|---|---|---|---|---|
| GPT-3 175B | cosine + linear warmup | 6e-4 | 10% | 375M tokens | 260B tokens[3] |
| Chinchilla 70B | cosine, cycle = D tokens | not disclosed | 10% over D tokens | not disclosed | 1.4T tokens[6] |
| PaLM 540B | cosine + linear warmup (Adafactor) | 1e-4 | not disclosed | 10,000 steps | not disclosed[17] |
| LLaMA 7B / 13B | cosine + warmup | 3.0e-4 | 10% | 2,000 steps | ~1T tokens[4] |
| LLaMA 33B / 65B | cosine + warmup | 1.5e-4 | 10% | 2,000 steps | ~1.4T tokens[4] |
| LLaMA 2 7B / 13B | cosine + warmup | 3.0e-4 | 10% | 2,000 steps | 2T tokens[5] |
| LLaMA 2 34B / 70B | cosine + warmup | 1.5e-4 | 10% | 2,000 steps | 2T tokens[5] |
| MiniCPM (variants) | WSD (trapezoidal) | varies | varies | linear | varies[15] |
The combination of linear warmup over a small fraction of training, followed by a half cosine decay to roughly 10% of the peak rate, has become a de facto default across the field. Configuration files for Megatron-LM and deepspeed include --lr-decay-style cosine, --lr-warmup-iters, and --lr-decay-iters (or --lr-decay-tokens) options that operationalise exactly this pattern.[18][19]
In pytorch, the two relevant classes live in torch.optim.lr_scheduler.[8][9]
CosineAnnealingLR(optimizer, T_max, eta_min=0.0, last_epoch=-1) implements the single cycle form. After T_max steps the learning rate equals eta_min. Calling scheduler.step() after each optimizer step or epoch advances the curve.[8]
CosineAnnealingWarmRestarts(optimizer, T_0, T_mult=1, eta_min=0.0, last_epoch=-1) implements the SGDR form with cycle restarts. T_0 is the first restart interval and T_mult is the geometric multiplier on subsequent intervals.[9]
Neither of these classes implements warmup directly. Practitioners typically chain a warmup scheduler (linear or cosine ramp) with CosineAnnealingLR using torch.optim.lr_scheduler.SequentialLR or ChainedScheduler.
The Hugging Face Transformers library exposes higher level helpers that integrate warmup directly.[10] get_cosine_schedule_with_warmup(optimizer, num_warmup_steps, num_training_steps, num_cycles=0.5, last_epoch=-1) produces a LambdaLR that linearly warms up from 0 to the optimizer's base learning rate over num_warmup_steps, then follows the cosine curve from the base rate down to 0 over the remaining num_training_steps − num_warmup_steps steps. The num_cycles=0.5 default is one half cosine; setting it to an integer N produces N hard restart cycles.[10] Variants get_cosine_with_min_lr_schedule_with_warmup and get_cosine_with_hard_restarts_schedule_with_warmup allow non zero floors and explicit restart counts.[10] The lr_scheduler_type argument of TrainingArguments accepts the string "cosine" to use this helper automatically.[10]
Distributed training frameworks expose configuration knobs. In Megatron-LM the relevant flags are --lr-decay-style cosine, --lr <peak>, --min-lr <floor>, --lr-warmup-iters <W>, and --lr-decay-iters <T> (alternatively --lr-decay-samples or --lr-decay-tokens).[18] In deepspeed the cosine schedule is configured under the scheduler block in the JSON config with type "WarmupCosineLR", supplying total_num_steps, warmup_num_steps, warmup_min_ratio, and cos_min_ratio parameters.[19]
A minimal PyTorch implementation of cosine with linear warmup is approximately:
def cosine_with_warmup(step, warmup, total, peak, floor):
if step < warmup:
return peak * step / warmup
progress = (step - warmup) / (total - warmup)
cosine_factor = 0.5 * (1.0 + math.cos(math.pi * progress))
return floor + (peak - floor) * cosine_factor
This expression matches the LLaMA, LLaMA 2, and GPT-3 schedule shape when floor = 0.1 * peak.[3][4][5]
The smooth, monotone shape of the cosine curve has been argued to have several theoretical and empirical advantages over step or polynomial decay.
The first motivation is the connection to simulated annealing (the wiki does not currently host a simulated_annealing entry). Annealing schedules in combinatorial optimization start from a high temperature and gradually cool, allowing escape from poor local minima early while permitting fine grained search late. The original SGDR paper presented warm restart cosine annealing as a deep learning analogue, with the high rate at restart playing the role of high temperature re excitation; Loshchilov and Hutter explicitly framed the design "as a complementary tool to existing means of dealing with the multimodal nature of the loss surface".[1]
A second motivation is smoothness. Step decay produces large, instantaneous changes in optimizer dynamics at the decay boundaries, which can manifest as visible kinks in training loss curves and force the optimizer to readjust its momentum buffers abruptly when the rate drops by an order of magnitude. Cosine decay is continuously differentiable everywhere and its derivative goes to zero at both endpoints, eliminating these discontinuities and producing visibly smoother loss curves in practice.[11] The zero derivative at t = 0 means the rate stays close to its peak for the first portion of the cycle, while the zero derivative at t = T_max means it lingers near the floor near the end, giving more iterations of effective high rate and effective fine grained convergence than either a linear or a polynomial schedule of the same total length.[11]
A third motivation, specific to large model pretraining, is that cosine spends a relatively long time at intermediate learning rates compared with strict polynomial or exponential decay, which empirically improves downstream task performance.[11] The cosine curve's average value over a half period from 1 to 0 is 1/2, but most of the integral is concentrated in the middle of the cycle, so intermediate rates dominate the total update budget. However, the precise reason cosine outperforms other smooth schedules of similar shape is still an open question; recent work has reported that linear decay to zero matches or exceeds cosine in LLM pretraining when the schedule length matches training length, suggesting that the precise functional form may matter less than the endpoint matching property.[20]
The most discussed limitation is the schedule length must match training length requirement identified by the Chinchilla authors.[6] Because the cosine curve commits to a specific T_max in advance, two undesirable failure modes appear when T_max is chosen incorrectly. If T_max is set longer than the actual training run (e.g. because the user stops early), training ends with the rate still well above η_min and final loss is worse than it could be. If T_max is set shorter than the actual run, the rate hits η_min before training ends and the optimizer effectively trains at the floor for the remainder. The Chinchilla paper reported "noticeably degraded" performance when the cycle overshoots by more than 25%.[6]
This constraint is awkward for continual pretraining and domain adaptation, where the user often does not know the final step count in advance, and for compute optimal scaling studies, where many training durations need to be compared. The Hoffmann et al. methodology change (refit the cosine schedule for every training duration) reconciled their results with Kaplan et al. (2020) but at the cost of a separate hyperparameter search per data point.[6]
The MiniCPM WSD schedule was explicitly designed to address this limitation by keeping the bulk of training at a constant rate, with only a short final decay tied to the actual endpoint.[15] Hu et al. report that the resulting schedule is "conducive to continuous training and domain adaptation" because the stable phase can be checkpointed and resumed without re committing to a fixed T_max.[15]
A more radical response is to remove schedules entirely. The Schedule-Free optimizer from Defazio et al. (2024) requires no advance knowledge of T, achieves competitive or better results than cosine across a range of tasks, and won the MLCommons 2024 AlgoPerf self tuning track; the authors describe their construction as "a theoretical framework that unifies scheduling and iterate averaging".[16] Other recent work has examined infinite learning rate schedules and constant rates with brief end of training decay, both motivated by the same observation that the strict T dependence of cosine is inconvenient and may not be necessary; linear decay to zero in particular has been reported to match or exceed cosine when the schedule length is correctly matched to the training run.[20]
A second class of limitations concerns the warmup phase rather than the cosine portion. Although the combined recipe is robust enough to have become the field default, the choice of warmup length is largely heuristic, with values ranging from less than 1% of training (typical for very long pretraining runs where 2,000 steps is a tiny fraction) to around 10% (more common for short fine tuning runs).[3][4][5] No fundamental theory predicts the right value, and recent analyses have argued that warmup is largely a workaround for over scaled initial learning rates and could in principle be eliminated by better initialization.[20]
Despite these alternatives, the cosine plus linear warmup schedule remains the default in nearly all reported large language model pretraining runs as of 2026, and is the most widely supported schedule across pytorch, Hugging Face Transformers, deepspeed, and Megatron-LM.[8][9][10][18][19]