Schedule-Free optimizer
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,091 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,091 words
Add missing citations, update stale details, or suggest a clearer explanation.
Schedule-Free is a family of optimization algorithms for deep learning and convex stochastic optimization that matches or exceeds the performance of tuned learning-rate schedules without specifying a horizon T or any decay schedule in advance. Introduced by Aaron Defazio, Xingyu Yang, Harsh Mehta, Konstantin Mishchenko, Ahmed Khaled, and Ashok Cutkosky in the May 2024 preprint "The Road Less Scheduled" (arXiv:2405.15682)[^1], the method replaces the momentum buffer of standard optimizers with a particular interpolation and averaging of three iterate streams (x, y, z). The Schedule-Free AdamW variant won the self-tuning track of the MLCommons AlgoPerf 2024 benchmark, beating the prize-qualification baseline by about 8 percent on the seven workloads where both algorithms trained successfully[^2]. The paper was selected as an Oral presentation at NeurIPS 2024 and the reference PyTorch implementation lives at the facebookresearch/schedule_free GitHub repository[^3][^4].
For most of the modern history of neural network training, the learning rate has been the single most sensitive hyperparameter in the recipe. Practitioners typically pair an optimizer such as Adam or AdamW with a decay schedule (step decay, linear decay, polynomial decay, or, since the late 2010s, cosine decay) that gradually reduces the step size over the course of training[^1]. The choice of schedule is tightly coupled to the choice of total number of training steps T: a cosine schedule that decays to zero at step T needs to know T in advance, and changing T usually requires retuning the schedule and other hyperparameters.
This coupling creates several practical problems[^1]. First, if the practitioner decides to train longer than planned, the original cosine schedule is no longer valid and a new one with a different T must be selected; midway extensions are awkward. Second, comparing optimizers across different budgets is hard because the "best schedule" for budget T1 is not the best schedule restricted to budget T2 < T1. Third, large pretraining runs often have to commit to a stopping time before training begins, and any later change forces either an abrupt switch (potentially destabilizing) or a continued run with a stale schedule.
The Schedule-Free paper observes that the empirical curve traced out by Polyak-Ruppert averaging during training, where each point on the curve corresponds to a final iterate obtained by averaging up to that step, very closely tracks the Pareto frontier of "loss versus number of steps trained" that cosine and linear decay schedules achieve when each schedule is independently tuned to its own T[^1]. The authors then ask whether there exists a method that follows this Pareto frontier within a single run, without specifying any horizon or schedule, and prove that the answer is yes.
The motivation also has a theoretical side. In classical convex optimization theory, two ways of producing a "final iterate" from stochastic gradient descent are well known: Polyak-Ruppert averaging (averaging all iterates), and primal averaging (in which the gradient at each step is taken at a running average of past iterates). Both have provable convergence rates of O(1/sqrt(T)) for convex Lipschitz problems, but neither matches what well-tuned decaying step sizes do in deep learning practice[^1]. The Schedule-Free construction is, in a precise sense, the family that interpolates between these two extremes through a momentum-like parameter beta.
Polyak-Ruppert averaging, originally proposed in the late 1980s and early 1990s for stochastic approximation, says that if one runs stochastic gradient descent with a constant step size, the running mean of the iterates converges at the optimal rate for the convex Lipschitz problem class[^1]. In practice it is almost never used as written in deep learning, because while the average iterate has nice asymptotic properties, the gradients are still computed at the latest unaveraged iterate and the average tends to lag behind, hurting practical performance.
Primal averaging, attributed to Nesterov and Tao among others, runs the descent step at the average instead of at the latest iterate: each gradient is evaluated at the running mean. This corresponds, after standard reductions, to the "online-to-batch" conversion in online learning theory, where an online algorithm's regret bound automatically becomes a convergence bound for the average iterate. Primal averaging is theoretically optimal for convex Lipschitz problems with the right step sizes but converges slowly in early iterations because the average is dominated by the (poor) initial iterate for a long time[^1].
By the 2020s, cosine decay (sometimes with linear warmup) had become the de facto schedule for Transformer pretraining, and step-wise schedules with warmup were standard for convolutional models. Cosine schedules are popular because empirically they produce a smooth descent followed by fine refinement near the end, and they avoid the abrupt drop of step decay[^1]. The downside is the rigidity already described: cosine ties the optimizer to a fixed budget.
A separate line of work, including stochastic weight averaging (SWA) and "Latest Weight Averaging" (LAWA), proposed averaging the last K iterates as a post-processing step. These methods improve over the final iterate but still rely on an underlying schedule and on choosing a tail window. Schedule-Free can be viewed as a principled, schedule-free generalization in which averaging is woven into the optimizer itself rather than applied as a post-hoc smoother[^1].
Schedule-Free maintains three sequences of parameters[^1][^4]:
The relationships between the streams are:
Here gamma is a constant step size (no schedule), beta is a momentum-like interpolation parameter (default 0.9 in the reference implementation), and the weights c_t implement an equal-weighted Polyak average of z[^1][^4]. The key conceptual move is that the gradient is evaluated at y, a convex combination of the base iterate z and the average x, rather than at one or the other. Setting beta to 0 recovers pure Polyak-Ruppert averaging; setting beta to 1 recovers primal averaging; intermediate values like 0.9 give the practical sweet spot[^1].
The paper's central theoretical contribution is a "Schedule-Free" lemma showing that a particular averaging scheme over a sequence of iterates is equivalent, in convergence-rate terms, to running unaveraged SGD with a tuned schedule that decays linearly to zero at the current step t[^1]. In other words, averaging with a particular increasing-weight scheme on iterates produced by constant-step-size SGD reproduces, at every step, the convergence guarantee of a linearly decaying schedule ending at that step. Because this holds at every t simultaneously, the method is "free" of needing T at the start: you get the right anytime guarantee with a single constant step size.
Concretely, for a convex G-Lipschitz loss function over a convex set of diameter D, Schedule-Free SGD with step size gamma = D / (G sqrt(T)) achieves[^1]:
E[F(x_T) minus F(x_star)] <= D G / sqrt(T)
for every t up to T, for any beta in [0, 1]. The bound matches the worst-case convergence rate of any first-order method on this problem class. The novelty is not the rate itself but that it is achieved by a single algorithm, with a single constant step size, that hits the bound at every intermediate step rather than only at a pre-specified endpoint.
A follow-up paper by Ahn, Magakyan, and Cutkosky (November 2024) extended this story to the nonconvex setting through a "general framework for online-to-nonconvex conversion," proving that Schedule-Free SGD also achieves optimal iteration complexity for nonsmooth nonconvex problems[^5]. Their analysis closes a theoretical gap because the original convex theory did not formally explain Schedule-Free's empirical success on deep neural networks, which are far from convex.
Schedule-Free SGD is the simplest concrete algorithm. Compared to SGD with momentum it adds no new hyperparameters: the momentum coefficient becomes the interpolation parameter beta, the learning rate is a constant gamma, and standard weight decay can be applied to z[^1][^4]. The implementation requires one extra parameter buffer beyond standard SGD: in the reference PyTorch code, x and z are stored explicitly and y is computed on demand from x and z each step. Memory cost is therefore comparable to SGD with momentum (which also keeps one extra buffer for the velocity)[^4].
A subtle implementation detail is that x and z must be kept in sync for batch normalization statistics. The reference repo recommends calling optimizer.train() before training steps and optimizer.eval() before evaluation so that the parameter vector seen by the model is x (used for evaluation) versus y (used during the forward pass that produces the gradient)[^4]. For models with batch normalization, the running statistics correspond to y by default; the README instructs users to run roughly 50 batches in model.train() plus optimizer.eval() mode before each evaluation so that the BN running mean and variance correspond to x[^4].
Schedule-Free AdamW applies the same x, y, z construction with the AdamW update on the z sequence instead of plain SGD. The Adam first-moment buffer is replaced by the Schedule-Free interpolation: the gradient is taken at y, the second moment is computed in the usual exponential-moving-average way, and z is updated with the Adam-style preconditioned step. The averaged stream x is again the reported model[^1][^4].
The practical hyperparameter guidance in the reference repository differs from standard AdamW[^4]:
warmup_steps argument that linearly ramps up the step size. The ramp does not contradict the "no schedule" claim because it is local to the start and does not require knowing T.Because of the train/eval mode requirement, the library also provides closure-based variants (SGDScheduleFreeClosure, AdamWScheduleFreeClosure) that compute the gradient at y in a single PyTorch closure and never expose the model to the y parameters between calls, eliminating the need to flip between train and eval modes manually[^4]. A ScheduleFreeWrapper was added later as an experimental wrapper that promotes an arbitrary PyTorch optimizer to a Schedule-Free version by handling the x and z bookkeeping externally and using the wrapped optimizer's step on z[^4].
A RAdamScheduleFree variant, contributed by an external developer, combines Schedule-Free with rectified Adam to eliminate both the schedule and the warmup, since RAdam's rectification term provides an implicit warmup[^4].
MLCommons ran the inaugural AlgoPerf: Training Algorithms Benchmark Competition in 2023 to 2024, with results announced in August 2024[^2]. The benchmark scored optimizer recipes by their wallclock time to reach a target validation metric on eight fixed and several randomized deep-learning workloads spanning ResNet-50 / ImageNet, ViT / ImageNet (Vision Transformer), U-Net / FastMRI, Conformer / LibriSpeech, GNN / OGBG, DLRM / Criteo, and Transformer / WMT, among others. The competition had two tracks:
The self-tuning track is far harder because schedules and learning rates that work well on (say) ImageNet are usually wrong by an order of magnitude for WMT.
Schedule-Free AdamW, submitted by Aaron Defazio, Alice Yang, and Konstantin Mishchenko on behalf of Meta and Samsung AI, was the winner of the self-tuning track with a benchmark score of 0.85, and was the only submission in that track to beat the prize-qualification baseline[^2]. According to the MLCommons announcement, Schedule-Free AdamW was approximately 8 percent faster than the self-tuning prize-qualification baseline and roughly 10 percent faster than the external-tuning baseline averaged across the seven workloads on which both algorithms trained to target. The prize itself was not awarded to the team because of an author overlap with the working group leadership, but the result was reported as the first time a single algorithm matched the carefully tuned baselines across the AlgoPerf workload set in a no-tuning setting[^2].
The external-tuning track was won by Distributed Shampoo, submitted by a separate Meta team led by Hao-Jun Michael Shi, with a score of 0.78 and a 28 percent speedup over the external-tuning baseline; the Shampoo team received the 25,000 USD prize[^2]. In total, 18 submissions from 10 teams across academia and industry participated, with over 4,000 training runs across 14 workloads in JAX or PyTorch; all submissions were released under Apache 2.0[^2].
The AlgoPerf result has been widely cited as the strongest single piece of evidence that Schedule-Free is not just a theoretically elegant construction but a practically competitive alternative to tuned cosine schedules across a heterogeneous workload mix, since the competition spanned Transformers, ConvNets, U-Nets, graph networks, and tabular models with no per-workload re-tuning permitted[^2].
"The Road Less Scheduled" was accepted as an Oral presentation at NeurIPS 2024, one of a small number of papers selected for that highest-profile track at the main conference[^3]. The Oral session for the paper was listed as Oral Session 1C in the NeurIPS 2024 program[^3]. While the paper was not on the announced Best Paper or Best Paper Runner-Up shortlist for NeurIPS 2024[^6], the combination of an Oral acceptance with the AlgoPerf self-tuning win established it as one of the most influential optimization papers of the year. The latest arXiv revision (v4) is dated 29 October 2024[^1].
The reference implementation is the open-source facebookresearch/schedule_free repository under Meta's FAIR organization, released alongside the paper under the Apache 2.0 license[^4]. The package is also distributed as schedulefree on PyPI and is installable with pip install schedulefree[^4][^7]. The repository exposes:
SGDScheduleFree and SGDScheduleFreeReference (memory-efficient and numerically conservative variants of Schedule-Free SGD).AdamWScheduleFree and AdamWScheduleFreeReference (Schedule-Free AdamW).RAdamScheduleFree (community-contributed RAdam variant).ScheduleFreeWrapper (experimental wrapper around an arbitrary base optimizer).API parameters for SGDScheduleFree include lr (default 1.0), momentum (default 0.9, interpreted as beta), weight_decay (default 0), warmup_steps (default 0), r (default 0.0, an exponent for polynomial weighting of the average), and weight_lr_power (default 2.0, an exponent governing how the per-step weight in the running average grows during warmup)[^8].
The JAX-native implementation lives in DeepMind's Optax library under optax.contrib.schedule_free and optax.contrib.schedule_free_adamw, contributed and merged in 2024[^9]. The Optax variant follows the same x, y, z construction and exposes the same hyperparameters; like the PyTorch reference, it must be used with explicit train/eval handling because the parameters seen by the model in training mode (y) differ from those used for evaluation (x)[^9].
Beyond the official PyTorch and Optax implementations, community ports include independent JAX implementations and Schedule-Free variants of other optimizers (for example, a Sophia / Schedule-Free hybrid). Hugging Face Transformers has received community feature requests to expose Schedule-Free SGD and AdamW as first-class trainer options, with discussion focused on whether the train/eval-mode requirement can be hidden behind the existing Trainer API[^10].
Adoption beyond the AlgoPerf benchmark is best characterized as steady research-community uptake rather than full standardization. The Optax merge means that any JAX codebase that already uses Optax can swap in a Schedule-Free optimizer with a single line[^9]. The PyPI package has been picked up by a range of independent reimplementations and benchmark studies; the Schedule-Free recipe is now a standard baseline in optimizer benchmark papers such as "Benchmarking Optimizers for Large Language Model Pretraining" and "Schedulers for Schedule-Free"[^11].
The author Aaron Defazio has continued to develop and promote the method, with related work on D-Adaptation (which automates the learning rate but still requires a schedule) and follow-up theoretical results[^1]. The collaboration spans Meta FAIR (Defazio, Yang, Mehta), Princeton (Khaled), Boston University (Cutkosky), and Samsung AI (Mishchenko, who has since moved organizations)[^1]. The reference repository is actively maintained as of 2025, with bug fixes and the experimental ScheduleFreeWrapper, RAdamScheduleFree, and weight-decay-at-y additions[^4].
The significance of Schedule-Free can be framed in three layers[^1][^2]:
A subtle consequence is that the practitioner's mental model of the optimizer shifts. With a cosine schedule one tracks "how far through training am I, and what step size does the schedule prescribe?". With Schedule-Free one runs at a constant step size and inspects the averaged iterate at any time; the algorithm itself does not need to know the planned endpoint.
Several limitations and caveats are documented in the paper and in the reference repository[^1][^4]:
model.train() plus optimizer.eval() mode before each evaluation, or alternatively use PreciseBN, to make the running statistics correspond to x rather than y. Without this step, validation numbers can be biased.Active research on the Schedule-Free family in 2024 to 2025 includes[^5][^11]:
ScheduleFreeWrapper mechanism[^4].warmup_steps parameter, contributed externally to the main repo[^4].AdamW with cosine schedule has been the dominant recipe for Transformer pretraining since the late 2010s; its key sensitivity is the choice of T and the corresponding decay shape. Lion, the symbolic-search-discovered optimizer from Google in 2023, also targets schedule-aware training and still requires a learning rate decay; it competes with AdamW on memory but does not eliminate scheduling. Distributed Shampoo, the AlgoPerf 2024 external-tuning winner from Meta, is a second-order method that uses a Kronecker-factored preconditioner and also relies on an explicit schedule[^2]. Schedule-Free sits orthogonally to all of these: rather than changing the descent direction (Lion, Shampoo) or the per-parameter scaling (Adam), it changes the averaging of the iterates so that schedule-equivalent guarantees emerge from constant-step-size descent[^1].
Comparison summary:
| Optimizer | Schedule required | Extra hyperparameters vs SGD/AdamW | Targeted regime |
|---|---|---|---|
| SGD + momentum | Yes | None | Vision, classical recipes |
| AdamW + cosine | Yes | T, schedule shape | Transformer pretraining |
| Lion | Yes | None | Memory-efficient training |
| Distributed Shampoo | Yes | Preconditioner block size | External-tuning track of MLCommons AlgoPerf 2024 |
| Schedule-Free AdamW | No | None (beta replaces momentum) | Self-tuning track of MLCommons AlgoPerf 2024 |
Note that "no extra hyperparameters" for Schedule-Free is in the sense of "compared to AdamW with momentum"; the beta and constant gamma replace the existing momentum and base learning rate rather than adding new ones[^1].