Schedule-Free optimizer

Schedule-Free is a family of optimization algorithms for deep learning and convex stochastic optimization that matches or exceeds the performance of tuned learning-rate schedules without specifying a horizon T or any decay schedule in advance. Introduced by Aaron Defazio, Xingyu Yang, Harsh Mehta, Konstantin Mishchenko, Ahmed Khaled, and Ashok Cutkosky in the May 2024 preprint "The Road Less Scheduled" (arXiv:2405.15682)[^1], the method replaces the momentum buffer of standard optimizers with a particular interpolation and averaging of three iterate streams (x, y, z). The Schedule-Free AdamW variant won the self-tuning track of the MLCommons AlgoPerf 2024 benchmark, beating the prize-qualification baseline by about 8 percent on the seven workloads where both algorithms trained successfully[^2]. The paper was selected as an Oral presentation at NeurIPS 2024 and the reference PyTorch implementation lives at the facebookresearch/schedule_free GitHub repository[^3][^4].

Motivation: the learning-rate schedule headache

For most of the modern history of neural network training, the learning rate has been the single most sensitive hyperparameter in the recipe. Practitioners typically pair an optimizer such as Adam or AdamW with a decay schedule (step decay, linear decay, polynomial decay, or, since the late 2010s, cosine decay) that gradually reduces the step size over the course of training[^1]. The choice of schedule is tightly coupled to the choice of total number of training steps T: a cosine schedule that decays to zero at step T needs to know T in advance, and changing T usually requires retuning the schedule and other hyperparameters.

This coupling creates several practical problems[^1]. First, if the practitioner decides to train longer than planned, the original cosine schedule is no longer valid and a new one with a different T must be selected; midway extensions are awkward. Second, comparing optimizers across different budgets is hard because the "best schedule" for budget T1 is not the best schedule restricted to budget T2 < T1. Third, large pretraining runs often have to commit to a stopping time before training begins, and any later change forces either an abrupt switch (potentially destabilizing) or a continued run with a stale schedule.

The Schedule-Free paper observes that the empirical curve traced out by Polyak-Ruppert averaging during training, where each point on the curve corresponds to a final iterate obtained by averaging up to that step, very closely tracks the Pareto frontier of "loss versus number of steps trained" that cosine and linear decay schedules achieve when each schedule is independently tuned to its own T[^1]. The authors then ask whether there exists a method that follows this Pareto frontier within a single run, without specifying any horizon or schedule, and prove that the answer is yes.

The motivation also has a theoretical side. In classical convex optimization theory, two ways of producing a "final iterate" from stochastic gradient descent are well known: Polyak-Ruppert averaging (averaging all iterates), and primal averaging (in which the gradient at each step is taken at a running average of past iterates). Both have provable convergence rates of O(1/sqrt(T)) for convex Lipschitz problems, but neither matches what well-tuned decaying step sizes do in deep learning practice[^1]. The Schedule-Free construction is, in a precise sense, the family that interpolates between these two extremes through a momentum-like parameter beta.

Background

Polyak-Ruppert averaging

Polyak-Ruppert averaging, originally proposed in the late 1980s and early 1990s for stochastic approximation, says that if one runs stochastic gradient descent with a constant step size, the running mean of the iterates converges at the optimal rate for the convex Lipschitz problem class[^1]. In practice it is almost never used as written in deep learning, because while the average iterate has nice asymptotic properties, the gradients are still computed at the latest unaveraged iterate and the average tends to lag behind, hurting practical performance.

Primal averaging and online-to-batch conversion

Primal averaging, attributed to Nesterov and Tao among others, runs the descent step at the average instead of at the latest iterate: each gradient is evaluated at the running mean. This corresponds, after standard reductions, to the "online-to-batch" conversion in online learning theory, where an online algorithm's regret bound automatically becomes a convergence bound for the average iterate. Primal averaging is theoretically optimal for convex Lipschitz problems with the right step sizes but converges slowly in early iterations because the average is dominated by the (poor) initial iterate for a long time[^1].

Schedules in practice

By the 2020s, cosine decay (sometimes with linear warmup) had become the de facto schedule for Transformer pretraining, and step-wise schedules with warmup were standard for convolutional models. Cosine schedules are popular because empirically they produce a smooth descent followed by fine refinement near the end, and they avoid the abrupt drop of step decay[^1]. The downside is the rigidity already described: cosine ties the optimizer to a fixed budget.

A separate line of work, including stochastic weight averaging (SWA) and "Latest Weight Averaging" (LAWA), proposed averaging the last K iterates as a post-processing step. These methods improve over the final iterate but still rely on an underlying schedule and on choosing a tail window. Schedule-Free can be viewed as a principled, schedule-free generalization in which averaging is woven into the optimizer itself rather than applied as a post-hoc smoother[^1].

How it works

Three iterate streams

Schedule-Free maintains three sequences of parameters[^1][^4]:

z is the "base" iterate, updated by an underlying gradient step (the equivalent of the classical SGD or AdamW state).
y is the "evaluation point" of the gradient: gradients are queried at y, not at z, on every step.
x is the "averaged iterate", a particular weighted average of past z values; x is the sequence reported as the trained model and used for validation and inference.

The relationships between the streams are:

y_t = (1 minus beta) z_t + beta x_t
z_{t+1} = z_t minus gamma * grad f(y_t)
x_{t+1} = (1 minus c_{t+1}) x_t + c_{t+1} z_{t+1}, where c_{t+1} = 1 / (t+1) for the basic version

Here gamma is a constant step size (no schedule), beta is a momentum-like interpolation parameter (default 0.9 in the reference implementation), and the weights c_t implement an equal-weighted Polyak average of z[^1][^4]. The key conceptual move is that the gradient is evaluated at y, a convex combination of the base iterate z and the average x, rather than at one or the other. Setting beta to 0 recovers pure Polyak-Ruppert averaging; setting beta to 1 recovers primal averaging; intermediate values like 0.9 give the practical sweet spot[^1].

Theory: unifying schedules and averaging

The paper's central theoretical contribution is a "Schedule-Free" lemma showing that a particular averaging scheme over a sequence of iterates is equivalent, in convergence-rate terms, to running unaveraged SGD with a tuned schedule that decays linearly to zero at the current step t[^1]. In other words, averaging with a particular increasing-weight scheme on iterates produced by constant-step-size SGD reproduces, at every step, the convergence guarantee of a linearly decaying schedule ending at that step. Because this holds at every t simultaneously, the method is "free" of needing T at the start: you get the right anytime guarantee with a single constant step size.

Concretely, for a convex G-Lipschitz loss function over a convex set of diameter D, Schedule-Free SGD with step size gamma = D / (G sqrt(T)) achieves[^1]:

E[F(x_T) minus F(x_star)] <= D G / sqrt(T)

for every t up to T, for any beta in [0, 1]. The bound matches the worst-case convergence rate of any first-order method on this problem class. The novelty is not the rate itself but that it is achieved by a single algorithm, with a single constant step size, that hits the bound at every intermediate step rather than only at a pre-specified endpoint.

A follow-up paper by Ahn, Magakyan, and Cutkosky (November 2024) extended this story to the nonconvex setting through a "general framework for online-to-nonconvex conversion," proving that Schedule-Free SGD also achieves optimal iteration complexity for nonsmooth nonconvex problems[^5]. Their analysis closes a theoretical gap because the original convex theory did not formally explain Schedule-Free's empirical success on deep neural networks, which are far from convex.

Schedule-Free SGD

Schedule-Free SGD is the simplest concrete algorithm. Compared to SGD with momentum it adds no new hyperparameters: the momentum coefficient becomes the interpolation parameter beta, the learning rate is a constant gamma, and standard weight decay can be applied to z[^1][^4]. The implementation requires one extra parameter buffer beyond standard SGD: in the reference PyTorch code, x and z are stored explicitly and y is computed on demand from x and z each step. Memory cost is therefore comparable to SGD with momentum (which also keeps one extra buffer for the velocity)[^4].

A subtle implementation detail is that x and z must be kept in sync for batch normalization statistics. The reference repo recommends calling optimizer.train() before training steps and optimizer.eval() before evaluation so that the parameter vector seen by the model is x (used for evaluation) versus y (used during the forward pass that produces the gradient)[^4]. For models with batch normalization, the running statistics correspond to y by default; the README instructs users to run roughly 50 batches in model.train() plus optimizer.eval() mode before each evaluation so that the BN running mean and variance correspond to x[^4].

Schedule-Free AdamW

Schedule-Free AdamW applies the same x, y, z construction with the AdamW update on the z sequence instead of plain SGD. The Adam first-moment buffer is replaced by the Schedule-Free interpolation: the gradient is taken at y, the second moment is computed in the usual exponential-moving-average way, and z is updated with the Adam-style preconditioned step. The averaged stream x is again the reported model[^1][^4].

The practical hyperparameter guidance in the reference repository differs from standard AdamW[^4]:

Beta defaults to 0.9 but should be increased toward 0.95 to 0.98 for very long training runs to dampen oscillations late in training.
Learning rates can be 1x to 10x larger than the tuned cosine-AdamW learning rate for the same problem. (For Schedule-Free SGD they recommend 10x to 50x larger than the SGD-with-momentum baseline.)
Warmup is still recommended early in training, via a warmup_steps argument that linearly ramps up the step size. The ramp does not contradict the "no schedule" claim because it is local to the start and does not require knowing T.
Weight decay can be applied either at z (the default, as in classical AdamW) or at y (an experimental option in the wrapper version that the authors report gives better results on some problems).

Closure and wrapper variants

Because of the train/eval mode requirement, the library also provides closure-based variants (SGDScheduleFreeClosure, AdamWScheduleFreeClosure) that compute the gradient at y in a single PyTorch closure and never expose the model to the y parameters between calls, eliminating the need to flip between train and eval modes manually[^4]. A ScheduleFreeWrapper was added later as an experimental wrapper that promotes an arbitrary PyTorch optimizer to a Schedule-Free version by handling the x and z bookkeeping externally and using the wrapped optimizer's step on z[^4].

A RAdamScheduleFree variant, contributed by an external developer, combines Schedule-Free with rectified Adam to eliminate both the schedule and the warmup, since RAdam's rectification term provides an implicit warmup[^4].

AlgoPerf 2024 results

MLCommons ran the inaugural AlgoPerf: Training Algorithms Benchmark Competition in 2023 to 2024, with results announced in August 2024[^2]. The benchmark scored optimizer recipes by their wallclock time to reach a target validation metric on eight fixed and several randomized deep-learning workloads spanning ResNet-50 / ImageNet, ViT / ImageNet (Vision Transformer), U-Net / FastMRI, Conformer / LibriSpeech, GNN / OGBG, DLRM / Criteo, and Transformer / WMT, among others. The competition had two tracks:

External tuning: submissions could be given up to 20 hyperparameter trials per workload, and the best trial's time counted.
Self-tuning: submissions ran with a single hyperparameter setting on each workload; no per-workload retuning was allowed and the algorithm had to "tune itself" through the training run.

The self-tuning track is far harder because schedules and learning rates that work well on (say) ImageNet are usually wrong by an order of magnitude for WMT.

Schedule-Free AdamW, submitted by Aaron Defazio, Alice Yang, and Konstantin Mishchenko on behalf of Meta and Samsung AI, was the winner of the self-tuning track with a benchmark score of 0.85, and was the only submission in that track to beat the prize-qualification baseline[^2]. According to the MLCommons announcement, Schedule-Free AdamW was approximately 8 percent faster than the self-tuning prize-qualification baseline and roughly 10 percent faster than the external-tuning baseline averaged across the seven workloads on which both algorithms trained to target. The prize itself was not awarded to the team because of an author overlap with the working group leadership, but the result was reported as the first time a single algorithm matched the carefully tuned baselines across the AlgoPerf workload set in a no-tuning setting[^2].

The external-tuning track was won by Distributed Shampoo, submitted by a separate Meta team led by Hao-Jun Michael Shi, with a score of 0.78 and a 28 percent speedup over the external-tuning baseline; the Shampoo team received the 25,000 USD prize[^2]. In total, 18 submissions from 10 teams across academia and industry participated, with over 4,000 training runs across 14 workloads in JAX or PyTorch; all submissions were released under Apache 2.0[^2].

The AlgoPerf result has been widely cited as the strongest single piece of evidence that Schedule-Free is not just a theoretically elegant construction but a practically competitive alternative to tuned cosine schedules across a heterogeneous workload mix, since the competition spanned Transformers, ConvNets, U-Nets, graph networks, and tabular models with no per-workload re-tuning permitted[^2].

Recognition and venue

"The Road Less Scheduled" was accepted as an Oral presentation at NeurIPS 2024, one of a small number of papers selected for that highest-profile track at the main conference[^3]. The Oral session for the paper was listed as Oral Session 1C in the NeurIPS 2024 program[^3]. While the paper was not on the announced Best Paper or Best Paper Runner-Up shortlist for NeurIPS 2024[^6], the combination of an Oral acceptance with the AlgoPerf self-tuning win established it as one of the most influential optimization papers of the year. The latest arXiv revision (v4) is dated 29 October 2024[^1].

Implementations

Reference PyTorch implementation

The reference implementation is the open-source facebookresearch/schedule_free repository under Meta's FAIR organization, released alongside the paper under the Apache 2.0 license[^4]. The package is also distributed as schedulefree on PyPI and is installable with pip install schedulefree[^4][^7]. The repository exposes:

SGDScheduleFree and SGDScheduleFreeReference (memory-efficient and numerically conservative variants of Schedule-Free SGD).
AdamWScheduleFree and AdamWScheduleFreeReference (Schedule-Free AdamW).
RAdamScheduleFree (community-contributed RAdam variant).
ScheduleFreeWrapper (experimental wrapper around an arbitrary base optimizer).
Closure-based versions of each.

API parameters for SGDScheduleFree include lr (default 1.0), momentum (default 0.9, interpreted as beta), weight_decay (default 0), warmup_steps (default 0), r (default 0.0, an exponent for polynomial weighting of the average), and weight_lr_power (default 2.0, an exponent governing how the per-step weight in the running average grows during warmup)[^8].

JAX / Optax implementation

The JAX-native implementation lives in DeepMind's Optax library under optax.contrib.schedule_free and optax.contrib.schedule_free_adamw, contributed and merged in 2024[^9]. The Optax variant follows the same x, y, z construction and exposes the same hyperparameters; like the PyTorch reference, it must be used with explicit train/eval handling because the parameters seen by the model in training mode (y) differ from those used for evaluation (x)[^9].

Other ports

Beyond the official PyTorch and Optax implementations, community ports include independent JAX implementations and Schedule-Free variants of other optimizers (for example, a Sophia / Schedule-Free hybrid). Hugging Face Transformers has received community feature requests to expose Schedule-Free SGD and AdamW as first-class trainer options, with discussion focused on whether the train/eval-mode requirement can be hidden behind the existing Trainer API[^10].

Adoption

Adoption beyond the AlgoPerf benchmark is best characterized as steady research-community uptake rather than full standardization. The Optax merge means that any JAX codebase that already uses Optax can swap in a Schedule-Free optimizer with a single line[^9]. The PyPI package has been picked up by a range of independent reimplementations and benchmark studies; the Schedule-Free recipe is now a standard baseline in optimizer benchmark papers such as "Benchmarking Optimizers for Large Language Model Pretraining" and "Schedulers for Schedule-Free"[^11].

The author Aaron Defazio has continued to develop and promote the method, with related work on D-Adaptation (which automates the learning rate but still requires a schedule) and follow-up theoretical results[^1]. The collaboration spans Meta FAIR (Defazio, Yang, Mehta), Princeton (Khaled), Boston University (Cutkosky), and Samsung AI (Mishchenko, who has since moved organizations)[^1]. The reference repository is actively maintained as of 2025, with bug fixes and the experimental ScheduleFreeWrapper, RAdamScheduleFree, and weight-decay-at-y additions[^4].

Significance

The significance of Schedule-Free can be framed in three layers[^1][^2]:

Practical. The method removes one of the most fragile parts of a deep learning training recipe, namely the dependency of the learning rate schedule on the total step count T. For continuous-training settings (where models keep training as new data arrives), curriculum-style recipes, or any situation where T is unknown or might change, this is a direct quality-of-life improvement.
Empirical. AlgoPerf 2024 demonstrated that the method is not just convenient but actually competitive with carefully tuned schedules on a workload mix spanning vision, language, speech, graphs, and recommendation, while using a single hyperparameter setting across workloads[^2].
Theoretical. The unification of Polyak-Ruppert averaging and primal averaging via a beta-parameterized family clarifies a long-standing question about why averaging-based methods enjoy convex-optimal rates but practical decay schedules outperform them in deep learning; Schedule-Free is the family in between, and the analysis shows it inherits the rate guarantee at every t with no schedule[^1].

A subtle consequence is that the practitioner's mental model of the optimizer shifts. With a cosine schedule one tracks "how far through training am I, and what step size does the schedule prescribe?". With Schedule-Free one runs at a constant step size and inspects the averaged iterate at any time; the algorithm itself does not need to know the planned endpoint.

Limitations

Several limitations and caveats are documented in the paper and in the reference repository[^1][^4]:

The train and eval modes of the optimizer must be set explicitly because the parameters used for the forward pass during training (y) differ from those reported as the model (x). This is invisible in PyTorch optimizer APIs and easy to forget; the closure-based variants exist precisely to avoid this footgun.
Models with batch normalization require an extra ~50 batches of "calibration" in model.train() plus optimizer.eval() mode before each evaluation, or alternatively use PreciseBN, to make the running statistics correspond to x rather than y. Without this step, validation numbers can be biased.
The beta hyperparameter requires more careful tuning than classical momentum: for very long runs, the default of 0.9 may be too small, and the repo recommends 0.95 to 0.98 for extended training[^4]. This is more sensitivity than the documentation around plain momentum where 0.9 is treated as a safe default for the entire run.
Learning rates are not directly transferable from cosine-schedule recipes; they typically need to be increased 1x to 10x for AdamW or 10x to 50x for SGD. Reusing the cosine learning rate verbatim can substantially under-tune Schedule-Free[^4].
The method is not parameter-free: a constant gamma, beta, and weight decay still have to be set, and the optimal values are not the same as those of the corresponding scheduled optimizer.
Schedule-Free is not always strictly better. Several follow-up benchmark papers find that on certain workloads with very large batch sizes or highly engineered cosine recipes, the best tuned cosine + AdamW still ties or beats Schedule-Free AdamW; the gain is most consistent in the self-tuning regime[^11].
The 50-batch BatchNorm calibration is workload-dependent and can be expensive for models that evaluate frequently.

Variants and extensions

Active research on the Schedule-Free family in 2024 to 2025 includes[^5][^11]:

Schedule-Free + nonconvex theory. Ahn, Magakyan, and Cutkosky (arXiv:2411.07061, November 2024) show that the same construction enjoys optimal complexity for nonsmooth nonconvex problems, via a general "online-to-nonconvex" conversion, closing the gap between the original convex theory and the deep-learning practice.
Hybrids with other optimizers. Community work has combined Schedule-Free with Sophia, with Adam variants, and with second-order methods through the ScheduleFreeWrapper mechanism[^4].
Theoretical hyperparameter prescriptions. "Schedulers for Schedule-Free" (arXiv:2511.07767) proposes principled choices for the beta parameter and step size based on the theoretical analysis, aiming to remove the small remaining tuning burden[^11].
RAdam-style warmup-free variants that combine Schedule-Free with rectified Adam to eliminate the explicit warmup_steps parameter, contributed externally to the main repo[^4].

Comparison to other modern optimizers

AdamW with cosine schedule has been the dominant recipe for Transformer pretraining since the late 2010s; its key sensitivity is the choice of T and the corresponding decay shape. Lion, the symbolic-search-discovered optimizer from Google in 2023, also targets schedule-aware training and still requires a learning rate decay; it competes with AdamW on memory but does not eliminate scheduling. Distributed Shampoo, the AlgoPerf 2024 external-tuning winner from Meta, is a second-order method that uses a Kronecker-factored preconditioner and also relies on an explicit schedule[^2]. Schedule-Free sits orthogonally to all of these: rather than changing the descent direction (Lion, Shampoo) or the per-parameter scaling (Adam), it changes the averaging of the iterates so that schedule-equivalent guarantees emerge from constant-step-size descent[^1].

Comparison summary:

Optimizer	Schedule required	Extra hyperparameters vs SGD/AdamW	Targeted regime
SGD + momentum	Yes	None	Vision, classical recipes
AdamW + cosine	Yes	T, schedule shape	Transformer pretraining
Lion	Yes	None	Memory-efficient training
Distributed Shampoo	Yes	Preconditioner block size	External-tuning track of MLCommons AlgoPerf 2024
Schedule-Free AdamW	No	None (beta replaces momentum)	Self-tuning track of MLCommons AlgoPerf 2024

Note that "no extra hyperparameters" for Schedule-Free is in the sense of "compared to AdamW with momentum"; the beta and constant gamma replace the existing momentum and base learning rate rather than adding new ones[^1].

Polyak-Ruppert averaging: a 1990s technique for stochastic approximation where the running mean of SGD iterates achieves optimal convex rates; Schedule-Free generalizes this by interpolating with primal averaging[^1].
Primal averaging / online-to-batch conversion: a tool from online learning that converts online regret guarantees into convergence guarantees for the average iterate. Schedule-Free's theory is built on a refined online-to-batch lemma[^1].
Stochastic Weight Averaging (SWA) and tail averaging methods: post-hoc averages of the last K iterates from a scheduled run. Conceptually related but require a base schedule and a window to be chosen; Schedule-Free folds the averaging into the optimizer.
D-Adaptation (Defazio and Mishchenko, ICML 2023): a learning-rate-free method that estimates the optimal step size online but still requires a schedule. D-Adaptation and Schedule-Free target different axes of hyperparameter elimination and have been combined.
The AlgoPerf benchmark family is a sibling thread: a methodology to score training algorithms head-to-head, which provided the empirical proving ground for Schedule-Free[^2].

References

Motivation: the learning-rate schedule headache

Background

Polyak-Ruppert averaging

Primal averaging and online-to-batch conversion

Schedules in practice

How it works

Three iterate streams

Theory: unifying schedules and averaging

Schedule-Free SGD

Schedule-Free AdamW

Closure and wrapper variants

AlgoPerf 2024 results

Recognition and venue

Implementations

Reference PyTorch implementation

JAX / Optax implementation

Other ports

Adoption

Significance

Limitations

Variants and extensions

Comparison to other modern optimizers

Related Work

See also

References

Improve this article

Motivation: the learning-rate schedule headache

Background

Polyak-Ruppert averaging

Primal averaging and online-to-batch conversion

Schedules in practice

How it works

Three iterate streams

Theory: unifying schedules and averaging

Schedule-Free SGD

Schedule-Free AdamW

Closure and wrapper variants

AlgoPerf 2024 results

Recognition and venue

Implementations

Reference PyTorch implementation

JAX / Optax implementation

Other ports

Adoption

Significance

Limitations

Variants and extensions

Comparison to other modern optimizers

Related Work

See also

References