# the_road_less_scheduled__1a860404.pdf

The Road Less Scheduled

Aaron Defazio1 Fundamental AI Research Team, Meta Xingyu (Alice) Yang2 Fundamental AI Research Team, Meta

Harsh Mehta Google Research Konstantin Mishchenko Samsung AI Center Ahmed Khaled Princeton University

Ashok Cutkosky3 Boston University 1Research Co-lead 2 Engineering Co-lead 3 Senior Author

Existing learning rate schedules that do not require specification of the optimization stopping step T are greatly out-performed by learning rate schedules that depend on T. We propose an approach that avoids the need for this stopping time by eschewing the use of schedules entirely, while exhibiting state-of-the-art performance compared to schedules across a wide family of problems ranging from convex problems to large-scale deep learning problems. Our Schedule-Free approach introduces no additional hyper-parameters over standard optimizers with momentum. Our method is a direct consequence of a new theory we develop that unifies scheduling and iterate averaging. An open source implementation of our method is available1. Schedule-Free Adam W is the core algorithm behind our winning entry to the MLCommons 2024 Algo Perf Algorithmic Efficiency Challenge Self-Tuning track.

1 Introduction

The theory of optimization, as applied in machine learning, has been successful at providing precise, prescriptive results for many problems. However, even in the simplest setting of stochastic gradient descent (SGD) applied to convex Lipschitz functions, there are glaring gaps between what our current theory prescribes and the methods used in practice.

Consider the stochastic gradient descent (SGD) step with step size γ > 0, zt+1 = zt γgt where gt is the stochastic (sub-)gradient at time t, computed at the point zt (formally defined in Section 1.1) of a convex Lipschitz function f. Although standard practice for many classes of problems, classical convergence theory suggests that the expected loss of this z sequence is suboptimal, and that the Polyak-Ruppert (PR) average x of the sequence should be returned instead (Polyak, 1990; Ruppert, 1988): zt+1 = zt γgt (1) xt+1 = (1 ct+1) xt + ct+1zt+1, (2)

where using ct+1 = 1/(t + 1) results in xt = 1

T PT t=1 zt. Despite their theoretical optimality, PR averages give much worse results in practice than using the last-iterate of SGD (Figures 2a,8) a folk-law result in the field of optimization, and a large theory-practice gap that is often attributed to the mismatch between this simplified problem class and the complexity of problems addressed in practice.

1https://github.com/facebookresearch/schedule_free

38th Conference on Neural Information Processing Systems (Neur IPS 2024).

0 20 40 60 80 100

Test Accuracy (%)

ILSVRC 2012 Image Net (Res Net-50)

0 200000 400000 600000

Open Web Text (GPT-2 124M)

Figure 1: Schedule-Free methods (black) closely track the Pareto frontier of loss v.s. training time in a single run. Both Schedule-Free SGD (left) and Adam W (right) match or exceed the performance of cosine learning rate schedules of varying lengths (red).

Recently, Zamani and Glineur (2023) and Defazio et al. (2023) showed that the exact worst-case optimal rates can be achieved via carefully chosen learning rate sequences (also known as schedules) alone, without the use of averaging. This result suggests that schedules have, in some sense, the same role to play as PR averaging in optimization. However, schedules have a critical disadvantage: they require setting the optimization stopping time T in advance.

Motivated by the theory-practice gap for Polyak-Ruppert averaging, we ask the following question:

Do there exist iterate averaging approaches that match the empirical performance of learning rate schedules, without sacrificing theoretical guarantees?

By developing a new link between averaging and learning rate sequences, we introduce a new approach to averaging that maintains the worst-case convergence rate theory of PR averaging, while matching and often exceeding the performance of schedule-based approaches firmly answering this question in the affirmative.

Summary of Results

Our approach does not require the stopping time T to be known or set in advance. It closely tracks the Pareto frontier of loss versus training time during a single training run (Figure 1), while requiring no additional hyper-parameters over the base SGD (with momentum) or Adam optimizer. Our approach uses an alternative form of momentum that replaces traditional momentum. This form has appealing theoretical properties: it is worst case optimal for any choice of the momentum parameter in the convex Lipschitz setting, a property that does not hold for traditional momentum. Our key theoretical result is a new online-to-batch conversion theorem, which establishes the optimality of our method while also unifying several existing online-to-batch theorems. We perform, to our knowledge, one of the largest machine learning optimization algorithm evaluations to date, consisting of 28 problems, ranging from logistic regression to large-scale deep learning problems. This evaluation contains more distinct and diverse largescale machine-learning problems than any other optimizer evaluation we are aware of in the literature. Schedule-Free methods show strong performance, matching or out-performing heavily-tuned cosine schedules. Schedule-Free Adam W won the MLCommons 2024 Algo Perf Algorithmic Efficiency Challenge Self-Tuning (Adaptive Algorithm) Track, providing independent verification of its SOTA performance against other optimization algorithms in cases where hyperparametertuning is limited. We provide details of our entry and plots comparing it to the competition baseline.

1.1 Notation Consider the stochastic convex minimization minx Rd f(x) = Eζ[f(x, ζ)], where each f(x, ζ) is Lipschitz and convex in x, and the expectation is taken over the random variable ζ. With a slight abuse of notation, we assume we are given, at time step t and any point y that we choose, an arbitrary sub-gradient f(y, ζt) from the sub-differential of f.

0 10000 20000 30000 40000 50000 60000

IWSLT14 (LSTM)

Polyak Averaging (4.35 SE 0.004) Primal Averaging (4.44 SE 0.001) Schedule-Free (4.18 SE 0.001) Cosine Schedule (4.23 SE 0.003)

(a) Schedule-Free learning converges faster than classical averaging approaches, often out-performing tuned schedules.

Minimal Loss for Different and values.

Minimal Loss

(b) Incorporating the momentum parameter β allows for convergence despite using larger learning rates γ on quadratic problems.

We propose the following method, which we call Schedule-Free SGD:

yt = (1 β)zt + βxt, (3) zt+1 = zt γ f(yt, ζt), (4) xt+1 = (1 ct+1) xt + ct+1zt+1, (5)

where ct+1 = 1/(t+1) and z1 = x1 is the initial point. Note that, with this weighting, the x sequence is just a running average of the z sequence. The y sequence is the gradient location sequence (on which gradients are evaluated at each step). The z sequence is the base sequence, which is where the base optimizer s update is performed (in this case SGD). The x sequence is the evaluation sequence, our best estimate of the weights so far.

This method has a momentum parameter β that interpolates between Polyak-Ruppert averaging (β = 0) and Primal averaging (β = 1). Primal averaging (Nesterov and Shikhman, 2015; Tao et al., 2018; Cutkosky, 2019; Kavis et al., 2019; Sebbouh et al., 2021; Defazio and Gower, 2021; Defazio and Jelassi, 2022), is an approach where the gradient is evaluated at the averaged point x, instead of z:

zt+1 = zt γ f(xt, ζt) (6) xt+1 = (1 ct+1) xt + ct+1zt+1, (7)

this approach maintains the worst-case optimality of PR averaging but is generally considered to converge too slowly to be practical (Figures 2a,8). The advantage of our interpolation is that we get the best of both worlds. We can achieve the fast convergence of Polyak-Ruppert averaging (since the z sequence moves much quicker than the x sequence), while still keeping some coupling between the returned sequence x and the gradient-evaluation locations y, which increases stability. Values of β similar to standard momentum values β 0.9 appear to work well in practice. We will use the notation α = 1 β when convenient.

In this formulation, β = 0.9 gives the practical advantages of momentum, dampening the immediate impact of large gradients, resulting in more stable training. To see this, notice that the immediate effect of the gradient gt at step t is to introduce (1 β)gt = 0.1gt into the iterate sequence y. This is similar to exponential-moving-average (EMA) momentum, where also (1 β)gt is added into the iterate sequence on step t. However, here the remainder of gt is very slowly added into y over time, via its place in the average x, whereas with an EMA with β = 0.9, the majority of the gradient is incorporated within the next 10 steps. So from this viewpoint, the Schedule-Free updates can be seen as a version of momentum that has the same immediate effect, but with a greater delay for adding in the remainder of the gradient. This form of momentum (by interpolation) also has a striking advantage: it does not result in any theoretical slowdown; it gives the optimal worst case (Nesterov, 2013) convergence for the non-smooth convex setting (including constants), for any choice of momentum β between 0 and 1 inclusive:

Theorem 1. Suppose F is a convex function, and ζ1, . . . , ζT is an i.i.d. sequence of random variables such that F = E[f(x, ζ)] for some function f that is G-Lipschitz in x. For any minimizer x , define

D = x1 x and γ = D/(G

T). Then for any β [0, 1], Schedule-Free SGD ensures:

E[F(x T ) F(x )] DG

In contrast, exponential-moving-average momentum in the non-smooth setting actually hurts the theoretical worst-case convergence rate. The Schedule-Free approach maintains the advantages of momentum (Sutskever et al., 2013) without the potential worst-case slow-down.

2.1 General Theory

The method analyzed in Theorem 1 is actually a special-case of a more general result that incorporates arbitrary online optimization algorithms rather than only SGD, as well as arbitrary time-varying sequences of βt. The proof is provided in Appendix A. Theorem 2. Let F be a convex function. Let ζ1, . . . , ζT be an iid sequence such that F(x) = Eζ[f(x, ζ)]. Let z1, . . . , z T be arbitrary vectors and let w1, . . . , w T and β1, . . . , βT be arbitrary numbers in [0, 1] such that zt, wt and βt are independent of ζt, . . . , ζT . Set:

xt = Pt i=1 wizi Pt i=1 wi = xt 1

1 wt Pt i=1 wi

+ wt Pt i=1 wi | {z }

yt = βtxt + (1 βt)zt (10) gt = f(yt, ζt). (11)

Then we have for all x :

E[F(x T ) F(x )] E[PT t=1 wt gt, zt x ]

PT i=1 wi . (12)

To recover Theorem 1 from the above result, notice that the algorithm analyzed by Theorem 1 is captured by Theorem 2 with wt = 1, βt a constant β and zt+1 = zt γgt for all t. Next, observe that the sequence z1, . . . , z T is performing online gradient descent (Zinkevich, 2003), for which it is well-known that the regret PT t=1 gt, zt x (appearing in the numerator of our result) is bounded by DG

T and so the result of Theorem 1 immediately follows.

The regret is the principle object of study in online convex optimization (Hazan, 2022; Orabona, 2019). Viewed in this light, Theorem 2 provides a way to convert an online convex optimization algorithm into a stochastic optimization algorithm: it is a form of online-to-batch conversion (Cesa-Bianchi et al., 2004). Classical online-to-batch conversions are a standard technique for obtaining convergence bounds for many stochastic optimization algorithms, including stochastic gradient descent (Zinkevich, 2003), Ada Grad (Duchi et al., 2011), AMSGrad (Reddi et al., 2018), and Adam (Kingma and Ba, 2014). All of these algorithms can be analyzed as online convex optimization algorithms: they provide bounds on the regret PT t=1 gt, zt x rather than direct convergence guarantees. It is then necessary (although sometimes left unstated) to convert these regret bounds into stochastic convergence guarantees via an online-to-batch conversion. Our result provides a more versatile method for effecting this conversion.

Theorem 2 actually provides a grand unification of a number of different online-to-batch conversions that have been proposed over the years. Most of these conversion methods were first developed specifically to provide convergence analysis for SGD (or some variant such as dual averaging or mirror descent), and then generalized into techniques that apply to any online convex optimization algorithm. For example, the classical Polyak averaging method can be generalized to form the standard online-to-batch conversion of Cesa-Bianchi et al. (2004), and is immediately recovered from Theorem 2 by setting wt = 1 and βt = 0 for all t. More recently Nesterov and Shikhman (2015); Tao et al. (2018) derived an alternative to Polyak averaging that was later generalized to work with arbitrarily online convex optimization algorithms by Cutkosky (2019); Kavis et al. (2019), and then observed to actually be equivalent to the heavy-ball momentum by Defazio (2020); Defazio and Gower (2021); Defazio and Jelassi (2022). This method is recovered by our Theorem 2 by setting

wt = 1 and βt = 1 for all t. Finally, very recently Zamani and Glineur (2023) discovered that gradient descent with a linear decay stepsize provides a last-iterate convergence guarantee, which was again generalized to an online-to-batch conversion by Defazio et al. (2023). This final result is also recovered by Theorem 2 by setting wt = 1 and βt = t

T (see Appendix B).

In Appendix C, we give a further tightening of Theorem 2 it can be improved to an equality by precisely tracking additional terms that appear on the right-hand-side. This tightened version can be used to show convergence rate results for smooth losses, both with and without strong-convexity. As an example application, we show that schedule-free optimistic-gradient methods (Rakhlin and Sridharan, 2013) converge with accelerated rates:

E[F(x T ) F(x )] = O D2L

2.2 On Large Learning Rates

Under classical worst-case convergence theory, the optimal choice of γ for a fixed duration training time T is γ = D/(G

T). This is the rate used in our bounds for Theorem 1 above. For any-time convergence (i.e. when stopping is allowed at any timestep), our proposed method can, in theory, be used with the standard learning rate sequence:

However, learning rate sequences of this form have poor practical performance (Defazio et al., 2023). Instead, much larger steps of the form D/G give far better performance across virtually all problems in applications (Defazio and Mishchenko, 2023) another theory-practice mismatch that is virtually undiscussed in the literature. Existing theory suggests that this step-size is too large to give O(1/

T) convergence, however, as we show below, there is an important special case where such large step sizes also give optimal rates up to constant factors.

Theorem 3. Consider the online learning setting with bounded gradients gt. Let zt+1 = zt γgt. Let D = z1 z for arbitrary reference point z and define G = maxt T gt . Suppose that the chosen step-size is γ = D/G, then if it holds that:

t=1 gt, zt z1 D

t=1 gt 2, (15)

t=1 gt, zt z = O

This regret bound for SGD implies a convergence rate bound for Schedule-Free SGD by application of our online-to-batch conversion. Condition 31 can be checked during a training run (Using reference point z = x T , and so D = x1 x T ), and we find that it holds for every problem we consider in our experiments in Section 4. More generally, the full conditions under which large learning rates can be used are not yet fully understood for stochastic problems. In the quadratic case, Bach and Moulines (2013) established that large fixed step-sizes give optimal convergence rates, and we conjecture that the success of large learning rates may be attributed to asymptotic quadratic behavior of the learning process.

Empirically, we find that Schedule-Free momentum enables the use of larger learning rates γ > 0 even in quadratic minimization problems f(x) = 1

2x Ax b x. We generate 10 different such 20-dimensional problems with eigenvalues drawn log-uniformly in [10 6, 1]. We plot the average minimal loss achieved as a function of the two parameters β and γ in Figure 2b. We can see that when the learning rate we use is small, what value of β we choose has little to no effect on the convergence of the algorithm. However, when γ is large, choosing β < 1 becomes crucial to achieving convergence.

3 Related Work

The proposed method has a striking resemblance to Nesterov s accelerated method (Nesterov, 1983, 2013) for L-smooth functions, which can be written in the AC-SA form (Lan, 2012):

yt = (1 ct+1)xt + ct+1zt (17)

zt+1 = zt k + 1

2L f(yt) (18)

xt+1 = (1 ct+1) xt + ct+1zt+1, (19)

where ct+1 = 2/(t + 2). The averaging constant, and more generally

ct+1 = r + 1 t + r + 1, (20)

for any real r > 1 is equivalent to the weighted average (Shamir and Zhang, 2013; Defazio and Gower, 2021) xt PT t=1 t rzt, where t r represents the rth factorial power of t. Our framework is compatible with factorial power averages without sacrificing theoretical guarantees.

Our approach differs from conventional accelerated methods by using a different weight for the yt and xt interpolations. We use a constant weight for yt and a decreasing weight for xt. Accelerated methods for strongly-convex problems use a constant weight for both, and those for non-strongly convex use an decreasing weight for both, so our approach doesn t directly correspond to either class of accelerated method. Accelerated methods also use a much larger step size for the zt sequence than our approach.

The use of equal-weighted averages is less common than the use of exponential weighting in the practical deep learning optimization literature. Exponential moving averages (EMA) of the iterate sequence are used in the popular Lookahead optimizer (Zhang et al., 2019). In the case of SGD, it performs i = 1 . . . k inner steps:

zt,i = zt,i 1 γ f(zt,i 1) (21)

followed by an outer step:

xt = xt 1 + α (zt,k xt 1) . (22)

The inner optimizer then starts at zt+1,0 = xt 1. The Lookahead method can be seen as the EMA version of primal averaging, just as exponential weight averaging is the EMA version of Polyak-Ruppert averaging.

Tail averaging, either using an exponential moving average or an equal-weighted average, is a common folk-law technique that often yields a practical improvement. For instance, this kind of averaging is used without citation by the influential work of Szegedy et al. (2016): Model evaluations are performed using a running average of the parameters computed over time. , and by Vaswani et al. (2017): ...averaged the last 20 checkpoints . Tail averages are typically Polyak-Ruppert style averaging as the average is not used for gradient evaluations during training.

More sophisticated tail averaging approaches such as Stochastic Weight Averaging (Izmailov et al., 2018) and LAtest Weight Averaging (Kaddour, 2022; Sanyal et al., 2023) combine averaging with large or cyclic learning rates. They are not a replacement for scheduling, instead they aim to improve final test metrics. They generally introduce additional hyper-parameters to tune, and require additional memory. It is possible to use SWA and LAWA on top of our approach, potentially giving further gains.

Sandler et al. (2023) show via a stochastic quadratic analysis framework that averaging and learning rate decreases achieve the same effective learning rate. For instance, and average of two points along the training trajectory can give almost identical results to using a learning rate two times smaller. Stochastic quadratic problems are particularly special, Bach and Moulines (2013) have shown that Polyak averaging gives optimal O(1/T) rates without the use of decreasing time-dependent step size sequences in this setting.

Within optimization theory, tail averages can be used to improve the convergence rate for stochastic non-smooth SGD in the strongly convex setting from O(log(T)/T) to O(1/T) (Rakhlin et al., 2012),

although at the expense of worse constants compared to using weighted averages of the whole sequence (Lacoste-Julien et al., 2012).

Portes et al. (2022) use cyclic learning rate schedules with increasing cycle periods to give a method that explores multiple points along the Pareto frontier of training time vs eval performance. Each point at the end of a cycle is an approximation to the model from a tuned schedule ending at that time. Our method gives the entire frontier, rather than just a few points along the path. In addition, our method matches or improves upon best known schedules, whereas the ... cyclic trade-off curve underestimated the standard trade-off curve by a margin of 0.5% validation accuracy (Portes et al., 2022).

4 Experiments

For our deep learning experiments, we evaluated Schedule-Free learning on a set benchmark tasks that are commonly used in the optimization research literature: CIFAR10 A Wide Res Net (16-8) architecture (Zagoruyko and Komodakis, 2016) on the CIFAR10 image classification dataset. CIFAR100 A Dense Net (Huang et al., 2017) architecture on the CIFAR-100 (100-class) classification dataset. SVHN A deep Res Net architecture (3-96) on the Street View House Numbers (SVHN) dataset. Image Net A standard Res Net-50 architecture (He et al., 2016) on the ILSVRC 2012 Image Net (Russakovsky et al., 2015) classification dataset. IWSLT14 A LSTM architecture (Wiseman and Rush, 2016) on the IWSLT14 German-English translation dataset (Cettolo et al., 2014). DLRM The DLRM (Naumov et al., 2019) architecture on the Criteo Kaggle Display Advertising dataset (Jean-Baptiste Tien, 2014). MRI A stacked U-Net architecture (Sriram et al., 2020) on the fast MRI dataset (Zbontar et al., 2018). MAE Fine-tuning a pretrained Masked Autoencoder (He et al., 2021) Vi T (patch16-512d-8b) on the ILSVRC 2012 Image Net dataset. Nano GPT A 124M parameter GPT-2 (Radford et al., 2019) style decoder-only transformer on the Open Web Text dataset (Gokaslan and Cohen, 2019). For each problem, both the baseline and the Schedule-Free method were tuned by sweeping both the weight decay and learning rate on a grid. We also swept β over two values, 0.9 and 0.98. Final hyper-parameters are listed in the Appendix. Schedule-Free SGD was used for CIFAR10, CIFAR100, SVHN and Image Net, and Schedule-Free Adam W (Loshchilov and Hutter, 2019) was used for the remaining tasks. We further include a step-wise schedule as a comparison on problems where step-wise schedules are customary. Further results for Polyak and Primal averaging are in Appendix I.

Our approach shows very strong performance (Figure 3) out-performing existing state-of-the-art cosine schedules on CIFAR-10, CIFAR-100, SVHN, IWSLT-14 (Figure 2a) and Open Web Text GPT-2 problems, as well as the state-of-the-art Linear Decay schedules on the fast MRI and Criteo DLRM tasks. On the remaining two problems, MAE fine-tuning and Image Net Res Net-50 training, it ties with the existing best schedules.

In general, the optimal learning rates for the Schedule-Free variants were larger than the optimal values for the base optimizers. The ability to use larger learning rates without diverging may be a contributing factor to the faster convergence of Schedule-Free methods. The β parameter works well at the default value of 0.9 for all problems except Nano GPT, where the loss started to increase rapidly when 0.9 was used (similar to the Polyak Averaging results in Appendix I). The larger β = 0.98 value in our sweep was stable.

4.1 MLCommons Algorithmic Efficiency benchmark

The Algo Perf challenge (Dahl et al., 2023) is designed to be a large-scale and comprehensive benchmark for deep learning optimization algorithms, covering major data domains and architectures. It includes Transformers, Conv Nets and U-Net models across image, language, graph and speech domains, and contains 8 problems total. We evaluated Schedule-Free Adam W following the competition guidelines, comparing against NAdam W, the competition reference Algorithm, running 10 seeds of each. As this is a time-to-target competition, traditional error bars are not appropriate so we instead plot all 10 seeds separately. Note that we excluded one benchmark problem, Res Net-50 training, as neither Adam W nor NAdam W can hit the target accuracy on that task.

0 50 100 150 200 250 300

Test Accuracy (%)

CIFAR-10 (WRN-16-8)

Schedule-Free (96.03% SE 0.04) Step-Wise Schedule (95.59% SE 0.03) Cosine Schedule (95.73% SE 0.04)

0 50 100 150 200 250 300

Test Accuracy (%)

CIFAR-100 (Dense Net)

Schedule-Free (78.71% SE 0.06) Step-Wise Schedule (76.41% SE 0.14) Cosine Schedule (77.41% SE 0.09)

0 50 100 150 200 250 300

Test Accuracy (%)

SVHN (Res Net-3-96)

Schedule-Free (98.40% SE 0.01) Cosine Schedule (98.27% SE 0.02) Step-Wise Schedule (98.20% SE 0.01)

0 20 40 60 80 100

Test Accuracy (%)

ILSVRC 2012 Image Net (Res Net-50)

Schedule-Free (76.90% SE 0.03) Cosine Schedule (76.90% SE 0.06) Step-Wise Schedule (76.49% SE 0.07)

0 10 20 30 40 50

fast MRI Knee (Var Net 2.0)

Cosine Schedule (0.9110 SE 0.00016) Schedule-Free (0.9112 SE 0.00012)

0 20 40 60 80 100

Test Accuracy

Criteo Kaggle (DLRM)

LD Schedule (0.7906 SE 0.00006) Schedule-Free (0.7915 SE 0.00003)

0 20 40 60 80 100

Test Accuracy (%)

MAE Image Net Finetune (Vi T)

Schedule-Free (83.54 SE 0.03) Cosine Schedule (83.52 SE 0.02)

0 200000 400000 600000

Open Web Text (GPT-2 124M)

Cosine Schedule (2.853 SE 0.004) Schedule-Free (2.831 SE 0.008)

Figure 3: Deep Learning Experiments

The self-tuning track restricts participants to provide a single set of hyper-parameters to use for all 8 problems. Given the large number of problems, this gives performance representative of a good default configuration.

Schedule-Free Adam W performs well across all considered tasks, out-performing the baseline on the WMT, VIT, FASTMRI and OGBG training, while tying on the Conformer and Criteo workloads, and marginally under-performing on the Deep Speech workload. We attribute the performance on the Conformer and Deep Speech tasks to their use of batch-norm - the Algo Perf setup doesn t easily allow us to update the BN running statistics on the x sequence, which is necessary with our method to get the best performance (See Section 4.3).

4.2 Convex Problems

We validated the Schedule-Free learning approach on a set of standard logistic regression problems from the Lib SVM repository. For each problem, and each method separately, we performed a full learning rate sweep on a power-of-two grid, and plotted mean and standard-error of the final train accuracy from 10 seeds using the best learning rate found.

Schedule-Free learning out-performs both averaging approaches and the state-of-the-art linear decay (LD) schedule baseline (Figure 7). It converges faster on all but 1 of 12 problems, has higher accuracy on 6 of the problems, and ties the baseline on the remaining problems. This demonstrates that the performance advantages of Schedule-Free methods are not limited to non-convex problems.

4.3 Implementation Concerns

The Schedule-Free variant of a method typically has the same memory requirements as the base method. For instance, Schedule-Free SGD requires no extra memory over standard SGD with momentum. Whereas SGDM tracks the current point x and the momentum buffer m, we can track x and z. The quantity y can be computed directly from the latest values of x and z, and so doesn t need

0.0 0.2 0.4 0.6 0.8 1.0

Normalized Time

Normalized Test Metric

NAdam W Baseline Schedule-Free

0.0 0.2 0.4 0.6 0.8

Normalized Time

Normalized Test Metric

NAdam W Baseline Schedule-Free

0.0 0.1 0.2 0.3 0.4

Normalized Time

Normalized Test Metric

Schedule-Free NAdam W Baseline

0.0 0.2 0.4 0.6 0.8 1.0 1.2

Normalized Time

Normalized Test Metric

Librispeech Conformer

Schedule-Free NAdam W Baseline

0.0 0.1 0.2 0.3 0.4

Normalized Time

Normalized Test Metric

NAdam W Baseline Schedule-Free

0.0 0.2 0.4 0.6 0.8

Normalized Time

Normalized Test Metric

Schedule-Free NAdam W Baseline

0.0 0.2 0.4 0.6

Normalized Time

Normalized Test Metric

Librispeech Deepspeech

Schedule-Free NAdam W Baseline

Figure 4: Schedule-Free Adam compared to target-setting baseline on the Algoperf competition self-tuning track.

Algorithm 1 Schedule-Free Adam W

1: Input: x1, learning rate γ, decay λ, warmup steps Twarmup, β1, β2, ϵ 2: z1 = x1 3: v0 = 0 4: for t = 1 to T do 5: yt = (1 β1)zt + β1xt Momentum via interpolation 6: gt f(yt, ζt) Gradient is evaluated at y 7: vt = β2vt 1 + (1 β2)g2 t 8: γt = γ p

1 βt 2 min(1, t/Twarmup) LR includes warmup and Adam bias-correction 9: zt+1 = zt γtgt/( vt + ϵ) γtλyt

10: ct+1 = γ2 t Pt i=1 γ2 i 11: xt+1 = (1 ct+1) xt + ct+1zt+1 Update weighted iterate average 12: end for 13: Return x T xx

to be explicitly stored. It s also possible to instead store z and y, and then compute x when needed. This low memory usage is the case for Adam W also, see Algorithm 1.

Our efficient Py Torch implementation actually uses one buffer to always store z and the primary parameter buffer to store either x or y, with the stored quantity flipping between the two for training and test/inference passes.

Our method requires extra code to handle models where batch norm is used. This is due to the fact that Batch Norm layers maintain a running_mean and running_var to track batch statistics which is calculated at y. For model evaluation, these buffers need to be updated to match the statistics on the x sequence. This can be done by evaluating a small number of training batches using x right before each eval. More sophisticated approaches such as Precise BN (Wu and Johnson, 2021) can also be used. This calculation is not needed for other normalization layers that do not use batch-statistics.

Learning rate warmup is still necessary for our method. We use a linear warmup for a fixed duration, and fuse the Adam bias-correction term into the learning rate for simplicity (this potentially impacts the effect of weight-decay during early iterations), giving a learning rate LR γt = γ p

1 βt 2 min(1, t/Twarmup) that approaches γ when the warmup and bias-correction period ends. We found that performance was greatly improved by using a weighted ct sequence when

0 25 50 75 100 125 150 175 200

Test Accuracy (%)

ILSVRC 2012 Image Net (Res Net-50) Learning Rate Sweep

Schedule-Free LR 5.0 (76.27% SE 0.03) Schedule-Free LR 3.0 (77.39% SE 0.05) Schedule-Free LR 1.5 (77.83% SE 0.03) Schedule-Free LR 1.0 (77.42% SE 0.02) Schedule-Free LR 0.5 (76.17% SE 0.07)

0 25 50 75 100 125 150 175 200

Test Accuracy (%)

ILSVRC 2012 Image Net (Res Net-50) Learning Rate Sweep

SGD LR 0.5 (76.42% SE 0.05) SGD LR 0.3 (77.05% SE 0.03) SGD LR 0.15 (77.57% SE 0.02) SGD LR 0.1 (77.68% SE 0.04) SGD LR 0.05 (77.69% SE 0.04)

Figure 6: Comparison of the LR sensitivity of Schedule-Free training and cosine schedule training

warmup is used, weighted by the square of the γt used during warmup:

ct+1 = γ2 t Pt i=1 γ2 i . (23)

This sequence decreases at a 1/t rate after the learning rate warmup. It is shifted by one from the indexing used in Theorem 2, which is done to simplify the implementation. This sequence is motivated by Theorem 2 s weighting sequences, which suggest weights proportional to polynomials of the learning rate. This sequence was used for both SGD and Adam W experiments.

Weight decay for Schedule-Free methods can be computed at either the y or z sequences. We used decay at y for our experiments, as this matches the interpretation of weight-decay as the use of an additional L2-regularizer term in the loss. We found that computing the regularization at y gives significantly better performance on some problems including Image Net and Nano GPT training.

5 Parameter Sensitivity

0 25 50 75 100 125 150 175 200

Test Accuracy (%)

ILSVRC 2012 Image Net (Res Net-50) Momentum Sweep

Schedule-Free MOM 0.75 (76.60% SE 0.02) Schedule-Free MOM 0.9 (77.78% SE 0.04) Schedule-Free MOM 0.95 (76.99% SE 0.05) Schedule-Free MOM 0.98 (75.37% SE 0.03)

Figure 5: Sensitivity to momentum values

For Schedule-Free learning to be truly schedulefree, it s important that the momentum hyperparameter doesn t implicitly have a dependence on the time-horizon. If tuning this parameter gave different values depending on the training duration, then the problem of setting the horizon has just been shifted to setting the momentum value. In Figure 5 we run Image Net training with Schedule-Free SGD for a longerthen-standard 200 epochs with a variety of momentum values, with the LR fixed to 1.5. We find that the best choice of momentum (β = 0.9) is the same for all durations of training.

Schedule-Free learning has a similar mild time-horizon dependency for the baseline learning rate value as schedule-based approaches. Figure 6 shows that the optimal learning rate stays the same for broad range of values, for both Schedule-Free and Schedule based training. For short duration training ( 25 epochs), larger LR values begin to show the best performance. Appendix J shows the sensitivity of the final test accuracy to the baseline learning rate for a selection of our test problems, in comparison to the baseline optimizer with a cosine schedule. We see that the overall sensitivity is similar to the baseline optimizer in each problem.

6 Conclusion

Two roads diverged in a wood, and I I took the one less traveled by, And that has made all the difference. - Robert Frost

We have presented Schedule-Free learning, an optimization approach that removes the need to specify a learning rate schedule while matching or outperforming schedule-based learning. The primary practical limitation is the need to sweep learning rate and weight decay, as the best values differ from the those used with a schedule. We provide a preliminary theoretical exploration of the method, establishing its worst-case optimal performance for non-smooth Lipschitz convex optimization.

Funding Acknowledgments

AC is supported by NSF grant number CCF-2211718.

Bach, F. and Moulines, E. (2013). Non-strongly-convex smooth stochastic approximation with convergence rate O(1/n). In Burges, C., Bottou, L., Welling, M., Ghahramani, Z., and Weinberger, K., editors, Advances in Neural Information Processing Systems, volume 26. Curran Associates, Inc.

Cesa-Bianchi, N., Conconi, A., and Gentile, C. (2004). On the generalization ability of on-line learning algorithms. IEEE Transactions on Information Theory, 50(9):2050 2057.

Cettolo, M., Niehues, J., Stüker, S., Bentivogli, L., and Federico, M. (2014). Report on the 11th IWSLT evaluation campaign. In IWSLT.

Chiang, C.-K., Yang, T., Lee, C.-J., Mahdavi, M., Lu, C.-J., Jin, R., and Zhu, S. (2012). Online optimization with gradual variations. In Conference on Learning Theory, pages 6 1. JMLR Workshop and Conference Proceedings.

Cutkosky, A. (2019). Anytime online-to-batch, optimism and acceleration. In International conference on machine learning, pages 1446 1454. PMLR.

Dahl, G. E., Schneider, F., Nado, Z., Agarwal, N., Sastry, C. S., Hennig, P., Medapati, S., Eschenhagen, R., Kasimbeg, P., Suo, D., Bae, J., Gilmer, J., Peirson, A. L., Khan, B., Anil, R., Rabbat, M., Krishnan, S., Snider, D., Amid, E., Chen, K., Maddison, C. J., Vasudev, R., Badura, M., Garg, A., and Mattson, P. (2023). Benchmarking Neural Network Training Algorithms.

Defazio, A. (2020). Momentum via primal averaging: Theoretical insights and learning rate schedules for non-convex optimization.

Defazio, A., Cutkosky, A., Mehta, H., and Mishchenko, K. (2023). When, why and how much? adaptive learning rate scheduling by refinement.

Defazio, A. and Gower, R. M. (2021). The power of factorial powers: New parameter settings for (stochastic) optimization. In Balasubramanian, V. N. and Tsang, I., editors, Proceedings of The 13th Asian Conference on Machine Learning, volume 157 of Proceedings of Machine Learning Research, pages 49 64. PMLR.

Defazio, A. and Jelassi, S. (2022). Adaptivity without compromise: A momentumized, adaptive, dual averaged gradient method for stochastic optimization. Journal of Machine Learning Research, 23:1 34.

Defazio, A. and Mishchenko, K. (2023). Learning-rate-free learning by D-adaptation. The 40th International Conference on Machine Learning (ICML 2023).

Duchi, J., Hazan, E., and Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(61).

Gokaslan, A. and Cohen, V. (2019). Openwebtext corpus. http://Skylion007.github.io/ Open Web Text Corpus.

Hazan, E. (2022). Introduction to online convex optimization. MIT Press.

Hazan, E. and Kale, S. (2010). Extracting certainty from uncertainty: Regret bounded by variation in costs. Machine learning, 80:165 188.

He, K., Chen, X., Xie, S., Li, Y., Dollár, P., and Girshick, R. (2021). Masked autoencoders are scalable vision learners. ar Xiv:2111.06377.

He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition.

Huang, G., Liu, Z., Van Der Maaten, L., and Weinberger, K. Q. (2017). Densely connected convolutional networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2261 2269.

Izmailov, P., Podoprikhin, D., Garipov, T., Vetrov, D., and Wilson, A. G. (2018). Averaging weights leads to wider optima and better generalization. In Conference on Uncertainty in Artificial Intelligence (UAI).

Jean-Baptiste Tien, joycenv, O. C. (2014). Display advertising challenge.

Joulani, P., György, A., and Szepesvári, C. (2017). A modular analysis of adaptive (non-) convex optimization: Optimism, composite objectives, and variational bounds. In International Conference on Algorithmic Learning Theory, pages 681 720. PMLR.

Joulani, P., Raj, A., Gyorgy, A., and Szepesvári, C. (2020). A simpler approach to accelerated optimization: iterative averaging meets optimism. In International conference on machine learning, pages 4984 4993. PMLR.

Kaddour, J. (2022). Stop wasting my time! saving days of Image Net and BERT training with latest weight averaging.

Kavis, A., Levy, K. Y., Bach, F., and Cevher, V. (2019). Uni XGrad: A universal, adaptive algorithm with optimal guarantees for constrained optimization. Advances in neural information processing systems, 32.

Kingma, D. P. and Ba, J. (2014). Adam: a method for stochastic optimization. In International Conference on Learning Representations.

Lacoste-Julien, S., Schmidt, M., and Bach, F. (2012). A simpler approach to obtaining an o(1/t) convergence rate for the projected stochastic subgradient method.

Lan, G. (2012). An optimal method for stochastic composite optimization. Mathematical Programming, 133(1):365 397.

Loshchilov, I. and Hutter, F. (2019). Decoupled weight decay regularization. In International Conference on Learning Representations.

Naumov, M., Mudigere, D., Shi, H. M., Huang, J., Sundaraman, N., Park, J., Wang, X., Gupta, U., Wu, C., Azzolini, A. G., Dzhulgakov, D., Mallevich, A., Cherniavskii, I., Lu, Y., Krishnamoorthi, R., Yu, A., Kondratenko, V., Pereira, S., Chen, X., Chen, W., Rao, V., Jia, B., Xiong, L., and Smelyanskiy, M. (2019). Deep learning recommendation model for personalization and recommendation systems. Co RR.

Nesterov, Y. (1983). A method for solving a convex programming problem with convergence rate O(1/k2). Soviet Mathematics Doklady.

Nesterov, Y. (2013). Lectures on Convex Optimization. Springer Nature.

Nesterov, Y. and Shikhman, V. (2015). Quasi-monotone subgradient methods for nonsmooth convex minimization. Journal of Optimization Theory and Applications, 165(3):917 940.

Orabona, F. (2019). A modern introduction to online learning. ar Xiv preprint ar Xiv:1912.13213.

Polyak, B. (1990). New stochastic approximation type procedures. Avtomatica i Telemekhanika, 7:98 107.

Portes, J., Blalock, D., Stephenson, C., and Frankle, J. (2022). Fast benchmarking of accuracy vs. training time with cyclic learning rates.

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and Sutskever, I. (2019). Language models are unsupervised multitask learners. Technical report, Open AI.

Rakhlin, A., Shamir, O., and Sridharan, K. (2012). Making gradient descent optimal for strongly convex stochastic optimization. In Proceedings of the 29th International Coference on International Conference on Machine Learning.

Rakhlin, A. and Sridharan, K. (2013). Online learning with predictable sequences. In Conference on Learning Theory, pages 993 1019. PMLR.

Reddi, S. J., Kale, S., and Kumar, S. (2018). On the convergence of Adam and beyond. In International Conference on Learning Representations.

Ruppert, D. (1988). Efficient estimations from a slowly convergent Robbins-Monro process. Technical Report, Cornell University.

Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A. C., and Fei-Fei, L. (2015). Image Net Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3).

Sandler, M., Zhmoginov, A., Vladymyrov, M., and Miller, N. (2023). Training trajectories, mini-batch losses and the curious role of the learning rate.

Sanyal, S., Neerkaje, A., Kaddour, J., Kumar, A., and Sanghavi, S. (2023). Early weight averaging meets high learning rates for LLM pre-training.

Sebbouh, O., Gower, R. M., and Defazio, A. (2021). On the (asymptotic) convergence of stochastic gradient descent and stochastic heavy ball. In Conference on Learning Theory, COLT 2021, Proceedings of Machine Learning Research. PMLR.

Shamir, O. and Zhang, T. (2013). Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes. In Proceedings of the 30th International Conference on Machine Learning.

Sriram, A., Zbontar, J., Murrell, T., Defazio, A., Zitnick, C. L., Yakubova, N., Knoll, F., and Johnson, P. (2020). End-to-end variational networks for accelerated MRI reconstruction. In International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer.

Sutskever, I., Martens, J., Dahl, G., and Hinton, G. E. (2013). On the importance of initialization and momentum in deep learning. In Proceedings of the 30th International Conference on International Conference on Machine Learning - Volume 28. JMLR.org.

Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna, Z. (2016). Rethinking the inception architecture for computer vision. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2818 2826.

Tao, W., Pan, Z., Wu, G., and Tao, Q. (2018). Primal averaging: A new gradient evaluation step to attain the optimal individual convergence. IEEE Transactions on Cybernetics, PP:1 11.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L. u., and Polosukhin, I. (2017). Attention is all you need. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R., editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.

Wiseman, S. and Rush, A. M. (2016). Sequence-to-sequence learning as beam-search optimization. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.

Wu, Y. and Johnson, J. (2021). Rethinking "batch" in batchnorm.

Zagoruyko, S. and Komodakis, N. (2016). Wide residual networks. In Proceedings of the British Machine Vision Conference (BMVC).

Zamani, M. and Glineur, F. (2023). Exact convergence rate of the last iterate in subgradient methods.

Zbontar, J., Knoll, F., Sriram, A., Muckley, M. J., Bruno, M., Defazio, A., Parente, M., Geras, K. J., Katsnelson, J., Chandarana, H., et al. (2018). fast MRI: An open dataset and benchmarks for accelerated MRI. ar Xiv preprint ar Xiv:1811.08839.

Zhang, M., Lucas, J., Ba, J., and Hinton, G. E. (2019). Lookahead optimizer: k steps forward, 1 step back. In Wallach, H., Larochelle, H., Beygelzimer, A., d'Alché-Buc, F., Fox, E., and Garnett, R., editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc.

Zinkevich, M. (2003). Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the Twentieth International Conference on International Conference on Machine Learning, pages 928 935.

Contributions

Aaron Defazio discovered the method, led research experimentation and proved initial versions of Theorems 1 and 7, with experimental/theoretical contributions by Alice Yang. Alice Yang led the development of the research codebase. Ashok Cutkosky proved key results including Theorem 2 and led the theoretical investigation of the method. Ahmed Khaled developed preliminary theory for obtaining accelerated rates which was later supplanted by Theorem 2, and investigated the utility of β with large learning rates for quadratics. Additional derivations by Konstantin Mishchenko and Harsh Mehta are included in appendix sections. Discussions between Aaron Defazio, Ashok Cutkosky, Konstantin Mishchenko, Harsh Mehta, and Ahmed Khaled over the last year contributed to this scientific discovery.

A Proof of Theorem 2

Theorem 2. Let F be a convex function. Let ζ1, . . . , ζT be an iid sequence such that F(x) = Eζ[f(x, ζ)]. Let z1, . . . , z T be arbitrary vectors and let w1, . . . , w T and β1, . . . , βT be arbitrary numbers in [0, 1] such that zt, wt and βt are independent of ζt, . . . , ζT . Set:

xt = Pt i=1 wizi Pt i=1 wi = xt 1

1 wt Pt i=1 wi

+ wt Pt i=1 wi | {z }

yt = βtxt + (1 βt)zt (10) gt = f(yt, ζt). (11) Then we have for all x :

E[F(x T ) F(x )] E[PT t=1 wt gt, zt x ]

PT i=1 wi . (12)

Proof. Throughout this proof, we will use the notation w1:t = Pt i=1 wi. The result is established by showing the following identity: w1:t F(xt) w1:t 1F(xt 1) wt F(x ) wt F(yt), zt x . (24) Where here F(yt) indicates a subgradient of F at yt with E[gt|zt] = F(yt). Given the identity (24), we sum over all t from 1 to T. Then the LHS will telescope to obtain:

w1:T (F(x T ) F(x ))

t=1 wt F(yt), zt x ,

from which the conclusion immediately follows since E[gt|zt] = F(yt). So, let us establish (24). To do so, it will help to observe the following identities: wtzt = w1:txt w1:t 1xt 1 w1:t 1(xt xt 1) = wt(zt xt) (25)

zt yt = βt 1 βt (yt xt). (26)

Now, setting F(xt) to be an arbitrary subgradient of F at xt, we have: w1:t F(xt) w1:t 1F(xt 1) wt F(x ) = w1:t 1(F(xt) F(xt 1)) + wt(F(xt) F(x )) w1:t 1 F(xt), xt xt 1 + wt(F(xt) F(x ))

using (25):

= wt F(xt), zt xt + wt(F(xt) F(x )) = wt F(xt), zt xt + wt(F(xt) F(yt)) + wt(F(yt) F(x )) wt F(xt), zt xt + wt F(xt), xt yt + wt F(yt), yt x = wt F(xt) F(yt), zt yt + wt F(yt), zt x

using (26):

= wt βt 1 βt F(xt) F(yt), yt xt + wt F(yt), zt x

Finally, recall that any convex function satisfies F(b) F(a), a b 0 for all a, b. This classical fact can be established by adding the following two subgradient identities:

F(a) F(b) + F(b), a b , F(b) F(a) + F(a), b a .

Then, since βt [0, 1], we have wt βt 1 βt F(xt) F(yt), yt xt 0, which establishes the desired identity (24).

B Recovering Prior Conversions, and Connections to Momentum

The following recursions provide an equivalent update to our main algorithm that casts the update in a more momentum-like form. Theorem 4. Under the same assumptions and notation as Theorem 2, set:

t = zt+1 zt, mt = xt+1 xt, ut = yt+1 yt.

mt = wt+1w1:t 1

wtw1:t+1 mt 1 + wt+1

ut = βt + (βt βt+1) w1:t

mt + (1 βt) t.

Here ut is playing the role of the update vector , as the sequence of points yt are where we will be evaluating gradients. The t value can be interpreted as a base update value: for the case that the zt sequence is specified by SGD (as in Theorem 1), t = ηgt. Thus, the update can be interpreted as a momentum term mt, plus an extra push in the direction of t scaled by 1 βt.

Proof. Let s solve for mt in terms of previous values:

mt = xt+1 xt

w1:t+1 (zt+1 xt)

w1:t+1 ( t + zt xt)

w1:t+1 ( t + w1:t 1

wt (xt xt 1))

= wt+1w1:t 1

wtw1:t+1 mt 1 + wt+1

Now let s solve for ut:

ut = βt+1xt+1 + (1 βt+1)zt+1 βtxt (1 βt)zt = βtmt + (1 βt) t + (βt βt+1)(zt+1 xt+1)

= βtmt + (1 βt) t + (βt βt+1) w1:t

wt+1 (xt+1 xt)

= βtmt + (1 βt) t + (βt βt+1) w1:t

= βt + (βt βt+1) w1:t

mt + (1 βt) t

In the special case that wt = 1 for all t, the updates simplify to:

t + 1mt 1 + 1 t + 1 t

ut = (βt + t(βt βt+1)) mt + (1 βt) t. In the special case that βt = β for all t, the update for ut simplifies to: ut = βmt + (1 β) t. From this, it is clear that if β = 1 and wt = 1, then we recover the standard Polyak momentum with a time-varying momentum factor mt = t 1

t+1mt 1 + 1 t+1 t, while if β = 0, then we have ordinary SGD without momentum.

B.1 Recovering Linear Decay

Let s take a look at the update for ut = yt+1 yt in the special case that wt = 1 for all t: ut = (βt + t(βt βt+1)) mt + (1 βt) t. Let us define αt = 1 βt. Then we can re-write this update as: ut = (1 αt + t(αt+1 αt)) mt + αt t. It looks like we might be able to set αt such that the coefficient of mt vanishes. In this case, αt would play the role of a schedule as the update would just be ut = αt t. Solving the recursion we get: αt 1 = t(αt+1 αt),

αt+1 = (t + 1)αt 1

Amazingly, this recursion is satisfied by αt = T t

T , which is the linear decay schedule! Notably, this schedule has αT = 0, which in turn implies that y T = x T , so that the last iterate of our algorithm is x T , for which Theorem 2 provides a convergence guarantee.

The recursion is also satisfied by αt = 1 for all t (which recovers standard Polyak-Ruppert averaging). Notably, this recursion shows that α1 will determine all subsequent α values. The values will decease linearly to zero, and then they will try to go negative, which is not allowed. So the linear decay schedule is the value of α1 that is just barely allowed since it hits zero at αT .

In general with arbitrary wt, the recursion is:

1 αt + (αt+1 αt) w1:t

If we insist that αT = 0 (so that y T = x T and we get a last iterate guarantee), then solving the recursion yields:

αt = wt+1:T

which exactly recovers the main result of Defazio et al. (2023).

C Generalizing Theorem 2 via Bregman Divergences

Here, we provide a generalized version of Theorem 2 in the style of Joulani et al. (2020). This result employs Bregman divergences to tighten the inequality of Theorem 2 to an equality. Theorem 5. Let F be a convex function. Let ζ1, . . . , ζT be a sequence of i.i.d. random variables, and let g be a function such that E[g(x, ζt)] F(x) for all x and t. Let z1, . . . , z T be arbitrary vectors and let w1, . . . , w T and α1, . . . , αT be arbitrary non-negative real numbers with αt 1 such that zt, wt and αt are independent of ζt, . . . , ζT . Define the Bregman divergence of F as BF (a, b) = F(a) F(b) F(b), a b 2. Set:

xt = Pt i=1 wizi Pt i=1 wi = xt 1

1 wt Pt i=1 wi

+ wi Pt i=1 wi zt

yt = (1 αt)xt + αtzt gt = g(yt, ζt).

2if F is not differentiable, then by abuse of notation define F(b) = E[g(b, ζ)], which is a particular choice of subgradient of F.

Define the compressed sum notation: w1:t = Pt i=1 wi, with w1:0 = 0.

Then we have for all x :

E[F(x T ) F(x )] = E

"PT t=1 wt gt, zt x

"PT t=1 wt αt BF (yt, xt) + wt(1 αt)

αt BF (xt, yt)

"PT t=1 w1:t 1BF (xt 1, xt) + wt BF (x , yt)

Let s take a minute to unpack this result since it is depressingly complicated. Recall that the Bregman divergence for a convex function must be positive, and so all the subtracted Bregman divergence terms can be dropped to make the bound only looser. This recovers Theorem 2. However, in Section D, we show how to exploit the negative Bregman terms to achieve accelerated rates when F is smooth, and in Section E we show how to exploit the negative Bregman terms to achieve faster rates when F is strongly-convex.

Proof. The proof is nearly the same as that of Theorem 2. The only difference is that we keep track of all the error terms in the inequalities via Bregman divergences.

Throughout this proof, we use F(x) to indicate Eζ[g(x, ζ)]. When F is differentiable, this is simply the ordinary gradient at x. When F is non-differentiable, this reprents a specific choice of subgradient at x.

Recall that any convex function satisfies F(b) F(a), a b = BF (a, b) BF (b, a) for all a, b. This classical fact can be established by adding the following two subgradient identities:

F(a) = F(b) + F(b), a b + BF (a, b) F(b) = F(a) + F(a), b a + BF (b, a) F(b) F(a), a b = BF (a, b) BF (b, a). (27)

The Theorem is established by showing the following identity:

w1:t F(xt) w1:t 1F(xt 1) wt F(x ) = wt F(yt), zt x

αt BF (yt, xt) wt(1 αt)

αt BF (xt, yt)

w1:t 1BF (xt 1, xt) wt BF (x , yt). (28)

Given the identity (28), we sum over all t from 1 to T. Then the LHS will telescope to obtain:

w1:T (F(x T ) F(x )) =

t=1 wt F(yt), zt x

wt αt BF (yt, xt) wt(1 αt)

αt BF (xt, yt)

t=1 w1:t 1BF (xt 1, xt) wt BF (x , yt),

from which the conclusion immediately follows since E[gt|g1, . . . , gt 1] = E[ F(yt)|g1, . . . , gt 1]. So, let us establish (28). To do so, it will help to observe the following identities:

wtzt = w1:txt w1:t 1xt 1 w1:t 1(xt xt 1) = wt(zt xt) (29)

zt yt = 1 αt

αt (yt xt). (30)

So, we have:

w1:t F(xt) w1:t 1F(xt 1) wt F(x ) = w1:t 1(F(xt) F(xt 1) + wt(F(xt) F(x )) = w1:t 1 F(xt), xt xt 1 + wt(F(xt) F(x )) w1:t 1BF (xt 1, xt)

using (29):

= wt F(xt), zt xt + wt(F(xt) F(x )) w1:t 1BF (xt 1, xt) = wt F(xt), zt xt + wt(F(xt) F(yt)) + wt(F(yt) F(x )) w1:t 1BF (xt 1, xt) = wt F(xt), zt xt + wt F(xt), xt yt + wt F(yt), yt x w1:t 1BF (xt 1, xt) wt BF (yt, xt) wt BF (x , yt) = wt F(xt) F(yt), zt yt + wt F(yt), zt x w1:t 1BF (xt 1, xt) wt BF (yt, xt) wt BF (x , yt)

using (30):

αt F(xt) F(yt), yt xt + wt F(yt), zt x

w1:t 1BF (xt 1, xt) wt BF (yt, xt) wt BF (x , yt)

using (27):

= wt F(yt), zt x

αt (BF (xt, yt) + BF (yt, xt))

w1:t 1BF (xt 1, xt) wt BF (yt, xt) wt BF (x , yt) = wt F(yt), zt x

αt BF (yt, xt) wt(1 αt)

αt BF (xt, yt)

w1:t 1BF (xt 1, xt) wt BF (x , yt).

D Acceleration

In this section, we show that by instantiating our framework with an optimistic online learning algorithm (Rakhlin and Sridharan, 2013), we achieve accelerated convergence guarantees. Our results match those available in the prior literature (Kavis et al., 2019; Joulani et al., 2020). Our approach is inspired by Joulani et al. (2020),: their method is based upon a version of Theorem 5 for the special case that αt = 0. Our result simply extends their analysis to αt = O(1/t).

First, we establish an important technical Corollary that simplifies Theorem 5 in the case that F is smooth and αt is sufficiently small. Corollary 1. Under the same conditions as Theorem 5, suppose additionally that F is L-smooth and suppose αt wt 10w1:t for all t. Then we have for all x :

E[F(x T ) F(x )] E

"PT t=1 wt gt, zt x

"PT t=1 w1:t 1 F(yt) F(yt 1) 2

where above the value of y0 is arbitrary (since the coefficient is w1:0 = 0).

Proof. The key thing is to observe that smoothness implies BF (a, b) 2L F(a) F(b) 2. The rest of the argument is straightforward manipulation of the terms in Theorem 5:

αt BF (yt, xt) wt(1 αt)

αt BF (xt, yt) wt(2 αt)

2Lαt F(xt) F(yt) 2

w1:t 1BF (xt 1, xt) wt BF (x , yt) w1:t 1

2L F(xt) F(xt 1) 2.

Next, observe that for any vectors a, b, c, for any λ > 0:

a + b + c 2 = a 2 b 2 c 2 2 a, b 2 b, c 2 a, c

(1 2/λ) a 2 + (2λ 1)( b 2 + c 2),

where we have used Young s inequality: | v, w | v 2

2 . Therefore, setting λt = 3 we obtain:

w1:t 1BF (xt 1, xt) wt BF (x , yt)

6L F(yt) F(yt 1) 2

2L ( F(xt) F(yt) 2 + F(xt 1) F(yt 1) 2).

Now, since αt wt 10w1:t 1, we obtain:

αt BF (yt, xt) wt(1 αt)

αt BF (xt, yt) w1:t 1BF (xt 1, xt) wt BF (x , yt)

6L F(yt) F(yt 1) 2

2L F(xt) F(yt) 2 + 5w1:t 1

2L F(xt 1) F(yt 1) 2.

Now summing over t from 1 to T (and dropping one negative term), the sum telescopes to:

6L F(yt) F(yt 1) 2.

The result now follows from Theorem 5.

Now, we consider the case that zt is given by an optimistic mirror descent algorithm:

Corollary 2. Suppose F is L-smooth. Define g0 = 0 and suppose also that for some D satisfying D y1 x :

t=1 wt gt, zt x D

t=1 w2 t gt gt 1 2.

Finally, suppose E[ gt gt 1 2] F(yt) F(yt 1) 2 + σ2 t for some constants σ1, . . . , σT (these are just variance bounds on the stochastic gradient oracle). Then with wt = t and αt 1 5(t 1), we have:

E[F(x T ) F(x )] 14D2L T(T + 1) + 2D q PT t=1 t2σ2 t T(T + 1)

where σ is uniform upper-bound on σt. Note that the algorithm does not need to know L or σ.

Algorithms producing z sequences obtaining the guarantee stated here are called optimistic online learning algorithms .

Proof. Applying Corollary 1, we obtain immediately:

2 E[F(x T ) F(x )]

t=1 t2 gt gt 1 2

12L F(yt) F(yt 1) 2

t=1 t2E [ F(yt) F(yt 1) 2] + t2σ2 t + F(y1) 2

t=1 t2E F(yt) F(yt 1) 2

t=1 t2E [ F(yt) F(yt 1) 2] + D

t=1 t2σ2 t + F(y1) 2

t=1 t2E F(yt) F(yt 1) 2

Using the identity A

6D2L + L y1 x 2

t=1 t2σ2 t .

Divide by T (T +1)

2 to conclude the result.

D.1 An Optimistic Regret Bound

In this section we provide an algorithm that achieves the optimistic regret bound required for our acceleration result Corollary 2. This algorithm is a mild variation on the established literature (Rakhlin and Sridharan, 2013; Chiang et al., 2012; Hazan and Kale, 2010; Joulani et al., 2017) to slightly improve a technical dependence on the maximum gradient value.

Lemma 1. For a sequence of vectors g1, . . . , g T , set ηt = D

2 Pt i=1 gi gi 1 2 with g0 = 0, define

mt = maxi t gi gi 1 and define the sequence of vectors zt, z t and gt by the recursions:

z1 = z 1 = 0

gt = gt 1 + min (mt 1, gt gt 1 ) gt gt 1

m2 t + Pt i=1 gi gi 1 2

z t+1 = Π z t+1 Dz t ηt gt

zt+1 = Π zt+1 Dz t+1 ηtgt.

t=1 gt, zt x 7D

t=1 gt gt 1 2.

Proof. For purposes of notation, define g0 = 0 and z 0 = 0. Further, observe that:

gt gt 1 mt 1 gt gt 1 gt gt 1 gt gt = mt mt 1

ηt D q Pt+1 i=1 gi gi 1 2

2 PT t=1 gt gt 1 2

Next, notice that z t+1 = argmin z D gt, z + 1 2ηt z z t 2. Therefore since x D, by first order optimality conditions:

gt + z t+1 z t ηt , z t+1 x

gt, z t+1 x 1

ηt z t z t+1, z t+1 x

2ηt z t+1 x 2

2ηt z t+1 z t 2

Similarly, we have zt = argmin z D gt 1, z + 1 2ηt 1 z z t 2. From this we have:

gt 1 + zt z t ηt 1 , zt z t+1

gt 1, zt z t+1 z t z t+1 2

2ηt 1 zt z t+1 2

2ηt 1 zt z t 2

gt, zt z t+1 z t z t+1 2

2ηt 1 zt z t+1 2

2ηt 1 zt z t 2

2ηt 1 + gt gt 1, zt z t+1

by Young s inequality:

z t z t+1 2

2ηt 1 zt z t+1 2

2ηt 1 zt z t 2

+ ηt 1 gt gt 1 2

2 + zt z t+1 2

z t z t+1 2

2ηt 1 + ηt 1 gt gt 1 2

So, combining these facts (and noticing that ηt 1 ηt:)

gt, zt x z t x 2

2ηt z t+1 x 2

2ηt + ηt 1 gt gt 1 2

gt, zt x z t x 2

2ηt z t+1 x 2

2ηt + ηt 1 gt gt 1 2

2 + gt gt, zt x

2ηt z t+1 x 2

2ηt + ηt 1 gt gt 1 2

2 + 2D(mt mt 1).

So, we have:

t=1 gt, zt x 2Dm T + z 1 x 2

ηt 1 gt gt 1 2

2Dm T + 4D2/ηT +

ηt 1 gt gt 1 2

ηt 1 gt gt 1 2

D gt gt 1 2

2 q Pt i=1 gi gi 1 2

t=1 gt gt 1 2

t=1 gt gt 1 2.

E Strongly Convex Losses

Suppose that the expected loss F is actually known to be µ-strongly convex. Then we d like to have a convergence guarantee of O(1/µT). This is achieved in Theorem 6 below.

Theorem 6. Under the same assumptions as Theorem 5, define ℓt(z) = gt, z + µ

2 yt z 2. Define the regret of the sequence zt as:

Regretℓ(x ) =

t=1 wt(ℓt(zt) ℓt(x )).

Then we have for x = argmin F:

E[F(x T ) F(x )] E

" Regretℓ(x ) PT t=1 wtµ

In particular, suppose x D for some known bound D and gt G for all t for some G so long as yt D. Then if we define wt = t for all t and set zt by:

zt+1 = Π z D

zt 2(gt + µ(zt yt))

then we have:

E[F(x T ) F(x )] 2(G + 2µD)2

Proof. From Theorem 5, we have:

E[F(x T ) F(x )] E

"PT t=1 wt gt, zt x

w1:T PT t=1 wt BF (x , yt)

Now, since F is µ-strongly convex, we have BF (x , yt) µ

2 yt x 2. Further, we have:

t=1 wt gt, zt x =

t=1 wt(ℓt(zt) ℓt(x )) wtµ

2 zt yt 2 + wtµ

From this we obtain the desired result:

E[F(x T ) F(x )] E

" Regretℓ(x ) PT t=1 wtµ

For the final statement, observe that with wt = t, wtℓt(z) = t gt, z + tµ

2 z yt 2 is tµ-strongly convex. Therefore if we use learning rate ηt = 1 µw1:t = 2 µt(t+1), then standard analysis of projected OGD yields:

t=1 t(ℓt(zt) ℓt(x ))

t=1 t ℓt(zt), zt x tµ

2 zt x 2 z T +1 x 2

t=2 zt x 2 1

2ηt 1 2ηt 1 tµ

ηtt2 ℓt(zt) 2

ηtt2 ℓt(zt) 2

t=1 ℓt(zt) 2

t=1 gt + µ(zt yt) 2

T(G + 2µD)2

where in the last inequality we have observed that since zt D and yt is a linear combination of past z values, yt D as well. Finally, observing that w1:T = T (T +1)

2 , the result follows.

F Large Step size convergence

Theorem 7. Consider the online learning setting with bounded gradients gt. Let zt+1 = zt γgt. Let D = z1 z for arbitrary reference point z and define G = maxt T gt . Suppose that the chosen step-size is γ = D/G, then if it holds that:

t=1 gt, zt z1 D

t=1 gt 2, (31)

t=1 gt, zt z = O

Proof. Consider SGD with fixed step size γ:

zt+1 = zt γgt.

Recall from D-Adaptation (Defazio and Mishchenko, 2023) theory that:

t=1 γ gt, zt z1 = 1

t=1 γ2 gt 2 1

2 st+1 2 (33)

t=1 γ gt, zt z s T +1 D +

t=1 γ gt, zt z1 . (34)

Now suppose that the regret at time T is negative. Then trivially the theorem holds:

t=1 gt, zt z 0 = O

therefore, without loss of generality we may assume that PT t=1 γ gt, zt z 0. Then from combining Equation 34 with Equation 33 we have:

2 s T +1 2 + s T +1 D + 1

t=1 γ2 gt 2 .

This is a quadratic equation in s T +1 which we can solve explicitly via the quadratic formula, taking the largest root:

b2 4ac 2a .

Plugging in the values a = 1

2, b = D, c = 1

2 PT t=1 γ2 gt 2:

v u u t D2 +

t=1 γ2 gt 2 2D +

t=1 γ2 gt 2.

s T +1 2D + γ

Substituting this into Equation 34:

t=1 γ gt, zt z 2D2 + γD

t=1 γ gt, zt z1 .

Therefore, if PT t=1 gt, zt z1 D q PT t=1 gt 2 then:

t=1 γ gt, zt z 2D2 + 2γD

Plugging in γ = D/G:

t=1 gt, zt z 2DG + 2D

and the theorem follows.

G Experimental Setup

G.1 Convex experiments

Each dataset is obtained from the LIBSVM repository and used without modifications.

Hyper-parameter Value GPUs 1 V100 Batch size 16 Epochs 100 Seeds 10 Schedule-Free β1 0.9

Hyper-parameter Value Decay 0.0 Optimizer Adam Baseline β1 0.9 β2 0.95

G.2 CIFAR-10

We used custom training code based on the Py Torch tutorial code for this problem. Following standard data-augmentation practises, we appliyed random horizontal flips and random offset cropping down to 32x32, using reflection padding of 4 pixels. Input pixel data was normalized by centering around 0.5.

Hyper-parameter Value Architecture Wide Res Net 16-8 Epochs 300 GPUs 1 V100 Batch size per GPU 128 Cosine/Schedule-Free Warmup 5% Baseline Stepwise LR 0.1

Hyper-parameter Value Seeds 10 decay 0.0001 Baseline Momentum 0.9 Schedule-Free LR 10 Schedule-Free β 0.9 Baseline Cosine LR 0.2

G.3 CIFAR-100

We used the same codebase as for our CIFAR-10 experiments, with the same data augmentation.

We normalized each input image using fixed mean and standard error values derived from preprocessing the data.

Hyper-parameter Value

Architecture Dense Net [6,12,24,16], growth rate 12 Epochs 300 GPUs 1 V100 Schedule-Free β 0.9 Cosine/Schedule-Free Warmup 5% Baseline Stepwise LR 0.05

Hyper-parameter Value Batch size per GPU 64 Seeds 10 Decay 0.0002 Baseline Momentum 0.9 Schedule-Free LR 5 Baseline Cosine LR 0.05

We used the same codebase as for our CIFAR experiments, and following the same data preprocessing.

Hyper-parameter Value Batch size 32 Weight decay Cosine 0.0001 Weight decay Step Sched 5e-5 Seeds 10 Baseline Stepwise LR 0.1

Hyper-parameter Value Cosine/Schedule-Free Warmup 5% Schedule-Free decay 0.0002 Schedule-Free LR 1.0 Schedule-Free β 0.9 Baseline Cosine LR 0.1

G.5 Image Net

We used the same code-base as for our CIFAR-10 experiments, and applied the same preprocessing procedure. The data-augmentations consisted of Py Torch s Random Resized Crop, cropping to 224x224 followed by random horizontal flips. Test images used a fixed resize to 256x256 followed by a center crop to 224x224.

Hyper-parameter Value Architecture Res Net50 Epochs 100 GPUs 8 V100 Batch size per GPU 32 Schedule-Free Decay 0.00005 Baseline Stepwise LR 0.1 Baseline Cosine LR 0.05

Hyper-parameter Value Seeds 5 Decay 0.0001 Baseline Momentum 0.9 Schedule-Free β 0.9 Cosine/Schedule-Free Warmup 5% Schedule-Free LR 1.5

G.6 IWSLT14

We used the Fair Seq framework 3 for our experiments. Rather than a vanilla LSTM we use the variant from Wiseman and Rush (2016) provided in the Fair Seq codebase.

Hyper-parameter Value Architecture lstm_wiseman_iwslt_de_en Max Epoch 55 GPUs 1 V100 Tokens per batch 4096 Warmup steps 4000 Dropout 0.3 Label smoothing 0.1 Schedule-Free LR 0.02 Schedule-Free warmup 5% Baseline schedule Linear Decay

Hyper-parameter Value Share decoder, input, output embed True

Float16 True Update Frequency 1 Seeds 10 Decay 0.05 Baseline β1 0.9 Schedule-Free β1 0.9 β2 0.98 Baseline LR 0.01

G.7 Nano GPT

We followed the Nano GPT codebase 4 as closely as possible, matching the default batch-size, training length and schedule. Our runs replicate the stated 2.85 loss in the documentation. Disabling gradient norm clipping is crucial for the Schedule-Free runs.

Hyper-parameter Value Architecture transformer_lm_gpt Batch size per gpu 12 Max Iters 600,000 GPUs 40 V100 Tokens per sample 512 Dropout 0.0 Baseline LR 0.0005 Warmup 2,000 Schedule-Free LR 0.005 Schedule-Free β 0.98 Schedule-Free decay 0.05

Hyper-parameter Value Block Size 1024 Num layer 12 Num head 12 Num embd 768 Float16 True Update Frequency 16 Seeds 5 Decay 0.1 Baseline β1, β2 0.9, 0.95 Gradient Clipping 0.0

3https://github.com/facebookresearch/fairseq 4https://github.com/karpathy/nano GPT

Our implementation uses the offical code5, with hyper-parameters following examples given in the repository.

Hyper-parameter Value Model vit_base_patch16 Epochs 100 GPUs 32 V100 Batch Size 32 Baseline LR 5e-4 Layer Decay 0.65 Weight Decay 0.05 Baseline β1 0.9

Hyper-parameter Value β2 0.999 Schedule-Free LR 0.002 Schedule-Free decay 0.05 Schedule-Free β1 0.9 Drop Path 0.1 Reprob 0.25 Mixup 0.8 Cutmix 1.0

We used a custom implementation of the DLRM model based on the publicly available code. Our optimizer uses dense gradients for implementation simplicity, although sparse-gradients using Ada Grad is a more common baseline on this problem, we consider Ada Grad variants of our scheduling approach as future work.

Hyper-parameter Value Iterations 300 000 Batch Size 128 Emb Dimension 16 GPUs 8 V100 Schedule-Free LR 0.0005 Schedule-Free β1 0.9 β2 0.999

Hyper-parameter Value Seeds 5 Decay 0.0 Baseline β1 0.9 Warmup 0 Baseline LR 0.0002 Baseline schedule Linear Decay

We used the version of the the fast MRI code base at https://github.com/facebookresearch/ fast MRI/tree/main/banding_removal. Note that we found that training failed using Py Torch 2 or newer, and so we ran these experiments using Py Torch 1.9.

Hyper-parameter Value Architecture 12 layer Var Net 2.0 Epochs 50 GPUs 8 V100 Batch size per GPU 1 Acceleration factor 4 Baseline Schedule Linear Decay Baseline LR 0.005 β2 0.999

Hyper-parameter Value Low frequency lines 16 Mask type Offset-1 Seeds 5 Decay 0.0 Baseline β1 0.9 Schedule-Free LR 0.005 Schedule-Free β 0.9

G.11 Algoperf

Our full algoperf entry is availiable at https://github.com/facebookresearch/schedule_ free/tree/main/schedulefree/algoperf. The hyper-parameters used for the self-tuning track submission are listed below.

5https://github.com/fairinternal/mae

Hyper-parameter Value Learning Rate 0.0025 one-minus Beta1 0.1 Beta2 (default) 0.9955159689799007 Weight Decay (default) 0.08121616522670176

Hyper-parameter Value Dropout Rate 0.1 Warmup Percentage 2% Label Smoothing 0.2 Polynomial in ct average 0.75

H Convex Experiments

0 20 40 60 80 100

Accuracy (%)

Polyak Averaging (85.5 SE 0.00) Primal Averaging (83.0 SE 0.44) Schedule-Free (89.7 SE 0.00) LD Schedule (89.2 SE 0.01)

0 20 40 60 80 100

Accuracy (%)

Polyak Averaging (95.7 SE 0.01) Primal Averaging (95.4 SE 0.02) Schedule-Free (97.2 SE 0.01) LD Schedule (95.8 SE 0.00)

0 20 40 60 80 100

Accuracy (%)

Polyak Averaging (100.0 SE 0.01) Primal Averaging (100.0 SE 0.00) Schedule-Free (100.0 SE 0.00) LD Schedule (100.0 SE 0.00)

0 20 40 60 80 100

Accuracy (%)

Polyak Averaging (68.8 SE 0.10) Primal Averaging (65.0 SE 0.61) Schedule-Free (72.1 SE 0.14) LD Schedule (70.3 SE 0.14)

0 20 40 60 80 100

Accuracy (%)

Polyak Averaging (98.3 SE 0.08) Primal Averaging (98.5 SE 0.06) Schedule-Free (98.6 SE 0.00) LD Schedule (98.6 SE 0.00)

0 20 40 60 80 100

Accuracy (%)

Polyak Averaging (77.8 SE 0.00) Primal Averaging (77.8 SE 0.02) Schedule-Free (77.8 SE 0.00) LD Schedule (77.8 SE 0.01)

0 20 40 60 80 100

Accuracy (%)

Polyak Averaging (95.8 SE 0.01) Primal Averaging (94.0 SE 0.11) Schedule-Free (96.0 SE 0.01) LD Schedule (96.0 SE 0.02)

0 20 40 60 80 100

Accuracy (%)

Polyak Averaging (90.8 SE 0.01) Primal Averaging (90.8 SE 0.64) Schedule-Free (98.1 SE 0.00) LD Schedule (96.5 SE 0.01)

0 20 40 60 80 100

Accuracy (%)

Polyak Averaging (98.0 SE 0.01) Primal Averaging (98.1 SE 0.03) Schedule-Free (99.5 SE 0.00) LD Schedule (98.9 SE 0.01)

0 20 40 60 80 100

Accuracy (%)

Polyak Averaging (80.8 SE 0.04) Primal Averaging (81.2 SE 0.16) Schedule-Free (83.4 SE 0.04) LD Schedule (83.2 SE 0.14)

0 20 40 60 80 100

Accuracy (%)

Polyak Averaging (76.8 SE 0.10) Primal Averaging (77.2 SE 0.18) Schedule-Free (77.5 SE 0.03) LD Schedule (77.6 SE 0.10)

0 20 40 60 80 100

Accuracy (%)

Polyak Averaging (100.0 SE 0.00) Primal Averaging (100.0 SE 0.00) Schedule-Free (100.0 SE 0.00) LD Schedule (100.0 SE 0.00)

Figure 7: Stochastic logistic regression experiments.

I Polyak and Primal Averaging Runs

These experiments follow the same tuning setup as Figure 3, where the learning rate and momentum is tuned separately for each method. In each case the c weighting sequence used for Schedule-Free training is also used to ensure a fair comparison. The Polyak averaging runs include momentum in the base optimizer as we found this gave the best results. We ran the Nano GPT experiment for a shorter 200,000 steps due to computational budget considerations. The Nano GPT Polyak averaging runs show a divergence in test loss for Polyak averaging.

0 50 100 150 200 250 300

Test Accuracy (%)

CIFAR-10 (WRN-16-8)

Polyak Averaging (95.56% SE 0.05) Primal Averaging (92.09% SE 0.07) Schedule-Free Reference 96.03%

0 50 100 150 200 250 300

Test Accuracy (%)

CIFAR-100 (Dense Net)

Polyak Averaging (76.99% SE 0.12) Primal Averaging (70.20% SE 0.19) Schedule-Free Reference 78.71%

0 50 100 150 200 250 300

Test Accuracy (%)

SVHN (Res Net-3-96)

Polyak Averaging (98.32% SE 0.01) Primal Averaging (97.36% SE 0.03) Schedule-Free Reference 98.40%

0 20 40 60 80 100

Test Accuracy (%)

ILSVRC 2012 Image Net (Res Net-50)

Polyak Averaging (72.84% SE 0.02) Primal Averaging (69.78% SE 0.09) Schedule-Free Reference 76.90%

0 10 20 30 40 50

fast MRI Knee (Var Net 2.0)

Primal Averaging (0.9100 SE 0.00037) Polyak Averaging (0.9088 SE 0.00141) Schedule-Free Reference 0.9112

0 20 40 60 80 100

Test Accuracy

Criteo Kaggle (DLRM)

Primal Averaging (0.7899 SE 0.00006) Polyak Averaging (0.7901 SE 0.00004) Schedule-Free Reference 0.7915

0 20 40 60 80 100

Test Accuracy (%)

MAE Image Net Finetune (Vi T)

Polyak Averaging (82.72 SE 0.03) Primal Averaging (81.98 SE 0.03) Schedule-Free Reference 83.54%

0 50000 100000 150000 200000

Open Web Text (GPT-2 124M)

Polyak Averaging (5.588 SE 0.134) Primal Averaging (3.063 SE 0.004) Schedule-Free Reference @ 200k 2.878

Figure 8: Polyak and Primal Averaging Experiments

J Additional LR Sensitivity Plots

10 1 100 101

Test Accuracy (%)

CIFAR-10 (WRN-16-8) LR Sensitivity

Schedule-Free Cosine Schedule

10 2 10 1 100 101

Test Accuracy (%)

CIFAR-100 (Dense Net) LR Sensitivity

Schedule-Free Cosine Schedule

10 2 10 1 100

Test Accuracy (%)

SVHN (Res Net-3-96) LR Sensitivity

Schedule-Free Cosine Schedule

Neur IPS Paper Checklist

Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? Answer: [Yes] Justification: Our paper contains new theory as well as extensive experimental results for our method. 2. Limitations

Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification: We detail the limitations of our method in the conclusion section. 3. Theory Assumptions and Proofs

Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof? Answer: [Yes] Justification: All theorems stated in the paper are proven in the Appendix. 4. Experimental Result Reproducibility

Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [Yes] Justification: Detailed results concerning the experimental setup are provided in the appendix. 5. Open access to data and code

Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: We have already open sourced our method. Experiments can be replicated using our open source Py Torch optimizer implementation together with existing open source code bases implementing each method. 6. Experimental Setting/Details

Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [Yes] Justification: To the best of our knowledge, we provide sufficient information about hyperparameter settings for all experiments to be reproducible. 7. Experiment Statistical Significance

Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [Yes]

Justification: All experiments reported use multiple seeds, and error bars are included in all plots. 8. Experiments Compute Resources

Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments?

Answer: [No]

Justification: As our method has no runtime overhead compared to existing approaches, we do not report computation times. 9. Code Of Ethics

Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines? Answer: [Yes] Justification: We have abided by the code of ethics. 10. Broader Impacts

Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [NA] . Justification: There is no foreseeable societal impact from our work. 11. Safeguards

Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)? Answer: [NA] Justification: Not Applicable 12. Licenses for existing assets

Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected? Answer: [No] Justification: We do not distribute any models or datasets with this work. Datasets used for evaluation have license information available at the cited source. 13. New Assets

Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [NA] . Justification: No new assets are being released. 14. Crowdsourcing and Research with Human Subjects

Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] . Justification: Not Applicable 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] . Justification: Not Applicable