# learningratefree_learning_by_dadaptation__b8796e49.pdf

Learning-Rate-Free Learning by D-Adaptation

Aaron Defazio 1 Konstantin Mishchenko 2

D-Adaptation is an approach to automatically setting the learning rate which asymptotically achieves the optimal rate of convergence for minimizing convex Lipschitz functions, with no back-tracking or line searches, and no additional function value or gradient evaluations per step. Our approach is the first hyper-parameter free method for this class without additional multiplicative log factors in the convergence rate. We present extensive experiments for SGD and Adam variants of our method, where the method automatically matches hand-tuned learning rates across more than a dozen diverse machine learning problems, including large-scale vision and language problems. An open-source implementation is available1.

1. Introduction

We consider the problem of unconstrained convex minimization, min x Rp f(x),

where f has Lipschitz constant G and a non-empty set of minimizers. The standard approach to solving it is the subgradient method that, starting at a point x0, produces new iterates following the update rule:

xk+1 = xk γkgk,

where gk f(xk) is a subgradient of f. After running for n steps, the average iterate ˆxn = 1 n+1 Pn k=0 xk is returned. The learning rate γk, also known as the step size, is the main quantity controlling if and how fast the method converges.

1https://github.com/facebookresearch/ dadaptation

1Meta AI, Fundamental AI Research (FAIR) team, New York 2CNRS, ENS, INRIA SIERRA team, France. Correspondence to: Aaron Defazio <aaron.defazio@gmail.com>, Konstantin Mishchenko <konsta.mish@gmail.com>.

Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s).

Algorithm 1 Dual Averaging with D-Adaptation

Input: x0, d0 > 0, s0 = 0, g0 f(x0), γ0 = 1/ g0 If g0 = 0, exit with ˆxn = x0 for k = 0 to n do

gk f(xk) sk+1 = sk + dkgk

γk+1 = 1 q Pk i=0 gi 2

ˆdk+1 = γk+1 sk+1 2 Pk i=0 γid2 i gi 2

Option II: ˆdk+1 = Pk i=0 diγi gi, si

sk+1 dk+1 = max dk, ˆdk+1

xk+1 = x0 γk+1sk+1 end for Return ˆxn = 1 Pn k=0 dk Pn k=0 dkxk

If the learning rate sequence is chosen too large, the method might oscillate around the solution, whereas small values lead to very slow progress. Setting γk optimally requires knowledge of the distance to a solution. In particular, denote x to be any minimizer of f, D to be the associated distance D = x0 x , and f to be the optimal value, f = f(x ). Then, using the fixed step size:

γk = D G n,

the average iterate ˆxn converges in terms of function value at an inverse square-root rate:

f(ˆxn) f = O(DG/ n).

This rate is worst-case optimal for this complexity class (Nesterov, 2018). Setting this step size requires knowledge of two problem constants, D and G. Adaptivity to G can be achieved using a number of approaches, the most practical of which is the use of Ada Grad-Norm step sizes (Streeter & Mc Mahan, 2010; Duchi et al., 2011; Ward et al., 2019):

γk = D q Pk i=0 gi 2 ,

Learning-Rate-Free Learning by D-Adaptation

together with projection onto the D-ball around the origin. Ada Grad-Norm step sizes still require knowledge of D, and they perform poorly when it is estimated wrong. In the (typical) case where we don t have knowledge of D, we can start with loose lower and upper bounds d0 and dmax, and perform a hyper-parameter grid search on a log-spaced scale. In most machine learning applications a grid search is the current standard practice.

In this work we take a different approach. We describe a method that achieves the optimal rate, for sufficiently large n, by maintaining and updating a lower bound on D (Algorithm 1). Using this lower bound is provably sufficient to achieve the optimal rate of convergence:

f(ˆxn) f(x ) = O DG n + 1

with no additional log factors, avoiding the need for a hyperparameter grid search.

Our method is highly effective across a broad range of practical problems, matching a carefully hand-tuned baseline learning rate across a broad range of machine learning problems within computer vision, Natural language processing and recommendation systems.

2. Algorithm

Our proposed approach is a simple modification of the Ada Grad step size applied to weighted dual averaging, together with our key innovation: D lower bounding. At each step, we construct a lower bound ˆdk on D using empirical quantities. If this bound is better (i.e. larger) than our current best bound dk of D, we use dk = ˆdk in subsequent steps. There are two options to estimate ˆdk, but since they have exactly the same theoretical properties, we only discuss the first option below.

To construct the lower bound, we show that a weighted sum of the function values is bounded above as:

k=0 dk (f(xk) f )

2 d2 k gk 2 γn+1

There are two key differences from the classical bound (Orabona, 2019):

k=0 dk (f(xk) f ) 1

2γ 1 n+1D2 +

2 d2 k gk 2 .

Firstly, we are able to gain an additional negative term 1

2γn+1 sn+1 2. Secondly, we replace the typical D2

error term with D sn+1 , following the idea of Carmon

& Hinder (2022). This bound is tighter than the classical bound, and equivalent when D = x0 xn+1 , since:

2γn+1 sn+1 2

2γ 1 n+1 D2 (D x0 xn+1 )2 1

From our bound, using the fact that

k=0 dk (f(xk) f ) 0,

2 d2 k gk 2 γn+1

which can be rearranged to yield a lower bound on D, involving only known quantities:

D ˆdn+1 = γn+1 sn+1 2 Pn k=0 γkd2 k gk 2

This bound is potentially vacuous if sn+1 2 is small in comparison to Pn k=0 γkd2 k gk 2. This only occurs once the algorithm is making fast-enough progress that bound adjustment is not necessary at that time. The maximum over seen bounds can not be negative since our algorithm begins with a user-specified positive lower bound d0, which sets the scale of the initial steps.

Theorem 2.1. For a convex G-Lipschitz function f, Algorithm 1 returns a point ˆxn such that:

f(ˆxn) f(x ) = O DG n + 1

as n , where D = x0 x for any x in the set of minimizers of f, as long as d0 D.

The above result is asymptotic due to the existence of worst-case functions when n is fixed in advance. For any fixed choice of n, a function can be constructed such that Algorithm 1 run for n steps has a dependence on d0. Despite this, we are able to show that the non-asymptotic convergence rate is only worse by a log factor.

Theorem 2.2. Consider Algorithm 1 run for n 2 log2(D/d0) iterations with the step size modified to be

G2 + Pk i=0 gi 2 . (1)

If we return the point ˆxt = 1 Pt k=0 dk Pt k=0 dkxk where t

Learning-Rate-Free Learning by D-Adaptation

is chosen to be t = arg mink n dk+1 Pk i=0 di , then using the

notation log2+(x) = max(1, log2 (x)):

f(ˆxt) f 16log2+(dn+1/d0)

The modification to the step size can be avoided at the cost of having an extra term, namely we would have the following guarantee for the same iterate ˆxt:

f(ˆxt) f 16DG log2+( D

d0 ) n + 1 + 8DG2 log2+( D

(n + 1) g0 .

Notice that, unlike the bound in the theorem above, it also depends on the initial gradient norm g0 .

Our algorithm returns a weighted average iterate ˆxn rather than the last iterate xn+1. This is standard practice when Ada Grad Norm schedules approaches are used, both for dual averaging and gradient descent. Techniques are known to obtain guarantees on the last-iterate either by the use of momentum (Defazio & Gower, 2021) or modified step-size sequences (Jain et al., 2019), although we have no explored if these approaches are compatible with D-Adaptation.

2.1. Why Dual Averaging?

The new bound we develop is actually general enough to apply to both gradient descent and dual averaging. Using the same proof techniques, D-Adaptation can also be applied on top of a gradient descent step. However, we do not use the gradient descent version above for a technical reason: the asymptotic convergence rate has an additional log factor. The practical performance of the two methods is very similar.

Theorem 2.3. Gradient Descent with D-Adaptation (Algorithm 2), under the assumptions of Theorem 2.1, returns a point ˆxn such that:

f(ˆxn) f = O DG n + 2 log (n + 2) .

This log factor arises whenever any-time step sizes are used on top of gradient descent when applied to unbounded domains, and is not specific to our method (Beck, 2014).

3. D-Adapted Ada Grad

The D-Adaptation technique can be applied on top of the coordinate-wise scaling variant of Ada Grad with appropriate modifications. Algorithm 3 presents this method. This variant estimates the distance to the solution in the ℓ - norm instead of the Euclidean norm, D = x0 x . The theory for Ada Grad without D-Adaptation also uses

Algorithm 2 Gradient Descent with D-Adaptation

Input: d0, x0 s0 = 0 If g0 = 0, exit with ˆxn = x0 for k = 0 to n do

G2 + Pk i=0 gi 2

sk+1 = sk + λkgk

ˆdk+1 = sk+1 2 Pk i=0 λ2 i gi 2

dk+1 = max dk, ˆdk+1

xk+1 = xk λkgk end for Return ˆxn = 1 Pn k=0 λk Pn k=0 λkxk

Algorithm 3 D-Adapted Ada Grad

Input: x0, d0 (default 10 6), G s0 = 0, a0 = [G , . . . , G ] for k = 0 to n do

gk f(xk, ξk) sk+1 = sk + dkgk a2 k+1 = a2 k + g2 k Ak+1 = diag(ak+1)

ˆdk+1 = sk+1 2 A 1 k+1 Pk i=0 d2 i gi 2 A 1 i 2 sk+1 1 dk+1 = max dk, ˆdk+1

xk+1 = x0 A 1 k+1sk+1 end for Return ˆxn = 1 Pn k=0 dk Pn k=0 dkxk

the same norm to measure the distance to solution, so this modification is natural, and results in the same adaptive convergence rate as Ada Grad up to constant factors without requiring knowledge of D .

Theorem 3.1. For a convex p-dimensional function with G = maxx f(x) , D-Adapted Ada Grad (Algorithm 3) returns a point ˆxn such that

f(ˆxn) f = O an+1 1 D

= O p G D n + 1

as n , where D = x0 x for any x in the set of minimizers of f, as long as d0 D .

Similarly to Theorem 2.2, we could achieve the same result up to higher order terms without using G in the initialization of a0. Following the standard approach for Ada Grad, Algorithm 3 maintains a vector a to track the

Learning-Rate-Free Learning by D-Adaptation

coordinate-wise denominator. We introduce a diagonal matrix Ak+1 which allows us to avoid using coordinatewise notation.

4. Discussion

Figure 1 depicts the behavior of D-Adaptation on a toy problem - minimizing an absolute value function starting at x0 = 1.0. Here d0 is started at 0.1, below the known D value of 1.0. This example illustrates the growth of dk towards D. The value of dk typically doesn t asymptotically approach D, as this is not guaranteed nor required by our theory. Instead, we show in Theorem F.1 that under a mild assumption, dk is asymptotically greater than or equal to D/(1 +

3). The lower bound ˆdk will often start to decrease, and even go negative, once dk is large enough. Negative values of ˆdk were seen in most of the experiments in Section 7.

4.1. Different ways to estimate D

Algorithm 3 is presented with two options for estimating ˆdk, where the numerator of the second option is provably larger or equal to that of the first option:

k=0 γkdk gk, sk γn+1

2 d2 k gk 2 .

We found the two options worked equally well in practice. The inner product between the step direction sk and the gradient gk, which shows up in the second option, is a quantity known as the (negative) hyper-gradient (Bengio, 2000; Domke, 2012; Pedregosa, 2016; Feurer & Hutter, 2019; Chandra et al., 2022; Wang et al., 2021). In classical applications of the hyper-gradient, the learning rate is increased when the gradient points in the same direction as the previous step, and it is decreased otherwise. In essence, the hyper-gradient indicates if the current learning rate is too large or to small. In works that use hypergradient to estimate learning rate, an additional hyperlearning rate parameter is needed to control the rate of change of the learning rate, whereas our approach requires no extra parameters.

In our approach, the hyper-gradient quantity is used to provide an actual estimate of the magnitude of the optimal learning rate (or more precisely a lower bound), which is far more information than just a directional signal of too-large or too-small. This is important for instance when a learning rate schedule is being used, as we can anneal the learning rate down over time, without the hyper-gradient responding by pushing the learning rate back up. This is also useful during learning rate warmup, as we are able to build an estimate of D during the warmup, which is not possible when using a classical hyper-gradient approach.

5. Related Work

We are not aware of any existing approach for convex Lipschitz (potentially non-smooth) optimization that avoids the need for knowledge of any hyper-parameters while still achieving the optimal rate asymptotically. Drori & Taylor (2020); Goujaud et al. (2022) give an algorithm for non-smooth convex problems which has no-tunable parameters, but requires an exact line-search. The Polyak step size (Polyak, 1987) is one approach that avoids requiring knowledge of D, instead, knowledge of f is required. Using estimates or approximations of f tend to result in unstable convergence, however a restarting scheme that maintains lower bounds on f can be shown to converge within a multiplicative log factor of the optimal rate (Hazan & Kakade, 2019).

Instead of running sub-gradient descent on every grid-point on a log spaced grid from d0 to dmax, a bisection algorithm can be run instead on the same grid, resulting in a doublelogarithmic term (Carmon & Hinder, 2022).

Approaches from Online Learning can be applied here such as coin-betting (Orabona, 2019; Orabona & Tommasi, 2017; Mc Mahan & Orabona, 2014; Zhang et al., 2022; Orabona & Pál, 2021). Asymptotic rates for these methods in the Offline Lipschitz setting are not currently known. In terms of nonasymptotic rates, their theoretical convergence rate is better by a factor p

log(1 + D/d0) than our non-asymptotic rate. Another approach in the Online Learning setting is Streeter & Mc Mahan (2012) s reward-doubling technique, which tracks similar norm quantities to our approach, although they estimate a different quantity than D.

Like our work, the Do G (Distance Over Gradients) approach of Ivgi et al. (2023) builds upon Carmon & Hinder (2022). They estimate D by

rk = max i k xi x0 .

This estimator is not necessarily bounded; they show a convex counter-example where rk goes to infinity. Nevertheless, by adding additional dampening in the denominator of the step-size, they are able to show learningrate free convergence in the stochastic setting. Their result is more general than ours, as we only prove convergence in the non-stochastic setting, although their rate contains additional multiplicative log-factors compared to our rate. Their work is concurrent with ours, and appears in the same venue.

6. Machine Learning Applications

It is straightforward to adapt the D-Adaptation technique to stochastic optimization, although the theory no longer directly supports this case. Algorithm 4 and 5 are versions

Learning-Rate-Free Learning by D-Adaptation

1.0 0.5 0.0 0.5 1.0

0 20 40 60 80 100

Figure 1. Toy problem illustrating the estimate of D over time, f(x) = |x|. x0 = 1.0 is shown as a blue dot on the left plot, and the following iterates are shown in purple.

Algorithm 4 SGD with D-Adaptation

Input: x0, d0 (default 10 6), γk (default 1), β = 0.9, G (default g0 ) s0 = 0, z0 = x0 for k = 0 to n do

gk f(xk, ξk)

G sk+1 = sk + λkgk zk+1 = zk λkgk xk+1 = βxk + (1 β)zk+1

ˆdk+1 = 2 Pk i=0 λi gi, si

dk+1 = max dk, ˆdk+1

of D-Adaptation for SGD and Adam respectively. Both of the two methods solve the stochastic optimization problem,

min x Rp E[f(x, ξ)]

using stochastic subgradients gk f(xk, ξk).

For the SGD variant (Algorithm 1), we multiply the D bound by a factor of two compared to Algorithm 4. This improves the practical performance of the method. Our theoretical rate is still valid up to constant factors, for any constant multiplier applied to the step-size, so this change is still covered by our theory. For the denominator of the step size, we use G = g0 , which is a crude approximation to the true G but appears to work very well in practice.

We include momentum (β) implemented using the primal averaging technique, following the approach of Defazio (2020) and Defazio & Gower (2021). For Adam, we make the following modifications:

Algorithm 5 Adam with D-Adaptation

Input: x0, d0 (default 10 6), γk (default 1), β1, β2, ϵ (default 0.9, 0.999, 10 8). s0 = 0, m0 = 0, v0 = 0, r0 = 0 for k = 0 to n do

gk f(xk, ξk) mk+1 = β1mk + (1 β1)dkγkgk vk+1 = β2vk + (1 β2)g2 k Ak+1 = diag( vk+1 + ϵ) xk+1 = xk A 1 k+1mk+1 Learning rate update sk+1 = β2sk + (1 β2)dkγkgk rk+1 = β2rk + (1 β2)dkγk gk, sk A 1 k+1 ˆdk+1 = rk+1 (1 β2) sk+1 1 dk+1 = max dk, ˆdk+1

The norms are now weighted instead of unweighted.

Since sk is now updated by an exponential moving average, a correction factor of 1 β2 in the D bound is needed to keep everything at the same scale.

The Adam variant adapts quicker than the SGD variant and we found no constant multiplier was needed for ˆd.

A derivation of the weights of this Adam variant is included in Appendix G. We use ˆd Option II for both methods, which only makes a practical difference for the Adam variant; for the SGD case it is exactly equivalent to Option I.

We include an optional γk constant sequence as input to the algorithms. This sequence should be set following a learning rate schedule if one is needed for the problem. This schedule should consider 1.0 as the base value, increase towards 1.0 during warm-up (if needed), and decrease from 1 during

Learning-Rate-Free Learning by D-Adaptation

learning rate annealing. Typically the same schedule can be used as would normally be used without D-Adaptation.

7. Experimental Results

We compared our D-Adapted variants of Adam and SGD on a range of machine learning problems to demonstrate their effectiveness in practice. Unless otherwise mentioned, we used the standard learning rate schedule typically used for the problem, with the base learning rate set by D-Adaptation. Full hyper-parameter settings for each problem are included in the Appendix.

Convex Problems For our convex experiments, we considered logistic regression applied to 12 commonly used benchmark problems from the LIBSVM repository. In each case, we consider 100 epochs of training, with a stage-wise schedule with 10-fold decreases at 60, 80, and 95 epochs. The learning rate for Adam was chosen as the value that gave the highest final accuracy using a grid search. D-Adaptation matches or exceeds the performance of the grid-search based learning rate on all 12 problems, to within 0.5% accuracy (Figure 4, in the Appendix).

Convolutional Image Classification For a convolutional image classification benchmark, we used the three most common datasets used for optimization method testing: CIFAR10, CIFAR100 (Krizhevsky, 2009) and Image Net 2012 (Russakovsky et al., 2015). We varied the architectures to show the flexibility of D-Adaptation, using a Wide Res Net (Zagoruyko & Komodakis, 2016), a Dense Net (Huang et al., 2017) and a vanilla Res Net model (He et al., 2016) respectively. D-Adaptation matches or exceeds the baseline learning rates on each problem.

LSTM Recurrent Neural Networks The IWSLT14 German-to-English dataset (Cettolo et al., 2014) is a standard choice for benchmarking machine translation models. We trained an LSTM model (Wiseman & Rush, 2016) commonly used for this problem. The standard training procedure includes an inverse-square-root learning rate schedule, which we used for both the baseline and for DAdaptation. Our model achieves comparable performance to the baseline training regimen without any need to tune the learning rate.

Masked Language Modelling Bidirectional Encoder Representations from Transformers (BERT) is a popular approach to pretraining transformer models (Devlin et al., 2019). We use the 110M parameter Ro BERTA variant (Liu et al., 2019) of BERT for our experiments. This model size provides a large and realistic test problem for D-Adaptation. We train on the Book-Wiki corpus (combining books from Zhu et al. (2015) and a snapshot

of Wikipedia). D-Adaptation again matches the baseline in test-set perplexity.

Auto-regressive Language Modelling For our experiments on auto-regressive language modelling, we used the original GPT decoder-only transformer architecture (Radford et al., 2019). This model is small enough to train on a single machine, unlike the larger GPT-2/3 models. Its architecture is representative of other large language models. We trained on the large Book-Wiki corpus. D-Adaptation is comparable to the baseline with only a negligible perplexity difference.

Object Detection The COCO 2017 object detection task is a popular benchmark in computer vision. We trained as Faster-RCNN (Ren et al., 2015) model as implemented in Detectron2 (Wu et al., 2019). For the backbone model, we used a pretrained Res Ne Xt-101-32x8d (Xie et al., 2017), the largest model available in Detectron2 for this purpose. Our initial experiments showed D-Adaptation overfitting. We identified that the default decay of 0.0001 in the code-base was not optimized for this backbone model, and increasing it to 0.00015 improved the test set accuracy for both the baseline (42.67 to 42.99) and D-adapted versions (41.92 to 43.07), matching the published result of 43 for this problem.

Vision Transformers Vision transformers (Dosovitskiy et al., 2021) are a recently developed approach to image classification that differ significantly from the image classification approaches in Section 7. They are closer to the state-of-the-art than Res Net models, and require significantly more resources to train to high accuracy. We use the vit_tiny_patch16_224 model in the Py Torch Image Models framework (Wightman, 2019) as it is small enough to train on 8 GPUs. The standard training pipeline uses a cosine learning rate schedule. This is an example of a situation where D-Adaptation under-performs the baseline learning rate. This problem appears to be highly sensitive to the initial learning rate, which may explain the observed differences.

fast MRI The fast MRI Knee Dataset (Zbontar et al., 2018) is a large-scale release of raw MRI data. The reconstruction task consists of producing a 2-dimensional, grey-scale image of the anatomy from the raw sensor data, under varying under-sampling regimes. We trained a Var Net 2.0 (Sriram et al., 2020) model, a strong baseline model on this dataset, using the code and training setup released by Meta (Knoll et al., 2020; Defazio, 2019). We again match the highly tuned baseline learning rate with D-Adaptation.

Recommendation Systems The Criteo Kaggle Display Advertising dataset is a large, sparse dataset of user clickthrough events. The DLRM (Naumov et al., 2019) model

Learning-Rate-Free Learning by D-Adaptation

0 50 100 150 200 250 300

Test Accuracy (%)

CIFAR-10 (WRN-16-8)

D-Adapt SGD (95.56% SE 0.04) SGD (95.59% SE 0.03)

0 50 100 150 200 250 300

CIFAR-10 (WRN-16-8)

D-Adapt SGD (1.3e-03 SE 4.9e-05) SGD (1.6e-03 SE 4.6e-05)

0 50 100 150 200 250 300

Test Accuracy (%)

CIFAR-100 (Dense Net)

D-Adapt SGD (76.90% SE 0.06) SGD (76.41% SE 0.14)

0 50 100 150 200 250 300

CIFAR-100 (Dense Net)

D-Adapt SGD (1.3e-02 SE 5.0e-04) SGD (1.3e-02 SE 5.2e-04)

0 20 40 60 80

Test Accuracy (%)

ILSVRC 2012 Image Net (Res Net-50)

D-Adapt SGD (76.43% SE 0.01) SGD (76.09% SE 0.06)

0 20 40 60 80

ILSVRC 2012 Image Net (Res Net-50)

D-Adapt SGD (8.9e-01 SE 8.7e-03) SGD (9.5e-01 SE 2.3e-02)

0 50 100 150 200 250 300

Test Accuracy (%)

ILSVRC 2012 Image Net (Vision Transformer)

D-Adapt Adam (72.05%) Adam (74.01%)

0 50 100 150 200 250 300

ILSVRC 2012 Image Net (Vision Transformer)

D-Adapt Adam (0.0039) Adam (0.0037)

0 50000 100000 150000 200000 250000

Test BBox AP

COCO 2017 (Faster RCNN X-101)

D-Adapt SGD (43.07 SE 0.09) SGD (42.99 SE 0.08)

0 50000 100000 150000 200000 250000

COCO 2017 (Faster RCNN X-101)

D-Adapt SGD (0.42 SE 0.01) SGD (0.41 SE 0.00)

Figure 2. Vision experiments.

Learning-Rate-Free Learning by D-Adaptation

0 10000 20000 30000 40000 50000 60000

IWSLT14 (LSTM)

Adam (4.31 SE 0.003) D-Adapt Adam (4.33 SE 0.003)

0 10000 20000 30000 40000 50000 60000

Training Loss

IWSLT14 (LSTM)

Adam (4.27 SE 0.003) D-Adapt Adam (4.03 SE 0.002)

5000 10000 15000 20000

Test Perplexity

Book Wiki (Ro BERTa)

Adam (3.97 SE 0.010) D-Adapt Adam (3.96 SE 0.008)

0 5000 10000 15000 20000

Training Perplexity

Book Wiki (Ro BERTa)

Adam (4.24 SE 0.018) D-Adapt Adam (4.26 SE 0.018)

0 10000 20000 30000 40000 50000 60000

Test Perplexity

Book Wiki (GPT Transformer)

Adam (19.49 SE 0.012) D-Adapt Adam (19.46 SE 0.019)

0 10000 20000 30000 40000 50000 60000

Training Perplexity

Book Wiki (GPT Transformer)

Adam (20.19 SE 0.084) D-Adapt Adam (20.16 SE 0.093)

0 10 20 30 40 50

fast MRI Knee (Var Net 2.0)

Adam (0.9103 SE 0.00032) D-Adapt Adam (0.9105 SE 0.00057)

0 10 20 30 40 50

fast MRI Knee (Var Net 2.0)

Adam (0.4281 SE 0.00021) D-Adapt Adam (0.4276 SE 0.00028)

0 20 40 60 80 100

Test Accuracy

Criteo Kaggle (DLRM)

Adam (0.7906 SE 0.00007) D-Adapt Adam (0.7905 SE 0.00014)

0 20 40 60 80 100

Criteo Kaggle (DLRM)

Adam (0.4444 SE 0.00155) D-Adapt Adam (0.4440 SE 0.00150)

Figure 3. NLP, medical and recommendation systems.

Learning-Rate-Free Learning by D-Adaptation

Problem Baseline LR D-Adapted LR Std. Dev. CIFAR10 1.0 2.085 0.078 CIFAR100 0.5 0.4544 0.029 Image Net 1.0 0.9227 0.084 IWSLT 0.01 0.003945 0.000086 GPT 0.001 0.0009218 0.000014 Ro BERTa 0.001 0.0009331 0.000011 COCO 0.2 0.2004 0.0026 Vi T 0.001 0.0073 0.00085 fast MRI 0.0003 0.0007596 0.00022 DLRM 0.0001 0.0001282 0.000056

Table 1. Comparison of baseline learning rates against final DAdapted learning rates for the deep learning experiments, with average and standard deviation shown over multiple seeds.

is a common benchmark for this problem, representative of personalization and recommendation systems used in industry. Our method closely matches the performance of the tuned baseline learning rate.

7.1. Sensitivity to d0

According to our theory, as long as each training run reaches the asymptotic regime the resulting final loss should be independent of the choice of d0, as long as d0 D. We tested this hypothesis by running each of the 12 convex logistic regression problems using values of d0 ranging from 10 16 to 10 2. Figure 5 (Appendix E) shows that across every dataset, there is no dependence on the initial value of d0. Given these results, we do not consider d0 a hyper-parameter. There is no indication that d0 should be tuned in practice.

7.2. Observed learning rates

Table 1 shows the learning rates obtained by D-Adaptation for each of our deep learning experiments. The adapted values show remarkable similarity to the hand-tuned values. The hand-tuned learning rates are given by either the paper or the public source code for each problem; It s unclear to what granularity they were tuned. In some cases DAdaptation gives notably higher learning rates, such as for CIFAR-10. For SGD experiments, we used Py Torch s dampening parameter for implementation consistency with Adam. This requires the learning rate to be multiplied by 1/(1 β1) compared to the undampened values, which is reflected in the baseline learning rates in this table.

We observed that in cases where there is a wide range of good learning rates that give equal final test results, DAdaptation has a tendency to choose values at the higher end of the range. For instance, on CIFAR10, using learning rate 2.0 instead of the baseline 1.0 gives equal final test accuracy. The default of 1.0 is likely used in practice just for simplicity.

8. Conclusion

We have presented a simple approach to achieving parameter free learning of convex Lipshitz functions, by constructing successively better lower bounds on the key unknown quantity: the distance to solution x0 x . Our approach for constructing these lower bounds may be of independent interest. Our method is also highly practical, demonstrating excellent performance across a range of large and diverse machine learning problems.

Acknowledgements

We would like to thank Ashok Cutkosky for suggesting substantial simplifications to the proof of Lemma A.2.

Beck, A. Introduction to Nonlinear Optimization. Society for Industrial and Applied Mathematics, 2014. (Cited on page 3)

Bengio, Y. Gradient-based optimization of hyperparameters. In Neural Computation, 2000. (Cited on page 4)

Carmon, Y. and Hinder, O. Making SGD parameter-free. In Conference on Learning Theory. PMLR, 2022. (Cited on pages 2 and 4)

Cettolo, M., Niehues, J., Stüker, S., Bentivogli, L., and Federico, M. Report on the 11th IWSLT evaluation campaign. In IWSLT, 2014. (Cited on page 6)

Chandra, K., Xie, A., Ragan-Kelley, J., and Meijer, E. Gradient descent: The ultimate optimizer. In 36th Conference on Neural Information Processing Systems (Neur IPS), 2022. (Cited on page 4)

Defazio, A. Offset sampling improves deep learning based accelerated mri reconstructions by exploiting symmetry. ar Xiv preprint ar Xiv:1912.01101, 2019. (Cited on page 6)

Defazio, A. Momentum via primal averaging: Theoretical insights and learning rate schedules for non-convex optimization, 2020. (Cited on page 5)

Defazio, A. and Gower, R. M. The power of factorial powers: New parameter settings for (stochastic) optimization. In Proceedings of The 13th Asian Conference on Machine Learning, volume 157 of Proceedings of Machine Learning Research. PMLR, 2021. (Cited on pages 3 and 5)

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. Proceedings of the 2019 Conference of the North American Chapter of the

Learning-Rate-Free Learning by D-Adaptation

Association for Computational Lingustics, 2019. (Cited on page 6)

Domke, J. Generic methods for optimization-based modeling. In Fifteenth International Conference on Artificial Intelligence and Statistics (AISTATS), 2012. (Cited on page 4)

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby, N. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021. (Cited on page 6)

Drori, Y. and Taylor, A. B. Efficient first-order methods for convex minimization: a constructive approach. Mathematical Programming, 2020. (Cited on page 4)

Duchi, J., Hazan, E., and Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(61), 2011. (Cited on pages 1 and 22)

Feurer, M. and Hutter, F. Automated Machine Learning, chapter Hyperparameter Optimization. Springer International Publishing, 2019. (Cited on page 4)

Goujaud, B., Taylor, A., and Dieuleveut, A. Optimal firstorder methods for convex functions with a quadratic upper bound. Technical report, INRIA, 2022. (Cited on page 4)

Hazan, E. and Kakade, S. M. Revisiting the polyak step size. Technical report, Google AI Princeton, 2019. (Cited on page 4)

He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2016. (Cited on page 6)

Huang, G., Liu, Z., Van Der Maaten, L., and Weinberger, K. Q. Densely connected convolutional networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2261 2269, 2017. doi: 10.1109/ CVPR.2017.243. (Cited on page 6)

Ivgi, M., Hinder, O., and Carmon, Y. Dog is sgd s best friend: A parameter-free dynamic step size schedule, 2023. (Cited on page 4)

Jain, P., Nagaraj, D., and Netrapalli, P. Making the last iterate of sgd information theoretically optimal. Conference On Learning Theory (COLT), 2019. (Cited on page 3)

Knoll, F., Zbontar, J., Sriram, A., Muckley, M. J., Bruno, M., Defazio, A., Parente, M., Geras, K. J., Katsnelson, J., Chandarana, H., Zhang, Z., Drozdzalv, M., Romero, A., Rabbat, M., Vincent, P., Pinkerton, J., Wang, D., Yakubova, N., Owens, E., Zitnick, C. L., Recht, M. P., Sodickson, D. K., and Lui, Y. W. fast MRI: A publicly available raw k-space and DICOM dataset of knee images for accelerated MR image reconstruction using machine learning. Radiology: Artificial Intelligence, 2(1), 2020. (Cited on page 6)

Krizhevsky, A. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009. (Cited on page 6)

Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov, V. Ro BERTa: A robustly optimized BERT pretraining approach. ar Xiv preprint ar Xiv:1907.11692, 2019. (Cited on page 6)

Mc Mahan, H. B. and Orabona, F. Unconstrained online linear learning in hilbert spaces: Minimax algorithms and normal approximations. In Proceedings of The 27th Conference on Learning Theory, volume 35 of Proceedings of Machine Learning Research. PMLR, 2014. (Cited on page 4)

Naumov, M., Mudigere, D., Shi, H. M., Huang, J., Sundaraman, N., Park, J., Wang, X., Gupta, U., Wu, C., Azzolini, A. G., Dzhulgakov, D., Mallevich, A., Cherniavskii, I., Lu, Y., Krishnamoorthi, R., Yu, A., Kondratenko, V., Pereira, S., Chen, X., Chen, W., Rao, V., Jia, B., Xiong, L., and Smelyanskiy, M. Deep learning recommendation model for personalization and recommendation systems. Co RR, 2019. (Cited on page 6)

Nesterov, Y. Lectures on Convex Optimization. Springer Nature, 2018. (Cited on page 1)

Orabona, F. A modern introduction to online learning. Technical report, Boston University, 2019. (Cited on pages 2, 4, and 19)

Orabona, F. and Pál, D. Parameter-free stochastic optimization of variationally coherent functions, 2021. (Cited on page 4)

Orabona, F. and Tommasi, T. Training deep networks without learning rates through coin betting. In Advances in Neural Information Processing Systems, volume 30, 2017. (Cited on page 4)

Pedregosa, F. Hyperparameter optimization with approximate gradient. In International conference on machine learning, pp. 737 746. PMLR, 2016. (Cited on page 4)

Learning-Rate-Free Learning by D-Adaptation

Polyak, B. T. Introduction to optimization. Optimization Software, Inc., 1987. (Cited on page 4)

Radford, A., Narasimhan, K., Salimans, T., and Sutskever, I. Improving language understanding by generative pretraining. Technical report, Open AI, 2019. (Cited on page 6)

Ren, S., He, K., Girshick, R., and Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015. (Cited on page 6)

Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A. C., and Fei-Fei, L. Image Net Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3), 2015. (Cited on page 6)

Sriram, A., Zbontar, J., Murrell, T., Defazio, A., Zitnick, C. L., Yakubova, N., Knoll, F., and Johnson, P. End-to-end variational networks for accelerated MRI reconstruction. In International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2020. (Cited on page 6)

Streeter, M. and Mc Mahan, H. B. Less regret via online conditioning, 2010. (Cited on pages 1 and 14)

Streeter, M. and Mc Mahan, H. B. No-regret algorithms for unconstrained online convex optimization. In Proceedings of the 25th International Conference on Neural Information Processing Systems (NIPS), 2012. (Cited on page 4)

Wang, X., Yuan, S., Wu, C., and Ge, R. Guarantees for tuning the step size using a learning-to-learn approach. In Proceedings of the 38th International Conference On Machine Learning, 2021. (Cited on page 4)

Ward, R., Wu, X., and Bottou, L. Adagrad stepsizes: sharp convergence over nonconvex landscapes. In International Conference on Machine Learning, 2019. (Cited on page 1)

Wightman, R. Pytorch image models. https://github. com/rwightman/pytorch-image-models, 2019. (Cited on page 6)

Wiseman, S. and Rush, A. M. Sequence-to-sequence learning as beam-search optimization. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2016. (Cited on page 6)

Wu, Y., Kirillov, A., Massa, F., Lo, W.-Y., and Girshick, R. Detectron2. https://github.com/ facebookresearch/detectron2, 2019. (Cited on page 6)

Xie, S., Girshick, R., Dollár, P., Tu, Z., and He, K. Aggregated residual transformations for deep neural networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. (Cited on page 6)

Zagoruyko, S. and Komodakis, N. Wide residual networks. In Proceedings of the British Machine Vision Conference (BMVC), 2016. (Cited on page 6)

Zbontar, J., Knoll, F., Sriram, A., Muckley, M. J., Bruno, M., Defazio, A., Parente, M., Geras, K. J., Katsnelson, J., Chandarana, H., et al. fast MRI: An open dataset and benchmarks for accelerated MRI. ar Xiv preprint ar Xiv:1811.08839, 2018. (Cited on page 6)

Zhang, Z., Cutkosky, A., and Paschalidis, I. C. Pde-based optimal strategy for unconstrained online learning. In Proceedings of the 39th International Conference on Machine Learning (ICML 2022), 2022. (Cited on page 4)

Zhu, Y., Kiros, R., Zemel, R., Salakhutdinov, R., Urtasun, R., Torralba, A., and Fidler, S. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), 2015. (Cited on page 6)

Learning-Rate-Free Learning by D-Adaptation

A. Core Theory

Here, we are going to consider a more general form of Algorithm 1 with arbitrary positive weights λk that do not have to be equal to dk. In particular, we will study the update rule

sn+1 = sn + λngn and ˆdn+1 = γn+1 sn+1 2 Pn k=0 γkλ2 k gk 2

Later in the proofs, we will set λk = dk, but most intermediate results are applicable with other choices of λk as well.

Lemma A.1. The inner product γkλk gk, sk is a key quantity that occurs in our theory. We can bound the sum of these inner products over time by considering the following expansion:

k=0 γkλk gk, sk = γn+1

2 λ2 k gk 2 + 1

k=0 (γk+1 γk) sk+1 2 .

This simplifies when γk = γn+1 and the weighting sequence is flat, i.e., if λk = 1 for all k:

k=0 gk, sk = γn+1

2 sn+1 2 + γn+1

with λ weights:

k=0 λk gk, sk = γn+1

2 sn+1 2 + γn+1

k=0 λ2 k gk 2 .

Proof. This is straightforward to show by induction (it s a consequence of standard DA proof techniques, where sn 2 is expanded).

2 sn+1 2 = γn

2 sn+1 2 + 1

2 (γn+1 γn) sn+1 2

2 sn 2 + γnλn gn, sn + γn

2 λ2 n gn 2 + 1

2 (γn+1 γn) sn+1 2 .

γnλn gn, sn = γn

2 sn 2 γn+1

2 sn+1 2 + γn

2 λ2 n gn 2 + 1

2 (γn+1 γn) sn+1 2 .

Telescoping

k=0 γkλk gk, sk = γn+1

2 λ2 k gk 2 + 1

k=0 (γk+1 γk) sk+1 2 .

Lemma A.2. The iterates of Algorithm 1 satisfy

k=0 λk (f(xk) f ) x0 x sn+1 +

2 λ2 k gk 2 γn+1

Learning-Rate-Free Learning by D-Adaptation

Proof. Starting from convexity:

k=0 λk (f(xk) f )

k=0 λk gk, xk x

k=0 λk gk, xk x0 + x0 x

= sn+1, x0 x +

k=0 λk gk, xk x0

= sn+1, x0 x

k=0 λkγk gk, sk

k=0 λkγk gk, sk .

We can further simplify with:

k=0 γkλk gk, sk = γn+1

2 λ2 k gk 2 + 1

k=0 (γk+1 γk) sk+1 2 .

Using the fact that γk+1 γk 0 we have:

k=0 λk (f(xk) f ) x0 x sn+1

k=0 γkλk gk, sk

x0 x sn+1 +

2 λ2 k gk 2 γn+1

Theorem A.3. For Algorithm 1, the initial distance to solution, D = x0 x , can be lower bounded as follows

D ˆdn+1 = γn+1 sn+1 2 Pn k=0 γkλ2 k gk 2

Proof. The key idea is that the bound in Lemma A.2,

k=0 λk (f(xk) f ) D sn+1 +

2 λ2 k gk 2 γn+1

gives some indication as to the magnitude of D in the case when the other terms on the right are negative. To proceed, we use Pn k=0 λk (f(xk) f ) 0, giving:

2 λ2 k gk 2 γn+1

which we can rearrange to:

D sn+1 γn+1

2 λ2 k gk 2 .

2 sn+1 2 Pn k=0 γk

2 λ2 k gk 2

Learning-Rate-Free Learning by D-Adaptation

Lemma A.4. In Algorithm 1, the norm of sn+1 is bounded by:

γn+1 + Pn k=0 γkλ2 k gk 2

2dn+1 . (2)

Proof. Using the definition of ˆdn+1 from Theorem A.3, and the property ˆdn+1 dn+1, we derive

2 λ2 k gk 2 = ˆdn+1 sn+1 dn+1 sn+1 .

Using inequality 2αβ α2 + β2 with α2 = 2d2 n+1 γn+1 and β2 = γn+1

2 sn+1 2 and then the bound above, we establish

2αβ = 2dn+1 sn+1 2d2 n+1 γn+1 + γn+1

2 sn+1 2 2d2 n+1 γn+1 + dn+1 sn+1 +

2 λ2 k gk 2.

Rearranging the terms, we obtain

dn+1 sn+1 2d2 n+1 γn+1 +

2 λ2 k gk 2.

It remains to divide this inequality by dn+1 to get the desired claim.

Proposition A.5. (From Streeter & Mc Mahan (2010)) The gradient error term can be bounded as:

G2 + Pk 1 i=0 gi 2 2

k=0 gk 2. (3)

Moreover, if γk = 1

G2+Pk 1 i=0 gi 2 , then

2 gk 2 γn+1

Lemma A.6. Setting λk = dk, it holds for Algorithm 1:

k=0 dk (f(xk) f ) 2Ddn+1

k=0 gk 2 + Ddn+1

k=0 γk gk 2.

Proof. First, recall the key bound from Lemma A.2:

k=0 λk (f(xk) f ) D sn+1 γn+1

2 λ2 k gk 2

2 λ2 k gk 2 .

Now let us apply the bound from Lemma A.4:

γn+1 + Pn k=0 γkλ2 k gk 2

which gives

k=0 λk (f(xk) f ) 2Ddn+1

γn+1 + D Pn k=0 γkλ2 k gk 2

2 λ2 k gk 2 .

Learning-Rate-Free Learning by D-Adaptation

Using λk = dk dn+1 D and plugging in the step size, we obtain

k=0 dk (f(xk) f ) 2Ddn+1

γn+1 + D Pn k=0 γkd2 n+1 gk 2

2 d2 n+1 gk 2

k=0 gk 2 + 1

k=0 γk gk 2 + 1

k=0 γk gk 2

k=0 gk 2 + Ddn+1

k=0 γk gk 2.

This is exactly our result.

A.1. Asymptotic analysis

Theorem. (Theorem 2.1) The average iterate ˆxn returned by Algorithm 1 satisfies:

f(ˆxn) f = O DG n + 1

Proof. In the case where g0 = 0, f(x0) = f(x ) and the theorem is trivially true, so we assume that g0 2 > 0. We will show the result holds for some n, where we choose n sufficiently large so that a number of criteria are met:

Criterion 1: since dk is a non-decreasing sequence upper bounded by D, there must exist some ˆn such that after ˆn steps, dk 1

2dn+1 for all k, n ˆn. We take n 2ˆn.

Criterion 2: since we assume the bound gk 2 G2, there must exist some r such that gn 2 Pn 1 k=0 gk 2 for all n r. Let us choose the smallest r that satisfies this condition, in which case gr 1 2 Pr 2 k=0 gk 2, otherwise we could have chosen r 1. Moreover, we have by definition γk 1 g0 for all k r 1. Combining this with the first bound from Proposition A.5, we derive

k=0 γk gk 2 =

k=r γk gk 2 +

k=0 γk gk 2

k=r gk 2 + 1 g0

k=0 gk 2 + 2 g0 gr 1 2

k=0 gk 2 + 2 G2

We continue with the bound from Lemma A.6:

k=0 dk (f(xk) f ) 2Ddn+1

k=0 gk 2 + Ddn+1

k=0 γk gk 2.

From Criterion 1, we have that:

1 2dn+1 = 1

2(n ˆn + 1)dn+1 1

4(n + 1)dn+1,

Learning-Rate-Free Learning by D-Adaptation

hence 1 Pn k=0 dk 4 (n + 1)dn+1 .

Plugging this back yields

1 Pn k=0 dk

k=0 dk (f(xk) f ) 8D (n + 1)

k=0 gk 2 + 4D n + 1

k=0 γk gk 2.

Using the bound obtained from Criterion 2, we further get

1 Pn k=0 dk

k=0 dk (f(xk) f ) 8D (n + 1)

k=0 gk 2 + 4D n + 1

k=0 gk 2 + 2 G2

Using gk 2 G2, we simplify this to

1 Pn k=0 dk

k=0 dk (f(xk) f ) 16DG n + 1 + 8DG2

(n + 1) g0 .

Using Jensen s inequality, we can convert this to a bound on the average iterate defined as

ˆxn = 1 Pn k=0 dk

f(ˆxn) f 12DG n + 1 + 8DG2

(n + 1) g0 .

Note that the second term on the right decreases faster than the first term with respect to n, so

f(ˆxn) f = O DG n + 1

A.2. Non-asymptotic analysis

Lemma A.7. Consider a sequence d0, . . . d N+1, where for each k, dk+1 dk and assume N + 1 2 log2(d N+1/d0). Then

min n N dn+1 Pn k=0 dk 4log2+(d N+1/d0)

N + 1 , (5)

where log2+(x) = max(1, log2 (x)).

Proof. Let r = log2+(d N+1/d0) . We proceed by an inductive argument on r. In the base case, if r 2, then dn+1 4d0 and the result follows immediately:

min n N dn+1 Pn k=0 dk d N+1 PN k=0 d0 4d0 (N + 1)d0

= 4 N + 1 4log2+(d N+1/d0)

So assume that r > 2 and define n = l N + 1 N+1 log2+(d N+1/d0) m . First we show that no induction is needed, and we may

take n = N, if dn 1

2d N+1. In that case, since the sequence dk is monotonic, it also holds

2d N+1, for all k n .

Learning-Rate-Free Learning by D-Adaptation

Then, it is easy to see that

2 (N + 1 n ) d N+1 1

N + 1 N + 2 N + 1 log2+(d N+1/d0)

N + 1 log2+(d N+1/d0) 1 d N+1.

Since we assume that N+1 log2+(d N+1/d0) 2, we can reduce this bound to the following:

(N + 1) log2+(d N+1/d0) 1 d N+1 (N + 1) d N+1 4 log2+(d N+1/d0).

Rearranging this bound gives: d N+1 PN k=0 dk 2log2+(d N+1/d0)

and therefore

min n N dn+1 Pn k=0 dk 4log2+(d N+1/d0)

Thus, the claim holds if dn 1

Now, suppose that dn 1

2d N+1. In that case, log2+(dn /d0) log2+( 1

2d N+1/d0) = r 1 and by definition

n (N + 1) 1 1 log2+(d N+1/d0)

2 log2(d N+1/d0) 1 1 log2+(d N+1/d0)

= 2 log2(d N+1/d0) 1 = 2 log2

2 log2(dn /d0).

Therefore, we can apply the inductive hypothesis to the sequence d0, . . . , dn :

min n n 1 dn+1 Pn k=0 dk 4log2+(dn /d0)

Under this inductive hypothesis, we note that:

log2+(dn /d0)

n = 1 l N + 1 N+1 log2+(d N+1/d0) m log2+(dn /d0)

1 N + 1 N+1 log2+(d N+1/d0) log2+(dn /d0)

= log2+(d N+1/d0) (N + 1) log2+(d N+1/d0) 1 log2+(dn /d0)

= log2+(d N+1/d0)

N + 1 log2+(dn /d0) log2+(d N+1/d0) 1.

Let us now bound the last fraction. Since r > 2, we have log2(d N+1/d0) r 1 2, so log2+( 1

2d N+1/d0) = log2( 1

2d N+1/d0), and, therefore,

log2+(dn /d0) log2+

= log2+(d N+1/d0) 1.

Plugging this back in yields:

log2+(dn /d0)

n log2+(d N+1/d0)

Learning-Rate-Free Learning by D-Adaptation

Putting it all together, we have that:

min n N dn+1 Pn k=0 dk min n n dn+1 Pn k=0 dk 4log2+(d N+1/d0)

Theorem. (Theorem 2.2) Consider Algorithm 1 run for n steps, where n 2 log2(D/d0), if we return the point ˆxt = 1 Pt k=0 dk Pt k=0 dkxk where t is chosen to be:

t = arg min k n dk+1 Pk i=0 di ,

f(ˆxt) f 16log2+(dn+1/d0)

Proof. Consider the bound from Lemma A.6:

1 Pn k=0 dk

k=0 dk (f(xk) f ) 2Ddn+1 Pn k=0 dk

k=0 gk 2 + Ddn+1 Pn k=0 dk

k=0 γk gk 2

(3) 2Ddn+1 Pn k=0 dk

k=0 gk 2 + Ddn+1 Pn k=0 dk 2

= 4Ddn+1 Pn k=0 dk

Now using Lemma A.7, we can return the point ˆxt and at time t = arg mink n dk+1 Pk i=0 di , ensuring that

dt+1 Pt k=0 dk = min k n dk+1 Pk i=0 di

(5) 4log2+(dn+1/d0)

giving us an upper bound:

f(ˆxt) f 16log2+(dn+1/d0)

We note that a similar proof can be used to remove the G2 term from the numerator of γk. To this end, we could reuse the bound obtained in the proof of Theorem 2.1:

k=0 γk gk 2 2

k=0 gk 2 + 2 G2

which holds for γk = 1 Pk 1 i=0 gi 2 . In the proof of Theorem 2.1, this bound was stated for n r, where r is the smallest

number such that gk 2 Pk 1 i=0 gi 2 for all k r. However, the bound itself does not require n r, since for n < r it holds even without the first term in the right-hand side. The second term in that bound does not increase with n, and it would result in the following bound for the same iterate ˆxt as in Theorem 2.2:

f(ˆxt) f 16DG log2+(D/d0) n + 1 + 8DG2 log2+(D/d0)

(n + 1) g0 .

Since the leading term in the bound above is of order O 1 n+1

, the extra term for not using G is negligible.

Learning-Rate-Free Learning by D-Adaptation

B. Gradient Descent Variant

The gradient descent variant (Algorithm 2) results in the following specializations of earlier theorems resulting from plugging in γk = 1:

Theorem B.1. It holds for the iterates of Algorithm 2,

k=0 λk [f(xk) f ] sn+1 x0 x

k=0 λk gk, sk .

Lemma B.2. Gradient descent iterates satisfy

k=0 λk gk, sk = 1

k=0 λ2 k gk 2 1

k=0 λ2 k gk 2 .

Lemma B.3. Algorithm 2 satisfies

sn+1 2dn+1 + Pn k=0 λ2 k gk 2

The logarithmic terms in the convergence rate of gradient descent arise from the use of the following standard lemma:

Lemma B.4. (Lemma 4.13 from Orabona (2019)) Let at be a sequence with a0 0 and ϕ be non-increasing for non-negative values, then: n X

Z Pn k=0 ak

Corollary B.5. For any vectors g0, . . . , gn such that gk G for all k, it holds

G2 + Pk i=0 gi 2 log (n + 2) .

Proof. Applying Lemma B.4 with a0 = G2 and ak = gk 1 2 up to an+1 = gn 2 to the function ϕ(x) = 1/x gives:

Z Pn+1 k=0 ak

k=0 gk 2 !!

log (n + 2) .

Lemma B.6. For Algorithm 2, we have

k=0 λk [f(xk) f ] 4dn+1D log (n + 2) .

Learning-Rate-Free Learning by D-Adaptation

Proof. Consider the result of Theorem B.1:

k=0 λk [f(xk) f ] sn+1 D

k=0 λk gk, sk .

We may simplify this by substituting Lemmas B.2 and B.3:

k=0 λk [f(xk) f ]

2dn+1 + Pn k=0 λ2 k gk 2

k=0 λ2 k gk 2

= 2dn+1D + 1

dn+1 + 1 n X

k=0 λ2 k gk 2 .

Now apply Corollary B.5:

k=0 λk [f(xk) f ] 2dn+1D + 1

dn+1 + 1 d2 n+1 log (n + 2)

dn+1 + dn+1

log (n + 2)

2dn+1D [1 + log (n + 2)]

4dn+1D log (n + 2) .

B.1. Asymptotic case

Theorem. (Theorem 2.3) It holds for Algorithm 2:

f(ˆxn) f = O DG n + 2 log (n + 2) .

Proof. Following the same logic as for the proof of Theorem 2.1, we may we take n large enough such that

4(n + 2)dn+1. (6)

Then from Jensen s inequality: 1 Pn k=0 λk

k=0 λk [f(xk) f ] f(ˆxn) f.

Applying Lemma B.6, we get

f(ˆxn) f 4dn+1D log (n + 2) Pn k=0 λk .

Consider the denominator:

G2 + Pk i=0 gi 2 1

1 + (k + 1)

So: f(ˆxn) f 16DG n + 2 log (n + 2) .

Learning-Rate-Free Learning by D-Adaptation

B.2. Non-asymptotic case

Theorem B.7. For Algorithm 2 run for n 2 log2(D/d0) iterations, with t chosen as:

t = arg min k n dk+1 Pk i=0 di ,

we have: f(ˆxt) f 12DG n + 1 log (n + 2) log2+(dn+1/d0).

Proof. Firstly, since f is convex, we can apply Jensen s inequality:

f(ˆxt) f 1 Pt k=0 λk

k=0 λk [f(xk) f ] .

Applying Lemma B.6 to the right-hand side, we get

f(ˆxt) f 4dn+1D log (n + 2) Pt k=0 λk .

Plugging-in the definition of λk, we obtain

G2 + Pk i=0 gi 2 1

1 + (k + 1)

(n + 1) dn+1 2G t + 2 log2+(dn+1/d0).

f(ˆxt) f 8DG t + 2

n + 1 log (n + 2) log2+(dn+1/d0)

12DG n + 1 log (n + 2) log2+(dn+1/d0).

C. Coordinate-wise setting

In the coordinate-wise setting we define the matrices An+1 as diagonal matrices with diagonal elements ai at step n defined as

v u u t G2 +

Let p be the number of dimensions. Define: D = x0 x

ˆdn+1 = sn+1 2 A 1 n+1 Pn k=0 λ2 k gk 2 A 1 k 2 sn+1 1 .

The following lemma applies to Algorithm 3 with general weights λk.

Learning-Rate-Free Learning by D-Adaptation

Lemma C.1. The inner product λk gk, A 1 k sk is a key quantity that occurs in our theory. Suppose that An+1 An for all n, then we can bound the sum of these inner products as follows:

k=0 λk gk, A 1 k sk 1

2 sn+1 2 A 1 n+1 + 1

k=0 λ2 k gk 2 A 1 k .

Proof. We start by expanding 1

2 sn+1 2 A 1 n+1 1 2 sn+1 2 A 1 n+1 1

2 sn+1 2 A 1 n

2 sn 2 A 1 n + λn gn, A 1 n sn + 1

2λ2 n gn 2 A 1 n .

Therefore λn gn, A 1 n sn 1

2 sn 2 A 1 n 1

2 sn+1 2 A 1 n+1 + 1

2λ2 n gn 2 A 1 n .

Telescoping over time gives:

k=0 λk gk, A 1 k sk 1

2 sn+1 2 A 1 n+1 + 1

k=0 λ2 k gk 2 A 1 k .

Below, we provide the analogue of Proposition A.5 for the coordinate-wise setting. Proposition C.2. (From Duchi et al. (2011)) The gradient error term can be bounded as:

G2 + Pk 1 i=0 g2 ij 2

v u u t G2 +

k=0 g2 kj, (7)

as long as G gij for all i, j. Lemma C.3. It holds for the iterates of Algorithm 3

k=0 λk (f(xk) f ) sn+1 1 D 1

2 sn+1 2 A 1 n+1 + 1

k=0 λ2 k gk 2 A 1 k .

Proof. We start by applying convexity:

k=0 λk (f(xk) f )

k=1 λk gk, xk x

k=1 λk gk, xk x0 + x0 x

= sn+1, x0 x +

k=1 λk gk, xk x0

= sn+1, x0 x

k=1 λk gk, A 1 k sk

sn+1 1 x0 x

k=1 λk gk, A 1 k sk .

Applying Lemma C.1 we have:

k=0 λk (f(xk) f ) sn+1 1 x0 x 1

2 sn+1 2 A 1 n+1 + 1

k=0 λ2 k gk 2 A 1 k .

Learning-Rate-Free Learning by D-Adaptation

Theorem C.4. Consider the iterates of Algorithm 3. The initial ℓ -distance D = x0 x satisfies

D ˆdn+1 = sn+1 2 A 1 n+1 Pn k=0 λ2 k gk 2 A 1 k 2 sn+1 1 .

Proof. Applying f(xk) f 0 to the bound from Lemma C.3 gives:

0 sn+1 1 D 1

2 sn+1 2 A 1 n+1 + 1

k=0 λ2 k gk 2 A 1 k .

Rearranging this inequality, we obtain

2 sn+1 2 A 1 n+1 1

k=0 λ2 k gk 2 A 1 k .

and, therefore,

D sn+1 2 A 1 n+1 Pn k=0 λ2 k gk 2 A 1 k 2 sn+1 1 .

Lemma C.5. The ℓ1-norm of sn+1 is bounded by:

sn+1 1 3dn+1 an+1 1 .

Proof. By the definition of ˆdn+1 we have:

1 2 sn+1 2 A 1 n+1 = ˆdn+1 sn+1 1 + 1

k=0 λ2 k gk 2 A 1 k .

and since ˆdn+1 dn+1, 1 2 sn+1 2 A 1 n+1 dn+1 sn+1 1 + 1

k=0 λ2 k gk 2 A 1 k .

Furthermore, using λk = dk dn+1 and Proposition C.2, we obtain

k=0 λ2 k gk 2 A 1 k 1

k=0 gk 2 A 1 k

v u u t G2 +

= d2 n+1 an+1 1 .

Therefore, using inequality 2αβ α2 + β2 with α2 = 2d2 n+1a(n+1)i and β2 = s2 (n+1)i 2a(n+1)i , we get

2dn+1 sn+1 1 =

i=1 2dn+1|s(n+1)i|

2d2 n+1a(n+1)i + s2 (n+1)i 2a(n+1)i

= 2d2 n+1 an+1 1 + 1

2 sn+1 2 A 1 n+1

2d2 n+1 an+1 1 + dn+1 sn+1 1 + 1

k=0 λ2 k gk 2 A 1 k

2d2 n+1 an+1 1 + dn+1 sn+1 1 + d2 n+1 an+1 1.

Learning-Rate-Free Learning by D-Adaptation

Rearranging, we get dn+1 sn+1 1 3d2 n+1 an+1 1.

Theorem. (Theorem 3.1) For a convex function with G = maxx f(x) , D-Adapted Ada Grad returns a point ˆxn such that

f(ˆxn) f = O an+1 1 D

= O p G D n + 1

as n , where D = x0 x for any x in the set of minimizers of f, as long as d0 D .

Proof. As in the proof of Theorem 2.1, we will show the result holds for some sufficiently n. Since dk is a non-decreasing sequence upper bounded by D, there must exist some ˆn such that after ˆn steps, dk 1

2dn+1 for all k, n ˆn. We take n 2ˆn.

1 2dn+1 = 1

2(n ˆn + 1)dn+1 1

4(n + 1)dn+1,

and, therefore, 1 Pn k=0 dk 4 (n + 1)dn+1 .

Combining this with Lemma C.3 yields

1 Pn k=0 dk

k=0 dk (f(xk) f ) 4 (n + 1)dn+1

sn+1 1 D + 1

k=0 d2 k gk 2 A 1 k

From Proposition C.2 we have:

k=0 d2 k gk 2 A 1 k 1

k=0 gk 2 A 1 k

d2 n+1 an+1 1 .

Plugging this in together with Lemma C.5 gives:

1 Pn k=0 dk

k=0 dk (f(xk) f ) 4 (n + 1)dn+1

3dn+1 an+1 1 D + d2 n+1 an+1 1

= 4 n + 1 (3 an+1 1 D + dn+1 an+1 1) .

So using dn+1 D we have:

1 Pn k=0 dk

k=0 dk (f(xk) f ) 16 n + 1 an+1 1 D .

Using Jensen s inequality on the left:

f(ˆxn) f 16 n + 1 an+1 1 D .

We can further simplify using an+1 1 = Pp j=1 q

G2 + Pn k=0 g2 kj p n + 1G :

f(ˆxn) f 16p G D n + 1 ,

which yields the result.

D. Parameter settings

In this section, we list the parameters, architectures and hardware that we used for the experiments. The information is collected in Tables 2 12.

Learning-Rate-Free Learning by D-Adaptation

Table 2. Logistic regression experiment. The problems are part of the LIBSVM repository. Since there are no standard train/test splits, and due to the small sizes of the datasets, we present training accuracy curves only.

Hyper-parameter Value Epochs 100 GPUs 1 V100 Batch size 16 Epochs 100 LR schedule 60,80,95 tenthing Seeds 10 Decay 0.0 Momentum 0.0 Baseline LR grid search

Table 3. CIFAR10 experiment. Our data augmentation pipeline followed standard practice: random horizontal flipping, then random cropping to 32 32 (padding 4), then normalization by centering around (0.5, 0.5, 0.5).

Hyper-parameter Value Architecture Wide Res Net 16-8 Epochs 300 GPUs 1 V100 Batch size per GPU 128 LR schedule 150-225 tenthing Seeds 10 decay 0.0001 Momentum 0.9 SGD LR 0.1

Table 4. CIFAR100 experiment. Following standard practice, we normalized the channels by subtracting ((0.5074,0.4867,0.4411) and dividing by (0.2011,0.1987,0.2025)). Augmentations used at training time were: random horizontal flips, random crop (32, padding=4, reflect).

Hyper-parameter Value

Architecture Dense Net [6,12,24,16], growth rate 12 Epochs 300 GPUs 1 V100 Batch size per GPU 64 LR schedule 150-225 tenthing Seeds 10 Decay 0.0002 Momentum 0.9 SGD LR 0.05

Table 5. Image Net experiment. Normalization of the color channels involved subtracting (0.485, 0.456, 0.406), and dividing by (0.229, 0.224, 0.225). For data augmentation at training we used Py Torch s Random Resized Crop to 224, then random horizontal flips. At test time images were resized to 256 then center cropped to 224.

Hyper-parameter Value Architecture Res Net50 Epochs 100 GPUs 8 V100 Batch size per GPU 32 LR schedule 30-60-90 tenthing Seeds 5 Decay 0.0001 Momentum 0.9 SGD LR 0.1

Table 6. fast MRI experiment. We used the implementation from https://github.com/ facebookresearch/fast MRI.

Hyper-parameter Value Architecture 12 layer Var Net 2.0 Epochs 50 GPUs 8 V100 Batch size per GPU 1 Acceleration factor 4 Low frequency lines 16 Mask type Offset-1 LR schedule flat Seeds 5 Decay 0.0 Adam LR 0.0003 β1, β2 0.9, 0.999

Table 7. IWSLT14 experiment. Our implementation used Fair Seq https://github.com/facebookresearch/fairseq defaults except for the parameters listed below. Note that the default Adam optimizer uses decoupled weight decay.

Hyper-parameter Value Architecture lstm_wiseman_iwslt_de_en Max Epoch 55 GPUs 1 V100 Max tokens per batch 4096 Warmup steps 4000 Dropout 0.3 Label smoothing 0.1 Share decoder, input, output embed True

Float16 True Update Frequency 1 LR schedule Inverse square-root Seeds 10 Decay 0.05 Adam LR 0.01 β1, β2 0.9, 0.98

Learning-Rate-Free Learning by D-Adaptation

Table 8. Ro BERTa Book Wiki experiment. Our implementation used Fair Seq defaults except for the parameters listed below.

Hyper-parameter Value Architecture roberta_base Task masked_lm Max updates 23,000 GPUs 8 V100 Max tokens per sample 512 Dropout 0.1 Attention Dropout 0.1 Max sentences 16 Warmup 10,000 Sample Break Mode Complete Float16 True Update Frequency 16 LR schedule Polynomial decay Seeds 5 Decay 0.0 Adam LR 0.001 β1, β2 0.9, 0.98

Table 9. GPT Book Wiki experiment. Our implementation used Fair Seq defaults except for the parameters listed below.

Hyper-parameter Value Architecture transformer_lm_gpt Task language_modeling Max updates 65,000 GPUs 8 V100 Max tokens per sample 512 Dropout 0.1 Attention Dropout 0.1 Max sentences 1 Warmup 10,000 Sample Break Mode Complete Share decoder, input, output embed True

Float16 True Update Frequency 16 LR schedule Polynomial decay Seeds 5 Decay 0.005 Adam LR 0.001 β1, β2 0.9, 0.98

Table 10. COCO Object Detection experiment. We used the Detectron2 codebase https://github. com/facebookresearch/detectron2, with the faster_rcnn_X_101_32x8d_FPN_3x configuration. We list its key parameters below.

Hyper-parameter Value Architecture X-101-32x8d Solver Steps (Schedule) 210000, 250000 Max Iter 270000 IMS Per Batch 16 Momentum 0.9 Decay 0.0001 SGD LR 0.02

Table 11. Vision Transformer experiment. We used the Pytorch Image Models codebase https://github.com/ rwightman/pytorch-image-models.

Hyper-parameter Value Model vit_tiny_patch16_224 Epochs 300 Batch Size 512 Sched Cosine Warmup Epochs 5 Hflip 0.5 aa rand-m6-mstd0.5 mixup 0.1 cutmix 1.0 Crop Pct 0.9 BCE Loss True Seeds 5 Decay 0.1 Adam LR 0.001 β1, β2 0.9, 0.999

Table 12. Criteo Kaggle experiment. We used our own implementation of DLRM, based on the codebase provided at https://github.com/ facebookresearch/dlrm.

Hyper-parameter Value Iterations 300 000 Batch Size 128 Schedule Flat Emb Dimension 16 Seeds 5 Decay 0.0 Adam LR 0.0001 β1, β2 0.9, 0.999

Learning-Rate-Free Learning by D-Adaptation

E. Additional figures

0 20 40 60 80 100

Accuracy (%)

Adam (89.5 SE 0.01) D-Adapt Adam (90.0 SE 0.07)

0 20 40 60 80 100

Accuracy (%)

Adam (94.1 SE 0.01) D-Adapt Adam (97.1 SE 0.01)

0 20 40 60 80 100

Accuracy (%)

Adam (100.0 SE 0.00) D-Adapt Adam (100.0 SE 0.00)

0 20 40 60 80 100

Accuracy (%)

Adam (72.0 SE 0.16) D-Adapt Adam (72.3 SE 0.19)

0 20 40 60 80 100

Accuracy (%)

Adam (98.3 SE 0.11) D-Adapt Adam (98.6 SE 0.00)

0 20 40 60 80 100

Accuracy (%)

Adam (77.8 SE 0.01) D-Adapt Adam (77.7 SE 0.01)

0 20 40 60 80 100

Accuracy (%)

Adam (96.2 SE 0.02) D-Adapt Adam (95.8 SE 0.02)

0 20 40 60 80 100

Accuracy (%)

Adam (92.9 SE 0.03) D-Adapt Adam (95.8 SE 0.04)

0 20 40 60 80 100

Accuracy (%)

Adam (98.6 SE 0.01) D-Adapt Adam (98.6 SE 0.01)

0 20 40 60 80 100

Accuracy (%)

Adam (82.1 SE 0.08) D-Adapt Adam (81.9 SE 0.16)

0 20 40 60 80 100

Accuracy (%)

Adam (77.4 SE 0.11) D-Adapt Adam (77.4 SE 0.15)

0 20 40 60 80 100

Accuracy (%)

Adam (100.0 SE 0.00) D-Adapt Adam (100.0 SE 0.00)

Figure 4. Logistic Regression experiments.

Learning-Rate-Free Learning by D-Adaptation

10 14 10 11 10 8 10 5 10 2

Accuracy (%)

10 14 10 11 10 8 10 5 10 2

Accuracy (%)

10 14 10 11 10 8 10 5 10 2

Accuracy (%)

10 14 10 11 10 8 10 5 10 2

Accuracy (%)

10 14 10 11 10 8 10 5 10 2

Accuracy (%)

10 14 10 11 10 8 10 5 10 2

Accuracy (%)

10 14 10 11 10 8 10 5 10 2

Accuracy (%)

10 14 10 11 10 8 10 5 10 2

Accuracy (%)

10 14 10 11 10 8 10 5 10 2

Accuracy (%)

10 14 10 11 10 8 10 5 10 2

Accuracy (%)

10 14 10 11 10 8 10 5 10 2

Accuracy (%)

10 14 10 11 10 8 10 5 10 2

Accuracy (%)

Figure 5. Final accuracy as a function of d0. Setup described in Section 7. Error bars show a range of 2 standard errors above and below the mean of the 10 seeds. For most problems error bars are too narrow to be visible.

F. Additional notes

Theorem F.1. If xn x 0, and the learning rate (1) is used, then:

lim n dn D 1 +

Proof. By triangle inequality, we can bound the distance to x as

D = x0 x xn x + xn x0 = xn x + γn sn .

Learning-Rate-Free Learning by D-Adaptation

We need to upper bound the last term γn sn . To this end, we use the same argument as in the proof of Lemma A.4, starting with the definition of ˆdn+1 and plugging-in λk = dk:

2 d2 k gk 2 = ˆdn+1 sn+1 dn+1 sn+1 .

The main change from the proof of Lemma A.4 is that now we will use inequality 2αβ α2 + β2 with α2 = θ d2 n+1 γn+1 and β2 = γn+1

θ sn+1 2 with θ to be chosen later to make the bound optimal. Plugging this inequality into the previous bound, we derive

2αβ = 2dn+1 sn+1 θd2 n+1 γn+1 + γn+1

θ sn+1 2 θd2 n+1 γn+1 + 2

θdn+1 sn+1 + 1

k=0 γkd2 k gk 2.

Since the sequence dk is non-decreasing, we have dk dn+1, further giving us

k=0 γkd2 k gk 2 d2 n+1

k=0 γk gk 2 (4) 2

θγn+1d2 n+1

= 2d2 n+1 θγn+1 .

Plugging this back and rearranging, we get

dn+1 sn+1 θd2 n+1 γn+1 + 2d2 n+1 θγn+1 = (θ + 2/θ)d2 n+1 γn+1 .

Now it is time for us to choose θ. Clearly, the optimal value of θ is the one that minimizes the ratio θ+2/θ 2(1 1/θ) = θ2+2 2(θ 1). It

can be shown that the value of θ = 1 +

3 is optimal and gives θ2 +2 2(θ 1) = 1 +

3. Thus, we have

γn+1 sn+1 (1 +

Now, assume that xn x in norm, so xn x 0. In that case, the bounds combined yield

D lim n ( xn x + γn sn ) = lim n γn sn (1 +

3) lim n dn.

Thus, the value of dn is asymptotically lower bounded by D 1+

F.1. A tighter lower bound on D

Using Lemma A.1, we can obtain a slightly tighter bound than in Theorem A.3. In particular, we have previously used the following bound:

k=0 λk (f(xk) f )

k=0 λk gk, xk x

k=0 λk gk, xk x0 + x0 x

= sn+1, x0 x +

k=0 λk gk, xk x0

= sn+1, x0 x

k=0 λkγk gk, sk

k=0 λkγk gk, sk .

Learning-Rate-Free Learning by D-Adaptation

From here, we can immediately conclude that

D = x0 x edn+1 = Pn k=0 λkγk gk, sk

Notice that it always holds edn ˆdn. The only complication that we can face is with Lemma A.4, where we used the definition of ˆdn to obtain the upper bound. Nevertheless, one can prove the same bound with ˆdn replaced by edn by repeating the same argument:

2 λ2 k gk 2 = ˆdn+1 sn+1 edn+1 sn+1 dn+1 sn+1 .

From that place, the rest of the proof of Lemma A.4 follows in exactly the same way. The other proofs only use the monotonicity of the sequence and its boundedness by D, dk dn+1 D, which would remain valid if replace ˆdn with edn.

G. Adam Derivation

Lemma G.1. Consider a positive constant c. Define the two sequences:

uk+1 = uk + 1

ˆuk+1 = cˆuk + (1 c) gk.

Then the following relationship holds between the two sequences:

ˆuk+1 = ck (1 c) uk+1,

assuming that ˆu0 = (1 c) u0.

In this section, we use hat notation to denote the exponential moving averages of each quantity (other than ˆd). We drop the hat notation for simplicity when we present the method (Algorithm 5). We also treat each quantity as 1-dimensional, with the understanding that the final result holds also when applied element-wise.

Our goal is to derive the EMA updates, given the following weighted updates:

sk+1 = sk + λkgk,

vk+1 = vk + λ2 kg2 k,

(1 β2) vk+1 ,

rk+1 = rk + γk+1λk gk, sk ,

ˆdn+1 = Pn k=0 γkλk gk, sk

sn+1 1 = rk+1 sn+1 1 .

Note that we normalized by γk+1 rather than γk for this implemented variant. We also introduce the Adam denominator through gamma, in the style of DA method, rather than the step size as implemented in Algorithm 5. This is the only way currently supported by our theory. However, we will still use the non-DA step:

xk+1 = xk λkgk.

The denominator of γ is chosen to ensure that the step is properly normalized. To see that, note that:

ˆvk+1 = βk 2 (1 β2) vk+1,

Learning-Rate-Free Learning by D-Adaptation

(1 β2) vk+1 =

βk 2 (1 β2) p

(1 β2) ˆvk+1

βk 2 gk = xk 1 p

Note that: ˆsk+1 = βk/2 2 1 p

and so: ˆsk+1 = p

β2ˆsk + 1 p

So we have:

rk+1 = rk + γk+1λk gk, sk

βk 2 1 β2 γk+1 1 p

βk 2 gk, ˆsk

βk 2 gk, ˆsk

= rk + 1 1 β2 1 p

ˆvk+1 gk, ˆsk

. Now define r k+1 = r k + 1 p

ˆvk+1 gk, ˆsk ,

then r k+1 = 1 β2 rk+1. Now using

β2ˆrk + 1 p

ˆvk+1 gk, ˆsk ,

ˆrk+1 = βk/2 2 1 p

β2 r k+1 = βk/2 2 1 p

Plugging this in gives:

ˆdn+1 = rk+1

sn 1 = ˆrk+1 βk/2 2 1 β2 2 sn 1

= βk/2 2 1 β2 ˆrk+1

βk/2 2 1 β2 2 ˆsn 1

= ˆrk+1 1 β2 ˆsn 1 .