# prodigy_an_expeditiously_adaptive_parameterfree_learner__be1371e6.pdf

Prodigy: An Expeditiously Adaptive Parameter-Free Learner

Konstantin Mishchenko 1 Aaron Defazio 2

We consider the problem of estimating the learning rate in adaptive methods, such as Ada Grad and Adam. We propose Prodigy, an algorithm that provably estimates the distance to the solution D, which is needed to set the learning rate optimally. At its core, Prodigy is a modification of the D-Adaptation method for learning-rate-free learning. It improves upon the convergence rate of D-Adaptation by a factor of O( p

log(D/d0)), where d0 is the initial estimate of D. We test Prodigy on 12 common logistic-regression benchmark datasets, VGG11 and Res Net-50 training on CIFAR10, Vi T training on Imagenet, LSTM training on IWSLT14, DLRM training on Criteo dataset, Var Net on Knee MRI dataset, as well as Ro BERTa and GPT transformer training on Book Wiki. Our experimental results show that our approach consistently outperforms D-Adaptation and reaches test accuracy values close to that of hand-tuned Adam.

1. Introduction

Optimization is an essential tool in modern machine learning, enabling efficient solutions to large-scale problems that arise in various domains, such as computer vision, natural language processing, and reinforcement learning. One of the key challenges is the selection of appropriate learning rates, which can significantly impact the convergence speed and the quality of the final solution. Learning-rate tuning has been particularly challenging in applications where there are multiple agents that use their own optimizer. For instance, when training Generative Adversarial Networks (GANs) (Goodfellow et al., 2020), there are two neural networks with different architectures. In federated learning, tuning is even more challenging (Khodak et al., 2021), since there might be billions of devices (Kairouz et al., 2021),

1Samsung AI Center 2Fundamental AI Research Team, Meta. Correspondence to: Konstantin Mishchenko <konsta.mish@gmail.com>.

Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s).

each optimizing their objective locally. Another example is Neural Architecture Search (NAS) (Zoph & Le, 2017), where the goal is to find the best neural network architecture automatically by training a lot of networks and evaluating them on a validation set. In such cases it becomes very expensive to manually tune the learning rate.

Recently, parameter-free adaptive learning rate methods (Orabona & Tommasi, 2017; Cutkosky & Orabona, 2018; Zhang et al., 2022; Carmon & Hinder, 2022; Ivgi et al., 2023) have gained considerable attention due to their ability to automatically adjust learning rates based on the problem structure and data characteristics. Among these, the DAdaptation method, introduced by Defazio & Mishchenko (2023), has emerged as a promising practical approach for learning-rate-free optimization.

For a convex objective function f, D-Adaptation works by maintaining a lower bound on the initial distance to solution D = x0 x , for any x in the solution set of the following problem:

min x Rp f(x).

In practice, the lower bound estimated by D-Adaptation increases rapidly during the course of optimization, plateauing to a value close to the true D. This D quantity is the key unknown constant needed to set the learning rate for adaptive optimization methods, forming the numerator of the step size:

ηk = D q Pk i=0 gi 2 , where D = x0 x , (1)

and the denominator is based on the Ada Grad step size (Duchi et al., 2011; Streeter & Mc Mahan, 2010; Ward et al., 2019). The Gradient Descent form of D-Adaptation simply plugs in the current lower bound at each step in place of D. This simple approach can be applied to estimate the step size in Adam (Kingma & Ba, 2015), which yields stateof-the-art performance across a wide-range of deep learning problems. Defazio & Mishchenko (2023) also show that asymptotically, D-Adaptation is as fast as specifying the step size using the true D (up to small constant factors).

Contributions

We summarize our contributions as follows:

Prodigy: An Expeditiously Adaptive Parameter-Free Learner

We present Prodigy, a modification of D-Adaptation that improves its worst-case non-asymptotic convergence rate.

Through extensive experiments, we demonstrate that Prodigy establishes a new state-of-the-art for learning rate adaptation, outperforming D-Adaptation.

We develop a lower complexity bound for methods which grow the learning rate at most exponentially fast. We show that this covers all methods that avoid significant overshooting.

Prodigy is highly practical. Our open-source implementation is already widely used for fine-tuning of vision and language models, and is the recommended optimizer for Hugging Face Diffusers Dream Booth Lo RA training.

2. Prodigy Approach

To understand how we can improve upon D-Adaptation, let us take a closer look at some details of its analysis. Consider Gradient Descent update

xk+1 = xk ηkgk,

where gk is the sub-gradient used at iteration k. The standard analysis of Ada Grad with a constant numerator gives the following error term:

Ada Grad error =

k=0 η2 k gk 2,

which is exactly why Ada Grad places q Pk i=0 gi 2 in the step-size denominator in equation (1). However, when setting the numerator adaptively with an estimate dk instead of the unknown D, as done in D-Adaptation, we end up with a different error term:

D-adaptation error =

k=0 d2 kη2 k gk 2.

The theory of D-Adaptation handles this error term by upper bounding by using the upper bound dk dn to get a similar error term to that of Ada Grad. This upper bound, however, is quite pessimistic since dn can be as large as D and dk can be as small as d0. Therefore, replacing d2 k with d2 n can introduce a multiplicative error of D2

d2 0 in this term.

In this paper, we take a different approach and instead handle the error term using modified Ada Grad-like step sizes. Since the error terms are now d2 i gi 2 instead of gi 2, the new adaptive step size should be

ηk = d2 k q Pk i=0 d2 i gi 2 .

Algorithm 1 Prodigy (GD version)

1: Input: d0 > 0, x0, G 0 2: for k = 0 to n do 3: gk f(xk) 4: Choose weight ωk (default: ωk = 1)

5: ηk = d2 kωk q

d2 k G2 + Pk i=0 d2 i ω2 i gi 2

6: xk+1 = xk ηkgk

7: ˆdk+1 = Pk i=0 ηi gi, x0 xi

xk+1 x0 8: dk+1 = max(dk, ˆdk+1) 9: end for 10: Return xn = 1 Pn k=0 d2 kωk Pn k=0 d2 kωkxk

Algorithm 2 Prodigy (Dual Averaging version)

1: Input: d0 > 0, x0, G 0; s0 = 0 Rp

2: for k = 0 to n do 3: gk f(xk) 4: sk+1 = sk + d2 kgk

5: ˆdk+1 = Pk i=0 d2 i gi, x0 xi sk+1 6: dk+1 = max(dk, ˆdk+1)

7: γk+1 = 1 q

d2 k+1G2 + Pk i=0 d2 i gi 2

8: xk+1 = x0 γk+1sk+1 9: end for 10: Return xn = 1 Pn k=0 d2 k Pn k=0 d2 kxk

This way, we can still control the error term of D-Adaptation but the obtained step size is provably larger since dk is nondecreasing:

d2 k q Pk i=0 d2 i gi 2 d2 k q Pk i=0 d2 k gi 2 = dk q Pk i=0 gi 2 .

Having larger step sizes while preserving the main error term is the key reason why the new algorithms converge, as we show below, with a faster rate.

Notice, however, that the methods might still be slow because the denominator in the step size might grow too large over time. To remedy this, we introduce a modification for the step size by placing an extra weight ωk next to the gradients:

ηk = d2 kωk q Pk i=0 d2 i ω2 i gi 2 .

In fact, the modified step size might even increase between iterations, whereas the Ada Grad step size always decreases.

Prodigy: An Expeditiously Adaptive Parameter-Free Learner

We will show that as long as ωk does not grow too quickly, the worst-case convergence rate is almost the same.

To establish non-asymptotic theory, we also introduce in our algorithms an extra term G2 in the denominator which upper bound the gradient norm, in case such a bound exists. We define it formally in the assumption below. Assumption 1. We assume that the objective f is GLipschitz, which implies that its gradients are bounded by G: for any x Rp and g f(x), it holds g G.

Algorithm 1 and Algorithm 2 give Gradient Descent and the Dual Averaging variants of our new method. In contrast to Ada Grad, they estimate the product of D and G in the denominator, so we call the proposed technique Prodigy. We give the following convergence result for Algorithm 1: Theorem 1. Assume f is convex and G-Lipschitz. Given any weights 1 ω0 ωn, the functional gap of the average iterate of Algorithm 1 converges as

2ωn DGdn+1(2 + log(1 + Pn k=0 ω2 k)) p Pn k=0 ωkd2 k ,

(2) where xn = 1 Pn k=0 d2 k Pn k=0 d2 kxk is the weighted average iterate.

Notice that we have the freedom to choose any nondecreasing sequence ωk as long as the right-hand side is decreasing. This allows us to put much more weight on the recent gradients and get more reasonable step sizes. For instance, we can choose ωk = kp, where p > 0 and since Pk k=0 k2p = O(k2p+2), it would result in an extra factor of 1 + p in the numerator due to the log term. The denominator, on the other hand, would increase as well, giving us a trade-off that depends on the values of dk s. We note that weights ωk = k have appeared previously in Accelegrad (Levy et al., 2018) and Uni XGrad (Kavis et al., 2019), which combine Ada Grad step sizes with momentum, and ωk =

k weighting is used in the recent MADGRAD method (Defazio & Jelassi, 2022).

To understand why this improves the convergence rate, consider the following lemma, which we prove in the appendix. The lemma presents an upper bound on the terms related to the dk sequence in the right-hand side of equation 2. Lemma 1. Let d0 d1 d N be positive numbers and assume N 2 log2+( d N

d0 ), where log2+( ) = 1 + log2( ). Then,

min t<N dt+1 q Pt k=0 d2 k 4 q

In contrast to the bound in (Defazio & Mishchenko, 2023), we bound dt+1 Pt k=0 d2 k instead of dt+1 Pt k=0 dk . This is the rea-

son why the overall guarantee improves by a factor of p

log2(D/d0). For instance, if we set ωk = 1 for all k and substitute the bound from Lemma 1, we get the convergence rate

f(xt) f = O

GD log(n) p

log2+(D/d0) n

where t n is chosen as the argmin from Lemma 1. Furthermore, for arbitrary increasing positive weights, we get the following guarantee by applying Lemma 1 directly to the bound in Theorem 1:

f(xt) f = O

GD log(n) q

log2+( ωn+1D

Even though our theory does not guarantee that it is beneficial to use increasing weights ωk, this result is, to the best of our knowledge, new for Ada Grad-like methods. It allows for a wide range of choices in ωk. For example, if we set ωk = β kp 2 with β2 < 1 and p < 1/3, then the method is still guaranteed to converge at the rate of O 1 n(1 3p)/2 . This is of particular interest when we study Adam-like methods, see Section 5 for a discussion.

The logarithmic term log(n) is, however, not necessary and only arises due to the use of Gradient Descent update. The Dual Averaging update of Algorithm 2, provides a tighter guarantee as given in the next theorem.

Theorem 2. Let f be a convex and G-Lipschitz function. For Algorithm 2, it holds that:

f(xt) f 4GD n

where t = arg mink n dk+1 Pk i=0 d2 i and log2+( ) = 1 +

Comparing this with the previous rate, the only difference is the removal of a multiplicative log(n) factor. This improvement, however, is mostly theoretical as Gradient Descent typically performs better in practice than Dual Averaging. We also note that we provide convergence results for Algorithm 2 without extra weights ωk.

2.1. Smooth analysis

One appealing property of Ada Grad is that it is a universal method, i.e., it works for both non-smooth and smooth problems (Levy et al., 2018), and the same is also known for Do G (Ivgi et al., 2023). Here, we show that Prodigy can adapt to smoothness as well, and works if the gradients are not bounded by G.

Prodigy: An Expeditiously Adaptive Parameter-Free Learner

Assumption 2. We say f is L-smooth if it holds for any x Rp that

f(x) 2 2L(f(x) f ). (3)

We give the following result for Algorithm 1 when applied to smooth objectives.

Theorem 3. Assume f is L-smooth, and set G = 0 and ω0 = = ωn = 1 in Algorithm 1. Let xn = 1 Pn k=0 d2 k Pn k=0 d2 kxk, then we get the following convergence guarantee:

f(xt) f = O log2+ D

d0 log2 2+ LD2

where t = arg mink n dk+1 Pk i=0 d2 i .

Compared to the non-smooth O(1/ n) convergence rate, the result in Theorem 3 offers a faster O(1/n) convergence rate. This also affects the first logarithmic term, which changes from p

log2+(D/d0) to log2+(D/d0), reflecting that n should be proportional to log2+(D/d0) in both cases. There is, unfortunately, an extra log2 2+ LD2

d0 g0 term in the bound, which we believe to be an artifact of our analysis.

3. Lower Complexity Bounds for Exponentially Bounded Algorithms

We can obtain an interesting class of algorithms, which contains our two D-Adaptation variants, by restricting the rate of growth.

Definition 1. An optimization algorithm is exponentially bounded if there exists a constant d0, so that for any sequence of G-bounded gradients it returns a sequence of iterates such that for all k:

xk x0 2kd0.

Theorem 4. D-Adaptation, Do G and Prodigy are exponentially bounded.

A simple lower complexity bound can be established via a simple 1-dimensional resisting oracle. The bound depends on the scale of the initial step of the algorithm, which is the size of the initial step from x0 to x1. This initial step is

G2 + g0 2 for D-Adaptation, and can be thought of as an algorithm-agnostic measure of d0.

Our lower bound allows the resisting oracle to choose a constant D after seeing x1, which is a much stronger oracle than required for establishing a lower bound. Ideally, a lower bound could be established where the constant D is fixed but unknown to the algorithm, and the actual distance

to solution x0 x D given by the oracle is allowed to depend on the iterate sequence.

The primary consequence of this difference is that our construction only tells us that hard problems exist for n small relative to D/d0, of the scale n < log(D/d0). It remains an open problem to show a more general lower bound for larger n. This is in a sense a trivial consequence of the exponentially bounded property, but is actually representative of the real behavior of the methods during the early steps of the algorithm, where both n and dk actually are small. Any more general lower bound must cover this case.

Our new D-Adaptation variants are optimal among exponentially bounded algorithms for this complexity class: Theorem 5. Consider any exponentially bounded algorithm for minimizing a convex G-Lipschitz function starting from x0, which has no knowledge of problem constants G and D. There exists a fixed gradient oracle such that any sequence of x1,..., xn, there exists a convex Lipschitz problem f with G = 1 and x0 x D for all minimizing points x , consistent with the gradient oracle such that:

min k n f(xk) f DG p

log2(D/x1) 2 n + 1 .

Using the simple construction from Theorem 5, we show in Appendix B that the class of exponentially bounded methods (potentially with an exponent other than 2) covers all Gradient Descent approaches that use an estimate of dk c D for some constant c, and use a step size γk dk/G without line-search or other additional queries. So the only way to achieve a log log dependence on d0 is by using a method that performs some queries that overshoot the standard D/G step size by more than a fixed constant factor. Although using larger step sizes is not problematic for Lipschitz functions, it comes with the risk of causing training divergence when applied to functions whose gradients are only locally bounded by G, which is common in machine learning settings.

Lower complexity bounds for the average regret in the more general online learning setting also apply here. They are of the form (Zhang et al., 2022):

k=0 gk, xk x = Ω

log2(D/ϵ) + ϵ n

where ϵ is a user-specified constant playing a similar role to x1. Bounds on the average regret directly bound function value sub-optimality as

f( x) f 1 n + 1

k=0 [f(xk) f ]

k=0 gk, xk x ,

Prodigy: An Expeditiously Adaptive Parameter-Free Learner

where x = 1 n+1 Pn k=0 xk.

4. Related Work

In this section, we review the major classes of techniques for optimizing convex Lipschitz functions with some level of problem parameter independence.

The Polyak step size (Polyak, 1987) trades the knowledge of D for f , achieving optimal convergence rate without additional log factors. Stable convergence requires accurate f estimates. A restarting scheme converges within a multiplicative log factor of the optimal rate (Hazan & Kakade, 2019). There has been substantial recent research on modifications of the Polyak step size to make it better suited to machine learning tasks (Loizou et al., 2021; Gower et al., 2021; Orvieto et al., 2022) but as of yet they have not seen widespread adoption.

Normalized sub-gradient descent and Ada Grad have received a lot of attention due to their universality (Grimmer, 2019; Levy et al., 2018). Ada Grad s extensions with momentum have been studied by Levy et al. (2018) and Ene et al. (2021).

Coin-betting (Orabona & Tommasi, 2017; Mc Mahan & Orabona, 2014; Cutkosky & Orabona, 2018; Zhang et al., 2022; Orabona & P al, 2021) is a family of approaches from the online learning setting which are also applicable for convex non-smooth optimization. They work by establishing a relationship by duality between regret minimization and wealth-maximization. Existing approaches for wealthmaximization can then be mapped to algorithms for regret minimization. Coin-betting approaches give convergence rates for an equal-weighted average of the iterates of the form:

f( xn) f = O

log (1 + D/d0) n

Standard D-Adaptation obtains asymptotic rates without the log factor, but was otherwise (theoretically) slower in finite time, as it had a log( ) rather than a p

log( ) dependence on D/d0:

f(ˆxn) f 16DG log2+(D/d0) n .

The Prodigy method closes the gap, giving the same sqrt-log dependence as coin betting.

The Do G method (Ivgi et al., 2023), based on the bisection approach of Carmon & Hinder (2022), is the only other approach that we are aware of that estimates D in an online fashion. Do G estimates D by rk:

rk = max i k xi x0 .

Algorithm 3 Prodigy (Adam version)

1: Input: d0 > 0 (default 10 6), x0, β1 (default 0.9), β2 (default 0.999), ϵ (default 10 8), γk (default 1 with cosine annealing) 2: r0 = 0, s0 = 0, m0 = 0, v0 = 0 3: for k = 0 to n do 4: gk f(xk) 5: mk+1 = β1mk + (1 β1)dkgk 6: vk+1 = β2vk + (1 β2)d2 kg2 k 7: rk+1 = β2rk + (1 β2)γkd2 k gk, x0 xk 8: sk+1 = β2sk + (1 β2)γkd2 kgk 9: ˆdk+1 = rk+1 sk+1 1 10: dk+1 = max(dk, ˆdk+1) 11: xk+1 = xk γkdkmk+1/( vk+1 + dkϵ) 12: end for

Ivgi et al. (2023) use this quantity as a plug-in estimate for the numerator of the step size, similar to D-Adaptation s approach. This approach can divergence in theory, but with additional modifications to the step size, the tamed T-Do G method is shown to converge. It has a log+(D/d0) dependence on the initial sub-optimally of the D estimate, so our approach improves on this dependence by a p

log+(D/do) factor.

Malitsky & Mishchenko (2020) proposed Ad GD, a method for convex optimization that does not require any hyperparameters and has a rate that is at least as good as that of the optimally tuned Gradient Descent. However, Ad GD requires the objective to be locally smooth, which hinders its use in many practical problems. Latafat et al. (2023) partially addressed this gap by proposing a proximal extension, but the case of non-smooth differentiable functions has remained unstudied.

5. Deriving Adam-Like Step Sizes

To derive an Adam-like method, which should use an exponential moving average for the step size, we want to approximate Adam s update of the exponential moving average of squared gradients:

vk+1 = β2vk + (1 β2)g2 k = (1 β2)

i=0 βk i 2 g2 i ,

where g2 k is the coordinate-wise square of the gradient gk. We can achieve this using exponential weights, ωk = β k/2 2 , which after substituting the definition of ηk give us the following identity:

d4 k η2 k = d2 k ω2 k G2 + d2 k gk 2 +

i=0 βk i 2 d2 i gi 2.

Prodigy: An Expeditiously Adaptive Parameter-Free Learner

0 200 400 600 800 1000

Accuracy (%)

DA D-Adapt (14.9 SE 0.66)

Prodigy DA (32.8 SE 0.74)

SGD D-Adapt (16.8 SE 0.60)

Prodigy SGD (41.0 SE 0.36)

Do G (41.9 SE 0.35)

0 200 400 600 800 1000

DA D-Adapt (72.6 SE 0.03)

Prodigy DA (84.8 SE 0.01)

SGD D-Adapt (73.2 SE 0.03)

Prodigy SGD (85.5 SE 0.01)

Do G (78.8 SE 0.01)

0 200 400 600 800 1000

DA D-Adapt (88.2 SE 1.62)

Prodigy DA (99.1 SE 0.04)

SGD D-Adapt (90.1 SE 1.36)

Prodigy SGD (99.5 SE 0.03)

Do G (98.9 SE 0.15)

0 200 400 600 800 1000

DA D-Adapt (52.3 SE 1.93)

Prodigy DA (67.9 SE 0.21)

SGD D-Adapt (54.0 SE 2.07)

Prodigy SGD (67.8 SE 0.20)

Do G (67.6 SE 0.32)

0 200 400 600 800 1000

Accuracy (%)

DA D-Adapt (93.1 SE 0.39)

Prodigy DA (97.5 SE 0.09)

SGD D-Adapt (93.9 SE 0.36)

Prodigy SGD (98.1 SE 0.07)

Do G (97.3 SE 0.07)

0 200 400 600 800 1000

DA D-Adapt (68.2 SE 0.08)

Prodigy DA (72.8 SE 0.02)

SGD D-Adapt (68.8 SE 0.08)

Prodigy SGD (73.0 SE 0.01)

Do G (72.9 SE 0.02)

0 200 400 600 800 1000

DA D-Adapt (56.7 SE 0.99)

Prodigy DA (88.7 SE 0.24)

SGD D-Adapt (60.3 SE 0.93)

Prodigy SGD (90.9 SE 0.16)

Do G (84.4 SE 0.24)

0 200 400 600 800 1000

DA D-Adapt (22.2 SE 1.31)

Prodigy DA (26.9 SE 1.13)

SGD D-Adapt (22.4 SE 1.30)

Prodigy SGD (28.5 SE 1.11)

Do G (31.1 SE 0.88)

0 200 400 600 800 1000

Accuracy (%)

DA D-Adapt (77.3 SE 0.75)

Prodigy DA (94.8 SE 0.07)

SGD D-Adapt (79.6 SE 0.66)

Prodigy SGD (95.4 SE 0.06)

Do G (92.4 SE 0.11)

0 200 400 600 800 1000

DA D-Adapt (71.7 SE 0.65)

Prodigy DA (77.7 SE 0.10)

SGD D-Adapt (72.9 SE 0.46)

Prodigy SGD (79.0 SE 0.07)

Do G (79.1 SE 0.08)

0 200 400 600 800 1000

DA D-Adapt (64.7 SE 0.34)

Prodigy DA (71.8 SE 0.11)

SGD D-Adapt (65.1 SE 0.24)

Prodigy SGD (73.1 SE 0.03)

Do G (72.0 SE 0.11)

0 200 400 600 800 1000

DA D-Adapt (98.2 SE 0.11)

Prodigy DA (100.0 SE 0.00)

SGD D-Adapt (98.5 SE 0.17)

Prodigy SGD (100.0 SE 0.00)

Do G (99.9 SE 0.07)

Figure 1. Convex multi-class classification problems. Error bars show a range of 1 standard error over 10 seeds.

This can be seen as computing an exponential moving average of dkgk rather than gk itself. In addition, in Appendix A.6, we provide a coordinate-wise version of Algorithm 2 and study its convergence properties. Based on the theory presented there, the denominator for ˆdk+1 should use the ℓ1 norm of the weighted gradient sum to estimate the ℓ distance to the solution D = x0 x . Thus, combining this insight with the design of Algorithm 1, we obtain the following expression for the Adam estimate of D :

ˆdk+1 = Pk i=0 β(k i)/2 2 d2 i gi, x0 xi

Pk i=0 β(k i)/2 2 d2 i gi 1 .

The update uses exponential moving average as well, although it is more conservative as it uses β2 instead of β2. Note that there is an extra of (1 β2) in the update for vk, which can be optionally compensated for by using the bias correction discussed by Kingma & Ba (2015). These update rules are summarized in Algorithm 3. This is the main algorithm that we study numerically in the next section.

6. Experiments

We test our methods on convex logistic regression as well as deep learning problems. The Prodigy method is used as presented in Algorithm 3 in all deep learning experiments.

Logistic regression. For the convex setting, we ran a set of classification experiments. For each dataset, we used the multi-margin loss (Weston & Watkins, 1999), a multi-class generalization of the hinge loss. This non-smooth loss results in bounded gradients, which is required by our theory.

We perform full-batch rather that stochastic optimization, for two reasons. Firstly, it matches the assumptions of our theory. Secondly, fast learning rate adaptation is more crucial for full-batch optimization than stochastic optimization as fewer total steps will be performed. Our convex experiments use the theoretical variants Algorithm 1 and Algorithm 2, but with G = 0 following standard practice.

We performed 1,000 steps for each dataset, using a randomized x0 and plot the results of 10 seeds. We ran both DA and GD variants of each method. Each plot shows the accuracy of the average iterate for each method. Figure 1 shows that our proposed algorithm greatly out-performs regular D-Adaptation. Our weighted GD variant of D-Adaptation is faster consistently across each dataset. Additionally, it adapts faster than the Do G method (Ivgi et al., 2023) on 10 of the 12 problems.

CIFAR10. For neural network experiments1, we consider training on CIFAR10 (Krizhevsky, 2009) with batch size 256, where D-Adapted Adam has a gap of a few percent compared to the standard Adam. We use cosine annealing with initial step size 1 for all Adam-based methods and initial step size 10 3 for Adam itself. The considered networks are VGG11 (Simonyan & Zisserman, 2014) and Res Net50 (He et al., 2016)2. To simplify the experiment, we do not use weight decay, so both networks slightly overfit and do not reach high test accuracy values. All methods were run

1The Py Torch code of our optimizer is available at https: //github.com/konstmish/prodigy

2VGG11 and Res Net-50 implementation along with the data loaders were taken from https://github.com/ kuangliu/pytorch-cifar

Prodigy: An Expeditiously Adaptive Parameter-Free Learner

0 10000 20000 30000 40000 50000 60000

IWSLT14 (LSTM)

Adam (4.31 SE 0.003) Prodigy (4.40 SE 0.005) D-Adapt Adam (4.33 SE 0.003)

0 10000 20000 30000 40000 50000 60000

Training Loss

IWSLT14 (LSTM)

Adam (4.27 SE 0.003) Prodigy (3.83 SE 0.003) D-Adapt Adam (4.03 SE 0.002)

5000 10000 15000 20000

Test Perplexity

Book Wiki (Ro BERTa)

Adam (3.97 SE 0.010) D-Adapt Adam (3.96 SE 0.008) Prodigy (3.96 SE 0.006)

0 5000 10000 15000 20000

Training Perplexity

Book Wiki (Ro BERTa)

Adam (4.24 SE 0.018) D-Adapt Adam (4.26 SE 0.018) Prodigy (4.23 SE 0.031)

0 10000 20000 30000 40000 50000 60000

Test Perplexity

Book Wiki (GPT Transformer)

Adam (19.49 SE 0.012) D-Adapt Adam (19.46 SE 0.019) Prodigy (19.65 SE 0.026)

0 10000 20000 30000 40000 50000 60000

Training Perplexity

Book Wiki (GPT Transformer)

Adam (20.19 SE 0.084) D-Adapt Adam (20.16 SE 0.093) Prodigy (20.13 SE 0.199)

0 20 40 60 80 100

Test Accuracy

Criteo Kaggle (DLRM)

Adam (0.7906 SE 0.00007) D-Adapt Adam (0.7905 SE 0.00014) Prodigy (0.7906 SE 0.00009)

0 20 40 60 80 100

Criteo Kaggle (DLRM)

Adam (0.4444 SE 0.00155) D-Adapt Adam (0.4440 SE 0.00150) Prodigy (0.4437 SE 0.00161)

0 10 20 30 40 50

fast MRI Knee

Adam (0.9103 SE 0.00032) Prodigy (0.9108 SE 0.00058) D-Adapt Adam (0.9105 SE 0.00057)

0 10 20 30 40 50

fast MRI Knee

Adam (0.4281 SE 0.00021) Prodigy (0.4278 SE 0.00055) D-Adapt Adam (0.4276 SE 0.00028)

0 50 100 150 200 250 300

Test Accuracy (%)

ILSVRC 2012 Image Net (Vision Transformer)

D-Adapt Adam (72.35% SE 0.07) Adam (75.40% SE 0.07) Prodigy (74.63% SE 0.21)

0 50 100 150 200 250 300

ILSVRC 2012 Image Net (Vision Transformer)

D-Adapt Adam (3.9e-03 SE 4.9e-06) Adam (3.7e-03 SE 1.9e-05) Prodigy (3.8e-03 SE 3.5e-05)

Figure 2. Adam-family experiments.

Prodigy: An Expeditiously Adaptive Parameter-Free Learner

0 50 100 150 200 Epoch

Test accuracy

Adam Do G L-Do G D-Adapt Adam Prodigy

0 50 100 150 200 Epoch

Adam Do G L-Do G D-Adapt Adam Prodigy

0 50 100 150 200

Do G D-Adapt Adam Prodigy

0 50 100 150 200 Epoch

Test accuracy

Adam Do G L-Do G D-Adapt Adam Prodigy

0 50 100 150 200 Epoch

Adam Do G L-Do G D-Adapt Adam Prodigy

0 50 100 150 200

Do G D-Adapt Adam Prodigy

Figure 3. VGG11 (top) and Res Net-50 (bottom) training on CIFAR10. Left: test accuracy (%), middle: train loss, right: step sizes. Prodigy is used as given in Algorithm 3. Prodigy estimates a larger step size than D-Adaptation, which helps it reach test accuracy closer to Adam.

using same 8 random seeds.

We show the results in Figure 3. As we can see, this gap is closed by Prodigy, which is achieved by the larger estimates of the step size.

For Do G and L-Do G, we compute the polynomial-averaging iterate and then report the best of the average and the last iterate. We average with γ = 8, see (Ivgi et al., 2023) for the details. While Do G produces larger step size estimate than Prodigy (see the right column in Figure 3, this is counterbalanced by the larger denominator in Do G. We also tried to modify Do G to use Adam-like step sizes but our heuristic modification diverged on this problem. We also observed that among Do G and its layer-wise version, L-Do G, there is no clear winner as the former performed better on VGG11 and the latter was better when training Res Net-50.

nano GPT transformer. We also train a 6-layer transformer network from nano GPT3 on the Shakespeare dataset. For all methods, we use batch size 256, clip the gradients to have norm not exceeding 1 and use float16 numbers. We use Adam W with hyperparameters given in the repository, i.e., β2 = 0.99, weight decay 0.1, step size 10 3, cosine annealing with warmup over 100 steps. The same weight decay value and cosine annealing is used for Prodigy and D-Adapted Adam, except that the latter two methods use step size 1. We accumulate minibatches of size 12 into a batch of size 480. We tuned the weight decay for Do G and L-Do G and found the value 10 4 to work well for this problem. We ran each method with 8 random seeds and report

3https://github.com/karpathy/nano GPT

the average as well as one-standard-deviation confidence intervals.

See Figure 4 for the results. In terms of the test loss, all methods are roughly equivalent except that Do G and LDo G were slower to reach the best value of roughly 1.5. For the train loss, Prodigy was on par with tuned Adam W and slightly better than D-Adapted Adam. Surprisingly, the estimated step size in Prodigy was very consistent across the 8 random seeds.

6.1. Large-scale Adam experiments

To validate the performance on large-scale practical applications directly against D-Adaptation, we ran the subset of the experiments from Defazio & Mishchenko (2023) that use the Adam optimizer. Methods without coordinate adaptivity are not competitive on these problems and so we exclude SGD and Do G from these comparisons.

LSTM, Ro BERTa, GPT, DLRM, Var Net. On the smallest problem of LSTM training, Prodigy appears to converge significantly faster in training loss and slightly overfits in test loss compared to the baselines. For Ro BERTa (Liu et al., 2019) and GPT (Radford et al., 2019) training on Book Wiki, Prodigy matches the performance of the baseline with only negligible differences. For the application problems, DLRM (Naumov et al., 2019) on the Criteo Kaggle Display Advertising dataset, and fast MRI Var Net (Zbontar et al., 2018), Prodigy again closely matches the baselines.

Prodigy: An Expeditiously Adaptive Parameter-Free Learner

0 2500 5000 7500 10000 12500 15000 Iteration

Adam W Do G L-Do G D-Adapt Adam Prodigy

0 2500 5000 7500 10000 12500 15000 Iteration

Adam W Do G L-Do G D-Adapt Adam Prodigy

0 2500 5000 7500 10000 12500 15000 Iteration

Adam W Do G D-Adapt Adam Prodigy

Figure 4. The test (left) and train (middle) loss curves as well as the estimated step size (right) when training a 6-layer nano GPT transformer.

Vi T training. Defazio & Mishchenko (2023) present a negative result for training vision transformer (Dosovitskiy et al., 2021), where D-Adaptation significantly underperforms tuned Adam. We investigated this effect, and we were able to reproduce this gap across a wide range of weightdecay values, although this problem has high run-to-run variance of 1-2% of test accuracy, which makes comparison difficult. Using weight decay 0.05 instead of 0.1 significantly improved performance of each variant, and so we present results for both the baselines and Prodigy at that value. We can see that Prodigy almost closes the gap between tuned Adam and D-Adaptation, giving a test accuracy of 74.63% compared to 75.4% for Adam, and more than 2% higher than D-Adaptation. See Figure 2 for the results.

7. Conclusion

We have presented Prodigy, a new method for learning-rate adaptation that improves upon the adaptation rate of the state-of-the-art D-Adaptation method. We proved convergence of Prodigy on both non-smooth and smooth problems. Prodigy was also shown to adapt faster than other known methods across a range of experiments. A practical limitation of our method is the increased memory requirement as it needs to store vectors x0 and sk in addition to Adam s vectors mk and vk. To remedy this issue, one can compress x0 and sk, for instance, by using lower precision data types, especially since we only use these vectors to produce scalar values for our dk estimates.

Impact Statement

This paper presents work whose goal is to improve the training speed of existing Machine Learning models. We believe the societal consequences of our work are minimal.

Carmon, Y. and Hinder, O. Making SGD parameter-free. In Conference on Learning Theory. PMLR, 2022.

Cutkosky, A. and Orabona, F. Black-box reductions for

parameter-free online learning in banach spaces. In Proceedings of the 31st Conference On Learning Theory, Proceedings of Machine Learning Research. PMLR, 2018.

Defazio, A. and Jelassi, S. Adaptivity without compromise: A momentumized, adaptive, dual averaged gradient method for stochastic optimization. Journal of Machine Learning Research, 23:1 34, 2022.

Defazio, A. and Mishchenko, K. Learning-rate-free learning by D-adaptation. In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J. (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp. 7449 7479. PMLR, 23 29 Jul 2023.

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby, N. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021.

Duchi, J., Hazan, E., and Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(61), 2011.

Ene, A., Nguyen, H. L., and Vladu, A. Adaptive gradient methods for constrained convex optimization and variational inequalities. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp. 7314 7321, 2021.

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial networks. Communications of the ACM, 63(11):139 144, 2020.

Gower, R. M., Defazio, A., and Rabbat, M. Stochastic Polyak stepsize with a moving target. ar Xiv preprint ar Xiv:2106.11851, 2021.

Prodigy: An Expeditiously Adaptive Parameter-Free Learner

Grimmer, B. Convergence rates for deterministic and stochastic subgradient methods without Lipschitz continuity. SIAM Journal on Optimization, 29(2):1350 1365, 2019.

Hazan, E. and Kakade, S. M. Revisiting the Polyak step size. ar Xiv preprint ar Xiv:1905.00313, 2019.

He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2016.

Ivgi, M., Hinder, O., and Carmon, Y. Do G is SGD s best friend: A parameter-free dynamic step size schedule. In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J. (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp. 14465 14499. PMLR, 23 29 Jul 2023.

Kairouz, P., Mc Mahan, H. B., Avent, B., Bellet, A., Bennis, M., Bhagoji, A. N., Bonawitz, K., Charles, Z., Cormode, G., Cummings, R., et al. Advances and open problems in federated learning. Foundations and Trends in Machine Learning, 14(1), 2021.

Kavis, A., Levy, K. Y., Bach, F., and Cevher, V. Uni XGrad: A universal, adaptive algorithm with optimal guarantees for constrained optimization. Advances in neural information processing systems, 32, 2019.

Khodak, M., Tu, R., Li, T., Li, L., Balcan, M.-F. F., Smith, V., and Talwalkar, A. Federated hyperparameter tuning: Challenges, baselines, and connections to weight-sharing. Advances in Neural Information Processing Systems, 34: 19184 19197, 2021.

Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. In ICLR, 2015. URL http://arxiv. org/abs/1412.6980.

Krizhevsky, A. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009.

Latafat, P., Themelis, A., Stella, L., and Patrinos, P. Adaptive proximal algorithms for convex optimization under local Lipschitz continuity of the gradient. ar Xiv preprint ar Xiv:2301.04431, 2023.

Levy, K. Y., Yurtsever, A., and Cevher, V. Online adaptive methods, universality and acceleration. Advances in neural information processing systems, 31, 2018.

Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov, V. Ro BERTa: A robustly optimized BERT pretraining approach. ar Xiv preprint ar Xiv:1907.11692, 2019.

Loizou, N., Vaswani, S., Laradji, I., and Lacoste-Julien, S. Stochastic Polyak step-size for SGD: An adaptive learning rate for fast convergence. In Proceedings of the 24th International Conference on Artificial Intelligence and Statistics (AISTATS), Proceedings of Machine Learning Research. PMLR, 2021.

Malitsky, Y. and Mishchenko, K. Adaptive gradient descent without descent. In III, H. D. and Singh, A. (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp. 6702 6712. PMLR, 13 18 Jul 2020.

Mc Mahan, H. B. and Orabona, F. Unconstrained online linear learning in Hilbert spaces: Minimax algorithms and normal approximations. In Proceedings of The 27th Conference on Learning Theory, volume 35 of Proceedings of Machine Learning Research. PMLR, 2014.

Naumov, M., Mudigere, D., Shi, H. M., Huang, J., Sundaraman, N., Park, J., Wang, X., Gupta, U., Wu, C., Azzolini, A. G., Dzhulgakov, D., Mallevich, A., Cherniavskii, I., Lu, Y., Krishnamoorthi, R., Yu, A., Kondratenko, V., Pereira, S., Chen, X., Chen, W., Rao, V., Jia, B., Xiong, L., and Smelyanskiy, M. Deep learning recommendation model for personalization and recommendation systems. Co RR, 2019.

Orabona, F. and P al, D. Parameter-free stochastic optimization of variationally coherent functions, 2021.

Orabona, F. and Tommasi, T. Training deep networks without learning rates through coin betting. In Advances in Neural Information Processing Systems, volume 30, 2017.

Orvieto, A., Lacoste-Julien, S., and Loizou, N. Dynamics of SGD with stochastic Polyak stepsizes: Truly adaptive variants and convergence to exact solution. In Advances in Neural Information Processing Systems (Neur IPS). Neur IPS, 2022.

Polyak, B. T. Introduction to optimization. Optimization Software, Inc., 1987.

Radford, A., Narasimhan, K., Salimans, T., and Sutskever, I. Improving language understanding by generative pretraining. Technical report, Open AI, 2019.

Simonyan, K. and Zisserman, A. Very deep convolutional networks for large-scale image recognition. ar Xiv preprint ar Xiv:1409.1556, 2014.

Streeter, M. and Mc Mahan, H. B. Less regret via online conditioning. ar Xiv preprint ar Xiv:1002.4862, 2010.

Prodigy: An Expeditiously Adaptive Parameter-Free Learner

Ward, R., Wu, X., and Bottou, L. Adagrad stepsizes: sharp convergence over nonconvex landscapes. In International Conference on Machine Learning, 2019.

Weston, J. and Watkins, C. Support vector machines for multi-class pattern recognition. pp. 219 224, 01 1999.

Zbontar, J., Knoll, F., Sriram, A., Muckley, M. J., Bruno, M., Defazio, A., Parente, M., Geras, K. J., Katsnelson, J., Chandarana, H., et al. fast MRI: An open dataset and benchmarks for accelerated MRI. ar Xiv preprint ar Xiv:1811.08839, 2018.

Zhang, Z., Cutkosky, A., and Paschalidis, I. C. PDE-based optimal strategy for unconstrained online learning. In Proceedings of the 39th International Conference on Machine Learning (ICML 2022), 2022.

Zoph, B. and Le, Q. Neural architecture search with reinforcement learning. In International Conference on Learning Representations, 2017. URL https:// openreview.net/forum?id=r1Ue8Hcxg.

Prodigy: An Expeditiously Adaptive Parameter-Free Learner

A. Analysis of Prodigy

As a reminder, we use the notation log2+(a) = 1 + log2(a) to denote the logarithm that is lower bounded by 1 for any a 1.

A.1. Useful propositions

Proposition 1 (Lemma A.2 in (Levy et al., 2018)). For any sequence of nonnegative real numbers a0, . . . , an v u u t

ak q Pk i=0 ai 2

k=0 ai. (4)

Proof. For completeness, we prove both statements here. Notice that for any α [0, 1], it holds 1 1 α α 2(1 1 α). Substituting α = ak Pk i=0 ai gives

1 ak Pk i=0 ai ak Pk i=0 ai 2

1 ak Pk i=0 ai

If we multiply all sides by q Pk i=0 ai, the inequality above becomes

i=0 ai ak q Pk i=0 ai 2

Summing over k = 0, . . . , n, we get the stated bound.

Proposition 2. For any sequence of nonnegative numbers a0, . . . , an and A > 0, it holds

ak A + Pk i=0 ai log A +

log(A). (5)

Proof. If ai = 0 for some i, we can simply ignore the corresponding summands, so let us assume that ai > 0 for all i. For any t > 0 it holds 1/(1 + t) log(1 + 1/t). Substituting t = Sk/ak, where Sk = A + Pk 1 i=0 ai for k > 0 and S0 = A, we get 1 1 + Sk

ak = ak ak + Sk = ak A + Pk i=0 ai log(1 + ak/Sk) = log(Sk+1) log(Sk).

Summing this over k from 0 to n, we get

ak A + Pk i=0 ai

k=0 (log(Sk+1) log(Sk)) = log(Sn+1) log(S0)

This is exactly what we wanted to prove.

A.2. Proof of Lemma 1

Proof. Following the proof in (Ivgi et al., 2023), we define K = log2 d N

d0 and n = N

K . Consider a partitioning of the sequence t N into half-open intervals Ik = [nk, n(k + 1)) for k = 0 to K 1. We want to show that there is at least one interval such that dk changes by at most a factor of 2 on that interval. We will use proof by contradiction.

Prodigy: An Expeditiously Adaptive Parameter-Free Learner

Suppose that for all intervals, dnk < 1

2dn(k+1). Then dk at least doubles in every interval, and so:

2K dn K < 1

which implies that d N/d0 > 2K and so K < log2 (d N/d0) which contradicts our definition K = log2 d N

d0 . Therefore,

there exists some ˆk such that dnˆk 1

2dn(ˆk+1). We can now proceed with proving the Lemma by considering the summation over interval Iˆk only:

min t<N dt+1 q Pt k=0 d2 k dn(ˆk+1) q Pn(ˆk+1) 1 k=0 d2 k

dn(ˆk+1) q Pn(ˆk+1) 1 k=nˆk d2 k

dn(ˆk+1) q Pn(ˆk+1) 1 k=nˆk d2 nˆk

= dn(ˆk+1) q

dn(ˆk+1) q 1

4nd2 n(ˆk+1) = 2 n = 2 q N

N log2(d N/d0)+1 1 = 2 q

N log2+ d N

N 2 log2+( d N

A.3. GD analysis

Lemma 2. Assume that d0 D. Then, the estimate dk in Algorithm 1 satisfies dk D for all k.

Proof. By optimality of f , we have f(xk) f 0, so

k=0 ηk(f(xk) f )

k=0 ηk gk, xk x =

k=0 ηk gk, x0 x +

k=0 ηk gk, xk x0 .

Collecting the gradients in the first sum together and using Cauchy-Schwarz inequality, we obtain

k=0 ηk(f(xk) f ) x0 xn+1, x0 x +

k=0 ηk gk, xk x0

x0 xn+1 x0 x +

k=0 ηk gk, xk x0 . (6)

Using the definition of ˆdn+1, this is equivalent to 0 (D ˆdn+1) x0 xn+1 , which implies ˆdn+1 D. Therefore, since d0 D, we can show by induction dn+1 D as well.

Lemma 3. The following inequality holds for the iterates of Algorithm 1:

xn+1 x0 2dn+1 + 1 2dn+1

k=0 η2 k gk 2.

Prodigy: An Expeditiously Adaptive Parameter-Free Learner

Proof. Let us rewrite ˆdn+1 in a slightly different manner:

ˆdn+1 xn+1 x0 def =

k=0 xk xk+1, x0 xk

1 2 xk+1 x0 2 xk xk+1 2 xk x0 2

2 xn+1 x0 2 1

k=0 xk xk+1 2.

Combining this with the property ˆdn+1 dn+1, we derive

1 2 xn+1 x0 2 1

k=0 xk xk+1 2 = ˆdn+1 xn+1 x0 dn+1 xn+1 x0 .

Applying inequality 2αβ α2 + β2 with α2 = 2d2 n+1 and β2 = 1

2 xn+1 x0 2 and plugging-in the bound above, we establish

2dn+1 xn+1 x0 = 2αβ α2 + β2 = 2d2 n+1 + 1

2 xn+1 x0 2

2d2 n+1 + dn+1 xn+1 x0 + 1

k=0 xk xk+1 2.

Rearranging the terms, we obtain

dn+1 xn+1 x0 2d2 n+1 + 1

k=0 xk xk+1 2 = 2d2 n+1 + 1

k=0 η2 k gk 2.

It remains to divide this inequality by dn+1 to get the desired claim.

Lemma 4. Assuming the weights ω0, . . . , ωn are positive, it holds for the iterates of Algorithm 1:

d4 kω2 k gk 2

d2 k G2 + Pk i=0 d2 i ω2 i gi 2 d2 n log

Proof. The lemma follows straightforwardly from Proposition 2 by substituting ak = d2 k d2n ω2 k gk 2 for k from 0 to n:

d4 kω2 k gk 2

d2 k G2 + Pk i=0 d2 i ω2 i gi 2 = d2 n

d2 k d2 n ω2 k gk 2

G2 + Pk i=0 d2 i d2 k ω2 i gi 2

d2 k d2n ω2 k gk 2

G2 + Pk i=0 d2 i d2n ω2 i gi 2

d2 k d2n ω2 k gk 2 log(G2)

d2 n log 1 +

where in the last step we used d2 k d2n ω2 k gk 2 ω2 k G2.

Let us restate Theorem 1:

Prodigy: An Expeditiously Adaptive Parameter-Free Learner

Theorem 6 (Same as Theorem 1). Given any weights 1 ω0 ωn, the functional gap of the average iterate of Algorithm 1 converges as

2ωn DG2dn+1 + dn+1 log(1 + Pn k=0 ω2 k) p Pn k=0 ωkd2 k .

Proof. The first steps in the proof follow the same lines as the theory in (Defazio & Mishchenko, 2023), but we still provide them for completeness.

Firstly, let us continue developing the bound proved in the proof of Lemma 2:

k=0 ηk(f(xk) f ) x0 xn+1 D +

k=0 ηk gk, xk x0

= x0 xn+1 D +

k=0 xk xk+1, xk x0

= x0 xn+1 D + 1

xk xk+1 2 + xk x0 2 xk+1 x0 2

x0 xn+1 D + 1

k=0 xk xk+1 2.

We upper bound the first term with the help of Lemma 3:

k=0 ηk(f(xk) f ) 2Ddn+1 + D 2dn+1

k=0 η2 k gk 2 + 1

k=0 η2 k gk 2.

Since by Lemma 2, 1 D dn+1 , we can simplify it to

k=0 ηk(f(xk) f ) 2Ddn+1 + D dn+1

k=0 η2 k gk 2

= 2Ddn+1 + D dn+1

d4 kω2 k d2 k G2 + Pk i=0 d2 i ω2 i gi 2 gk 2

(7) 2Ddn+1 + D dn+1 d2 n log 1 +

Using the convexity of f, we can apply Jensen s inequality on the iterate xn to get

f(xn) f 1 Pn k=0 ηk

k=0 ηk(f(xk) f ) 2Ddn+1 + D dn+1 d2 n log(1 + Pn k=0 ω2 k) Pn k=0 ηk

D2dn+1 + dn+1 log(1 + Pn k=0 ω2 k) Pn k=0 ηk . (8)

Notice that gi G and ωi ωn, so

ηk = d2 kωk q

d2 k G2 + Pk i=0 d2 i ω2 i gi 2 d2 kωk

d2 k + Pk i=0 d2 i ω2 i d2 kωk

G 2ωn q Pk i=0 d2 i ωi .

Summing over k from 0 to n gives

k=0 ηk 1 2ωn G

d2 kωk q Pk i=0 d2 i ωi

(4) 1 2ωn G

k=0 d2 kωk.

Prodigy: An Expeditiously Adaptive Parameter-Free Learner

f(xn) f (8)

2ωn DG dn+1 p Pn k=0 d2 kωk

2 + log 1 +

Corollary 1. Consider Algorithm 1 with n 2 log2 2D

and define t = arg mink n dk Pk i=0 d2 i . If we choose weights

ωk = 1, then it holds

2DG2 + log(n + 2) n

Proof. Substituting ωk in the bound of Theorem 1, we get for any n

f(xn) f (8)

2DG dn+1 p Pn k=0 d2 k log (n + 2) .

If n 2 log2 2D

, using the definition of t, the result of Lemma 1 and the property dn D, we obtain

2DG min k n dk+1 q Pk i=0 d2 i (2 + log (n + 2))

2DG2 + log(n + 2) n

If, in contrast, n 2 log2 2D

, then it trivially holds by convexity and Cauchy-Schwarz

f(xt) f f(x0) f g0, x0 x g0 x0 x GD

Corollary 2. Choose any p 0 ans set the weights to be ωk = (k + 1)p. Then,

f(xn) f = O DG(p + 1)3/2 log(n + 1) n + 1

Proof. Since the sequence d0, d1, . . . is non-decreasing and upper bounded by D, there exists an index ˆn such that dk 2dˆn for any k ˆn. Moreover, we have for n 2(ˆn + 1)

k=ˆn ωk 1 p + 1 (n + 1)p+1 (ˆn + 1)p+1 1 2(p + 1)(n + 1)p+1

k=1 k2p Z n+2

2 x2pdx 1 2p + 1(n + 2)2p+1 1 (n + 2)2p+1 1.

Let us plug this into the bound of Theorem 1 for n 2(ˆn + 1):

2ωn DG dn+1 p Pn k=0 d2 kωk

2(n + 1)p DG p

d2 ˆn Pn k=ˆn ωk (2 + (2p + 1) log(n + 2))

4 p + 1DG n + 1 (2 + (2p + 1) log(n + 2)) = O DG(p + 1)3/2 log(n + 1) n + 1

which matches our claim.

Prodigy: An Expeditiously Adaptive Parameter-Free Learner

Notice that the bound in Corollary 2 does not depend on D/d0. This is only possible asymptotically for a large enough k and a similar bound without weights was presented by Defazio & Mishchenko (2023).

A.4. GD smooth analysis

We will need the following technical result to show convergence on smooth functions.

Lemma 5. If numbers a > 1, b > 0 are such that a b(1 + log a), then it also holds a 2b(1 + log b).

Proof. Firstly, let us plug-in the upper bound into itself:

a b(1 + log a) b(1 + log(b(1 + log a))) = b(1 + log b + log(1 + log a)).

Therefore, it is enough to show that log(1 + log a) 1 + log b. Notice that for any a > 1 it holds (1 + log a)2 2a, so 1 + log a 2a 1+log a 2b and log(1 + log a) log(2b) < 1 + log b.

Theorem 7 (Same as Theorem 3). Assume f is L-smooth, and set G = 0 and ω0 = = ωn = 1 in Algorithm 1. Then, we get the following convergence guarantee:

f(xt) f = O log2+ D

d0 log2 2+ LD2

where we used xn = 1 Pn k=0 d2 k Pn k=0 d2 kxk and t = arg mink n dk+1 Pk i=0 d2 i .

Proof. Recall that we have established the upper bound

k=0 ηk(f(xk) f ) 2Ddn+1 + D dn+1

k=0 η2 k gk 2

= 2Ddn+1 + D dn+1

d4 k Pk i=0 d2 i gi 2 gk 2.

Firstly, let us work on the right-hand side. Since we do not assume that gk is bounded in the smooth case, we keep the gradients in the upper bound:

d4 k gk 2 Pk i=0 d2 i gi 2 = d2 0 +

d4 k gk 2 Pk i=0 d2 i gi 2

d2 0 + d2 n

d2 k gk 2 Pk i=0 d2 i gi 2

(5) d2 0 + d2 n log d2 0 g0 2 +

k=1 d2 k gk 2 log(d2 0 g0 2)

= d2 0 + d2 n log 1 +

= d2 0 + d2 n log n X

Secondly, we shall work on the left-hand side of the first inequality. The smoothness assumption implies gi 2 2L(f(xi) f ), so

ηk = d2 k q Pk i=0 d2 i gi 2 d2 k

2L q Pk i=0 d2 i (f(xi) f ) .

Prodigy: An Expeditiously Adaptive Parameter-Free Learner

Combining this lower bound with the first part of Proposition 1 with ak = d2 k(f(xk) f ) gives

k=0 ηk(f(xk) f )

d2 k(f(xk) f )

2L q Pk i=0 d2 i (f(xi) f ) 1

k=0 d2 k(f(xk) f ).

All together, the bounds above yield

k=0 d2 k(f(xk) f ) 2Ddn+1 + D dn+1 d2 0 + D dn+1 d2 n log n X

Since D dn+1 d2 0 D dn+1 d2 n D dn+1 d2 n+1 = Ddn+1, we have

k=0 d2 k(f(xk) f ) Ddn+1

3 + log n X

3 + log 2L d2 0 g0 2

k=0 d2 k(f(xk) f ) .

Denote Fn = q

2L d2 0 g0 2 Pn k=0 d2 k(f(xk) f ) and notice that the bounds above imply Fn 1. Then, we can rewrite the last bound above as

2L Fn Ddn+1(3 + log F 2 n) = Ddn+1(3 + 2 log Fn) < 3Ddn+1(1 + log Fn).

Applying Lemma 5 with a = Fn and b = 6LDdn+1

d0 g0 , we get

Fn 12LDdn+1

1 + log 6LDdn+1

1 + log 6LD2

Squaring both sides and rearranging yields

k=0 d2 k(f(xk) f ) 72LD2d2 n+1 1 + log 6LD2

Now we can apply Jensen s inequality:

f(xn) f 1 Pn k=0 d2 k

k=0 d2 k(f(xk) f ) d2 n+1 Pn k=0 d2 k 72LD2 1 + log 6LD2

Taking the minimum of the right-hand side over n, we can replace the first ratio in the right-hand side with 4 log2+(D2/d2 0) = 8 log2+(D2/d0).

A.5. DA analysis

Lemma 6. Considering Algorithm 2, we have

γn+1 + Pn k=0 γkd4 k gk 2

Proof. When studying Dual Averaging, we need to introduce an extra sequence that lower bounds dn:

dn+1 def = γn+1 sn+1 2 Pn k=0 γkd4 k gk 2

Prodigy: An Expeditiously Adaptive Parameter-Free Learner

Let us show that ˆdn+1 dn+1 by comparing their numerators:

ˆdn+1 sn+1 =

k=0 d2 k gk, x0 xk =

k=0 d2 kγk gk, sk =

k=0 γk sk+1 sk, sk

2 sk+1 2 sk+1 sk 2 sk 2

2 sn+1 2 + 1

k=0 (γk γk+1) sk+1 2 1

k=0 γkd4 k gk 2

γk γk+1 γn+1

k=0 γkd4 k gk 2

= dn+1 sn+1 .

Using the definition of dn+1, and the property dn+1 ˆdn+1 dn+1, we derive

k=0 γkd4 k gk 2 = dn+1 sn+1 dn+1 sn+1 .

Using inequality 2αβ α2 + β2 with α2 = 2d2 n+1 γn+1 and β2 = γn+1

2 sn+1 2 and then the bound above, we establish

2dn+1 sn+1 = 2αβ α2 + β2 = 2d2 n+1 γn+1 + γn+1

2d2 n+1 γn+1 + dn+1 sn+1 + 1

k=0 γkd4 k gk 2.

Rearranging the terms, we obtain

dn+1 sn+1 2d2 n+1 γn+1 + 1

k=0 γkd4 k gk 2.

It remains to divide both sides by dn+1.

Lemma 7. The Dual Averaging algorithm (Algorithm 2) satisfies

k=0 d2 k(f(xk) f ) (D ˆdn+1) sn+1 . (9)

Proof. Summing inequality f(xk) f gk, xk x with weights d2 k, we get

k=0 d2 k(f(xk) f )

k=0 d2 k gk, xk x =

k=0 d2 k gk, x0 x +

k=0 d2 k gk, xk x0 .

Using Cauchy-Schwarz on the first product in the right-hand side and then telescoping the second sum, we obtain

k=0 d2 k(f(xk) f ) sn+1 x0 x +

k=0 d2 k gk, xk x0

= sn+1 D ˆdn+1 sn+1 .

Next, we restate and prove Theorem 2:

Prodigy: An Expeditiously Adaptive Parameter-Free Learner

Theorem 8 (Same as Theorem 2). For Algorithm 2, it holds that:

f(xt) f 4GD n

where t = arg mink n dk+1 Pk i=0 d2 i .

Proof. Let us sum inequality d2 k(f(xk) f ) 0 and then apply Lemma 7:

k=0 d2 k(f(xk) f ) (9) (D ˆdn+1) sn+1 .

Clearly, this implies that ˆdn+1 D, and by induction it follows that dn+1 D as well. Now let us upper bound the functional values:

k=0 d2 k(f(xk) f ) (9) D sn+1

k=0 γkd2 k gk, sk

k=0 γk sk+1 sk, sk

= D sn+1 + 1

k=0 γk sk+1 sk 2 + sk 2 sk+1 2

= D sn+1 + 1

k=0 γk sk+1 sk 2 + 1

k=0 (γk γk 1) sk 2 γn

We can drop the last two terms since γk γk 1:

k=0 d2 k(f(xk) f ) D sn+1 + 1

k=0 γk sk+1 sk 2

= D sn+1 + 1

k=0 γkd4 k gk 2.

The first term in the right-hand side is readily bounded by Lemma 6:

k=0 d2 k(f(xk) f ) D sn+1 + 1

k=0 γkd4 k gk 2

γn+1 + D 2dn+1

k=0 γkd4 k gk 2 + 1

k=0 γkd4 k gk 2

dk dn dn+1 2Ddn+1

k=0 γkd2 k gk 2.

Prodigy: An Expeditiously Adaptive Parameter-Free Learner

Algorithm 4 Prodigy (Coordinate-wise Dual Averaging version)

1: Input: d0 > 0, x0, G 0; s0 = 0 Rp, 1 = (1, . . . , 1) Rp

2: for k = 0 to n do 3: gk f(xk) 4: sk+1 = sk + d2 kgk

5: ˆdk+1 = Pk i=0 d2 i gi, x0 xi sk+1 1 6: dk+1 = max(dk, ˆdk+1) 7: a2 k+1 = d2 k+1G2 1 + Pk i=0 d2 i g2 i Coordinate-wise square 8: Ak+1 = diag(ak+1) 9: xk+1 = x0 A 1 k+1sk+1 10: end for 11: Return xn = 1 Pn k=0 d2 k Pn k=0 d2 kxk

Then, apply Proposition 1:

k=0 d2 k(f(xk) f ) 2D

k=0 γkd2 k gk 2

d2 k G2 + Pk 1 i=0 d2 i gi 2 d2 k gk 2

d2 k gk 2 + Pk 1 i=0 d2 i gi 2 d2 k gk 2

γn+1 + 2Ddn

k=0 d2 k gk 2.

Let us now bound each gradient norm using gk G:

k=0 d2 k(f(xk) f ) 4Ddn+1

k=0 d2 k gk 2 4GDdn+1

Thus, we get the following convergence rate:

f(xt) f 4GDdt+1 q Pt k=0 d2 k Pt k=0 d2 k = 4GDdt+1 q Pt k=0 d2 k = min t <n 4GDdt +1 q Pt k=0 d2 k

A.6. Coordinate-wise Prodigy

Here we study Algorithm 4. The theory in this section follows closely the analysis in Section A.5. There are only a few minor differences such as the use of weighted norms, which we define as x, y A 1 = x A 1y for any matrix A 0. In addition, we use ℓ norm for the distance term and for the gradients, see the assumption below.

Assumption 3. The gradients are upper bounded coordinate-wise: gk G .

We begin with the analogue of Lemma 6:

Prodigy: An Expeditiously Adaptive Parameter-Free Learner

Lemma 8. It holds for the iterates of Algorithm 4:

sn+1 1 2dn+1 an+1 1 + 1 2dn+1

k=0 d4 k gk 2 A 1 k .

Proof. As in the proof of Lemma 6, let us introduce an extra sequence dn:

dn+1 def = sn+1 2 A 1 n+1 Pn k=0 d4 k gk 2 A 1 k 2 sn+1 1 .

The next step is to show that ˆdn+1 dn+1 by comparing the numerators:

ˆdn+1 sn+1 1 =

k=0 d2 k gk, x0 xk =

k=0 d2 k gk, sk A 1 k =

k=0 sk+1 sk, sk A 1 k

h sk+1 2 A 1 k sk+1 sk 2 A 1 k sk 2 A 1 k

2 sn+1 2 A 1 n + 1

k=0 sk+1 2 A 1 k A 1 k+1 1

k=0 d4 k gk 2 A 1 k

A 1 k A 1 k+1 1 2 sn+1 2 A 1 n+1 1

k=0 d4 k gk 2 A 1 k

= dn+1 sn+1 1.

Using the definition of dn+1, and the property dn+1 ˆdn+1 dn+1, we derive

1 2 sn+1 2 A 1 n+1 1

k=0 d4 k gk 2 A 1 k = dn+1 sn+1 1 dn+1 sn+1 1 .

Using inequality 2αβ α2 + β2 with α2 = 2d2 n+1a(n+1)i and β2 = 1 2a(n+1)i s2 (n+1)i for i = 1, . . . , p and then the bound above, we establish

2dn+1 sn+1 1 =

i=1 dn+1|s(n+1)i|

2d2 n+1a(n+1)i + 1 2a(n+1)i s2 (n+1)i

= 2d2 n+1 an+1 1 + 1

2 sn+1 A 1 n+1

2d2 n+1 an+1 1 + dn+1 sn+1 1 + 1

k=0 d4 k gk 2 A 1 k .

Rearranging the terms, we get

dn+1 sn+1 1 2d2 n+1 an+1 1 + 1

k=0 d4 k gk 2 A 1 k .

It remains to divide both sides by dn+1.

The next lemma is similar to Lemma 8 except that it uses ℓ norm for the distance to a solution and ℓ1 norm for the weighted gradient sum sn. Lemma 9. The coordinate-wise version of Prodigy (Algorithm 4) satisfies

k=0 d2 k(f(xk) f ) (D ˆdn+1) sn+1 1, (10)

where D = x0 x .

Prodigy: An Expeditiously Adaptive Parameter-Free Learner

Proof. Summing inequality f(xk) f gk, xk x with weights d2 k, we get

k=0 d2 k(f(xk) f )

k=0 d2 k gk, xk x =

k=0 d2 k gk, x0 x +

k=0 d2 k gk, xk x0 .

Using H older s inequality on the first product in the right-hand side and then telescoping the second sum, we obtain

k=0 d2 k(f(xk) f ) sn+1 1 x0 x +

k=0 d2 k gk, xk x0

= sn+1 1D ˆdn+1 sn+1 .

The use of ℓ1 norm for the term sn+1 above is motivated by the fact that it naturally arises in other parts of the theory.

Theorem 9. Algorithm 4 converges with the rate

f(xt) f 4p G D n

where t = arg mink n dk+1 Pk i=0 d2 i .

Proof. From Lemma 9, we get

k=0 d2 k(f(xk) f ) (10) (D ˆdn+1) sn+1 1,

so we can prove by induction that dn+1 D . Using the same bounds as before, we get for the average iterate

k=0 d2 k(f(xk) f ) D sn+1 1

k=0 d2 k gk, x0 xk

= D sn+1 1 + 1

k=0 d4 k gk 2 A 1 k + 1

k=0 sk 2 A 1 k A 1 k+1 1

2 sn+1 2 A 1 n+1

D sn+1 1 + 1

k=0 d4 k gk 2 A 1 k .

Let us plug in the bound from Lemma 8:

k=0 d2 k(f(xk) f ) 2D dn+1 an+1 1 + D

k=0 d4 k gk 2 A 1 k + 1

k=0 d4 k gk 2 A 1 k

dn+1 D 2D dn+1 an+1 1 + D

k=0 d4 k gk 2 A 1 k

dk dn 2D dn+1 an+1 1 + D

k=0 d2 k gk 2 A 1 k .

We now apply Proposition 1, and use g2 kj G2 :

k=0 d2 k(f(xk) f ) 2D dn+1 an+1 1 + D

d2 kg2 kj q

d2 k G2 + Pk 1 i=0 d2 i g2 ij

2D dn+1 an+1 1 + 2D

k=0 d2 kg2 kj

Prodigy: An Expeditiously Adaptive Parameter-Free Learner

Using Lemma 1, we get the rate for t = arg mint n dt +1 q Pt k=0 d2 k :

f(xt) f 4p G D n

B. Lower Complexity Theory

Theorem 10. Consider any exponentially bounded algorithm for minimizing a convex G-Lipschitz function starting from x0, which has no knowledge of problem constants G and D. There exists a fixed gradient oracle such that any sequence of x1,..., xn, there exists a convex Lipschitz problem f with G = 1 and x0 x D for all minimizing points x , consistent with the gradient oracle such that:

min k n f(xk) f DG p

log2(D/x1) 2 n + 1 .

Proof. We consider the construction of a 1D oracle for this problem. Our oracle returns g0 = 1 and f(xk) = xk for all queries. Without loss of generality we assume that xk > 0 for all k 1, and G = 1.

For each step k 1 we define: x = 2n+1x1,

and thus D = |x0 x | = 2k+1x1. and our construction uses the following function value and gradient sequence

f(x) = |x x | + x .

Note that for all query points x, the gradient is negative, and only the left arm of the absolute value function is seen by the algorithm, so the function appears linear for all test points. Using this construction, we have:

min k n [f(xk) f ] = min k n (x xk)

= 2n+1x1 max k n xk

2 2nx1 2nx1 = 2nx1

Now note that: p

log2(Dn/x1) = p

log2(2n+1) =

Combining these two results:

min k n f(xk) f 1

log2(D/x1) n + 1 .

Theorem 11. D-Adaptation, Do G and Prodigy are exponentially bounded.

Proof. Consider the D lower bound from D-Adaptation:

ˆdn+1 = Pn k=0 dkγk gk, sk

Prodigy: An Expeditiously Adaptive Parameter-Free Learner

Recall that n X

k=0 dkγk gk, sk γn+1 sn+1 2 .

Note also that γn+1 1

So the sequence dn is upper bounded by the sequence:

(Pn 1 k=0 ak n 1 d0 n = 0 .

This sequence has the following closed form:

an+1 = 2nd0 for n 1.

We can prove this by induction. The base case is by definition a1 = a0. Then

k=0 ak + an = an + an = 2an = 2nd0.

Note that for both the Dual Averaging form and the GD form we have, we have:

k=0 dk dn+1 2nd0.

It follows that D-Adaptation is exponentially bounded. For Prodigy, note that:

d2 n+1G2 = 1 dn+1G.

1 dn+1G sn+1 2

sn+1 1 dn+1G

The rest of the argument follows the D-Adaptation case, with:

k=0 dk dn+1 2nd0.

Prodigy: An Expeditiously Adaptive Parameter-Free Learner

For Do G, recall the basic Do G step is gradient descent with step sizes:

G2 + Pk i=0 gi 2 .

Using the triangle inequality we have:

xk+1 x0 = xk γkgk x0

xk x0 + γk gk

xk x0 + rk 2 rk.

Chaining gives the result.

Proposition 3. Suppose that dk c D and γk dk/G. then:

xk x0 (2c + 1)n x1 x0 .

Proof. Without loss of generality assume that G = 1. Firstly, note that using the absolute value function as constructed in Theorem 5, it s clear that there is always exists a function with Dk 2 xk x at step k consistent with the sequence of gradients seen so far. Therefore, it must hold that

dk c Dk 2c xk x0 .

We prove the result by induction. For the base case, trivially:

x1 x0 (2c + 1)1 x1 x0 .

For the inductive case:

xk+1 x0 = xk γkgk x0

xk x0 + γk gk

xk x0 + c Dk

xk x0 + c Dk (2c + 1) xk x0

(2c + 1)n+1 x1 x0 .