# prodigy_an_expeditiously_adaptive_parameterfree_learner__be1371e6.pdf Prodigy: An Expeditiously Adaptive Parameter-Free Learner Konstantin Mishchenko 1 Aaron Defazio 2 We consider the problem of estimating the learning rate in adaptive methods, such as Ada Grad and Adam. We propose Prodigy, an algorithm that provably estimates the distance to the solution D, which is needed to set the learning rate optimally. At its core, Prodigy is a modification of the D-Adaptation method for learning-rate-free learning. It improves upon the convergence rate of D-Adaptation by a factor of O( p log(D/d0)), where d0 is the initial estimate of D. We test Prodigy on 12 common logistic-regression benchmark datasets, VGG11 and Res Net-50 training on CIFAR10, Vi T training on Imagenet, LSTM training on IWSLT14, DLRM training on Criteo dataset, Var Net on Knee MRI dataset, as well as Ro BERTa and GPT transformer training on Book Wiki. Our experimental results show that our approach consistently outperforms D-Adaptation and reaches test accuracy values close to that of hand-tuned Adam. 1. Introduction Optimization is an essential tool in modern machine learning, enabling efficient solutions to large-scale problems that arise in various domains, such as computer vision, natural language processing, and reinforcement learning. One of the key challenges is the selection of appropriate learning rates, which can significantly impact the convergence speed and the quality of the final solution. Learning-rate tuning has been particularly challenging in applications where there are multiple agents that use their own optimizer. For instance, when training Generative Adversarial Networks (GANs) (Goodfellow et al., 2020), there are two neural networks with different architectures. In federated learning, tuning is even more challenging (Khodak et al., 2021), since there might be billions of devices (Kairouz et al., 2021), 1Samsung AI Center 2Fundamental AI Research Team, Meta. Correspondence to: Konstantin Mishchenko . Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s). each optimizing their objective locally. Another example is Neural Architecture Search (NAS) (Zoph & Le, 2017), where the goal is to find the best neural network architecture automatically by training a lot of networks and evaluating them on a validation set. In such cases it becomes very expensive to manually tune the learning rate. Recently, parameter-free adaptive learning rate methods (Orabona & Tommasi, 2017; Cutkosky & Orabona, 2018; Zhang et al., 2022; Carmon & Hinder, 2022; Ivgi et al., 2023) have gained considerable attention due to their ability to automatically adjust learning rates based on the problem structure and data characteristics. Among these, the DAdaptation method, introduced by Defazio & Mishchenko (2023), has emerged as a promising practical approach for learning-rate-free optimization. For a convex objective function f, D-Adaptation works by maintaining a lower bound on the initial distance to solution D = x0 x , for any x in the solution set of the following problem: min x Rp f(x). In practice, the lower bound estimated by D-Adaptation increases rapidly during the course of optimization, plateauing to a value close to the true D. This D quantity is the key unknown constant needed to set the learning rate for adaptive optimization methods, forming the numerator of the step size: ηk = D q Pk i=0 gi 2 , where D = x0 x , (1) and the denominator is based on the Ada Grad step size (Duchi et al., 2011; Streeter & Mc Mahan, 2010; Ward et al., 2019). The Gradient Descent form of D-Adaptation simply plugs in the current lower bound at each step in place of D. This simple approach can be applied to estimate the step size in Adam (Kingma & Ba, 2015), which yields stateof-the-art performance across a wide-range of deep learning problems. Defazio & Mishchenko (2023) also show that asymptotically, D-Adaptation is as fast as specifying the step size using the true D (up to small constant factors). Contributions We summarize our contributions as follows: Prodigy: An Expeditiously Adaptive Parameter-Free Learner We present Prodigy, a modification of D-Adaptation that improves its worst-case non-asymptotic convergence rate. Through extensive experiments, we demonstrate that Prodigy establishes a new state-of-the-art for learning rate adaptation, outperforming D-Adaptation. We develop a lower complexity bound for methods which grow the learning rate at most exponentially fast. We show that this covers all methods that avoid significant overshooting. Prodigy is highly practical. Our open-source implementation is already widely used for fine-tuning of vision and language models, and is the recommended optimizer for Hugging Face Diffusers Dream Booth Lo RA training. 2. Prodigy Approach To understand how we can improve upon D-Adaptation, let us take a closer look at some details of its analysis. Consider Gradient Descent update xk+1 = xk ηkgk, where gk is the sub-gradient used at iteration k. The standard analysis of Ada Grad with a constant numerator gives the following error term: Ada Grad error = k=0 η2 k gk 2, which is exactly why Ada Grad places q Pk i=0 gi 2 in the step-size denominator in equation (1). However, when setting the numerator adaptively with an estimate dk instead of the unknown D, as done in D-Adaptation, we end up with a different error term: D-adaptation error = k=0 d2 kη2 k gk 2. The theory of D-Adaptation handles this error term by upper bounding by using the upper bound dk dn to get a similar error term to that of Ada Grad. This upper bound, however, is quite pessimistic since dn can be as large as D and dk can be as small as d0. Therefore, replacing d2 k with d2 n can introduce a multiplicative error of D2 d2 0 in this term. In this paper, we take a different approach and instead handle the error term using modified Ada Grad-like step sizes. Since the error terms are now d2 i gi 2 instead of gi 2, the new adaptive step size should be ηk = d2 k q Pk i=0 d2 i gi 2 . Algorithm 1 Prodigy (GD version) 1: Input: d0 > 0, x0, G 0 2: for k = 0 to n do 3: gk f(xk) 4: Choose weight ωk (default: ωk = 1) 5: ηk = d2 kωk q d2 k G2 + Pk i=0 d2 i ω2 i gi 2 6: xk+1 = xk ηkgk 7: ˆdk+1 = Pk i=0 ηi gi, x0 xi xk+1 x0 8: dk+1 = max(dk, ˆdk+1) 9: end for 10: Return xn = 1 Pn k=0 d2 kωk Pn k=0 d2 kωkxk Algorithm 2 Prodigy (Dual Averaging version) 1: Input: d0 > 0, x0, G 0; s0 = 0 Rp 2: for k = 0 to n do 3: gk f(xk) 4: sk+1 = sk + d2 kgk 5: ˆdk+1 = Pk i=0 d2 i gi, x0 xi sk+1 6: dk+1 = max(dk, ˆdk+1) 7: γk+1 = 1 q d2 k+1G2 + Pk i=0 d2 i gi 2 8: xk+1 = x0 γk+1sk+1 9: end for 10: Return xn = 1 Pn k=0 d2 k Pn k=0 d2 kxk This way, we can still control the error term of D-Adaptation but the obtained step size is provably larger since dk is nondecreasing: d2 k q Pk i=0 d2 i gi 2 d2 k q Pk i=0 d2 k gi 2 = dk q Pk i=0 gi 2 . Having larger step sizes while preserving the main error term is the key reason why the new algorithms converge, as we show below, with a faster rate. Notice, however, that the methods might still be slow because the denominator in the step size might grow too large over time. To remedy this, we introduce a modification for the step size by placing an extra weight ωk next to the gradients: ηk = d2 kωk q Pk i=0 d2 i ω2 i gi 2 . In fact, the modified step size might even increase between iterations, whereas the Ada Grad step size always decreases. Prodigy: An Expeditiously Adaptive Parameter-Free Learner We will show that as long as ωk does not grow too quickly, the worst-case convergence rate is almost the same. To establish non-asymptotic theory, we also introduce in our algorithms an extra term G2 in the denominator which upper bound the gradient norm, in case such a bound exists. We define it formally in the assumption below. Assumption 1. We assume that the objective f is GLipschitz, which implies that its gradients are bounded by G: for any x Rp and g f(x), it holds g G. Algorithm 1 and Algorithm 2 give Gradient Descent and the Dual Averaging variants of our new method. In contrast to Ada Grad, they estimate the product of D and G in the denominator, so we call the proposed technique Prodigy. We give the following convergence result for Algorithm 1: Theorem 1. Assume f is convex and G-Lipschitz. Given any weights 1 ω0 ωn, the functional gap of the average iterate of Algorithm 1 converges as 2ωn DGdn+1(2 + log(1 + Pn k=0 ω2 k)) p Pn k=0 ωkd2 k , (2) where xn = 1 Pn k=0 d2 k Pn k=0 d2 kxk is the weighted average iterate. Notice that we have the freedom to choose any nondecreasing sequence ωk as long as the right-hand side is decreasing. This allows us to put much more weight on the recent gradients and get more reasonable step sizes. For instance, we can choose ωk = kp, where p > 0 and since Pk k=0 k2p = O(k2p+2), it would result in an extra factor of 1 + p in the numerator due to the log term. The denominator, on the other hand, would increase as well, giving us a trade-off that depends on the values of dk s. We note that weights ωk = k have appeared previously in Accelegrad (Levy et al., 2018) and Uni XGrad (Kavis et al., 2019), which combine Ada Grad step sizes with momentum, and ωk = k weighting is used in the recent MADGRAD method (Defazio & Jelassi, 2022). To understand why this improves the convergence rate, consider the following lemma, which we prove in the appendix. The lemma presents an upper bound on the terms related to the dk sequence in the right-hand side of equation 2. Lemma 1. Let d0 d1 d N be positive numbers and assume N 2 log2+( d N d0 ), where log2+( ) = 1 + log2( ). Then, min t 0 (default 10 6), x0, β1 (default 0.9), β2 (default 0.999), ϵ (default 10 8), γk (default 1 with cosine annealing) 2: r0 = 0, s0 = 0, m0 = 0, v0 = 0 3: for k = 0 to n do 4: gk f(xk) 5: mk+1 = β1mk + (1 β1)dkgk 6: vk+1 = β2vk + (1 β2)d2 kg2 k 7: rk+1 = β2rk + (1 β2)γkd2 k gk, x0 xk 8: sk+1 = β2sk + (1 β2)γkd2 kgk 9: ˆdk+1 = rk+1 sk+1 1 10: dk+1 = max(dk, ˆdk+1) 11: xk+1 = xk γkdkmk+1/( vk+1 + dkϵ) 12: end for Ivgi et al. (2023) use this quantity as a plug-in estimate for the numerator of the step size, similar to D-Adaptation s approach. This approach can divergence in theory, but with additional modifications to the step size, the tamed T-Do G method is shown to converge. It has a log+(D/d0) dependence on the initial sub-optimally of the D estimate, so our approach improves on this dependence by a p log+(D/do) factor. Malitsky & Mishchenko (2020) proposed Ad GD, a method for convex optimization that does not require any hyperparameters and has a rate that is at least as good as that of the optimally tuned Gradient Descent. However, Ad GD requires the objective to be locally smooth, which hinders its use in many practical problems. Latafat et al. (2023) partially addressed this gap by proposing a proximal extension, but the case of non-smooth differentiable functions has remained unstudied. 5. Deriving Adam-Like Step Sizes To derive an Adam-like method, which should use an exponential moving average for the step size, we want to approximate Adam s update of the exponential moving average of squared gradients: vk+1 = β2vk + (1 β2)g2 k = (1 β2) i=0 βk i 2 g2 i , where g2 k is the coordinate-wise square of the gradient gk. We can achieve this using exponential weights, ωk = β k/2 2 , which after substituting the definition of ηk give us the following identity: d4 k η2 k = d2 k ω2 k G2 + d2 k gk 2 + i=0 βk i 2 d2 i gi 2. Prodigy: An Expeditiously Adaptive Parameter-Free Learner 0 200 400 600 800 1000 Accuracy (%) DA D-Adapt (14.9 SE 0.66) Prodigy DA (32.8 SE 0.74) SGD D-Adapt (16.8 SE 0.60) Prodigy SGD (41.0 SE 0.36) Do G (41.9 SE 0.35) 0 200 400 600 800 1000 DA D-Adapt (72.6 SE 0.03) Prodigy DA (84.8 SE 0.01) SGD D-Adapt (73.2 SE 0.03) Prodigy SGD (85.5 SE 0.01) Do G (78.8 SE 0.01) 0 200 400 600 800 1000 DA D-Adapt (88.2 SE 1.62) Prodigy DA (99.1 SE 0.04) SGD D-Adapt (90.1 SE 1.36) Prodigy SGD (99.5 SE 0.03) Do G (98.9 SE 0.15) 0 200 400 600 800 1000 DA D-Adapt (52.3 SE 1.93) Prodigy DA (67.9 SE 0.21) SGD D-Adapt (54.0 SE 2.07) Prodigy SGD (67.8 SE 0.20) Do G (67.6 SE 0.32) 0 200 400 600 800 1000 Accuracy (%) DA D-Adapt (93.1 SE 0.39) Prodigy DA (97.5 SE 0.09) SGD D-Adapt (93.9 SE 0.36) Prodigy SGD (98.1 SE 0.07) Do G (97.3 SE 0.07) 0 200 400 600 800 1000 DA D-Adapt (68.2 SE 0.08) Prodigy DA (72.8 SE 0.02) SGD D-Adapt (68.8 SE 0.08) Prodigy SGD (73.0 SE 0.01) Do G (72.9 SE 0.02) 0 200 400 600 800 1000 DA D-Adapt (56.7 SE 0.99) Prodigy DA (88.7 SE 0.24) SGD D-Adapt (60.3 SE 0.93) Prodigy SGD (90.9 SE 0.16) Do G (84.4 SE 0.24) 0 200 400 600 800 1000 DA D-Adapt (22.2 SE 1.31) Prodigy DA (26.9 SE 1.13) SGD D-Adapt (22.4 SE 1.30) Prodigy SGD (28.5 SE 1.11) Do G (31.1 SE 0.88) 0 200 400 600 800 1000 Accuracy (%) DA D-Adapt (77.3 SE 0.75) Prodigy DA (94.8 SE 0.07) SGD D-Adapt (79.6 SE 0.66) Prodigy SGD (95.4 SE 0.06) Do G (92.4 SE 0.11) 0 200 400 600 800 1000 DA D-Adapt (71.7 SE 0.65) Prodigy DA (77.7 SE 0.10) SGD D-Adapt (72.9 SE 0.46) Prodigy SGD (79.0 SE 0.07) Do G (79.1 SE 0.08) 0 200 400 600 800 1000 DA D-Adapt (64.7 SE 0.34) Prodigy DA (71.8 SE 0.11) SGD D-Adapt (65.1 SE 0.24) Prodigy SGD (73.1 SE 0.03) Do G (72.0 SE 0.11) 0 200 400 600 800 1000 DA D-Adapt (98.2 SE 0.11) Prodigy DA (100.0 SE 0.00) SGD D-Adapt (98.5 SE 0.17) Prodigy SGD (100.0 SE 0.00) Do G (99.9 SE 0.07) Figure 1. Convex multi-class classification problems. Error bars show a range of 1 standard error over 10 seeds. This can be seen as computing an exponential moving average of dkgk rather than gk itself. In addition, in Appendix A.6, we provide a coordinate-wise version of Algorithm 2 and study its convergence properties. Based on the theory presented there, the denominator for ˆdk+1 should use the ℓ1 norm of the weighted gradient sum to estimate the ℓ distance to the solution D = x0 x . Thus, combining this insight with the design of Algorithm 1, we obtain the following expression for the Adam estimate of D : ˆdk+1 = Pk i=0 β(k i)/2 2 d2 i gi, x0 xi Pk i=0 β(k i)/2 2 d2 i gi 1 . The update uses exponential moving average as well, although it is more conservative as it uses β2 instead of β2. Note that there is an extra of (1 β2) in the update for vk, which can be optionally compensated for by using the bias correction discussed by Kingma & Ba (2015). These update rules are summarized in Algorithm 3. This is the main algorithm that we study numerically in the next section. 6. Experiments We test our methods on convex logistic regression as well as deep learning problems. The Prodigy method is used as presented in Algorithm 3 in all deep learning experiments. Logistic regression. For the convex setting, we ran a set of classification experiments. For each dataset, we used the multi-margin loss (Weston & Watkins, 1999), a multi-class generalization of the hinge loss. This non-smooth loss results in bounded gradients, which is required by our theory. We perform full-batch rather that stochastic optimization, for two reasons. Firstly, it matches the assumptions of our theory. Secondly, fast learning rate adaptation is more crucial for full-batch optimization than stochastic optimization as fewer total steps will be performed. Our convex experiments use the theoretical variants Algorithm 1 and Algorithm 2, but with G = 0 following standard practice. We performed 1,000 steps for each dataset, using a randomized x0 and plot the results of 10 seeds. We ran both DA and GD variants of each method. Each plot shows the accuracy of the average iterate for each method. Figure 1 shows that our proposed algorithm greatly out-performs regular D-Adaptation. Our weighted GD variant of D-Adaptation is faster consistently across each dataset. Additionally, it adapts faster than the Do G method (Ivgi et al., 2023) on 10 of the 12 problems. CIFAR10. For neural network experiments1, we consider training on CIFAR10 (Krizhevsky, 2009) with batch size 256, where D-Adapted Adam has a gap of a few percent compared to the standard Adam. We use cosine annealing with initial step size 1 for all Adam-based methods and initial step size 10 3 for Adam itself. The considered networks are VGG11 (Simonyan & Zisserman, 2014) and Res Net50 (He et al., 2016)2. To simplify the experiment, we do not use weight decay, so both networks slightly overfit and do not reach high test accuracy values. All methods were run 1The Py Torch code of our optimizer is available at https: //github.com/konstmish/prodigy 2VGG11 and Res Net-50 implementation along with the data loaders were taken from https://github.com/ kuangliu/pytorch-cifar Prodigy: An Expeditiously Adaptive Parameter-Free Learner 0 10000 20000 30000 40000 50000 60000 IWSLT14 (LSTM) Adam (4.31 SE 0.003) Prodigy (4.40 SE 0.005) D-Adapt Adam (4.33 SE 0.003) 0 10000 20000 30000 40000 50000 60000 Training Loss IWSLT14 (LSTM) Adam (4.27 SE 0.003) Prodigy (3.83 SE 0.003) D-Adapt Adam (4.03 SE 0.002) 5000 10000 15000 20000 Test Perplexity Book Wiki (Ro BERTa) Adam (3.97 SE 0.010) D-Adapt Adam (3.96 SE 0.008) Prodigy (3.96 SE 0.006) 0 5000 10000 15000 20000 Training Perplexity Book Wiki (Ro BERTa) Adam (4.24 SE 0.018) D-Adapt Adam (4.26 SE 0.018) Prodigy (4.23 SE 0.031) 0 10000 20000 30000 40000 50000 60000 Test Perplexity Book Wiki (GPT Transformer) Adam (19.49 SE 0.012) D-Adapt Adam (19.46 SE 0.019) Prodigy (19.65 SE 0.026) 0 10000 20000 30000 40000 50000 60000 Training Perplexity Book Wiki (GPT Transformer) Adam (20.19 SE 0.084) D-Adapt Adam (20.16 SE 0.093) Prodigy (20.13 SE 0.199) 0 20 40 60 80 100 Test Accuracy Criteo Kaggle (DLRM) Adam (0.7906 SE 0.00007) D-Adapt Adam (0.7905 SE 0.00014) Prodigy (0.7906 SE 0.00009) 0 20 40 60 80 100 Criteo Kaggle (DLRM) Adam (0.4444 SE 0.00155) D-Adapt Adam (0.4440 SE 0.00150) Prodigy (0.4437 SE 0.00161) 0 10 20 30 40 50 fast MRI Knee Adam (0.9103 SE 0.00032) Prodigy (0.9108 SE 0.00058) D-Adapt Adam (0.9105 SE 0.00057) 0 10 20 30 40 50 fast MRI Knee Adam (0.4281 SE 0.00021) Prodigy (0.4278 SE 0.00055) D-Adapt Adam (0.4276 SE 0.00028) 0 50 100 150 200 250 300 Test Accuracy (%) ILSVRC 2012 Image Net (Vision Transformer) D-Adapt Adam (72.35% SE 0.07) Adam (75.40% SE 0.07) Prodigy (74.63% SE 0.21) 0 50 100 150 200 250 300 ILSVRC 2012 Image Net (Vision Transformer) D-Adapt Adam (3.9e-03 SE 4.9e-06) Adam (3.7e-03 SE 1.9e-05) Prodigy (3.8e-03 SE 3.5e-05) Figure 2. Adam-family experiments. Prodigy: An Expeditiously Adaptive Parameter-Free Learner 0 50 100 150 200 Epoch Test accuracy Adam Do G L-Do G D-Adapt Adam Prodigy 0 50 100 150 200 Epoch Adam Do G L-Do G D-Adapt Adam Prodigy 0 50 100 150 200 Do G D-Adapt Adam Prodigy 0 50 100 150 200 Epoch Test accuracy Adam Do G L-Do G D-Adapt Adam Prodigy 0 50 100 150 200 Epoch Adam Do G L-Do G D-Adapt Adam Prodigy 0 50 100 150 200 Do G D-Adapt Adam Prodigy Figure 3. VGG11 (top) and Res Net-50 (bottom) training on CIFAR10. Left: test accuracy (%), middle: train loss, right: step sizes. Prodigy is used as given in Algorithm 3. Prodigy estimates a larger step size than D-Adaptation, which helps it reach test accuracy closer to Adam. using same 8 random seeds. We show the results in Figure 3. As we can see, this gap is closed by Prodigy, which is achieved by the larger estimates of the step size. For Do G and L-Do G, we compute the polynomial-averaging iterate and then report the best of the average and the last iterate. We average with γ = 8, see (Ivgi et al., 2023) for the details. While Do G produces larger step size estimate than Prodigy (see the right column in Figure 3, this is counterbalanced by the larger denominator in Do G. We also tried to modify Do G to use Adam-like step sizes but our heuristic modification diverged on this problem. We also observed that among Do G and its layer-wise version, L-Do G, there is no clear winner as the former performed better on VGG11 and the latter was better when training Res Net-50. nano GPT transformer. We also train a 6-layer transformer network from nano GPT3 on the Shakespeare dataset. For all methods, we use batch size 256, clip the gradients to have norm not exceeding 1 and use float16 numbers. We use Adam W with hyperparameters given in the repository, i.e., β2 = 0.99, weight decay 0.1, step size 10 3, cosine annealing with warmup over 100 steps. The same weight decay value and cosine annealing is used for Prodigy and D-Adapted Adam, except that the latter two methods use step size 1. We accumulate minibatches of size 12 into a batch of size 480. We tuned the weight decay for Do G and L-Do G and found the value 10 4 to work well for this problem. We ran each method with 8 random seeds and report 3https://github.com/karpathy/nano GPT the average as well as one-standard-deviation confidence intervals. See Figure 4 for the results. In terms of the test loss, all methods are roughly equivalent except that Do G and LDo G were slower to reach the best value of roughly 1.5. For the train loss, Prodigy was on par with tuned Adam W and slightly better than D-Adapted Adam. Surprisingly, the estimated step size in Prodigy was very consistent across the 8 random seeds. 6.1. Large-scale Adam experiments To validate the performance on large-scale practical applications directly against D-Adaptation, we ran the subset of the experiments from Defazio & Mishchenko (2023) that use the Adam optimizer. Methods without coordinate adaptivity are not competitive on these problems and so we exclude SGD and Do G from these comparisons. LSTM, Ro BERTa, GPT, DLRM, Var Net. On the smallest problem of LSTM training, Prodigy appears to converge significantly faster in training loss and slightly overfits in test loss compared to the baselines. For Ro BERTa (Liu et al., 2019) and GPT (Radford et al., 2019) training on Book Wiki, Prodigy matches the performance of the baseline with only negligible differences. For the application problems, DLRM (Naumov et al., 2019) on the Criteo Kaggle Display Advertising dataset, and fast MRI Var Net (Zbontar et al., 2018), Prodigy again closely matches the baselines. Prodigy: An Expeditiously Adaptive Parameter-Free Learner 0 2500 5000 7500 10000 12500 15000 Iteration Adam W Do G L-Do G D-Adapt Adam Prodigy 0 2500 5000 7500 10000 12500 15000 Iteration Adam W Do G L-Do G D-Adapt Adam Prodigy 0 2500 5000 7500 10000 12500 15000 Iteration Adam W Do G D-Adapt Adam Prodigy Figure 4. The test (left) and train (middle) loss curves as well as the estimated step size (right) when training a 6-layer nano GPT transformer. Vi T training. Defazio & Mishchenko (2023) present a negative result for training vision transformer (Dosovitskiy et al., 2021), where D-Adaptation significantly underperforms tuned Adam. We investigated this effect, and we were able to reproduce this gap across a wide range of weightdecay values, although this problem has high run-to-run variance of 1-2% of test accuracy, which makes comparison difficult. Using weight decay 0.05 instead of 0.1 significantly improved performance of each variant, and so we present results for both the baselines and Prodigy at that value. We can see that Prodigy almost closes the gap between tuned Adam and D-Adaptation, giving a test accuracy of 74.63% compared to 75.4% for Adam, and more than 2% higher than D-Adaptation. See Figure 2 for the results. 7. Conclusion We have presented Prodigy, a new method for learning-rate adaptation that improves upon the adaptation rate of the state-of-the-art D-Adaptation method. We proved convergence of Prodigy on both non-smooth and smooth problems. Prodigy was also shown to adapt faster than other known methods across a range of experiments. A practical limitation of our method is the increased memory requirement as it needs to store vectors x0 and sk in addition to Adam s vectors mk and vk. To remedy this issue, one can compress x0 and sk, for instance, by using lower precision data types, especially since we only use these vectors to produce scalar values for our dk estimates. Impact Statement This paper presents work whose goal is to improve the training speed of existing Machine Learning models. We believe the societal consequences of our work are minimal. Carmon, Y. and Hinder, O. Making SGD parameter-free. In Conference on Learning Theory. PMLR, 2022. Cutkosky, A. and Orabona, F. Black-box reductions for parameter-free online learning in banach spaces. In Proceedings of the 31st Conference On Learning Theory, Proceedings of Machine Learning Research. PMLR, 2018. Defazio, A. and Jelassi, S. Adaptivity without compromise: A momentumized, adaptive, dual averaged gradient method for stochastic optimization. Journal of Machine Learning Research, 23:1 34, 2022. Defazio, A. and Mishchenko, K. Learning-rate-free learning by D-adaptation. In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J. (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp. 7449 7479. PMLR, 23 29 Jul 2023. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby, N. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021. Duchi, J., Hazan, E., and Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(61), 2011. Ene, A., Nguyen, H. L., and Vladu, A. Adaptive gradient methods for constrained convex optimization and variational inequalities. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp. 7314 7321, 2021. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial networks. Communications of the ACM, 63(11):139 144, 2020. Gower, R. M., Defazio, A., and Rabbat, M. Stochastic Polyak stepsize with a moving target. ar Xiv preprint ar Xiv:2106.11851, 2021. Prodigy: An Expeditiously Adaptive Parameter-Free Learner Grimmer, B. Convergence rates for deterministic and stochastic subgradient methods without Lipschitz continuity. SIAM Journal on Optimization, 29(2):1350 1365, 2019. Hazan, E. and Kakade, S. M. Revisiting the Polyak step size. ar Xiv preprint ar Xiv:1905.00313, 2019. He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2016. Ivgi, M., Hinder, O., and Carmon, Y. Do G is SGD s best friend: A parameter-free dynamic step size schedule. In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J. (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp. 14465 14499. PMLR, 23 29 Jul 2023. Kairouz, P., Mc Mahan, H. B., Avent, B., Bellet, A., Bennis, M., Bhagoji, A. N., Bonawitz, K., Charles, Z., Cormode, G., Cummings, R., et al. Advances and open problems in federated learning. Foundations and Trends in Machine Learning, 14(1), 2021. Kavis, A., Levy, K. Y., Bach, F., and Cevher, V. Uni XGrad: A universal, adaptive algorithm with optimal guarantees for constrained optimization. Advances in neural information processing systems, 32, 2019. Khodak, M., Tu, R., Li, T., Li, L., Balcan, M.-F. F., Smith, V., and Talwalkar, A. Federated hyperparameter tuning: Challenges, baselines, and connections to weight-sharing. Advances in Neural Information Processing Systems, 34: 19184 19197, 2021. Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. In ICLR, 2015. URL http://arxiv. org/abs/1412.6980. Krizhevsky, A. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009. Latafat, P., Themelis, A., Stella, L., and Patrinos, P. Adaptive proximal algorithms for convex optimization under local Lipschitz continuity of the gradient. ar Xiv preprint ar Xiv:2301.04431, 2023. Levy, K. Y., Yurtsever, A., and Cevher, V. Online adaptive methods, universality and acceleration. Advances in neural information processing systems, 31, 2018. Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov, V. Ro BERTa: A robustly optimized BERT pretraining approach. ar Xiv preprint ar Xiv:1907.11692, 2019. Loizou, N., Vaswani, S., Laradji, I., and Lacoste-Julien, S. Stochastic Polyak step-size for SGD: An adaptive learning rate for fast convergence. In Proceedings of the 24th International Conference on Artificial Intelligence and Statistics (AISTATS), Proceedings of Machine Learning Research. PMLR, 2021. Malitsky, Y. and Mishchenko, K. Adaptive gradient descent without descent. In III, H. D. and Singh, A. (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp. 6702 6712. PMLR, 13 18 Jul 2020. Mc Mahan, H. B. and Orabona, F. Unconstrained online linear learning in Hilbert spaces: Minimax algorithms and normal approximations. In Proceedings of The 27th Conference on Learning Theory, volume 35 of Proceedings of Machine Learning Research. PMLR, 2014. Naumov, M., Mudigere, D., Shi, H. M., Huang, J., Sundaraman, N., Park, J., Wang, X., Gupta, U., Wu, C., Azzolini, A. G., Dzhulgakov, D., Mallevich, A., Cherniavskii, I., Lu, Y., Krishnamoorthi, R., Yu, A., Kondratenko, V., Pereira, S., Chen, X., Chen, W., Rao, V., Jia, B., Xiong, L., and Smelyanskiy, M. Deep learning recommendation model for personalization and recommendation systems. Co RR, 2019. Orabona, F. and P al, D. Parameter-free stochastic optimization of variationally coherent functions, 2021. Orabona, F. and Tommasi, T. Training deep networks without learning rates through coin betting. In Advances in Neural Information Processing Systems, volume 30, 2017. Orvieto, A., Lacoste-Julien, S., and Loizou, N. Dynamics of SGD with stochastic Polyak stepsizes: Truly adaptive variants and convergence to exact solution. In Advances in Neural Information Processing Systems (Neur IPS). Neur IPS, 2022. Polyak, B. T. Introduction to optimization. Optimization Software, Inc., 1987. Radford, A., Narasimhan, K., Salimans, T., and Sutskever, I. Improving language understanding by generative pretraining. Technical report, Open AI, 2019. Simonyan, K. and Zisserman, A. Very deep convolutional networks for large-scale image recognition. ar Xiv preprint ar Xiv:1409.1556, 2014. Streeter, M. and Mc Mahan, H. B. Less regret via online conditioning. ar Xiv preprint ar Xiv:1002.4862, 2010. Prodigy: An Expeditiously Adaptive Parameter-Free Learner Ward, R., Wu, X., and Bottou, L. Adagrad stepsizes: sharp convergence over nonconvex landscapes. In International Conference on Machine Learning, 2019. Weston, J. and Watkins, C. Support vector machines for multi-class pattern recognition. pp. 219 224, 01 1999. Zbontar, J., Knoll, F., Sriram, A., Muckley, M. J., Bruno, M., Defazio, A., Parente, M., Geras, K. J., Katsnelson, J., Chandarana, H., et al. fast MRI: An open dataset and benchmarks for accelerated MRI. ar Xiv preprint ar Xiv:1811.08839, 2018. Zhang, Z., Cutkosky, A., and Paschalidis, I. C. PDE-based optimal strategy for unconstrained online learning. In Proceedings of the 39th International Conference on Machine Learning (ICML 2022), 2022. Zoph, B. and Le, Q. Neural architecture search with reinforcement learning. In International Conference on Learning Representations, 2017. URL https:// openreview.net/forum?id=r1Ue8Hcxg. Prodigy: An Expeditiously Adaptive Parameter-Free Learner A. Analysis of Prodigy As a reminder, we use the notation log2+(a) = 1 + log2(a) to denote the logarithm that is lower bounded by 1 for any a 1. A.1. Useful propositions Proposition 1 (Lemma A.2 in (Levy et al., 2018)). For any sequence of nonnegative real numbers a0, . . . , an v u u t ak q Pk i=0 ai 2 k=0 ai. (4) Proof. For completeness, we prove both statements here. Notice that for any α [0, 1], it holds 1 1 α α 2(1 1 α). Substituting α = ak Pk i=0 ai gives 1 ak Pk i=0 ai ak Pk i=0 ai 2 1 ak Pk i=0 ai If we multiply all sides by q Pk i=0 ai, the inequality above becomes i=0 ai ak q Pk i=0 ai 2 Summing over k = 0, . . . , n, we get the stated bound. Proposition 2. For any sequence of nonnegative numbers a0, . . . , an and A > 0, it holds ak A + Pk i=0 ai log A + log(A). (5) Proof. If ai = 0 for some i, we can simply ignore the corresponding summands, so let us assume that ai > 0 for all i. For any t > 0 it holds 1/(1 + t) log(1 + 1/t). Substituting t = Sk/ak, where Sk = A + Pk 1 i=0 ai for k > 0 and S0 = A, we get 1 1 + Sk ak = ak ak + Sk = ak A + Pk i=0 ai log(1 + ak/Sk) = log(Sk+1) log(Sk). Summing this over k from 0 to n, we get ak A + Pk i=0 ai k=0 (log(Sk+1) log(Sk)) = log(Sn+1) log(S0) This is exactly what we wanted to prove. A.2. Proof of Lemma 1 Proof. Following the proof in (Ivgi et al., 2023), we define K = log2 d N d0 and n = N K . Consider a partitioning of the sequence t N into half-open intervals Ik = [nk, n(k + 1)) for k = 0 to K 1. We want to show that there is at least one interval such that dk changes by at most a factor of 2 on that interval. We will use proof by contradiction. Prodigy: An Expeditiously Adaptive Parameter-Free Learner Suppose that for all intervals, dnk < 1 2dn(k+1). Then dk at least doubles in every interval, and so: 2K dn K < 1 which implies that d N/d0 > 2K and so K < log2 (d N/d0) which contradicts our definition K = log2 d N d0 . Therefore, there exists some ˆk such that dnˆk 1 2dn(ˆk+1). We can now proceed with proving the Lemma by considering the summation over interval Iˆk only: min t 1, b > 0 are such that a b(1 + log a), then it also holds a 2b(1 + log b). Proof. Firstly, let us plug-in the upper bound into itself: a b(1 + log a) b(1 + log(b(1 + log a))) = b(1 + log b + log(1 + log a)). Therefore, it is enough to show that log(1 + log a) 1 + log b. Notice that for any a > 1 it holds (1 + log a)2 2a, so 1 + log a 2a 1+log a 2b and log(1 + log a) log(2b) < 1 + log b. Theorem 7 (Same as Theorem 3). Assume f is L-smooth, and set G = 0 and ω0 = = ωn = 1 in Algorithm 1. Then, we get the following convergence guarantee: f(xt) f = O log2+ D d0 log2 2+ LD2 where we used xn = 1 Pn k=0 d2 k Pn k=0 d2 kxk and t = arg mink n dk+1 Pk i=0 d2 i . Proof. Recall that we have established the upper bound k=0 ηk(f(xk) f ) 2Ddn+1 + D dn+1 k=0 η2 k gk 2 = 2Ddn+1 + D dn+1 d4 k Pk i=0 d2 i gi 2 gk 2. Firstly, let us work on the right-hand side. Since we do not assume that gk is bounded in the smooth case, we keep the gradients in the upper bound: d4 k gk 2 Pk i=0 d2 i gi 2 = d2 0 + d4 k gk 2 Pk i=0 d2 i gi 2 d2 0 + d2 n d2 k gk 2 Pk i=0 d2 i gi 2 (5) d2 0 + d2 n log d2 0 g0 2 + k=1 d2 k gk 2 log(d2 0 g0 2) = d2 0 + d2 n log 1 + = d2 0 + d2 n log n X Secondly, we shall work on the left-hand side of the first inequality. The smoothness assumption implies gi 2 2L(f(xi) f ), so ηk = d2 k q Pk i=0 d2 i gi 2 d2 k 2L q Pk i=0 d2 i (f(xi) f ) . Prodigy: An Expeditiously Adaptive Parameter-Free Learner Combining this lower bound with the first part of Proposition 1 with ak = d2 k(f(xk) f ) gives k=0 ηk(f(xk) f ) d2 k(f(xk) f ) 2L q Pk i=0 d2 i (f(xi) f ) 1 k=0 d2 k(f(xk) f ). All together, the bounds above yield k=0 d2 k(f(xk) f ) 2Ddn+1 + D dn+1 d2 0 + D dn+1 d2 n log n X Since D dn+1 d2 0 D dn+1 d2 n D dn+1 d2 n+1 = Ddn+1, we have k=0 d2 k(f(xk) f ) Ddn+1 3 + log n X 3 + log 2L d2 0 g0 2 k=0 d2 k(f(xk) f ) . Denote Fn = q 2L d2 0 g0 2 Pn k=0 d2 k(f(xk) f ) and notice that the bounds above imply Fn 1. Then, we can rewrite the last bound above as 2L Fn Ddn+1(3 + log F 2 n) = Ddn+1(3 + 2 log Fn) < 3Ddn+1(1 + log Fn). Applying Lemma 5 with a = Fn and b = 6LDdn+1 d0 g0 , we get Fn 12LDdn+1 1 + log 6LDdn+1 1 + log 6LD2 Squaring both sides and rearranging yields k=0 d2 k(f(xk) f ) 72LD2d2 n+1 1 + log 6LD2 Now we can apply Jensen s inequality: f(xn) f 1 Pn k=0 d2 k k=0 d2 k(f(xk) f ) d2 n+1 Pn k=0 d2 k 72LD2 1 + log 6LD2 Taking the minimum of the right-hand side over n, we can replace the first ratio in the right-hand side with 4 log2+(D2/d2 0) = 8 log2+(D2/d0). A.5. DA analysis Lemma 6. Considering Algorithm 2, we have γn+1 + Pn k=0 γkd4 k gk 2 Proof. When studying Dual Averaging, we need to introduce an extra sequence that lower bounds dn: dn+1 def = γn+1 sn+1 2 Pn k=0 γkd4 k gk 2 Prodigy: An Expeditiously Adaptive Parameter-Free Learner Let us show that ˆdn+1 dn+1 by comparing their numerators: ˆdn+1 sn+1 = k=0 d2 k gk, x0 xk = k=0 d2 kγk gk, sk = k=0 γk sk+1 sk, sk 2 sk+1 2 sk+1 sk 2 sk 2 2 sn+1 2 + 1 k=0 (γk γk+1) sk+1 2 1 k=0 γkd4 k gk 2 γk γk+1 γn+1 k=0 γkd4 k gk 2 = dn+1 sn+1 . Using the definition of dn+1, and the property dn+1 ˆdn+1 dn+1, we derive k=0 γkd4 k gk 2 = dn+1 sn+1 dn+1 sn+1 . Using inequality 2αβ α2 + β2 with α2 = 2d2 n+1 γn+1 and β2 = γn+1 2 sn+1 2 and then the bound above, we establish 2dn+1 sn+1 = 2αβ α2 + β2 = 2d2 n+1 γn+1 + γn+1 2d2 n+1 γn+1 + dn+1 sn+1 + 1 k=0 γkd4 k gk 2. Rearranging the terms, we obtain dn+1 sn+1 2d2 n+1 γn+1 + 1 k=0 γkd4 k gk 2. It remains to divide both sides by dn+1. Lemma 7. The Dual Averaging algorithm (Algorithm 2) satisfies k=0 d2 k(f(xk) f ) (D ˆdn+1) sn+1 . (9) Proof. Summing inequality f(xk) f gk, xk x with weights d2 k, we get k=0 d2 k(f(xk) f ) k=0 d2 k gk, xk x = k=0 d2 k gk, x0 x + k=0 d2 k gk, xk x0 . Using Cauchy-Schwarz on the first product in the right-hand side and then telescoping the second sum, we obtain k=0 d2 k(f(xk) f ) sn+1 x0 x + k=0 d2 k gk, xk x0 = sn+1 D ˆdn+1 sn+1 . Next, we restate and prove Theorem 2: Prodigy: An Expeditiously Adaptive Parameter-Free Learner Theorem 8 (Same as Theorem 2). For Algorithm 2, it holds that: f(xt) f 4GD n where t = arg mink n dk+1 Pk i=0 d2 i . Proof. Let us sum inequality d2 k(f(xk) f ) 0 and then apply Lemma 7: k=0 d2 k(f(xk) f ) (9) (D ˆdn+1) sn+1 . Clearly, this implies that ˆdn+1 D, and by induction it follows that dn+1 D as well. Now let us upper bound the functional values: k=0 d2 k(f(xk) f ) (9) D sn+1 k=0 γkd2 k gk, sk k=0 γk sk+1 sk, sk = D sn+1 + 1 k=0 γk sk+1 sk 2 + sk 2 sk+1 2 = D sn+1 + 1 k=0 γk sk+1 sk 2 + 1 k=0 (γk γk 1) sk 2 γn We can drop the last two terms since γk γk 1: k=0 d2 k(f(xk) f ) D sn+1 + 1 k=0 γk sk+1 sk 2 = D sn+1 + 1 k=0 γkd4 k gk 2. The first term in the right-hand side is readily bounded by Lemma 6: k=0 d2 k(f(xk) f ) D sn+1 + 1 k=0 γkd4 k gk 2 γn+1 + D 2dn+1 k=0 γkd4 k gk 2 + 1 k=0 γkd4 k gk 2 dk dn dn+1 2Ddn+1 k=0 γkd2 k gk 2. Prodigy: An Expeditiously Adaptive Parameter-Free Learner Algorithm 4 Prodigy (Coordinate-wise Dual Averaging version) 1: Input: d0 > 0, x0, G 0; s0 = 0 Rp, 1 = (1, . . . , 1) Rp 2: for k = 0 to n do 3: gk f(xk) 4: sk+1 = sk + d2 kgk 5: ˆdk+1 = Pk i=0 d2 i gi, x0 xi sk+1 1 6: dk+1 = max(dk, ˆdk+1) 7: a2 k+1 = d2 k+1G2 1 + Pk i=0 d2 i g2 i Coordinate-wise square 8: Ak+1 = diag(ak+1) 9: xk+1 = x0 A 1 k+1sk+1 10: end for 11: Return xn = 1 Pn k=0 d2 k Pn k=0 d2 kxk Then, apply Proposition 1: k=0 d2 k(f(xk) f ) 2D k=0 γkd2 k gk 2 d2 k G2 + Pk 1 i=0 d2 i gi 2 d2 k gk 2 d2 k gk 2 + Pk 1 i=0 d2 i gi 2 d2 k gk 2 γn+1 + 2Ddn k=0 d2 k gk 2. Let us now bound each gradient norm using gk G: k=0 d2 k(f(xk) f ) 4Ddn+1 k=0 d2 k gk 2 4GDdn+1 Thus, we get the following convergence rate: f(xt) f 4GDdt+1 q Pt k=0 d2 k Pt k=0 d2 k = 4GDdt+1 q Pt k=0 d2 k = min t 0 for all k 1, and G = 1. For each step k 1 we define: x = 2n+1x1, and thus D = |x0 x | = 2k+1x1. and our construction uses the following function value and gradient sequence f(x) = |x x | + x . Note that for all query points x, the gradient is negative, and only the left arm of the absolute value function is seen by the algorithm, so the function appears linear for all test points. Using this construction, we have: min k n [f(xk) f ] = min k n (x xk) = 2n+1x1 max k n xk 2 2nx1 2nx1 = 2nx1 Now note that: p log2(Dn/x1) = p log2(2n+1) = Combining these two results: min k n f(xk) f 1 log2(D/x1) n + 1 . Theorem 11. D-Adaptation, Do G and Prodigy are exponentially bounded. Proof. Consider the D lower bound from D-Adaptation: ˆdn+1 = Pn k=0 dkγk gk, sk Prodigy: An Expeditiously Adaptive Parameter-Free Learner Recall that n X k=0 dkγk gk, sk γn+1 sn+1 2 . Note also that γn+1 1 So the sequence dn is upper bounded by the sequence: (Pn 1 k=0 ak n 1 d0 n = 0 . This sequence has the following closed form: an+1 = 2nd0 for n 1. We can prove this by induction. The base case is by definition a1 = a0. Then k=0 ak + an = an + an = 2an = 2nd0. Note that for both the Dual Averaging form and the GD form we have, we have: k=0 dk dn+1 2nd0. It follows that D-Adaptation is exponentially bounded. For Prodigy, note that: d2 n+1G2 = 1 dn+1G. 1 dn+1G sn+1 2 sn+1 1 dn+1G The rest of the argument follows the D-Adaptation case, with: k=0 dk dn+1 2nd0. Prodigy: An Expeditiously Adaptive Parameter-Free Learner For Do G, recall the basic Do G step is gradient descent with step sizes: G2 + Pk i=0 gi 2 . Using the triangle inequality we have: xk+1 x0 = xk γkgk x0 xk x0 + γk gk xk x0 + rk 2 rk. Chaining gives the result. Proposition 3. Suppose that dk c D and γk dk/G. then: xk x0 (2c + 1)n x1 x0 . Proof. Without loss of generality assume that G = 1. Firstly, note that using the absolute value function as constructed in Theorem 5, it s clear that there is always exists a function with Dk 2 xk x at step k consistent with the sequence of gradients seen so far. Therefore, it must hold that dk c Dk 2c xk x0 . We prove the result by induction. For the base case, trivially: x1 x0 (2c + 1)1 x1 x0 . For the inductive case: xk+1 x0 = xk γkgk x0 xk x0 + γk gk xk x0 + c Dk xk x0 + c Dk (2c + 1) xk x0 (2c + 1)n+1 x1 x0 .