# mada_metaadaptive_optimizers_through_hypergradient_descent__0be427fe.pdf

MADA: Meta-Adaptive Optimizers Through Hyper-Gradient Descent

Kaan Ozkara 1 Can Karakus 2 Parameswaran Raman 2 Mingyi Hong 2 3 Shoham Sabach 4 Branislav Kveton 2

Volkan Cevher 2 5

Following the introduction of Adam, several novel adaptive optimizers for deep learning have been proposed. These optimizers typically excel in some tasks but may not outperform Adam uniformly across all tasks. In this work, we introduce Meta-Adaptive Optimizers (MADA), a unified optimizer framework that can generalize several known optimizers and dynamically learn the most suitable one during training. The key idea in MADA is to parameterize the space of optimizers and dynamically search through it using hypergradient descent during training. We empirically compare MADA to other popular optimizers on vision and language tasks, and find that MADA consistently outperforms Adam and other popular optimizers, and is robust against sub-optimally tuned hyper-parameters. MADA achieves a greater validation performance improvement over Adam compared to other popular optimizers during GPT2 training and fine-tuning. We also propose AVGrad, a modification of AMSGrad that replaces the maximum operator with averaging, which is more suitable for hyper-gradient optimization. Finally, we provide a convergence analysis to show that parameterized interpolations of optimizers can improve their error bounds (up to constants), hinting at an advantage for meta-optimizers.

1Department of Electrical and Computer Engineering, University of California Los Angeles, USA 2Amazon Web Services 3Department of Electrical and Computer Engineering, University of Minnesota, USA 4Faculty of Data and Decision Sciences, Technion Israel Institute of Technology, Israel 5LIONS, IEM, STI, Ecole Polytechnique F ed erale de Lausanne, Switzerland. SS, MH, and VC hold concurrent appointments as an Amazon Scholar and as a faculty at Technion, University of Minnesota, and EPFL, respectively. This paper describes their work performed at Amazon. Correspondence to: Can Karakus <cakarak@amazon.com>.

Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s).

1. Introduction

The choice of an optimization algorithm plays a critical role in determining the downstream performance of a machine learning model. Adaptive moment optimizers are the most preferred class of optimizers employed in most learning tasks such as training Large Language Models (LLMs) (Brown et al., 2020; Touvron et al., 2023) and Diffusion Models (Rombach et al., 2022). In particular, Adam (Kingma and Ba, 2015) is still the go-to optimizer in LLM training, despite the emergence of many other optimizers since then.

Recently proposed optimizers (Chen et al., 2023; Xie et al., 2023; Liu et al., 2023; You et al., 2020; Foret et al., 2021) report improved performance compared to Adam in specific tasks. However, it is unclear if their strong performance generalizes across a wide range of tasks as Adam s does (Schmidt et al., 2021). It is also unclear if a single optimizer can be uniformly the best across all learning tasks and training regimes, such as different batch sizes, hyper-parameters, and datasets.

In this work, we introduce the concept of a parameterized optimizer, which can be viewed as a unified parameterization of a collection of given optimizers. The parameters of a parameterized optimizer define a convex polytope (e.g. hypercube), whose vertices may correspond to individual base optimizers, while the interior represents new optimizers formed by interpolating between them. To make this concept operational, we propose the meta-adaptive optimizer (MADA), which combines the parameterized optimizer with hyper-gradient descent (Baydin et al., 2018) to learn the optimizer coefficients. MADA dynamically adjusts the parameterized optimizer coefficients during training, and effectively adapts the optimizer choice to the learning task. While MADA bears some connections to optimizer search methods, such as (Chen et al., 2023), it does not solely output a final optimizer state. It dynamically adapts the optimizer to the current neighborhood of the loss landscape on-the-fly, removing the need for an offline optimizer search stage, or an outer hyper-parameter selection loop.

Contributions. We make the following contributions:

We introduce the concept of a parameterized optimizer,

MADA: Meta-Adaptive Optimizers Through Hyper-Gradient Descent

which takes a collection of existing optimizers, and unifies them into a single optimizer, combining the individual update rules through learnable coefficients.

We propose MADA, a meta-optimization framework that combines parameterized optimizers with hypergradient descent to learn a specific optimizer instance during training.

We find that not all optimizers are suitable to use in a hyper-gradient optimization framework. Specifically, among popular optimizers, using AMSGrad (Reddi et al., 2018) results in poor performance within MADA due to its use of the maximum operator. Motivated by this, we propose a modification of it called AVGrad, which replaces the maximum operator on the secondorder moments with time-averaging, and leads to better performance when used as part of MADA. We also analyze its convergence properties (see Appendix B). We show that MADA converges to AVGrad in a simple example where it is known to perform better than Adam, providing evidence for the effectiveness of MADA in adapting the optimizer to the task (see Appendix C).

To demonstrate that optimizer interpolations can improve convergence bounds in an analytically tractable scenario, we theoretically analyze the convergence behavior for interpolations between two optimizers, namely AVGrad and Adam. Our analysis shows that the interpolated optimizer improves the convergence bounds of the base optimizers up to constant factors.

We develop a specific parameterized optimizer, which interpolates between Adam (Kingma and Ba, 2015), AVGrad, Yogi (Zaheer et al., 2018), Adan (Xie et al., 2023), and Lion (Chen et al., 2023). On language tasks we compare MADA, which is based on this parameterized optimizer, against Adam, Lion, Adan, and Hyper Adam1, and show that MADA consistently outperforms all baselines. We also illustrate the robustness of MADA to initial hyper-parameters and analyze the evolution of hyper-parameters during training. On vision tasks we compare MADA to Adam, SGD with momentum, and Hyper Adam, and observe consistent performance improvement.

Related work. Motivated by the high cost of training large language models, a large number of novel optimizers have been proposed in recent years to speed up the training, increase generalization performance or train more resourceefficient models. As mentioned before, Adam (Kingma and Ba, 2015) is the most commonly employed optimizer, and succeeding methods, in general, try to improve upon

1Throughout this paper we will refer to the version of Adam that uses hyper-gradients to tune β1 and β2 parameters, as in (Chandra et al., 2022), as Hyper Adam.

it under different scenarios. (Reddi et al., 2018; Zaheer et al., 2018) propose AMSGrad and YOGI, respectively, to fix Adam s potentially non-decreasing effective learning rate. (Xie et al., 2023; Dozat, 2016) introduce Adan and Nadam, respectively, to replace heavy-ball momentum in Adam with Nesterov momentum. (You et al., 2017; 2020) propose LARS and LAMB to improve the performance of Adam in large-batch regime. (Heo et al., 2021) proposes Adam P to avoid premature decay of scale-invariant weights. Ada Bound (Luo et al., 2019) and Ada Belief (Zhuang et al., 2020) try to estimate a more stable second-order moment term. (Foret et al., 2021) proposes sharpness-aware minimization (SAM) and (Chen et al., 2020) proposes Padam to increase the generalization performance of the trained models. (Liu et al., 2020) stabilizes the training by reducing gradient variance. The work in (Chen et al., 2023) is similar in spirit to our work, in that it introduces a method to symbolically search a space of optimizers; however unlike our work they consider an offline search method, whereas our method learns the optimizer during actual model training, and uses hyper-gradients.

A separate line of work proposed learned optimizers to delegate the optimization task to neural networks (fully connected or LSTMs) (Almeida et al., 2021; Metz et al., 2022; 2020). Instead of directly using first order information as in optimizers, these methods treat gradient information, alongside other features such as training loss, validation loss (Almeida et al., 2021) and so on, as inputs to a neural network which outputs the update to be applied on a particular weight. Learned optimizers require resource-intensive offline training (e.g. thousands of TPU-months in (Metz et al., 2022)) since the updates are handled through a separate neural network that needs to be trained before deployment. Integration of a separate neural network to the optimization process introduces additional complexity, which MADA avoids by requiring only a few parameters to be learned on-the-fly.

Another related line of work is on gradient-based hyperparameter optimization (Almeida et al., 1999; Maclaurin et al., 2015; Franceschi et al., 2017; Baydin et al., 2018; Chandra et al., 2022); our work applies a similar idea in a new setting, namely optimizer search and adaptation. Finally, we note that our work more broadly relates to a long line of research on Auto ML and meta-learning (Andrychowicz et al., 2016; Wichrowska et al., 2017; Hospedales et al., 2021; Real et al., 2020).

2. Parameterized Optimizers

We focus on minimizing a loss function F : Rd R, that is, optimization problems of the following form

min x Rd F(x).

MADA: Meta-Adaptive Optimizers Through Hyper-Gradient Descent

Table 1 A unified framework to express adaptive moment optimizers.

Method First-Order Moment Second-Order Moment Adam (Kingma and Ba, 2015) mt = β1mt 1 + (1 β1)gt vt = β2vt 1 + (1 β2)g2 t

AMSGrad (Reddi et al., 2018) mt = β1mt 1 + (1 β1)gt vt = β2 vt 1 + (1 β2)g2 t vt = max{vt 1, vt}

Adan (Xie et al., 2023) mt = β1 mt 1 + (1 β1)gt nt = β3nt 1 + (1 β3)(gt gt 1) mt = mt + β3nt

ˆgt = gt + β3(gt gt 1) vt = β2vt 1 + (1 β2)ˆg2 t

Yogi (Zaheer et al., 2018) mt = β1mt 1 + (1 β1)gt ˆgt = vt 1 + g2 t sign(g2 t vt 1) vt = β2vt 1 + (1 β2)ˆgt

Throughout the paper, we denote by f : Rd R a random function computed on a minibatch sampled from the underlying data distribution. Therefore, for any x Rd, we have E [f(x)] = F(x). Moreover, we assume F is differentiable and that E [ f(x)] = F(x) for all x Rd. We use ft to denote the random function evaluated with input mini-batch sampled at t.

Informally, a parameterized optimizer can be described as the convex hull of a set of optimizers, when mapped to a Euclidean space under a certain parameterization. This section is devoted to explaining this notion in more detail, before Section 3 discusses how to perform learning in this Euclidean space.

In order to build a parameterized optimizer, we begin with a collection of existing optimizers. For the sake of concreteness, we illustrate the idea through the following four optimizers: Adam, AMSGrad, Adan and Yogi. A careful inspection reveals that these four optimizers can be described in one update rule that involves three iteratively generated sequences: model parameters, first-order moments, and second-order moments, which will be denoted throughout using the vectors x, m, and v, respectively. Each optimizer only differs in the way it updates these sequences. More precisely, starting with any x0 Rd and setting v0 = m0 = 0, the generic update rule is given by

xt = xt 1 αt mt vt + ϵ, (1)

where αt > 0 is the learning rate and ϵ > 0 is given. Table 1 shows how each optimizer defines its firstand second-moment iterates within this formulation, where we define gt := ft(xt 1) and use β1, β2, β3 to denote hyperparameters controlling moment updates. For simplicity of the exposition, we omit the bias-correction terms for all optimizers.

To parameterize the design space of these four optimizers, we introduce real coefficients restricted between 0 and 1 that interpolate terms arising from different optimizers. Particular choices of these coefficients (typically at extreme values of 0 and 1) recover the underlying optimizers, but

they express a new optimizer for their intermediate values.

As an example, observe that the first-order moment update rule of Adan already subsumes those of Adam, AMSGrad, and Yogi (where { mt} and {nt} are new sequences introduced by Adan):

mt = β1 mt 1 + (1 β1)gt nt = β3nt 1 + (1 β3)(gt gt 1)

mt = mt + β3nt, (2)

where taking β3 = 0 recovers the update rules of the other three optimizers since mt = mt. With respect to the secondorder moments, Adan similarly covers Adam when β3 = 0, since we get that ˆgt = gt. In order to incorporate the second-order moments of AMSGrad and Yogi in the same parameterization, we introduce two new coefficients c, ρ [0, 1], and unify the second-moment computation as follows:

ˆgt = gt + β3(gt gt 1)

g2 t = cˆg2 t + (1 c)(vt 1 + ˆg2 t sign(ˆg2 t vt 1))

vt = β2 vt 1 + (1 β2) g2 t

v(max) t = max{v(max) t 1 , vt}

vt = ρ vt + (1 ρ)v(max) t . (3)

It is easy to check that, for instance, the second-order moments of Adam can be recovered when β3 = 0, and c = ρ = 1; those of AMSGrad are recovered when β3 = ρ = 0 and c = 1; Adan can be recovered when c = ρ = 1; and Yogi can be recovered with β3 = c = 0 and ρ = 1.

To summarize, in this example, for the four optimizers Adam, AMSGrad, Adan and Yogi, our parameterized optimizer is given by the three updating rules: (1), (2) and (3). Following the same line of arguments one can generate parameterized optimizers for other collections of given optimizers as well. However, unless we know a priori how to set the corresponding coefficients, the parameterized optimizer is not readily usable in practice. In the next section, we develop our meta-optimizer which uses a parameterized optimizer and learns the coefficients in an online fashion.

MADA: Meta-Adaptive Optimizers Through Hyper-Gradient Descent

3. Meta-Adaptive Optimizers

In the previous section, we described how one can parameterize the design space of a given collection of optimizers. Here, we define MADA as a meta-optimization framework that dynamically learns the coefficients of a parameterized optimizer during training. In this work, we use hyper-gradient descent (Baydin et al., 2018; Chandra et al., 2022) to learn these coefficients, which avoids an expensive hyper-parameter optimization loop that involves multiple training runs. However, we note that in principle other techniques can also be combined with parameterized optimizers to learn the coefficients.

Learning interpolation coefficients. Hyper-gradient descent views the hyper-parameters as trainable parameters of the loss function, and thus differentiates the loss with respect to them and updates them through gradient steps. (Chandra et al., 2022) re-purposes Py Torch autograd machinery (Paszke et al., 2017) to automatically compute gradients with respect to optimization hyper-parameters such as learning rate and Adam β1 and β2 parameters. Since the update rule of a parameterized optimizer is differentiable with respect to its learnable coefficients, we can also update them using hyper-gradient descent steps.

To precisely describe our meta-optimizer MADA in detail, we need additional notation. We will denote a parameterized optimizer by Oq, where q D denotes the vector of coefficients that defines the optimizer, and D represents the domain of the vector q. In the case of the example from Section 2, we have that q = (β1, β2, β3, ρ, c) and Oq is given by (1), (2), and (3). Note that Oq also encapsulates non-learnable state parameters (such as first-order and second-order moments) and other hyper-parameters such as the learning rate, weight decay, and stability parameter. In the example from Section 2, the domain is the unit hypercube where each element is in the range [0, 1]. We denote by ΠD the orthogonal projection onto the set D. We provide a pseudo code to illustrate this (see Algorithm 1).

Algorithm 1 Pseudocode for a generic MADA Input: A parameterized optimizer Oq, where q D, a hyper-learning rate α, number of total iterations T. Init.: x0 and q0.

1: for t=1 to T do 2: Sample ft. 3: Update model parameters: xt = Oqt 1(xt 1). 4: Update optimizer coefficients: qt = ΠD [qt 1 α qft(xt 1)] . 5: end for Output: Model wights x T .

Hyper-gradient computation. Before concluding this sec-

tion, we briefly illustrate hyper-gradient computation2. In order to compute the hyper-gradient of the loss with respect to a particular optimizer coefficient, we treat the updated model weights as a function of the coefficient. For instance, considering again the example from Section 2, we will show how to compute the gradient of the function ft with respect to the coefficient ρ using the parameterized optimizer as given in (3). Using the chain rule, we obtain

ρ = ft(xt 1)

xt 1 αt 1mt 1 2 vt 1( vt 1 + ϵ)2 vt 1 v(max) t 1 ,

where the first term ft(xt 1)/ xt 1 is readily computed using standard back-propagation3. Note that the latter two objects of (4) are not explicitly constructed in practice; instead autograd simply continues back-propagating the already-computed parameter gradients into optimizer coefficients, while handling the necessary book-keeping for hyper-gradient computation.

4. On Convergence of Interpolated Optimizers

Since MADA uses hyper-gradients to update the optimizer, a prerequisite for a base optimizer to work well within MADA is that its update rules must allow efficient flow of hypergradients to the parameterization coefficients. In particular, in our experiments with MADA we have found that including AMSGrad among the base optimizers has an adverse effect on the end performance of the trained model. We conjecture that this is caused by the maximum operator in the second-moment term of AMSGrad. Specifically, note that back-propagation of gradients through max(a, b) corresponds to a simple routing operation to the larger one of a and b, where the smaller one is passed 0 gradients. In AMSGrad, vt = max {vt 1, vt} which means that for most steps the first term will be greater, causing insufficient hyper-gradient updates on β2 parameter through vt path4. To remedy this, we introduce AVGrad, which replaces the maximum operator with time-averaging of second moments, and results in better hyper-gradient flow and validations loss.

2Further details on hyper-gradients can be found in (Baydin et al., 2018; Chandra et al., 2022). 3In (4), the first term is a row vector, the second term is a Jacobian matrix, and the third term is a column vector, and thus the operator refers to a matrix multiplication. 4To be accurate, considering the example parameterization (3), vt term provides another path for hyper-gradients on β2. However, in practice we observed that when AMSGrad is used, ρ parameter tends to vanish, diminishing the effect of this path as well.

MADA: Meta-Adaptive Optimizers Through Hyper-Gradient Descent

Figure 1: Value of the right-hand side of (6) in Theorem 1 for a representative case where β2 = 0.9, T = 10, 000.

Figure 2: Value of the right-hand side of (6) in Theorem 1 for a representative case where β2 = 0.9, T = 1, 000.

4.1. AVGrad and its interpolation with Adam

Following the notation in Table 1, we define AVGrad as an optimizer where the first-order moments and second-order moments are defined by

mt = β1mt 1 + (1 β1)gt vt = β2 vt 1 + (1 β2)g2 t

t ( vt + (t 1) vt 1),

vt = vt, (5)

where vt is the running average of past vt s. We provide the convergence analysis of AVGrad in Appendix B in Theorems 2 and 3

In what follows, we focus on the interpolation between Adam and AVGrad, with the interpolation coefficient ρt. Specifically, we replace last line in (5) with vt := ρt vt+(1 ρt) vt where ρt interpolates between second-order moments of Adam and AVGrad.

Soundness of AVGrad. We start with a proposition showing that AVGrad is a valid alternative to AMSGrad in mitigating the sub-optimal convergence issue in Adam (Reddi et al., 2018). Specifically, when ρt decays with 1/t (i.e., converges to AVGrad), the interpolated optimizer fixes the non-decreasing learning rate problem in Adam, similar to AMSGrad. Proposition 1. The following inequality is a sufficient condition for non-increasing effective learning rate:

ρt 1 t(1 β2) + 1.

In Appendix C, we use MADA to solve the convex problem given in (Reddi et al., 2018), an example where Adam fails. We found that MADA quickly converges to AVGrad, and thus to the optimum, by adapting ρt 0, even when we initialize it from Adam (ρ0 = 1).

4.2. Convergence of the interpolated optimizer

Due to the complexity of our overall parameterization, deriving theoretical guarantees for MADA at its full scope is challenging. Therefore, in this section, we reduce the scope by examining convergence properties of interpolations between two optimizers, namely, AVGrad and Adam. In this setting, we will show that the parameterized optimizer allows the design of novel interpolated optimizers which improve the convergence rate of either optimizer up to constant factors, demonstrating the value of the parameterized optimizer formulation. We defer the proof and other technical details to Appendix A.

We make the following standard assumptions: F is bounded from below, is L-smooth, and with stochastic gradients that are bounded almost surely.

(D efossez et al., 2022) proposed a unified analysis for the convergence analysis of Adagrad and Adam in the nonconvex setting. Since AVGrad can be seen as a version of Adagrad, the proof technique of (D efossez et al., 2022) can be adapted to AVGrad with some modifications. The modification includes the introduction of ρ parameter which interpolates between the second-order moment and the moment average, and results in an additional degree of freedom. Subsequently, ρ can be chosen to skew the updates towards the advantageous term under different scenarios.

Here we state the convergence result of the interpolated optimizer without momentum, that is when β1 = 0 (see precise statement in Appendix A). The extension to the case with momentum can be done in the same way we extend Theorem 2 to Theorem 3, similar to (D efossez et al., 2022).

Theorem 1 (Convergence of interpolation of AVGrad and Adam without momentum). Under the assumptions above and αt = α

t for some α > 0, and for ρt = ρ

t for a constant

MADA: Meta-Adaptive Optimizers Through Hyper-Gradient Descent

0 25 50 75 100 125 150 175 200 500x iterations

Validation losses vs. iterations

Adam MADA MADA-FS Lion AVGrad Adan

Figure 3: Validation losses of competing methods on Open Web Text for GPT-2 (125M) model using the same random seed.

0 25 50 75 100 125 150 175 200 500x iterations

Training losses vs. iterations

Adam MADA MADA-FS Lion AVGrad Adan

Figure 4: Training losses of competing methods on Open Web Text for GPT-2 (125M) model using same random seed, smoothed for convenience.

T + C2 ln (E(T))

T + C3 h ln ρ

where, GT = 1 T PT t=1 F(xt) 2, E(T) =

ρ+(1 ρ)T 1 β2 , and C1, C2, C3 are constants independent of T, ρ, β2.

Remark. Aligned with the condition in Proposition 1, the contribution from vt in Theorem 1 is multiplied with 1/t in order to avoid divergent terms in the bound. Observe that when ρ = 0 or ρ = 1 we recover the convergence rates for AVGrad (see Theorem 2 in Appendix B) and Adam (D efossez et al., 2022) respectively. We note that for the right-hand side to remain bounded, one would need either 0 ρ β2 (all terms vanish as in AVGrad), or ρ = 1 (a constant term remains as in Adam). To gain more insights, in Figures 1 and 2, we plot the value of the bound in Theorem 1 with respect to ρ for two representative cases . The first case with larger T represents a more favorable setting for AVGrad as all the terms vanish with T. The second is more favorable for Adam since the vanishing terms vanish faster than AVGrad. In both cases, the interpolated optimizer provides the best convergence bound when ρ = β2, which suggests an advantage for the interpolated optimizers.

5. Experiments

In our experiments, we aim to answer how the generalization and downstream performance of MADA compares against fixed optimizers, and how robust MADA is to poor hyperparameter initializations compared to fixed optimizers. We focus on the pre-training of models from scratch. We also try to gain insights into the behavior of MADA by monitoring the evolution of the optimizer coefficients. Additional details of our experimental setup, and results can be found in Appendix C.

Table 2 Validation loss of MADA on Open Web Text vs other adaptive optimizer baselines.

Method Validation Loss Adam 2.8956 Adan 2.8896 Hyper Adam 2.8950 Lion 2.8892 AVGrad 2.8895

MADA 2.8806 MADA-FS 2.8766 Poor initialization MADA 2.8921 Adam 2.9157

5.1. A concrete parameterized optimizer

We use a concrete parameterization that is similar to the example presented in Section 2, with two changes. First, we replace AMSGrad with AVGrad, and second, we include Lion among the optimizers that we interpolate. Hence, second-order moment in (3) becomes vt = ρ vt + (1 ρ) vt as in the interpolation in Section 4. Moreover, the update term becomes γ mt vt+ϵ + (1 γ)sign(ut) where ut is the moment term from Lion (Chen et al., 2023). We refer the reader to Appendix C for a complete description of the parameterization. Lion was omitted in the example in Section 2 for the sake of simplicity, since it integrates less naturally with the rest of the optimizers.

5.2. Experimental setting

Data and models. In recent years, auto-regressive models have been widely used for benchmarking and algorithm evaluation (Radford et al., 2019b; Liu et al., 2023). Motivated by this, we evaluate MADA on the causal language modeling task with GPT-2, over two datasets: Shakespeare (Karpathy,

MADA: Meta-Adaptive Optimizers Through Hyper-Gradient Descent

Figure 5: Parameter evolution for β1,0 = 0.9, β2,0 = 0.95, β3,0 = 0.

Figure 6: Parameter evolution for β1,0 = 0.7, β2,0 = 0.8, β3,0 = 0.

Figure 7: Parameter evolution for β1,0 = 0.9, β2,0 = 0.95, β3,0 = 0.9.

2015), and Open Web Text (Gokaslan and Cohen, 2019). On the Shakespeare dataset, we train a 6 layer transformer with context size of 256 (11M parameters) and fine-tune GPT-2 (XL) with 48 layers (1.5B parameters). On Open Web Text, we train GPT-2 (small) with 12 layers, 1024 context size (125M parameters) and GPT-2 (medium) with 24 layers, 1024 context size (335M parameters). Our code is available at https://github.com/amazon-science/ mada_optimizer_search.5

Baselines and learned optimizers. On Open Web Text, we compare MADA to Adam (Kingma and Ba, 2015), Hyper Adam (Chandra et al., 2022) (Adam with β1 and β2 parameters learned through hyper-gradients), Adan (Xie et al., 2023), Lion (Chen et al., 2023), and AVGrad. For all the methods we used decoupled weight decay (Loshchilov and Hutter, 2019), i.e. the Adam W variant of Adam is used. We keep the learning hyper-parameters such as learning rate schedule and weight decay identical, except for Lion where we choose a lower initial learning rate as suggested in (Liu et al., 2023). For a fair comparison, for all optimizers, we use the same codebase based on (Chandra et al., 2022). For Open Web Text experiments, we use established parameters for Adam (β1 = 0.9, β2 = 0.95, ϵ = 10 6) and also use these values as the initial parameters for MADA, Hyper Adam, and AVGrad; for other methods we use the parameters suggested in respective papers; and measure the validation loss. For Shakespeare experiments, we compare MADA to Adam and Hyper Adam; the relatively small model size in this experiment allows us to sweep the grid of initial β parameters and compare the final training loss across many different hyper-parameter choices. In some experiments, we also evaluate the final state of MADA statically, i.e., we take the final optimizer learned by MADA and use it as the fixed optimizer from the beginning of training, which we refer to as MADA-FS.

Tuning hyper-learning rates. While hyper-learning rates

5We use nano GPT (https://github.com/karpathy/ nano GPT) code base for the implementation.

are additional hyper-parameters to be tuned; the tasks are less sensitive to hyper-learning rates compared to other hyper-parameters such as learning rate, most likely because each hyper-learning rate controls the update of a single parameter. As a result, hyper-learning rates can be finetuned easily and can be transferred across similar tasks. In general, for the hyper-learning rates of β1, β2, since the gradients are relatively large, we search in a set of smaller values: {10 3, 5 10 4, 10 4}. For the other parameters we search in the set {10 1, 5 10 2, 10 2}.

5.3. Results

In this section, we first present empirical results of training GPT-2 models on Open Web Text, and provide a discussion of generalization performance and robustness of MADA. We also show how the interpolation coefficients evolve during training. Next, we provide a visualization of the optimal training loss given various initializations of β1, β2 values using the smaller Shakespeare dataset. We present additional results in Appendix C including a toy problem to show that MADA discovers optimal solutions that Adam cannot find, and comparing MADA and Adam for GPT-2 (medium).

GPT-2 on Open Web Text. When we initialize MADA from AVGrad and Adan with β1,0 = 0.9, β2,0 = 0.95, β3,0 = 0.9, ρ0 = 0 (where subscript 0 denotes the time index) from Table 2 we observe that MADA consistently outperforms Adam and other recently proposed optimizers, including Lion, Adan, and Hyper Adam. In particular, MADA-FS outperforms Adam with established parameters by 0.019 in validation loss which is a significant improvement for this task. In Figure 3 and Figure 4 we see that MADA is able to converge to a lower loss than baseline methods (both in terms of validation and training loss). Note that MADA is able to converge faster than AVGrad and Adam with the same learning rate and schedule, which may be attributed to theoretical result Theorem 1 on interpolated optimizers.

Perplexity on benchmark datasets. To measure the generalization of MADA to other datasets, we compute the vali-

MADA: Meta-Adaptive Optimizers Through Hyper-Gradient Descent

Figure 8: Final training loss of Adam, MADA, and Hyper Adam vs. initial β values on Shakespeare dataset. MADA yields better performance for wider choice of initial β values, illustrating its robustness.

dation perplexity of the trained models on three datasets as shown in Table 3. We see that on Open Web Text MADA outperforms Adam and Hyper Adam by 0.34 and 0.33 points); while on Wikitext (Merity et al., 2016), MADA outperforms these two baselines by 4.45 and 2.37 respectively. On Lambada (Radford et al., 2019a) it is the second-best-performing method with a small gap behind Hyper Adam (0.88).

MADA-FS versus MADA. We find that MADA-FS performs slightly better than MADA for Open Web Text (0.004 in validation loss and 0.07, 1.85, 0.68 points on validation perplexity on Open Web Text, Wikitext and Lambada). More details are in Tables 2 and 3. On the other hand, on Shakespeare we observe that, the dynamically-evolving optimizer performs generally better than using the fixed final optimizer. We conjecture that the difference in the number of iterations, the model size or dataset may explain this phenomenon.

Learning interpolation coefficients is crucial. We also see that Hyper Adam performance is very close to Adam (2.8950 vs. 2.8956), which suggests that the main performance improvement of MADA does not simply originate from the tuning of the β1, β2 parameters, but rather from the adaptation of the interpolation coefficients in the optimizer space.

Robustness to initial hyper-parameters. Additionally, we compare MADA to Adam when we initialize from the suboptimal hyperparameters β1,0 = 0.7, β2,0 = 0.8 (we call these optimizers MADA and Adam respectively). Notably, we observe that even suboptimally initialized MADA is able to outperform Adam with the established initial parameters, as well as Hyper Adam. This particular example indicates the robustness of MADA. We provide more examples with poor initial hyper-parameter choices on Shakespeare dataset in the next subsections. Evolving hyper-parameters. We monitor the evolution of the interpolation coefficients during training of GPT-2 (125M) model on Open Web Text for various choices of ini-

Table 3 Validation perplexities of competing methods on Open Web Text, Wikitext and Lambada datasets.

Method Open Web Text Wikitext Lambada

Adam 18.0940 63.8544 77.3314 Adan 17.9863 63.5518 74.6970 Hyper Adam 18.0843 61.7717 72.6803 Lion 17.9792 61.8661 75.3158 AVGrad 17.9840 64.2620 75.1317

MADA 17.8249 61.2513 74.2480 MADA-FS 17.7544 59.4086 73.5623

Poor initialization MADA 18.0317 57.1613 75.3550 Adam 18.4624 72.9017 79.1217

tialization of β1, β2 (Figures 5, 6, and 7).

First, we observe that sub-optimal initialization of hyperparameters (as in Figure 6) results in a more significant update of the coefficients. This suggests that default optimizer state with β1,0 = 0.9, β2,0 = 0.95 is close to a local minimum. A closer inspection yields insight into how MADA accelerates training. First, in Figures 5, 6 we see that β1 increases at the beginning which facilitates taking larger steps, while towards the end it decreases to give more emphasis on individual gradients. Second, we observe that the interpolation factor of AVGrad, ρ, follows a three-phase pattern. This suggests that MADA puts more importance on individual normalization term in the beginning and towards the end, and it puts more emphasis on the less noisy averaging term during the intermediate stages.

Another interesting behavior we observe is when we initialize β3 (which governs the weight of Adan) from a higher value such as 0.9 (Figure 7), we see β1 reaches 0.99 and stays constant, in contrast with Figures 5, 6. Remarkably, MADA initialized with β3 = 0.9 automatically trains β1 to be

MADA: Meta-Adaptive Optimizers Through Hyper-Gradient Descent

0.99; i.e. MADA, when initialized from combination of Adan and AVGrad, automatically finds the β1 that is suggested by the authors of (Xie et al., 2023). Note that β1, β2, β3 in this work correspond to 1 β1, 1 β3, 1 β2 in (Xie et al., 2023).

Shakespeare dataset. On Shakespeare dataset we compare MADA to Adam and Hyper Adam. We show that it results in significant increase in the performance across many initial β1, β2 values when training 11M parameter model. Particularly, in Figure 8, we see that MADA results in better performance for the vast majority of initial β1 and β2 parameters, illustrating robustness.

Fine-tuning GPT-2 XL on Shakespeare dataset. We also fine-tune the pre-trained GPT-2 XL (1.5B parameters) model on Shakespeare, for approximately 2 epochs, using MADA, Adam, and Hyper Adam methods. Table 4 shows the resulting training loss values, where MADA results in a large improvement. Notably, Hyper Adam achieves no gain over Adam.

Table 4 Training losses after 2 epochs of fine-tuning on Shakespeare dataset.

Method Training loss MADA 0.255 Adam 0.276 Hyper Adam 0.278

Vision tasks. We also validated the performance of MADA on two computer vision models: 5-layer CNN and Res Net-9 on CIFAR-10 dataset. We observe that MADA shows consistent improvement over our baselines (see test accuracy results below in Table 5). Table 5 Test accuracy of competing methods for CNN and Res Net models.

Method 5-layer CNN Res Net-9 MADA 66.12 0.14 93.79 0.11 Adam 65.84 0.11 93.73 0.10 Hyper Adam 65.80 0.21 93.69 0.06 Momentum SGD 65.97 0.94 92.60 0.10

6. Conclusion

In this paper we propose an approach to unify a set of optimizers into a parameterized formulation. We further show how to dynamically learn the most suitable optimizer for a particular task during training using hyper-gradient descent. We employ MADA to train language models and observe that it consistently outperforms Adam and other fixed optimizers both in terms of best validation performance, and in terms of robustness to initial hyper-parameters.

Limitations. The main limitation of our framework is the

additional computational requirement and memory usage of the parameterized optimizer due to the maintained optimizer states. The additional computation can be broken down into two components: additional per-parameter computational steps during optimizer update, and the computation of hypergradients. Note that the first component is negligible compared to the overall FLOPs requirement of a single training step in a language model. This is because for a model with N parameters trained over a batch of T tokens, the model parameter update in the parameterized optimizer involves c N FLOPs (with c being on the range of 10-20 depending on the parameterization), while the forward-backward passes require approximately 6TN FLOPs, with T being on the order of thousands. To analyze the second component, consider the hyper-gradient example in (4), where the computation of hyper-gradient with respect to ρ involves several vector-level element-wise operations (note that the multiplication of the first term with second does not require a matrix-vector multiplication, since it can be implemented by a dot product with the numerator followed by elementwise division with the denominator), where each vector is of size N. As a result, the FLOPs requirement is still O(N) (does not scale with T), and much smaller than the 6TN required for the forward-backward passes. The derivative with respect to the other coefficients can similarly be shown to have O(N) complexity.

The additional memory usage arises from the fact that we need to maintain a larger optimizer state per parameter. To mitigate, one can employ sharded data parallelism techniques (Rajbhandari et al., 2020), which shards the optimizer state across a large number of data-parallel devices. Another approach is to analyze the contributions of the constituent optimizers to the learning performance, and prune the ones that have little contribution.

In Appendix D, we provide profiling results showing the actual increase in iteration time and memory usage.

Future work. Theoretically, we have shown the convergence of interpolation of AVGrad and Adam; as a future work we would like to construct a generic theoretical framework that can characterize the convergence of optimizer interpolations more generally. Empirically, we aim to gain a deeper understanding of the dynamics of the optimizer learning process in practice.

MADA: Meta-Adaptive Optimizers Through Hyper-Gradient Descent

Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning and neural network optimization. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.

Diogo Almeida, Clemens Winter, Jie Tang, and Wojciech Zaremba. A generalizable approach to learning optimizers. ar Xiv preprint ar Xiv:2106.00958, 2021.

Lu ıs B. Almeida, Thibault Langlois, Jose D. Amaral, and Alexander Plakhov. Parameter Adaptation in Stochastic Optimization, page 111 134. Publications of the Newton Institute. Cambridge University Press, 1999. doi: 10. 1017/CBO9780511569920.007.

Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W Hoffman, David Pfau, Tom Schaul, Brendan Shillingford, and Nando De Freitas. Learning to learn by gradient descent by gradient descent. Advances in neural information processing systems, 29, 2016.

Atilim Gunes Baydin, Robert Cornish, David Martinez Rubio, Mark Schmidt, and Frank Wood. Online learning rate adaptation with hypergradient descent. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum? id=Bkrs Az WAb.

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877 1901, 2020.

Kartik Chandra, Audrey Xie, Jonathan Ragan-Kelley, and Erik Meijer. Gradient descent: The ultimate optimizer. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022. URL https:// openreview.net/forum?id=-Qp-3L-5Zd I.

Jinghui Chen, Dongruo Zhou, Yiqi Tang, Ziyan Yang, Yuan Cao, and Quanquan Gu. Closing the generalization gap of adaptive gradient methods in training deep neural networks, 2020.

Xiangning Chen, Chen Liang, Da Huang, Esteban Real, Kaiyuan Wang, Yao Liu, Hieu Pham, Xuanyi Dong, Thang Luong, Cho-Jui Hsieh, Yifeng Lu, and Quoc V. Le. Symbolic discovery of optimization algorithms, 2023.

Alexandre D efossez, Leon Bottou, Francis Bach, and Nicolas Usunier. A simple convergence proof of adam

and adagrad. Transactions on Machine Learning Research, 2022. ISSN 2835-8856. URL https:// openreview.net/forum?id=ZPQhz TSWA7.

Timothy Dozat. Incorporating nesterov momentum into adam. 2016.

Pierre Foret, Ariel Kleiner, Hossein Mobahi, and Behnam Neyshabur. Sharpness-aware minimization for efficiently improving generalization. In International Conference on Learning Representations, 2021. URL https:// openreview.net/forum?id=6Tm1mposlr M.

Luca Franceschi, Michele Donini, Paolo Frasconi, and Massimiliano Pontil. Forward and reverse gradient-based hyperparameter optimization. In International Conference on Machine Learning, pages 1165 1173. PMLR, 2017.

Aaron Gokaslan and Vanya Cohen. Openwebtext corpus. http://Skylion007.github.io/ Open Web Text Corpus, 2019.

Byeongho Heo, Sanghyuk Chun, Seong Joon Oh, Dongyoon Han, Sangdoo Yun, Gyuwan Kim, Youngjung Uh, and Jung-Woo Ha. Adamp: Slowing down the slowdown for momentum optimizers on scale-invariant weights, 2021.

Timothy Hospedales, Antreas Antoniou, Paul Micaelli, and Amos Storkey. Meta-learning in neural networks: A survey. IEEE transactions on pattern analysis and machine intelligence, 44(9):5149 5169, 2021.

Andrej Karpathy. The unreasonable effectiveness of recurrent neural networks. http://karpathy.github. io/2015/05/21/rnn-effectiveness/, 2015.

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Yoshua Bengio and Yann Le Cun, editors, 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. URL http://arxiv.org/abs/1412.6980.

Hong Liu, Zhiyuan Li, David Hall, Percy Liang, and Tengyu Ma. Sophia: A scalable stochastic second-order optimizer for language model pre-training, 2023.

Liyuan Liu, Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao, and Jiawei Han. On the variance of the adaptive learning rate and beyond. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum? id=rkgz2a EKDr.

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019. URL https://openreview. net/forum?id=Bkg6Ri Cq Y7.

MADA: Meta-Adaptive Optimizers Through Hyper-Gradient Descent

Liangchen Luo, Yuanhao Xiong, and Yan Liu. Adaptive gradient methods with dynamic bound of learning rate. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum? id=Bkg3g2R9FX.

Dougal Maclaurin, David Duvenaud, and Ryan Adams. Gradient-based hyperparameter optimization through reversible learning. In International conference on machine learning, pages 2113 2122. PMLR, 2015.

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models, 2016.

Luke Metz, Niru Maheswaranathan, C Daniel Freeman, Ben Poole, and Jascha Sohl-Dickstein. Tasks, stability, architecture, and compute: Training more effective learned optimizers, and using them to train themselves. ar Xiv preprint ar Xiv:2009.11243, 2020.

Luke Metz, James Harrison, C Daniel Freeman, Amil Merchant, Lucas Beyer, James Bradbury, Naman Agrawal, Ben Poole, Igor Mordatch, Adam Roberts, et al. Velo: Training versatile learned optimizers by scaling up. ar Xiv preprint ar Xiv:2211.09760, 2022.

Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary De Vito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. 2017.

Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. 2019a.

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. Open AI blog, 1(8):9, 2019b.

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1 16. IEEE, 2020.

Esteban Real, Chen Liang, David So, and Quoc Le. Auto ML-zero: Evolving machine learning algorithms from scratch. In Hal Daum e III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 8007 8019. PMLR, 13 18 Jul 2020. URL https://proceedings.mlr.press/ v119/real20a.html.

Sashank J. Reddi, Satyen Kale, and Sanjiv Kumar. On the convergence of adam and beyond. In International Conference on Learning Representations,

2018. URL https://openreview.net/forum? id=ry Qu7f-RZ.

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj orn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684 10695, 2022.

Robin M Schmidt, Frank Schneider, and Philipp Hennig. Descending through a crowded valley - benchmarking deep learning optimizers. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 9367 9376. PMLR, 18 24 Jul 2021. URL https://proceedings.mlr. press/v139/schmidt21a.html.

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth ee Lacroix, Baptiste Rozi ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. ar Xiv preprint ar Xiv:2302.13971, 2023.

Olga Wichrowska, Niru Maheswaranathan, Matthew W Hoffman, Sergio Gomez Colmenarejo, Misha Denil, Nando Freitas, and Jascha Sohl-Dickstein. Learned optimizers that scale and generalize. In International conference on machine learning, pages 3751 3760. PMLR, 2017.

Xingyu Xie, Pan Zhou, Huan Li, Zhouchen Lin, and Shuicheng Yan. Adan: Adaptive nesterov momentum algorithm for faster optimizing deep models, 2023.

Yang You, Igor Gitman, and Boris Ginsburg. Large batch training of convolutional networks, 2017.

Yang You, Jing Li, Sashank Reddi, Jonathan Hseu, Sanjiv Kumar, Srinadh Bhojanapalli, Xiaodan Song, James Demmel, Kurt Keutzer, and Cho-Jui Hsieh. Large batch optimization for deep learning: Training bert in 76 minutes. In International Conference on Learning Representations, 2020. URL https://openreview.net/ forum?id=Syx4wn Etv H.

Manzil Zaheer, Sashank Reddi, Devendra Sachan, Satyen Kale, and Sanjiv Kumar. Adaptive methods for nonconvex optimization. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. URL https://proceedings.neurips. cc/paper_files/paper/2018/file/ 90365351ccc7437a1309dc64e4db32a3-Paper. pdf.

MADA: Meta-Adaptive Optimizers Through Hyper-Gradient Descent

Juntang Zhuang, Tommy Tang, Yifan Ding, Sekhar C Tatikonda, Nicha Dvornek, Xenophon Papademetris, and James Duncan. Adabelief optimizer: Adapting stepsizes by the belief in observed gradients. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 18795 18806. Curran Associates, Inc., 2020. URL https://proceedings.neurips. cc/paper_files/paper/2020/file/ d9d4f495e875a2e075a1a4a6e1b9770f-Paper. pdf.

MADA: Meta-Adaptive Optimizers Through Hyper-Gradient Descent

A. Setup, Details, and Proof for Theorem 1

Assumptions. (i) F is bounded below by F , (ii) stochastic gradients, that is x Rd, f(x) R a.s. , (iii) the objective function is smooth, that is x, y Rd, F(x) F(y) 2 L x y 2.

Setup. Our setup follows (D efossez et al., 2022). Let d N be the dimensionality of the model. Given a function h : Rd R we define ih as the i th component of the gradient. For the stochastic loss, we assume we have access to an oracle providing i.i.d samples (ft)t N. Et 1 is defined as the expectation conditioned upon observing the past stochastic function values: f1, . . . , ft 1. Unlike (D efossez et al., 2022), for simplicity of calculations, we define ϵt = ϵ/t where ϵ is a small constant for stability. Next we define adaptive moment updates in a generic way, and then specialize them to AVGrad. These updates are different from those in Section 2, but they are equivalent, and facilitates easier analysis.

Adaptive updates. Similar to (D efossez et al., 2022) we define the following vectors iteratively.

mt,i = β1mt 1,i + ift(xt 1)

vt,i = β2vt 1,i + ift(xt 1)2

Note that ˆmt,i = (1 β1)mt,i and ˆvt,i = (1 β2)vt,i give the updates in Adam. (D efossez et al., 2022) uses the learning rate to absorb (1 β) terms. In particular, if one has an update xt,i = xt 1,i αt mt,i

ϵ+vt,i ; Adam is recovered with

αt = α 1 β1 1 β2 .

AVGrad updates. Let αt = α

t be the learning rate. AVGrad is defined via the following iterative steps

v(sum) t,i = v(sum) t 1,i + vt,i =

j=1 vj,i and v(avg) t,i = v(sum) t,i

xt,i = xt 1,i αt mt,i q

ϵt + v(avg) t,i = xt 1,i α mt,i q

ϵ + v(sum) t,i

Note that to obtain the AVGrad updates we also need to scale the learning rate with 1 β1 1 β2 ; but, for now, we assume it is already absorbed in the α for the sake of simplicity. We also drop the bias correction for simplicity as justified in (D efossez et al., 2022).

A.1. Proof of Theorem 1

We multiply the normalization term of the update with t from the decaying learning rate; hence, there is a multiplying factor of t difference in definition of ψ in Theorem 1 and ψ in this proof. Let us start by defining ψt,i

ψt,i = ρvt,i + (1 ρ)v(sum) t,i (7)

We also define ψt,i:

ψt,i = ρ vt,i + (1 ρ) v(sum) t,i = ρ(β2vt 1,i + Et 1 ift(xt 1)2) + (1 ρ)(v(sum) t 1,i + β2vt 1,i + Et 1 ift(xt 1)2)

= (1 ρ)v(sum) t 1,i + β2vt 1,i + Et 1 ift(xt 1)2, (8)

where the contribution of the last gradient is replaced by its expected value conditioned on the past iteration. Let us also define

ut,i = ift(xt 1) q

From the smoothness of the objective function F, we can use the Descent Lemma to obtain that

F(xt) F(xt 1) α F(xt 1) ut + α2L

MADA: Meta-Adaptive Optimizers Through Hyper-Gradient Descent

Taking the expectation of both sides conditioned on f1, . . . , ft 1, we have

Et 1F(xt) F(xt 1) α X

i F(xt 1) ift(xt 1) q

2 + Et 1[ ut 2]. (9)

To upper bound the second term on the right hand side, we use the following lemma which generalizes Lemma 5.1 in (D efossez et al., 2022) to be used with interpolated optimizer and replaces vt with ψt,i as the normalization term.

Lemma 1 (Updates approximately follow a descent direction).

i F(xt 1) ift(xt 1) q

ϵ + ψt,i 2REt 1

Proof. For the sake of simplicity, we define G = i F(xt 1), g = ift(xt 1), ψ = ψt,i, ψ = ψt,i. First obviously, we have

Gg ϵ + ψ = Gg q

ϵ + ψ + Gg ϵ + ψ Gg q

ϵ + ψ | {z } A

For the first term, we use the fact that Et 1[g] = G to obtain that

ϵ + ψ . (12)

In order to bound A, we begin with

A = Gg ψ ψ ϵ + ψ q

ϵ + ψ( ϵ + ψ + q

ϵ + ψ) = Gg Et 1g2 g2

ϵ + ψ( ϵ + ψ + q

where the last equality follows from the fact that ψ ψ = Et 1[g2] g2 (see (8) and the definition of ψt,i). Hence, from the triangle inequality we have that

|A| |Gg| Et 1g2 ϵ + ψ(ϵ + ψ) | {z } A1

+ |Gg| g2 q

ϵ + ψ(ϵ + ψ) | {z } A2

where in the inequality follows from the fact that ϵ + ψ + q

ϵ + ψ max( ϵ + ψ, q

ϵ + ψ). A useful fact that will be used later is the following

λ > 0, x, y R xy λx2

To bound A1, we use (13) with λ =

ϵ+ ψ 2 , x = |G|

ϵ+ ψ, y = |g| Et 1g2

ϵ+ ψ ϵ+ψ, which yields that

ϵ + ψ + g2Et 1[g2]2

(ϵ + ψ)3/2(ϵ + ψ) .

Taking the expectation and noting ϵ + ψ Et 1[g2] ensures that

Et 1[A1] G2

ϵ + ψ + Et 1[g2] q

ϵ + ψ + REt 1

MADA: Meta-Adaptive Optimizers Through Hyper-Gradient Descent

where the last inequality uses our boundedness assumption p

Et 1[g2] R and the fact that q

Similarly, for A2, we use (13) with λ =

ϵ+ ψ 2Et 1g2 , x = |Gg|

ϵ+ ψ, y = g2

ϵ+ψ, which gives us (we used similar bounding

Et 1[A2] G2

ϵ + ψ + REt 1[g2]

Combining both upper bounds we have,

Et 1[|A|] G2

ϵ + ψ + 2REt 1[g2]

which can be equivalently written as follows

ϵ + ψ + 2REt 1[g2]

Finally, combining further with (12) and substituting into (11) gives us,

ϵ + ψ + 2REt 1[g2]

ϵ + ψ 2REt 1[g2]

which proves the desired result.

Using Lemma 1 in (9) yields that

Et 1[F(xt)] F(xt 1)

α 1 β2 2R p

ρ + (1 ρ)t F(xt 1) 2 2αREt 1[ ut 2]

2 Et 1[ ut 2].

Summing both sides for t = 1, . . . , T, and noting

T implies that

E[F(x T )] F(x0) α 1 β2 2R p

t=0 F(xt) 2 + 2αR + α2L

t=0 E[ ut 2].

Equivalently, we can write,

α 1 β2 2R p

t=0 F(xt) 2 F(x0) E[F(x T )] + 2αR + α2L

t=0 E[ ut 2]. (14)

Now, we would like to bound the last term on the right hand side. For this, we introduce the following lemma that generalizes Lemma 5.2 in (D efossez et al., 2022).

Lemma 2. Define bt = ρ Pt j=1 βt j 2 aj + (1 ρ) Pt τ=1 Pτ j=1 βτ j 2 aj for t > 0 and b0 = 0, we have

at ϵ + bt ln 1 + R2(ρ + (1 ρ)T)

Proof. Let lt = Pt j=1 βt j 2 aj, rt = Pt τ=1 Pτ j=1 βτ j 2 aj, we have bt = ρlt + (1 ρ)rt. Since bt > at 0 we have that 1 z exp z with z = at ϵ+bt < 1. Hence,

at ϵ + bt ln(ϵ + bt) ln(ϵ + bt at)

MADA: Meta-Adaptive Optimizers Through Hyper-Gradient Descent

= ln(ϵ + bt) ln(ϵ + ρ(lt at) + (1 ρ)(rt at))

= ln(ϵ + bt) ln ϵ + ρβ2lt 1 + (1 ρ)rt 1 + (1 ρ)(

j=1 βt j 2 aj at)

= ln(ϵ + bt) ln ϵ + ρβ2lt 1 + (1 ρ)rt 1 + (1 ρ)β2

j=1 βt 1 j 2 aj

| {z } lt 1

= ln(ϵ + bt) ln (ϵ + β2lt 1 + (1 ρ)rt 1)

= ln ϵ + bt

+ ln ϵ + ρlt 1 + (1 ρ)rt 1

ϵ + β2lt 1 + (1 ρ)rt 1

+ ln max{1, ρ

= ln ϵ + bt

where the first equality is due to definition of bt. Summing the inequality above for t = 1, 2, . . . , T, yields that (recall that b0 = 0)

at ϵ + bt ln 1 + b T

Also note that b T R2(ρ+(1 ρ)T )

Using Lemma 2 in (14), dividing both sides by T, and after some algebra we have

t=1 F(xt) 2 2R p

ρ + (1 ρ)T(F0 F )

ρ + (1 ρ)Td 1 β2T (2R + αL)

ln 1 + R2(ρ + (1 ρ)T)

which concludes the proof.

B. Convergence of AVGrad

Using the notation and assumptions in Section 3, we give the following convergence results for AVGrad.

Theorem 2 (Convergence of AVGrad without momentum). Under the above assumptions and αt = α

t for some α > 0, we have:

t=1 F(xt) 2 2R(F0 F )

T 1 β2 + 2Rd

T 1 β2 (2R + αL) ln 1 + R2T (1 β2)ϵ

Theorem 3 (Convergence of AVGrad with momentum). Let τT {0, . . . , T 1} denote a random index such that j N, j < T, P[τ = j] 1 βT j 1 . Under the above assumptions and αt = α

t for some α > 0, we have:

E F(xτ) 2 2(1 β1)R

T α 1 β2 T (F(x0) F ) + C

Td T ln 1 + R2T (1 β2)ϵ

where C = αRL 1 β2(1 β1) + 2β1α2L2

(1 β2)(1 β1)3 + 12R2 1 β1 and T = T β1 1 β1 .

MADA: Meta-Adaptive Optimizers Through Hyper-Gradient Descent

Remark. Comparing our Theorems 2 and 3 to Theorems 3 and 4 in (D efossez et al., 2022), we observe that AVGrad does not have the non-vanishing, constant term that Adam has (see (D efossez et al., 2022), Theorem 2). Moreover, unlike the result for Adam, we do not require β2 > β1. Analogous to how momentum slows down the convergence bound of algorithms (by multiplicative factors) (D efossez et al., 2022) AVGrad slows down the convergence of Adagrad by multiplicative factors of (1 β2).

B.1. Proof of Theorem 2

Proof of convergence of AVGrad can be obtained by replacing the interpolated second order moment term ψt, by v(sum) t in the previous section. In other words, setting ρ = 0 in the Proof of Theorem 1, going through the analysis will give the desired result

t=1 F(xt) 2 2R(F0 F )

T 1 β2 + 2Rd 1 β2

T (2R + αL) ln 1 + TR2

B.2. Proof of Theorem 3

The idea in this proof is to essentially change gradient terms in the Descent Lemma with first order moments, as the difference in consequent model weights will now depend on the moment term rather than a gradient at some time point. The necessary changes in the lemmas closely follow the proof for Theorem 3,4 in (D efossez et al., 2022). We start by redefining some iterative vectors.

mt,i = β1mt 1,i + ift(xt 1)

vt,i = β2vt 1,i + ift(xt 1)2

xt,i = xt 1,i αt mt,i q

ϵt + v(avg) t,i = xt 1,i α mt,i q

ϵ + v(sum) t,i

Let us further define Gt = F(xt 1), gt = ft(xt 1), ut,i = mt,i q

ϵ+v(sum) t,i and Ut,i = gt,i q

ϵ+v(sum) t,i . And also define:

v(sum) t,k,i = v(sum) t k,i + Et k 1[

j=1 βτ j 2 g2 j,i]

note that for j t k we have Et k 1[g2 j ] = g2 j , so v(sum) t,k,i essentially replaces the contribution of last k gradients with their expected values. From the smoothness of the objective function F, we have

F(xt) F(xt 1) αG t ut + α2L

Taking the expectations of both sides,

E[F(xt)] E[F(xt 1)] α X

i [d] E[G t,iut,i] + α2L

2 E[ ut 2 2]. (15)

To bound the second term on the right hand side, we introduce the following approximate descent lemma, whose proof is provided at the end of section.

Lemma 3. [Updates approximately follow a descent direction]

Gt,i mt,i q

ϵ + v(sum) t,i

ϵ + v(sum) t,k+1,i

3R 1 β2 1 β1

k + 1βk 1E[ Ut k 2]

MADA: Meta-Adaptive Optimizers Through Hyper-Gradient Descent

l=1 ut l 2 t 1 X

Using Lemma 3 in (15) for the second term on right hand side, yields that

E[F(xt)] E[F(xt 1)] α

ϵ + v(sum) t,k+1,i

+ 3αR 1 β2 1 β1

k + 1βk 1E[ Ut k 2 2]

l=1 ut l 2 2

2 E[ ut 2 2]. (16)

Let us define Ωt := q Pt τ=1 Pτ j=1 βτ j 2 , hence, from our boundedness assumption, q

ϵ + v(sum) t,k+1,i

R q Pt τ=1 Pτ j=1 βτ j 2 = RΩt, inserting in (16) implies that

E[F(xt)] E[F(xt 1)] α 2RΩt

k=0 βk 1E G2 t k,i !

2 E ut 2 2 + 3αR 1 β2 1 β1

k + 1βk 1E Ut k 2 2

l=1 ut l 2 2

Summing for t = 1, . . . , T results in

k=0 βk 1E G2 t k 2 2 | {z } A

F(x0) F + α2L

t=1 E[ ut 2]

l=1 E[ ut l 2]

+ 3αR 1 β2 1 β1

k + 1βk 1E[ Ut k 2]

We will examine each term separately, for B, we state the following lemma (whose proof is given at the end of section) to bound the sum of squared norm term

Lemma 4. Assume, 0 β1 < 1, 0 β2 < 1 and we have sequence of real numbers (at)t [T ]. Let ct = Pt τ=1 βt τ 1 at, bt = Pt τ=1 Pτ j=1 βτ j 2 a2 t. Then,

c2 t ϵ + bt 1 (1 β1)2 ln 1 + b T

Lemma 4 implies that

t=1 E[ ut 2] α2L 2(1 β1)2 X

1 + v(sum) T,i

α2L 2(1 β1)2 X

i [d] ln 1 + R2T (1 β2)ϵ

Before moving on with other terms, we state some useful facts that are proven in (D efossez et al., 2022).

Fact 1. Given 0 < a < 1 and Q N we have,

q=0 aqq a (1 a)2 .

MADA: Meta-Adaptive Optimizers Through Hyper-Gradient Descent

Fact 2. Given 0 < a < 1 and Q N we have,

q=0 aq q 2 (1 a)3/2 .

Fact 3. Given 0 < a < 1 and Q N we have,

q=0 aq q(q + 1) 4a (1 a)5/2 .

Now we move onto examining other terms in (17). For C, we make the following change in the index j = t l which yields the following steps

l=1 E[ ut l 2]

j=1 E[ uj 2]

j=1 E[ uj 2]

j=1 E[ uj 2]

j=1 E[ uj 2]

j=1 E[ uj 2] β1 (1 β1)2

β1 (1 β1)4 X

1 + v(sum) T,i

β1 (1 β1)4 X

i [d] ln 1 + R2T (1 β2)ϵ

where in (a) we use Fact 3, in (b) we use Lemma 4, and in (c) the definition of v(sum) T,i . For D, we make the following change in the index j = t k, and obtain that

3αR 1 β2 1 β1

k + 1βk 1E[ Ut k 2] = 3αR 1 β2 1 β1

1 + t jβt j 1 E[ Uj 2]

= 3αR 1 β2 1 β1

j=1 E[ Uj 2]

1 + t jβt j 1

(a) 3αR 1 β2 1 β1

j=1 E[ Uj 2] 2 (1 β1)3/2

(b) 6αR 1 β2

1 + v(sum) T,i

(b) 6αR 1 β2

i [d] ln 1 + R2

MADA: Meta-Adaptive Optimizers Through Hyper-Gradient Descent

where in (a) we use Fact 2 and in (b) we use Lemma 2. For A, first note that,

T (1 β2). (23)

Let us change index, j = t k, and use (23):

k=0 βk 1E[ G2 t k 2] α 1 β2

j=1 βt j 1 E[ Gj 2]

j=1 E[ Gj 2]

= α 1 β2 2(1 β1)R

j=1 (1 βT j+1 1 )E[ Gj 2]

= α 1 β2 2(1 β1)R

j=0 (1 βT j 1 )E[ F(xj) 2].

Note, PT 1 j=0 (1 βT j 1 ) = T β1 1 βT 1 1 β1 T β1 1 β1 = T, and let τ {0, . . . , T 1} and P[τ = j] 1 βT j 1 , then we have:

A α 1 β2 2(1 β1)R

T TE[ F(xτ) 2]. (24)

Inserting (18), (19), (21), (24) in (17) yields that

α 1 β2 2(1 β1)R

T TE[ F(xτ) 2] | {z } A

F(x0) F + α2L

2 1 (1 β1)2 X

i [d] ln 1 + R2T (1 β2)ϵ

β1 (1 β1)4 X

i [d] ln 1 + R2T (1 β2)ϵ

i [d] ln 1 + R2

and after some algebra we have,

E[ F(xτ) 2] 2(1 β1)R

T α 1 β2 T (F(x0) F ) + αRL

Td 1 β2(1 β1) T ln 1 + R2T (1 β2)2ϵ

Td (1 β2)(1 β1)3 T ln 1 + R2T (1 β2)ϵ

Td 1 β1 T ln 1 + R2T (1 β2)ϵ

Equivalently,

E F(xτ) 2 2(1 β1)R

T α 1 β2 T (F(x0) F ) + C

Td T ln 1 + R2T (1 β2)ϵ

where C = αRL 1 β2(1 β1) + 2β1α2L2

(1 β2)(1 β1)3 + 12R2 1 β1 which concludes the proof.

MADA: Meta-Adaptive Optimizers Through Hyper-Gradient Descent

Proof of Lemma 3. We start by separating the main term into two:

Gt,i mt,i q

ϵ + v(sum) t,i =

k=0 Gt,iβt k 1 gt k,i q

ϵ + v(sum) t,i (25)

k=0 Gt k,iβt k 1 gt k,i q

ϵ + v(sum) t,i | {z } A

k=0 (Gt,i Gt k,i)βt k 1 gt k,i q

ϵ + v(sum) t,i | {z } B

For B, we again utilize the fact (13) that λ > 0, x, y R xy λx2

2λ, for each dimension with λ = 1 β1 2R 1 β2 k+1, x = |Gt,i Gt k,i|, y = |gt k,i| q

ϵ+v(sum) t,i . Then,

1 β1 4R 1 β2

k + 1(Gt,i Gt k,i)2 + 2R 1 β2

k + 1g2 t k,i 1 β1(ϵ + v(sum) t k,i )

Note that ϵ + v(sum) t,i ϵ + v(sum) t k,i , hence,

g2 t k,i ϵ + v(sum) t,i g2 t k,i ϵ + v(sum) t k,i = U 2 t k,i.

Moreover, smoothness of objective function implies

Gt Gt k 2 L2 xt 1 xt k 1 2 = L2

l=1 αut l 2 α2L2k

As a result,

α2L2βk 1 1 β1

l=1 ut l 2 +

k + 1 1 β1 βk 1 Ut k 2

l=1 ut l 2 t 1 X

k + R 1 β2 1 β1

k + 1βk 1 Ut k 2. (27)

For term A, we focus on the main term of the summation E

Gt k gt k,i q

ϵ+v(sum) t,i

. Let G = Gt k,i, g = gt k,i, v =

v(sum) t,k+1,i, v = v(sum) t,i . Note that,

v v = v(sum) t k,i + Et k 1 h t X

j=1 βτ j 2 g2 j i v(sum) t,i

j=1 βτ j 2 g2 j ]

j=1 βτ j 2 g2 j

| {z } A2 1

j=1 βτ j 2 g2 j

| {z } A2 2

We continue similar to Lemma 1.

Gg ϵ + v = G2 ϵ + v + Gg A2 1 A2 2 ϵ + v ϵ + v( ϵ + v + ϵ + v) | {z } C

MADA: Meta-Adaptive Optimizers Through Hyper-Gradient Descent

|C| |Gg|A2 1 ϵ + v(ϵ + v) | {z } C1

+ |Gg|A2 2 (ϵ + v) ϵ + v | {z } C2

We will utilize (13), for C1, let λ =

(1 β1) ϵ+ v

2 , x = |G| ϵ+ v, y = |g|A2 1 ϵ+ v ϵ+v. Then,

4 ϵ + v + 1 1 β1

g2A4 1 (ϵ + v) 3 2 (ϵ + v) . (29)

Given ϵ + v A2 1 and taking the expectation:

Et k 1[C1] G2

4 ϵ + v + 1 1 β1

A2 1 ϵ + v Et k 1

Similarly, for C2, we let λ = 1 β1 ϵ+ v 2A2 1 , x = |GA2| ϵ+ v , y = |A2g|

ϵ+v . Then,

4 ϵ + v A2 2 A2 1 + 1 1 β1

A2 1 ϵ + v g2A2 2 (ϵ + v)2 .

Using ϵ + v A2 2 and Et k 1[ A2 2 A2 1 ] = 1,

Et k 1[C2] G2

4 ϵ + v + 1 1 β1

A2 1 ϵ + v Et k 1

Et k 1[|C|] G2

2 ϵ + v + 1 1 β1

2A2 1 ϵ + v Et k 1

Using A1 ϵ + v, thus, A1 R

k + 1 1 β2:

Et k 1[|C|] G2

2 ϵ + v + 2R

k + 1 1 β2 1 β1 Et k 1

Taking complete expectation, using ϵ + v(sum) t,i ϵ + v(sum) t k,i and reintroducing the indices:

ϵ + v(sum) t,k+1,i

k + 1 1 β2 1 β1 Et k 1

" g2 t k,i (ϵ + v(sum) t k,i )

Equivalently,

ϵ + v(sum) t,k+1,i

k + 1 1 β2 1 β1 Et k 1

" g2 t k,i (ϵ + v(sum) t k,i )

ϵ + v(sum) t,k+1,i

MADA: Meta-Adaptive Optimizers Through Hyper-Gradient Descent

Then, inserting (30), we have:

ϵ + v(sum) t,k+1,i

ϵ + v(sum) t,k+1,i

k + 1 1 β2 1 β1 Et k 1

(ϵ + v(sum) t k,i )

ϵ + v(sum) t,k+1,i

2R 1 β2 1 β1

k + 1E[ Ut k 2]. (31)

Note, in (25) we have,

Gt,i mt,i q

ϵ + v(sum) t,i = A + B A |B|,

taking the expectation of both sides yields

Gt,i mt,i q

ϵ + v(sum) t,i

= E[A] + E[B] E[A] + E[ |B|], (32)

Inserting (31) and negated (27) into (32) results in

Gt,i mt,i q

ϵ + v(sum) t,i

ϵ + v(sum) t,k+1,i

3R 1 β2 1 β1

k + 1βk 1E Ut k 2

l=1 ut l 2 t 1 X

which completes the proof.

Proof of Lemma 4. For some t,

c2 t ϵ + bt 1 1 β1

τ=1 βt τ 1 a2 τ ϵ + bt .

We have ϵ + bt ϵ + bτ for any τ t, then,

c2 t ϵ + bt 1 1 β1

τ=1 βt τ 1 a2 τ ϵ + bτ

a2 τ ϵ + bτ

a2 τ ϵ + bτ

1 (1 β1)2 ln 1 + b T

where in the last inequality we apply Lemma 2 with ρ = 0. Note, from the definition of b T we also have that b T R2T

C. Additional Experiments and Details

C.1. Overall Parameterization in the Experiments

For the first order moment term we have,

mt = β1mt 1 + (1 β1)gt

MADA: Meta-Adaptive Optimizers Through Hyper-Gradient Descent

nt = β3nt 1 + (1 β3)(gt gt 1)

ut = Lion(gt, mlion t ),

where β1 controls the strength of heavy-ball momentum and β3 extends the equation to Nesterov momentum. The first order moment equations essentially capture the updates for Adan and Adam. For the second order moment term we have,

ˆgt = gt + β3(gt gt 1)

g2 t = ct 1ˆg2 t + (1 ct 1)( vt 1 + ˆg2 t sign(ˆg2 t vt 1))

vt = β2 vt 1 + (1 β2) g2 t vt = ( vt + (t 1) vt 1)/t

vt = ρ vt + (1 ρ) vt

where c controls whether the difference of consequent vts grows with g2 t sign(g2 t vt 1), as in YOGI, or g2 t (g2 t vt 1) as in Adam; the former, prevents rapid increases in effective learning rate and provides more controlled updates. For the second order moment term, ρt determines whether to use, less noisy, average of past vts (we call this method AVGrad and defer its formal introduction to the next section) or the current vt, which may be more noisy but up to date. Lastly, the update term together with the decoupled weight decay is as follows,

xt 1 = xt 1 λαtxt 1

xt = xt 1 αt γ mt + β3nt vt + ϵ + (1 γ)sign(ut) ,

where ϵ is the stability parameter, sign(ut) is the update term of Lion optimizer (see Table 1), γ controls whether to do a Lion type update or not.

C.2. Training setup for the experiments

On Open Web Text, we use a global batch size of 480 sequences, cosine learning rate schedule with the peak learning rate of 6 10 4 (1.5 10 4 for Lion) and the final learning rate of 1.5 10 5 (as chosen for Sophia algorithm in (Liu et al., 2023)). For Lion we use the tuned learning rate from (Liu, 2023) for the same setting. For Adam we found that employing Sophia s learning rate schedule results in better loss compared to employing Adam s learning rate schedule in (Liu et al., 2023). For our methods: AVGrad, MADA, MADA-FS, we directly employ Adam s learning rate and learning schedule without any tuning since our goal is to demonstrate that MADA can be plugged in place of Adam without any changes in learning rate schedule.

We run the experiment for 100,000 iterations (first 2000 are warmup iterations) which corresponds to training on 492B tokens. We use a weight decay parameter of 0.1. On Shakespeare, we use a batch size of 64, cosine learning rate schedule with the peak learning rate of 10 3 and the final learning rate of 10 4. We run the experiment for 5,000 iterations (first 100 are warmup iterations). We run our experiments on AWS p5.48xlarge instances equipped with 8 NVIDIA H100 GPUs. To be able utilize multiple GPU s we allreduced hyper-gradients across GPU s.

We initialize MADA with β3 = 0.9, ρ = 0 on top of Adam s established β1, β2 parameters for Open Web Text experiments as it resulted in a better validation loss. For Shakespeare experiments, we compare MADA against Adam over a grid of (β1, β2) values, where in the case of MADA (β1, β2) represents the initial values. For the Open Web Text experiments, we do not update hyper-parameters for the first 50 iterations for the sake of stability. For both experiments we use SGD as the hyper-parameter optimizer. On Shakespeare, for 10M model we use 2.5e 3 learning rate and 0.5 momentum (we do not use momentum for γ) for learning the hyper-parameters. On Open Web Text (and for 1.5B model on Shakespeare) we use a learning rate of 5e 4 for training β1, β2 and 1e 1 for other hyper-parameters.

Vision Tasks. For 5-layer model experiments, we use a CNN whose first two layers are convolutional layers with 6 and 16 output channels and 5 kernel size; and last 3 layers are fully connected layers with 120, 84, and 10 output dimensionality. We use Re LU activation in all layers except the last one and maxpool on the outputs convolutional layers. We use a constant learning rate of 1e 3 for MADA, Adam, Hyper Adam; and 1e 2 learning rate and 0.9 momentum coefficient for SGD with momentum. We initialize MADA from Adam state, we set hyper learning rate for β1, β2 to 1e 4 and for the other variables to 1e 2. We use batch size of 256 and train for 50 epochs. For Res Net-9 experiments, we use the model implementation

MADA: Meta-Adaptive Optimizers Through Hyper-Gradient Descent

from https://github.com/Moddy2024/Res Net-9. We use one cycle learning rate scheduler with 1e 2 peak learning rate for MADA, Adam, Hyper Adam and 1e 1 peak learning rate for SGD with momentum. Again we initialize MADA from Adam, we set hyper learning rate for β1 to 1e 4, for β2 to 1e 4 and for the other variables to 1e 2. We use batch size of 400 and train for 50 epochs.

C.3. Additional Experiments

Method Open Web Text (validation loss) Open Web Text Wikitext Lambada MADA (ρ0 = 0) 2.8853 17.9084 61.4689 73.1291 MADA-FS (ρ0 = 0) 2.8838 17.8822 63.9886 75.6158

Table 6: Validation loss on Open Web Text and validation perplexities on Open Web Text, Wikitext and Lambada datasets of GPT-2 (125M) models trained on Open Web Text with MADA. The initial optimzier state is AVGrad.

Method Open Web Text (validation loss) Open Web Text Wikitext Lambada Adam 2.6527 14.1928 50.1410 53.7468 MADA 2.6422 14.0441 43.8067 53.7575

Table 7: Validation loss on Open Web Text and validation perplexities on Open Web Text, Wikitext and Lambada datasets of GPT-2 (355M) models trained on Open Web Text with Adam and MADA. The initial state of MADA is AVGrad + Adan (ρ = 0, β3 = 0.9).

Figure 9: Average regret, xt, ρt with respect to iterations for MADA on simple convex function.

MADA: Meta-Adaptive Optimizers Through Hyper-Gradient Descent

Synthetic convex experiment. The online learning experiment given in (Reddi et al., 2018) to motivate AMSGrad is as follows:

( 1010x for t mod 101 = 1 10x otherwise (33)

with constraint set x [ 1, 1] and t denotes the time index in the online learning setting. This is an example where Adam notably fails to converge to the optimal solution, x = 1, whereas AMSGrad and AVGrad reach the optimum. In Figure 9, we plot the average regret which is defined by (gt(x) gt( 1))/t, as well as the evolution of x and ρ with respect to iterations. Here, we reduce the effect of Lion by assigning a small hyper-learning rate to γ, since sign SGD based methods neutralize the effect of large gradients which allow faster progress towards the optimum. We observe that even when we initialize MADA from Adam (ρ0 = 1), it quickly recovers AVGrad (ρt 0), which is the right optimizer to use for this example. This experiment shows that MADA can learn to behave like AVGrad even when initialized from Adam.

D. Profiling Results

We profile the memory footprints of the methods and find that the peak memory usage is 15.5 GB for Adam, and 24.9 GB for MADA; time per iteration is 0.65s for Adam and 0.85s for MADA. If we exclude LION (which contributes the least in this setting) MADA results in 22.2GB of memory usage and 0.72s time per iteration. We also investigated a large-batch setting. Comparing Adam and MADA respectively, training GPT-2 (124M) with a batch size of 3840 ( 4 million tokens) requires memory usage of 51.5GB and 61.5GB while the time per iteration is 4.72s and 5.20s, we would like to note that our method is able to outperform Adam with the same wall-clock time as well. The main reason for the memory increase is the optimizer states; due to the increased number of base optimizers with such a large model size. In contrast, for vision tasks, we observe virtually no additional memory used due to the smaller size of the models and minimal overhead of storing optimizer states.