# the_ademamix_optimizer_better_faster_older__117f0787.pdf

Published as a conference paper at ICLR 2025

THE ADEMAMIX OPTIMIZER: BETTER, FASTER, OLDER

Matteo Pagliardini EPFL Pierre Ablin Apple David Grangier Apple

Momentum based optimizers are central to a wide range of machine learning applications. These typically rely on an Exponential Moving Average (EMA) of gradients, which decays exponentially the present contribution of older gradients. This accounts for gradients being local linear approximations which lose their relevance as the iterate moves along the loss landscape. This work questions the use of a single EMA to accumulate past gradients and empirically demonstrates how this choice can be sub-optimal: a single EMA cannot simultaneously give a high weight to the immediate past, and a non-negligible weight to older gradients. Building on this observation, we propose Ad EMAMix, a simple modification of the Adam optimizer with a mixture of two EMAs to better take advantage of past gradients. Our experiments on language modeling and image classification show quite surprisingly that gradients can stay relevant for tens of thousands of steps. They help to converge faster, and often to lower minima: e.g., a 1.3B parameter Ad EMAMix LLM trained on 101B tokens performs comparably to an Adam W model trained on 197B tokens (+95%). Moreover, our method significantly slowsdown model forgetting during training. Our work motivates further exploration of different types of functions to leverage past gradients, beyond EMAs.

1 INTRODUCTION

With large neural networks, deep-learning has revolutionized numerous fields, such as computer vision and natural language processing. At the heart of this paradigm lies the challenge of optimizing complex, non-convex loss functions using noisy gradient estimates. This optimization process is typically carried out using variants of Stochastic Gradient Descent (SGD) (Robbins & Monro, 1951) or adaptive methods such as Adam and Adam W (Kingma & Ba, 2015; Loshchilov & Hutter, 2019), which have become ubiquitous in training state-of-the-art models (Devlin et al., 2019; Brown et al., 2020; Dosovitskiy et al., 2021a; Radford et al., 2021; Zhai et al., 2022; Dehghani et al., 2023; Touvron et al., 2023; Dubey et al., 2024).

A key component in many of these iterative optimization algorithms is momentum, which has long been shown to accelerate convergence (Nemirovskii & Nesterov, 1985) and often leads to solutions with superior generalization properties (Sutskever et al., 2013). By accumulating gradient vectors over successive optimization steps, momentum helps overcome small local variations of the loss landscape, potentially escaping shallow local minima, and accelerate in plateau regions (Qian, 1999; Ruder, 2016; Goh, 2017). Both SGD with momentum (SGD+M) and Adam incorporate momentum under the form of Exponential Moving Averages (EMAs) of past gradients GT = {g(0), . . . , g(T )}:

EMA(β, GT ) β EMA(β, G(T 1)) + (1 β)g(T ) =

i=0 βi(1 β)g(T i). (EMA)

Two considerations support the use of EMAs. From a practical standpoint, the recursive formula of EMA allows for efficient implementations, which do not require maintaining a buffer of past gradients. From a theoretical standpoint, gradient descent with momentum leads to optimal convergence rates for quadratics (Polyak, 1964; Nesterov, 1983). However, those results do not guarantee any optimality for general non-quadratic cases (Goujaud et al., 2023).

Work done while interning at Apple.

Published as a conference paper at ICLR 2025

0 256k 400k 500k Iterations

Adam W trained on {17B, 26B, 33B} tokens

Ad EMAMix trained on {17B, 26B, 33B} tokens

(a) 110M parameters.

0 256k 400k 500k Iterations

Adam W trained on {25B, 39B, 49B} tokens

Ad EMAMix trained on {25B, 39B, 49B} tokens

(b) 330M parameters.

0 256k 770k 1M Iterations

Adam W 197B

Adam W trained on {34B, 101B, 131B} tokens

Ad EMAMix trained on {34B, 101B, 131B} tokens

(c) 1.3B parameters.

Figure 1: Comparing Adam W and Ad EMAMix on language modeling. In (a,b,c), we plot the loss obtained using Adam W and Ad EMAMix (our optimizer) to train Transformer models of various sizes on the Redpajama dataset. In (a), we train multiple baselines for 256k, 400k, and 500k iterations, resulting in processing from 17B to 33B tokens. Two Adam W runs with different number of iterations look very different as we use a cosine-decay for the learning rate. We compare those baselines to training Ad EMAMix for 256k iterations. We observe that our method reaches a similar loss as an Adam W model trained on nearly twice the number of tokens. Analogous comparisons can be derived from (b) and (c). Notably, in (c), a 1.3B parameter Ad EMAMix model trained on 101B tokens performs comparably to an Adam W model trained on 197B tokens (95% more, blue horizontal line). See 4.1 and App. B.1 for a detailed description of our experimental setup, including hyperparameters.

The use of momentum in optimization is grounded in the varying nature of gradients. As local linear approximations of the loss landscape, their information can quickly become outdated as the optimization process progresses (Pascanu et al., 2013). For this reason, practitioners typically employ moderate momentum values (i.e. β 0.8 or 0.9), effectively creating a moving average of recent gradients while discarding older information. Selecting larger β values seems counter-intuitive, as it would suggest that older gradients maintain their relevance over extended periods of training. While it is tempting to see the use of small βs as a confirmation of the limited temporal relevance of gradients, our work reveals instead that older gradients can efficiently be used. When we increase β, we decrease the relative importance of recent gradients, and the iterate now fails to respond to local changes in the loss landscape. We observe that a single EMA cannot both give a significant weight to recent gradients, and give a non-negligible weight to older gradients (see Fig. 3a). However, a linear combination between a fast-changing (e.g. β = 0.9 or β = 0) and a slow-changing (e.g. β = 0.9999) EMA allows the iterate to beneficiate from (i) the great speedup provided by the larger (slow-changing) momentum, while (ii) still being reactive to small changes in the loss landscape (fast-changing). More precisely, we find the following statement to convey an important intuition behind this approach:

While changing the direction of the slow momentum is difficult, any adjustment orthogonal to that direction is easy which favors fast progress in sinuous valley-like landscapes: the fast EMA helping to stay in the valley while the slow EMA accelerates in the valley s direction.

A toy illustration of this can be seen in Fig. 2. Based on this idea, we propose Ad EMAMix (Adaptive EMA Mixture), a novel Adam based optimizer which successfully leverages very old gradients to reach better solutions.

Contributions. Our contributions can be summarized as follows:

We propose Ad EMAMix, a novel optimizer which better leverages past gradients by avoiding a common pitfall of EMA-based optimizers (see 3).

We empirically demonstrate the superiority of our method over Adam by training Vi T and language models (Transformers and Mamba) of up to 1.3B parameters (see 4). In addition, we show gains from switching mid-training from Adam to Ad EMAMix (see Fig. 5).

Published as a conference paper at ICLR 2025

0 2000 4000 Iterations

Distance to solution

Adam β2 = 0.999 β1 {0.9, 0.99, 0.999, 0.9999}

Ad EMAMix β1, β2 = 0.9, 0.999 α = 9, β3 {0.999, 0.9999}

(a) x(t) x 2.

Adam β1 = 0.9 Adam β1 = 0.99 Adam β1 = 0.999 Adam β1 = 0.9999

(b) Adam trajectories.

Ad EMAMix β3 = 0.999 Ad EMAMix β3 = 0.9999

(c) Ad EMAMix trajectories.

Figure 2: Comparing Adam and Ad EMAMix on the Rosenbrock function. Starting from x(0) = [ 3, 5], we minimize the Rosenbrock function f(x1, x2) = (1 x1)2 + 100(x2 x2 1)2. The global minimum ( ) is x = [1, 1]. We use β2 = 0.999 for Adam and (β1, β2, α) = (0.9, 0.999, 9) for Ad EMAMix (see 3). We reduce the learning rate for Ad EMAMix to compensate for the influence of α. We do a sweep over β1 (resp. β3) for Adam (resp. for Ad EMAMix). In (b), When Adam s β1 is small (e.g. 0.9), the iterates do not oscillate but convergence is slow. Increasing β1 makes the iterates move faster but with large oscillations. In contrast, for Ad EMAMix in (c), we observe that despite β3 being large, the iterates moves fast and without oscillations. This results in reaching better solutions faster as can be seen in (a).

We show Ad EMAMix forgets the training data slower when compared to Adam (see Fig. 4).

More broadly, our findings contribute to a deeper understanding of the optimal balance between using historical gradients and adapting to the rapidly changing loss landscape. Our work invites further research in methods combining old and recent gradients, beyond EMAs.

2 RELATED WORK

Work on understanding momentum. From the seminal work of Polyak (1964), many publications analyzed the effect of gradient descent + momentum (GD+M) in both convex and non-convex settings (Ghadimi et al., 2015; Flammarion & Bach, 2015; Kidambi et al., 2018; Defazio, 2020; Sebbouh et al., 2021). While the acceleration in the noise-free setting has been long theorized for convex functions, several publications indicate this effect might not necessarily extend to stochastic settings (Yuan et al., 2016; Kidambi et al., 2018; Leclerc & Madry, 2020), emphasizing instead a link between momentum and effective learning rate. Recent work have been seeking to understand the impact of momentum on generalization through studying the implicit bias of momentum methods (Ghosh et al., 2023; Papazov et al., 2024), exposing a preference of SGD+M for lower norm solutions. Those further exposed a link between higher momentum and higher effective learning rate and higher variance reduction. Despite the volume of prior work on the subject, our understanding of momentum methods in non-convex stochastic settings is still incomplete (Yuan et al., 2016). Oscillatory behaviours, and the sometimes ambiguous effect of variance on optimization render the analysis tedious. From a theoretical standpoint, our work raises several questions. First, given that we gain from averaging very old gradients, what can it reveal of the loss landscape and the consistency of one batch s gradient during training? Second, would our approach not decrease the variance up to a point that is harming generalization (Ghosh et al., 2023)? While no answer to those questions is given in this work, we provide a toy justification which indicates that large momentums can have a positive impact in noise-free non-convex settings (see Fig. 2) indicating the improvement of our approach is at least partially explainable without considering variance-reduction effects. We moreover expose a link between momentum and forgetting the training data (see Fig. 4), which to our knowledge is novel.

Work on deep-learning optimizers. Despite the popularity of Adam and Adam W (Kingma & Ba, 2015; Loshchilov & Hutter, 2019) in training deep neural networks, optimizer design is a rich field of research and we focus on a few of the works most relevant to this study. Adafactor (Shazeer & Stern, 2018) improves Adam s memory efficiency by factorizing the second moment estimate. Lamb (You

Published as a conference paper at ICLR 2025

et al., 2020) extends Adam by adding layerwise normalized updates. Chen et al. (2023a) use algorithm discovery to derive the Lion optimizer. Contrary to Adam, Lion uses a single momentum term and the sign function to produce updates with the same magnitude across dimensions. Interestingly, Chen et al. (2023a) also report better scores are obtained when using a slightly larger momentum term (β = 0.99). In this work we show how increasing the momentum well beyond this value can still be beneficial. See App. C.3.2 for a detailed comparison between Ad EMAMix and Lion. Recently, Liu et al. (2023) introduced Sophia, a scalable second-order optimizer designed for LLM training. Sophia uses a Hessian-based pre-conditioner which better normalizes the step size, penalizing steps in high curvature direction and accelerating in low curvature directions. Understanding in which circumstances those novel optimizers bring improvements is still being investigated (Kaddour et al., 2023), and Adam s dominance remains mostly unchallenged.

Work incorporating an additional momentum term. Lee et al. (2024) introduce Grokfast, which uses a pre-filtering step on the gradient to amplify the low frequencies and help solve groking. When combined with Adam, it effectively applies the Adam s EMAs on top of another gradient averaging method (e.g. EMA for Grokfast-EMA). Somewhat similarly, Chen et al. (2023b) refer to the Double EMA (DEMA) used in some finance applications (Mulloy, 1994) as one motivation for their Ad Meta optimizer. Our motivation behind Ad EMAMix is to combine both a high sensitivity to the recent gradients as well as incorporating very distant gradient, in this respect, using nested EMAs is not the right candidate as it reduces the influence of recent gradients. A more detailed review of Ad Meta and DEMA is in App. C.3.1. Most relevant to us, Lucas et al. (2019, Agg Mo) also observe that using a combination of EMAs can enable the use of larger βs, and incorporate a sum of K momentum terms into GD. They show their approach reaches similar performances as baseline optimizers, with a faster convergence. In contrast, we modify Adam, and introduce schedulers that are critical to reaching good performances at larger scales. This allows us to outperform Adam by a significant margin. In App. C.3.3, we notice no further improvement by adding more momentum terms. Finally, Szegedy et al. (2024) propose a general framework to derive and study optimizers with linear combinations of memory vectors which encompasses EMA mixtures.

Work on distributed optimization. Perhaps surprisingly, recent work on distributed optimization Di Lo Co and Slow Mo (Douillard et al., 2023; Wang et al., 2020) are relevant to our work. N workers θ(t) 1 , . . . , θ(t) N are trained independently for K steps (e.g. K = 500). Every K steps, the delta updates { θi}N i=1 {θ(t+K) i θ(t) i }N i=1 are averaged and applied to each worker using an outer optimizer with momentum β: θ(t+K) i = θ(t) i η Opt( 1

N P i θi, β). The application of the outer momentum every 500 steps increases the importance of old gradients in the optimization trajectory. We believe this observation might in parts explain the strong results provided by those methods, and further motivated our work.

3 OUR METHOD: ADEMAMIX

Setup & notations. Let Lθ : X 7 R be a loss function parameterized by θ, and mapping inputs x X to R. Given a sampled batch x, let g = θLθ(x) be a stochastic gradient of the loss w.r.t. θ. To minimize the empirical loss, the Adam optimizer (Kingma & Ba, 2015) relies on first and second moments, resp. m and ν, estimated via two EMAs parametrized by (β1, β2) [0, 1[2. A weight-decay parameter λ R+ is often used as in Loshchilov & Hutter (2019):

m(t) = β1m(t 1) + (1 β1)g(t), ˆm(t) = m(t)

1 βt 1 ν(t) = β2ν(t 1) + (1 β2)g(t)2, ˆν(t) = ν(t)

1 βt 2 θ(t) = θ(t 1) η ˆ m(t)

ˆν(t)+ϵ + λθ(t 1) .

With t > 0 being the timestep, η being the learning rate, and ϵ a small constant. Initially m(t=0) = ν(t=0) = 0. To prevent the bias induced by the initial m(t=0) and ν(t=0), the outputs of the two EMAs are corrected into ˆm(t) and ˆν(t). Those are used to compute the final weight update, scaled by the learning rate.

How far to look into the past? A typical value for β1 is 0.9. Fig. 3a shows the weights given to past gradients for different values of β. The larger the β, the more uniform the average is. To put this in perspective observing that P i=0 βi(1 β) = 1 for β [0, 1[ the number of successive previous

Published as a conference paper at ICLR 2025

0 5000 10000 t

EMA1: β = 0.9 EMA2: β = 0.99 EMA3: β = 0.999 EMA4: β = 0.9999

(a) Limitation of EMAs.

0 500k = Tα,β3 1M 1.3M Iterations

Adam W Ad EMAMix β3 = 0.9999 Ad EMAMix β3 = 0.99999 100k steps of linear η decay

(b) Constant η-scheduler.

0 64k 128k Iterations

Adam W trained on {8B, 16B} tokens

Ad EMAMix trained on {8B, 16B} tokens

(c) Mamba results (168M).

Figure 3: Limitation of EMAs, constant η-scheduler, & Mamba results. In (a), we plot the weights wt for each past gradient g(t) given by different EMAs after 10k steps. For a given β, EMA(β, g(0), . . . , g(T )) = PT i=0 βi(1 β)g(T i), which decays the contribution of past gradients exponentially. A small β (e.g. 0.9) will give a high weight to the immediate past and negligible weights to older timesteps. In contrast, a high β (e.g. 0.9999) is giving a relatively uniform, yet non-negligible weight to all gradients. No β value can simultaneously give a high weight to the immediate past and a non-negligible weight to very old timesteps. In (b), we train multiple 1.3B language models using 3k steps of warmup and then a constant learning rate η = 10 4. This allows us to observe the gap between Adam W and Ad EMAMix without the cosine decay as a confounder. We still use schedulers for α and β3 with Tα,β3 = 500k, α = 5. Similar to Zhai et al. (2022); Hu et al. (2024); Hägele et al. (2024), we decay the learning rate linearly at t = 1M and t = 1.3M. The loss-gap between Adam W and Ad EMAMix increases at first, and then remains constant. Ad EMAMix still outperforms Adam W after decaying the learning rate. See App.C.1.10 to see the impact of the linear decay duration. In (c), we train 168M parameter Mamba models, showing how Ad EMAMix s performances can generalize outside of the Transformer architecture.

steps receiving a cumulative weight of 0.5, is thalf = ln(0.5)

ln(β) 1. For β = 0.9, thalf 6, meaning that half of the weight is given to the previous six gradients. This observation can also be extended to SGD with e.g. polyak or nesterov momentums (Polyak, 1964; Nesterov, 1983), which typically relies on similar β values. The value of β1 is rarely increased beyond 0.9. In our experiments with Adam W, increasing β1 further degraded the performance (see App. C.1.7). Does this mean older gradients are outdated? We show that this is not the case, rather, increasing beta is reducing the sensitivity to recent gradients too much. We design Ad EMAMix such that the sensitivity to recent gradients is kept, while also incorporating information from much older gradients using an additional momentum term. This allows for the use of much larger β values e.g. 0.9999. To compare, for β = 0.9999, thalf 6,930, spreading half of the mass over the previous 6,930 past gradients.

Ad EMAMix. To keep a high sensitivity to recent gradients, while also incorporating information from older gradients, we add a second EMA (changes compared to Adam W are in Blue):

m(t) 1 = β1m(t 1) 1 + (1 β1)g(t), ˆm(t) 1 = m(t) 1 1 βt 1 m(t) 2 = β3m(t 1) 2 + (1 β3)g(t)

ν(t) = β2ν(t 1) + (1 β2)g(t)2, ˆν(t) = ν(t)

1 βt 2 θ(t) = θ(t 1) η ˆ m(t) 1 +αm(t) 2

ˆν(t)+ϵ + λθ(t 1) .

(Ad EMAMix)

In our experiments, while the values of β1, β2 remain similar to those of equation Adam W, we often use β3 = 0.9999. We find α [4, 10] to work well in practice.

Tackling early training instabilities. Early training instabilities are commonplace when training deep models, and empirically motivated the introduction of methods such as learning rate warmup and gradient clipping. Gilmer et al. (2022) show how the use of learning rate warmup can be justified from a curvature perspective; allowing the iterates to move to parts of the optimization landscape where larger learning rates are stable. While we use learning rate warmup in all our experiments,

Published as a conference paper at ICLR 2025

we still noticed Ad EMAMix models using a large β3 would diverge early. This, despite not using bias correction over m2, which lets the momentum buffer fill itself slowly. Those failed runs are characterized by updates of large magnitudes in the early phase of training (see App. C.1.10, Fig. 28). For this reason, we progressively increase the values of β3 and α using schedulers. For α we use a linear scheduler. A linear scheduler for β3 would be ill-fitted as the same increment of β3 has a different impact for different values of β3. For instance, observe that an increase of β of δβ = 0.0001 barely increases the thalf for β = 0.9, while 0.999 0.999 + δβ increases the thalf of 77. For this reason, we design the β3 scheduler to increase thalf linearly (see App. A.1 for a derivation of that scheduler). The two schedulers are summarized below:

α(t) = fα(t, α, Tα) = min( tα

Tα , α), (fα)

β(t) 3 = fβ3(t, β3, βstart, Tβ3) = min exp ln(βstart) ln(β3) (1 t Tβ3 ) ln(β3) + t Tβ3 ln(βstart)

, β3 . (fβ3)

With Tα and Tβ3 are resp. the warmup times for α(t) and β(t) 3 to reach their final and maximal values. In practice we always set those two to the same value: Tα = Tβ3 = Tα,β3, and we typically use Tα,β3 = T, with T being the total number of iterations. βstart is always set to β1 in our experiments. The use of those schedulers is not always required, especially, we found those have no impact when Ad EMAMix is activated later during training (see Fig. 5). The full Ad EMAMix optimizer, including the schedulers, is shown in Alg. 1.

Algorithm 1 Ad EMAMix optimizer. Differences with Adam W are in blue.

1: Input: Data distribution D. Initial model parameters θ(0). Number of iterations T. Learning rate η. ϵ a small constant. Adam W parameters: β1, β2 and λ. Ad EMAMix parameters β3, α. Warmup parameter Tα,β3, note that we usually set it to T. βstart is usually set to β1. 2: Initialize timestep: t 0

3: Initialize EMAs: m(0) 1 0 , m(0) 2 0 , ν(0) 0 4: for t [T] do 5: t t + 1 6: Optional: use schedulers η(t), β(t) 3 fβ3(t, β3, βstart, Tα,β3) and α(t) fα(t, α, Tα,β3) 7: Sample batch: x D 8: Compute gradient: g(t) θLθ(t 1)(x)

9: Update the fast EMA m1: m(t) 1 β1m(t 1) 1 + (1 β1)g(t)

10: Update the slow EMA m2: m(t) 2 β(t) 3 m(t 1) 2 + (1 β(t) 3 )g(t)

11: Update the second moment estimate: ν(t) β2ν(t 1) + (1 β2)g(t)2

12: Apply bias corrections: ˆm(t) 1 m(t) 1 1 βt 1 , ˆν(t) 1 ν(t) 1 1 βt 2

13: Update parameters: θ(t) θ(t 1) η(t) ˆ m(t) 1 +α(t)m(t) 2

ˆν(t)+ϵ + λθ(t 1)

14: end for 15: Return optimized parameters θ(T )

Using β1 = 0 to save memory. By setting β1 = 0, we can save on memory by replacing ˆm(t) 1 by g(t). In this case, the memory cost of Ad EMAMix is the same as for Adam W. We show in App. C.1.5 (Fig. 16b) and App. C.1.8 that using β1 = 0 often works, at the cost of potentially less stable training.

Computational overheads. Adding an additional EMA requires additional memory and compute. The added compute is negligible in comparison to what is required for the forward-backward, and has little impact over the total runtime as shown in Fig. 5a. Moreover when considering larger distributed setups it is worth noting that Ad EMAMix is not increasing communication (gradient reduction) over Adam. Therefore, we expect the runtime overhead of Ad EMAMix to shrink in those settings, as data movements occupy a larger fraction of the total runtime. A more significant overhead is in terms of memory when β1 = 0, as Ad EMAMix requires to allocate both m1 and m2, which are of the same size as the model parameters θ. We believe this issue is of lesser consequences as Fully Sharded-Data-Paralellism (Zhao et al., 2023, FSDP) can always be used to distribute the optimizer states across compute nodes. As an example, using the Megatron library (Shoeybi et al., 2020) to

Published as a conference paper at ICLR 2025

90k 256k 2.8

Loss on batch B

90k 256k 2.8

Loss on batch B

(a) t B = 90k.

70k 170k 256k

70k 170k 256k

(b) t B = 170k.

70k 190k 256k

70k 190k 256k

(c) t B = 190k.

Adam W training without B Adam W training using B at t = t B t B

Ad EMAMix training without B Ad EMAMix training using B at t = t B t B

(d) t B = 230k.

Figure 4: Measuring forgetting using a held-out batch B. The top row is for Adam W, the bottom row is for Ad EMAMix. We trained one Adam W and Ad EMAMix model on a Red Pajama dataset not containing the batch B, those runs are in blue. We then run multiple experiments where we inject B in the training data at a specific timestep t B. Those runs are in orange. To inspect how much influence B had when it is injected at t B, we can observe the evolution of the gap between the blue and the orange curves. For both optimizers, we observe a rapid decrease of the loss on B right after training on B. The sharpness of this decrease in loss is more pronounced for Adam W compared to Ad EMAMix. However, when using Adam W, the loss on B then increases faster, which we interpret as the model forgetting B faster. In contrast, the curves for Ad EMAMix are smoother, the loss on B goes back up slower, and ultimately B had a bigger impact on the training when using Ad EMAMix as can be seen by looking at the larger gap between the orange and blue curves for the last iteration. Finally, the forgetting behaviour evolves during training, with later training batches being remembered better. See App. C.1.3 for more experiments on forgetting.

train an 8B model across 16 compute nodes, each having 8 GPUs. Assuming a data-parallelism (DP) of 64 (2 GPUs per replica). This would imply sharding the optimizer states over the 64 replicas, causing a small per-GPU offset of 0.5GB when compared with Adam W. In the cases where memory remains an issue, one mitigation strategy could be to apply factorization strategies as in Shazeer & Stern (2018); Zhao et al. (2024) , or reduce the size of the momentum buffers as in Zhang et al. (2025).

Hyperparameter sensitivity. While we introduce up to four new hyperparameters: α, β3, Tα, and Tβ3, in practice we always set Tα = Tβ3 = Tα,β3, and use Tα,β3 = T in most cases. We show in App. C.1.5 that Tα and Tβ3 should only be large enough to prevent instabilities early during training. While all of our experiments on language modeling use β3 = 0.9999, other values such as 0.999 or even 0.99999 still can outperform the Adam W baseline (see App. C.1.5, Fig. 13). On vision tasks, the scatter plots in Fig. 6 show all the Ad EMAMix models trained for those experiments, highlighting how easy it can be to find good α and β3 values. Overall, we find the ranges of values of α, β3 and Tα,β3 providing improvements over Adam W to be wide. See App. C.1.5 for hyperparameter sweeps.

Gradient clipping & β2. Adam can diverge in cases where β2 is not sufficiently large compared to β1 (Zhang et al., 2022). Intuitively, the denominator in the Adam W update can shrink faster than the numerator, causing a spike in update magnitude . Given the proximity of our method to Adam W, a similar observation applies to Ad EMAMix, and we require β2 to be large enough. In practice, we observe that both β2 = 0.999 or 0.9999 work well. Moreover, we notice that gradient clipping mitigates the issue and we use small gradient clipping values in our experiments (e.g. 0.5 or 0.1). We also believe clipping to help dealing with outlier gradients, which can have a long-lasting and detrimental effect for Ad EMAMix given the large β3.

Limitations. Ad EMAMix consists in leveraging very old gradients. Therefore, the method is best suited to settings where the number of iterations is important. We report on this effect in App. C.1.6,

Published as a conference paper at ICLR 2025

110M 1.3B 20

Training time (hours)

Adam W Ad EMAMix

(a) Train time for 256k iters.

300k 390k 500k Iterations

Adam W 33B tokens Ad EMAMix from 300k Ad EMAMix from 330k Ad EMAMix from 360k Ad EMAMix from 390k Ad EMAMix from 0

(b) From Adam W (110M).

500k 600k 770k Iterations

Adam W 1M iterations

Adam W 101B tokens Ad EMAMix from 500k Ad EMAMix from 600k Ad EMAMix from 0

(c) From Adam W (1.3B).

Figure 5: Training time comparison & starting Ad EMAMix from Adam W. In (a), we compare the time required to train 110M and 1.3B parameter models for 256k iterations. The additional EMA renders Ad EMAMix slightly slower than Adam W. However, if we were to train Adam W longer to compensate for this gap, we would only train for an additional 2379 and 5421 iterations for resp. 110M and 1.3B parameter models. Those additional iterations would not be sufficient to close the gap (see Fig. 1). In (b) and (c), we show for two different model sizes the effect of switching from Adam W to Ad EMAMix during training. Ad EMAMix s additional parameter m2 is initialized to 0, no scheduler is used for α or β3. For both model sizes, we observe the loss increases slightly at first before decreasing and outperforming the baseline. In both cases, the earlier Ad EMAMix is used, the better the final loss. See App. C.1.10 for results using schedulers.

additionally showing how smaller values of β3 (e.g. β3 = 0.999) can be better for low iterations scenarios. Moreover, retaining gradient information over many thousands steps can pose a problem in domains requiring fast adaptation to a sudden distribution shift.

In this section we use Ad EMAMix to train language models ( 4.1 and 4.2) and vision transformers ( 4.3) ranging from 24M to 1.3B parameters.

4.1 TRANSFORMER LLM TRAINING

Experimental setup. We use a transformer architecture (Vaswani et al., 2017) with learnt positional encoding. All of our experiments are using sequences of 1,024 tokens. We experiment with three sizes of transformers: 110M, 335M, and 1.3B. For the learning rate, we use 3k warmup steps followed by unless specified a cosine decay to 10 5. We extensively tuned the hyperparameters for both Adam W and Ad EMAMix models (see App. B.1). We use the Red Pajama v2 (Computer, 2023) dataset for all of our experiments. We use batch sizes of 64, 96 and 128 for respectively our 110M, 335M, and 1.3B parameter models. Depending on the model, we vary the number of iterations from 256k to 1.5M. For Ad EMAMix, we use β3 = 0.9999 and α {5, 8, 10} depending on the model. A full description of the architecture and hyperparameters used is in App. B.1. We train on up to 8 A100 NVIDIA GPUs using data-parallelism.

Why not simply increasing Adam W s β1? While our toy experiment in Fig. 2 already gives some intuition on why increasing Adam s β1 is likely not to have the same effect as having an additional EMA as in Ad EMAMix, we verify this intuition by training multiple 110M models using Adam with large β1 {0.99, 0.999, 0.9999, 0.99999}. When we use a large β1 from the beginning of training, we observe instabilities for larger β1 values and no β1 > 0.9 improves upon the Adam W baseline. One can imagine this to be due to increasing β1 too early. Therefore, we also modify Adam W and add the same scheduler on β1 as we use on Ad EMAMix s β3. β1 is now increased steadily over the entire training duration. While this mostly stabilizes the training, none of the experiments outperformed the baseline using β1 = 0.9. Moreover, to rule out any effect that could be due to early training instabilities, we run the same experiments starting from a pre-trained Adam W checkpoint trained for 300k iterations. We simply resume training and either increase β1 suddenly or using a scheduler.

Published as a conference paper at ICLR 2025

Here again unlike when using Ad EMAMix none of those experiments outperform the baseline. The details of those experiments are available in App. C.1.7. Those experiments show that simply increasing the β1 value in Adam W is not enough, which justifies our design of Ad EMAMix.

Better perplexity for the same number of steps. For all model sizes, all the number of iterations used ranging from 256k to 1M , Ad EMAMix always outperforms the Adam W baseline. In Fig. 1, we show the validation loss curves for Adam W and Ad EMAMix models trained for each size on various numbers of tokens. For 110M parameter models, training for 256k iterations gives similar results as training an Adam W model for 500k iterations. The gap between baseline and our method seems to be increasing as we increase the number of iterations. For 1.3B parameter models, training using 770k steps is on par with training the baseline for 1.5M iterations reaching the same performance with 51% fewer tokens (an economy of 96M tokens). In Fig. 3b, we observe similar improvements when using a constant and then linear decay learning rate scheduler.

Training speed comparison. We measure the time per iteration for all of our experiments. In Fig. 5a we plot the time required to train our 110M and 1.3B parameter models for 256k iterations. We observe that the impact of using an additional EMA on the training speed is negligible. If we were to train new models with a fix time budget, the extra iterations of the baseline would not be sufficient to close the gap. For instance, to match a 110M parameter Ad EMAMix model trained for 256k iterations, we need to train an Adam W model for 500k iterations, and the time advantage of Adam W would only allow us to do 2,379 additional iterations. Moreover, as mentioned in 3, we expect the time overhead to decrease when multi-node training is used, as IOs would become an important bottleneck, and Ad EMAMix is not increasing IOs.

Consistency of the gain across token-budgets. In Fig. 1, given that enough tokens have been used w.r.t. the model size, we observe consistent gains accross token budgets. It seems our method is able to bring a constant improvement over the baseline. This can be seen more clearly in Fig. 3b, showing results when using a constant learning rate scheduler which removes the confounder of the cosine learning rate decay. We observe how, after an initial phase in which the gap grows, this gap becomes seemingly constant after a sufficient number of iterations.

Resume from Adam W vs. training Ad EMAMix from scratch. So far we trained Ad EMAMix models from scratch. We show it is also possible to switch from an Adam W to an Ad EMAMix state in the middle of training. When switching to Ad EMAMix at step tswitch, we initialize m(tswitch) 2 = 0 and replace t by t tswitch in the scheduler equations if those are used. However, we find that schedulers are not required when resuming training and report results without them in the main paper (see App. C.1.10 for more details). In Fig 5b and Fig.5c, we show that (i) it is possible to improve upon the baseline when activating Ad EMAMix later during training, and (ii) the earlier the switch, the larger the gain, with diminishing returns. This indicates that the improvement of Ad EMAMix cannot be attributed solely to early training dynamics, but rather, late training dynamics seem to play an important role. This is further corroborated by the reverse experiment which switches from Ad EMAMix to Adam W mid-training and show a performance degradation (see results in App. C.1.4).

Ad EMAMix models forget the training data slower. As an attempt to understand the reason for Ad EMAMix s improvements over Adam W, we study how fast a training batch is forgotten after being used during training. We focus on following one specific batch B. We start by removing B from the Red Pajama training data and train Adam W and Ad EMAMix models. For those runs, B is therefore akin to a batch from the validation set. We measure the loss on B while training. Now we can resume training from various checkpoints, and inject B into the training data at various times t B. By comparing the two runs one having trained on B, while the other never saw B we can visualize how B is learned, and how it is forgotten. After training on B, we expect the loss on B to decrease suddenly and then increase as the model forgets the contribution of that batch. When comparing the forgetting curves for Adam W and Ad EMAMix in Fig. 4, we see striking differences. Adam W models forget much faster the loss over B increases faster than Ad EMAMix models. Moreover, at the end of training, batches processed by Ad EMAMix see their loss being improved over many thousands of iterations. Additional experiments on forgetting can be found in App. C.1.3.

4.2 MAMBA LM TRAINING

Experimental setup & results. Our setup is similar to 4.1, except that we now train 168M parameter Mamba models (Gu & Dao, 2023) using the Fine Web dataset (Penedo et al., 2024). See

Published as a conference paper at ICLR 2025

App. B.2 for details. In Fig. 3c the improvement using Ad EMAMix is consistent with our experiments on Transformer models, showing how Ad EMAMix s gains can extend beyond Transformers.

4.3 VIT TRAINING

2.1 2.2 2.3 Training loss

Adam W Ad EMAMix β3 = 0.999 Ad EMAMix β3 = 0.9999

(a) 24M params, 11M images. (Image Net-21k)

1.6 1.8 2.0 Training loss

Adam W Ad EMAMix β3 = 0.999 Ad EMAMix β3 = 0.9999

(b) 86M params, 11M images. (Image Net-21k)

0.1 0.2 0.3 Training loss

Adam W Ad EMAMix β3 = 0.999 Ad EMAMix β3 = 0.9999

(c) 86M params, 1.3M images. (Image Net-1k)

Figure 6: Vi T results for different capacity/data ratio. As described in 4.3, we compare the train vs. test loss between Ad EMAMix and Adam W, on multiple scenarios (a,b,c). Those have different capacity/data ratios. For each setting, we first tune an Adam W baseline by testing multiple hyperparameters, those are represented by blue dots. We then pick the hyperparameters of the best baseline corresponding to the lowest test loss and train multiple Ad EMAMix models by simply testing various α {1, 5, 10, 15, 20} and β3 {0.999, 0.9999}. Those are represented by orange squares and red triangles. See App. B.3 for detailed hyperparameters. The red area in the above plots represents the area in which we achieve both a better training and test loss compared to the best baseline. The diagonal black line represents the line we would follow when we improve upon the baseline without adding any overfitting (i.e. an improvement in train loss yields the same improvement in test loss). (a) represents the most desirable setting, in which we have relatively a lot of data compared to the model size. In this setting, Ad EMAMix outperforms the baseline for all hyperparameters tested. If we increase the model size as in (b), most Ad EMAMix hyperparameters are improving upon the baseline. If we now decrease the dataset size, using a 86M model with 1.3M images as in (c), doing a total of 320 epochs, we do not observe an improvement from using Ad EMAMix. See App. C.2 to see classification accuracies and loss curves for those experiments.

Experimental setup. In this section we consider a different setting in which the data is now a limited resource, and we do multiple epochs (e.g. 37 or 320). We use two subsets of the Image Net (Russakovsky et al., 2015) dataset: (i) the widely used Image Net-1k subset, consisting of 1.3M images and 1,000 possible classes; (ii) a filtered and pre-processed version of the Image Net-21k (Ridnik et al., 2021) containing 11M images corresponding to 10,450 classes. For each, we measure the test loss on a held-out test set. We use the Vi T architecture (Dosovitskiy et al., 2021b) at two different scales: 24M and 86M parameters. Importantly, if it is common in the vision literature to pre-train large Vi T models and finetune them on smaller datasets, this work focuses on pretraining optimization and we therefore train and test on the same distribution. The models hyperparameters are detailed in App. B.3, we use a batch size of 4096 for all our experiments. We used training hyperparameters from Dosovitskiy et al. (2021b) as a starting point and did some additional tuning of the learning-rate, dropout, and weight decay for our Adam W baselines. We then use the hyperparameters of the best Adam W baseline and train Ad EMAMix models with various values of α {1, 5, 10, 15, 20} and β3 {0.999, 0.9999}. We train models for 320 and 37 epochs for resp. the Image Net-1k and Image Net-21k datasets, corresponding in both cases to 100k iterations. Data-augmentation techniques have been shown to be central to the efficient training of Vi Ts Touvron et al. (2021; 2022). We use simple data-augmentations, which includes mixup (Zhang et al., 2018). We train on 8 A100 NVIDIA GPUs using Pytorch Fully Sharded Data-Parallelism (Zhao et al., 2023, FSDP).

Ad EMAMix for different capacity/data ratios. We consider three scenarios that differ by their capacity/data ratios. First, we trained 24M parameter models on 11M images (Image Net-21k), for 37 epochs. In this setting, as can be seen in Fig. 6a, it is trivial to find Ad EMAMix parameters

Published as a conference paper at ICLR 2025

outperforming the baseline in terms of both training and test accuracy. We now increase the model size to 86M parameters. Fig. 6b shows it is still easy to find parameters outperforming the baseline. We now keep the model size of 86M parameters and switch to the smaller Image Net-1k dataset (1.3M images), which given our 100k iterations increases the number of epochs to 320. In this setting, Fig. 6c shows that outperforming the baseline becomes difficult. Ad EMAMix seems to perform best in scenarios with large volumes of data w.r.t. the capacity of the model.

5 CONCLUSION

In this work, we find that old gradients can be leveraged to efficiently train large language models and Vi Ts. Our proposed optimizer combines two momentum terms. A slow (large β) momentum gathers information over many timestep, while a fast (slow β) momentum can adapt the trajectory of the iterates to the rapidly changing loss landscape. We demonstrate the superiority of our optimizer over Adam W through a set of experiments on text modeling and image classification. We moreover reveal how our optimizer forgets the training data at a slower pace.

Aida Amini, Saadia Gabriel, Shanchuan Lin, Rik Koncel-Kedziorski, Yejin Choi, and Hannaneh Hajishirzi. Mathqa: Towards interpretable math word problem solving with operation-based formalisms. In NAACL-HLT (1), pp. 2357 2367. Association for Computational Linguistics, 2019.

Sumithra Bhakthavatsalam, Daniel Khashabi, Tushar Khot, Bhavana Dalvi Mishra, Kyle Richardson, Ashish Sabharwal, Carissa Schoenick, Oyvind Tafjord, and Peter Clark. Think you have solved direct-answer question answering? try arc-da, the direct-answer AI2 reasoning challenge. Co RR, abs/2102.03315, 2021.

Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. PIQA: reasoning about physical commonsense in natural language. In AAAI, pp. 7432 7439. AAAI Press, 2020.

Sid Black, Stella Biderman, Eric Hallahan, Quentin Anthony, Leo Gao, Laurence Golding, Horace He, Connor Leahy, Kyle Mc Donell, Jason Phang, Michael Pieler, USVSN Sai Prashanth, Shivanshu Purohit, Laria Reynolds, Jonathan Tow, Ben Wang, and Samuel Weinbach. Gpt-neox-20b: An open-source autoregressive language model. Co RR, abs/2204.06745, 2022.

James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, and Skye Wanderman-Milne. Jax: composable transformations of python+numpy programs, 2018. URL https://github.com/google/jax.

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam Mc Candlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. In Neur IPS, 2020.

Xiangning Chen, Chen Liang, Da Huang, Esteban Real, Kaiyuan Wang, Hieu Pham, Xuanyi Dong, Thang Luong, Cho-Jui Hsieh, Yifeng Lu, and Quoc V. Le. Symbolic discovery of optimization algorithms. In Neur IPS, 2023a.

Yineng Chen, Zuchao Li, Lefei Zhang, Bo Du, and Hai Zhao. Bidirectional looking with A novel double exponential moving average to adaptive and non-adaptive momentum optimizers. In ICML, volume 202 of Proceedings of Machine Learning Research, pp. 4764 4803. PMLR, 2023b.

Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions. In NAACL-HLT (1), pp. 2924 2936. Association for Computational Linguistics, 2019.

Together Computer. Redpajama: an open dataset for training large language models, 2023. URL

https://github.com/togethercomputer/Red Pajama-Data.

Published as a conference paper at ICLR 2025

Aaron Defazio. Understanding the role of momentum in non-convex optimization: Practical insights from a lyapunov analysis. Co RR, abs/2010.00406, 2020. URL https://arxiv.org/abs/ 2010.00406.

Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Peter Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin, Rodolphe Jenatton, Lucas Beyer, Michael Tschannen, Anurag Arnab, Xiao Wang, Carlos Riquelme Ruiz, Matthias Minderer, Joan Puigcerver, Utku Evci, Manoj Kumar, Sjoerd van Steenkiste, Gamaleldin Fathy Elsayed, Aravindh Mahendran, Fisher Yu, Avital Oliver, Fantine Huot, Jasmijn Bastings, Mark Collier, Alexey A. Gritsenko, Vighnesh Birodkar, Cristina Nader Vasconcelos, Yi Tay, Thomas Mensink, Alexander Kolesnikov, Filip Pavetic, Dustin Tran, Thomas Kipf, Mario Lucic, Xiaohua Zhai, Daniel Keysers, Jeremiah J. Harmsen, and Neil Houlsby. Scaling vision transformers to 22 billion parameters. In ICML, volume 202 of Proceedings of Machine Learning Research, pp. 7480 7512. PMLR, 2023.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT (1), pp. 4171 4186. Association for Computational Linguistics, 2019.

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR. Open Review.net, 2021a.

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR. Open Review.net, 2021b.

Arthur Douillard, Qixuang Feng, Andrei A. Rusu, Rachita Chhaparia, Yani Donchev, Adhiguna Kuncoro, Marc Aurelio Ranzato, Arthur Szlam, and Jiajun Shen. Diloco: Distributed lowcommunication training of language models. Co RR, abs/2311.08105, 2023.

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurélien Rodriguez, Austen Gregerson, Ava Spataru, Baptiste Rozière, Bethany Biron, Binh Tang, Bobbie Chern, Charlotte Caucheteux, Chaya Nayak, Chloe Bi, Chris Marra, Chris Mc Connell, Christian Keller, Christophe Touret, Chunyang Wu, Corinne Wong, Cristian Canton Ferrer, Cyrus Nikolaidis, Damien Allonsius, Daniel Song, Danielle Pintz, Danny Livshits, David Esiobu, Dhruv Choudhary, Dhruv Mahajan, Diego Garcia-Olano, Diego Perino, Dieuwke Hupkes, Egor Lakomkin, Ehab Al Badawy, Elina Lobanova, Emily Dinan, Eric Michael Smith, Filip Radenovic, Frank Zhang, Gabriel Synnaeve, Gabrielle Lee, Georgia Lewis Anderson, Graeme Nail, Grégoire Mialon, Guan Pang, Guillem Cucurell, Hailey Nguyen, Hannah Korevaar, Hu Xu, Hugo Touvron, Iliyan Zarov, Imanol Arrieta Ibarra, Isabel M. Kloumann, Ishan Misra, Ivan Evtimov, Jade Copet, Jaewon Lee, Jan Geffert, Jana Vranes, Jason Park, Jay Mahadeokar, Jeet Shah, Jelmer van der Linde, Jennifer Billock, Jenny Hong, Jenya Lee, Jeremy Fu, Jianfeng Chi, Jianyu Huang, Jiawen Liu, Jie Wang, Jiecao Yu, Joanna Bitton, Joe Spisak, Jongsoo Park, Joseph Rocca, Joshua Johnstun, Joshua Saxe, Junteng Jia, Kalyan Vasuden Alwala, Kartikeya Upasani, Kate Plawiak, Ke Li, Kenneth Heafield, and Kevin Stone. The llama 3 herd of models. Co RR, abs/2407.21783, 2024.

Nicolas Flammarion and Francis R. Bach. From averaging to acceleration, there is only a step-size. In COLT, volume 40 of JMLR Workshop and Conference Proceedings, pp. 658 695. JMLR.org, 2015.

Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony Di Pofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac h, Haonan Li, Kyle Mc Donell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. A framework for few-shot language model evaluation, 12 2023. URL https://zenodo.org/records/10256836.

Published as a conference paper at ICLR 2025

Euhanna Ghadimi, Hamid Reza Feyzmahdavian, and Mikael Johansson. Global convergence of the heavy-ball method for convex optimization. In ECC, pp. 310 315. IEEE, 2015.

Avrajit Ghosh, He Lyu, Xitong Zhang, and Rongrong Wang. Implicit regularization in heavy-ball momentum accelerated stochastic gradient descent. In ICLR. Open Review.net, 2023.

Justin Gilmer, Behrooz Ghorbani, Ankush Garg, Sneha Kudugunta, Behnam Neyshabur, David Cardoze, George Edward Dahl, Zachary Nado, and Orhan Firat. A loss curvature perspective on training instabilities of deep learning models. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=Oc KMT-36v Us.

Gabriel Goh. Why momentum really works. Distill, 2017. doi: 10.23915/distill.00006. URL

http://distill.pub/2017/momentum.

Akhilesh Gotmare, Nitish Shirish Keskar, Caiming Xiong, and Richard Socher. A closer look at deep learning heuristics: Learning rate restarts, warmup and distillation. In ICLR (Poster). Open Review.net, 2019.

Baptiste Goujaud, Adrien Taylor, and Aymeric Dieuleveut. Provable non-accelerations of the heavyball method. ar Xiv preprint ar Xiv:2307.11291, 2023.

Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. ar Xiv preprint ar Xiv:2312.00752, 2023.

Alexander Hägele, Elie Bakouch, Atli Kosson, Loubna Ben Allal, Leandro von Werra, and Martin Jaggi. Scaling laws and compute-optimal training beyond fixed training durations. Co RR, abs/2405.18392, 2024.

Shengding Hu, Yuge Tu, Xu Han, Chaoqun He, Ganqu Cui, Xiang Long, Zhi Zheng, Yewei Fang, Yuxiang Huang, Weilin Zhao, Xinrong Zhang, Zhen Leng Thai, Kai Zhang, Chongyi Wang, Yuan Yao, Chenyang Zhao, Jie Zhou, Jie Cai, Zhongwu Zhai, Ning Ding, Chao Jia, Guoyang Zeng, Dahai Li, Zhiyuan Liu, and Maosong Sun. Minicpm: Unveiling the potential of small language models with scalable training strategies. Co RR, abs/2404.06395, 2024.

Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William W. Cohen, and Xinghua Lu. Pubmedqa: A dataset for biomedical research question answering. In EMNLP/IJCNLP (1), pp. 2567 2577. Association for Computational Linguistics, 2019.

Jean Kaddour, Oscar Key, Piotr Nawrot, Pasquale Minervini, and Matt J. Kusner. No train no gain: Revisiting efficient training algorithms for transformer-based language models. In Neur IPS, 2023.

Rahul Kidambi, Praneeth Netrapalli, Prateek Jain, and Sham M. Kakade. On the insufficiency of existing momentum schemes for stochastic optimization. In ICLR. Open Review.net, 2018.

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR (Poster), 2015.

Taku Kudo and John Richardson. Sentence Piece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Eduardo Blanco and Wei Lu (eds.), Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 66 71, Brussels, Belgium, November 2018. Association for Computational Linguistics. doi: 10.18653/v1/D18-2012. URL https://aclanthology.org/D18-2012.

Nathan Lambert, Valentina Pyatkin, Jacob Morrison, LJ Miranda, Bill Yuchen Lin, Khyathi Raghavi Chandu, Nouha Dziri, Sachin Kumar, Tom Zick, Yejin Choi, Noah A. Smith, and Hannaneh Hajishirzi. Rewardbench: Evaluating reward models for language modeling. Co RR, abs/2403.13787, 2024.

Guillaume Leclerc and Aleksander Madry. The two regimes of deep network training. Co RR, abs/2002.10376, 2020.

Jaerin Lee, Bong Gyun Kang, Kihoon Kim, and Kyoung Mu Lee. Grokfast: Accelerated grokking by amplifying slow gradients. Co RR, abs/2405.20233, 2024.

Published as a conference paper at ICLR 2025

Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods. In ACL (1), pp. 3214 3252. Association for Computational Linguistics, 2022.

Hong Liu, Zhiyuan Li, David Hall, Percy Liang, and Tengyu Ma. Sophia: A scalable stochastic second-order optimizer for language model pre-training. Co RR, abs/2305.14342, 2023.

Jian Liu, Leyang Cui, Hanmeng Liu, Dandan Huang, Yile Wang, and Yue Zhang. Logiqa: A challenge dataset for machine reading comprehension with logical reasoning. In IJCAI, pp. 3622 3628. ijcai.org, 2020.

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In ICLR (Poster). Open Review.net, 2019.

James Lucas, Shengyang Sun, Richard S. Zemel, and Roger B. Grosse. Aggregated momentum: Stability through passive damping. In ICLR (Poster). Open Review.net, 2019.

Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? A new dataset for open book question answering. In EMNLP, pp. 2381 2391. Association for Computational Linguistics, 2018.

Patrick G. Mulloy. Smoothing data with faster moving averages. Technical Analysis of Stocks and Commodities Magazine, 12, 1994.

Arkaddii S Nemirovskii and Yu E Nesterov. Optimal methods of smooth convex minimization. USSR Computational Mathematics and Mathematical Physics, 25(2):21 30, 1985.

Yurii Nesterov. A method for unconstrained convex minimization problem with the rate of convergence o(1/k2). In Doklady Akademii Nauk SSSR, 1983. URL https://api. semanticscholar.org/Corpus ID:202149403.

Hristo Papazov, Scott Pesme, and Nicolas Flammarion. Leveraging continuous time to understand momentum when training diagonal linear networks. In AISTATS, volume 238 of Proceedings of Machine Learning Research, pp. 3556 3564. PMLR, 2024.

Razvan Pascanu, Tomás Mikolov, and Yoshua Bengio. On the difficulty of training recurrent neural networks. In ICML (3), volume 28 of JMLR Workshop and Conference Proceedings, pp. 1310 1318. JMLR.org, 2013.

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Z. Yang, Zachary De Vito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, highperformance deep learning library. In Neur IPS, pp. 8024 8035, 2019.

Guilherme Penedo, Hynek Kydlíˇcek, Loubna Ben allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro Von Werra, and Thomas Wolf. The fineweb datasets: Decanting the web for the finest text data at scale, 2024. URL https://arxiv.org/abs/2406.17557.

B.T. Polyak. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 3(4):864 878, 1964. ISSN 0041-5553. doi: https://doi.org/10.1016/0041-5553(63)90382-3. URL https://www.sciencedirect. com/science/article/pii/0041555363903823.

Ning Qian. On the momentum term in gradient descent learning algorithms. Neural Networks, 12(1):145 151, 1999. ISSN 0893-6080. doi: https://doi.org/10. 1016/S0893-6080(98)00116-6. URL https://www.sciencedirect.com/science/ article/pii/S0893608098001166.

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In ICML, volume 139 of Proceedings of Machine Learning Research, pp. 8748 8763. PMLR, 2021.

Published as a conference paper at ICLR 2025

Tal Ridnik, Emanuel Ben-Baruch, Asaf Noy, and Lihi Zelnik-Manor. Imagenet-21k pretraining for the masses, 2021.

H. Robbins and S. Monro. A stochastic approximation method. Annals of Mathematical Statistics, 22:400 407, 1951.

Sebastian Ruder. An overview of gradient descent optimization algorithms. Co RR, abs/1609.04747, 2016.

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. Image Net Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115 (3):211 252, 2015. doi: 10.1007/s11263-015-0816-y.

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale. In AAAI, pp. 8732 8740. AAAI Press, 2020.

Othmane Sebbouh, Robert M. Gower, and Aaron Defazio. Almost sure convergence rates for stochastic gradient descent and stochastic heavy ball. In COLT, volume 134 of Proceedings of Machine Learning Research, pp. 3935 3971. PMLR, 2021.

Noam Shazeer and Mitchell Stern. Adafactor: Adaptive learning rates with sublinear memory cost. In ICML, volume 80 of Proceedings of Machine Learning Research, pp. 4603 4611. PMLR, 2018.

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick Le Gresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism, 2020. URL https://arxiv.org/abs/1909.08053.

Guijin Son, Hanwool Lee, Sungdong Kim, Seungone Kim, Niklas Muennighoff, Taekyoon Choi, Cheonbok Park, Kang Min Yoo, and Stella Biderman. KMMLU: measuring massive multitask language understanding in korean. Co RR, abs/2402.11548, 2024.

Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. On the importance of initialization and momentum in deep learning. In Sanjoy Dasgupta and David Mc Allester (eds.), Proceedings of the 30th International Conference on Machine Learning, volume 28 of Proceedings of Machine Learning Research, pp. 1139 1147, Atlanta, Georgia, USA, 17 19 Jun 2013. PMLR. URL https://proceedings.mlr.press/v28/sutskever13.html.

Balázs Szegedy, Domonkos Czifra, and Péter K orösi-Szabó. Dynamic memory based adaptive optimization, 2024. URL https://arxiv.org/abs/2402.15262.

Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. Training data-efficient image transformers & distillation through attention. In ICML, volume 139 of Proceedings of Machine Learning Research, pp. 10347 10357. PMLR, 2021.

Hugo Touvron, Matthieu Cord, and Hervé Jégou. Deit III: revenge of the vit. In ECCV (24), volume 13684 of Lecture Notes in Computer Science, pp. 516 533. Springer, 2022.

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. Llama 2: Open foundation and fine-tuned chat models, 2023. URL https://arxiv.org/abs/2307.09288.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NIPS, pp. 5998 6008, 2017.

Published as a conference paper at ICLR 2025

Jianyu Wang, Vinayak Tantia, Nicolas Ballas, and Michael G. Rabbat. Slowmo: Improving communication-efficient distributed SGD with slow momentum. In ICLR. Open Review.net, 2020.

Johannes Welbl, Nelson F. Liu, and Matt Gardner. Crowdsourcing multiple choice science questions. In NUT@EMNLP, pp. 94 106. Association for Computational Linguistics, 2017.

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, and Jamie Brew. Huggingface s transformers: State-of-the-art natural language processing. Co RR, abs/1910.03771, 2019.

Yang You, Jing Li, Sashank J. Reddi, Jonathan Hseu, Sanjiv Kumar, Srinadh Bhojanapalli, Xiaodan Song, James Demmel, Kurt Keutzer, and Cho-Jui Hsieh. Large batch optimization for deep learning: Training BERT in 76 minutes. In ICLR. Open Review.net, 2020.

Kun Yuan, Bicheng Ying, and Ali H. Sayed. On the influence of momentum acceleration on online learning. Journal of Machine Learning Research, 17(192):1 66, 2016. URL http: //jmlr.org/papers/v17/16-157.html.

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? In ACL (1), pp. 4791 4800. Association for Computational Linguistics, 2019.

Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and Lucas Beyer. Scaling vision transformers. In CVPR, pp. 1204 1213. IEEE, 2022.

Hongyi Zhang, Moustapha Cissé, Yann N. Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. In ICLR (Poster). Open Review.net, 2018.

Yushun Zhang, Congliang Chen, Naichen Shi, Ruoyu Sun, and Zhi-Quan Luo. Adam can converge without any modification on update rules. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (eds.), Advances in Neural Information Processing Systems, volume 35, pp. 28386 28399. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/ file/b6260ae5566442da053e5ab5d691067a-Paper-Conference.pdf.

Yushun Zhang, Congliang Chen, Ziniu Li, Tian Ding, Chenwei Wu, Diederik P Kingma, Yinyu Ye, Zhi-Quan Luo, and Ruoyu Sun. Adam-mini: Use fewer learning rates to gain more. In The Thirteenth International Conference on Learning Representations, 2025. URL https: //openreview.net/forum?id=i BExha U3Lc.

Rosie Zhao, Depen Morwani, David Brandfonbrener, Nikhil Vyas, and Sham M. Kakade. Deconstructing what makes a good optimizer for language models. Co RR, abs/2407.07972, 2024.

Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, Alban Desmaison, Can Balioglu, Pritam Damania, Bernard Nguyen, Geeta Chauhan, Yuchen Hao, Ajit Mathews, and Shen Li. Pytorch fsdp: Experiences on scaling fully sharded data parallel, 2023. URL https://arxiv.org/abs/2304.11277.

1 Introduction 1

2 Related Work 3

3 Our method: Ad EMAMix 4

4 Results 8

4.1 Transformer LLM Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

4.2 Mamba LM Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

4.3 Vi T Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

5 Conclusion 11

A Implementation details 18

A.1 Deriving the β3 scheduler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

A.2 Pytorch implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

A.3 Optax implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

B Experimental details 21

B.1 Transformer LLM experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

B.2 Mamba LM experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

B.3 Vi T experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

C Additional results 25

C.1 LLM experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

C.1.1 In-Context Learning (ICL) results . . . . . . . . . . . . . . . . . . . . . . 25

C.1.2 Results using 3B parameter models . . . . . . . . . . . . . . . . . . . . . 26

C.1.3 More results on forgetting . . . . . . . . . . . . . . . . . . . . . . . . . . 27

C.1.4 Removing the second EMA mid-training . . . . . . . . . . . . . . . . . . 29

C.1.5 Hyperparameter sensitivity . . . . . . . . . . . . . . . . . . . . . . . . . . 30

C.1.6 Impact of training for fewer iterations . . . . . . . . . . . . . . . . . . . . 32

C.1.7 Limitations of a single EMA . . . . . . . . . . . . . . . . . . . . . . . . . 33

C.1.8 Removing m1 by setting β1 = 0 . . . . . . . . . . . . . . . . . . . . . . . 35

C.1.9 Comparing m1 + αm2 versus (1 α)m1 + αm2 . . . . . . . . . . . . . . 36

C.1.10 Miscellaneous results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

C.2 Vi T experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

C.3 Comparison with other methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

C.3.1 Comparison with Ad Meta and DEMA . . . . . . . . . . . . . . . . . . . . . 41

C.3.2 Comparison with Lion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

C.3.3 Adding a third momentum term ( Agg Mo) . . . . . . . . . . . . . . . . . 43

Published as a conference paper at ICLR 2025

A IMPLEMENTATION DETAILS

A.1 DERIVING THE β3 SCHEDULER

Let s consider S(t), the sum of the weights given to the last t gradients by an EMA parameterized by β [0, 1[:

S(t) = (1 β)

We want to know which timestep t would correspond to a cumulative weight of 0.5:

i=0 βi = 0.5 βt+1 = 0.5 t = ln(0.5)

Let f(β) = ln(0.5)

ln(β) 1. This function provides how many past consecutive gradients receive a cumulative weight of 0.5.

Its inverse is: f 1(t) = 0.5 1 t+1

We want a scheduler which increases β from βstart to βend such that f(β) increases linearly. Given an interpolating parameter µ [0, 1], this scheduler can be written as:

β(µ) = f 1((1 µ)f(βstart) + µf(βend))

By replacing f and f 1 by their respective formula, one can arrive to:

β(µ) = exp ln(βstart) ln(βend) (1 µ) ln(βend) + µ ln(βstart)

By setting βend = β3 and µ = t Tβ3 , we arrive to the β3-scheduler introduced in 3. We show the shape of our scheduler and compare it to a linear scheduler in Fig. 7.

0.0 0.2 0.4 0.6 0.8 1.0 µ

Our β-scheduler: βstart = 0.9, βend = 0.9999 Linear scheduler: βstart = 0.9, βend = 0.9999 Our β-scheduler: βstart = 0.9, βend = 0.999 Linear scheduler: βstart = 0.9, βend = 0.999

Figure 7: Ad EMAMix s β3 scheduler. We compare our scheduler to a linear scheduler for βstart = 0.9 and βend {0.999, 0.9999}. While our scheduler looks more aggressive at first glance, it increases fast for smaller values of β, and slowly for larger ones associated. This makes sense as the same increase of β for larger β values has a bigger impact than that same increase applied to a smaller value of β. The two linear schedulers look practically the same, despite values of βend differing by one order of magnitude. This is not the case with our scheduler.

Published as a conference paper at ICLR 2025

A.2 PYTORCH IMPLEMENTATION

The following is a code skeleton for our Ad EMAMix optimizer in Pytorch (Paszke et al., 2019). The full implementation of Ad EMAMix in Pytorch is provided in the following repository: https: //github.com/apple/ml-ademamix

Listing 1: Ad EMAMix code skeleton using Pytorch

1 import math 2 import torch 3 from torch.optim import Optimizer 4 5 6 class Ad EMAMix(Optimizer): 7 8 def __init__(self, 9 params, 10 lr=1e-3, 11 betas=(0.9,0.999,0.9999), 12 alpha=5.0, 13 T_beta3=0, 14 T_alpha=0, 15 eps=1e-8, 16 weight_decay=0.0): 17 # init the optimizer 18 19 @torch.no_grad() 20 def step(self): 21 22 for group in self.param_groups: 23 24 lr = group["lr"] 25 lmbda = group["weight_decay"] 26 eps = group["eps"] 27 beta1, beta2, beta3_final = group["betas"] 28 T_beta3 = group["T_beta3"] 29 T_alpha = group["T_alpha"] 30 alpha_final = group["alpha"] 31 32 for p in group["params"]: 33 34 grad = p.grad 35 state = self.state[p] 36 37 # State initialization 38 if len(state) == 0: 39 state["step"] = 0 # step counter used for bias correction 40 state["m1"] = torch.zeros_like(p) # fast EMA 41 state["m2"] = torch.zeros_like(p) # slow EMA 42 state["nu"] = torch.zeros_like(p) # second moment estimate 43 44 m1, m2, nu = state["m1"], state["m2"], state["nu"] 45 46 # Bias correction: no correction for beta3 s EMA 47 state["step"] += 1 48 bias_correction1 = 1 - beta1 ** state["step"] 49 bias_correction2 = 1 - beta2 ** state["step"] 50 51 # Calling the schedulers for alpha and beta3 52 alpha = alpha_scheduler(state["step"], start=0, end=alpha_final, T=T_alpha) 53 beta3 = beta3_scheduler(state["step"], start=beta1, end=beta3_final, T=T_beta3) 54 55 # Update the EMAs 56 m1.mul_(beta1).add_(grad, alpha=1 - beta1) 57 m2.mul_(beta3).add_(grad, alpha=1 - beta3) 58 nu.mul_(beta2).addcmul_(grad, grad, value=1 - beta2) 59 60 # Compute step 61 denom = (nu.sqrt() / math.sqrt(bias_correction2)).add_(eps) 62 update = (m1.div(bias_correction1) + alpha * m2) / denom 63 64 # Add weight decay 65 update.add_(p, alpha=lmbda) 66 67 # Apply the update scaled by the learning rate 68 p.add_(-lr * update) 69 70 return loss

Published as a conference paper at ICLR 2025

A.3 OPTAX IMPLEMENTATION

The following is a code skeleton for our Ad EMAMix optimizer in Optax, an optimization library based on Jax (Bradbury et al., 2018). The full implementation of Ad EMAMix in Jax is provided in the following repository: https://github.com/apple/ml-ademamix

Listing 2: Ad EMAMix code skeleton using JAX and Optax

1 from typing import Named Tuple 2 import chex 3 from optax._src import transform, combine, base, numerics 4 from jax import tree_util as jtu 5 import jax.numpy as jnp 6 7 8 class Scale By Ademamix State(Named Tuple): 9 count: chex.Array 10 count_m2: chex.Array 11 m1: base.Updates 12 m2: base.Updates 13 nu: base.Updates 14 15 16 def ademamix(lr, 17 b1=0.9, 18 b2=0.999, 19 b3=0.9999, 20 alpha=5.0, 21 b3_scheduler=None, 22 alpha_scheduler=None, 23 eps=1e-8, 24 weight_decay=0.0): 25 return combine.chain( 26 scale_by_ademamix(b1, b2, b3, alpha, b3_scheduler, alpha_scheduler, eps), 27 transform.add_decayed_weights(weight_decay), 28 transform.scale_by_learning_rate(lr), 29 ) 30 31 32 def scale_by_ademamix(b1, 33 b2, 34 b3, 35 alpha, 36 b3_scheduler, 37 alpha_scheduler, 38 eps): 39 40 def init_fn(params): 41 m1 = tree_zeros_like(params) # fast EMA 42 m2 = tree_zeros_like(params) # slow EMA 43 nu = tree_zeros_like(params) # second moment estimate 44 return Scale By Ademamix State( 45 count=jnp.zeros([], jnp.int32), 46 count_mu2=jnp.zeros([], jnp.int32), 47 m1=m1, 48 m2=m2, 49 nu=nu 50 ) 51 52 def update_fn(updates, state, params=None): 53 del params 54 c_b3 = b3_scheduler(state.count_m2) if b3_scheduler is not None else b3 55 c_alpha = alpha_scheduler(state.count_m2) if alpha_scheduler is not None else alpha 56 m1 = tree_update_moment(updates, state.m1, b1, 1) # m1 = b1 * m1 + (1-b1) * updates 57 m2 = tree_update_moment(updates, state.m2, c_b3, 1) 58 nu = tree_update_moment_per_elem_norm(updates, state.nu, b2, 2) 59 count_inc = numerics.safe_int32_increment(state.count) 60 count_m2_inc = numerics.safe_int32_increment(state.count_m2) 61 m1_hat = tree_bias_correction(m1, b1, count_inc) 62 nu_hat = tree_bias_correction(nu, b2, count_inc) 63 updates = jtu.tree_map( 64 lambda m1_, m2_, v_: (m1_+c_alpha*m2_)/(jnp.sqrt(v_)+eps), 65 m1_hat, 66 m2, 67 nu_hat 68 ) 69 return updates, Scale By Ademamix State( 70 count=count_inc, 71 count_m2=count_m2_inc, 72 m1=m1, 73 m2=m2, 74 nu=nu 75 ) 76 77 return base.Gradient Transformation(init_fn, update_fn)

Published as a conference paper at ICLR 2025

B EXPERIMENTAL DETAILS

B.1 TRANSFORMER LLM EXPERIMENTS

Architecture details. Our architecture is based on the transformer decoder of Vaswani et al. (2017). We use learnt positional encoding. We use a Sentence Piece (Kudo & Richardson, 2018) tokenizer with a vocabulary of 32000 tokens. The model specific parameters for the different sizes are summarized in Table 1. Our implementation is using Jax (Bradbury et al., 2018), and we train using bfloat16, except for normalization modules and softmax which use float32. The optimizer states and operations are in float32.

Table 1: Model parameters for our LLM experiments.

Model parameters 110M 330M 1.3B

Hidden size 768 1024 2048 FFW expansion factor 4 4 4 Attention heads 12 16 16 Layers 12 24 24

How did we tune the hyperparameters? Starting from our smallest models (110M parameters), we first tuned hyperparameters for our Adam W models. We then use the best hyperparameters found as a starting point for Ad EMAMix and tuned β3 and α. When we use schedulers for α and β3, we set Tα = Tβ3 = T, with T being the total number of iterations. Table 2 summarizes the hyperparameters we tried for this model size. The impact of Ad EMAMix s hyperparameters is discussed extensively in App. C.1.5. For our 330M parameter models, we mostly kept the best hyperparameters found for the 110M model and re-tuned the learning rate and gradient clipping. For Ad EMAMix, we additionally tested multiple β3 and α values. We summarize this process in Table 3. Finally, for our 1.3B parameter models, we re-iterated the same process, re-tuning only the learning rate and gradient clipping parameters for our Adam W runs. When trying to transfer the best learning rate found to Ad EMAMix, we found it to be too high, causing instabilities we couldn t fix using gradient clipping. For this reason, we also tuned the learning rate for Ad EMAMix for this model size. This process is described in Table 4.

Table 2: Hyperparameter tuning for our 110M parameter LLM models. In this table we report the hyperparameters we tuned for our 110M parameter models. When multiple values are given, we bold the parameters we found to give the best results.

Hyperparameter Value

Learning rate η 0.005, 0.002, 0.001, 0.0005, 0.0001 Number of warmup steps 2000, 3000, 4000, 5000, 6000 Sequence length 1024 Weight decay λ 0.1, 0.0 Learning rate decay scheduler no-decay, cosine-decay Batch size 64 Gradient clipping None, 5.0, 1.0, 0.5

Adam W β1 0.9, 0.99, 0.999, 0.9999 Adam W β2 0.95, 0.999

Ad EMAMix β1 0.9 Ad EMAMix β2 0.999 Ad EMAMix β3 0.999, 0.9999, 0.99999 Ad EMAMix α 2, 4, 6, 8, 10, 15, 20

Hyperparameters for experiments switching from Adam W and Ad EMAMix. For experiments in Fig. 5b and Fig. 5c, when we switch from Adam W to Ad EMAMix during training, for our 110M parameter models (Fig. 5b), we use α = 2, β3 = 0.9999 and Tα,β3 = 0. For our 1.3B parameter

Published as a conference paper at ICLR 2025

Table 3: Hyperparameter tuning for our 330M parameter LLM models. In this table we report the hyperparameters we tuned for our 330M parameter models. When multiple values are given, we bold the parameters we found to give the best results.

Hyperparameter Value

Learning rate η 0.001, 0.0005, 0.0001 Number of warmup steps 3000 Sequence length 1024 Weight decay λ 0.1 Learning rate decay scheduler cosine-decay Batch size 96 Gradient clipping 1.0, 0.5, 0.1

Adam W β1 0.9 Adam W β2 0.999

Ad EMAMix β1 0.9 Ad EMAMix β2 0.999 Ad EMAMix β3 0.999, 0.9999, 0.99999 Ad EMAMix α 5, 8, 10, 15

Table 4: Hyperparameter tuning for our 1.3B parameter LLM models. In this table we report the hyperparameters we tuned for our 1.3B parameter models. When multiple values are given, we bold the parameters we found to give the best results.

Hyperparameter Value

Number of warmup steps 3000 Sequence length 1024 Weight decay λ 0.1 Learning rate decay scheduler cosine-decay Batch size 128 Gradient clipping 1.0, 0.5, 0.1

Learning rate η for Adam W 0.001, 0.0005, 0.0003, 0.0001, 0.00005 Adam W β1 0.9 Adam W β2 0.999

Learning rate η for Ad EMAMix 0.0005, 0.0003 Ad EMAMix β1 0.9 Ad EMAMix β2 0.999 Ad EMAMix β3 0.999, 0.9999, 0.99999 Ad EMAMix α 1, 3, 5, 8, 10, 15

models (Fig. 5c), we use α = 1, β3 = 0.9999 and Tα,β3 = 0. Other hyperparameters are inherited from the Adam W model we are switching from.

Hyperparameters used our constant learning rate scheduler experiments. For Fig. 3b we use η = 10 4, the remaining of the hyperparameters are identical as those used for our other 1.3B experiments and provided in Table 4.

Hyperparameters used in our forgetting experiments. For the experiments in Fig. 4, we used 110M parameter models and hyperparameters from Table 2.

Published as a conference paper at ICLR 2025

B.2 MAMBA LM EXPERIMENTS

Architecture details. We use a Mamba architecture (Gu & Dao, 2023) with an embedding dimension of 768, an expansion factor of 2, a state size of 16, and a depth of 24. We use a batch size of 120 sequences of 1024 tokens. We use the Eleuther AI/gpt-neox-20b tokenizer from Black et al. (2022), using the Hugging Face library (Wolf et al., 2019).

Hyperparameter details. We used parameters mostly taken from Gu & Dao (2023):

Learning rate: η = 0.0006,

Warmup steps: 3k,

η-scheduler type: cosine decay to 10 5,

Weight-decay: λ = 0.1,

Gradient clipping value: 1,

Total training steps T {64k, 128k}.

For experiments with Adam W, we use β1 = 0.9 and β2 = 0.999. For Ad EMAMix, we use β1 = 0.9, β2 = 0.999, β3 = 0.9999, Tα,β3 = T and α = 8.

Published as a conference paper at ICLR 2025

B.3 VIT EXPERIMENTS

Architecture details. We use a Vi T architecture following Dosovitskiy et al. (2021b). The architecture details for both our 24M and 86M parameter models can be seen in Table 5. We do not use an EMA of the iterates as our final model. Our implementation is in Pytorch (Paszke et al., 2019), and use bfloat16.

Table 5: Model parameters for our Vi T experiments.

Model parameters 24M 86M

Patch size 16 16 Number of patches 14 14 14 14 Image size 224 224 Embedding dim. 384 768 MLP dim. 1536 3072 Layers 12 12 Number of heads 6 12

Data augmentation and Mixup. While more sophisticated augmentation methods exist (Touvron et al., 2022), we simply use random-resized cropping, random horizontal flip (p = 0.5), and random color jitter (applied with a probability p = 0.8).

How did we tune the hyperparameters? For each model size, we started by tuning the hyperparameters for the Adam W baseline. The hyperparameters we tuned and the values we retained for our three different settings are in Table 6. For our experiments on Image Net-1k, given that all models overfit the dataset, we select the best model according to the minimum validation loss, akin to using early stopping. For Ad EMAMix, for each setting, we use the hyperparameters of the best Adam W model and only tune α and β3. For each setting, we train 8 models using α {1, 5, 10, 15, 20} and β3 {0.999, 0.9999}. All of the Ad EMAMix models are shown in Fig. 6 and Fig. 31. Only the best Adam W model is shown in Fig. 31.

Table 6: Hyperparameters tuning for our Vi T Adam W experiments. We started our hyperparameter search from the values given in (Dosovitskiy et al. (2021b), Table 3), which recommends using (learning-rate, weight decay, dropout)= (0.003, 0.3, 0.1) when training on Image Net-1k and (0.001, 0.03, 0.1) when training on Image Net-21k. The various hyperparameters we experimented with are in the following table, values which gave us the lowest validation loss are in bold. For each setting, Ad EMAMix experiments use the same hyperparameters as the best Adam W baselines.

Setting 24M params, 11M images 86M params, 11M images 86M params, 1.3M images (Image Net-21k) (Image Net-21k) (Image Net-1k)

Learning rate 0.001, 0.003, 0.005 0.001, 0.003 0.001, 0.003, 0.005 Weight decay 0.03, 0.1, 0.3 0.03, 0.1, 0.3 0.03, 0.1, 0.3, 0.5, 0.7 Dropout 0.1 0.1 0.1 Adam W β1 0.9, 0.99, 0.999 0.9, 0.99, 0.999 0.9, 0.99, 0.999 Batch size 4096 4096 4096

Published as a conference paper at ICLR 2025

C ADDITIONAL RESULTS

C.1 LLM EXPERIMENTS

C.1.1 IN-CONTEXT LEARNING (ICL) RESULTS

In-Context Learning (ICL) results. We evaluate our largest (1.3B) LLM models on in-context learning tasks. We use the lm-eval package (Gao et al., 2023). We consider the following tasks:

Hella Swag (Zellers et al., 2019), Winogrande (Sakaguchi et al., 2020), ARC (Bhakthavatsalam et al., 2021), Bool Q (Clark et al., 2019), Logi QA (Liu et al., 2020), Math QA (Amini et al., 2019), MMLU (Son et al., 2024), Openbook QA (Mihaylov et al., 2018), PIQA (Bisk et al., 2020), Pubmed QA (Jin et al., 2019), Reward Bench (Lambert et al., 2024), Sciq (Welbl et al., 2017), Truthful QA (Lin et al., 2022).

The two Ad EMAMis and Adam W models we benchmark have been trained for 1M steps, with a batch size of 128, corresponding to around 131B tokens. The results of the evaluation are in Table 7. We observe how the model trained with Ad EMAMix is outperforming the Adam W model on nearly all of the tasks, sometimes by a large margin for e.g. Pubmed QA. In addition to Table 7, we track the evolution of some of those scores during training for MMLU, Reward Bench, ARC-Challenge, and ARC-Easy, results can be seen in Fig. 8. For that figure, we modify the way MMLU is evaluated. Instead of appending all the multiple answers, we concatenate the question prompt and each answer separately, and pick the most likely answer. We found it allows a much better comparison of smaller models, which otherwise fluctuates around random guessing.

Table 7: In-context learning results for our 1.3B parameter models. We compare Adam W and Ad EMAMix models trained on 131B tokens. We use the lm-eval package.

Task Adam W Ad EMAMix Task Adam W Ad EMAMix

ARC-Challenge 0.262 0.274 Openbook QA 0.240 0.238 ARC-Easy 0.612 0.619 PIQA 0.715 0.715 Bool Q 0.569 0.576 Pubmed QA 0.556 0.632 Hella Swag 0.426 0.436 Reward Bench 0.569 0.573 Logi QA 0.235 0.225 Reward Bench (reasoning) 0.617 0.630 Math QA 0.226 0.236 Sciq 0.903 0.907 Winogrande 0.563 0.580 Truthful QA 0.361 0.352 MMLU 0.244 0.248

15B 75B 131B Tokens

Adam W Ad EMAMix

15B 75B 131B Tokens

Adam W Ad EMAMix

(b) Reward Bench.

15B 75B 131B Tokens

Adam W Ad EMAMix

(c) ARC-Challenge.

15B 75B 131B Tokens

Adam W Ad EMAMix

(d) ARC-Easy.

Figure 8: In-context learning results for our 1.3B models. We periodically measure the performances of Adam W and Ad EMAMix models on in-context learning tasks. Except for a couple exceptions, Ad EMAMix is performing better than Adam W.

Published as a conference paper at ICLR 2025

C.1.2 RESULTS USING 3B PARAMETER MODELS

Experimental setup. We train 3B parameter models on Red Pajama v2 data. We use a learning rate of η = 0.0003, a weight decay λ = 0.1, a gradient clipping value of 0.1, a batch size of 1024. The sequence length is 1024. We use a cosine η decay and 3000 steps of warmup. For Ad EMAMix we used β3 {0.9999, 0.99995} and β1 {0.0, 0.9}. For Adam W, we use β1 = 0.9. Both methods use β2 = 0.999. We train models for 477k iterations, on a total of 500B tokens.

Results. Results are shown in Fig. 9. We observe that Ad EMAMix is outperforming the Adam W baseline. Moreover, using β1 = 0.0 in Ad EMAMix is not deteriorating the results. More experiments are required to estimate how many additional steps would be needed for the Adam W baseline to reach the same performances as Ad EMAMix.

0 477k 645k Iterations

Adam W Ad EMAMix β1 = 0.9 Ad EMAMix β1 = 0.0

(a) β3 = 0.9999, β1 {0.0, 0.9}.

0 477k Iterations

Adam W Ad EMAMix

(b) β3 = 0.99995, β1 = 0.9.

Figure 9: Results using 3B parameter models. For Adam W models, we use β1 = 0.9, β2 = 0.999. In (a) we compare the Adam W baseline to Ad EMAMix models trained with β3 = 0.9999 and β1 {0.0, 0.9}. Both Ad EMAMix models reach a similar final loss and outperform the baseline. The convergence to the solution is slower at the beginning for β1 = 0.0. When β1 = 0.0, the memory footprint of Ad EMAMix is the same as for Adam W, meaning that we can get the improvement in loss at no extra cost. In (b), we train an Ad EMAMix model with a larger β3 = 0.99995. While we observe a lower loss in the early training phase, the final loss is similar to the one obtained with β3 = 0.9999. Ad EMAMix reaches the same loss as an Adam W model trained for 168k more steps.

Published as a conference paper at ICLR 2025

C.1.3 MORE RESULTS ON FORGETTING

Evolution of forgetting during training. In Fig. 4 in the main paper, we follow the loss on one specific batch over the entire training. In the following experiment, we do a closeup on the forgetting of different batches as training progresses. The goal being to visualize how later batches are ultimately more remembered, and compare the forgetting profiles of Adam W and Ad EMAMix. For this, in Fig. 10a and Fig. 10b, we follow fixed batches at different stages of the training process. Every 10k iterations, we measure the loss over a specific batch B before and after training on that batch.

0 5000 10000 Iterations after training on the batch

Normalized loss

(a) Adam W.

0 5000 10000 Iterations after training on the batch

Index of tracked batch

(b) Ad EMAMix.

Figure 10: Comparing forgetting between Adam W and Ad EMAMix. In (a) and (b), we follow over 10k iterations the loss of training batches Bτ for τ {30k, 40k, . . . , 250k}. We measure the loss right before training on Bτ, and then monitor this loss over the next 10k steps. For instance, the most yellow curve represents the loss on batch B250k that we track for the last 6k iterations of training (we train for 256k iterations). The loss is normalized such that the loss for Bτ is 0 and the one at time τ + 50 is 1. This allows us to overlap curves and observe the tendency of the model to forget batches throughout training. We observe that Ad EMAMix improves its loss on training batches Bτ over many timesteps, a striking difference compared to Adam W, which has its loss on Bτ increasing faster. Moreover likely due to the learning rate scheduler the models forget significantly slower at the end of training. In this later training stage (τ > 180k), the difference between the two optimizers is especially pronounced. While this visualization allows to efficiently compare forgetting at different stages of training, the normalization prevents us from comparing the drop in loss between Ad EMAMix and Adam W. To compare loss values, see Fig. 4.

Why are training batches forgotten more at the end of training? Given the previous observation about how both Adam W and Ad EMAMix models are forgetting the training batches at different paces given different stages of training, one can wonder what is causing this phenomenon? To answer this question, we contrast the results of Fig. 4 with results obtained using a different scheduler with a constant learning rate and a linear decay. We train 110M parameter models for 300k iterations, using a max learning rate of 0.001 and a batch size of 64. Results for those experiments are in Fig. 11. Those results indicate that the decaying learning rate might be the main parameter controlling how much a batch is remembered during training. This has interesting implications when selecting which learning rate decay to use. A cosine decay, with a rather long period of decay, might remember more than a constant learning rate scheduler with only a small number of steps of linear decay at the end.

Published as a conference paper at ICLR 2025

70k 230k 300k

Loss on batch B

70k 230k 300k

Loss on batch B

(a) t B = 230k.

(b) t B = 270k.

(c) t B = 280k.

Adam W training without B Adam W training using B at t = t B t B

Ad EMAMix training without B Ad EMAMix training using B at t = t B t B

(d) t B = 290k.

Figure 11: Measuring forgetting using a held-out batch B, using a constant η-scheduler with linear decay. The top row is for Adam W, the bottom row is for Ad EMAMix. The setup is similar to the one from Fig. 4, except that we now use a different learning rate scheduler. We use 3000 steps of linear warmup, them 247k steps of constant learning rate η = 0.001, and finally 50k steps of linear decay (represented by the shaded green area in the above plots). In Fig. 4, we observed that the batches used later during training are remembered more by both Adam W and Ad EMAMix. Multiple hypotheses could explain this phenomenon: (i) it could be that the learning rate decay is the main parameter controlling forgetting, or (ii) it could be that a sufficient number of steps are required to reach a part of the optimization landscape where gradients are more orthogonal and baches are remembered. The present experiment refutes that second hypothesis. Indeed, before the start of the η-decay (as seen in (a)), training batches are not remembered at the end of training. They only start to be remembered when the decay periods starts, as seen in (b,c,d). Moreover, we can make the same observation as in Fig. 4: compared to Adam W, Ad EMAMix models tend to keep the training data longer in memory.

Published as a conference paper at ICLR 2025

C.1.4 REMOVING THE SECOND EMA MID-TRAINING

Removing the second EMA mid-training. In Fig. 5b and Fig. 5c, we looked at what is happening when we switch from Adam W to Ad EMAMix in the middle of training. In this section, we will look at the opposite conversion: removing the m2 parameter of Ad EMAMix during training, effectively switching back to Adam W. Results for this experiment can be seen in Fig. 12. Right after the switch, we observe a drop of the loss, followed by an increase and finally back to convergence. Ultimately, the final loss value is in between the ones obtained training only using Adam W and only using Ad EMAMix.

0 50000 100000 150000 200000 250000 Iterations

tswitch Adam W Ad EMAMix α = 10.0, β3 = 0.9999, Tα = Tβ3 = 256k

Ad EMAMix α = 10.0, β3 = 0.9999, Tα = Tβ3 = 90k

Figure 12: Switching from Ad EMAMix to Adam W at tswitch = 128k. Using our Ad EMAMix optimizer, we train two 110M parameters models with different Tα = Tβ3 {90k, 256k}. For historical reasons, those experiments were done using a batch size of 32 instead of 64 (as we used for all the other 110M parameter experiments in this paper). We switch from Ad EMAMix to Adam W at tswitch = 128k. At first, the removal of the +αm2 contribution in the update is suddenly decreasing the effective learning rate, causing the loss to drop suddenly. The loss then increases slightly and ends up in between the Adam W only (blue curve) and Ad EMAMix only (grey curves) losses.

Published as a conference paper at ICLR 2025

C.1.5 HYPERPARAMETER SENSITIVITY

Hyperparameter sensitivity. Depending on whether the α and β3 schedulers are used, Ad EMAMix adds up to 4 new hyperparameters: α, β3, Tα and Tβ3. In all our experiments we tied Tα = Tβ3 = Tα,β3. In this section we analyze the sensitivity of Ad EMAMix to those hyperparameters. We study the impact of α and β3 in Fig. 13, revealing wide ranges of values for which hyperparameters are outperforming the Adam W baseline. We study the sensitivity of Tα,β3 as well as justify the choice of using a scheduler on α in Fig. 14. When training from scratch Tα,β3 needs simply to be large enough to avoid early divergence. In Fig. 16 we study the sensitivity to the gradient clipping and Ad EMAMix s β1 value. While gradient clipping can help stabilize training and smooth the loss curves, it has little impact over the final loss value. Reducing the value of β1, we observe some loss spikes, yet the final loss value is relatively unaffected.

0 50000 100000 150000 200000 250000 Iterations

Adam W Ad EMAMix β3 = 0.9999, α = 2 Ad EMAMix β3 = 0.9999, α = 6 Ad EMAMix β3 = 0.9999, α = 10 Ad EMAMix β3 = 0.9999, α = 15 Ad EMAMix β3 = 0.9999, α = 20 Ad EMAMix β3 = 0.9999, α = 30 Ad EMAMix β3 = 0.9999, α = 40

(a) Sensitivity to α.

0 50000 100000 150000 200000 250000 Iterations

Adam W Ad EMAMix β3 = 0.99, α = 8 Ad EMAMix β3 = 0.999, α = 8 Ad EMAMix β3 = 0.9999, α = 8 Ad EMAMix β3 = 0.99999, α = 8 Ad EMAMix β3 = 0.999995, α = 8

(b) Sensitivity to β3.

Figure 13: Sensitivity of Ad EMAMix to α and β3. We test the sensitivity of 110M parameter models to the values of α and β3. In (a), we vary α {2, 6, 10, 15, 20, 30, 40} while keeping all other parameters equal (β3 = 0.9999, Tα,β3 = 256k, clipping = 0.5, η = 0.001, λ = 0.1). We see that the larger the α, the more significant is the loss decrease early during training, yet this does not necessarily translates into a better final loss. The last iterate s loss is larger for large values of α. For this model, a sweet spot seems to be reach for α = 10. Moreover, the span of values resulting in good results is very large. In (b), we now vary β3 {0.99, 0.999, 0.9999, 0.99999, 0.999995}. We first observe that except for β3 = 0.999995 all those values are outperforming the Adam W baseline. Moreover, we observe that there is a sweet spot, at β3 = 0.9999, after which the final loss increases.

0 50000 100000 150000 200000 250000 Iterations

Adam W Ad EMAMix Tα = Tβ3 = 10k

Ad EMAMix Tα = Tβ3 = 30k

Ad EMAMix Tα = Tβ3 = 50k

Ad EMAMix Tα = Tβ3 = 70k

Ad EMAMix Tα = Tβ3 = 90k

Ad EMAMix Tα = Tβ3 = 128k

(a) Sensitivity to Tα = Tβ3 = Tα,β3.

0 50000 100000 150000 200000 250000 Iterations

Adam W Ad EMAMix Tβ3 = 256k, Tα = 0

Ad EMAMix Tβ3 = 256k, Tα = 10k

Ad EMAMix Tβ3 = 256k, Tα = 50k

Ad EMAMix Tβ3 = 256k, Tα = 70k

Ad EMAMix Tβ3 = 256k, Tα = 256k

(b) Sensitivity to Tα.

Figure 14: Sensitivity of Ad EMAMix to Tα and Tβ3. For all the Ad EMAMix experiments in this figure, we used β1 = 0.9, β2 = 0.999, β3 = 0.9999, α = 10. In (a), we test the sensitivity of 110M parameter models to the values of Tα and Tβ3. We tied the two values as in all the experiments in the main paper: Tα = Tβ3 = Tα,β3. We observe that when not enough warmup steps are used, the iterates either diverge or converge to slightly worse loss values. In (b) we set Tβ3 = 256k and only vary Tα. We observe that Tα needs to be large enough to avoid divergence. This demonstrates the necessity of using a warmup on α. This being said, the necessity of increasing α linearly depends on the value of α. Using a smaller α value can work with Tα = 0, but using an α-scheduler alleviates any instability issue and increases the range of optimal α values (see Fig. 15). Moreover, in other experiments, we observe how not using schedulers yields detrimental large updates in the early training phase, see App. C.1.10, Fig. 28.

Published as a conference paper at ICLR 2025

0 50000 100000 150000 200000 250000 Iterations

Adam W Ad EMAMix Tβ3 = 256k, Tα = 0, α = 2

Ad EMAMix Tα,β3 = 256k, α = 2

Ad EMAMix Tβ3 = 256k, Tα = 0, α = 6

Ad EMAMix Tα,β3 = 256k, α = 6

(a) No α-scheduler (110M).

0 50000 100000 150000 200000 250000 Iterations

Ad EMAMix Tα = 0, α = 1 Ad EMAMix Tα = 0, α = 3 Ad EMAMix Tα = 0, α = 5 Ad EMAMix Tα = 256k, α = 5

(b) No α-scheduler (1.3B).

Figure 15: The α-scheduler reduces the sensitivity to α. We compare Ad EMAMix experiments with and without α-scheduler. The experiments with α-scheduler are more stable. In (b), for 1.3B parameter models, only α = 1 converges in a stable fashion (the purple curve with α = 5 diverged early), and reaches a final loss slightly worse than the Ad EMAMix model trained using an α-scheduler over 256k steps. In (a), for smaller 110M parameter models, we observe (i) that it is possible to converge without α-scheduler for small α values, as well as (ii) that it seems easier to reach good loss values with a scheduler (the dotted lines have a lower final loss on average).

0 50000 100000 150000 200000 250000 Iterations

Adam W Ad EMAMix grad-clip= 0.1, α = 8 Ad EMAMix grad-clip= 1.0, α = 8 Ad EMAMix grad-clip= 5.0, α = 8 Ad EMAMix grad-clip= 0.1, α = 2 Ad EMAMix grad-clip= 1.0, α = 2 Ad EMAMix grad-clip= 5.0, α = 2

(a) Sensitivity to gradient clipping.

0 50000 100000 150000 200000 250000 Iterations

Adam W Ad EMAMix β1 = 0.0 Ad EMAMix β1 = 0.1 Ad EMAMix β1 = 0.3 Ad EMAMix β1 = 0.5 Ad EMAMix β1 = 0.7 Ad EMAMix β1 = 0.9 Ad EMAMix β1 = 0.99

(b) Sensitivity to β1.

Figure 16: Sensitivity of Ad EMAMix to β1 and the gradient clipping value. Unless specified in the caption, all the Ad EMAMix experiments in this figure used β1 = 0.9, β2 = 0.999, β3 = 0.9999, α = 10, Tα,β3 = 256k. In (a) we vary the amount of gradient clipping used during training. We notice how gradient clipping can help to smooth the curves, yet bringing only minor improvements. In (b) we perform a sweep over β1 {0, 0.1, 0.3, 0.5, 0.7, 0.9, 0.99}. The value β1 = 0.0 is especially interesting as it allows us to remove entirely m1, therefore saving memory. We observe that smaller β1 values yield noisier curves, with multiple loss spikes. At the 110M parameter scale, those spike do not significantly impact the convergence, we conjecture that they could become a problem at larger scales. We also notice that slightly better losses could be obtained using smaller β1 values (e.g. β1 = 0.3), we keep β1 = 0.9 in all of our experiments for the ease of comparison with Adam W.

Published as a conference paper at ICLR 2025

C.1.6 IMPACT OF TRAINING FOR FEWER ITERATIONS

Sensitivity to the number of iterations. As we rely on old gradients, on question that arises is whether Ad EMAMix would still perform well when reducing the number of iterations. In this section we compare the loss obtained by 110M parameter models when halving the number of iterations and doubling the batch size, in such a way that the number of tokens consumed for training is always the same (17B). In Fig. 17, we observe that decreasing the number of iterations too much increases the final loss of the model. This effect is more pronounced for Ad EMAMix. However, at both 32k and 64k iterations, Ad EMAMix still outperforms Adam W. In Fig. 18 we observe that reducing β3 can mostly alleviate the problem.

0 5B 10B 17B Tokens

Adam W 256k steps, bs = 64 Adam W 128k steps, bs = 128 Adam W 64k steps, bs = 256 Adam W 32k steps, bs = 512

(a) Adam W with fewer steps.

0 5B 10B 17B Tokens

Ad EMAMix 256k steps, bs = 64 Ad EMAMix 128k steps, bs = 128 Ad EMAMix 64k steps, bs = 256 Ad EMAMix 32k steps, bs = 512

(b) Ad EMAMix with fewer steps.

32k 64k 128k 256k Number of training iterations

Adam W 17B Ad EMAMix 17B

(c) Comparing the final loss.

Figure 17: Impact of using fewer iterations on Ad EMAMix vs. Adam W. We train multiple 110M parameter models. All models are trained on 17B tokens, but using different batch sizes (bs {64, 128, 256, 512}), which results in different numbers of iterations: {32k, 64k, 128k, 256k}. For both Adam W and Ad EMAMix, we use hyperparameters from Table 2 (β3 = 0.9999 for Ad EMAMix). In (a), we observe a drop in performance when reducing the number of steps, as the loss of the final iterates is higher for the model trained for 32k steps compared to the ones trained for more iterations. In (b) we observe a similar story for Ad EMAMix models, with a slightly more pronounced performance degradation. Yet, as it can be seen in (c), the gap between Ad EMAMix and Adam W still remains, even when training for 32k iterations. Importantly, this experiment does not suggest that Ad EMAMix is performing worse when using larger batches. In our Vi T experiments, we successfully use a batch size of 4096.

0 5B 10B 17B Tokens

Ad EMAMix 256k steps, bs = 64 Ad EMAMix 128k steps, bs = 128 Ad EMAMix 64k steps, bs = 256 Ad EMAMix 32k steps, bs = 512

(a) Ad EMAMix with fewer steps β3 = 0.9999 (same as Fig. 17b).

0 5B 10B 17B Tokens

Ad EMAMix 256k steps, bs = 64 Ad EMAMix 128k steps, bs = 128 Ad EMAMix 64k steps, bs = 256 Ad EMAMix 32k steps, bs = 512

(b) Ad EMAMix with fewer steps β3 = 0.999.

32k 64k 128k 256k Number of training iterations

Adam W 17B Ad EMAMix 17B, β3 = 0.9999 Ad EMAMix 17B, β3 = 0.999

(c) Comparing the final loss.

Figure 18: Reducing β3 can work better with low numbers of iterations. We take the same experimental protocol as Fig. 17 but this time we compare training Ad EMAMix models with β = 0.999 and β = 0.9999. In (a,b) we observe how using a smaller β3 = 0.999 solves in a large part the problem of decreasing performances when using a small number of iterations. In (c), while final loss for T {128k, 256k} is lower using β3 = 0.9999, using β3 = 0.999 is better for T {32k, 64k}. Using β3 = 0.999, the gap between Ad EMAMix and Adam W increases as the number of iterations decreases.

Published as a conference paper at ICLR 2025

C.1.7 LIMITATIONS OF A SINGLE EMA

Results on a 2D toy example. A natural question that arises from our method is whether it is possible to obtain the same results without the additional EMA. In this section we aim to convince the reader that the two EMAs are required. First, we propose to study a small toy 2D optimization problem:

(x , y ) arg min (x,y) f(x, y), with f(x, y) = 8(x 1)2 (1.3x2 + 2x + 1) + 0.5(y 4)2

This function was introduced by Liu et al. (2023) as a function with sharp curvature along the x-axis, and flatter curvature along the y-axis. Initializing the first iterate x(0) = (0.3, 1.5), we start optimizing with (i) Adam with a large β1 = 0.999, and (ii) Ad EMAMix with β1 = 0.9 and β3 = 0.999. In both cases β2 = 0.999. To make the experiment interesting, we initialize the EMA buffers for both methods to either ( 3, 0) or ( 0.8, 3). This has for effect to give an initial "speed" to the first iterate. As a result of this speed, Adam with a large β1 is unable to correct his trajectory. In contrast, using two EMAs, one with a small, and one with a large β as in Ad EMAMix converges to the solution. The fast changing EMA can correct for the bias introduced by the slow changing EMA. See Fig. 19 for results.

Adam β1 = 0.999 Ad EMAMix β1 = 0.9, β3 = 0.999

(a) m(0) 1 = m(0) 2 = [ 3, 0].

Adam β1 = 0.999 Ad EMAMix β1 = 0.9, β3 = 0.999

(b) m(0) 1 = m(0) 2 = [ 0.8, 3].

Figure 19: Showing the necessity of a second momentum term on a toy example. In this experiment, we use Adam and Ad EMAMix to optimize a simple function (same function as in Fig.2 in Liu et al. (2023), is a local optimum). We want to understand what is preventing the use of large β1 values in Adam, and how Ad EMAMix is overcoming this problem. Starting from x(0) = [0.3, 1.5], we initialize m1 (resp. m1 and m2) for Adam (resp. Ad EMAMix) to either [ 3, 0] or [ 0.8, 3] shown with black arrows (not to scale). ν(0) is still initialized to 0. This is similar to giving an initial speed to the iterate. In both cases (a) and (b), the iterates following the Adam trajectory fail to converge in the given number of iterations. This is due to the large β1 which requires many hundreds of iterations to significantly alter m1. As a result, the iterates mostly follow the initial direction imposed by m(0) 1 (ν is still used to scale m1). In contrast, Ad EMAMix converges to the solution. The fast (smaller β) EMA of Ad EMAMix corrects the trajectory of the slow (high β) EMA. Interestingly, Ad EMAMix initially overshoots the optimum. It has to wait for the influence of m(0) 2 to fade away before finally converging to the solution. While this experiment does not explain why a larger momentum would be beneficial, it illustrates how using two EMAs can enable the use of large β values. A better intuition behind why a large momentum could be beneficial is presented in Fig. 2.

Results training 110M parameters LMs. To further demonstrate that simply increasing β1 in Adam W does not provide nearly the same gains as Ad EMAMix, we run several additional experiments. In Fig. 20 we show what we obtain when training from scratch using a single EMA with a β1 {0.99, 0.999, 0.9999, 0.99999}. Naively increasing β1 in Adam W does not work; adding a scheduler on β1 to smoothly increase its value during the entirety of the training also fails. In Fig. 21, we show results when increasing β1 in the middle of training, with and without scheduler on β1. This differs from the previous setting as we bypass the initial training phase capable of causing instabilities (see

Published as a conference paper at ICLR 2025

3), as well as bypass the initial iterations during which the bias correction done by Adam W can have an impact. Here again we observe increasing the β1 value does not provide any noticeable gain.

0 50000 100000 150000 200000 250000 Iterations

Adam W β1 = 0.9 Adam W β1 = 0.99 Adam W β1 = 0.999 Adam W β1 = 0.9999 Adam W β1 = 0.99999

(a) Vanilla Adam W.

0 50000 100000 150000 200000 250000 Iterations

Adam W β1 = 0.9, β2 = 0.999 Adam W β1 = 0.99, β2 = 0.999 Adam W β1 = 0.999, β2 = 0.999 Adam W β1 = 0.9999, β2 = 0.999 Adam W β1 = 0.99995, β2 = 0.999 Adam W β1 = 0.99999, β2 = 0.999 Adam W β1 = 0.99999, β2 = 0.9999

(b) Adam W with β1-scheduler.

Figure 20: Increasing β1 in Adam W. In (a) we perform a simple sweep over the β1 hyperparameter of Adam W, we observe how the optimizer is unable to cope with large values of β1, diverging for β1 > 0.999. In (b) we modify Adam W to incorporate the tricks we used in Ad EMAMix, i.e. we add a scheduler on β1. While this prevents divergence, the final loss is getting worse as we increase β1. We had to increase β2 to stabilise the training when using β1 = 0.99999. Those experiments support the importance of the additional EMA in Ad EMAMix.

200000 250000 300000 350000 400000 450000 500000

Adam W β1 = 0.9 from scratch Adam W β1 = 0.99 Adam W β1 = 0.999 Adam W β1 = 0.9999 Adam W β1 = 0.99999

(a) Increasing β1 without warmup.

200000 250000 300000 350000 400000 450000 500000

Adam W β1 = 0.9 from scratch Adam W β1 = 0.99, Tβ1 = 200k

Adam W β1 = 0.999, Tβ1 = 200k

Adam W β1 = 0.9999, Tβ1 = 200k

Adam W β1 = 0.99999, Tβ1 = 200k

(b) Increasing β1 with warmup.

Figure 21: Increasing β1 in Adam W later during training. In (a) we load an Adam W checkpoint and resume training using Adam W with a larger β1. We observe no gain over the baseline from increasing the value of β1. In (b), we add a scheduler over β1 (similar to the scheduler over β3 for Ad EMAMix) to smooth the transition between β1 = 0.9 and larger values. We set Tβ1 = 200k iterations. Despite the smoother transition, no gain over the baseline is observed.

Published as a conference paper at ICLR 2025

C.1.8 REMOVING m1 BY SETTING β1 = 0

Effect of the batch size when β1 = 0. In this section we investigate the impact of setting β1 = 0, which allows to replace m(t) 1 by the current gradient g(t), saving memory. In this scenario the memory complexity of Ad EMAMix is the same as the one of Adam W. We noticed in Fig. 16b how reducing β1 can make the training slightly more unstable, triggering some loss spikes. In Fig. 22 we study the impact of the batch size on those instability, revealing that β1 = 0 is less stable when using larger batch sizes. However, despite the spikes, the final loss for β1 = 0 and β1 = 0.9 are very similar β1 = 0.9 yielding slightly better results for smaller batch sizes.

0 128k 256k Iterations

Ad EMAMix batch-size= 4, β1 = 0 Ad EMAMix batch-size= 4, β1 = 0.9 Ad EMAMix batch-size= 8, β1 = 0 Ad EMAMix batch-size= 8, β1 = 0.9 Ad EMAMix batch-size= 16, β1 = 0 Ad EMAMix batch-size= 16, β1 = 0.9 Ad EMAMix batch-size= 32, β1 = 0 Ad EMAMix batch-size= 32, β1 = 0.9 Ad EMAMix batch-size= 64, β1 = 0 Ad EMAMix batch-size= 64, β1 = 0.9 Ad EMAMix batch-size= 128, β1 = 0 Ad EMAMix batch-size= 128, β1 = 0.9

Figure 22: Effect of batch size with β1 = 0. For a 110M parameter model, we use β1 {0.0, 0.9} and vary the batch-size {4, 8, 16, 32, 64, 128}. Lighter curves use β1 = 0, darker curves are their β1 = 0.9 counterpart. For small batch sizes, no instabilities are observed, and β1 = 0.9 reaches slightly better performances. For larger batch sizes, instabilities are observed with using β1 = 0, yet the final loss values are similar as the ones obtained with β1 = 0.9. This indicates that one can potentially save memory without compromising results.

Setting β1 = 0 for 1.3B parameter models. In this section, using a 1.3B parameter model, we re-do the same experiment as in Fig. 1, but we set β1 = 0. In Fig. 23, we show the results of this experiment.

0 200k 500k 770k Iterations

Adam W 101B tokens Ad EMAMix 101B tokens β1 = 0.9, β3 = 0.9999 Ad EMAMix 101B tokens β1 = 0.0, β3 = 0.9999

Figure 23: Using β1 = 0 with a 1.3B parameter model. We use β2 = 0.999 for all those experiments. We observe how it is possible to converge despite using β1 = 0. The final loss with β1 = 0 is slightly better than the one with β1 = 0.9.

Published as a conference paper at ICLR 2025

C.1.9 COMPARING m1 + αm2 VERSUS (1 α)m1 + αm2

In the equation Ad EMAMix, we combine m1 and m2 using m1 + αm2. In this section we contrast this to using a convex combination (1 α)m1 + αm2, with α [0, 1].

Are those two somewhat equivalent? Given the combination η(m1 +αm2), it is possible to rewrite it as a convex combination as such:

η(m1 + αm2) = ˆη((1 ˆα)m1 + ˆαm2), with ˆη = η(α + 1) and ˆα = α α + 1.

Therefore, the two formulas are equivalent up-to a reparametrization. However, in practice, we use schedulers on η and α. While at any timestep t, it is possible to reparametrize one formula into the other, there is not one single reparametrization that holds for all timesteps. Assuming that we do not change the nature of the schedulers for α and η, those two formulations are therefore not equivalent. In Fig. 24 we show the evolution of the weights given to m1 and m2 for the two approaches. We assume a cosine scheduler for η and a linear scheduler for α. We can see how the two formulations give very different weights to m1 and m2.

0 100000 200000 t

weight for m1: η(t)

weight for m2: η(t)α(t)

(a) η(t)(m1 + α(t)m2).

0 100000 200000 t

weight for m1: η(t)(1 α(t))

weight for m2: η(t)α(t)

(b) η(t)((1 α(t))m1+α(t)m2).

Figure 24: Weights given to m1 and m2 for the two formulations. In (a), we show the weights given to m1 and m2 when we use η(t)(m1 + α(t)m2). We use α = 8 (α is the final value reached by the scheduler α(t)). In (b), we show the weights given to m1 and m2 when we use η(t)((1 α(t))m1 + α(t)m2). We use α = 0.9. Importantly, in both cases, we assume a cosine scheduler for η(t) and a linear scheduler for α(t). Given this choice of schedulers, there is no reparametrization which can make those two formulation equal. In Fig. 25 we show that using the convex combination formulation yields worse results.

Is the convex combination better? Now that we established that the two formulations are different, which one is better? To answer this we run two grid searches, varying η {0.4, 0.6, 0.8, 1, 1.2, 1.4, 1.6} and α {2, 4, 6, 8, 10, 12, 14} for the formulation in equation Ad EMAMix, and α {0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9} for the convex combination formulation. Results in Fig. 25 show how the formulation used in Ad EMAMix provides a better loss for all the pairs (η, α) tested, also outperforming Adam W for all of those. In contrast, for the convex combination, while most (η, α) pairs outperform Adam W, the improvement is smaller.

Getting the same results by changing the schedulers for the convex combination. We concluded before that unless we tamper with the schedulers it is not possible to find a reparametrization that would work for all timesteps and make the two formulations equivalent. What if we could change the schedulers? Which schedulers would we need to use with the convex formulation, to get the same results as with the other formulations. The answer is simple, let η(t) and α(t) denote the cosine and linear schedulers for resp. η and α. The convex combination would need to use the following schedulers:

ˆη(t) = η(t)(α(t) + 1), ˆα(t) = α(t)

Therefore, to get equivalent results as for the original Ad EMAMix formulation, the learning rate scheduler would need to be ˆη(t), which is not a cosine scheduler. The α scheduler, ˆα(t), would not be linear. We plot those schedulers in Fig. 26.

Published as a conference paper at ICLR 2025

0.4 0.6 0.8 1 1.2 1.4 1.6 η 103

(a) m1 + αm2.

0.4 0.6 0.8 1 1.2 1.4 1.6 η 103

(b) (1 α)m1 + αm2.

Figure 25: η-α hyperparameter sweep for the two formulations. For pairs of (η, α), we train small (6-layers) transformer LMs on Red Pajama v2 data. We plot the results for the two formulations, the color represents the final validation loss, and is normalized between the two plots. We also do a sweep over η to fine the value giving the best results for Adam W. We use a green level curve to show the performance of the best Adam W model. Points in the interior of the green level curve represents configurations outperforming Adam W. In (a), we color the entire frame in green as all the pairs of hyperparameters are outperforming Adam W. It is clear from those two figures that using m1 + αm2 yields better results, more consistently, compared to using (1 α)m1 + αm2.

0 100000 200000 t

ˆη(t) = η(t)(α(t) + 1)

(a) η schedulers.

0 100000 200000 t

ˆα(t) = α(t)

(b) α schedulers.

Figure 26: Using schedulers ˆη(t) and ˆα(t) with the convex combination formulation is the same as using the original Ad EMAMix. The original Ad EMAMix forumulation uses a cosine and linear schedulers for resp. η and α. Those schedulers are in blue. For the weights given to m1 and m2 to be the same when using the convex combination formulation, one need to use different schedulers ˆη(t)

and ˆα(t) (in orange). Using the original Ad EMAMix formulation allows to use simpler schedulers.

Published as a conference paper at ICLR 2025

C.1.10 MISCELLANEOUS RESULTS

Impact of β3 and α schedulers when starting Ad EMAMix from Adam W. As mentioned in 3, we do not necessarily need the α and β3 schedulers when resuming training from a sufficiently late checkpoint. Fig. 5b and Fig. 5c do not use schedulers, unless we start using Ad EMAMix from scratch. In Fig. 27, we vary the warmup periods for α and β3 when starting Ad EMAMix from an Adam W checkpoint at tswitch = 300k. We observe how increasing Tβ3 = Tα only slows down the convergence. This serves to illustrate that the schedulers for α and β3 are only required to stabilise the optimization during the early training phase.

200000 250000 300000 350000 400000 450000 500000

Adam W Ad EMAMix β3 = 0.9999, α = 2.0, Tα = Tβ3 = 0

Ad EMAMix β3 = 0.9999, α = 2.0, Tα = Tβ3 = 25k

Ad EMAMix β3 = 0.9999, α = 2.0, Tα = Tβ3 = 50k

Ad EMAMix β3 = 0.9999, α = 2.0, Tα = Tβ3 = 100k

Figure 27: No need to use schedulers when switching from Adam W to Ad EMAMix sufficiently late. For a 110M parameter model, we switch from Adam W to Ad EMAMix after tswitch = 300k iterations. We vary Tβ3 = Tα {0, 25k, 50k, 100k}, the other prameters are fixed: β3 = 0.9999, α = 2, the learning rate is inherited from Adam W and stays on his decaying course following a cosine decay. Ad EMAMix s m2 is initialized to 0. The larger Tβ3 = Tα, the slower the convergence is, the final loss value are very similar. This shows that the schedulers for α and β3 are not important when starting Ad EMAMix from a sufficiently late checkpoint.

Ad EMAMix from scratch without schedulers. In this section we provide more justification for the use of schedulers on α and β3. We reveal that, without the use of schedulers, the norms of the updates increases significantly, even for small α values. Unstable and large weight updates are known to occur in the early phases of training when learning rate warmup is not used (Gotmare et al., 2019). Using large momentum values too early seem to increase this phenomenon. See Fig. 28.

0 50000 100000 150000 200000 250000 Iterations

Adam W Ad EMAMix α = 0.9, β3 = 0.9999, Tα,β3 = 0

Ad EMAMix α = 0.6, β3 = 0.9999, Tα,β3 = 0

Ad EMAMix α = 0.3, β3 = 0.9999, Tα,β3 = 0

Ad EMAMix α = 0.1, β3 = 0.9999, Tα,β3 = 0

(a) Norm of 1000 consecutive updates.

0 50000 100000 150000 200000 250000 Iterations

Adam W Ad EMAMix α = 0.9, β3 = 0.9999, Tα,β3 = 0

Ad EMAMix α = 0.6, β3 = 0.9999, Tα,β3 = 0

Ad EMAMix α = 0.3, β3 = 0.9999, Tα,β3 = 0

Ad EMAMix α = 0.1, β3 = 0.9999, Tα,β3 = 0

Figure 28: The norm of the updates is exploding without schedulers. For historical reasons, those experiments were done with a 110M parameter model using a batch size of 32, and without weight decay (λ = 0). We use β2 = 0.999 and β1 = 0.9. We use 3000 steps of learning rate warmup for all the experiments. In (a), we monitor the norm of 1000 consecutive updates: θ(t) 2 = θ(t+1000) θ(t) 2. Even when using small values of α, the norm of θ(t) is increasing significantly at the beginning of training. As a result, in (b), we observe how without the use of a scheduler on α and β3 Ad EMAMix models are either diverging or fail to recover and improve upon Adam W.

Published as a conference paper at ICLR 2025

Impact of the linear decay duration when using a constant η-scheduler. In Fig. 3b we show results using Ad EMAMix with a linear warmup constant linear decay learning rate scheduler. We used 100k of linear decay. In this section we experiment with 200k steps of linear decay, the rest of the parameters are the same: we use a 1.3B parameter model with a max learning rate of 10 4 and remaining hyperparameters as in Table 4. Results can be seen in Fig. 29.

0 500k = Tα,β3 900k 1.1M Iterations

Adam W Ad EMAMix β3 = 0.9999 Ad EMAMix β3 = 0.99999 200k steps of linear η decay

Figure 29: Testing a different decay duration compared to Fig. 3b. Increasing the length of the learning rate decay improves the results for each methods. The gap between Ad EMAMix using β3 = 0.99999 and β3 = 0.9999 is shrinking, but the gap between Ad EMAMix models and Adam W remains.

Same figure as Fig. 1, including the Adam W trained on 197B tokens. In Fig. 1 we represent the Adam W experiment trained on 197B tokens by a blue horizontal line for aesthetic reasons. In Fig. 30 we include this missing curve to the same plot.

0 256k 770k 1M 1.5M Iterations

Adam W 197B

Adam W trained on {34B, 101B, 131B, 197B} tokens

Ad EMAMix trained on {34B, 101B, 131B} tokens

Figure 30: Same as Fig. 1 adding Adam W 197B.

Published as a conference paper at ICLR 2025

C.2 VIT EXPERIMENTS

Top-k accuracies. In Fig. 6, we plot the test and train loss for the final iterates. In Fig. 31, we give a more detailed view by reporting the evolution of the training loss, test loss, and top-1 accuracy. Looking at the first row, the training loss for Ad EMAMix is systematically better than the Adam W baseline. The second row shows that in cases where the test loss correlates well with the train loss, Ad EMAMix works well. The top-1 accuracy carries the same message.

0 50000 100000

Adam W Ad EMAMix β3 = 0.999 Ad EMAMix β3 = 0.9999

0 50000 100000

0 50000 100000 Iterations

Top-1 Accuracy

(a) 24M params, 11M images. (Image Net-21k)

0 50000 100000

Adam W Ad EMAMix β3 = 0.999 Ad EMAMix β3 = 0.9999

0 50000 100000

0 50000 100000 Iterations

Top-1 Accuracy

(b) 86M params, 11M images. (Image Net-21k)

0 50000 100000

Adam W Ad EMAMix β3 = 0.999 Ad EMAMix β3 = 0.9999

0 50000 100000

0 50000 100000 Iterations

Top-1 Accuracy

(c) 86M params, 1.3M images. (Image Net-1k)

Figure 31: Ad EMAMix & capacity/data ratio. These figures complement Fig. 6 and provide the top-1 accuracy as well as the training and test losses for the three experimental setups used in our Vi T experiments (see 4.3). From left to right, the ratio between data and model capacity worsen. In (a), we train a 24M parameter model on a dataset of 11M images (Imagenet-21k), doing a total of 37 epochs. In this setting the training and test loss curves are very similar, signaling no blatant overfitting. We observe Ad EMAMix trivially outperforms the Adam W baseline. As the ratio between data and model capacity worsen, in (b) (more capacity) and (c) (more capacity and less data), it becomes more and more difficult to find hyperparameters outperforming the baseline. Yet, across all those settings, the training loss for Ad EMAMix is always lower than that of the Adam W baseline. When decreasing training loss correlates well with decreasing test loss, Ad EMAMix can be expected to work well.

Published as a conference paper at ICLR 2025

C.3 COMPARISON WITH OTHER METHODS

C.3.1 COMPARISON WITH ADMETA AND DEMA

Double Exponential Moving Average (DEMA). Originally introduced by Mulloy (1994), a DEMA originally aimed to emphasize the weight of recent asset price fluctuations, making the DEMA indicator more reactive to changes compared to simple EMAs. Given notations introduced in the main paper, let EMA(T N:T ) β EMA(β, g(T N), . . . , g(T )), let N be the window size representing how many consecutive values are considered in the average, the formula for DEMA can be written as follows:

DEMA(β, g(T 2N), . . . , g(T )) = 2 EMA(T N:T ) β EMA(β, EMA(T 2N:T N) β , . . . , EMA(T N:T ) β ).

If a simple EMA tends to give a significant weight to more recent observations, a DEMA emphasizes this behaviour by removing some of the weight given to older observations. This is not what we suggest doing in this work, we want both high sensitivity to recent observations and non-negligible weights given to older observations.

Ad Meta. Chen et al. (2023b) take inspiration over DEMA and use nested EMAs in their Ad Meta-S optimizer:

m(t) 1 = β1m(t 1) 1 + g(t)

h(t) = κg(t) + µm(t) 1 m(t) 2 = β2m(t 1) 2 + (1 β2)h(t)

θ(t) = θ(t 1) ηm(t) 2 .

(Ad Meta-S)

With µ and κ parameterized by β1 [0, 1[ as such:

µ = 25 10(β1 + 1

In their Ad Meta-S experiments, they use β1 = 0.9, corresponding to (µ, κ) = (4.88, 2.11). β2 takes values ranging from 0.1 to 0.4. As a results, unlike Ad EMAMix, Ad Meta is not leveraging very old gradients. Analysing the Ad Meta algorithm, we see that it consists in two nested EMAs. We show the shape of nested EMAs in Fig. 32. In sharp contrast with our approach we observe that (i) it reduces the weights given to recent gradients and (ii) it gives a small weight to old gradients.

0 50 100 150 200 timesteps t

Weight given to each timestep

Nested EMAs: β1 = 0.9, β2 = 0.4 Sum of two EMAs: β1 = 0.9, β2 = 0.999

(a) Linear scale.

0 50 100 150 200 timesteps t

Weight given to each timestep

Nested EMAs: β1 = 0.9, β2 = 0.4 Sum of two EMAs: β1 = 0.9, β2 = 0.999

(b) Logarithmic scale.

Figure 32: Comparison between nested EMAs and a linear combination of EMAs. In (a) we observe how two nested EMAs actually decrease the influence of recent timesteps. This is the justification for the DEMA method, which aims to increase the sensitivity to recent gradients by subtracting nested EMAs. In (b), using a log-scale for the y-axis, we see how, for the values used in Ad Meta, the older gradients receive a negligible weight. In our work, we show how very old gradients can be leveraged to get better results by keeping a high sensitivity to recent gradients while giving non-negligible weights to older gradients.

Published as a conference paper at ICLR 2025

C.3.2 COMPARISON WITH LION

The Lion optimizer. Chen et al. (2023a) derived the following optimizer. We change notations to facilitate the comparison with Ad EMAMix: θ(t) = θ(t 1) η sign(αm(t 1) + (1 α)g(t)) + λθ(t 1)

m(t) = βm(t 1) + (1 β)g(t). (Lion)

The Lion optimizer uses a sign function and updates its EMA after updating the parameters. Moreover, it does not normalize the updates. While Lion and Ad EMAMix are quite different from each others, we can draw one similarity. Indeed, the interpolation αm(t 1) + (1 α)g(t) is similar to combining two EMAs, one of them using β = 0. We also explore the possibility of setting Ad EMAMix s β1 to 0 in 3, we show in App. C.1.8 and App. C.1.5 (Fig. 16b) that this often works despite a more unstable training. Interestingly, Chen et al. (2023a) find that larger β = 0.99 values work best. Beside this similarity, the two optimizers behave very differently, the biggest difference being the use of the sign function, which Chen et al. (2023a) claim can help regularize the training. We test the Lion optimizer on language modeling. To tune the hyperparameters, we took values from Chen et al. (2023a) as a starting point, as well as recipes provided by the Optax Jax library. A summary of our hyperparameter tuning is in Table 8:

Table 8: Hyperparameter tuning for Lion 110M parameter models. When multiple values are given, we bold the parameters we found to give the best results.

Hyperparameter Value

Learning rate η 0.00005, 0.0001, 0.0002, 0.0004, 0.0006, 0.0008, 0.001 Number of warmup steps 3000 Sequence length 1024 Weight decay λ 0.01, 0.125, 0.166, 0.25, 0.5, 1.0 Learning rate decay scheduler cosine-decay Batch size 64 Gradient clipping 1.0, 0.5

Lion α 0.5, 0.7, 0.9 Lion β 0.99, 0.9999

The training curve associated to the best hyperparameters is in Fig. 33. We observe that Lion is not outperforming our carefully tuned Adam W baseline. Moreover, no attempt to increase β beyond 0.99 was successful, as those models mostly diverged. This emphasizes one of our main contributions: the introduction of schedulers enabling the use of large momentum values.

0 50000 100000 150000 200000 250000 Iterations

Adam W Best Lion model

Figure 33: Comparing Adam W with the best Lion model found. Hyperparameters used are in Table 8. We train many Lion models and select the one with the lowest final validation loss. Lion is not outperforming our carefully tuned Adam W baseline.

Published as a conference paper at ICLR 2025

C.3.3 ADDING A THIRD MOMENTUM TERM ( AGGMO)

Lucas et al. (2019, Agg Mo) propose to add an arbitrary number (K) of momentum terms to the gradient descent algorithm: ( m(t) i = βim(t 1) i + g(t) for 1 i K θ(t) = θ(t 1) η

1 i K m(t) i . (Agg Mo)

In order to see whether more momentum terms can be beneficial, we modify our Ad EMAMix optimizer to include a third momentum m3. We name the resulting optimizer Ad3EMAMix:

m(t) 1 = β1m(t 1) 1 + (1 β1)g(t), ˆm(t) 1 = m(t) 1 1 βt 1 m(t) 2 = β3m(t 1) 2 + (1 β3)g(t)

m(t) 3 = β4m(t 1) 3 + (1 β4)g(t)

ν(t) = β2ν(t 1) + (1 β2)g(t)2, ˆν(t) = ν(t)

1 βt 2 θ(t) = θ(t 1) η ˆ m(t) 1 +α(m(t) 2 +m(t) 3 )

ˆν(t)+ϵ + λθ(t 1) .

(Ad3EMAMix)

We Train a 110M parameter model with same hyperparameters as in Table 2, but we use α = 4 instead of α = 8. We apply the same scheduler to β4 as for β3. In Fig. 34 we show the resulting validation loss curves for various (β1, β3, β4) triplets. Our experiments show no advantage of adding an extra β4-EMA. Rather than carrying the message that we should use more momentum terms, our work shows how in order to use a large momentum a term should be added to stay sensitive to local fluctuations of the loss landscape. In that sense, we believe two terms should be enough.

0 128k 256k Iterations

Adam W Ad EMAMix Ad3EMAMix β1, β3, β4 = 0.9, 0.9999, 0.99999 Ad3EMAMix β1, β3, β4 = 0.9, 0.999, 0.99999 Ad3EMAMix β1, β3, β4 = 0.9, 0.999, 0.9999 Ad3EMAMix β1, β3, β4 = 0.9, 0.99, 0.9999 Ad3EMAMix β1, β3, β4 = 0.9, 0.99, 0.99999

Figure 34: Adding an extra momentum to Ad EMAMix does not provide better results. We train multiple 110M parameter Ad3EMAMix models with various (β1, β3, β4) triplets (lighter curves). None of those models end up outperforming Ad EMAMix, indicating that the additional m3 momentum term is seemingly not bringing any advantage.