# decoupled_weight_decay_regularization__f663d7a6.pdf

Published as a conference paper at ICLR 2019

DECOUPLED WEIGHT DECAY REGULARIZATION

Ilya Loshchilov & Frank Hutter University of Freiburg Freiburg, Germany, ilya.loshchilov@gmail.com, fh@cs.uni-freiburg.de

L2 regularization and weight decay regularization are equivalent for standard stochastic gradient descent (when rescaled by the learning rate), but as we demonstrate this is not the case for adaptive gradient algorithms, such as Adam. While common implementations of these algorithms employ L2 regularization (often calling it weight decay in what may be misleading due to the inequivalence we expose), we propose a simple modiﬁcation to recover the original formulation of weight decay regularization by decoupling the weight decay from the optimization steps taken w.r.t. the loss function. We provide empirical evidence that our proposed modiﬁcation (i) decouples the optimal choice of weight decay factor from the setting of the learning rate for both standard SGD and Adam and (ii) substantially improves Adam s generalization performance, allowing it to compete with SGD with momentum on image classiﬁcation datasets (on which it was previously typically outperformed by the latter). Our proposed decoupled weight decay has already been adopted by many researchers, and the community has implemented it in Tensor Flow and Py Torch; the complete source code for our experiments is available at https://github.com/loshchil/Adam W-and-SGDW

1 INTRODUCTION

Adaptive gradient methods, such as Ada Grad (Duchi et al., 2011), RMSProp (Tieleman & Hinton, 2012), Adam (Kingma & Ba, 2014) and most recently AMSGrad (Reddi et al., 2018) have become a default method of choice for training feed-forward and recurrent neural networks (Xu et al., 2015; Radford et al., 2015). Nevertheless, state-of-the-art results for popular image classiﬁcation datasets, such as CIFAR-10 and CIFAR-100 Krizhevsky (2009), are still obtained by applying SGD with momentum (Gastaldi, 2017; Cubuk et al., 2018). Furthermore, Wilson et al. (2017) suggested that adaptive gradient methods do not generalize as well as SGD with momentum when tested on a diverse set of deep learning tasks, such as image classiﬁcation, character-level language modeling and constituency parsing. Different hypotheses about the origins of this worse generalization have been investigated, such as the presence of sharp local minima (Keskar et al., 2016; Dinh et al., 2017) and inherent problems of adaptive gradient methods (Wilson et al., 2017). In this paper, we investigate whether it is better to use L2 regularization or weight decay regularization to train deep neural networks with SGD and Adam. We show that a major factor of the poor generalization of the most popular adaptive gradient method, Adam, is due to the fact that L2 regularization is not nearly as effective for it as for SGD. Speciﬁcally, our analysis of Adam leads to the following observations:

L2 regularization and weight decay are not identical. Contrary to a belief which seems popular among some practitioners, the two techniques are not equivalent. For SGD, they can be made equivalent by a reparameterization of the weight decay factor based on the learning rate; this is not the case for Adam. In particular, when combined with adaptive gradients, L2 regularization leads to weights with large parameter and/or gradient amplitudes being regularized less than they would be when using weight decay.

L2 regularization is not effective in Adam. One possible explanation why Adam and other adaptive gradient methods might be outperformed by SGD with momentum is that common deep learning libraries only implement L2 regularization, not the original weight decay. Therefore, on tasks/datasets where the use of L2 regularization is beneﬁcial for SGD (e.g.,

Published as a conference paper at ICLR 2019

on many popular image classiﬁcation datasets), Adam leads to worse results than SGD with momentum (for which L2 regularization behaves as expected). Weight decay is equally effective in both SGD and Adam. For SGD, it is equivalent to L2 regularization, while for Adam it is not. Optimal weight decay depends on the total number of batch passes/weight updates. Our empirical analysis of SGD and Adam suggests that the larger the runtime/number of batch passes to be performed, the smaller the optimal weight decay. This effect tends to be neglected because hyperparameters are often tuned for a ﬁxed number of training epochs. As a result, the values of the weight decay found to perform best for short runs do not generalize to much longer runs.

The main contribution of this paper is to improve regularization in Adam by decoupling the weight decay from the gradient-based update. In a comprehensive analysis, we show that Adam generalizes substantially better with decoupled weight decay than with L2 regularization, achieving 15% relative improvement in test error (see Figures 2 and 3); this holds true for various image recognition datasets (CIFAR-10 and Image Net32x32), training budgets (ranging from 100 to 1800 epochs), and learning rate schedules (ﬁxed, drop-step, and cosine annealing; see Figure 1). We demonstrate that our decoupled weight decay renders the optimal settings of the learning rate and the weight decay factor much more independent, thereby easing hyperparameter optimization (see Figure 2).

The main motivation of this paper is to improve Adam to make it competitive w.r.t. SGD with momentum even for those problems where it did not use to be competitive. We hope that as a result, practitioners do not need to switch between Adam and SGD anymore, which in turn should reduce the common issue of selecting dataset/task-speciﬁc training algorithms and their hyperparameters.

2 DECOUPLING THE WEIGHT DECAY FROM THE GRADIENT-BASED UPDATE

In the weight decay described by Hanson & Pratt (1988), the weights θ decay exponentially as

θt+1 = (1 λ)θt α ft(θt), (1)

where λ deﬁnes the rate of the weight decay per step and ft(θt) is the t-th batch gradient to be multiplied by a learning rate α. For standard SGD, it is equivalent to standard L2 regularization: Proposition 1 (Weight decay = L2 reg for standard SGD). Standard SGD with base learning rate α executes the same steps on batch loss functions ft(θ) with weight decay λ (deﬁned in Equation 1) as it executes without weight decay on f reg t (θ) = ft(θ) + λ

2 θ 2 2, with λ = λ

The proofs of this well-known fact, as well as our other propositions, are given in the Appendix A.

Due to this equivalence, L2 regularization is very frequently referred to as weight decay, including in popular deep learning libraries. However, as we will demonstrate later in this section, this equivalence does not hold for adaptive gradient methods. One fact that is often overlooked already for the simple case of SGD is that in order for the equivalence to hold, the L2 regularizer λ has to be set to λ α, i.e., if there is an overall best weight decay value λ, the best value of λ is tightly coupled with the learning rate α. In order to decouple the effects of these two hyperparameters, we advocate to decouple the weight decay step as proposed by Hanson & Pratt (1988) (Equation 1).

Looking ﬁrst at the case of SGD, we propose to decay the weights simultaneously with the update of θt based on gradient information in Line 9 of Algorithm 1. This yields our proposed variant of SGD with momentum using decoupled weight decay (SGDW). This simple modiﬁcation explicitly decouples λ and α (although some problem-dependent implicit coupling may of course remain as for any two hyperparameters). In order to account for a possible scheduling of both α and λ, we introduce a scaling factor ηt delivered by a user-deﬁned procedure Set Schedule Multiplier(t).

Now, let s turn to adaptive gradient algorithms like the popular optimizer Adam Kingma & Ba (2014), which scale gradients by their historic magnitudes. Intuitively, when Adam is run on a loss function f plus L2 regularization, weights that tend to have large gradients in f do not get regularized as much as they would with decoupled weight decay, since the gradient of the regularizer gets scaled along with the gradient of f. This leads to an inequivalence of L2 and decoupled weight decay regularization for adaptive gradient algorithms:

Published as a conference paper at ICLR 2019

Algorithm 1 SGD with L2 regularization and SGD with decoupled weight decay (SGDW) , both with momentum

1: given initial learning rate α IR, momentum factor β1 IR, weight decay/L2 regularization factor λ IR

2: initialize time step t 0, parameter vector θt=0 IRn, ﬁrst moment vector mt=0 0, schedule multiplier ηt=0 IR 3: repeat 4: t t + 1 5: ft(θt 1) Select Batch(θt 1) select batch and return the corresponding gradient 6: gt ft(θt 1) +λθt 1 7: ηt Set Schedule Multiplier(t) can be ﬁxed, decay, be used for warm restarts 8: mt β1mt 1 + ηtαgt 9: θt θt 1 mt ηtλθt 1 10: until stopping criterion is met 11: return optimized parameters θt

Algorithm 2 Adam with L2 regularization and Adam with decoupled weight decay (Adam W)

1: given α = 0.001, β1 = 0.9, β2 = 0.999, ϵ = 10 8, λ IR 2: initialize time step t 0, parameter vector θt=0 IRn, ﬁrst moment vector mt=0 0, second moment vector vt=0 0, schedule multiplier ηt=0 IR 3: repeat 4: t t + 1 5: ft(θt 1) Select Batch(θt 1) select batch and return the corresponding gradient 6: gt ft(θt 1) +λθt 1 7: mt β1mt 1 + (1 β1)gt here and below all operations are element-wise 8: vt β2vt 1 + (1 β2)g2 t 9: ˆmt mt/(1 βt 1) β1 is taken to the power of t 10: ˆvt vt/(1 βt 2) β2 is taken to the power of t 11: ηt Set Schedule Multiplier(t) can be ﬁxed, decay, or also be used for warm restarts

12: θt θt 1 ηt αˆmt/( ˆvt + ϵ) +λθt 1

13: until stopping criterion is met 14: return optimized parameters θt

Proposition 2 (Weight decay = L2 reg for adaptive gradients). Let O denote an optimizer that has iterates θt+1 θt αMt ft(θt) when run on batch loss function ft(θ) without weight decay, and θt+1 (1 λ)θt αMt ft(θt) when run on ft(θ) with weight decay, respectively, with Mt = k I (where k R). Then, for O there exists no L2 coefﬁcient λ such that running O on batch loss f reg t (θ) = ft(θ)+ λ

2 θ 2 2 without weight decay is equivalent to running O on ft(θ) with decay λ R+.

We decouple weight decay and loss-based gradient updates in Adam as shown in line 12 of Algorithm 2; this gives rise to our variant of Adam with decoupled weight decay (Adam W).

Having shown that L2 regularization and weight decay regularization differ for adaptive gradient algorithms raises the question of how they differ and how to interpret their effects. Their equivalence for standard SGD remains very helpful for intuition: both mechanisms push weights closer to zero, at the same rate. However, for adaptive gradient algorithms they differ: with L2 regularization, the sums of the gradient of the loss function and the gradient of the regularizer (i.e., the L2 norm of the weights) are adapted, whereas with weight decay, only the gradients of the loss function are adapted (with the weight decay step separated from the adaptive gradient mechanism). With L2 regularization both types of gradients are normalized by their typical (summed) magnitudes, and therefore weights x with large typical gradient magnitude s are regularized by a smaller relative amount than other weights. In contrast, weight decay regularizes all weights with the same rate λ, effectively regularizing weights x with large s more than standard L2 regularization does. We demonstrate this formally for a simple special case of adaptive gradient algorithm with a ﬁxed preconditioner:

Published as a conference paper at ICLR 2019

Proposition 3 (Weight decay = scale-adjusted L2 reg for adaptive gradient algorithm with ﬁxed preconditioner). Let O denote an algorithm with the same characteristics as in Proposition 2, and using a ﬁxed preconditioner matrix Mt = diag(s) 1 (with si > 0 for all i). Then, O with base learning rate α executes the same steps on batch loss functions ft(θ) with weight decay λ as it executes without weight decay on the scale-adjusted regularized batch loss

f sreg t (θ) = ft(θ) + λ

θ s 2 2 , (2)

where and denote element-wise multiplication and square root, respectively, and λ = λ

3 JUSTIFICATION OF DECOUPLED WEIGHT DECAY VIA A VIEW OF ADAPTIVE GRADIENT METHODS AS BAYESIAN FILTERING

We now discuss a justiﬁcation of decoupled weight decay in the framework of Bayesian ﬁltering for a uniﬁed theory of adaptive gradient algorithms due to Aitchison (2018). After we posted a preliminary version of our current paper on ar Xiv, Aitchison noted that his theory gives us a theoretical framework in which we can understand the superiority of this weight decay over L2 regularization, because it is weight decay, rather than L2 regularization that emerges through the straightforward application of Bayesian ﬁltering. (Aitchison, 2018). While full credit for this theory goes to Aitchison, we summarize it here to shed some light on why weight decay may be favored over L2 regularization.

Aitchison (2018) views stochastic optimization of n parameters x1, . . . , xn as a Bayesian ﬁltering problem with the goal of inferring a distribution over the optimal values of each of the parameters xi given the current values of the other parameters θ i(t) at time step t. When the other parameters do not change this is an optimization problem, but when they do change it becomes one of tracking the optimizer using Bayesian ﬁltering as follows. One is given a probability distribution P(θt | y1:t) of the optimizer at time step t that takes into account the data y1:t from the ﬁrst t mini batches, a state transition prior P(θt+1 | θt) reﬂecting a (small) data-independent change in this distribution from one step to the next, and a likelihood P(yt+1 | θt+1) derived from the mini batch at step t + 1. The posterior distribution P(θt+1 | y1:t+1) of the optimizer at time step t + 1 can then be computed (as usual in Bayesian ﬁltering) by marginalizing over θt to obtain the onestep ahead predictions P(θt+1 | y1:t) and then applying Bayes rule to incorporate the likelihood P(yt+1 | θt+1). Aitchison (2018) assumes a Gaussian state transition distribution P(θt+1 | θt) and an approximate conjugate likelihood P(yt+1 | θt+1), leading to the following closed-form update of the ﬁltering distribution s mean:

µpost = µprior + Σpost g, (3)

where g is the gradient of the log likelihood of the mini batch at time t. This result implies a preconditioner of the gradients that is given by the posterior uncertainty Σpost of the ﬁltering distribution: updates are larger for parameters we are more uncertain about and smaller for parameters we are more certain about. Aitchison (2018) goes on to show that popular adaptive gradient methods, such as Adam and RMSprop, as well as Kronecker-factorized methods are special cases of this framework.

Decoupled weight decay very naturally ﬁts into this uniﬁed framework can express weight decay as part of the state-transition distribution: Aitchison (2018) assumes a slow change of the optimizer according to the following Gaussian:

P(θt+1 | θt) = N((I A)θt, Q), (4)

where Q is the covariance of Gaussian perturbations of the weights, and A is a regularizer to avoid values growing unboundedly over time. When instantiated as A = λ I, this regularizer A plays exactly the role of decoupled weight decay as described in Equation 1, since this leads to multiplying the current mean estimate θt by (1 λ) at each step. Notably, this regularization is also directly applied to the prior and does not depend on the uncertainty in each of the parameters (which would be required for L2 regularization).

Published as a conference paper at ICLR 2019

Figure 1: Adam performs better with decoupled weight decay (bottom row, Adam W) than with L2 regularization (top row, Adam). We show the ﬁnal test error of a 26 2x64d Res Net on CIFAR-10 after 100 epochs of training with ﬁxed learning rate (left column), step-drop learning rate (with drops at epoch indexes 30, 60 and 80, middle column) and cosine annealing (right column). Adam W leads to a more separable hyperparameter search space, especially when a learning rate schedule, such as step-drop and cosine annealing is applied. Cosine annealing yields clearly superior results.

4 EXPERIMENTAL VALIDATION

We now evaluate the performance of decoupled weight decay under various training budgets and learning rate schedules. Our experimental setup follows that of Gastaldi (2017), who proposed, in addition to L2 regularization, to apply the new Shake-Shake regularization to a 3-branch residual DNN that allowed to achieve new state-of-the-art results of 2.86% on the CIFAR-10 dataset (Krizhevsky, 2009). We always used a batch size of 128. The regular data augmentation procedure used for the CIFAR datasets was applied. We used the same model/source code based on fb.resnet.torch 1. The base networks are a 26 2x64d Res Net (i.e. the network has a depth of 26, 2 residual branches and the ﬁrst residual block has a width of 64) and a 26 2x96d Res Net with 11.6M and 25.6M parameters, respectively. For a detailed description of the network and the Shake-Shake method, we refer the interested reader to Gastaldi (2017). We also perform experiments on the Image Net32x32 dataset (Chrabaszcz et al., 2017), a downsampled version of the original Image Net dataset with 1.2 million 32 32 pixels images.

4.1 EVALUATING DECOUPLED WEIGHT DECAY WITH DIFFERENT LEARNING RATE SCHEDULES

In our ﬁrst experiment, we compare Adam with L2 regularization to Adam with decoupled weight decay (Adam W), using three different learning rate schedules: a ﬁxed learning rate, a drop-step schedule, and a cosine annealing schedule (Loshchilov & Hutter, 2016). For each learning rate schedule and weight decay variant, we trained a 2x64d Res Net for 100 epochs, using different settings of the initial learning rate α and the weight decay factor λ. Figure 1 shows that decoupled weight decay outperforms L2 regularization for all learning rate schedules, with larger differences for better learning rate schedules. We also note that decoupled weight decay leads to a more separable hyperparameter search space, especially when a learning rate schedule, such as step-drop and

1https://github.com/xgastaldi/shake-shake

Published as a conference paper at ICLR 2019

Figure 2: The Top-1 test error of a 26 2x64d Res Net on CIFAR-10 measured after 100 epochs. The proposed SGDW and Adam W (right column) have a more separable hyperparameter space.

Figure 3: Learning curves (top row) and generalization results (bottom row) obtained by a 26 2x96d Res Net trained with Adam and Adam W on CIFAR-10. See text for details. Supp Figure 4 in the Appendix shows the same qualitative results for Image Net32x32.

cosine annealing is applied. The ﬁgure also shows that cosine annealing clearly outperforms the other learning rate schedules; we thus used cosine annealing for the remainder of the experiments.

Published as a conference paper at ICLR 2019

4.2 DECOUPLING THE WEIGHT DECAY AND INITIAL LEARNING RATE PARAMETERS

In order to verify our hypothesis about the coupling of α and λ, in Figure 2 we compare the performance of L2 regularization vs. decoupled weight decay in SGD (SGD vs. SGDW, top row) and in Adam (Adam vs. Adam W, bottom row). In SGD (Figure 2, top left), L2 regularization is not decoupled from the learning rate (the common way as described in Algorithm 1), and the ﬁgure clearly shows that the basin of best hyperparameter settings (depicted by color and top-10 hyperparameter settings by black circles) is not aligned with the x-axis or y-axis but lies on the diagonal. This suggests that the two hyperparameters are interdependent and need to be changed simultaneously, while only changing one of them might substantially worsen results. Consider, e.g., the setting at the top left black circle (α = 1/2, λ = 1/8 0.001); only changing either α or λ by itself would worsen results, while changing both of them could still yield clear improvements. We note that this coupling of initial learning rate and L2 regularization factor might have contributed to SGD s reputation of being very sensitive to its hyperparameter settings.

In contrast, the results for SGD with decoupled weight decay (SGDW) in Figure 2 (top right) show that weight decay and initial learning rate are decoupled. The proposed approach renders the two hyperparameters more separable: even if the learning rate is not well tuned yet (e.g., consider the value of 1/1024 in Figure 2, top right), leaving it ﬁxed and only optimizing the weight decay factor would yield a good value (of 1/4*0.001). This is not the case for SGD with L2 regularization (see Figure 2, top left).

The results for Adam with L2 regularization are given in Figure 2 (bottom left). Adam s best hyperparameter settings performed clearly worse than SGD s best ones (compare Figure 2, top left). While both methods used L2 regularization, Adam did not beneﬁt from it at all: its best results obtained for non-zero L2 regularization factors were comparable to the best ones obtained without the L2 regularization, i.e., when λ = 0. Similarly to the original SGD, the shape of the hyperparameter landscape suggests that the two hyperparameters are coupled.

In contrast, the results for our new variant of Adam with decoupled weight decay (Adam W) in Figure 2 (bottom right) show that Adam W largely decouples weight decay and learning rate. The results for the best hyperparameter settings were substantially better than the best ones of Adam with L2 regularization and rivaled those of SGD and SGDW.

In summary, the results in Figure 2 support our hypothesis that the weight decay and learning rate hyperparameters can be decoupled, and that this in turn simpliﬁes the problem of hyperparameter tuning in SGD and improves Adam s performance to be competitive w.r.t. SGD with momentum.

4.3 BETTER GENERALIZATION OF ADAMW

While the previous experiment suggested that the basin of optimal hyperparameters of Adam W is broader and deeper than the one of Adam, we next investigated the results for much longer runs of 1800 epochs to compare the generalization capabilities of Adam W and Adam.

We ﬁxed the initial learning rate to 0.001 which represents both the default learning rate for Adam and the one which showed reasonably good results in our experiments. Figure 3 shows the results for 12 settings of the L2 regularization of Adam and 7 settings of the normalized weight decay of Adam W (the normalized weight decay represents a rescaling formally deﬁned in the Appendix B.1, it amounts to a multiplicative factor which depends on the number of bath passes). Interestingly, while the dynamics of the learning curves of Adam and Adam W often coincided for the ﬁrst half of the training run, Adam W often led to lower training loss and test errors (see Figure 3 top left and top right, respectively). Importantly, the use of weight decay in Adam did not yield as good results as in Adam W (see also Figure 3, bottom left). Next, we investigated whether Adam W s better results were only due to better convergence or due to better generalization. The results in Figure 3 (bottom right) for the best settings of Adam and Adam W suggest that Adam W did not only yield better training loss but also yielded better generalization performance for similar training loss values. The results on Image Net32x32 (see Supp Figure 4 in the Appendix) lead to the same conclusion of substantially improved generalization performance.

Published as a conference paper at ICLR 2019

Figure 4: Top-1 test error on CIFAR-10 (left) and Top-5 test error on Image Net32x32 (right). For a better resolution and with training loss curves, see Supp Figure 5 and Supp Figure 6 in the supplementary material.

4.4 ADAMWR WITH WARM RESTARTS FOR BETTER ANYTIME PERFORMANCE

In order to improve anytime performance of SGDW and Adam W we extended them with warm restarts of (Loshchilov & Hutter, 2016) to obtain SGDWR and Adam WR, respectively (see section B.2 in the Appendix). As Figure 4 shows, Adam WR greatly sped up Adam W on CIFAR-10 and Image Net32x32, up to a factor of 10 (see the results at the ﬁrst restart). For the default learning rate of 0.001, Adam W achieved 15% relative improvement in test errors compared to Adam both on CIFAR-10 (also see Figure 3) and Image Net32x32 (also see Supp Figure 5). Adam WR achieved the same improved results but with a much better anytime performance. These improvements closed most of the gap between Adam and SGDWR on CIFAR-10 and yielded comparable performance on Image Net32x32.

4.5 USE OF ADAMW ON OTHER DATASETS AND ARCHITECTURES

Several other research groups have already successfully applied Adam W in citable works. For example, Wang et al. (2018) used Adam W to train a novel architecture for face detection on the standard WIDER FACE dataset (Yang et al., 2016), obtaining almost 10x faster predictions than the previous state of the art algorithms while achieving comparable performance. V olker et al. (2018) employed Adam W with cosine annealing to train convolutional neural networks to classify and characterize error-related brain signals measured from intracranial electroencephalography (EEG) recordings. While their paper does not provide a comparison to Adam, they kindly provided us with a direct comparison of the two on their best-performing problem-speciﬁc network architecture Deep4Net and a variant of Res Net. Adam W with the same hyperparameter setting as Adam yielded higher test set accuracy on Deep4Net (73.68% versus 71.37%) and statistically signiﬁcantly higher test set accuracy on Res Net (72.04% versus 61.34%). Radford et al. (2018) employed Adam W to train Transformer (Vaswani et al., 2017) architectures to obtain new state-of-the-art results on a wide range of benchmarks for natural language understanding. Zhang et al. (2018) compared L2 regularization vs. weight decay for SGD, Adam and the Kronecker-Factored Approximate Curvature (K-FAC) optimizer (Martens & Grosse, 2015) on the CIFAR datasets with Res Net and VGG architectures, reporting that decoupled weight decay consistently outperformed L2 regularization in cases where they differ.

5 CONCLUSION AND FUTURE WORK

Following suggestions that adaptive gradient methods such as Adam might lead to worse generalization than SGD with momentum (Wilson et al., 2017), we identiﬁed and exposed the inequivalence of L2 regularization and weight decay for Adam. We empirically showed that our version of Adam with decoupled weight decay yields substantially better generalization performance than the common implementation of Adam with L2 regularization. We also proposed to use warm restarts for Adam to improve its anytime performance.

Published as a conference paper at ICLR 2019

Our results obtained on image classiﬁcation datasets must be veriﬁed on a wider range of tasks, especially ones where the use of regularization is expected to be important. It would be interesting to integrate our ﬁndings on weight decay into other methods which attempt to improve Adam, e.g, normalized direction-preserving Adam (Zhang et al., 2017). While we focused our experimental analysis on Adam, we believe that similar results also hold for other adaptive gradient methods, such as Ada Grad (Duchi et al., 2011) and AMSGrad (Reddi et al., 2018).

6 ACKNOWLEDGMENTS

This work was supported by the European Research Council (ERC) under the European Union s Horizon 2020 research and innovation programme under grant no. 716721, by the German Research Foundation (DFG), under the Brain Links Brain Tools Cluster of Excellence (grant number EXC 1086) and through grant no. INST 37/935-1 FUGG, and by the German state of Baden W urttemberg through bw HPC. We thank Patryk Chrabaszcz for helping running experiments with Image Net32x32. We thank Matthias Feurer and Robin Schirrmeister for providing valuable feedback on this paper in several iterations. We thank Martin V olker, Robin Schirrmeister, and Tonio Ball for providing us with a comparison of Adam W and Adam on their EEG data.

Finally, we thank the following members of the deep learning community for implementing decoupled weight decay in various deep learning libraries:

Jingwei Zhang, Lei Tai, Robin Schirrmeister, and Kashif Rasul for their implementations in Py Torch (see https://github.com/pytorch/pytorch/pull/4429) Phil Jund for his implementation in Tensor Flow described at https://www.tensorflow.org/api_docs/python/tf/contrib/opt/ Decoupled Weight Decay Extension Sylvain Gugger, Anand Saha, Jeremy Howard and other members of fast.ai for their implementation available at https://github.com/sgugger/Adam-experiments Guillaume Lambard for his implementation in Keras available at https://github. com/GLambard/Adam W_Keras Yagami Lin for his implementation in Caffe available at https://github.com/ Yagami123/Caffe-Adam W-Adam WR

Laurence Aitchison. A uniﬁed theory of adaptive stochastic gradient descent as Bayesian ﬁltering. ar Xiv:1507.02030, 2018.

Patryk Chrabaszcz, Ilya Loshchilov, and Frank Hutter. A downsampled variant of Image Net as an alternative to the CIFAR datasets. ar Xiv:1707.08819, 2017.

Ekin D Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasudevan, and Quoc V Le. Autoaugment: Learning augmentation policies from data. ar Xiv preprint ar Xiv:1805.09501, 2018.

Laurent Dinh, Razvan Pascanu, Samy Bengio, and Yoshua Bengio. Sharp minima can generalize for deep nets. ar Xiv:1703.04933, 2017.

John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. The Journal of Machine Learning Research, 12:2121 2159, 2011.

Xavier Gastaldi. Shake-Shake regularization. ar Xiv preprint ar Xiv:1705.07485, 2017.

Stephen Jos e Hanson and Lorien Y Pratt. Comparing biases for minimal network construction with back-propagation. In Proceedings of the 1st International Conference on Neural Information Processing Systems, pp. 177 185, 1988.

Gao Huang, Yixuan Li, Geoff Pleiss, Zhuang Liu, John E Hopcroft, and Kilian Q Weinberger. Snapshot ensembles: Train 1, get m for free. ar Xiv:1704.00109, 2017.

Published as a conference paper at ICLR 2019

Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. ar Xiv:1609.04836, 2016.

Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv:1412.6980, 2014.

Alex Krizhevsky. Learning multiple layers of features from tiny images. 2009.

Hao Li, Zheng Xu, Gavin Taylor, and Tom Goldstein. Visualizing the loss landscape of neural nets. ar Xiv preprint ar Xiv:1712.09913, 2017.

Ilya Loshchilov and Frank Hutter. SGDR: stochastic gradient descent with warm restarts. ar Xiv:1608.03983, 2016.

James Martens and Roger Grosse. Optimizing neural networks with kronecker-factored approximate curvature. In International conference on machine learning, pp. 2408 2417, 2015.

Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. ar Xiv:1511.06434, 2015.

Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre-training. URL https://s3-us-west-2. amazonaws. com/openaiassets/research-covers/language-unsupervised/language understanding paper. pdf, 2018.

Sashank J. Reddi, Satyen Kale, and Sanjiv Kumar. On the convergence of adam and beyond. International Conference on Learning Representations, 2018.

Leslie N Smith. Cyclical learning rates for training neural networks. ar Xiv:1506.01186v3, 2016.

Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 4(2):26 31, 2012.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, pp. 5998 6008, 2017.

Martin V olker, Jiˇr ı Hammer, Robin T Schirrmeister, Joos Behncke, Lukas DJ Fiederer, Andreas Schulze-Bonhage, Petr Marusiˇc, Wolfram Burgard, and Tonio Ball. Intracranial error detection via deep learning. ar Xiv preprint ar Xiv:1805.01667, 2018.

Jianfeng Wang, Ye Yuan, Gang Yu, and Sun Jian. Sface: An efﬁcient network for face detection in large scale variations. ar Xiv preprint ar Xiv:1804.06559, 2018.

Ashia C Wilson, Rebecca Roelofs, Mitchell Stern, Nathan Srebro, and Benjamin Recht. The marginal value of adaptive gradient methods in machine learning. ar Xiv:1705.08292, 2017.

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. In International Conference on Machine Learning, pp. 2048 2057, 2015.

Shuo Yang, Ping Luo, Chen-Change Loy, and Xiaoou Tang. Wider face: A face detection benchmark. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5525 5533, 2016.

Guodong Zhang, Chaoqi Wang, Bowen Xu, and Roger Grosse. Three mechanisms of weight decay regularization. ar Xiv preprint ar Xiv:1810.12281, 2018.

Zijun Zhang, Lin Ma, Zongpeng Li, and Chuan Wu. Normalized direction-preserving adam. ar Xiv:1709.04546, 2017.

Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V. Le. Learning transferable architectures for scalable image recognition. In ar Xiv:1707.07012 [cs.CV], 2017.

Published as a conference paper at ICLR 2019

A FORMAL ANALYSIS OF WEIGHT DECAY VS L2 REGULARIZATION

Proof of Proposition 1 The proof for this well-known fact is straight-forward. SGD without weight decay has the following iterates on f reg t (θ) = ft(θ) + λ

θt+1 θt α f reg t (θt) = θt α ft(θt) αλ θt. (5)

SGD with weight decay has the following iterates on ft(θ):

θt+1 (1 λ)θt α ft(θt). (6)

These iterates are identical since λ = λ

Proof of Proposition 2 Similarly to the Proof of Proposition 1, the iterates of O without weight decay on f reg t (θ) = ft(θ)+ 1 2λ θ 2 2 and O with weight decay λ on ft are, respectively:

θt+1 θt αλ Mtθt αMt ft(θt). (7) θt+1 (1 λ)θt αMt ft(θt). (8)

The equality of these iterates for all θt would imply λθt = αλ Mtθt. This can only hold for all θt if Mt = k I, with k R, which is not the case for O. Therefore, no L2 regularizer λ θ 2 2 exists that makes the iterates equivalent.

Proof of Proposition 3 O without weight decay has the following iterates on f sreg t (θ) = ft(θ) + λ

θt+1 θt α f sreg t (θt)/s (9) = θt α ft(θt)/s αλ θt s/s (10) = θt α ft(θt)/s αλ θt, (11)

where the division by s is element-wise. O with weight decay has the following iterates on ft(θ):

θt+1 (1 λ)θt α f(θt)/s (12) = θt α f(θt)/s λθt, (13)

These iterates are identical since λ = λ

B ADDITIONAL PRACTICAL IMPROVEMENTS OF ADAM

Having discussed decoupled weight decay for improving Adam s generalization, in this section we introduce two additional components to improve Adam s performance in practice.

B.1 NORMALIZED WEIGHT DECAY

Our preliminary experiments showed that different weight decay factors are optimal for different computational budgets (deﬁned in terms of the number of batch passes). Relatedly, Li et al. (2017) demonstrated that a smaller batch size (for the same total number of epochs) leads to the shrinking effect of weight decay being more pronounced. Here, we propose to reduce this dependence by normalizing the values of weight decay. Speciﬁcally, we replace the hyperparameter λ by a new (more

robust) normalized weight decay hyperparameter λnorm, and use this to set λ as λ = λnorm q

b BT , where b is the batch size, B is the total number of training points and T is the total number of epochs.2 Thus, λnorm can be interpreted as the weight decay used if only one batch pass is allowed. We emphasize that our choice of normalization is merely one possibility informed by few experiments; a more lasting conclusion we draw is that using some normalization can substantially improve results.

2In the context of our Adam WR variant discussed in Section B.2, T is the total number of epochs in the current restart.

Published as a conference paper at ICLR 2019

B.2 ADAM WITH COSINE ANNEALING AND WARM RESTARTS

We now apply cosine annealing and warm restarts to Adam, following the recent work of Loshchilov & Hutter (2016). There, the authors proposed Stochastic Gradient Descent with Warm Restarts (SGDR) to improve anytime performance of SGD by quickly cooling down the learning rate according to a cosine schedule and periodically increasing it. SGDR has been successfully adopted to lead to new state-of-the-art results for popular image classiﬁcation benchmarks (Huang et al., 2017; Gastaldi, 2017; Zoph et al., 2017), and we therefore tried extending it to Adam. However, while our initial version of Adam with warm restarts had better anytime performance than Adam, it was not competitive with SGD with warm restarts, precisely because L2 regularization was not working as well as in SGD. Now, having ﬁxed this issue by means of the original weight decay regularization (Section 2) and also having introduced normalized weight decay (Section B.1), the original work on cosine annealing and warm restarts by Loshchilov & Hutter (2016) directly carries over to Adam.

In the interest of keeping the presentation self-contained, we brieﬂy describe how SGDR schedules the change of the effective learning rate in order to accelerate the training of DNNs. Here, we decouple the initial learning rate α and its multiplier ηt used to obtain the actual learning rate at iteration t (see, e.g., line 8 in Algorithm 1). In SGDR, we simulate a new warm-started run/restart of SGD once Ti epochs are performed, where i is the index of the run. Importantly, the restarts are not performed from scratch but emulated by increasing ηt while the old value of θt is used as an initial solution. The amount by which ηt is increased controls to which extent the previously acquired information (e.g., momentum) is used. Within the i-th run, the value of ηt decays according to a cosine annealing (Loshchilov & Hutter, 2016) learning rate for each batch as follows:

ηt = η(i) min + 0.5(η(i) max η(i) min)(1 + cos(πTcur/Ti)), (14)

where η(i) min and η(i) max are ranges for the multiplier and Tcur accounts for how many epochs have been performed since the last restart. Tcur is updated at each batch iteration t and is thus not constrained to integer values. Adjusting (e.g., decreasing) η(i) min and η(i) max at every i-th restart (see also Smith (2016)) could potentially improve performance, but we do not consider that option here because it would involve additional hyperparameters. For η(i) max = 1 and η(i) min = 0, one can simplify Eq. (14) to

ηt = 0.5 + 0.5 cos(πTcur/Ti). (15)

In order to achieve good anytime performance, one can start with an initially small Ti (e.g., from 1% to 10% of the expected total budget) and multiply it by a factor of Tmult (e.g., Tmult = 2) at every restart. The (i + 1)-th restart is triggered when Tcur = Ti by setting Tcur to 0. An example setting of the schedule multiplier is given in C.

Our proposed Adam WR algorithm represents Adam W (see Algorithm 2) with ηt following Eq. (15) and λ computed at each iteration using normalized weight decay described in the previous section. We note that normalized weight decay allowed us to use a constant parameter setting across short and long runs performed within Adam WR and SGDWR (SGDW with warm restarts).

C AN EXAMPLE SETTING OF THE SCHEDULE MULTIPLIER

An example schedule of the schedule multiplier ηt is given in Supp Figure 1 for Ti=0 = 100 and Tmult = 2. After the initial 100 epochs the learning rate will reach 0 because ηt=100 = 0. Then, since Tcur = Ti=0, we restart by resetting Tcur = 0, causing the multiplier ηt to be reset to 1 due to Eq. (15). This multiplier will then decrease again from 1 to 0, but now over the course of 200 epochs because Ti=1 = Ti=0Tmult = 200. Solutions obtained right before the restarts, when ηt = 0 (e.g., at epoch indexes 100, 300, 700 and 1500 as shown in Supp Figure 1) are recommended by the optimizer as the solutions, with more recent solutions prioritized.

D ADDITIONAL RESULTS

We investigated whether the use of much longer runs (1800 epochs) of standard Adam (Adam with L2 regularization and a ﬁxed learning rate) makes the use of cosine annealing unnecessary.

Published as a conference paper at ICLR 2019

200 400 600 800 1000 1200 1400 0

Learning rate multiplier η

T0=100, Tmult=2

Supp Figure 1: An example schedule of the learning rate multiplier as a function of epoch index. The ﬁrst run is scheduled to converge at epoch Ti=0 = 100, then the budget for the next run is doubled as Ti=1 = Ti=0Tmult = 200, etc.

Supp Figure 2 shows the results of standard Adam for a 4 by 4 logarithmic grid of hyperparameter settings (the coarseness of the grid is due to the high computational expense of runs for 1800 epochs). Even after taking the low resolution of the grid into account, the results appear to be at best comparable to the ones obtained with Adam W with 18 times less epochs and a smaller network (see Supp Figure 3, top row, middle). These results are not very surprising given Figure 2 in the main paper (which demonstrates the effectiveness of Adam W) and Supp Figure 1 (which demonstrates the necessity to use some learning rate schedule such as cosine annealing).

Our experimental results with Adam and SGD suggested that the total runtime in terms of the number of epochs affect the basin of optimal hyperparameters (see Supp Figure 3). More speciﬁcally, the greater the total number of epochs the smaller the values of the weight decay should be. Supp Figure 4 shows that our remedy for this problem, the normalized weight decay deﬁned in Eq. (15), simpliﬁes hyperparameter selection because the optimal values observed for short runs are similar to the ones for much longer runs. We used our initial experiments on CIFAR-10 to suggest the square root normalization we proposed in Eq. (15) and double-checked that this is not a coincidence on the Image Net32x32 dataset (Chrabaszcz et al., 2017), a downsampled version of the original Image Net dataset with 1.2 million 32 32 pixels images, where an epoch is 24 times longer than on CIFAR-10. This experiment also supported the square root scaling: the best values of the normalized weight decay observed on CIFAR-10 represented nearly optimal values for Image Net32x32 (see Supp Figure 3). In contrast, had we used the same raw weight decay values λ for Image Net32x32 as for CIFAR10 and for the same number of epochs, without the proposed normalization, λ would have been roughly 5 times too large for Image Net32x32, leading to much worse performance. The optimal normalized weight decay values were also very similar (e.g., λnorm = 0.025 and λnorm = 0.05) across SGDW and Adam W.

Supp Figure 4 is the equivalent of Figure 3 in the main paper, but for Image Net32x32 instead of for CIFAR-10. The qualitative results are identical: weight decay leads to better training loss (crossentropy) than L2 regularization, and to an even greater improvement of test error.

Supp Figure 5 and Supp Figure 6 are the equivalents of Figure 4 in the main paper but supplemented with training loss curves in its bottom row. The results show that Adam and its variants with decoupled weight decay converge faster (in terms of training loss) on CIFAR-10 than the corresponding SGD variants (the difference for Image Net32x32 is small). As is discussed in the main paper, when the same values of training loss are considered, Adam W demonstrates better values of test error than Adam. Interestingly, Supp Figure 5 and Supp Figure 6 show that restart variants Adam WR and SGDWR also demonstrate better generalization than Adam W and SGDW, respectively.

Published as a conference paper at ICLR 2019

Supp Figure 2: Performance of standard Adam : Adam with L2 regularization and a ﬁxed learning rate. We show the ﬁnal test error of a 26 2x96d Res Net on CIFAR-10 after 1800 epochs of the original Adam for different settings of learning rate and weight decay used for L2 regularization.

Published as a conference paper at ICLR 2019

Supp Figure 3: Effect of normalized weight decay. We show the ﬁnal test Top-1 error on CIFAR10 (ﬁrst two rows for Adam W without and with normalized weight decay) and Top-5 error on Image Net32x32 (last two rows for Adam W and SGDW, both with normalized weight decay) of a 26 2x64d Res Net after different numbers of epochs (see columns). While the optimal settings of the raw weight decay change signiﬁcantly for different runtime budgets (see the ﬁrst row), the values of the normalized weight decay remain very similar for different budgets (see the second row) and different datasets (here, CIFAR-10 and Image Net32x32), and even across Adam W and SGDW.

Published as a conference paper at ICLR 2019

Supp Figure 4: Learning curves (top row) and generalization results (Top-5 errors in bottom row) obtained by a 26 2x96d Res Net trained with Adam and Adam W on Image Net32x32.

Published as a conference paper at ICLR 2019

Supp Figure 5: Test error curves (top row) and training loss curves (bottom row) for CIFAR-10.

Published as a conference paper at ICLR 2019

Supp Figure 6: Test error curves (top row) and training loss curves (bottom row) for Image Net32x32.