# improving_robustness_with_adaptive_weight_decay__21e2c625.pdf

Improving Robustness with Adaptive Weight Decay

Amin Ghiasi, Ali Shafahi, Reza Ardekani

Apple Cupertino, CA, 95014 {mghiasi2, ashafahi, rardekani} @apple.com

We propose adaptive weight decay, which automatically tunes the hyper-parameter for weight decay during each training iteration. For classiﬁcation problems, we propose changing the value of the weight decay hyper-parameter on the ﬂy based on the strength of updates from the classiﬁcation loss (i.e., gradient of cross-entropy), and the regularization loss (i.e., 2-norm of the weights). We show that this simple modiﬁcation can result in large improvements in adversarial robustness an area which suffers from robust overﬁtting without requiring extra data across various datasets and architecture choices. For example, our reformulation results in 20% relative robustness improvement for CIFAR-100, and 10% relative robustness improvement on CIFAR-10 comparing to the best tuned hyper-parameters of traditional weight decay resulting in models that have comparable performance to SOTA robustness methods. In addition, this method has other desirable properties, such as less sensitivity to learning rate, and smaller weight norms, which the latter contributes to robustness to overﬁtting to label noise, and pruning.

1 Introduction

Deep Neural Networks (DNNs) have exceeded human capability on many computer vision tasks. Due to their high capacity for memorizing training examples (Zhang et al., 2021), DNN generalization heavily relies on the training algorithm. To reduce memorization and improve generaliazation, several approaches have been taken including regularization and augmentation. Some of these augmentation techniques alter the network input (De Vries & Taylor, 2017; Chen et al., 2020; Cubuk et al., 2019, 2020; Müller & Hutter, 2021), some alter hidden states of the network (Srivastava et al., 2014; Ioffe & Szegedy, 2015; Gastaldi, 2017; Yamada et al., 2019), some alter the expected output (Warde-Farley & Goodfellow, 2016; Kannan et al., 2018), and some affect multiple levels (Zhang et al., 2017; Yun et al., 2019; Hendrycks et al., 2019b). Typically, augmentation methods aim to enhance generalization by increasing the diversity of the dataset. The utilization of regularizers, such as weight decay (Plaut et al., 1986; Krogh & Hertz, 1991), serves to prevent overﬁtting by eliminating solutions that solely memorize training examples and by constraining the complexity of the DNN. Regularization methods are most beneﬁcial in areas such as adversarial robustness, and noisy-data settings settings which suffer from catastrophic overﬁtting. In this paper, we revisit weight decay; a regularizer mainly used to avoid overﬁtting.

The rest of the paper is organized as follows: In Section 2, we revisit tuning the weight decay hyper-parameter to improve adversarial robustness and introduce Adaptive Weight Decay. Also in Section 2, through extensive experiments on various image classiﬁcation datasets, we show that adversarial training with Adaptive Weight Decay improves both robustness and natural generalization compared to traditional non-adaptive weight decay. Next, in Section 3, we brieﬂy mention other potential applications of Adaptive Weight Decay to network pruning, robustness to sub-optimal learning-rates, and training on noisy labels.

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

2 Adversarial Robustness

DNNs are susceptible to adversarial perturbations (Szegedy et al., 2013; Biggio et al., 2013). In the adversarial setting, the adversary adds a small imperceptible noise to the image, which fools the network into making an incorrect prediction. To ensure that the adversarial noise is imperceptible to the human eye, usually noise with bounded p-norms have been studied (Sharif et al., 2018). In such settings, the objective for the adversary is to maximize the following loss:

max |δ|p Xent(f(x + δ, w), y), (1)

where Xent is the Cross-entropy loss, δ is the adversarial perturbation, x is the clean example, y is the ground truth label, is the adversarial perturbation budget, and w is the DNN paramater.

A multitude of papers concentrate on the adversarial task and propose methods to generate robust adversarial examples through various approaches, including the modiﬁcation of the loss function and the provision of optimization techniques to effectively optimize the adversarial generation loss functions (Goodfellow et al., 2014; Madry et al., 2017; Carlini & Wagner, 2017; Izmailov et al., 2018; Croce & Hein, 2020a; Andriushchenko et al., 2020). An additional area of research centers on mitigating the impact of potent adversarial examples. While certain studies on adversarial defense prioritize approaches with theoretical guarantees (Wong & Kolter, 2018; Cohen et al., 2019), in practical applications, variations of adversarial training have emerged as the prevailing defense strategy against adversarial attacks (Madry et al., 2017; Shafahi et al., 2019; Wong et al., 2020; Rebufﬁet al., 2021; Gowal et al., 2020). Adversarial training involves on the ﬂy generation of adversarial examples during the training process and subsequently training the model using these examples. The adversarial training loss can be formulated as a min-max optimization problem:

w max |δ|p Xent(f(x + δ, w), y), (2)

2.1 Robust overﬁtting and relationship to weight decay

Adversarial training is a strong baseline for defending against adversarial attacks; however, it often suffers from a phenomenon referred to as Robust Overﬁtting (Rice et al., 2020). Weight decay regularization, as discussed in 2.1.1, is a common technique used for preventing overﬁtting.

2.1.1 Weight Decay

Weight decay encourages weights of networks to have smaller magnitudes (Zhang et al., 2018) and is widely used to improve generalization. Weight decay regularization can have many forms (Loshchilov & Hutter, 2017), and we focus on the popular 2-norm variant. More precisely, we focus on classiﬁcation problems with cross-entropy as the main loss such as adversarial training and weight decay as the regularizer, which was popularized by Krizhevsky et al. (2017):

Lossw(x, y) = Xent(f(x, w), y) + λwd

where w is the network parameters, (x, y) is the training data, and λwd is the weight-decay hyperparameter. λwd is a crucial hyper-parameter in weight decay, determining the weight penalty compared to the main loss (e.g., cross-entropy). A small λwd may cause overﬁtting, while a large value can yield a low weight-norm solution that poorly ﬁts the training data. Thus, selecting an appropriate λwd value is essential for achieving an optimal balance.

2.1.2 Robust overﬁtting phenomenon revisited

To study robust overﬁtting, we focus on evaluating the 1 adversarial robustness on the CIFAR10 dataset while limiting the adversarial budget of the attacker to = 8 a common setting for

evaluating robustness. For these experiments, we use a Wide Res Net 28-10 architecture (Zagoruyko & Komodakis, 2016) and widely adopted PGD adversarial training (Madry et al., 2017) to solve the adversarial training loss with weight decay regularization:

max |δ|1 8 Xent(f(x + δ, w), y) + λwd

Figure 1: Robust validation accuracy (a) and validation loss (b) and training loss (c) on CIFAR-10 subsets. λwd = 0.00089 is the best performing hyper-parameter we found by doing a grid-search. The other two hyper-parameters are two points from our grid-search, one with larger and the other with smaller hyper-parameter for weight decay. The thickness of the plot-lines correspond to the magnitude of the weight-norm penalties. As it can be seen by (a) and (b), networks trained by small values of λwd suffer from robust-overﬁtting, while networks trained with larger values of λwd do not suffer from robust overﬁtting but the larger λwd further prevents the network from ﬁtting the data (c) resulting in reduced overall robustness.

We reserve 10% of the training examples as a held-out validation set for early stopping and checkpoint selection. In practice, to solve eq. 4, the network parameters w are updated after generating adversarial examples in real-time using a 7-step PGD adversarial attack. We train for 200 epochs, using an initial learning-rate of 0.1 combined with a cosine learning-rate schedule. Throughout training, at the end of each epoch, the robust accuracy and robustness loss (i.e., cross-entropy loss of adversarial examples) are evaluated on the validation set by subjecting the held-out validation examples to a 3-step PGD attack. For further details, please refer to A.1.

To further understand the robust overﬁtting phenomenon in the presence of weight decay, we train different models by varying the weight-norm hyperparameter λwd in eq. 4.

Figure 1 illustrates the accuracy and cross-entropy loss on the adversarial examples built for the held-out validation set for three choices1 of λwd throughout training. As seen in Figure 1(a), for small λwd choices, the robust validation accuracy does not monotonically increase towards the end of training. The Non-monotonicity behavior, which is related to robust overﬁtting, is even more pronounced if we look at the robustness loss computed on the held-out validation (Figure 1(b)). Note that this behavior is still evident even if we look at the best hyper-parameter value according to the validation set (λ

wd = 0.00089).

Various methods have been proposed to rectify robust overﬁtting, including early stopping (Rice et al., 2020), use of extra unlabeled data (Carmon et al., 2019), synthesized images (Gowal et al., 2020), pre-training (Hendrycks et al., 2019a), use of data augmentations (Rebufﬁet al., 2021), and stochastic weight averaging (Izmailov et al., 2018).

In Fig. 1, we observe that simply having smaller weight-norms (by increasing λwd) could reduce this non-monotonic behavior on the validation set adversarial examples. Although, this comes at the cost of larger cross-entropy loss on the training set adversarial examples, as shown in Figure 1(c). Even though the overall loss function from eq. 4 is a minimization problem, the terms in the loss function implicitly have conﬂicting objectives: During the training process, when the cross-entropy term holds dominance, effectively reducing the weight norm becomes challenging, resulting in non-monotonic behavior of robust validation metrics towards the later stages of training. Conversely, when the weight-norm term takes precedence, the cross-entropy objective encounters difﬁculties in achieving signiﬁcant reductions. In the next section, we introduce Adaptive Weight Decay, which explicitly strikes a balance between these two terms during training.

2.2 Adaptive Weight Decay

Inspired by the ﬁndings in 2.1.2, we propose Adaptive Weight Decay (AWD). The goal of AWD is to maintain a balance between weight decay and cross-entropy updates during training in order to guide

1Figure 2 captures the complete set of λwd values we tested.

the optimization to a solution which satisﬁes both objectives more effectively. To derive AWD, we study one gradient descent step for updating the parameter w at step t + 1 from its value at step t:

wt+1 = wt rwt lr wt λwd lr, (5)

where rwt is the gradient computed from the cross-entropy objective, and wt λwd is the gradient computed from the weight decay term from eq. 3. We deﬁne λawd as a metric that keeps track of the ratio of the magnituedes coming from each objective:

λawd(t) = kλwdwtk

krwtk , (6)

To keep a balance between the two objectives, we aim to keep this ratio constant during training. AWD is a simple yet effective way of maintaining this balance. Adaptive weight decay shares similarities with non-adaptive (traditional) weight decay, with the only distinction being that the hyper-parameter λwd is not ﬁxed throughout training. Instead, λwd dynamically changes in each iteration to ensure λawd(t) λawd(t 1) λawd. To keep this ratio constant at every step t, we can rewrite the λawd equation (eq. 6) as:

λwd(t) = λawd krwtk

Eq. 7 allows us to have a different weight decay hyperparameter value (λwd) for every optimization iteration t, which keeps the gradients received from the cross entropy and weight decay balanced throughout the optimization. Note that weight decay penalty λt can be computed on the ﬂy with almost no computational overhead during the training. Using the exponential weighted average

λt = 0.1 λt 1 + 0.9 λt, we could make λt more stable (Algorithm 1).

Algorithm 1 Adaptive Weight Decay

1: Input: λawd > 0 2: λ 0 3: for (x, y) 2 loader do 4: p model(x) . Get models prediction. 5: main Cross Entropy(p, y) . Compute Cross Entropy. 6: rw backward(main) . Compute the gradients of main loss w.r.t weights.

7: λ krwkλawd

kwk . Compute iteration s weight decay hyperparameter.

8: λ 0.1 λ + 0.9 stop_gradient(λ) . Compute the weighted average as a scalar. 9: w w lr(rw + λ w) . Update Network s parameters. 10: end for

2.2.1 Differences between Adaptive and Non-Adaptive Weight Decay

To study the differences between adaptive and non-adaptive weight decay and to build intuition, we can plug in λt of the adaptive method (eq. 7) directly into the equation for traditional weight decay (eq. 3) and derive the total loss based on Adaptive Weight Decay:

Losswt(x, y) = Xent(f(x, wt), y) + λawd krwtkkwtk

Please note that directly solving eq. 8 will invoke the computation of second-order derivatives since λt is computed using the ﬁrst-order derivatives. However, as stated in Alg. 1, we convert the λt into a non-tensor scalar to save computation and avoid second-order derivatives. We treat krwtk in eq. 8 as a constant and do not allow gradients to back-propagate through it. As a result, adaptive weight decay has negligible computation overhead compared to traditional non-adaptive weight decay.

By comparing the weight decay term in the adaptive weight decay loss (eq. 8): λawd

2 kwkkrwk with that of the traditional weight decay loss (eq. 3): λwd

2 kwk2, we can build intuition on some of the differences between the two. For example, the non-adaptive weight decay regularization term approaches zero only when the weight norms are close to zero, whereas, in AWD, it also happens

Figure 2: Robust accuracy (a) and loss (b) on CIFAR-10 validation subset. Both ﬁgures highlight the best performing hyper-parameter for non-adaptive weight decay λwd = 0.00089 with sharp strokes. As it can be seen, lower values of λwd cause robust overﬁtting, while high values of it prevent network from ﬁtting entirely. However, training with adaptive weight decay prevents overﬁtting and achieves highest performance in robustness.

when the cross-entropy gradients are close to zero. Consequently, AWD prevents over-optimization of weight norms in ﬂat minima, allowing for more (relative) weight to be given to the cross-entropy objective. Additionally, AWD penalizes weight norms more when the gradient of cross-entropy is large, preventing it from falling into steep local minima and hence overﬁtting early in training.

We verify our intuition of AWD being capable of reducing robust overﬁtting in practice by replacing the non-adaptive weight decay with AWD and monitoring the same two metrics from 2.1.2. The results for a good choice of the AWD hyper-parameter (λawd) and various choices of non-adaptive weight decay (λwd) hyper-parameter are summarized in Figure 2 2.

2.2.2 Related works to Adaptive Weight Decay

The most related studies to AWD are Ada Decay (Nakamura & Hong, 2019) and LARS (You et al., 2017). Ada Decay changes the weight decay hyper-parameter adaptively for each individual parameter, as opposed to ours which we tune the hyper-parameter for the entire network. LARS is a common optimizer when using large batch sizes which adaptively changes the learning rate for each layer. We evaluate these relevant methods in the context of improving adversarial robustness and experimentally compare with AWD in Table 2 and Appendix D 3.

2.3 Experimental Robustness results for Adaptive Weight Decay

AWD can help improve the robustness on various datasets which suffer from robust overﬁtting. To illustrate this, we focus on six datasets: SVHN, Fashion MNIST, Flowers, CIFAR-10, CIFAR-100, and Tiny Image Net. Tiny Image Net is a subset of Image Net, consisting of 200 classes and images of size 64 64 3. For all experiments, we use the widely accepted 7-step PGD adversarial training to solve eq. 4 (Madry et al., 2017) while keeping 10% of the examples from the training set as held-out validation set for the purpose of early stopping. For early stopping, we select the checkpoint with the highest 1 = 8 robustness accuracy measured by a 3-step PGD attack on the held-out validation set. For CIFAR10, CIFAR100, and Tiny Image Net experiments, we use a Wide Res Net 28-10 architecture, and for SVHN, Fashion MNIST, and Flowers, we use a Res Net18 architecture. Other details about the experimental setup can be found in Appendix A.1. For all experiments, we tune the conventional non-adaptive weight decay parameter (λwd) for improving robustness generalization and compare that to tuning the λawd hyper-parameter for adaptive weight decay. To ensure that we search for enough values for λwd, we use up to twice as many values for λwd compared to λawd.

Figure 3 plots the robustness accuracy measured by applying Auto Attack (Croce & Hein, 2020b) on the test examples for the CIFAR-10, CIFAR-100, and Tiny Image Net datasets, respectively. We

2See Appendix C.4 for similar analysis on other datasets. 3Due to space limitations we defer detailed discussions and comparisons to Appendix D.

Figure 3: 1 = 8 robust accuracy on the test set of adversarially trained Wide Res Net28-10 networks on CIFAR-10, CIFAR-100, and Tiny Image Net (a, c, e) using different choices for the hyper-parameters of non-adaptive weight decay (λwd), and (b, d, f) different choices of the hyperparameter for adaptive weight decay (λawd).

observe that training with adaptive weight decay improves the robustness by a margin of 4.84% on CIFAR-10, 5.08% on CIFAR-100, and 3.01% on Tiny Image Net, compared to the non-adaptive counterpart. These margins translate to a relative improvement of 10.7%, 20.5%, and 18.0%, on CIFAR-10, CIFAR-100, and Tiny Image Net, respectively.

Increasing robustness often comes at the cost of drops in clean accuracy (Zhang et al., 2019). This observation could be attributed, at least in part, to the phenomenon that certain p-norm bounded adversarial examples bear a closer resemblance to the network s predicted class than their original class (Sharif et al., 2018). An active area of research seeks a better trade-off between robustness and natural accuracy by ﬁnding other points on the Pareto-optimal curve of robustness and accuracy. For example, (Balaji et al., 2019) use instance-speciﬁc perturbation budgets during training. Interestingly, when comparing the most robust network trained with non-adaptive weight decay (λwd) to that

Method Dataset Opt k Wk2 Nat Acc Auto Att Xent + λ

2 2 λwd = 0.00089 CIFAR-10 SGD 35.58 84.31 45.19 0.58 λawd = 0.022 SGD 7.11 87.08 50.03 0.08

λwd = 0.0005 CIFAR-100 SGD 51.32 60.15 22.53 0.67 λawd = 0.022 SGD 13.41 61.39 27.15 1.51

λwd = 0.00211 Tiny Img Net SGD 25.62 47.87 16.73 3.56 λawd = 0.01714 SGD 15.01 48.46 19.74 2.80

λwd = 5e 7 SVHN SGD 102.11 92.04 44.16 1.02 λawd = 0.02378 SGD 5.39 93.04 47.10 1.15

λwd = 0.00089 Fashion MNIST SGD 14.39 83.96 78.73 0.51 λawd = 0.01414 SGD 9.05 85.42 79.24 0.44

λwd = 0.005 Flowers SGD 19.94 90.98 32.35 1.72 λawd = 0.06727 SGD 13.87 90.39 39.22 1.42

Table 1: Adversarial robustness of PGD-7 adversarially trained networks using adaptive and nonadaptive weight decay. Table summarizes the best performing hyper-parameter for each method on each dataset. Not only the adaptive method outperforms the non-adaptive method in terms of robust accuracy, it is also superior in terms of the natural accuracy. Models trained with AWD have considerably smaller weight-norms. In the last column, we report the total loss value of the non-adaptive weight decay for the best tuned λ for that dataset, found by grid search for each dataset. Interestingly, when we measure the non-adaptive total loss (eq. 4) on the training set, we observe that networks trained with the adaptive method often have smaller non-adaptive training loss even though in AWD we have not optimized that loss directly.

trained with AWD (λawd), we notice that those trained with the adaptive method have higher clean accuracy across various datasets (Table. 1).

In addition, we observe comparatively smaller weight-norms for models trained with adaptive weight decay, which might contribute to their better generalization and lesser robust overﬁtting. For the SVHN dataset, the bes model trained with AWD has 20x smaller weight-norm compared to the best model trained with traditional weight-decay. Networks which have such small weight-norms that maintain good performance on validation data is difﬁcult to achieve with traditional weight-decay as previously illustrated in sec. 2.1.2. Perhaps most interestingly, when we compute the value of non-adaptive weight decay loss for AWD trained models, we observe that they are even sometimes superior in terms of that objective as it can be seen in the last column of Table 1. This behavior could imply that the AWD reformulation is better in terms of optimization and can more easily ﬁnd a balance and simultaneously decrease both objective terms in eq. 4 and is aligned with the intuitive explanation in sec. 2.2.1.

The previously shown results suggest an excellent potential for adversarial training with AWD. To further study this potential, we only substituted the Momentum-SGD optimizer used in all previous experiments with the ASAM optimizer (Kwon et al., 2021), and used the same hyperparameters used in previous experiments for comparison with advanced adversarial robustness algorithms. To the best of our knowledge, and according to the Robust Bench (Croce et al., 2020), the state-of-the-art 1 = 8.0 defense for CIFAR-100 without extra synthesized or captured data using WRN28-10 achieves 29.80% robust accuracy (Rebufﬁet al., 2021). We achieve 29.54% robust accuracy, which is comparable to these advanced algorithms even though our approach is a simple modiﬁcation to weight decay. This is while our proposed method achieves 63.93% natural accuracy which is 1% higher. See Table 2 for more details. In addition to comparing with the SOTA method on WRN28-10, we compare with a large suite of strong robustness methods which report their performances on WRN3210 in Table 2. In addition to these two architectures, in Appendix C.1, we test other architectures to

demonstrate the robustness of the proposed method to various architectural choices. For ablations demonstrating the robustness of AWD to various other parameter choices for adversarial training such as number of epochs, adversarial budget ( ), please refer to Appendix C.3 and Appendix C.2, respectively.

Adaptive Weight Decay (AWD) can help improve the robustness over traditional weight decay on many datasets as summarized before in Table 1. In Table 2 we demonstrated that AWD when

Method WRN Aug Epo ASAM TR SWA Nat AA

λAda Decay = 0.002* 28-10 P&C 200 - - - 57.17 24.18 (Rebufﬁet al., 2021) 28-10 Cut Mix 400 - X X 62.97 29.80 (Rebufﬁet al., 2021) 28-10 P&C 400 - X X 59.06 28.75 λwd = 0.0005 28-10 P&C 200 - - - 60.15 22.53 λwd = 0.0005 + ASAM 28-10 P&C 100 X - - 58.09 22.55 λwd = 0.00281* + ASAM 28-10 P&C 100 X - - 62.24 26.38 λawd = 0.022 28-10 P&C 200 - - - 61.39 27.15 λawd = 0.022 + ASAM 28-10 P&C 100 X - - 63.93 29.54

AT (Madry et al., 2017) 32-10 P&C 100 - - - 60.13 24.76 TRADES (Zhang et al., 2019) 32-10 P&C 100 - X - 60.73 24.90 MART (Wang et al., 2020) 32-10 P&C 100 - - - 54.08 25.30 FAT (Zhang et al., 2020a) 32-10 P&C 100 - - - 66.74 20.88 AWP (Wu et al., 2020) 32-10 P&C 100 - - - 55.16 25.16 GAIRAT (Zhang et al., 2020b) 32-10 P&C 100 - - - 58.43 17.54 MAIL-AT (Liu et al., 2021) 32-10 P&C 100 - - - 60.74 22.44 MAIL-TR (Liu et al., 2021) 32-10 P&C 100 - X - 60.13 24.80 λawd = 0.022 + ASAM 32-10 P&C 100 X - - 64.49 29.70

Table 2: CIFAR-100 adversarial robustness performance of various strong methods. Adaptive weight decay with ASAM optimizer outperforms many strong baselines. For experiments marked with * we do another round of hyper-parameter search. λAda Decay indicates using the work from Nakamura & Hong (2019). The columns represent the method, depth and width of the Wide Res Nets used, augmentation, number of epochs, whether ASAM, TRADES (Zhang et al., 2019), and Stochastic Weight Averaging (Izmailov et al., 2018), were used in the training, followed by the natural accuracy and adversarial accuracy using Auto Attack. In the augmentation column, P&C is short for Pad and Crop. The experiments with are based on results from (Liu et al., 2021) which use a custom choice of parameters to alleviate robust overﬁtting. We also experimented with methods related to AWD such as LARS. We observed no improvement, so we do not report the results here. More details can be found in Appendix D.2.

combined with advanced optimization methods such as ASAM can result in models which have good natural and robust accuracies when compared with advanced methods on the CIFAR-100 dataset. Table 3 compares AWD+ASAM with various advanced methods on the CIFAR-10 dataset. The hyper-parameters used in this experiment are similar to those used before for the CIFAR-100 dataset. As it can be seen, despite it s simplicity, AWD depicts improvements over very strong baselines on two extensively studied datasets of the adversarial machine learning domain 4.

Method WRN Aug Epo CIFAR-10 Nat AA

AT (Madry et al., 2017) 32-10 P&C 100 87.80 48.46 TRADES (Zhang et al., 2019) 32-10 P&C 100 86.36 53.40 MART (Wang et al., 2020) 32-10 P&C 100 84.76 51.40 FAT (Zhang et al., 2020a) 32-10 P&C 100 89.70 47.48 AWP (Wu et al., 2020) 32-10 P&C 100 57.55 53.08 GAIRAT (Zhang et al., 2020b) 32-10 P&C 100 86.30 40.30 MAIL-AT (Liu et al., 2021) 32-10 P&C 100 84.83 47.10 MAIL-TR (Liu et al., 2021) 32-10 P&C 100 84.00 53.90 λawd = 0.022 + ASAM 32-10 P&C 100 88.55 54.04

Table 3: WRN32-10 models CIFAR-10 models trained with AWD when using the ASAM optimizer are more robust than models trained with various sophisticated algorithms from the literature.

4Image Net robustness results can be seen in Appendix C.11

3 Additional Properties of Adaptive Weight Decay

Due to the properties mentioned before, such as reducing overﬁtting and resulting in networks with smaller weight-norms, Adaptive Weight Decay (AWD) can be seen as a good choice for other applications which can beneﬁt from robustness. In particular, below in 3.1 we study the effect of adaptive weight decay in the noisy label setting. More speciﬁcally, we show roughly 4% accuracy improvement on CIFAR-100 and 2% on CIFAR-10 for training on the 20% symmetry label ﬂipping setting (Bartlett et al., 2006). In addition, in the Appendix, we show the potential of AWD for reducing sensitivity to sub-optimal learning rates. Also, we show that networks which are naturally trained achieving roughly similar accuracy, once trained with adaptive weight decay, tend to have lower weight-norms. This phenomenon can have exciting implications for pruning networks (Le Cun et al., 1989; Hassibi & Stork, 1992).

3.1 Robustness to Noisy Labels

Popular vision datasets, such as MNIST (Le Cun & Cortes, 2010), CIFAR (Krizhevsky et al., 2009), and Image Net (Deng et al., 2009), contain some amount of label noise (Yun et al., 2021; Zhang, 2017). While some studies provide methods for identifying and correcting such errors (Yun et al., 2021; Müller & Markert, 2019; Al-Rawi & Karatzas, 2018; Kuriyama, 2020), others provide training algorithms that avoid over-ﬁtting, or even better, avoid ﬁtting the incorrectly labeled examples entirely (Jiang et al., 2018; Song et al., 2019; Jiang et al., 2020).

In this section, we perform a preliminary investigation of adaptive weight decay s resistance to ﬁtting training data with label noise. Following previous studies, we use symmetry label ﬂipping (Bartlett et al., 2006) to create noisy data for CIFAR-10 and CIFAR-100 and use Res Net34 as the backbone. Other experimental setup details can be found in Appendix A.2. Similar to the previous section, we test different hyper-parameters for adaptive and non-adaptive weight decay. To ease comparison in this setting, we train two networks for each hyper-parameter: 1with a certain degree of label noise and 2with no noisy labels. We then report the accuracy on the clean label test set. The test accuracy on the second network one which is trained with no label noise is just the clean accuracy. Having the clean accuracy coupled with the accuracy after training on noisy data enables an easier understanding of the sensitivity of each training algorithm and choice of hyper-parameter to label noise. Figure 4 gathers the results of the noisy data experiment on CIFAR-100.

Figure 4 demonstrates that networks trained with adaptive weight decay exhibit a more modest decline in performance when label noise is present in the training set. For instance, Figure 4(a) shows that λawd = 0.028 for adaptive and λwd = 0.0089 for non-adaptive weight decay achieve roughly 70% accuracy when trained on clean data, while the adaptive version achieves 4% higher accuracy when trained on the noisy data. Appendix C.5 includes similar results for CIFAR-10.

Intuitively, based on eq. 7, in the later stages of training, when examples with label noise generate large gradients, adaptive weight decay intensiﬁes the penalty for weight decay. This mechanism effectively prevents the model from ﬁtting the noisy data by regularizing the gradients.

4 Conclusion

Regularization methods for a long time have aided deep neural networks in generalizing on data not seen during training. Due to their signiﬁcant effects on the outcome, it is crucial to have the right amount of regularization and correctly tune training hyper-parameters. We propose Adaptive Weight Decay (AWD), which is a simple modiﬁcation to weight decay one of the most commonly employed regularization methods. In our study, we conduct a comprehensive comparison between AWD and non-adaptive weight decay in various settings, including adversarial robustness and training with noisy labels. Through rigorous experimentation, we demonstrate that AWD consistently yields enhanced robustness. By conducting experiments on diverse datasets and architectures, we provide empirical evidence to showcase the effectiveness of our approach in mitigating robust overﬁtting.

Figure 4: Comparison of similarly performing networks once trained on CIFAR-100 clean data, after training on 20% (a), 40% (b), 60% (c), and 80% (d). Networks trained with adaptive weight decay are less sensitive to label noise compared to ones trained with non-adaptive weight decay.

5 Acknowledgement

We extend our sincere appreciation to Zhile Ren for their invaluable support and perceptive contributions during the publication of this manuscript.

Al-Rawi, M. and Karatzas, D. On the labeling correctness in computer vision datasets. In IAL@

PKDD/ECML, pp. 1 23, 2018.

Andriushchenko, M., Croce, F., Flammarion, N., and Hein, M. Square attack: a query-efﬁcient

black-box adversarial attack via random search. In European Conference on Computer Vision, pp. 484 501. Springer, 2020.

Balaji, Y., Goldstein, T., and Hoffman, J. Instance adaptive adversarial training: Improved accuracy

tradeoffs in neural nets. ar Xiv preprint ar Xiv:1910.08051, 2019.

Bartlett, P. L., Jordan, M. I., and Mc Auliffe, J. D. Convexity, classiﬁcation, and risk bounds. Journal

of the American Statistical Association, 101(473):138 156, 2006.

Bergstra, J., Yamins, D., and Cox, D. Making a science of model search: Hyperparameter optimization

in hundreds of dimensions for vision architectures. In International conference on machine learning, pp. 115 123. PMLR, 2013.

Biggio, B., Corona, I., Maiorca, D., Nelson, B., Šrndi c, N., Laskov, P., Giacinto, G., and Roli, F.

Evasion attacks against machine learning at test time. In Joint European conference on machine learning and knowledge discovery in databases, pp. 387 402. Springer, 2013.

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam,

P., Sastry, G., Askell, A., et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877 1901, 2020.

Carlini, N. and Wagner, D. Towards evaluating the robustness of neural networks. In 2017 ieee

symposium on security and privacy (sp), pp. 39 57. Ieee, 2017.

Carmon, Y., Raghunathan, A., Schmidt, L., Duchi, J. C., and Liang, P. S. Unlabeled data improves

adversarial robustness. Advances in Neural Information Processing Systems, 32, 2019.

Chen, P., Liu, S., Zhao, H., and Jia, J. Gridmask data augmentation. ar Xiv preprint ar Xiv:2001.04086,

Cohen, J., Rosenfeld, E., and Kolter, Z. Certiﬁed adversarial robustness via randomized smoothing.

In International Conference on Machine Learning, pp. 1310 1320. PMLR, 2019.

Croce, F. and Hein, M. Minimally distorted adversarial examples with a fast adaptive boundary attack.

In International Conference on Machine Learning, pp. 2196 2205. PMLR, 2020a.

Croce, F. and Hein, M. Reliable evaluation of adversarial robustness with an ensemble of diverse

parameter-free attacks. In International conference on machine learning, pp. 2206 2216. PMLR, 2020b.

Croce, F., Andriushchenko, M., Sehwag, V., Debenedetti, E., Flammarion, N., Chiang, M., Mittal,

P., and Hein, M. Robustbench: a standardized adversarial robustness benchmark. ar Xiv preprint ar Xiv:2010.09670, 2020.

Cubuk, E. D., Zoph, B., Mane, D., Vasudevan, V., and Le, Q. V. Autoaugment: Learning augmentation

strategies from data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 113 123, 2019.

Cubuk, E. D., Zoph, B., Shlens, J., and Le, Q. V. Randaugment: Practical automated data augmenta-

tion with a reduced search space. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pp. 702 703, 2020.

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. Imagenet: A large-scale hierarchical

image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248 255. Ieee, 2009.

De Vries, T. and Taylor, G. W. Improved regularization of convolutional neural networks with cutout.

ar Xiv preprint ar Xiv:1708.04552, 2017.

Gastaldi, X. Shake-shake regularization. ar Xiv preprint ar Xiv:1705.07485, 2017.

Goodfellow, I. J., Shlens, J., and Szegedy, C. Explaining and harnessing adversarial examples. ar Xiv

preprint ar Xiv:1412.6572, 2014.

Gowal, S., Qin, C., Uesato, J., Mann, T., and Kohli, P. Uncovering the limits of adversarial training

against norm-bounded adversarial examples. ar Xiv preprint ar Xiv:2010.03593, 2020.

Hassibi, B. and Stork, D. Second order derivatives for network pruning: Optimal brain surgeon.

Advances in neural information processing systems, 5, 1992.

Hendrycks, D., Lee, K., and Mazeika, M. Using pre-training can improve model robustness and

uncertainty. In International Conference on Machine Learning, pp. 2712 2721. PMLR, 2019a.

Hendrycks, D., Mu, N., Cubuk, E. D., Zoph, B., Gilmer, J., and Lakshminarayanan, B. Aug-

mix: A simple data processing method to improve robustness and uncertainty. ar Xiv preprint ar Xiv:1912.02781, 2019b.

Ioffe, S. and Szegedy, C. Batch normalization: Accelerating deep network training by reducing

internal covariate shift. In International conference on machine learning, pp. 448 456. PMLR, 2015.

Izmailov, P., Podoprikhin, D., Garipov, T., Vetrov, D., and Wilson, A. G. Averaging weights leads to

wider optima and better generalization. ar Xiv preprint ar Xiv:1803.05407, 2018.

Jiang, L., Zhou, Z., Leung, T., Li, L.-J., and Fei-Fei, L. Mentornet: Learning data-driven curriculum

for very deep neural networks on corrupted labels. In International conference on machine learning, pp. 2304 2313. PMLR, 2018.

Jiang, L., Huang, D., Liu, M., and Yang, W. Beyond synthetic noise: Deep learning on controlled

noisy labels. In International Conference on Machine Learning, pp. 4804 4815. PMLR, 2020.

Kannan, H., Kurakin, A., and Goodfellow, I. Adversarial logit pairing. ar Xiv preprint ar Xiv:1803.06373, 2018.

Krizhevsky, A., Hinton, G., et al. Learning multiple layers of features from tiny images. 2009.

Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classiﬁcation with deep convolutional

neural networks. Communications of the ACM, 60(6):84 90, 2017.

Krogh, A. and Hertz, J. A simple weight decay can improve generalization. Advances in neural

information processing systems, 4, 1991.

Kuriyama, K. Autocleansing: Unbiased estimation of deep learning with mislabeled data. 2020.

Kwon, J., Kim, J., Park, H., and Choi, I. K. Asam: Adaptive sharpness-aware minimization for scale-

invariant learning of deep neural networks. In International Conference on Machine Learning, pp. 5905 5914. PMLR, 2021.

Le Cun, Y. and Cortes, C. MNIST handwritten digit database. 2010. URL http://yann.lecun.

com/exdb/mnist/.

Le Cun, Y., Denker, J., and Solla, S. Optimal brain damage. Advances in neural information

processing systems, 2, 1989.

Liu, F., Han, B., Liu, T., Gong, C., Niu, G., Zhou, M., Sugiyama, M., et al. Probabilistic margins for

instance reweighting in adversarial training. Advances in Neural Information Processing Systems, 34:23258 23269, 2021.

Loshchilov, I. and Hutter, F. Decoupled weight decay regularization. ar Xiv preprint ar Xiv:1711.05101,

Madry, A., Makelov, A., Schmidt, L., Tsipras, D., and Vladu, A. Towards deep learning models

resistant to adversarial attacks. ar Xiv preprint ar Xiv:1706.06083, 2017.

Mendoza, H., Klein, A., Feurer, M., Springenberg, J. T., and Hutter, F. Towards automatically-tuned

neural networks. In Workshop on automatic machine learning, pp. 58 65. PMLR, 2016.

Müller, N. M. and Markert, K. Identifying mislabeled instances in classiﬁcation datasets. In 2019

International Joint Conference on Neural Networks (IJCNN), pp. 1 8. IEEE, 2019.

Müller, S. G. and Hutter, F. Trivialaugment: Tuning-free yet state-of-the-art data augmentation. In

Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 774 782, 2021.

Nakamura, K. and Hong, B.-W. Adaptive weight decay for deep neural networks. IEEE Access, 7:

118857 118865, 2019.

Plaut, D. C. et al. Experiments on learning by back propagation. 1986.

Rasley, J., Rajbhandari, S., Ruwase, O., and He, Y. Deepspeed: System optimizations enable training

deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 3505 3506, 2020.

Real, E., Aggarwal, A., Huang, Y., and Le, Q. V. Aging evolution for image classiﬁer architecture

search. In AAAI conference on artiﬁcial intelligence, volume 2, pp. 2, 2019.

Rebufﬁ, S.-A., Gowal, S., Calian, D. A., Stimberg, F., Wiles, O., and Mann, T. Fixing data augmentation to improve adversarial robustness. ar Xiv preprint ar Xiv:2103.01946, 2021.

Rice, L., Wong, E., and Kolter, Z. Overﬁtting in adversarially robust deep learning. In International

Conference on Machine Learning, pp. 8093 8104. PMLR, 2020.

Shafahi, A., Najibi, M., Ghiasi, M. A., Xu, Z., Dickerson, J., Studer, C., Davis, L. S., Taylor, G., and

Goldstein, T. Adversarial training for free! Advances in Neural Information Processing Systems, 32, 2019.

Sharif, M., Bauer, L., and Reiter, M. K. On the suitability of lp-norms for creating and preventing

adversarial examples. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 1605 1613, 2018.

Smith, S., Patwary, M., Norick, B., Le Gresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye,

S., Zerveas, G., Korthikanti, V., et al. Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. ar Xiv preprint ar Xiv:2201.11990, 2022.

Song, H., Kim, M., and Lee, J.-G. Selﬁe: Refurbishing unclean samples for robust deep learning. In

International Conference on Machine Learning, pp. 5907 5915. PMLR, 2019.

Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. Dropout: a simple

way to prevent neural networks from overﬁtting. The journal of machine learning research, 15(1): 1929 1958, 2014.

Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., and Fergus, R. Intriguing

properties of neural networks. ar Xiv preprint ar Xiv:1312.6199, 2013.

Tan, M. and Le, Q. Efﬁcientnet: Rethinking model scaling for convolutional neural networks. In

International conference on machine learning, pp. 6105 6114. PMLR, 2019.

Wang, Y., Zou, D., Yi, J., Bailey, J., Ma, X., and Gu, Q. Improving adversarial robustness requires

revisiting misclassiﬁed examples. In International Conference on Learning Representations, 2020.

Warde-Farley, D. and Goodfellow, I. 11 adversarial perturbations of deep neural networks. Perturba-

tions, Optimization, and Statistics, 311:5, 2016.

Wong, E. and Kolter, Z. Provable defenses against adversarial examples via the convex outer

adversarial polytope. In International Conference on Machine Learning, pp. 5286 5295. PMLR, 2018.

Wong, E., Rice, L., and Kolter, J. Z. Fast is better than free: Revisiting adversarial training. ar Xiv

preprint ar Xiv:2001.03994, 2020.

Wu, D., Xia, S.-T., and Wang, Y. Adversarial weight perturbation helps robust generalization.

Advances in Neural Information Processing Systems, 33:2958 2969, 2020.

Yamada, Y., Iwamura, M., Akiba, T., and Kise, K. Shakedrop regularization for deep residual learning.

IEEE Access, 7:186126 186136, 2019.

You, Y., Gitman, I., and Ginsburg, B. Large batch training of convolutional networks. ar Xiv preprint

ar Xiv:1708.03888, 2017.

Yun, S., Han, D., Oh, S. J., Chun, S., Choe, J., and Yoo, Y. Cutmix: Regularization strategy to

train strong classiﬁers with localizable features. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 6023 6032, 2019.

Yun, S., Oh, S. J., Heo, B., Han, D., Choe, J., and Chun, S. Re-labeling imagenet: from single to

multi-labels, from global to localized labels. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2340 2350, 2021.

Zagoruyko, S. and Komodakis, N. Wide residual networks. ar Xiv preprint ar Xiv:1605.07146, 2016.

Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. Understanding deep learning (still)

requires rethinking generalization. Communications of the ACM, 64(3):107 115, 2021.

Zhang, G., Wang, C., Xu, B., and Grosse, R. Three mechanisms of weight decay regularization.

ar Xiv preprint ar Xiv:1810.12281, 2018.

Zhang, H., Cisse, M., Dauphin, Y. N., and Lopez-Paz, D. mixup: Beyond empirical risk minimization.

ar Xiv preprint ar Xiv:1710.09412, 2017.

Zhang, H., Yu, Y., Jiao, J., Xing, E., El Ghaoui, L., and Jordan, M. Theoretically principled trade-off

between robustness and accuracy. In International conference on machine learning, pp. 7472 7482. PMLR, 2019.

Zhang, J., Xu, X., Han, B., Niu, G., Cui, L., Sugiyama, M., and Kankanhalli, M. Attacks which

do not kill training make adversarial learning stronger. In International conference on machine learning, pp. 11278 11287. PMLR, 2020a.

Zhang, J., Zhu, J., Niu, G., Han, B., Sugiyama, M., and Kankanhalli, M. Geometry-aware instance-

reweighted adversarial training. ar Xiv preprint ar Xiv:2010.01736, 2020b.

Zhang, X. A method of data label checking and the wrong labels in mnist and cifar10. Available at

SSRN 3072167, 2017.

Zhou, Y., Ebrahimi, S., Arık, S. Ö., Yu, H., Liu, H., and Diamos, G. Resource-efﬁcient neural

architect. ar Xiv preprint ar Xiv:1806.07912, 2018.