# exploring_memorization_in_adversarial_training__c897cbc2.pdf

Published as a conference paper at ICLR 2022

EXPLORING MEMORIZATION IN ADVERSARIAL TRAINING

Yinpeng Dong1,2, Ke Xu4, Xiao Yang1, Tianyu Pang1, Zhijie Deng1, Hang Su1,3, Jun Zhu1,2,3

1 Dept. of Comp. Sci. and Tech., Institute for AI, Tsinghua-Bosch Joint ML Center, THBI Lab 1 BNRist Center, Tsinghua University, Beijing, China; 2 Real AI; 3 Peng Cheng Laboratory; 4 CMU

{dongyinpeng, suhangss, dcszj}@mail.tsinghua.edu.cn, kx1@andrew.cmu.edu

Deep learning models have a propensity for ﬁtting the entire training set even with random labels, which requires memorization of every training sample. In this paper, we explore the memorization effect in adversarial training (AT) for promoting a deeper understanding of model capacity, convergence, generalization, and especially robust overﬁtting of the adversarially trained models. We ﬁrst demonstrate that deep networks have sufﬁcient capacity to memorize adversarial examples of training data with completely random labels, but not all AT algorithms can converge under the extreme circumstance. Our study of AT with random labels motivates further analyses on the convergence and generalization of AT. We ﬁnd that some AT approaches suffer from a gradient instability issue and most recently suggested complexity measures cannot explain robust generalization by considering models trained on random labels. Furthermore, we identify a signiﬁcant drawback of memorization in AT that it could result in robust overﬁtting. We then propose a new mitigation algorithm motivated by detailed memorization analyses. Extensive experiments on various datasets validate the effectiveness of the proposed method.

1 INTRODUCTION

Deep neural networks (DNNs) usually exhibit excellent generalization ability in pattern recognition tasks, despite their sufﬁcient capacity to overﬁt or memorize the entire training set with completely random labels (Zhang et al., 2017). The memorization behavior in deep learning has aroused tremendous attention to identifying the differences between learning on true and random labels (Arpit et al., 2017; Neyshabur et al., 2017), and examining what and why DNNs memorize (Feldman, 2020; Feldman & Zhang, 2020; Maennel et al., 2020). This phenomenon has also motivated a growing body of works on model capacity (Arpit et al., 2017; Belkin et al., 2019), convergence (Allen-Zhu et al., 2019; Du et al., 2019; Zou et al., 2020), and generalization (Neyshabur et al., 2017; Bartlett et al., 2017), which consequently provide a better understanding of the DNN working mechanism.

In this paper, we explore the memorization behavior for a different learning algorithm adversarial training (AT). Owing to the security threat of adversarial examples, i.e., maliciously generated inputs by adding imperceptible perturbations to cause misclassiﬁcation (Szegedy et al., 2014; Goodfellow et al., 2015), various defense methods have been proposed to improve the adversarial robustness of DNNs (Kurakin et al., 2017; Madry et al., 2018; Liao et al., 2018; Wong & Kolter, 2018; Cohen et al., 2019; Zhang et al., 2019b; Pang et al., 2019; 2020; Dong et al., 2020a). AT is arguably the most effective defense technique (Athalye et al., 2018; Dong et al., 2020b), in which the network is trained on the adversarially augmented samples instead of the natural ones (Madry et al., 2018).

Despite the popularity, the memorization behavior in AT is less explored. Schmidt et al. (2018) show that a model is able to fully (over)ﬁt the training set against an adversary, i.e., reaching almost 100% robust training accuracy, while the performance on test data is much inferior, witnessing a signiﬁcant generalization gap. The overﬁtting phenomenon in AT is further investigated in Rice et al. (2020). However, it is not clear whether DNNs could memorize adversarial examples of training data with completely random labels. Answering this question could help to examine the effects of memorization in AT under the extreme circumstance and facilitate a deeper understanding of capacity, convergence, generalization, and robust overﬁtting of the adversarially trained models. In general, it

Jun Zhu is the corresponding author. This work was done when Ke Xu was visiting Tsinghua University.

Published as a conference paper at ICLR 2022

Standard Training Adversarial Training

Trained on True Labels Trained on Random Labels Trained on True Labels Trained on Random Labels Figure 1: A conceptual illustration of decision boundaries learned via standard training and adversarial training with true and random labels, respectively. The model needs a signiﬁcantly more complicated decision boundary to memorize adversarial examples of training data with random labels.

is difﬁcult for a classiﬁer to memorize adversarial examples with random labels since the model entails a much more complicated decision boundary, as illustrated in Fig. 1. Even though the networks have sufﬁcient capacity, AT may not necessarily converge. Therefore, we aim to comprehensively study this problem and explore how the analysis can motivate better algorithms.

Our contributions. We ﬁrst empirically investigate the memorization behavior in AT by performing PGD-AT (Madry et al., 2018) and TRADES (Zhang et al., 2019b) with random labels sampled uniformly over all classes. Different from standard training (ST) that can easily memorize random labels (Zhang et al., 2017), AT may fail to converge, with PGD-AT being a typical example. Nevertheless, TRADES can converge under this circumstance. It demonstrates that DNNs have sufﬁcient capacity to memorize adversarial examples of training data with completely random labels. This phenomenon is commonly observed on multiple datasets, network architectures, and threat models.

The memorization analysis has further implications for understanding the convergence and generalization of AT. We conduct a convergence analysis on gradient magnitude and stability to explain the counter-intuitive different convergence properties of PGD-AT and TRADES with random labels since they behave similarly when trained on true labels (Rice et al., 2020). We corroborate that PGD-AT suffers from a gradient instability issue while the gradients of TRADES are relatively stable thanks to its adversarial loss. Moreover, by considering models trained on random labels, our generalization analysis indicates that several recently suggested complexity measures are inadequate to explain robust generalization, which is complementary to the ﬁndings in ST (Neyshabur et al., 2017). Accordingly, an appropriate explanation of robust generalization remains largely under-addressed.

Lastly, but most importantly, we identify a signiﬁcant drawback of memorization in AT that it could result in robust overﬁtting (Rice et al., 2020). We argue that the cause of robust overﬁtting lies in the memorization of one-hot labels in the typical AT methods. The one-hot labels can be inappropriate or even noisy for some adversarial examples because some data naturally lies close to the decision boundary, and the corresponding adversarial examples should be assigned low predictive conﬁdence (Stutz et al., 2020; Cheng et al., 2020). To solve this problem, we propose a new mitigation algorithm that impedes over-conﬁdent predictions by regularization for avoiding the excessive memorization of adversarial examples with possibly noisy labels. Experiments validate that our method can eliminate robust overﬁtting to a large extent across multiple datasets, network architectures, threat models, and AT methods, achieving better robustness under a variety of adversarial attacks than the baselines.

2 BACKGROUND

2.1 ADVERSARIAL TRAINING

Let D = {(xi, yi)}n i=1 denote a training dataset with n samples, where xi Rd is a natural example and yi {1, ..., C} is its true label often encoded as an one-hot vector 1yi with totally C classes. Adversarial training (AT) can be formulated as a robust optimization problem (Madry et al., 2018):

i=1 max x i S(xi) L(fθ(x i), yi), (1)

where fθ is a DNN classiﬁer with parameters θ that predicts probabilities over all classes, L is the classiﬁcation loss (i.e., the cross-entropy loss as L(fθ(x), y) = 1 y log fθ(x)), and S(x) = {x : x x p ϵ} is an adversarial region centered at x with radius ϵ > 0 under the ℓp-norm threat models (e.g., ℓ2 and ℓ norms that we consider). The robust optimization problem (1) is solved by using adversarial attacks to approximate the inner maximization and updating the model parameters

Published as a conference paper at ICLR 2022

θ via gradient descent. A typical method uses projected gradient descent (PGD) (Madry et al., 2018) for the inner problem, which starts at a randomly initialized point in S(xi) and iteratively updates the adversarial example under the ℓ -norm threat model by

x i = ΠS(xi) x i + α sign( x L(fθ(x i), yi)) , (2)

where Π( ) is the projection operator and α is the step size.

Besides PGD-AT, another typical AT method is TRADES (Zhang et al., 2019b), which balances the trade-off between robustness and natural accuracy by minimizing a different adversarial loss

L(fθ(xi), yi) + β max x i S(xi) D(fθ(xi) fθ(x i)) , (3)

where L is the clean cross-entropy loss on the natural example, D is the Kullback Leibler divergence, and β is a balancing parameter. The inner maximization of TRADES is also solved by PGD.

Recent progress of AT includes designing new adversarial losses (Mao et al., 2019; Qin et al., 2019; Pang et al., 2020; Wang et al., 2020; Dong et al., 2020a) and network architecture (Xie et al., 2019), training acceleration (Shafahi et al., 2019; Zhang et al., 2019a; Wong et al., 2020), and exploiting more training data (Hendrycks et al., 2019; Alayrac et al., 2019; Carmon et al., 2019; Zhai et al., 2019). Recent works highlight the training tricks in AT (Gowal et al., 2020; Pang et al., 2021).

2.2 RELATED WORK ON DNN MEMORIZATION

It has been observed that DNNs can easily memorize training data with random labels (Zhang et al., 2017), which requires rethinking of conventional techniques (e.g., VC dimension) to explain generalization. Arpit et al. (2017) identify qualitative differences between learning on true and random labels. Further works attempt to examine what and why DNNs memorize (Feldman, 2020; Feldman & Zhang, 2020; Maennel et al., 2020). Motivated by the memorization phenomenon in deep learning, convergence of training has been analyzed in the over-parameterized setting (Allen-Zhu et al., 2019; Du et al., 2019; Zou et al., 2020), while generalization has been studied with numerous theoretical and empirical complexity measures (Neyshabur et al., 2015; 2017; Bartlett et al., 2017; Novak et al., 2018; Arora et al., 2018; Cao & Gu, 2019; Jiang et al., 2020; Chen et al., 2020).

In contrast, the memorization behavior in AT has been less explored. The previous works demonstrate that DNNs can ﬁt training data against an adversary (Madry et al., 2018; Schmidt et al., 2018; Rice et al., 2020), e.g., achieving nearly 100% robust training accuracy against a PGD adversary, but this behavior is not explored when trained on random labels. This paper is dedicated to investigating the memorization in AT under the extreme condition with random labels, while drawing connections to capacity, convergence, generalization, and robust overﬁtting, with the overarching goal of better understanding the AT working mechanism.

3 MEMORIZATION IN AT AND IMPLICATIONS

In this section, we ﬁrst explore the memorization behavior in AT through an empirical study. Our analysis raises new questions about the convergence and generalization of AT, many of which cannot be answered by existing works. Thereafter, we provide further analytical studies on the convergence and generalization of AT by considering models trained on random labels particularly.

3.1 AT WITH RANDOM LABELS

We explore the memorization behavior of PGD-AT (Madry et al., 2018) and TRADES (Zhang et al., 2019b) as two studying cases. The experiments are conducted on CIFAR-10 (Krizhevsky & Hinton, 2009) with a Wide Res Net model (Zagoruyko & Komodakis, 2016) of depth 28 and widen factor 10 (WRN-28-10). Similar to Zhang et al. (2017), we train a network on the original dataset with true labels and on a copy of the dataset in which the true labels are corrupted by random ones. For training and robustness evaluation, a 10-step ℓ PGD adversary with ϵ = 8/255 and α = 2/255 is adopted. For TRADES, the PGD adversary maximizes the KL divergence during training, while maximizes the cross-entropy loss for robustness evaluation, as common practice (Zhang et al., 2019b). We set β = 6.0. In the sequel, we denote accuracy of a classiﬁer against the 10-step PGD adversary as robust accuracy , and accuracy on natural examples as natural accuracy .

Published as a conference paper at ICLR 2022

(a) PGD-AT (b) TRADES (c) Generalization gap growth

Figure 2: (a) and (b) show the natural and robust training accuracies of PGD-AT and TRADES, respectively, when trained on true or random labels. (c) shows the generalization gap under varying levels of label noise.

Fig. 2(a) and Fig. 2(b) show the learning curves of PGD-AT and TRADES without explicit regularizations. Both methods achieve almost 100% natural and robust training accuracies when trained on true labels. When the labels are random, we observe the totally different behaviors between PGDAT and TRADES PGD-AT fails to converge while TRADES still reaches nearly 100% training accuracies. This phenomenon is somewhat striking because PGD-AT and TRADES perform similarly on true labels (Rice et al., 2020). We ﬁnd that the different memorization behaviors between PGD-AT and TRADES when trained on random labels can commonly be observed across a variety of datasets, model architectures, and threat models (shown in Appendix A.1), indicating that it is a general phenomenon of memorization in the two AT methods. Therefore, our ﬁnding is:

DNNs have sufﬁcient capacity to memorize adversarial examples of training data with completely random labels, but the convergence depends on the AT algorithms.

Partially corrupted labels. We then inspect the behavior of AT under varying levels of label noise from 0% (true labels) to 100% (completely random labels). The generalization gap (i.e., difference between training and test accuracies) presented in Fig. 2(c) grows steadily as we increase the noise rate before the network fails to converge. The learning curves are provided in Appendix A.1.

Explicit regularizations. We study the role of common regularizers in AT memorization, including data augmentation, weight decay, and dropout (Srivastava et al., 2014). We train TRADES on true and random labels with several combinations of regularizers. We observe the explicit regularizers do not signiﬁcantly affect the model s ability to memorize adversarial examples, similar to the ﬁnding in ST (Zhang et al., 2017; Arpit et al., 2017). The detailed results are provided in Appendix A.1.

3.2 CONVERGENCE ANALYSIS OF AT WITH RANDOM LABELS

Since we have observed a counter-intuitive fact that PGD-AT and TRADES exhibit different convergence properties with random labels, it is necessary to perform a convergence analysis to understand this phenomenon. Note that our ﬁnding can hardly be explained by previous works (Gao et al., 2019; Wang et al., 2019; Zhang et al., 2020).

We ﬁrst study the effects of different training settings on PGD-AT with random labels. We conduct experiments to analyze each training factor individually, including network architecture, attack steps, optimizer, and perturbation budget. We ﬁnd that tuning the training settings cannot make PGD-AT converge with random labels (Appendix A.2 details the results). Based on the analysis, we think that the convergence issue of PGD-AT could be a result of the adversarial loss function in Eq. (1) rather than other training conﬁgurations. Speciﬁcally, TRADES in Eq. (3) minimizes a clean cross-entropy (CE) loss on natural examples, making DNNs memorize natural examples with random labels before ﬁtting adversarial examples. As seen in Fig. 2(b), at the very early stage of TRADES training (the ﬁrst 25 epochs), the natural accuracy starts to increase while the robust accuracy does not. However, PGD-AT in Eq. (1) directly minimizes the CE loss on adversarial samples with random labels, which can introduce unstable gradients with large variance, making it fail to converge. To corroborate the above argument, we analyze the gradient magnitude and stability below.

Gradient magnitude. First, we calculate the average gradient norm of the adversarial loss in Eq. (1) w.r.t. model parameters over each training sample for PGD-AT, and similarly calculate the average gradient norm of the clean CE loss (the ﬁrst term) and the KL loss (the second term) in Eq. (3) w.r.t. parameters for TRADES to analyze their effects, respectively. We present the gradient norm along with training in Fig. 3(a). We can see that at the initial training epochs, the gradient norm of the KL loss in TRADES is much smaller than that of the CE loss, which indicates that the CE

Published as a conference paper at ICLR 2022

Figure 3: (a): Gradient norm of PGD-AT and TRADES along the training process. (b): The ratio of the gradient norm of PGD-AT and TRADES during the ﬁrst 1000 training iterations.

Figure 4: (a): The ℓ2 distance between the gradients at θ and θ + λd of different losses, where θ are initialized, λ [ 0.05, 0.05]. (b): The cosine similarity between the gradients in each two successive epochs.

loss dominates TRADES training initially. With the training progressing, the KL loss has a larger gradient norm, making the network memorize adversarial examples. However, it is still unclear why PGD-AT does not rely on a similar learning tactic for convergence. To make a direct comparison with TRADES, we rewrite the adversarial loss of PGD-AT in Eq. (1) as

max x i S(xi) L(fθ(x i), yi) = L(fθ(xi), yi) + R(xi, yi, θ),

where R(xi, yi, θ) denotes the difference between the CE loss on adversarial example x i and that on natural example xi. Hence we can separately calculate the gradient norm of L(fθ(xi), yi) and R(xi, yi, θ) w.r.t. parameters θ to ﬁnd out the effect of R(xi, yi, θ) on training. Speciﬁcally, we measure the relative gradient magnitude, i.e., in PGD-AT we calculate the ratio of the gradient norm θR(xi,yi,θ) 2 θL(fθ(xi),yi) 2 ; while in TRADES, we similarly calculate the ratio of the gradient norm of the KL loss to that of the CE loss. Fig. 3(b) illustrates the ratio of PGD-AT and TRADES during the ﬁrst 1000 training iterations. The ratio of PGD-AT is consistently higher than that of TRADES, meaning that R(xi, yi, θ) has a non-negligible impact on training.

Gradient stability. Then, we analyze the gradient stability to explain why PGD-AT cannot converge. We denote the adversarial loss of PGD-AT as J (x, y, θ) = maxx S(x) L(fθ(x ), y) with the subscript i omitted for notation simplicity. We have a theorem on gradient stability. Theorem 1. Suppose the gradient of the clean cross-entropy loss is locally Lipschitz continuous as

θL(fθ(x ), y) θL(fθ(x), y) 2 K x x p, (4)

for any x Rd, x S(x), and any θ, where K is the Lipschitz constant. Then we have

θJ (x, y, θ1) θJ (x, y, θ2) 2 θL(fθ1(x), y) θL(fθ2(x), y) 2 + 2ϵK. (5)

We provide the proof in Appendix B, where we show the upper bound in Eq. (5) is tight. Theorem 1 indicates that the gradient of the adversarial loss J (x, y, θ) of PGD-AT will change more dramatically than that of the clean CE loss L(fθ(x), y). When θ1 and θ2 are close, the difference between the gradients of L at θ1 and θ2 is close to 0 due to the semi-smoothness of over-parameterized DNNs (Allen-Zhu et al., 2019), but that of J is relatively large due to 2ϵK in Eq. (5). To validate this, we visualize the change of gradient when moving the parameters θ along a random direction d with magnitude λ. In particular, we set θ as initialization, d is sampled from a Gaussian distribution and normalized ﬁlter-wise (Li et al., 2018). For PGD-AT and TRADES, we craft adversarial examples on-the-ﬂy for the model with θ + λd and measure the change of gradient by the ℓ2 distance to gradient at θ averaged over all data samples. The curves on gradient change of PGD-AT, TRADES, and the clean CE loss are shown in Fig. 4(a). In a small neighborhood of θ (i.e., small λ), the gradient of PGD-AT changes abruptly while the gradients of TRADES and the clean CE loss are more continuous. The gradient instability leads to a lower cosine similarity between the gradient directions w.r.t. the same data in each two successive training epochs of PGD-AT, as illustrated in Fig. 4(b). Therefore, the training of PGD-AT would be rather unstable that the gradient exhibits large variance, making it fail to converge.

Figure 5: AT by Eq. (6)

Clean CE loss helps PGD-AT converge. To further verify our argument, we add the clean CE loss into the PGD-AT objective to resemble the learning of TRADES with random labels, as

(1 γ) L(fθ(xi), yi) + γ max x i S(xi) L(fθ(x i), yi) , (6)

Published as a conference paper at ICLR 2022

Figure 6: The results on four complexity measures of the adversarially trained models w.r.t. robust generalization gap. The training settings of these models are provided in Appendix A.3.

where γ is gradually increased from 0 to 1. By using Eq. (6), the gradient would be stabler at the initial stage and training on random labels can successfully converge, as shown Fig. 5.

In summary, our convergence analysis identiﬁes the gradient instability issue of PGD-AT, provides new insights on the differences between PGD-AT and TRADES, and partially explain the failures of AT under other realistic settings beyond the scope of this section as detailed in Appendix A.2.

3.3 GENERALIZATION ANALYSIS OF AT WITH RANDOM LABELS

As our study demonstrates DNNs ability to memorize adversarial examples with random labels, we raise the question of whether DNNs rely on a similar memorization tactic on true labels and how to explain/ensure robust generalization. Although many efforts have been devoted to studying robust generalization of AT theoretically or empirically (Yin et al., 2018; Schmidt et al., 2018; Bubeck et al., 2019; Tu et al., 2019; Wu et al., 2020), they do not take the models trained on random labels into consideration. As it is easy to show that the explicit regularizations are not the adequate explanation of generalization in ST (Zhang et al., 2017; Arpit et al., 2017) and AT (see Appendix A.3), people resort to complexity measures of a model to explain generalization (i.e., a lower complexity should imply a smaller generalization gap). Here we show how the recently proposed complexity measures fail to explain robust generalization when comparing models trained on true and random labels.

We consider several norm-based and sharpness/ﬂatness-based measures. We denote the parameters of a network by θ := {Wi}m i=1. The norm-based measures include spectral norm 1 γmargin Qm i=1 Wi 2 and ℓ1 norm 1 γmargin Pm i=1 Wi 1 of model parameters, where γmargin is a margin on model output to make them scale-insensitive (Neyshabur et al., 2017). The spectral norm appears in the theoretical robust generalization bounds (Yin et al., 2018; Tu et al., 2019) and is related to the Lipschitz constant of neural networks (Cisse et al., 2017). The ℓ1 norm is adopted to reduce the robust generalization gap (Yin et al., 2018). The sharpness/ﬂatness-based measures include the curvature of input loss landscape (Moosavi-Dezfooli et al., 2019) as the dominant eigenvalue of the Hessian matrix, as well as the ﬂatness of weight loss landscape (Wu et al., 2020) related to the change of adversarial loss when moving the weights along a random direction. Fig. 6 plots the four complexity measures w.r.t. robust generalization gap of several models trained with various combinations of regularizations on true or random labels. The results show that the ﬁrst three measures can hardly ensure robust generalization, that lower complexity does not necessarily imply smaller robust generalization gap, e.g., the models trained on random labels can even lead to lower complexity than those trained on true labels. Among them, the ﬂatness of weight loss landscape (Wu et al., 2020) is more reliable.

In summary, the generalization analysis indicates that the previous approaches, especially various complexity measures, cannot adequately explain and ensure the robust generalization performance in AT. Our ﬁnding of robust generalization in AT is complementary to that of standard generalization in ST (Zhang et al., 2017; Neyshabur et al., 2017; Jiang et al., 2020). Accordingly, robust generalization of adversarially trained models remains an open problem for future research.

4 ROBUST OVERFITTING ANALYSIS

Rice et al. (2020) have identiﬁed robust overﬁtting as a dominant phenomenon in AT, i.e., shortly after the ﬁrst learning rate decay, further training will continue to decrease the robust test accuracy. They further show that several remedies for overﬁtting, including explicit ℓ1 and ℓ2 regularizations, data augmentation, etc., cannot gain improvements upon early stopping. Although robust overﬁtting has been thoroughly investigated, there still lacks an explanation of why it occurs. In this section, we draw a connection between memorization and robust overﬁtting in AT by showing that robust overﬁtting is caused by excessive memorization of one-hot labels in the typical AT methods. Motivated by the analysis, we then propose an effective strategy to eliminate robust overﬁtting.

Published as a conference paper at ICLR 2022

(a) (b) (c)

Figure 7: (a): The accuracy curves of PGD-AT with true labels to reproduce robust overﬁtting. (b): The robust test accuracy of PGD-AT under various perturbation budgets ϵ. (c): The adversarial loss of two independently trained networks by PGD-AT on 500 samples sorted by the loss of the ﬁrst model.

4.1 EXPLAINING ROBUST OVERFITTING

The typical AT approaches (e.g., PGD-AT, TRADES) commonly adopt one-hot labels as the targets for training, as introduced in Sec. 2.1. The one-hot labels could be inappropriate for some adversarial examples because it is difﬁcult for a network to assign high-conﬁdent one-hot labels for all perturbed samples within the perturbation budget ϵ (Stutz et al., 2020; Cheng et al., 2020). Intuitively, some examples may naturally lie close to the decision boundary and should be assigned lower predictive conﬁdence for the worst-case adversarial examples. It indicates that one-hot labels of some training data may be noisy in AT1. After a certain training epoch, the model memorizes these hard training examples with possibly noisy labels, leading to the reduction of test robustness, as shown in Fig. 7(a). Thus, we hypothesize the cause of robust overﬁtting lies in the memorization of one-hot labels.

Our hypothesis is well supported by two pieces of evidence. First, we ﬁnd that when the perturbation budget ϵ is small, robust overﬁtting does not occur, as shown in Fig. 7(b). This observation implies that the one-hot labels are more appropriate as the targets for adversarial examples within a smaller neighborhood while become noisier under a larger perturbation budget and lead to overﬁtting. Second, we validate that the hard training examples with higher adversarial loss values are consistent across different models. We ﬁrst train two independent networks (using the same architecture and different random seeds) by PGD-AT and calculate the adversarial loss for each training sample. We show the adversarial losses on 500 samples sorted by the loss of the ﬁrst model in Fig. 7(c). It can be seen that the samples with lower adversarial losses of the ﬁrst model also have relatively lower losses of the second one and vice versa. We further quantitatively measure the consistency of the adversarial losses of all training samples between the two models using the Kendall s rank coefﬁcient (Kendall, 1938), which is 0.85 in this case. A similar result can be observed for two different model architectures (see Appendix C.1). The results verify that the hard training examples with possibly noisy labels are intrinsic of a dataset, supporting our hypothesis on why robust overﬁtting occurs.

4.2 MITIGATING ROBUST OVERFITTING

Based on the above analysis, we resort to the methods that are less prone to overﬁt noisy labels for mitigating robust overﬁtting in AT. Although learning with noisy labels has been broadly studied in ST (Natarajan et al., 2013; Patrini et al., 2017; Jiang et al., 2018; Han et al., 2018; Zhang & Sabuncu, 2018), we ﬁnd that most of these approaches are not suitable for AT. For example, a typical line of methods ﬁlter out noisy samples and train the models on the identiﬁed clean samples (Jiang et al., 2018; Han et al., 2018; Ren et al., 2018). However, they will neglect a portion of training data with noisy labels, which can lead to inferior results for AT due to the reduction of training data (Schmidt et al., 2018). Table 2 shows the results to validate this.

To address this problem, we propose to regularize the predictions of adversarial examples from being over-conﬁdent by integrating the temporal ensembling (TE) approach (Laine & Aila, 2017) into the AT frameworks. TE maintains an ensemble prediction of each data and penalizes the difference between the current prediction and the ensemble prediction, which is effective for semi-supervised learning and learning with noisy labels (Laine & Aila, 2017). We think that TE is suitable for AT since it enables to leverage all training samples and hinders the network from excessive memorization of one-hot labels with a regularization term. Speciﬁcally, we denote the ensemble prediction of

1Note that we argue the one-hot labels are noisy when used in AT, but do not argue the ground-truth labels of the dataset are noisy, which is different from a previous work (Sanyal et al., 2021).

Published as a conference paper at ICLR 2022

Table 1: Test accuracy (%) of several methods on CIFAR-10, CIFAR-100, and SVHN under the ℓ norm with ϵ = 8/255 based on the Res Net-18 architecture. We choose the best checkpoint according to the highest robust accuracy on the test set under PGD-10.

Method Natural Accuracy PGD-10 PGD-1000 C&W-1000 Auto Attack Best Final Diff Best Final Diff Best Final Diff Best Final Diff Best Final Diff

PGD-AT 83.75 84.82 -1.07 52.64 44.92 7.72 51.22 42.74 8.48 50.11 43.63 7.48 47.74 41.84 5.90 PGD-AT+TE 82.35 82.79 -0.44 55.79 54.83 0.96 54.65 53.30 1.35 52.30 51.73 0.57 50.59 49.62 0.97

TRADES 81.19 82.48 -1.29 53.32 50.25 3.07 52.44 48.67 3.77 49.88 48.14 1.74 49.03 46.80 2.23 TRADES+TE 83.86 83.97 -0.11 55.15 54.42 0.73 53.74 53.03 0.71 50.77 50.63 0.14 49.77 49.20 0.57

(a) The evaluation results on CIFAR-10.

Method Natural Accuracy PGD-10 PGD-1000 C&W-1000 Auto Attack Best Final Diff Best Final Diff Best Final Diff Best Final Diff Best Final Diff

PGD-AT 57.54 57.51 0.03 29.40 21.75 7.65 28.54 20.63 7.91 27.06 21.17 5.89 24.72 19.34 5.38 PGD-AT+TE 56.45 57.12 -0.67 31.74 30.24 1.50 31.27 29.80 1.47 28.27 27.36 0.91 26.30 25.34 0.96

TRADES 57.98 56.32 1.66 29.93 27.70 2.23 29.51 26.93 2.58 25.46 24.42 1.04 24.61 23.40 1.21 TRADES+TE 59.35 58.72 0.63 31.09 30.12 0.97 30.54 29.45 1.09 26.61 25.94 0.67 25.27 24.55 0.72

(b) The evaluation results on CIFAR-100.

Method Natural Accuracy PGD-10 PGD-1000 C&W-1000 Auto Attack Best Final Diff Best Final Diff Best Final Diff Best Final Diff Best Final Diff

PGD-AT 89.00 90.55 -1.55 54.51 46.97 7.54 52.22 42.85 9.37 48.66 44.13 4.53 46.61 38.24 8.37 PGD-AT+TE 90.09 90.91 -0.82 59.74 59.05 0.69 57.71 56.46 1.25 54.55 53.94 0.61 51.44 50.61 0.83

TRADES 90.88 91.30 -0.42 59.50 57.04 2.46 52.78 50.17 2.61 52.76 50.53 2.23 40.36 38.88 1.48 TRADES+TE 89.01 88.52 0.49 59.81 58.49 1.32 58.24 56.66 1.58 54.00 53.24 0.76 51.45 50.16 1.29

(c) The evaluation results on SVHN.

a training sample xi as pi, which is updated in each training epoch as pi η pi +(1 η) fθ(xi), where η is the momentum term. The training objective of PGD-AT with TE can be expressed as

i=1 max x i S(xi)

L(fθ(x i), yi) + w fθ(x i) ˆpi 2 2 , (7)

where ˆpi is the normalization of pi as a probability vector and w is a balancing weight. TE can be similarly integrated with TRADES with the same regularization term. The network would learn to ﬁt relatively easy samples with one-hot labels in the initial training stage, as shown in Fig. 7(a). After the learning rate decays, the network can keep assigning low conﬁdence for hard samples with the regularization term in Eq. (7) and avoid ﬁtting one-hot labels. Therefore, the proposed algorithm enables to learn under label noise in AT and alleviates the robust overﬁtting problem.

5 EMPIRICAL EVALUATION ON MITIGATING ROBUST OVERFITTING

In this section, we provide the experimental results on CIFAR-10, CIFAR-100 (Krizhevsky & Hinton, 2009), and SVHN (Netzer et al., 2011) datasets to validate the effectiveness of our proposed method. Code is available at https://github.com/dongyp13/memorization-AT.

Training details. We adopt the common setting that the perturbation budget is ϵ = 8/255 under the ℓ norm in most experiments. We consider PGD-AT and TRADES as two typical AT baselines and integrate the proposed TE approach into them, respectively. We use the Res Net-18 (He et al., 2016) model as the classiﬁer in most experiments. In training, we use the 10-step PGD adversary with α = 2/255. The models are trained via the SGD optimizer with momentum 0.9, weight decay 0.0005, and batch size 128. For CIFAR-10/100, we set the learning rate as 0.1 initially which is decayed by 0.1 at 100 and 150 epochs with totally 200 training epochs. For SVHN, the learning rate starts from 0.01 with a cosine annealing schedule for a total number of 80 training epochs. In our method, We set η = 0.9 and w = 30 along a Gaussian ramp-up curve (Laine & Aila, 2017).

Evaluation results. We adopt PGD-10, PGD-1000, C&W-1000 (Carlini & Wagner, 2017), and Auto Attack (Croce & Hein, 2020b) for evaluating adversarial robustness rigorously. Auto Attack is a strong attack to evaluate model robustness, which is composed of an ensemble of diverse attacks, including APGD-CE (Croce & Hein, 2020b), APGD-DLR (Croce & Hein, 2020b), FAB (Croce

Published as a conference paper at ICLR 2022

& Hein, 2020a), and Square attack (Andriushchenko et al., 2020). To show the performance of robust overﬁtting, we report the test accuracy on the best checkpoint that achieves the highest robust test accuracy under PGD-10 and the ﬁnal checkpoint, as well as the difference between these two checkpoints. The results of PGD-AT, TRADES, and the combinations of them with our proposed approach (denoted as PGD-AT+TE and TRADES+TE) on the CIFAR-10, CIFAR-100, and SVHN datasets are shown in Table 1.

Figure 8: The natural and robust test accuracy curves (under PGD10) of PGD-AT, TRADES, and their extensions by integrating the proposed TE approach. The models are trained on CIFAR-10 under the ℓ norm with ϵ = 8/255 based on the Res Net-18 architecture.

We can observe that the differences between best and ﬁnal test accuracies of our method are reduced to around 1%, while the accuracy gaps of PGDAT and TRADES are much larger. It indicates that our method largely eliminates robust overﬁtting. Due to being less affected by robust overﬁtting, our method achieves higher robust accuracies than the baselines. We also show the learning curves of these methods in Fig. 8. We consistently demonstrate the effectiveness of our method on different network architectures (including WRN-34-10 and VGG-16) and threat models (including ℓ2 norm), which will be shown in Appendix C.2.

Discussion and comparison with related works. Our method is kind of similar to the label smoothing (LS) technique, which is studied in AT (Pang et al., 2021). Recent works have also introduced the smoothness in training labels and model weights (Chen et al., 2021; Huang et al., 2020), which can alleviate robust overﬁtting to some extent. The signiﬁcant difference between our work and them is that we provide a reasonable explanation for robust overﬁtting one-hot labels are noisy for AT, while previous methods did not give such an explanation and could be viewed as solutions to our identiﬁed problem. To empirically compare with these methods, we conduct experiments on CIFAR-10 with the Res Net-18 network. Under the PGD-AT framework, we compare with the baseline PGD-AT, PGD-AT+LS, self-adaptive training (SAT) (Huang et al., 2020), and knowledge distillation with stochastic weight averaging (KD-SWA) (Chen et al., 2021). We also adopt the Coteaching approach (Han et al., 2018) adapted to PGD-AT, which jointly trains two models using the ﬁltered samples given by each other. The results under the adopted attacks are presented in Table 2. Although various techniques can alleviate robust overﬁtting, our method achieves better robustness than the others, validating its effectiveness. For Co-teaching, though robust overﬁtting is alleviated, the performance is worse than our proposed method due to the reduction of training data.

Table 2: Test accuracy (%) of the proposed method and other methods on CIFAR-10 under the ℓ norm with ϵ = 8/255 based on the Res Net-18 architecture.

Method Natural Accuracy PGD-10 PGD-1000 C&W-1000 Auto Attack Best Final Diff Best Final Diff Best Final Diff Best Final Diff Best Final Diff

PGD-AT 83.75 84.82 -1.07 52.64 44.92 7.72 51.22 42.74 8.48 50.11 43.63 7.48 47.74 41.84 5.90 PGD-AT+LS 82.68 85.16 -2.48 53.70 48.90 4.80 52.56 46.31 6.25 50.41 46.06 4.35 49.02 44.39 4.63 SAT 82.81 81.86 0.95 53.81 53.31 0.50 52.41 52.00 0.41 51.99 51.71 0.28 50.21 49.73 0.48 KD-SWA 84.84 85.26 -0.42 54.89 53.80 1.09 53.31 52.45 0.86 51.48 50.91 0.57 50.42 49.83 0.59 Co-teaching 81.94 82.22 -0.28 51.27 50.52 0.75 50.15 49.12 1.03 50.85 49.86 0.99 49.60 48.49 1.11 PGD-AT+TE 82.35 82.79 -0.44 55.79 54.83 0.96 54.65 53.30 1.35 52.30 51.73 0.57 50.59 49.62 0.97

6 CONCLUSION

In this paper, we demonstrate the capacity of DNNs to ﬁt adversarial examples with random labels by exploring memorization in adversarial training, which also poses open questions on the convergence and generalization of adversarially trained models. We validate that some AT methods suffer from a gradient instability issue and robust generalization can hardly be explained by complexity measures. We further identify a signiﬁcant drawback of memorization in AT related to the robust overﬁtting phenomenon robust overﬁtting is caused by memorizing one-hot labels in adversarial training. We propose a new mitigation algorithm to address this issue, with the effectiveness validated extensively.

Published as a conference paper at ICLR 2022

ACKNOWLEDGEMENTS

This work was supported by the National Key Research and Development Program of China (2020AAA0106000, 2020AAA0104304, 2020AAA0106302), NSFC Projects (Nos. 61620106010, 62061136001, 61621136008, 62076147, U19B2034, U1811461, U19A2081), Beijing NSF Project (No. JQ19016), Beijing Academy of Artiﬁcial Intelligence (BAAI), Tsinghua-Alibaba Joint Research Program, Tsinghua Institute for Guo Qiang, Tsinghua-OPPO Joint Research Center for Future Terminal Technology.

ETHICS STATEMENT

The existence of adversarial examples can pose severe security threats to machine learning and deep learning models when they are deployed to real-world applications. The vulnerability to adversarial examples could also lower the conﬁdence of the public on machine learning techniques. Therefore, it is important to develop more robust models. As the most effective method for promoting model robustness, adversarial training (AT) has not been fully investigated. This paper aims to investigate the memorization effect of AT to facilitate a better understanding of its working mechanism. Some ﬁndings in this paper can be analyzed more deeply, including theoretical analysis of AT convergence, generalization, etc., which we leave to future work.

REPRODUCIBILITY STATEMENT

Most of the experiments are easily reproducible. We provide the code for reproducing the results at https://github.com/dongyp13/memorization-AT.

Jean-Baptiste Alayrac, Jonathan Uesato, Po-Sen Huang, Alhussein Fawzi, Robert Stanforth, and Pushmeet Kohli. Are labels required for improving adversarial robustness? In Advances in Neural Information Processing Systems (Neur IPS), pp. 12214 12223, 2019.

Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via overparameterization. In International Conference on Machine Learning (ICML), pp. 242 252, 2019.

Maksym Andriushchenko, Francesco Croce, Nicolas Flammarion, and Matthias Hein. Square attack: a query-efﬁcient black-box adversarial attack via random search. In European Conference on Computer Vision (ECCV), pp. 484 501, 2020.

Sanjeev Arora, Rong Ge, Behnam Neyshabur, and Yi Zhang. Stronger generalization bounds for deep nets via a compression approach. In International Conference on Machine Learning (ICML), pp. 254 263, 2018.

Devansh Arpit, Stanisław Jastrz ebski, Nicolas Ballas, David Krueger, Emmanuel Bengio, Maxinder S Kanwal, Tegan Maharaj, Asja Fischer, Aaron Courville, Yoshua Bengio, et al. A closer look at memorization in deep networks. In International Conference on Machine Learning (ICML), pp. 233 242, 2017.

Anish Athalye, Nicholas Carlini, and David Wagner. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. In International Conference on Machine Learning (ICML), pp. 274 283, 2018.

Peter L Bartlett, Dylan J Foster, and Matus J Telgarsky. Spectrally-normalized margin bounds for neural networks. In Advances in Neural Information Processing Systems (NIPS), pp. 6240 6249, 2017.

Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machinelearning practice and the classical bias variance trade-off. Proceedings of the National Academy of Sciences, 116(32):15849 15854, 2019.

Published as a conference paper at ICLR 2022

Sébastien Bubeck, Yin Tat Lee, Eric Price, and Ilya Razenshteyn. Adversarial examples from computational constraints. In International Conference on Machine Learning (ICML), pp. 831 840, 2019.

Yuan Cao and Quanquan Gu. Generalization bounds of stochastic gradient descent for wide and deep neural networks. In Advances in Neural Information Processing Systems (Neur IPS), 2019.

Nicholas Carlini and David Wagner. Towards evaluating the robustness of neural networks. In IEEE Symposium on Security and Privacy (SP), pp. 39 57, 2017.

Yair Carmon, Aditi Raghunathan, Ludwig Schmidt, Percy Liang, and John C Duchi. Unlabeled data improves adversarial robustness. In Advances in Neural Information Processing Systems (Neur IPS), pp. 11192 11203, 2019.

Tianlong Chen, Zhenyu Zhang, Sijia Liu, Shiyu Chang, and Zhangyang Wang. Robust overﬁtting may be mitigated by properly learned smoothening. In International Conference on Learning Representations (ICLR), 2021.

Zixiang Chen, Yuan Cao, Difan Zou, and Quanquan Gu. How much over-parameterization is sufﬁcient to learn deep relu networks? In International Conference on Learning Representations (ICLR), 2020.

Minhao Cheng, Qi Lei, Pin-Yu Chen, Inderjit Dhillon, and Cho-Jui Hsieh. Cat: Customized adversarial training for improved robustness. ar Xiv preprint ar Xiv:2002.06789, 2020.

Moustapha Cisse, Piotr Bojanowski, Edouard Grave, Yann Dauphin, and Nicolas Usunier. Parseval networks: Improving robustness to adversarial examples. In International Conference on Machine Learning (ICML), pp. 854 863, 2017.

Jeremy M Cohen, Elan Rosenfeld, and J Zico Kolter. Certiﬁed adversarial robustness via randomized smoothing. In International Conference on Machine Learning (ICML), pp. 1310 1320, 2019.

Francesco Croce and Matthias Hein. Minimally distorted adversarial examples with a fast adaptive boundary attack. In International Conference on Machine Learning (ICML), pp. 2196 2205, 2020a.

Francesco Croce and Matthias Hein. Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In International Conference on Machine Learning (ICML), pp. 2206 2216, 2020b.

Yinpeng Dong, Zhijie Deng, Tianyu Pang, Jun Zhu, and Hang Su. Adversarial distributional training for robust deep learning. In Advances in Neural Information Processing Systems (Neur IPS), pp. 8270 8283, 2020a.

Yinpeng Dong, Qi-An Fu, Xiao Yang, Tianyu Pang, Hang Su, Zihao Xiao, and Jun Zhu. Benchmarking adversarial robustness on image classiﬁcation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 321 331, 2020b.

Simon Du, Jason Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. Gradient descent ﬁnds global minima of deep neural networks. In International Conference on Machine Learning (ICML), pp. 1675 1685, 2019.

Vitaly Feldman. Does learning require memorization? a short tale about a long tail. In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, pp. 954 959, 2020.

Vitaly Feldman and Chiyuan Zhang. What neural networks memorize and why: Discovering the long tail via inﬂuence estimation. In Advances in Neural Information Processing Systems (Neur IPS), pp. 2881 2891, 2020.

Ruiqi Gao, Tianle Cai, Haochuan Li, Cho-Jui Hsieh, Liwei Wang, and Jason D Lee. Convergence of adversarial training in overparametrized neural networks. In Advances in Neural Information Processing Systems (Neur IPS), pp. 13029 13040, 2019.

Published as a conference paper at ICLR 2022

Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. In International Conference on Learning Representations (ICLR), 2015.

Sven Gowal, Chongli Qin, Jonathan Uesato, Timothy Mann, and Pushmeet Kohli. Uncovering the limits of adversarial training against norm-bounded adversarial examples. ar Xiv preprint ar Xiv:2010.03593, 2020.

Bo Han, Quanming Yao, Xingrui Yu, Gang Niu, Miao Xu, Weihua Hu, Ivor Tsang, and Masashi Sugiyama. Co-teaching: Robust training of deep neural networks with extremely noisy labels. In Advances in Neural Information Processing Systems (Neur IPS), pp. 8527 8537, 2018.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770 778, 2016.

Dan Hendrycks, Kimin Lee, and Mantas Mazeika. Using pre-training can improve model robustness and uncertainty. In International Conference on Machine Learning (ICML), pp. 2712 2721, 2019.

Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4700 4708, 2017.

Lang Huang, Chao Zhang, and Hongyang Zhang. Self-adaptive training: beyond empirical risk minimization. In Advances in Neural Information Processing Systems (Neur IPS), pp. 19365 19376, 2020.

Lu Jiang, Zhengyuan Zhou, Thomas Leung, Li-Jia Li, and Li Fei-Fei. Mentornet: Learning datadriven curriculum for very deep neural networks on corrupted labels. In International Conference on Machine Learning (ICML), pp. 2304 2313, 2018.

Yiding Jiang, Behnam Neyshabur, Hossein Mobahi, Dilip Krishnan, and Samy Bengio. Fantastic generalization measures and where to ﬁnd them. In International Conference on Learning Representations (ICLR), 2020.

Maurice G Kendall. A new measure of rank correlation. Biometrika, 30(1/2):81 93, 1938.

Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), 2015.

Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009.

Alexey Kurakin, Ian Goodfellow, and Samy Bengio. Adversarial machine learning at scale. In International Conference on Learning Representations (ICLR), 2017.

Samuli Laine and Timo Aila. Temporal ensembling for semi-supervised learning. In International Conference on Learning Representations (ICLR), 2017.

Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, and Tom Goldstein. Visualizing the loss landscape of neural nets. In Advances in Neural Information Processing Systems (Neur IPS), pp. 6389 6399, 2018.

Fangzhou Liao, Ming Liang, Yinpeng Dong, Tianyu Pang, Xiaolin Hu, and Jun Zhu. Defense against adversarial attacks using high-level representation guided denoiser. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1778 1787, 2018.

Chen Liu, Mathieu Salzmann, Tao Lin, Ryota Tomioka, and Sabine Süsstrunk. On the loss landscape of adversarial training: Identifying challenges and how to overcome them. In Advances in Neural Information Processing Systems (Neur IPS), 2020a.

Shengchao Liu, Dimitris Papailiopoulos, and Dimitris Achlioptas. Bad global minima exist and sgd can reach them. In Advances in Neural Information Processing Systems (Neur IPS), 2020b.

Published as a conference paper at ICLR 2022

Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. In International Conference on Learning Representations (ICLR), 2018.

Hartmut Maennel, Ibrahim Alabdulmohsin, Ilya Tolstikhin, Robert JN Baldock, Olivier Bousquet, Sylvain Gelly, and Daniel Keysers. What do neural networks learn when trained with random labels? In Advances in Neural Information Processing Systems (Neur IPS), pp. 19693 19704, 2020.

Chengzhi Mao, Ziyuan Zhong, Junfeng Yang, Carl Vondrick, and Baishakhi Ray. Metric learning for adversarial robustness. In Advances in Neural Information Processing Systems (Neur IPS), pp. 480 491, 2019.

Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, Jonathan Uesato, and Pascal Frossard. Robustness via curvature regularization, and vice versa. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9078 9086, 2019.

Nagarajan Natarajan, Inderjit S Dhillon, Pradeep K Ravikumar, and Ambuj Tewari. Learning with noisy labels. In Advances in Neural Information Processing Systems (NIPS), pp. 1196 1204, 2013.

Yurii E Nesterov. A method for solving the convex programming problem with convergence rate o (1/kˆ 2). In Dokl. akad. nauk Sssr, volume 269, pp. 543 547, 1983.

Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng. Reading digits in natural images with unsupervised feature learning. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning, 2011.

Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. Norm-based capacity control in neural networks. In Conference on Learning Theory (COLT), pp. 1376 1401, 2015.

Behnam Neyshabur, Srinadh Bhojanapalli, David Mc Allester, and Nati Srebro. Exploring generalization in deep learning. In Advances in Neural Information Processing Systems (NIPS), pp. 5947 5956, 2017.

Roman Novak, Yasaman Bahri, Daniel A. Abolaﬁa, Jeffrey Pennington, and Jascha Sohl-Dickstein. Sensitivity and generalization in neural networks: an empirical study. In International Conference on Learning Representations (ICLR), 2018.

Tianyu Pang, Kun Xu, Chao Du, Ning Chen, and Jun Zhu. Improving adversarial robustness via promoting ensemble diversity. In International Conference on Machine Learning (ICML), pp. 4970 4979, 2019.

Tianyu Pang, Kun Xu, Yinpeng Dong, Chao Du, Ning Chen, and Jun Zhu. Rethinking softmax crossentropy loss for adversarial robustness. In International Conference on Learning Representations (ICLR), 2020.

Tianyu Pang, Xiao Yang, Yinpeng Dong, Hang Su, and Jun Zhu. Bag of tricks for adversarial training. In International Conference on Learning Representations (ICLR), 2021.

Giorgio Patrini, Alessandro Rozza, Aditya Krishna Menon, Richard Nock, and Lizhen Qu. Making deep neural networks robust to label noise: A loss correction approach. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1944 1952, 2017.

Chongli Qin, James Martens, Sven Gowal, Dilip Krishnan, Krishnamurthy Dvijotham, Alhussein Fawzi, Soham De, Robert Stanforth, and Pushmeet Kohli. Adversarial robustness through local linearization. In Advances in Neural Information Processing Systems (Neur IPS), pp. 13847 13856, 2019.

Mengye Ren, Wenyuan Zeng, Bin Yang, and Raquel Urtasun. Learning to reweight examples for robust deep learning. In International Conference on Machine Learning (ICML), pp. 4334 4343, 2018.

Published as a conference paper at ICLR 2022

Leslie Rice, Eric Wong, and Zico Kolter. Overﬁtting in adversarially robust deep learning. In International Conference on Machine Learning (ICML), pp. 8093 8104, 2020.

Amartya Sanyal, Puneet K. Dokania, Varun Kanade, and Philip Torr. How benign is benign overﬁtting ? In International Conference on Learning Representations (ICLR), 2021.

Ludwig Schmidt, Shibani Santurkar, Dimitris Tsipras, Kunal Talwar, and Aleksander Madry. Adversarially robust generalization requires more data. In Advances in Neural Information Processing Systems (Neur IPS), pp. 5014 5026, 2018.

Ali Shafahi, Mahyar Najibi, Amin Ghiasi, Zheng Xu, John Dickerson, Christoph Studer, Larry S Davis, Gavin Taylor, and Tom Goldstein. Adversarial training for free! In Advances in Neural Information Processing Systems (Neur IPS), pp. 3358 3369, 2019.

Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overﬁtting. The Journal of Machine Learning Research, 15(1):1929 1958, 2014.

David Stutz, Matthias Hein, and Bernt Schiele. Conﬁdence-calibrated adversarial training: Generalizing to unseen attacks. In International Conference on Machine Learning (ICML), pp. 9155 9166, 2020.

Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. In International Conference on Learning Representations (ICLR), 2014.

Zhuozhuo Tu, Jingwei Zhang, and Dacheng Tao. Theoretical analysis of adversarial learning: A minimax approach. In Advances in Neural Information Processing Systems (Neur IPS), pp. 12280 12290, 2019.

Yisen Wang, Xingjun Ma, James Bailey, Jinfeng Yi, Bowen Zhou, and Quanquan Gu. On the convergence and robustness of adversarial training. In International Conference on Machine Learning (ICML), pp. 6586 6595, 2019.

Yisen Wang, Difan Zou, Jinfeng Yi, James Bailey, Xingjun Ma, and Quanquan Gu. Improving adversarial robustness requires revisiting misclassiﬁed examples. In International Conference on Learning Representations (ICLR), 2020.

Eric Wong and Zico Kolter. Provable defenses against adversarial examples via the convex outer adversarial polytope. In International Conference on Machine Learning (ICML), pp. 5286 5295, 2018.

Eric Wong, Leslie Rice, and J. Zico Kolter. Fast is better than free: Revisiting adversarial training. In International Conference on Learning Representations (ICLR), 2020.

Dongxian Wu, Shu-Tao Xia, and Yisen Wang. Adversarial weight perturbation helps robust generalization. In Advances in Neural Information Processing Systems (Neur IPS), pp. 2958 2969, 2020.

Cihang Xie and Alan Yuille. Intriguing properties of adversarial training. In International Conference on Learning Representations (ICLR), 2020.

Cihang Xie, Yuxin Wu, Laurens van der Maaten, Alan L Yuille, and Kaiming He. Feature denoising for improving adversarial robustness. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 501 509, 2019.

Dong Yin, Kannan Ramchandran, and Peter Bartlett. Rademacher complexity for adversarially robust generalization. In International Conference on Machine Learning (ICML), pp. 7085 7094, 2018.

Fisher Yu, Dequan Wang, Evan Shelhamer, and Trevor Darrell. Deep layer aggregation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2403 2412, 2018.

Published as a conference paper at ICLR 2022

Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. In Proceedings of the British Machine Vision Conference (BMVC), 2016.

Runtian Zhai, Tianle Cai, Di He, Chen Dan, Kun He, John Hopcroft, and Liwei Wang. Adversarially robust generalization just requires more unlabeled data. ar Xiv preprint ar Xiv:1906.00555, 2019.

Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. In International Conference on Learning Representations (ICLR), 2017.

Dinghuai Zhang, Tianyuan Zhang, Yiping Lu, Zhanxing Zhu, and Bin Dong. You only propagate once: Painless adversarial training using maximal principle. In Advances in Neural Information Processing Systems (Neur IPS), pp. 227 238, 2019a.

Hongyang Zhang, Yaodong Yu, Jiantao Jiao, Eric P Xing, Laurent El Ghaoui, and Michael I Jordan. Theoretically principled trade-off between robustness and accuracy. In International Conference on Machine Learning (ICML), pp. 7472 7482, 2019b.

Yi Zhang, Orestis Plevrakis, Simon S Du, Xingguo Li, Zhao Song, and Sanjeev Arora. Overparameterized adversarial training: An analysis overcoming the curse of dimensionality. In Advances in Neural Information Processing Systems (Neur IPS), pp. 679 688, 2020.

Zhilu Zhang and Mert Sabuncu. Generalized cross entropy loss for training deep neural networks with noisy labels. In Advances in Neural Information Processing Systems (Neur IPS), pp. 8778 8788, 2018.

Difan Zou, Yuan Cao, Dongruo Zhou, and Quanquan Gu. Gradient descent optimizes overparameterized deep relu networks. Machine Learning, 109(3):467 492, 2020.

Published as a conference paper at ICLR 2022

A ADDITIONAL EXPERIMENTS ON MEMORIZATION IN AT

In this section, we provide additional experiments on the memorization behavior in AT. All of the experiments are conducted on NVIDIA 2080 Ti GPUs. The source code of this paper is submitted as the supplementary material, and will be released after the review process.

A.1 MEMORIZATION OF PGD-AT AND TRADES

A.1.1 DIFFERENT TRAINING SETTINGS

We ﬁrst demonstrate that the different memorization behaviors between PGD-AT and TRADES can be generally observed under various settings.

(a) PGD-AT (b) TRADES

Figure A.1: The natural and robust training accuracies of PGD-AT and TRADES on CIFAR-100 when trained on true or random labels.

(a) PGD-AT (b) TRADES

Figure A.2: The natural and robust training accuracies of PGD-AT and TRADES on SVHN when trained on true or random labels.

Datasets. Similar to Fig. 2, we show the accuracy curves of PGD-AT and TRADES when trained on true or random labels on CIFAR-100 (Krizhevsky & Hinton, 2009) in Fig. A.1 and on SVHN (Netzer et al., 2011) in Fig. A.2. We consistently observe that PGD-AT fails to converge with random labels, while TRADES can successfully converge, although it does not reach 100% accuracy on SVHN.

Model architectures. We then consider other network architectures, including the Dense Net-121 model (Huang et al., 2017) and the deep layer aggregation (DLA) model (Yu et al., 2018). The corresponding results are shown in Fig. A.3. The similar results can be observed, although it may take more training epochs to make TRADES converge with the smaller Dense Net-121 network.

Published as a conference paper at ICLR 2022

(a) Dense Net-121 (b) DLA

Figure A.3: The natural and robust training accuracies of PGD-AT and TRADES on CIFAR-10 with different architectures when trained on random labels.

Figure A.4: The natural and robust training accuracies of PGD-AT and TRADES on CIFAR-10 under the ℓ2-norm threat model when trained on random labels.

Threat models. We further consider the ℓ2-norm threat model, in which we set ϵ = 1.0 and α = 0.25 in the 10-step PGD adversary. The learning curves of PGD-AT and TRADES are shown in Fig. A.4, which also exhibit similar results.

Perturbation budget. We study the memorization behavior in AT with different perturbation budgets ϵ. In Fig. A.5, we show that when the perturbation budget is ϵ = 16/255 (ℓ norm), TRADES trained on random labels can still converge. But when we set a larger budget (e.g., ϵ = 32/255), both PGD-AT and TRADES cannot obtain near 100% robust training accuracy. We also ﬁnd under this condition, even AT trained on true labels cannot get 100% robust training accuracy, indicating that the gradient instability issue discussed in Sec. 3.2 results in the convergence problem.

In summary, our empirical observation that PGD-AT and TRADES perform differently when trained on random labels is general across multiple datasets, network architectures, and threat models.

A.1.2 LEARNING CURVES UNDER DIFFERENT NOISE RATES

We show the learning curves of PGD-AT and TRADES under varying levels of label noise in Fig. A.6 and Fig. A.7, respectively. In this experiment, we adopt the weight decay and data augmentation for regularizations. We can see that the network achieves maximum accuracy on the test set before ﬁtting the noisy training set. Thus the model learns easy and simple patterns ﬁrst before ﬁtting the noise, similar to the ﬁnding in ST (Arpit et al., 2017). It can also be observed that under 80% noise rate, PGD-AT fails to converge. Note that when the noise rate is 0%, the network is trained on true labels, but the robust test accuracy also decreases after a certain epoch. This phenomenon is called robust overﬁtting (Rice et al., 2020), which is studied in Sec. 4.

Published as a conference paper at ICLR 2022

(a) PGD-AT (b) TRADES

Figure A.5: The natural and robust training accuracies of PGD-AT and TRADES on CIFAR-10 with ϵ = 16/255 when trained on true or random labels.

Figure A.6: Accuracy curves of PGD-AT under different noise rates on CIFAR-10.

Figure A.7: Accuracy curves of TRADES under different noise rates on CIFAR-10.

A.1.3 EXPLICIT REGULARIZATIONS

We consider three common regularizers, including data augmentation, weight decay, and dropout (Srivastava et al., 2014). We train the models based on TRADES on true and random labels with several combinations of explicit regularizers. As shown in Table A.1, the explicit regularizations do not signiﬁcantly affect the model s ability to memorize adversarial examples with random labels.

A.2 MORE RESULTS ON THE CONVERGENCE OF AT

A.2.1 TRAINING CONFIGURATIONS

We study different training conﬁgurations on PGD-AT with random labels. We consider various factors as follows. These experiments are conducted on CIFAR-10.

Model capacity. Recent work suggests that model size is a critical factor to obtain better robustness (Madry et al., 2018; Xie & Yuille, 2020). A possible reason why PGD-AT fails to converge with random labels may also be the insufﬁcient model capacity. Therefore, we try to use larger models, including WRN-34-20 (which is used in Rice et al. (2020)) and WRN-70-16 (which is used in Gowal et al. (2020)). However, using larger models under this setting cannot solve the convergence problem.

Published as a conference paper at ICLR 2022

Table A.1: The training accuracy, test accuracy, and generalization gap (%) of TRADES when trained on true or random labels, with and without explicit regularizations, including data augmentation (random crop and ﬂip), weight decay (0.0002), and dropout (0.2).

Labels Data Weight Dropout Training Accuracy Test Accuracy Generalization Gap Augmentation Decay Natural Robust Natural Robust Natural Robust

true 99.73 99.65 77.53 37.47 22.20 62.18 true 99.57 97.03 82.91 45.37 16.93 51.66 true 99.59 99.53 77.31 38.94 22.28 60.59 true 99.65 99.40 79.96 39.86 19.69 59.54 true 99.50 97.28 84.26 49.16 15.24 48.12 true 99.41 99.20 80.28 41.64 19.13 57.56

random 99.80 99.55 9.79 0.15 90.01 99.40 random 99.36 86.10 9.71 0.24 89.65 85.86 random 99.84 99.53 10.13 0.23 89.71 99.30 random 99.15 92.23 9.04 0.17 90.11 92.06 random 99.25 69.62 9.67 0.24 89.58 69.38 random 99.38 81.57 9.54 0.19 89.84 81.38

Attack steps. We adopt the weaker FGSM adversary (Goodfellow et al., 2015) for training. We also adopt the random initialization trick as argued in Wong et al. (2020) and adjust the step size as α = 10/255, yielding the fast adversarial training method (Wong et al., 2020). However, fast AT still cannot converge.

Optimizer. We try to use various optimizers, including the SGD momentum optimizer, the Adam optimizer (Kingma & Ba, 2015), and the nesterov optimizer (Nesterov, 1983); different learning rate schedules, including the piecewise decay and cosine schedules, and different learning rates (0.1 and 0.01), but none of these attempts make PGD-AT converge.

Perturbation budget. The perturbation budget ϵ is an important factor to affect the convergence of PGD-AT. When ϵ approaches 0, PGD-AT would degenerate into standard training, which can easily converge (Zhang et al., 2017). Hence we try different values of ϵ, and ﬁnd that PGD-AT can converge with a smaller ϵ (e.g., ϵ = 1/255) but cannot converge when ϵ 2/255.

A.2.2 GRADIENT STABILITY UNDER COSINE SIMILARITY

In Fig. 4(a), we show the gradient change of PGD-AT, TRADES, and the clean CE loss under the ℓ2 distance. We further show the cosine similarity between the gradients at θ and θ + λd in Fig. A.8. The cosine similarity is also averaged over all data samples. The results based on cosine similarity are consistent with the results based on the ℓ2 distance, showing that the gradient of PGD-AT changes more abruptly.

A.2.3 THE FAILURES OF AT UNDER REALISTIC SETTINGS

We ﬁnd that some AT methods (e.g., PGD-AT) suffer from a gradient instability issue, which results in the convergence problem when trained on random labels. Under other realistic setting, our analysis may also be valuable.

First, PGD-AT fails to converge under 80% noise rate. We think that the unstable gradients can overwhelm the useful gradients given by clean examples. To prove it, we train the models under 80% uniform label noise, by either PGD-AT or standard training (ST) on natural examples. We then select 100 training images with wrong labels and another 100 training images with true labels for evaluation. Similarly, we calculate the gradient norm of the cross-entropy loss w.r.t. model parameters of each method. We show the results in Fig. A.9. For AT, the gradient norm of clean examples is larger than that of noisy examples at beginning, which makes the model learn to classify. However, for PGD-AT, the gradient norm of clean examples is almost the same as that of noisy examples (the two curves overlap together). And the unstable gradients provided by noisy examples would overwhelm the useful gradients given by clean examples, making the network fail to converge.

Published as a conference paper at ICLR 2022

0.050 0.025 0.000 0.025 0.050 λ

Cosine Similarity

PGD-AT TRADES CE loss

Figure A.8: The cosine similarity between the gradients at θ and θ + λd of different losses, where θ are initialized, λ [ 0.05, 0.05].

Figure A.9: The gradient norm of PGD-AT and ST given clean examples or noisy examples, when trained on 80% uniform label noise.

Second, when the perturbation budget is large (e.g., ϵ = 64/255), PGD-AT cannot converge with true labels, while TRADES can achieve about 50% training accuracies. This can also be explained by our convergence analysis that the gradient is very unstable in PGD-AT with a larger perturbation budget, making it fail to converge.

A.3 MORE DISCUSSIONS ON THE GENERALIZATION OF AT

As shown in Table A.1, when trained on true labels, although the regularizers can help to reduce the generalization gap, the model without any regularization can still generalize non-trivially. The three explicit regularizations do not signiﬁcantly affect the model s ability to memorize adversarial examples with random labels. In consequence, the explicit regularizers are not the adequate explanation of generalization. By inspecting the learning dynamics of AT under different noise rates in Fig. A.6 and Fig. A.7, the network achieves maximum accuracy on the test set before ﬁtting the noisy training set, meaning that the model learns simple patterns (i.e., clean data) before memorizing the hard examples with wrong labels, similar to the observation in standard training (Arpit et al., 2017). The results suggest that optimization by itself serves as an implicit regularizer to ﬁnd a model with good generalization performance.

0 50 100 150 200 Epochs

rand init; natural acc. rand init; robust acc. adv init; natural acc. adv init; robust acc.

(a) Vanilla SGD

0 50 100 150 200 Epochs

rand init; natural acc. rand init; robust acc. adv init; natural acc. adv init; robust acc.

(b) SOTA SGD

Figure A.10: The natural and robust testing accuracies of TRADES on CIFAR-10 with different initialization strategies and training methods.

A recent work (Liu et al., 2020b) points out that in standard training, pre-training on random labels can lead to substantial performance degeneration of subsequent SGD training on true labels, while adding regularizations can overcome the bad initialization caused by pre-training with random labels. In this paper, we further investigate whether this ﬁnding can generalize to adversarial training.

Published as a conference paper at ICLR 2022

As PGD-AT cannot converge with random labels, we adopt TRADES to conduct experiments. Following Liu et al. (2020b), we consider two initialization strategies random initialization and adversarial initialization generated by training on random labeling of the training data. We also consider two training methods vanilla SGD training and SOTA SGD training with data augmentation (random crops and ﬂips), weight decay, and momentum. The results are shown in Fig. A.10. It can be seen that with vanilla SGD, the adversarial initialization can lead to worse performance than the random initialization. But with the regularization techniques, the models with different initializations converge to nearly the same test accuracy. The results are consistent with the ﬁndings in Liu et al. (2020b).

B PROOF OF THEOREM 1

Proof. Recall that J (x, y, θ) = maxx S(x) L(fθ(x ), y) is the adversarial loss of PGD-AT. First, we have θJ (x, y, θ1) θJ (x, y, θ2) 2 = θJ (x, y, θ1) θL(fθ1(x), y) θJ (x, y, θ2) + θL(fθ2(x), y)+ θL(fθ1(x), y) θL(fθ2(x), y) 2 θJ (x, y, θ1) θL(fθ1(x), y) 2+ θJ (x, y, θ2) θL(fθ2(x), y) 2+ θL(fθ1(x), y) θL(fθ2(x), y) 2.

From the assumption, for any x Rd and x S(x), we have

θL(fθ(x ), y) θL(fθ(x), y) 2 K x x p ϵK,

due to the deﬁnition of S(x). We also note that J (x, y, θ) is the maximal cross-entropy loss L within S(x), such that we have

θJ (x, y, θ) θL(fθ(x), y) 2 ϵK. (B.2)

Combining Eq. (B.1) and Eq. (B.2), we can obtain Eq. (5).

Note that the bound is tight since the all the equalities can be reached.

Remark 1. We note that a recent work (Liu et al., 2020a) gives a similar result on gradient stability. The difference is that they assume the loss function satisﬁes an additional Lipschitzian smoothness condition as θL(fθ1(x), y) θL(fθ2(x), y) 2 Kθ θ1 θ2 2,

where Kθ is another constant. Then they prove that

θJ (x, y, θ1) θJ (x, y, θ2) 2 Kθ θ1 θ2 2 + 2ϵK.

It can be noted that with this new assumption, we can simply obtain this result by Theorem 1. Therefore, Theorem 1 is a more general result of the previous one.

B.1 THEORETICAL ANALYSIS FOR TRADES

Note that TRADES adopts the KL divergence in its adversarial loss. The KL divergence is deﬁned on two predicted probability distributions over all classes, as

D(fθ(x) fθ(x )) = X

y {1,...,C} fθ(x)y log fθ(x)y

However, based on the local Lipschitz continuity assumption of the clean cross-entropy loss (which is only concerned with the predicted probability of the true class) in Eq. (4), we cannot derive a similar theoretical bound on the gradient stability of TRADES as in Eq. (5). Therefore, we need to make a different assumption on the KL divergence. For example, suppose the gradient of the KL divergence satisﬁes θD(fθ(x) fθ(x )) 2 K x x p,

Published as a conference paper at ICLR 2022

for any x Rd, x S(x), and any θ, where K is another constant. We denote the adversarial loss of TRADES as J (x, y, θ), then we have

θJ (x, y, θ1) θJ (x, y, θ1) 2 θL(fθ1(x), y) θL(fθ2(x), y) 2+

β θ max x S(x) D(fθ1(x) fθ1(x )) 2+

β θ max x S(x) D(fθ2(x) fθ2(x )) 2

θL(fθ1(x), y) θL(fθ2(x), y) 2 + 2βϵK .

Although we can derive a similar bound on gradient stability of TRADES, this bound is not directly comparable to Eq. (5) since we cannot ﬁnd the relationship between K and K . However, our empirical analysis on gradient magnitude in Sec. 3.2 has shown that TRADES is dominated by the clean cross-entropy loss at the initial training epochs, thus the gradient stability of the TRADES loss will be similar to that of the clean cross-entropy loss, as also revealed in Fig. 4(a). Therefore, the gradient of TRADES would be relatively stable.

C FULL EXPERIMENTS ON ROBUST OVERFITTING

C.1 ADDITIONAL EXPERIMENTS ON EXPLAINING ROBUST OVERFITTING

Figure C.1: The robust test accuracy of TRADES under various perturbation budgets ϵ.

Figure C.2: The adversarial loss of WRN-28-10 and Res Net-18 trained by PGD-AT on 500 samples sorted by the loss of the ﬁrst model (i.e., WRN-28-10).

First, we show the robust test accuracy curves of TRADES under various perturbation budgets in Fig. C.1. It can also be observed that when the perturbation budget is small, robust overﬁtting does not occur.

Second, we show that the hard training examples with higher adversarial loss values are consistent across different model architectures. We train one WRN-28-10 model and one Res Net-18 model based on PGD-AT. We then calculate the adversarial loss for each training sample for these two models. We show the adversarial loss on 500 samples sorted by the loss of the ﬁrst model (i.e., WRN28-10) in Fig. C.2. It can be seen that the samples with lower adversarial losses of the ﬁrst model also have relatively lower losses of the second one and vice versa. The Kendall s rank coefﬁcient of the adversarial loss between the two models is 0.78 in this case.

Third, we visualize the hard training examples with high adversarial loss values in Fig. C.3. It can be seen that these examples are difﬁcult to recognize and their labels may be wrong. Therefore, the one-hot labels for these hard training examples can be noisy for AT, leading to the robust overﬁtting problem.

Published as a conference paper at ICLR 2022

Natural Examples

Adversarial

Figure C.3: The hard training examples with high adversarial loss values.

Table C.1: Test accuracy (%) of several methods using different model architectures and threat models. We choose the best checkpoint according to the highest robust accuracy on the test set under PGD-10.

Methods Networks Norms Natural Accuracy PGD-10 Best Final Diff Best Final Diff

PGD-AT WRN-34-10

ℓ (ϵ = 8/255)

86.58 86.83 -0.25 55.83 49.52 6.31 PGD-AT+TE WRN-34-10 85.43 85.10 0.33 59.30 56.63 2.67 PGD-AT VGG-16 79.60 81.26 -1.66 48.52 43.02 5.50 PGD-AT+TE VGG-16 78.19 79.13 -0.94 52.06 51.29 0.77

Res Net-18 ℓ2 (ϵ = 128/255)

88.82 88.96 -0.14 69.05 65.96 3.09 PGD-AT+TE 87.95 88.20 -0.25 72.58 71.93 0.65 TRADES 86.50 86.57 -0.07 70.22 66.07 4.15 TRADES+TE 88.42 88.60 -0.18 72.72 72.43 0.29

C.2 ADDITIONAL EXPERIMENTS ON MITIGATING ROBUST OVERFITTING

We show the results of our proposed methods on other network architectures (including WRN34-10 and VGG-16) and threat models (including ℓ2 norm) in Table C.1. The results consistently demonstrate the effectiveness of the proposed method.

We further show the results of PGD-AT, PGD-AT+TE, TRADES, and TRADES+TE on CIFAR-10 over 3 runs in Table C.2.

Published as a conference paper at ICLR 2022

Table C.2: Test accuracy (%) of several methods on CIFAR-10 under the ℓ norm with ϵ = 8/255 based on the Res Net-18 architecture. We show the mean/std of the results over 3 runs.

Method Natural Accuracy Best Final Diff

PGD-AT 83.76 0.02 84.93 0.26 -1.17 0.29 PGD-AT+TE 82.36 0.18 82.69 0.14 -0.33 0.31

TRADES 81.34 0.15 82.70 0.21 -1.36 0.36 TRADES+TE 83.66 0.19 83.89 0.09 -0.23 0.21

Method PGD-10 Best Final Diff

PGD-AT 52.62 0.10 44.91 0.01 7.71 0.11 PGD-AT+TE 55.74 0.17 54.82 0.23 0.92 0.07 TRADES 53.25 0.07 50.48 0.23 2.77 0.16 TRADES+TE 54.93 0.16 54.04 0.19 0.89 0.13

Method PGD-1000 Best Final Diff

PGD-AT 51.26 0.03 42.72 0.06 8.54 0.06 PGD-AT+TE 54.54 0.27 53.01 0.34 1.53 0.24 TRADES 52.24 0.20 48.74 0.17 3.50 0.13 TRADES+TE 53.55 0.16 52.93 0.07 0.62 0.11

Method C&W-1000 Best Final Diff

PGD-AT 50.24 0.12 43.59 0.07 6.65 0.19 PGD-AT+TE 52.31 0.01 51.67 0.12 0.64 0.11 TRADES 49.83 0.05 48.11 0.04 1.72 0.02 TRADES+TE 50.80 0.02 50.61 0.07 0.19 0.08

Method Auto Attack Best Final Diff

PGD-AT 47.85 0.17 41.62 0.16 6.23 0.26 PGD-AT+TE 50.37 0.22 49.36 0.24 1.01 0.03 TRADES 48.86 0.18 46.73 0.07 2.13 0.11 TRADES+TE 49.40 0.27 48.77 0.21 0.63 0.05