# regularizing_hard_examples_improves_adversarial_robustness__8fa07fe5.pdf

Journal of Machine Learning Research 26 (2025) 1-48 Submitted 12/22; Revised 11/24; Published 2/25

Regularizing Hard Examples Improves Adversarial Robustness

Hyungyu Lee rucy74@snu.ac.kr Electrical and Computer Engineering Interdisciplinary Program in Artiﬁcial Intelligence Seoul National University, Seoul 08826, Republic of Korea Saehyung Lee halo8218@snu.ac.kr Electrical and Computer Engineering Interdisciplinary Program in Artiﬁcial Intelligence Seoul National University, Seoul 08826, Republic of Korea Ho Bae* hobae@ewha.ac.kr Department of Cyber Security Ewha Womans University, Seoul 03760, Republic of Korea Sungroh Yoon* sryoon@snu.ac.kr Electrical and Computer Engineering Interdisciplinary Program in Artiﬁcial Intelligence AIIS, ASRI, INMC, and ISRC, Seoul National University Seoul National University, Seoul 08826, Republic of Korea

Editor: Quanquan Gu

Recent studies have validated that pruning hard-to-learn examples from training improves the generalization performance of neural networks (NNs). In this study, we investigate this intriguing phenomenon the negative eﬀect of hard examples on generalization in adversarial training. Particularly, we theoretically demonstrate that the increase in the diﬃculty of hard examples in adversarial training is signiﬁcantly greater than the increase in the diﬃculty of easy examples. Furthermore, we verify that hard examples are only ﬁtted through memorization of the label in adversarial training. We conduct both theoretical and empirical analyses of this memorization phenomenon, showing that pruning hard examples in adversarial training can enhance the model s robustness. However, the challenge remains in ﬁnding the optimal threshold for removing hard examples that degrade robustness performance. Based upon these observations, we propose a new approach, diﬃculty proportional label smoothing (DPLS), to adaptively mitigate the negative eﬀect of hard examples, thereby improving the adversarial robustness of NNs. Notably, our experimental result indicates that our method can successfully leverage hard examples while circumventing the negative eﬀect.

Keywords: adversarial robustness, adversarial examples, hard examples, memorization, robust overﬁtting

Corresponding authors

2025 Hyungyu Lee, Saehyung Lee, Ho Bae, and Sungroh Yoon.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v26/22-1428.html.

Lee, Lee, Bae, and Yoon

1. Introduction

The structural regularities of classiﬁcation datasets have been investigated in several studies (Jiang et al., 2021; Paul et al., 2021; Wu et al., 2018), and measures have been proposed for identifying the regularities of training samples through training statistics. For example, Jiang et al. (2021) discovered that the learning speed of training examples is strongly correlated with the structural regularities, i.e., regular examples are learned quickly, whereas irregular examples are learned slowly. Paul et al. (2021) measured the diﬃculty of a training example by using the loss gradient norm. Based on the proposed measures, the aforementioned studies clariﬁed the relationship between example diﬃculty and generalization in the context of deep neural networks (DNNs). In particular, a compelling ﬁnding among their observations motivates our study: memorizing hard-to-learn/irregular examples can deteriorate the generalization ability of DNNs. In other words, hard-to-learn examples are unrepresentative outliers of a class or data with corrupted labels, and thus, these examples can result in the degradation of generalization. In this study, we explore this intriguing phenomenon the negative eﬀect of hard examples on generalization in adversarial settings.

0 25 50 75 100 125 150 175 200 Epoch

STD STD (hard pruning) PGD PGD (hard pruning)

Figure 1: The test accuracy (for clean examples) of the standard (STD) models and the test accuracy (for adversarial examples) of the adversarial (PGD (Madry et al., 2017)) models trained on the subsets of CIFAR-10. On pruning hard examples from training, test accuracy increases both in the STD and PGD models, but the increase in the PGD model is greater than that in the STD model.

Adversarial training diﬀers from standard training: it uses adversarial examples as the training data. Notably, adversarial training requires more data than standard training (Schmidt et al., 2018). Additionally, the improvement in generalization rendered by using more data is observably greater in an adversarial setting than in a standard setting (Lee et al., 2021; Gowal et al., 2021b). Therefore, recent studies on adversarial training have focused on approaches for using more data eﬀectively to bridge the gap between train and test inferences of DNNs (Carmon et al., 2019; Gowal et al., 2021b). Nevertheless, in this study, we demonstrate that the improvement in generalization performance could be achieved by regularizing the use of data. Speciﬁcally, we demonstrate that the improvement achieved by regularizing the training of hard examples is greater in adversarial training. Fig. 1 illustrates the diﬀerence between the eﬀect of pruning hard examples in standard training (denoted as STD) and adversarial training by using projected gradient descent (Madry et al., 2017) (denoted as PGD). After training models on subsets of CIFAR-10 (Krizhevsky et al., 2009) for 100 epochs, we either pruned hard examples from the training dataset or continued by training on the entire training data. We evaluated the test accuracy for clean

Example diﬃculty is measured by accumulating 0-1 loss along the training trajectory. For further details, refer to Appendix B.2.

Regularizing Hard Examples Improves Adversarial Robustness

examples in the STD models and the test accuracy for adversarial examples in the PGD models. Observably, pruning hard examples during training can improve the generalization performance, which is consistent with the previous study s result that training of hard examples can deteriorate generalization. However, the result indicates that the performance increase is greater in adversarial training. It can be inferred that the negative eﬀect of hard examples on training is intensiﬁed by adversarial perturbations, and thus, the improvement by pruning hard examples is more substantial in adversarial training.

In this study, we verify that the cause of this result is the negative eﬀect of hard examples in adversarial training. We theoretically prove that the diﬃculty of each example increases in an adversarial setting and that the increase is more signiﬁcant for hard examples than for easy examples. In this context, we demonstrate that hard examples are ﬁtted only through memorization in adversarial training and that pruning them from the training process improves robustness. Based on this, we conduct a theoretical analysis of the memorization phenomenon of hard examples, proving that including hard examples in adversarial training reduces robustness performance. To demonstrate this empirically, we perform analytical experiments on the assumptions and outcomes of the theoretical analysis, showing that the theoretical analysis aligns with actual results. Consequently, we show that hard examples negatively impact the model s robustness. However, the challenge remains that ﬁnding the optimal diﬃculty threshold for removing hard examples that deteriorate robustness is an intractable proposition.

To address this problem, we propose a regularization method, diﬃculty proportional label smoothing (DPLS), to mitigate the negative eﬀect of hard examples. Label smoothing (LS) (Szegedy et al., 2016) regularizes the prediction of each example to prevent the memorization of the one-hot label. In this perspective, DPLS extends this LS approach to adopt a smoothing factor that is proportional to the diﬃculty of the example for regularizing the training of each example. Using the results of experiments on a variety of datasets and algorithms, we assess our proposed methodology and ﬁnd that DPLS could successfully leverage hard examples while avoiding the negative eﬀect.

In Section 2, we introduce the background and previous studies related to the topic addressed in this paper. In Section 3, we conduct a theoretical and empirical analysis of learning hard examples in adversarial training. In Section 4, we analyze approaches for eﬀectively learning hard examples in adversarial training and propose a methodology. In Section 5, we demonstrate the eﬀectiveness of the proposed methodology through experiments on various algorithms and datasets. Finally, in Section 6, we present the conclusions of the paper.

2. Related work

Adversarial robustness Adversarial training (Madry et al., 2017) employs adversarial examples as training data to improve robustness. For a dataset D = {(xi, yi)}n i=1, where xi Rd represents a clean data, and yi {1, ..., C} denotes its corresponding label of C

Lee, Lee, Bae, and Yoon

classes. Adversarial training can be formulated in an empirical risk minimization form featuring the following optimization:

i=1 max δ S L(f(xi + δ), yi; θ). (1)

Here, θ denotes the parameter of a model f, L represents the loss function, and S = {δ Rd : ||δ|| ϵ} where ϵ is an adversarial budget, which depicts an upper bound of adversarial perturbations. Another form of adversarial training is TRADES (Zhang et al., 2019), which controls the trade-oﬀbetween robustness and standard accuracy by dividing training loss into the loss of clean examples and regularization for robustness. For adversarial attacks, fast gradient sign method (FGSM) is a general method for generating adversarial examples (Goodfellow et al., 2014). PGD is a multi-step version of FGSM (Madry et al., 2017). Adaptive auto attack (A3) is an eﬃcient and reliable attack which utilizes adaptive random starts and statistics-based discarding strategy (Liu et al., 2022b).

Robust and non-robust feature In Tsipras et al. (2018), certain data features were deﬁned as moderately or slightly correlated with the data labels. These features are useful for improving performance when training a standard classiﬁer, but they have the characteristic of reducing the classiﬁer s performance because they can be manipulated by adversarial attacks. Such features are referred to as non-robust features. In contrast, features that are not manipulated by adversarial attacks and maintain their correlation with the data labels are referred to as robust features. Ilyas et al. (2019) formulated a mathematical deﬁnition of robust and non-robust features based on their correlation with the labels. Speciﬁcally, for a data pair (x, y) from a dataset D, a feature g is deﬁned as a ρ-useful feature if it satisﬁes E[y g(x)] ρ, where ρ > 0. Here, for an adversarial perturbation δ, a useful feature g is deﬁned as a robust feature if it satisﬁes E [infδ y g(x + δ)] ρ, and as a non-robust feature if it does not satisfy this condition. We conduct a theoretical analysis utilizing these concepts of robust and non-robust features as deﬁned in previous studies.

Overﬁtting and memorization in adversarial training Wong et al. (2020) conducted analysis on overﬁtting in adversarial training, termed robust overﬁtting. Chen et al. (2020) utilized a smoothing method that combines knowledge distillation (Hinton et al., 2015) and stochastic weight average (Izmailov et al., 2018). Dong et al. (2022) also used a knowledge distillation method to mitigate overﬁtting caused by distribution shift and label noise. Yu et al. (2022) proposed a method to prevent overﬁtting of small-loss data combined with adversarial weight perturbation (Wu et al., 2020). In Huang et al. (2020); Liu et al. (2021); Dong et al. (2021), they used the exponential moving average of predictions as a label for training to prevent robust oveﬁtting. In particular, Dong et al. (2021) demonstrated that DNNs are suﬃcient to memorize training adversarial examples with random labels; reportedly the causes of robust overﬁtting may be the memorization of one-hot labels.

Measurement on diﬃculty of examples Paul et al. (2021) demonstrated that the norm of the error vector (EL2N score) can be used to identify important and diﬃcult training data. They discovered that examples with high score tend to be hard to learn but important for improving the performance. However, pruning certain examples with the highest score can improve generalization performance, indicating that the highest score examples tend

Regularizing Hard Examples Improves Adversarial Robustness

to be unrepresentative. Jiang et al. (2021) proposed consistency score (C-score), which is the expected accuracy of each instance for the models trained with various data subsets excluding the given instance. While these scores can be utilized in several applications (Lee et al., 2022), we use these scores as the example diﬃculty in our study.

3. Analysis

In this section, we conduct a theoretical and empirical analysis of the eﬀect of hard examples on training. Through this analysis, we reveal that in adversarial training, learning from hard examples primarily occurs through memorization. Next, we perform a theoretical analysis of the memorization of hard examples that occurs in adversarial training. Finally, we demonstrate the theoretical analysis through various empirical veriﬁcations of the memorization of hard examples.

3.1 Analysis on the eﬀect of hard examples

We ﬁrst demonstrate through a theoretical analysis of hard examples that training hard examples becomes more diﬃcult as the adversarial budget in adversarial training increases, and we illustrate this theoretical analysis with experimental evidence. Next, we compare the learning patterns of hard examples in standard and adversarial training, conﬁrming that in adversarial training, the learning of hard examples primarily occurs through memorization.

3.1.1 Theoretical analysis of the effect of hard examples

𝐝𝐞𝐜𝐢𝐬𝐢𝐨𝐧𝐛𝐨𝐮𝐧𝐝𝐚𝐫𝐲

Figure 2: The schematic illustration of easy (x2) and hard examples (x1).

We theoretically analyze the eﬀect of hard examples in both standard and adversarial training. A simple model is designed to demonstrate the manner in which adversarial perturbations aﬀect hard examples in adversarial training. We aim to design examples featuring diﬀerent diﬃculties in learning. Speciﬁcally, we deﬁne a binary classiﬁcation model f(x) = σ(w x + b), which is a mapping function f : X Y from input space X to output space Y, where (x, y) Rd { 1} and σ( ) represents a sigmoid function. To ensure simplicity, we assume that w = 1. Thereafter, we set the loss function as crossentropy loss and deﬁne easy and hard examples by controlling the distance between each example and the decision boundary. The hard example is set as (x1, y1 = +1) where w x1 + b

= d1, and the easy example is set as (x2, y2 = +1) where w x2 + b

= d2 and d1 < d2. The diﬀerences between the gradient norm of the examples in standard and adversarial settings are as follows:

Deﬁnition 1 (Gradient diﬀerence) The diﬀerences between the loss gradient norm on x1 and x2 in standard and adversarial training are deﬁned as follows:

Dstd(x1, x2) = (1 σ(w x1 + b))x1 (1 σ(w x2 + b))x2

Dadv(x1, x2, ϵ) = (1 σ(w (x1 + δ) + b))(x1 + δ) (1 σ(w (x2 + δ) + b))(x2 + δ) . (2)

Lee, Lee, Bae, and Yoon

Here, δ represents an adversarial perturbation, where w δ = ϵ. Dstd denotes the diﬀerence between the loss gradient norm on inputs in standard training and Dadv denotes that in adversarial training. The following theorem and corollary demonstrate that the diﬀerence between the loss gradient norm of hard and easy examples in adversarial training is more signiﬁcant than that in standard training:

Theorem 2 Let the gradient norm of easy examples is smaller than the gradient norm of hard examples in both standard and adversarial training. Then, if w x1 0 and 0 < ϵ < (d1+d2)

2 , it satisﬁes Dstd(x1, x2) < Dadv(x1, x2, ϵ). (3)

Corollary 3 The increase in gradient diﬀerence by an adversarial attack is proportional to the diﬃculty of the example and the size of the upper bound of adversarial perturbations. Thus, if w x3 = d3 and d2 < d3, it satisﬁes

Dadv(x1, x2, ϵ) < Dadv(x1, x3, ϵ), (4)

and if ϵ1 < ϵ2, it satisﬁes

Dadv(x1, x2, ϵ1) < Dadv(x1, x2, ϵ2). (5)

We have reported the proof in Appendix A.1. In the model whose training loss is suﬃciently converged, the outputs of easy examples are saturated, where σ(w x2) 1. Thus, the loss gradient norm of easy examples in standard training is assumed to be similar to that in adversarial training. The assumption is validated through empirical analysis. The change in the gradient norm of hard examples through the addition of adversarial perturbation is larger than that of easy examples. According to Paul et al. (2021), the gradient norm can be used as the diﬃculty (importance) score. In other words, the diﬃculty increment of hard examples is more signiﬁcant than that of easy examples.

3.1.2 Empirical analysis of the effect of hard examples

The preliminary analysis veriﬁes that the increase in the diﬃculty of hard examples is greater than the increase in that of easy examples in adversarial training, and we demonstrated that hard examples exerts a negative eﬀect on generalization (Fig. 1). We accordingly hypothesize that a major factor deteriorating generalization in adversarial training is the large increase in the diﬃculty of hard examples. We empirically verify this hypothesis. We select hard examples according to the average loss of the models along the training trajectory in accordance with the strategy reported by Jiang et al. (2021). The loss can be of any kind, which is a measure of the error of each example. We employ 0-1 loss as the measurement of diﬃculty. For a clear representation, the model trained with standard training is denoted as STD, and the model trained via adversarial training as PGD. We represent the accuracy for clean examples as standard accuracy and the accuracy for adversarial examples as robust accuracy.

Regularizing Hard Examples Improves Adversarial Robustness

100 120 140 160 180 200 Epoch

easy norm hard norm

= 0 = 2 = 4 = 8

100 120 140 160 180 200 Epoch

easy norm ( = 8)

hard norm ( = 8)

clean examples adv. examples easy norm difference hard norm difference

Figure 3: Average error vector norm (EL2N score (Paul et al., 2021)) of training examples.

Comparing the diﬃculty of hard examples We initially conduct an empirical study on the difﬁculty of hard examples by comparing the gradient norm in standard and adversarial training. Paul et al. (2021) reported that the average error vector norm (EL2N score) can approximate the expected training loss gradient norm of each example. Thus, we employ this to compare the gradient norm in both training. We select the top 10k hard and top 10k easy example subsets from the training dataset of CIFAR-10 and subsequently calculate the average error vector norm of examples in each subset during training, where the error vector norm of an input pair (x, y) is deﬁned as E f(x) y 2 for the one-hot label vector y. We train several models by varying the adversarial budget ϵ = 8 (PGD), 4, 2, and 0 (STD), and the result is obtained after the learning rate decay epoch, which is the training point indicating that the model is suﬃciently trained. As illustrated in Fig. 3 (left), the average norm values of easy examples are small in all models, and the diﬀerence in the average norm of easy examples between models is also comparatively small. However, for the results of hard examples, the average norm of hard examples increases signiﬁcantly as the adversarial budget ϵ increases. In the PGD model (right), greater diﬀerence is observed between clean and adversarial examples of hard examples than that of easy examples. This result indicates that the loss gradient norm of hard examples is rendered greater in adversarial training while that of easy examples does not, which demonstrates our theoretical analysis presented in Section 3.1.

Comparing the training of hard examples The diﬃculty of hard examples is signiﬁcantly ampliﬁed in adversarial training. We therefore investigate whether hard examples are ﬁtted normally in adversarial training. Figs. 4a and 4b show the test accuracy and training accuracy of the top 10k hard, top 10k easy, and entire training examples in the STD and PGD models. Observably, test accuracy decreases marginally following the learning rate decay (epoch 100) in the STD model. In contrast, the result of the PGD model indicates that when the training robust accuracy of hard examples starts to increase (at the decay epoch 100), the test robust accuracy starts to decrease rapidly. Moreover, the training accuracy of the easy examples remains unchanged after the decay, indicating that the decrease in the test performance is mostly attributed to the training of hard examples. Fig. 4c shows the training accuracy of hard examples and the test accuracy of the models with varying adversarial budget ϵ. The decrease in test accuracy and the increase in training accuracy of hard examples after the learning rate decay are observed to be severe as the ϵ increases. Therefore, given these characteristics in the learning patterns, we can regard this phenomenon as the memorization of hard examples that occurs in adversarial training, in contrast to standard training. Speciﬁcally, as the adversarial budget ϵ increases in adversarial training, it suggests that an increasing number of hard examples are being learned through memorization.

Lee, Lee, Bae, and Yoon

0 25 50 75 100 125 150 175 200 Epoch

train acc. test acc. standard (entire) standard (easy) standard (hard)

0 25 50 75 100 125 150 175 200 Epoch

train acc. test acc. standard (entire) robust (entire) robust (easy) robust (hard)

0 25 50 75 100 125 150 175 200 Epoch

train acc. test acc.

= 0 = 2 = 4 = 8

Figure 4: (a) and (b) show the training and test accuracy curves in the STD and PGD models, respectively. (c) shows the training accuracy of hard examples and test accuracy of the models with diﬀerent adversarial budgets ϵ.

Table 1: Performance of models with and without pruning 10k hard or easy examples.

Model Std. Rob.

None 87.19 51.80 Easy pruning 87.27 49.71 Hard pruning 84.70 53.36

Pruning hard examples from training We observed the phenomenon of hard examples being memorized after the learning rate decay, as shown in Fig. 4b. Next, we conducted experiments to verify whether the hard examples ﬁtted through memorization actually contribute to the model s performance. To do this, as in Fig. 1, we trained the model after excluding the hard examples from the training dataset and then compared the performance. Speciﬁcally, we pruned the top 10k easy and top 10k hard examples from the training dataset of CIFAR-10, respectively.

Table 1 shows the standard and robust accuracies of the PGD (None) and PGD pruning models (Easy and Hard pruning). From the robust accuracy results, we can see that excluding easy examples from the training dataset results in the expected decrease in robust accuracy. In contrast, when hard examples are excluded, robust accuracy increases. This suggests that learning hard examples may reduce the model s robustness. However, when focusing on standard accuracy, we observe that excluding hard examples from the training process leads to a decrease in standard performance. Taking this into consideration, learning hard examples appears to increase standard accuracy but decrease robust accuracy. This characteristic resembles the robustness-accuracy trade-oﬀobserved by Tsipras et al. (2018). Therefore, it implies that the model may be learning non-robust features through the memorization of hard examples.

3.2 Theoretical analysis on the training of hard examples

In Section 3.1, we demonstrated that learning hard examples is challenging in adversarial training and observed that these hard examples are primarily learned through memorization. Subsequently, through hard example pruning experiments, we observed that the memorization of hard examples exhibits patterns similar to the learning of non-robust features. In this section, we conduct a theoretical analysis of this phenomenon in adversarial training. Speciﬁcally, we perform a theoretical analysis of the eﬀects of learning through the memorization of hard examples in a binary classiﬁcation task. We investigate the impact of

Regularizing Hard Examples Improves Adversarial Robustness

hard examples on the model s performance in adversarial training, and through this, we theoretically demonstrate that including hard examples in the training process leads to a decrease in robustness performance.

Deﬁnition of data distributions in binary classiﬁcation task We ﬁrst deﬁne the task for the theoretical analysis of learning through the memorization of hard examples. Following Tsipras et al. (2018) and Ilyas et al. (2019), we deﬁne a binary classiﬁcation task consisting of the following input-output pair (x, y).

Deﬁnition 4 (Normal example data distribution)

y u.a.r. { 1, +1}, x Rd+2 =

x0 N(ηy, 1),

xi {1,...,d} i.i.d. N(βηy, 1), xd+1 = 0,

where β is a constant that satisﬁes 0 < β < 1. Here, assuming βη < ϵ < η, x0 corresponds to a robust feature, while xi {1,...,d} correspond to non-robust features. Additionally, we introduce a memorization feature corresponding to xd+1. The memorization feature represents a feature that exists only in speciﬁc data, speciﬁcally in hard examples, but does not exist in normal examples. If a model learns a memorization feature, it can be considered that the model has memorized it. We deﬁne that data sampled from the normal example data distribution does not exhibit memorization features. Therefore, we refer to examples without memorization features sampled from the distribution of Equation 6 as a normal example. Here, the set of datasets that can be generated from this distribution is deﬁned as Xnormal.

Next, we deﬁne the distribution of hard examples in the above binary classiﬁcation task. In Section 3.1, we observed that hard examples are learned through memorization. Additionally, we observed that this memorization seems to involve the learning of non-robust features. Based on this, we make the following assumptions about hard examples.

Assumption 1 (Properties of hard examples from the perspective of adversarial robustness) A hard example, diﬃcult for the model to learn, consists only of non-robust and memorization features. The memorization feature functions as a robust feature during the training phase but acts as a non-robust feature for unseen data during the test phase.

These assumptions describe hard examples that become more diﬃcult to learn as the adversarial budget increases, aligning them with the properties of non-robust features. Their learning through memorization corresponds to memorization features, which act as robust features during training by avoiding zero assignment in adversarial training (Tsipras et al., 2018). However, during testing on unseen data, these features behave as non-robust features, increasing standard accuracy but decreasing robust accuracy. Based on these assumptions, we deﬁne the distribution of hard examples.

Lee, Lee, Bae, and Yoon

Deﬁnition 5 (Hard example data distribution)

y u.a.r. { 1, +1}, xhard Rd+2 =

xi {1,...,d} i.i.d. N(βηy, 1),

( N(ηy, 1), training phase N(βηy, 1), test phase, unseen data (7)

The hard example data distribution is characterized by the robust feature x0 taking a value of 0. The memorization feature, while being a non-robust feature, is deﬁned to behave like a robust feature during training. We deﬁne the set of datasets that can be generated from this distribution as Xhard.

Analysis on the eﬀect of training hard examples Next, we compare the performance of a binary classiﬁcation model trained on datasets sampled from the data distributions introduced above. To train the model, we sample m normal examples to form the training dataset X train normal Xnormal and sample n hard examples to construct a hard example training set X train hard Xhard. We then examine the case where the model is trained using the combined dataset X train normal X train hard , consisting of m+n data points. For simplicity, we redeﬁne the model f as f = sign(w x + b) and assume x 1. Here, the weight wadv of the adversarially trained model fadv (with βη < ϵ < η) is learned, and the model s standard and robust accuracies are expressed as follows.

w0 = m Zadv(m+n),

wi {1,...,d} = 0, wd+1 = n Zadv(m+n),

Pr fadv(x) = y = Pr 1 Zadv

m m+n N(η, 1) + by > 0 ,

Pr fadv(x + δ) = y = Pr 1 Zadv

m m+n N(η ϵ, 1) + by > 0 ,

(8) Here, Zadv is a constant that satisﬁes w = 1. The model s performance is measured on a subset of the normal example data distribution, X test normal (Xnormal \ X train normal), which does not include the training data. Since normal examples do not have memorization features, the results are as above. Then, based on Equation 8, the following theorem holds when m > 0.

Theorem 6 (Degradation of normal test accuracy with increasing hard examples in training) For a model fadv adversarially trained on the combined dataset X train normal X train hard , where X train normal consists of m normal examples and X train hard consists of n hard examples, both the standard accuracy Pr fadv(x) = y and the robust accuracy Pr fadv(x + δ) = y on the normal example test dataset X test normal are monotonically decreasing with respect to n.

The above theorem demonstrates that including additional hard examples during training does not contribute to improving the performance on normal examples in adversarial training. Now, we similarly evaluate the models performance on hard examples. The performance is measured on a subset of the hard example data distribution, X test hard (Xhard \ X train hard ).

Pr fadv(xhard) = y = Pr 1 Zadv

n m+n N(βη, 1) + by > 0 ,

Pr fadv(xhard + δ) = y = Pr 1 Zadv

n m+n N(βη ϵ, 1) + by > 0 , (9)

Regularizing Hard Examples Improves Adversarial Robustness

Hard examples do not contain robust features, and their performance is determined by the model weight corresponding to the memorization feature. Then, based on Equation 9, the following theorem holds when n > 0.

Theorem 7 (Trade-oﬀin hard test accuracy with increasing hard examples in training) For a model fadv adversarially trained on the combined dataset X train normal X train hard , where X train normal consists of m normal examples and X train hard consists of n hard examples, the standard accuracy Pr fadv(x) = y on the hard example test dataset X test hard is monotonically increasing with respect to n, but the robust accuracy Pr fadv(x + δ) = y is monotonically decreasing with respect to n.

To summarize Theorem 6 and Theorem 7, including hard examples in adversarial training can lead to either an increase or decrease in standard accuracy. This is because the performance on normal examples decreases, while the performance on hard examples increases. However, robust accuracy always decreases because performance declines across all types of examples. The derivation process of the equations introduced in this section, as well as the proofs of the theorems and corollaries, can be found in Appendix A.2.

3.3 Empirical veriﬁcation on the training of hard examples

In this section, we conduct an empirical analysis to validate the assumptions and theorems discussed in Section 3.2. We ﬁrst verify the assumptions regarding hard examples, and then we empirically examine the theorems.

3.3.1 Empirical verification of the properties of hard examples

We proceed with the validation of Assumption 1. If the assumption holds, the properties of a model learning hard examples are characterized as follows.

1. Memorization: For the hard example dataset X train hard , the model fadv,hard, which was adversarially trained with X train hard included, exhibits high robust accuracy on X train hard . However, the model fadv,\hard, which was adversarially trained without X train hard , shows nearly 0% robust accuracy on unseen X train hard .

2. Infeasibility of adversarial training: Standard training is possible with only hard examples, but adversarial training is not. Speciﬁcally, let f be the model trained solely on X train hard . Then, Pr fstd(x) > Pr fadv(x) > 1

C , while Pr fadv(x + δ) < 1

C , where C is the number of class.

3. Trade-oﬀ: As the proportion of hard examples in the training dataset increases, the test robust accuracy decreases, while the test standard accuracy increases.

Property 1: Memorization Based on Assumption 1, when a hard example dataset X train hard is included in adversarial training, the memorization feature acts as a robust feature for the hard examples included in the training. As a result, even when adversarial attacks are applied to these hard examples, the memorization feature enables high robust accuracy on these examples. In contrast, if the same dataset X train hard is not included in adversarial

Lee, Lee, Bae, and Yoon

training, the memorization feature acts as a non-robust feature for these hard examples or may not have been learned at all. In this case, when adversarial attacks are applied to these unseen hard examples, the memorization feature cannot be utilized, leaving no available features for prediction, resulting in 0% accuracy. To illustrate this, we conduct the following experiment.

0 25 50 75 100 125 150 175 200 Epoch

hard acc. (w. hard) normal acc. (w. normal)

hard acc. (w.o. hard) normal acc. (w.o. normal)

Figure 5: Comparison of the memorization property in adversarial training. Only hard examples exhibit the memorization property.

For the CIFAR-10 dataset, the hard example data subset X train hard is composed of the top 5k hard examples. Two models are trained: one including the top 5k hard examples (fadv,hard) and the other excluding them (fadv,\hard). The dataset including the hard examples was trained using all 50k CIFAR10 data points, while the dataset excluding the hard examples was trained using the remaining 45k data points after removing the top 5k hard examples. After training, the performance of both models was evaluated on the top 5k hard examples X train hard . Additionally, for comparison with normal examples, a normal example data subset was composed of the top 5k easy examples, and the same experiments were conducted.

Fig. 5 compares the memorization property in adversarial training. Observably, normal examples show signiﬁcant accuracy on data not included in the training, even if they were not part of the training process. In contrast, when hard examples are included in the training, the model can achieve up to 100% accuracy on these hard examples. However, when they are not included in the training, the model shows 0% accuracy. This demonstrates that, unlike normal examples, hard examples exhibit the memorization property. Additional details regarding Fig. 5 can be found in Appendix B.4.

Property 2: Infeasibility of adversarial training From the Assumption 1, when adversarial training is conducted using only hard examples, the model only learns the memorization feature; it leads to the robust accuracy that is lower than random prediction due to the memorization feature acting as a non-robust feature. Conversely, standard accuracy is higher than random prediction because of the same memorization feature. In contrast, standard training, which learns both non-robust and memorization features, will show higher standard accuracy compared to the adversarially trained model. To illustrate this, we conduct the following experiment.

Table 2: Performance of models trained with 10k examples.

Method STD PGD Std. Rob.

Normal 78.46 61.63 39.56 Hard 77.60 19.92 8.68

We construct a hard example data subset using the top 10k hard examples and a normal example data subset using the top 10k easy examples. Then, instead of using the entire dataset, we train models using only these respective subsets through standard training and adversarial training to compare their performance. We measure the standard (Std.) and robust (Rob.) accuracy for the standard (STD) and adversarial (PGD) models.

Regularizing Hard Examples Improves Adversarial Robustness

In Table 2, the STD model trained with the hard example data subset exhibits performance comparable to that of the STD model trained with the easy example data subset. On the other hand, the PGD model trained with the hard example data subset shows lower standard accuracy than the STD model. As expected, in terms of robust accuracy, the model performs worse than random prediction (10%). Therefore, it can be concluded that standard training is possible with only hard examples, but adversarial training is not. Additional details can be found in Appendix B.4.

100 120 140 160 180 200 Epoch

Standard Accuracy

5k normal corruption 5k hard corruption

100 120 140 160 180 200 Epoch

Robust Accuracy

5k normal corruption 5k hard corruption

Figure 6: The standard and robust accuracies of the models whose labels of subsets are corrupted.

Property 3: Trade-oﬀ We have already demonstrated in Table 1 through the hard examples pruning experiment that learning hard examples exhibits the robustness-accuracy tradeoﬀ. To further support this, we additionally conduct a label corruption (shuﬄing) experiment, where the labels of hard examples are randomly changed. When label corruption is applied to hard examples, the corresponding weight of the memorization feature is learned incorrectly but does not become zero. Consequently, while this does not increase standard accuracy, the vulnerability of the non-robust feature remains, so unlike with pruning, robust accuracy is expected to remain unchanged rather than increase.

We construct a hard example data subset using the top 5k hard examples and a normal example data subset using the top 5k easy examples. We replace only the labels of data in each subset with random labels. We then conduct adversarial training on each of these datasets.

The models trained using the dataset with corrupted labels for normal and hard examples are shown in Fig. 6, along with their standard (left) and robust (right) accuracy curves. Examining the results for standard accuracy (left), it is evident that performance decreases for both normal and hard examples when label corruption is applied. In the results for robust accuracy (right), it is observed that while the performance of the models with corrupted normal examples consistently declines during training, the performance of the model with corrupted hard examples behaves similarly to that of the PGD model (None). This indicates that even when the labels of hard examples are corrupted, they do not signiﬁcantly aﬀect the robust accuracy of the adversarial training model. Additional details regarding Fig. 6 can be found in Appendix B.4.

Consequently, we empirically verify that the various characteristics that can arise in the learning of hard examples, as suggested by Assumption 1, appear as expected. Therefore, we conﬁrm that Assumption 1 which posits that hard examples consist only of non-robust features and memorization features, with the memorization feature acting as a robust feature in the training data and as a non-robust feature in the test data is supported to a certain degree.

Lee, Lee, Bae, and Yoon

0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 Data Pruning Ratio

Standard Accuracy

None Normal Hard

0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 Data Pruning Ratio

Robust Accuracy

None Normal Hard

0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 Data Pruning Ratio

Figure 7: (a) and (b) show the standard and robust accuracies of STD and PGD models where subset of training data is pruned. (c) shows the results that repeat the experiment (a) or (b) for diﬀerent adversarial budgets.

3.3.2 Empirical verification of the training of hard examples

In this section, we conduct an empirical analysis of Theorem 6 and Theorem 7, which provide a theoretical analysis of the eﬀect of increasing the number of hard examples in the training data during adversarial training. We empirically verify the theory through further experiments related to Fig. 1 and Table 1. We prune the top n% easy examples ( Xnormal) and the top n% hard examples ( Xhard) from the training dataset of CIFAR-10, respectively (0 n 35). Figs. 7a and 7b show the standard and robust accuracies of the PGD pruning models.

In Fig. 7b, pruning hard examples improves the robust accuracy. This aligns with the theoretical analysis presented in Section 3.2, which suggests that the robustness of adversarial training models decreases as the number of hard examples in the training dataset increases. In contrast to hard examples, pruning normal examples leads to a decrease in performance. This outcome is likely due to the increased proportion of hard examples in the overall dataset.

Fig. 7c demonstrates that the improvement from pruning hard examples decreases as the adversarial budget ϵ decreases. Therefore, the performance improvement observed by pruning hard examples from the training dataset can be considered a characteristic of adversarial training.

Through Theorem 6 and Theorem 7, we observed that increasing the number of hard examples in the training dataset decreases the standard accuracy on normal examples but increases the standard accuracy on hard examples. Fig. 7a further conﬁrms that the increase in standard accuracy is indeed predominant. To verify this, we conduct a theoretical analysis comparing the increase and decrease in standard accuracy as presented in Theorem 6 and Theorem 7. As noted in Corollary 2 in Appendix A.3, when the number of normal examples is suﬃciently greater than the number of hard examples, increasing the number of hard examples in the training dataset leads to an increase in overall standard accuracy, as observed in Fig. 7a. Since typical datasets are assumed to have more normal examples than hard examples, this condition appears to be probable.

Regularizing Hard Examples Improves Adversarial Robustness

In this section, we explore approaches to mitigate the negative eﬀects of learning hard examples. We evaluate the eﬀectiveness of methodologies that regularize the learning of hard examples based on label smoothing (LS). Through this evaluation, we ﬁnd that basic regularization techniques like label smoothing are inadequate for regularizing the learning of hard examples. To address this, we theoretically demonstrate that the appropriate direction for label smoothing regularization in the context of hard examples is to allocate the label smoothing factor in proportion to the diﬃculty of the hard examples. Accordingly, we propose Diﬃculty Proportional Label Smoothing (DPLS), a method that properly regularizes hard examples and enables eﬀective adversarial training.

4.1 Mitigation of the negative eﬀect of hard examples

Although training hard examples impairs the performance in adversarial training, simply pruning hard examples from training is not the optimal solution. In Section 3.2, we deﬁned normal and hard examples as disjoint distributions, but in real data, this may not be the case. As shown in Fig. 7b, pruning the top 20% of hard examples improves performance, but beyond that point, performance starts to decline. Moreover, according to Fig. 7c, the point at which performance degradation begins varies depending on the adversarial budget ϵ. These results suggest that ﬁnding the appropriate hard example pruning point depends on the type of dataset and the adversarial budget, making the task of identifying the optimal diﬃculty threshold for removing hard examples that deteriorate robustness an intractable proposition.

Thus, instead of pruning hard examples, we applied label smoothing (LS) (Szegedy et al., 2016) to subsets of the dataset to mitigate the negative eﬀect of hard examples. Label smoothing (LS) is a technique aimed at reducing model overconﬁdence by assigning a value of λ to the correct label while distributing the value 1 λ uniformly across all labels. Here, λ represents the label smoothing factor and lies within the range [ 1

C , 1], where C is the number of class. Following the above pruning experiment, we applied LS (λ = 0.9) to the top n% of hard examples during training and then evaluated the model s performance.

Figures. 8b depict the test robust accuracy of the PGD LS models. As a result of pruning models, applying LS to the hard example subset increases robust accuracy, but applying LS to the normal example subset decreases performance. Fig. 8c shows that the increase in robust accuracy is similar to that observed with pruning. According to Fig. 8a, only marginal decreases are observed in standard accuracy. This implies that LS produces eﬀects similar to pruning; however, the decrease in standard accuracy is less pronounced compared to pruning. Nevertheless, considering the performance drop observed when LS is applied to normal examples, it is expected that continuing to increase the proportion of hard examples to which LS is applied will eventually lead to a decrease in performance. Consequently, similar to pruning, determining the optimal diﬃculty threshold for applying LS to hard examples that negatively impact robustness is a challenging and complex task.

Lee, Lee, Bae, and Yoon

0.00 0.05 0.10 0.15 0.20 Label Smoothing Ratio

Standard Accuracy

None Normal Hard

0.00 0.05 0.10 0.15 0.20 Label Smoothing Ratio

Robust Accuracy

None Normal Hard

0.00 0.05 0.10 0.15 0.20 Label Smoothing Ratio

Robust Accuracy

None Label Smoothing Pruning

Figure 8: (a) the test standard accuracy and (b) robust accuracy of the PGD LS models. (c) the performance comparisons of the PGD LS and pruning models.

4.2 Our method to mitigate the negative eﬀect of hard examples

Since it is diﬃcult to accurately measure and deﬁne the diﬃculty of data, it is challenging to precisely determine whether a given data point is a hard example that does not contain robust features. We can expect that there will be data points on the boundary between hard and normal examples, depending on the diﬃculty of learning the data. Therefore, considering this diﬃculty, we deﬁne an input-label pair distribution that includes both hard and normal examples as follows.

Deﬁnition 8 (Data distribution with diﬃculty)

y u.a.r. { 1, +1}, x Rd+2 =

x0 = N(αηy, 1),

xi {1,...,d} i.i.d. N(βηy, 1),

( N(γηy, 1), training phase N(βηy, 1), test phase, unseen data (10)

Here, 0 α 1 represents the diﬃculty of the data, with each data point having a diﬀerent value. This reﬂects the diﬃculty through the coeﬃcient of the robust feature, with lower diﬃculty corresponding to a higher value of α. When αη ϵ, the data point can be considered a hard example. 0 γ 1 represents the coeﬃcient of the memorization feature, which is inversely related to the robust feature coeﬃcient α; when γη > ϵ, the data point is classiﬁed as a hard example.

Next, we explore approaches to mitigate the negative eﬀects of hard examples (data with low α) when training a model using a dataset composed of data points with varying levels of diﬃculty. Unlike pruning, LS sacriﬁces standard accuracy less and also allows for adjusting the intensity of the method through the LS factor λ (λ [ 1

C , 1]). Therefore, we explore an approach that adaptively mitigates the negative eﬀects of hard examples by leveraging these characteristics of LS.

It is established that during adversarial training, the optimal weight conﬁguration for achieving high robust accuracy is one where the weight values corresponding to features other than the robust feature are zero, i.e., w = (1, 0, 0, . . . ). Therefore, when training

Regularizing Hard Examples Improves Adversarial Robustness

a model using a dataset composed of data with varying levels of diﬃculty, we explore an LS approach that aims to train the model s weight w to be as close as possible to w . Let Xdifficulty denote the set of datasets that can be generated by sampling from Equation 10, and let the training dataset be X train difficulty Xdifficulty. Then, the optimal LS factor for each data point that trains the model s weight w to be close to w during adversarial training with X train difficulty follows the theoretical principles outlined below.

Theorem 9 Let the robust feature coeﬃcients of each data point in the dataset X train difficulty, consisting of k data points x1, x2, . . . , xk, be denoted as α1, α2, . . . , αk. Then, when performing adversarial training using X train difficulty, the optimal LS factors λ1, λ2, . . . , λk for each training example are proportional to their respective robust feature coeﬃcients α1, α2, . . . , αk.

The proof of Theorem 9 can be found in Appendix A.4. Theorem 9 demonstrates that applying LS in proportion to the robust feature coeﬃcient of each data point during adversarial training is the optimal learning approach for achieving robustness. In other words, this suggests that the optimal LS approach for adversarial training is to decrease the LS factor as the diﬃculty of the data increases, making the intensity of smoothing regularization proportional to the diﬃculty. Therefore, based on Theorem 9, we propose a regularization approach that adjusts the LS factor for each example during training.

We modify the LS method to ensure that the training of hard examples is regularized, but the training of other examples is less aﬀected. We use the LS factor that is proportional to the diﬃculty of each example and control the intensity of regularization, which we term as diﬃculty proportional label smoothing (DPLS). The diﬃculty is calculated by using the average loss of the training models along the training trajectory. We train the model and accumulate the diﬃculty until the diﬃculty calculation epoch T, and the training progresses with the application of DPLS using the calculated diﬃculty. We establish the LS factor of the easiest example as λ = 1 (no smoothing) and the factor of the most diﬃcult example as λ [ 1

C , 1), where C represents the number of the class. The procedure of DPLS is described in Algorithms 1 and 2.

Our method maintains the training eﬀect of each training example, but it avoids the negative eﬀect of hard examples. We could utilize several loss types for diﬃculty measurement including cross-entropy loss, the norm of the error vector (EL2N score) (Paul et al., 2021), consistency score (C-score) (Jiang et al., 2021), and 0-1 loss. Applying DPLS via any error measure is an eﬀective strategy for mitigating the negative eﬀect of hard examples (refer to Table 5). However, in our experiments, as the 0-1 loss exhibits a higher performance than the other loss types, we mainly use the 0-1 loss as the diﬃculty measurement loss. Notably, our method calculates the diﬃculty using clean examples because the adversarial examples are diﬀerent from training algorithms and training epochs.

Lee, Lee, Bae, and Yoon

Algorithm 1 DPLS adversarial training Require: Dataset D, model parameter θ, batch size n, diﬃculty calculation epoch T, total training epoch K, learning rate α, diﬃculty loss storage S

1: for t = 1 to K do

2: for batch in D do

3: batch and indices:

4: {xi, yi}n i=1, I D

5: compute or apply DPLS:

6: if t < T then

7: S [I] {Ldiﬃculty(xi, yi)}n i=1 8: else

9: {yi}n i=1 DPLS(S [I])

11: adversarial training:

12: {δi}n i=1 arg maxδi Ladv(xi + δi, yi)

n Pn i=i Ladv(xi + δi, yi)

14: θ θ α θL

15: end for

16: end for

17: Output: robust classiﬁer f

Algorithm 2 Calculation of DPLS Require: Dataset D, Dataset size N, number of class C, loss storage S, DPLS factor λ

1: Initialization:

2: apply min-max normalization:

3: S S min(S) max(S) min(S) 4: for i = 1 to N do

5: (xi, yi) D[i]

6: apply LS using S:

7: yi LS(yi, 1 (1 λ)S[i], C)

8: D[i] (xi, yi)

10: function LS(y, λ, C)

11: for i = 1 to C do

12: if y[i] = 1 then

15: y[i] 1 λ

C 1 16: end if

17: end for

18: return y

5. Experiment

Implementation details We conducted experiments on CIFAR-10, CIFAR-100 (Krizhevsky et al., 2009), SVHN (Netzer et al., 2011), and STL-10 (Coates et al., 2011) by employing PGD (Madry et al., 2017) and TRADES (Zhang et al., 2019) as the baseline adversarial training algorithms. We utilized Wide Res Net28-10 (Zagoruyko and Komodakis, 2016) as the architecture of our models. We set the LS factor of the most diﬃcult example in DPLS considering the average diﬃculty score of each dataset and algorithm. We selected the diﬃculty calculation epoch as T = 90 for stability, which is 10 epochs before the ﬁrst learning rate decay. To evaluate the robustness of the models, we employed the PGD and adaptive auto attack (A3) (Liu et al., 2022b). Further details are presented in Appendix B.1.

Improvement in robustness We compared four mitigation methods: baseline (-), pruning hard examples (pruning), label smoothing (LS), and our proposed method (DPLS). We pruned the top 10% of examples with high diﬃculty for pruning models. For LS models, we applied the same LS factor with the average factor of DPLS. Thus, with the same LS budget, our method assigns a large portion of the budget to the hard examples and LS assigns it uniformly to every example. We trained the baseline model until epoch 90 and continued this training with the application each method. Table 3 shows that pruning hard examples and DPLS are eﬀective for mitigating the eﬀect of hard examples and improving the robustness

Regularizing Hard Examples Improves Adversarial Robustness

Table 3: Performance of the models for mitigation methods. The best results are indicated in bold.

Dataset Method Std. Robust Robust Method Std. Robust Robust (PGD) (A3) (PGD) (A3)

PGD 87.19 56.44 51.80 TRADES 85.66 58.46 54.08 +Pruning 86.44 56.18 52.56 +Pruning 84.07 58.04 54.64 +LS 86.96 57.20 51.99 +LS 86.32 58.72 54.02 +DPLS 87.13 58.43 54.03 +DPLS 85.36 59.33 55.19

PGD 62.88 31.82 27.36 TRADES 60.78 33.19 28.05 +Pruning 62.28 31.64 28.05 +Pruning 60.61 33.53 29.01 +LS 63.02 32.26 27.67 +LS 62.43 33.29 27.83 +DPLS 62.95 35.10 29.33 +DPLS 63.32 34.53 29.68

PGD 92.96 52.88 35.20 TRADES 91.84 59.14 47.46 +Pruning 92.68 55.52 40.87 +Pruning 91.54 60.27 47.79 +LS 92.65 66.20 38.44 +LS 92.27 59.65 46.37 +DPLS 92.94 76.53 42.01 +DPLS 92.33 59.40 49.05

PGD 66.95 36.49 33.12 TRADES 66.40 36.27 33.16 +Pruning 66.75 35.27 33.15 +Pruning 65.39 37.39 34.91 +LS 66.90 35.66 32.05 +LS 66.49 37.89 33.99 +DPLS 66.16 38.27 34.75 +DPLS 65.63 39.94 36.06

Dataset Method Std. Robust Robust Method Std. Robust Robust (PGD) (A3) (PGD) (A3)

PGD 91.78 72.60 71.96 TRADES 90.36 74.37 73.50 +Pruning 91.24 73.03 72.64 +Pruning 89.12 73.76 73.32 +LS 92.21 72.39 71.55 +LS 90.92 74.78 73.67 +DPLS 92.07 74.90 74.01 +DPLS 90.43 75.38 74.47

Dataset Method Std. Robust Robust Method Std. Robust Robust (PGD) (A3) (PGD) (A3)

PGD 94.17 83.20 83.04 TRADES 93.32 84.42 84.29 +Pruning 92.95 82.73 82.70 +Pruning 92.08 83.84 83.69 +LS 94.56 83.46 83.13 +LS 93.14 84.33 83.96 +DPLS 94.06 84.87 84.51 +DPLS 92.94 85.18 84.79

generalization. The performance improvement of the pruning models demonstrates that the negative eﬀect of hard examples appears in various datasets and algorithms. The results of LS models show no improvements in the robustness performance. Contrarily, the DPLS increases the robustness of the models across diﬀerent training algorithms and datasets for all adversarial budgets ϵ. The robustness performance diﬀerence between LS and DPLS indicates that regularization without considering the eﬀect of each example on training is diﬃcult to improve robust generalization.

Fig. 9 illustrates the training and test robust accuracy results (left) of the PGD and PGD with DPLS (denoted as DPLS) models and training accuracy of top N% (right) hard

Lee, Lee, Bae, and Yoon

Table 4: Performance comparison with other robust overﬁtting mitigation methods.

Method Std. Robust (PGD) Robust (A3)

PGD 87.19 56.44 51.8 +SAT 86.94 57.52 52.94 +KD+SWA 88.28 57.93 53.49 +TE 85.83 58.31 52.84

+DPLS 86.17 58.14 53.98 +DPLS+SWA 87.14 60.07 55.91

Table 5: Robust accuracy of DPLS according to various diﬃculty measurement.

Method PGD TRADES

- 51.8 54.08 C-score 52.9 54.71 CE loss 52.73 54.45 EL2N 53.38 54.32 0-1 loss 54.03 55.19

Table 6: Performance improvements in other adversarial training algorithms.

Method Std. Robust (PGD) Robust (A3)

AWP 84.82 60.19 55.54 +DPLS 84.4 60.26 56.09

MART 83.9 59.05 52.19 +DPLS 83.81 59.54 53.18

RST 84.82 61.53 57.26 +DPLS 84.43 62.03 57.81

examples of the 50,000 training examples of CIFAR-10 (e.g. (0-20%) denotes the top 10k hard examples). Observably, comparing DPLS with PGD, the decrease in the training accuracy of the DPLS model coincides with the increase in the test accuracy. DPLS (right) prevents overﬁtting of hard examples and causes the fast convergence of comparatively easy examples. This result indicates that our method prevents the memorization of hard examples and mitigate the negative eﬀect of hard examples, thereby leading to decrease in the gap between training and test accuracy by improving the generalization performance.

0 50 100 150 200 Epoch

Robust Accuracy

PGD (train) PGD (test) PGD+DPLS (train) PGD+DPLS (test)

100 120 140 160 180 200 Epoch

Robust Accuracy

PGD PGD+DPLS 0~20% 20~40% 40~60% 60~80% 80~100%

Figure 9: The training and test accuracy curves of the PGD and the PGD with DPLS model on CIFAR-10.

Comparison with other mitigation methods We compared DPLS with the previous robust overﬁtting mitigation methods: self-adaptive training (SAT) (Huang et al., 2020), knowledge distillation with stochastic weight averaging (KD+SWA) (Chen et al., 2020), and temporal ensemble (TE) (Dong et al., 2021). We trained the PGD model combined with each method and with total training epoch 200. The details on the settings of each method are described in Appendix B.5. The results listed in Table 4 indicate that all the methods are eﬀective at mitigating robust overﬁtting. Among the methods, DPLS exhibits the highest improvement in the test robust accuracy against A3. Additionally, because our method is orthogonal to SWA (Izmailov et al., 2018), we trained the model with the method combining DPLS and SWA. Further improvement is observed when combined with SWA. These results indicate that our method successfully mitigates the memorization of hard examples and prevents overﬁtting.

Applying DPLS to other training algorithms We applied our method to other adversarial training algorithms to verify that DPLS can consistently increase the robustness generalization performance independent of algorithms. We combined DPLS with three algorithms: adversarial weight perturbation (AWP) (Wu et al., 2020), misclassiﬁcation aware adversarial training (MART) (Wang et al., 2019), and robust self training (RST) (Car-

Regularizing Hard Examples Improves Adversarial Robustness

mon et al., 2019). We trained each algorithm model combined with TRADES+DPLS and evaluated the trained models. Table 6 shows that all training algorithms combined with DPLS yield improvements in robustness. For the RST models, we calculated the diﬃculty of the unlabeled dataset as the ratio of number of training iterations to the number of right predictions. The result shows that DPLS successfully mitigates the negative eﬀect of hard examples for the unlabeled dataset and improves the robust accuracy for the A3 and PGD attacks.

Various diﬃculty loss on DPLS We compared the loss types of the diﬃculty for DPLS. Table 5 shows the performance of the TRADES models that apply DPLS with several loss functions: C-score, cross-entropy loss (CE loss), EL2N score, and 0-1 loss. All models in which DPLS was applied exhibited a better performance than the baseline models. The diﬃculty calculated from another model (e.g. C-score from STD) was also eﬀective on mitigation. Any diﬃculty loss was eﬀective, and DPLS with the 0-1 loss oﬀered the best performance. The results indicate that the 0-1 loss leverages comparatively objective values because the 0-1 loss is less aﬀected by scale or variance at each checkpoint of the model. As the calculation process can be detached from the training, further improvement is achievable by using the better diﬃculty measurement loss. Further experiments and ablation studies on DPLS can be found in Appendix B.5 and Appendix D.

6. Conclusion

In this study, we analyzed the eﬀect of hard examples in adversarial training. We demonstrated that training hard examples can deteriorate the performance in adversarial training and veriﬁed that the cause is the memorization of hard examples. We accordingly conducted an analysis of the memorization phenomenon and observed that increasing the number of hard examples during adversarial training leads to a decrease in the model s robust accuracy. While we found that pruning hard examples from the training process enhances the model s robustness, the diﬃculty lies in identifying the optimal threshold for removing hard examples that weaken robustness performance. Thus, we proposed a method, DPLS, which adaptively mitigates the memorization of hard examples. Through experiments on various datasets and algorithms, we veriﬁed that our method could successfully leverage hard examples, thereby improving robustness.

Acknowledgments

This work was supported by the National Research Foundation of Korea (NRF) grants funded by the Korea government (Ministry of Science and ICT, MSIT) (2022R1A3B1077720 and 2022R1A5A708390811), by the Institute of Information & Communications Technology Planning & Evaluation (IITP) grants funded by the Korea government (MSIT) (NO.RS2021-II211343 for the Artiﬁcial Intelligence Graduate School Program at Seoul National University and RS-2024-004398199 for AI-Based Automated Vulnerability Detection and Safe Code Generation), by the Korea Evaluation Institute of Industrial Technology (KEIT) grants funded by the Korea government (MOTIE), and by the BK21 FOUR program of the

Lee, Lee, Bae, and Yoon

Education and Research Program for Future ICT Pioneers, Seoul National University in 2025.

Anish Athalye, Nicholas Carlini, and David Wagner. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. In International conference on machine learning, pages 274 283. PMLR, 2018.

Yair Carmon, Aditi Raghunathan, Ludwig Schmidt, John C Duchi, and Percy S Liang. Unlabeled data improves adversarial robustness. In Advances in Neural Information Processing Systems, pages 11190 11201, 2019.

Jinghui Chen and Quanquan Gu. Rays: A ray searching method for hard-label adversarial attack. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1739 1747, 2020.

Tianlong Chen, Zhenyu Zhang, Sijia Liu, Shiyu Chang, and Zhangyang Wang. Robust overﬁtting may be mitigated by properly learned smoothening. In International Conference on Learning Representations, 2020.

Adam Coates, Andrew Ng, and Honglak Lee. An analysis of single-layer networks in unsupervised feature learning. In Proceedings of the fourteenth international conference on artiﬁcial intelligence and statistics, pages 215 223. JMLR Workshop and Conference Proceedings, 2011.

Francesco Croce and Matthias Hein. Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. ar Xiv preprint ar Xiv:2003.01690, 2020.

Francesco Croce, Maksym Andriushchenko, Vikash Sehwag, Edoardo Debenedetti, Nicolas Flammarion, Mung Chiang, Prateek Mittal, and Matthias Hein. Robustbench: a standardized adversarial robustness benchmark. ar Xiv preprint ar Xiv:2010.09670, 2020.

Terrance De Vries and Graham W Taylor. Improved regularization of convolutional neural networks with cutout. ar Xiv preprint ar Xiv:1708.04552, 2017.

Chengyu Dong, Liyuan Liu, and Jingbo Shang. Label noise in adversarial training: A novel perspective to study robust overﬁtting. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=9_O9m TLYJQp.

Yinpeng Dong, Ke Xu, Xiao Yang, Tianyu Pang, Zhijie Deng, Hang Su, and Jun Zhu. Exploring memorization in adversarial training. ar Xiv preprint ar Xiv:2106.01606, 2021.

Vitaly Feldman. Does learning require memorization? a short tale about a long tail. In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, pages 954 959, 2020.

Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. ar Xiv preprint ar Xiv:1412.6572, 2014.

Regularizing Hard Examples Improves Adversarial Robustness

Sven Gowal, Sylvestre-Alvise Rebuﬃ, Olivia Wiles, Florian Stimberg, Dan Andrei Calian, and Timothy A Mann. Improving robustness using generated data. Advances in Neural Information Processing Systems, 34, 2021a.

Sven Gowal, Sylvestre-Alvise Rebuﬃ, Olivia Wiles, Florian Stimberg, Dan Andrei Calian, and Timothy A Mann. Improving robustness using generated data. Advances in Neural Information Processing Systems, 34, 2021b.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770 778, 2016.

Geoﬀrey Hinton, Oriol Vinyals, JeﬀDean, et al. Distilling the knowledge in a neural network. ar Xiv preprint ar Xiv:1503.02531, 2(7), 2015.

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diﬀusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840 6851, 2020.

Lang Huang, Chao Zhang, and Hongyang Zhang. Self-adaptive training: beyond empirical risk minimization. Advances in neural information processing systems, 33:19365 19376, 2020.

Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Logan Engstrom, Brandon Tran, and Aleksander Madry. Adversarial examples are not bugs, they are features. ar Xiv preprint ar Xiv:1905.02175, 2019.

Pavel Izmailov, Dmitrii Podoprikhin, Timur Garipov, Dmitry Vetrov, and Andrew Gordon Wilson. Averaging weights leads to wider optima and better generalization. ar Xiv preprint ar Xiv:1803.05407, 2018.

Linxi Jiang, Xingjun Ma, Zejia Weng, James Bailey, and Yu-Gang Jiang. Imbalanced gradients: A new cause of overestimated adversarial robustness. 2020.

Ziheng Jiang, Chiyuan Zhang, Kunal Talwar, and Michael C Mozer. Characterizing structural regularities of labeled data in overparameterized models. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 5034 5044. PMLR, 18 24 Jul 2021. URL https://proceedings.mlr.press/v139/jiang21k.html.

Alex Krizhevsky, Geoﬀrey Hinton, et al. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.

Saehyung Lee, Changhwa Park, Hyungyu Lee, Jihun Yi, Jonghyun Lee, and Sungroh Yoon. Removing undesirable feature contributions using out-of-distribution data. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum? id=e IHYL6fpbk A.

Saehyung Lee, Sanghyuk Chun, Sangwon Jung, Sangdoo Yun, and Sungroh Yoon. Dataset condensation with contrastive signals. ar Xiv preprint ar Xiv:2202.02916, 2022.

Lee, Lee, Bae, and Yoon

Chen Liu, Zhichao Huang, Mathieu Salzmann, Tong Zhang, and Sabine Süsstrunk. On the impact of hard adversarial instances on overﬁtting in adversarial training. ar Xiv preprint ar Xiv:2112.07324, 2021.

Chen Liu, Zhichao Huang, Mathieu Salzmann, Tong Zhang, and Sabine Süsstrunk. On the impact of hard adversarial instances on overﬁtting in adversarial training, 2022a. URL https://openreview.net/forum?id=hb GV3vz MPz G.

Ye Liu, Yaya Cheng, Lianli Gao, Xianglong Liu, Qilong Zhang, and Jingkuan Song. Practical evaluation of adversarial robustness via adaptive auto attack. ar Xiv preprint ar Xiv:2203.05154, 2022b.

Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. ar Xiv preprint ar Xiv:1706.06083, 2017.

Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng. Reading digits in natural images with unsupervised feature learning. 2011.

Weili Nie, Brandon Guo, Yujia Huang, Chaowei Xiao, Arash Vahdat, and Animashree Anandkumar. Diﬀusion models for adversarial puriﬁcation. In International Conference on Machine Learning, pages 16805 16827. PMLR, 2022.

Tianyu Pang, Xiao Yang, Yinpeng Dong, Hang Su, and Jun Zhu. Bag of tricks for adversarial training. In International Conference on Learning Representations, 2021. URL https: //openreview.net/forum?id=Xb8xvrt B8Ce.

Mansheej Paul, Surya Ganguli, and Gintare Karolina Dziugaite. Deep learning on a data diet: Finding important examples early in training. Advances in Neural Information Processing Systems, 34, 2021.

Sylvestre-Alvise Rebuﬃ, Sven Gowal, Dan Andrei Calian, Florian Stimberg, Olivia Wiles, and Timothy A Mann. Data augmentation can improve robustness. Advances in Neural Information Processing Systems, 34:29935 29948, 2021.

Ludwig Schmidt, Shibani Santurkar, Dimitris Tsipras, Kunal Talwar, and Aleksander Madry. Adversarially robust generalization requires more data. In Advances in Neural Information Processing Systems, pages 5014 5026, 2018.

Christian Szegedy, Vincent Vanhoucke, Sergey Ioﬀe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2818 2826, 2016.

Antonio Torralba, Rob Fergus, and William T Freeman. 80 million tiny images: A large data set for nonparametric object and scene recognition. IEEE transactions on pattern analysis and machine intelligence, 30(11):1958 1970, 2008.

Dimitris Tsipras, Shibani Santurkar, Logan Engstrom, Alexander Turner, and Aleksander Madry. Robustness may be at odds with accuracy. ar Xiv preprint ar Xiv:1805.12152, 2018.

Regularizing Hard Examples Improves Adversarial Robustness

Yisen Wang, Difan Zou, Jinfeng Yi, James Bailey, Xingjun Ma, and Quanquan Gu. Improving adversarial robustness requires revisiting misclassiﬁed examples. In International Conference on Learning Representations, 2019.

Eric Wong, Leslie Rice, and Zico Kolter. Overﬁtting in adversarially robust deep learning. In Proceedings of Machine Learning and Systems 2020, pages 5304 5315. 2020.

Dongxian Wu, Shu-Tao Xia, and Yisen Wang. Adversarial weight perturbation helps robust generalization. Advances in Neural Information Processing Systems, 33:2958 2969, 2020.

Zuxuan Wu, Tushar Nagarajan, Abhishek Kumar, Steven Rennie, Larry S Davis, Kristen Grauman, and Rogerio Feris. Blockdrop: Dynamic inference paths in residual networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8817 8826, 2018.

Han Xu, Xiaorui Liu, Yaxin Li, Anil Jain, and Jiliang Tang. To be robust or to be fair: Towards fairness in adversarial training. In International Conference on Machine Learning, pages 11492 11501. PMLR, 2021.

Chaojian Yu, Bo Han, Li Shen, Jun Yu, Chen Gong, Mingming Gong, and Tongliang Liu. Understanding robust overﬁtting of adversarial training and beyond. In International Conference on Machine Learning, pages 25595 25610. PMLR, 2022.

Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and Youngjoon Yoo. Cutmix: Regularization strategy to train strong classiﬁers with localizable features. In Proceedings of the IEEE International Conference on Computer Vision, pages 6023 6032, 2019.

Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. ar Xiv preprint ar Xiv:1605.07146, 2016.

Hongyang Zhang, Yaodong Yu, Jiantao Jiao, Eric Xing, Laurent El Ghaoui, and Michael Jordan. Theoretically principled trade-oﬀbetween robustness and accuracy. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 7472 7482, Long Beach, California, USA, 09 15 Jun 2019. PMLR. URL http://proceedings.mlr.press/v97/zhang19p.html. https://github. com/yaodongyu/TRADES.

Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. In International Conference on Learning Representations, 2018.

Lee, Lee, Bae, and Yoon

Appendix A. Proofs

A.1 Section 3.1 proofs

Deﬁnition 1 (Gradient diﬀerence) The diﬀerences between the loss gradient norm on x1 and x2 in standard and adversarial training are deﬁned as follows:

Dstd(x1, x2) = (1 σ(w x1 + b))x1 (1 σ(w x2 + b))x2

Dadv(x1, x2, ϵ) = (1 σ(w (x1 + δ) + b))(x1 + δ) (1 σ(w (x2 + δ) + b))(x2 + δ) . (11)

w y log f(x) + (1 y) log(1 f(x))

w y log σ(w x + b) + (1 y) log 1 σ(w x + b)

= y(1 σ(w x + b)) + (1 y)( σ(w x + b)) x.

Here, y = +1 for x1 and x2

L(f(x), y = +1)

w = (1 σ(w x + b))x. (12)

The gradient norm diﬀerence between x1 and x2 in standard training is represented as: L(f(x1), y1)

L(f(x2), y2)

= (1 σ(w x1 + b))x1 (1 σ(w x2 + b))x2 .

(13) Thus, the gradient norm diﬀerence between the adversarial examples is represented as: L(f(x1 + δ), y1)

L(f(x2 + δ), y2)

= (1 σ(w (x1 + δ) + b))(x1 + δ) (1 σ(w (x2 + δ) + b))(x2 + δ) . (14)

Theorem 1 Let the gradient norm of easy examples is smaller than the gradient norm of hard examples in both standard and adversarial training. Then, if w x1 + b 0 and 0 < ϵ < (d1+d2)

2 , it satisﬁes

Dstd(x1, x2) < Dadv(x1, x2, ϵ). (15)

Proof Because the gradient norm of easy examples is smaller than the gradient norm of hard examples in both standard and adversarial training, Dstd(x1, x2) > 0 and Dadv(x1, x2, ϵ) > 0. Then, we can rewrite Equation 16 as

Dstd(x1, x2) Dadv(x1, x2, ϵ) < 1. (16)

Regularizing Hard Examples Improves Adversarial Robustness

Let the prediction for x1 is not incorrect, where w x1 + b 0. Here, we assume the norm of every input x that is deﬁned in the input space X is almost the same. Then, the above inequality is represented as

Dstd(x1, x2) Dadv(x1, x2, ϵ) =

(1 σ(w x1 + b))x1 (1 σ(w x2 + b))x2

(1 σ(w (x1 + δ) + b))(x1 + δ) (1 σ(w (x2 + δ) + b))(x2 + δ) (17)

= (1 σ(w x1 + b)) x1 (1 σ(w x2 + b)) x2 (1 σ(w (x1 + δ) + b)) (x1 + δ) (1 σ(w (x2 + δ) + b)) (x2 + δ) (18)

= (1 σ(w x1 + b)) (1 σ(w x2 + b)) (1 σ(w (x1 + δ) + b)) (1 σ(w (x2 + δ) + b)) (19)

= σ( w x1 b) σ( w x2 b) σ( w (x1 + δ) b) σ( w (x2 + δ) b) (20)

= σ( d1) σ( d2) σ( (d1 ϵ)) σ( (d2 ϵ)) (21)

= σ( d1) σ( (d1 + (d2 d1))) σ( (d1 ϵ)) σ( (d1 ϵ + (d2 d1))). (22)

Here, it satisﬁes Equation 16 if the function h(x) = σ( x) σ( (x + (d2 d1))) is monotonically decreasing for input x d1 ϵ, where ϵ > 0. Let a denote (d2 d1), where a = (d2 d1). Then,

d dx σ( x) σ( (x + a)) = d

1 1 + ex 1 1 + ex+a

(1 + ex+a)2 1 (1 + ex)2

= ex (ea 1)(1 e2x+a)

(ex + 1)2(ex+a + 1)2

Since a > 0, Equation 25 can be represented as 1 e2x+a < 0. Then, we can rewrite Equation 25 as

Thus, the fuctnion h(x) = σ( x) σ( (x + (d2 d1))) is monotonically decreasing for x > d2 d1

2 , and here, the minimum value of x is d1 ϵ. Therefore, if d1 ϵ > d2 d1

2 , which can be rewritten as ϵ < d1+d2

2 , it satisﬁes

Dstd(x1, x2) < Dadv(x1, x2, ϵ). (27)

Corollary 1 The increase in gradient diﬀerence by an adversarial attack is proportional to the diﬃculty of the example and the size of the upper bound of adversarial perturbations.

Lee, Lee, Bae, and Yoon

Thus, if w x3 = d3 and d2 < d3, it satisﬁes

Dadv(x1, x2, ϵ) < Dadv(x1, x3, ϵ), (28)

and if ϵ1 < ϵ2, it satisﬁes

Dadv(x1, x2, ϵ1) < Dadv(x1, x2, ϵ2). (29)

Proof In the equation 23 of the above proof, it satisﬁes Equation 28 if the function h(x) = σ( (d1 ϵ)) σ( (d1 ϵ + (x d1))) is monotonically increasing for input x > d1. Then,

d dx σ( (d1 ϵ)) σ( (d1 ϵ + (x d1))) = d

1 1 + ex d1

(1 + ex d1)2

In the same way, it satisﬁes Equation 29 if the function h(x) = σ( (d1 ϵ)) σ( (d1 ϵ + (d2 d1))) is monotonically increasing for input ϵ 0. Let x denote ϵ d1 and a denote d2 d1. Then, if ϵ d1 < d2 d1

2 , which can be rewritten as ϵ < d1+d2

2 , it satisﬁes

d dx σ(x) σ(x a)) = d

1 1 + e x 1 1 + e x+a

= ex a (ea 1)(1 e2x a)

(ex + 1)2(ex a + 1)2

A.2 Section 3.2 proofs

Next, we proceed with the proofs of Theorems 6 and 7. Before the proof, we will examine the derivation process of the adversarial training model weight wadv in Equation 8, which represents the weights of models trained with m normal examples and n hard examples. Following Tsipras et al. (2018), we utilize a soft-margin SVM classiﬁer. The optimization problem for the model is as follows.

min w E max(0, 1 yw x)

subject to w = 1 (34)

Given that x 1 and βη < ϵ < η, the weight learning process for the adversarial training model is as follows.

E max(0, 1 yw (x + δ)) = E 1 yw (x + δ) (35)

= E 1 w0N(η ϵ, 1)

i=1 wi N(sign(wi) (βη ϵ), 1) wd+1N(η ϵ, 1) (36)

= 1 m m + nw0(η ϵ)

i=1 wisign(wi) (βη ϵ) n m + nwd+1(η ϵ) (37)

Regularizing Hard Examples Improves Adversarial Robustness

According to the above equation, in adversarial training, the optimal solution is to assign a value of 0 to the weights corresponding to the non-robust features, which is consistent with what was demonstrated by Tsipras et al. (2018). Therefore, when wi 1,...,d = 0, Equation 34 can be simpliﬁed as follows.

min w m m + nw0(η ϵ) n m + nwd+1(η ϵ)

subject to w2 0 + w2 d+1 = 1 (38)

Using the Lagrange multiplier method, we deﬁne the above equation, and the partial derivatives with respect to each term are as follows.

L(w0, wd+1, µ) = m m + nw0(η ϵ) n m + nwd+1(η ϵ) + µ(w2 0 + w2 d+1 1) (39)

L w0 = m m + n(η ϵ) + 2µw0 = 0 (40)

L wd+1 = n m + n(η ϵ) + 2µwd+1 = 0 (41)

L µ = w2 0 + w2 d+1 1 = 0 (42)

Given that m m+n(η ϵ) = a0 and n m+n(η ϵ) = a1, we can rewrite the equations as a0 = 2µw0 and a1 = 2µwd+1. Dividing the two equations gives a0

a1 = w0 wd+1 , so

a0 w0. Substituting this into w2 0 + w2 d+1 1 = 0 gives w2 0 = a2 0 a2 0+a2 1 and w2 d+1 = a2 1 a2 0+a2 1 . Therefore, to obtain the minimum values, w0 = a0

a2 0+a2 1 and wd+1 = a1

a2 0+a2 1 . Dividing

a0 and a1 by (η ϵ) gives w0 = 1 Zadv m m+n and wd+1 = 1 Zadv n m+n.

Next, we proceed with the proof of Theorem 2.

Theorem 2 For a model fadv adversarially trained on the combined dataset X train normal X train hard , where X train normal consists of m normal examples and X train hard consists of n hard examples, both the standard accuracy Pr fadv(x) = y and the robust accuracy Pr fadv(x + δ) = y on the normal example test dataset X test normal are monotonically decreasing with respect to n.

Proof First, the standard accuracy Pr fadv(x) = y of fadv on X test normal is expressed as follows. Here, the bias b is assumed to be b = 0, and it is also assumed that the proportions of the two classes in the test data are equal.

Lee, Lee, Bae, and Yoon

Pr fadv(x) = y = Pr y(w x) + b > 0 (43)

= Pr 1 Zadv

m m + n N(η, 1) + y b > 0 (44)

m2 + n2 N(η, 1) + y b > 0 (45)

= Pr N(0, 1) > y b

2Pr N(0, 1) > b

2Pr N(0, 1) > b

Let the CDF Φ( ) represent the cumulative distribution function, and the PDF φ( ) represent the probability density function. Additionally, for readability, we substitute g(n) = b

Pr fadv(x) = y Φ( ) is CDF = 1 2

2 Φ( g(n) η) Φ(g(n) η) (48)

1 + Φ(g(n) + η) Φ(g(n) η) (49)

φ( ) is PDF derivative w.r.t. n 1 2

φ(g(n) + η) g (n) φ(g(n) η) g (n) (50)

exp (g(n) + η)2

exp (g(n) η)2

When b < 0, g (n) 0 and g(n) + η 2 > g(n) η 2, so the above result is negative. Similarly, when b > 0, g (n) 0 and g(n) + η 2 < g(n) η 2, so the above result is also negative. Consequently, with respect to the size n of the hard example training dataset, the standard accuracy Pr fadv(x) = y of fadv on the normal example test dataset X test normal is monotonically decreasing as n increases.

For robust accuracy Pr fadv(x + δ) = y , since η ϵ > 0, replacing η with η ϵ in the above proof regarding standard accuracy does not change the outcome of the proof. Therefore, similar to standard accuracy, the robust accuracy Pr fadv(x + δ) = y is also monotonically decreasing with respect to n.

Theorem 3 For a model fadv adversarially trained on the combined dataset X train normal X train hard , where X train normal consists of m normal examples and X train hard consists of n hard examples, the standard accuracy Pr fadv(x) = y on the hard example test dataset X test hard is monotonically increasing with respect to n, but the robust accuracy Pr fadv(x + δ) = y is monotonically decreasing with respect to n.

Regularizing Hard Examples Improves Adversarial Robustness

First, the standard accuracy Pr fadv(x) = y of fadv on X test hard is expressed as follows. As before, it is assumed that the bias b is b = 0, and that the two classes in the test data are equally represented.

Pr fadv(x) = y = Pr y(w x) + b > 0 (52)

= Pr 1 Zadv

n m + n N(βη, 1) + y b > 0 (53)

m2 + n2 N(βη, 1) + y b > 0 (54)

= Pr N(0, 1) > y b

2Pr N(0, 1) > b

2Pr N(0, 1) > b

As with the above, let the CDF Φ( ) represent the cumulative distribution function, and the PDF φ( ) represent the probability density function. Additionally, we substitute g(n) = b

Pr fadv(x) = y Φ( ) is CDF = 1 2

2 Φ( g(n) βη) Φ(g(n) βη) (57)

1 + Φ(g(n) + βη) Φ(g(n) βη) (58)

φ( ) is PDF derivative w.r.t. n 1 2

φ(g(n) + βη) g (n) φ(g(n) βη) g (n) (59)

exp (g(n) + βη)2

exp (g(n) βη)2

The CDF Φ( ) represents the cumulative distribution function, and the PDF φ( ) represents the probability density function. For readability, we substitute g(n) = b

n . Since the sign of g (n) in this result is opposite to that in the proof of Theorem 2, the outcome is also opposite. When b < 0, g (n) 0 and g(n) + η 2 > g(n) η 2, so the result is positive. Similarly, when b > 0, g (n) 0 and g(n) + η 2 < g(n) η 2, so the result is also positive. Consequently, with respect to the size n of the hard example training dataset, the standard accuracy Pr fadv(x) = y of fadv on the hard example test dataset X test hard is monotonically increasing with n.

For robust accuracy Pr fadv(x + δ) = y , since η ϵ < 0, replacing η with η ϵ in the proof for standard accuracy results in g(n) + η 2 and g(n) η 2 being replaced by g(n) + η ϵ 2 and g(n) η + ϵ 2, respectively, which reverses the order of the two terms. Consequently, the result is the opposite of that for standard accuracy, and the robust accuracy Pr fadv(x + δ) = y is monotonically decreasing with n.

Lee, Lee, Bae, and Yoon

A.3 Section 3.3 proofs

Next, by comparing the relative increases and decreases in standard accuracy in Theorem 2 and Theorem 3, we can derive the following corollary.

Corollary 2 For a model fadv adversarially trained on the combined dataset X train normal X train hard , where X train normal consists of m normal examples and X train hard consists of n hard examples, the standard accuracy Pr fadv(x) = y on the overall test dataset X test normal X est hard is monotonically increasing with respect to n, if n3 βm3.

To compare the expressions for the increase and decrease in standard accuracy derived from the proofs of the two theorems above, we approximate each value using a ﬁrst-order Taylor expansion. The results are as follows.

exp (g(n) + η)2

exp (g(n) η)2

2π( 2g(n)η), (normal)

exp (g(n) + βη)2

exp (g(n) βη)2

2π( 2g(n)βη), (hard)

Here, we examine the sign of the sum of the two terms above. It is assumed that the ratio of normal examples to hard examples in the test phase is m : n. Let g(n) for hard examples be denoted as ghard(n), which can be expressed as follows.

2π( 2g(n)η) + n g hard(n)

2π ( 2ghard(n)βη) = 1 2

m + 2m2b2βη

2πmn2 n3 + βm3 (64)

Therefore, if n3 βm3, increasing the number of hard examples n in the training dataset will lead to an increase in standard accuracy, whereas in the opposite case, it will decrease. Since datasets collected under normal circumstances usually have a higher proportion of normal examples compared to hard examples, it can generally be assumed that the condition n3 βm3 is satisﬁed. Consequently, increasing the number of hard examples n in the training dataset will typically result in an increase in standard accuracy.

Regularizing Hard Examples Improves Adversarial Robustness

A.4 Section 4.2 proofs

We proceed with the proof of the following theory discussed in Section 4.1.

Theorem 4 Let the robust feature coeﬃcients of each data point in the dataset X train difficulty, consisting of k data points x1, x2, . . . , xk, be denoted as α1, α2, . . . , αk. Then, when performing adversarial training using X train difficulty, the optimal LS factors λ1, λ2, . . . , λk for each training example are proportional to their respective robust feature coeﬃcients α1, α2, . . . , αk.

First, the weight of the adversarially trained model using the data given in Theorem 4 is derived through the same process as the weight wadv of the above fadv, and is given as follows.

w0 = Pk i=1 (2λi 1)(αiη ϵ) wi 1,...,k = 0 wd+1 = Pk i=1 (2λi 1)(γiη ϵ)

Here, Z is a normalization term tuned to satisfy w = 1. To ensure that the weight parameter w is close to w = (1, 0, ..., 0), the LS factor is assigned in such a way that w0 increases and wd+1 decreases. Therefore, the optimization equation to be solved can be expressed as follows.

min λ wd+1 w0

subject to 0.5 λi 1, i = 1, ..., k (66)

Here, λ = (λ0, ..., λk). The optimization objective can be expressed in the form of a vector operation with respect to λ as follows.

wd+1 w0 = 2ηλ (α γ) η(α γ) 1 (67)

Therefore, the optimization objective is minimized when λ is proportional to (α γ). Here, let c1, c2 > 0 be constants. Since γ is inversely proportional to α, it can be expressed as γ = c1α. Consequently, λ satisﬁes λ = min(0.5, max(1, c2(1 + c1)α)). Therefore, the optimal LS factor can ultimately be said to be proportional to the robust feature coeﬃcient.

Appendix B. Additional implementation and experimental details

B.1 Implementation details

Datasets CIFAR-10 (Krizhevsky et al., 2009) consists of 50,000 training images and 10,000 test images with 10 classes. CIFAR-100 (Krizhevsky et al., 2009) consists of 50,000 training

Lee, Lee, Bae, and Yoon

images and 10,000 test images with 100 classes. CIFAR-10 and CIFAR-100 images have sizes of 32 32 pixels. The CIFAR datasets are subsets of the 80 million tiny images dataset (Torralba et al., 2008), and the 80 million tiny images dataset contains images downloaded from seven independent image search engines: Altavista, Ask, Flickr, Cydral, Google, Picsearch, and Webshots. SVHN (Netzer et al., 2011) consists of 73,257 training images and 26,032 test images with 10 classes. SVHN images have sizes of 32 32 pixels. SVHN is obtained from a very large set of images from urban areas in various countries by using Google Street View. STL-10 (Coates et al., 2011) consists of 5,000 training images and 8,000 test images with 10 classes. Additionally, it provides 100,000 unlabeled images to support unsupervised learning. STL-10 images have sizes of 96 96 pixels, which is larger than the typical 32 32 images found in datasets like CIFAR-10 (Krizhevsky et al., 2009). In our experiments, we resized the STL-10 dataset images to 64 64 pixels for model training. Further details can be found in https://paperswithcode.com/datasets.

Implementation details We conducted experiments on CIFAR-10, CIFAR-100 (Krizhevsky et al., 2009), SVHN (Netzer et al., 2011), and STL-10 (Coates et al., 2011). We used PGD (Madry et al., 2017) and TRADES (Zhang et al., 2019) as the baseline adversarial training algorithms. We used Wide Res Net28-10 (Zagoruyko and Komodakis, 2016) as the architecture of our models. The learning rates for CIFAR-10, CIFAR-100, and STL-10 are set to 0.1 and 0.01 for SVHN, and the decay at 100 and 105 of the total training epoch 110 with decay factor 0.1 following Pang et al. (2021). We additionally conducted experiments with the learning rate decay at 50% and 75% of the total training epoch 200 with decay factor 0.1 following Madry et al. (2017). We used stochastic gradient descent optimizer with the weight decay factor 5e-4 and the momentum 0.9. The upper bounds of adversarial perturbation were set to 0.031 (ϵ = 8), 0.0155 (ϵ = 4), and 0.00775 (ϵ = 2), and the step-size of training adversarial examples of each model were set to one fourth of the ℓ -bound of each model with 10 steps. To evaluate robustness of the models, we used the 10-step and 20-step PGD attacks, and adaptive auto attack (A3) (Liu et al., 2022b) for reliable evaluation. We used a single RTX 8000 GPU with CUDA11.6 and Cu DNN7.6.5 in our experiments.

B.2 Experiment details and futher experiments on subset training

Figure 1 For the experiments in Fig. 1, we executed 200 training epochs on the CIFAR-10 dataset in both standard and adversarial training. We constituted datasets for both standard and adversarial (Madry et al., 2017) training by selecting 5k hard and 5k random examples from the training set, where the diﬃculty of data is measured by accumulating 0-1 loss along the training trajectory of the models trained with standard and adversarial training until epoch 90. Considering that the size of the dataset is small and to prevent overﬁtting before the learning rate decay in both standard and adversarial training, we used the Res Net-18 (He et al., 2016) architecture. The ℓ -bound and step-size of adversarial examples were set to 0.0155, and 0.0035, respectively, with 5 steps. The adversarial training time on the 5k+5k subset of the CIFAR-10 dataset is 2 hours for 200 epochs. The adversarial training time on the 5k subset (hard excluded) of the CIFAR-10 dataset is 0.5 hours for 100 epochs (after the learning rate decay epoch 100). For further experiments in adversarial training, we constituted a new dataset as in Fig. 1, with only using top 10k hard and 10k random examples (total 20k). We trained the PGD

Regularizing Hard Examples Improves Adversarial Robustness

model until learning rate decay, then progress training for the models with the dataset: (1) entire examples, (2) 10k random examples (hard pruning), (3) top 10k hard examples (hard only). We used 10k instead of 5k to prevent overﬁtting before pruning. In Fig. 10, the models trained with the dataset that includes hard examples show decrease in test robust accuracy; however, the model trained with the dataset that hard examples are pruned shows increases in test robust accuracy. Thus, pruning hard examples improves robustness, indicating that hard examples degrades the performance in adversarial training. The implementation details for Fig. 10 are same with those of Fig. 3. The adversarial training time on the 10k+10k subset of the CIFAR-10 dataset is 20 hours for 200 epochs. The adversarial training time on the 10k subset (hard excluded) of the CIFAR-10 dataset is 5 hours for 100 epochs (after learning rate decay epoch 100).

B.3 Experiment details for comparing el2n score

0 25 50 75 100 125 150 175 200 Epoch

None test standard acc.

hard pruning test robust acc.

Figure 10: Test accuracy of PGD models trained with subsets of CIFAR-10.

Figure 3 For the experiments in Fig. 3, we executed 200 training epochs on the CIFAR-10 dataset in each training. The other experimental details are summarized in Appendix B.1. The adversarial training time on the entire CIFAR-10 dataset is 47 hours for 200 epochs. The full results of Fig. 3 can be seen in Fig. 11. For the results of the PGD model before the learning rate decay, because hard examples are diﬃcult to learn also in the perspective of standard accuracy with high learning rate, the result shows large EL2N score for hard clean examples. After the learning rate decay, where the model starts to ﬁt hard examples, the diﬃculty diﬀerence between clean and adversarial examples of hard examples starts to increase.

0 25 50 75 100 125 150 175 200 Epoch

easy norm hard norm

= 0 = 2 = 4 = 8

0 25 50 75 100 125 150 175 200 Epoch

easy norm ( = 8)

hard norm ( = 8)

clean examples adv. examples clean - adv. difference

Figure 11: Average error vector norm (EL2N score) of hard and easy examples.

B.4 Experiment details and further experiments on other analyses

Figure 4 We executed 200 training epochs on the CIFAR-10 dataset in each training. The other experimental details are summarized in Appendix B.1. The hard and easy examples are selected according to the diﬃculty of each model. The training and test accuracy results

Lee, Lee, Bae, and Yoon

and loss results for the PGD model with ϵ = 4 and ϵ = 2 can be seen from Fig. 12. The robust overﬁtting phenomenon is observed to be severe as the adversarial budget of the training setting increases.

0 25 50 75 100 125 150 175 200 Epoch

train loss test loss standard (entire) robust (entire) robust (easy) robust (hard)

0 25 50 75 100 125 150 175 200 Epoch

train acc. test acc. standard (entire) robust (entire) robust (easy) robust (hard)

0 25 50 75 100 125 150 175 200 Epoch

train acc. test acc. standard (entire) robust (entire) robust (easy) robust (hard)

Figure 12: (a) presents the training and test loss curves in the PGD model. (b) shows the training and test accuracy in the PGD models with ϵ = 4. (c) shows the training and test accuracy in the PGD models with ϵ = 2.

Figure 5 We executed 200 training epochs on the CIFAR-10 dataset in each training. The other experimental details are summarized in Appendix B.1. We select top 5k easy and top 5k hard examples. We then trained the models with the dataset that prunes top 5k easy examples and that prunes top 5k hard examples, respectively. The memorization results including experiments on adversarial training with ϵ = 4, ϵ = 2, and ϵ = 0 (STD) are described in Fig. 13.

These results illustrate the necessity of a memorization feature. To explain this, let us consider the possibility that the results in Fig. 5 could be explained solely by the presence of robust and non-robust features, without the need for a memorization feature. First, if we assume that hard examples possess robust features, then it should be possible to predict the adversarial examples of hard examples through the training of the remaining 45k data points of CIFAR-10 in Fig. 5, even if these hard examples are not included in the adversarial training. However, since the model trained without the hard examples shows 0% accuracy on them, this assumption contradicts the idea that hard examples contain robust features. Second, let us assume that hard examples are composed solely of non-robust features. In this case, the train accuracy for these examples should be 0%. However, Fig. 5 shows 100% train accuracy, which contradicts this assumption. Therefore, the assumption that the results can be explained solely by the presence of robust and non-robust features without a memorization feature is diﬃcult to support. To explain the results in Fig. 5, the existence of a feature that acts as a robust feature during training but as a non-robust feature during testing in other words, a memorization feature is necessary.

Table 2 For the experiments in Table 2, we executed 110 training epochs on the CIFAR-10 dataset in each model. The other experimental details are summarized in Appendix B.1. Table 7 shows the performance of the models trained with top 10k easy and top 10k hard examples as Table 2 with varying the adversarial budget ϵ. The robust accuracy is evaluated using A3 (Liu et al., 2022b). As the adversarial budget decreases, the standard accuracy

Regularizing Hard Examples Improves Adversarial Robustness

0 25 50 75 100 125 150 175 200 Epoch

hard acc. (with hard) normal acc. (with normal)

hard acc. (without hard) normal acc. (without normal)

0 25 50 75 100 125 150 175 200 Epoch

hard acc. (with hard) normal acc. (with normal)

hard acc. (without hard) normal acc. (without normal)

0 25 50 75 100 125 150 175 200 Epoch

hard acc. (with hard) normal acc. (with normal)

hard acc. (without hard) normal acc. (without normal)

0 25 50 75 100 125 150 175 200 Epoch

hard acc. (w. hard) normal acc. (w. normal)

hard acc. (w.o. hard) normal acc. (w.o. normal)

Figure 13: Memorization of easy and hard examples in each setting. Clean example results for a standard setting and adversarial example results for adversarial settings.

of the models that prunes hard examples increases, which suggests that the number of anti-correlated features which were slightly correlated features decreases.

Table 7: Performance of models trained with 10k examples.

Method STD PGD (ϵ = 2) PGD (ϵ = 4) PGD (ϵ = 8) Std. Rob. Std. Rob. Std. Rob. Std. Rob.

Easy 78.46 - 75.66 62.00 69.65 51.51 61.63 39.56 Hard 77.60 - 45.47 22.54 32.17 9.04 19.92 8.68

Figure 6 We assigned random labels to the hard examples for the corruption model from the beginning of the training, and we pruned the hard examples from the beginning of the training for the pruning model. The robust accuracy is evaluated using A3 (Liu et al., 2022b). Table 8 shows the ﬁnal accuracy for the models trained with applying random labels to the top 5k hard examples. We compared the results with the performance of the normal models and the models that prunes top 5k hard examples for PGD and TRADES. Observably, assigning random labels to hard examples does not change the robustness performance.

Table 8: Performance comparison of the ﬁnal accuracy for models with label corruption.

Std. Robust Robust Std. Robust Robust (PGD) (A3) (PGD) (A3)

Normal 86.82 50.25 46.07 84.83 51.43 46.69 Corrupted 82.31 49.32 45.67 82.11 50.64 46.15 Pruned 85.99 52.36 48.81 84.71 53.44 48.78

Distinguishing hard examples In Section 3.2, we assumed properties of hard examples and conducted a theoretical analysis on the impact of learning hard examples on model performance. Based on this theoretical analysis, we can distinguish hard examples using the following metrics:

Lee, Lee, Bae, and Yoon

1. Prf A(Dadv)[arg max f(xi) = yi] Prf A(Dadv\{i})[arg max f(xi) = yi]

2. E(x,y) Dstd [Pr (fstd(x; A(Dstd)) = y) Pr (fstd(x; A(Dstd \ {i})) = y)] E(x,y) Dadv [Pr (fadv(x; A(Dadv)) = y) Pr (fadv(x; A(Dadv \ {i})) = y)]

3. E(x,y) Dstd [Pr (fadv(x; A(Dadv)) = y) Pr (fadv(x; A(Dadv \ {i})) = y)] E(x,y) Dadv [Pr (fadv(x; A(Dadv)) = y) Pr (fadv(x; A(Dadv \ {i})) = y)]

In all the above metrics, a higher value indicates a stronger representation of the properties presented in Assumption 1. The ﬁrst metric is the same as the memorization score (Feldman, 2020), which reﬂects the diﬀerence in performance for a data point (xi, yi) when it is included in training versus when it is not. The second metric compares the diﬀerence for the data point (xi, yi) between standard training (with and without inclusion in training) and adversarial training (with and without inclusion). The third metric measures the trade-oﬀ(xi, yi) by comparing the standard accuracy and adversarial accuracy in adversarial training when the data point is included versus when it is excluded from training.

While the metrics above can identify data with high values as hard examples, it is challenging to train every model with and without each data point for comparison. Therefore, as in Section 3.1, we conducted analyses in Section 3.3 using hard examples identiﬁed through 0-1 loss to check whether data that is diﬃcult to learn actually exhibits the properties assumed for hard examples in Assumption 1.

Figure 7 For the experiments in Fig. 7, We executed 110 training epochs on the CIFAR-10 dataset in each model. The other experimental details are summarized in Appendix B.1. We trained the baseline models until epoch 90, which is the 10 epoch before the learning rate decay for the stability of the training, then we continued by the training using the dataset with the subset pruned as the experiments in Fig. 1. We evaluated the performance of the best checkpoint in each model using A3 (Liu et al., 2022b).

B.5 Experiment details and ablation studies of DPLS

Table 3 We executed 110 training epochs on the CIFAR-10 dataset in adversarial training. The other experimental details are summarized in Appendix B.1. We mainly use 0-1 loss as diﬃculty measurement loss for DPLS. We set the label smoothing factor of the most diﬃcult example in DPLS considering the average diﬃculty score of each dataset and algorithm. We tuned the factor such that the average label smoothing factor is between 0.8 and 0.9 in 10-class datasets (CIFAR-10, SVHN, and STL-10) and is between 0.7 and 0.9 in a 100-class dataset (CIFAR-100). We selected diﬃculty calculation epoch as T = 90, which is 10 epochs before the ﬁrst learning rate decay. The label smoothing factor hyperparameter λ of the most diﬃcult example for the DPLS models and the label smoothing factor for LS models in Table 3 of the main paper is summarized in Table 9. The PGD model in SVHN with label smoothing methods shows a result similar to gradient masking eﬀect (Athalye et al., 2018). We applied DPLS only to the labels of outer maximization loss to reduce the eﬀect. However, it is considered as the properties of the SVHN dataset with PGD and label smoothing because the experiments on other datasets and algorithms do not show gradient masking. Furthermore, the results of our method still show improvement also against A3 for SVHN by 6.81%p, which could ignore gradient masking and evaluate true robustness. It means that our method increases the true robustness of the model also for SVHN.

Regularizing Hard Examples Improves Adversarial Robustness

Table 9: The smoothing factor λ and the average of label smoothing factor for each model in Table 3 in the main paper.

Dataset Model Method λ (smoothing factor) Avg. smoothing

CIFAR-10 PGD LS 0.900 0.900 DPLS 0.500 0.897

TRADES LS 0.900 0.900 DPLS 0.500 0.893

CIFAR-100 PGD LS 0.900 0.900 DPLS 0.700 0.862

TRADES LS 0.700 0.700 DPLS 0.300 0.703

SVHN PGD LS 0.900 0.900 DPLS 0.100 0.940

TRADES LS 0.900 0.900 DPLS 0.100 0.927

PGD LS 0.800 0.800 DPLS 0.500 0.788

TRADES LS 0.800 0.800 DPLS 0.500 0.797

Figure 9 and Table 4 For the experiments of previous robust overﬁtting mitigation methods, we used the code uploaded in the oﬃcial Github by the author of each method. The initial learning rate was set to 0.1, and the learning rate decay was applied at the learning rate decay scheduling of each method with total training epochs 200 and with a decay factor of 0.1. The learning rate decay was applied at 50 and 150 epochs for the method of knowledge distillation with stochastic weight averaging (KD+SWA) (Chen et al., 2020) and the learning rate decay was applied at 100 and 150 epochs for the other methods (the baseline model, our method (DPLS), self-adpative training (SAT) (Huang et al., 2020), and temporal ensemble (TE) (Dong et al., 2021)). We executed 200 training epochs on the CIFAR-10 dataset in adversarial training. The other experimental details are summarized in Appendix B.1.

Table 10: Performance of DPLS according to various diﬃculty measurement losses.

Std. Robust Robust Std. Robust Robust (PGD) (A3) (PGD) (A3)

- 87.19 56.44 51.8 85.66 58.46 54.08 C-score 88.05 57.58 52.9 86.6 59.0 54.71 CE loss 87.51 57.47 52.73 86.02 58.64 54.45 EL2N 87.27 58.89 53.38 86.94 58.94 54.32 0-1 loss 87.13 58.43 54.03 85.36 59.33 55.19

Lee, Lee, Bae, and Yoon

Table 5 For the experiments in Table 5, we executed 110 training epochs on the subsets of CIFAR-10 dataset in each model. The other experimental details are summarized in Appendix B.1. We employed the pre-calculated C-score (Jiang et al., 2021) from a standard model, and the diﬃculty from the other losses was calculated during training according to the procedure described in Algorithm 1. The detailed results of the performance of DPLS according to various diﬃculty losses are listed in Table 10.

Table 11: Applying DPLS to MART (Wang et al., 2019). DPLS model denotes the model that calculate the diﬃculty of training examples for DPLS.

Method DPLS model Standard Robust (A3)

MART - 83.9 52.19 MART 83.18 52.84 TRADES 83.81 53.2

Table 12: Applying DPLS to RST (Carmon et al., 2019). DPLS dataset denotes the dataset to which DPLS is applied.

Method DPLS dataset Standard Robust (A3)

- 84.82 57.26 Sup 84.36 57.35 Unsup 85.2 57.62 Sup+Unsup 84.43 57.81

Table 6 For the experiments of applying DPLS to other adversarial training algorithms, we used the code uploaded in the oﬃcial Github by the author of each method. We executed 110 training epochs on the CIFAR-10 dataset in adversarial training. The other experimental details are summarized in Appendix B.1. Table 11 and Table 12 show the experimental details of applying DPLS to misclassiﬁcation aware adversarial training (MART) (Wang et al., 2019) and robust self training (RST) (Carmon et al., 2019). We use the 1,000k extra data generated by a denoising diﬀusion probabilistic model (DDPM) (Ho et al., 2020) as in Gowal et al. (2021a). For the MART models, we used the diﬃculty score calculated from the TRADES model because MART uses weighted loss considering diﬃculty (ground-truth class conﬁdence) of each sample, in which the trained model by MART can be biased and miscalculate the diﬃculty of each sample. The MART model combined with DPLS where the diﬃculty is calculated from TRADES shows higher performance than the other models. It can be inferred that calculating accurate diﬃculty score is important, but applying DPLS with approximate diﬃculty is also eﬀective at improving robustness. For the RST models, we applied DPLS to both labeled and unlabeled dataset (Sup+Unsup), only the labeled dataset (Sup), and only the unlabeled dataset (Unsup). We calculated diﬃculty of the unlabeled dataset as the ratio of the number of training to the number of right predictions. For the DPLS (Sup) model, it is inferred that the model is trained to depend on the unlabeled dataset because of regularization only on the labeled data. For the DPLS (Unsup) model, although the diﬃculty of the unlabeled dataset is not calculated from the full trajectory, the DPLS successfully mitigates the negative eﬀect of hard examples for the unlabeled dataset.

B.6 Other experiment details

Additional details of easy and hard examples Fig. 14 shows the class distributions of 10k easy and 10k hard examples in the training dataset of CIFAR-10 selected by using accumulated 0-1 loss diﬃculty along the training trajectory of TRADES Zhang et al. (2019). Fig. 15 shows the 0-1 loss distributions of the training datasets for CIFAR-10, CIFAR-100,

Regularizing Hard Examples Improves Adversarial Robustness

0 1 2 3 4 5 6 7 8 9 Class

Number of data

(a) Easy examples

0 1 2 3 4 5 6 7 8 9 Class

Number of data

(b) Hard examples

Figure 14: The class distributions of 10k easy and 10k hard examples in CIFAR-10 selected by using accumulated 0-1 loss diﬃculty.

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0-1 loss

Number of data

(a) CIFAR-10

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0-1 loss

Number of data

(b) CIFAR-100

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0-1 loss

Number of data

Figure 15: The 0-1 loss distributions of the each training dataset: CIFAR-10, CIFAR-100, and SVHN.

and SVHN, which are calculated with the training trajectory of TRADES. The value of 0-1 loss is normalized by using min-max normalization. Fig. 16 shows easy and hard examples in the training dataset of CIFAR-10.

Appendix C. Related work

Wong et al. (2020) conducted analysis on overﬁtting in adversarial training, which is termed as robust overﬁtting. They applied various data augmentation and regularziation methods to reduce the generalization gap between training and test robust accuracy, concluding that the eﬀect of early stopping can be matched by other methods. Chen et al. (2020) utilized smoothing of both logits and weights in adversarial training. They applied knowledge distillation (Hinton et al., 2015) to logits instead of label smoothing (Szegedy et al., 2016) with robust-trained and standard-trained models as teacher models and used stochastic weight average (Izmailov et al., 2018) as a weight smoothing method by averaging weights in another model along the training trajectory. They demonstrated that the smoothing methods are eﬀective at mitigating robust overﬁtting. Rebuﬃet al. (2021) combined data augmentation methods and weight averaging and signiﬁcantly improved robust accuracy. They noted that weight averaging is eﬀective when robust accuracy between training iterations is maintained. They experimented on several heuristics-driven augmentation such as cutout (De Vries and Taylor, 2017), mixup (Zhang et al., 2018), and cutmix (Yun et al., 2019). Dong et al.

Lee, Lee, Bae, and Yoon

Original Airplane Car Bird Cat Deer Dog Frog Horse Ship Truck

Prediction Airplane Car Bird Cat Deer Dog Frog Horse Ship Truck

Prediction Airplane Car Bird Cat Deer Dog Frog Horse Ship Truck

(a) Easy examples

Original Airplane Car Bird Cat Deer Dog Frog Horse Ship Truck

Prediction Car Ship Airplane Deer Truck Bird Cat Brid Frog Airplane

Prediction Frog Brid Deer Dog Car Cat Deer Deer Cat Bird

(b) Hard examples

Figure 16: Easy and hard examples in CIFAR-10 Krizhevsky et al. (2009) selected by using accumulated 0-1 loss diﬃculty. Original denotes the ground-truth class of images and prediction denotes the most frequently predicted class of images at each epoch of training until the calculation epoch.

(2021) demonstrated that DNNs are suﬃcient to memorize training adversarial examples with random labels. They showed that the memorization of hard training examples leads to decrease in training loss but increase in test error, noting that the causes of robust overﬁtting may be the memorization of one-hot labels. Speciﬁcally, they showed that training adversarial examples with completely random labels are possible when the training algorithm is TRADES (Zhang et al., 2019). For the PGD (Madry et al., 2017) models, when the label noise ratio exceeds a certain threshold, the generalization gap between train and test accuracy becomes zero and the model exhibits a performance similar to that of the random prediction. In Liu et al. (2022a), they utilize average training loss as diﬃculty measurement and analyze overﬁtting mitigation methods in adversarial training. However, instead of using diﬃculty score to improve the performance of adversarial training, they proposed Adpative Target and Reweight by using moving average predictions as in Huang et al. (2020); Dong et al. (2021) and ground truth prediction as in Wang et al. (2019). Because the performance for standard adversarial training is not provided in the paper, but the performance only for fast adversarial training and ﬁne-tuning with extra data is provided, we trained and evaluated their models for the comparison. The results of Table 13 shows that Reweight is not applicable to the standard adversarial training and Adaptive Target shows the same performance as SAT Huang et al. (2020) because it is simply a linear combination of SAT and one-hot label. In Liu et al. (2022a), they constituted a subset dataset in a balanced form between classes. However, the diﬃculty of classes can be imbalanced, and selecting the

Regularizing Hard Examples Improves Adversarial Robustness

Table 13: Performance comparison with the methods in Liu et al. (2021). The models trained using the code in Liu et al. (2021) are denoted as (*).

Method Std. Robust (PGD) Robust (A3)

PGD 87.19 56.44 51.8 +DPLS 86.17 58.14 53.98

PGD (*) 86.75 56.24 51.81 +Reweight (*) 25.91 22.16 21.93 +Adaptive Target (*) 86.04 58.21 52.87 +Reweight+Adaptive Target (*) 30.49 25.81 25.6

most diﬃcult data considering the balance between classes can result in a deviation from correct observations. As noted in Xu et al. (2021), adversarial training diﬀerently increases the robustness of each class in a dataset, so the performance imbalance between classes becomes more signiﬁcant in adversarial training. Thus, it can be negligible in standard training to balance the number of each class in training example subsets, but it can be non-trivial in adversarial training. The distributions of hard and easy example subsets in our paper are plotted in Fig 14. The distributions of easy and hard example subsets are highly class-imbalanced. The comparison study on the pruning balanced or imbalanced training subsets with diﬀerent diﬃculty losses can be seen in Table 19, where the models that prune imbalanced subsets consistently show higher performance than the models that prune balanced subsets. Furthermore, we showed that our method does not only exploit the performance increase of the easy classes, but increases the both performance of easy and hard classes as shown in Table 20.

Table 14: Performance of DPLS according to the DPLS calculation epoch T.

T PGD TRADES Std. Rob. Std. Rob.

10 84.13 53.34 84.88 54.69 50 86.33 54.05 85.56 55.0 90 87.13 54.03 85.36 55.19 100 86.94 53.8 85.53 54.77

Table 15: Performance of DPLS according to the re-initialized epoch, where DPLS is precalculated by using DPLS of the calculation epoch 90.

T Reinit. PGD TRADES Std. Rob. Std. Rob.

0 86.62 53.69 84.99 55.05 50 86.68 54.09 84.89 55.11 90 87.13 54.03 85.36 55.19 100 86.76 53.76 85.26 54.93

Appendix D. Additional ablation study

Ablation on the DPLS calculation epoch In Table 14, we conducted experiments varying the calculation epoch of DPLS. The results show that applying DPLS at any epoch

Lee, Lee, Bae, and Yoon

is eﬀective at increasing robustness. However, the results show that the improvement is slightly higher when using the diﬃculty calculated until higher epoch. Additionally, when DPLS is applied after the learning rate decay, the result shows that the increase in robustness is diminished, but it still shows better performance than the baseline model. Thus, the application of DPLS before memorization of hard examples can maximize the mitigation eﬀect of the method, which indicates the overﬁtting prevention eﬀect of DPLS.

Ablation on the re-initialization of training by using DPLS In Table 15, we conducted ablation study varying the training re-initialization epoch with pre-calculated DPLS by using the diﬃculty of that calculated until the training epoch 90. There are small diﬀerences between the results. Comparing the result in Table 14 with the result in Table 15, it indicates that the diﬃculty score is important for DPLS and the applying checkpoint is less important after ﬁtting training examples and before memorization of hard examples. However, all models still show the improvement in the robustness performance of the model as in the above DPLS calculation epoch ablation study.

Table 16: Performance of DPLS applying to the PGD model on CIFAR-10 according to the smoothing factor λ.

Model λ (smoothing factor) Avg. smoothing Standard Robust (PGD) Robust (A3)

- 1.0 87.19 56.44 51.8 0.9 0.98 87.73 56.8 52.37 0.7 0.939 87.25 57.58 53.17 0.5 0.898 87.13 58.43 54.03 0.3 0.857 86.09 58.51 54.29 0.1 0.816 85.2 58.36 53.92

Table 17: Performance of DPLS applying to the TRADES model on CIFAR-10 according to the smoothing factor λ.

Model λ (smoothing factor) Avg. smoothing Standard Robust (PGD) Robust (A3)

- 1.0 85.66 58.46 54.08 0.9 0.979 85.49 58.74 54.39 0.7 0.935 85.41 59.08 54.8 0.5 0.893 85.36 59.33 55.19 0.3 0.85 84.95 59.29 55.22 0.1 0.807 84.06 59.25 55.38

Ablation study on the factor of DPLS Table 16 and Table 17 show the performance of DPLS applying to the PGD and TRADES models on CIFAR-10, respectively. We calculated

Regularizing Hard Examples Improves Adversarial Robustness

the average value for the smoothing factor of DPLS. In both results, it is observed that applying DPLS with the factor of high value further increases the robustness performance.

Ablation study on the regularization of DPLS We used an LS approach for mitigating the eﬀect of hard examples, instead of simply reweighting the loss of hard examples. Pruning hard examples corresponds to assigning zero values to the loss of hard examples, so we compare the performance of DPLS with the reweighting method. We applied the diﬃculty of each example to the loss to reweight the loss of each example, which is the same diﬃculty with DPLS. Because TRADES utilized the robustness loss without labels, we applied DPLS to cross-entropy loss for clean examples and reweighted the robust loss with the same diﬃculty; thus, the diﬀerence between DPLS and the reweighting method in TRADES is the diﬀerence of application of the method to the loss for clean examples. Table18 lists the result of applying diﬃculty proportional reweight (DPR) and DPLS. In the results of both PGD and TRADES, DPLS shows higher standard accuracy. It is inferred that because the DPR method still ensures that the predictions are identical to one-hot labels and that the method focuses on the memorization of examples, the method improves robustness and deteriorates accuracy depending on robustness-accuracy trade-oﬀ. In contrast, instead of memorizing hard examples with one-hot label, our method takes advantage of hard examples by allowing soft labels and learns distributions that smoothly include hard examples, which results in improving robustness with decreasing the reduction of standard performance.

Table 18: Performance comparison of DPLS with reweighting methods.

Std. Robust Robust Std. Robust Robust (PGD) (A3) (PGD) (A3)

- 87.19 56.44 51.8 85.66 58.46 54.08 DPR 86.1 57.32 53.2 84.63 58.54 54.88 DPLS 87.13 58.43 54.03 85.36 59.33 55.19

Study on balancing the number of data in each class for pruning In our experiments, we selected easy and hard examples without consideration of class imbalance. The class distributions of 10k easy and 10k hard examples in our experiment are shown in Figure 14. We conducted the ablation study on the eﬀect of balancing the class distribution of the hard examples subset. Table 19 shows the results of the pruning models of TRADES, which pruned top 10% examples with high diﬃculty from the training dataset of CIFAR-10. We experimented with varying the diﬃculty measurement loss and the balance of the number of pruned data for each class (the number of pruned data for each class in balanced subset is 500). We extracted the balanced pruning subset by selecting top 10% examples with high diﬃculty from the training subset of each class. It is observed that while the pruning models without consideration of class imbalance improve robustness, the pruning models with the balanced pruning subset show the similar robustness performance with that of the baseline model. It indicates that when selecting hard examples which have the negative eﬀect to training, it is less eﬀective for the improvement of the robustness performance to consider the balance of the number of data for each class.

Lee, Lee, Bae, and Yoon

Table 19: Performance of pruning models according to various diﬃculty losses with and without balancing the excluded data number of each class.

Model Method Balancing Standard Robust (PGD) Robust (CW) Robust (A3)

- - 85.66 58.46 56.53 54.08 C-score 85.2 57.87 56.72 53.97 C-score 85.35 58.52 56.82 54.25 CE loss 83.7 57.49 56.76 54.01 CE loss 83.82 58.37 57.31 54.45 EL2N 83.8 57.76 56.89 54.07 EL2N 83.25 57.86 56.9 54.28 0-1 loss 83.81 57.84 56.76 54.22 0-1 loss 84.07 58.04 57.4 54.64

Study on the robust fairness In Xu et al. (2021), they indicated that adversarial training diﬀerently increases robustness of each class in a dataset, which eﬀectively increases the robustness of easy classes but not eﬀectively increases the robustness of hard classes. They noted that the performance imbalance between classes, which is termed as the robustness fairness problem, becomes more signiﬁcant in adversarial training. Because DPLS regularizes hard examples, it can be assumed that the increase in the robustness performance is attributed to the increase in the performance of easy classes. Thus, we compared the robustness performance of easy and hard classes in Table 20. We divide the class of CIFAR-10 into 5 easy and 5 hard class subsets. We referred to the class distributions of easy and hard examples from Figure 14 and divide them into easy (class 0, 1, 7, 8, 9) and hard (class 2, 3, 4, 5, 6). From Table 20, it is observed that the performance increase is attributed to the increase in the robustness improvement of both the easy and hard class subsets in the DPLS models. It indicates that our method does not only exploit the performance increase of the easy classes, but increases the both performance of easy and hard classes.

Table 20: Robustness performance against adversarial attacks for easy class and hard class subsets of CIFAR-10.

Model Method PGD PGD (easy) PGD (hard) A3 A3

PGD - 56.54 70.5 42.58 51.8 67.62 35.98 DPLS 58.49 71.82 45.16 54.05 68.66 39.44

TRADES - 58.52 71.42 45.62 54.08 68.64 39.52 DPLS 59.32 72.9 45.74 55.19 70.22 40.16

Regularizing Hard Examples Improves Adversarial Robustness

Table 21: Performance evaluation of DPLS using MD (Jiang et al., 2020) and Ray S (Chen and Gu, 2020) attacks.

Std. Robust Robust Robust Std. Robust Robust Robust (MD) (Ray S) (A3) (MD) (Ray S) (A3)

- 87.19 51.93 61.13 51.80 85.66 54.17 60.84 54.08 DPLS 87.13 54.07 62.14 54.03 85.36 55.25 62.11 55.19

Appendix E. Additional comparison study

Additional comparative study using existing evaluation attack methods We primarily evaluated our models against Auto Attack (Croce and Hein, 2020) because it provides an accurate assessment without giving a false sense of security due to issues like gradient obfuscation. Since the MD ensemble attack (Jiang et al., 2020), which includes margin decomposition (MD), appears to be a variant of Auto Attack, we also measured the CIFAR-10 model s performance against this attack. The results are presented in Table 21. As shown in the table, the MD ensemble attack yields performance results and patterns very similar to those of Auto Attack, demonstrating that our method, DPLS, is also robust against the MD attack.

Furthermore, we evaluated our models against Ray S (Chen and Gu, 2020), an attack method that can reveal vulnerabilities leading to a false sense of security. The results, also presented in Table 21, indicate that our methodology, DPLS, is robust against Ray S and shows improved performance compared to the baseline model.

Table 22: Performance evaluation of DPLS combined with Diﬀpure (Nie et al., 2022) on Robust Bench (Croce et al., 2020)

Std. Robust Std. Robust (A3) (A3)

Baseline 88.28 53.32 86.13 56.05 Baseline + DPLS 86.72 56.25 85.15 56.84 Baseline + DiﬀPure 88.28 66.87 87.11 66.80 Baseline + DPLS + DiﬀPure 87.50 68.61 86.72 71.09

Additional comparative study combining existing defense methods DiﬀPure (Nie et al., 2022) is an image puriﬁcation method that leverages pre-trained diﬀusion-based models. Since the classiﬁer model operates independently of the image puriﬁcation method, it can be combined with any adversarial training-based approach. We applied the DiﬀPure method to two models: one trained using baseline adversarial training methods without DPLS, and another trained using adversarial training methods with DPLS. In the CIFAR-10 setting of

Lee, Lee, Bae, and Yoon

DiﬀPure, we measured the performance by replacing the classiﬁer model and evaluated its robustness using Robust Bench (Croce et al., 2020).

As shown in Table 22, we compared the performance of adding only DiﬀPure to the PGD and TRADES baseline models with the performance when both DiﬀPure and DPLS are added together. We observed that the models with DPLS added achieved relatively higher performance. Since DiﬀPure operates independently of the classiﬁer model to enhance adversarial defense performance, utilizing a model trained with DPLS which is comparatively more robust appears to result in better defense performance.