# improving_adversarial_robustness_requires_revisiting_misclassified_examples__12445a95.pdf

Published as a conference paper at ICLR 2020

IMPROVING ADVERSARIAL ROBUSTNESS REQUIRES REVISITING MISCLASSIFIED EXAMPLES

Yisen Wang1 , Difan Zou2 , Jinfeng Yi3, James Bailey4, Xingjun Ma4 , Quanquan Gu2

1Shanghai Jiao Tong University 2University of California, Los Angles 3JD.com 4The University of Melbourne eewangyisen@gmail.com, {knowzou, qgu}@cs.ucla.edu, yijinfeng@jd.com, {baileyj, xingjun.ma}@unimelb.edu.au

Deep neural networks (DNNs) are vulnerable to adversarial examples crafted by imperceptible perturbations. A range of defense techniques have been proposed to improve DNN robustness to adversarial examples, among which adversarial training has been demonstrated to be the most effective. Adversarial training is often formulated as a min-max optimization problem, with the inner maximization for generating adversarial examples. However, there exists a simple, yet easily overlooked fact that adversarial examples are only deﬁned on correctly classiﬁed (natural) examples, but inevitably, some (natural) examples will be misclassiﬁed during training. In this paper, we investigate the distinctive inﬂuence of misclassiﬁed and correctly classiﬁed examples on the ﬁnal robustness of adversarial training. Speciﬁcally, we ﬁnd that misclassiﬁed examples indeed have a signiﬁcant impact on the ﬁnal robustness. More surprisingly, we ﬁnd that different maximization techniques on misclassiﬁed examples may have a negligible inﬂuence on the ﬁnal robustness, while different minimization techniques are crucial. Motivated by the above discovery, we propose a new defense algorithm called Misclassiﬁcation Aware adve Rsarial Training (MART), which explicitly differentiates the misclassiﬁed and correctly classiﬁed examples during the training. We also propose a semi-supervised extension of MART, which can leverage the unlabeled data to further improve the robustness. Experimental results show that MART and its variant could signiﬁcantly improve the state-of-the-art adversarial robustness.

1 INTRODUCTION

Despite their great success in applications such as computer vision (He et al., 2016), speech recognition (Wang et al., 2017) and natural language processing (Devlin et al., 2018; Zeng et al., 2019), deep neural networks (DNNs) are extremely vulnerable to adversarial examples crafted by adding small adversarial perturbations to natural examples (Szegedy et al., 2013; Goodfellow et al., 2015; Wu et al., 2020). Given a DNN classiﬁer hθ with parameter θ and a correctly classiﬁed natural example x with class label y (hθ(x) = y), an adversarial example x can be generated by perturbing x such that hθ(x ) = y, i.e., the natural example is correctly classiﬁed before perturbation but misclassiﬁed after perturbation. The perturbation required for misclassiﬁcation is often small and bounded by an Lp-norm x x p ϵ, which keeps x within the ϵ-ball centered at x, so that it is visually the same for human observers. This vulnerability of DNNs raises serious security concerns about their practicability in security critical applications (Chen et al., 2015; Kurakin et al., 2016; Jiang et al., 2019; Finlayson et al., 2019; Ma et al., 2019).

Compared with pre/post-processing methods such as feature squeezing (Xu et al., 2017), input denoising (Guo et al., 2018; Liao et al., 2018; Samangouei et al., 2018; Bai et al., 2019) and adversarial detection (Feinman et al., 2017; Ma et al., 2018; Lee et al., 2018), several defense techniques have been proposed to train DNNs that are inherently robust to adversarial examples including defensive distillation (Papernot et al., 2016), gradient regularization (Gu & Rigazio, 2014; Papernot et al., 2017; Ross & Doshi-Velez, 2018; Tramèr et al., 2018), model compression (Das et al., 2018; Liu et al., 2018) and activation pruning (Dhillon et al., 2018; Rakin et al., 2018), among which

Equal contribution. Corresponding author.

Published as a conference paper at ICLR 2020

0 20 40 60 80 100 Training Epoch

Test Robustness (%)

Perturb (PGD10) +

Not perturb +

Not perturb

(a) Not perturb

0 20 40 60 80 100 Training Epoch

Test Robustness (%)

(b) Inner maximization

0 20 40 60 80 100 Training Epoch

Test Robustness (%)

(c) Outer minimization

Figure 1: The distinctive inﬂuence of misclassiﬁed examples (S ) versus correctly classiﬁed ones (S+) on the robustness of adversarial training. We test the white-box robustness of different strategies on either subset of examples: (a) using them directly for training ( not perturb ); (b) using weak attack (FGSM) in the inner maximization; and (c) using regularized CE loss in the outer minimization.

adversarial training has been demonstrated to be the most effective (Athalye et al., 2018). Adversarial training can be regarded as a data augmentation technique that trains DNNs on adversarial examples, and can be viewed as solving the following min-max optimization problem (Madry et al., 2018):

i=1 max x i xi p ϵ ℓ(hθ(x i), yi), (1)

where n is the number of training examples and ℓ( ) is the classiﬁcation loss, such as the commonly used cross-entropy (CE) loss. The inner maximization generates adversarial examples that can be used by the outer minimization to train robust DNNs. Recently, adversarial training with adversarial examples generated by Projected Gradient Descent (PGD) (Madry et al., 2018) has been demonstrated to be the only method that can train moderately robust DNNs without being fully attacked (Athalye et al., 2018). However, there is still a signiﬁcant gap between adversarial robustness (test accuracy on adversarial examples) and natural accuracy (test accuracy on natural examples), even for simple image datasets like CIFAR-10 (Krizhevsky & Hinton, 2009).

Compared with natural training (on natural examples), training adversarially robust DNNs is particularly difﬁcult (Madry et al., 2018). Nakkiran (2019) showed that a model requires more capacity to be robust (i.e., simple models can have high natural accuracy but are less likely to be robust). In addition, the sample complexity of adversarial training can be signiﬁcantly higher than that of natural training, that is, training robust DNNs tends to require more data either labeled (Schmidt et al., 2018) or unlabeled ones (Uesato et al., 2019; Carmon et al., 2019; Najaﬁet al., 2019; Zhai et al., 2019). Moreover, Tsipras et al. (2019); Zhang et al. (2019) demonstrated that adversarial robustness may be inherently at odds with natural accuracy. Parallel to these studies, in this paper, we provide some new insights on the adversarial examples used for adversarial training.

Recall that the formal deﬁnition of an adversarial example is conditioned on it being correctly classiﬁed (Carlini et al., 2019). From this perspective, adversarial examples generated from misclassiﬁed examples are undeﬁned . Most adversarial training variants neglect this distinction, where all training examples are treated equally in both the maximization and the minimization processes, regardless of whether or not they are correctly classiﬁed. The only exception we are aware of is Ding et al. (2018), which proposes to use maximal margin optimization for correctly classiﬁed examples. Yet they did not pay sufﬁcient attention to misclassiﬁed examples. A deeper understanding about the inﬂuence of misclassiﬁed and correctly classiﬁed examples on the robustness is still missing in the literature. Therefore, we raise the following questions:

Are the adversarial examples generated from i) misclassiﬁed and ii) correctly classiﬁed examples, equally important for adversarial robustness? If not, how can one make better use of the difference to improve robustness?

In this paper, we investigate this intriguing, yet thus far overlooked aspect of adversarial training, and ﬁnd that misclassiﬁed and correctly classiﬁed examples exhibit a distinctive inﬂuence on the ﬁnal robustness. To illustrate this phenomenon, we conduct a proof-of-concept experiment on CIFAR10 in a white-box setting with L maximum perturbation ϵ = 8/255. We ﬁrst train an 8-layer

In this paper, both correctly classiﬁed and misclassiﬁed refer to predictions on natural training examples.

Published as a conference paper at ICLR 2020

Convolutional Neural Network (CNN) using standard adversarial training with 10-step PGD (PGD10) and step size ϵ/4, then use this network (87% training accuracy) to select two subsets of natural training examples to investigate: 1) a subset of misclassiﬁed examples S (13% of training data), and 2) a subset of correctly classiﬁed examples S+ (also 13% of training data, |S+| = |S |). Using these two subsets, we explore different ways to re-train the same network, and evaluate its robustness against white-box PGD20 (step size ϵ/10) attacks on the test dataset.

In Figure 1(a), we ﬁnd that misclassiﬁed examples have a signiﬁcant impact on the ﬁnal robustness. Compared with standard adversarial training (dashed blue line), the ﬁnal robustness drops drastically, if examples in subset S are not perturbed (solid green line) during adversarial training (other examples are still perturbed by PGD10). In contrast, the same operation on subset S+ only slightly affects the ﬁnal robustness (solid orange line). Previous work has found that removing a small proportion of training examples does not reduce the robustness (Ding et al., 2019), which seems to be true for correctly classiﬁed examples, but is apparently not true for misclassiﬁed examples.

To further understand the distinctive inﬂuence of misclassiﬁed and correctly classiﬁed examples, we test different techniques on them within either the maximization or the minimization process of adversarial training. Firstly, we apply different maximization techniques while keeping the minimization loss CE unchanged. As shown in Figure 1(b), the ﬁnal robustness is barely affected when we use a weak attack (e.g., Fast Gradient Sign Method (FGSM) (Goodfellow et al., 2015)) to perturb misclassiﬁed examples S (all other training examples are still perturbed by PGD10). This suggests that different maximization techniques on misclassiﬁed examples S may have a negligible inﬂuence on the ﬁnal robustness, provided that the inner maximization problem is solved to a moderate precision (Wang et al., 2019). However, for subset S+, a weak attack for the maximization tends to degenerate the robustness. Secondly, we test different minimization techniques with the inner maximization still solved by PGD10. Interestingly, we ﬁnd that different minimization techniques on misclassiﬁed examples make a signiﬁcant difference to the ﬁnal robustness. As shown in Figure 1(c), compared with standard adversarial training (dashed blue line) with the CE loss, the ﬁnal robustness is signiﬁcantly improved when the outer minimization on misclassiﬁed examples is regularized (solid green line) by an additional term (a KL-divergence term that was used previously in Zheng et al. (2016); Zhang et al. (2019)). The same regularization applied to correctly classiﬁed examples also helps the ﬁnal robustness (solid orange line), though not as signiﬁcantly as for misclassiﬁed examples.

Motivated by the above observations, we reformulate the adversarial risk to incorporate an explicit differentiation of misclassiﬁed examples in a form of regularization. We then propose a new defense algorithm to achieve this in a dynamic way during adversarial training. Our main contributions are:

We investigate the distinctive inﬂuence of misclassiﬁed and correctly classiﬁed examples on the ﬁnal robustness of adversarial training. We ﬁnd that the manipulation on misclassiﬁed examples has more impact on the ﬁnal robustness, and the minimization techniques are more crucial than maximization ones under the min-max optimization framework.

We propose a regularized adversarial risk which incorporates an explicit differentiation of misclassiﬁed examples as a regularizer. Based on that, we further propose a new defense algorithm, called Misclassiﬁcation Aware adve Rsarial Training (MART).

Experimentally, we show that adversarial robustness can be signiﬁcantly improved over the stateof-the-art, by a speciﬁc focus on misclassiﬁed examples. It also helps improve recently proposed adversarial training with unlabeled data.

2 MISCLASSIFICATION AWARE ADVERSARIAL RISK

In this section, we propose a regularized adversarial risk that incorporates an explicit differentiation of misclassiﬁed examples.

2.1 PRELIMINARIES

We ﬁrst deﬁne some notations. We use lower case and lower case bold face to denote scalars and vectors, respectively. We use upper case calligraphic symbols to denote sets.

For a K-class (K 2) classiﬁcation problem, given a dataset {(xi, yi)}i=1,...,n with xi Rd as a natural example and yi {1, . . . , K} as its associated label, a DNN classiﬁer hθ with model

Published as a conference paper at ICLR 2020

parameter θ predicts the class of an input example xi:

hθ(xi) = arg max k=1,...,K pk(xi, θ), pk(xi, θ) = exp(zk(xi, θ))/

k =1 exp(zk (xi, θ)), (2)

where zk(xi, θ) is the logits output of the network with respect to class k, and pk(xi, θ) is the probability (softmax on logits) of xi belonging to class k.

The adversarial risk (Madry et al., 2018) on dataset {(xi, yi)}i=1,...,n and classiﬁer hθ can be deﬁned with respect to the 0-1 loss (Zhang et al., 2019) as:

i=1 max x i Bϵ(xi) 1 hθ(x i) = yi , (3)

where 1( ) is the indicator function and Bϵ(xi) = {x : x xi p ϵ} denotes the Lp-norm ball centered at xi with radius ϵ. We will focus on the L -ball in this paper.

2.2 MISCLASSIFICATION AWARE REGULARIZATION

Note that the adversarial risk in (3) is deﬁned on adversarial examples within the ϵ-ball of all natural examples, regardless of whether they are correctly classiﬁed (hθ(xi) = yi) or misclassiﬁed (hθ(xi) = yi) by the current model hθ. To differentiate, we reformulate the adversarial risk based on the prediction of the current network hθ. Speciﬁcally, natural training examples can be divided into two subsets with respect to hθ, with one subset of correctly classiﬁed examples (S+ hθ) and one subset of misclassiﬁed examples (S hθ):

S+ hθ = {i : i [n], hθ(xi) = yi} and S hθ = {i : i [n], hθ(xi) = yi}. Then we are going to deﬁne adversarial risk separately for correctly classiﬁed and misclassiﬁed examples. As we observed in Figure 1(c), regularization on misclassiﬁed examples can signiﬁcantly improve robustness. Therefore, for misclassiﬁed examples, we formulate the adversarial risk as: R (hθ, xi) := 1(hθ(ˆx i) = yi) + 1(hθ(xi) = hθ(ˆx i)), (4) where the adversarial example ˆx i is generated by solving ˆx i = arg max x i Bϵ(xi) 1(hθ(x i) = yi). (5)

We remark that the ﬁrst and second terms on the R.H.S. of (4) correspond to the standard adversarial risk and the regularization term respectively. Moreover, we would like to clarify that the regularization term 1(hθ(xi) = hθ(ˆx i)) aims to encourage the output of neural network to be stable against misclassiﬁed adversarial examples. For misclassiﬁed examples, direct minimization of the standard adversarial risk may be too hard, as themselves cannot be classiﬁed correctly, even without any perturbations. A similar idea has been used in stability training on all training examples (Zheng et al., 2016; Kannan et al., 2018; Zhang et al., 2019).

Then we consider correctly classiﬁed examples. As can be observed in Figure 1(c), regularization on correctly classiﬁed examples cannot provide as signiﬁcant improvement as achieved by that on misclassiﬁed ones. Moreover, in this case it can be found that 1(hθ(xi) = hθ(ˆx i)) = 1(hθ(ˆx i) = yi) since we have hθ(xi) = yi, which implies that the regularizer has exactly same form as the adversarial risk. Therefore, for correctly classiﬁed example, we simply use the standard adversarial risk, i.e., R+(hθ, xi) := max x i Bϵ(xi) 1(hθ(x i) = yi) = 1(hθ(ˆx i) = yi), (6)

where the adversarial example ˆx i is deﬁned in (5).

Finally, combining the proposed two adversarial risks for correctly classiﬁed examples and misclassiﬁed examples in an adversarial training framework, we can train a network that minimizes the following risk:

minθ Rmisc(hθ) : = 1

i S+ hθ R+(hθ, xi) + P

i S hθ R (hθ, xi)

n Pn i=1 1(hθ(ˆx i) = yi) + 1(hθ(xi) = hθ(ˆx i)) 1(hθ(xi) = yi) , (7)

where ˆx i is deﬁned in (5) and the second equality follows from the deﬁnition of Sh+ θ and Sh θ . The new risk deﬁned above is a regularized adversarial risk with regularization term 1/n Pn i=1 1(hθ(xi) = hθ(ˆx i)) 1(hθ(xi) = yi), which we call the misclassiﬁcation aware regularization.

Published as a conference paper at ICLR 2020

3 PROPOSED DEFENSE: MISCLASSIFICATION AWARE ADVERSARIAL TRAINING (MART)

In the previous section, we derived the misclassiﬁcation aware adversarial risk based on 0-1 loss. However, optimization over 0-1 loss is intractable in practice. We next propose a Misclassiﬁcation Aware adve Rsarial Training (MART) algorithm, by replacing the 0-1 losses with proper surrogate loss functions which are both physical meaningful and computationally tractable. Following that, we further analyze the difference of MART to existing work, and propose a semi-supervised extension.

3.1 THE PROPOSED DEFENSE ALGORITHM

Surrogate Loss for Outer Minimization. As presented in (7), the minimization consists of three indicator functions: (1) 1(hθ(ˆx i) = yi); (2) 1(hθ(xi) = hθ(ˆx i)); and (3) 1(hθ(xi) = yi).

For the ﬁrst indicator function 1(hθ(ˆx i) = yi), we propose to use a boosted cross entropy (BCE) loss as the surrogate loss, instead of the commonly used CE loss in (Madry et al., 2018; Wang et al., 2019). This is largely because classifying adversarial examples requires a stronger classiﬁer than natural examples, as the presence of adversarial examples makes the classiﬁcation decision boundary become more complicated. This is pointed out by (Madry et al., 2018), where they increase the model capacity for a stronger classiﬁer. The beneﬁt of using BCE compared to CE will shortly be presented in the experiment section. The proposed BCE loss is deﬁned as:

BCE p(ˆx i, θ), yi = log pyi(ˆx i, θ) log 1 max k =yi pk(ˆx i, θ) , (8)

where pk(ˆx i, θ) is the probability output deﬁned in (2), the ﬁrst term log pyi(ˆx i, θ) is the commonly used CE loss, denoted CE(p(ˆx i, θ), yi), and the second term log 1 maxk =yi pk(ˆx i, θ)

is a margin term used to improve the decision margin of the classiﬁer. A similar idea has been used for improving adversarial strength by Carlini & Wagner (2017). Note that BCE is just a simple boost that works well in our experiments and other boosted losses could also work here.

For the second indicator function 1(hθ(xi) = hθ(ˆx i)), we can use KL divergence as the surrogate loss function (Zhang et al., 2019; Zheng et al., 2016), since hθ(xi) = hθ(ˆx i) implies that adversarial examples have different output distributions to that of natural examples. Thus, we have

KL p(xi, θ) p(ˆx i, θ) =

k=1 pk(xi, θ) log pk(xi, θ)

pk(ˆx i, θ). (9)

The third indicator function 1(hθ(xi) = yi) is a condition that emphasizes learning on misclassiﬁed examples. However, the condition cannot be directly optimized if we conduct a hard decision during the training process (Ding et al. (2019) uses hard decision and does not optimize the condition). Instead, we propose to use a soft decision scheme by replacing 1(hθ(xi) = yi) with the output probability 1 pyi(xi, θ). This will be large for misclassiﬁed examples and small for correctly classiﬁed examples.

Surrogate Loss for Inner Maximization. The goal of inner maximization is to generate adversarial example ˆx i for natural example xi by solving (5). Therefore, we aim to ﬁnd a surrogate loss function for the indicator function 1(hθ(x i) = yi). Here, we leverage the commonly used CE loss as the surrogate loss and ﬁnd the adversarial example ˆx i as follows:

ˆx i = arg max x i Bϵ(xi) CE p(x i, θ), yi . (10)

Following our ﬁndings in Figure 1(b) that a strong attack can help robustness (though is negligible for misclassiﬁed examples), we propose to use the (strong) PGD attack to maximize the CE loss for both the correctly classiﬁed and misclassiﬁed examples, the same as standard adversarial training. Note that other surrogate loss functions, such as those exploited in (Athalye et al., 2018; Carlini & Wagner, 2017) for adversarial attack, could also be used here.

We slightly abuse the notation of ˆx i as the maximizer of the surrogate loss rather than the 0-1 loss in (5).

Published as a conference paper at ICLR 2020

The Overall Objective. Based on the surrogate loss functions, we can state the ﬁnal objective function for our proposed Misclassiﬁcation Aware adve Rsarial Training (MART) defense:

LMART(θ) = 1

i=1 ℓ(xi, yi, θ), (11)

where ℓ(xi, yi, θ) is deﬁned as

ℓ(xi, yi, θ) := BCE p(ˆx i, θ), yi + λ KL p(xi, θ) p(ˆx i, θ) 1 pyi(xi, θ) .

Here the adversarial example ˆx i is generated by (10), and λ is a tunable scaling parameter that balances the two parts of the ﬁnal loss, and is ﬁxed for all training examples. The complete training procedure of MART is described in Appendix A.

3.2 RELATION TO EXISTING WORK

In this section, we brieﬂy discuss the difference between our MART and existing defense methods including standard adversarial training (Standard) (Madry et al., 2018), logit pairing methods (Kannan et al., 2018), max-margin adversarial training (MMA) (Ding et al., 2018) and TRADES (Zhang et al., 2019), as presented in Table 1.

Table 1: Loss function comparison with existing work. The adversarial example ˆx is generated by (10) for all defense methods except TRADES and MMA. The adversarial example in TRADES is generated by maximizing its regularization term (KL-divergence), and the adversarial example in MMA is generated by solving (10) with different perturbation limit (i.e., ϵ).

Defense Method Loss Function Standard CE(p(ˆx , θ), y) ALP CE(p(ˆx , θ), y) + λ p(ˆx , θ) p(x, θ) 2 2 CLP CE(p(x, θ), y) + λ p(ˆx , θ) p(x, θ) 2 2 TRADES CE(p(x, θ), y) + λ KL p(x, θ)||p(ˆx , θ)

MMA CE(p(ˆx , θ), y) 1(hθ(x) = y) + CE(p(x, θ), y) 1(hθ(x) = y) MART BCE(p(ˆx , θ), y) + λ KL(p(x, θ)||p(ˆx , θ)) (1 py(x, θ))

Speciﬁcally, the Standard algorithm was designed to minimize the standard adversarial loss, i.e., cross-entropy loss on adversarial examples. Logit pairing methods, consisting of adversarial logit pairing (ALP) and clean logit pairing (CLP), introduce a regularization term enclosing both natural examples and their adversarial counterparts. The objective function of TRADES is also a linear combination of natural loss and regularization terms on the output probabilities corresponding to natural examples and their adversarial counterparts using KL divergence. However, none of these algorithms differentiates the misclassiﬁed examples and correctly classiﬁed examples.

The most relevant work is MMA, which proposes to use maximal margin optimization for correctly classiﬁed examples while keeping the optimization on misclassiﬁed examples unchanged. Speciﬁcally, for correctly classiﬁed examples, MMA adopts cross-entropy loss on adversarial examples, which are generated by solving (10) with example-dependent perturbation limit. For misclassiﬁed examples, MMA directly applies cross-entropy loss on natural examples. We emphasize that our MART is different from MMA in the following aspects: (1) MMA performs hard decision to identify misclassiﬁed examples from training data, while MART uses soft decision scheme on training data based on the corresponding output probabilities (p(ˆx , θ)), which can be jointly learned during the training process; (2) for correctly classiﬁed examples, MMA adopts cross-entropy loss on adversarial examples with different perturbation limits, while MART utilizes the proposed BCE loss on the adverarial examples with the same perturbation limit; (3) for misclassiﬁed examples, MMA adopts cross-entropy loss on natural examples, while MART adopts a regularized adversarial loss involving both adversarial and natural examples. Because of these differences, we later will show that MART outperforms MMA in the experiments.

3.3 SEMI-SUPERVISED EXTENSION WITH UNLABELED DATA

Recent work has shown that semi-supervised learning with additional unlabeled data can improve the adversarial robustness (Uesato et al., 2019; Carmon et al., 2019; Najaﬁet al., 2019; Zhai et al., 2019). Speciﬁcally, the training loss function applied in these semi-supervised learning methods is typically

Published as a conference paper at ICLR 2020

deﬁned as a weighted sum of the supervised loss (loss on the labeled data) and the unsupervised one (loss on the unlabeled data), i.e.,

L(θ) = Lsup(θ) + γ Lunsup(θ),

where γ > 0 is the weight of unsupervised loss. As pointed out in Uesato et al. (2019), there are multiple choices of the unsupervised loss function Lunsup(θ), leading to different defense methods, among which the most effective defense method is UAT++. In particular, UAT++ ﬁrst trains a natural model on labeled data, and then use this model to generate pseudo labels for unlabeled data. Moreover, given a training data (x, y) (which can be either labeled or unlabeled data), the supervised and unsupervised loss functions adopted in UAT++ are deﬁned as

ℓUAT++ sup (x, y; θ) = ℓUAT++ unsup (x, y; θ) = max x Bϵ CE(p(x , θ), y) + λ max x Bϵ KL(p(x, θ)||p(x , θ)),

where λ is a tunable hyperparameter. A similar idea was also proposed in a concurrent work (Carmon et al., 2019), leading to another semi-supervised defense method called RST. The ﬁrst stage of RST is also to generate pseudo labels for unlabeled data by training a natural model on the labeled data. Then in the second stage, RST applies TRADES loss to train the robust model based on both labeled and unlabeled data, i.e., given a training data (x, y), the supervised and unsupervised loss functions adopted in RST is deﬁned as

ℓRST sup (x, y; θ) = ℓRST unsup(x, y; θ) = CE(p(x, θ), y) + λ max x Bϵ KL(p(x, θ)||p(x , θ)).

As we pointed out in Figure 1(b) and the following experiment section, the maximization technique has a neglectable inﬂuence on the robustness. Therefore, the major difference between UAT++ and RST is the objective function for minimization. Considering MART is also an objective function, it thus could be easily combined with semi-supervised learning with unlabeled data. Following RST, we propose the following semi-supervised version of MART:

LMART semi (θ) = X

i Ssup ℓMART sup (xi, yi; θ) + γ X

i Sunsup ℓMART unsup (xi, yi; θ)

with supervised and unsupervised loss function deﬁned as follows,

ℓMARTT sup (x, y; θ) = ℓMART unsup (x, y; θ) = BCE(p(ˆx , θ), y) + λ KL(p(x, θ)||p(ˆx , θ)) (1 py(x, θ)),

where the adversarial example ˆx is generated by solving (10), and Ssup and Sunsup denote the set of labeled data and unlabeled data respectively.

4 EXPERIMENTS

In this section, we ﬁrst conduct a set of experiments to provide a comprehensive understanding of our proposed defense MART, and then evaluate its robustness on benchmark datasets in both white-box and black-box settings. Finally, we benchmark the state-of-the-art robustness and explore using unlabeled data for further improvement.

4.1 UNDERSTANDING THE PROPOSED MART

Here, we investigate MART from 4 different perspectives: (1) removing components of the MART loss function, (2) replacing components of the MART loss function, (3) misclassiﬁcation aware loss on certain proportions of training data, and (4) sensitivity to regularization parameter λ.

Experimental Setup. We train Res Net-18 (He et al., 2016) with different variants of MART on CIFAR-10 (Krizhevsky & Hinton, 2009). All the models are trained using SGD with momentum 0.9, weight decay 2 10 4 and an initial learning rate of 0.1, which is divided by 10 at the 75-th and 90-th epoch. All natural images are normalized into [0, 1], and simple data augmentations including 4-pixel padding with 32 32 random crop and random horizontal ﬂip. The maximum perturbation ϵ = 8/255 and parameter λ = 6. The training attack is PGD10 with random start and step size ϵ/4, while the test attack is PGD20 with random start and step size ϵ/10.

Removing Components of MART. Recalling the objective function of MART in (11), it has three terms in the loss function: BCE, KL and 1 p . As illustrated in Figure 2(a), removing 1 p or KL

The adversarial example x in KL term is reused the one in CE term for efﬁciency in Uesato et al. (2019) For simplicity, we use these abbreviations in Section 4.1

Published as a conference paper at ICLR 2020

0 20 40 60 80 100 Training Epoch

Test Robustness (%)

BCE(x_adv) + KL (1-p) BCE(x_adv) + KL BCE(x_adv)

(a) Removing

0 20 40 60 80 100 Training Epoch

Test Robustness (%)

BCE(x_adv) + KL (1-p) BCE(x_adv) CE(x_adv) BCE(x_adv) BCE(x_nat) Inner Max: CE KL

(b) Replacing

0 20 40 60 80 100 Training Epoch

Test Robustness (%)

KL (1-p) on 100% data KL (1-p) on 75% data KL (1-p) on 50% data KL (1-p) on 0% data

(c) Training data (%)

0.5 1 2 3 4 5 6 7 8 9 10 20 30 40 50 λ

Test Robustness (%)

TRADES MART

(d) Regularization λ

Figure 2: The comprehensive ablation experiments of MART. In each plot, the dashed blue line represents the original MART method.

or both all leads to a signiﬁcant robustness degradation. In particular, we found that the soft decision term 1 p has a constant robustness improvement throughout the training process, while the KL term can help mitigate overﬁtting at a later stage of training (after 80 epochs). When the two terms are combined together, they boost the ﬁnal robustness considerably without causing overﬁtting.

Replacing Components of MART. As we show in Figure 2(b), when the BCE component is either replaced by a CE term or redeﬁned on natural examples (xnat), the ﬁnal robustness decreases by a substantial amount. It suggests that learning with CE instead of our proposed BCE suffers from insufﬁcient learning with lower robustness throughout the entire training process. On the other hand, learning with BCE on natural examples exhibits severe overﬁtting at the later stage (solid green line). We did not observe any beneﬁt when replacing CE by KL in the inner maximization of adversarial min-max framework (solid red line), an observation that is consistent with Figure 1(b).

Ablation on Training Data. Here, we show the contribution of our proposed misclassiﬁcation aware regularization (e.g., the KL (1 p) term in (11)) to the ﬁnal robustness with respect to the training data. Speciﬁcally, we gradually increase the proportion of training examples that are trained using the proposed misclassiﬁcation aware regularization term, and display the corresponding robustness in Figure 2(c). The training examples using the proposed regularization are randomly selected, and the BCE term is still deﬁned on all training (adversarial) examples. As can be observed, the robustness can be improved steadily when the proposed regularization is applied on more data. This veriﬁes the beneﬁt of the differentiation of correctly classiﬁed and misclassiﬁed examples.

Sensitivity to Regularization Parameter λ. We further investigate the parameter λ in MART objective function deﬁned in (11) which controls the strength of the regularization. We also test the regularization parameter λ of TRADES (please refer to Table 1). We present the results in Figure 2(d) for different λ [1/2, 50]. By explicitly differentiating the misclassiﬁed and correctly classiﬁed examples, MART achieves good stability and robustness across different choices of λ, which is also consistently better and more stable than that of TRADES.

4.2 ROBUSTNESS EVALUATION AND ANALYSIS

In this part, we evaluate the robustness of MART on both MNIST (Le Cun et al., 1998) and CIFAR-10 (Krizhevsky & Hinton, 2009) datasets against various white-box and black-box attacks.

Baselines. (1) Standard (Madry et al., 2018); (2) MMA (Ding et al., 2019); (3) Dynamic (Wang et al., 2019); and (4) TRADES (Zhang et al., 2019). We only compare with adversarial training variants, since they are the most effective defense to date (Athalye et al., 2018).

Defense Settings. For MNIST, all defense models are built on a 4-layer CNN and trained using SGD with momentum 0.9. The initial learning rate is 0.01 and divided by 10 at the 20-th and 40-th epoch. For CIFAR-10, we use SGD with momentum 0.9, weight decay 3.5 10 3 and an initial learning rate of 0.01, which is divided by 10 at the 75-th and 90-th epoch. For the training attack, it is also the PGD10 with random start and step size ϵ/4. The perturbation limit ϵ = 0.3 for MNIST, and ϵ = 8/255 for CIFAR-10. For MART, we set λ = 5. Hyperparameters of the baselines are conﬁgured as per their original papers: max margin is set to 0.45 (MNIST) or 12/255 (CIFAR-10) for MMA, maximum criterion value cmax = 0.5 for Dynamic and λ = 4 for TRADES.

For clarity, we set the ticks on x-axis in uniform interval for different λ. Standard, MMA and TRADES have been introduced in Section 3.2. Dynamic is the adversarial training with a criterion that dynamically controls the convergence quality of the inner maximization.

Published as a conference paper at ICLR 2020

Table 2: White-box robustness (accuracy (%) on white-box test attacks) on MNIST and CIFAR-10.

Defense MNIST CIFAR-10

Natural FGSM PGD20 CW Natural FGSM PGD20 CW Standard 99.11 97.17 94.62 94.25 84.44 61.89 47.55 45.98 MMA 98.92 97.25 95.25 94.77 84.76 62.08 48.33 45.77 Dynamic 98.96 97.34 95.27 94.85 83.33 62.47 49.40 46.94 TRADES 99.25 96.67 94.58 94.03 82.90 62.82 50.25 48.29 MART 98.74 97.87 96.48 96.10 83.07 65.65 55.57 54.87

Table 3: Black-box robustness (accuracy (%) on black-box test attacks) on MNIST and CIFAR-10.

Defense MNIST CIFAR-10

FGSM PGD10 PGD20 CW FGSM PGD10 PGD20 CW Standard 96.12 95.73 95.47 96.34 79.98 80.27 80.01 80.85 MMA 96.11 95.94 95.81 96.87 80.28 80.52 80.48 81.32 Dynamic 97.60 96.25 95.82 97.03 81.37 81.71 81.38 82.05 TRADES 97.49 96.03 95.73 97.20 81.52 81.73 81.53 82.11 MART 97.77 96.96 96.97 98.36 82.75 82.93 82.70 82.95

White-box Robustness. We evaluate the robustness of all defense models against three types of attacks for both MNIST and CIFAR-10: FGSM, PGD20 (20-step PGD with step size ϵ/10), and CW (L version of CW optimized by PGD). All attacks have full access to model parameters and are constrained by the same perturbation limit ϵ. The white-box robustness of all defense models are reported in Table 2, where Natural denotes the accuracy on natural test images. Our proposed defense MART achieves the best robustness against all three types of attacks on both MNIST and CIFAR-10. Compared with MNIST, the robustness improvements of MART over other baselines are more signiﬁcant on CIFAR-10. This is because adversarial training on CIFAR-10 is a more challenging problem that may have more misclassiﬁed examples during training, and MART can better handle those misclassiﬁed examples due to its regularization term in (11). Note that the robustness improvement of MART is not caused by the so-called obfuscated gradients (Athalye et al., 2018). This can be veriﬁed by two phenomenons: (1) strong test attacks (e.g., PGD20) have higher success rates (lower accuracies) than weak test attacks (e.g., FGSM), and (2) white-box test attacks have higher success rates than back-box test attacks (comparing Table 2 with Table 3). Besides, we conduct an additional check using a gradient-free attack SPSA (Uesato et al., 2018). SPSA attack does not obtain lower accuracy than gradient-based attacks like PGD, which conﬁrms that the robustness of MART trained models are not due to gradient masking.

Black-box Robustness. Black-box test attacks are crafted from the natural test images by attacking a surrogate model with an architecture that is either a copy of the defense model (MNIST) or a more complex Res Net-50 model (CIFAR-10). Both surrogate models are trained separately from the defense models on the original training sets. The attacking methods used here are: FGSM, PGD10, PGD20, and CW . The black-box robustness of all defense models are reported in Table 3. Again, the proposed defense MART achieves higher robustness than other baselines. Compared with the white-box results, all defense methods achieve much better robustness against black-box attacks, even close to the natural accuracy. This suggests that adversarial training is indeed a very practical choice for defense scenarios where the target model can be kept secret from potential attackers. It is also observed that robustness on strong attacks like CW is higher than weak attacks like FGSM, which indicates that strong attacks have less transferability than weak attacks (Madry et al., 2018).

4.3 BENCHMARKING THE STATE-OF-THE-ART ROBUSTNESS

In this part, we conduct more experiments on a large-capacity network Wide Res Net (Zagoruyko & Komodakis, 2016) to benchmark the state-of-the-art robustness, and also explore using unlabeled data for further robustness boost.

Performance on Wide Res Net. We employ Wide Res Net-34-10 (depth 34 and width 10) to explore the full power of our proposed MART defense method, and also benchmark the state-of-the-art robustness on CIFAR-10. The robustness of all defense models are tested against white-box FGSM, PGD20, PGD100 and CW attacks, under the same settings as Section 4.1. We report the robustness

Published as a conference paper at ICLR 2020

Table 4: White-box robustness (%) on CIFAR-10 using the Wide Res Net-34-10.

Defense Natural FGSM PGD20 PGD100 CW Best Last Best Last Best Last Best Last Standard 87.30 56.10 56.10 52.68 49.31 51.55 49.03 50.73 48.47 Dynamic 84.51 63.53 63.53 55.03 51.70 54.12 50.07 51.34 49.27 TRADES 84.22 64.70 64.70 56.40 53.16 55.68 51.27 51.98 51.12 MART 84.17 67.51 67.51 58.56 57.39 57.88 55.04 54.58 54.53

Table 5: White-box robustness (%) on Wide Res Net with additional unlabeled data. a ) Wide Res Net-34-8 with 100K unlabeled data

Defense Natural PGD20

UAT++ 86.04 59.41 RST 88.24 59.60 MART 86.68 61.88

b ) Wide Res Net-28-10 with 500K unlabeled data

Defense Natural PGD20

UAT++ 86.21 62.76 RST 89.70 63.10 MART 86.30 65.04

of both the best and the last epoch models obtained during training in Table 4 . For each defense method against each attack, the best refers to the highest robustness that ever achieved at different checkpoints. Speciﬁcally, against FGSM attack, the best model was found at the last epoch (e.g., "best" is also "last") for all defense methods, while against PGD20, PGD100 and CW attacks, the best model was found at the epoch right after the ﬁrst time learning rate decay (i.e., epoch 76). Our proposed MART outperforms all baseline methods in terms of the robustness of both the best and the last epoch models. Particularly under the most common comparison setting (against PGD20 attacks on CIFAR-10), MART improved 8% over Standard, and 4% even over TRADES for the last epoch model. A similar trend of improvement is also observed for the best epoch model results. Considering the worst case accuracies against all attacks, MART still gains 6% and 3.5% robustness improvement over Standard and TRADES respectively.

Boosting with Additional Unlabeled Data. Here, we evaluate the proposed semi-supervised version of MART and show that it can also beneﬁt from additional unlabeled data and achieves better robustness. Following the exact settings in UAT++ (Uesato et al., 2019) and RST (Carmon et al., 2019), we compare the robustness of MART with them on Wide Res Net-34-8 and Wide Res Net-28-10 against PGD20 (FGSM20) respectively (in the same setting as reported in their paper). The dataset is CIFAR-10 with 100K and 500K unlabeled data extracted from the 80 Million Tiny Images dataset (Torralba et al., 2008). As conﬁrmed in Table 5, our proposed defense MART can also beneﬁt from unlabeled data, and further improves the UAT++ and RST defenses. This again veriﬁes the beneﬁt of differentiating misclassiﬁed and correctly classiﬁed examples for improving robustness, and further demonstrates the superiority of our proposed method.

5 CONCLUSION AND FUTURE WORK

In this paper, we investigated the interesting observation that misclassiﬁed examples have a recognizable impact on the ﬁnal robustness of adversarial training, especially for the outer minimization process. Based on this observation, we designed a misclassiﬁcation aware adversarial risk, which is formulated as adding an misclassiﬁcation aware regularization to the standard adversarial risk. Following the regularized adversarial risk, we proposed a new defense algorithm, called Misclassiﬁcation Aware adve Rsarial Training (MART), with appropriate surrogate loss functions. Experimental results demonstrated that MART can achieve signiﬁcantly improved adversarial robustness with respect to the state-of-the-art, and can also achieves better robustness with additional unlabeled data.

In the future, we plan to investigate the effect of differentiation of correctly classiﬁed/misclassiﬁed training examples in the recently proposed certiﬁed/provable robustness framework (Cohen et al., 2019; Salman et al., 2019) and explore the potential improvements brought by the differentiation of training examples.

ACKNOWLEDGEMENT

We thank the anonymous reviewers and area chair for their helpful comments. Part of the experiments were done on JD AI Platform Neu Hub when Yisen Wang was at JD.com.

Explanations about the best and last results are in Appendix B.

Published as a conference paper at ICLR 2020

Anish Athalye, Nicholas Carlini, and David Wagner. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. In ICML, 2018.

Yang Bai, Yan Feng, Yisen Wang, Tao Dai, Shu-Tao Xia, and Yong Jiang. Hilbert-based generative defense for adversarial examples. In ICCV, 2019.

Nicholas Carlini and David Wagner. Towards evaluating the robustness of neural networks. In S&P, 2017.

Nicholas Carlini, Anish Athalye, Nicolas Papernot, Wieland Brendel, Jonas Rauber, Dimitris Tsipras, Ian Goodfellow, and Aleksander Madry. On evaluating adversarial robustness. ar Xiv preprint ar Xiv:1902.06705, 2019.

Yair Carmon, Aditi Raghunathan, Ludwig Schmidt, Percy Liang, and John C Duchi. Unlabeled data improves adversarial robustness. ar Xiv preprint ar Xiv:1905.13736, 2019.

Chenyi Chen, Ari Seff, Alain Kornhauser, and Jianxiong Xiao. Deepdriving: Learning affordance for direct perception in autonomous driving. In CVPR, 2015.

Jeremy M Cohen, Elan Rosenfeld, and J Zico Kolter. Certiﬁed adversarial robustness via randomized smoothing. ar Xiv preprint ar Xiv:1902.02918, 2019.

Nilaksh Das, Madhuri Shanbhogue, Shang-Tse Chen, Fred Hohman, Siwei Li, Li Chen, Michael E Kounavis, and Duen Horng Chau. Compression to the rescue: Defending from adversarial attacks across modalities. In KDD, 2018.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. ar Xiv preprint ar Xiv:1810.04805, 2018.

Guneet S Dhillon, Kamyar Azizzadenesheli, Zachary C Lipton, Jeremy Bernstein, Jean Kossaiﬁ, Aran Khanna, and Anima Anandkumar. Stochastic activation pruning for robust adversarial defense. ICLR, 2018.

Gavin Weiguang Ding, Yash Sharma, Kry Yik Chau Lui, and Ruitong Huang. Max-margin adversarial (mma) training: Direct input space margin maximization through adversarial training. ar Xiv preprint ar Xiv:1812.02637, 2018.

Gavin Weiguang Ding, Kry Yik Chau Lui, Xiaomeng Jin, Luyu Wang, and Ruitong Huang. On the sensitivity of adversarial robustness to input data distributions. ICLR, 2019.

Reuben Feinman, Ryan R. Curtin, Saurabh Shintre, and Andrew B Gardner. Detecting adversarial samples from artifacts. ar Xiv preprint ar Xiv:1703.00410, 2017.

Samuel G Finlayson, John D Bowers, Joichi Ito, Jonathan L Zittrain, Andrew L Beam, and Isaac S Kohane. Adversarial attacks on medical machine learning. Science, 363(6433):1287 1289, 2019.

Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. In ICLR, 2015.

Shixiang Gu and Luca Rigazio. Towards deep neural network architectures robust to adversarial examples. ar Xiv preprint ar Xiv:1412.5068, 2014.

Chuan Guo, Mayank Rana, Moustapha Cisse, and Laurens van der Maaten. Countering adversarial images using input transformations. In ICLR, 2018.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.

Linxi Jiang, Xingjun Ma, Shaoxiang Chen, James Bailey, and Yu-Gang Jiang. Black-box adversarial attacks on video recognition models. In ACM MM, pp. 864 872, 2019.

Harini Kannan, Alexey Kurakin, and Ian Goodfellow. Adversarial logit pairing. ar Xiv preprint ar Xiv:1803.06373, 2018.

Published as a conference paper at ICLR 2020

Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical Report, University of Toronto, 2009.

Alexey Kurakin, Ian Goodfellow, and Samy Bengio. Adversarial examples in the physical world. ar Xiv preprint ar Xiv:1607.02533, 2016.

Yann Le Cun, Léon Bottou, Yoshua Bengio, Patrick Haffner, et al. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998.

Kimin Lee, Kibok Lee, Honglak Lee, and Jinwoo Shin. A simple uniﬁed framework for detecting out-of-distribution samples and adversarial attacks. In Neur IPS, pp. 7167 7177, 2018.

Fangzhou Liao, Ming Liang, Yinpeng Dong, Tianyu Pang, Xiaolin Hu, and Jun Zhu. Defense against adversarial attacks using high-level representation guided denoiser. In CVPR, 2018.

Qi Liu, Tao Liu, Zihao Liu, Yanzhi Wang, Yier Jin, and Wujie Wen. Security analysis and enhancement of model compressed deep learning systems under adversarial attacks. ar Xiv preprint ar Xiv:1802.05193, 2018.

Xingjun Ma, Bo Li, Yisen Wang, Sarah M Erfani, Sudanthi Wijewickrema, Grant Schoenebeck, Dawn Song, Michael E Houle, and James Bailey. Characterizing adversarial subspaces using local intrinsic dimensionality. In ICLR, 2018.

Xingjun Ma, Yuhao Niu, Lin Gu, Yisen Wang, Yitian Zhao, James Bailey, and Feng Lu. Understanding adversarial attacks on deep learning based medical image analysis systems. ar Xiv preprint ar Xiv:1907.10456, 2019.

Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. In ICLR, 2018.

Amir Najaﬁ, Shin-ichi Maeda, Masanori Koyama, and Takeru Miyato. Robustness to adversarial perturbations in learning from incomplete data. ar Xiv preprint ar Xiv:1905.13021, 2019.

Preetum Nakkiran. Adversarial robustness may be at odds with simplicity. ar Xiv preprint ar Xiv:1901.00532, 2019.

Nicolas Papernot, Patrick Mc Daniel, Xi Wu, Somesh Jha, and Ananthram Swami. Distillation as a defense to adversarial perturbations against deep neural networks. In S&P, 2016.

Nicolas Papernot, Patrick Mc Daniel, Ian Goodfellow, Somesh Jha, Z Berkay Celik, and Ananthram Swami. Practical black-box attacks against machine learning. In Asia CCS, 2017.

Adnan Siraj Rakin, Jinfeng Yi, Boqing Gong, and Deliang Fan. Defend deep neural networks against adversarial examples via ﬁxed anddynamic quantized activation functions. ar Xiv preprint ar Xiv:1807.06714, 2018.

Andrew Slavin Ross and Finale Doshi-Velez. Improving the adversarial robustness and interpretability of deep neural networks by regularizing their input gradients. In AAAI, 2018.

Hadi Salman, Greg Yang, Jerry Li, Pengchuan Zhang, Huan Zhang, Ilya Razenshteyn, and Sebastien Bubeck. Provably robust deep learning via adversarially trained smoothed classiﬁers. ar Xiv preprint ar Xiv:1906.04584, 2019.

Pouya Samangouei, Maya Kabkab, and Rama Chellappa. Defense-GAN: Protecting classiﬁers against adversarial attacks using generative models. In ICLR, 2018.

Ludwig Schmidt, Shibani Santurkar, Dimitris Tsipras, Kunal Talwar, and Aleksander Madry. Adversarially robust generalization requires more data. In Neur IPS, 2018.

Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. ar Xiv preprint ar Xiv:1312.6199, 2013.

Antonio Torralba, Rob Fergus, and William T Freeman. 80 million tiny images: A large data set for nonparametric object and scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(11):1958 1970, 2008.

Published as a conference paper at ICLR 2020

Florian Tramèr, Alexey Kurakin, Nicolas Papernot, Ian Goodfellow, Dan Boneh, and Patrick Mc Daniel. Ensemble adversarial training: Attacks and defenses. In ICLR, 2018.

Dimitris Tsipras, Shibani Santurkar, Logan Engstrom, Alexander Turner, and Aleksander Madry. Robustness may be at odds with accuracy. In ICLR, 2019.

Jonathan Uesato, Brendan O Donoghue, Aaron van den Oord, and Pushmeet Kohli. Adversarial risk and the dangers of evaluating against weak attacks. ar Xiv preprint ar Xiv:1802.05666, 2018.

Jonathan Uesato, Jean-Baptiste Alayrac, Po-Sen Huang, Robert Stanforth, Alhussein Fawzi, Pushmeet Kohli, et al. Are labels required for improving adversarial robustness? ar Xiv preprint ar Xiv:1905.13725, 2019.

Yisen Wang, Xuejiao Deng, Songbai Pu, and Zhiheng Huang. Residual convolutional ctc networks for automatic speech recognition. ar Xiv preprint ar Xiv:1702.07793, 2017.

Yisen Wang, Xingjun Ma, James Bailey, Jinfeng Yi, Bowen Zhou, and Quanquan Gu. On the convergence and robustness of adversarial training. In ICML, 2019.

Dongxian Wu, Yisen Wang, Shu-Tao Xia, James Bailey, and Xingjun Ma. Skip connections matter: On the transferability of adversarial examples generated with resnets. In ICLR, 2020.

Weilin Xu, David Evans, and Yanjun Qi. Feature squeezing: Detecting adversarial examples in deep neural networks. ar Xiv preprint ar Xiv:1704.01155, 2017.

Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. ar Xiv preprint ar Xiv:1605.07146, 2016.

Min Zeng, Yisen Wang, and Yuan Luo. Dirichlet latent variable hierarchical recurrent encoder-decoder in dialogue generation. In EMNLP, 2019.

Runtian Zhai, Tianle Cai, Di He, Chen Dan, Kun He, John Hopcroft, and Liwei Wang. Adversarially robust generalization just requires more unlabeled data. ar Xiv preprint ar Xiv:1906.00555, 2019.

Hongyang Zhang, Yaodong Yu, Jiantao Jiao, Eric P. Xing, Laurent El Ghaoui, and Michael I. Jordan. Theoretically principled trade-off between robustness and accuracy. In ICML, 2019.

Stephan Zheng, Yang Song, Thomas Leung, and Ian Goodfellow. Improving the robustness of deep neural networks via stability training. In CVPR, 2016.

Published as a conference paper at ICLR 2020

A MART ALGORITHM

Algorithm 1 Misclassiﬁcation Aware adve Rsarial Training (MART)

1: Input: Training data {xi, yi}i=1,...,n, outer iteration number TO, inner iteration number TI, maximum perturbation ϵ, step size for inner optimization ηI, step size for outer optimization ηO 2: Initialization: Standard random initialization of hθ 3: for t = 1, . . . , TO do 4: Uniformly sample a minibatch of training data B(t)

5: for xi B(t) do 6: x i = xi + ϵ ξ, with ξ U( 1, 1) # U is a uniform distribution 7: for s = 1, . . . , TI do 8: x i ΠBϵ(xi) x i + ηI sign( x i CE(p(x i, θ), yi)) # Π( ) is the projection operator 9: end for 10: ˆx i x i 11: end for 12: θ θ ηO P

xi B(t) θL(xi, yi, ˆx i; θ) 13: end for 14: Output: Robust classiﬁer hθ

B EXPLANATIONS ABOUT THE RESULTS OF TRADES

To make a fair comparison with the latest method TRADES (Zhang et al., 2019), our code is built upon the TRADES framework. We would like to point out that the robustness reported in their paper is the *best* robustness that occurred during training. To avoid questioning of our results, we have reported both the *best* and the *last* results in Table 4, and the *best* results for TRADES match their original paper.