# costsensitive_robustness_against_adversarial_examples__d789941b.pdf

Published as a conference paper at ICLR 2019

COST-SENSITIVE ROBUSTNESS AGAINST ADVERSARIAL EXAMPLES

Xiao Zhang Department of Computer Science University of Virginia xz7bc@virginia.edu

David Evans Department of Computer Science University of Virginia evans@virginia.edu

Several recent works have developed methods for training classiﬁers that are certiﬁably robust against norm-bounded adversarial perturbations. These methods assume that all the adversarial transformations are equally important, which is seldom the case in real-world applications. We advocate for cost-sensitive robustness as the criteria for measuring the classiﬁer s performance for tasks where some adversarial transformation are more important than others. We encode the potential harm of each adversarial transformation in a cost matrix, and propose a general objective function to adapt the robust training method of Wong & Kolter (2018) to optimize for cost-sensitive robustness. Our experiments on simple MNIST and CIFAR10 models with a variety of cost matrices show that the proposed approach can produce models with substantially reduced cost-sensitive robust error, while maintaining classiﬁcation accuracy.

1 INTRODUCTION

Despite the exceptional performance of deep neural networks (DNNs) on various machine learning tasks such as malware detection (Saxe & Berlin, 2015), face recognition (Parkhi et al., 2015) and autonomous driving (Bojarski et al., 2016), recent studies (Szegedy et al., 2014; Goodfellow et al., 2015) have shown that deep learning models are vulnerable to misclassifying inputs, known as adversarial examples, that are crafted with targeted but visually-imperceptible perturbations. While several defense mechanisms have been proposed and empirically demonstrated to be successful against existing particular attacks (Papernot et al., 2016; Goodfellow et al., 2015), new attacks (Carlini & Wagner, 2017; Tram er et al., 2018; Athalye et al., 2018) are repeatedly found that circumvent such defenses. To end this arm race, recent works (Wong & Kolter, 2018; Raghunathan et al., 2018; Wong et al., 2018; Wang et al., 2018) propose methods to certify examples to be robust against some speciﬁc norm-bounded adversarial perturbations for given inputs and to train models to optimize for certiﬁable robustness.

However, all of the aforementioned methods aim at improving the overall robustness of the classiﬁer. This means that the methods to improve robustness are designed to prevent seed examples in any class from being misclassiﬁed as any other class. Achieving such a goal (at least for some deﬁnitions of adversarial robustness) requires producing a perfect classiﬁer, and has, unsurprisingly, remained elusive. Indeed, Mahloujifar et al. (2019) proved that if the metric probability space is concentrated, overall adversarial robustness is unattainable for any classiﬁer with initial constant error.

We argue that overall robustness may not be the appropriate criteria for measuring system performance in security-sensitive applications, since only certain kinds of adversarial misclassiﬁcations pose meaningful threats that provide value for potential adversaries. Whereas overall robustness places equal emphasis on every adversarial transformation, from a security perspective, only certain transformations matter. As a simple example, misclassifying a malicious program as benign results in more severe consequences than the reverse.

In this paper, we propose a general method for adapting provable defenses against norm-bounded perturbations to take into account the potential harm of different adversarial class transformations. Inspired by cost-sensitive learning (Domingos, 1999; Elkan, 2001) for non-adversarial contexts, we capture the impact of different adversarial class transformations using a cost matrix C, where each

Published as a conference paper at ICLR 2019

entry represents the cost of an adversary being able to take a natural example from the ﬁrst class and perturb it so as to be misclassiﬁed by the model as the second class. Instead of reducing the overall robust error, our goal is to minimize the cost-weighted robust error (which we deﬁne for both binary and real-valued costs in C). The proposed method incorporates the speciﬁed cost matrix into the training objective function, which encourages stronger robustness guarantees on cost-sensitive class transformations, while maintaining the overall classiﬁcation accuracy on the original inputs.

Contributions. By encoding the consequences of different adversarial transformations into a cost matrix, we introduce the notion of cost-sensitive robustness (Section 3.1) as a metric to assess the expected performance of a classiﬁer when facing adversarial examples. We propose an objective function for training a cost-sensitive robust classiﬁer (Section 3.2). The proposed method is general in that it can incorporate any type of cost matrix, including both binary and real-valued. We demonstrate the effectiveness of the proposed cost-sensitive defense model for a variety of cost scenarios on two benchmark image classiﬁcation datasets: MNIST (Section 4.1) and CIFAR10 (Section 4.2). Compared with the state-of-the-art overall robust defense model (Wong & Kolter, 2018), our model achieves signiﬁcant improvements in cost-sensitive robustness for different tasks, while maintaining approximately the same classiﬁcation accuracy on both datasets.

Notation. We use lower-case boldface letters such as x for vectors and capital boldface letters such as A to represent matrices. Let [m] be the index set {1, 2, . . . , m} and Aij be the (i, j)-th entry of matrix A. Denote the i-th natural basis vector, the all-ones vector and the identity matrix by ei, 1 and I respectively. For any vector x Rd, the ℓ -norm of x is deﬁned as x = maxi [d] |xi|.

2 BACKGROUND

In this section, we provide a brief introduction on related topics, including neural network classiﬁers, adversarial examples, defenses with certiﬁed robustness, and cost-sensitive learning.

2.1 NEURAL NETWORK CLASSIFIERS

A K-layer neural network classiﬁer can be represented by a function f : X Y such that f(x) = f K 1(f K 2( (f1(x)))), for any x X. For k {1, 2, . . . , K 2}, the mapping function fk( ) typically consists of two operations: an afﬁne transformation (either matrix multiplication or convolution) and a nonlinear activation. In this paper, we consider rectiﬁed linear unit (Re LU) as the activation function. If denote the feature vector of the k-th layer as zk, then fk( ) is deﬁned as

zk+1 = fk(zk) = max{Wkzk + bk, 0}, k {1, 2, . . . K 2},

where Wk denotes the weight parameter matrix and bk the bias vector. The output function f K 1( ) maps the feature vector in the last hidden layer to the output space Y solely through matrix multiplication: z K = f K 1(z K 1) = WK 1z K 1 + b K 1, where z K can be regarded as the estimated score vector of input x for different possible output classes. In the following discussions, we use fθ to represent the neural network classiﬁer, where θ = {W1, . . . , WK 1, b1, . . . , b K 1} denotes the model parameters.

To train the neural network, a loss function PN i=1 L(fθ(xi), yi) is deﬁned for a set of training examples {xi, yi}N i=1, where xi is the i-th input vector and yi denotes its class label. Cross-entropy loss is typically used for multiclass image classiﬁcation. With proper initialization, all model parameters are then updated iteratively using backpropagation. For any input example ex, the predicted label by is given by the index of the largest predicted score among all classes, argmaxj[fθ(ex)]j.

2.2 ADVERSARIAL EXAMPLES

An adversarial example is an input, generated by some adversary, which is visually indistinguishable from an example from the natural distribution, but is able to mislead the target classiﬁer. Since visually indistinguishable depends on human perception, which is hard to deﬁne rigorously, we consider the most popular alternative: input examples with perturbations bounded in ℓ -norm (Goodfellow et al., 2015). More formally, the set of adversarial examples with respect to seed example {x0, y0}

Published as a conference paper at ICLR 2019

and classiﬁer fθ( ) is deﬁned as

Aϵ(x0, y0; θ) = x X : x x0 ϵ and argmax j [fθ(x)]j = y0 , (2.1)

where ϵ > 0 denotes the maximum perturbation distance. Although ℓp distances are commonly used in adversarial examples research, they are not an adequate measure of perceptual similarity (Sharif et al., 2018) and other minimal geometric transformations can be used to ﬁnd adversarial examples (Engstrom et al., 2017; Kanbak et al., 2018; Xiao et al., 2018). Nevertheless, there is considerable interest in improving robustness in this simple domain, and hope that as this research area matures we will ﬁnd ways to apply results from studying simpliﬁed problems to more realistic ones.

2.3 DEFENSES WITH CERTIFIED ROBUSTNESS

A line of recent work has proposed defenses that are guaranteed to be robust against norm-bounded adversarial perturbations. Hein & Andriushchenko (2017) proved formal robustness guarantees against ℓ2-norm bounded perturbations for two-layer neural networks, and provided a training method based on a surrogate robust bound. Raghunathan et al. (2018) developed an approach based on semideﬁnite relaxation for training certiﬁed robust classiﬁers, but was limited to two-layer fullyconnected networks. Our work builds most directly on Wong & Kolter (2018), which can be applied to deep Re LU-based networks and achieves the state-of-the-art certiﬁed robustness on MNIST dataset.

Following the deﬁnitions in Wong & Kolter (2018), an adversarial polytope Zϵ(x) with respect to a given example x is deﬁned as

Zϵ(x) = fθ(x + ) : ϵ , (2.2)

which contains all the possible output vectors for the given classiﬁer fθ by perturbing x within an ℓ -norm ball with radius ϵ. A seed example, {x0, y0}, is said to be certiﬁed robust with respect to maximum perturbation distance ϵ, if the corresponding adversarial example set Aϵ(x0, y0; θ) is empty. Equivalently, if we solve, for any output class ytarg = y0, the optimization problem,

minimize z K [z K]y0 [z K]ytarg, subject to z K Zϵ(x0), (2.3)

then according to the deﬁnition of Aϵ(x0, y0; θ) in (2.1), {x0, y0} is guaranteed to be robust provided that the optimal objective value of (2.3) is positive for every output class. To train a robust model on a given dataset {xi, yi}N i=1, the standard robust optimization aims to minimize the sample loss function on the worst-case locations through the following adversarial loss

i=1 max ϵ L fθ(xi + ), yi , (2.4)

where L( , ) denotes the cross-entropy loss. However, due to the nonconvexity of the neural network classiﬁer fθ( ) introduced by the nonlinear Re LU activation, both the adversarial polytope (2.2) and training objective (2.4) are highly nonconvex. In addition, solving optimization problem (2.3) for each pair of input example and output class is computationally intractable.

Instead of solving the optimization problem directly, Wong & Kolter (2018) proposed an alternative training objective function based on convex relaxation, which can be efﬁciently optimized through a dual network. Speciﬁcally, they relaxed Zϵ(x) into a convex outer adversarial polytope e Zϵ(x) by replacing the Re LU inequalities for each neuron z = max{bz, 0} with a set of inequalities,

z 0, z bz, ubz + (u ℓ)z uℓ, (2.5)

where u, ℓdenote the lower and upper bounds on the considered pre-Re LU activation.1 Based on the relaxed outer bound e Zϵ(x), they propose the following alternative optimization problem,

minimize z K [z K]y0 [z K]ytarg, subject to z K e Zϵ(x0), (2.6)

which is in fact a linear program. Since Zϵ(x) e Zϵ(x) for any x X, solving (2.6) for all output classes provides stronger robustness guarantees compared with (2.3), provided all the optimal

1The elementwise activation bounds can be computed efﬁciently using Algorithm 1 in Wong & Kolter (2018).

Published as a conference paper at ICLR 2019

objective values are positive. In addition, they derived a guaranteed lower bound, denoted by Jϵ x0, gθ(ey0 eytarg) , on the optimal objective value of Equation 2.6 using duality theory, where gθ( ) is a K-layer feedforward dual network (Theorem 1 in Wong & Kolter (2018)). Finally, according to the properties of cross-entropy loss, they minimize the following objective to train the robust model, which serves as an upper bound of the adversarial loss (2.4):

minimize θ 1 N

i=1 L Jϵ xi, gθ(eyi 1 I) , yi

where gθ( ) is regarded as a columnwise function when applied to a matrix. Although the proposed method in Wong & Kolter (2018) achieves certiﬁed robustness, its computational complexity is quadratic with the network size in the worst case so it only scales to small networks. Recently, Wong et al. (2018) extended the training procedure to scale to larger networks by using nonlinear random projections. However, if the network size allows for both methods, we observe a small decrease in performance using the training method provided in Wong et al. (2018). Therefore, we only use the approximation techniques for the experiments on CIFAR10 ( 4.2), and use the less scalable method for the MNIST experiments ( 4.1).

2.4 COST-SENSITIVE LEARNING

Cost-sensitive learning (Domingos, 1999; Elkan, 2001; Liu & Zhou, 2006) was proposed to deal with unequal misclassiﬁcation costs and class imbalance problems commonly found in classiﬁcation applications. The key observation is that cost-blind learning algorithms tend to overwhelm the major class, but the neglected minor class is often our primary interest. For example, in medical diagnosis misclassifying a rare cancerous lesion as benign is extremely costly. Various cost-sensitive learning algorithms (Kukar & Kononenko, 1998; Zadrozny et al., 2003; Zhou & Liu, 2010; Khan et al., 2018) have been proposed in literature, but only a few algorithms, limited to simple classiﬁers, considered adversarial settings.2 Dalvi et al. (2004) studied the naive Bayes classiﬁer for spam detection in the presence of a cost-sensitive adversary, and developed an adversary-aware classiﬁer based on game theory. Asif et al. (2015) proposed a cost-sensitive robust minimax approach that hardens a linear discriminant classiﬁer with robustness in the adversarial context. All of these methods are designed for simple linear classiﬁers, and cannot be directly extended to neural network classiﬁers. In addition, the robustness of their proposed classiﬁer is only examined experimentally based on the performance against some speciﬁc adversary, so does not provide any notion of certiﬁed robustness. Recently, Dreossi et al. (2018) advocated for the idea of using application-level semantics in adversarial analysis, however, they didn t provide a formal method on how to train such classiﬁer. Our work provides a practical training method that hardens neural network classiﬁers with certiﬁed cost-sensitive robustness against adversarial perturbations.

3 TRAINING A COST-SENSITIVE ROBUST CLASSIFIER

The approach introduced in Wong & Kolter (2018) penalizes all adversarial class transformations equally, even though the consequences of adversarial examples usually depends on the speciﬁc class transformations. Here, we provide a formal deﬁnition of cost-sensitive robustness ( 3.1) and propose a general method for training cost-sensitive robust models ( 3.2).

3.1 CERTIFIED COST-SENSITIVE ROBUSTNESS

Our approach uses a cost matrix C that encodes the cost (i.e., potential harm to model deployer) of different adversarial examples. First, we consider the case where there are m classes and C is a m m binary matrix with Cjj {0, 1}. The value Cjj indicates whether we care about an adversary transforming a seed input in class j into one recognized by the model as being in class j . If the adversarial transformation j j matters, Cjj = 1, otherwise Cjj = 0. Let Ωj = {j [m] : Cjj = 0} be the index set of output classes that induce cost with respect to

2Given the vulnerability of standard classiﬁers to adversarial examples, it is not surprising that standard cost-sensitive classiﬁers are also ineffective against adversaries. The experiments described in Appendix B supported this expectation.

Published as a conference paper at ICLR 2019

input class j. For any j [m], let δj = 0 if Ωj is an empty set, and δj = 1 otherwise. We are only concerned with adversarial transformations from a seed class j to target classes j Ωj. For any example x in seed class j, x is said to be certiﬁed cost-sensitive robust if the lower bound Jϵ(x, gθ(ej ej )) 0 for all j Ωj. That is, no adversarial perturbations in an ℓ -norm ball around x with radius ϵ can mislead the classiﬁer to any target class in Ωj.

The cost-sensitive robust error on a dataset {xi, yi}N i=1 is deﬁned as the number of examples that are not guaranteed to be cost-sensitive robust over the number of non-zero cost candidate seed examples:

cost-sensitive robust error = 1 #{i [N] : Jϵ(xi, gθ(eyi ej )) 0, j Ωyi} P

j|δj =1 Nj ,

where #A represents the cardinality of a set A, and Nj is the total number of examples in class j.

Next, we consider a more general case where C is a m m real-valued cost matrix. Each entry of C is a non-negative real number, which represents the cost of the corresponding adversarial transformation. To take into account the different potential costs among adversarial examples, we measure the costsensitive robustness by the average certiﬁed cost of adversarial examples. The cost of an adversarial example x in class j is deﬁned as the sum of all Cjj such that Jϵ(x, gθ(ej ej )) < 0. Intuitively speaking, an adversarial example will induce more cost if it can be adversarially misclassiﬁed as more target classes with high cost. Accordingly, the robust cost is deﬁned as the total cost of adversarial examples divided by the total number of valued seed examples:

robust cost =

j Ωj Cjj 1 Jϵ(xi, gθ(ej ej )) < 0

j|δj =1 Nj , (3.1)

where 1( ) denotes the indicator function.

3.2 COST-SENSITIVE ROBUST OPTIMIZATION

Recall that our goal is to develop a classiﬁer with certiﬁed cost-sensitive robustness, as deﬁned in 3.1, while maintaining overall classiﬁcation accuracy. According to the guaranteed lower bound, Jϵ x0, gθ(ey0 eytarg) on Equation 2.6 and inspired by the cost-sensitive CE loss (Khan et al., 2018), we propose the following robust optimization with respect to a neural network classiﬁer fθ:

minimize θ 1 N

i [N] L fθ(xi), yi

i|yi=j log 1 + X

j Ωj Cjj exp Jϵ(xi, gθ(ej ej )) , (3.2)

where α 0 denotes the regularization parameter. The ﬁrst term in Equation 3.2 denotes the cross-entropy loss for standard classiﬁcation, whereas the second term accounts for the cost-sensitive robustness. Compared with the overall robustness training objective function (2.7), we include a regularization parameter α to control the trade-off between classiﬁcation accuracy on original inputs and adversarial robustness.

To provide cost-sensitivity, the loss function selectively penalizes the adversarial examples based on their cost. For binary cost matrixes, the regularization term penalizes every cost-sensitive adversarial example equally, but has no impact for instances where Cjj = 0. For the real-valued costs, a larger value of Cjj increases the weight of the corresponding adversarial transformation in the training objective. This optimization problem (3.2) can be solved efﬁciently using gradient-based algorithms, such as stochastic gradient descent and ADAM (Kingma & Ba, 2015).

4 EXPERIMENTS

We evaluate the performance of our cost-sensitive robustness training method on models for two benchmark image classiﬁcation datasets: MNIST (Le Cun et al., 2010) and CIFAR10 (Krizhevsky & Hinton, 2009). We compare our results for various cost scenarios with overall robustness training

Published as a conference paper at ICLR 2019

0 20 40 60 Number of training epochs

train robust validation robust train classification validation classification

(a) learning curves

target class

0.41% 0.92% 0.92% 1.02% 1.02% 1.22% 0.82% 1.63% 0.82%

0.35% 1.41% 1.85% 0.70% 0.88% 1.23% 0.62% 3.96% 0.26%

3.00% 2.23% 7.07% 1.84% 1.74% 2.33% 2.71% 6.69% 1.45%

1.58% 1.98% 6.73% 1.49% 5.54% 1.09% 3.37% 6.83% 2.77%

2.95% 2.14% 2.34% 1.93% 1.43% 2.75% 2.55% 4.48% 10.08%

4.04% 1.68% 3.14% 9.75% 3.48% 4.71% 3.36% 9.08% 4.15%

3.34% 1.88% 2.51% 1.46% 3.55% 3.55% 1.15% 3.34% 1.04%

0.78% 3.31% 5.06% 5.06% 3.70% 2.04% 0.39% 4.86% 6.13%

3.59% 3.39% 4.83% 7.29% 4.93% 6.06% 5.13% 3.90% 5.75%

3.47% 2.97% 2.78% 5.05% 10.01% 5.95% 1.59% 4.96% 6.64%

Robust error rate

(b) heatmap of robust test error

Figure 1: Preliminary results on MNIST using overall robust classiﬁer: (a) learning curves of the classiﬁcation error and overall robust error over the 60 training epochs; (b) heatmap of the robust test error for pairwise class transformations based on the best trained classiﬁer.

( 2.3) as a baseline. For both datasets, the relevant family of attacks is speciﬁed as all the adversarial perturbations that are bounded in an ℓ -norm ball.

Our goal in the experiments is to evaluate how well a variety of different types of cost matrices can be supported. MNIST and CIFAR-10 are toy datasets, thus there are no obvious cost matrices that correspond to meaningful security applications for these datasets. Instead, we select representative tasks and design cost matrices to capture them.

For MNIST, we use the same convolutional neural network architecture (Le Cun et al., 1998) as Wong & Kolter (2018), which includes two convolutional layers, with 16 and 32 ﬁlters respectively, and a two fully-connected layers, consisting of 100 and 10 hidden units respectively. Re LU activations are applied to each layer except the last one. For both our cost-sensitive robust model and the overall robust model, we randomly split the 60,000 training samples into ﬁve folds of equal size, and train the classiﬁer over 60 epochs on four of them using the Adam optimizer (Kingma & Ba, 2015) with batch size 50 and learning rate 0.001. We treat the remaining fold as a validation dataset for model selection. In addition, we use the ϵ-scheduling and learning rate decay techniques, where we increase ϵ from 0.05 to the desired value linearly over the ﬁrst 20 epochs and decay the learning rate by 0.5 every 10 epochs for the remaining epochs.

Baseline: Overall Robustness. Figure 1(a) illustrates the learning curves of both classiﬁcation error and overall robust error during training based on robust loss (2.7) with maximum perturbation distance ϵ = 0.2. The model with classiﬁcation error less than 4% and minimum overall robust error on the validation dataset is selected over the 60 training epochs. The best classiﬁer reaches 3.39% classiﬁcation error and 13.80% overall robust error on the 10,000 MNIST testing samples. We report the robust test error for every adversarial transformation in Figure 1(b) (for the model without any robustness training all of the values are 100%). The (i, j)-th entry is a bound on the robustness of that seed-target transformation the fraction of testing examples in class i that cannot be certiﬁed robust against transformation into class j for any ϵ norm-bounded attack. As shown in Figure 1(b), the vulnerability to adversarial transformations differs considerably among class pairs and appears correlated with perceptual similarity. For instance, only 0.26% of seeds in class 1 cannot be certiﬁed robust for target class 9 compare to 10% of seeds from class 9 into class 4.

Binary Cost Matrix. Next, we evaluate the effectiveness of cost-sensitive robustness training in producing models that are more robust for adversarial transformations designated as valuable. We consider four types of tasks deﬁned by different binary cost matrices that capture different sets of adversarial transformations: single pair: particular seed class s to particular target class t; single seed: particular seed class s to any target class; single target: any seed class to particular target class t; and

Published as a conference paper at ICLR 2019

Table 1: Comparisons between different robust defense models on MNIST dataset against ℓ normbounded adversarial perturbations with ϵ = 0.2. The sparsity gives the number of non-zero entries in the cost matrix over the total number of possible adversarial transformations. The candidates column is the number of potential seed examples for each task.

Task Description Sparsity Candidates Best α Classiﬁcation Error Robust Error

baseline ours baseline ours

single pair (0,2) 1/90 980 10.0 3.39% 2.68% 0.92% 0.31% (6,5) 1/90 958 5.0 3.39% 2.49% 3.55% 0.42% (4,9) 1/90 982 4.0 3.39% 3.00% 10.08% 1.02%

single seed digit 0 9/90 980 10.0 3.39% 3.48% 3.67% 0.92% digit 2 9/90 1032 1.0 3.39% 2.91% 14.34% 3.68% digit 8 9/90 974 0.4 3.39% 3.37% 22.28% 5.75%

single target digit 1 9/90 8865 4.0 3.39% 3.29% 2.23% 0.14% digit 5 9/90 9108 2.0 3.39% 3.24% 3.10% 0.29% digit 8 9/90 9026 1.0 3.39% 3.52% 5.24% 0.54%

top 10 10/90 6024 0.4 3.39% 3.34% 11.14% 7.02% random 10 10/90 7028 0.4 3.39% 3.18% 5.01% 2.18% odd digit 45/90 5074 0.2 3.39% 3.30% 14.45% 9.97% even digit 45/90 4926 0.1 3.39% 2.82% 13.13% 9.44%

0.0% 5.0% 10.0% 15.0% 20.0% Cost-sensitive robust error

digit 0 baseline model our model

(a) single seed class

0.0% 1.0% 2.0% 3.0% 4.0% 5.0% Cost-sensitive robust error

digit 0 baseline model our model

(b) single target class

Figure 2: Cost-sensitive robust error using the proposed model and baseline model on MNIST for different binary tasks: (a) treat each digit as the seed class of concern respectively; (b) treat each digit as the target class of concern respectively.

multiple: multiple seed and target classes. For each setting, the cost matrix is deﬁned as Cij = 1 if (i, j) is selected; otherwise, Cij = 0. In general, we expect that the sparser the cost matrix, the more opportunity there is for cost-sensitive training to improve cost-sensitive robustness over models trained for overall robustness.

For the single pair task, we selected three representative adversarial goals: a low vulnerability pair (0, 2), medium vulnerability pair (6, 5) and high vulnerability pair (4, 9). We selected these pairs by considering the robust error results on the overall-robustness trained model (Figure 1(b)) as a rough measure for transformation hardness. This is generally consistent with intuitions about the MNIST digit classes (e.g., 9 and 4 look similar, so are harder to induce robustness against adversarial transformation), as well as with the visualization results produced by dimension reduction techniques, such as t-SNE (Maaten & Hinton, 2008).

Similarly, for the single seed and single target tasks we select three representative examples representing low, medium, and high vulnerability to include in Table 1 and provide full results for all the single-seed and single target tasks for MNIST in Figure 2. For the multiple transformations

Published as a conference paper at ICLR 2019

Table 2: Comparison results of different robust defense models for tasks with real-valued cost matrix.

Dataset Task Sparsity Candidates Best α Classiﬁcation Error Robust Cost

baseline ours baseline ours

MNIST small-large 45/90 10000 0.04 3.39% 3.47% 2.245 0.947 MNIST large-small 45/90 10000 0.04 3.39% 3.13% 3.344 1.549 CIFAR vehicle 40/90 4000 0.1 31.80% 26.19% 4.183 3.095

task, we consider four variations: (i) the ten most vulnerable seed-target transformations; (ii) ten randomly-selected seed-target transformations; (iii) all the class transformations from odd digit seed to any other class; (iv) all the class transformations from even digit seed to any other class.

Table 1 summarizes the results, comparing the cost-sensitive robust error between the baseline model trained for overall robustness and a model trained using our cost-sensitive robust optimization. The cost-sensitive robust defense model is trained with ϵ = 0.2 based on loss function (3.2) and the corresponding cost matrix C. The regularization parameter α is tuned via cross validation (see Appendix A for details). We report the selected best α, classiﬁcation error and cost-sensitive robust error on the testing dataset.

Our model achieves a substantial improvement on the cost-sensitive robustness compared with the baseline model on all of the considered tasks, with no signiﬁcant increases in normal classiﬁcation error. The cost-sensitive robust error reduction varies from 30% to 90%, and is generally higher for sparse cost matrices. In particular, our classiﬁer reduces the number of cost-sensitive adversarial examples from 198 to 12 on the single target task with digit 1 as the target class.

Real-valued Cost Matrices. Loosely motivated by a check forging adversary who obtains value by changing the semantic interpretation of a number (Papernot et al., 2016), we consider two real-valued cost matrices: small-large, where only adversarial transformations from a smaller digit class to a larger one are valued, and the cost of valued-transformation is quadratic with the absolute difference between the seed and target class digits: Cij = (i j)2 if j > i, otherwise Cij = 0; large-small: only adversarial transformations from a larger digit class to a smaller one are valued: Cij = (i j)2 if i > j, otherwise Cij = 0. We tune α for the cost-sensitive robust model on the training MNIST dataset via cross validation, and set all the other parameters the same as in the binary case. The certiﬁed robust error for every adversarial transformation on MNIST testing dataset is shown in Figure 3, and the classiﬁcation error and robust cost are given in Table 2. Compared with the model trained for overall robustness (Figure 1(b)), our trained classiﬁer achieves stronger robustness guarantees on the adversarial transformations that induce costs, especially for those with larger costs.

0.7% 1.3% 1.1% 0.6% 1.3% 0.8% 0.6% 0.5% 0.2%

1.7% 2.2% 1.0% 0.4% 0.6% 0.5% 0.1% 0.5% 0.0%

11.8% 17.6% 5.6% 1.2% 1.1% 0.9% 1.2% 1.7% 0.5%

17.4% 23.1% 36.2% 1.7% 4.8% 0.7% 1.6% 1.8% 1.3%

26.4% 34.5% 20.3% 9.9% 3.6% 2.5% 3.5% 2.9% 5.9%

26.6% 24.7% 21.1% 50.1% 5.5% 3.1% 1.1% 2.5% 1.2%

58.8% 35.6% 61.9% 18.9% 23.7% 26.6% 1.1% 2.2% 0.6%

26.3% 33.3% 40.6% 53.3% 13.1% 8.4% 1.5% 4.4% 3.9%

95.9% 82.4% 98.6% 93.7% 47.2% 62.5% 26.5% 15.5% 4.6%

68.4% 70.1% 84.1% 99.7% 99.6% 79.3% 14.1% 92.9% 48.1%

(a) small-large

2.2% 49.6% 38.4% 42.6% 83.5% 95.4% 82.9% 93.7% 99.2%

0.4% 24.6% 73.3% 96.7% 93.0% 96.8% 94.9% 99.9% 87.0%

1.0% 2.2% 53.4% 18.1% 27.2% 29.6% 49.0% 86.4% 54.2%

0.3% 0.4% 3.8% 7.7% 52.6% 11.2% 38.5% 78.0% 84.7%

1.1% 0.5% 1.3% 1.0% 6.1% 17.1% 25.2% 74.5% 70.2%

0.7% 0.3% 0.8% 4.2% 2.1% 19.4% 10.1% 61.1% 68.7%

1.6% 0.9% 2.3% 1.4% 2.4% 3.9% 1.9% 23.3% 6.8%

0.2% 1.3% 3.0% 3.7% 3.2% 2.1% 1.4% 16.2% 56.9%

0.5% 0.6% 2.3% 3.0% 1.9% 3.6% 5.0% 4.8% 23.0%

0.9% 2.0% 1.9% 3.1% 4.4% 4.9% 2.0% 5.2% 9.6%

(b) large-small

Figure 3: Heatmaps of robust test error using our cost-sensitive robust classiﬁer on MNIST for various real-valued cost tasks: (a) small-large; (b) large-small.

Published as a conference paper at ICLR 2019

Table 3: Cost-sensitive robust models for CIFAR10 dataset against adversarial examples, ϵ = 2/255.

Task Description Sparsity Candidates Best α Classiﬁcation Error Robust Error

baseline ours baseline ours

single pair (frog, bird) 1/90 1000 10.0 31.80% 27.88% 19.90% 1.20% (cat, plane) 1/90 1000 10.0 31.80% 28.63% 9.30% 2.60%

single seed dog 9/90 1000 0.2 31.80% 30.69% 57.20% 28.90% truck 9/90 1000 0.8 31.80% 31.55% 35.60% 15.40%

single target deer 9/90 9000 0.1 31.80% 26.69% 16.99% 3.77% ship 9/90 9000 0.1 31.80% 24.80% 9.42% 3.06%

multiple A-V 24/90 6000 0.1 31.80% 26.65% 16.67% 7.42% V-A 24/90 4000 0.2 31.80% 27.60% 12.07% 8.00%

7.2% 8.9% 9.5% 8.4% 8.0% 4.9% 7.4% 20.7% 10.7%

7.4% 4.0% 7.2% 4.2% 5.7% 3.2% 3.8% 8.8% 18.5%

20.8% 8.3% 38.9% 43.2% 35.7% 23.4% 19.1% 11.2% 8.6%

9.3% 5.2% 13.9% 18.6% 31.6% 14.3% 11.0% 6.8% 7.4%

12.9% 3.5% 22.1% 31.2% 25.9% 18.5% 19.8% 7.3% 4.7%

8.2% 3.6% 15.8% 46.9% 17.3% 11.5% 12.7% 6.5% 5.6%

7.3% 5.1% 19.9% 31.7% 30.5% 24.0% 10.9% 4.8% 5.9%

8.2% 4.6% 13.2% 24.7% 18.5% 24.4% 9.2% 5.7% 7.8%

14.0% 9.3% 5.3% 6.5% 4.6% 5.6% 3.1% 3.7% 9.8%

12.7% 20.9% 7.2% 10.4% 7.6% 8.4% 5.2% 9.2% 13.0%

(a) baseline model

11.2% 7.3% 6.7% 5.3% 6.8% 4.0% 5.3% 17.3% 12.8%

9.2% 3.5% 3.7% 2.2% 3.3% 1.8% 2.4% 10.2% 24.0%

50.5% 31.0% 56.9% 49.3% 58.1% 43.1% 44.7% 29.5% 32.6%

38.2% 32.8% 52.9% 42.2% 79.3% 37.9% 45.9% 28.6% 38.4%

71.1% 49.3% 85.8% 79.1% 79.0% 65.7% 79.9% 45.2% 54.4%

29.8% 22.9% 44.5% 63.5% 34.1% 30.6% 41.6% 21.9% 27.6%

49.1% 46.3% 75.7% 74.0% 64.5% 68.9% 53.7% 32.6% 50.2%

35.2% 26.9% 36.7% 39.1% 39.5% 47.8% 21.7% 22.2% 35.0%

26.7% 11.5% 5.0% 4.7% 3.9% 4.6% 2.7% 3.4% 11.1%

15.5% 25.7% 4.4% 5.4% 3.7% 5.6% 3.6% 5.6% 13.7%

(b) our model

Figure 4: Heatmaps of robust test error for the real-valued task on CIFAR10 using different robust classiﬁers: (a) baseline model; (b) our proposed cost-sensitive robust model.

4.2 CIFAR10

We use the same neural network architecture for the CIFAR10 dataset as Wong et al. (2018), with four convolutional layers and two fully-connected layers. For memory and computational efﬁciency, we incorporate the approximation technique based on nonlinear random projection during the training phase (Wong et al. (2018), 3.2). We train both the baseline model and our model using random projection of 50 dimensions, and optimize the training objective using SGD. Other parameters such as learning rate and batch size are set as same as those in Wong et al. (2018).

Given a speciﬁc task, we train the cost-sensitive robust classiﬁer on 80% randomly-selected training examples, and tune the regularization parameter α according to the performance on the remaining examples as validation dataset. The tasks are similar to those for MNIST ( 4.1), except for the multiple transformations task we cluster the ten CIFAR10 classes into two large groups: animals and vehicles, and consider the cases where only transformations between an animal class and a vehicle class are sensitive, and the converse.

Table 3 shows results on the testing data based on different robust defense models with ϵ = 2/255. For all of the aforementioned tasks, our models substantially reduce the cost-sensitive robust error while keeping a lower classiﬁcation error than the baseline.

For the real-valued task, we are concerned with adversarial transformations from seed examples in vehicle classes to other target classes. In addition, more cost is placed on transformations from vehicle to animal, which is 10 times larger compared with that from vehicle to vehicle. Figures 4(a)

Published as a conference paper at ICLR 2019

= 0.1 = 0.15 = 0.2 = 0.25 maximum perturbation distance

error rate based on different models

classification error (baseline) classification error (ours) robust error (baseline) robust error (ours)

= 2/255 = 4/255 = 6/255 maximum perturbation distance

error rate based on different models

classification error (baseline) classification error (ours) robust error (baseline) robust error (ours)

(b) CIFAR10

Figure 5: Results for different adversary strengths, ϵ, for different settings: (a) MNIST single seed task with digit 9 as the chosen class; (b) CIFAR10 single seed task with dog as the chosen class.

and 4(b) illustrate the pairwise robust test error using overall robust model and the proposed classiﬁer for the aforementioned real-valued task on CIFAR10.

4.3 VARYING ADVERSARY STRENGTH

We investigate the performance of our model against different levels of adversarial strength by varying the value of ϵ that deﬁnes the ℓ ball available to the adversary. Figure 5 show the overall classiﬁcation and cost-sensitive robust error of our best trained model, compared with the baseline model, on the MNIST single seed task with digit 9 and CIFAR single seed task with dog as the seed class of concern, as we vary the maximum ℓ perturbation distance.

Under all the considered attack models, the proposed classiﬁer achieves better cost-sensitive adversarial robustness than the baseline, while maintaining similar classiﬁcation accuracy on original data points. As the adversarial strength increases, the improvement for cost-sensitive robustness over overall robustness becomes more signiﬁcant.

5 CONCLUSION

By focusing on overall robustness, previous robustness training methods expend a large fraction of the capacity of the network on unimportant transformations. We argue that for most scenarios, the actual harm caused by an adversarial transformation often varies depending on the seed and target class, so robust training methods should be designed to account for these differences. By incorporating a cost matrix into the training objective, we develop a general method for producing a cost-sensitive robust classiﬁer. Our experimental results show that our cost-sensitive training method works across a variety of different types of cost matrices, so we believe it can be generalized to other cost matrix scenarios that would be found in realistic applications.

There remains a large gap between the small models and limited attacker capabilities for which we can achieve certiﬁable robustness, and the complex models and unconstrained attacks that may be important in practice. The scalability of our techniques is limited to the toy models and simple attack norms for which certiﬁable robustness is currently feasible, so considerable process is needed before they could be applied to realistic scenarios. However, we hope that considering cost-sensitive robustness instead of overall robustness is a step towards achieving more realistic robustness goals.

AVAILABILITY

Our implementation, including code for reproducing all our experiments, is available as open source code at https://github.com/xiaozhanguva/Cost-Sensitive-Robustness.

Published as a conference paper at ICLR 2019

ACKNOWLEDGEMENTS

We thank Eric Wong for providing the implementation of certiﬁed robustness we built on, as well as for insightful discussions. We thank Jianfeng Chi for helpful advice on implementing our experiments. This work was supported by grants from the National Science Foundation (#1804603 and #1619098) and research awards from Amazon, Baidu, and Intel.

Kaiser Asif, Wei Xing, Sima Behpour, and Brian D Ziebart. Adversarial cost-sensitive classiﬁcation. In 31st Conference on Uncertainty in Artiﬁcial Intelligence, 2015.

Anish Athalye, Nicholas Carlini, and David Wagner. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. In International Conference on Machine Learning, 2018.

Mariusz Bojarski, Davide Del Testa, Daniel Dworakowski, Bernhard Firner, Beat Flepp, Prasoon Goyal, Lawrence D. Jackel, Mathew Monfort, Urs Muller, Jiakai Zhang, Xin Zhang, Jake Zhao, and Karol Zieba. End to end learning for self-driving cars. ar Xiv preprint ar Xiv:1604.07316, 2016.

Nicholas Carlini and David Wagner. Towards evaluating the robustness of neural networks. In IEEE Symposium on Security and Privacy, 2017.

Nilesh Dalvi, Pedro Domingos, Sumit Sanghai, Deepak Verma, et al. Adversarial classiﬁcation. In 10th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2004.

Pedro Domingos. Metacost: A general method for making classiﬁers cost-sensitive. In Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1999.

Tommaso Dreossi, Somesh Jha, and Sanjit A Seshia. Semantic adversarial deep learning. In International Conference on Computer Aided Veriﬁcation, 2018.

Charles Elkan. The foundations of cost-sensitive learning. In International Joint Conference on Artiﬁcial Intelligence, 2001.

Logan Engstrom, Dimitris Tsipras, Ludwig Schmidt, and Aleksander Madry. A rotation and a translation sufﬁce: Fooling CNNs with simple transformations. ar Xiv preprint ar Xiv:1712.02779, 2017.

Ian Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. In International Conference on Learning Representations, 2015.

Matthias Hein and Maksym Andriushchenko. Formal guarantees on the robustness of a classiﬁer against adversarial manipulation. In Advances in Neural Information Processing Systems, 2017.

Can Kanbak, Seyed-Mohsen Moosavi-Dezfooli, and Pascal Frossard. Geometric robustness of deep networks: analysis and improvement. In Computer Vision and Pattern Recognition, 2018.

Salman H Khan, Munawar Hayat, Mohammed Bennamoun, Ferdous A Sohel, and Roberto Togneri. Cost-sensitive learning of deep feature representations from imbalanced data. IEEE Transactions on Neural Networks and Learning Systems, 29(8):3573 3587, 2018.

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015.

Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009.

Matjaˇz Kukar and Igor Kononenko. Cost-sensitive learning with neural networks. In 13th European Conference on Artiﬁcial Intelligence, 1998.

Yann Le Cun, L eon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998.

Published as a conference paper at ICLR 2019

Yann Le Cun, Corinna Cortes, and CJ Burges. MNIST handwritten digit database. http://yann.lecun. com/exdb/mnist, 2010.

Xu-Ying Liu and Zhi-Hua Zhou. The inﬂuence of class imbalance on cost-sensitive learning: An empirical study. In Sixth International Conference on Data Mining, 2006.

Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-SNE. Journal of Machine Learning Research, 2008.

Saeed Mahloujifar, Dimitrios I Diochnos, and Mohammad Mahmoody. The curse of concentration in robust learning: Evasion and poisoning attacks from concentration of measure. In AAAI Conference on Artiﬁcial Intelligence, 2019.

Nicolas Papernot, Patrick Mc Daniel, Xi Wu, Somesh Jha, and Ananthram Swami. Distillation as a defense to adversarial perturbations against deep neural networks. In IEEE Symposium on Security and Privacy, 2016.

Omkar M Parkhi, Andrea Vedaldi, and Andrew Zisserman. Deep face recognition. In British Machine Vision Conference, 2015.

Aditi Raghunathan, Jacob Steinhardt, and Percy Liang. Certiﬁed defenses against adversarial examples. In International Conference on Learning Representations, 2018.

Joshua Saxe and Konstantin Berlin. Deep neural network based malware detection using two dimensional binary program features. In 10th International Conference on Malicious and Unwanted Software, 2015.

Mahmood Sharif, Lujo Bauer, and Michael K Reiter. On the suitability of lp-norms for creating and preventing adversarial examples. In CVPR Workshop on Bright and Dark Sides of Computer Vision: Challenges and Opportunities for Privacy and Security, 2018.

Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. In International Conference on Learning Representations, 2014.

Florian Tram er, Alexey Kurakin, Nicolas Papernot, Ian Goodfellow, Dan Boneh, and Patrick Mc Daniel. Ensemble adversarial training: Attacks and defenses. In International Conference on Learning Representations, 2018.

Shiqi Wang, Kexin Pei, Justin Whitehouse, Junfeng Yang, and Suman Jana. Formal security analysis of neural networks using symbolic intervals. In USENIX Security Symposium, 2018.

Eric Wong and Zico Kolter. Provable defenses against adversarial examples via the convex outer adversarial polytope. In International Conference on Machine Learning, 2018.

Eric Wong, Frank Schmidt, Jan Hendrik Metzen, and J Zico Kolter. Scaling provable adversarial defenses. In Conference on Neural Information Processing Systems, 2018.

Chaowei Xiao, Jun-Yan Zhu, Bo Li, Warren He, Mingyan Liu, and Dawn Song. Spatially transformed adversarial examples. In International Conference on Learning Representations, 2018.

Bianca Zadrozny, John Langford, and Naoki Abe. Cost-sensitive learning by cost-proportionate example weighting. In Third IEEE International Conference on Data Mining, 2003.

Zhi-Hua Zhou and Xu-Ying Liu. On multi-class cost-sensitive learning. Computational Intelligence, 26(3):232 257, 2010.

Published as a conference paper at ICLR 2019

Cost-Sensitive Robustness against Adversarial Examples Supplemental Materials

A PARAMETER TUNING

For experiments on the MNIST dataset, we ﬁrst perform a coarse tuning on regularization parameter α with searching grid {10 2, 10 1, 100, 101, 102}, and select the most appropriate one, denoted by αcoarse, with overall classiﬁcation error less than 4% and the lowest cost-sensitive robust error on validation dataset. Then, we further ﬁnely tune α from the range {2 3, 2 2, 2 1, 20, 21, 22, 23} αcoarse, and choose the best robust model according to the same criteria.

Figures 6(a) and 6(b) show the learning curves for task B with digit 9 as the selected seed class based on the proposed cost-sensitive robust model with varying α (we show digit 9 because it is one of the most vulnerable seed classes). The results suggest that as the value of α increases, the corresponding classiﬁer will have a lower cost-sensitive robust error but a higher classiﬁcation error, which is what we expect from the design of (3.2).

We observe similar trends for the learning curves for the other tasks, so do not present them here. For the CIFAR10 experiments, a similar tuning strategy is implemented. The only difference is that we use 35% as the threshold of overall classiﬁcation error for selecting the best α.

0 20 40 60 Number of training epochs

Classification error

alpha=0.01 alpha=0.1 alpha=1.0 alpha=10.0

(a) classiﬁcation error

0 20 40 60 Number of training epochs

Cost-sensitive robust error

alpha=0.01 alpha=0.1 alpha=1.0 alpha=10.0

(b) cost-sensitive robust error

Figure 6: Learning curves for single seed task with digit 9 as the selected seed class on MNIST using the proposed model with varying α: (a) learning curves of classiﬁcation error; (b) learning curves of cost-sensitive robust error. The maximum perturbation distance is speciﬁed as ϵ = 0.2.

B COMPARISON WITH STANDARD COST-SENSITIVE CLASSIFIER

As discussed in Section 2.4, prior work on cost-sensitive learning mainly focuses on the nonadversarial setting. In this section, we investigate the robustness of the cross-entropy based costsensitive classiﬁer proposed in Khan et al. (2018), and compare the performance of their classiﬁer with our proposed cost-sensitive robust classiﬁer. Given a set of training examples {(xi, yi)}N i=1 and cost matrix C with each entry representing the cost of the corresponding misclassiﬁcation, the evaluation metric for cost-sensitive learning is deﬁned as the average cost of misclassiﬁcations, or more concretely

misclassﬁcation cost = 1

i [N] Cyibyi, where byi = argmax j [m] [fθ(xi)]j,

where m is the total number of class labels and fθ( ) denotes the neural network classiﬁer as introduced in Section 2.1. In addition, the cross-entropy based cost-sensitive training objective takes the following form:

minimize θ 1 N

i|yi=j log 1 + X

j =j Cjj exp [fθ(xi)]j [fθ(xi)]j . (B.1)

Published as a conference paper at ICLR 2019

Table 4: Comparison results of different trained classiﬁers for small-large real-valued task on MNIST with maximum perturbation distance ϵ = 0.2.

Classiﬁer Classiﬁcation Error Misclassiﬁcation Cost Robust Cost

Baseline 0.91% 0.054 85.197 Cost-Sensitive Standard 2.57% 0.016 85.197 Overall Robustness 3.49% 0.252 1.982 Cost-Sensitive Robustness 3.38% 0.060 0.915

To provide a fair comparison, we assume the cost matrix used for (B.1) coincides with the cost matrix used for cost-sensitive robust training (3.2) in our experiment, whereas they are unlikely to be the same for real security applications. For instance, misclassifying a benign program as malicious may still induce some cost in the non-adversarial setting, whereas the adversary may only beneﬁt from transforming a malicious program into a benign one.

We consider the small-large real-valued task for MNIST, where the cost matrix C is designed as Cij = 0.1, if i > j; Cij = 0, if i = j; Cij = (i j)2, otherwise. Table 4 demonstrates the comparison results of different classiﬁers in such setting: the baseline standard deep learning classiﬁer, a standard cost-sensitive classiﬁer (Khan et al., 2018) trained using (B.1), classiﬁer trained for overall robustness (Wong & Kolter, 2018) and our proposed classiﬁer trained for cost-sensitive robustness. Compared with baseline, the standard cost-sensitive classiﬁer indeed reduces the misclassiﬁcation cost. But, it does not provide any improvement on the robust cost, as deﬁned in (3.1). In sharp contrast, our robust training method signiﬁcantly improves the cost-sensitive robustness.