# costsensitive_robustness_against_adversarial_examples__d789941b.pdf Published as a conference paper at ICLR 2019 COST-SENSITIVE ROBUSTNESS AGAINST ADVERSARIAL EXAMPLES Xiao Zhang Department of Computer Science University of Virginia xz7bc@virginia.edu David Evans Department of Computer Science University of Virginia evans@virginia.edu Several recent works have developed methods for training classifiers that are certifiably robust against norm-bounded adversarial perturbations. These methods assume that all the adversarial transformations are equally important, which is seldom the case in real-world applications. We advocate for cost-sensitive robustness as the criteria for measuring the classifier s performance for tasks where some adversarial transformation are more important than others. We encode the potential harm of each adversarial transformation in a cost matrix, and propose a general objective function to adapt the robust training method of Wong & Kolter (2018) to optimize for cost-sensitive robustness. Our experiments on simple MNIST and CIFAR10 models with a variety of cost matrices show that the proposed approach can produce models with substantially reduced cost-sensitive robust error, while maintaining classification accuracy. 1 INTRODUCTION Despite the exceptional performance of deep neural networks (DNNs) on various machine learning tasks such as malware detection (Saxe & Berlin, 2015), face recognition (Parkhi et al., 2015) and autonomous driving (Bojarski et al., 2016), recent studies (Szegedy et al., 2014; Goodfellow et al., 2015) have shown that deep learning models are vulnerable to misclassifying inputs, known as adversarial examples, that are crafted with targeted but visually-imperceptible perturbations. While several defense mechanisms have been proposed and empirically demonstrated to be successful against existing particular attacks (Papernot et al., 2016; Goodfellow et al., 2015), new attacks (Carlini & Wagner, 2017; Tram er et al., 2018; Athalye et al., 2018) are repeatedly found that circumvent such defenses. To end this arm race, recent works (Wong & Kolter, 2018; Raghunathan et al., 2018; Wong et al., 2018; Wang et al., 2018) propose methods to certify examples to be robust against some specific norm-bounded adversarial perturbations for given inputs and to train models to optimize for certifiable robustness. However, all of the aforementioned methods aim at improving the overall robustness of the classifier. This means that the methods to improve robustness are designed to prevent seed examples in any class from being misclassified as any other class. Achieving such a goal (at least for some definitions of adversarial robustness) requires producing a perfect classifier, and has, unsurprisingly, remained elusive. Indeed, Mahloujifar et al. (2019) proved that if the metric probability space is concentrated, overall adversarial robustness is unattainable for any classifier with initial constant error. We argue that overall robustness may not be the appropriate criteria for measuring system performance in security-sensitive applications, since only certain kinds of adversarial misclassifications pose meaningful threats that provide value for potential adversaries. Whereas overall robustness places equal emphasis on every adversarial transformation, from a security perspective, only certain transformations matter. As a simple example, misclassifying a malicious program as benign results in more severe consequences than the reverse. In this paper, we propose a general method for adapting provable defenses against norm-bounded perturbations to take into account the potential harm of different adversarial class transformations. Inspired by cost-sensitive learning (Domingos, 1999; Elkan, 2001) for non-adversarial contexts, we capture the impact of different adversarial class transformations using a cost matrix C, where each Published as a conference paper at ICLR 2019 entry represents the cost of an adversary being able to take a natural example from the first class and perturb it so as to be misclassified by the model as the second class. Instead of reducing the overall robust error, our goal is to minimize the cost-weighted robust error (which we define for both binary and real-valued costs in C). The proposed method incorporates the specified cost matrix into the training objective function, which encourages stronger robustness guarantees on cost-sensitive class transformations, while maintaining the overall classification accuracy on the original inputs. Contributions. By encoding the consequences of different adversarial transformations into a cost matrix, we introduce the notion of cost-sensitive robustness (Section 3.1) as a metric to assess the expected performance of a classifier when facing adversarial examples. We propose an objective function for training a cost-sensitive robust classifier (Section 3.2). The proposed method is general in that it can incorporate any type of cost matrix, including both binary and real-valued. We demonstrate the effectiveness of the proposed cost-sensitive defense model for a variety of cost scenarios on two benchmark image classification datasets: MNIST (Section 4.1) and CIFAR10 (Section 4.2). Compared with the state-of-the-art overall robust defense model (Wong & Kolter, 2018), our model achieves significant improvements in cost-sensitive robustness for different tasks, while maintaining approximately the same classification accuracy on both datasets. Notation. We use lower-case boldface letters such as x for vectors and capital boldface letters such as A to represent matrices. Let [m] be the index set {1, 2, . . . , m} and Aij be the (i, j)-th entry of matrix A. Denote the i-th natural basis vector, the all-ones vector and the identity matrix by ei, 1 and I respectively. For any vector x Rd, the ℓ -norm of x is defined as x = maxi [d] |xi|. 2 BACKGROUND In this section, we provide a brief introduction on related topics, including neural network classifiers, adversarial examples, defenses with certified robustness, and cost-sensitive learning. 2.1 NEURAL NETWORK CLASSIFIERS A K-layer neural network classifier can be represented by a function f : X Y such that f(x) = f K 1(f K 2( (f1(x)))), for any x X. For k {1, 2, . . . , K 2}, the mapping function fk( ) typically consists of two operations: an affine transformation (either matrix multiplication or convolution) and a nonlinear activation. In this paper, we consider rectified linear unit (Re LU) as the activation function. If denote the feature vector of the k-th layer as zk, then fk( ) is defined as zk+1 = fk(zk) = max{Wkzk + bk, 0}, k {1, 2, . . . K 2}, where Wk denotes the weight parameter matrix and bk the bias vector. The output function f K 1( ) maps the feature vector in the last hidden layer to the output space Y solely through matrix multiplication: z K = f K 1(z K 1) = WK 1z K 1 + b K 1, where z K can be regarded as the estimated score vector of input x for different possible output classes. In the following discussions, we use fθ to represent the neural network classifier, where θ = {W1, . . . , WK 1, b1, . . . , b K 1} denotes the model parameters. To train the neural network, a loss function PN i=1 L(fθ(xi), yi) is defined for a set of training examples {xi, yi}N i=1, where xi is the i-th input vector and yi denotes its class label. Cross-entropy loss is typically used for multiclass image classification. With proper initialization, all model parameters are then updated iteratively using backpropagation. For any input example ex, the predicted label by is given by the index of the largest predicted score among all classes, argmaxj[fθ(ex)]j. 2.2 ADVERSARIAL EXAMPLES An adversarial example is an input, generated by some adversary, which is visually indistinguishable from an example from the natural distribution, but is able to mislead the target classifier. Since visually indistinguishable depends on human perception, which is hard to define rigorously, we consider the most popular alternative: input examples with perturbations bounded in ℓ -norm (Goodfellow et al., 2015). More formally, the set of adversarial examples with respect to seed example {x0, y0} Published as a conference paper at ICLR 2019 and classifier fθ( ) is defined as Aϵ(x0, y0; θ) = x X : x x0 ϵ and argmax j [fθ(x)]j = y0 , (2.1) where ϵ > 0 denotes the maximum perturbation distance. Although ℓp distances are commonly used in adversarial examples research, they are not an adequate measure of perceptual similarity (Sharif et al., 2018) and other minimal geometric transformations can be used to find adversarial examples (Engstrom et al., 2017; Kanbak et al., 2018; Xiao et al., 2018). Nevertheless, there is considerable interest in improving robustness in this simple domain, and hope that as this research area matures we will find ways to apply results from studying simplified problems to more realistic ones. 2.3 DEFENSES WITH CERTIFIED ROBUSTNESS A line of recent work has proposed defenses that are guaranteed to be robust against norm-bounded adversarial perturbations. Hein & Andriushchenko (2017) proved formal robustness guarantees against ℓ2-norm bounded perturbations for two-layer neural networks, and provided a training method based on a surrogate robust bound. Raghunathan et al. (2018) developed an approach based on semidefinite relaxation for training certified robust classifiers, but was limited to two-layer fullyconnected networks. Our work builds most directly on Wong & Kolter (2018), which can be applied to deep Re LU-based networks and achieves the state-of-the-art certified robustness on MNIST dataset. Following the definitions in Wong & Kolter (2018), an adversarial polytope Zϵ(x) with respect to a given example x is defined as Zϵ(x) = fθ(x + ) : ϵ , (2.2) which contains all the possible output vectors for the given classifier fθ by perturbing x within an ℓ -norm ball with radius ϵ. A seed example, {x0, y0}, is said to be certified robust with respect to maximum perturbation distance ϵ, if the corresponding adversarial example set Aϵ(x0, y0; θ) is empty. Equivalently, if we solve, for any output class ytarg = y0, the optimization problem, minimize z K [z K]y0 [z K]ytarg, subject to z K Zϵ(x0), (2.3) then according to the definition of Aϵ(x0, y0; θ) in (2.1), {x0, y0} is guaranteed to be robust provided that the optimal objective value of (2.3) is positive for every output class. To train a robust model on a given dataset {xi, yi}N i=1, the standard robust optimization aims to minimize the sample loss function on the worst-case locations through the following adversarial loss i=1 max ϵ L fθ(xi + ), yi , (2.4) where L( , ) denotes the cross-entropy loss. However, due to the nonconvexity of the neural network classifier fθ( ) introduced by the nonlinear Re LU activation, both the adversarial polytope (2.2) and training objective (2.4) are highly nonconvex. In addition, solving optimization problem (2.3) for each pair of input example and output class is computationally intractable. Instead of solving the optimization problem directly, Wong & Kolter (2018) proposed an alternative training objective function based on convex relaxation, which can be efficiently optimized through a dual network. Specifically, they relaxed Zϵ(x) into a convex outer adversarial polytope e Zϵ(x) by replacing the Re LU inequalities for each neuron z = max{bz, 0} with a set of inequalities, z 0, z bz, ubz + (u ℓ)z uℓ, (2.5) where u, ℓdenote the lower and upper bounds on the considered pre-Re LU activation.1 Based on the relaxed outer bound e Zϵ(x), they propose the following alternative optimization problem, minimize z K [z K]y0 [z K]ytarg, subject to z K e Zϵ(x0), (2.6) which is in fact a linear program. Since Zϵ(x) e Zϵ(x) for any x X, solving (2.6) for all output classes provides stronger robustness guarantees compared with (2.3), provided all the optimal 1The elementwise activation bounds can be computed efficiently using Algorithm 1 in Wong & Kolter (2018). Published as a conference paper at ICLR 2019 objective values are positive. In addition, they derived a guaranteed lower bound, denoted by Jϵ x0, gθ(ey0 eytarg) , on the optimal objective value of Equation 2.6 using duality theory, where gθ( ) is a K-layer feedforward dual network (Theorem 1 in Wong & Kolter (2018)). Finally, according to the properties of cross-entropy loss, they minimize the following objective to train the robust model, which serves as an upper bound of the adversarial loss (2.4): minimize θ 1 N i=1 L Jϵ xi, gθ(eyi 1 I) , yi where gθ( ) is regarded as a columnwise function when applied to a matrix. Although the proposed method in Wong & Kolter (2018) achieves certified robustness, its computational complexity is quadratic with the network size in the worst case so it only scales to small networks. Recently, Wong et al. (2018) extended the training procedure to scale to larger networks by using nonlinear random projections. However, if the network size allows for both methods, we observe a small decrease in performance using the training method provided in Wong et al. (2018). Therefore, we only use the approximation techniques for the experiments on CIFAR10 ( 4.2), and use the less scalable method for the MNIST experiments ( 4.1). 2.4 COST-SENSITIVE LEARNING Cost-sensitive learning (Domingos, 1999; Elkan, 2001; Liu & Zhou, 2006) was proposed to deal with unequal misclassification costs and class imbalance problems commonly found in classification applications. The key observation is that cost-blind learning algorithms tend to overwhelm the major class, but the neglected minor class is often our primary interest. For example, in medical diagnosis misclassifying a rare cancerous lesion as benign is extremely costly. Various cost-sensitive learning algorithms (Kukar & Kononenko, 1998; Zadrozny et al., 2003; Zhou & Liu, 2010; Khan et al., 2018) have been proposed in literature, but only a few algorithms, limited to simple classifiers, considered adversarial settings.2 Dalvi et al. (2004) studied the naive Bayes classifier for spam detection in the presence of a cost-sensitive adversary, and developed an adversary-aware classifier based on game theory. Asif et al. (2015) proposed a cost-sensitive robust minimax approach that hardens a linear discriminant classifier with robustness in the adversarial context. All of these methods are designed for simple linear classifiers, and cannot be directly extended to neural network classifiers. In addition, the robustness of their proposed classifier is only examined experimentally based on the performance against some specific adversary, so does not provide any notion of certified robustness. Recently, Dreossi et al. (2018) advocated for the idea of using application-level semantics in adversarial analysis, however, they didn t provide a formal method on how to train such classifier. Our work provides a practical training method that hardens neural network classifiers with certified cost-sensitive robustness against adversarial perturbations. 3 TRAINING A COST-SENSITIVE ROBUST CLASSIFIER The approach introduced in Wong & Kolter (2018) penalizes all adversarial class transformations equally, even though the consequences of adversarial examples usually depends on the specific class transformations. Here, we provide a formal definition of cost-sensitive robustness ( 3.1) and propose a general method for training cost-sensitive robust models ( 3.2). 3.1 CERTIFIED COST-SENSITIVE ROBUSTNESS Our approach uses a cost matrix C that encodes the cost (i.e., potential harm to model deployer) of different adversarial examples. First, we consider the case where there are m classes and C is a m m binary matrix with Cjj {0, 1}. The value Cjj indicates whether we care about an adversary transforming a seed input in class j into one recognized by the model as being in class j . If the adversarial transformation j j matters, Cjj = 1, otherwise Cjj = 0. Let Ωj = {j [m] : Cjj = 0} be the index set of output classes that induce cost with respect to 2Given the vulnerability of standard classifiers to adversarial examples, it is not surprising that standard cost-sensitive classifiers are also ineffective against adversaries. The experiments described in Appendix B supported this expectation. Published as a conference paper at ICLR 2019 input class j. For any j [m], let δj = 0 if Ωj is an empty set, and δj = 1 otherwise. We are only concerned with adversarial transformations from a seed class j to target classes j Ωj. For any example x in seed class j, x is said to be certified cost-sensitive robust if the lower bound Jϵ(x, gθ(ej ej )) 0 for all j Ωj. That is, no adversarial perturbations in an ℓ -norm ball around x with radius ϵ can mislead the classifier to any target class in Ωj. The cost-sensitive robust error on a dataset {xi, yi}N i=1 is defined as the number of examples that are not guaranteed to be cost-sensitive robust over the number of non-zero cost candidate seed examples: cost-sensitive robust error = 1 #{i [N] : Jϵ(xi, gθ(eyi ej )) 0, j Ωyi} P j|δj =1 Nj , where #A represents the cardinality of a set A, and Nj is the total number of examples in class j. Next, we consider a more general case where C is a m m real-valued cost matrix. Each entry of C is a non-negative real number, which represents the cost of the corresponding adversarial transformation. To take into account the different potential costs among adversarial examples, we measure the costsensitive robustness by the average certified cost of adversarial examples. The cost of an adversarial example x in class j is defined as the sum of all Cjj such that Jϵ(x, gθ(ej ej )) < 0. Intuitively speaking, an adversarial example will induce more cost if it can be adversarially misclassified as more target classes with high cost. Accordingly, the robust cost is defined as the total cost of adversarial examples divided by the total number of valued seed examples: robust cost = j Ωj Cjj 1 Jϵ(xi, gθ(ej ej )) < 0 j|δj =1 Nj , (3.1) where 1( ) denotes the indicator function. 3.2 COST-SENSITIVE ROBUST OPTIMIZATION Recall that our goal is to develop a classifier with certified cost-sensitive robustness, as defined in 3.1, while maintaining overall classification accuracy. According to the guaranteed lower bound, Jϵ x0, gθ(ey0 eytarg) on Equation 2.6 and inspired by the cost-sensitive CE loss (Khan et al., 2018), we propose the following robust optimization with respect to a neural network classifier fθ: minimize θ 1 N i [N] L fθ(xi), yi i|yi=j log 1 + X j Ωj Cjj exp Jϵ(xi, gθ(ej ej )) , (3.2) where α 0 denotes the regularization parameter. The first term in Equation 3.2 denotes the cross-entropy loss for standard classification, whereas the second term accounts for the cost-sensitive robustness. Compared with the overall robustness training objective function (2.7), we include a regularization parameter α to control the trade-off between classification accuracy on original inputs and adversarial robustness. To provide cost-sensitivity, the loss function selectively penalizes the adversarial examples based on their cost. For binary cost matrixes, the regularization term penalizes every cost-sensitive adversarial example equally, but has no impact for instances where Cjj = 0. For the real-valued costs, a larger value of Cjj increases the weight of the corresponding adversarial transformation in the training objective. This optimization problem (3.2) can be solved efficiently using gradient-based algorithms, such as stochastic gradient descent and ADAM (Kingma & Ba, 2015). 4 EXPERIMENTS We evaluate the performance of our cost-sensitive robustness training method on models for two benchmark image classification datasets: MNIST (Le Cun et al., 2010) and CIFAR10 (Krizhevsky & Hinton, 2009). We compare our results for various cost scenarios with overall robustness training Published as a conference paper at ICLR 2019 0 20 40 60 Number of training epochs train robust validation robust train classification validation classification (a) learning curves target class 0.41% 0.92% 0.92% 1.02% 1.02% 1.22% 0.82% 1.63% 0.82% 0.35% 1.41% 1.85% 0.70% 0.88% 1.23% 0.62% 3.96% 0.26% 3.00% 2.23% 7.07% 1.84% 1.74% 2.33% 2.71% 6.69% 1.45% 1.58% 1.98% 6.73% 1.49% 5.54% 1.09% 3.37% 6.83% 2.77% 2.95% 2.14% 2.34% 1.93% 1.43% 2.75% 2.55% 4.48% 10.08% 4.04% 1.68% 3.14% 9.75% 3.48% 4.71% 3.36% 9.08% 4.15% 3.34% 1.88% 2.51% 1.46% 3.55% 3.55% 1.15% 3.34% 1.04% 0.78% 3.31% 5.06% 5.06% 3.70% 2.04% 0.39% 4.86% 6.13% 3.59% 3.39% 4.83% 7.29% 4.93% 6.06% 5.13% 3.90% 5.75% 3.47% 2.97% 2.78% 5.05% 10.01% 5.95% 1.59% 4.96% 6.64% Robust error rate (b) heatmap of robust test error Figure 1: Preliminary results on MNIST using overall robust classifier: (a) learning curves of the classification error and overall robust error over the 60 training epochs; (b) heatmap of the robust test error for pairwise class transformations based on the best trained classifier. ( 2.3) as a baseline. For both datasets, the relevant family of attacks is specified as all the adversarial perturbations that are bounded in an ℓ -norm ball. Our goal in the experiments is to evaluate how well a variety of different types of cost matrices can be supported. MNIST and CIFAR-10 are toy datasets, thus there are no obvious cost matrices that correspond to meaningful security applications for these datasets. Instead, we select representative tasks and design cost matrices to capture them. For MNIST, we use the same convolutional neural network architecture (Le Cun et al., 1998) as Wong & Kolter (2018), which includes two convolutional layers, with 16 and 32 filters respectively, and a two fully-connected layers, consisting of 100 and 10 hidden units respectively. Re LU activations are applied to each layer except the last one. For both our cost-sensitive robust model and the overall robust model, we randomly split the 60,000 training samples into five folds of equal size, and train the classifier over 60 epochs on four of them using the Adam optimizer (Kingma & Ba, 2015) with batch size 50 and learning rate 0.001. We treat the remaining fold as a validation dataset for model selection. In addition, we use the ϵ-scheduling and learning rate decay techniques, where we increase ϵ from 0.05 to the desired value linearly over the first 20 epochs and decay the learning rate by 0.5 every 10 epochs for the remaining epochs. Baseline: Overall Robustness. Figure 1(a) illustrates the learning curves of both classification error and overall robust error during training based on robust loss (2.7) with maximum perturbation distance ϵ = 0.2. The model with classification error less than 4% and minimum overall robust error on the validation dataset is selected over the 60 training epochs. The best classifier reaches 3.39% classification error and 13.80% overall robust error on the 10,000 MNIST testing samples. We report the robust test error for every adversarial transformation in Figure 1(b) (for the model without any robustness training all of the values are 100%). The (i, j)-th entry is a bound on the robustness of that seed-target transformation the fraction of testing examples in class i that cannot be certified robust against transformation into class j for any ϵ norm-bounded attack. As shown in Figure 1(b), the vulnerability to adversarial transformations differs considerably among class pairs and appears correlated with perceptual similarity. For instance, only 0.26% of seeds in class 1 cannot be certified robust for target class 9 compare to 10% of seeds from class 9 into class 4. Binary Cost Matrix. Next, we evaluate the effectiveness of cost-sensitive robustness training in producing models that are more robust for adversarial transformations designated as valuable. We consider four types of tasks defined by different binary cost matrices that capture different sets of adversarial transformations: single pair: particular seed class s to particular target class t; single seed: particular seed class s to any target class; single target: any seed class to particular target class t; and Published as a conference paper at ICLR 2019 Table 1: Comparisons between different robust defense models on MNIST dataset against ℓ normbounded adversarial perturbations with ϵ = 0.2. The sparsity gives the number of non-zero entries in the cost matrix over the total number of possible adversarial transformations. The candidates column is the number of potential seed examples for each task. Task Description Sparsity Candidates Best α Classification Error Robust Error baseline ours baseline ours single pair (0,2) 1/90 980 10.0 3.39% 2.68% 0.92% 0.31% (6,5) 1/90 958 5.0 3.39% 2.49% 3.55% 0.42% (4,9) 1/90 982 4.0 3.39% 3.00% 10.08% 1.02% single seed digit 0 9/90 980 10.0 3.39% 3.48% 3.67% 0.92% digit 2 9/90 1032 1.0 3.39% 2.91% 14.34% 3.68% digit 8 9/90 974 0.4 3.39% 3.37% 22.28% 5.75% single target digit 1 9/90 8865 4.0 3.39% 3.29% 2.23% 0.14% digit 5 9/90 9108 2.0 3.39% 3.24% 3.10% 0.29% digit 8 9/90 9026 1.0 3.39% 3.52% 5.24% 0.54% top 10 10/90 6024 0.4 3.39% 3.34% 11.14% 7.02% random 10 10/90 7028 0.4 3.39% 3.18% 5.01% 2.18% odd digit 45/90 5074 0.2 3.39% 3.30% 14.45% 9.97% even digit 45/90 4926 0.1 3.39% 2.82% 13.13% 9.44% 0.0% 5.0% 10.0% 15.0% 20.0% Cost-sensitive robust error digit 0 baseline model our model (a) single seed class 0.0% 1.0% 2.0% 3.0% 4.0% 5.0% Cost-sensitive robust error digit 0 baseline model our model (b) single target class Figure 2: Cost-sensitive robust error using the proposed model and baseline model on MNIST for different binary tasks: (a) treat each digit as the seed class of concern respectively; (b) treat each digit as the target class of concern respectively. multiple: multiple seed and target classes. For each setting, the cost matrix is defined as Cij = 1 if (i, j) is selected; otherwise, Cij = 0. In general, we expect that the sparser the cost matrix, the more opportunity there is for cost-sensitive training to improve cost-sensitive robustness over models trained for overall robustness. For the single pair task, we selected three representative adversarial goals: a low vulnerability pair (0, 2), medium vulnerability pair (6, 5) and high vulnerability pair (4, 9). We selected these pairs by considering the robust error results on the overall-robustness trained model (Figure 1(b)) as a rough measure for transformation hardness. This is generally consistent with intuitions about the MNIST digit classes (e.g., 9 and 4 look similar, so are harder to induce robustness against adversarial transformation), as well as with the visualization results produced by dimension reduction techniques, such as t-SNE (Maaten & Hinton, 2008). Similarly, for the single seed and single target tasks we select three representative examples representing low, medium, and high vulnerability to include in Table 1 and provide full results for all the single-seed and single target tasks for MNIST in Figure 2. For the multiple transformations Published as a conference paper at ICLR 2019 Table 2: Comparison results of different robust defense models for tasks with real-valued cost matrix. Dataset Task Sparsity Candidates Best α Classification Error Robust Cost baseline ours baseline ours MNIST small-large 45/90 10000 0.04 3.39% 3.47% 2.245 0.947 MNIST large-small 45/90 10000 0.04 3.39% 3.13% 3.344 1.549 CIFAR vehicle 40/90 4000 0.1 31.80% 26.19% 4.183 3.095 task, we consider four variations: (i) the ten most vulnerable seed-target transformations; (ii) ten randomly-selected seed-target transformations; (iii) all the class transformations from odd digit seed to any other class; (iv) all the class transformations from even digit seed to any other class. Table 1 summarizes the results, comparing the cost-sensitive robust error between the baseline model trained for overall robustness and a model trained using our cost-sensitive robust optimization. The cost-sensitive robust defense model is trained with ϵ = 0.2 based on loss function (3.2) and the corresponding cost matrix C. The regularization parameter α is tuned via cross validation (see Appendix A for details). We report the selected best α, classification error and cost-sensitive robust error on the testing dataset. Our model achieves a substantial improvement on the cost-sensitive robustness compared with the baseline model on all of the considered tasks, with no significant increases in normal classification error. The cost-sensitive robust error reduction varies from 30% to 90%, and is generally higher for sparse cost matrices. In particular, our classifier reduces the number of cost-sensitive adversarial examples from 198 to 12 on the single target task with digit 1 as the target class. Real-valued Cost Matrices. Loosely motivated by a check forging adversary who obtains value by changing the semantic interpretation of a number (Papernot et al., 2016), we consider two real-valued cost matrices: small-large, where only adversarial transformations from a smaller digit class to a larger one are valued, and the cost of valued-transformation is quadratic with the absolute difference between the seed and target class digits: Cij = (i j)2 if j > i, otherwise Cij = 0; large-small: only adversarial transformations from a larger digit class to a smaller one are valued: Cij = (i j)2 if i > j, otherwise Cij = 0. We tune α for the cost-sensitive robust model on the training MNIST dataset via cross validation, and set all the other parameters the same as in the binary case. The certified robust error for every adversarial transformation on MNIST testing dataset is shown in Figure 3, and the classification error and robust cost are given in Table 2. Compared with the model trained for overall robustness (Figure 1(b)), our trained classifier achieves stronger robustness guarantees on the adversarial transformations that induce costs, especially for those with larger costs. 0.7% 1.3% 1.1% 0.6% 1.3% 0.8% 0.6% 0.5% 0.2% 1.7% 2.2% 1.0% 0.4% 0.6% 0.5% 0.1% 0.5% 0.0% 11.8% 17.6% 5.6% 1.2% 1.1% 0.9% 1.2% 1.7% 0.5% 17.4% 23.1% 36.2% 1.7% 4.8% 0.7% 1.6% 1.8% 1.3% 26.4% 34.5% 20.3% 9.9% 3.6% 2.5% 3.5% 2.9% 5.9% 26.6% 24.7% 21.1% 50.1% 5.5% 3.1% 1.1% 2.5% 1.2% 58.8% 35.6% 61.9% 18.9% 23.7% 26.6% 1.1% 2.2% 0.6% 26.3% 33.3% 40.6% 53.3% 13.1% 8.4% 1.5% 4.4% 3.9% 95.9% 82.4% 98.6% 93.7% 47.2% 62.5% 26.5% 15.5% 4.6% 68.4% 70.1% 84.1% 99.7% 99.6% 79.3% 14.1% 92.9% 48.1% (a) small-large 2.2% 49.6% 38.4% 42.6% 83.5% 95.4% 82.9% 93.7% 99.2% 0.4% 24.6% 73.3% 96.7% 93.0% 96.8% 94.9% 99.9% 87.0% 1.0% 2.2% 53.4% 18.1% 27.2% 29.6% 49.0% 86.4% 54.2% 0.3% 0.4% 3.8% 7.7% 52.6% 11.2% 38.5% 78.0% 84.7% 1.1% 0.5% 1.3% 1.0% 6.1% 17.1% 25.2% 74.5% 70.2% 0.7% 0.3% 0.8% 4.2% 2.1% 19.4% 10.1% 61.1% 68.7% 1.6% 0.9% 2.3% 1.4% 2.4% 3.9% 1.9% 23.3% 6.8% 0.2% 1.3% 3.0% 3.7% 3.2% 2.1% 1.4% 16.2% 56.9% 0.5% 0.6% 2.3% 3.0% 1.9% 3.6% 5.0% 4.8% 23.0% 0.9% 2.0% 1.9% 3.1% 4.4% 4.9% 2.0% 5.2% 9.6% (b) large-small Figure 3: Heatmaps of robust test error using our cost-sensitive robust classifier on MNIST for various real-valued cost tasks: (a) small-large; (b) large-small. Published as a conference paper at ICLR 2019 Table 3: Cost-sensitive robust models for CIFAR10 dataset against adversarial examples, ϵ = 2/255. Task Description Sparsity Candidates Best α Classification Error Robust Error baseline ours baseline ours single pair (frog, bird) 1/90 1000 10.0 31.80% 27.88% 19.90% 1.20% (cat, plane) 1/90 1000 10.0 31.80% 28.63% 9.30% 2.60% single seed dog 9/90 1000 0.2 31.80% 30.69% 57.20% 28.90% truck 9/90 1000 0.8 31.80% 31.55% 35.60% 15.40% single target deer 9/90 9000 0.1 31.80% 26.69% 16.99% 3.77% ship 9/90 9000 0.1 31.80% 24.80% 9.42% 3.06% multiple A-V 24/90 6000 0.1 31.80% 26.65% 16.67% 7.42% V-A 24/90 4000 0.2 31.80% 27.60% 12.07% 8.00% 7.2% 8.9% 9.5% 8.4% 8.0% 4.9% 7.4% 20.7% 10.7% 7.4% 4.0% 7.2% 4.2% 5.7% 3.2% 3.8% 8.8% 18.5% 20.8% 8.3% 38.9% 43.2% 35.7% 23.4% 19.1% 11.2% 8.6% 9.3% 5.2% 13.9% 18.6% 31.6% 14.3% 11.0% 6.8% 7.4% 12.9% 3.5% 22.1% 31.2% 25.9% 18.5% 19.8% 7.3% 4.7% 8.2% 3.6% 15.8% 46.9% 17.3% 11.5% 12.7% 6.5% 5.6% 7.3% 5.1% 19.9% 31.7% 30.5% 24.0% 10.9% 4.8% 5.9% 8.2% 4.6% 13.2% 24.7% 18.5% 24.4% 9.2% 5.7% 7.8% 14.0% 9.3% 5.3% 6.5% 4.6% 5.6% 3.1% 3.7% 9.8% 12.7% 20.9% 7.2% 10.4% 7.6% 8.4% 5.2% 9.2% 13.0% (a) baseline model 11.2% 7.3% 6.7% 5.3% 6.8% 4.0% 5.3% 17.3% 12.8% 9.2% 3.5% 3.7% 2.2% 3.3% 1.8% 2.4% 10.2% 24.0% 50.5% 31.0% 56.9% 49.3% 58.1% 43.1% 44.7% 29.5% 32.6% 38.2% 32.8% 52.9% 42.2% 79.3% 37.9% 45.9% 28.6% 38.4% 71.1% 49.3% 85.8% 79.1% 79.0% 65.7% 79.9% 45.2% 54.4% 29.8% 22.9% 44.5% 63.5% 34.1% 30.6% 41.6% 21.9% 27.6% 49.1% 46.3% 75.7% 74.0% 64.5% 68.9% 53.7% 32.6% 50.2% 35.2% 26.9% 36.7% 39.1% 39.5% 47.8% 21.7% 22.2% 35.0% 26.7% 11.5% 5.0% 4.7% 3.9% 4.6% 2.7% 3.4% 11.1% 15.5% 25.7% 4.4% 5.4% 3.7% 5.6% 3.6% 5.6% 13.7% (b) our model Figure 4: Heatmaps of robust test error for the real-valued task on CIFAR10 using different robust classifiers: (a) baseline model; (b) our proposed cost-sensitive robust model. 4.2 CIFAR10 We use the same neural network architecture for the CIFAR10 dataset as Wong et al. (2018), with four convolutional layers and two fully-connected layers. For memory and computational efficiency, we incorporate the approximation technique based on nonlinear random projection during the training phase (Wong et al. (2018), 3.2). We train both the baseline model and our model using random projection of 50 dimensions, and optimize the training objective using SGD. Other parameters such as learning rate and batch size are set as same as those in Wong et al. (2018). Given a specific task, we train the cost-sensitive robust classifier on 80% randomly-selected training examples, and tune the regularization parameter α according to the performance on the remaining examples as validation dataset. The tasks are similar to those for MNIST ( 4.1), except for the multiple transformations task we cluster the ten CIFAR10 classes into two large groups: animals and vehicles, and consider the cases where only transformations between an animal class and a vehicle class are sensitive, and the converse. Table 3 shows results on the testing data based on different robust defense models with ϵ = 2/255. For all of the aforementioned tasks, our models substantially reduce the cost-sensitive robust error while keeping a lower classification error than the baseline. For the real-valued task, we are concerned with adversarial transformations from seed examples in vehicle classes to other target classes. In addition, more cost is placed on transformations from vehicle to animal, which is 10 times larger compared with that from vehicle to vehicle. Figures 4(a) Published as a conference paper at ICLR 2019 = 0.1 = 0.15 = 0.2 = 0.25 maximum perturbation distance error rate based on different models classification error (baseline) classification error (ours) robust error (baseline) robust error (ours) = 2/255 = 4/255 = 6/255 maximum perturbation distance error rate based on different models classification error (baseline) classification error (ours) robust error (baseline) robust error (ours) (b) CIFAR10 Figure 5: Results for different adversary strengths, ϵ, for different settings: (a) MNIST single seed task with digit 9 as the chosen class; (b) CIFAR10 single seed task with dog as the chosen class. and 4(b) illustrate the pairwise robust test error using overall robust model and the proposed classifier for the aforementioned real-valued task on CIFAR10. 4.3 VARYING ADVERSARY STRENGTH We investigate the performance of our model against different levels of adversarial strength by varying the value of ϵ that defines the ℓ ball available to the adversary. Figure 5 show the overall classification and cost-sensitive robust error of our best trained model, compared with the baseline model, on the MNIST single seed task with digit 9 and CIFAR single seed task with dog as the seed class of concern, as we vary the maximum ℓ perturbation distance. Under all the considered attack models, the proposed classifier achieves better cost-sensitive adversarial robustness than the baseline, while maintaining similar classification accuracy on original data points. As the adversarial strength increases, the improvement for cost-sensitive robustness over overall robustness becomes more significant. 5 CONCLUSION By focusing on overall robustness, previous robustness training methods expend a large fraction of the capacity of the network on unimportant transformations. We argue that for most scenarios, the actual harm caused by an adversarial transformation often varies depending on the seed and target class, so robust training methods should be designed to account for these differences. By incorporating a cost matrix into the training objective, we develop a general method for producing a cost-sensitive robust classifier. Our experimental results show that our cost-sensitive training method works across a variety of different types of cost matrices, so we believe it can be generalized to other cost matrix scenarios that would be found in realistic applications. There remains a large gap between the small models and limited attacker capabilities for which we can achieve certifiable robustness, and the complex models and unconstrained attacks that may be important in practice. The scalability of our techniques is limited to the toy models and simple attack norms for which certifiable robustness is currently feasible, so considerable process is needed before they could be applied to realistic scenarios. However, we hope that considering cost-sensitive robustness instead of overall robustness is a step towards achieving more realistic robustness goals. AVAILABILITY Our implementation, including code for reproducing all our experiments, is available as open source code at https://github.com/xiaozhanguva/Cost-Sensitive-Robustness. Published as a conference paper at ICLR 2019 ACKNOWLEDGEMENTS We thank Eric Wong for providing the implementation of certified robustness we built on, as well as for insightful discussions. We thank Jianfeng Chi for helpful advice on implementing our experiments. This work was supported by grants from the National Science Foundation (#1804603 and #1619098) and research awards from Amazon, Baidu, and Intel. Kaiser Asif, Wei Xing, Sima Behpour, and Brian D Ziebart. Adversarial cost-sensitive classification. In 31st Conference on Uncertainty in Artificial Intelligence, 2015. Anish Athalye, Nicholas Carlini, and David Wagner. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. In International Conference on Machine Learning, 2018. Mariusz Bojarski, Davide Del Testa, Daniel Dworakowski, Bernhard Firner, Beat Flepp, Prasoon Goyal, Lawrence D. Jackel, Mathew Monfort, Urs Muller, Jiakai Zhang, Xin Zhang, Jake Zhao, and Karol Zieba. End to end learning for self-driving cars. ar Xiv preprint ar Xiv:1604.07316, 2016. Nicholas Carlini and David Wagner. Towards evaluating the robustness of neural networks. In IEEE Symposium on Security and Privacy, 2017. Nilesh Dalvi, Pedro Domingos, Sumit Sanghai, Deepak Verma, et al. Adversarial classification. In 10th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2004. Pedro Domingos. Metacost: A general method for making classifiers cost-sensitive. In Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1999. Tommaso Dreossi, Somesh Jha, and Sanjit A Seshia. Semantic adversarial deep learning. In International Conference on Computer Aided Verification, 2018. Charles Elkan. The foundations of cost-sensitive learning. In International Joint Conference on Artificial Intelligence, 2001. Logan Engstrom, Dimitris Tsipras, Ludwig Schmidt, and Aleksander Madry. A rotation and a translation suffice: Fooling CNNs with simple transformations. ar Xiv preprint ar Xiv:1712.02779, 2017. Ian Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. In International Conference on Learning Representations, 2015. Matthias Hein and Maksym Andriushchenko. Formal guarantees on the robustness of a classifier against adversarial manipulation. In Advances in Neural Information Processing Systems, 2017. Can Kanbak, Seyed-Mohsen Moosavi-Dezfooli, and Pascal Frossard. Geometric robustness of deep networks: analysis and improvement. In Computer Vision and Pattern Recognition, 2018. Salman H Khan, Munawar Hayat, Mohammed Bennamoun, Ferdous A Sohel, and Roberto Togneri. Cost-sensitive learning of deep feature representations from imbalanced data. IEEE Transactions on Neural Networks and Learning Systems, 29(8):3573 3587, 2018. Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015. Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009. Matjaˇz Kukar and Igor Kononenko. Cost-sensitive learning with neural networks. In 13th European Conference on Artificial Intelligence, 1998. Yann Le Cun, L eon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278 2324, 1998. Published as a conference paper at ICLR 2019 Yann Le Cun, Corinna Cortes, and CJ Burges. MNIST handwritten digit database. http://yann.lecun. com/exdb/mnist, 2010. Xu-Ying Liu and Zhi-Hua Zhou. The influence of class imbalance on cost-sensitive learning: An empirical study. In Sixth International Conference on Data Mining, 2006. Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-SNE. Journal of Machine Learning Research, 2008. Saeed Mahloujifar, Dimitrios I Diochnos, and Mohammad Mahmoody. The curse of concentration in robust learning: Evasion and poisoning attacks from concentration of measure. In AAAI Conference on Artificial Intelligence, 2019. Nicolas Papernot, Patrick Mc Daniel, Xi Wu, Somesh Jha, and Ananthram Swami. Distillation as a defense to adversarial perturbations against deep neural networks. In IEEE Symposium on Security and Privacy, 2016. Omkar M Parkhi, Andrea Vedaldi, and Andrew Zisserman. Deep face recognition. In British Machine Vision Conference, 2015. Aditi Raghunathan, Jacob Steinhardt, and Percy Liang. Certified defenses against adversarial examples. In International Conference on Learning Representations, 2018. Joshua Saxe and Konstantin Berlin. Deep neural network based malware detection using two dimensional binary program features. In 10th International Conference on Malicious and Unwanted Software, 2015. Mahmood Sharif, Lujo Bauer, and Michael K Reiter. On the suitability of lp-norms for creating and preventing adversarial examples. In CVPR Workshop on Bright and Dark Sides of Computer Vision: Challenges and Opportunities for Privacy and Security, 2018. Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. In International Conference on Learning Representations, 2014. Florian Tram er, Alexey Kurakin, Nicolas Papernot, Ian Goodfellow, Dan Boneh, and Patrick Mc Daniel. Ensemble adversarial training: Attacks and defenses. In International Conference on Learning Representations, 2018. Shiqi Wang, Kexin Pei, Justin Whitehouse, Junfeng Yang, and Suman Jana. Formal security analysis of neural networks using symbolic intervals. In USENIX Security Symposium, 2018. Eric Wong and Zico Kolter. Provable defenses against adversarial examples via the convex outer adversarial polytope. In International Conference on Machine Learning, 2018. Eric Wong, Frank Schmidt, Jan Hendrik Metzen, and J Zico Kolter. Scaling provable adversarial defenses. In Conference on Neural Information Processing Systems, 2018. Chaowei Xiao, Jun-Yan Zhu, Bo Li, Warren He, Mingyan Liu, and Dawn Song. Spatially transformed adversarial examples. In International Conference on Learning Representations, 2018. Bianca Zadrozny, John Langford, and Naoki Abe. Cost-sensitive learning by cost-proportionate example weighting. In Third IEEE International Conference on Data Mining, 2003. Zhi-Hua Zhou and Xu-Ying Liu. On multi-class cost-sensitive learning. Computational Intelligence, 26(3):232 257, 2010. Published as a conference paper at ICLR 2019 Cost-Sensitive Robustness against Adversarial Examples Supplemental Materials A PARAMETER TUNING For experiments on the MNIST dataset, we first perform a coarse tuning on regularization parameter α with searching grid {10 2, 10 1, 100, 101, 102}, and select the most appropriate one, denoted by αcoarse, with overall classification error less than 4% and the lowest cost-sensitive robust error on validation dataset. Then, we further finely tune α from the range {2 3, 2 2, 2 1, 20, 21, 22, 23} αcoarse, and choose the best robust model according to the same criteria. Figures 6(a) and 6(b) show the learning curves for task B with digit 9 as the selected seed class based on the proposed cost-sensitive robust model with varying α (we show digit 9 because it is one of the most vulnerable seed classes). The results suggest that as the value of α increases, the corresponding classifier will have a lower cost-sensitive robust error but a higher classification error, which is what we expect from the design of (3.2). We observe similar trends for the learning curves for the other tasks, so do not present them here. For the CIFAR10 experiments, a similar tuning strategy is implemented. The only difference is that we use 35% as the threshold of overall classification error for selecting the best α. 0 20 40 60 Number of training epochs Classification error alpha=0.01 alpha=0.1 alpha=1.0 alpha=10.0 (a) classification error 0 20 40 60 Number of training epochs Cost-sensitive robust error alpha=0.01 alpha=0.1 alpha=1.0 alpha=10.0 (b) cost-sensitive robust error Figure 6: Learning curves for single seed task with digit 9 as the selected seed class on MNIST using the proposed model with varying α: (a) learning curves of classification error; (b) learning curves of cost-sensitive robust error. The maximum perturbation distance is specified as ϵ = 0.2. B COMPARISON WITH STANDARD COST-SENSITIVE CLASSIFIER As discussed in Section 2.4, prior work on cost-sensitive learning mainly focuses on the nonadversarial setting. In this section, we investigate the robustness of the cross-entropy based costsensitive classifier proposed in Khan et al. (2018), and compare the performance of their classifier with our proposed cost-sensitive robust classifier. Given a set of training examples {(xi, yi)}N i=1 and cost matrix C with each entry representing the cost of the corresponding misclassification, the evaluation metric for cost-sensitive learning is defined as the average cost of misclassifications, or more concretely misclassfication cost = 1 i [N] Cyibyi, where byi = argmax j [m] [fθ(xi)]j, where m is the total number of class labels and fθ( ) denotes the neural network classifier as introduced in Section 2.1. In addition, the cross-entropy based cost-sensitive training objective takes the following form: minimize θ 1 N i|yi=j log 1 + X j =j Cjj exp [fθ(xi)]j [fθ(xi)]j . (B.1) Published as a conference paper at ICLR 2019 Table 4: Comparison results of different trained classifiers for small-large real-valued task on MNIST with maximum perturbation distance ϵ = 0.2. Classifier Classification Error Misclassification Cost Robust Cost Baseline 0.91% 0.054 85.197 Cost-Sensitive Standard 2.57% 0.016 85.197 Overall Robustness 3.49% 0.252 1.982 Cost-Sensitive Robustness 3.38% 0.060 0.915 To provide a fair comparison, we assume the cost matrix used for (B.1) coincides with the cost matrix used for cost-sensitive robust training (3.2) in our experiment, whereas they are unlikely to be the same for real security applications. For instance, misclassifying a benign program as malicious may still induce some cost in the non-adversarial setting, whereas the adversary may only benefit from transforming a malicious program into a benign one. We consider the small-large real-valued task for MNIST, where the cost matrix C is designed as Cij = 0.1, if i > j; Cij = 0, if i = j; Cij = (i j)2, otherwise. Table 4 demonstrates the comparison results of different classifiers in such setting: the baseline standard deep learning classifier, a standard cost-sensitive classifier (Khan et al., 2018) trained using (B.1), classifier trained for overall robustness (Wong & Kolter, 2018) and our proposed classifier trained for cost-sensitive robustness. Compared with baseline, the standard cost-sensitive classifier indeed reduces the misclassification cost. But, it does not provide any improvement on the robust cost, as defined in (3.1). In sharp contrast, our robust training method significantly improves the cost-sensitive robustness.