# certified_adversarial_robustness_with_additive_noise__de283840.pdf

Certiﬁed Adversarial Robustness with Additive Noise

Bai Li Department of Statistical Science Duke University bai.li@duke.edu

Changyou Chen Department of CSE University at Buffalo, SUNY cchangyou@gmail.com

Wenlin Wang Department of ECE Duke University wenlin.wang@duke.edu

Lawrence Carin Department of ECE Duke University lcarin@duke.edu

The existence of adversarial data examples has drawn signiﬁcant attention in the deep-learning community; such data are seemingly minimally perturbed relative to the original data, but lead to very different outputs from a deep-learning algorithm. Although a signiﬁcant body of work on developing defensive models has been considered, most such models are heuristic and are often vulnerable to adaptive attacks. Defensive methods that provide theoretical robustness guarantees have been studied intensively, yet most fail to obtain non-trivial robustness when a large-scale model and data are present. To address these limitations, we introduce a framework that is scalable and provides certiﬁed bounds on the norm of the input manipulation for constructing adversarial examples. We establish a connection between robustness against adversarial perturbation and additive random noise, and propose a training strategy that can signiﬁcantly improve the certiﬁed bounds. Our evaluation on MNIST, CIFAR-10 and Image Net suggests that the proposed method is scalable to complicated models and large data sets, while providing competitive robustness to state-of-the-art provable defense methods.

1 Introduction

Although deep neural networks have achieved signiﬁcant success on a variety of challenging machine learning tasks, including state-of-the-art accuracy on large-scale image classiﬁcation [1, 2], the discovery of adversarial examples [3] has drawn attention and raised concerns. Adversarial examples are carefully perturbed versions of the original data that successfully fool a classiﬁer. In the image domain, for example, adversarial examples are images that have no visual difference from natural images, but that lead to different classiﬁcation results [4].

A large body of work has been developed on defensive methods to tackle adversarial examples, yet most remain vulnerable to adaptive attacks [3 10]. A major drawback of many defensive models is that they are heuristic and fail to obtain a theoretically justiﬁed guarantee of robustness. On the other hand, many works have focused on providing provable/certiﬁed robustness of deep neural networks [11 17].

Recently, [18] provided theoretical insight on certiﬁed robust prediction, building a connection between differential privacy and model robustness. It was shown that adding properly chosen noise to the classiﬁer will lead to certiﬁed robust prediction. Building on ideas in [18], we conduct an analysis of model robustness based on Rényi divergence [19] between the outputs of models for

33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada.

ar Xiv:1809.03113v6 [cs.LG] 10 Nov 2019

natural and adversarial examples when random noise is added, and show a higher upper bound on the tolerable size of perturbations compared to [18]. In addition, our analysis naturally leads to a connection between adversarial defense and robustness to random noise. Based on this, we introduce a comprehensive framework that incorporates stability training for additive random noise, to improve classiﬁcation accuracy and hence the certiﬁed bound. Our contributions are as follows:

We derive a certiﬁed bound for robustness to adversarial attacks, applicable to models with general activation functions and network structures. Speciﬁcally, according to [20], the derived bound for 1 perturbation is tight in the binary case.

Using our derived bound, we establish a strong connection between robustness to adversarial perturbations and additive random noise. We propose a new training strategy that accounts for this connection. The new training strategy leads to signiﬁcant improvement on the certiﬁed bounds.

We conduct a comprehensive set of experiments to evaluate both the theoretical and empirical performance of our methods, with results that are competitive with the state of the art.

2 Background and Related Work

Much research has focused on providing provable/certiﬁed robustness for neural networks. One line of such studies considers distributionally robust optimization, which aims to provide robustness to changes of the data-generating distribution. For example, [21] study robust optimization over a φ-divergence ball of a nominal distribution. A robustness with respect to the Wasserstein distance between natural and adversarial distributions was provided in [22]. One limitation of distributional robustness is that the divergence between distributions is rarely used as an empirical measure of strength of adversarial attacks.

Alternatively, studies have attempted to provide a certiﬁed bound of minimum distortion. In [11] a certiﬁed bound is derived with a small loss in accuracy for robustness to 2 perturbations in two-layer networks. A method based on semi-deﬁnite relaxation is proposed in [12] for calculating a certiﬁed bound, yet their analysis cannot be applied to networks with more than one hidden layer. A robust optimization procedure is developed in [13] by considering a convex outer approximation of the set of activation functions reachable through a norm-bounded perturbation. Their analysis, however, is still limited to Re LU networks and pure feedforward networks. Algorithms to efﬁciently compute a certiﬁed bound are considered in [14] by utilizing the property of Re LU networks as well. Recently, their idea has been extended by [15] and relaxed to general activation functions. However, both of their analyses only apply to the multi-layer perceptron (MLP), limiting the application of their results.

In general, most analyses for certiﬁed bounds rely on the properties of speciﬁc activation functions or model structures, and are difﬁcult to scale. Several works aim to generalize their analysis to accommodate ﬂexible model structures and large-scale data. For example, [16] formed an optimization problem to obtain an upper bound via Lagrangian relaxation. They successfully obtained the ﬁrst non-trivial bound for the CIFAR-10 data set. The analysis of [13] was improved in [17], where it was scaled up to large neural networks with general activation functions, obtaining state-of-the-art results on MNIST and CIFAR-10. Certiﬁed robustness is obtained by [18] by analyzing the connection between adversarial robustness and differential privacy. Similar to [16, 17], their certiﬁed bound is agnostic to model structure and is scalable, but it is loose and is not comparable to [16, 17]. Our approach maintains all the advantages of [18], and signiﬁcantly improves the certiﬁed bound with more advanced analysis.

The connection between adversarial robustness and robustness to added random noise has been studied in several works. In [23] this connection is established by exploring the curvature of the classiﬁer decision boundary. Later, [24] showed adversarial robustness requires reducing error rates to essentially zero under large additive noise. While most previous works use concentration of measure as their analysis tool, we approach such connection from a different perspective using Rényi divergence [19]; we illustrate the connection to robustness to random noise in a more direct manner. More importantly, our analysis suggests improving robustness to additive Gaussian noise can directly result in the improvement of the certiﬁed bound.

3 Preliminaries

3.1 Notation

We consider the task of image classiﬁcation. Natural images are represented by x X [0, 1]h w c, where X represents the image space, with h, w and c denoting the height, width, and number of channels of an image, respectively. An image classiﬁer over k classes is considered as a function f : X {1, . . . , k}. We only consider classiﬁers constructed by deep neural networks (DNNs). To present our framework, we deﬁne a stochastic classiﬁer, a function f over x with output f(x) being a multinomial distribution over {1, . . . , k}, i.e., P (f(x) = i) = pi for

i pi = 1. One can classify x by picking argmaxi pi. Note this distribution is different from the one generated from softmax.

3.2 Rényi Divergence

Our theoretical result depends on the Rényi divergence, deﬁned in the following [19].

Deﬁnition 1 (Rényi Divergence) For two probability distributions P and Q over R, the Rényi divergence of order α > 1 is

Dα(P Q) = 1 α 1 log Ex Q

3.3 Adversarial Examples

Given a classiﬁer f : X {1, . . . , k} for an image x X, an adversarial example x satisﬁes D(x, x ) < for some small > 0, and f(x) = f(x ), where D( , ) is some distance metric, i.e., x is close to x but yields a different classiﬁcation result. The distance is often described in terms of an p metric, and in most of the literature 2 and metrics are considered. In our development, we focus on the 2 metric but also provide experimental results for . More general deﬁnitions of adversarial examples are considered in some works [25], but we only address norm-bounded adversarial examples in this paper.

3.4 Adversarial Defense

Classiﬁcation models that are robust to adversarial examples are referred to as adversarial defense models. We introduce the most advanced defense models in two categories.

Empirically, the most successful defense model is based on adversarial training [4, 26], that is augmenting adversarial examples during training to help improve model robustness. TRadeoff-inspired Adversarial DEfense via Surrogate-loss minimization (TRADES) [27] is a variety of adversarial training that introduces a regularization term for adversarial examples:

L = L(f(x), y) + γL(f(x, f(xadv))

where L( , ) is the cross-entropy loss. This defense model won 1st place in the Neur IPS 2018 Adversarial Vision Challenge (Robust Model Track) and has shown better performance compared to previous models [26].

On the other hand, the state-of-the-art approach for provable robustness is proposed by [17], where a dual network is considered for computing a bound for adversarial perturbation using linearprogramming (LP), as in [13]. They optimize the bound during training to achieve strong provable robustness.

Although empirical robustness and provable robustness are often considered as orthogonal research directions, we propose an approach that provides both. In our experiments, presented in Sec. 6, we show our approach is competitive with both of the aforementioned methods.

4 Certiﬁed Robust Classiﬁer

We propose a framework that yields an upper bound on the tolerable size of attacks, enabling certiﬁed robustness on a classiﬁer. Intuitively, our approach adds random noise to pixels of adversarial examples before classiﬁcation, to eliminate the effects of adversarial perturbations.

Algorithm 1 Certiﬁed Robust Classiﬁer

Require: An input image x; A standard deviation σ > 0; A classiﬁer f over {1, . . . , k}; Number of iterations n (n = 1 is sufﬁcient if only the robust classiﬁcation c is desired). 1: Set i = 1. 2: for i [n] do 3: Add i.i.d. Gaussian noise N(0, σ2) to each pixel of x and apply the classiﬁer f on it. Let the output be ci = f(x + N(0, σ2I)). 4: end for 5: Estimate the distribution of the output as pj = #{ci=j:i=1,...,n}

n . 6: Calculate the upper bound:

L = sup α>1

1 p(1) p(2) + 2 1

p1 α (1) + p1 α (2) 1 1 α 1/2

where p(1) and p(2) are the ﬁrst and the second largest values in p1, . . . , pk. 7: Return classiﬁcation result c = argmaxi pi and the tolerable size of the attack L.

Our approach is summarized in Algorithm 1. In the following, we develop theory to prove the certiﬁed robustness of the proposed algorithm. Our goal is to show that if the classiﬁcation of x in Algorithm 1 is in class c, then for any examples x such that x x 2 L, the classiﬁcation of x is also in class c.

To prove our claim, ﬁrst recall that a stochastic classiﬁer f over {1, . . . , k} has an output f(x) corresponding to a multinomial distribution over {1, . . . , k}, with probabilities as (p1, . . . , pk). In this context, robustness to an adversarial example x generated from x means argmaxi pi = argmaxj p j with P (f(x) = i) = pi and P (f(x ) = j) = p j, where P( ) denotes the probability of an event. In the remainder of this section, we show Algorithm 1 achieves such robustness based on the Rényi divergence, starting with the following lemma.

Lemma 1 Let P = (p1, . . . , pk) and Q = (q1, . . . , qk) be two multinomial distributions over the same index set {1, . . . , k}. If the indices of the largest probabilities do not match on P and Q, that is argmaxi pi = argmaxj qj, then

Dα(Q P) log

1 p(1) p(2) + 2 1

p1 α (1) + p1 α (2) 1 1 α

where p(1) and p(2) are the largest and the second largest probabilities among the set of all pi.

To simplify notation, we deﬁne Mp(x1, . . . , xn) = ( 1

n n i=1 xp i )1/p as the generalized mean. The right hand side (RHS) of the condition in Lemma 1 then becomes log 1 2M1 p(1), p(2) + 2M1 α p(1), p(2) .

Lemma 1 proposes a lower bound of the Rényi divergence for changing the index of the maximum of P, i.e., for any distribution Q, if Dα(Q P) < log 1 2M1 p(1), p(2) + 2M1 α p(1), p(2) , the index of the maximum of P and Q must be the same. Based on Lemma 1, we obtain our main theorem on certiﬁed robustness as follows, validating our claim.

Theorem 2 Suppose we have x X, and a potential adversarial example x X such that x x 2 L. Given a k-classiﬁer f : X {1, . . . , k}, let f(x + N(0, σ2I)) (p1, . . . , pk) and f(x + N(0, σ2I)) (p 1, . . . , p k).

If the following condition is satisﬁed, with p(1) and p(2) being the ﬁrst and second largest probabilities in {pi}:

sup α>1 2σ2

α log 1 2M1 p(1), p(2) + 2M1 α p(1), p(2) L2

then argmaxi pi = argmaxj p j

The conclusion of Theorem 2 can be extended to the 1 case by replacing Gaussian with Laplacian noise. Speciﬁcally, notice the Renyi divergence between two Laplacian distribution Λ(x, λ) and Λ(x , λ) is

1 α 1 log α 2α 1 exp (α 1) x x 1

2α 1 exp α x x 1

Meanwhile, log 1 2M1 p(1), p(2) + 2M1 α p(1), p(2) α log(1 p(1) + p(2)), thus we have the upper bound for the 1 perturbation:

Theorem 3 In the same setting as in Theorem 2, with x x 1 L, let f(x + Λ(0, λ)) (p1, . . . , pk) and f(x + Λ(0, λ)) (p 1, . . . , p k). If λ log(1 p(1) + p(2)) L is satisﬁed, then argmaxi pi = argmaxj p j.

In the rest of this paper, we focus on the 2 norm with Gaussian noise, but most conclusions are also applicable to 1 norm with Laplacian noise. A more comprehensive analysis for 1 norm can be found in [20]. Interestingly, they have proved that the bound λ log(1 p(1) + p(2)) is tight in the binary case for 1 norm [20].

With Theorem 2, we can enable certiﬁed 2 ( 1) robustness on any classiﬁer f by adding i.i.d. Gaussian (Laplacian) noise to pixels of inputs during testing, as done in Algorithm 1. It provides an upper bound for the tolerable size of attacks for a classiﬁer, i.e., as long as the pertubation size is less than the upper bound (the sup part in Theorem 2), any adversarial sample can be classiﬁed correctly.

Conﬁdence Interval and Sample Size In practice we can only estimate p(1) and p(2) from samples, thus the obtained lower bound is not precise and requires adjustment. Note that (p1, . . . , pk) forms a multinomial distribution, and therefore the conﬁdence intervals for p(1) and p(2) can be estimated using one-sided Clopper-Pearson interval along with Bonferroni correction. We refer to [18] for further details. In all our subsequent experiments, we use the end points (lower for p(1) and upper for p(2)) of the 95% conﬁdence intervals for estimating p(1) and p(2), and multiply 95% for the corresponding accuracy. Moreover, the estimates for the conﬁdence intervals are more precise when we increase the sample size n, but at the cost of extra computational burden. In practice, we ﬁnd a sample size of n = 100 is sufﬁcient.

Choice of σ The formula of our lower bound indicates a higher standard deviation σ results in a higher bound. In practice, however, a larger amount of added noise also leads to higher classiﬁcation error and a larger gap between p(1) and p(2), which gives a lower bound. Therefore, the best choice of σ2 is not obvious. We will demonstrate the effect of different choices of σ2 in the experiments of Sec. 6.

5 Improved Certiﬁed Robustness

Based on the property of generalized mean, one can show that the upper bound is larger when the difference between p(1) and p(2) becomes larger. This is consistent with the intuition that a larger difference between p(1) and p(2) indicates more conﬁdent classiﬁcation. In other words, more conﬁdent and accurate prediction in the presence of additive Gaussian noise, in the sense that p(1) is much larger than p(2), leads to better certiﬁed robustness. To this end, a connection between robustness to adversarial examples and robustness to added random noise has been established by our analysis.

Such a connection is beneﬁcial, because robustness to additive Gaussian noise is much easier to achieve than robustness to carefully crafted adversarial examples. Consequently, it is natural to consider improving the adversarial robustness of a model by ﬁrst improving its robustness to added random noise. In the context of Algorithm 1, we aim to improve the robustness of f to additive random noise. Note improving robustness to added Gaussian noise as a way of improving adversarial robustness has been proposed by [28] and was later shown ineffective [29]. Our method is different

in that it requires added Gaussian noise during the testing phase, and more importantly it is supported theoretically.

There have been notable efforts at developing neural networks that are robust to added random noise [30, 31]; yet, these methods failed to defend against adversarial attacks, as they are not particularly designed for this task. Within our framework, since Algorithm 1 has no constraint on the classiﬁer f, which gives the ﬂexibility to modify f, we can adapt these methods to improve the accuracy of classiﬁcation when Gaussian noise is present, hence improving the robustness to adversarial attacks. In this paper, we only discuss stability training, but a much wider scope of literature exists for robustness to added random noise [30, 31].

5.1 Stability Training

The idea of introducing perturbations during training to improve model robustness has been studied widely. In [32] the authors considered perturbing models as a construction of pseudo-ensembles, to improve semi-supervised learning. More recently, [33] used a similar training strategy, named stability training, to improve classiﬁcation robustness on noisy images.

For any natural image x, stability training encourages its perturbed version x to yield a similar classiﬁcation result under a classiﬁer f, i.e., D(f(x), f(x )) is small for some distance measure D. Speciﬁcally, given a loss function L for the original classiﬁcation task, stability training introduces a regularization term L(x, x ) = L +γLstability(x, x ) = L +γD(f(x), f(x )), where γ controls the strength of the stability term. As we are interested in a classiﬁcation task, we use cross-entropy as the distance D between f(x) and f(x ), yielding the stability loss Lstability =

j P(yj|x) log P(yj|x ), where P(yj|x) and P(yj|x ) are the probabilities generated after softmax. In this paper, we add i.i.d. Gaussian noise to each pixel of x to construct x , as suggested in [33].

Stability training is in the same spirit as adversarial training, but is only designed to improve the classiﬁcation accuracy under a Gaussian perturbation. Within our framework, we can apply stability training to f, to improve the robustness of Algorithm 1 against adversarial perturbations. We call the resulting defense method Stability Training with Noise (STN).

Adversarial Logit Pairing Adversarial Logit Pairing (ALP) was proposed in [34]; it adds D(f(x), f(x )) as the regularizer, with x being an adversarial example. Subsequent work has shown ALP fails to obtain adversarial robustness [35]. Our method is different from ALP and any other regularizer-based approach, as the key component in our framework is the added Gaussian noise at the testing phase of Algorithm 1, while stability training is merely a technique for improving the robustness further. We do not claim stability training alone yields adversarial robustness.

6 Experiments

We perform experiments on the MNIST and CIFAR-10 data sets, to evaluate the theoretical and empirical performance of our methods. We subsequently also consider the larger Image Net dataset. For the MNIST data set, the model architecture follows the models used in [36], which contains two convolutional layers, each containing 64 ﬁlters, followed with a fully connected layer of size 128. For the CIFAR-10 dataset, we use a convolutional neural network with seven convolutional layers along with Max Pooling. In both datasets, image intensities are scaled to [0, 1], and the size of attacks are also rescaled accordingly. For reference, a distortion of 0.031 in the [0, 1] scale corresponds to 8 in [0, 255] scale. The source code can be found at https://github.com/Bai-Li/STN-Code.

6.1 Theoretical Bound

With Algorithm 1, we are able to classify a natural image x and calculate an upper bound for the tolerable size of attacks L for this particular image. Thus, with a given size of the attack L , the classiﬁcation must be robust if L < L. If a natural example is correctly classiﬁed and robust for L simultaneously, any adversarial examples x with x x 2 < L will be classiﬁed correctly. Therefore, we can determine the proportion of such examples in the test set as a lower bound of accuracy given size L .

We plot in Figure 1 different lower bounds for various choices of σ and L , for both MNIST and CIFAR-10. To interpret the results, for example on MNIST, when σ = 0.7, Algorithm 1 achieves at least 51% accuracy under any attack whose 2-norm size is smaller than 1.4.

Figure 1: Accuracy lower bounds for MNIST (left) and CIFAR-10 (right). We test various choices of σ in Algorithm 1. For reference, we include results for Pixel DP (green) and the lower bound without stability training (orange).

When σ is ﬁxed, there exists a threshold beyond which the certiﬁed lower bound degenerates to zero. The larger the deviation σ is, the later the degeneration happens. However, larger σ also leads to a worse bound when L is small, as the added noise reduces the accuracy of classiﬁcation.

As an ablation study, we include the corresponding lower bound without stability training. The improvement due to stability training is signiﬁcant. In addition, to demonstrate the improvement of the certiﬁed bound compared to Pixel DP [18], we also include the results for Pixel DP. Although Pixel DP also has tuning parameters δ and similar to σ in our setting, we only include the optimal pair of parameters found by grid search, for simplicity of the plots. One observes the accuracy lower bounds for Pixel DP (green) are dominated by our bounds.

We also compare STN with the approach from [17]. Besides training a single robust classiﬁer, [17] also proposed a strategy of training a sequence of robust classiﬁers as a cascade model which results in better provable robustness, although it reduces the accuracy on natural examples. We compare both to STN in Table 1. Since both methods show a clear trade-off between the certiﬁed bound and corresponding robustness accuracy, we include the certiﬁed bounds and corresponding accuracy in parenthesis for both models, along with the accuracy on natural examples.

Table 1: Comparison on MNIST and CIFAR-10. The numbers a (b%) mean a certiﬁed bound a with the corresponding accuracy b%.

MNIST CIFAR-10 Model Robust Accuracy Natural Robust Accuracy Natural [17] (Single) 1.58 (43.5%) 88.2% 36.00 (53.0%) 61.2% [17] (Cascade) 1.58 (74.6%) 81.4% 36.00 (58.7%) 68.8% STN 1.58 (69.0%) 98.9% 36.00 (65.6%) 80.5%

Figure 2: Comparison of the certiﬁed bound from STN (orange) and Pixel DP (blue) on Image Net.

Our bound is close to the one from [17] on MNIST, and becomes better on CIFAR-10. In addition, since the training objective of [17] is particularly designed for provable robustness and depends on the size of attacks, its accuracy on the natural examples decreases drastically when accommodating stronger attacks. On the other hand, STN is capable of keeping a high accuracy on natural examples while providing strong robustness guarantees.

Certiﬁed Robustness on Image Net As our framework adds almost no extra computational burden on the training procedure, we are able to compute accuracy lower bounds for Image Net [37], a large-scale image dataset that contains over 1 million images and 1,000 classes. We compare our bound with Pixel DP in Figure 2. Clearly, our bound is higher than the one obtained via Pixel DP.

6.2 Empirical Results

We next perform classiﬁcation and measure the accuracy on real adversarial examples to evaluate the empirical performance of our defense methods. For each pair of attacks and defense models, we generate a robust accuracy vs. perturbation size curve for a comprehensive evaluation. We compare our method to TRADES on MNIST and CIFAR-10. Although we have emphasized the theoretical bound of the defense, the empirical performance is promising as well. More details and results of the experiments, such as for gradient-free attacks, are included in the Appendix.

Avoid Gradient Masking A defensive model incorporating randomness may make it difﬁcult to apply standard attacks, by causing gradient masking as discussed in [10], thus achieving robustness unfairly. To ensure the robustness of our approach is not due to gradient masking, we use the expectation of the gradient with respect to the randomization when estimating gradients, to ensemble over randomization and eliminate the effect of randomness, as recommended in [10, 38]. In particular, the gradient is estimated as Er N(0,σ2I) [ x+r L(θ, x + r, y)] 1 n0 n0 i=1 [ x+ri L(θ, x + ri, y)], where ri s are i.i.d. samples from N(0, σ2I) distribution, and n0 is the number of samples. We assume threat models are aware of the value of σ in Algorithm 1 and use the same value for attacks.

White-box Attacks For attacks, we use Projected Gradient Descent (PGD) attacks [26]. It constructs an adversarial example by iteratively updating the natural input along with the sign of its gradient and projecting it into the constrained space, to ensure its a valid input. For 2 attacks, we perform a Carlini & Wagner attack [8], which constructs an adversarial example via solving an optimization problem for minimizing distortion distance and maximizing classiﬁcation error. We also use technique from [39], that has been shown to be more effective against adversarially trained model, where the gradients are estimated as the average of gradients of multiple randomly perturbed samples. This brings a variant of Carlini & Wagner attack with the same form as the ensemble-over-randomization mentioned above, therefore it is even more fair to use it. The results for white-box attacks are illustrated in Figure 3.

Figure 3: MNIST and CIFAR-10: Comparisons of the adversarial robustness of TRADES and STN with various attack sizes for both 2 and . The plots are ordered as: MNIST( 2), MNIST( ), CIFAR-10( 2), and CIFAR-10( ). Both white-box (straight lines) and black-box attacks (dash lines) are considered.

Figure 4: Robust accuracy of STN with different choices of σ for both 2 and attacks. The plots are ordered as: MNIST( 2), MNIST( ), CIFAR-10( 2), and CIFAR-10( ).

Black-box Attacks To better understand the behavior of our methods and to further ensure there is no gradient masking, we include results for black-box attacks. After comparison, we realize the adversarial examples generated from Madry s model [26] result in the strongest black-box attacks for both TRADES and STN. Therefore, we apply the 2 and white-box attacks to Madry s model and test the resulting adversarial examples on TRADES and STN. The results are reported as the dashlines in Figure 3.

Summary of Results Overall, STN shows a promising level of robustness, especially regarding 2-bounded distortions, as anticipated. One observes that STN performs slightly worse than TRADES when the size of attacks is small, and becomes better when the size increases. Intuitively, the added random noise dominantly reduces the accuracy for small attack size and becomes beneﬁcial against stronger attacks. It is worth-noting that Algorithm 1 adds almost no computational burden, as it only requires multiple forward passes, and stability training only requires augmenting randomly perturbed examples. On the other hand, TRADES is extremely time-consuming, due to the iterative construction of adversarial examples.

Choice of σ In previous experiments, we use σ = 0.7 and σ = 100

255 for MNIST and CIFAR-10, respectively. However, the choice of σ plays an important role, as shown in Figure 1; therefore, we study in Figure 4 how different values of σ affect the empirical robust accuracy. The results make it more clear that the noise hurts when small attacks are considered, but helps against large attacks. Ideally, using an adaptive amount of noise, that lets the amount of added noise grow with the size of attacks, could lead to better empirical results, yet it is practically impossible as the size of attacks is unknown beforehand. In addition, we include results for σ = 0, which is equivalent to a model without additive Gaussian noise. Its vulnerability indicates the essential of our framework is the additive Gaussian noise.

7 Comparison to [40]

Following this work, [40] proposed a tighter bound in 2 norm than the one in section 4. Although they do indeed improve our bound, our work has unique contributions in several ways: (i) We propose stability training to improve the bound and robustness, while they only use Gaussian augmentation. In general, stability training works better than Gaussian augmentation, as shown in Figure 1. Thus, stability training is an important and unique contribution of this paper. (ii) We conduct empirical evaluation against real attacks and compare to the state-of-the-art defense method (adversarial training) to show our approach is competitive. [40] only discusses the certiﬁed bound and does not provide evaluation against real attacks. (iii) The analysis from [40] is difﬁcult to be extended to other norms, because it requires isotropy. On the other hand, ours lead to a tight certiﬁed 1 bound by adding Laplacian noise, as discussed in Section 4.

8 Conclusions

We propose an analysis for constructing defensive models with certiﬁed robustness. Our analysis leads to a connection between robustness to adversarial attacks and robustness to additive random perturbations. We then propose a new strategy based on stability training for improving the robustness of the defense models. The experimental results show our defense model provides competitive provable robustness and empirical robustness compared to the state-of-the-art models. It especially yields strong robustness when strong attacks are considered.

There is a noticeable gap between the theoretical lower bounds and the empirical accuracy, indicating that the proposed upper bound might not be tight, or the empirical results should be worse for stronger attacks that have not been developed, as has happened to many defense models. We believe each explanation points to a direction for future research.

Acknowledgments This research was supported in part by DARPA, DOE, NIH, NSF and ONR.

[1] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In European conference on computer vision, pages 630 645. Springer, 2016.

[2] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4700 4708, 2017.

[3] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. ar Xiv preprint ar Xiv:1312.6199, 2013.

[4] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. ar Xiv preprint ar Xiv:1412.6572, 2014.

[5] Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, and Pascal Frossard. Deepfool: a simple and accurate method to fool deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2574 2582, 2016.

[6] Nicolas Papernot, Patrick Mc Daniel, Xi Wu, Somesh Jha, and Ananthram Swami. Distillation as a defense to adversarial perturbations against deep neural networks. In Security and Privacy (SP), 2016 IEEE Symposium on, pages 582 597. IEEE, 2016.

[7] Alexey Kurakin, Ian Goodfellow, and Samy Bengio. Adversarial machine learning at scale. ar Xiv preprint ar Xiv:1611.01236, 2016.

[8] Nicholas Carlini and David Wagner. Towards evaluating the robustness of neural networks. In Security and Privacy (SP), 2017 IEEE Symposium on, pages 39 57. IEEE, 2017.

[9] Wieland Brendel, Jonas Rauber, and Matthias Bethge. Decision-based adversarial attacks: Reliable attacks against black-box machine learning models. ar Xiv preprint ar Xiv:1712.04248, 2017.

[10] Anish Athalye, Nicholas Carlini, and David Wagner. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. ar Xiv preprint ar Xiv:1802.00420, 2018.

[11] Matthias Hein and Maksym Andriushchenko. Formal guarantees on the robustness of a classiﬁer against adversarial manipulation. In Advances in Neural Information Processing Systems, pages 2266 2276, 2017.

[12] Aditi Raghunathan, Jacob Steinhardt, and Percy Liang. Certiﬁed defenses against adversarial examples. ar Xiv preprint ar Xiv:1801.09344, 2018.

[13] J Zico Kolter and Eric Wong. Provable defenses against adversarial examples via the convex outer adversarial polytope. ar Xiv preprint ar Xiv:1711.00851, 2017.

[14] Tsui-Wei Weng, Huan Zhang, Hongge Chen, Zhao Song, Cho-Jui Hsieh, Duane Boning, Inderjit S Dhillon, and Luca Daniel. Towards fast computation of certiﬁed robustness for relu networks. ar Xiv preprint ar Xiv:1804.09699, 2018.

[15] Huan Zhang, Tsui-Wei Weng, Pin-Yu Chen, Cho-Jui Hsieh, and Luca Daniel. Efﬁcient neural network robustness certiﬁcation with general activation functions. In Advances in Neural Information Processing Systems, pages 4944 4953, 2018.

[16] Krishnamurthy Dvijotham, Robert Stanforth, Sven Gowal, Timothy Mann, and Pushmeet Kohli. A dual approach to scalable veriﬁcation of deep networks. ar Xiv preprint ar Xiv:1803.06567, 2018.

[17] Eric Wong, Frank Schmidt, Jan Hendrik Metzen, and J Zico Kolter. Scaling provable adversarial defenses. ar Xiv preprint ar Xiv:1805.12514, 2018.

[18] Mathias Lecuyer, Vaggelis Atlidakis, Roxana Geambasu, Daniel Hsu, and Suman Jana. Certiﬁed robustness to adversarial examples with differential privacy. ar Xiv preprint ar Xiv:1802.03471, 2018.

[19] Tim Van Erven and Peter Harremos. Rényi divergence and kullback-leibler divergence. IEEE Transactions on Information Theory, 60(7):3797 3820, 2014.

[20] Anonymous. $\ell_1$ adversarial robustness certiﬁcates: a randomized smoothing approach. In Submitted to International Conference on Learning Representations, 2020. under review.

[21] Hongseok Namkoong and John C Duchi. Variance-based regularization with convex objectives. In Advances in Neural Information Processing Systems, pages 2971 2980, 2017.

[22] Aman Sinha, Hongseok Namkoong, and John Duchi. Certiﬁable distributional robustness with principled adversarial training. ar Xiv preprint ar Xiv:1710.10571, 2017.

[23] Alhussein Fawzi, Seyed-Mohsen Moosavi-Dezfooli, and Pascal Frossard. Robustness of classiﬁers: from adversarial to random noise. In Advances in Neural Information Processing Systems, pages 1632 1640, 2016.

[24] Nic Ford, Justin Gilmer, Nicolas Carlini, and Dogus Cubuk. Adversarial examples are a natural consequence of test error in noise. ar Xiv preprint ar Xiv:1901.10513, 2019.

[25] T. B. Brown, N. Carlini, C. Zhang, C. Olsson, P. Christiano, and I. Goodfellow. Unrestricted adversarial examples. ar Xiv preprint ar Xiv:1809.08352, 2018.

[26] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. ar Xiv preprint ar Xiv:1706.06083, 2017.

[27] Hongyang Zhang, Yaodong Yu, Jiantao Jiao, Eric P Xing, Laurent El Ghaoui, and Michael I Jordan. Theoretically principled trade-off between robustness and accuracy. ar Xiv preprint ar Xiv:1901.08573, 2019.

[28] Valentina Zantedeschi, Maria-Irina Nicolae, and Ambrish Rawat. Efﬁcient defenses against adversarial attacks. In Proceedings of the 10th ACM Workshop on Artiﬁcial Intelligence and Security, pages 39 49. ACM, 2017.

[29] Nicholas Carlini and David Wagner. Magnet and" efﬁcient defenses against adversarial attacks" are not robust to adversarial examples. ar Xiv preprint ar Xiv:1711.08478, 2017.

[30] Junyuan Xie, Linli Xu, and Enhong Chen. Image denoising and inpainting with deep neural networks. In Advances in neural information processing systems, pages 341 349, 2012.

[31] Kai Zhang, Wangmeng Zuo, Yunjin Chen, Deyu Meng, and Lei Zhang. Beyond a gaussian denoiser: Residual learning of deep cnn for image denoising. IEEE Transactions on Image Processing, 26(7):3142 3155, 2017.

[32] Philip Bachman, Ouais Alsharif, and Doina Precup. Learning with pseudo-ensembles. In Advances in Neural Information Processing Systems, pages 3365 3373, 2014.

[33] Stephan Zheng, Yang Song, Thomas Leung, and Ian Goodfellow. Improving the robustness of deep neural networks via stability training. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4480 4488, 2016.

[34] Harini Kannan, Alexey Kurakin, and Ian Goodfellow. Adversarial logit pairing. ar Xiv preprint ar Xiv:1803.06373, 2018.

[35] Logan Engstrom, Andrew Ilyas, and Anish Athalye. Evaluating and understanding the robustness of adversarial logit pairing. ar Xiv preprint ar Xiv:1807.10272, 2018.

[36] Florian Tramèr, Alexey Kurakin, Nicolas Papernot, Ian Goodfellow, Dan Boneh, and Patrick Mc Daniel. Ensemble adversarial training: Attacks and defenses. ar Xiv preprint ar Xiv:1705.07204, 2017.

[37] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Image Net: A Large-Scale Hierarchical Image Database. In CVPR09, 2009.

[38] Nicholas Carlini, Anish Athalye, Nicolas Papernot, Wieland Brendel, Jonas Rauber, Dimitris Tsipras, Ian Goodfellow, and Aleksander Madry. On evaluating adversarial robustness. ar Xiv preprint ar Xiv:1902.06705, 2019.

[39] Bai Li, Changyou Chen, Wenlin Wang, and Lawrence Carin. On norm-agnostic robustness of adversarial training, 2019.

[40] Jeremy M Cohen, Elan Rosenfeld, and J Zico Kolter. Certiﬁed adversarial robustness via randomized smoothing. ar Xiv preprint ar Xiv:1902.02918, 2019.