# secondorder_provable_defenses_against_adversarial_attacks__3ebf7ea9.pdf

Second-Order Provable Defenses against Adversarial Attacks

Sahil Singla 1 Soheil Feizi 1

A robustness certiﬁcate against adversarial examples is the minimum distance of a given input to the decision boundary of the classiﬁer (or its lower bound). For any perturbation of the input with a magnitude smaller than the certiﬁcate value, the classiﬁcation output will provably remain unchanged. Computing exact robustness certiﬁcates for neural networks is difﬁcult in general since it requires solving a non-convex optimization. In this paper, we provide computationally-efﬁcient robustness certiﬁcates for neural networks with differentiable activation functions in two steps. First, we show that if the eigenvalues of the Hessian of the network (curvatures of the network) are bounded (globally or locally), we can compute a robustness certiﬁcate in the l2 norm efﬁciently using convex optimization. Second, we derive a computationally-efﬁcient differentiable upper bound on the curvature of a deep network. We also use the curvature bound as a regularization term during the training of the network to boost its certiﬁed robustness. Putting these results together leads to our proposed Curvature-based Robustness Certiﬁcate (CRC) and Curvaturebased Robust Training (CRT). Our numerical results show that CRT leads to signiﬁcantly higher certiﬁed robust accuracy compared to intervalbound propagation (IBP) based training. We achieve certiﬁed robust accuracy 69.79%, 57.78% and 53.19% while IBP-based methods achieve 44.96%, 44.74% and 44.66% on 2,3 and 4 layer networks respectively on the MNIST-dataset.

1Department of Computer Science, University of Maryland, College Park. Correspondence to: Sahil Singla <ssingla@cs.umd.edu>, Soheil Feizi <sfeizi@cs.umd.edu>.

Proceedings of the 37 th International Conference on Machine Learning, Vienna, Austria, PMLR 119, 2020. Copyright 2020 by the author(s).

1. Introduction

Modern neural networks achieve high accuracy on tasks such as image classiﬁcation and speech recognition, but are known to be brittle to small, adversarially chosen perturbations of their inputs (Szegedy et al., 2014). A classiﬁer which correctly classiﬁes an image x, can be fooled by an adversary to misclassify an adversarial example x + δ, such that x + δ is indistinguishable from x to a human. Adversarial examples can also fool systems when they are printed out on a paper and photographed with a smart phone (Kurakin et al., 2016a). Even in a black box threat model, where the adversary has no access to the model parameters, attackers could target autonomous vehicles by using stickers or paint to create an adversarial stop sign that the vehicle would interpret as a yield or another sign (Papernot et al., 2016). This trend is worrisome and suggests that adversarial vulnerabilities need to be appropriately addressed before neural networks can be deployed in security critical applications.

In this work, we propose a new approach for developing provable defenses against ℓ2-bounded adversarial attacks as well as computing robustness certiﬁcations of pre-trained deep networks with differentiable activations. In contrast to the existing certiﬁcates (Weng et al., 2018; Zhang et al., 2018b) that use the ﬁrst-order information (upper and lower bounds on the slope), our approach is based on the secondorder information (upper and lower bounds on curvature values i.e. eigenvalues of the Hessian). Our approach is based on two key theoretically-justiﬁed steps: First, in Theorems 1 and 2, we show that if the eigenvalues of the Hessian of the network (curvatures of the network) are bounded (globally or locally), we can efﬁciently compute a robustness certiﬁcate and develop a defense method against ℓ2-bounded adversarial attacks using convex optimization. Second, in Theorem 4, we derive a computationally-efﬁcient differentiable bound on the curvature (eigenvalues of the Hessian) of a deep network. We derive this bound by explicitly characterizing the Hessian of a deep network in Lemma 1.

Although the problem of ﬁnding the closest adversarial example to a given point for deep nets leads to a non-convex optimization problem, our proposed Curvature-based Robustness Certiﬁcate (CRC), under some veriﬁable conditions, is able to compute points on the decision boundary that are provably closest to the input. That is, it provides

Second-Order Provable Defenses against Adversarial Attacks

the tightest certiﬁcate in those cases. For example, for a 2,3,4 layer networks trained on MNIST, we can ﬁnd provably closest adversarial points for 44.17%, 22.59%, 19.53% cases, respectively (Table 2). To the best of our knowledge, our method is the ﬁrst approach that can efﬁciently compute provably closest adversarial examples for a signiﬁcant fraction of examples in non-trivial neural networks.

We note that un-regularized networks, specially deep ones, can obtain large curvature bounds which can lead to small robustness certiﬁcates. However, by using the derived curvature bound as a regularizer during training, we signiﬁcantly decrease curvature values of the network, with little or no decrease in its performance (Table 5, Figure 1). Using this technique, our method signiﬁcantly outperforms intervalbound propagation (IBP) (Wong et al., 2018; Zhang et al., 2019a) and achieves state of the art certiﬁed accuracy (Tables 3 and 4). In particular, our method achieves certiﬁed robust accuracy 69.79%, 57.78% and 53.19% while IBPbased methods achieve 44.96%, 44.74% and 44.66% on 2,3 and 4 layer networks, respectively, on the MNIST-dataset (similar results for Fashion-MNIST).

Other recent works (e.g. Moosavi Dezfooli et al. (2019); Qin et al. (2019)) empirically show that using an estimate of curvature at inputs as a regularizer leads to empirical robustness on par with the adversarial training. In this work, however, we use a bound on the absolute value of curvature (and not an estimate) as a regularizer and show that it results in high certiﬁed robustness. Moreover, previous works have tried to certify robustness by bounding the Lipschitz constant of the neural network (Anil et al., 2018; Hein & Andriushchenko, 2017; Peck et al., 2017; Szegedy et al., 2014; Zhang et al., 2018c). Our approach, however, is based on bounding the Lipschitz constant of the gradient which in turn leads to bound on the eigenvalues of the Hessian of deep neural networks.

In summary, we make the following contributions:

We derive a closed-form expression for the Hessian of a deep network with differentiable activation functions (Lemma 1) and derive bounds on the curvature using this closed-form formula (Theorems 3 and 4).

We develop computationally efﬁcient methods for both the robustness certiﬁcation as well as the adversarial attack problems (Theorems 1 and 2).

We provide veriﬁable conditions under which our method is able to compute points on the decision boundary that are provably closest to the input. Empirically, we show that this condition holds for a signiﬁcant fraction of examples (Table 2).

We show that using our proposed curvature bounds as a regularizer during training leads to improved cer-

tiﬁed accuracy on 2,3 and 4 layer networks (on the MNIST and Fashion-MNIST datasets) compared to IBP-based adversarial training (Wong & Kolter, 2017; Zhang et al., 2019a) (Tables 3 and 4). Our robustness certiﬁcate (CRC) outperforms CROWN s certiﬁcate (Zhang et al., 2018b) signiﬁcantly when trained with our regularizer (Table 5).

To the best of our knowledge, this is the ﬁrst work that (a) demonstrates the utility of second-order information for provable robustness, (b) derives a framework to ﬁnd the exact robustness certiﬁcates in the l2 norm and the exact worst case adversarial perturbation in an l2 ball of given a radius under some conditions, and (c) derives an exact closed form expression for the Hessian and bounds on the curvature values using the same.

2. Related work

In the last couple of years, several empirical defenses have been proposed for training classiﬁers to be robust against adversarial perturbations (Kurakin et al., 2016b; Madry et al., 2018; Miyato et al., 2017; Papernot et al., 2016; Samangouei et al., 2018; Zhang et al., 2019b; Zheng et al., 2016) Although these defenses robustify classiﬁers to particular types of attacks, they can be still vulnerable against stronger attacks (Athalye & Carlini, 2018; Athalye et al., 2018; Carlini & Wagner, 2017; Laidlaw & Feizi, 2019; Uesato et al., 2018). For example, (Athalye et al., 2018) showed most of the empirical defenses proposed in ICLR 2018 can be broken by developing tailored attacks for each of them.

To end the cycle between defenses and attacks, a line of work on certiﬁed defenses has gained attention where the goal is to train classiﬁers whose predictions are provably robust within some given region (Bunel et al., 2017; Carlini et al., 2017; Cheng et al., 2017; Croce et al., 2018; Dutta et al., 2018; Dvijotham et al., 2018a;b; Ehlers, 2017; Fischetti & Jo, 2018; Gehr et al., 2018; Gowal et al., 2018; Huang et al., 2016; Katz et al., 2017; Levine & Feizi, 2020a;b;c; Lomuscio & Maganti, 2017; Mirman et al., 2018; Raghunathan et al., 2018a;b; Singh et al., 2018; Wang et al., 2018a;b; Weng et al., 2018; Wong & Kolter, 2017; Wong et al., 2018; Zhang et al., 2018b; 2019a). These methods, however, do not scale to large and practical networks used in solving modern machine learning problems. Another line of defense work focuses on randomized smoothing where the prediction is robust within some region around the input with a user-chosen probability (Cao & Gong, 2017; Cohen et al., 2019; L ecuyer et al., 2018; Li et al., 2018; Liu et al., 2017; Salman et al., 2019). Although these methods can scale to large networks, certifying robustness with probability close to 1 often requires generating a large number of noisy samples around the input which leads to high inference-time computational complexity. We discuss existing works in

Second-Order Provable Defenses against Adversarial Attacks

Table 1. A summary of various primal and dual concepts used in the paper. f denotes the function of the decision boundary, i.e. z(L) y z(L) t where y is the true label and t is the attack target. m and M are lower and upper bounds on the smallest and largest eigenvalues of the Hessian of f, respectively.

Certiﬁcate problem ( ) = cert Attack problem ( ) = attack

primal problem, p ( ) minf(x)=0 1/2 x x(0) 2 min x x(0) ρ f(x)

dual function, d( )(η) minx 1/2 x x(0) 2 + ηf(x) minx f(x) + η/2( x x(0) 2 ρ2)

When is dual solvable? 1/M η 1/m m η

dual problem, d ( ) max 1/M η 1/m dcert(η) max m η dattack(η)

When primal = dual? f(x(cert)) = 0 x(attack) x(0) = ρ

more details in Appendix A.

3. Notation

Consider a fully connected neural network with L layers and NI neurons in the Ith layer (L 2 and I [L]) for a multi-label classiﬁcation problem with C classes (NL = C). The corresponding function of the neural network is z(L) RD RC where D is the dimension of the input. For an input x, we use z(I)(x) RNI and a(I)(x) RNI to denote the input (before applying the activation function) and output (after applying the activation function) of neurons in the Ith hidden layer of the network, respectively. To simplify notation and when no confusion arises, we make the dependency of z(I) and a(I)

to x implicit. We deﬁne a(0)(x) = x and N0 = D.

With a fully connected architecture, each z(I) and a(I) is computed using a transformation matrix W(I) RNI NI 1, the bias vector b(I) RNI and an activation function σ(.) as follows:

z(I) = W(I)a(I 1) + b(I), a(I) = σ (z(I))

We use (z(L) i z(L) j )(x) as a shorthand for z(L) i (x)

z(L) j (x).

We use [p] to denote the set {1,...,p} and [p,q], p q to denote the set {p,p + 1,...,q}. We use small letters i,j,k etc to denote the index over a vector or rows of a matrix and capital letters I,J to denote the index over layers of network. The element in the ith position of a vector v is given by vi, the vector in the ith row of a matrix A is Ai while the element in the ith row and jth column of A is Ai,j. We use v and A to denote the 2-norm and the operator 2-norm of the vector v and the matrix A, respectively. We use v and A to denote the vector and matrix constructed by taking the elementwise absolute values. We use λmax(A) and λmin(A) to denote the largest and smallest eigenvalues of a symmetric matrix A. We use diag(v) to denote the diagonal matrix constructed by placing each element of

v along the diagonal. We use to denote the Hadamard Product, I to denote the identity matrix. We use and to denote Linear Matrix Inequalities (LMIs) such that given two symmetric matrices A and B where A B means A B Positive Semi-Deﬁnite (PSD).

4. Using duality to solve the attack and certiﬁcate problems

Consider an input x(0) with true label y and attack target t. In the certiﬁcate problem, our goal is to ﬁnd a lower bound of minimum l2 distance between x(0) and decision boundary f(x) = 0 where f(x) = (z(L) y z(L) t )(x). The problem for solving the exact distance (primal) can be written as:

p cert = min f(x)=0[1

2 x x(0) 2]

p cert = min x max η [1

2 x x(0) 2 + ηf(x)] (1)

However, solving the above problem can be hard in general. Using the minimax theorem (primal dual), we can write the dual of the above problem as follows:

p cert max η dcert(η)

dcert(η) = min x [1

2 x x(0) 2 + ηf(x)] (2)

From the theory of duality, we know that dcert(η) for each value of η gives a lower bound on the exact certiﬁcation value (the primal solution) p cert. However, since f is nonconvex, solving dcert(η) for every η can be difﬁcult. In the next section, we will prove that the curvature of the function f is bounded globally:

m I 2 xf MI x RD (3)

In this case, we have the following theorem (d cert is deﬁned in Table 1): Theorem 1. dcert(η) is a convex optimization problem for 1/M η 1/m. Moreover, If x(cert) is the solution to d cert such that f(x(cert)) = 0, then p cert = d cert.

Second-Order Provable Defenses against Adversarial Attacks

Below, we brieﬂy outline the proof while the full proof is presented in Appendix E.1. The Hessian of the objective function of the dual dcert(η), i.e the function inside the minx is given by:

2 x x(0) 2 + ηf(x)] = I + η 2 xf

From equation (3), we know that the eigenvalues of I+η 2 xf are bounded between (1 + ηm,1 + ηM) if η 0, and in (1+ηM,1+ηm) if η 0. In both cases, we can see that for 1/M η 1/m, all eigenvalues will be non-negative, making the objective function convex. When x(cert) satisﬁes f(x(cert)) = 0, we have d cert = 1/2 x(cert) x(0) 2. Using the duality theorem we have d cert p cert and from the deﬁnition of p cert, we have p cert d cert. Combining the two inequalities, we get p cert = d cert.

Next, we consider the attack problem. The goal here is to ﬁnd an adversarial example inside an l2 ball of radius ρ such that f(x) is minimized. Using similar arguments, we can get the following theorem for the attack problem (p attack, d attack and dattack are deﬁned in Table 1):

Theorem 2. dattack(η) is a convex optimization problem for m η. Moreover, if x(attack) is the solution to d attack such that x(attack) x(0) = ρ, p attack = d attack.

The proof is presented in Appendix E.2. We note that both Theorems 1 and 2 hold for any non-convex function with continuous gradients. Thus they can also be of interest in problems such as optimization of neural nets.

Using Theorems 1 and 2, we have the following deﬁnitions for certiﬁcation and attack optimizations:

Deﬁnition 1. (Curvature-based Certiﬁcate Optimization) Given an input x(0) with true label y, false target t, we deﬁne (η(cert),x(cert)) as the solution of the following maxmin optimization:

max 1/M η 1/mmin x [1

2 x x(0) 2 + ηf(x)]

We refer to x(cert) x(0) as the Curvature-based Robustness Certiﬁcate (CRC).

Deﬁnition 2. (Curvature-based Attack Optimization) Given input x(0) with label y, false target t, and the l2 ball radius ρ, we deﬁne (η(attack),x(attack)) as the solution of the following optimization:

max η mmin x [η

2 ( x x(0) 2 ρ2) + f(x)]

When x(attack) is used for training in an adversarial training framework, we call the method the Curvature-based Robust Training (CRT).

A direct implication of Theorems 1 and 2 is that the tightness of our robustness certiﬁcate crucially depends on the tightness of our curvature bounds, m and M. If m and M are very large compared to the true eigenvalue bounds of the Hessian of the network, the resulting robustness certiﬁcate will be vacuous. In Table 5 (and Figure 1), we show that by adding the derived bound as a regularization term during the training, we can signiﬁcantly decrease curvature bounds of the network, with little or no decrease in its performance. This leads to high robustness certiﬁcations against adversarial attacks.

5. Curvature Bounds for deep networks

In this section, we provide a computationally efﬁcient approach to compute the curvature bounds for neural networks with differentiable activation functions. To the best of our knowledge, there is no prior work on ﬁnding provable bounds on the curvature values of deep neural networks.

5.1. Closed form expression for the Hessian

Using the chain rule of second derivatives, we can derive 2 xz(L) i as a sum of matrix products:

Lemma 1. Given an L layer neural network, the Hessian of the ith hidden unit with respect to the input x, i.e 2 xz(L) i is given by the following formula:

2 xz(L) i = L 1 I=1 (B(I)) T diag(F(L,I) i σ (z(I)))B(I)

where B(I) is the Jacobian of z(I) with respect to x (dimensions NI D), and F(L,I) is the Jacobian of z(L) with respect to a(I) (dimensions NL NI).

The proof is presented in Appendix E.3. Using the chain rule, we can compute B(I), F(L,I) matrices in Lemma 1 recursively as follows:

W(1), I = 1

W(I)diag (σ (z(I 1)))B(I 1), I 2

W(L), I = L 1

W(L)diag (σ (z(L 1)))F(L 1,I), I L 2

This leads to a fast back-propagation like method that can be used to compute the Hessian. Note that Lemma 1 only assumes a matrix multiplication operation from a(I 1) to z(I). Since a convolution operation can also be expressed as a matrix multiplication, we can directly extend this lemma to deep convolutional networks. Furthermore, Lemma 1 can also be of independent interest in other related problems such as second-order interpretation methods for deep learning (e.g. (Singla et al., 2019)).

Second-Order Provable Defenses against Adversarial Attacks

5.2. Curvature bounds for Two Layer networks

For a two-layer network and using Lemma 1, 2 x (z(2) y z(2) t ) is given by:

(W(1)) T diag((W(2) y W(2) t ) σ (z(1)))W(1)

In the above equation, note that only the term σ (z(1)) depends on x. We can maximize and minimize each element in the diag term, (W(2) y,i W(2) t,i )σ (z(1) i ) independently subject to the constraint that σ (.) is bounded. Using this procedure, we construct matrices P and N that satisfy properties given in the following theorem:

Theorem 3. Given a two layer network whose activation function has bounded second derivative:

h L σ (x) h U x R

(a) We have the following linear matrix inequalities (LMIs):

N 2 x (z(2) y z(2) t ) P x RD

(b) If h U 0 and h L 0, P is PSD, N is a NSD matrix.

(c) This gives the following global bounds on the eigenvalues of the Hessian:

m I 2 x (z(2) y z(2) t ) MI (4)

where M = λmax(P), m = λmin(N)

P and N are independent of x and deﬁned in equations (51) and (52) in Appendix E.4.

The proof is presented in Appendix E.4. Because power iteration ﬁnds the eigenvalue with largest magnitude, we can use it to ﬁnd m and M only when P is PSD and N is NSD. We solve for h U and h L for sigmoid, tanh, softplus activation functions in Appendix F and show that this is in fact the case for them.

We note that this result does not hold for Re LU networks since the Re LU function is not differentiable everywhere. However, in Appendix G, we devise a method to compute the certiﬁcate for a two layer Re LU network by ﬁnding a quadratic function that is a provable lower bound for z(2) y z(2) t . We show that the resulting method signiﬁcantly outperforms CROWN-Ada (see Appendix Table 9).

5.3. Curvature bounds for Deep networks

Using Lemma 1, we know that 2 xz(L) i is a sum product of matrices B(I) and F(L,I) i . Thus, if we can ﬁnd upper

bounds for B(I) and F(L,I) i , we can get upper bounds for 2 xz(L) i . Using this intuition (proof is presented in Appendix E.5), we have the following result: Theorem 4. Given an L layer neural network whose activation function satiﬁes:

σ (x) g, σ (x) h x R,

the absolute value of eigenvalues of 2 xz(L) i is globally bounded by the following quantity:

2 xz(L) i h L 1 I=1 (r(I)) 2 max j (S(L,I) i,j ), x RD

where r(I) and S(L,I) are independent of x and deﬁned recursively as:

W(1) , I = 1 g W(I) r(I 1), I 2 (5)

W(L) , I = L 1 g W(L) S(L 1,I), I L 2 (6)

The above expressions allows for an extremely efﬁcient computation of the curvature bounds for deep networks. We consider simpliﬁcation of this result for sigmoid, tanh, softplus activations in Appendix F. The curvature bounds for z(L) y z(L) t can be computed by replacing W(L) i with W(L) y W(L) t in Theorem 4. The resulting bound is independent of x, and only depends on network weights W(I), the true label y, and the target t. We denote it with K(W,y,t). To simplify notation, when no confusion arises we denote it with K. In our experiments, for two layer networks, we use M, m from Theorem 3 since it provides tighter curvature bounds. For deeper networks (L 3), we use M = K, m = K.

6. Adversarial training with curvature regularization

Since the term B(I) in Lemma 1 is the Jacobian of z(I) with respect to x, B(I) , it is equal to the lipschitz constant of the neural network constructed from the ﬁrst I layers of the original network. Finding tight bounds on the lipschitz constant is an active area of research (Fazlyab et al., 2019; Scaman & Virmaux, 2018; Weng et al., 2018) and the product of the operator norm of weight matrices is known to be a loose bound on the lipschitz constant for deep networks. Since we use the same product to compute the bound for B(I) in Theorem 4, the resulting curvature bound is likely to be loose for very deep networks.

In Figure 1, we observe the same trend: as the depth of the network increases, the upper bound Kub computed using Theorem 4 becomes signiﬁcantly larger than the lower

Second-Order Provable Defenses against Adversarial Attacks

(a) 2 layer network

(b) 3 layer network

Figure 1. Illustration of lower (Klb) and upper (Kub) bounds on the curvature of 2 and 3 layer networks with sigmoid activations trained on MNIST. Without any curvature regularization (γ = 0), curvature bounds increase signiﬁcantly for deeper networks. Similarly with γ = 0, networks adversarially trained with PGD have high curvature as well (note the log scale of the y-axis). However, using our curvature bound as a regularizer, the bound becomes tight and CRC gives high certiﬁcate values (Table 5). We report the curvature bounds (Klb and Kub) for networks with different depths in Appendix Table 10.

bound Klb (computed by taking the maximum of the largest eigenvalue of the Hessian across all test images with label y and the second largest logit t, then averaging across different (y,t)). However, by regularizing the network to have small curvature during training, the bound becomes signiﬁcantly tighter. Interestingly, using curvature regularization, even with this loose curvature bound for deep nets, we achieve signiﬁcantly higher robust accuracy than the current state of the art methods while enjoying signiﬁcantly higher standard accuracy as well (see Tables 3 and 4).

To regularize the network to have small curvature values, we penalize the curvature bound K during training. To compute the gradient of K with respect to the network weights, note that using Theorem 4, we can compute K using absolute value, matrix multiplications, and operator norm. Since the gradient of operator norm does not exist in standard libraries, we created a new layer where the gradient of W(I) i.e W(I) W(I) is given by:

W(I) W(I) = u(I) (v(I)) T

where W(I)v(I) = W(I) u(I)

Note that W(I) , u(I) and v(I) can be computed using power iteration. Since the network weights do not change signiﬁcantly during a single training step, we can use the singular vectors u(I) and v(I) computed in the previous training step to update W(I) using one iteration of power method. This approach to compute the gradient of the largest singular value of a matrix has also been used in previous published work (Miyato et al., 2018). Thus, the per-sample

loss for training with curvature regularization is given by:

ℓ(z(L)(x(0)), y) + γK(W, y, t) (7)

where ℓdenotes the cross entropy loss, y is the true label of the input x(0), t is the attack target and γ is the regularization coefﬁcient for penalizing large curvature values. Similar to the adversarial training, in CRT, we use x(attack)

instead of x(0) in equation (7).

7. Experiments

The certiﬁed robust accuracy means the fraction of correctly classiﬁed test samples whose robustness certiﬁcates (computed using CRC) are greater than a pre-speciﬁed radius ρ. Unless otherwise speciﬁed, we use the class with the second largest logit as the attack target (i.e. the class t). The notation (L [1024], activation) denotes a neural network with L layers with the speciﬁed activation, (γ = c) denotes standard training with γ set to c, while (CRT, c) denotes CRT training with γ = c. Certiﬁcates are computed over 150 randomly chosen correctly classiﬁed images. We use a single NVIDIA Ge Force RTX 2080 Ti GPU.

7.1. Fraction of inputs with tightest robustness certiﬁcate

Using the veriﬁable condition of Theorems 1 and 2, our approach is able to (1) ﬁnd points that are provably the worst case adversarial perturbations (in the l2 norm) in the attack problem and (2) ﬁnd points on the decision boundary that are provably closest to the input in the l2 norm in the certiﬁcation problem. In particular, in Table 2, we observe

Second-Order Provable Defenses against Adversarial Attacks

that for curvature regularized networks, our approach ﬁnds provably worst-case adversarial perturbations for all of the inputs with a small drop in the accuracy. Moreover, for 2,3,and 4 layer networks, our method ﬁnds provably closest adversarial examples for 44.17%, 22.59% and 19.53% of inputs in the MNIST test set, respectively.

Table 2. Certiﬁcate success rate denotes the fraction of points satisfying zy zt = 0, Attack success rate denotes the fraction of points (x(0)) satisfying x(attack) x(0) 2 = ρ = 0.5 implying primal=dual in Theorems 1 and 2 respectively. We use the MNIST dataset.

Network γ Accuracy Attack success

Certiﬁcate success

2 [1024], sigmoid

0. 98.77% 5.05% 2.24% 0.03 98.30% 100% 44.17% 3 [1024], sigmoid

0. 98.52% 0.% 0.12% 0.05 97.60% 100% 22.59% 4 [1024], sigmoid

0. 98.22% 0.% 0.01% 0.07 95.24% 100% 19.53%

We note that the technique presented in this work is not applicable to Re LU networks due to the absence of the curvature information. Verifying the robustness property in an l2 ball around the input is known to be an NP-complete problem for Re LU networks (Katz et al., 2017) for an arbitrary Re LU network, it is computationally challenging to even verify (let alone ﬁnd) that a given adversarial perturbation is the worst case perturbation in polynomial time unless P=NP. Using curvature-regularized neural networks with smooth activation functions, we show that it is possible to ﬁnd (and not just verify) the exact worst case perturbation (and robustness certiﬁcate) for a signiﬁcant fraction of test inputs. We note that in the worst case even for smooth classiﬁers, the problem of ﬁnding the worst adversarial perturbations remain computationally challenging. However, our theoretical and empirical results provide strong evidence that bounding the curvature of the network and using smooth activation functions can be critical to achieve high robustness guarantees for a signiﬁcant fraction of samples.

7.2. Comparison with existing provable defenses

We compare against certiﬁed defense techniques proposed in Wong et al. (2018) and Zhang et al. (2019a) in Table 3 for the MNIST dataset (Le Cun & Cortes, 2010) and Table 4 for the Fashion-MNIST dataset (Xiao et al., 2017) with l2 radius ρ = 1.58. Even though our proposed CRT requires fully differentiable activation functions such as softplus, sigmoid, tanh etc, we include comparison with Re LU networks because the methods proposed in Wong et al. (2018); Zhang et al. (2019a) use Re LU. Since CROWN-IBP can be trained using the softplus activation function, we include it in our

comparison. Similar comparison with l2 radius ρ = 0.5 is given in Appendix Table 2 (MNIST dataset) and Table 3 (Fashion-MNIST dataset). We observe that CRT (certiﬁed with CRC) gives signiﬁcantly higher certiﬁed robust accuracy as well as standard accuracy compared to either of the methods on both MNIST and Fashion-MNIST datasets for both different values of ρ. Since shallow fully connected networks are known to perform poorly on the CIFAR-10 dataset, we do not include those results in our comparison.

Table 3. Comparison with interval-bound propagation based adversarial training methods: COAP i.e Convex Outer Adversarial Polytope (Wong et al., 2018), CROWN-IBP (Zhang et al., 2019a)) and Curvature-based Robust Training (Ours) with attack radius ρ = 1.58 on MNIST. For CROWN-IBP, we vary the ﬁnal beta hyperparameter between 0.8 and 3, and use the model with best certiﬁed accuracy. Results with ρ = 0.5 are in Appendix Table 2.

Network Training Standard Accuracy

Certiﬁed Robust Accuracy

2 [1024], softplus

CRT, 0.01 98.68% 69.79% CROWN-IBP 88.48% 42.36% 2 [1024], relu

COAP 89.33% 44.29% CROWN-IBP 89.49% 44.96%

3 [1024], softplus

CRT, 0.05 97.43% 57.78% CROWN-IBP 86.58% 42.14% 3 [1024], relu

COAP 89.12% 44.21% CROWN-IBP 87.77% 44.74%

4 [1024], softplus

CRT, 0.07 95.60% 53.19% CROWN-IBP 82.74% 41.34% 4 [1024], relu

COAP 90.17% 44.66% CROWN-IBP 84.4% 43.83%

In Appendix Table 5, we compare CRT with Randomized Smoothing (Cohen et al., 2019). For 2 & 3 layer networks, we achieve higher robust accuracy. However, we note that since our certiﬁcate is deterministic while the smoothingbased certiﬁcate is probabilistic (although with high probability), the results are not directly comparable. As a separate result, we also prove that randomized smoothing bounds the curvature of the network (Theorem 1 in Appendix E.6). We also include comparison with empirical defense methods namely PGD and TRADES in Appendix Table 8.

7.3. Comparison with existing certiﬁcates

In Table 5, we compare CRC with CROWN-general (Zhang et al., 2018a). For 2-layer networks, CRC outperforms CROWN signiﬁcantly. For deeper networks, CRC works better than CROWN when the network is trained with curvature regularization. However, with small γ = 0.01, we

Second-Order Provable Defenses against Adversarial Attacks

Table 4. Comparison between COAP (Wong et al., 2018), CROWN-IBP (Zhang et al., 2019a)) and Curvature-based Robust Training (Ours) with attack radius ρ = 1.58 on Fashion-MNIST. Results with ρ = 0.5 for are in Appendix Table 3.

Network Training Standard Accuracy

Certiﬁed Robust Accuracy

2 [1024], softplus

CRT, 0.01 80.31% 54.39% CROWN-IBP 69.23% 47.19% 2 [1024], relu

COAP 74.1% 46.3% CROWN-IBP 70.73% 48.61%

3 [1024], softplus

CRT, 0.05 78.39% 53.4% CROWN-IBP 68.72% 46.52% 3 [1024], relu

COAP 73.9% 46.3% CROWN-IBP 70.79% 48.69%

4 [1024], softplus

CRT, 0.07 75.61% 49.6% CROWN-IBP 68.31% 46.21% 4 [1024], relu

COAP 73.6% 45.1% CROWN-IBP 70.21% 48.08%

see a signiﬁcant increase in CRC but a very small drop in the test accuracy (without any adversarial training). We can see that with γ = 0.01, non-trivial certiﬁed accuracies of 83.53%, 88.33%, 89.61% can be achieved on 2,3,4 layer sigmoid networks, respectively, without any adversarial training. Adversarial training using CRT further boosts certiﬁed accuracy to 95.59%, 94.99% and 93.41%, respectively. We show some results on CIFAR-10 dataset in Appendix Table 7. We again observe improvements in the robustness certiﬁcate and certiﬁed robust accuracy using CRC and CRT.

7.4. Results using local curvature bounds

From Theorems 1 and 2, we can observe that if the curvature is locally bounded within a convex region around the input (we call it the safe region), then the corresponding dual problems (d cert, d attack) are again convex optimization problems provided the optimization trajectory does not escape the safe region.

Theorem 3 can be directly extended to compute the local curvature bound using bounds on the second derivatives, i.e. σ (z(1)) in the local region. In Table 6, we show signiﬁcant improvements for the CRC certiﬁcate for two-layer sigmoid networks on the MNIST dataset for γ = 0. However, with the curvature regularization, the difference is insigniﬁcant. We also observe that the certiﬁed accuracy for (CRT, 0.0) improves from 95.04% to 95.31% and for standard improves from 54.17% to 58.06%. The certiﬁed accuracy remains the same for other cases. Implementation details are in the Appendix Section C.6.

Network Training Standard Accuracy

Certiﬁed Robust Accuracy

2 [1024], sigmoid

standard 98.37% 54.17% γ = 0.01 98.08% 83.53% CRT, 0.01 98.57% 95.59%

3 [1024], sigmoid

standard 98.37% 0.00% γ = 0.01 97.71% 88.33% CRT, 0.01 97.23% 94.99%

4 [1024], sigmoid

standard 98.39% 0.00% γ = 0.01 97.41% 89.61% CRT, 0.01 97.83% 93.41%

(a) Effect of γ on certiﬁed robust accuracy

Network Training Certiﬁcate (mean) CROWN CRC

2 [1024], sigmoid

standard 0.28395 0.48500 γ = 0.01 0.32548 0.84719 CRT, 0.01 0.43061 1.54673

3 [1024], sigmoid

standard 0.24644 0.06874 γ = 0.01 0.39799 1.07842 CRT, 0.01 0.39603 1.24100

4 [1024], sigmoid

standard 0.19501 0.00454 γ = 0.01 0.40620 1.05323 CRT, 0.01 0.40327 1.06208

(b) Comparison between CROWN-general (Zhang et al., 2018a) and CRC.

Table 5. Effect of curvature regularization and CRT on certiﬁed robust accuracy and robustness certiﬁcate

Computing local curvature bounds for deeper networks, however, is more challenging due to the presence of terms involving multiplication of ﬁrst and second derivatives. A straightforward extension of Theorem 4 4, wherein we compute the upper bound on σ and σ in a local region around the input across all neurons in all layers does not yield signiﬁcant improvements over the global method, therefore we do not include those results in our comparison.

Table 6. Comparison between Certiﬁed Robust accuracy and CRC for 2 layer sigmoid and tanh networks using global and local curvature bounds on MNIST dataset with ρ = 0.5

Network Training CRC (Global)

CRC (Local)

2 [1024], sigmoid

standard 0.5013 0.5847 CRT, 0.0 1.0011 1.1741 CRT, 0.01 1.5705 1.6047 CRT, 0.02 1.6720 1.6831

Second-Order Provable Defenses against Adversarial Attacks

8. Extension to convolutional neural networks

The formula derived in Lemma 1 is valid even for convolutional neural networks. However, to derive the curvature bound (using Theorem 4), we need to compute a bound on the singular values of the Jacobian of the convolution layer (i.e. W(I) ). In order to do this, one can use spectral bounds for convolution layers derived in (Sedghi et al., 2018). We present some preliminary results using this technique in (Singla & Feizi, 2019) for a single layer convolutional neural network with softplus activations.

9. Conclusion

In this paper, we develop computationally-efﬁcient convex relaxations for robustness certiﬁcation and adversarial attack problems given the classiﬁer has a bounded curvature. We also show that this convex relaxation is tight under some general veriﬁable conditions. To be able to use proposed certiﬁcation and attack convex optimizations, we derive global curvature bounds for deep networks with differentiable activation functions. This result is a consequence of a closed-form expression that we derive for the Hessian of a deep network. Adversarial training using our attack method coupled with curvature regularization results in a significantly higher certiﬁed robust accuracy than the existing provable defense methods. Our proposed curvature-based robustness certiﬁcate signiﬁcantly outperforms the CROWN certiﬁcate when trained with our regularizer. Scaling up our proposed curvature-based robustness certiﬁcation and training methods as well as further tightening the derived curvature bounds are among interesting directions for the future work.

10. Acknowledgements

This project was supported in part by NSF CAREER AWARD 1942230, HR 00111990077, HR001119S0026 and Simons Fellowship on Foundations of Deep Learning .

Anil, C., Lucas, J., and Grosse, R. B. Sorting out lipschitz function approximation. In ICML, 2018.

Athalye, A. and Carlini, N. On the robustness of the cvpr 2018 white-box adversarial example defenses. Ar Xiv, abs/1804.03286, 2018.

Athalye, A., Carlini, N., and Wagner, D. A. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. In ICML, 2018.

Bunel, R., Turkaslan, I., Torr, P. H. S., Kohli, P., and Mudigonda, P. K. A uniﬁed view of piecewise linear neural network veriﬁcation. In Neur IPS, 2017.

Cao, X. and Gong, N. Z. Mitigating evasion attacks to deep neural networks via region-based classiﬁcation. Ar Xiv, abs/1709.05583, 2017.

Carlini, N. and Wagner, D. Adversarial examples are not easily detected: Bypassing ten detection methods. In Proceedings of the 10th ACM Workshop on Artiﬁcial Intelligence and Security, AISec 17, 2017.

Carlini, N., Katz, G., Barrett, C. E., and Dill, D. L. Provably minimally-distorted adversarial examples. 2017.

Cheng, C.-H., N uhrenberg, G., and Ruess, H. Maximum resilience of artiﬁcial neural networks. In ATVA, 2017.

Cohen, J. M., Rosenfeld, E., and Kolter, J. Z. Certiﬁed adversarial robustness via randomized smoothing. In ICML, 2019.

Croce, F., Andriushchenko, M., and Hein, M. Provable robustness of relu networks via maximization of linear regions. Ar Xiv, abs/1810.07481, 2018.

Dutta, S., Jha, S., Sankaranarayanan, S., and Tiwari, A. Output range analysis for deep feedforward neural networks. In NFM, 2018.

Dvijotham, K., Gowal, S., Stanforth, R., Arandjelovic, R., O Donoghue, B., Uesato, J., and Kohli, P. Training veriﬁed learners with learned veriﬁers. Ar Xiv, abs/1805.10265, 2018a.

Dvijotham, K., Stanforth, R., Gowal, S., Mann, T. A., and Kohli, P. A dual approach to scalable veriﬁcation of deep networks. In UAI, 2018b.

Ehlers, R. Formal veriﬁcation of piece-wise linear feedforward neural networks. Ar Xiv, abs/1705.01320, 2017.

Fazlyab, M., Robey, A., Hassani, H., Morari, M., and Pappas, G. J. Efﬁcient and accurate estimation of lipschitz constants for deep neural networks. Co RR, abs/1906.04893, 2019. URL http://arxiv.org/ abs/1906.04893.

Fischetti, M. and Jo, J. Deep neural networks and mixed integer linear optimization. Constraints, 23:296 309, 2018.

Gehr, T., Mirman, M., Drachsler-Cohen, D., Tsankov, P., Chaudhuri, S., and Vechev, M. T. Ai2: Safety and robustness certiﬁcation of neural networks with abstract interpretation. 2018 IEEE Symposium on Security and Privacy (SP), pp. 3 18, 2018.

Gowal, S., Dvijotham, K., Stanforth, R., Bunel, R., Qin, C., Uesato, J., Arandjelovic, R., Mann, T. A., and Kohli, P. On the effectiveness of interval bound propagation for training veriﬁably robust models. Ar Xiv, abs/1810.12715, 2018.

Second-Order Provable Defenses against Adversarial Attacks

Hein, M. and Andriushchenko, M. Formal guarantees on the robustness of a classiﬁer against adversarial manipulation. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 30, pp. 2266 2276. 2017.

Huang, X., Kwiatkowska, M. Z., Wang, S., and Wu, M. Safety veriﬁcation of deep neural networks. Ar Xiv, abs/1610.06940, 2016.

Katz, G., Barrett, C. W., Dill, D. L., Julian, K., and Kochenderfer, M. J. Reluplex: An efﬁcient smt solver for verifying deep neural networks. In CAV, 2017.

Kurakin, A., Goodfellow, I. J., and Bengio, S. Adversarial examples in the physical world. Ar Xiv, abs/1607.02533, 2016a.

Kurakin, A., Goodfellow, I. J., and Bengio, S. Adversarial machine learning at scale. Ar Xiv, abs/1611.01236, 2016b.

Laidlaw, C. and Feizi, S. Functional adversarial attacks. In Neur IPS, 2019.

Le Cun, Y. and Cortes, C. MNIST handwritten digit database. 2010. URL http://yann.lecun.com/ exdb/mnist/.

L ecuyer, M., Atlidakis, V., Geambasu, R., Hsu, D., and Jana, S. K. K. Certiﬁed robustness to adversarial examples with differential privacy. In IEEE S&P 2019, 2018.

Levine, A. and Feizi, S. (de)randomized smoothing for certiﬁable defense against patch attacks. Ar Xiv, abs/2002.10733, 2020a.

Levine, A. and Feizi, S. Robustness certiﬁcates for sparse adversarial attacks by randomized ablation. In AAAI, 2020b.

Levine, A. J. and Feizi, S. Wasserstein smoothing: Certiﬁed robustness against wasserstein adversarial attacks. Ar Xiv, abs/1910.10783, 2020c.

Li, B. H., Chen, C., Wang, W., and Carin, L. Certiﬁed adversarial robustness with additive gaussian noise. 2018.

Liu, X., Cheng, M., Zhang, H., and Hsieh, C.-J. Towards robust neural networks via random self-ensemble. Ar Xiv, abs/1712.00673, 2017.

Lomuscio, A. and Maganti, L. An approach to reachability analysis for feed-forward relu neural networks. Ar Xiv, abs/1706.07351, 2017.

Madry, A., Makelov, A., Schmidt, L., Tsipras, D., and Vladu, A. Towards deep learning models resistant to adversarial attacks. In International Conference

on Learning Representations, 2018. URL https:// openreview.net/forum?id=r Jz IBf ZAb.

Mirman, M., Gehr, T., and Vechev, M. T. Differentiable abstract interpretation for provably robust neural networks. In ICML, 2018.

Miyato, T., ichi Maeda, S., Koyama, M., and Ishii, S. Virtual adversarial training: A regularization method for supervised and semi-supervised learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41: 1979 1993, 2017.

Miyato, T., Kataoka, T., Koyama, M., and Yoshida, Y. Spectral normalization for generative adversarial networks. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum? id=B1QRgzi T-.

Moosavi Dezfooli, S. M., Fawzi, A., Uesato, J., and Frossard, P. Robustness via curvature regularization, and vice versa. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.

Papernot, N., Mc Daniel, P., Wu, X., Jha, S., and Swami, A. Distillation as a defense to adversarial perturbations against deep neural networks. In 2016 IEEE Symposium on Security and Privacy (SP), pp. 582 597, May 2016. doi: 10.1109/SP.2016.41.

Papernot, N., Mc Daniel, P. D., Goodfellow, I. J., Jha, S., Celik, Z. B., and Swami, A. Practical black-box attacks against deep learning systems using adversarial examples. Co RR, abs/1602.02697, 2016. URL http://arxiv. org/abs/1602.02697.

Peck, J., Roels, J., Goossens, B., and Saeys, Y. Lower bounds on the robustness to adversarial perturbations. In NIPS, 2017.

Qin, C., Martens, J., Gowal, S., Krishnan, D., Fawzi, A., De, S., Stanforth, R., and Kohli, P. Adversarial robustness through local linearization. ar Xiv preprint ar Xiv:1907.02610, 2019.

Raghunathan, A., Steinhardt, J., and Liang, P. Certiﬁed defenses against adversarial examples. Ar Xiv, abs/1801.09344, 2018a.

Raghunathan, A., Steinhardt, J., and Liang, P. Semideﬁnite relaxations for certifying robustness to adversarial examples. In Neur IPS, 2018b.

Salman, H., Yang, G., Li, J., Zhang, P., Zhang, H., Razenshteyn, I. P., and Bubeck, S. Provably robust deep learning via adversarially trained smoothed classiﬁers. Ar Xiv, abs/1906.04584, 2019.

Second-Order Provable Defenses against Adversarial Attacks

Samangouei, P., Kabkab, M., and Chellappa, R. Defense GAN: Protecting classiﬁers against adversarial attacks using generative models. In International Conference on Learning Representations, 2018. URL https:// openreview.net/forum?id=Bk J3ibb0-.

Scaman, K. and Virmaux, A. Lipschitz regularity of deep neural networks: analysis and efﬁcient estimation, 2018.

Sedghi, H., Gupta, V., and Long, P. M. The singular values of convolutional layers. ar Xiv preprint ar Xiv:1805.10408, 2018.

Singh, G., Gehr, T., Mirman, M., P uschel, M., and Vechev, M. T. Fast and effective robustness certiﬁcation. In Neur IPS, 2018.

Singla, S. and Feizi, S. Bounding singular values of convolution layers, 2019.

Singla, S., Wallace, E., Feng, S., and Feizi, S. Understanding impacts of high-order loss approximations and features in deep learning interpretation. In ICML, 2019.

Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., and Fergus, R. Intriguing properties of neural networks. In International Conference on Learning Representations, 2014. URL http: //arxiv.org/abs/1312.6199.

Uesato, J., O Donoghue, B., Kohli, P., and van den Oord, A. Adversarial risk and the dangers of evaluating against weak attacks. In ICML, 2018.

Wang, S., Chen, Y., Abdou, A., and Jana, S. K. K. Mixtrain: Scalable training of veriﬁably robust neural networks. 2018a.

Wang, S., Pei, K., Whitehouse, J., Yang, J., and Jana, S. K. K. Efﬁcient formal safety analysis of neural networks. In Neur IPS, 2018b.

Weng, T.-W., Zhang, H., Chen, H., Song, Z., Hsieh, C.-J., Boning, D. S., Dhillon, I. S., and Daniel, L. Towards fast computation of certiﬁed robustness for relu networks. Ar Xiv, abs/1804.09699, 2018.

Wong, E. and Kolter, J. Z. Provable defenses against adversarial examples via the convex outer adversarial polytope. Ar Xiv, abs/1711.00851, 2017.

Wong, E., Schmidt, F. R., Metzen, J. H., and Kolter, J. Z. Scaling provable adversarial defenses. In Neur IPS, 2018.

Xiao, H., Rasul, K., and Vollgraf, R. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. Co RR, abs/1708.07747, 2017. URL http: //arxiv.org/abs/1708.07747.

Zhang, H., Weng, T.-W., Chen, P.-Y., Hsieh, C.-J., and Daniel, L. Efﬁcient neural network robustness certiﬁcation with general activation functions. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 31, pp. 4939 4948. Curran Associates, Inc., 2018a.

Zhang, H., Weng, T.-W., Chen, P.-Y., Hsieh, C.-J., and Daniel, L. Efﬁcient neural network robustness certiﬁcation with general activation functions. Ar Xiv, abs/1811.00866, 2018b.

Zhang, H., Zhang, P., and Hsieh, C.-J. Recurjac: An efﬁcient recursive algorithm for bounding jacobian matrix of neural networks and its applications. In AAAI, 2018c.

Zhang, H., Chen, H., Xiao, C., Li, B., Boning, D. S., and Hsieh, C. Towards stable and efﬁcient training of veriﬁably robust neural networks. Co RR, abs/1906.06316, 2019a. URL http://arxiv.org/ abs/1906.06316.

Zhang, H., Yu, Y., Jiao, J., Xing, E. P., Ghaoui, L. E., and Jordan, M. I. Theoretically principled trade-off between robustness and accuracy. In ICML, 2019b.

Zheng, S., Song, Y., Leung, T., and Goodfellow, I. J. Improving the robustness of deep neural networks via stability training. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4480 4488, 2016.