# improved_techniques_for_deterministic_l2_robustness__7d83396d.pdf

Improved techniques for deterministic l2 robustness

Sahil Singla

Department of Computer Science University of Maryland ssingla@umd.edu

Soheil Feizi

Department of Computer Science University of Maryland sfeizi@umd.edu

Training convolutional neural networks (CNNs) with a strict 1-Lipschitz constraint under the l2 norm is useful for adversarial robustness, interpretable gradients and stable training. 1-Lipschitz CNNs are usually designed by enforcing each layer to have an orthogonal Jacobian matrix (for all inputs) to prevent the gradients from vanishing during backpropagation. However, their performance often significantly lags behind that of heuristic methods to enforce Lipschitz constraints where the resulting CNN is not provably 1-Lipschitz. In this work, we reduce this gap by introducing (a) a procedure to certify robustness of 1-Lipschitz CNNs by replacing the last linear layer with a 1-hidden layer MLP that significantly improves their performance for both standard and provably robust accuracy, (b) a method to significantly reduce the training time per epoch for Skew Orthogonal Convolution (SOC) layers (> 30% reduction for deeper networks) and (c) a class of pooling layers using the mathematical property that the l2 distance of an input to a manifold is 1-Lipschitz. Using these methods, we significantly advance the state-of-the-art for standard and provable robust accuracies on CIFAR-10 (gains of +1.79% and +3.82%) and similarly on CIFAR-100 (+3.78% and +4.75%) across all networks. Code is available at https://github.com/singlasahil14/improved_l2_robustness.

1 Introduction

The Lipschitz constant 1 of a neural network f : Rd Rc, denoted by Lip(f), controls the change in the output divided by change in the input (both changes measured in the l2 norm). Previous work provides evidence that a small Lipschitz constant is useful for interpretable saliency maps [Tsipras et al., 2018], generalization bounds [Long and Sedghi, 2020], Wasserstein distance estimation [Villani, 2008], adversarial robustness [Szegedy et al., 2014] and preventing gradient explosion during backpropagation [Xiao et al., 2018]. Several prior works [Miyato et al., 2018, Gulrajani et al., 2017] use heuristic methods to enforce Lipschitz constraints to successfully address problems such as stabilizing GAN training. However, these methods do not enforce a guaranteed lipschitz bound and it remains challenging to achieve strong results with provable guarantees.

Using the composition property i.e. Lip(g h) Lip(g) Lip(h), we can construct a 1-Lipschitz neural network by constraining each layer to be 1-Lipschitz. However, a key difficulty with this approach is that because a 1-Lipschitz layer can only reduce the gradient norm during backpropagation, for deeper networks, this results in small gradient values for layers closer to the input making training slow and difficult. To address this problem, Anil et al. [2018] introduce Gradient Norm Preserving (GNP) architectures where each layer preserves the gradient norm during backpropagation. For 1-Lipschitz Convolutional Neural Networks (CNNs), this involves using orthogonal convolutions (convolution layers with an orthogonal Jacobian matrix) [Li et al., 2019b, Trockman and Kolter,

1In this work, we assume the Lipschitz constant under the l2 norm.

36th Conference on Neural Information Processing Systems (Neur IPS 2022).

Table 1: CIFAR-10 results using faster gradients, projection pooling, CRC-Lip SOC Time/epoch (secs) Reduction Standard accuracy Provable robust accuracy layers Ours Previous Ours Previous Ours Previous

6 25.34 30.63 -17.27% 79.36% 76.68% 67.13% 60.09% 11 37.11 48.44 -23.39% 79.57% 77.73% 66.75% 62.82% 16 49.27 66.74 -26.17% 79.44% 77.78% 66.99% 62.75% 21 61.59 83.18 -25.96% 79.13% 77.50% 66.45% 63.31% 26 71.51 100.70 -28.99% 79.19% 77.18% 66.28% 62.46% 31 84.00 119.55 -29.74% 78.64% 74.43% 66.05% 59.65% 36 95.06 137.10 -30.66% 78.57% 72.73% 65.94% 57.18% 41 106.01 156.25 -32.16% 78.41% 71.33% 65.51% 55.74%

2021, Singla and Feizi, 2021, Yu et al., 2022, Kiani et al., 2022] and using a class of GNP activation functions called House Holder activations [Singla et al., 2022].

Among orthogonal convolutions, SOC [Singla and Feizi, 2021] achieves state-of-the-art results on CIFAR-10, 100 [Singla et al., 2022]. SOC uses the following two mathematical properties to construct an orthogonal convolution: (a) the exponential of a skew symmetric matrix is orthogonal and (b) the matrix exponential can be computed using the exponential series. A drawback of using the exponential series is that it requires us to apply the convolution operation multiple times to achieve a reasonable approximation of an orthogonal matrix. Consequently, during backpropagation, computing the gradient with respect to the weights for a single SOC layer requires us to compute the weight gradient for each of these convolutions resulting in significant training overhead. In this work, we show that because the matrix is skew symmetric, using some approximations, the gradients with respect to weights can be computed in significantly reduced time with no loss in performance. This enables us to reduce the training time per epoch using SOC by > 30% for networks with large number of SOC layers (Results in Table 1).

Another limitation of 1-Lipschitz CNNs is that their performance is often significantly below compared to that of standard CNNs. Recently, Singla et al. [2022] introduced a procedure for certifying robustness by relaxing the orthogonality requirements of the last linear layer (i.e. the linear layer mapping penultimate neurons to class logits) achieving state of the art results on CIFAR-10, 100 [Krizhevsky, 2009]. Since MLPs are more expressive than linear layers, one would expect improved standard accuracy by replacing the last linear layer with them. However, the resulting networks are not 1-Lipschitz and achieving high robustness guarantees (provable robust accuracy) is difficult.

To certify robustness for these networks, note that since the mapping from input to penultimate layer is 1-Lipschitz, the robustness certificate for the penultimate output also provides a certificate for the input. Thus, we first replace the linear layer mapping penultimate output to logits with a 1-hidden layer MLP (Multi layer Perceptron) because such MLPs are easier to certify compared to deep MLPs. To certify robustness for the MLP, we use the Curvature-based Robustness Certificate or CRC [Singla and Feizi, 2020]. To train MLP to have high robustness guarantees, we use a variant of adversarial training that only applies adversarial perturbations to the MLP (not the whole network). We call our certification procedure CRC-Lip. This results in improved results for both the standard and also the provable robust accuracy across all network architectures on CIFAR-10 ( 1.63%, 3.14%) and CIFAR-100 ( 2.49%, 2.27% respectively). Results are in Tables 3 and 4.

While several works have attempted to construct novel and more expressive orthogonal convolution layers and GNP activation functions, current state-of-the-art 1-Lipschitz CNNs still use pooling layers by taking the max of different elements. In this work, we introduce a class of 1-Lipschitz pooling layers called projection pooling using the following mathematical property: Given a manifold M Rn and input x Rn, the function d M : Rn R defined as the l2 distance of x to M is 1-Lipschitz. Thus, to construct a pooling layer, we can first define a learnable manifold with parameters Θ (Example in Section 6). During training, for input x Rn, the pooling layer simply outputs d M(x) R as the output, resulting in decrease in input dimension by a factor of n. Moreover, since the output d M(x) is also a function of Θ, Θ can be learned during training. However, solving for the distance d M(x) can be difficult especially when M is a high-dimensional manifold.

To address this limitation, (a) we use 2D projection pooling layers that reduce the dimension by factor of 2 and (b) we construct these layers using piecewise linear curves for which the distance can be computed efficiently by computing the minimum distance to all line segments and the connecting points (Example in Appendix Figure 2). If the curve is closed and without self-intersections, we can also define a signed projection pooling for which the signs of the output for points inside and outside the region enclosed by the curve are different (x and y in Figure 2). This allows the subsequent layers to distinguish between the inputs inside and outside the region. In this work, we show some preliminary results using a simple 2D projection pooling layer (Section 6). We leave the problem of constructing high performance 2D projection pooling layers open for future research.

In summary, in this paper, we make the following contributions:

We introduce a method for faster computation of the weight gradient for SOC layers. For deeper networks, we observe reduction in training time per epoch by > 30% (Table 1).

We introduce a certification procedure called CRC-Lip that replaces the last linear layer with a 1hidden layer MLP and results in significantly improved standard and provable robust accuracy. For deeper networks ( 35 layers), we observe improvements of 5.84%, 8.00% in standard and 8.76%, 8.65% in provable robust accuracy (l2 radius 36/255) on CIFAR-10,100 respectively.

We introduce a large class of 1-Lipschitz pooling layers called projection pooling using the mathematical property that the l2 distance d M(x) of an input x to the manifold M is 1-Lipschitz.

On CIFAR-10, across all architectures, we achieve the best standard and provable robust accuracy (at 36/255) of 79.57, 67.13% respectively (gain of +1.79%, +3.82% from prior works). Similarly, on CIFAR-100, we achieve 51.84, 39.27% (+3.78%, +4.75% from prior works). These results establish new state-of-the-art results in the standard and provable robust accuracy on both datasets.

2 Related work

Provable defenses against adversarial examples: For a provably robust classifier, we can guarantee that its predictions remain constant within some region around the input. Most of the existing methods for provable robustness either bound the Lipschitz constant or use convex relaxations [Singh et al., 2017, 2018a, 2019a,b,c, Weng et al., 2018, Salman et al., 2019b, Zhang et al., 2019, 2018, Wong et al., 2018, Wong and Kolter, 2018, Singh et al., 2018b, Raghunathan et al., 2018, Dvijotham* et al., 2018, Croce et al., 2019, Gowal et al., 2019, Dvijotham* et al., 2020, Lu and Kumar, 2020, Singla and Feizi, 2020, Bunel et al., 2020, Leino et al., 2021, Leino and Fredrikson, 2021, Zhang et al., 2021, 2022, Wang et al., 2021, Huang et al., 2021, Müller et al., 2021, Singh et al., 2021, Palma et al., 2021]. However, these methods are often not scalable to large neural networks while achieving high performance. In contrast, randomized smoothing [Liu et al., 2018, Cao and Gong, 2017, Lécuyer et al., 2018, Li et al., 2019a, Cohen et al., 2019, Salman et al., 2019a, Levine et al., 2019, Kumar et al., 2020a,b, Salman et al., 2020, 2021] scales to large neural networks but is a probabilistically certified defense: certifying robustness with high probability requires generating a large number of noisy samples leading to high inference-time. The defense we propose in this work is deterministic and not comparable to randomized smoothing.

Provably Lipschitz neural networks: The class of Gradient Norm Preserving (GNP) and 1-Lipschitz fully connected neural networks was first introduced by Anil et al. [2018]. To design each layer to be GNP, they orthogonalize weight matrices and use a class of piecewise linear GNP activations called Group Sort. Later, Singla et al. [2022] proved that for any piecewise linear GNP function to be continuous, different Jacobian matrices in a neighborhood must change via householder transformations, implying that Group Sort is a special case of more general House Holder activations. Several previous works enforce Lipschitz constraints on convolution layers using spectral normalization, clipping or regularization [Cissé et al., 2017, Tsuzuku et al., 2018, Qian and Wegman, 2019, Gouk et al., 2020, Sedghi et al., 2019]. However, these methods either enforce loose lipschitz bounds or do not scale to large networks. To ensure that the Lipschitz constraint on convolutional layers is tight, recent works construct convolution layers with an orthogonal Jacobian [Li et al., 2019b, Trockman and Kolter, 2021, Singla and Feizi, 2021, Yu et al., 2022, Su et al., 2022, Kiani et al., 2022]. These approaches avoid the aforementioned issues and allow training of large, provably 1-Lipschitz CNNs achieving state-of-the-art results for deterministic l2 robustness.

3 Problem setup and Notation

For a vector v, vj denotes its jth element. For a matrix A, Aj,: and A:,k denote the jth row and kth column respectively. Both Aj,: and A:,k are assumed to be column vectors (thus Aj,: is the transpose of jth row of A). Aj,k denotes the element in jth row and kth column of A. A:j,:k denotes the matrix containing the first j rows and k columns of A. We define A:j = A:j,: and Aj: = Aj:,:. Similar notation applies to higher order tensors. I denotes the identity matrix, R to denote the field of real numbers. We construct a 1-Lipschitz neural network, f : Rd Rc (d is the input dimension, c is the number of classes) by composing 1-Lipschitz convolution layers and GNP activation functions. We often use the abbreviation fi fj : Rd R to denote the function so that: (fi fj) (x) = fi(x) fj(x), x Rd

For a matrix A Rq r and a tensor B Rp q r, A denotes the vector constructed by stacking the rows of A and B by stacking the vectors Bj,:,:, j [p 1] so that: A T = AT 0,: , AT 1,: , . . . , AT q 1,: , B T = B0,:,: T , B1,:,: T , . . . , Bp 1,:,: T

For a 2D convolution filter, L Rp q r s and input X Rq n n, we use L X Rp n n to denote the convolution of filter L with X. We use the the notation L i X L i 1 (L X) and L 0 X = X. Unless specified, we assume zero padding and stride 1 in each direction.

4 Faster gradient computation for Skew Orthogonal Convolutions

Among the existing orthogonal convolution layers in the literature, Skew Orthogonal Convolutions (or SOC) by Singla and Feizi [2021] achieves state-of-the-art results on CIFAR-10,100 [Singla et al., 2022]. SOC first constructs a convolution filter L whose Jacobian is skew-symmetric. This is followed by a convolution exponential [Hoogeboom et al., 2020] operation. Since the exponential of a skew-symmetric matrix is orthogonal, the Jacobian of the resulting layer is an orthogonal matrix.

However, a drawback of this procedure is that convolution exponential requires multiple convolution operations per SOC layer to achieve a reasonable approximation of the orthogonal Jacobian. Consequently, if we use k convolution operations in the SOC layer during forward pass, we need to compute the gradient with respect to the weights L (called convolution weight gradient) per convolution operation (k times) which can lead to slower training time especially when the number of SOC layers is large. To address this limitation, we show that even if we use k convolutions in the forward pass of an SOC layer, the weight gradient for the layer can be computed using a single convolution weight gradient during backpropagation leading to significant reduction in training time.

For simplicity, let us first consider the case of an orthogonal fully connected layer (i.e. not convolution) with the same input and output size (n). Later, we will see that our analysis leads to improvements for orthogonal convolutional layers. Further, assume that the weights i.e. A Rn n are skew-symmetric i.e. A = AT and given the input x Rn, the output z Rn is computed as follows:

x + b, where A = AT (1)

We approximate the exponential series: exp(A) = P i=0 Ai/i! using a finite number of terms (k). Forward pass (z): To compute z during the forward pass, we use the following iterations:

u(i) = x i = k 1 x + (Au(i+1))/(i + 1) i k 2 (2)

It can be shown that z = u(0) (Appendix C). During backpropagation, given the gradient of loss ℓ w.r.t. layer output z i.e. z ℓ, we want to compute x ℓ(input gradient) and A ℓ(weight gradient).

Input gradient ( x ℓ): To compute x ℓ, observe that z is a linear function of x. Thus, using the chain rule, skew-symmetricity (AT = A) and the property Ai T = AT i, we have:

We can again approximate the exponential series using the same number of terms as in the forward pass i.e. k. To compute the finite term approximation, we use the following iterations:

v(i) = z ℓ i = k 1 z ℓ (Av(i+1))/(i + 1) i k 2 (3)

Similar to forward pass, it can be shown that x ℓ= v(0) (Appendix C).

Weight gradient ( A ℓ): We first derive the exact expression for A ℓin the Theorem below:

Theorem 1 The gradient of the loss function ℓw.r.t A i.e. A ℓis given by:

Ai 1x v(i) T v(i) Ai 1x T (4)

where v(i) is defined as in equation (3).

Note that the first outer product i.e. Ai 1x v(i) T and the second i.e. v(i) Ai 1x T are transpose of each other implying that each term in the summation is skew-symmetric. Although these outer products can be computed in a straightforward way for orthogonal fully connected layers, this is not the case for orthogonal convolution layers (SOC in this case). This is because, for a convolution filter L Rp q r s, the term analogous to Ai 1x is a patch of size q r s and that analogous to v(i) is another patch of size p 1 1 resulting in an outer product of the desired size p q r s per patch. Thus, for SOC layers, each term inside the summation is computed by summing over the outer products for all such patches. For large input sizes, the number of such patches can often be large, making this computation expensive. To address this limitation, we use the following approximation:

A ℓ u(1) v(1) T v(1) u(1) T (5)

In Appendix B, we show that the above approximation is principled because after subtracting the exact and approximation gradients (equations (4) and (5)) and simplifying, each term in the resulting series is divided by a large value in its denominator ( 0). This approximation is useful because in equation (5), the outer product needs to be computed once whereas in equation (4), the outer products need to be computed k times. Also, both u(1) and v(1) are computed while computing u(0) = z during the forward pass and v(0) = x z during the backward pass using the recurrences in equations (2) and (3). Thus, we can simply store u(1), v(1) during the forward,backward pass respectively so that A ℓcan be computed directly using equation (5). In our experiments, we observe that this leads to significant reduction in training time with almost no drop in performance (Tables 1, 3, 4).

5 Curvature-based Robustness Certificate

A key property of 1-Lipschitz CNNs is that the output of each layer is 1-Lipschitz with respect to the input. Given an input x Rd, consider the penultimate output g(x) Rm and logits f(x) Rc for some 1-Lipschitz CNN. Existing robustness certificates [Li et al., 2019b, Singla et al., 2022] rely on the linearity of the function from g(x) f(x). However, since MLPs have higher expressive power than linear functions [Cybenko, 1989], one way to improve performance could be to replace this mapping with MLPs. However, certifying robustness for deep MLPs is difficult.

To address these challenges, we first replace the mapping from g(x) f(x) with a 1-hidden layer MLP because they are easier to certify compared to deeper networks. Because computing exact certificates for Re LU networks is known to be NP-complete [Sinha et al., 2018], we use the differentiable Softplus activation [Dugas et al., 2000] to certify robustness. To certify robustness, we use the Curvature-based Robustness Certificate or CRC [Singla and Feizi, 2020] because it provides exact certificates for a significant fraction of inputs for shallow MLPs. We provide a brief review of CRC in Appendix Section D. Let g : Rd Rm be a 1-Lipschitz continuous function and h : Rm Rc be a 1-hidden layer MLP such that f = h g. Further, let l be the predicted class for input x i.e. fl(x) maxi =l fi(x). Since g is 1-Lipschitz, it can be shown that if h is provably robust in an l2 radius R around input g(x), then f is also provably robust in the l2 radius R around x. The resulting procedure is called CRC-Lip and is given in the following proposition:

(a) Rearrangement operation [Singla and Feizi, 2021]

(b) 2D projection pooling

Figure 1: Illustration of the rearrangement operation (left) and 2D projection pooling (right).

Proposition 1 (CRC-Lip) For input x such that fl(x) maxi =l fi(x), let R be the robustness certificate for h using input g(x), then R is also the robustness certificate for the function f:

min i =l min hl(y )=hi(y ) y g(x) 2 R = min i =l min fl(x )=fi(x ) x x 2 R (6)

Proof is in Appendix A.2. In our experiments, we find that although replacing the linear layer with an MLP achieves high standard accuracy, directly using the above certificate often leads to very small robustness certificates and thus low certified robust accuracy. To address this problem, we introduce (a) an adversarial training procedure that only applies to the input of the MLP g(x) (i.e. not the input of the neural network x) and (b) curvature regularization to reduce the curvature of the MLP. It was shown in Singla and Feizi [2021] that combining adversarial training with curvature regularization leads to significantly improved provable robust accuracy with small reduction in standard accuracy. This results in the following loss function for training:

min Ω, Ψ E(x,l) D

max y gΨ(x) 2 ρ ℓ(hΩ(y , l)) + γ Kh

In the above equation, Ωdenote the parameters of the MLP (i.e. h), Ψ are the parameters of the 1-Lipschitz function g, Kh is the bound on the curvature of the MLP, ρ denotes the l2 perturbation radius and γ denotes the curvature regularization coefficient. The curvature bound Kh is the same as in Singla and Feizi [2020]. Also, since we apply adversarial perturbations in the penultimate layer, we also need to backpropagate through this procedure to enable training of previous layers. To this end, we simply use an identity map (same gradient output as the gradient input) to backpropagate through the adversarial training procedure and find that it works well in practice, achieving significantly better results compared to the state-of-the-art. It is possible that a more principled method of backpropagation may lead to better results and we leave that avenue open for future research.

6 Projection pooling layers

In 1-Lipschitz CNNs, pooling is usually carried out as follows: given input X Rq r r (r is even), we first use rearrangement [Jacobsen et al., 2018] illustrated in Figure 1b to construct X R4q (r/2) (r/2). Next, we apply an orthogonal convolution which gives an output of the same size Z R4q (r/2) (r/2) and divide it into two tensors of equal sizes (along the channel dimension) giving Z:2q R2q (r/2) (r/2) and Z2q: R2q (r/2) (r/2). We then define max(Z:2q, Z2q:) (or either one of Z:2q, Z2q:) as the output of the pooling layer.

Although max is 1-Lipschitz, its expressive power is limited. For example, consider the 1-Lipschitz function x, y 2 = p

x2 + y2. It is easy to see that if x = y, the error between x, y 2 and max(x, y) can be arbitrarily large for large values of x, y. To address such limitations, we construct expressive pooling layers using the following mathematical property:

Theorem 2 Given x Rn and manifold M Rn, the distance function (in l2 norm) is 1-Lipschitz:

d M(x) = min x M x x 2 = |d M(x) d M(y)| x y 2 (8)

The above theorem provides a powerful method for constructing a learnable pooling layer by selecting a learnable manifold MΘ Rr (Θ denotes the set of learnable parameters). To apply the pooling

Output Size Layer

32 32 32 Conv + Max Min 64 16 16 Lip Block-n/5 128 8 8 Lip Block-n/5 256 4 4 Lip Block-n/5 512 2 2 Lip Block-n/5 1024 1 1 Lip Block-n/5 # of classes Linear/MLP-1

(a) Lip Convnet-n Architecture

Output Size Layer Repeats

q r r Input _ q r r Conv + Max Min n/5 1 4q (r/2) (r/2) Rearrange 1 4q (r/2) (r/2) Conv 1 2q (r/2) (r/2) Pooling 1

(b) Lip Block-n/5

Table 2: Lip Convnet-n and Lip Block-n/5 architectures

operation to an input x Rn, we simply output its l2 distance to the manifold MΘ denoted by d MΘ(x). Since d MΘ(x) R while x Rn, this operation reduces the input size by a factor of n.

As an example, let Mu = {u}. Here u Rn is the learnable parameter and the distance function d Mu(x) = x u 2. Even in this very simple case, for u = 0, d Mu(x) = x 2 and for n = 2, this function can learn to represent the function x, y 2 discussed earlier exactly. We also prove that a signed l2 distance function can be defined when M satisfies certain properties:

Corollary 1 If M is a connected manifold that divides Rn into two nonempty connected sets A and B such that A B = ϕ and every path from a A and b B intersects a point on M, then there exists a signed l2 distance function with different signs in A and B.

Proofs of Theorems 2 and Corollary 1 are given in Appendix A.3 and A.4 respectively.

In practice, computing d M can be difficult for high-dimensional M. To tackle this issue, we use 2D pooling layers where M is defined to be a piecewise linear curve in 2D. The l2 distance can then be computed by finding the minimum distance to all the line segments and connecting points. We emphasize that even in this relatively simple case, M can have large number of parameters and d M can still be efficient to compute because these individual distances can be computed in parallel. We use Mθ defined below (illustrated in Figure 1b, lines L1, L2 correspond to ϕ = +θ, θ):

Mθ = {R(cos ϕ, sin ϕ) : R 0, ϕ {+θ, θ}}

In each colored region (Figure 1b), d Mθ can be computed using the following:

d Mθ (R(cos α, sin α)) =

R sin(α θ), 0 α θ + π/2 R sin(α + θ), 0 < α θ + π/2 R, otherwise (9)

To apply projection pooling on Z:2q, Z2q: discussed previously, we output d Mθ (Z:2q, Z2q:).

7 Experiments

We perform experiments under the setting of provably robust image classification on CIFAR-10 and CIFAR-100 datasets. We use the Lip Convnet-5, 10, 15, . . . , 40 architectures for comparison. We use SOC as the orthogonal convolution and Max Min as the activation in all architectures. All experiments were performed using 1 NVIDIA Ge Force RTX 2080 Ti GPU. All networks were trained for 200 epochs with initial learning rate of 0.1, dropped by a factor of 0.1 after 100 and 150 epochs. For adversarial training with curvature regularization, we use ρ = 36/255 (0.1411), γ = 0.5 for CIFAR-10 and ρ = 0.2, γ = 0.75 for CIFAR-100. We find that certifying robustness using CRC-Lip is computationally expensive for CIFAR-100 due to large number of classes. To address this issue, we only use classes with top-10 logits (instead of all 100 classes) for CIFAR-100 (Table 4). Since CRC-Lip requires us to solve a convex optimization, we consider a certificate to be valid if the input is correctly classified and gradient at the optimal solution is 10 6 (0 otherwise). All results are

Table 3: Provable robustness results on CIFAR-10 (Lip Convnet-5, 10 results in Appendix Table 5)

Lip Conv Methods Standard Accuracy Provable Robust Accuracy Increase net36/255 72/255 108/255 (standard) (36/255)

15 Baseline 77.78% 62.75% 46.34% 31.38% _ _ + Fast 77.75% 62.52% 46.23% 31.19% -0.03% -0.23% + CRC 79.44% 66.99% 52.56% 38.30% +1.66% +4.24% 20 Baseline 77.50% 63.31% 46.42% 31.53% _ _ + Fast 77.13% 62.05% 45.86% 31.13% -0.37% -1.26% + CRC 79.13% 66.45% 52.45% 38.12% +1.63% +3.14% 25 Baseline 77.18% 62.46% 45.78% 31.16% _ _ + Fast 76.94% 61.91% 45.59% 30.69% -0.24% -0.55% + CRC 79.19% 66.28% 51.74% 37.99% +2.01% +3.82% 30 Baseline 74.43% 59.65% 43.76% 29.16% _ _ + Fast 74.69% 58.84% 43.33% 28.93% +0.26% -0.81% + CRC 78.64% 66.05% 51.31% 37.30% +4.21% +6.40% 35 Baseline 72.73% 57.18% 42.08% 28.09% _ _ + Fast 72.91% 57.58% 41.52% 27.37% +0.18% +0.40% + CRC 78.57% 65.94% 52.04% 37.63% +5.84% +8.76% 40 Baseline 71.33% 55.74% 39.32% 26.06% _ _ + Fast 71.60% 56.15% 39.82% 25.63% +0.27% +0.41% + CRC 78.41% 65.51% 51.32% 37.30% +7.08% +9.77%

reported using the complete test sets of CIFAR-10 and CIFAR-100. We compare the provable robust accuracy using 3 different l2 perturbation radii: 36/255, 72/255, 108/255.

Table details: For the baseline ("Baseline" in Tables 3, 4), we use the standard max pooling with the certificate based on LLN [Singla et al., 2022] due to their superior performance over prior works. In Tables 3 and 4, for each architecture, "+ Fast" adds faster gradient computation, "+ CRC" replaces max pooling with projection pooling (equation (9)) and replaces the last linear layer with a 1-hidden layer MLP (CRC-Lip certificate) while also using faster gradients. For each architecture, the columns "Increase (Standard)" and "Increase (36/255)" denote the increase in standard and provable robust accuracy relative to "Baseline" standard and provable robust accuracy (36/255). Results where Projection pooling and CRC-Lip are added separately are given in Appendix Tables 5 and 6.

Lip Convnet Architecture: We use a 1-Lipschitz CNN architecture called Lip Convnet-n where n is a multiple of 5 and n + 1 is the total number of convolution layers. It consists of an initial SOC layer that expands the number of channels from 3 to 32. This is followed by 5 blocks that reduce spatial dimensions (height and width) by half while doubling the number of channels. The architecture is summarized in Tables 2a and 2b. The last layer outputs the class logits and is either a linear layer in which case we use LLN to certify robustness or a single-hidden layer MLP where we use CRC-Lip.

Correcting the certificates: Since SOC is an approximation, the Lipschitz constant of each SOC layer can be slightly more than 1 and if we use a large number of SOC layers (e.g. 41 in Lip Convnet40), the Lipschitz constant of the full network (Lip(f)) can be significantly larger than 1. To mitigate this issue, we take the following steps: (a) We use a large number of terms (k = 15) during test time to approximate the exponential which results in a small approximation error (using the error bound in Singla and Feizi [2021]) and (b) We compute Lip(f) by multiplying the lipschitz constant of all SOC layers (using the power method) and then divide the certificate by Lip(f).

Table 4: Provable robustness results on CIFAR-100 (Lip Convnet-5, 10 results in Appendix Table 6)

Lip Conv Methods Standard Accuracy Provable Robust Accuracy Increase net36/255 72/255 108/255 (standard) (36/255)

15 Baseline 48.06% 34.52% 23.08% 14.70% _ _ + Fast 47.97% 33.84% 22.66% 14.26% -0.09% -0.68% + CRC 50.79% 37.50% 26.16% 17.27% +2.73% +2.98% 20 Baseline 47.37% 33.99% 23.40% 14.69% _ _ + Fast 46.41% 33.07% 22.06% 14.00% -0.96% -0.92% + CRC 51.84% 38.54% 27.32% 18.53% +4.47% +4.55% 25 Baseline 45.77% 32.08% 21.36% 13.64% _ _ + Fast 45.28% 31.67% 20.69% 13.26% -0.49% -0.41% + CRC 51.59% 39.27% 27.94% 19.06% +5.82% +7.19% 30 Baseline 46.39% 33.08% 22.02% 13.77% _ _ + Fast 45.86% 32.54% 21.18% 12.77% -0.53% -0.54% + CRC 50.97% 38.77% 27.73% 19.28% +4.58% +5.69% 35 Baseline 43.42% 30.36% 19.71% 12.66% _ _ + Fast 42.78% 29.88% 19.73% 12.52% -0.64% -0.48% + CRC 51.42% 39.01% 28.94% 20.29% +8.00% +8.65% 40 Baseline 41.72% 28.53% 18.37% 11.49% _ _ + Fast 42.07% 28.51% 18.86% 11.89% +0.35% -0.02% + CRC 50.11% 38.69% 28.45% 20.05% +8.39% +10.16%

7.1 Results using Faster gradient computation

We show the reduction in training time per epoch (in seconds) on CIFAR-10 in Table 1 and CIFAR100 in Appendix Table 8. In both Tables, we observe that for deeper networks ( 25 layers), the reduction in time per epoch is 30%. The corresponding standard and provable robust accuracy numbers are given in Tables 3 (CIFAR-10) and 4 (CIFAR-100) in the row "+ Fast". From the columns "Increase (Standard)" and "Increase (36/255)", we observe that the performance is similar to the baseline across all network architectures. For deeper networks: Lip Convnet-35, 40, we observe an increase in performance for CIFAR-10 and small decrease (< 0.64%) for CIFAR-100.

7.2 Results using projection pooling and CRC-Lip

We observe that using CRC and projection pooling (row "+ CRC") leads to significant improvements in performance across all Lip Convnet architectures. On CIFAR-10, we observe significant improvements in both the standard ( 1.63%, column "Increase (standard)") and provable robust accuracy ( 3.14%, column "Increase (36/255)") across all architectures. On CIFAR-100, we also observe significant improvements in the standard ( 2.49%) and provable robust accuracy ( 2.27%). For deeper networks (Lip Convnet-35, 40), on CIFAR-10, we observe even more significant gains in the standard ( 5.84%) and provable robust accuracy ( 8.76%). Similarly on CIFAR-100, we observe gains of 8.00% (standard) and 8.65% (provable robust).

Our results establish a new state-of-the-art for both the standard and provable robust accuracy across all attack radii. In Table 3 (CIFAR-10), the best "Baseline" standard and provable robust accuracy values (at 36/255, 72/255, 108/255) are 77.78% and 63.31%, 46.42%, 31.53% respectively. The best "+ CRC" values are 79.57% and 67.13%, 53.17%, 38.60%. This results in improvements of +1.79% and +3.82%, +6.75%, +7.07% respectively. Similarly, in Table 4 (CIFAR-100), the best "Baseline" standard and provable robust values are 48.06% and 34.52%, 23.40%, 14.70%. The best "+ CRC" values are 51.84% (+3.78%) and 39.27% (+4.75%), 27.94% (+4.54%), 20.29% (+5.59%).

8 Acknowledgements

This project was supported in part by NSF CAREER AWARD 1942230, a grant from NIST 60NANB20D134, HR001119S0026 (GARD), ONR YIP award N00014-22-1-2271, Army Grant No. W911NF2120076 and the NSF award CCF2212458.

C. Anil, J. Lucas, and R. B. Grosse. Sorting out lipschitz function approximation. In ICML, 2018.

R. Bunel, J. Lu, I. Turkaslan, P. H. S. Torr, P. Kohli, and M. P. Kumar. Branch and bound for piecewise linear neural network verification. J. Mach. Learn. Res., 21:42:1 42:39, 2020.

X. Cao and N. Z. Gong. Mitigating evasion attacks to deep neural networks via region-based classification. In Proceedings of the 33rd Annual Computer Security Applications Conference, ACSAC 2017, page 278 287, New York, NY, USA, 2017. Association for Computing Machinery. ISBN 9781450353458. doi: 10.1145/3134600.3134606. URL https://doi.org/10.1145/ 3134600.3134606.

M. Cissé, P. Bojanowski, E. Grave, Y. N. Dauphin, and N. Usunier. Parseval networks: Improving robustness to adversarial examples. In D. Precup and Y. W. Teh, editors, Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, volume 70 of Proceedings of Machine Learning Research, pages 854 863. PMLR, 2017. URL http://proceedings.mlr.press/v70/cisse17a.html.

J. M. Cohen, E. Rosenfeld, and J. Z. Kolter. Certified adversarial robustness via randomized smoothing. In ICML, 2019.

F. Croce, M. Andriushchenko, and M. Hein. Provable robustness of relu networks via maximization of linear regions. AISTATS 2019, 2019.

G. Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals, and Systems (MCSS), 2(4):303 314, Dec. 1989. ISSN 0932-4194. doi: 10.1007/BF02551274. URL http://dx.doi.org/10.1007/BF02551274.

C. Dugas, Y. Bengio, F. Bélisle, C. Nadeau, and R. Garcia. Incorporating secondorder functional knowledge for better option pricing. In T. Leen, T. Dietterich, and V. Tresp, editors, Advances in Neural Information Processing Systems, volume 13. MIT Press, 2000. URL https://proceedings.neurips.cc/paper/2000/file/ 44968aece94f667e4095002d140b5896-Paper.pdf.

K. Dvijotham*, R. Stanforth, S. Gowal, T. Mann, and P. Kohli. A dual approach to scalable verification of deep networks. In Proceedings of the Thirty-Fourth Conference Annual Conference on Uncertainty in Artificial Intelligence (UAI-18), Corvallis, Oregon, 2018. AUAI Press.

K. D. Dvijotham*, A. Raghunathan, J. Uesato, S. Dathathri, A. Kurakin, I. Goodfellow, P. Kohli, J. Steinhardt, and P. Liang. Enabling certification of verification-agnostic networks via memoryefficient semidefinite programming. In Advances in Neural Information Processing Systems, pages , 2020.

H. Gouk, E. Frank, B. Pfahringer, and M. J. Cree. Regularisation of neural networks by enforcing lipschitz continuity, 2020.

S. Gowal, K. Dvijotham*, R. Stanforth, R. Bunel, C. Qin, J. Uesato, R. Arandjelovic, T. A. Mann, and P. Kohli. On the effectiveness of interval bound propagation for training verifiably robust models. In ICCV, volume 9, pages , 2019.

I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville. Improved training of wasserstein gans. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30, pages 5767 5777. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper/ 2017/file/892c3b1c6dccd52936e27cbd0ff683d6-Paper.pdf.

E. Hoogeboom, V. G. Satorras, J. Tomczak, and M. Welling. The convolution exponential and generalized sylvester flows. Ar Xiv, abs/2006.01910, 2020.

Y. Huang, H. Zhang, Y. Shi, J. Z. Kolter, and A. Anandkumar. Training certifiably robust neural networks with efficient local lipschitz bounds. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan, editors, Advances in Neural Information Processing Systems, 2021. URL https: //openreview.net/forum?id=FTt28RYj5Pc.

J.-H. Jacobsen, A. W. Smeulders, and E. Oyallon. i-revnet: Deep invertible networks. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id= HJsjk Mb0Z.

B. Kiani, R. Balestriero, Y. Lecun, and S. Lloyd. projunn: efficient method for training deep networks with unitary matrices, 2022. URL https://arxiv.org/abs/2203.05483.

A. Krizhevsky. Learning multiple layers of features from tiny images. Technical report, 2009.

A. Kumar, A. Levine, S. Feizi, and T. Goldstein. Certifying confidence via randomized smoothing. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 5165 5177. Curran Associates, Inc., 2020a. URL https://proceedings.neurips.cc/paper/2020/file/ 37aa5dfc44dddd0d19d4311e2c7a0240-Paper.pdf.

A. Kumar, A. Levine, T. Goldstein, and S. Feizi. Curse of dimensionality on randomized smoothing for certifiable robustness. In H. D. III and A. Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 5458 5467. PMLR, 13 18 Jul 2020b. URL http://proceedings.mlr.press/v119/ kumar20b.html.

M. Lécuyer, V. Atlidakis, R. Geambasu, D. Hsu, and S. K. K. Jana. Certified robustness to adversarial examples with differential privacy. In IEEE S&P 2019, 2018.

K. Leino and M. Fredrikson. Relaxing local robustness. In Neural Information Processing Systems (NIPS), 2021.

K. Leino, Z. Wang, and M. Fredrikson. Globally-robust neural networks. In International Conference on Machine Learning (ICML), 2021.

A. Levine, S. Singla, and S. Feizi. Certifiably robust interpretation in deep learning, 2019.

B. Li, C. Chen, W. Wang, and L. Carin. Certified adversarial robustness with additive noise. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d Alche-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32, pages 9464 9474. Curran Associates, Inc., 2019a. URL https://proceedings.neurips.cc/paper/2019/file/ 335cd1b90bfa4ee70b39d08a4ae0cf2d-Paper.pdf.

Q. Li, S. Haque, C. Anil, J. Lucas, R. Grosse, and J.-H. Jacobsen. Preventing gradient attenuation in lipschitz constrained convolutional networks. Conference on Neural Information Processing Systems, 2019b.

X. Liu, M. Cheng, H. Zhang, and C. Hsieh. Towards robust neural networks via random self-ensemble. In ECCV, 2018.

P. M. Long and H. Sedghi. Generalization bounds for deep convolutional neural networks. In International Conference on Learning Representations, 2020. URL https://openreview.net/ forum?id=r1e_Fp NFDr.

J. Lu and M. P. Kumar. Neural network branching for neural network verification. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id= B1evfa4t PB.

T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida. Spectral normalization for generative adversarial networks. In International Conference on Learning Representations, 2018. URL https:// openreview.net/forum?id=B1QRgzi T-.

C. Müller, F. Serre, G. Singh, M. Püschel, and M. Vechev. Scaling polyhedral neural network verification on gpus. In A. Smola, A. Dimakis, and I. Stoica, editors, Proceedings of Machine Learning and Systems, volume 3, pages 733 746, 2021. URL https://proceedings.mlsys. org/paper/2021/file/ca46c1b9512a7a8315fa3c5a946e8265-Paper.pdf.

A. D. Palma, H. Behl, R. R. Bunel, P. Torr, and M. P. Kumar. Scaling the convex barrier with active sets. In International Conference on Learning Representations, 2021. URL https: //openreview.net/forum?id=u Qf Oy7Lrl TR.

H. Qian and M. N. Wegman. L2-nonexpansive neural networks. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=Byx GSs R9FQ.

A. Raghunathan, J. Steinhardt, and P. Liang. Semidefinite relaxations for certifying robustness to adversarial examples. In Neur IPS, 2018.

H. Salman, J. Li, I. Razenshteyn, P. Zhang, H. Zhang, S. Bubeck, and G. Yang. Provably robust deep learning via adversarially trained smoothed classifiers. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32, pages 11292 11303. Curran Associates, Inc., 2019a. URL https://proceedings.neurips.cc/paper/2019/file/ 3a24b25a7b092a252166a1641ae953e7-Paper.pdf.

H. Salman, G. Yang, H. Zhang, C.-J. Hsieh, and P. Zhang. A convex relaxation barrier to tight robustness verification of neural networks. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019b. URL https://proceedings.neurips. cc/paper/2019/file/246a3c5544feb054f3ea718f61adfa16-Paper.pdf.

H. Salman, M. Sun, G. Yang, A. Kapoor, and J. Z. Kolter. Denoised smoothing: A provable defense for pretrained classifiers. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS 20, Red Hook, NY, USA, 2020. Curran Associates Inc. ISBN 9781713829546.

H. Salman, S. Jain, E. Wong, and A. Madry. Certified patch robustness via smoothed vision transformers. Co RR, abs/2110.07719, 2021. URL https://arxiv.org/abs/2110.07719.

H. Sedghi, V. Gupta, and P. M. Long. The singular values of convolutional layers. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id= r Jev Yo A9Fm.

G. Singh, M. Püschel, and M. Vechev. Fast polyhedra abstract domain. In Proceedings of the 44th ACM SIGPLAN Symposium on Principles of Programming Languages, POPL 2017, page 46 59, New York, NY, USA, 2017. Association for Computing Machinery. ISBN 9781450346603. doi: 10.1145/3009837.3009885. URL https://doi.org/10.1145/3009837.3009885.

G. Singh, T. Gehr, M. Mirman, M. Püschel, and M. Vechev. Fast and effective robustness certification. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018a. URL https://proceedings.neurips.cc/paper/2018/file/ f2f446980d8e971ef3da97af089481c3-Paper.pdf.

G. Singh, T. Gehr, M. Mirman, M. Puschel, and M. T. Vechev. Fast and effective robustness certification. In Neur IPS, 2018b.

G. Singh, R. Ganvir, M. Püschel, and M. Vechev. Beyond the single neuron convex barrier for neural network certification. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019a. URL https://proceedings.neurips.cc/paper/2019/file/ 0a9fdbb17feb6ccb7ec405cfb85222c4-Paper.pdf.

G. Singh, T. Gehr, M. Püschel, and M. Vechev. An abstract domain for certifying neural networks. Proc. ACM Program. Lang., 3(POPL), jan 2019b. doi: 10.1145/3290354. URL https://doi. org/10.1145/3290354.

G. Singh, T. Gehr, M. Püschel, and M. Vechev. Robustness certification with refinement. In International Conference on Learning Representations, 2019c. URL https://openreview. net/forum?id=HJge Eh09KQ.

H. Singh, M. P. Kumar, P. Torr, and K. D. Dvijotham. Overcoming the convex barrier for simplex inputs. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan, editors, Advances in Neural Information Processing Systems, 2021. URL https://openreview.net/forum?id= JXREUky Hi7u.

S. Singla and S. Feizi. Second-order provable defenses against adversarial attacks. In H. D. III and A. Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 8981 8991. PMLR, 13 18 Jul 2020. URL http://proceedings.mlr.press/v119/singla20a.html.

S. Singla and S. Feizi. Skew orthogonal convolutions. In ICML, 2021. URL https://arxiv.org/ abs/2105.11417.

S. Singla, S. Singla, and S. Feizi. Improved deterministic l2 robustness on CIFAR-10 and CIFAR-100. In International Conference on Learning Representations, 2022. URL https://openreview. net/forum?id=t D7e Cta Sk R.

A. Sinha, H. Namkoong, and J. Duchi. Certifiable distributional robustness with principled adversarial training. In International Conference on Learning Representations, 2018. URL https:// openreview.net/forum?id=Hk6k Pg ZA-.

J. Su, W. Byeon, and F. Huang. Scaling-up diverse orthogonal convolutional networks with a paraunitary framework. In ICML, 2022. URL https://arxiv.org/abs/2106.09121.

C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus. Intriguing properties of neural networks. In International Conference on Learning Representations, 2014. URL http://arxiv.org/abs/1312.6199.

A. Trockman and J. Z. Kolter. Orthogonalizing convolutional layers with the cayley transform. In International Conference on Learning Representations, 2021. URL https://openreview.net/ forum?id=Pbj8H_j EHYv.

D. Tsipras, S. Santurkar, L. Engstrom, A. Turner, and A. Madry. Robustness may be at odds with accuracy. In ICLR, 2018.

Y. Tsuzuku, I. Sato, and M. Sugiyama. Lipschitz-margin training: Scalable certification of perturbation invariance for deep neural networks. In Neur IPS, 2018.

C. Villani. Optimal transport, old and new, 2008.

S. Wang, H. Zhang, K. Xu, X. Lin, S. Jana, C.-J. Hsieh, and J. Z. Kolter. Beta-CROWN: Efficient bound propagation with per-neuron split constraints for neural network robustness verification. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan, editors, Advances in Neural Information Processing Systems, 2021. URL https://openreview.net/forum?id=ah YIl RBe CFw.

L. Weng, H. Zhang, H. Chen, Z. Song, C.-J. Hsieh, L. Daniel, D. Boning, and I. Dhillon. Towards fast computation of certified robustness for Re LU networks. In J. Dy and A. Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 5276 5285. PMLR, 10 15 Jul 2018. URL https:// proceedings.mlr.press/v80/weng18a.html.

E. Wong and Z. Kolter. Provable defenses against adversarial examples via the convex outer adversarial polytope. In J. Dy and A. Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 5286 5295, Stockholmsmässan, Stockholm Sweden, 10 15 Jul 2018. PMLR. URL http: //proceedings.mlr.press/v80/wong18a.html.

E. Wong, F. R. Schmidt, J. H. Metzen, and J. Z. Kolter. Scaling provable adversarial defenses. In Neur IPS, 2018.

L. Xiao, Y. Bahri, J. Sohl-Dickstein, S. Schoenholz, and J. Pennington. Dynamical isometry and a mean field theory of CNNs: How to train 10,000-layer vanilla convolutional neural networks. In J. Dy and A. Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 5393 5402, Stockholmsmässan, Stockholm Sweden, 10 15 Jul 2018. PMLR. URL http://proceedings.mlr.press/v80/xiao18a.html.

T. Yu, J. Li, Y. CAI, and P. Li. Constructing orthogonal convolutions in an explicit manner. In International Conference on Learning Representations, 2022. URL https://openreview.net/ forum?id=Zr5W2LSRh D.

B. Zhang, T. Cai, Z. Lu, D. He, and L. Wang. Towards certifying linfinity robustness using neural networks with linfinity-dist neurons. In ICML, 2021.

B. Zhang, D. Jiang, D. He, and L. Wang. Boosting the certified robustness of l-infinity distance nets. In International Conference on Learning Representations, 2022. URL https://openreview. net/forum?id=Q76Y7wkiji.

H. Zhang, T.-W. Weng, P.-Y. Chen, C.-J. Hsieh, and L. Daniel. Efficient neural network robustness certification with general activation functions. In Advances in Neural Information Processing Systems (NIPS), ar Xiv preprint ar Xiv:1811.00866, dec 2018.

H. Zhang, P. Zhang, and C.-J. Hsieh. Recurjac: An efficient recursive algorithm for bounding jacobian matrix of neural networks and its applications. In AAAI Conference on Artificial Intelligence (AAAI), ar Xiv preprint ar Xiv:1810.11783, dec 2019.

1. For all authors...

(a) Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes] in Section 7, we discuss how CRCLipmight lead to slower certification. (c) Did you discuss any potential negative societal impacts of your work? [N/A] (d) Have you read the ethics review guidelines and ensured that your paper conforms to them?

[Yes] 2. If you are including theoretical results...

(a) Did you state the full set of assumptions of all theoretical results? [Yes] (b) Did you include complete proofs of all theoretical results? [Yes] 3. If you ran experiments...

(a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [N/A] because it is expensive to run. (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...

(a) If your work uses existing assets, did you cite the creators? [Yes] (b) Did you mention the license of the assets? [N/A]

(c) Did you include any new assets either in the supplemental material or as a URL? [N/A] (d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects...

(a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]