# skew_orthogonal_convolutions__7315a2b3.pdf

Skew Orthogonal Convolutions

Sahil Singla 1 Soheil Feizi 1

Training convolutional neural networks with a Lipschitz constraint under the l2 norm is useful for provable adversarial robustness, interpretable gradients, stable training, etc. While 1-Lipschitz networks can be designed by imposing a 1-Lipschitz constraint on each layer, training such networks requires each layer to be gradient norm preserving (GNP) to prevent gradients from vanishing. However, existing GNP convolutions suffer from slow training, lead to signiﬁcant reduction in accuracy and provide no guarantees on their approximations. In this work, we propose a GNP convolution layer called Skew Orthogonal Convolution (SOC) that uses the following mathematical property: when a matrix is Skew-Symmetric, its exponential function is an orthogonal matrix. To use this property, we ﬁrst construct a convolution ﬁlter whose Jacobian is Skew-Symmetric. Then, we use the Taylor series expansion of the Jacobian exponential to construct the SOC layer that is orthogonal. To efﬁciently implement SOC, we keep a ﬁnite number of terms from the Taylor series and provide a provable guarantee on the approximation error. Our experiments on CIFAR10 and CIFAR-100 show that SOC allows us to train provably Lipschitz, large convolutional neural networks signiﬁcantly faster than prior works while achieving signiﬁcant improvements for both standard and certiﬁed robust accuracies.

1. Introduction

The Lipschitz constant2 of a neural network puts an upper bound on how much the output is allowed to change in proportion to a change in input. Previous work has shown that a small Lipschitz constant leads to improved generalization bounds (Bartlett et al., 2017; Long & Sedghi, 2020),

1Department of Computer Science, University of Maryland, College Park. Correspondence to: Sahil Singla <ssingla@umd.edu>.

Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s).

adversarial robustness (Ciss e et al., 2017; Szegedy et al., 2014) and interpretable gradients (Tsipras et al., 2018). The Lipschitz constant also upper bounds the change in gradient norm during backpropagation and can thus prevent gradient explosion during training, allowing us to train very deep networks (Xiao et al., 2018). Moreover, the Wasserstein distance between two probability distributions can be expressed as a maximization over 1-Lipschitz functions (Villani, 2008; Peyr e & Cuturi, 2018), and has been used for training Wasserstein GANs (Arjovsky et al., 2017; Gulrajani et al., 2017) and Wasserstein VAEs (Tolstikhin et al., 2018).

Using the Lipschitz composition property i.e. Lip(f g) Lip(f)Lip(g) , a Lipschitz constant of a neural network can be bounded by the product of the Lipschitz constant of all layers. 1-Lipschitz neural networks can thus be designed by imposing a 1-Lipschitz constraint on each layer. However, Anil et al. (2018) identiﬁed a key difﬁculty with this approach: because a layer with a Lipschitz bound of 1 can only reduce the norm of the gradient during backpropagation, each step of backprop gradually attenuates the gradient norm, resulting in a much smaller gradient for the layers closer to the input, thereby making training slow and difﬁcult. To address this problem, they introduced Gradient Norm Preserving (GNP) architectures where each layer preserves the gradient norm by ensuring that the Jacobian of each layer is an Orthogonal matrix (for all inputs to the layer). For convolutional layers, this involves constraining the Jacobian of each convolution layer to be an Orthogonal matrix (Li et al., 2019b; Xiao et al., 2018) and using a GNP activation function called Group Sort (Anil et al., 2018).

Li et al. (2019b) introduced an Orthogonal convolution layer called Block Convolutional Orthogonal Parametrization (BCOP). BCOP uses a clever application of 1D Orthogonal convolution ﬁlters of sizes 2 1 and 1 2 to construct a 2D Orthogonal convolution ﬁlter. It overcomes common issues of Lipschitz-constrained networks such as gradient norm attenuation and loose lipschitz bounds and enables training of large, provably 1-Lipschitz Convolutional Neural Networks (CNNs) achieving results competitive with existing methods for provable adversarial robustness. However, BCOP suffers from slow training, signiﬁcant reduction in accuracy and provides no guarantees on its approximation

2Unless speciﬁed, we use Lipschitz constant under the l2 norm.

Skew Orthogonal Convolutions

(a) Constructing a Skew-Symmetric convolution ﬁlter

(b) Spectral normalization

(c) Convolution exponential (L e X)

Figure 1. Each color denotes a scalar, the minus sign ( ) on top of some color denotes the negative of the scalar with that color. Given any convolution ﬁlter M, we can construct a Skew-Symmetric ﬁlter (Figure 1a). Next, we apply spectral normalization to bound the norm of the Jacobian (Figure 1b). On input X, applying convolution exponential (L e X) results in an Orthogonal convolution (Figure 1c).

of an Orthogonal Jacobian matrix (details in Section 2).

To address these shortcomings, we introduce an Orthogonal convolution layer called Skew Orthogonal Convolution (SOC). For provably Lipschitz CNNs, SOC results in signiﬁcantly improved standard and certiﬁed robust accuracies compared to BCOP while requiring signiﬁcantly less training time (Table 2). We also derive provable guarantees on our approximation of an Orthogonal Jacobian.

Our work is based on the following key mathematical property: If A is a Skew-Symmetric matrix (i.e. A = AT ), exp(A) is an Orthogonal matrix (i.e. exp(A)T exp(A) = exp(A) exp(A)T = I) where

exp(A) = I + A

To design an Orthogonal convolution layer using this property, we need to: (a) construct Skew-Symmetric ﬁlters, i.e. convolution ﬁlters whose Jacobian is Skew-Symmetric; and (b) efﬁciently approximate exp(J) with a guaranteed small error where J is the Jacobian of a Skew-Symmetric ﬁlter.

To construct Skew-Symmetric convolution ﬁlters, we prove (in Theorem 2) that every Skew-Symmetric ﬁlter L can be written as L = M conv transpose(M) for some ﬁlter M where conv transpose represents the convolution transpose operator deﬁned in equation (3) (note that this operator is different from the matrix transpose). This result is analogous to the property that every real Skew-Symmetric matrix A can be written as A = B BT for some real matrix B.

We can efﬁciently approximate exp(J) using a ﬁnite number of terms in equation (1) and the convolution exponential (Hoogeboom et al., 2020). But it is unclear whether the series can be approximated with high precision and how many terms need to be computed to achieve the desired approximation error. To resolve these issues, we derive a bound on the l2 norm of the difference between exp(J) and its approximation using the ﬁrst k terms in equation (1), called Sk(J) when J is Skew-Symmetric (Theorem 3):

exp(J) Sk(J) 2 J k 2 k! . (2)

This guarantee suggests that when J 2 is small, exp(J) can be approximated with high precision using a small number of terms. Also, the factorial term in denominator causes the error to decay very fast as k increases. In our experiments, we observe that using k = 12, J 2 1.8 leads to an error bound of 2.415 10 6. We can use spectral normalization (Miyato et al., 2018) to ensure J 2 is provably bounded using the theoretical result of Singla & Feizi (2021). The design of SOC is summarized in Figure 1. Code is available at https://github.com/singlasahil14/SOC.

To summarize, we make the following contributions:

We introduce an Orthogonal convolution layer (called Skew Orthogonal Convolution or SOC) by ﬁrst designing a Skew-Symmetric convolution ﬁlter (Theorem 2) and then computing the exponential function of its Jacobian using a ﬁnite number of terms in its Taylor series.

For a Skew-Symmetric ﬁlter with Jacobian J, we derive a bound on the approximation error between exp (J) and its k-term approximation (Theorem 3).

SOC achieves signiﬁcantly higher standard and provable robust accuracy on 1-Lipschitz convolutional neural networks than BCOP while requiring less training time (Table 2.) For example, SOC achieves 2.82% higher standard and 3.91% higher provable robust accuracy with 54.6% less training time on CIFAR-10 using the Lip Convnet-20 architecture (details in Section 6.5). For deeper networks ( 30 layers), SOC outperforms BCOP with an improvement of 10% on both standard and robust accuracy again achieving 50% reduction in the training time.

In Theorem 4, we prove that for every Skew-Symmetric ﬁlter with Jacobian J, there exists Skew-Symmetric matrix B satisfying: exp(B) = exp(J), B 2 π . Since J 2 can be large, this can allow us to reduce the approximation error without sacriﬁcing the expressive power.

2. Related work

Provably lipschitz convolutional neural networks: Anil et al. (2018) proposed a class of fully connected neural net-

Skew Orthogonal Convolutions

works (FCNs) which are Gradient Norm Preserving (GNP) and provably 1-Lipschitz using the Group Sort activation and Orthogonal weight matrices. Since then, there have been numerous attempts to tightly enforce 1-Lipschitz constraints on convolutional neural networks (CNNs) (Ciss e et al., 2017; Tsuzuku et al., 2018; Qian & Wegman, 2019; Gouk et al., 2020; Sedghi et al., 2019). However, these approaches either enforce loose lipschitz bounds or are computationally intractable for large networks. Li et al. (2019b) introduced an Orthogonal convolution layer called Block Convolutional Orthogonal Parametrization (BCOP) that avoids the aforementioned issues and allows the training of large, provably 1-Lipschitz CNNs while achieving provable robust accuracy comparable with the existing methods. However, it suffers from some issues: (a) it can only represent a subset of all Orthogonal convolutions, (b) it requires a BCOP convolution ﬁlter with 2n channels to represent all the connected components of a BCOP convolution ﬁlter with n channels thus requiring 4 times more parameters, (c) to construct a convolution ﬁlter with size k k and n input/output channels, it requires 2k 1 matrices of size 2n 2n that must remain Orthogonal throughout training; resulting in well known difﬁculties of optimization over the Stiefel manifold (Edelman et al., 1998), (d) it constructs convolution ﬁlters from symmetric projectors and error in these projectors can lead to an error in the ﬁnal convolution ﬁlter whereas BCOP does not provide guarantees on the error.

Provable defenses against adversarial examples: A classiﬁer is said to be provably robust if one can guarantee that a classiﬁer s prediction remains constant within some region around the input. Most of the existing methods for provable robustness either bound the Lipschitz constant of the neural network or the individual layers (Weng et al., 2018; Zhang et al., 2019; 2018; Wong et al., 2018; Wong & Kolter, 2018; Raghunathan et al., 2018; Croce et al., 2019; Singh et al., 2018; Singla & Feizi, 2020). However, these methods do not scale to large and practical networks on Image Net. To scale to such large networks, randomized smoothing (Liu et al., 2018; Cao & Gong, 2017; L ecuyer et al., 2018; Li et al., 2019a; Cohen et al., 2019; Salman et al., 2019; Kumar et al., 2020; Levine et al., 2019) has been proposed as a probabilistically certiﬁed defense. In contrast, the defense we propose in this work is deterministic and hence not directly comparable to randomized smoothing.

3. Notation

For a vector v, vj denotes its jth element. For a matrix A, Aj,: and A:,k denote the jth row and kth column respectively. Both Aj,: and A:,k are assumed to be column vectors (thus Aj,: is the transpose of jth row of A). Aj,k denotes the element in jth row and kth column of A. A:j,:k denotes the matrix containing the ﬁrst j rows and k columns of A.

Figure 2. Each color denotes a scalar. Flipping a conv. ﬁlter (of odd size) transposes its Jacobian. Thus, any odd-sized ﬁlter that equals the negative of its ﬂip leads to a Skew-Symmetric Jacobian.

The same rules are directly extended to higher order tensors. Bold zero (i.e. 0) denotes the matrix (or tensor) consisting of zero at all elements and I denotes the identity matrix. denotes the kronecker product. We use C to denote the ﬁeld of complex numbers and R for real numbers. For a scalar a C, a denotes its complex conjugate. For a matrix (or tensor) A, A denotes the element-wise complex conjugate. For A Cm n, AH denotes the Hermitian transpose (i.e. AH = AT ). For a C, Re(a), Im(a) and |a| denote the real part, imaginary part and modulus of a, respectively. We use ι to denote the imaginary part iota (i.e. ι2 = 1).

For a matrix A Cq r and a tensor B Cp q r, A denotes the vector constructed by stacking the rows of A and B by stacking the vectors Bj,:,:, j [p 1] so that:

A T = AT 0,: , AT 1,: , . . . , AT q 1,:

B T = B0,:,: T , B1,:,: T , . . . , Bp 1,:,: T

For a 2D convolution ﬁlter, L Cp q r s, we deﬁne the tensor conv transpose(L) Cq p r s as follows:

[conv transpose(L)]i,j,k,l = [L]j,i,r 1 k,s 1 l (3)

Note that this is very different from the usual matrix transpose. See an example in Section 4. Given an input X Cq n n, we use L X Cp n n to denote the convolution of ﬁlter L with X. We use the the notation L i X L i 1 (L X). Unless speciﬁed, we assume zero padding and stride 1 in each direction.

4. Filters with Skew Symmetric Jacobians

We know that for any matrix A that is Skew-Symmetric (A = AT ), exp(A) is an Orthogonal matrix:

exp(A) (exp(A))T = (exp(A))T exp(A) = I

This suggests that if we can parametrize the complete set of convolution ﬁlters with Skew-Symmetric Jacobians, we can use the convolution exponential (Hoogeboom et al., 2020) to approximate an Orthogonal matrix. To construct this set, we

Skew Orthogonal Convolutions

ﬁrst prove that, if convolution using ﬁlter L Rm m p q

(p and q are odd) has Jacobian J, the convolution using conv transpose(L) results in Jacobian JT . We note that convolution with conv transpose(L) ﬁlter results exactly in an operation often called transposed convolution (adjoint of the convolution operator), which appears in backpropagation through convolution layers (Goodfellow et al., 2016).

To motivate our proof, consider a ﬁlter L R1 1 3 3. Applying conv transpose (equation (3)), we get:

a b c d e f g h i

, conv transpose (L) =

i h g f e d c b a

That is, for a 2D convolution ﬁlter with 1 channel, conv transpose ﬂips it along the horizontal and vertical directions. To understand why this ﬂipping transposes the Jacobian, we provide another example for a 1D convolution ﬁlter in Figure 2. Our proof uses the following expression for the Jacobian of convolution using a ﬁlter L R1 1 (2p+1) (2q+1) and input X R1 n n:

j= q L0,0,p+i,q+j P(i) P(j)

where P(k) Rn n, P(k) i,j = 1 if i j = k and 0 otherwise. The above equation leads to the following theorem:

Theorem 1. Consider a 2D convolution ﬁlter L Rm m (2p+1) (2q+1) and input X Rm n n. Let J = X (L X), then JT = X (conv transpose(L) X).

Next, we prove that any 2D convolution ﬁlter L whose Jacobian is a Skew-Symmetric matrix can be expressed as: L = M conv transpose(M) where M has the same dimensions as L. This allows us to parametrize the set of all convolution ﬁlters with Skew-Symmetric Jacobian matrices.

Theorem 2. Consider a 2D convolution ﬁlter L Rm m (2p+1) (2q+1) and input X Rm n n. The Jacobian X (L X) is Skew-Symmetric if and only if:

L = M conv transpose(M)

for some ﬁlter M Rm m (2p+1) (2q+1).

Thus, convolution using the ﬁlter L results in a skewsymmetric operator. This operator can also be interpreted as a Lie algebra for the special orthogonal group i.e the group of orthogonal matrices with determinant 1.

We prove Theorems 1 and 2 for the more general case of complex convolution ﬁlters (Li,j,k,l C) in Appendix Sections B.1 and B.2. Theorem 2 allow us to convert any arbitrary convolution ﬁlter into a ﬁlter with a Skew-Symmetric Jacobian. This leads to the following deﬁnition:

Deﬁnition 1. (Skew-Symmetric Convolution Filter) A convolution ﬁlter L Rm m (2p+1) (2q+1) is said to be Skew Symmetric if given an input X Rm n n, the Jacobian matrix X (L X) is Skew-Symmetric.

We note that although Theorem 2 requires the height and width of M to be odd integers, we can also construct a Skew-Symmetric ﬁlter when M has even height/width by zero padding M to make the desired dimensions odd.

5. Skew Orthogonal Convolution layers

In this section, we derive a method to approximate the exponential of the Jacobian of a Skew-Symmetric convolution ﬁlter (i.e. exp(J)). We also derive a bound on the approximation error. Given an input X Rm n n and a Skew-Symmetric convolution ﬁlter L Rm m k k (k is odd), let J be the Jacobian of convolution ﬁlter L so that:

J X = (L X) (4)

By construction, we know that J is a Skew-Symmetric matrix, thus exp(J) is an Orthogonal matrix. We are interested in computing exp (J) X efﬁciently where:

exp (J) X = X + J X 1! + J2 X 2! + J3 X 3! +

Using equation (4), the above expression can be written as:

exp (J) X = X + L X

where the notation L i X L i 1 (L X). Using the above equation, we deﬁne L e X as follows:

L e X = X + L X

The above operation is called convolution exponential, and was introduced by Hoogeboom et al. (2020). By construction, L e X satisﬁes: exp (J) X = L e X. Thus, the Jacobian of L e X with respect to X is equal to exp(J) which is Orthogonal (since J is Skew-Symmetric). However, L e X can only be approximated using a ﬁnite number of terms in the series given in equation (5). Thus, we need to bound the error of such an approximation.

5.1. Bounding the Approximation Error

To bound the approximation error using a ﬁnite number of terms, ﬁrst note that since the Jacobian matrix J is Skew Symmetric, all the eigenvalues are purely imaginary. For a purely imaginary scalar λ C i.e. Re(λ) = 0 , we ﬁrst bound the error between exp(λ) and approximation pk(λ)

Skew Orthogonal Convolutions

Algorithm 1 Skew Orthogonal Convolution Input: feature map: X Rci n n, convolution ﬁlter: M Rm m h w (m = max(ci, co)) , terms: K Output: output after applying convolution exponential: Y if ci < co then

X pad(X, (co ci, 0, 0)) end L M conv transpose(M) L spectral normalization(L) Y X

factorial 1 for j 2 to K do

factorial factorial (j 1) Y Y + (X /factorial) end if ci > co then

Y Y[0 : co, : , : ] end Return: Y

computed using k terms of the exponential series as follows:

exp(λ) pk(λ) |λ|k

k! , λ : Re(λ) = 0 (6)

The above result then allows us to prove the following result for a Skew-Symmetric matrix in a straightforward manner:

Theorem 3. For Skew-Symmetric J, we have the inequality:

exp(J) Sk(J) 2 J k 2 k! where Sk(J) =

A more general proof of Theorem 3 (for J Cn n and skew-Hermitian i.e. J = JH) is given in Appendix Section B.3. The above theorem allows us to bound the approximation error between the true matrix exponential (which is Orthogonal) and its k term approximation as a function of the number of terms (k) and the Jacobian norm J 2. The factorial term in the denominator causes the error to decay very fast as the number of terms increases. We call the resulting algorithm Skew Orthogonal Convolution (SOC).

We emphasize that the above theorem is valid only for Skew Symmetric matrices and hence not directly applicable for the convolution exponential (Hoogeboom et al., 2020).

5.2. Complete Set of Skew Orthogonal Convolutions

Observe that for Re(λ) = 0, (i.e. λ = ιθ, θ R), we have:

exp(λ) = exp(λ + 2ιπk) = cos(θ) + ι sin(θ), k Z

This suggests that we can shift λ by integer multiples of 2πι without changing exp(λ) while reducing the approximation

error (using Theorem 3). For example, exp(ιπ/3) requires fewer terms to achieve the desired approximation (using equation (6)) than say exp(ι(π/3 + 2π)) because the latter has higher norm (i.e. 2π + π/3 = 7π/3) than the former (i.e. π/3). This insight leads to the following theorem:

Theorem 4. Given a real Skew-Symmetric matrix A, we can construct another real Skew-Symmetric matrix B such that B satisﬁes: (i) exp(A) = exp(B) and (ii) B 2 π.

A proof is given in Appendix Section B.4. This proves that every real Skew-Symmetric Jacobian matrix J (associated with some Skew-Symmetric convolution ﬁlter L) can be replaced with a Skew-Symmetric Jacobian B such that exp (B) = exp (J) and B 2 π note that J 2 can be arbitrarily large . This strictly reduces the approximation error (Theorem 3) without sacriﬁcing the expressive power.

We make the following observations about Theorem 4: (a) If J is equal to the Jacobian of some Skew-Symmetric convolution ﬁlter, B may not satisfy this property, i.e. it may not exhibit the block doubly toeplitz structure of the Jacobian of a 2D convolution ﬁlter (Sedghi et al., 2019) and thus may not equal the jacobian of some Skew-Symmetric convolution ﬁlter; (b) even if B satisﬁes this property, the ﬁlter size of the Skew-Symmetric ﬁlter whose Jacobian equals B can be very different from that of the ﬁlter with Jacobian J.

In this sense, Theorem 4 cannot directly be used to parametrize the complete set of SOC because it is not clear how to efﬁciently parametrize the set of all matrices B that satisfy (a) B 2 π and (b) exp(B) = exp(J) where J is the Jacobian of some Skew-Symmetric convolution ﬁlter. We leave this question of efﬁcient parametrization of Skew Orthogonal Convolution layers open for future research.

5.3. Extensions to 3D and Complex Convolutions

When the matrix A Cn n is skew-Hermitian (A = AH), then exp(A) is a unitary matrix:

exp(A) (exp(A))H = (exp(A))H exp(A) = I

To use the above property to construct a unitary convolution layer with complex weights, we ﬁrst deﬁne:

Deﬁnition 2. (Skew-Hermitian Convolution Filter) A convolution ﬁlter L Cm m (2p+1) (2q+1) is said to be Skew Hermitian if given an input X Cm n n, the Jacobian matrix X (L X) is Skew-Hermitian.

Using the extensions of Theorems 1 and 2 for complex convolution ﬁlters (proofs in Appendix Sections B.1 and B.2), we can construct a 2D Skew-Hermitian convolution ﬁlter. Next, using an extension of Theorem 3 for complex Skew Hermitian matrices (proof in Appendix Section B.3), we can get exactly the same bound on the approximation error. The resulting algorithm is called Skew Unitary Convolution

Skew Orthogonal Convolutions

Figure 3. Invertible downsampling operation ψ

(SUC). We also prove an extension of Theorem 4 for complex Skew-Hermitian matrices in Appendix Section B.5. We discuss the construction of 3D Skew-Hermitian convolution ﬁlters in Appendix Sections B.6 and B.7.

6. Implementation details of SOC

In this section, we explain the key implementation details of SOC (summarized in Algorithm 1).

6.1. Bounding the norm of Jacobian

To bound the norm of the Jacobian of Skew-Symmetric convolution ﬁlter, we use the following result:

Theorem. (Singla & Feizi, 2021) Consider a convolution ﬁlter L Rco ci h w applied to input X. Let J be the Jacobian of L X w.r.t X, we have the following inequality:

hw min ( R 2, S 2, T 2, U 2) ,

where R Rcoh ciw, S Rcow cih, T Rco cihw and U Rcohw ci are obtained by reshaping the ﬁlter L.

Using the above theorem, we divide the Skew-Symmetric convolution ﬁlter by min ( R 2, S 2, T 2, U 2) so that the spectral norm of the resulting ﬁlter is bounded by

hw. We next multiply the normalized ﬁlter with the hyperparameter, 0.7 as we ﬁnd that it allows faster convergence with no loss in performance. Unless speciﬁed, we use h = w = 3 in all of our experiments resulting in the norm bound of 2.1. Note that while the above theorem also allows us to bound the Lipschitz constant of a convolution layer, for deep networks (say 40 layers), the Lipschitz bound (assuming a 1-Lipschitz activation function) would increase to 2.140 = 7.74 1012. Thus, the above bound alone is unlikely to enforce a tight global Lipschitz constraint.

6.2. Different input and output channels

In general, we may want to construct an orthogonal convolution that maps from ci input channels to co output channels where ci = co. Consider the two cases:

Case 1 (co < ci): We construct a Skew-Symmetric convolution ﬁlter with ci channels. After applying the exponential, we select the ﬁrst co output channels from the output layer.

Case 2 (co > ci): We use a Skew-Symmetric convolution

Output Size Convolution layer Repeats

16 16 conv [3 3, 32, 1] (n/5) 1 conv [3 3, 64, 2] 1

8 8 conv [3 3, 64, 1] (n/5) 1 conv [3 3, 128, 2] 1

4 4 conv [3 3, 128, 1] (n/5) 1 conv [3 3, 256, 2] 1

2 2 conv [3 3, 256, 1] (n/5) 1 conv [3 3, 512, 2] 1

1 1 conv [3 3, 512, 1] (n/5) 1 conv [1 1, 1024, 2] 1

Table 1. Lip Convnet-n Architecture. Each convolution layer is followed by the Max Min activation.

ﬁlter with co channels. We zero pad the input with co ci channels and then compute the convolution exponential.

6.3. Strided convolution

Given an input X Rci n n (n is even), we may want to construct an orthogonal convolution with output Y Rco (n/2) (n/2) (i.e. an orthogonal convolution with stride 2). To perform a strided convolution, we ﬁrst apply invertible downsampling ψ as shown in Figure 3 (Jacobsen et al., 2018) to construct X R4ci (n/2) (n/2). Next, we apply convolution exponential to X using a Skew-Symmetric convolution ﬁlter with 4ci input and co output channels.

6.4. Number of terms for the approximation

During training, we use 6 terms to approximate the exponential function for speed. During evaluation, we use 12 terms to ensure that the exponential of the Jacobian is sufﬁciently close to being an orthogonal matrix.

6.5. Network architecture

We design a provably 1-Lipschitz architecture called Lip Convnet-n (n is the number of convolution layers and a multiple of 5 in our experiments). It consists of (n/5) 1 Orthogonal convolutions of stride 1 (followed by the Max Min activation function), followed by Orthogonal convolution of stride 2 (again followed by the Max Min). It is summarized in Table 1. conv [k k, m, s] denotes convolution layer with ﬁlter of size k k, out channels m and stride s. It is followed by a fully connected layer to output the class logits. The Max Min activation function (Anil et al., 2018) is described in Appendix Section C.

Skew Orthogonal Convolutions

Model Conv. Type

CIFAR-10 CIFAR-100 Standard Accuracy Robust Accuracy Time per epoch (s)

Standard Accuracy Robust Accuracy Time per epoch (s)

Lip Convnet-5 BCOP 74.35% 58.01% 96.153 42.61% 28.67% 94.463 SOC 75.78% 59.16% 31.096 42.73% 27.82% 30.844

Lip Convnet-10 BCOP 74.47% 58.48% 122.115 42.08% 27.75% 119.038 SOC 76.48% 60.82% 48.242 43.71% 29.39% 48.363

Lip Convnet-15 BCOP 73.86% 57.39% 145.944 39.98% 26.17% 144.173 SOC 76.68% 61.30% 63.742 42.93% 28.79% 63.540

Lip Convnet-20 BCOP 69.84% 52.10% 170.009 36.13% 22.50% 172.266 SOC 76.43% 61.92% 77.226 43.07% 29.18% 76.460

Lip Convnet-25 BCOP 68.26% 49.92% 207.359 28.41% 16.34% 205.313 SOC 75.19% 60.18% 98.534 43.31% 28.59% 95.950

Lip Convnet-30 BCOP 64.11% 43.39% 227.916 26.87% 14.03% 229.840 SOC 74.47% 59.04% 110.531 42.90% 28.74% 107.163

Lip Convnet-35 BCOP 63.05% 41.72% 267.272 21.71% 10.33% 274.256 SOC 73.70% 58.44% 130.671 42.44% 28.31% 126.368

Lip Convnet-40 BCOP 60.17% 38.87% 295.350 19.97% 8.66% 289.369 SOC 71.63% 54.36% 144.556 41.83% 27.98% 140.458

Table 2. Results for provable robustness against adversarial examples (l2 perturbation radius of 36/255). Time per epoch is the training time per epoch (in seconds).

7. Experiments

Our goal is to evaluate the expressiveness of our method (SOC) compared to BCOP for constructing Orthogonal convolutional layers. To study this, we perform experiments in three settings: (a) provably robust image classiﬁcation, (b) standard training and (c) adversarial training.

All experiments were performed using 1 NVIDIA Ge Force RTX 2080 Ti GPU. All networks were trained for 200 epochs with an initial learning rate 0.1, dropped by a factor of 0.1 after 50 and 150 epochs. We use no weight decay for training with BCOP convolution as it signiﬁcantly reduces its performance. For training with standard convolution and SOC, we use a weight decay of 10 4. We use the same setup for training with BCOP as given in their github repository. While this implementation uses 20 Bjorck iterations for orthogonalizing matrices, we compare with BCOP using 30 Bjorck iterations in Appendix Table 5. Unless speciﬁed, we use BCOP with 20 Bjorck iterations.

To evaluate the approximation error for SOC at convergence (using Theorem 3), we compute the norm of the Jacobian of the Skew-Symmetric convolution ﬁlter using real normalization (Ryu et al., 2019). We observe that the maximum norm (across different experiments and layers of the network) is below 1.8 (i.e. slightly below the theoretical upper bound of

2.1 discussed in Section 6.1) resulting in a maximum error of 1.812/12! = 2.415 10 6.

7.1. Provable Defenses against Adversarial Attacks

To certify provable robustness of 1-Lipschitz network f for some input x, we ﬁrst deﬁne the margin of prediction: Mf(x) = max (0, yt maxi =t yi) where y = [y1, y2,...] is the predicted logits from f on x and yt is the correct logit. Using Theorem 7 in Li et al. (2019b), we can derive the robustness certiﬁcate as Mf(x)/

2. The provable robust accuracy, evaluated using an l2 perturbation radius of 36/255 (same as in Li et al. (2019b)) equals the fraction of data points (x) in the test dataset satisfying Mf(x)/

2 36/255. Additional results using l2 perturbation of 72/255 are given in Appendix Table 6.

In Table 2, we show the results of our experiments using different Lip Convnet architectures with varying number of layers on CIFAR-10 and CIFAR-100 datasets. We make the following observations: (a) SOC achieves signiﬁcantly higher standard and provable robust accuracy than BCOP for different architectures and datasets, (b) SOC requires signiﬁcantly less training time per epoch than BCOP and (c) as the number of layers increases, the performance of BCOP degrades rapidly but that of SOC remains largely consistent. For example, on a Lip Convnet-40 architecture,

Skew Orthogonal Convolutions

Model Conv. Type

CIFAR-10 CIFAR-100 Standard Accuracy Time per epoch (s)

Standard Accuracy Time per epoch (s)

Resnet-18 Standard 95.10% 13.289 77.60% 13.440 BCOP 92.38% 128.383 71.16% 128.146 SOC 94.24% 110.750 74.55% 103.633

Resnet-34 Standard 95.54% 22.348 78.60% 22.806 BCOP 93.79% 237.068 73.38% 235.367 SOC 94.44% 170.864 75.52% 164.178

Resnet-50 Standard 95.47% 38.834 78.11% 37.454 SOC 94.68% 584.762 77.95% 597.297

Table 3. Results for standard accuracy. For Resnet-50, we observe OOM (Out Of Memory) error when using BCOP.

Model Conv. Type

CIFAR-10 CIFAR-100 Standard Accuracy Robust Accuracy Time per epoch (s)

Standard Accuracy Robust Accuracy Time per epoch (s)

Resnet-18 Standard 83.05% 44.39% 28.139 59.87% 22.78% 28.147 BCOP 79.26% 34.85% 264.694 54.80% 16.00% 252.868 SOC 82.24% 43.73% 203.860 58.95% 22.65% 199.188

Table 4. Results for empirical robustness against adversarial examples (l perturbation radius of 8/255).

SOC achieves 11.46% higher standard accuracy; 15.49% higher provable robust accuracy on the CIFAR-10 dataset and 21.86% higher standard accuracy; 19.32% higher provable robust accuracy on the CIFAR-100 dataset. We further emphasize that none of the other well known deterministic provable defenses (discussed in Section 2) are scalable to large networks as the ones in Table 2. BCOP, while scalable, achieves signiﬁcantly lower standard and provable robust accuracies for deep networks than SOC.

7.2. Standard Training

For standard training, we perform experiments using Resnet18, Resnet-34 and Resnet-50 architectures on CIFAR-10 and CIFAR-100 datasets. Results are presented in Table 3. We again observe that SOC achieves higher standard accuracy than BCOP on different architectures and datasets while requiring signiﬁcantly less time to train. For Resnet-50, the performance of SOC almost matches that of standard convolution layers while BCOP results in an Out Of Memory (OOM) error. However, for Resnet-18 and Resnet-34, the difference is not as signiﬁcant as the one observed for Lip Convnet architectures in Table 2. We conjecture that this is because the residual connections allows the gradient to ﬂow relatively freely compared to being restricted to ﬂow through the convolution layers in Lip Convnet architectures.

7.3. Adversarial Training

For adversarial training, we use a threat model with an l attack radius of 8/255. Note that we use the l threat model (instead of l2) because it is known to be a stronger adversarial threat model for evaluating empirical robustness (Madry et al., 2018). For training, we use the FGSM variant by Wong et al. (2020). For evaluation, we use 50 iterations of PGD with step size of 2/255 and 10 random restarts. Results are presented in Table 4. We observe that for Resnet-18 architecture and on both CIFAR-10 and CIFAR-100 datasets, SOC results in signiﬁcantly improved standard and empirical robust accuracy compared to BCOP while requiring signiﬁcantly less time to train. The performance of SOC comes close to the performance of a standard convolution layer with the difference being less than 1% for both standard and robust accuracy on both the datasets.

8. Discussion and Future work

In this work, we design a new orthogonal convolution layer by ﬁrst constructing a Skew-Symmetric convolution ﬁlter and then applying the convolution exponential (Hoogeboom et al., 2020) to the ﬁlter. We also derive provable guarantees on the approximation of the exponential using a ﬁnite number of terms. Our method achieves signiﬁcantly higher accuracy than BCOP for various network architectures and

Skew Orthogonal Convolutions

datasets under standard, adversarial and provably robust training setups while requiring less training time per epoch. We suggest the following directions for future research:

Reducing the evaluation time: While SOC requires less time to train than BCOP, it requires more time for evaluation because the convolution ﬁlter needs to be applied multiple times to approximate the orthogonal matrix with the desired error. In contrast, BCOP constructs an orthogonal convolution ﬁlter that needs to be applied only once during evaluation. From Theorem 3, we know that we can reduce the number of terms required to achieve the desired approximation error by reducing the Jacobian norm J 2. Training approaches such as spectral norm regularization (Singla & Feizi, 2021) and singular value clipping (Sedghi et al., 2019) can be useful to further lower J 2 and thus reduce the evaluation time.

Complete Set of SOC convolutions: While Theorem 4 suggests that the complete set of SOC convolutions can be constructed from a subset of Skew-Symmetric matrices B that satisfy (a) B 2 π and (b) exp(B) = exp(A) where A is the Jacobian of some Skew-Symmetric convolution ﬁlter, it is an open question how to efﬁciently parametrize this subset for training Lipschitz convolutional neural networks. This remains an interesting problem for future research.

9. Acknowledgements

This project was supported in part by NSF CAREER AWARD 1942230, HR001119S0026, HR00112090132, NIST 60NANB20D134 and Simons Fellowship on Foundations of Deep Learning.

Anil, C., Lucas, J., and Grosse, R. B. Sorting out lipschitz function approximation. In ICML, 2018.

Arjovsky, M., Chintala, S., and Bottou, L. Wasserstein generative adversarial networks. In Precup, D. and Teh, Y. W. (eds.), Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp. 214 223, International Convention Centre, Sydney, Australia, 06 11 Aug 2017. PMLR. URL http://proceedings. mlr.press/v70/arjovsky17a.html.

Bartlett, P. L., Foster, D. J., and Telgarsky, M. Spectrallynormalized margin bounds for neural networks. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS 17, pp. 6241 6250, USA, 2017. Curran Associates Inc. ISBN 978-1-5108-6096-4. URL http://dl.acm.org/ citation.cfm?id=3295222.3295372.

Cao, X. and Gong, N. Z. Mitigating evasion attacks to deep neural networks via region-based classiﬁcation. In Proceedings of the 33rd Annual Computer Security Applications Conference, ACSAC 2017, pp. 278 287, New York, NY, USA, 2017. Association for Computing Machinery. ISBN 9781450353458. doi: 10. 1145/3134600.3134606. URL https://doi.org/ 10.1145/3134600.3134606.

Ciss e, M., Bojanowski, P., Grave, E., Dauphin, Y. N., and Usunier, N. Parseval networks: Improving robustness to adversarial examples. In Precup, D. and Teh, Y. W. (eds.), Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, volume 70 of Proceedings of Machine Learning Research, pp. 854 863. PMLR, 2017. URL http://proceedings.mlr.press/ v70/cisse17a.html.

Cohen, J. M., Rosenfeld, E., and Kolter, J. Z. Certiﬁed adversarial robustness via randomized smoothing. In ICML, 2019.

Croce, F., Andriushchenko, M., and Hein, M. Provable robustness of relu networks via maximization of linear regions. AISTATS 2019, 2019.

Edelman, A., Arias, T. A., and Smith, S. The geometry of algorithms with orthogonality constraints. SIAM J. Matrix Anal. Appl., 20:303 353, 1998.

Goodfellow, I., Bengio, Y., and Courville, A. Deep Learning. MIT Press, 2016. http://www. deeplearningbook.org.

Gouk, H., Frank, E., Pfahringer, B., and Cree, M. J. Regularisation of neural networks by enforcing lipschitz continuity, 2020.

Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., and Courville, A. C. Improved training of wasserstein gans. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 30, pp. 5767 5777. Curran Associates, Inc., 2017. URL https://proceedings. neurips.cc/paper/2017/file/ 892c3b1c6dccd52936e27cbd0ff683d6-Paper. pdf.

Hoogeboom, E., Satorras, V. G., Tomczak, J., and Welling, M. The convolution exponential and generalized sylvester ﬂows. Ar Xiv, abs/2006.01910, 2020.

Jacobsen, J.-H., Smeulders, A. W., and Oyallon, E. i-revnet: Deep invertible networks. In International Conference on Learning Representations, 2018. URL https:// openreview.net/forum?id=HJsjk Mb0Z.

Skew Orthogonal Convolutions

Kumar, A., Levine, A., Goldstein, T., and Feizi, S. Curse of dimensionality on randomized smoothing for certiﬁable robustness. In III, H. D. and Singh, A. (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp. 5458 5467. PMLR, 13 18 Jul 2020. URL http://proceedings.mlr.press/v119/ kumar20b.html.

L ecuyer, M., Atlidakis, V., Geambasu, R., Hsu, D., and Jana, S. K. K. Certiﬁed robustness to adversarial examples with differential privacy. In IEEE S&P 2019, 2018.

Levine, A., Singla, S., and Feizi, S. Certiﬁably robust interpretation in deep learning, 2019.

Li, B., Chen, C., Wang, W., and Carin, L. Certiﬁed adversarial robustness with additive noise. In Wallach, H., Larochelle, H., Beygelzimer, A., d'Alch e-Buc, F., Fox, E., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 32, pp. 9464 9474. Curran Associates, Inc., 2019a. URL https://proceedings. neurips.cc/paper/2019/file/ 335cd1b90bfa4ee70b39d08a4ae0cf2d-Paper. pdf.

Li, Q., Haque, S., Anil, C., Lucas, J., Grosse, R., and Jacobsen, J.-H. Preventing gradient attenuation in lipschitz constrained convolutional networks. Conference on Neural Information Processing Systems, 2019b.

Liu, X., Cheng, M., Zhang, H., and Hsieh, C. Towards robust neural networks via random self-ensemble. In ECCV, 2018.

Long, P. M. and Sedghi, H. Generalization bounds for deep convolutional neural networks. In International Conference on Learning Representations, 2020. URL https: //openreview.net/forum?id=r1e_Fp NFDr.

Madry, A., Makelov, A., Schmidt, L., Tsipras, D., and Vladu, A. Towards deep learning models resistant to adversarial attacks. In International Conference on Learning Representations, 2018. URL https:// openreview.net/forum?id=r Jz IBf ZAb.

Miyato, T., Kataoka, T., Koyama, M., and Yoshida, Y. Spectral normalization for generative adversarial networks. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum? id=B1QRgzi T-.

Peyr e, G. and Cuturi, M. Computational optimal transport, 2018.

Qian, H. and Wegman, M. N. L2-nonexpansive neural networks. In International Conference on Learning Representations, 2019. URL https://openreview. net/forum?id=Byx GSs R9FQ.

Raghunathan, A., Steinhardt, J., and Liang, P. Semideﬁnite relaxations for certifying robustness to adversarial examples. In Neur IPS, 2018.

Ryu, E., Liu, J., Wang, S., Chen, X., Wang, Z., and Yin, W. Plug-and-play methods provably converge with properly trained denoisers. In Chaudhuri, K. and Salakhutdinov, R. (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp. 5546 5557, Long Beach, California, USA, 09 15 Jun 2019. PMLR. URL http://proceedings.mlr. press/v97/ryu19a.html.

Salman, H., Li, J., Razenshteyn, I., Zhang, P., Zhang, H., Bubeck, S., and Yang, G. Provably robust deep learning via adversarially trained smoothed classiﬁers. In Wallach, H., Larochelle, H., Beygelzimer, A., d'Alch e-Buc, F., Fox, E., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 32, pp. 11292 11303. Curran Associates, Inc., 2019. URL https://proceedings. neurips.cc/paper/2019/file/ 3a24b25a7b092a252166a1641ae953e7-Paper. pdf.

Sedghi, H., Gupta, V., and Long, P. M. The singular values of convolutional layers. In International Conference on Learning Representations, 2019. URL https: //openreview.net/forum?id=r Jev Yo A9Fm.

Singh, G., Gehr, T., Mirman, M., P uschel, M., and Vechev, M. T. Fast and effective robustness certiﬁcation. In Neur IPS, 2018.

Singla, S. and Feizi, S. Second-order provable defenses against adversarial attacks. In III, H. D. and Singh, A. (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp. 8981 8991. PMLR, 13 18 Jul 2020. URL http://proceedings.mlr. press/v119/singla20a.html.

Singla, S. and Feizi, S. Fantastic four: Differentiable and efﬁcient bounds on singular values of convolution layers. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum? id=JCRbl Sgs34Z.

Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., and Fergus, R. Intriguing

Skew Orthogonal Convolutions

properties of neural networks. In International Conference on Learning Representations, 2014. URL http: //arxiv.org/abs/1312.6199.

Tolstikhin, I., Bousquet, O., Gelly, S., and Schoelkopf, B. Wasserstein auto-encoders. In International Conference on Learning Representations, 2018. URL https:// openreview.net/forum?id=Hk L7n1-0b.

Tsipras, D., Santurkar, S., Engstrom, L., Turner, A., and Madry, A. Robustness may be at odds with accuracy. In ICLR, 2018.

Tsuzuku, Y., Sato, I., and Sugiyama, M. Lipschitz-margin training: Scalable certiﬁcation of perturbation invariance for deep neural networks. In Neur IPS, 2018.

Villani, C. Optimal transport, old and new, 2008.

Weng, T.-W., Zhang, H., Chen, H., Song, Z., Hsieh, C.- J., Boning, D., and Daniel, I. S. D. A. Towards fast computation of certiﬁed robustness for relu networks. In International Conference on Machine Learning (ICML), july 2018.

Wong, E. and Kolter, Z. Provable defenses against adversarial examples via the convex outer adversarial polytope. In Dy, J. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 5286 5295, Stockholmsm assan, Stockholm Sweden, 10 15 Jul 2018. PMLR. URL http://proceedings.mlr. press/v80/wong18a.html.

Wong, E., Schmidt, F. R., Metzen, J. H., and Kolter, J. Z. Scaling provable adversarial defenses. In Neur IPS, 2018.

Wong, E., Rice, L., and Kolter, J. Z. Fast is better than free: Revisiting adversarial training. In International Conference on Learning Representations, 2020. URL https: //openreview.net/forum?id=BJx040EFv H.

Xiao, L., Bahri, Y., Sohl-Dickstein, J., Schoenholz, S., and Pennington, J. Dynamical isometry and a mean ﬁeld theory of CNNs: How to train 10,000-layer vanilla convolutional neural networks. In Dy, J. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 5393 5402, Stockholmsm assan, Stockholm Sweden, 10 15 Jul 2018. PMLR. URL http://proceedings.mlr. press/v80/xiao18a.html.

Zhang, H., Weng, T.-W., Chen, P.-Y., Hsieh, C.-J., and Daniel, L. Efﬁcient neural network robustness certiﬁcation with general activation functions. In Advances in Neural Information Processing Systems (NIPS), ar Xiv preprint ar Xiv:1811.00866, dec 2018.

Zhang, H., Zhang, P., and Hsieh, C.-J. Recurjac: An efﬁcient recursive algorithm for bounding jacobian matrix of neural networks and its applications. In AAAI Conference on Artiﬁcial Intelligence (AAAI), ar Xiv preprint ar Xiv:1810.11783, dec 2019.