# skew_orthogonal_convolutions__7315a2b3.pdf Skew Orthogonal Convolutions Sahil Singla 1 Soheil Feizi 1 Training convolutional neural networks with a Lipschitz constraint under the l2 norm is useful for provable adversarial robustness, interpretable gradients, stable training, etc. While 1-Lipschitz networks can be designed by imposing a 1-Lipschitz constraint on each layer, training such networks requires each layer to be gradient norm preserving (GNP) to prevent gradients from vanishing. However, existing GNP convolutions suffer from slow training, lead to significant reduction in accuracy and provide no guarantees on their approximations. In this work, we propose a GNP convolution layer called Skew Orthogonal Convolution (SOC) that uses the following mathematical property: when a matrix is Skew-Symmetric, its exponential function is an orthogonal matrix. To use this property, we first construct a convolution filter whose Jacobian is Skew-Symmetric. Then, we use the Taylor series expansion of the Jacobian exponential to construct the SOC layer that is orthogonal. To efficiently implement SOC, we keep a finite number of terms from the Taylor series and provide a provable guarantee on the approximation error. Our experiments on CIFAR10 and CIFAR-100 show that SOC allows us to train provably Lipschitz, large convolutional neural networks significantly faster than prior works while achieving significant improvements for both standard and certified robust accuracies. 1. Introduction The Lipschitz constant2 of a neural network puts an upper bound on how much the output is allowed to change in proportion to a change in input. Previous work has shown that a small Lipschitz constant leads to improved generalization bounds (Bartlett et al., 2017; Long & Sedghi, 2020), 1Department of Computer Science, University of Maryland, College Park. Correspondence to: Sahil Singla . Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s). adversarial robustness (Ciss e et al., 2017; Szegedy et al., 2014) and interpretable gradients (Tsipras et al., 2018). The Lipschitz constant also upper bounds the change in gradient norm during backpropagation and can thus prevent gradient explosion during training, allowing us to train very deep networks (Xiao et al., 2018). Moreover, the Wasserstein distance between two probability distributions can be expressed as a maximization over 1-Lipschitz functions (Villani, 2008; Peyr e & Cuturi, 2018), and has been used for training Wasserstein GANs (Arjovsky et al., 2017; Gulrajani et al., 2017) and Wasserstein VAEs (Tolstikhin et al., 2018). Using the Lipschitz composition property i.e. Lip(f g) Lip(f)Lip(g) , a Lipschitz constant of a neural network can be bounded by the product of the Lipschitz constant of all layers. 1-Lipschitz neural networks can thus be designed by imposing a 1-Lipschitz constraint on each layer. However, Anil et al. (2018) identified a key difficulty with this approach: because a layer with a Lipschitz bound of 1 can only reduce the norm of the gradient during backpropagation, each step of backprop gradually attenuates the gradient norm, resulting in a much smaller gradient for the layers closer to the input, thereby making training slow and difficult. To address this problem, they introduced Gradient Norm Preserving (GNP) architectures where each layer preserves the gradient norm by ensuring that the Jacobian of each layer is an Orthogonal matrix (for all inputs to the layer). For convolutional layers, this involves constraining the Jacobian of each convolution layer to be an Orthogonal matrix (Li et al., 2019b; Xiao et al., 2018) and using a GNP activation function called Group Sort (Anil et al., 2018). Li et al. (2019b) introduced an Orthogonal convolution layer called Block Convolutional Orthogonal Parametrization (BCOP). BCOP uses a clever application of 1D Orthogonal convolution filters of sizes 2 1 and 1 2 to construct a 2D Orthogonal convolution filter. It overcomes common issues of Lipschitz-constrained networks such as gradient norm attenuation and loose lipschitz bounds and enables training of large, provably 1-Lipschitz Convolutional Neural Networks (CNNs) achieving results competitive with existing methods for provable adversarial robustness. However, BCOP suffers from slow training, significant reduction in accuracy and provides no guarantees on its approximation 2Unless specified, we use Lipschitz constant under the l2 norm. Skew Orthogonal Convolutions (a) Constructing a Skew-Symmetric convolution filter (b) Spectral normalization (c) Convolution exponential (L e X) Figure 1. Each color denotes a scalar, the minus sign ( ) on top of some color denotes the negative of the scalar with that color. Given any convolution filter M, we can construct a Skew-Symmetric filter (Figure 1a). Next, we apply spectral normalization to bound the norm of the Jacobian (Figure 1b). On input X, applying convolution exponential (L e X) results in an Orthogonal convolution (Figure 1c). of an Orthogonal Jacobian matrix (details in Section 2). To address these shortcomings, we introduce an Orthogonal convolution layer called Skew Orthogonal Convolution (SOC). For provably Lipschitz CNNs, SOC results in significantly improved standard and certified robust accuracies compared to BCOP while requiring significantly less training time (Table 2). We also derive provable guarantees on our approximation of an Orthogonal Jacobian. Our work is based on the following key mathematical property: If A is a Skew-Symmetric matrix (i.e. A = AT ), exp(A) is an Orthogonal matrix (i.e. exp(A)T exp(A) = exp(A) exp(A)T = I) where exp(A) = I + A To design an Orthogonal convolution layer using this property, we need to: (a) construct Skew-Symmetric filters, i.e. convolution filters whose Jacobian is Skew-Symmetric; and (b) efficiently approximate exp(J) with a guaranteed small error where J is the Jacobian of a Skew-Symmetric filter. To construct Skew-Symmetric convolution filters, we prove (in Theorem 2) that every Skew-Symmetric filter L can be written as L = M conv transpose(M) for some filter M where conv transpose represents the convolution transpose operator defined in equation (3) (note that this operator is different from the matrix transpose). This result is analogous to the property that every real Skew-Symmetric matrix A can be written as A = B BT for some real matrix B. We can efficiently approximate exp(J) using a finite number of terms in equation (1) and the convolution exponential (Hoogeboom et al., 2020). But it is unclear whether the series can be approximated with high precision and how many terms need to be computed to achieve the desired approximation error. To resolve these issues, we derive a bound on the l2 norm of the difference between exp(J) and its approximation using the first k terms in equation (1), called Sk(J) when J is Skew-Symmetric (Theorem 3): exp(J) Sk(J) 2 J k 2 k! . (2) This guarantee suggests that when J 2 is small, exp(J) can be approximated with high precision using a small number of terms. Also, the factorial term in denominator causes the error to decay very fast as k increases. In our experiments, we observe that using k = 12, J 2 1.8 leads to an error bound of 2.415 10 6. We can use spectral normalization (Miyato et al., 2018) to ensure J 2 is provably bounded using the theoretical result of Singla & Feizi (2021). The design of SOC is summarized in Figure 1. Code is available at https://github.com/singlasahil14/SOC. To summarize, we make the following contributions: We introduce an Orthogonal convolution layer (called Skew Orthogonal Convolution or SOC) by first designing a Skew-Symmetric convolution filter (Theorem 2) and then computing the exponential function of its Jacobian using a finite number of terms in its Taylor series. For a Skew-Symmetric filter with Jacobian J, we derive a bound on the approximation error between exp (J) and its k-term approximation (Theorem 3). SOC achieves significantly higher standard and provable robust accuracy on 1-Lipschitz convolutional neural networks than BCOP while requiring less training time (Table 2.) For example, SOC achieves 2.82% higher standard and 3.91% higher provable robust accuracy with 54.6% less training time on CIFAR-10 using the Lip Convnet-20 architecture (details in Section 6.5). For deeper networks ( 30 layers), SOC outperforms BCOP with an improvement of 10% on both standard and robust accuracy again achieving 50% reduction in the training time. In Theorem 4, we prove that for every Skew-Symmetric filter with Jacobian J, there exists Skew-Symmetric matrix B satisfying: exp(B) = exp(J), B 2 π . Since J 2 can be large, this can allow us to reduce the approximation error without sacrificing the expressive power. 2. Related work Provably lipschitz convolutional neural networks: Anil et al. (2018) proposed a class of fully connected neural net- Skew Orthogonal Convolutions works (FCNs) which are Gradient Norm Preserving (GNP) and provably 1-Lipschitz using the Group Sort activation and Orthogonal weight matrices. Since then, there have been numerous attempts to tightly enforce 1-Lipschitz constraints on convolutional neural networks (CNNs) (Ciss e et al., 2017; Tsuzuku et al., 2018; Qian & Wegman, 2019; Gouk et al., 2020; Sedghi et al., 2019). However, these approaches either enforce loose lipschitz bounds or are computationally intractable for large networks. Li et al. (2019b) introduced an Orthogonal convolution layer called Block Convolutional Orthogonal Parametrization (BCOP) that avoids the aforementioned issues and allows the training of large, provably 1-Lipschitz CNNs while achieving provable robust accuracy comparable with the existing methods. However, it suffers from some issues: (a) it can only represent a subset of all Orthogonal convolutions, (b) it requires a BCOP convolution filter with 2n channels to represent all the connected components of a BCOP convolution filter with n channels thus requiring 4 times more parameters, (c) to construct a convolution filter with size k k and n input/output channels, it requires 2k 1 matrices of size 2n 2n that must remain Orthogonal throughout training; resulting in well known difficulties of optimization over the Stiefel manifold (Edelman et al., 1998), (d) it constructs convolution filters from symmetric projectors and error in these projectors can lead to an error in the final convolution filter whereas BCOP does not provide guarantees on the error. Provable defenses against adversarial examples: A classifier is said to be provably robust if one can guarantee that a classifier s prediction remains constant within some region around the input. Most of the existing methods for provable robustness either bound the Lipschitz constant of the neural network or the individual layers (Weng et al., 2018; Zhang et al., 2019; 2018; Wong et al., 2018; Wong & Kolter, 2018; Raghunathan et al., 2018; Croce et al., 2019; Singh et al., 2018; Singla & Feizi, 2020). However, these methods do not scale to large and practical networks on Image Net. To scale to such large networks, randomized smoothing (Liu et al., 2018; Cao & Gong, 2017; L ecuyer et al., 2018; Li et al., 2019a; Cohen et al., 2019; Salman et al., 2019; Kumar et al., 2020; Levine et al., 2019) has been proposed as a probabilistically certified defense. In contrast, the defense we propose in this work is deterministic and hence not directly comparable to randomized smoothing. 3. Notation For a vector v, vj denotes its jth element. For a matrix A, Aj,: and A:,k denote the jth row and kth column respectively. Both Aj,: and A:,k are assumed to be column vectors (thus Aj,: is the transpose of jth row of A). Aj,k denotes the element in jth row and kth column of A. A:j,:k denotes the matrix containing the first j rows and k columns of A. Figure 2. Each color denotes a scalar. Flipping a conv. filter (of odd size) transposes its Jacobian. Thus, any odd-sized filter that equals the negative of its flip leads to a Skew-Symmetric Jacobian. The same rules are directly extended to higher order tensors. Bold zero (i.e. 0) denotes the matrix (or tensor) consisting of zero at all elements and I denotes the identity matrix. denotes the kronecker product. We use C to denote the field of complex numbers and R for real numbers. For a scalar a C, a denotes its complex conjugate. For a matrix (or tensor) A, A denotes the element-wise complex conjugate. For A Cm n, AH denotes the Hermitian transpose (i.e. AH = AT ). For a C, Re(a), Im(a) and |a| denote the real part, imaginary part and modulus of a, respectively. We use ι to denote the imaginary part iota (i.e. ι2 = 1). For a matrix A Cq r and a tensor B Cp q r, A denotes the vector constructed by stacking the rows of A and B by stacking the vectors Bj,:,:, j [p 1] so that: A T = AT 0,: , AT 1,: , . . . , AT q 1,: B T = B0,:,: T , B1,:,: T , . . . , Bp 1,:,: T For a 2D convolution filter, L Cp q r s, we define the tensor conv transpose(L) Cq p r s as follows: [conv transpose(L)]i,j,k,l = [L]j,i,r 1 k,s 1 l (3) Note that this is very different from the usual matrix transpose. See an example in Section 4. Given an input X Cq n n, we use L X Cp n n to denote the convolution of filter L with X. We use the the notation L i X L i 1 (L X). Unless specified, we assume zero padding and stride 1 in each direction. 4. Filters with Skew Symmetric Jacobians We know that for any matrix A that is Skew-Symmetric (A = AT ), exp(A) is an Orthogonal matrix: exp(A) (exp(A))T = (exp(A))T exp(A) = I This suggests that if we can parametrize the complete set of convolution filters with Skew-Symmetric Jacobians, we can use the convolution exponential (Hoogeboom et al., 2020) to approximate an Orthogonal matrix. To construct this set, we Skew Orthogonal Convolutions first prove that, if convolution using filter L Rm m p q (p and q are odd) has Jacobian J, the convolution using conv transpose(L) results in Jacobian JT . We note that convolution with conv transpose(L) filter results exactly in an operation often called transposed convolution (adjoint of the convolution operator), which appears in backpropagation through convolution layers (Goodfellow et al., 2016). To motivate our proof, consider a filter L R1 1 3 3. Applying conv transpose (equation (3)), we get: a b c d e f g h i , conv transpose (L) = i h g f e d c b a That is, for a 2D convolution filter with 1 channel, conv transpose flips it along the horizontal and vertical directions. To understand why this flipping transposes the Jacobian, we provide another example for a 1D convolution filter in Figure 2. Our proof uses the following expression for the Jacobian of convolution using a filter L R1 1 (2p+1) (2q+1) and input X R1 n n: j= q L0,0,p+i,q+j P(i) P(j) where P(k) Rn n, P(k) i,j = 1 if i j = k and 0 otherwise. The above equation leads to the following theorem: Theorem 1. Consider a 2D convolution filter L Rm m (2p+1) (2q+1) and input X Rm n n. Let J = X (L X), then JT = X (conv transpose(L) X). Next, we prove that any 2D convolution filter L whose Jacobian is a Skew-Symmetric matrix can be expressed as: L = M conv transpose(M) where M has the same dimensions as L. This allows us to parametrize the set of all convolution filters with Skew-Symmetric Jacobian matrices. Theorem 2. Consider a 2D convolution filter L Rm m (2p+1) (2q+1) and input X Rm n n. The Jacobian X (L X) is Skew-Symmetric if and only if: L = M conv transpose(M) for some filter M Rm m (2p+1) (2q+1). Thus, convolution using the filter L results in a skewsymmetric operator. This operator can also be interpreted as a Lie algebra for the special orthogonal group i.e the group of orthogonal matrices with determinant 1. We prove Theorems 1 and 2 for the more general case of complex convolution filters (Li,j,k,l C) in Appendix Sections B.1 and B.2. Theorem 2 allow us to convert any arbitrary convolution filter into a filter with a Skew-Symmetric Jacobian. This leads to the following definition: Definition 1. (Skew-Symmetric Convolution Filter) A convolution filter L Rm m (2p+1) (2q+1) is said to be Skew Symmetric if given an input X Rm n n, the Jacobian matrix X (L X) is Skew-Symmetric. We note that although Theorem 2 requires the height and width of M to be odd integers, we can also construct a Skew-Symmetric filter when M has even height/width by zero padding M to make the desired dimensions odd. 5. Skew Orthogonal Convolution layers In this section, we derive a method to approximate the exponential of the Jacobian of a Skew-Symmetric convolution filter (i.e. exp(J)). We also derive a bound on the approximation error. Given an input X Rm n n and a Skew-Symmetric convolution filter L Rm m k k (k is odd), let J be the Jacobian of convolution filter L so that: J X = (L X) (4) By construction, we know that J is a Skew-Symmetric matrix, thus exp(J) is an Orthogonal matrix. We are interested in computing exp (J) X efficiently where: exp (J) X = X + J X 1! + J2 X 2! + J3 X 3! + Using equation (4), the above expression can be written as: exp (J) X = X + L X where the notation L i X L i 1 (L X). Using the above equation, we define L e X as follows: L e X = X + L X The above operation is called convolution exponential, and was introduced by Hoogeboom et al. (2020). By construction, L e X satisfies: exp (J) X = L e X. Thus, the Jacobian of L e X with respect to X is equal to exp(J) which is Orthogonal (since J is Skew-Symmetric). However, L e X can only be approximated using a finite number of terms in the series given in equation (5). Thus, we need to bound the error of such an approximation. 5.1. Bounding the Approximation Error To bound the approximation error using a finite number of terms, first note that since the Jacobian matrix J is Skew Symmetric, all the eigenvalues are purely imaginary. For a purely imaginary scalar λ C i.e. Re(λ) = 0 , we first bound the error between exp(λ) and approximation pk(λ) Skew Orthogonal Convolutions Algorithm 1 Skew Orthogonal Convolution Input: feature map: X Rci n n, convolution filter: M Rm m h w (m = max(ci, co)) , terms: K Output: output after applying convolution exponential: Y if ci < co then X pad(X, (co ci, 0, 0)) end L M conv transpose(M) L spectral normalization(L) Y X factorial 1 for j 2 to K do factorial factorial (j 1) Y Y + (X /factorial) end if ci > co then Y Y[0 : co, : , : ] end Return: Y computed using k terms of the exponential series as follows: exp(λ) pk(λ) |λ|k k! , λ : Re(λ) = 0 (6) The above result then allows us to prove the following result for a Skew-Symmetric matrix in a straightforward manner: Theorem 3. For Skew-Symmetric J, we have the inequality: exp(J) Sk(J) 2 J k 2 k! where Sk(J) = A more general proof of Theorem 3 (for J Cn n and skew-Hermitian i.e. J = JH) is given in Appendix Section B.3. The above theorem allows us to bound the approximation error between the true matrix exponential (which is Orthogonal) and its k term approximation as a function of the number of terms (k) and the Jacobian norm J 2. The factorial term in the denominator causes the error to decay very fast as the number of terms increases. We call the resulting algorithm Skew Orthogonal Convolution (SOC). We emphasize that the above theorem is valid only for Skew Symmetric matrices and hence not directly applicable for the convolution exponential (Hoogeboom et al., 2020). 5.2. Complete Set of Skew Orthogonal Convolutions Observe that for Re(λ) = 0, (i.e. λ = ιθ, θ R), we have: exp(λ) = exp(λ + 2ιπk) = cos(θ) + ι sin(θ), k Z This suggests that we can shift λ by integer multiples of 2πι without changing exp(λ) while reducing the approximation error (using Theorem 3). For example, exp(ιπ/3) requires fewer terms to achieve the desired approximation (using equation (6)) than say exp(ι(π/3 + 2π)) because the latter has higher norm (i.e. 2π + π/3 = 7π/3) than the former (i.e. π/3). This insight leads to the following theorem: Theorem 4. Given a real Skew-Symmetric matrix A, we can construct another real Skew-Symmetric matrix B such that B satisfies: (i) exp(A) = exp(B) and (ii) B 2 π. A proof is given in Appendix Section B.4. This proves that every real Skew-Symmetric Jacobian matrix J (associated with some Skew-Symmetric convolution filter L) can be replaced with a Skew-Symmetric Jacobian B such that exp (B) = exp (J) and B 2 π note that J 2 can be arbitrarily large . This strictly reduces the approximation error (Theorem 3) without sacrificing the expressive power. We make the following observations about Theorem 4: (a) If J is equal to the Jacobian of some Skew-Symmetric convolution filter, B may not satisfy this property, i.e. it may not exhibit the block doubly toeplitz structure of the Jacobian of a 2D convolution filter (Sedghi et al., 2019) and thus may not equal the jacobian of some Skew-Symmetric convolution filter; (b) even if B satisfies this property, the filter size of the Skew-Symmetric filter whose Jacobian equals B can be very different from that of the filter with Jacobian J. In this sense, Theorem 4 cannot directly be used to parametrize the complete set of SOC because it is not clear how to efficiently parametrize the set of all matrices B that satisfy (a) B 2 π and (b) exp(B) = exp(J) where J is the Jacobian of some Skew-Symmetric convolution filter. We leave this question of efficient parametrization of Skew Orthogonal Convolution layers open for future research. 5.3. Extensions to 3D and Complex Convolutions When the matrix A Cn n is skew-Hermitian (A = AH), then exp(A) is a unitary matrix: exp(A) (exp(A))H = (exp(A))H exp(A) = I To use the above property to construct a unitary convolution layer with complex weights, we first define: Definition 2. (Skew-Hermitian Convolution Filter) A convolution filter L Cm m (2p+1) (2q+1) is said to be Skew Hermitian if given an input X Cm n n, the Jacobian matrix X (L X) is Skew-Hermitian. Using the extensions of Theorems 1 and 2 for complex convolution filters (proofs in Appendix Sections B.1 and B.2), we can construct a 2D Skew-Hermitian convolution filter. Next, using an extension of Theorem 3 for complex Skew Hermitian matrices (proof in Appendix Section B.3), we can get exactly the same bound on the approximation error. The resulting algorithm is called Skew Unitary Convolution Skew Orthogonal Convolutions Figure 3. Invertible downsampling operation ψ (SUC). We also prove an extension of Theorem 4 for complex Skew-Hermitian matrices in Appendix Section B.5. We discuss the construction of 3D Skew-Hermitian convolution filters in Appendix Sections B.6 and B.7. 6. Implementation details of SOC In this section, we explain the key implementation details of SOC (summarized in Algorithm 1). 6.1. Bounding the norm of Jacobian To bound the norm of the Jacobian of Skew-Symmetric convolution filter, we use the following result: Theorem. (Singla & Feizi, 2021) Consider a convolution filter L Rco ci h w applied to input X. Let J be the Jacobian of L X w.r.t X, we have the following inequality: hw min ( R 2, S 2, T 2, U 2) , where R Rcoh ciw, S Rcow cih, T Rco cihw and U Rcohw ci are obtained by reshaping the filter L. Using the above theorem, we divide the Skew-Symmetric convolution filter by min ( R 2, S 2, T 2, U 2) so that the spectral norm of the resulting filter is bounded by hw. We next multiply the normalized filter with the hyperparameter, 0.7 as we find that it allows faster convergence with no loss in performance. Unless specified, we use h = w = 3 in all of our experiments resulting in the norm bound of 2.1. Note that while the above theorem also allows us to bound the Lipschitz constant of a convolution layer, for deep networks (say 40 layers), the Lipschitz bound (assuming a 1-Lipschitz activation function) would increase to 2.140 = 7.74 1012. Thus, the above bound alone is unlikely to enforce a tight global Lipschitz constraint. 6.2. Different input and output channels In general, we may want to construct an orthogonal convolution that maps from ci input channels to co output channels where ci = co. Consider the two cases: Case 1 (co < ci): We construct a Skew-Symmetric convolution filter with ci channels. After applying the exponential, we select the first co output channels from the output layer. Case 2 (co > ci): We use a Skew-Symmetric convolution Output Size Convolution layer Repeats 16 16 conv [3 3, 32, 1] (n/5) 1 conv [3 3, 64, 2] 1 8 8 conv [3 3, 64, 1] (n/5) 1 conv [3 3, 128, 2] 1 4 4 conv [3 3, 128, 1] (n/5) 1 conv [3 3, 256, 2] 1 2 2 conv [3 3, 256, 1] (n/5) 1 conv [3 3, 512, 2] 1 1 1 conv [3 3, 512, 1] (n/5) 1 conv [1 1, 1024, 2] 1 Table 1. Lip Convnet-n Architecture. Each convolution layer is followed by the Max Min activation. filter with co channels. We zero pad the input with co ci channels and then compute the convolution exponential. 6.3. Strided convolution Given an input X Rci n n (n is even), we may want to construct an orthogonal convolution with output Y Rco (n/2) (n/2) (i.e. an orthogonal convolution with stride 2). To perform a strided convolution, we first apply invertible downsampling ψ as shown in Figure 3 (Jacobsen et al., 2018) to construct X R4ci (n/2) (n/2). Next, we apply convolution exponential to X using a Skew-Symmetric convolution filter with 4ci input and co output channels. 6.4. Number of terms for the approximation During training, we use 6 terms to approximate the exponential function for speed. During evaluation, we use 12 terms to ensure that the exponential of the Jacobian is sufficiently close to being an orthogonal matrix. 6.5. Network architecture We design a provably 1-Lipschitz architecture called Lip Convnet-n (n is the number of convolution layers and a multiple of 5 in our experiments). It consists of (n/5) 1 Orthogonal convolutions of stride 1 (followed by the Max Min activation function), followed by Orthogonal convolution of stride 2 (again followed by the Max Min). It is summarized in Table 1. conv [k k, m, s] denotes convolution layer with filter of size k k, out channels m and stride s. It is followed by a fully connected layer to output the class logits. The Max Min activation function (Anil et al., 2018) is described in Appendix Section C. Skew Orthogonal Convolutions Model Conv. Type CIFAR-10 CIFAR-100 Standard Accuracy Robust Accuracy Time per epoch (s) Standard Accuracy Robust Accuracy Time per epoch (s) Lip Convnet-5 BCOP 74.35% 58.01% 96.153 42.61% 28.67% 94.463 SOC 75.78% 59.16% 31.096 42.73% 27.82% 30.844 Lip Convnet-10 BCOP 74.47% 58.48% 122.115 42.08% 27.75% 119.038 SOC 76.48% 60.82% 48.242 43.71% 29.39% 48.363 Lip Convnet-15 BCOP 73.86% 57.39% 145.944 39.98% 26.17% 144.173 SOC 76.68% 61.30% 63.742 42.93% 28.79% 63.540 Lip Convnet-20 BCOP 69.84% 52.10% 170.009 36.13% 22.50% 172.266 SOC 76.43% 61.92% 77.226 43.07% 29.18% 76.460 Lip Convnet-25 BCOP 68.26% 49.92% 207.359 28.41% 16.34% 205.313 SOC 75.19% 60.18% 98.534 43.31% 28.59% 95.950 Lip Convnet-30 BCOP 64.11% 43.39% 227.916 26.87% 14.03% 229.840 SOC 74.47% 59.04% 110.531 42.90% 28.74% 107.163 Lip Convnet-35 BCOP 63.05% 41.72% 267.272 21.71% 10.33% 274.256 SOC 73.70% 58.44% 130.671 42.44% 28.31% 126.368 Lip Convnet-40 BCOP 60.17% 38.87% 295.350 19.97% 8.66% 289.369 SOC 71.63% 54.36% 144.556 41.83% 27.98% 140.458 Table 2. Results for provable robustness against adversarial examples (l2 perturbation radius of 36/255). Time per epoch is the training time per epoch (in seconds). 7. Experiments Our goal is to evaluate the expressiveness of our method (SOC) compared to BCOP for constructing Orthogonal convolutional layers. To study this, we perform experiments in three settings: (a) provably robust image classification, (b) standard training and (c) adversarial training. All experiments were performed using 1 NVIDIA Ge Force RTX 2080 Ti GPU. All networks were trained for 200 epochs with an initial learning rate 0.1, dropped by a factor of 0.1 after 50 and 150 epochs. We use no weight decay for training with BCOP convolution as it significantly reduces its performance. For training with standard convolution and SOC, we use a weight decay of 10 4. We use the same setup for training with BCOP as given in their github repository. While this implementation uses 20 Bjorck iterations for orthogonalizing matrices, we compare with BCOP using 30 Bjorck iterations in Appendix Table 5. Unless specified, we use BCOP with 20 Bjorck iterations. To evaluate the approximation error for SOC at convergence (using Theorem 3), we compute the norm of the Jacobian of the Skew-Symmetric convolution filter using real normalization (Ryu et al., 2019). We observe that the maximum norm (across different experiments and layers of the network) is below 1.8 (i.e. slightly below the theoretical upper bound of 2.1 discussed in Section 6.1) resulting in a maximum error of 1.812/12! = 2.415 10 6. 7.1. Provable Defenses against Adversarial Attacks To certify provable robustness of 1-Lipschitz network f for some input x, we first define the margin of prediction: Mf(x) = max (0, yt maxi =t yi) where y = [y1, y2,...] is the predicted logits from f on x and yt is the correct logit. Using Theorem 7 in Li et al. (2019b), we can derive the robustness certificate as Mf(x)/ 2. The provable robust accuracy, evaluated using an l2 perturbation radius of 36/255 (same as in Li et al. (2019b)) equals the fraction of data points (x) in the test dataset satisfying Mf(x)/ 2 36/255. Additional results using l2 perturbation of 72/255 are given in Appendix Table 6. In Table 2, we show the results of our experiments using different Lip Convnet architectures with varying number of layers on CIFAR-10 and CIFAR-100 datasets. We make the following observations: (a) SOC achieves significantly higher standard and provable robust accuracy than BCOP for different architectures and datasets, (b) SOC requires significantly less training time per epoch than BCOP and (c) as the number of layers increases, the performance of BCOP degrades rapidly but that of SOC remains largely consistent. For example, on a Lip Convnet-40 architecture, Skew Orthogonal Convolutions Model Conv. Type CIFAR-10 CIFAR-100 Standard Accuracy Time per epoch (s) Standard Accuracy Time per epoch (s) Resnet-18 Standard 95.10% 13.289 77.60% 13.440 BCOP 92.38% 128.383 71.16% 128.146 SOC 94.24% 110.750 74.55% 103.633 Resnet-34 Standard 95.54% 22.348 78.60% 22.806 BCOP 93.79% 237.068 73.38% 235.367 SOC 94.44% 170.864 75.52% 164.178 Resnet-50 Standard 95.47% 38.834 78.11% 37.454 SOC 94.68% 584.762 77.95% 597.297 Table 3. Results for standard accuracy. For Resnet-50, we observe OOM (Out Of Memory) error when using BCOP. Model Conv. Type CIFAR-10 CIFAR-100 Standard Accuracy Robust Accuracy Time per epoch (s) Standard Accuracy Robust Accuracy Time per epoch (s) Resnet-18 Standard 83.05% 44.39% 28.139 59.87% 22.78% 28.147 BCOP 79.26% 34.85% 264.694 54.80% 16.00% 252.868 SOC 82.24% 43.73% 203.860 58.95% 22.65% 199.188 Table 4. Results for empirical robustness against adversarial examples (l perturbation radius of 8/255). SOC achieves 11.46% higher standard accuracy; 15.49% higher provable robust accuracy on the CIFAR-10 dataset and 21.86% higher standard accuracy; 19.32% higher provable robust accuracy on the CIFAR-100 dataset. We further emphasize that none of the other well known deterministic provable defenses (discussed in Section 2) are scalable to large networks as the ones in Table 2. BCOP, while scalable, achieves significantly lower standard and provable robust accuracies for deep networks than SOC. 7.2. Standard Training For standard training, we perform experiments using Resnet18, Resnet-34 and Resnet-50 architectures on CIFAR-10 and CIFAR-100 datasets. Results are presented in Table 3. We again observe that SOC achieves higher standard accuracy than BCOP on different architectures and datasets while requiring significantly less time to train. For Resnet-50, the performance of SOC almost matches that of standard convolution layers while BCOP results in an Out Of Memory (OOM) error. However, for Resnet-18 and Resnet-34, the difference is not as significant as the one observed for Lip Convnet architectures in Table 2. We conjecture that this is because the residual connections allows the gradient to flow relatively freely compared to being restricted to flow through the convolution layers in Lip Convnet architectures. 7.3. Adversarial Training For adversarial training, we use a threat model with an l attack radius of 8/255. Note that we use the l threat model (instead of l2) because it is known to be a stronger adversarial threat model for evaluating empirical robustness (Madry et al., 2018). For training, we use the FGSM variant by Wong et al. (2020). For evaluation, we use 50 iterations of PGD with step size of 2/255 and 10 random restarts. Results are presented in Table 4. We observe that for Resnet-18 architecture and on both CIFAR-10 and CIFAR-100 datasets, SOC results in significantly improved standard and empirical robust accuracy compared to BCOP while requiring significantly less time to train. The performance of SOC comes close to the performance of a standard convolution layer with the difference being less than 1% for both standard and robust accuracy on both the datasets. 8. Discussion and Future work In this work, we design a new orthogonal convolution layer by first constructing a Skew-Symmetric convolution filter and then applying the convolution exponential (Hoogeboom et al., 2020) to the filter. We also derive provable guarantees on the approximation of the exponential using a finite number of terms. Our method achieves significantly higher accuracy than BCOP for various network architectures and Skew Orthogonal Convolutions datasets under standard, adversarial and provably robust training setups while requiring less training time per epoch. We suggest the following directions for future research: Reducing the evaluation time: While SOC requires less time to train than BCOP, it requires more time for evaluation because the convolution filter needs to be applied multiple times to approximate the orthogonal matrix with the desired error. In contrast, BCOP constructs an orthogonal convolution filter that needs to be applied only once during evaluation. From Theorem 3, we know that we can reduce the number of terms required to achieve the desired approximation error by reducing the Jacobian norm J 2. Training approaches such as spectral norm regularization (Singla & Feizi, 2021) and singular value clipping (Sedghi et al., 2019) can be useful to further lower J 2 and thus reduce the evaluation time. Complete Set of SOC convolutions: While Theorem 4 suggests that the complete set of SOC convolutions can be constructed from a subset of Skew-Symmetric matrices B that satisfy (a) B 2 π and (b) exp(B) = exp(A) where A is the Jacobian of some Skew-Symmetric convolution filter, it is an open question how to efficiently parametrize this subset for training Lipschitz convolutional neural networks. This remains an interesting problem for future research. 9. Acknowledgements This project was supported in part by NSF CAREER AWARD 1942230, HR001119S0026, HR00112090132, NIST 60NANB20D134 and Simons Fellowship on Foundations of Deep Learning. Anil, C., Lucas, J., and Grosse, R. B. Sorting out lipschitz function approximation. In ICML, 2018. Arjovsky, M., Chintala, S., and Bottou, L. Wasserstein generative adversarial networks. In Precup, D. and Teh, Y. W. (eds.), Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp. 214 223, International Convention Centre, Sydney, Australia, 06 11 Aug 2017. PMLR. URL http://proceedings. mlr.press/v70/arjovsky17a.html. Bartlett, P. L., Foster, D. J., and Telgarsky, M. Spectrallynormalized margin bounds for neural networks. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS 17, pp. 6241 6250, USA, 2017. Curran Associates Inc. ISBN 978-1-5108-6096-4. URL http://dl.acm.org/ citation.cfm?id=3295222.3295372. Cao, X. and Gong, N. Z. Mitigating evasion attacks to deep neural networks via region-based classification. In Proceedings of the 33rd Annual Computer Security Applications Conference, ACSAC 2017, pp. 278 287, New York, NY, USA, 2017. Association for Computing Machinery. ISBN 9781450353458. doi: 10. 1145/3134600.3134606. URL https://doi.org/ 10.1145/3134600.3134606. Ciss e, M., Bojanowski, P., Grave, E., Dauphin, Y. N., and Usunier, N. Parseval networks: Improving robustness to adversarial examples. In Precup, D. and Teh, Y. W. (eds.), Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, volume 70 of Proceedings of Machine Learning Research, pp. 854 863. PMLR, 2017. URL http://proceedings.mlr.press/ v70/cisse17a.html. Cohen, J. M., Rosenfeld, E., and Kolter, J. Z. Certified adversarial robustness via randomized smoothing. In ICML, 2019. Croce, F., Andriushchenko, M., and Hein, M. Provable robustness of relu networks via maximization of linear regions. AISTATS 2019, 2019. Edelman, A., Arias, T. A., and Smith, S. The geometry of algorithms with orthogonality constraints. SIAM J. Matrix Anal. Appl., 20:303 353, 1998. Goodfellow, I., Bengio, Y., and Courville, A. Deep Learning. MIT Press, 2016. http://www. deeplearningbook.org. Gouk, H., Frank, E., Pfahringer, B., and Cree, M. J. Regularisation of neural networks by enforcing lipschitz continuity, 2020. Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., and Courville, A. C. Improved training of wasserstein gans. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 30, pp. 5767 5777. Curran Associates, Inc., 2017. URL https://proceedings. neurips.cc/paper/2017/file/ 892c3b1c6dccd52936e27cbd0ff683d6-Paper. pdf. Hoogeboom, E., Satorras, V. G., Tomczak, J., and Welling, M. The convolution exponential and generalized sylvester flows. Ar Xiv, abs/2006.01910, 2020. Jacobsen, J.-H., Smeulders, A. W., and Oyallon, E. i-revnet: Deep invertible networks. In International Conference on Learning Representations, 2018. URL https:// openreview.net/forum?id=HJsjk Mb0Z. Skew Orthogonal Convolutions Kumar, A., Levine, A., Goldstein, T., and Feizi, S. Curse of dimensionality on randomized smoothing for certifiable robustness. In III, H. D. and Singh, A. (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp. 5458 5467. PMLR, 13 18 Jul 2020. URL http://proceedings.mlr.press/v119/ kumar20b.html. L ecuyer, M., Atlidakis, V., Geambasu, R., Hsu, D., and Jana, S. K. K. Certified robustness to adversarial examples with differential privacy. In IEEE S&P 2019, 2018. Levine, A., Singla, S., and Feizi, S. Certifiably robust interpretation in deep learning, 2019. Li, B., Chen, C., Wang, W., and Carin, L. Certified adversarial robustness with additive noise. In Wallach, H., Larochelle, H., Beygelzimer, A., d'Alch e-Buc, F., Fox, E., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 32, pp. 9464 9474. Curran Associates, Inc., 2019a. URL https://proceedings. neurips.cc/paper/2019/file/ 335cd1b90bfa4ee70b39d08a4ae0cf2d-Paper. pdf. Li, Q., Haque, S., Anil, C., Lucas, J., Grosse, R., and Jacobsen, J.-H. Preventing gradient attenuation in lipschitz constrained convolutional networks. Conference on Neural Information Processing Systems, 2019b. Liu, X., Cheng, M., Zhang, H., and Hsieh, C. Towards robust neural networks via random self-ensemble. In ECCV, 2018. Long, P. M. and Sedghi, H. Generalization bounds for deep convolutional neural networks. In International Conference on Learning Representations, 2020. URL https: //openreview.net/forum?id=r1e_Fp NFDr. Madry, A., Makelov, A., Schmidt, L., Tsipras, D., and Vladu, A. Towards deep learning models resistant to adversarial attacks. In International Conference on Learning Representations, 2018. URL https:// openreview.net/forum?id=r Jz IBf ZAb. Miyato, T., Kataoka, T., Koyama, M., and Yoshida, Y. Spectral normalization for generative adversarial networks. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum? id=B1QRgzi T-. Peyr e, G. and Cuturi, M. Computational optimal transport, 2018. Qian, H. and Wegman, M. N. L2-nonexpansive neural networks. In International Conference on Learning Representations, 2019. URL https://openreview. net/forum?id=Byx GSs R9FQ. Raghunathan, A., Steinhardt, J., and Liang, P. Semidefinite relaxations for certifying robustness to adversarial examples. In Neur IPS, 2018. Ryu, E., Liu, J., Wang, S., Chen, X., Wang, Z., and Yin, W. Plug-and-play methods provably converge with properly trained denoisers. In Chaudhuri, K. and Salakhutdinov, R. (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp. 5546 5557, Long Beach, California, USA, 09 15 Jun 2019. PMLR. URL http://proceedings.mlr. press/v97/ryu19a.html. Salman, H., Li, J., Razenshteyn, I., Zhang, P., Zhang, H., Bubeck, S., and Yang, G. Provably robust deep learning via adversarially trained smoothed classifiers. In Wallach, H., Larochelle, H., Beygelzimer, A., d'Alch e-Buc, F., Fox, E., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 32, pp. 11292 11303. Curran Associates, Inc., 2019. URL https://proceedings. neurips.cc/paper/2019/file/ 3a24b25a7b092a252166a1641ae953e7-Paper. pdf. Sedghi, H., Gupta, V., and Long, P. M. The singular values of convolutional layers. In International Conference on Learning Representations, 2019. URL https: //openreview.net/forum?id=r Jev Yo A9Fm. Singh, G., Gehr, T., Mirman, M., P uschel, M., and Vechev, M. T. Fast and effective robustness certification. In Neur IPS, 2018. Singla, S. and Feizi, S. Second-order provable defenses against adversarial attacks. In III, H. D. and Singh, A. (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp. 8981 8991. PMLR, 13 18 Jul 2020. URL http://proceedings.mlr. press/v119/singla20a.html. Singla, S. and Feizi, S. Fantastic four: Differentiable and efficient bounds on singular values of convolution layers. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum? id=JCRbl Sgs34Z. Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., and Fergus, R. Intriguing Skew Orthogonal Convolutions properties of neural networks. In International Conference on Learning Representations, 2014. URL http: //arxiv.org/abs/1312.6199. Tolstikhin, I., Bousquet, O., Gelly, S., and Schoelkopf, B. Wasserstein auto-encoders. In International Conference on Learning Representations, 2018. URL https:// openreview.net/forum?id=Hk L7n1-0b. Tsipras, D., Santurkar, S., Engstrom, L., Turner, A., and Madry, A. Robustness may be at odds with accuracy. In ICLR, 2018. Tsuzuku, Y., Sato, I., and Sugiyama, M. Lipschitz-margin training: Scalable certification of perturbation invariance for deep neural networks. In Neur IPS, 2018. Villani, C. Optimal transport, old and new, 2008. Weng, T.-W., Zhang, H., Chen, H., Song, Z., Hsieh, C.- J., Boning, D., and Daniel, I. S. D. A. Towards fast computation of certified robustness for relu networks. In International Conference on Machine Learning (ICML), july 2018. Wong, E. and Kolter, Z. Provable defenses against adversarial examples via the convex outer adversarial polytope. In Dy, J. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 5286 5295, Stockholmsm assan, Stockholm Sweden, 10 15 Jul 2018. PMLR. URL http://proceedings.mlr. press/v80/wong18a.html. Wong, E., Schmidt, F. R., Metzen, J. H., and Kolter, J. Z. Scaling provable adversarial defenses. In Neur IPS, 2018. Wong, E., Rice, L., and Kolter, J. Z. Fast is better than free: Revisiting adversarial training. In International Conference on Learning Representations, 2020. URL https: //openreview.net/forum?id=BJx040EFv H. Xiao, L., Bahri, Y., Sohl-Dickstein, J., Schoenholz, S., and Pennington, J. Dynamical isometry and a mean field theory of CNNs: How to train 10,000-layer vanilla convolutional neural networks. In Dy, J. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 5393 5402, Stockholmsm assan, Stockholm Sweden, 10 15 Jul 2018. PMLR. URL http://proceedings.mlr. press/v80/xiao18a.html. Zhang, H., Weng, T.-W., Chen, P.-Y., Hsieh, C.-J., and Daniel, L. Efficient neural network robustness certification with general activation functions. In Advances in Neural Information Processing Systems (NIPS), ar Xiv preprint ar Xiv:1811.00866, dec 2018. Zhang, H., Zhang, P., and Hsieh, C.-J. Recurjac: An efficient recursive algorithm for bounding jacobian matrix of neural networks and its applications. In AAAI Conference on Artificial Intelligence (AAAI), ar Xiv preprint ar Xiv:1810.11783, dec 2019.