# efficient_training_of_lowcurvature_neural_networks__bd2a3a18.pdf Efficient Training of Low-Curvature Neural Networks Suraj Srinivas 1 Harvard University ssrinivas@seas.harvard.edu Kyle Matoba Idiap Research Institute & EPFL kyle.matoba@epfl.ch Himabindu Lakkaraju Harvard University hlakkaraju@hbs.edu François Fleuret University of Geneva francois.fleuret@unige.ch Standard deep neural networks often have excess non-linearity, making them susceptible to issues such as low adversarial robustness and gradient instability. Common methods to address these downstream issues, such as adversarial training, are expensive and often sacrifice predictive accuracy. In this work, we address the core issue of excess non-linearity via curvature, and demonstrate low-curvature neural networks (LCNNs) that obtain drastically lower curvature than standard models while exhibiting similar predictive performance. This leads to improved robustness and stable gradients, at a fraction of the cost of standard adversarial training. To achieve this, we decompose overall model curvature in terms of curvatures and slopes of its constituent layers. To enable efficient curvature minimization of constituent layers, we introduce two novel architectural components: first, a non-linearity called centered-softplus that is a stable variant of the softplus non-linearity, and second, a Lipschitz-constrained batch normalization layer. Our experiments show that LCNNs have lower curvature, more stable gradients and increased off-the-shelf adversarial robustness when compared to standard neural networks, all without affecting predictive performance. Our approach is easy to use and can be readily incorporated into existing neural network architectures. Code to implement our method and replicate our experiments is available at https://github.com/kylematoba/lcnn. 1 Introduction The high degree of flexibility present in deep neural networks is critical to achieving good performance in complex tasks such as image classification, language modelling and generative modelling of images [1 3]. However, excessive flexibility is undesirable as this can lead to model underspecification [4] which results in unpredictable behaviour on out-of-domain inputs, such as vulnerability to adversarial examples. Such under-specification can be avoided in principle via Occam s razor, which requires training models that are as simple as possible for the task at hand, and not any more. Motivated by this principle, in this work we focus on training neural network models without excess non-linearity in their input-output map (e.g.: see Fig 1b), such that predictive performance remains unaffected. *Equal Contribution 1Work done partially at Idiap Research Institute 36th Conference on Neural Information Processing Systems (Neur IPS 2022). (a) Decision boundary with standard NN (β = 1000) (b) Decision boundary with LCNN (β = 1) 5 softplus( = 1.0) softplus( = 0.2) c-softplus( = 1.0) c-softplus( = 0.2) (c) Softplus vs Centered-Softplus Figure 1: Decision boundaries of (a) standard NN and (b) LCNN trained on the two moons dataset. LCNN recovers highly regular decision boundaries in contrast to the standard NN. (c) Comparison of softplus and centered-softplus non-linearities (defined in 3.3). These behave similarly for large β values, and converge to linear maps for low β values. However, softplus diverges while centeredsoftplus stays close to the origin. Central to this work is a precise notion of curvature, which is a mathematical quantity that encodes the flexibility or the degree of non-linearity of a function at a point. In deep learning, the curvature of a function at a point is often quantified as the norm of the Hessian at that point [5 7]. Hessian norms are zero everywhere if and only if the function is linear, making them suitable to measure the degree of non-linearity. However, they suffer from a dependence on the scaling of model gradients, which makes them unsuitable to study its interplay with model robustness. In particular, robust models have small gradient norms [8], which naturally imply smaller Hessian norms. But are they truly more linear as a result? To be able to study robustness independent of non-linearity, we propose normalized curvature, which normalizes the Hessian norm by its corresponding gradient norm, thus disentangling the two measures. Surprisingly, we find that normalized curvature is a stable measure across train and test samples (see Table 3), whereas the usual curvature is not. A conceptually straightforward approach to train models with low-curvature input-output maps [6, 9] involves directly penalizing curvature locally at training samples. However, these methods involve expensive Hessian computations, and only minimize local point-wise curvature and not curvature everywhere. A complementary approach is that of Dombrowski et al. [10], who propose architectures that have small global curvature, but do not explicitly penalize this curvature during training. In contrast, we propose efficient mechanisms to explicitly penalize the normalized curvature globally. In addition, while previous methods [6, 7] penalize the Frobenius norm of the Hessian, we penalize its spectral norm, thus providing tighter and more interpretable robustness bounds.1 Our overall contributions in this paper are: 1. In 3.1 we propose to measure curvature of deep models via normalized curvature, which is invariant to scaling of the model gradients. 2. In 3.2 we show that normalized curvature of neural networks can be upper bounded in terms of normalized curvatures and slopes of individual layers. 3. We introduce an architecture for training LCNNs combining a novel activation function called the centered-softplus ( 3.3), and recent innovations to constrain the Lipschitz constant of convolutional ( 3.4.1) and batch normalization ( 3.4.2) layers. 4. In 4 we prove bounds on the relative gradient robustness and adversarial robustness of models in terms of normalized curvature, showing that controlling normalized curvature directly controls these properties. 5. In 5, we show experiments demonstrating that our proposed innovations are successful in training low-curvature models without sacrificing training accuracy, and that such models have robust gradients, and are more robust to adversarial examples out-of-the-box. 1Note that the Frobenius norm of a matrix strictly upper bounds its spectral norm 2 Related Work Adversarial Robustness of Neural Networks: The well-known phenemenon of adversarial vulnerabilities of neural networks [11, 12] shows that adding small amounts of imperceptible noise can cause deep neural networks to misclassify points with high confidence. The canonical method to defend against this vulnerability is adversarial training [13] which trains models to accurately classify adversarial examples generated via an attack such as projected gradient descent (PGD). However, this approach is computationally expensive and provides no formal guarantees on robustness. Cohen et al. [14] proposed randomized smoothing, which provides a formal guarantee on robustness by generating a smooth classifer from any black-box classifier. Hein and Andriushchenko [8] identified the local Lipschitz constant as critical quantity to prove formal robustness guarantees. Moosavi Dezfooli et al. [6] penalize the Frobenius norm of the Hessian, and show that they performs similarly to models trained via adversarial training. Qin et al. [9] introduce a local linearity regularizer, which also implicitly penalizes the Hessian. Similar to these works, we enforce low curvature to induce robustness, but we focus on out of the box robustness of LCNNs. Unreliable Gradient Interpretations in Neural Networks: Gradient explanations in neural networks can be unreliable. Ghorbani et al. [15], Zhang et al. [16] showed that for any input, it is possible to find adversarial inputs such that the gradient explanations for these points that are highly dissimilar to each other. Srinivas and Fleuret [17] showed that pre-softmax logit gradients however are independent of model behaviour, and as a result we focus on post-softmax loss gradients in this work. Ros and Doshi-Velez [18] showed empirically that robustness can be improved by gradient regularization, however Dombrowski et al. [10] showed that gradient instability is primarily due to large Hessian norms. This suggests that the gradient penalization in Ros and Doshi-Velez [18] performed unintentional Hessian regularization, which is consistent with our experimental results. To alleviate this, Dombrowski et al. [7] proposed to train low curvature models via softplus activations and weight decay, broadly similar to our strategy. However, while Dombrowski et al. [7] focused on the Frobenius norm of the Hessian, we penalize the normalized curvature a scaled version of the Hessian spectral norm which is strictly smaller than the Frobenius norm, and this results in a more sophisticated penalization strategy. Lipschitz Layers in Neural Networks There has been extensive work on methods to bound the Lipschitz constant of neural networks. Cisse et al. [19] introduced Parseval networks, which penalizes the deviation of linear layers from orthonormality since an orthonormal linear operator evidently has a Lipschitz constant of one, this shrinks the Lipschitz constant of a layer towards one. Trockman and Kolter [20] use a reparameterization of the weight matrix, called the Cayley Transform, that is orthogonal by construction. Miyato et al. [21], Ryu et al. [22] proposed spectral normalization, where linear layers are re-parameterized by dividing by their spectral norm, ensuring that the overall spectral norm of the parameterized layer is one. 3 Training Low-Curvature Neural Networks In this section, we introduce our approach for training low-curvature neural nets (LCNNs). Unless otherwise specified, we shall consider a neural network classifier f that maps inputs x Rd to logits which characterize the prediction, and can be further combined with the true label distribution and a loss function to give a scalar loss value f(x) R+. 3.1 Measuring Relative Model Non-Linearity via Normalized Curvature We begin our analysis by discussing a desirable definition of curvature Cf(x) R+. While curvature is well-studied topic in differential geometry [23] where Hessian normalization is a common theme, our discussion will be largely independent of this literature, in a manner more suited to requirements in deep learning. Regardless of the definition, a typical property of a curvature measure is that Cf(x) = 0 x Rd f is linear, and the higher the curvature, the further from linear the function is. Hence maxx Rd Cf(x) can be thought of as a measure of a model s non-linearity. A common way to define curvature in machine learning [5 7] has been via Hessian norms. However, these measures are sensitive to gradient scaling, which is undesirable. After all, the degree of model non-linearity must intuitively be independent of how large its gradients are. For example, if two functions f, g are scaled (f = k g), or rotated versions of each other ( f = k g), then we would like them to have similar curvatures in the interest of disentangling curvature (i.e., degree of non-linearity) from scaling. It is easy to see that Hessian norms 2 xf(x) 2 do not have this property, as scaling the function also scales the corresponding Hessian. We would like to be able to recover low-curvature models with steep gradients (that are non-robust), precisely to be able to disentangle their properties from low-curvature models with small gradients (that are robust), which Hessian norms do not allow us to do. To avoid this problem, we propose a definition of curvature that is approximately normalized: Cf(x) = 2f(x) 2/( f(x) 2 + ε). Here f(x) 2 and 2f(x) 2 are the ℓ2 norm of the gradient and the spectral norm of the Hessian, respectively, where f(x) Rd, 2f(x) Rd d, and ε > 0 is a small constant to ensure well-behavedness of the measure. This definition measures Hessian norm relative to the gradient norm, and captures a notion of relative local linearity, which can be seen via an application of Taylor s theorem: f(x + ϵ) f(x) f(x) ϵ 2 f(x) 2 | {z } relative local linearity 2 max x Rd Cf(x) | {z } normalized curvature Here the numerator on the left captures the local linearity error, scaled by the gradient norm in the denominator, which can be shown using Taylor s theorem to be upper bounded by the normalized curvature. We shall consider penalizing this notion of normalized curvature, and we shall simply refer to this quantity as curvature in the rest of the paper. 3.2 A Data-Free Upper Bound on Curvature Directly penalizing the curvature is computationally expensive as it requires backpropagating an estimate of the Hessian norm, which itself requires backpropagating gradient-vector products. This requires chaining the entire computational graph of the model at least three times. Moosavi Dezfooli et al. [6] reduce the complexity of this operation by computing a finite-difference approximation to the Hessian from gradients, but even this double-backpropagation is expensive. We require an efficient penalization procedure takes a single backpropagation step. To this end, we propose to minimize a data-free upper bound on the curvature. To illustrate the idea, we first show this upper bound for the simplified case of the composition of one-dimensional functions (f : R R). Lemma 1. Given a 1-dimensional compositional function f = f L f L 1 ... f1 with fi : R R for i = 1, 2, . . . , L, the normalized curvature Cf := |f /f |, is bounded by f f PL i=1 f i f i Qi j=1 |f j| Proof. We first have f = QL i=1 f i. Differentiating this expression we have f = PL i=1 f i Qi j=1 f j QL k=1,k =i f k. Dividing by f , taking absolute value of both sides, and using the triangle inequality, we have the intended result. Extending the above expression to functions Rn R is not straightforward, as the intermediate layer Hessians are order three tensors. We derive this using a result from Wang et al. [24] that connects the spectral norms of order-n tensors to the spectral norm of their matrix unfoldings . The full derivation is presented in the appendix, and the (simplified) result is stated below. Theorem 1. Given a function f = f L f L 1 . . . f1 with fi : Rni 1 Rni, the curvature Cf can be bounded by the sum of curvatures of individual layers Cfi(x), i.e., i=1 ni Cfi(x) j=1 fj 1fj(x) 2 i=1 ni max x Cfi(x ) j=1 max x fj 1fj(x ) 2. The rightmost term is independent of x and thus holds uniformly across all data points. This bound shows that controlling the curvature and Lipschitz constant of each layer of a neural network enables us to control the overall curvature of the model. Practical neural networks typically consist of1 linear maps such as convolutions, fully connected layers, and batch normalization layers, and non-linear activation functions. Linear maps have zero curvature by definition, and non-linear layers often have bounded gradients (= 1), which simplifies computations. In the sections that follow, we shall see how to penalize the remaining terms, i.e., the curvature of the non-linear activations, and the Lipschitz constant of the linear layers. 3.3 Centered-Softplus: Activation Function with Trainable Curvature Theorem 1 shows that the curvature of a neural network depends on the curvature of its constituent activation functions. We thus propose to use activation functions with minimal curvature. The kink at the origin of the Re LU function implies an undefined second derivative. On the other hand, a smooth activation such as the softplus function, s(x; β) = log(1 + exp(βx))/β is better suited to analyzing questions of curvature. Despite not being a common baseline choice, softplus does see regular use, especially where its smoothness facilitates analysis [25]. The curvature of the softplus function is Cs( ;β)(x) = β 1 ds(x; β) Thus using softplus with small β values ensures low curvature. However, we observe two critical drawbacks of softplus preventing its usage with small β: (1) divergence for small β, where s(x; β 0) = which ensures that well-behaved low curvature maps cannot be recovered and, (2) instability upon composition i.e., sn(x = 0; β) = s s ... s | {z } n times (x = 0; β) = log(n+1) we have that sn(x = 0; β) as n , which shows that composing softplus functions exacerbates instability around the origin. This is critical for deep models with large number of layers n, which is precisely the scenario of interest to us. To remedy this problem, we propose centeredsoftplus s0(x; β), a simple modification to softplus by introducing a normalizing term as follows. s0(x; β) = s(x; β) log 2 β log 1 + exp(βx) This ensures that s(x = 0; β) = sn(x = 0; β) = 0 for any positive integer n, and hence also ensures stability upon composition. More importantly, we have s0(x; β 0) = x/2 which is a scaled linear map, while still retaining s0(x; β ) = Re LU(x). This ensures that we are able to learn both well-behaved linear maps, as well as highly non-linear Re LU-like maps if required. We further propose to cast β is a learnable parameter and penalize its value, hence directly penalizing the curvature of that layer. Having accounted for the curvature of the non-linearities, the next section discusses controlling the gradients of the linear layers. 3.4 Lipschitz Linear Layers Theorem 1 shows that penalizing the Lipschitz constants of the constituent linear layers in a model is necessary to penalize overall model curvature. There are broadly three classes of linear layers we consider: convolutions, fully connected layers, and batch normalization. 3.4.1 Spectrally Normalized Convolutions and Fully Connected Layers We propose to penalize the Lipschitz constant of convolutions and fully connected layer via existing spectral normalization-like techniques. For fully connected layers, we use vanilla spectral normalization (Miyato et al. [21]) which controls the spectral norm of a fully connected layer by reparameterization replacing a weight matrix W, by W/||W||2 which has a unit spectral norm. Ryu et al. [22] generalize this to convolutions by presenting a power iteration method that works directly on the linear mapping implicit in a convolution: maintaining 3D left and right singular vectors of the 4D tensor of convolutional filters and developing the corresponding update rules. 1We ignore self-attention layers in this work. They call this real spectral normalization to distinguish it from the approximation that [21] proposed for convolutions. Using spectral normalization on fully connected layers and real spectral normalization on convolutional layers ensures that the spectral norm of these layers is exactly equal to one, further simplifying the bound in Theorem 1. 3.4.2 γ-Lipschitz Batch Normalization In principle, at inference time batch normalization (BN) is multiplication by the diagonal matrix of inverse running standard deviation estimates.1. Thus we can spectrally normalize the BN layer by computing the reciprocal of the smallest running standard deviation value across dimensions i, i.e., ||BN||2 = maxx ||BN(x)||2 = 1/ mini running-std(i). In practice, we found that models with such spectrally normalized BN layers tended to either diverge or fail to train in the first place, indicating that the scale introduced by BN is necessary for training. To remedy this, we introduce the γ-Lipschitz batch normalization, defined by 1-Lipschitz-BN(x) BN(x)/||BN||2 γ-Lipschitz-BN(x) min(γ, ||BN||2) | {z } scaling factor γ 1-Lipschitz-BN(x). By clipping the scaling above at γ (equivalently, the running standard deviation below, at 1/γ), we can ensure that the Lipschitz constant of a batch normalization layer is at most equal to γ R+. We provide a Py Torch-style code snippet in the appendix. As with β, described in 3.3, we cast γ as a learnable parameter in order to penalize it during training. Gouk et al. [26] proposed a similar (simpler) solution, whereas they they fit a common γ to all BN layers, we let γ vary freely by layer. 3.5 Penalizing Curvature We have discussed three architectural innovations centered-softplus activations with a trainable β, spectrally normalized linear and convolution layers and a γ-Lipschitz batch normalization layer. We now discuss methods to penalize the overall curvature of a model built with these layers. Since convolutional and fully connected layers have spectral norms equal to one by construction, they contribute nothing to curvature. Thus, we restrict attention to batch normalization and activation layers. The set of which will be subsequently referred to respectively βSP and γBN. For models with batch normalization, naively using the upper bound in Theorem 1 is problematic due to the exponential growth in the product of Lipschitz constants of batch normalization layers. To alleviate this, we propose to use a penalization Rf where γi terms are aggregated additively across batch normalization layers, and independent of the βi terms in the following manner: i βSP βi + λγ X j γBN log γj (5) An additive aggregation ensures that the penalization is well-behaved during training and does not grow exponentially. Note that the underlying model is necessarily linear if the penalization term is zero, thus making it an appropriate measure of model non-linearity. Also note that βi 0, γi 1 by construction. We shall henceforth use the term LCNN to refer to a model trained with the proposed architectural components (centered-softplus, spectral normalization, and γ-Lipschitz batch normalization) and associated regularization terms on β, γ. We next discuss the robustness and interpretability benefits that we obtain with LCNNs. 4 Why Train Low-Curvature Models? In this section we discuss the advantages that low-curvature models offer, particularly as it pertains to robustness and gradient stability. These statements apply not just to LCNNs, but low-curvature models in general obtained via any other mechanism. 1We ignore the learnable parameters of batch normalization for simplicity. The architecture we propose subsequently also does not have trainable affine parameters. 4.1 Low Curvature Models have Stable Gradients Recent work [15, 16] has shown that gradient explanations are manipulable, and that we can easily find inputs whose explanations differ maximally from those at the original inputs, making them unreliable in practice for identifying important features. As we shall show, this amounts to models having a large curvature. In particular, we show that the relative gradient difference across points ϵ 2 away in the ℓ2 sense is upper bounded by the normalized curvature Cf, as given below. Proposition 1. Consider a model f with maxx Cf(x) δC, and two inputs x and x + ϵ ( Rd). The relative distance between gradients at these points is bounded by f(x + ϵ) f(x) 2 f(x) 2 ϵ 2δC exp( ϵ 2δC) ϵ 2Cf(x) (Quadratic Approximation) The proof expands f(x) in a Taylor expansion around x+ϵ and bounds the magnitude of the second and higher order terms over the neighborhood of x, and the full argument is given in the appendix. The upper bound is considerably simpler when we assume that the function locally quadratic, which corresponds to the rightmost term ϵ 2Cf(x). Thus the smaller the model curvature, the more locally stable are the gradients. 4.2 Low Curvature is Necessary for ℓ2 Robustness Having a small gradient norm is known to be an important aspect of adversarial robustness [8]. However, small gradients alone are not sufficient, and low curvature is necessary, to achieve robustness. This is easy to see intuitively - a model may have low gradients at a point leading to robustness for small noise values, but if the curvature is large, then gradient norms at neighboring points can quickly increase, leading to misclassification for even slightly larger noise levels. This effect is an instance of gradient-masking [27] , which provides an illusion of robustness by making models only locally robust. In the result below, we formalize this intuition and establish an upper bound on the distance between two nearby points, which we show depends on both the gradient norm (as was known previously) and well as the max curvature of the underlying model. Proposition 2. Consider a model f with maxx Cf(x) δC, then for two inputs x and x+ϵ ( Rd), we have the following expression for robustness f(x + ϵ) f(x) 2 ϵ 2 f(x) 2 (1 + ϵ 2δC exp( ϵ 2δC)) ϵ 2 f(x) 2 (1 + ϵ 2Cf(x)) (Quadratic Approximation) The proof uses similar techniques to those of proposition 1, and is also given in the appendix. This result shows that given two models with equally small gradients at data points, the greater robustness will be achieved by the model with the smaller curvature. 5 Experiments In this section we perform experiments to (1) evaluate the effectiveness of our proposed method in training models with low curvature as originally intended, (2) evaluate whether low curvature models have robust gradients in practice, and (3) evaluate the effectiveness of low-curvature models for adversarial robustness. Our experiments are primarily conducted on a base Res Net-18 architecture ([28]) using the CIFAR10 and CIFAR100 datasets ([29]), and using the Pytorch [30] framework. Our methods entailed fairly modest computation our most involved computations can be completed in under three GPU days, and all experimental results could be computed in less than 60 GPU-days. We used a mixture of GPUs primarily NVIDIA Ge Force GTX 1080 Tis on an internal compute cluster. Baselines and Our Methods Our baseline model for comparison involves using a Res Net-18 model with softplus activation with a high β = 103 to mimic Re LU, and yet have well-defined curvatures. Another baseline is gradient norm regularization (which we henceforth call Grad Reg ) [31], where the same baseline model is trained with an additional penalty on the gradient norm. We train two variants of our approach - a base LCNN, which involves penalizing the curvature, and another variant combining LCNN penalty and gradient norm regularization (LCNN + Grad Reg), which controls both the curvature and gradient norm. Our theory indicates that the LCNN + Grad Reg variant is likely to produce more robust models, which we verify experimentally. We also compare with CURE [6], softplus with weight decay [7] and adversarial training with ℓ2 PGD [13] with noise magnitude of 0.1 and 3 iterations of PGD. We provide experimental details in the appendix. Parameter Settings All our models are trained for 200 epochs with an SGD + momentum optimizer, with a momentum of 0.9 and an initial learning rate of 0.1 which decays by a factor of 10 at 150 and 175 epochs, and a weight decay of 5 10 4. 5.1 Evaluating the Efficacy of Curvature Penalization In this section, we evaluate whether LCNNs indeed reduce model curvature in practice. Table 1 contains our results, from which we make the following observations: (1) most of our baselines except CURE and adversarial training do not meaningfully lose predictive performance (2) Grad Reg and adversarially trained models are best at reducing gradient norm while LCNN-based models are best at penalizing curvature. Overall, these experimental results show that LCNN-based models indeed minimize curvature as intended. We also observe in Table 1 that Grad Reg [31] has an unexpected regularizing effect on the Hessian and curvature. We conjecture that this is partly due to the following decomposition of the loss Hessian, which can be written as 2f(x) fl(x) 2 fl LSE(x) fl(x) + fl LSE(x) 2fl(x), where LSE(x) is the Log Sum Exp function, and fl(x) is the pre-softmax logit output. We observe that the first term strongly depends on the gradients, which may explain the Hessian regularization effects of Grad Reg, while the second term depends on the Hessian of the bulk of the neural network, which is penalized by the LCNN penalties. This also explains why combining both penalizations (LCNN + Grad Reg) further reduces curvature. We also measure the average per-epoch training times on a GTX 1080Ti, which are: standard models / softplus + weight decay ( 100 sec), LCNN ( 160 sec), Grad Reg ( 270 sec), LCNN+Grad Reg ( 350 sec), CURE / Adversarial Training ( 500 sec). Note that the increase in computation for LCNN is primarily due to the use of spectral normalization layers. The results show that LCNNs are indeed able to penalize curvature by only marginally (1.6 ) increasing training time, and using LCNN+Grad Reg only increases time 1.3 over Grad Reg while providing curvature benefits. Table 1: Model geometry of various Res Net-18 models trained with various regularizers on the CIFAR100 test dataset. Gradient norm regularized models [31] ( Grad Reg ) are best at reducing gradient norms, while LCNN-based models are best at reducing curvature, leaving gradients unpenalized. We obtain the benefits of both by combining these penalties. Results are averaged across two runs. Model Ex f(x) 2 Ex 2f(x) 2 Ex Cf(x) Accuracy (%) Standard 19.66 0.33 6061.96 968.05 270.89 75.04 77.42 0.11 LCNNs 22.04 1.41 1143.62 99.38 69.50 2.41 77.30 0.11 Grad Reg [31] 8.86 0.12 776.56 63.62 89.47 5.86 77.20 0.26 LCNNs + Grad Reg 9.87 0.27 154.36 0.22 25.30 0.09 77.29 0.07 CURE [6] 8.86 0.01 979.45 14.05 116.31 4.58 76.48 0.07 Softplus + Wt. Decay [7] 18.08 0.05 1052.84 7.27 70.39 0.88 77.44 0.28 Adversarial Training [32] 7.99 0.03 501.43 18.64 63.79 1.65 76.96 0.26 5.2 Impact of Curvature on Gradient Robustness In 4.1, we showed that low-curvature models tend to have more robust gradients. Here we evaluate whether this prediction empirically by measuring the relative gradient robustness for the models with various ranges of curvature values and noise levels. In particular, we measure robustness to random noise at fixed magnitudes ranging logarithmically from 1 10 3 to 1 10 1. We plot our results 10 3 10 2 10 1 Noise Magnitude Relative Gradient Robustness Standard Grad Reg LCNN LCNN + Grad Reg Standard (Theoretical) Grad Reg (Theoretical) LCNN (Theoretical) LCNN + Grad Reg (Theoretical) (a) Gradient Robustness on CIFAR10 10 3 10 2 10 1 Noise Magnitude Relative Gradient Robustness Standard Grad Reg LCNN LCNN + Grad Reg Standard (Theoretical) Grad Reg (Theoretical) LCNN (Theoretical) LCNN + Grad Reg (Theoretical) (b) Gradient Robustness on CIFAR100 Figure 2: Plot showing relative gradient robustness f(x+ϵ) f(x) 2 f(x) 2 as a function of added noise ϵ 2 on (a) CIFAR10 and (b) CIFAR100 with a Res Net-18 model. We observe that low-curvature models lead to an order of magnitude improvement in gradient robustness, and this improvement closely follows the trend predicted by the theoretical upper bound in 4. in Figure 2 where we find that our results match theory (i.e, the simplified quadratic approximation in 4.1) quite closely in terms of the overall trends, and that low curvature models have an order of magnitude improvement in robustness over standard models. 5.3 Impact of Curvature on Adversarial Robustness Our theory in 4.2 shows that having low curvature is necessary for robustness, along with having small gradient norms. In this section we evaluate this claim empirically, by evaluating adversarial examples via ℓ2 PGD [13] adversaries with various noise magnitudes. We use the Cleverhans library [33] to implement PGD. Our results are present in Table 2 where we find that LCNN+Grad Reg models perform on par with adversarial training, without a resulting accuracy loss. Table 2: Results indicating off-the-shelf model accuracies (%) upon using ℓ2 PGD adversarial examples across various noise magnitudes. Adversarial training performs the best overall, however sacrifices clean accuracy. LCNN+Grad Reg models perform similarly but without significant loss of clean accuracy. Results are averaged across two runs. Model Acc. (%) ϵ 2 = 0.05 ϵ 2 = 0.1 ϵ 2 = 0.15 ϵ 2 = 0.2 Standard 77.42 .10 59.97 .11 37.55 .13 23.41 .08 16.11 .21 LCNN 77.16 .07 61.17 .53 39.72 .17 25.60 .32 17.66 .18 Grad Reg 77.20 .26 71.90 .11 61.06 .03 49.19 .12 38.09 .47 LCNNs + Grad Reg 77.29 .26 72.68 .52 63.36 .39 52.96 .76 42.70 .77 CURE [6] 76.48 .07 71.39 .12 61.28 .32 49.60 .09 39.04 .16 Softplus + Wt. Decay [7] 77.44 .28 60.86 .36 38.04 .43 23.85 .33 16.20 .01 Adversarial Training [32] 76.96 .26 72.76 .15 64.70 .20 54.80 .25 44.98 .57 5.4 Train-Test Discrepancy in Model Geometry During our experiments, we observed a consistent phenomenon where the gradient norms and Hessian norms for test data points were much larger than those for the train data points, which hints at a form of overfitting with regards to these quantities. We term this phenomenon as the train-test discrepancy in model geometry. Interestingly, we did not observe any such discrepancy using our proposed curvature measure, indicating that our proposed measure may be a more reliable measure of model geometry. We report our results in Table 3, where we measure the relative discrepancy, finding that the discrepancy for our proposed measure of curvature is multiple orders of magnitude smaller than the corresponding quantity for gradient and Hessian norms. We leave further investi- gation of this phenomenon regarding why curvature is stable across train and test as a topic for future work. Table 3: Train-test descrepancy in model geometry, where the relative descrepancy ttg(X) = | g(Xtest) g(Xtrain) g(Xtest) | is shown for three different geometric measures. We observe that (1) there exists a large train-test descrepancy, with the test gradient / hessian norms being > 10 the corresponding values for the train set. (2) the descrepancy is 2-3 orders of magnitude smaller for our proposed curvature measure, indicating that it may be a stable model property. Model tt Ex X f(x) 2 tt Ex X 2f(x) 2 tt Ex X Cf(x) Standard 11.75 12.28 0.025 Grad Reg 11.33 11.22 0.017 LCNN 19.99 11.33 0.129 LCNNs + Grad Reg 21.82 10.43 0.146 Summary of Experimental Results Overall, our experiments show that: (1) LCNNs have lower curvature than standard models as advertised, and combining them with gradient norm regularization further decreases curvature (see Table 1). The latter phenomenon is unexpected, as our curvature measure ignores gradient scaling. (2) LCNNs combined with gradient norm regularization achieve an order of magnitude improved gradient robustness over standard models (see Figure 2). (3) LCNNs combined with gradient norm regularization outperform adversarial training in terms of achieving a better predictive accuracy at a lower curvature (see Table 1), and are competitive in terms of adversarial robustness (see Table 2), while being 1.4 faster. (4) We observe that there exists a train-test discrepancy for standard geometric quantities like the gradient and Hessian norm, and this discrepancy disappears for our proposed curvature measure (see Table 3). We also present ablation experiments, additional adversarial attacks, and evaluations on more datasets and architectures in the Appendix. 6 Discussion In this paper, we presented a modular approach to remove excess curvature in neural network models. Importantly, we found that combining vanilla LCNNs with gradient norm regularization resulted in models with the smallest curvature, the most stable gradients as well as those that are the most adversarially robust. Notably, this procedure achieves adversarial robustness without explicitly generating adversarial examples during training. The current limitations of our approach are that we only consider convolutional and fully connected layers, and not self-attention or recurrent layers. We also do not investigate the learning-theoretic benefits (or harms) of low-curvature models, or study their generalization for small number of training samples, or their robustness to label noise (which we already observe in Fig. 1b). Investigating these are important topics for future work. Acknowledgments and Disclosure of Funding The authors would like to thank the anonymous reviewers for their helpful feedback and all the funding agencies listed below for supporting this work. SS and HL are supported in part by the NSF awards #IIS-2008461 and #IIS-2040989, and research awards from Google, JP Morgan, Amazon, Harvard Data Science Initiative, and D3 Institute at Harvard. KM and SS (partly) are supported by the Swiss National Science Foundation under grant number FNS-188758 CORTI . HL would like to thank Sujatha and Mohan Lakkaraju for their continued support and encouragement. [1] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770 778, 2016. [2] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. ar Xiv preprint ar Xiv:2204.02311, 2022. [3] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. ar Xiv preprint ar Xiv:2204.06125, 2022. [4] Alexander D Amour, Katherine Heller, Dan Moldovan, Ben Adlam, Babak Alipanahi, Alex Beutel, Christina Chen, Jonathan Deaton, Jacob Eisenstein, Matthew D Hoffman, et al. Underspecification presents challenges for credibility in modern machine learning. ar Xiv preprint ar Xiv:2011.03395, 1(3), 2020. [5] Laurent Dinh, Razvan Pascanu, Samy Bengio, and Yoshua Bengio. Sharp minima can generalize for deep nets. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 1019 1028. PMLR, 06 11 Aug 2017. URL https://proceedings. mlr.press/v70/dinh17b.html. [6] Seyed Mohsen Moosavi Dezfooli, Alhussein Fawzi, Jonathan Uesato, and Pascal Frossard. Robustness via curvature regularization, and vice versa. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9070 9078, Los Alamitos, CA, USA, jun 2019. IEEE Computer Society. URL https://doi.ieeecomputersociety.org/10. 1109/CVPR.2019.00929. [7] Ann-Kathrin Dombrowski, Christopher J. Anders, Klaus-Robert Müller, and Pan Kessel. Towards robust explanations for deep neural networks. Pattern Recognition, 121:108194, 2022. ISSN 0031-3203. URL https://doi.org/10.1016/j.patcog.2021.108194. [8] Matthias Hein and Maksym Andriushchenko. Formal guarantees on the robustness of a classifier against adversarial manipulation. Advances in neural information processing systems, 30, 2017. [9] Chongli Qin, James Martens, Sven Gowal, Dilip Krishnan, Krishnamurthy Dvijotham, Alhussein Fawzi, Soham De, Robert Stanforth, and Pushmeet Kohli. Adversarial robustness through local linearization. Advances in Neural Information Processing Systems, 32, 2019. [10] Ann-Kathrin Dombrowski, Maximillian Alber, Christopher Anders, Marcel Ackermann, Klaus-Robert Müller, and Pan Kessel. Explanations can be manipulated and geometry is to blame. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper/2019/file/ bb836c01cdc9120a9c984c525e4b1a4a-Paper.pdf. [11] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. In International Conference on Learning Representations, 2014. URL http://arxiv.org/abs/1312.6199. [12] Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and Harnessing Adversarial Examples. ar Xiv e-prints, Dec 2014. URL http://arxiv.org/abs/1412.6572. [13] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id= r Jz IBf ZAb. [14] Jeremy Cohen, Elan Rosenfeld, and Zico Kolter. Certified adversarial robustness via randomized smoothing. In International Conference on Machine Learning, pages 1310 1320. PMLR, 2019. [15] Amirata Ghorbani, Abubakar Abid, and James Zou. Interpretation of neural networks is fragile. Proceedings of the AAAI conference on artificial intelligence, 33(01):3681 3688, 2019. URL https://doi.org/10.1609/aaai.v33i01.33013681. [16] Xinyang Zhang, Ningfei Wang, Hua Shen, Shouling Ji, Xiapu Luo, and Ting Wang. Interpretable deep learning under fire. In 29th {USENIX} Security Symposium ({USENIX} Security 20), 2020. [17] Suraj Srinivas and Francois Fleuret. Rethinking the role of gradient-based attribution methods for model interpretability. In International Conference on Learning Representations, 2020. [18] Andrew Slavin Ros and Finale Doshi-Velez. Improving the adversarial robustness and interpretability of deep neural networks by regularizing their input gradients. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence, AAAI 18/IAAI 18/EAAI 18. AAAI Press, 2018. ISBN 978-1-57735-800-8. [19] Moustapha Cisse, Piotr Bojanowski, Edouard Grave, Yann Dauphin, and Nicolas Usunier. Parseval Networks: Improving Robustness to Adversarial Examples. ar Xiv e-prints, art. ar Xiv:1704.08847, April 2017. [20] Asher Trockman and J. Zico Kolter. Orthogonalizing convolutional layers with the cayley transform. In International Conference on Learning Representations, 2021. URL https: //openreview.net/forum?id=Pbj8H_j EHYv. [21] Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral Normalization for Generative Adversarial Networks. ar Xiv e-prints, Feb 2018. URL http: //arxiv.org/abs/1802.05957. [22] Ernest Ryu, Jialin Liu, Sicheng Wang, Xiaohan Chen, Zhangyang Wang, and Wotao Yin. Plugand-play methods provably converge with properly trained denoisers. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 5546 5557. PMLR, 09 15 Jun 2019. URL https://proceedings.mlr.press/v97/ryu19a.html. [23] John M Lee. Riemannian Manifolds: An Introduction to Curvature, volume 176. Springer Science & Business Media, 2006. [24] Miaoyan Wang, Khanh Dao Duc, Jonathan Fischer, and Yun S. Song. Operator norm inequalities between tensor unfoldings on the partition lattice. Linear Algebra and its Applications, 520:44 66, 2017. ISSN 0024-3795. URL https://doi.org/10.1016/j.laa.2017.01. 017. [25] Will Grathwohl, Ricky T. Q. Chen, Jesse Bettencourt, and David Duvenaud. Scalable reversible generative models with free-form continuous dynamics. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=r Jxgkn Cc K7. [26] Henry Gouk, Eibe Frank, Bernhard Pfahringer, and Michael J Cree. Regularisation of neural networks by enforcing lipschitz continuity. Machine Learning, 110(2):393 416, 2021. [27] Anish Athalye, Nicholas Carlini, and David Wagner. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. In International conference on machine learning, pages 274 283. PMLR, 2018. [28] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770 778, 2016. URL https://doi.org/10.1109/CVPR.2016.90. [29] Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009. URL https://www.cs.toronto.edu/~kriz/ learning-features-2009-TR.pdf. [30] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary De Vito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. In NIPS 2017 Workshop Autodiff Program, 2017. [31] H. Drucker and Y. Le Cun. Improving generalization performance using double backpropagation. IEEE Transactions on Neural Networks, 3(6):991 997, 1992. doi: 10.1109/72.165600. [32] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards Deep Learning Models Resistant to Adversarial Attacks. ar Xiv e-prints, June 2017. URL http://arxiv.org/abs/1706.06083. [33] Nicolas Papernot, Fartash Faghri, Nicholas Carlini, Ian Goodfellow, Reuben Feinman, Alexey Kurakin, Cihang Xie, Yash Sharma, Tom Brown, Aurko Roy, Alexander Matyasko, Vahid Behzadan, Karen Hambardzumyan, Zhishuai Zhang, Yi-Lin Juang, Zhi Li, Ryan Sheatsley, Abhibhav Garg, Jonathan Uesato, Willi Gierke, Yinpeng Dong, David Berthelot, Paul Hendricks, Jonas Rauber, and Rujun Long. Technical report on the Clever Hans v2.1.0 adversarial examples library. ar Xiv preprint ar Xiv:1610.00768, 2018. URL https://arxiv.org/abs/ 1610.00768. 1. For all authors... (a) Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes] (c) Did you discuss any potential negative societal impacts of your work? [N/A] We present foundational work on the geometry of neural network models and as such we do not foresee any negative societal impacts beyond those that already exist with the usage of neural network models. (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] 2. If you are including theoretical results... (a) Did you state the full set of assumptions of all theoretical results? [Yes] (b) Did you include complete proofs of all theoretical results? [Yes] proofs are provided in the appendix 3. If you ran experiments... (a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] some training details are mentioned in the appendix (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [Yes] (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets... (a) If your work uses existing assets, did you cite the creators? [Yes] (b) Did you mention the license of the assets? [Yes] (c) Did you include any new assets either in the supplemental material or as a URL? [N/A] (d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [N/A] 5. If you used crowdsourcing or conducted research with human subjects... We did not use crowdsourcing or conducted research with human subjects. (a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]