# generalizable_adversarial_training_via_spectral_normalization__a6698cca.pdf Published as a conference paper at ICLR 2019 GENERALIZABLE ADVERSARIAL TRAINING VIA SPECTRAL NORMALIZATION Farzan Farnia , Jesse M. Zhang , David N. Tse Department of Electrical Engineering Stanford University {farnia,jessez,dntse}@stanford.edu Deep neural networks (DNNs) have set benchmarks on a wide array of supervised learning tasks. Trained DNNs, however, often lack robustness to minor adversarial perturbations to the input, which undermines their true practicality. Recent works have increased the robustness of DNNs by fitting networks using adversarially-perturbed training samples, but the improved performance can still be far below the performance seen in non-adversarial settings. A significant portion of this gap can be attributed to the decrease in generalization performance due to adversarial training. In this work, we extend the notion of margin loss to adversarial settings and bound the generalization error for DNNs trained under several well-known gradient-based attack schemes, motivating an effective regularization scheme based on spectral normalization of the DNN s weight matrices. We also provide a computationally-efficient method for normalizing the spectral norm of convolutional layers with arbitrary stride and padding schemes in deep convolutional networks. We evaluate the power of spectral normalization extensively on combinations of datasets, network architectures, and adversarial training schemes. 1 INTRODUCTION Despite their impressive performance on many supervised learning tasks, deep neural networks (DNNs) are often highly susceptible to adversarial perturbations imperceptible to the human eye (Szegedy et al., 2013; Goodfellow et al., 2014b). These adversarial attacks" have received enormous attention in the machine learning literature over recent years (Goodfellow et al., 2014b; Moosavi Dezfooli et al., 2016; Carlini & Wagner, 2016; Kurakin et al., 2016; Papernot et al., 2016; Carlini & Wagner, 2017; Papernot et al., 2017; Madry et al., 2018; Tramèr et al., 2018). Adversarial attack studies have mainly focused on developing effective attack and defense schemes. While attack schemes attempt to mislead a trained classifier via additive perturbations to the input, defense mechanisms aim to train classifiers robust to these perturbations. Although existing defense methods result in considerably better performance compared to standard training methods, the improved performance can still be far below the performance in non-adversarial settings (Athalye et al., 2018; Schmidt et al., 2018). A standard adversarial training scheme involves fitting a classifier using adversarially-perturbed samples (Szegedy et al., 2013; Goodfellow et al., 2014b) with the intention of producing a trained classifier with better robustness to attacks on future (i.e. test) samples. Madry et al. (2018) provides a robust optimization interpretation of the adversarial training approach, demonstrating that this strategy finds the optimal classifier minimizing the average worst-case loss over an adversarial ball centered at each training sample. This minimax interpretation can also be extended to distributionally-robust training methods (Sinha et al., 2018) where the offered robustness is over a Wasserstein-ball around the empirical distribution of training data. Recently, Schmidt et al. (2018) have shown that standard adversarial training produces networks that generalize poorly. The performance of adversarially-trained DNNs over test samples can be significantly worse than their training performance, and this gap can be far greater than the generalization Equal Contributors Published as a conference paper at ICLR 2019 0 20000 40000 60000 80000 training steps FGM training train valid train (SN) valid (SN) 0 20000 40000 60000 80000 training steps PGM training 0 20000 40000 60000 80000 training steps WRM training Figure 1: Adversarial training performance with and without spectral normalization (SN) for Alex Net fit on CIFAR10. The gain in the final test accuracies for FGM, PGM, and WRM after spectral normalization are 0.09, 0.11, and 0.04, respectively (see Table 1 in the Appendix). For FGM and PGM, perturbations have ℓ2 magnitude 2.44. gap achieved using standard empirical risk minimization (ERM). This discrepancy suggests that the overall adversarial test performance can be improved by applying effective regularization schemes during adversarial training. In this work, we propose using spectral normalization (SN) (Miyato et al., 2018) as a computationallyefficient and statistically-powerful regularization scheme for adversarial training of DNNs. SN has been successfully implemented and applied for DNNs in the context of generative adversarial networks (GANs) (Goodfellow et al., 2014a), resulting in state-of-the-art deep generative models for several benchmark tasks (Miyato et al., 2018). Moreover, SN (Tsuzuku et al., 2018) and other similar Lipschitz regularization techniques (Cisse et al., 2017) have been successfully applied in non-adversarial training settings to improve the robustness of ERM-trained networks to adversarial attacks. The theoretical results in (Bartlett et al., 2017; Neyshabur et al., 2017a) and empirical results in (Yoshida & Miyato, 2017) also suggest that SN can close the generalization gap for DNNs in non-adversarial ERM setting. On the theoretical front, we extend the standard notion of margin loss to adversarial settings. We leverage the PAC-Bayes generalization framework (Mc Allester, 1999) to prove generalization bounds for spectrally-normalized DNNs in terms of our defined adversarial margin loss. We obtain adversarial generalization error bounds for three well-known gradient-based attack schemes: fast gradient method (FGM) (Goodfellow et al., 2014b), projected gradient method (PGM) (Kurakin et al., 2016), and Wasserstein risk minimization (WRM) (Sinha et al., 2018). Our theoretical analysis shows that the adversarial generalization error will vanish by applying SN to all layers. On the empirical front, we show that SN can significantly improve the test performance of adversarially-trained DNNs. We perform numerical experiments over various standard datasets and DNN architectures. In almost all of our experiments, we obtain a better test performance after applying SN. For example, Figure 1 shows the training and validation performance for Alex Net fit on the CIFAR10 dataset using FGM, PGM, and WRM, resulting in adversarial test accuracy improvements of 9, 11, and 4 percent, respectively. To perform our numerical experiments, we develop a computationally-efficient approach for normalizing the spectral norm of convolution layers with arbitrary stride and padding schemes. To summarize, the main contributions of this work are: 1. Proposing SN as a regularization scheme for adversarial training of DNNs, 2. Extending concepts of margin-based generalization analysis to adversarial settings and proving margin-based generalization bounds for three gradient-based adversarial attack schemes, 3. Developing an efficient method for normalizing the spectral norm of convolutional layers in deep convolution networks, 4. Numerically demonstrating the improved test and generalization performance of DNNs trained with SN. Published as a conference paper at ICLR 2019 2 PRELIMINARIES In this section, we first review some standard concepts of margin-based generalization analysis in learning theory. We then extend these notions to adversarial training settings. 2.1 SUPERVISED LEARNING, DEEP NEURAL NETWORKS, GENERALIZATION ERROR Consider samples {(x1, y1), . . . , (xn, yn)} drawn i.i.d from underlying distribution PX,Y . We suppose X X and Y {1, 2, . . . , m} where m represents the number of different labels. Given loss function ℓand function class F = {fw, w W} parameterized by w, a supervised learner aims to find the optimal function in F minimizing the expected loss (risk) averaged over the underlying distribution P. We consider Fnn as the class of d-layer neural networks with h hidden units per layer and activation functions σ : R R. Each fw : X Rm in Fnn maps a data point x to an m-dimensional vector. Specifically, we can express each fw Fnn as fw(x) = Wdσ(Wd 1 σ(W1x) )). We use Wi 2 to denote the spectral norm of matrix Wi, defined as the largest singular value of Wi, and Wi F to denote Wi s Frobenius norm. A classifier fw s performance over the true distribution of data can be different from the training performance over the empirical distribution of training samples ˆP. The difference between the empirical and true averaged losses, evaluated on respectively training and test samples, is called the generalization error. Similar to Neyshabur et al. (2017a), we evaluate a DNN s generalization performance using its expected margin loss defined for margin parameter γ > 0 as Lγ(fw) := P fw(X)[Y ] γ + max j =Y fw(X)[j] , (1) where fw(X)[j] denotes the jth entry of fw(X) Rm. For a given data point X, we predict the label corresponding to the maximum entry of fw(X). Also, we use b Lγ(fw) to denote the empirical margin loss averaged over the training samples. The goal of margin-based generalization analysis is to provide theoretical comparison between the true and empirical margin risks. 2.2 ADVERSARIAL ATTACKS, ADVERSARIAL TRAINING A supervised learner observes only the training samples and hence does not know the true distribution of data. Then, a standard approach to train a classifier is to minimize the empirical expected loss ℓ over function class F = {fw : w W}, which is min w W 1 n i=1 ℓ fw(xi), yi . (2) This approach is called empirical risk minimization (ERM). For better optimization performance, the loss function ℓis commonly chosen to be smooth. Hence, 0-1 and margin losses are replaced by smooth surrogate loss functions such as the cross-entropy loss. However, we still use the margin loss as defined in (1) for evaluating the test and generalization performance of DNN classifiers. While ERM training usually achieves good performance over DNNs, several recent observations reveal that adding some adversarially-chosen perturbation to each sample can significantly drop the trained DNN s performance. Given norm function and adversarial noise power ϵ > 0, the adversarial additive noise for sample (x, y) and classifier fw is defined to be δadv w (x) := argmax δ ϵ ℓ fw(x + δ), y . (3) To provide adversarial robustness against the above attack scheme, a standard technique, which is called adversarial training, follows ERM training over the adversarially-perturbed samples by solving min w W 1 n i=1 ℓ fw xi + δadv w (xi) , yi := min w W 1 n i=1 max δi ϵ ℓ fw( xi + δi), yi . (4) However, (3) and (4) are intractable optimization problems. Therefore, several schemes have been proposed in the literature to approximate the optimal solution of (3). In this work, we analyze the Published as a conference paper at ICLR 2019 generalization performance of the following three gradient-based methods for approximating the solution to (3). We note that several other attack schemes such as Deep Fool (Moosavi Dezfooli et al., 2016), CW attacks (Carlini & Wagner, 2017), target and least-likely attacks (Kurakin et al., 2016) have been introduced and examined in the literature, which can lead to interesting future directions for this work. 1. Fast Gradient Method (FGM) (Goodfellow et al., 2014b): FGM approximates the solution to (3) by considering a linearized DNN loss around a given data point. Hence, FGM perturbs (x, y) by adding the following noise vector: δfgm w (x) := argmax δ ϵ δT x ℓ fw(x), y . (5) For the special case of ℓ -norm , the above representation of FGM recovers the fast gradient sign method (FGSM) where each data point (x, y) is perturbed by the ϵ-normalized sign vector of the loss s gradient. For ℓ2-norm 2, we similarly normalize the loss s gradient vector to have ϵ Euclidean norm. 2. Projected Gradient Method (PGM) (Kurakin et al., 2016): PGM is the iterative version of FGM and applies projected gradient descent to solve (3). PGM follows the following update rules for a given r number of steps: 1 i r : δpgm,i+1 w (x) := Y δpgm,i w (x) + α ν(i) w , (6) ν(i) w := argmax δ 1 δT x ℓ fw(x + δpgm,i w (x)), y . Here, we first find the direction ν(i) w along which the loss at the ith perturbed point changes the most, and then we move the perturbed point along this direction by stepsize α followed by projecting the resulting perturbation onto the set {δ : δ ϵ} with ϵ-bounded norm. 3. Wasserstein Risk Minimization (WRM) (Sinha et al., 2018): WRM solves the following variant of (3) for data-point (x, y) where the norm constraint in (3) is replaced by a norm-squared Lagrangian penalty term: δwrm w (x) := argmax δ ℓ fw(x + δ), y λ As discussed earlier, the optimization problem (3) is generally intractable. However, in the case of Euclidean norm 2, if we assume xℓ(fw(x), y) s Lipschitz constant is upper-bounded by λ, then WRM optimization (7) results in solving a convex optimization problem and can be efficiently solved using gradient methods. To obtain efficient adversarial defense schemes, we can substitute δfgm w , δpgm w , or δwrm w for δadv w in (4). Instead of fitting the classifier over true adversarial examples, which are NP-hard to obtain, we can instead train the DNN over FGM, PGM, or WRM-adversarially perturbed samples. 2.3 ADVERSARIAL GENERALIZATION ERROR The goal of adversarial training is to improve the robustness against adversarial attacks on not only the training samples but also on test samples; however, the adversarial training problem (4) focuses only on the training samples. To evaluate the adversarial generalization performance, we extend the notion of margin loss defined earlier in (1) to adversarial training settings by defining the adversarial margin loss as Ladv γ (fw) = P fw(X + δadv w X) [Y ] γ + max j =Y fw X + δadv w (X) [j] . (8) Here, we measure the margin loss over adversarially-perturbed samples, and we use b Ladv γ (fw) to denote the empirical adversarial margin loss. We also use Lfgm γ (fw), Lpgm γ (fw), and Lwrm γ (fw) to denote the adversarial margin losses with FGM (5), PGM (6), and WRM (7) attacks, respectively. Published as a conference paper at ICLR 2019 3 MARGIN-BASED ADVERSARIAL GENERALIZATION BOUNDS As previously discussed, generalization performance can be different between adversarial and nonadversarial settings. In this section, we provide generalization bounds for DNN classifiers under adversarial attacks in terms of the spectral norms of the trained DNN s weight matrices. The bounds motivate regularizing these spectral norms in order to limit the DNN s capacity and improve its generalization performance under adversarial attacks. We use the PAC-Bayes framework (Mc Allester, 1999; 2003) to prove our main results. To derive adversarial generalization error bounds for DNNs with smooth activation functions σ, we first extend a recent result on the margin-based generalization bound for the Re LU activation function (Neyshabur et al., 2017a) to general 1-Lipschitz activation functions. Theorem 1. Consider Fnn = {fw : w W} the class of d hidden-layer neural networks with h units per hidden-layer with 1-Lipschitz activation σ satisfying σ(0) = 0. Suppose that X, X s support set, is norm-bounded as x 2 B, x X. Also assume for constant M 1 any fw Fnn satisfies i : 1 M Wi 2 βw M, βw := d Y i=1 Wi 2 1/d. Here βw denotes the geometric mean of fw s spectral norms across all layers. Then, for any η, γ > 0, with probability at least 1 η for any fw Fnn we have: L0(fw) b Lγ(fw) + O s B2d2h log(dh)Φerm(fw) + d log dn log M where we define complexity score Φerm(fw) := Qd i=1 Wi 2 2 Pd i=1 Wi 2 F Wi 2 2 . Proof. We defer the proof to the Appendix. We now generalize this result to adversarial settings where the DNN s performance is evaluated under adversarial attacks. We prove three separate adversarial generalization error bounds for FGM, PGM, and WRM attacks. For the following results, we consider Fnn, the class of neural nets defined in Theorem 1. Moreover, we assume that the training loss ℓ(ˆy, y) and its first-order derivative are 1-Lipschitz. Similar to Sinha et al. (2018), we assume the activation σ is smooth and its derivative σ is 1-Lipschitz. This class of activations include ELU (Clevert et al., 2015) and tanh functions but not the Re LU function. However, our numerical results in Table 1 from the Appendix suggest similar generalization performance between ELU and Re LU activations. Theorem 2. Consider Fnn, X in Theorem 1 and training loss function ℓsatisfying the assumptions stated above. We consider an FGM attack with noise power ϵ according to Euclidean norm 2. For any fw Fnn assume κ xℓ fw(x), y 2 holds for constant κ > 0, any y Y, and any x Bϵ, 2(X) ϵ-close to X s support set. Then, for any η, γ > 0 with probability 1 η the following bound holds for the FGM margin loss of any fw Fnn Lfgm 0 (fw) b Lfgm γ (fw) + O s (B + ϵ)2d2h log(dh) Φfgm ϵ,κ (fw) + d log dn log M where Φfgm ϵ,κ (fw) := Qd i=1 Wi 2(1+(ϵ/κ)(Qd i=1 Wi 2) Pd i=1 Qi j=1 Wj 2) 2 Pd i=1 Wi 2 F Wi 2 2 . Proof. We defer the proof to the Appendix. Note that the above theorem assumes that the change rate for the loss function around test samples is at least κ, which gives a baseline for measuring the attack power ϵ. In our numerical experiments, we validate this assumption over standard image recognition tasks. Next, we generalize this result to adversarial settings with PGM attack, i.e. the iterative version of FGM attack. Published as a conference paper at ICLR 2019 Theorem 3. Consider Fnn, X and training loss function ℓfor which the assumptions in Theorem 2 hold. We consider a PGM attack with noise power ϵ given Euclidean norm 2, r iterations for attack, and stepsize α. Then, for any η, γ > 0 with probability 1 η the following bound applies to the PGM margin loss of any fw Fnn Lpgm 0 (fw) b Lpgm γ (fw) + O s (B + ϵ)2d2h log(dh) Φpgm ϵ,κ,r,α(fw) + d log rdn log M Here we define Φpgm ϵ,κ,r,α(fw) as the following expression i=1 Wi 2 1 + (α/κ)1 (2α/κ)rlip( ℓ fw)r 1 (2α/κ)lip( ℓ fw) i=1 Wi 2 d X j=1 Wj 2 2 d X Wi 2 F Wi 2 2 , where lip( ℓ fw) := Qd i=1 Wi 2 Pd i=1 Qi j=1 Wj 2 provides an upper-bound on the Lipschitz constant of xℓ(fw x), y . Proof. We defer the proof to the Appendix. In the above result, notice that if lip( ℓ fw)/κ < 1/(2α) then for any number of gradient steps the PGM margin-based generalization bound will grow the FGM generalization error bound in Theorem 2 by factor 1/ 1 (2α/κ)lip( ℓ fw) . We next extend our adversarial generalization analysis to WRM attacks. Theorem 4. For neural net class Fnn and training loss ℓsatisfying Theorem 2 s assumptions, consider a WRM attack with Lagrangian coefficient λ and Euclidean norm 2. Given parameter 0 < τ < 1, assume lip( ℓ fw) defined in Theorem 3 is upper-bounded by λ(1 τ) for any fw Fnn. For any η > 0, the following WRM margin-based generalization bound holds with probability 1 η for any fw Fnn: Lwrm 0 (fw) b Lwrm γ (fw) + O s λ Qd i=1 Wi 2)2d2h log(dh)Φwrm λ (fw) + d log dn log M where we define Φwrm λ (fw) := d Y i=1 Wi 2 1 + 1 λ lip( ℓ fw)( j=1 Wj 2 2 d X Wi 2 F Wi 2 2 . Proof. We defer the proof to the Appendix. As discussed by Sinha et al. (2018), the condition lip( ℓ fw) < λ for the actual Lipschitz constant of ℓ fw is in fact required to guarantee WRM s convergence to the global solution. Notice that the WRM generalization error bound in Theorem 4 is bounded by the product of 1 λ lip( ℓ fw) and the FGM generalization bound in Theorem 2. 4 SPECTRAL NORMALIZATION OF CONVOLUTIONAL LAYERS To control the Lipschitz constant of our trained network, we need to ensure that the spectral norm associated with each linear operation in the network does not exceed some pre-specified β. For fully-connected layers (i.e. regular matrix multiplication), please see Appendix B. For a general class of linear operations including convolution, Tsuzuku et al. (2018) propose to compute the operation s spectral norm through computing the gradient of the Euclidean norm of the operation s output. Here, we leverage the deconvolution operation to further simplify and accelerate computing the spectral norm of the convolution operation. Additionally, Sedghi et al. (2018) develop a method for computing all the singular values including the largest one, i.e. the spectral norm. While elegant, the method only applies to convolution filters with stride 1 and zero-padding. However, in practice the normalization factor depends on the stride size and padding scheme governing the convolution operation. Here Published as a conference paper at ICLR 2019 we develop an efficient approach for computing the maximum singular value, i.e. spectral norm, of convolutional layers with arbitary stride and padding schemes. Note that, as also discussed by Gouk et al. (2018), the ith convolutional layer output feature map ψi is a linear operation of the input X: j=1 Fi,j Xj, where X has M feature maps, Fi,j is a filter, and denotes the convolution operation (which also encapsulates stride size and padding scheme). For simplicity, we ignore the additive bias terms here. By vectorizing X and letting Vi,j represent the overall linear operation associated with Fi,j, we see that ψi(X) = [V1,1 . . . V1,M] X, and therefore the overall convolution operation can be described using V1,1 . . . V1,M ... ... ... VN,1 . . . VN,M While explicitly reconstructing W is expensive, we can still compute σ(W), the spectral norm of W, by leveraging the convolution transpose operation implemented by several modern-day deep learning packages. This allows us to efficiently performs matrix multiplication with W T without explicitly constructing W. Therefore we can approximate σ(W) using a modified version of power iteration (Algorithm 1), wrapping the appropriate stride size and padding arguments into the convolution and convolution transpose operations. After obtaining σ(W), we compute WSN in the same manner as for the fully-connected layers. Like Miyato et al., we exploit the fact that SGD only makes small updates to W from training step to training step, reusing the same u and running only one iteration per step. Unlike Miyato et al., rather than enforcing σ(W) = β, we instead enforce the looser constraint σ(W) β: WSN = W/ max(1, σ(W)/β), (9) which we observe to result in faster training for supervised learning tasks. Algorithm 1 Convolutional power iteration Initialize u with a random vector matching the shape of the convolution input for t = 0, ..., T 1 do v conv(W, u)/ conv(W, u) 2 u conv_transpose(W, v)/ conv_transpose(W, v) 2 end for σ v conv(W, u) 5 NUMERICAL EXPERIMENTS In this section we provide an array of empirical experiments to validate both the bounds we derived in Section 3 and our implementation of spectral normalization described in section 4. We show that spectral normalization improves both test accuracy and generalization for a variety of adversarial training schemes, datasets, and network architectures. All experiments are implemented in Tensor Flow (Abadi et al., 2016). For each experiment, we cross validate 4 to 6 values of β (see (9)) using a fixed validation set of 500 samples. For PGM, we used r = 15 iterations and α = 2ϵ/r. Additionally, for FGM and PGM we used ℓ2-type attacks (unless specified) with magnitude ϵ = 0.05E ˆ P [ X 2] (this value was approximately 2.44 for CIFAR10). For WRM, we implemented gradient ascent as discussed by Sinha et al. (2018). Additionally, for WRM training we used a Lagrangian coefficient of 0.002E ˆ P [ X 2] for CIFAR10 and SVHN and a Lagrangian coefficient of 0.04E ˆ P [ X 2] for MNIST in a similar manner to Sinha et al. (2018). The code will be made readily available. Published as a conference paper at ICLR 2019 5.1 VALIDATION OF SPECTRAL NORMALIZATION IMPLEMENTATION AND BOUNDS We first demonstrate the effect of the proposed spectral normalization approach on the final DNN weights by comparing the ℓ2 norm of the input x to that of the output fw(x). As shown in Figure 2(a), without spectral normalization (β = in (9)), the norm gain can be large. Additionally, because we are using cross-entropy loss, the weights (and therefore the norm gain) can grow arbitrarily high if we continue training as reported by Neyshabur et al. (2017b). As we decrease β, however, we produce more constrained networks, resulting in a decrease in norm gain. At β = 1, the gain of the network cannot be greater than 1, which is consistent with what we observe. Additionally, we provide a comparison of our method to that of Miyato et al. (2018) in Appendix A.1, empirically demonstrating that Miyato et al. s method does not properly control the spectral norm of convolutional layers, resulting in worse generalization performance. Figure 2(b) shows that the ℓ2 norms of the gradients with respect to the training samples are nicely distributed after spectral normalization. Additionally, this figure suggests that the minimum gradient ℓ2-norm assumption (the κ condition in Theorems 2 and 3) holds for spectrally-normalized networks. The first column of Figure 3 shows that, as observed by Bartlett et al. (2017), Alex Net trained using ERM generates similar margin distributions for both random and true labels on CIFAR10 unless we normalize the margins appropriately. We see that even without further correction, ERM training with SN allows Alex Net to have distinguishable performance between the two datasets. This observation suggests that SN as a regularization scheme enforces the generalization error bounds shown for spectrally-normalized DNNs by Bartlett et al. (2017) and Neyshabur et al. (2017a). Additionally, the margin normalization factor (the capacity norm Φ in Theorems 1-4) is much smaller for networks trained with SN. As demonstrated by the other columns in Figure 3, a smaller normalization factor results in larger normalized margin values and much tighter margin-based generalization bounds (a factor of 102 for ERM and a factor of 105 for FGM and PGM) (see Theorems 1-4). 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 ||f(x)||/||x|| Alex Net Norm Gain (a) ℓ2 norm gain due to the network f trained using ERM. 0.0 0.5 1.0 x (fw(x)) 2 0.0 0.2 0.4 0.6 x (fw(x)) 2 0.00 0.25 0.50 x (fw(x)) 2 0.00 0.25 0.50 0.75 x (fw(x)) 2 (b) Distributions of the ℓ2 norms of the gradients with respect to training samples. Training regularized with SN. Figure 2: Validation of SN implementation and distribution of the gradient norms using Alex Net trained on CIFAR10. 5.2 SPECTRAL NORMALIZATION IMPROVES GENERALIZATION AND ADVERSARIAL ROBUSTNESS The phenomenon of overfitting random labels described by Zhang et al. (2016) can be observed even for adversarial training methods. Figure 4 shows how the FGM, PGM, or WRM adversarial training schemes only slightly delay the rate at which Alex Net fits random labels on CIFAR10, and therefore the generalization gap can be quite large without proper regularization. After introducing spectral normalization, however, we see that the network has a much harder time fitting both the random and true labels. With the proper amount of SN (chosen via cross validation), we can obtain networks that struggle to fit random labels while still obtaining the same or better test performance on true labels. We also observe that training schemes regularized with SN result in networks more robust to adversarial attacks. Figure 5 shows that even without adversarial training, Alex Net with SN becomes more robust to FGM, PGM, and WRM attacks. Adversarial training improves adversarial robustness Published as a conference paper at ICLR 2019 0 50 100 margins True Random 0.0 0.5 1.0 1.5 ERM normalized 0.5 0.0 0.5 ERM SN normalized 2 0 2 4 6 1e 9 FGM normalized FGM SN normalized 0 1 2 3 1e 9 PGM normalized PGM SN normalized Figure 3: Effect of SN on distributions of unnormalized (leftmost column) and normalized (other three columns) margins for Alex Net fit on CIFAR10. The normalization factor is described by the capacity norm Φ reported in Theorems 1-4. more than SN by itself; however we see that we can further improve the robustness of the trained networks significantly by combining SN with adversarial training. 0 20000 40000 60000 80000 training steps training accuracy Random labels ERM FGM PGM WRM 0 20000 40000 60000 80000 training steps Random labels (SN) 0 20000 40000 60000 80000 training steps True labels (SN) Figure 4: Fitting random and true labels on CIFAR10 with Alex Net using adversarial training. ERM training FGM 2 attacks ERM training PGM 2 attacks 0.0 0.2 0.4 ERM training PGM attacks ERM training WRM attacks FGM 2 training FGM 2 attacks PGM 2 training PGM 2 attacks 0.0 0.2 0.4 PGM training PGM attacks WRM training WRM attacks Figure 5: Robustness of Alex Net trained on CIFAR10 to various adversarial attacks. 5.3 OTHER DATASETS AND ARCHITECTURES We demonstrate the power of regularization via SN on several combinations of datasets, network architectures, and adversarial training schemes. The datasets we evaluate are CIFAR10, MNIST, and SVHN. We fit CIFAR10 using the Alex Net and Inception networks described by Zhang et al. (2016), 1-hidden-layer and 2-hidden-layer multi layer perceptrons (MLPs) with ELU activation and 512 hidden nodes in each layer, and the Res Net architecture (He et al. (2016)) provided in Tensor Flow Published as a conference paper at ICLR 2019 for fitting CIFAR10. We fit MNIST using the ELU network described by Sinha et al. (2018) and the 1-hidden-layer and 2-hidden-layer MLPs. Finally, we fit SVHN using the same Alex Net architecture we used to fit CIFAR10. Our implementations do not use any additional regularization schemes including weight decay, dropout (Srivastava et al., 2014), and batch normalization (Ioffe & Szegedy, 2015) as these approaches are not motivated by the theory developed in this work; however, we provide numerical experiments comparing the proposed approach with weight decay, dropout, and batch normalization in Appendix A.2. Table 1 in the Appendix reports the pre and post-SN test accuracies for all 42 combinations evaluated. Figure 1 in the Introduction and Figures 7-9 in the Appendix show examples of training and validation curves on some of these combinations. We see that the validation curve generally improves after regularization with SN, and the observed improvements in validation accuracy are confirmed by the test accuracies reported in Table 1. Figure 6 visually summarizes Table 1, showing how SN can often significantly improve the test accuracy (and therefore decrease the generalization gap) for several of the combinations. We also provide Table 2 in the Appendix which shows the proportional increase in training time after introducing SN with our Tensor Flow implementation. 0.6 0.8 1.0 Test acc Gain in test acc after SN ERM training CIFAR10 MNIST SVHN 0.4 0.6 0.8 1.0 Test acc FGM training 0.4 0.6 0.8 1.0 Test acc PGM training 0.4 0.6 0.8 Test acc WRM training Figure 6: Test accuracy improvement after SN for various datasets and network architectures. 6 RELATED WORKS Providing theoretical guarantees for adversarial robustness of various classifiers has been studied in multiple works. Wang et al. (2017) targets analyzing the adversarial robustness of the nearest neighbor approach. Gilmer et al. (2018) studies the effect of the complexity of the data-generating manifold on the final adversarial robustness for a specific trained model. Fawzi et al. (2018) proves lower-bounds for the complexity of robust learning in adversarial settings, targeting the population distribution of data. Xu et al. (2009) shows that the regularized support vector machine (SVM) can be interpreted via robust optimization. Fawzi et al. (2016) analyzes the robustness of a fixed classifier to random and adversarial perturbations of the input data. While all of these works seek to understand the robustness properties of different classification function classes, unlike our work they do not focus on the generalization aspects of learning over DNNs under adversarial attacks. Concerning the generalization aspect of adversarial training, Sinha et al. (2018) provides optimization and generalization guarantees for WRM under the assumptions discussed after Theorem 4. However, their generalization guarantee only applies to the Wasserstein cost function, which is different from the 0-1 or margin loss and does not explicitly suggest a regularization scheme. In a recent related work, Schmidt et al. (2018) numerically shows the wide generalization gap in PGM adversarial training and theoretically establishes lower-bounds on the sample complexity of linear classifiers in Gaussian settings. While our work does not provide sample complexity lower-bounds, we study the broader function class of DNNs where we provide upper-bounds on adversarial generalization error and suggest an explicit regularization scheme for adversarial training over DNNs. Generalization in deep learning has been a topic of great interest in machine learning (Zhang et al., 2016). In addition to margin-based bounds (Bartlett et al., 2017; Neyshabur et al., 2017a), various other tools including VC dimension (Anthony & Bartlett, 2009), norm-based capacity scores (Bartlett & Mendelson, 2002; Neyshabur et al., 2015), and flatness of local minima (Keskar et al., 2016; Neyshabur et al., 2017b) have been used to analyze generalization properties of DNNs. Recently, Arora et al. (2018) introduced a compression approach to further improve the margin-based bounds presented by Bartlett et al. (2017); Neyshabur et al. (2017a). The PAC-Bayes bound has also been considered and computed by Dziugaite & Roy (2017), resulting in non-vacuous bounds for the MNIST dataset. Published as a conference paper at ICLR 2019 ACKNOWLEDGMENTS We are grateful for support under the National Science Foundation grant under CCF-1563098, and the Center for Science of Information (CSo I), an NSF Science and Technology Center under grant agreement CCF-0939370. Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. ar Xiv preprint ar Xiv:1603.04467, 2016. Martin Anthony and Peter L Bartlett. Neural network learning: Theoretical foundations. cambridge university press, 2009. Sanjeev Arora, Rong Ge, Behnam Neyshabur, and Yi Zhang. Stronger generalization bounds for deep nets via a compression approach. ar Xiv preprint ar Xiv:1802.05296, 2018. Anish Athalye, Nicholas Carlini, and David Wagner. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. In Proceedings of the 35th International Conference on Machine Learning, pp. 274 283, 2018. Peter L Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3(Nov):463 482, 2002. Peter L Bartlett, Dylan J Foster, and Matus J Telgarsky. Spectrally-normalized margin bounds for neural networks. In Advances in Neural Information Processing Systems, pp. 6241 6250, 2017. Nicholas Carlini and David Wagner. Defensive distillation is not robust to adversarial examples. ar Xiv preprint ar Xiv:1607.04311, 2016. Nicholas Carlini and David Wagner. Towards evaluating the robustness of neural networks. In 2017 IEEE Symposium on Security and Privacy (SP), pp. 39 57. IEEE, 2017. Moustapha Cisse, Piotr Bojanowski, Edouard Grave, Yann Dauphin, and Nicolas Usunier. Parseval networks: Improving robustness to adversarial examples. ar Xiv preprint ar Xiv:1704.08847, 2017. Djork-Arné Clevert, Thomas Unterthiner, and Sepp Hochreiter. Fast and accurate deep network learning by exponential linear units (elus). ar Xiv preprint ar Xiv:1511.07289, 2015. Gintare Karolina Dziugaite and Daniel M Roy. Entropy-sgd optimizes the prior of a pac-bayes bound: Datadependent pac-bayes priors via differential privacy. ar Xiv preprint ar Xiv:1712.09376, 2017. Alhussein Fawzi, Seyed-Mohsen Moosavi-Dezfooli, and Pascal Frossard. Robustness of classifiers: from adversarial to random noise. In Advances in Neural Information Processing Systems, pp. 1632 1640, 2016. Alhussein Fawzi, Hamza Fawzi, and Omar Fawzi. Adversarial vulnerability for any classifier. ar Xiv preprint ar Xiv:1802.08686, 2018. Justin Gilmer, Luke Metz, Fartash Faghri, Samuel S Schoenholz, Maithra Raghu, Martin Wattenberg, and Ian Goodfellow. Adversarial spheres. ar Xiv preprint ar Xiv:1801.02774, 2018. Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672 2680, 2014a. Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. ar Xiv preprint ar Xiv:1412.6572, 2014b. Henry Gouk, Eibe Frank, Bernhard Pfahringer, and Michael Cree. Regularisation of neural networks by enforcing lipschitz continuity. ar Xiv preprint ar Xiv:1804.04368, 2018. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In European conference on computer vision, pp. 630 645. Springer, 2016. Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. ar Xiv preprint ar Xiv:1502.03167, 2015. Published as a conference paper at ICLR 2019 Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. ar Xiv preprint ar Xiv:1609.04836, 2016. Alexey Kurakin, Ian Goodfellow, and Samy Bengio. Adversarial machine learning at scale. ar Xiv preprint ar Xiv:1611.01236, 2016. Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. In International Conference on Learning Representations, 2018. David Mc Allester. Simplified pac-bayesian margin bounds. In Learning theory and Kernel machines, pp. 203 215. Springer, 2003. David A Mc Allester. Pac-bayesian model averaging. In Proceedings of the twelfth annual conference on Computational learning theory, pp. 164 170. ACM, 1999. Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generative adversarial networks. ar Xiv preprint ar Xiv:1802.05957, 2018. Seyed Mohsen Moosavi Dezfooli, Alhussein Fawzi, and Pascal Frossard. Deepfool: a simple and accurate method to fool deep neural networks. In Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), number EPFL-CONF-218057, 2016. Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. Norm-based capacity control in neural networks. In COLT, pp. 1376 1401, 2015. Behnam Neyshabur, Srinadh Bhojanapalli, David Mc Allester, and Nathan Srebro. A pac-bayesian approach to spectrally-normalized margin bounds for neural networks. ar Xiv preprint ar Xiv:1707.09564, 2017a. Behnam Neyshabur, Srinadh Bhojanapalli, David Mc Allester, and Nati Srebro. Exploring generalization in deep learning. In Advances in Neural Information Processing Systems, pp. 5949 5958, 2017b. Nicolas Papernot, Patrick Mc Daniel, Xi Wu, Somesh Jha, and Ananthram Swami. Distillation as a defense to adversarial perturbations against deep neural networks. In 2016 IEEE Symposium on Security and Privacy (SP), pp. 582 597. IEEE, 2016. Nicolas Papernot, Patrick Mc Daniel, Ian Goodfellow, Somesh Jha, Z Berkay Celik, and Ananthram Swami. Practical black-box attacks against machine learning. In Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security, pp. 506 519. ACM, 2017. Ludwig Schmidt, Shibani Santurkar, Dimitris Tsipras, Kunal Talwar, and Aleksander M adry. Adversarially robust generalization requires more data. ar Xiv preprint ar Xiv:1804.11285, 2018. Hanie Sedghi, Vineet Gupta, and Philip M Long. The singular values of convolutional layers. ar Xiv preprint ar Xiv:1805.10408, 2018. Aman Sinha, Hongseok Namkoong, and John Duchi. Certifiable distributional robustness with principled adversarial training. In International Conference on Learning Representations, 2018. Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1): 1929 1958, 2014. Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. ar Xiv preprint ar Xiv:1312.6199, 2013. Florian Tramèr, Alexey Kurakin, Nicolas Papernot, Ian Goodfellow, Dan Boneh, and Patrick Mc Daniel. Ensemble adversarial training: Attacks and defenses. In International Conference on Learning Representations, 2018. Joel A Tropp. User-friendly tail bounds for sums of random matrices. Foundations of computational mathematics, 12(4):389 434, 2012. Yusuke Tsuzuku, Issei Sato, and Masashi Sugiyama. Lipschitz-margin training: Scalable certification of perturbation invariance for deep neural networks. ar Xiv preprint ar Xiv:1802.04034, 2018. Yizhen Wang, Somesh Jha, and Kamalika Chaudhuri. Analyzing the robustness of nearest neighbors to adversarial examples. ar Xiv preprint ar Xiv:1706.03922, 2017. Published as a conference paper at ICLR 2019 Huan Xu, Constantine Caramanis, and Shie Mannor. Robustness and regularization of support vector machines. Journal of Machine Learning Research, 10(Jul):1485 1510, 2009. Yuichi Yoshida and Takeru Miyato. Spectral norm regularization for improving the generalizability of deep learning. ar Xiv preprint ar Xiv:1705.10941, 2017. Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. ar Xiv preprint ar Xiv:1611.03530, 2016. Published as a conference paper at ICLR 2019 A FURTHER EXPERIMENTAL RESULTS Table 1: Train and test accuracies before and after spectral normalization for various datasets, network architectures, and training schemes. The amount of spectral normalization was selected from 4-6 values of β via cross validation on 500 samples. For each row, the greater test accuracy is bolded (both are bolded in the event of a tie). ℓ adversarial training was performed with magnitude 0.1. Dataset Architecture Training Train acc Test acc Train acc (SN) Test acc (SN) CIFAR10 Alex Net ERM 1.00 0.79 1.00 0.79 CIFAR10 Alex Net FGM ℓ2 0.98 0.54 0.93 0.63 CIFAR10 Alex Net FGM ℓ 1.00 0.51 0.67 0.56 CIFAR10 Alex Net PGM ℓ2 0.99 0.50 0.92 0.62 CIFAR10 Alex Net PGM ℓ 0.99 0.44 0.86 0.54 CIFAR10 Alex Net WRM 1.00 0.61 0.76 0.65 CIFAR10 ELU-Alex Net ERM 1.00 0.79 1.00 0.79 CIFAR10 ELU-Alex Net FGM ℓ2 0.97 0.52 0.68 0.60 CIFAR10 ELU-Alex Net PGM ℓ2 0.98 0.53 0.88 0.61 CIFAR10 ELU-Alex Net WRM 1.00 0.60 1.00 0.60 CIFAR10 Inception ERM 1.00 0.85 1.00 0.86 CIFAR10 Inception PGM ℓ2 0.99 0.53 1.00 0.58 CIFAR10 Inception PGM ℓ 0.98 0.48 0.62 0.56 CIFAR10 Inception WRM 1.00 0.66 1.00 0.67 CIFAR10 1-layer MLP ERM 0.98 0.49 0.68 0.53 CIFAR10 1-layer MLP FGM ℓ2 0.60 0.36 0.60 0.46 CIFAR10 1-layer MLP PGM ℓ2 0.57 0.36 0.55 0.46 CIFAR10 1-layer MLP WRM 0.60 0.41 0.62 0.50 CIFAR10 2-layer MLP ERM 0.99 0.51 0.79 0.56 CIFAR10 2-layer MLP FGM ℓ2 0.57 0.36 0.66 0.49 CIFAR10 2-layer MLP PGM ℓ2 0.93 0.35 0.66 0.48 CIFAR10 2-layer MLP WRM 0.87 0.35 0.73 0.52 CIFAR10 Res Net ERM 1.00 0.80 1.00 0.83 CIFAR10 Res Net PGM ℓ2 0.99 0.49 1.00 0.55 CIFAR10 Res Net PGM ℓ 0.98 0.44 0.72 0.53 CIFAR10 Res Net WRM 1.00 0.63 1.00 0.66 MNIST ELU-Net ERM 1.00 0.99 1.00 0.99* MNIST ELU-Net FGM ℓ2 0.98 0.97 1.00 0.97 MNIST ELU-Net PGM ℓ2 0.99 0.97 1.00 0.97 MNIST ELU-Net WRM 0.95 0.92 0.95 0.93 MNIST 1-layer MLP ERM 1.00 0.98 1.00 0.98* MNIST 1-layer MLP FGM ℓ2 0.88 0.88 1.00 0.96 MNIST 1-layer MLP PGM ℓ2 1.00 0.96 1.00 0.96 MNIST 1-layer MLP WRM 0.92 0.88 0.92 0.88 MNIST 2-layer MLP ERM 1.00 0.98 1.00 0.98 MNIST 2-layer MLP FGM ℓ2 0.97 0.91 1.00 0.96 MNIST 2-layer MLP PGM ℓ2 1.00 0.96 1.00 0.97 MNIST 2-layer MLP WRM 0.97 0.88 0.98 0.90 SVHN Alex Net ERM 1.00 0.93 1.00 0.93* SVHN Alex Net FGM ℓ2 0.97 0.76 0.95 0.83 SVHN Alex Net PGM ℓ2 1.00 0.78 0.85 0.81 SVHN Alex Net WRM 1.00 0.83 0.87 0.84 * β = (i.e. no spectral normalization) achieved the highest validation accuracy. Published as a conference paper at ICLR 2019 Table 2: Runtime increase after introducing spectral normalization for various datasets, network architectures, and training schemes. These ratios were obtained by running the experiments on one NVIDIA Titan Xp GPU for 40 epochs. Dataset Architecture Training no SN runtime SN runtime ratio CIFAR10 Alex Net ERM 229 s 283 s 1.24 CIFAR10 Alex Net FGM ℓ2 407 s 463 s 1.14 CIFAR10 Alex Net FGM ℓ 408 s 465 s 1.14 CIFAR10 Alex Net PGM ℓ2 2917 s 3077 s 1.05 CIFAR10 Alex Net PGM ℓ 2896 s 3048 s 1.05 CIFAR10 Alex Net WRM 3076 s 3151 s 1.02 CIFAR10 ELU-Alex Net ERM 231 s 283 s 1.23 CIFAR10 ELU-Alex Net FGM ℓ2 410 s 466 s 1.14 CIFAR10 ELU-Alex Net PGM ℓ2 2939 s 3093 s 1.05 CIFAR10 ELU-Alex Net WRM 3094 s 3150 s 1.02 CIFAR10 Inception ERM 632 s 734 s 1.16 CIFAR10 Inception PGM ℓ2 9994 s 6082 s 0.61 CIFAR10 Inception PGM ℓ 9948 s 6063 s 0.61 CIFAR10 Inception WRM 10247 s 6356 s 0.62 CIFAR10 1-layer MLP ERM 22 s 31 s 1.42 CIFAR10 1-layer MLP FGM ℓ2 25 s 35 s 1.43 CIFAR10 1-layer MLP PGM ℓ2 79 s 93 s 1.18 CIFAR10 1-layer MLP WRM 73 s 86 s 1.18 CIFAR10 2-layer MLP ERM 23 s 37 s 1.59 CIFAR10 2-layer MLP FGM ℓ2 27 s 41 s 1.51 CIFAR10 2-layer MLP PGM ℓ2 91 s 108 s 1.19 CIFAR10 2-layer MLP WRM 85 s 103 s 1.21 CIFAR10 Res Net ERM 315 s 547 s 1.73 CIFAR10 Res Net PGM ℓ2 2994 s 3300 s 1.10 CIFAR10 Res Net PGM ℓ 2980 s 3300 s 1.11 CIFAR10 Res Net WRM 3187 s 3457 s 1.08 MNIST ELU-Net ERM 55 s 97 s 1.76 MNIST ELU-Net FGM ℓ2 91 s 136 s 1.49 MNIST ELU-Net PGM ℓ2 614 s 676 s 1.10 MNIST ELU-Net WRM 635 s 670 s 1.06 MNIST 1-layer MLP ERM 15 s 24 s 1.60 MNIST 1-layer MLP FGM ℓ2 17 s 27 s 1.57 MNIST 1-layer MLP PGM ℓ2 57 s 71 s 1.24 MNIST 1-layer MLP WRM 51 s 63 s 1.24 MNIST 2-layer MLP ERM 17 s 31 s 1.84 MNIST 2-layer MLP FGM ℓ2 20 s 35 s 1.77 MNIST 2-layer MLP PGM ℓ2 67 s 89 s 1.32 MNIST 2-layer MLP WRM 62 s 81 s 1.30 SVHN Alex Net ERM 334 s 412 s 1.23 SVHN Alex Net FGM ℓ2 596 s 676 s 1.13 SVHN Alex Net PGM ℓ2 4270 s 4495 s 1.05 SVHN Alex Net WRM 4501 s 4572 s 1.02 Published as a conference paper at ICLR 2019 0 25000 50000 75000 training steps Alex Net FGM ( = 0.1) train valid train (SN) valid (SN) 0 25000 50000 75000 training steps Alex Net PGM ( = 0.1) 0 25000 50000 75000 training steps Alex Net FGM ( = 0.3) 0 25000 50000 75000 training steps Alex Net PGM ( = 0.3) Figure 7: Adversarial training performance with and without spectral normalization for Alex Net fit on CIFAR10. 0 25000 50000 75000 training steps Inception PGM train valid train (SN) valid (SN) 0 25000 50000 75000 training steps Inception WRM 0 25000 50000 75000 training steps Res Net PGM 0 20000 40000 training steps Res Net WRM Figure 8: Adversarial training performance with and without spectral normalization for Inception and Res Net fit on CIFAR10. 0 20000 40000 60000 80000 training steps ELU-Alex Net FGM train valid train (SN) valid (SN) 0 20000 40000 60000 80000 training steps ELU-Alex Net PGM 0 20000 40000 60000 80000 training steps ELU-Alex Net WRM Figure 9: Adversarial training performance with and without spectral normalization for Alex Net with ELU activation functions fit on CIFAR10. A.1 COMPARISON OF PROPOSED METHOD TO MIYATO ET AL. (2018) S METHOD For the optimal β chosen when fitting Alex Net to CIFAR10 with PGM, we repeat the experiment using the spectral normalization approach suggested by Miyato et al. (2018). This approach performs spectral normalization on convolutional layers by scaling the convolution kernel by the spectral norm of the kernel rather than the spectral norm of the overall convolution operation. Because it does not account for how the kernel can amplify perturbations in a single pixel multiple times (see Section 4), it does not properly control the spectral norm. Published as a conference paper at ICLR 2019 In Figure 10, we see that for the optimal β reported in the main text, using Miyato et al. (2018) s SN method results in worse generalization performance. This is because although we specified that β = 1.6, the actual β obtained using Miyato et al. (2018) s method can be much greater for convolutional layers, resulting in overfitting (hence the training curve quickly approaches 1.0 accuracy). The Alex Net architecture used has two convolutional layers. For the proposed method, the final spectral norms of the convolutional layers were both 1.60; for Miyato et al. (2018) s method, the final spectral norms of the convolutional layers were 7.72 and 7.45 despite the corresponding convolution kernels having spectral norms of 1.60. Our proposed method is less computationally efficient in comparison to Miyato et al. (2018) s approach because each power iteration step requires a convolution operation rather than a division operation. As shown in Table 3, the proposed approach is not significantly less efficient with our Tensor Flow implementation. 0 20000 40000 60000 80000 training steps Proposed ( = 1.6) train valid 0 2000 4000 6000 8000 100001200014000 training steps Miyato ( = 1.6) train valid Figure 10: Adversarial training performance with proposed SN versus Miyato et al. (2018) s SN for Alex Net fit on CIFAR10 using PGM. The final train and validation accuracies for the proposed method are 0.92 and 0.60. The final train and validation accuracies for Miyato et al. (2018) s are 1.00 and 0.55. Table 3: Runtime increase of the proposed spectral normalization approach compared to Miyato et al. (2018) s approach for CIFAR10 and various network architectures and training schemes. These ratios were obtained by running the experiments on one NVIDIA Titan Xp GPU for 40 epochs. Dataset Architecture Training proposed SN runtime Miyato SN runtime CIFAR10 Alex Net ERM 1.11 CIFAR10 Alex Net FGM ℓ2 1.06 CIFAR10 Alex Net FGM ℓ 1.11 CIFAR10 Alex Net PGM ℓ2 1.01 CIFAR10 Alex Net PGM ℓ 1.11 CIFAR10 Alex Net WRM 1.02 CIFAR10 Inception ERM 0.98 CIFAR10 Inception PGM ℓ2 1.04 CIFAR10 Inception PGM ℓ 1.06 CIFAR10 Inception WRM 1.03 Published as a conference paper at ICLR 2019 A.2 COMPARISON OF PROPOSED METHOD TO WEIGHT DECAY, DROPOUT, AND BATCH NORMALIZATION 0 25000 50000 75000 training steps train valid train (SN) valid (SN) 0 10000 20000 30000 training steps train valid 0 25000 50000 75000 training steps Weight Decay train valid 0 25000 50000 75000 training steps train valid Figure 11: Adversarial training performance with proposed SN versus batch normalization, weight decay, and dropout for Alex Net fit on CIFAR10 using PGM. The dropout rate was 0.8, and the amount of weight decay was 5e-4 for all weights. The leftmost plot is from Figure 1 and compares final performance of no regularization (train accuracy 1.00, validation accuracy 0.48) to that of SN (train accuracy 0.92, validation accuracy 0.60). The final train and validation accuracies for batch normalization are 1.00 and 0.54; the final train and validation accuracies for weight decay are 0.84 and 0.55; and the final train and validation accuracies for dropout are 0.99 and 0.52. B SPECTRAL NORMALIZATION OF FULLY-CONNECTED LAYERS For fully-connected layers, we approximate the spectral norm of a given matrix W using the approach described by Miyato et al. (2018): the power iteration method. For each W, we randomly initialize a vector u and approximate both the left and right singular vectors by iterating the update rules v W u/ W u 2 u W T v/ W T v 2. The final singular value can be approximated with σ(W) v T W u. Like Miyato et al., we exploit the fact that SGD only makes small updates to W from training step to training step, reusing the same u and running only one iteration per step. Unlike Miyato et al., rather than enforcing σ(W) = β, we instead enforce the looser constraint σ(W) β as described by Gouk et al. (2018): WSN = W/ max(1, σ(W)/β), which we observe to result in faster training in practice for supervised learning tasks. C.1 PROOF OF THEOREM 1 First let us quote the following two lemmas from (Neyshabur et al., 2017a). Lemma 1 (Neyshabur et al. (2017a)). Consider Fnn = {fw : w W} as the class of neural nets parameterized by w where each fw maps input x X to Rm. Let Q be a distribution on parameter vector chosen independently from the n training samples. Then, for each η > 0 with probability at least 1 η for any w and any random perturbation u satisfying Pru maxx X fw+u(x) fw(x) γ L0(fw) b Lγ(fw) + 4 KL(Pw+u Q) + log 6n η n 1 . (10) Published as a conference paper at ICLR 2019 Lemma 2 (Neyshabur et al. (2017a)). Consider a d-layer neural net fw with 1-Lipschitz activation function σ where σ(0) = 0. Then for any norm-bounded input x 2 B and weight perturbation u : Ui 2 1 d Wi 2, we have the following perturbation bound: fw+u(x) fw(x) 2 e B d Y Ui 2 Wi 2 . (11) To prove Theorem 1, consider f ew with weights ew. Since (1 + 1 d)d e and 1 d)d 1, for any weight vector w such that Wi 2 f Wi 2 1 d f Wi 2 for every i we have: (1/e) d d 1 i=1 f Wi 2. (12) We apply Lemma 1, choosing Q to be a zero-mean multivariate Gaussian distribution with diagonal covariance matrix, where each entry of the ith layer Ui has standard deviation ξi = f Wi 2 β e w ξ with ξ chosen later in the proof. Note that βw defined earlier in the theorem is the geometric average of spectral norms across all layers. Then for the ith layer s random perturbation vector ui N(0, ξ2 i I), we get the following bound from (Tropp, 2012) with h representing the width of the ith hidden layer: Pr β ew Ui 2 f Wi 2 > t 2h exp( t2 2hξ2 ). (13) We now use a union bound over all layers for a maximum union probability of 1/2, which implies the normalized β ew Ui 2 f Wi 2 for each layer is upper-bounded by ξ p 2h log(4hd). Then for any w satisfying Wi 2 f Wi 2 1 d f Wi 2 for all i s max x 2 B fw+u(x) fw(x) 2 e B d Y (a) e2B d Y Ui 2 f Wi 2 = e2Bβd 1 ew i=1 β ew Ui 2 f Wi 2 e2d Bβd 1 ew ξ p 2h log(4hd). (14) Here (a) holds, since 1 Wj Qd i=1 Wi 2 e f Wj Qd i=1 f Wi 2 is true for each j. Hence we choose ξ = γ 30d Bβd 1 e w h log(4hd) for which the perturbation vector satisfies the assumptions of Lemma 2. Then, we bound the KL-divergence term in Lemma 1 as Wi 2 F 2ξ2 i = 302d2B2β2d ew h log(4hd) 2γ2 Wi 2 F f Wi 2 2 (b) 302e2d2B2 Qd i=1 Wi 2 2h log(4hd) 2γ2 Wi 2 F Wi 2 2 = O d2B2h log(hd) Qd i=1 Wi 2 2 γ2 Wi 2 F Wi 2 2 Note that (b) holds, because we assume Wi 2 f Wi 2 1 d f Wi 2 implying 1 f Wj Qd i=1 f Wi 2 (1 1 d) (d 1) 1 Wj Qd i=1 Wi 2 e Wj Qd i=1 Wi 2 for each j. There- fore, Lemma 1 implies with probability 1 η we have the following bound hold for any w satisfying Published as a conference paper at ICLR 2019 Wi 2 f Wi 2 1 d f Wi 2 for all i s, L0(fw) b Lγ(fw) + O s B2d2h log(dh)Φerm(fw) + log n Then, we can give an upper-bound over all the functions in Fnn by finding the covering number of the set of ew s where for each feasible w we have the mentioned condition satisfied for at least one of ew s. We only need to form the bound for ( γ 2B )1/d βw ( γ n 2B )1/d which can be covered using a cover of size dn1/2d as discussed in (Neyshabur et al., 2017a). Then, from the theorem s assumption we know each Wi 2 will be in the interval [ 1 M βw, Mβw] which we want to cover such that for any β in the interval there exists a eβ satisfying |β eβ| eβ/d. For this purpose we can use a cover of size 2 log1+1/d M 2(d + 1) log M,1 which combined for all i s gives a cover with size O((d log M)d) whose logarithm is growing as d log(d log M). This together with (15) completes the proof. C.2 PROOF OF THEOREM 2 We start by proving the following lemmas providing perturbation bound for FGM attacks. Lemma 3. Consider a d-layer neural net fw with 1-Lipschitz and 1-smooth (1-Lipschitz derivative) activation σ where σ(0) = 0. Let training loss ℓ: (Rm, Y) R also be 1-Lipschitz and 1-smooth for any fixed label y Y. Then, for any input x, label y, and perturbation vector u satisfying i : Ui 2 1 d Wi 2 we have xℓ fw+u(x), y) xℓ fw(x), y) 2 (16) i=1 Wi 2 d X Wi 2 + x 2 i Y j=1 Wj 2 i X Proof. Since for a fixed y ℓsatisfies the same Lipschitzness and smoothness properties as σ, then zℓ(z, y) 2 1 and applying the chain rule implies: xℓ fw+u(x), y) xℓ fw(x), y) 2 = xfw+u(x) ( ℓ) fw+u(x), y) xfw(x) ( ℓ) fw(x), y) 2 xfw+u(x) ( ℓ) fw+u(x), y) xfw(x) ( ℓ) fw+u(x), y) 2 + xfw(x) ( ℓ) fw+u(x), y) xfw(x) ( ℓ) fw(x), y) 2 xfw+u(x) xfw(x) 2 + d Y i=1 Wi 2 ( ℓ) fw+u(x), y) ( ℓ) fw(x), y) 2 xfw+u(x) xfw(x) 2 + d Y i=1 Wi 2 fw+u(x) fw(x) 2 xfw+u(x) xfw(x) 2 + e x 2 d Y i=1 Wi 2 2 d X Ui 2 Wi 2 . (17) The above result is a conclusion of Lemma 2 and the lemma s assumptions implying xfw(x) 2 Qd i=1 Wi 2 for every x. Now, we define k = xf (k) w+u(x) xf (k) w (x) 2 where f (k) w (x) := Wkσ(Wk 1 σ(W1x)) ) denotes the DNN s output at layer k. With (17) in mind, we complete this lemma s proof by showing the following inequality via induction: i=1 Wi 2 k X Wi 2 + x 2 i 1 Y j=1 Wj 2 i 1 X 1Note that log 1 x 1 x implying log 1 1 1 d+1 1 d+1 and hence (log(1 + 1/d)) 1 d + 1. Published as a conference paper at ICLR 2019 The above equation will prove the lemma because for k d we have (1 + 1 d)d e. For k = 0, 0 = 0 since f (0) w (x) = x and does not change with w. Given that (18) holds for k we have k+1 = xf (k+1) w+u (x) xf (k+1) w (x) 2 = x(Wk+1 + Uk+1)σ(f (k) w+u(x)) x Wk+1σ(f (k) w (x)) 2 = xf (k) w+u(x)σ (f (k) w+u(x))(Wk+1 + Uk+1)T xf (k) w (x)σ (f (k) w (x))WT k+1 2 xf (k) w+u(x)σ (f (k) w+u(x))(Wk+1 + Uk+1)T xf (k) w+u(x)σ (f (k) w (x))(Wk+1 + Uk+1)T 2 + xf (k) w+u(x)σ (f (k) w (x))(Wk+1 + Uk+1)T xf (k) w+u(x)σ (f (k) w (x))WT k+1 2 + xf (k) w+u(x)σ (f (k) w (x))WT k+1 xf (k) w (x)σ (f (k) w (x))WT k+1 2 d) Wk+1 2 xf (k) w+u(x) 2 f (k) w+u(x) f (k) w (x) 2 + Uk+1 2 xf (k) w+u(x) 2 σ (f (k) w (x)) 2 + Wk+1 2 σ (f (k) w (x)) 2 k d)k+1 k+1 Y i=1 Wi 2 e x 2 k Y i=1 Wi 2 k X i=1 Wi 2 Uk+1 2 Wk+1 2 + Wk+1 2 k d)k+1 k+1 Y i=1 Wi 2 Uk+1 2 Wk+1 2 + x 2 k Y i=1 Wi 2 k X d)k+1 k+1 Y i=1 Wi 2 k+1 X Wi 2 + x 2 i 1 Y j=1 Wj 2 i 1 X Therefore, combining (17) and (18) the lemma s proof is complete Before presenting the perturbation bound for FGM attacks, we first prove the following simple lemma. Lemma 4. Consider vectors z1, z2 and norm function . If max{ z1 , z2 } κ, then ϵ z1 z1 ϵ z2 z2 2ϵ κ z1 z2 . (19) Proof. Without loss of generality suppose z2 z1 and therefore κ z1 . Then, ϵ z1 z1 ϵ z2 z2 = ϵ 1 z1 (z1 z2) z1 z2 ϵ z1 z1 z2 + ϵ z1 2ϵ z1 z1 z2 Lemma 5. Consider a d-layer neural network function fw with 1-Lipschitz, 1-smooth activation σ where σ(0) = 0. Consider FGM attacks with noise power ϵ according to Euclidean norm || ||2. Suppose κ xℓ(fw(x), y) 2 holds over the ϵ-ball around the support set X. Then, for any norm-bounded perturbation vector u such that Ui 2 1 d Wi 2. i, we have δfgm w+u(x) δfgm w (x) 2 2e2ϵ i=1 Wi 2 d X Wi 2 + x 2 i Y j=1 Wj 2 i X Published as a conference paper at ICLR 2019 Proof. The FGM attack according to Euclidean norm is simply the DNN loss s gradient normalized to have ϵ-Euclidean norm. The lemma is hence a direct result of combining Lemmas 3 and 4. To prove Theorem 2, we apply Lemma 1 together with the result in Lemma 5. Similar to the proof for Theorem 1, given weights ew we consider a zero-mean multivariate Gaussian perturbation vector u with diagonal covariance matrix where each element in the ith layer ui varies with the scaled standard deviation ξi = f Wi 2 β e w ξ with ξ properly chosen later in the proof. Consider weights w for which i : Wi 2 f Wi 2 1 d f Wi 2. (20) Since ui N(0, ξ2 i I), (Tropp, 2012) shows the following bound holds Pr β ew Ui 2 f Wi 2 > t 2h exp( t2 2hξ2 ). (21) Then we apply a union bound over all layers for a maximum union probability of 1/2 implying the normalized β ew Ui 2 f Wi 2 for each layer is upper-bounded by ξ p 2h log(4hd). Now, if the assumptions of Lemma 5 hold for perturbation vector u given the choice of ξ, for the FGM attack with noise power ϵ according to Euclidean norm 2 we have fw+u x + δfgm w+u(x) fw x + δfgm w (x) 2 (22) fw+u x + δfgm w+u(x) fw x + δfgm w+u(x) 2 + fw x + δfgm w+u(x) fw x + δfgm w (x) 2 fw+u x + δfgm w+u(x) fw x + δfgm w+u(x) 2 + d Y i=1 Wi 2 δfgm w+u(x) δfgm w (x) 2 Ui 2 Wi 2 + 2e2 ϵ Wi 2 + B i Y j=1 Wj 2 i X Ui 2 f Wi 2 + 2e5 ϵ i=1 f Wi 2 2 f Wi 2 + B i Y j=1 f Wj 2 i X Uj 2 f Wj 2 2e5d(B + ϵ)ξ p 2h log(4hd) d Y i=1 f Wi 2 + ϵ i=1 f Wi 2 2 1/B + j=1 f Wj 2 . Hence we choose 8e5d(B + ϵ) p 2h log(4hd) Qd i=1 f Wi 2 1 + ϵ κ Qd i=1 f Wi 2(1/B + Pd i=1 Qi j=1 f Wj 2) , (23) for which the assumptions of Lemmas 1 and 5 hold. Assuming B 1, similar to Theorem 1 s proof we can show for any w such that Wi 2 f Wi 2 1 d f Wi 2 we have Wi 2 F 2ξ2 i = O d2(B + ϵ)2h log(hd) Qd i=1 f Wi 2 2 1 + ϵ κ Qd i=1 f Wi 2 Pd i=1 Qi j=1 f Wj 2 2 Wi 2 F f Wi 2 2 O d2(B + ϵ)2h log(hd) Qd i=1 Wi 2 2 1 + ϵ κ Qd i=1 Wi 2 Pd i=1 Qi j=1 Wj 2 2 Wi 2 F Wi 2 2 Then, applying Lemma 1 reveals that given any η > 0 with probability at least 1 η for any w such that Wi 2 f Wi 2 1 d f Wi 2 we have Lfgm 0 (fw) b Lfgm γ (fw) + O s (B + ϵ)2d2h log(dh) Φfgm ϵ,κ (fw) + log n Published as a conference paper at ICLR 2019 where Φfgm ϵ,κ (fw) := Qd i=1 Wi 2(1 + ϵ κ Qd i=1 Wi 2 Pd i=1 Qi j=1 Wj 2) 2 Pd i=1 Wi 2 F Wi 2 2 . Note that similar to our proof for Theorem 1 we can find a cover of size O((d log M)ddn1/2d) for the spectral norms of the weights feasible set, where for any Wi 2 we have ai such that Wi 2 ai ai/d. Applying this covering number bound to (24) completes the proof. C.3 PROOF OF THEOREM 3 We use the following two lemmas to extend the proof of Theorem 2 for FGM attacks to show Theorem 3 for PGM attacks. Lemma 6. Consider a d-layer neural network function fw with 1-Lipschitz, 1-smooth activation σ where σ(0) = 0. We consider PGM attacks with noise power ϵ according to Euclidean norm || ||2, r iterations and stepsize α. Suppose κ xℓ fw(x), y 2 holds over the ϵ-ball around the support set X. Then for any perturbation vector u such that Ui 2 1 d Wi 2 for every i we have δpgm,r w+u (x) δpgm,r w (x) 2 e2(2α/κ)1 (2α/κ)r lip( ℓ fw)r 1 (2α/κ) lip( ℓ fw) (25) i=1 Wi 2 d X Wi 2 + ( x 2 + ϵ) i Y j=1 Wj 2 i X Here lip( ℓ fw) denotes the actual Lipschitz constant of xℓ(fw(x), y). Proof. We use induction to show this lemma for different r values. The result for case r = 1 is a direct consequence of Lemma 5. Suppose that the result is true for r = k. Then, Lemmas 3 and 4 imply δpgm,k+1 w+u (x) δpgm,k+1 w (x) 2 xℓ(fw+u(x + δpgm,k w+u (x))) xℓ(fw(x + δpgm,k w (x))) 2 xℓ(fw+u(x + δpgm,k w+u (x))) xℓ(fw(x + δpgm,k w+u (x))) 2 xℓ(fw(x + δpgm,k w+u (x))) xℓ(fw(x + δpgm,k w (x))) 2 i=1 Wi 2 d X Wi 2 + ( x 2 + ϵ) i Y j=1 Wj 2 i X κ lip( ℓ fw) xδpgm,k w+u (x) xδpgm,k w (x) 2 i=1 Wi 2 d X Wi 2 + ( x 2 + ϵ) i Y j=1 Wj 2 i X κ lip( ℓ fw)e2(2α/κ)1 (2α/κ)k lip( ℓ fw)k 1 (2α/κ) lip( ℓ fw) d Y i=1 Wi 2 d X Ui 2 Wi 2 + ( x 2 + ϵ) i Y j=1 Wj 2 i X = e2(2α/κ)1 (2α/κ)k+1 lip( ℓ fw)k+1 1 (2α/κ) lip( ℓ fw) i=1 Wi 2 d X Wi 2 + ( x 2 + ϵ) i Y j=1 Wj 2 i X where the last line follows from the equality Pk i=0 si = 1 sk+1 1 s . Therefore, by induction the lemma holds for every value r 1. Published as a conference paper at ICLR 2019 Lemma 7. Consider a d-layer neural network function fw with 1-Lipschitz, 1-smooth activation σ where σ(0) = 0. Also, assume that training loss ℓis 1-Lipschitz and 1-smooth. Then, lip xℓ fw(x), y lip( ℓ fw) := d Y i=1 Wi 2 d X j=1 Wj 2. (26) Proof. First of all note that according to the chain rule lip xℓ fw(x), y = lip xfw(x) ( ℓ) fw(x), y lip xfw(x) + lip(fw)2 lip xfw(x) + i=1 Wi 2 2. Considering the above result, we complete the proof by inductively proving lip xfw(x) (Qd i=1 Wi 2) Pd i=1 Qi 1 j=1 Wj 2. For d = 1, xfw(x) is constant and hence the result holds. Assume the statement holds for d = k. Due to the chain rule, xf (k+1) w (x) = x Wk+1σ f (k) w (x) = xf (k) w (x) σ f (k) w (x) WT k+1 and therefore for any x and v xf (k+1) w (x + v) xf (k+1) w (x) 2 xf (k) w (x + v) σ f (k) w (x + v) WT k+1 xf (k) w (x + v) σ f (k) w (x + v) WT k+1 2 Wk+1 2 xf (k) w (x + v) σ f (k) w (x + v) xf (k) w (x) σ f (k) w (x) 2 Wk+1 2 xf (k) w (x + v) σ f (k) w (x + v) xf (k) w (x) σ f (k) w (x + v) 2 + Wk+1 2 xf (k) w (x) σ f (k) w (x + v) xf (k) w (x) σ f (k) w (x) 2 xf (k) w (x + v) xf (k) w (x) 2 + xf (k) w (x) 2 σ f (k) w (x + v) σ f (k) w (x) 2 lip xf (k) w (x) + lip f (k) w (x) v 2 j=1 Wj 2 v 2, which shows the statement holds for d = k + 1 and therefore completes the proof via induction. In order to prove Theorem 3, we note that for any norm-bounded x 2 B and perturbation vector u such that i, Ui 2 1 d Wi 2 we have fw+u(x + δpgm,r w+u (x)) fw(x + δpgm,r w (x)) 2 fw+u(x + δpgm,r w+u (x)) fw(x + δpgm,r w+u (x)) 2 + fw(x + δpgm,r w+u (x)) fw(x + δpgm,r w (x)) 2 e(B + ϵ) d Y i=1 Wi 2 d X Ui 2 Wi 2 + e2(2α/κ)1 (2α/κ)r lip( ℓ fw)r 1 (2α/κ) lip( ℓ fw) i=1 Wi 2 2 d X Wi 2 + (B + ϵ) i Y j=1 Wj 2 i X e(B + ϵ) d Y i=1 Wi 2 d X Ui 2 Wi 2 + e2(2α/κ)1 (2α/κ)rlip( ℓ fw)r 1 (2α/κ)lip( ℓ fw) Published as a conference paper at ICLR 2019 i=1 Wi 2 2 d X Wi 2 + (B + ϵ) i Y j=1 Wj 2 i X The last inequality holds since as shown in Lemma 7 lip xℓ(fw(x), y) lip( ℓ fw) := Qd i=1 Wi 2 Pd i=1 Qi j=1 Wj 2. Here the upper-bound lip( ℓ f ew) for ew changes by a factor at most e2/r for w such that Wi 2 f Wi 2 1 rd f Wi 2. Therefore, given ew if similar to the proof for Theorem 2 we choose a zero-mean multivariate Gaussian distribution Q for u with the ith layer ui s standard deviation to be ξi = f Wi 2 β w ξ where 8d(B + ϵ) p 2h log(4hd)e4(α/κ) 1 e2(2α/κ)rlip( ℓ f e w)r 1 e2/r(2α/κ)lip( ℓ f e w) (Qd i=1 f Wi 2) 1 + Pd i=1 Qi j=1 f Wj 2 . Then for any w satisfying Wi 2 f Wi 2 1 rd f Wi 2, applying union bound shows that the assumption of Lemma 1 Pru maxx X fw+u(x + δpgm,r w+u (x)) fw(x + δpgm,r w (x)) γ 2 holds for Q, and further we have Wi 2 F 2ξ2 i = O d2(B + ϵ)2h log(hd) Qd i=1 f Wi 2 2 1 e2(2α/κ)rlip( ℓ f e w)r 1 e2/r(2α/κ)lip( ℓ f e w) 1 + α κ Qd i=1 f Wi 2 Pd i=1 Qi j=1 f Wj 2 2 Wi 2 F f Wi 2 2 O d2(B + ϵ)2h log(hd) Qd i=1 Wi 2 2 1 (2α/κ)rlip( ℓ fw)r 1 (2α/κ)lip( ℓ fw) 1 + α κ Qd i=1 Wi 2 Pd i=1 Qi j=1 Wj 2 2 Wi 2 F Wi 2 2 Applying the above bound to Lemma 1 shows that for any η > 0 the following holds with probability 1 η for any w where Wi 2 f Wi 2 1 rd f Wi 2: Lpgm 0 (fw) b Lpgm γ (fw) + O s (B + ϵ)2d2h log(dh) Φpgm ϵ,κ,r,α(fw) + log n where we consider Φpgm ϵ,κ,r,α(fw) as the following expression i=1 Wi 2 1 + (α/κ)1 (2α/κ)rlip( ℓ fw)r 1 (2α/κ)lip( ℓ fw) i=1 Wi 2 d X j=1 Wj 2 2 d X Wi 2 F Wi 2 2 . Using a similar argument to our proof of Theorem 2, we can properly cover the spectral norms for each Wi with 2rd log M points, such that for any feasible Wi 2 value, satisfying the assumptions, we have value ai in our cover where Wi 2 ai 1 rdai. Therefore, we can cover all feasible combinations of spectral norms with (2rd log M)ddn1/2d, which combined with the above discussion completes the proof. C.4 PROOF OF THEOREM 4 We first show the following lemma providing a perturbation bound for WRM attacks. Lemma 8. Consider a d-layer neural net fw satisfying the assumptions of Lemma 3. Then, for any weight perturbation u such that Ui 2 1 d Wi 2 we have δwrm w+u(x) δwrm w (x) 2 e2 λ lip( ℓ fw) Published as a conference paper at ICLR 2019 i=1 Wi 2 d X Wi 2 + x 2 + Qd j=1 Wj 2 j=1 Wj 2 i X In the above inequality, lip( ℓ fw) denotes the Lipschitz constant of xℓ(fw(x), y). Proof. First of all note that for any x we have δwrm w (x) 2 (Qd i=1 Wi 2)/λ, because we assume lip( ℓ fw) < λ implying WRM s optimization is a convex optimization problem with the global solution δwrm w (x) satisfying δwrm w (x) = 1 λ ℓ fw(x + δwrm w (x)) which is norm-bounded by lip(ℓ fw) λ (Qd i=1 Wi 2)/λ. Moreover, applying Lemma 3 we have δwrm w+u(x) δwrm w (x) 2 λ xℓ(fw+u(x + δwrm w+u(x)) 1 λ xℓ(fw(x + δwrm w (x)) 2 λ xℓ(fw+u(x + δwrm w+u(x)) 1 λ xℓ(fw(x + δwrm w+u(x)) 2 λ xℓ(fw(x + δwrm w+u(x)) 1 λ xℓ(fw(x + δwrm w (x)) 2 i=1 Wi 2 d X Wi 2 + x 2 + Qd j=1 Wj 2 j=1 Wj 2 i X + lip( ℓ fw) δwrm w+u(x) δwrm w (x) 2 which shows the following inequality and hence completes the proof: 1 lip( ℓ fw) λ δwrm w+u(x) δwrm w (x) 2 i=1 Wi 2 d X Wi 2 + x 2 + Qd j=1 Wj 2 j=1 Wj 2 i X Combining the above lemma with Lemma 2, for any norm-bounded x 2 B and perturbation vector u where Ui 2 1 d Wi 2, fw+u(x + δwrm w+u(x)) fw(x + δwrm w (x)) 2 fw+u(x + δwrm w+u(x)) fw(x + δwrm w+u(x)) 2 + fw(x + δwrm w+u(x)) fw(x + δwrm w (x)) 2 Qd j=1 Wj 2 i=1 Wi 2 d X Ui 2 Wi 2 + d Y λ lip( ℓ fw) Qd j=1 Wj 2 j=1 Wj 2 i X Qd j=1 Wj 2 i=1 Wi 2 d X Ui 2 Wi 2 + d Y λ lip( ℓ fw) Qd j=1 Wj 2 j=1 Wj 2 i X Similar to the proofs of Theorems 2,3, given ew we choose a zero-mean multivariate Gaussian distribution Q with diagonal covariance matrix for random perturbation u, with the ith layer ui s standard deviation parameter ξi = f Wi 2 β w ξ where 2h log(4hd)(B + Qd j=1 f Wj 2/λ) Qd i=1 f Wi 2 1 + 1 λ lip( ℓ f e w) Pd i=1 Qi j=1 f Wj 2 . Published as a conference paper at ICLR 2019 Using a union bound suggests the assumption of Lemma 1 Pru maxx X fw+u(x + δwrm w+u(x)) fw(x + δwrm w (x)) γ 2 holds for Q. Then, for any w satisfying Wi 2 f Wi 2 1 4d/τ f Wi 2 we have lip(ℓ fw) (eτ/4)2lip(ℓ f ew) (eτ/4)2λ(1 τ) 1 τ 1 τ/2λ (1 τ 2)λ which implies the guard-band τ for ew applies to w after being modified by a factor 2. Hence, Wi 2 F 2ξ2 i j=1 f Wj 2/λ 2h log(hd) Qd i=1 f Wi 2 2 1 + 1 λ lip( ℓ f e w) Pd i=1 Qi j=1 f Wj 2 2 Wi 2 F f Wi 2 2 j=1 Wj 2/λ 2h log(hd) Qd i=1 Wi 2 2 1 + 1 λ lip( ℓ fw) Pd i=1 Qi j=1 Wj 2 2 Wi 2 F f Wi 2 2 Using this bound in Lemma 1 implies that for any η > 0 the following bound will hold with probability 1 η for any w where Wi 2 f Wi 2 1 4d/τ f Wi 2: Lwrm 0 (fw) b Lwrm γ (fw) + O s B + Qd i=1 Wi 2/λ 2d2h log(dh) Φwrm λ (fw) + d log n where we define Φwrm λ (fw) to be i=1 Wi 2 1 + 1 λ lip( ℓ fw)( j=1 Wj 2 2 d X Wi 2 F Wi 2 2 . Using a similar argument to our proofs of Theorems 2 and 3, we can cover the possible spectral norms for each Wi with O((8d/τ) log M) points, such that for any feasible Wi 2 value satisfying the theorem s assumptions, we have value ai in our cover where Wi 2 ai 1 4d/τ ai. Therefore, we can cover all feasible combinations of spectral norms with O(((8d/τ) log M)ddn1/2d), which combined with the above discussion finishes the proof.