# adversarial_training_from_mean_field_perspective__5abe433b.pdf

Adversarial Training from Mean Field Perspective

Soichiro Kumano The University of Tokyo kumano@cvm.t.u-tokyo.ac.jp

Hiroshi Kera Chiba University kera@chiba-u.jp

Toshihiko Yamasaki The University of Tokyo yamasaki@cvm.t.u-tokyo.ac.jp

Although adversarial training is known to be effective against adversarial examples, training dynamics are not well understood. In this study, we present the first theoretical analysis of adversarial training in random deep neural networks without any assumptions on data distributions. We introduce a new theoretical framework based on mean field theory, which addresses the limitations of existing mean field-based approaches. Based on the framework, we derive the (empirically tight) upper bounds of ℓq norm-based adversarial loss with ℓp norm-based adversarial examples for various values of p and q. Moreover, we prove that networks without shortcuts are generally not adversarially trainable and that adversarial training reduces network capacity. We also show that the network width alleviates these issues. Furthermore, the various impacts of input and output dimensions on the upper bounds and time evolution of weight variance are presented.

1 Introduction

Adversarial training [38, 58] is one of the most effective approaches against adversarial examples [89]. Various studies aimed to improve the performance of adversarial training [28, 94, 113], leading to numerous observations. Adversarial training improves accuracy for adversarial examples but decreases it for clean images [58, 88]. Moreover, it requires additional training data [14, 40, 77] and achieves high robust accuracy in a training dataset but not in a test dataset [58]. To improve the reliability of adversarial training and address the aforementioned challenges, a theoretical analysis is essential. Specifically, it is crucial to gain insight into network evolution during adversarial training, conditions for adversarial trainability, and differences between adversarial and standard training.

However, the theoretical understanding of network training is challenging, even for standard training, due to the non-convexity of loss surface and optimization stochasticity. Recent studies employed the mean field theory and analyzed the early stage of standard training for randomly initialized deep neural networks (random networks) [71, 81]. Some studies explored network trainability regarding gradient vanishing/explosion [81, 104, 105] and dynamical isometry [69, 79, 100]. Others examined network representation power [49, 71, 101]. The theoretical results of the early stage of training have been empirically observed to fit well with fully trained networks [25, 84], partially supported by recent theoretical results [46]. However, existing mean field-based approaches cannot manage the probabilistic properties of an entire network (e.g., the distribution of a network Jacobian) and dependence between network inputs and parameters, which are crucial for analyzing adversarial training.

In this study, we propose a mean field-based framework that addresses the aforementioned limitations (Thm 4.1), and apply it to the adversarial training analysis. While previous studies on adversarial

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

training rely on strong assumptions (e.g., Gaussian data [14, 80] and linear classifiers [76, 107]), the proposed framework includes various scenarios (e.g., ℓp norm-based adversarial examples and deep neural networks with or without shortcuts, i.e., residual or vanilla networks) without any assumptions on data distributions. Our analysis reveals various adversarial training characteristics that have been only experimentally observed or are unknown. The results are summarized as follows.

Upper bounds of adversarial loss. We derive the upper bounds of adversarial loss, quantifying the adverse effect of adversarial examples for various combinations of ℓq-adversarial loss with the ℓp-norm ϵ-ball (p, q {1, 2, }) (Thm 5.1). Numerical experiments confirm the tightness of these bounds. We also investigate the impacts of input and output dimensions on these bounds, and discover that for the (p, q) = (2, ) combination, the bound is independent of these dimensions.

Time evolution of weight variance. We present the time (training step) evolution of weight variance in training (Thms 5.4 and G.9). Weight variance has been used to assess training properties [49, 71, 81]. Our analysis indicates that adversarial training significantly regularizes weights and exhibits consistent weight dynamics across different norm choices.

Vanilla networks are not adversarially trainable under mild conditions. We show that gradient vanishing occurs in vanilla networks (without shortcuts) with large depths and small widths, making them untrainable via adversarial training, even with careful weight initialization (Thm 5.7). This contradicts standard training, where deep vanilla networks can be trained with proper initialization [81, 100]. However, residual networks are adversarially trainable even without proper initialization (Thm 5.8 and Prop G.10), proving the importance of shortcuts for adversarial training.

Degradation of network capacity and role of network width. As adversarial robustness requires high network capacity [63], deep networks can be used in adversarial training. However, we reveal that network capacity, measured based on the Fisher Rao norm [56], sharply degrades in deep networks during adversarial training (Thms 5.9 and G.14). Specifically, we confirm that the capacity at training step t is Θ(L t L2/N), where L and N denote the network depth and width, respectively. Interestingly, this result contrasts the roles of depth and width, i.e., the depth increases the initial capacity but decays it as training proceeds, whereas the width preserves it. While our theory is validated only during the initial stages of training, our experiments confirm that the adversarial robustness after full training is significantly influenced by network width (cf. Tabs. A5 and A6) as demonstrated in our theorems.

Other contributions. Furthermore, we show the followings cases: (a) Equality of the adversarial loss is obtained instead of inequality (upper bound) under several assumptions (Props G.2 and G.3). (b) Adversarial training leads to faster weight decay and less stable gradients compared with ℓ2 weight regularization. (c) Capacity degradation is discussed for metrics other than the Fisher Rao norm. (d) Adversarial risk cannot be mitigated while maintaining trainability and expressivity. (e) Discussion on Re LU-like activations extends to Lipschitz continuous activations under certain conditions. (f) A single-gradient descent attack can find adversarial examples that flip the prediction of a binary classifier (Prop K.1). Contributions (b) (f) are found in Appx. K.

Although this study focuses on adversarial training, our theoretical framework can be applied to other training methods that consider the probabilistic property of an entire network and dependence between network inputs and parameters. Consequently, we believe that this study can potentially contribute to the theoretical understanding of adversarial training and various deep learning methods.

2 Related work

Here, we summarize the full version in Appx. A. A technical discussion follows in Sec. 5.1.

Mean field theory. Mean field theory in machine learning investigates the training dynamics of random networks in chaotic and ordered phases [71]. Networks can be trained at the boundary between these phases [81]. The theory has been extended to networks with shortcuts [105, 106], recurrent connections [15, 69], and batch normalization [104]. It has been employed to study dynamical isometry [15, 69, 70, 79], and a subsequent study achieved training of 10,000-layer networks without

shortcuts [100]. Moreover, the theory has been applied to analyze network representation power [49, 71, 101]. However, existing mean field-based analysis cannot handle the properties of an entire network and input parameter dependence, which is a drawback for some deep learning methods, e.g., adversarial training. Thus, we propose a new framework to address these limitations.

Adversarial training. Various questions related to adversarial training have been theoretically addressed by some studies, including the robustness-accuracy trade-off [29, 47, 75, 76, 92, 113], generalization gap [6, 50, 102, 107], sample complexity [1, 14, 62, 80, 110], large model requirement [63], and enhanced transfer learning performance [27]. However, these results are obtained in limited settings (e.g., Gaussian data and linear classifiers) and are not easily extended to deep neural networks or realistic data distributions. To explore more general settings, recent studies used the neural tangent kernel theory [4, 46, 55]. In the kernel regime, adversarial training, even with a heuristic attack, finds a robust network [35, 115]. In our study, we investigate adversarial training dynamics based on a mean field perspective, covering general multilayered networks with or without shortcuts and without assumptions about data distributions.

3 Preliminaries

3.1 Setting

Notations are summarized in Tab. A2. For an integer n N, let [n] := {1, . . . , n}. In this study, we focus on random deep neural networks with Re LU-like activations, called random Re LU-like networks. This is formally defined as follows:

Definition 3.1 (Re LU-like network). A network is called a Re LU-like network if all its activation functions are ϕ(z) := uz for z 0 and vz for z < 0, with u, v R.

Re LU-like activations [33, 57] are widely used in theoretical and practical applications [42, 45, 53, 85, 109]. In Appx. K, we extend our theorems to networks with Lipschitz continuous activations.

A Re LU-like network, f : Rd RK, comprises L N trainable layers and two non-trainable layers for adjusting input and output dimensions. The input layer projects xin Rd to an N-dimensional vector x(0) RN using the random matrix P in RN d. Subsequently, L consecutive affine transformations and activations are applied by g : RN RN. Then, g(x(0)) is multiplied by a random matrix P out RK N to obtain the output vector f(xin). Finally, the network function is provided by f(xin) := P outg(P inxin). We assume that d and K are sufficiently large, and each entry of P in and P out is i.i.d. and sampled from Gaussians N(0, 1/d) and N(0, 1/N), respectively.

An L-layer neural network g comprises weights W (l) = (W (l) ij ) RN N and biases b(l) =

(b(l) 1 , . . . , b(l) N ) RN, where l [L] denotes the layer index. The network is assumed to possess a sufficiently large width (i.e., N is sufficiently large). The l-th preand post-activation are defined as h(l) := W (l)x(l 1) + b(l) and x(l) := ϕ(h(l)), respectively, where Re LU-like activation ϕ operates entry-wise. The weight W (l) ij and bias b(l) i are i.i.d. and sampled from N(0, σ2 w/N) and N(0, σ2 b), respectively. The network function is represented by Eq. (1). For a residual network setting, refer to Appx. C.

f(xin) := P outϕ(W (L)ϕ( ϕ(W (1)P inxin + b(1)) ) + b(L)). (1)

3.2 Background

Mean field theory. Mean field theory employs probabilistic methods to analyze the properties of random deep neural networks. It assumes that h(l) follows a Gaussian, justified by the central limit theorem when width N is sufficiently large [71]. Here, we review the mean field-based approach to analyze the forward and backward dynamics of a network. Let L : Rd R represent the loss function. The mean squared pre-activation E[(h(l) i )2] and gradient χ(l) := E[( L(xin)/ x(l) i )2], where i denotes the neuron index, are calculated as follows [71, 81]:

E[(h(l) i )2] = σ2 w E[ϕ(h(l 1) i )2] + σ2 b, χ(l) = σ2 w E[ϕ (h(l+1) i )2]χ(l+1). (2)

These equations represent the dynamics between adjacent layers. We can infer that gradients vanish when σ2 w E[ϕ (h(l+1) i )2] < 1 and explode when σ2 w E[ϕ (h(l+1) i )2] > 1, indicating that a network is trainable only if σ2 w E[ϕ (h(l+1) i )2] 1.

Adversarial training. We define adversarial loss as follows:

Ladv(xin) := max η p ϵ

f(xin + η) f(xin) q (3)

= max η p ϵ

P outϕ(W (L)ϕ( ϕ(W (1)P in(xin + η) + b(1)) ) + b(L))

P outϕ(W (L)ϕ( ϕ(W (1)P inxin + b(1)) ) + b(L))

where ϵ > 0 and p, q {1, 2, }. The adversarial loss aims to minimize the difference between the network outputs of natural and adversarial inputs. Networks are trained by minimizing the sum of the standard loss Lstd : Rd Y R (e.g., cross-entropy loss), where Y denotes a label set, and the adversarial loss Ladv. The mean field analysis typically assumes gradient independence for loss functions (Appx. B) [81]. We use this assumption for Lstd, but not Ladv. Although Eq. (3) differs from the standard adversarial loss based on cross-entropy [58], our definition is employed in more robust methods, e.g., TRADES [113] and is theoretically simpler to analyze. Therefore, herein, the aforementioned loss is analyzed. However, even this simplified definition (Eq. (3)) poses a challenge for theoretical analysis due to the complex nested structure of a deep neural network (cf. Eq. (4)).

4 Theoretical framework

4.1 Limitations of existing mean field-based approaches

We propose a new theoretical framework based on mean field theory to analyze adversarial training. Here, we describe two limitations of existing mean field-based approaches, e.g., Eq. (2).

Layer-wise approach. Existing approaches focus on the dynamics between adjacent layers (cf. Eq. (2)). However, analyzing the adversarial loss (Eq. (3)) requires a framework that handles the probabilistic properties of an entire network instead of adjacent layers. This analysis becomes difficult due to the complex nested structure of networks (cf. Eqs. (1) and (4)). For example, there is no clarity on the the probabilistic behavior of f(xin), distribution of f(xin + η) f(xin), and dependence between f(xin) and inputs. We need a framework that disentangles the nested structure of networks and manages the probabilistic properties of an entire network. Recent studies on the mean field theory studies [15, 36, 69, 70, 100] have concepts related to ours; the comparative analysis is given in Sec. 5.1.

Difficulty in analyzing input parameter dependence. The analysis of adversarial training requires consideration of input-parameter dependence since adversarial perturbations are designed based on network parameters (cf. Eq. (3)). However, this cannot be readily addressed using existing approaches because their frameworks (e.g., Eq. (2)) do not offer a clear view of the dependence between perturbation η and network parameters W (1) 1,1 , W (1) 1,2 , . . ., and W (L) N,N.

The proposed framework resolves these limitations and provides a simple network representation that allows us to capture the entire network property with clear input parameter dependence.

4.2 Proposed framework

The current mean field-based approaches cannot capture the probabilistic properties of an entire network due to the complex nested structure of deep neural networks. Moreover, it is difficult to consider the dependence between inputs and numerous number of parameters. To address these limitations, we employ a linear-like representation of a Re LU-like network and propose its probabilistic properties. As Re LU-like networks are piecewise linear, a vanilla Re LU-like network can be represented as:

f(xin) = J(xin)xin + a(xin), (5)

J(xin) := P out D(ϕ (h(L)(xin)))W (L)D(ϕ (h(L 1)(xin)))W (L 1) W (1)P in, (6)

0.10 0.05 0.00 0.05 0.10 0

Probability density

Exp. Theory

Figure 1: Distribution of J(xin)1,1 in the vanilla Re LU network with d = 1, 000, K = 1, N = 5, 000, L = 10, σ2 w = 2, and σ2 b = 0.01. The blue histogram represents the experimental results (10,000-time samplings), and the orange curve is predicted by Thm 4.1.

Table 1: Values of βp,q. Under further assumptions, we can obtain equality of the adversarial loss rather than inequality (upper bound). Values marked with represent the equality when ϵ is sufficiently small. Values marked with are applicable if ϵ is sufficiently small and K = 1.

q = 1 q = 2 q =

d p = 2 1 + q

where D( ) denotes a diagonal matrix and a(xin) is defined similar to J(xin) (cf. Eq. (A48)). For a residual network, Eq. (A73) can be referred. Importantly, this representation does not rely on approximations, e.g., Taylor expansions. As W (l) and b(l) are randomly sampled, J(l)(xin) and a(l)(xin) denote a random matrix and vector, respectively. Unlike the original network definition (Eq. (1)), this representation (Eq. (5)) is non-nested and focuses only two parameters, thereby simplifying network analysis. Remarkably, we show the following properties of J(xin) and a(xin): Theorem 4.1 (Properties and distributions of J(xin) and a(xin)). Suppose that the width N is sufficiently large. Then, for any xin Rd, (I) J(xin) and a(xin) are independent. (II) each entry of J(xin) and a(xin) is i.i.d. and follows the Gaussian below:

J(xin)ij N 0, ωL

, a(xin)i N

where α := (u2 + v2)/2 (cf. Defn 3.1) and ω is ωv := ασ2 w for vanilla networks and ωr := 1 + ασ2 w for residual networks.

A significance of this theorem lies in that despite being functions of xin, the distributions of J(xin) and a(xin) do not depend on xin.1 In other words, although J(xin) and a(xin) are determined by (a fixed) xin for an initialized network, they become different for each sampling of weights and biases, and the selection of these values is independent of xin. Besides, J(xin) and a(xin) are independent and their distributions are Gaussian, which exhibits convenient properties. A sketch of proof is given in Appx. D and formal one is in Appxs. E and F.

To validate Thm 4.1, we conducted a numerical experiment and present the results in Fig. 1. We randomly sampled 10,000 vanilla Re LU networks and computed J(xin)1,1 for each network using the identical input xin. Additional experimental results can be found in Appx. L.

Broader applicability. The proposed framework (Thm 4.1) manages an entire network using only two Gaussians. While we focus on adversarial training, Thm 4.1 can be valuable for other deep neural network analyses based on the mean field theory. For example, contrastive learning [16] can be investigated, as it aims to minimize the distance between original and positive samples while maximizing it for negative samples. In this context, instead of considering the adversarial loss, f(xin + η) f(xin) , we can analyze loss functions, e.g., f(xin pos) f(xin ori) and f(xin neg) f(xin ori) , where xin ori, xin pos, and xin neg represent the original, positive, and negative samples, respectively. The complex nested structure of a network makes it challenging to consider the difference between two network outputs; however, Thm 4.1 helps theoretically manageable analyses.

5 Analysis of adversarial training

The proof of each theorem is described in Appx. G.

1This does not imply that random variables, J(x) and J(y), are identical for x = y. In addition, J(x) and J(y) are not always independent. Please also refer to the empirical results in Appx. L.

5.1 Upper bounds of adversarial loss

As the Re LU-like network f is locally linear and its input Jacobian at xin is J(xin) (cf. Eq. (5)), we can consider a more tractable form of the adversarial loss instead of Eq. (4) as follows:

Ladv(xin) max x Rd, η p ϵ J(x)η q = ϵ max x Rd J(x) p,q, (8)

where J(x) p,q := max η p=1 J(x)η q denotes the (p, q)-operator norm of J(x). Using Thm 4.1, which describes the property of J(x), we transform Ineq. (8) and obtain the following: Theorem 5.1 (Upper bounds of adversarial loss). Suppose that the input dimension d, output dimension K, and width N are sufficiently large. Then, for any xin Rd, the following inequality holds:

Ladv(xin) ϵβp,qωL/2 = ϵβp,q( α

W W W 2)L/2 (vanilla) ϵβp,q(1 + α LN P

W W W 2)L/2 (residual) , (9)

where W := {W (1) 1,1 , W (1) 1,2 , . . . , W (L) N,N} denotes the set of all network weights. The constant βp,q for each norm pair (p, q) is described in Tab. 1.

For some choices of (p, q), we cannot derive upper bounds, and thus, Tab. 1 contains blank. Numerical experiments show the tightness of the bounds (cf. Fig. 2). The theorem indicates that (i) the bounds increase linearly with the perturbation budget ϵ, (ii) the effects of the input and output dimensions depend on the norms (p, q) (cf. Tab. 1), and (iii) network depth L exponentially impacts the bounds, with ω = 1 as a threshold between order and chaos. Further, the square sum of the weights in Ineq. (9) suggests that adversarial training exhibits a weight regularization effect, which is compared to ℓ2 weight regularization in Appx. I. Besides, we derive equality rather than inequality (upper bound) under specific assumptions (e.g., small ϵ) for some (p, q) in Appx. K, indicated by and in Tab. 1.

The input and output dimensions, d and K, influence the bounds through the (p, q)-dependent parameter βp,q. In Tab. 1, βp,q displays a wide range of dependencies on d and K. When d and q = , d significantly affects the two phases of the adversarial loss, where p = 2 marks the transition point from order (p = 1) to chaos (p = ). In contrast, under realistic assumptions with d K, K affects negligibly. Interestingly, the dimensions do not impact the upper bounds when (p, q) = (2, ). In practice, we scale the perturbation budget and adversarial loss according to the choice of (p, q), respectively, and discuss the scaling effects in Appx. K.

Comparison with other studies. We can consider studies on global Lipschitz of networks in certified adversarial defenses [3, 18, 93] and spectral regularization [60, 108] to analyze Ineq. (8). A key difference is that the proposed probabilistic approach contradicts their deterministic approach. By imposing probabilistic constraints on network parameters, we can obtain exponentially tighter, interpretable, and more theoretically manageable bounds, which facilitates the subsequent section s discussion. The mathematical comparison is described in Appx. H.

Results obtained from [15, 36, 69, 70, 100] can be used to analyze the Jacobian s singular value distribution. Compared to their approaches, which are limited to (p, q) = (2, 2), the proposed method offers greater generality and flexibility, providing upper bounds for various (p, q). Moreover, Thm 4.1 enables the derivation of equality instead of inequality (upper bound), Props G.2 and G.3, which is not achievable using the approaches mentioned in the aforementioned studies because it cannot incorporate perturbation-Jacobian dependence. Moreover, we do not consider their assumption that the variance V[h(l) i ] is constant for all l [L], which is often difficult to satisfy.

Further, we refer to [78], which established a theoretical link between adversarial training and the (p, q)-operator norm of a Jacobian. Their findings support Ineq. (8) in training scenarios using heuristic attacks, e.g., projected gradient descent [58]. In this study, we derive concrete upper bounds beyond their theoretical link, enabling further investigation of adversarial training properties.

5.2 Time evolution of weight variance

Weight variance plays a critical role in determining deep neural network properties [49, 81]. We substitute the adversarial loss definition (Eq. (3)) with Ladv := ϵβp,qω(t)L/2, where t 0 denotes the

continuous training step. Considering gradient descent with an infinitely small learning rate (gradient flow), the model parameter θ(t) at step t is updated as:

θ(t) . (10)

We make the following assumption. Assumption 5.2. For 0 t T N, model parameters are independent, and weight and bias follow Gaussian N(0, σ2 w(t)/N) and N(0, σ2 b(t)), respectively.

This assumption ensures that the properties of the model parameters remain close to their initial values during the early stages of training (t T). Under Asm 5.2, the original and proposed mean field theories (Thm 4.1) remain valid during training. For a moderately small value of T, Asm 5.2 is not strong because the model parameters change minimally and retain their initialized states with sufficiently small learning rates. Recent neural tangent kernel studies partially supported this assumption [4, 46, 55], and it is known that random network theories align well with fully trained networks [25, 84]. For example, T = 160 is reasonable in a specific training setting (cf. Fig. 4).

Now, we summarize other assumptions for reference as follows: Assumption 5.3. The input dimension d, output dimension K, and width N are sufficiently large. We apply Asm B.1 to the standard loss function Lstd. The adversarial loss is defined as Ladv := ϵβp,qω(t)L/2 (cf. Thm 5.1).

Based on the aforementioned settings, we obtain the time evolution of the weight variance. Theorem 5.4 (Weight time evolution of vanilla network in adversarial training). Suppose that Asms 5.2 and 5.3 hold. Then, the time evolution of σ2 w of a vanilla network in adversarial training is given by:

σ2 w(t) = 1 ϵαβp,qωv(0)L/2 1

N t σ2 w(0). (11)

A similar result is obtained for residual networks (Thm G.9). The theorem reveals that the weight variance linearly decreases with t, which can be attributed to the weight regularization effect of adversarial training (cf. Sec. 5.1). In addition, the norm pair (p, q) affect only time-invariant constant βp,q, and the dynamics of the weight variance can be represented as a consistent function of t regardless of (p, q). In other words, adversarial training exhibits consistent weight dynamics irrespective of norm selection, with a scale factor varying.

5.3 Vanilla networks are not adversarially trainable under mild conditions

We show that in adversarial training, vanilla networks can fit a training dataset in limited cases (small depth and large width), but residual networks can in most cases. This result suggests that residual networks are better suited for adversarial training. First, we present the trainability condition based on the concept in [81, 105]. Definition 5.5 ((M, m)-trainability condition). A network is said to be (M, m)-trainable if a network satisfies m χ(0)/χ(L) M, where 0 m 1 and M 1.2

The value χ(l) denotes the squared length of the gradient in the l-th layer. A near-zero χ(0)/χ(L) suggests gradient vanishing, while a large value implies gradient explosion. Hence, Defn 5.5 is directly linked to successful training. In contrast to the existing definition, χ(l 1)/χ(l) 1 [81, 105], Defn 5.5 incorporates M and m for the subsequent discussion. Then, we establish specific (M, m)- trainability conditions for Re LU-like networks. Lemma 5.6 (Vanilla and residual (M, m)-trainability condition). Suppose that the width N is sufficiently large. Then, the (M, m)-trainability conditions for vanilla and residual networks are respectively given by:

m1/L ασ2 w M 1/L (vanilla), ασ2 w M 1/L 1 (residual). (12)

2Trainability depends on other factors such as the number of parameters; however, these are not considered herein to maintain simplicity. It is empirically known that this condition represents trainability well [81].

Using the weight time evolution (Thm 5.4) and (M, m)-trainability condition (Lemma 5.6), the following theorem can be readily derived. Theorem 5.7 (Vanilla networks are not adversarially trainable). Consider a vanilla network. Suppose that Asms 5.2 and 5.3 hold, and the (M, m)-trainability condition holds at t = 0 and ασ2 w(0) = 1. If

T (1 m1/L)N

ϵαβp,q , (13)

then there exists 0 < τ T such that the (M, m)-trainability condition does not hold for τ t T.

This indicates the potential untrainability of vanilla networks in adversarial training, even when satisfying the (M, m)-trainability condition at initialization, contradicting with standard training where extremely deep networks can be trained if initialized properly [81]. This issue arises from the inconsistency between the trainability condition of a vanilla network, i.e., m1/L ασ2 w (cf. Lemma 5.6) and monotonically decreasing nature of σ2 w(t) in adversarial training (cf. Thm 5.4). As stated in Asm 5.2, we assume that T is small. Therefore, if the right-hand term of Eq. (13) becomes large, the assumption and Thm 5.7 are violated. In summary, for large L (many layers) and ϵ (large perturbation constraint), vanilla networks are not adversarially trainable. Moreover, a large N (a wide network) can mitigate this issue in vanilla networks. For example, the vanilla network with L = 20, N = 256, and ϵ = 0.3 is not adversarially trainable; however, when the width is increased to 512, it becomes trainable (cf. Fig. 4). In contrast, we can claim the following for residual networks: Theorem 5.8 (Residual networks are adversarially trainable). Consider a residual network. Suppose that Asms 5.2 and 5.3 hold, and the (M, m)-trainability condition holds at t = 0 and ασ2 w(0) 1. Then, (M, m)-trainability condition always holds for 0 t T.

This occurs due to the (M, m)-trainability condition for residual networks, which does not have any lower bound (cf. Lemma 5.6), and the monotonically decreasing nature of σ2 w (cf. Thm G.9). Besides, residual networks are adversarially trainable even without careful weight initialization (Prop G.10). These theorems highlight the robust stability of adversarial training in residual networks.

5.4 Degradation of network capacity

We demonstrate that adversarial training degrades network capacity. We use the Fisher Rao norm as a metric [56], with alternative metrics discussed in Appx. K. The Fisher Rao norm is defined as:

w FR := w F (xin)w, F (xin) :=

where w := (W (1) 11 , W (1) 12 , . . . , W (L) NN) represents the vector of all the network weights and F denotes the empirical Fisher information matrix when only one training data point is considered (for simplicity). The Fisher Rao norm is preferred over other norm-based capacity metrics [8, 65 67] owing to its invariance to node-wise rescaling [56]. We present the following theorem based on [49]: Theorem 5.9 (Adversarial training degrades network capacity). Consider a vanilla network. Suppose that Asms 5.2 and 5.3 hold, Asm B.1 is applied to the network output, and the (M, m)-trainability condition holds at t = 0 and ασ2 w(0) = 1. Assume xin 2 =

d and σ2 b(t) = 0. Then, the expectation of the Fisher Rao norm is given by:

E[ w(t) FR] = LK 1 ϵαβp,q L

A related result for residual networks is presented in Thm G.14. As described in [63], adversarial robustness necessitates high capacity. However, Thm 5.9 indicated that capacity decreases linearly with step t in adversarial training. To address this conflict, large depth (large L), which increase the initial capacity with Θ(L), can be considered, but this scenario accelerates the degradation speed with Θ(L2). To preserve capacity, we must increase the width N accordingly. Consequently, to achieve high capacity in adversarial training, it is necessary to increase not only the number of layers L but also the width N to keep L2/N constant. Although Thms 5.9 and G.14 have been established in the early stages of training, numerical experiments proved that the adversarial robustness following full training is significantly influenced by network width (cf. Tabs. A5 and A6).

250 500 750 1000 0.25

p = 1, q = 1

250 500 750 1000

p = 1, q = 2

250 500 750 1000

Theory Exp.

250 500 750 1000

0.20 p = 2, q = 2

250 500 750 1000 0.0

250 500 750 1000

Input dimension d

Adversarial loss Ladv

Figure 2: Adversarial loss (Eq. (3)) in vanilla networks with N = 40, 000, K = 100, L = 3, and ϵ = 0.1. We generated 100 adversarial examples for each input dimension. The blue curves and bands represent the mean and standard deviation of the adversarial loss, respectively, whereas the orange curves (upper bounds) are predicted based on Thm 5.1. Some samples slightly exceed the upper bounds because we used the finite network width (cf. Appx. L).

0 500 1000 1500 2000 Training step

Weight variance σ2 w

Standard ℓ2 weight

Adv. (N = 2k)

Adv. (N = 1k)

Theory (N = 2k)

Theory (N = 1k)

Figure 3: Time evolution of the weight variance in the vanilla network with L = 10, p = , q = , and ϵ = 0.3. We used N = 1, 000 for standard and ℓ2 regularized training. The solid lines represent experimental results. The dashed lines are predicted by Thm 5.4.

100 200 300 400 500 600 700 800 Width N

Training accuracy

Figure 4: Heat map of the training accuracy of vanilla networks with p = , q = , and ϵ = 0.3. The dashed lines represent the condition of T in Thm 5.7 with m = 0.0001. In standard training, high accuracy is obtained across all the depths and widths (cf. Fig. A19).

6 Experimental results

We validate Thms 5.1, 5.4 and 5.7 via numerical experiments. The vanilla Re LU networks were initialized with σ2 w = 2 and σ2 b = 0.01 to meet the (M, m)-trainability conditions (Lemma 5.6). To verify Thms 5.4 and 5.7, we used MNIST [26]. Setup details and more results are given in Appx. L.

Verification of Thm 5.1. We created adversarial examples for initialized networks and computed the adversarial loss (Eq. (3)). As shown in Fig. 2, the upper bounds in Thm 5.1 were considerably tight. Some samples slightly exceeded the upper bounds because we used the finite network width while the infinite width is assumed (cf. Appx. L).

Verification of Thm 5.4. We trained vanilla networks normally (with and without ℓ2 regularization) and adversarially. The time evolution of weight variance is shown in Fig. 3. Adversarial training significantly reduces weight variance, whereas wide width (i.e., large N) suppresses it. The validity of Thm 5.4, which forms the basis of our theorems such as Thms 5.7 and 5.9, supports our theorems.

Verification of Thm 5.7. We trained vanilla networks considering various depth and width settings and monitored the training accuracy. As shown in Fig. 4, it was difficult for vanilla networks to fit the training dataset when the depth was large and the width was small; increased width helps in fitting. Although we currently lack a theoretical prediction of the boundary between trainable and untrainable areas determined based on Eq. (13), empirical evidence suggests that T = 160 is relevant.

7 Limitations

The mean field theory offers valuable insight into network training. However, its applicability is restricted to the initial stages of training. Although recent studies suggested empirically [25, 84] and theoretically [46] that the analysis of early-stage training extends well to full training, the strict relationship is yet to be explored. Our results have the same limitations. Although some theorems accurately capture the behavior during the initial stages of training and even after full training (cf. Tabs. A5 and A6), as training progresses, some theorems begin to diverge from the actual behavior (cf. Fig. A17). Another caveat is that the mean field theory assumes infinite network width, which is practically infeasible. Empirically, our theorems hold well when the width approximately exceeds 1,000 (cf. Fig. A7), while Thm 5.1 requires larger width approximately exceeds 10,000 for (p, q) = (2, 2) (cf. Fig. A14). These limitations also derive from the mean field theory and are not unique to our study. Despite these limitations, we consider that this study provides a powerful theoretical framework that extends the applicability of the mean field theory to various training methods, and the results obtained for adversarial training are insightful.

8 Conclusions

We proposed a framework based on the mean field theory and conducted a theoretical analysis of adversarial training. The proposed framework addressed the limitations of existing mean field-based approaches, which could not handle the probabilistic properties of an entire network and dependence between network inputs and parameters [71, 81]. Based on this framework, we examined adversarial training from various perspectives, unveiling upper bounds of adversarial loss, relationships between adversarial loss and network input/output dimensions, the time evolution of weight variance, trainability conditions, and the degradation of network capacity. The theorems of this study were validated via numerical experiments. The proposed theoretical framework is highly versatile and can help analyze various training methods, e.g., contrastive learning.

Acknowledgments and Disclosure of Funding

We would like to thank Huishuai Zhang for useful discussions. This work was supported by JSPS KAKENHI Grant Number JP23KJ0789 and JP22K17962, by JST, ACT-X Grant Number JPMJAX23C7, JAPAN, and by Microsoft Research Asia.

[1] J.-B. Alayrac, J. Uesato, P.-S. Huang, A. Fawzi, R. Stanforth, and P. Kohli. Are labels required for improving adversarial robustness? In Neur IPS, volume 32, 2019.

[2] L. Amsaleg, J. Bailey, D. Barbe, S. Erfani, M. E. Houle, V. Nguyen, and M. Radovanovi c. The vulnerability of learning to adversarial perturbation increases with intrinsic dimensionality. In WIFS, pages 1 6, 2017.

[3] C. Anil, J. Lucas, and R. Grosse. Sorting out lipschitz function approximation. In ICML, pages 291 301, 2019.

[4] S. Arora, S. S. Du, W. Hu, Z. Li, R. R. Salakhutdinov, and R. Wang. On exact computation with an infinitely wide neural net. In Neur IPS, volume 32, 2019.

[5] A. Athalye, N. Carlini, and D. Wagner. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. In ICML, pages 274 283, 2018.

[6] P. Awasthi, N. Frank, and M. Mohri. Adversarial learning guarantees for linear hypotheses and neural networks. In ICML, pages 431 441, 2020.

[7] P. Bartlett, S. Bubeck, and Y. Cherapanamjeri. Adversarial examples in multi-layer random relu networks. In Neur IPS, volume 34, pages 9241 9252, 2021.

[8] P. L. Bartlett, D. J. Foster, and M. J. Telgarsky. Spectrally-normalized margin bounds for neural networks. In Neur IPS, volume 30, 2017.

[9] P. L. Bartlett and S. Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. JMLR, 3(Nov):463 482, 2002.

[10] Y. Blumenfeld, D. Gilboa, and D. Soudry. A mean field theory of quantized deep networks: The quantization-depth trade-off. In Neur IPS, volume 32, 2019.

[11] S. Bubeck, Y. Cherapanamjeri, G. Gidel, and R. Tachet des Combes. A single gradient step finds adversarial examples on random two-layers neural networks. In Neur IPS, volume 34, pages 10081 10091, 2021.

[12] N. Carlini and D. Wagner. Adversarial examples are not easily detected: Bypassing ten detection methods. In ACM WS, pages 3 14, 2017.

[13] N. Carlini and D. Wagner. Towards evaluating the robustness of neural networks. In SSP, pages 39 57, 2017.

[14] Y. Carmon, A. Raghunathan, L. Schmidt, P. Liang, and J. C. Duchi. Unlabeled data improves adversarial robustness. In Neur IPS, 2019.

[15] M. Chen, J. Pennington, and S. Schoenholz. Dynamical isometry and a mean field theory of RNNs: Gating enables signal propagation in recurrent neural networks. In ICML, pages 873 882, 2018.

[16] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton. A simple framework for contrastive learning of visual representations. In ICML, pages 1597 1607, 2020.

[17] Y. Cho and L. Saul. Kernel methods for deep learning. In Neur IPS, volume 22, 2009.

[18] M. Cisse, P. Bojanowski, E. Grave, Y. Dauphin, and N. Usunier. Parseval networks: Improving robustness to adversarial examples. In ICML, pages 854 863, 2017.

[19] J. Cohen, E. Rosenfeld, and Z. Kolter. Certified adversarial robustness via randomized smoothing. In ICML, pages 1310 1320, 2019.

[20] F. Croce and M. Hein. Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In ICML, pages 2206 2216, 2020.

[21] A. Damianou and N. D. Lawrence. Deep gaussian processes. In AISTATS, pages 207 215, 2013.

[22] A. Daniely. SGD learns the conjugate kernel class of the network. In Neur IPS, volume 30, 2017.

[23] A. Daniely, R. Frostig, and Y. Singer. Toward deeper understanding of neural networks: The power of initialization and a dual view on expressivity. In Neur IPS, volume 29, 2016.

[24] A. Daniely and H. Schacham. Most relu networks suffer from ellˆ2 adversarial perturbations. In Neur IPS, volume 33, pages 6629 6636, 2020.

[25] G. De Palma, B. Kiani, and S. Lloyd. Adversarial robustness guarantees for random deep neural networks. In ICML, pages 2522 2534, 2021.

[26] L. Deng. The MNIST database of handwritten digit images for machine learning research. Signal Process. Mag., 29(6):141 142, 2012.

[27] Z. Deng, L. Zhang, K. Vodrahalli, K. Kawaguchi, and J. Y. Zou. Adversarial training helps transfer learning via better representations. In Neur IPS, volume 34, pages 25179 25191, 2021.

[28] G. W. Ding, Y. Sharma, K. Y. C. Lui, and R. Huang. MMA training: Direct input space margin maximization through adversarial training. In ICLR, 2020.

[29] E. Dobriban, H. Hassani, D. Hong, and A. Robey. Provable tradeoffs in adversarially robust classification. ar Xiv:2006.05161, 2020.

[30] A. Fawzi, H. Fawzi, and O. Fawzi. Adversarial vulnerability for any classifier. In Neur IPS, volume 31, 2018.

[31] A. Fawzi, O. Fawzi, and P. Frossard. Analysis of classifiers robustness to adversarial perturbations. ML, 107(3):481 508, 2018.

[32] A. Fawzi, S.-M. Moosavi-Dezfooli, and P. Frossard. Robustness of classifiers: from adversarial to random noise. In Neur IPS, volume 29, 2016.

[33] K. Fukushima. Cognitron: A self-organizing multilayered neural network. Biol. Cybern., 20(3):121 136, 1975.

[34] A. Galloway, A. Golubeva, T. Tanay, M. Moussa, and G. W. Taylor. Batch normalization is a cause of adversarial vulnerability. In ICML WS, 2019.

[35] R. Gao, T. Cai, H. Li, C.-J. Hsieh, L. Wang, and J. D. Lee. Convergence of adversarial training in overparametrized neural networks. In Neur IPS, volume 32, 2019.

[36] D. Gilboa, B. Chang, M. Chen, G. Yang, S. S. Schoenholz, E. H. Chi, and J. Pennington. Dynamical isometry and a mean field theory of LSTMs and GRUs. ar Xiv:1901.08987, 2019.

[37] J. Gilmer, L. Metz, F. Faghri, S. S. Schoenholz, M. Raghu, M. Wattenberg, and I. Goodfellow. Adversarial spheres. In ICLR WS, 2018.

[38] I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples. In ICLR, 2015.

[39] S. Gowal, K. D. Dvijotham, R. Stanforth, R. Bunel, C. Qin, J. Uesato, R. Arandjelovic, T. Mann, and P. Kohli. Scalable verified training for provably robust image classification. In ICCV, pages 4842 4851, 2019.

[40] S. Gowal, S.-A. Rebuffi, O. Wiles, F. Stimberg, D. A. Calian, and T. A. Mann. Improving robustness using generated data. In Neur IPS, 2021.

[41] S. Hayou, A. Doucet, and J. Rousseau. On the selection of initialization and activation function for deep neural networks. ar Xiv:1805.08266, 2018.

[42] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, pages 770 778, 2016.

[43] M. Hein and M. Andriushchenko. Formal guarantees on the robustness of a classifier against adversarial manipulation. In Neur IPS, volume 30, 2017.

[44] J. Hron, Y. Bahri, J. Sohl-Dickstein, and R. Novak. Infinite attention: NNGP and NTK for deep attention networks. In ICML, pages 4376 4386, 2020.

[45] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. In CVPR, pages 4700 4708, 2017.

[46] A. Jacot, F. Gabriel, and C. Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In Neur IPS, volume 31, 2018.

[47] A. Javanmard, M. Soltanolkotabi, and H. Hassani. Precise tradeoffs in adversarial training for linear regression. In COLT, pages 2034 2078, 2020.

[48] H. Kannan, A. Kurakin, and I. Goodfellow. Adversarial logit pairing. ar Xiv:1803.06373, 2018.

[49] R. Karakida, S. Akaho, and S.-i. Amari. Universal statistics of Fisher information in deep neural networks: Mean field approach. In AISTATS, pages 1032 1041, 2019.

[50] J. Khim and P.-L. Loh. Adversarial risk bounds via function transformation. ar Xiv:1810.09519, 2018.

[51] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In CVPR, 2015.

[52] A. Krizhevsky. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009.

[53] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Neur IPS, pages 1097 1105, 2012.

[54] J. Lee, Y. Bahri, R. Novak, S. S. Schoenholz, J. Pennington, and J. Sohl-Dickstein. Deep neural networks as gaussian processes. In ICLR, 2018.

[55] J. Lee, L. Xiao, S. Schoenholz, Y. Bahri, R. Novak, J. Sohl-Dickstein, and J. Pennington. Wide neural networks of any depth evolve as linear models under gradient descent. In Neur IPS, volume 32, 2019.

[56] T. Liang, T. Poggio, A. Rakhlin, and J. Stokes. Fisher-rao metric, geometry, and complexity of neural networks. In AISTAT, pages 888 896, 2019.

[57] A. L. Maas, A. Y. Hannun, A. Y. Ng, et al. Rectifier nonlinearities improve neural network acoustic models. In ICML, volume 30, page 3, 2013.

[58] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu. Towards deep learning models resistant to adversarial attacks. In ICLR, 2018.

[59] A. G. d. G. Matthews, M. Rowland, J. Hron, R. E. Turner, and Z. Ghahramani. Gaussian process behaviour in wide deep neural networks. In ICLR, 2018.

[60] T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida. Spectral normalization for generative adversarial networks. In ICLR, 2018.

[61] A. Montanari and Y. Wu. Adversarial examples in random neural networks with general activations. ar Xiv:2203.17209, 2022.

[62] A. Najafi, S.-i. Maeda, M. Koyama, and T. Miyato. Robustness to adversarial perturbations in learning from incomplete data. In Neur IPS, volume 32, 2019.

[63] P. Nakkiran. Adversarial robustness may be at odds with simplicity. ar Xiv:1901.00532, 2019.

[64] R. M. Neal. Priors for infinite networks. In Bayesian Learning for Neural Networks, pages 29 53. Springer, 1996.

[65] B. Neyshabur, S. Bhojanapalli, D. Mc Allester, and N. Srebro. Exploring generalization in deep learning. In Neur IPS, volume 30, 2017.

[66] B. Neyshabur, R. R. Salakhutdinov, and N. Srebro. Path-SGD: Path-normalized optimization in deep neural networks. In Neur IPS, volume 28, 2015.

[67] B. Neyshabur, R. Tomioka, and N. Srebro. Norm-based capacity control in neural networks. In COLT, pages 1376 1401, 2015.

[68] R. Novak, L. Xiao, J. Lee, Y. Bahri, G. Yang, J. Hron, D. A. Abolafia, J. Pennington, and J. Sohl-Dickstein. Bayesian deep convolutional networks with many channels are gaussian processes. In ICLR, 2019.

[69] J. Pennington, S. Schoenholz, and S. Ganguli. Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice. In Neur IPS, volume 30, 2017.

[70] J. Pennington, S. Schoenholz, and S. Ganguli. The emergence of spectral universality in deep networks. In AISTATS, pages 1924 1932, 2018.

[71] B. Poole, S. Lahiri, M. Raghu, J. Sohl-Dickstein, and S. Ganguli. Exponential expressivity in deep neural networks through transient chaos. In Neur IPS, volume 29, 2016.

[72] R. Rade and S.-M. Moosavi-Dezfooli. Helper-based adversarial training: Reducing excessive margin to achieve a better accuracy vs. robustness trade-off. In ICML, 2021.

[73] M. Raghu, B. Poole, J. Kleinberg, S. Ganguli, and J. Sohl-Dickstein. On the expressive power of deep neural networks. In ICML, pages 2847 2854, 2017.

[74] A. Raghunathan, J. Steinhardt, and P. Liang. Certified defenses against adversarial examples. In ICLR, 2018.

[75] A. Raghunathan, S. M. Xie, F. Yang, J. Duchi, and P. Liang. Understanding and mitigating the tradeoff between robustness and accuracy. In ICML, 2020.

[76] A. Raghunathan, S. M. Xie, F. Yang, J. C. Duchi, and P. Liang. Adversarial training can hurt generalization. In ICML WS, 2019.

[77] S.-A. Rebuffi, S. Gowal, D. A. Calian, F. Stimberg, O. Wiles, and T. Mann. Fixing data augmentation to improve adversarial robustness. ar Xiv:2103.01946, 2021.

[78] K. Roth, Y. Kilcher, and T. Hofmann. Adversarial training is a form of data-dependent operator norm regularization. In Neur IPS, volume 33, pages 14973 14985, 2020.

[79] A. M. Saxe, J. L. Mc Clelland, and S. Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. In ICLR, 2014.

[80] L. Schmidt, S. Santurkar, D. Tsipras, K. Talwar, and A. Madry. Adversarially robust generalization requires more data. In Neur IPS, 2018.

[81] S. S. Schoenholz, J. Gilmer, S. Ganguli, and J. Sohl-Dickstein. Deep information propagation. In ICLR, 2017.

[82] A. Shafahi, W. R. Huang, C. Studer, S. Feizi, and T. Goldstein. Are adversarial examples inevitable? In ICLR, 2019.

[83] A. Shafahi, M. Najibi, A. Ghiasi, Z. Xu, J. Dickerson, C. Studer, L. S. Davis, G. Taylor, and T. Goldstein. Adversarial training for free! In Neur IPS, 2019.

[84] C.-J. Simon-Gabriel, Y. Ollivier, L. Bottou, B. Schölkopf, and D. Lopez-Paz. First-order adversarial vulnerability of neural networks and input dimension. In ICML, pages 5809 5817, 2019.

[85] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2014.

[86] A. Sinha, H. Namkoong, R. Volpi, and J. Duchi. Certifying some distributional robustness with principled adversarial training. In ICLR, 2018.

[87] H. Sompolinsky, A. Crisanti, and H.-J. Sommers. Chaos in random neural networks. Phys. Rev. Lett., 61(3):259, 1988.

[88] D. Su, H. Zhang, H. Chen, J. Yi, P.-Y. Chen, and Y. Gao. Is robustness the cost of accuracy? a comprehensive study on the robustness of 18 deep image classification models. In ECCV, pages 631 648, 2018.

[89] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus. Intriguing properties of neural networks. In ICLR, 2014.

[90] F. Tramer, N. Carlini, W. Brendel, and A. Madry. On adaptive attacks to adversarial example defenses. In Neur IPS, volume 33, pages 1633 1645, 2020.

[91] J. A. Tropp. Topics in sparse approximation. The University of Texas at Austin, 2004.

[92] D. Tsipras, S. Santurkar, L. Engstrom, A. Turner, and A. Madry. Robustness may be at odds with accuracy. In ICLR, 2019.

[93] Y. Tsuzuku, I. Sato, and M. Sugiyama. Lipschitz-margin training: Scalable certification of perturbation invariance for deep neural networks. In Neur IPS, volume 31, 2018.

[94] Y. Wang, D. Zou, J. Yi, J. Bailey, X. Ma, and Q. Gu. Improving adversarial robustness requires revisiting misclassified examples. In ICLR, 2020.

[95] C. Williams. Computing with infinite networks. In Neur IPS, volume 9, 1996.

[96] E. Wong and Z. Kolter. Provable defenses against adversarial examples via the convex outer adversarial polytope. In ICML, pages 5286 5295, 2018.

[97] E. Wong, L. Rice, and J. Z. Kolter. Fast is better than free: Revisiting adversarial training. In ICLR, 2020.

[98] B. Wu, J. Chen, D. Cai, X. He, and Q. Gu. Do wider neural networks really help adversarial robustness? In Neur IPS, volume 34, pages 7054 7067, 2021.

[99] H. Xiao, K. Rasul, and R. Vollgraf. Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms. ar Xiv:1708.07747, 2017.

[100] L. Xiao, Y. Bahri, J. Sohl-Dickstein, S. Schoenholz, and J. Pennington. Dynamical isometry and a mean field theory of cnns: How to train 10,000-layer vanilla convolutional neural networks. In ICML, pages 5393 5402, 2018.

[101] L. Xiao, J. Pennington, and S. Schoenholz. Disentangling trainability and generalization in deep neural networks. In ICML, pages 10462 10472, 2020.

[102] Y. Xing, Q. Song, and G. Cheng. On the generalization properties of adversarial training. In AISTATS, pages 505 513, 2021.

[103] G. Yang. Scaling limits of wide neural networks with weight sharing: Gaussian process behavior, gradient independence, and neural tangent kernel derivation. ar Xiv:1902.04760, 2019.

[104] G. Yang, J. Pennington, V. Rao, J. Sohl-Dickstein, and S. S. Schoenholz. A mean field theory of batch normalization. In ICLR, 2019.

[105] G. Yang and S. Schoenholz. Mean field residual networks: On the edge of chaos. In Neur IPS, volume 30, 2017.

[106] G. Yang and S. S. Schoenholz. Deep mean field theory: Layerwise variance and width variation as methods to control gradient explosion. Open Review, 2018.

[107] D. Yin, R. Kannan, and P. Bartlett. Rademacher complexity for adversarially robust generalization. In ICML, pages 7085 7094, 2019.

[108] Y. Yoshida and T. Miyato. Spectral norm regularization for improving the generalizability of deep learning. ar Xiv:1705.10941, 2017.

[109] S. Zagoruyko and N. Komodakis. Wide residual networks. In BMVC, pages 1 12, 2016.

[110] R. Zhai, T. Cai, D. He, C. Dan, K. He, J. Hopcroft, and L. Wang. Adversarially robust generalization just requires more unlabeled data. ar Xiv:1906.00555, 2019.

[111] D. Zhang, T. Zhang, Y. Lu, Z. Zhu, and B. Dong. You only propagate once: Accelerating adversarial training via maximal principle. In Neur IPS, 2019.

[112] H. Zhang, D. Yu, Y. Lu, and D. He. Adversarial noises are linearly separable for (nearly) random neural networks. ar Xiv:2206.04316, 2022.

[113] H. Zhang, Y. Yu, J. Jiao, E. Xing, L. El Ghaoui, and M. Jordan. Theoretically principled trade-off between robustness and accuracy. In ICML, pages 7472 7482, 2019.

[114] J. Zhang, X. Xu, B. Han, G. Niu, L. Cui, M. Sugiyama, and M. Kankanhalli. Attacks which do not kill training make adversarial learning stronger. In ICML, pages 11278 11287, 2020.

[115] Y. Zhang, O. Plevrakis, S. S. Du, X. Li, Z. Song, and S. Arora. Over-parameterized adversarial training: An analysis overcoming the curse of dimensionality. In Neur IPS, volume 33, pages 679 688, 2020.

Table A2: Notation. While h(l) is a function that takes xin as input, we omit the argument as h(l) := h(l)(xin) for notational simplicity. Other symbols sometimes follow this.

Notation Description Cf.

d N Input dimension of the network Sec. 3.1 K N Output dimension of the network Sec. 3.1 L N Number of layers in the network, i.e., network depth Sec. 3.1 N N Number of neurons in a layer, i.e., network width Sec. 3.1 u, v R Slope of a Re LU-like function Defn 3.1 p {1, 2, } Perturbation constraint norm, η p ϵ Eq. (3) q {1, 2, } Output difference norm, f(xin + η) f(xin) q Eq. (3) m [0, 1] (M, m)-trainability condition Defn 5.5 ϵ 0 Perturbation constraint, η p ϵ Eq. (3) α 0 α := (u2 + v2)/2 Thm 4.1 βp,q 0 Time-invariant constant determined by (p, q) Thm 5.1 t 0 Continuous training step Eq. (10) T 0 Upper limit of training steps Asm 5.2 σ2 w 0 Weight variance, Wij N(0, σ2 w/N) Sec. 3.1 σ2 b 0 Bias variance, bi N(0, σ2 b) Sec. 3.1 ω 0 ωv := ασ2 w (vanilla) and ωr := 1 + ασ2 w (residual) Thm 4.1 M 1 (M, m)-trainability condition Defn 5.5 xin Rd Input vector Sec. 3.1 η Rd Perturbation, η p ϵ Eq. (3) W (l) RN N l-th weight, W (l) ij N(0, σ2 w/N) Sec. 3.1 W R Set of all network weights, {W (1) 11 , W (1) 12 , . . . , W (L) NN} Thm 5.1 w RLN2 Vector of all the weights, (W (1) 11 , . . . , W (L) 11 ) Eq. (14) b(l) RN l-th bias, b(l) i N(0, σ2 b) Sec. 3.1 P in RN d Rand. mat. for input adjustment, P in ij N(0, 1

d) Sec. 3.1 P out RK N Rand. mat. for output adjustment, P out ij N(0, 1

N ) Sec. 3.1 P (l) RN N Rand. mat. in the l-th shortcuts, P (l) ij N(0, 1

N ) Eq. (A16) ϕ : R R Re LU-like activation, ϕ := uz(z 0); vz(z 0) Defn 3.1 χ(l) : Rd R Mean squared l-th gradient, E[( L(xin)/ x(l) i )2] Sec. 3.1 h(l) : Rd RN l-th pre-activation, h(l) := W (l)x(l 1)(xin) + b(l) Sec. 3.1 x(l) : Rd RN l-th post-activation, x(l) := ϕ(h(l)(xin)) Sec. 3.1 F : Rd RLN2 LN2 Empirical Fisher information matrix Eq. (14) f : Rd RK Overall network, f(xin) := P outg(P inxin) Sec. 3.1 g : RN RN L-trainable layer Re LU-like network Sec. 3.1 J : Rd RK d Slope of a piecewise linear region at xin Thm 4.1 a : Rd RK Bias of a piecewise linear region at xin Thm 4.1 D : RN RN N Diagonal matrix Eq. (5) Lstd : Rd Y R Standard loss function, where Y represents a label set Sec. 3.2 Ladv : Rd R 0 Adversarial loss function Eq. (3)

A Additional related work

Please also refer to Secs. 2 and 5.1.

A.1 Random networks and mean field theory

Understanding general deep neural networks is challenging due to the non-convexity of loss surfaces and stochastic nature of optimization. Researchers have explored random networks, which have random parameters instead of trained parameters. Random networks have been primarily studied in three areas: compositional kernels [17, 22, 23], neural network Gaussian processes [21, 44, 54, 59,

64, 68, 95, 103], and mean field theory. These fields share close relationships, particularly between neural network Gaussian processes and mean field theory. For example, the forwarding dynamics in mean field theory (Eq. (2)) is equivalent to the kernel representation of a Gaussian process [64, 71]. Furthermore, the accuracy of a Gaussian process is significantly influenced by the edge of chaos, which has been studied in mean field theory [68]. We provide a more detailed review of the mean field theory.

Mean field theory for neural networks was first introduced in [87] and later extended in [71, 81]. This theory examines the signal propagation, dynamics, and trainability of random networks in two phases: ordered and chaotic. The ordered phase is characterized by decreasing layer output variance and vanishing gradients during backpropagation, while the chaotic phase is characterized by expanding variance and exploding gradients. Effective training of deep neural networks occurs near the boundary between these two phases [71, 81]. Researchers have applied mean field theory to study various network architectures, including dropout [81], batch normalization [104], residual network [105, 106], recurrent network [15], quantized network [10], and Swish [41]. Recent research [15, 36, 69, 70, 100] has also utilized the mean field theory to analyze dynamical isometry, where all singular values of the Jacobian are one. Some of these findings can be applied to analyze Ineq. (8) in this study, with detailed comparisons provided in Sec. 5.1. Other studies have explored network representation power [49, 71, 101]. This work utilized the proof idea of [49] in the derivation of Thms 5.9 and G.14.

In this study, we employed mean field theory to analyze adversarial training behavior. However, the original theory has two limitations that make it unsuitable for this purpose (cf. Sec. 4.1). To address these limitations, we introduced a new mean field-based framework (cf. Sec. 4.2). Our theory is also applicable to mean field-based analyses of other deep learning methods beyond adversarial training.

A.2 Adversarial examples

A.2.1 Adversarial examples in random neural networks

We summarize the literature on adversarial examples in random neural networks [7, 11, 24, 25, 61, 84, 112]. It has been reported that adversarial perturbations can be found through gradient flow in most random Re LU networks with decreasing layer widths [24]. This result was extended to two-layer random networks with greater width than input dimension [11] and generalized to random Re LU networks with constant depth and wide width [7]. Recently, similar results were presented without width restrictions and for local Lipschitz continuous activation [61]. Additionally, it has been demonstrated that adversarial noises generated by a single-step attack are linearly separable in a two-layer random network and the neural tangent kernel regime [112]. More information on [25, 84] can be found in Appx. A.2.2.

Although the context differs somewhat, we mention [34], which provides empirical evidence that batch normalization leads to adversarial vulnerability. This observation aligns with previous results from mean field theory suggesting that batch normalization causes gradient explosion [104].

In this study, we focused on early stage adversarial training properties rather than adversarial examples. A key difference is that we primarily investigated the maximum difference between network outputs for standard and adversarial inputs rather than misclassification which was the main focus of previous studies [7, 11, 24, 61]. Nonetheless, some findings in Sec. 5.1 can be applied to the understanding of adversarial examples in random networks. In addition, in Appx. K, we proved the existence of adversarial examples in random networks to demonstrate the effectiveness of Thm 4.1.

A.2.2 Adversarial examples and input dimension

We present research investigating the relationship between adversarial examples and input dimensions [2, 25, 30 32, 37, 38, 82, 84]. Apart from [38],3 most studies have indicated that adversarial example threats increase with the square root of the input dimension [25, 30 32, 37, 82, 84]. Some studies have focused on specific data distributions or simple classifiers [30 32, 37, 82], while others targeted random networks [25, 84]. The study in [25] has examined the distance from the decision boundary in random networks, and concluded that adversarial examples deceive classifiers more easily with the square root of the input dimension. Another study has utilized the first-order Taylor

3Indeed, they omitted weight scaling for simplicity, and their results were essentially consistent with those of other studies when the scaling was considered.

expansion of a loss function to analyze the loss gradient with respect to an input [84]. While a direct comparison between our work and [84] is challenging due to differing loss functions, our theorems offer two advantages. First, we considered both the input and output dimensions, whereas they addressed only input dimensions. Second, our theorems are applicable to residual networks, whereas their assumptions are not.

In this study, we examined the relationship between an input dimension and adversarial risk in Sec. 5.1. Our analysis did not depend on specific data distributions or architectures, except for Re LU-like activations and random parameters, which is advantageous. In addition, we assessed it for a wide variety of norms, demonstrating that the impact of adversarial examples is not limited to the square root of the input dimension alone (cf. Tab. 1). Moreover, our bound considers not only the input dimension but also the number of classes.

A.3 Adversarial training

Numerous empirical adversarial defenses have been proposed; however, most are ineffective against stronger attacks [5, 12, 13, 20, 90]. Some studies have focused on theoretically certified defenses [19, 39, 43, 74, 96], but these are often only applicable to specific or small networks, or are weaker than empirical methods. Adversarial training [38, 58] is considered the most effective empirical defense against various attacks [5, 20, 90]. This involves training a classifier using a dataset that contains natural images and adversarial examples [38] or solely adversarial examples [58]. Various forms of adversarial training exist, including more effective loss functions [28, 48, 58, 94, 113], time-efficient frameworks [83, 97, 111], and procedures that preserve the clean accuracy [72, 113, 114]. Recent studies have demonstrated that combining adversarial training and data augmentation with unlabeled or generated data results in high robustness [40, 77]. Despite significant progress in empirical methods, a theoretical understanding of adversarial training remains incomplete. We provide a summary of theoretical studies on adversarial training, including robust generalization research.

Several studies have investigated the trade-off between robust and clean accuracy [29, 47, 75, 76, 92, 113], initially observed empirically [58, 88]. The trade-off has been proven inevitable even in the infinite data limit, assuming data is constructed from a moderately correlated single feature and weakly correlated many features [92]. Similar claims were found in [113] for different data distributions. It has been reported that the trade-off in finite data settings for linear and slightly more complex predictors can be mitigated with additional unlabeled data [76]. While comparable results were reported in [75], a contrasting study also exists [47]. Moreover, the trade-off has been shown to depend on class imbalance in a dataset using Gaussian classification models [29].

Various studies have examined the generalization gap of adversarial robust models [6, 50, 102, 107]. For example, Res Net [42] trained adversarially on CIFAR-10 [52] achieved 96% robust training accuracy but only 47% robust test accuracy [107]. The lower bound of adversarial Rademacher complexity [9] has been shown to increase with the square root of the input dimension for linear classifiers trained with ℓ adversarial examples [107], which was later extended in [6]. Similar results using a tree transformation approach have been reported in [50]. However, the influence of network width and depth, or other training settings, such as training with ℓ1 adversarial examples, on the bound remains unclear. The generalization gap has been investigated for linear regression models and two-layer neural networks with lazy training in data interpolation contexts [102].

Numerous studies have explored the sample complexity of robust generalization [1, 14, 62, 80, 110]. Robust learning may require a larger sample size than standard learning [80]. Subsequent research has indicated that unlabeled data can be sufficient to achieve robust generalization [1, 14, 62, 110]. The majority of these studies have focused on data sampled from Gaussian mixture models [1, 14, 110]. Moreover, the sample complexity of distributionally robust learning with perturbations in the Wasserstein ball has been examined [62], and some studies have suggested that the trade-off between robustness and accuracy can be mitigated using additional unlabeled data [75, 76].

Recent findings have suggested that robust classification requires complex classifiers [63], which is supported by the results in the neural tangent kernel regime [35]. In the context of transfer learning, robust classifiers have been shown to perform better [27]. Furthermore, a certifiable adversarial training procedure has been established, constraining perturbation by the distributional Wasserstein distance [86].

The aforementioned outcomes rely on specific data distributions, such as Gaussian or heuristic-tuned distributions, or simplistic models, such as linear classifiers or two-layer neural networks. The lack of theoretical research on adversarial training in deep neural networks stems from the complexity of training these models, including non-convex loss surfaces and stochastic optimization. Recent research has employed the neural tangent kernel regime to address these challenges, demonstrating that adversarial training can yield a robust network with near-zero robust loss [35]. This result was later extended in two-layer neural networks [115], eliminating the assumption in [35] that requires exponentially large width and runtime.

In this study, we conducted a theoretical analysis of adversarial training, focusing on the time evolution of network parameters during training, the conditions promoting adversarial training, and the differences between adversarial and standard training. Our analysis targeted deep neural networks without relying on any assumptions about data distribution and employed ℓp norms practically used as perturbation constraints, rather than more impractical metrics such as the distributional Wasserstein distance. To address the training difficulty of deep neural networks, we utilized mean field theory.

Finally, we mention the report from a perturbation instability perspective that increasing network width does not necessarily improve robustness in adversarial training [98]. This may seem to contradict our results, which suggest that a wider network can help the model maintain capacity during adversarial training, implying greater robustness in wider networks. However, these two claims are compatible. Robustness is determined by both perturbation instability (negative effect) and network capacity (positive effect). While the negative effect of width appears dominant in [98] s experiments on CIFAR-10 and Wide Res Net, the positive effect appeared more prevalent in our experiments on MNIST, Fashion-MNIST, and fully connected networks with or without shortcuts. The dominant factor may depend on the dataset and model architectures.

B Gradient independence assumption

The gradient independence assumption was first introduced in [81] for backward dynamics in mean field theory and later refined in [105]. We provide a definition based on [105] as follows:

Assumption B.1 (Gradient independence assumption [105]). (a) We use a different set of weights for backpropagation than those used to compute the network outputs, but sampled i.i.d. from the same distributions. (b) For any loss L, the gradient at layer l, L/ x(l), is independent of h(l) and x(l 1).

Although not strictly accurate, this assumption has been empirically found to hold well [81, 105]. It has been rigorously justified for specific architectures, including vanilla, residual, and convolutional networks [103]. In this study, we applied the assumption to the standard loss function but not to the adversarial loss function. In addition, we regard a network output as a loss and apply Asm B.1 to the network output in Sec. 5.4.

C Setting of residual networks

The mean field theory for residual networks is studied in [105]. Although they employed trainable weights in shortcuts, we employ the untrainable matrix. Formally, the preand post-activations in the l-th layer of a residual network are defined as follows:

h(l) := W (l)x(l 1) + b(l), x(l) := x(l 1) + P (l)ϕ(h(l)), (A16)

where P (l) RN N is an untrained random matrix. Each entry of P (l) is i.i.d. sampled from a Gaussian N(0, 1/N) at initialization. The random matrix P (l) is introduced for simplifying probabilistic calculations and is applied in accordance with Asm B.1(a). The definition of x(l), given in Eq. (A16), is slightly different from the original definition in [105]. Therefore, we will derive the basic probabilistic properties of preand post-activation again, based on Eq. (A16). Following a similar approach to [71, 105], the mean squared preand post-activation can be calculated as follows:

E[(h(l) i )2] = σ2 w E[(x(l 1) i )2] + σ2 b, E[(x(l) i )2] = E[(x(l 1) i )2] + E[ϕ(h(l) i )2]. (A17)

Additionally, following [81, 105], the mean squared gradient with respect to preand post-activation can be calculated as follows:

= σ2 w E[ϕ (h(l) i )2]E

χ(l) = (1 + σ2 w E[ϕ (h(l+1) i )2])χ(l+1). (A19)

We can derive Eq. (A18) under Asm B.1 as follows:

L h(l+1) h(l+1)

x(l) i h(l) i

h(l+1) j W (l+1) ji

ϕ (h(l) i )2

= σ2 w E[ϕ (h(l) i )2]E

We can derive Eq. (A19) under Asm B.1 as follows.

x(l+1) j P (l+1) jk ϕ (h(l+1) k )W (l+1) ki

x(l+1) j P (l+1) jk P (l+1) j k ϕ (h(l+1) k )ϕ (h(l+1) k )W (l+1) ki W (l+1) k i

=(1 + σ2 w E[ϕ (h(l+1) k )2])χ(l+1). (A25)

From Eq. (A23) to Eq. (A24), we used the following equation:

x(l+1) j x(l) i (A26)

k=1 P (l+1) jk ϕ(h(l+1) k )

x(l+1) j P (l+1) jk ϕ (h(l+1) k )W (l+1) ki . (A28)

The second term of Eq. (A24) is rearranged as follows:

x(l+1) j P (l+1) jk ϕ (h(l+1) k )W (l+1) ki

k=1 E[P (l+1) jk ]E

x(l+1) j ϕ (h(l+1) k )W (l+1) ki

The third term of Eq. (A24) is rearranged as follows:

x(l+1) j P (l+1) jk P (l+1) j k ϕ (h(l+1) k )ϕ (h(l+1) k )W (l+1) ki W (l+1) k i

(P (l+1) jk )2ϕ (h(l+1) k )2(W (l+1) ki )2

E[(P (l+1) jk )2]E[ϕ (h(l+1) k )2]E[(W (l+1) ki )2] (A32)

=σ2 w E[ϕ (h(l+1) k )2]χ(l+1). (A33)

D Sketch of proof for Thm 4.1

In this section, we provide a plain but informal proof of Thm 4.1 for the stepping stone to the formal proof. We present two levels of explanation: the most straightforward one and the other that is closer to a formal proof.

First, we introduce the simplest proof of counterintuitive independence of the distribution of J(xin) from xin. As indicated in Eq. (6), the definition of J(xin) includes ϕ (h(l)(xin)); thus, the distribution appears to depend xin. Here, we consider the distribution of ϕ (h(x)), where h(x) := wx with w N(0, σ2 w) and x R. From the following proposition, although ϕ (h(x)) is defined as a function of x, its distribution is independent of x. This characteristic holds even for J(xin), which encompasses multiple ϕ (h(l)(xin)).

Proposition D.1. Let ϕ(z) := uz (z 0); vz (z < 0) be a Re LU-like function, w N(0, σ2 w) be a Gaussian variable, and x R be a fixed real number. Then, the distribution of ϕ (wx) is independent of x.

Proof. Consider the derivative of ϕ, given by ϕ (z) := u (z 0); v (z < 0). The input to ϕ , i.e., wx, follows a Gaussian N(0, x2σ2 w). The probability of a zero-mean Gaussian being greater than or equal to zero is the same as it being less than zero, regardless of the value of x. Therefore, for any given x, the probabilities of ϕ (wx) = u and ϕ (wx) = v are invariant to x. Consequently, the claim is established.

Then, let us consider Thm 4.1 in a neural network with one activation and one weight layer, respectively, as f(x) := P outϕ(W (1)P inxin). In this setting, the network Jacobian is represented by J(xin) := P out D(ϕ (W (1)P inxin))W (1)P in. Moreover, we assume that the uncorrelated Gaussian variables are independent. This assumption is incorrect. Uncorrelated Gaussian variables are not necessarily independent (cf. Remark E.3). However, for the intuition about the formal proof, we assume this. The simplified Thm 4.1 and its proof are as follows:

Proposition D.2. Consider a neural network f(x) := P outϕ(W (1)P inxin). Suppose that the width N is sufficiently large. Assume that uncorrelated Gaussian variables are independent. Then, for any xin Rd, each entry of J(xin) := P out D(ϕ (W (1)P inxin))W (1)P in is i.i.d. and follows the Gaussian N(0, ασ2 w/d).

Proof. First, we prove that each entry of J(xin) is i.i.d. and follows a Gaussian. To this end, we consider the probabilistic properties in the order of P inxin, D(ϕ (W (1)P inxin)), W (1)P in, and P out D(ϕ (W (1)P inxin))W (1)P in. Then, we derive the concrete distribution, i.e., its mean and variance. Comparing the distributions between f(xin)2 and (J(xin)xin)2, we achieve this.

We first consider P inxin =: x(0). Recall that xin is a deterministic vector and P in is a random matrix where each entry is i.i.d. and follows a Gaussian. Since the weighted sum of independent Gaussian variables follows a Gaussian, each entry of x(0) is i.i.d. and follows a Gaussian.

Then, let us consider W (1)x(0). The i-th entry of W (1)x(0) is expressed as PN j=1 W (1) ij x(0) j . Since

W (1) is a random matrix with i.i.d. entries, W (1) ij x(0) j is i.i.d. with respect to j. By the central limit theorem with infinite N, the i-th entry of W (1)x(0) follows a Gaussian. In addition, the entries of W (1)x(0) follow the same Gaussian. Moreover, since E[(PN j=1 W (1) ij x(0) j )(PN j=1 W (1) i j x(0) j )] = 0 holds for i = i , different entries of W (1)x(0) are uncorrelated and thus independent (this is originally incorrect but guaranteed by the assumption here). In conclusion, W (1)x(0) is a random vector with i.i.d. Gaussian entries. Naturally, D(ϕ (W (1)P inxin)) is a diagonal matrix with i.i.d. entries.

Similar to the discussion above, W (1)P in =: A is a random matrix with i.i.d. entries.

Finally, consider P out D(ϕ (W (1)P inxin))W (1)P in. For notational simplicity, we denote D := D(ϕ (W (1)P inxin)). Thus, P out D(ϕ (W (1)P inxin))W (1)P in = P out DA. The entry of i-th row and j-th column of this matrix is expressed as PN k=1 P out ik Dk Akj. Note that P out is independent of DA and its entries are i.i.d. Moreover, based on network symmetry, the dependence between Dk and Akj is invariant to k. Thus, P out ik Dk Akj is i.i.d. with respect to k and each entry of this matrix follows the same Gaussian by the central limit theorem. Moreover, since E[(PN k=1 P out ik Dk Akj)(PN k=1 P out i k Dk Akj )] = 0 holds for (i, j) = (i , j ), different entries of P out DA are uncorrelated and thus independent. Therefore, each entry of J(xin) is i.i.d. and follows the Gaussian.

Next, we derive the mean and variance of J(xin). Trivially, E[J(xin)i] = 0 since the mean of an entry of P out is zero. We derive the variance by comparing E[f(x)2 i ] and E[(Pd j=1 J(xin)ijxin j )2].

Recall f(x)i := Pd j=1 J(xin)ijxin j . As a preliminary, with a Gaussian variable z N(0, σ2) and its probabilistic density function g(z), we derive the following equation:

Ez N(0,σ2)[ϕ(z)2] = Z

ϕ(z)2g(z; σ2)dz (A34)

0 u2z2g(z; σ2)dz + Z 0

v2z2g(z; σ2)dz (A35)

= ασ2. (A37)

In addition, if P out and z are independent, then

j=1 P out ij ϕ(z)j

j=1 E[(P out ij )2]E[ϕ(z)2 j] = ασ2. (A38)

As shown in the discussion above, each entry of W (1)x(0) follows the Gaussian. Moreover, we can simply derive its variance as σ2 w xin 2/d. Thus,

E[f(x)2 i ] = E

j=1 P out ij ϕ(W (1)x(0))j

= ασ2 w xin 2/d. (A39)

Since f(x)i := Pd j=1 J(xin)ijxin j , E[f(x)2 i ] is also expanded as:

E[f(x)2 i ] = E

j=1 J(xin)ijxin j

j=1 E[J(xin)2 ij](xin j )2 = E[J(xin)2 ij] xin 2. (A40)

Comparing the two equations above, we obtain E[J(xin)2 ij] = ασ2 w/d. Therefore, the claim is established.

E Derivation of Thm 4.1 for vanilla networks

E.1 Preliminary

First, we introduce some lemmas. These results are more general and not restricted to the context of neural networks. The notation in this section is independent of that in other sections.

We refer to a random matrix where the mean of an entry is zero as a zero-mean matrix. Formally, this is defined as follows: Definition E.1 (zero-mean matrix/vector). A random matrix/vector A is called a zero-mean matrix/vector if E[Aij] = 0 for any i and j.

For zero-mean matrices, the following properties hold: Remark E.2 (Properties of zero-mean matrix). Let A be a zero-mean matrix.

Let B be a zero-mean matrix. Then, their sum, A + B, is a zero-mean matrix.

Let B be a random matrix. If A and B are independent, then their products, AB and BA, are zero-mean matrices.

Then, we review several properties of Gaussian variables. Remark E.3 (Properties of Gaussian variables). The following statements hold:

Any linear combination of multiple Gaussian variables follows a Gaussian if and only if they are jointly distributed Gaussian.

A weighted sum of independent Gaussian variables is Gaussian distributed.

Gaussian variables are jointly distributed and uncorrelated if and only if they are independent.

Next, we examine the properties of random matrices/vectors with i.i.d. Gaussian entries, called a Gaussian matrix/vector. Their formal definitions are as follows: Definition E.4 (Gaussian matrix/vector). A matrix/vector is called a Gaussian matrix/vector if all the entries are i.i.d. and follow a Gaussian.

Gaussian matrices and vectors satisfy the following lemmas. Lemma E.5. If A and B are independent Gaussian matrices, then A + B is a Gaussian matrix. In particular, when A and B are zero-mean, A + B is a zero-mean Gaussian matrix.

Proof. By Remarks E.2 and E.3, this is trivial.

The following lemmas are derived for the proof of Thm 4.1. Lemma E.6. Suppose that random matrices A Rm n and B Rn l satisfy the following properties: (a) A and B are independent. (b) A is a zero-mean matrix. (c) All the entries of A are i.i.d. (d) All the entries of B are identically distributed. (e) Any two entries of B from different rows are independent. (f) For any i [n], j, k [l], j = k, the dependence between Bij and Bik is invariant to i, j, k. (g) For any i [n], j, k [l], j = k, E[Bij Bik] = 0 holds. Then, for a sufficiently large n, AB is a zero-mean Gaussian matrix.

Proof. By (a), (b), and Remark E.2, AB is a zero-mean matrix. The linear combination of all the entries of AB is expressed as

j=1 sij Ai B j =

k=1 Aik Bkj =

j=1 sij Aik Bkj

where sij R is a constant. By (a), (c), (d), (e), and (f), Pm i=1 Pl j=1 sij Aik Bkj is i.i.d. with respect to k. By the central limit theorem, c follows a Gaussian. By Remark E.3, all the entries of AB are jointly Gaussian distributed. Moreover, by (c) and (d), the entries of AB follow the same Gaussian. By (a), (b), (c), (e), and (g), the covariance between any two entries of AB is zero. Thus, by Remark E.3, the claim is established.

Lemma E.7. Suppose that random matrices A Rm n, B Rn l1, and C Rn l2 satisfy the following properties: (a) A, B, and C are independent. (b) A is a zero-mean matrix. (c) All the entries of A are i.i.d. (d) All the entries of B are identically distributed. (e) All the entries of C are identically distributed. (f) Any two entries of B from different rows are independent. (g) Any two entries of C from different rows are independent. (h) For any i [n], j, k [l1], j = k, the dependence between Bij and Bik is invariant to i, j, k. (i) For any i [n], j, k [l2], j = k, the dependence between Cij and Cik is invariant to i, j, k. (j) For any i [n], j [l1], k [l2], E[Bij Cik] = 0 holds. Then, for a sufficiently large n, AB and AC are independent.

Proof. By (a), (c), (d), (e), (f), (g), (h), and (i), similar to Lemma E.6, all the entries of AB and AC are jointly Gaussian distributed. By (a), (b), (c), and (j), the covariance between any two entries from AB and AC is zero. Thus, by Remark E.3, the claim is established.

Lemma E.8. Let w, x Rm be Gaussian vectors. Let ϕ (z) := u (z 0); v (z < 0). Then, for a sufficiently large m, ϕ (w x)w and x are independent.

Proof. We prove P[ϕ (w x)w = a | x = b] = P[ϕ (w x)w = a]. For any x, w = a/u+x/ m or w = a/v x/ m satisfy ϕ (w x)w = a. Since m is sufficiently large, w = a/u + x/ m and w = a/v x/ m asymptotically approach w = a/u and w = a/v, respectively. Thus, the claim is established.

E.2 Definitions of J(xin) and a(xin) for vanilla networks

Here, we define (derive) J(xin) and a(xin) for vanilla networks. The l-th preand post-activation are defined as follows:

h(l)(xin) := W (l)x(l 1)(xin) + b(l), x(l)(xin) := ϕ(h(l)(xin)). (A42)

Using ϕ(h(l)(xin)) = D(ϕ (h(l)(xin)))h(l)(xin), the l-th pre-activation can be rearranged as follows:

h(l)(xin) = W (l)D(ϕ (h(l 1)(xin)))h(l 1)(xin) + b(l). (A43)

Because Re LU-like networks are piecewise linear, the l-th pre-activation is also represented as follows:

h(l)(xin) = J(l)(xin)x(0)(xin) + a(l)(xin). (A44)

Substituting Eq. (A44) for Eq. (A43), the equation can be rearranged as follows:

h(l)(xin) =W (l)D(ϕ (h(l 1)(xin)))J(l 1)(xin)x(0)(xin)

+ W (l)D(ϕ (h(l 1)(xin)))a(l 1)(xin) + b(l). (A45)

Comparing Eq. (A44) and Eq. (A45), the following equations can be derived:

J(l)(xin) := W (l)D(ϕ (h(l 1)(xin)))J(l 1)(xin), (A46)

a(l)(xin) := W (l)D(ϕ (h(l 1)(xin)))a(l 1)(xin) + b(l), (A47)

where J(1)(xin) := W (1) and a(1)(xin) := b(1). Finally, J(xin) and a(xin) are defined as follows:

f(xin) = J(xin)xin + a(xin), (5)

J(xin) := P out D(ϕ (h(L)(xin)))J(L)(xin)P in, (6)

a(xin) := P out D(ϕ (h(L)(xin)))a(L)(xin). (A48)

E.3 Main derivation

First, we remark several fundamental properties of weights, biases, random projections, and preactivations in a network. Remark E.9. By definition in Sec. 3.1, the following statements hold for any l [L]:

(Gaussian matrix/vector) W (l), P in, and P out are zero-mean Gaussian matrices and b(l) is a zero-mean Gaussian vector.

(Variance) The variance of W (l), P in, and P out is σ2 w/N, 1/d, and 1/N, respectively.

(Independence) W (l) and b(l) are independent of each other, as is any random variable in the layer before the l-th. In addition, P out is independent of any random variable without a network output.

(Dependence between pre-activation and bias) For i = j, the dependence between h(l) i and b(l) i is invariant to i. For i = j, h(l) i and b(l) j are independent.

Lemma E.10. Consider a vanilla network. For any l [L], the following hold:

(a) J(l)(xin) is a zero-mean Gaussian matrix.

(b) a(l)(xin) is a zero-mean Gaussian vector.

(c) J(l)(xin) and a(l)(xin) are independent.

Proof. We prove the claim by induction. As a preliminary, see Remark E.9. The case with l = 1 is trivial. Suppose that all the claims hold for l 1 [L 1].

(a) By Lemma E.6, the claim is established. Note that E[ϕ (h(l) i )2J(l)(xin)ij J(l)(xin)ik] = E[ϕ (h(l) i )2]E[J(l)(xin)ij]E[J(l)(xin)ik] = 0 since N is sufficiently large.

(b) By Lemma E.6, the claim is established.

(c) By Lemma E.7, the claim is established.

Lemma E.11. The mean squared network output is given by:

E[f(xin)2 i ] = ωL

xin 2 2 + ασ2 b

i=1 ωi 1. (A49)

Proof. By Remark E.9, the mean squared preand postactivation are calculated as follows (cf. Eq. (A34)):

E[(h(l) i )2] = σ2 w E[(x(l 1) i )2] + σ2 b, (A50)

E[(x(l) i )2] =

( αE[(h(l) i )2] (vanilla) E[(x(l 1) i )2] + αE[(h(l) i )2] (residual) . (A51)

Recursively calculating Eqs. (A50) and (A51), the mean squared post-activation in a vanilla network is calculated as follows:

E[(x(l) i )2] = αE[(h(l) i )2] = ωv E[(x(l 1) i )2] + ασ2 b = ωl v E[(x(0) i )2] + ασ2 b

i=1 ωi 1 v . (A52)

The mean squared post-activation in a residual network is calculated as follows:

E[(x(l) i )2] = E[(x(l 1) i )2] + α(σ2 w E[(x(l 1) i )2] + σ2 b) (A53)

= ωr E[(x(l 1) i )2] + ασ2 b (A54)

= ωl r E[(x(0) i )2] + ασ2 b

i=1 ωi 1 r . (A55)

Using x(0) := P inxin, the above results are expanded as follows:

E[(x(l) i )2] = ωl d X

j=1 E[(P in ij )2](xin j )2 + ασ2 b

i=1 ωi 1 = ωl

xin 2 2 + ασ2 b

i=1 ωi 1. (A56)

The mean squared network output is rearranged to the mean squared L-th post-activation as follows:

E[f(xin)2 i ] =

j=1 E[(P out ij )2]E[(x(L) j )2] = E[(x(L) i )2]. (A57)

By Eqs. (A56) and (A57), the claim is established.

Theorem 4.1 (Properties and distributions of J(xin) and a(xin)). Suppose that the width N is sufficiently large. Then, for any xin Rd, (I) J(xin) and a(xin) are independent. (II) each entry of J(xin) and a(xin) is i.i.d. and follows the Gaussian below:

J(xin)ij N 0, ωL

, a(xin)i N

where α := (u2 + v2)/2 (cf. Defn 3.1) and ω is ωv := ασ2 w for vanilla networks and ωr := 1 + ασ2 w for residual networks.

Proof. Similar to Lemma E.8, P in and a(xin) are independent. Similar to Lemmas E.10 and F.2, J(xin) and a(xin) are a zero-mean Gaussian matrix and vector, respectively, and J(xin) and a(xin) are independent. The mean squared network output can be expanded as follows:

E[f(xin)2 i ]

j=1 J(xin)ijxin j + a(xin)i

k=1 E[J(xin)ij J(xin)ik]xin j xin k + 2

j=1 E[J(xin)ija(xin)i]xin j + E[a(xin)2 i ] (A59)

j=1 E[J(xin)2 ij](xin j )2 + E[a(xin)2 i ]. (A60)

Using the symmetry of entries of J(xin) and a(xin), the equation can be simplified as follows:

E[f(xin)2 i ] = E[J(xin)2 ij] xin 2 2 + E[a(xin)2 i ]. (A61)

Comparing Lemma E.11, Eq. (7) is obtained. Thus, the claim is established.

F Derivation of Thm 4.1 for residual networks

F.1 Definitions of J(xin) and a(xin) for residual networks

Similar to Appx. E.2, we define (derive) J(xin) and a(xin) for residual networks. For notational simplicity, we omit the argument xin. First, we represent x(l) as follows:

x(l) = (I + V (l))x(0) + c(l), (A62)

where V (0) := 0 and c(0) := 0. With V (l) and c(l), we can represent h(l) as follows:

h(l) = W (l)x(l 1) + b(l) (A63)

= W (l)((I + V (l 1))x(0) + c(l 1)) + b(l) (A64)

= W (l)(I + V (l 1))x(0) + W (l)c(l 1) + b(l). (A65)

Then, we derive recurrence representations of V (l) and c(l) as follows:

x(l) =x(l 1) + P (l)ϕ(h(l)) (A66)

=(I + V (l 1))x(0) + c(l 1)

+ P (l)D(ϕ (h(l)))(W (l)(I + V (l 1))x(0) + W (l)c(l 1) + b(l)). (A67)

For reference, we denote

U (l) := P (l)D(ϕ (h(l)))W (l), d(l) := P (l)D(ϕ (h(l)))b(l). (A68)

With the notations above,

x(l) = (I + V (l 1))x(0) + c(l 1) + U (l)(I + V (l 1))x(0) + U (l)c(l 1) + d(l) (A69)

= (I + U (l))(I + V (l 1))x(0) + (I + U (l))c(l 1) + d(l) (A70)

= (I + U (l) + V (l 1) + U (l)V (l 1))x(0) + c(l 1) + U (l)c(l 1) + d(l). (A71)

V (l) := U (l) + V (l 1) + U (l)V (l 1), c(l) := c(l 1) + U (l)c(l 1) + d(l). (A72)

Finally, J(xin) and a(xin) are defined as follows:

J(xin) := P out(I + V (L))P in, a(xin) := P outc(L). (A73)

F.2 Main derivation

First, we note the properties of random projections in shortcuts. Remark F.1. By definition (cf. Appx. C), the following statements hold for any l [L]:

(Gaussian matrix/vector) P (l) is a zero-mean Gaussian matrix.

(Variance) The variance of P (l) is 1/N.

(Independence) P (l) is independent of W (l), b(l), and h(l), as is any random variable in the layer before the l-th.

Finally, we introduce a lemma similar to Lemma E.10. A subsequent discussion to derive Thm 4.1 is the same as the proof in Appx. E.3. Lemma F.2. Consider a residual network. For any l [L], the following statements hold:

(a) J(l)(xin) is a zero-mean Gaussian matrix.

(b) a(l)(xin) is a zero-mean Gaussian vector.

(c) J(l)(xin) and a(l)(xin) are independent.

Proof. As a preliminary, please refer to Remarks E.9 and F.1 and Lemma E.10.

Note that effects of all the random variables in the layer before the l-th, such as P (l 1), h(l 1), W (l 1), and b(l 1), are aggregated to x(l 1). By Lemma E.8, U (l) and d(l) are independent of U (l )

and d(l ) for any l = l. Similarly, as V (l) and c(l) consist of U (1), . . ., U (l), d(1), . . ., d(l), U (l+1)

and d(l+1) are independent of V (l) and c(l). In addition, similar to Lemma E.10, U (l) and d(l) are independent by Lemma E.7.

We prove the claim by induction. Assume that V (l 1) is a Gaussian matrix. This holds for l = 1 because of V (1) = U (1). Similar to Lemma E.8, Uij is independent of Ui V j. Since U (l) and V (l 1) are independent Gaussian matrices and N is sufficiently large, V (l) is a Gaussian matrices. Similarly, c(l) is a Gaussian vector. By Lemma E.7, V (l) and c(l) are independent. Thus, similar to Lemma E.10, J(xin) and a(xin) are independent Gaussian matrices.

G Derivation of the theorems in Sec. 5

G.1 Derivation of the theorems in Sec. 5.1

First, we introduce the following lemma, which is required to derive the maximum ℓ norm of a column of J(xin). Lemma G.1. The maximum absolute value of n N i.i.d. Gaussian variables with zero mean and variance σ2 > 0 is smaller than

Table A3: Computable combination of a (p, q)-operator norm [91]. q = 1 q = 2 q =

p = 1 max. ℓ1 norm of a column max. ℓ2 norm of a column max. ℓ norm of a column p = 2 NP-hard max. singular value max. ℓ2 norm of a row p = NP-hard NP-hard max. ℓ1 norm of a row

Proof. Let z R be a Gaussian variable with zero mean and variance σ2 and a > 0 be a positive value. First, we compute the upper bound of probability of z > a as follows:

P[z > a] = Z

a g(z; σ2)dz (A74)

z ag(z; σ2)dz (A75)

Then, we compute the lower bound of probability of |z| < a as follows:

P[|z| < a] = Z a

a g(z; σ2)dz (A78)

0 g(z; σ2)dz + 2 Z 0

g(z; σ2)dz 1 (A79)

g(z; σ2)dz 1 (A80)

a g(z; σ2)dz 1 (A81)

a g(z; σ2)dz (A82)

Let {zi}n i=1 be n i.i.d. Gaussian variables. Finally, we compute the upper bound of probability of maxi [n] |zi| > a as follows:

P max i [n] |zi| > a = (1 P[|z| < a])n

In particular, when a =

P max i [n] |zi| >

2σ2 ln n 1 n

n n 0. (A85)

Thus, the claim is established.

Theorem 5.1 (Upper bounds of adversarial loss). Suppose that the input dimension d, output dimension K, and width N are sufficiently large. Then, for any xin Rd, the following inequality holds:

Ladv(xin) ϵβp,qωL/2 = ϵβp,q( α

W W W 2)L/2 (vanilla) ϵβp,q(1 + α LN P

W W W 2)L/2 (residual) , (9)

where W := {W (1) 1,1 , W (1) 1,2 , . . . , W (L) N,N} denotes the set of all network weights. The constant βp,q for each norm pair (p, q) is described in Tab. 1.

Proof. We note the following:

By Thm 4.1, each entry of J(xin) is i.i.d. and follows a Gaussian with zero mean and variance V[J(xin)ij] = ωL/d.

The input and output dimensions, d and K, are sufficiently large (cf. Sec. 3.1).

For some combinations of (p, q), (p, q)-operator norms are computable (cf. Tab. A3) [91].

In addition, we note the following upper bound (cf. Sec. 5.1):

Ladv(xin) ϵ max x Rd J(x) p,q. (Ineq. (8))

As a preliminary, we compute the mean absolute value of J(xin)ij as follows:

E[|J(xin)ij|] =

In the above derivation, we use the following equation

Ez N(0,σ2)[|z|] =

Based on Tab. A3, we compute J(xin) p,q as follows:

Maximum ℓ1 norm of a column, J(xin) 1,1.

i=1 |J(xin)ij| = max j [d] K 1

i=1 |J(xin)ij| (A88)

= max j [d] KE[|J(xin)ij|] (A89)

= max j [d] K

2 πd KωL/2. (A91)

Maximum ℓ2 norm of a column, J(xin) 1,2.

i=1 J(xin)2 ij = max j [d]

v u u t K 1

i=1 J(xin)2 ij (A92)

= max j [d]

KE[J(xin)2 ij] (A93)

= max j [d]

d ωL/2. (A95)

Maximum ℓ norm of a column, J(xin) 1, . By Lemma G.1,

max i [K] |J(xin)ij| q

2V[J(xin)ij] ln K =

d ωL/2. (A96)

Maximum singular value, J(xin) 2,2. By the Marchenko Pastur law,

ωL/2, (A97)

where 2 denotes the spectral norm (largest singular value).

Maximum ℓ2 norm of a row, J(xin) 2, .

j=1 J(xin)2 ij = max i [K]

j=1 J(xin)2 ij (A98)

= max i [K]

d E[J(xin)2 ij] (A99)

= max i [K]

= ωL/2. (A101)

Maximum ℓ1 norm of a row, J(xin) , .

j=1 |J(xin)ij| = max i [K] d1

j=1 |J(xin)ij| (A102)

= max i [K] d E[|J(xin)ij|] (A103)

= max i [K] d

π ωL/2. (A105)

In addition, we try to derive equalities rather than inequalities (upper bounds). First, we note the following equation:

Ladv(xin) := max η p ϵ

f(xin + η) f(xin) q (Eq. (3))

= max η p ϵ

J(xin + η)(xin + η) + a(xin + η) J(xin)xin a(xin) . (A106)

Because J(xin) and a(xin) are the slope and bias of a piecewise linear region, respectively, for sufficiently small η, J(xin + η) and a(xin + η) are identical to J(xin) and a(xin), respectively. Therefore, for sufficiently small η, we can derive the following equation:

Ladv(xin) = max η p ϵ

J(xin)η = ϵ J(xin) p,q. (A107)

Proposition G.2. Suppose that the perturbation constraint ϵ is sufficiently small. For any xin Rd and (p, q) = (1, 1), (1, 2), (2, ), and ( , ), the following equality holds:

Ladv(xin) = ϵβp,qωL/2. (A108)

Proof. As shown in Eq. (A107), the adversarial loss (Eq. (3)) is equal to the (p, q)-operator norm of J(xin) under this assumption. For (p, q) = (1, 1), (1, 2), (2, ), and ( , ), we can obtain the equalities of the (p, q)-operator norms (cf. the proof of Thm 5.1). Thus, the claim is established. Note that, for (p, q) = (1, ) and (2, 2), we can derive only upper bounds (cf. the proof of Thm 5.1).

Proposition G.3. Suppose that the perturbation constraint ϵ is sufficiently small and the output dimension is one. For any xin Rd, and (p, q) = (2, ) and ( , ), the following equality holds:

Ladv(xin) = ϵβp,qωL/2. (A109)

In addition, for any xin Rd and (p, q) = (2, 2), the following equality holds:

Ladv(xin) = ϵωL/2. (A110)

Proof. Similar to Prop G.2, the upper bounds for (p, q) = (2, ) and ( , ) become equalities. The upper bounds for (p, q) = (1, 1) and (1, 2) require sufficiently large K, and thus, they do not become equalities. In addition, as Tab. A3 shows, the (2, 2)-operator norm is a maximum singular value. When K = 1, the maximum singular value is identical to the Frobenius norm (ℓ2 norm), which is calculated as follows:

j=1 J(xin)2 1j (A111)

j=1 J(xin)2 1j (A112)

d E[J(xin)2 1j] (A113)

= ωL/2. (A114)

Thus, in contrast to the proof of Thm 5.1, we can obtain the equality of the (2, 2)-operator norm instead of the inequality (upper bound).

G.2 Derivation of the theorems in Sec. 5.2

In the following, we set Ladv := ϵβp,qωL/2. As a preliminary, we state the following two lemmas. These are used to solve the differential equation derived from gradient flow. Lemma G.4. For t, a, b R, x : R R, and x(0) := x0, the following holds:

dt = ax(t)b x(t) = (a(b 1)t + x (b 1) 0 ) 1/(b 1). (A115)

dt = ax(t)b (A116) Z x bdx = a Z dt (A117)

1 (b 1)x (b 1) = at + C1 (A118)

x(t) = (a(b 1)t + C2) 1/(b 1). (A119)

By the initial value,

x(0) := x0 = C 1/(b 1) 2 (A120)

C2 = x (b 1) 0 . (A121)

x(t) = (a(b 1)t + x (b 1) 0 ) 1/(b 1). (A122)

Lemma G.5. For t, c R, a, b 0, x : R R, x(0) := x0, and 0 bx(0) 1, the following holds: dx(t)

dt = a(1 + bx(t))cx(t) x(t) = x0 (1 + bcx0) exp(at) bcx0 . (A123)

Proof. By a(1 + bx(t))cx(t) 0, x(t) does not increase with t. That is, bx(t) bx(0) 1. Therefore, we can approximate (1 + bx(t))c as 1 + bcx(t) using the binomial theorem. Thus, we try to solve the following differential equation:

dt = a(1 + bcx(t))x(t), (A124)

whose solution is as follows:

x(t) = C1 bc(exp(at) C1). (A125)

By the initial value,

x(0) := x0 = C1 bc(1 C1) (A126)

bc(1 C1)x0 = C1 (A127) bcx0 = (1 + bcx0)C1 (A128)

C1 = bcx0 1 + bcx0 . (A129)

bcx0 1+bcx0 bc(exp(at) bcx0 1+bcx0 ) (A130)

= bcx0 bc((1 + bcx0) exp(at) bcx0) (A131)

= x0 (1 + bcx0) exp(at) bcx0 . (A132)

Then, using the update equation of parameters (Eq. (10)), we consider the differential equation of weight variance as follows: Lemma G.6. Suppose that Asm 5.2 holds. Let L1, . . . , Ln : R R be the n loss functions and W := {W (1) 11 , W (1) 12 , . . . , W (L) NN} be the set of all network weights. A network is trained by minimizing the sum of loss functions, Pn i=1 Li(xin). The network parameters are updated similarly to Eq. (10). The differential equation of σ2 w(t) is given by:

dσ2 w(t) dt = NEW W

Proof. Since Asm 5.2 holds and the number of weights, LN 2, is sufficiently large, σ2 w(t + dt) can be represented as follows:

σ2 w(t + dt) = N σ2 w(t + dt)

N = NVW W[W(t + dt)] = NEW W[W(t + dt)2]. (A134)

We can expand W(t + dt) using Eq. (10) and rearrange the above equation as follows:

σ2 w(t + dt) = NEW W

W(t) dt + O(dt2)

= σ2 w(t) NEW W

+ O(dt2) (A137)

σ2 w(t) NEW W

dσ2 w(t) dt = NEW W

Lemma G.7. For Ladv := ϵβp,qωL/2, the following equality holds:

W = ϵαβp,qωL/2 1

N W. (A140)

Proof. If the number of network weights, i.e., LN 2, is sufficiently large, we can represent the weight variance as follows: σ2 w N = 1 LN 2 X

W W W 2. (A141)

The derivative of ω with respect to W W can be calculated as follows:

ωv W = (ασ2 w) W = (αN σ2 w N ) W = αN 1 LN2 P

LN W, (A142)

ωr W = (1 + ασ2 w) W = 2α

LN W. (A143)

The derivative of ωL/2 with respect to W W can be calculated as follows:

2 ωL/2 1 2α

LN W = αωL/2 1

N W. (A144)

W = ϵβp,q ωL/2

W = ϵαβp,qωL/2 1

N W. (A145)

Lemma G.8. Suppose that Asm 5.2 holds. Let Lstd : Rd R be the standard loss function and Ladv : Rd R be the adversarial loss function. Suppose that Asm B.1 applies to Lstd. The adversarial loss function is defined as Ladv(xin; t) := ϵβp,qω(t)L/2. A network is trained by minimizing Lstd +Ladv. Network parameters are updated by Eq. (10). Then, the differential equation of σ2 w(t) is given by:

dσ2 w(t) dt = ϵαβp,q

N ω(t)L/2 1σ2 w(t). (A146)

Proof. By Lemma G.6,

dσ2 w(t) dt = NE W(t) Lstd

NE W(t) Ladv

Since the standard loss satisfies Asm B.1, E h W(t) Lstd

W (t) i = 0 and

dσ2 w(t) dt = NE W(t) Ladv

By Lemma G.7,

dσ2 w(t) dt = NE W(t)ϵαβp,qω(t)L/2 1

N W(t) (A149)

= ϵαβp,qω(t)L/2 1E[W(t)2] (A150)

N ω(t)L/2 1σ2 w(t). (A151)

Theorem 5.4 (Weight time evolution of vanilla network in adversarial training). Suppose that Asms 5.2 and 5.3 hold. Then, the time evolution of σ2 w of a vanilla network in adversarial training is given by:

σ2 w(t) = 1 ϵαβp,qωv(0)L/2 1

N t σ2 w(0). (11)

Proof. By Lemma G.8, the time evolution of weight variance in a vanilla network is represented as follows: dσ2 w(t) dt = ϵαβp,q

N ωv(t)L/2 1σ2 w(t) (A152)

N αL/2 1σ2 w(t)L/2 1σ2 w(t) (A153)

= ϵαL/2βp,q

N σ2 w(t)L/2. (A154)

Denote L := L/2 1. By Lemma G.4, we can obtain

σ2 w(t) = ϵαL/2βp,q

N (L/2 1)t + σ2 w(0) (L/2 1) 1/(L/2 1) (A155)

ϵαL +1βp,q L

N t + σ2 w(0) L ! 1/L

1 + ϵαL +1σ2 w(0)L βp,q L

σ2 w(0) (A157)

1 + ϵαωv(0)L βp,q L

σ2 w(0). (A158)

L ϵαβp,qωv(0)L L

1 ϵαβp,qωv(0)L

σ2 w(0). (A159)

Theorem G.9 (Weight time evolution of a residual network in adversarial training). Suppose that Asm 5.2 holds and ασ2 w(0) 1. The time evolution of σ2 w of a residual network in adversarial training is given by: σ2 w(t) = 1 (1 + αL σ2 w(0))ϵαβp,qt/N σ2 w(0). (A160)

Proof. By Lemma G.8, the time evolution of weight variance in a residual network is represented as follows: dσ2 w(t) dt = ϵαβp,q

N ω(t)L/2 1σ2 w(t) = ϵαβp,q

N (1 + ασ2 w(t))L/2 1σ2 w(t). (A161)

By Lemma G.5, we can obtain

σ2 w(t) = σ2 w(0)

(1 + α(L/2 1)σ2w(0)) exp ϵαβp,q

N t α(L/2 1)σ2w(0) (A162)

= ((σ2 w(0) 1 + αL ) exp(ϵαβp,qt/N) αL ) 1. (A163) By the Maclaurin expansion of the exponential function and t T N,

σ2 w(t) (σ2 w(0) 1 + αL ) 1 + ϵαβp,qt

αL 1 (A164)

= σ2 w(0) 1 + (σ2 w(0) 1 + αL )ϵαβp,qt

= 1 + (1 + αL σ2 w(0))ϵαβp,qt

1 σ2 w(0). (A166)

σ2 w(t) = 1 (1 + αL σ2 w(0))ϵαβp,qt

σ2 w(0). (A167)

Then, we consider the time evolution of weight variance, Thms 5.4 and G.9, at initialization satisfying (M, m)-trainability condition (Lemma 5.6). Here, we assume ω(0) 1 satisfying Lemma 5.6. Under this assumption, the time evolution of weight variance in vanilla networks, Eq. (11), can be rearranged as follows:

σ2 w(t) = (1 ϵαβp,qωv(0)L/2 1t/N)σ2 w(0) ωv(0) 1 (1 ϵαβp,qt/N)σ2 w(0). (A168) In addition, the time evolution of weight variance in residual networks, Eq. (A160), can be rearranged as follows:

σ2 w(t) = 1 (1 + αL σ2 w(0))ϵαβp,qt/N σ2 w(0) ωr(0) 1 (1 ϵαβp,qt/N)σ2 w(0). (A169) Thus, the time evolution of weight variance is consistent in vanilla and residual networks at initialization satisfying Lemma 5.6.

G.3 Derivation of the theorems in Sec. 5.3

Lemma 5.6 (Vanilla and residual (M, m)-trainability condition). Suppose that the width N is sufficiently large. Then, the (M, m)-trainability conditions for vanilla and residual networks are respectively given by:

m1/L ασ2 w M 1/L (vanilla), ασ2 w M 1/L 1 (residual). (12)

Proof. Applying Eq. (A38) to Eqs. (2) and (A19), we can obtain the following equations:

χ(l+1) = σ2 w E[ϕ (h(l+1))] = ασ2 w = ωv (vanilla) 1 + σ2 w E[ϕ (h(l+1))2] = 1 + ασ2 w = ωr (residual) . (A170)

Then, recursively computing the above equation, we can derive the following equations:

χ(L) = ωL = (ασ2 w)L (vanilla) (1 + ασ2 w)L (residual) . (A171)

Applying this equation to Defn 5.5,

χ(L) M m1/L ω M 1/L. (A172)

For a vanilla network, m1/L ασ2 w M 1/L. (A173) For a residual network, m1/L 1 ασ2 w M 1/L 1. (A174)

In particular, for a residual network, by m1/L 1 0 and ασ2 w 0,

(0 )ασ2 w M 1/L 1. (A175)

Theorem 5.7 (Vanilla networks are not adversarially trainable). Consider a vanilla network. Suppose that Asms 5.2 and 5.3 hold, and the (M, m)-trainability condition holds at t = 0 and ασ2 w(0) = 1. If

T (1 m1/L)N

ϵαβp,q , (13)

then there exists 0 < τ T such that the (M, m)-trainability condition does not hold for τ t T.

Proof. Let us consider T breaking the (M, m)-trainability condition. As Thm 5.4 claims, σ2 w(t) decreases monotonically with t. Thus, we consider T such that σ2 w(T) is less than the lower bound.

m1/L ασ2 w(T) = 1 ϵαβp,q T

ασ2 w(0) = 1 ϵαβp,q T

N 1 m1/L (A177)

T (1 m1/L)N

ϵαβp,q . (A178)

Theorem 5.8 (Residual networks are adversarially trainable). Consider a residual network. Suppose that Asms 5.2 and 5.3 hold, and the (M, m)-trainability condition holds at t = 0 and ασ2 w(0) 1. Then, (M, m)-trainability condition always holds for 0 t T.

Proof. By Thm G.9, σ2 w(t) decreases monotonically. Thus, the following inequality holds, and, by Lemma 5.6, the (M, m)-trainability condition always holds in 0 t T:

ασ2 w(t) ασ2 w(0) M 1/L 1. (A179)

Proposition G.10 (Residual networks are adversarially trainable without careful weight initialization). Suppose that Asm 5.2 holds, the network is residual, and the (M, m)-trainability condition is not satisfied at t = 0. If

T (ασ2 w(0) (M 1/L 1))N ϵαL/2+1βp,qσ2w(0)L/2 , (A180)

there exists 0 < τ T such that the (M, m)-trainability condition hold for τ t T.

Proof. Denote the time evolution of the weight variance on a vanilla network by σ2 w,v and the variance on a residual network by σ2 w,r. By Eq. (A161),

dσ2 w(t) dt = ϵαβp,q

N (1 + ασ2 w(t))L/2 1σ2 w(t) ϵαL/2βp,q

N σ2 w(t)L/2. (A181)

Recall that the following differential equation represents the time evolution of weight variance on a vanilla network (cf. Thm 5.4):

dσ2 w(t) dt = ϵαL/2βp,q

N σ2 w(t)L/2. (A182)

From the above, σ2 w,r(t) σ2 w,v(t) if σ2 w(0) is the same. Here, we consider T such that σ2 w,v(T) is less than the upper bound of the (M, m)-trainability condition. By Thm 5.4,

ασ2 w,v(T) M 1/L 1 (A183)

1 ϵαβp,qωv(0)L T

ασ2 w,v(0) M 1/L 1 (A184)

1 ϵαβp,qαL σ2 w(0)L T N M 1/L 1

ασ2w(0) (A185)

ασ2w(0) ϵαβp,qαL σ2 w(0)L T N (A186)

(ασ2 w(0) M 1/L + 1)N ασ2w(0)ϵαβp,qαL σ2w(0)L T (A187)

(ασ2 w(0) M 1/L + 1)N ϵαL/2+1βp,qσ2w(0)L/2 T. (A188)

G.4 Derivation of the theorems in Sec. 5.4

First, we state the following two lemmas as a preliminary. Lemma G.11. Consider a vanilla network. Suppose that Asm B.1 is applied to the network output. For any l [L], the mean squared gradient of fi with respect to h(l) j is given by:

= αωL l v χ(L). (A189)

fi h(l+1) h(l+1)

x(l) j h(l) j

fi h(l+1) k W (l+1) kj

ϕ (h(l) j )2

By Asm B.1 and Eq. (A38),

= σ2 w E[ϕ (h(l) j )2]E

fi h(l+1) k

fi h(l+1) k

Moreover, by Asm B.1,

x(L) j h(L) j

ϕ (h(L) j )2

= αχ(L). (A195)

= αωL l v χ(L). (A196)

Lemma G.12. Suppose that Asm B.1 is applied to the network output. For any l [L], the mean squared gradient of fi with respect to a weight is given by:

fi W (l) jk

= αωL 1χ(L)E[(x(0) j )2] + α2ωL lχ(L)σ2 b

m=1 ωm 1. (A197)

Proof. Vanilla. By Asm B.1,

fi W (l) jk

h(l) j W (l) jk

E[(x(l 1) k )2]. (A198)

By Lemmas E.11 and G.11,

fi W (l) jk

= αωL l v χ(L)

ωl 1 v E[(x(0) j )2] + ασ2 b

= αωL 1 v χ(L)E[(x(0) j )2] + α2ωL l v χ(L)σ2 b

m=1 ωm 1 v . (A200)

Residual. By Asm B.1,

fi W (l) jk

x(l) j h(l) j

h(l) j W (l) jk

ϕ (h(l) j )2(x(l 1) k )2

E[ϕ (h(l) j )2]E[(x(l 1) k )2] (A203)

= χ(l)E[ϕ (h(l) j )2]E[(x(l 1) k )2]. (A204)

By Eq. (A38),

fi W (l) jk

= αχ(l)E[(x(l 1) k )2]. (A205)

By Eqs. (A19) and (A38),

χ(l) = ωL l r χ(L). (A206)

Thus, by Lemma E.11,

fi W (l) jk

= αωL l r χ(L)

ωl 1 r E[(x(0) j )2] + ασ2 b

= αωL 1 r χ(L)E[(x(0))2] + α2ωL l r χ(L)σ2 b

m=1 ωm 1 r . (A208)

Now, we can simply represent the mean of the Fisher Rao norm.

Lemma G.13. Suppose that Asm B.1 is applied to the network output. Suppose xin 2 =

d and σ2 b = 0. The mean of the Fisher Rao norm is given by:

E[ w FR] = LKασ2 wωL 1. (A209)

Proof. The following discussion is based on [49]. We rearrange and expand their discussion for a Re LU-like network. We also consider it for a residual network. Note that w := (W (1) 11 , . . . , W (L) NN) . As a preliminary, see Remark E.9. The Fisher Rao norm is calculated as follows:

w FR := w F w =

j=1 wi Fijwj =

fk wj wiwj. (A210)

By Asm B.1,

E[wiwj] (A212)

By Lemma G.12, E[( fi/ W (l) jk )2] is invariant to l if σ2 b = 0. In other words, E[( fk/ wi)2] does not depend on i if σ2 b = 0. Thus,

E[ w FR] = LNKσ2 w E

= LNKασ2 wωL 1χ(L)E[(x(0) i )2]. (A214)

PN k=1 P out ik x(L) k x(L) j

= E[(P out ij )2] = 1

E[(x(0) i )2] = E

j=1 P in ij xin j

j=1 E[(P in ij )2](xin j )2 = 1. (A216)

E[ w FR] = LKασ2 wωL 1. (A217)

Theorem 5.9 (Adversarial training degrades network capacity). Consider a vanilla network. Suppose that Asms 5.2 and 5.3 hold, Asm B.1 is applied to the network output, and the (M, m)-trainability condition holds at t = 0 and ασ2 w(0) = 1. Assume xin 2 =

d and σ2 b(t) = 0. Then, the expectation of the Fisher Rao norm is given by:

E[ w(t) FR] = LK 1 ϵαβp,q L

Proof. By Lemma G.13,

E[ w FR] = LKασ2 w(t)ωv(t)L 1 = LKαLσ2 w(t)L. (A218)

By Thm 5.4,

E[ w FR] = LKαLσ2 w(0)L 1 ϵαβp,qt

L = LK 1 ϵαβp,qt

By ϵαβp,qt/N 1,

E[ w FR] LK 1 Lϵαβp,qt

Theorem G.14 (Adversarial training degrades network capacity). Consider a residual network. Suppose that Asm 5.2 holds, Asm B.1 is applied to the network output, and the (M, m)-trainability condition holds at t = 0, e.g., ασ2 w(0) 1. Suppose xin 2 =

d and σ2 b(t) = 0. Then, the mean of the Fisher Rao norm is given by:

E[ w FR] =LKασ2 w(0)

1 + (L 1)ασ2 w(0)

(2(L 1)ασ2 w(0) + 1) 1 + L

2 1 ασ2 w(0) ϵαβp,qt

Proof. By Lemma G.13,

E[ w FR] = LKασ2 w(t)ωr(t)L 1 (A221)

= LKασ2 w(t)(1 + ασ2 w(t))L 1 (A222)

LKασ2 w(t)(1 + (L 1)ασ2 w(t)). (A223)

Temporally, denote A := (1 + (L/2 1)ασ2 w(0))ϵαβp,qt/N. By Thm G.9,

E[ w FR] = LKα(1 A)σ2 w(0)(1 + (L 1)α(1 A)σ2 w(0)) (A224)

= LKασ2 w(0)(1 A)(1 + (L 1)ασ2 w(0) (L 1)Aασ2 w(0)) (A225)

= LKασ2 w(0)

1 A + (L 1)(1 A)ασ2 w(0)

(L 1)Aασ2 w(0) + O(A2)

By O(1/N 2) 0, O(A2) 0. Thus,

E[ w FR] LKασ2 w(0)(1 A + (L 1)(1 A)ασ2 w(0) (L 1)Aασ2 w(0)) (A227)

=LKασ2 w(0)(1 A + (L 1)ασ2 w(0) 2(L 1)Aασ2 w(0)) (A228)

=LKασ2 w(0)(1 + (L 1)ασ2 w(0) (2(L 1)ασ2 w(0) + 1)A) (A229)

=LKασ2 w(0)

1 + (L 1)ασ2 w(0)

(2(L 1)ασ2 w(0) + 1) 1 + L

2 1 ασ2 w(0) ϵαβp,qt

H Comparison with matrix decomposition approaches

Here, we analyze Ineq. (8) with another approach, which decomposes J(xin) into products of matrices, and consider norm of each matrix. As related studies based on this approach, we cite the literature on certified adversarial defense [3, 18, 93] and spectral regularization [60, 108]. We consider the case of (p, q) = (2, 2) for simplicity. In comparison to them, our approach is different in two ways. First, their bound is not tractable due to a deterministic approach. For example, they consider the spectral norm of J(xin) by decomposing J(xin) and computing the norm of each matrix consisting J(xin) as follows:

J(xin) 2 = P out D(ϕ (h(L)(xin)))W (L) D(ϕ (h(1)(xin)))W (1)P in 2 (A231)

D(ϕ (h(L)(xin))) 2

D(ϕ (h(1)(xin))) 2

P in 2 (A232)

max(|u|, |v|)L P in F P out F

W (l) F. (A233)

Because Eqs. (A232) and (A233) are hard to theoretically manage, we cannot derive some theorems in this study such as Thms 5.4, 5.7 and 5.9. Second, because we calculate the spectral norm of J(xin) directly in Thm 5.1 instead of considering the norm of each matrix and the spectral norm is submultiplicative, our upper bounds are always tighter than those derived from Eq. (A232).

Here, we proceed with the discussion of tightness of their bound using probabilistic theory. One approach is the rearrangement of Eq. (A232) using the Marchenko Pastur law. As a preliminary, we note the following:

d , P out 2 1 +

K N , W (l) 2 2 p

σ2w. (A234)

Using the above equations, Eq. (A232) can be rearranged as follows:

J(xin) 2 max(|u|, |v|)L

2L(σ2 w)L/2. (A235)

This is exponentially looser than Thm 5.1.

Another approach is the rearrangement of Eq. (A233) using the assumption that each matrix is sufficiently large. As a preliminary, we note the following:

v u u t Nd 1

j=1 (P in ij )2 = q

Nd E[(P in ij )2] =

v u u t KN 1 KN

j=1 (P out ij )2 = q

KNE[(P out ij )2] =

v u u t N 2 1

j=1 (W (l) ij )2 = q

N 2E[(W (l) ij )2] =

Nσ2w. (A238)

Using the above equations, Eq. (A233) can be rearranged as follows: J(xin) 2 max(|u|, |v|)L

Nσ2w L = max(|u|, |v|)LK1/2N (L+1)/2(σ2 w)L/2. (A239) This is also exponentially looser than Thm 5.1.

I Comparison with ℓ2 weight regularization

As stated in Secs. 5.1 and 5.2, adversarial training serves the role of a weight regularizer. Here, we compare adversarial training with ℓ2 weight regularization. First, we define ℓ2 weight regularization as follows: Lw(W; λ) := λ X

W W W 2, (A240)

where λ > 0 is a scaling factor. Then, we determine the scaling factor λ based on the concept of gradient vanishing and explosion. The mean of Lw(W; λ) is calculated as follows:

E[Lw(W; λ)] = λ X

W W E[W 2] = λLN 2E[W 2] = λLN 2 σ2 w N = λLNσ2 w. (A241)

To prevent (gradient) vanishing and explosion of E[Lw(W; λ)] under sufficiently large L and N, λ should be 1/(LN). Moreover, for simplicity of the derivation, we set λ := 1/(2LN). Finally, ℓ2 regularized training tries to minimize the following loss function:

L := Lstd + 1 2LN

W W W 2 (A242)

Next, we consider the time evolution of σ2 w with ℓ2 weight regularization. Similar to Thms 5.4 and G.9, we can derive the following proposition: Proposition I.1 (Weight time evolution with ℓ2 weight regularization). Suppose that Asm 5.2 holds. Let Lstd be the standard loss function and Lw(W) := 1/(2LN) P

W W W 2 be the ℓ2 regularization loss function. Suppose that Asm B.1 applies to Lstd, but not to Lw. A network is trained by minimizing Lstd + Lw(W). Then, the time evolution of σ2 w is given by:

σ2 w(t) = 1 t LN

σ2 w(0). (A243)

Proof. Similar to Lemma G.8, dσ2 w(t) dt = NE W(t) Lw(W)

= NE W(t)W(t)

LE[W(t)2] (A246)

LN σ2 w(t). (A247)

σ2 w(t) = exp t

σ2 w(0). (A248)

By t T N, using Maclaurin expansion of the exponential function,

σ2 w(t) = 1 t LN

σ2 w(0). (A249)

Here, we consider the adversarial loss defined as Ladv(xin) := ϵβp,qωL/2 (cf. Thm 5.1). The difference between adversarial training and ℓ2 regularization lies in the scale: ϵαβp,q/N in adversarial training (cf. Thm 5.4) and 1/(LN) in ℓ2 regularized training. Adversarial training with strong adversarial examples (e.g., ℓ constrained ones) reduces σ2 w(t) more drastically than ℓ2 regularized training. For example, with L = 100, N = 1, 000, d = 3 224 224, ϵ = 0.1, α = 1/2, p = , q = , then Θ(ϵαβp,q/N) = 10 2 and Θ(1/(LN)) = 10 5.

Finally, we compare adversarial training and ℓ2 regularization in terms of stability of gradient descent. The derivations of both losses with respect to the weight W W are given by (cf. Lemma G.7):

W = ϵαβp,qωL/2 1W

LN . (A250)

This indicates that the derivation of ℓ2 regularizer is smooth across the entire weight space, whereas the derivation of adversarial loss changes significantly at the ω = 1 boundary. Note that ω evolves during training (cf. the definition in Thm 4.1). The steep derivation in adversarial training prevents the gradient descent from reaching a minimum.

J Revisiting trainability condition in adversarial training

Here, we revisit Defn 5.5. Recall that this condition can be derived exclusively under Asm B.1. This is because χ(l) can only be computed under Asm B.1 [81]. In this study, we assume that Asm B.1 applies to the standard loss Lstd, but not to the adversarial loss Ladv. Consequently, the composite loss L := Lstd + Ladv does not satisfy Asm B.1, implying that strictly speaking, Defn 5.5 is not valid for L and considering Defn 5.5 in adversarial training might not be entirely accurate.

However, we must emphasize that Defn 5.5 continues to be a determining factor in the success of adversarial training, or in other words, training with L. This is primarily due to the fact that if Defn 5.5 is not met, networks will be unable to adequately fit the training dataset owing to gradient vanishing or explosion.

Examining this from a mathematical standpoint, the necessity of Defn 5.5 is underlined by the linearity of the differential operator. The update equation for the parameter θ(t) is formulated as follows: dθ(t)

θ(t) = Lstd

θ(t) (A251)

As demonstrated by the equation above, updates with Lstd and Ladv are conducted independently. Although Defn 5.5 may not be considered for Ladv, it can be for Lstd. In order to forestall gradient vanishing or explosion in the standard loss and to ensure classification ability for clean images, Defn 5.5 remains a necessity. Therefore, it can be concluded that Defn 5.5 continues to dictate the success of training, even within the context of adversarial training.

K Other theoretical results

Scaled effect of input and output dimensions. In Tab. 1, we showed a wide effect of the input dimension d and the output dimension K on the upper bounds of the adversarial loss. In practice, the perturbation budget ϵ and the adversarial loss are scaled based on the chosen p and q. For example, we can rearrange the definition of the adversarial loss (Eq. (3)) as follows:

Ladv(xin) := max η p λpϵ λq f(xin + η) f(xin) q, (A252)

Table A4: Values of β p,q. The description is the same as Tab. 1

q = 1 q = 2 q =

where λp and λq denote the scaling factors. In this context, the upper bound (Thm 5.1) can be rearranged as follows:

Ladv(xin) ϵ λpλqβp,q ωL/2 = ϵβ p,qωL/2, (A253)

where β p,q := λpλqβp,q. For example, we consider the following λp and λq:

d (p = 2) 1 (p = ) , λq :=

1/K (q = 1) 1/

K (q = 2) 1 (q = ) . (A254)

Under these conditions, β p,q corresponds to Tab. A4. This table reveals that (i) the input dimension impacts the upper bounds with Θ(

d), but the output dimension has little or no effects; (ii) although the dimensions generally exert a similar influence on the bounds, the specific factors vary widely; (iii) for (p, q) = (1, ) and (2, 2), the output dimension behaves in a distinctive manner. In conclusion, our upper bounds, even inclusive of scaling effects, provide several insights into the relationship between the adversarial loss and input/output dimensions.

Other metrics of capacity. Here, we consider capacity metrics other than the Fisher Rao norm. Norm-based metrics, such as the path norm [66], the group norm [67], and spectral norm [8], have a similar definition to the Fisher Rao norm. They are constructed by the sum and product of the weights of a network. The trajectory length [73] is defined by the arc length of a network output with change of parameters. A similar definition based on curvature was also used in [71]. These metrics concluded that capacity increases with the weight variance of a network, despite differences in speed. Therefore, we can derive similar results as in Sec. 5.4 for metrics other than the Fisher Rao norm.

Mitigation of adversarial risk. Here, we consider the mitigation of adversarial risk, i.e., the mitigation of the upper bounds of the adversarial loss. There are three solutions: (i) sample each entry of P in from N(0, 1/d2) instead of N(0, 1/d); (ii) sample each entry of W (l) from N(0, σ2 w/N 2) instead of N(0, σ2 w/N) for l [L]; (iii) sample each entry of P out from N(0, 1/N 2) instead of N(0, 1/N).

First, let us examine scenario (i). In this setting, the variance of J(xin) is transformed into V[J(xin)ij] = ωL/d2 from V[J(xin)ij] = ωL/d (cf. the proof of Thm 4.1). In addition, β , changes to Θ(β , ) = 1 from Θ(β , ) =

d (cf. the proof of Thm 5.1). These modifications suggest a potential reduction in adversarial risk in scenario (i).

However, this leads to a training failure. Let us consider the variance of a network output under no bias (for simplicity). For sufficiently large input dimension d and xin [0, 1]d, it can be calculated as follows:

V[f(xin)] = V[J(xin)xin] = V[J(xin)ij] xin 2 2 = xin 2 2 d2 0. (A255)

This equation implies that the network always outputs a zero vector for any input xin. In other words, input information is lost during signal propagation in the network. In this situation, networks cannot learn the data structure well, and thus training does not proceed (cf. [71, 81, 105]).

Situations (2) and (3) are also similar to (1); they can mitigate the adversarial risk, but they break input information during forward and backward. Therefore, we conclude that it is not feasible to mitigate the adversarial effect without compromising the network s trainability.

Extension to Lipschitz continuous activations. In this study, we primarily established our theorems for networks employing Re LU-like activations. Here, we attempt to generalize these theorems to encompass networks with Lipschitz continuous activations. However, this extension involve looser upper bounds or potentially unrealistic assumptions.

First, we examine the upper bounds for p = 2 and q = 2. Since J(xin) is a Jacobian at xin, Ineq. (8) remains valid for Lipschitz continuous activations. Denote the Lipschitz constant by k 0. Similar to the discussion in Appx. H, we can derive the following upper bound:

J(xin) 2 k L

2L(σ2 w)L/2. (A256)

An important observation here is that ωL/2, which dictates the training properties, is common to networks with both Re LU-like and Lipschitz continuous activations. Consequently, we claim that our theorems also extend to Lipschitz continuous activations, despite differences in the factor. However, this bound is exponentially loose, casting doubt on its ability to accurately represent the properties of adversarial loss.

Second, we use the analysis of a network Jacobian [69]. For a comprehensive comparison between our approach and theirs, see Sec. 5.1. Here, we adopt their assumption where the variance V[h(l)] is constant for any l [L]. Drawing from the results expressed in Eq. (22) of [69], we can assert that J(xin) 2 Θ((σ2 w)L/2) holds. Similar to the first proposition, this implies that our theorems hold to general activations including Lipschitz continuous activations. However, the assumption clearly does not hold in our theorems such as Thm 5.4.

Single-gradient descent attack can find adversarial examples. Here, we demonstrate that a single-gradient descent attack, such as the fast gradient sign method [38], can find adversarial examples. This is not a substantially novel contribution. For example, see [7, 11, 24, 25, 61, 84, 112]. However, to introduce a new approach based on Thm 4.1, we attempt it. Proposition K.1 (Single-gradient descent attack can find adversarial examples). Suppose that xin 2 =

d, the perturbation constraint ϵ is sufficiently small, and the input dimension d is sufficiently large. Then, the single-gradient descent attack finds ℓ constrained adversarial examples that flip the prediction of a single-output random deep neural network with high probability

v u u t ωLd

π ωL + ασ2 b PL k=1 ωk 1

d 1, (A257)

where erf is the error function.

Proof. Since η is sufficiently small, the same linear regions in a Re LU-like network encompass both xin and xin + η, i.e., J(xin) = J(xin + η). When an ℓ constrained adversarial example flips the prediction of a classifier, the inequality |J(xin)xin + a(xin)| < |J(xin)η| holds for |η| ϵ. The perturbation generated by a single-gradient descent is defined as η := ϵ sgn(J(xin)). Note that J(xin) is a one-dimensional vector in single-output networks. Using this perturbation, by Eq. (A87) and Thm 4.1, the right term of the inequality can be transformed into:

|J(xin)η| ϵd1

i=1 |J(xin)i| = ϵd E[|J(xin)i|] = ϵd

Then, let us consider |J(xin)xin+a(xin)|. By the reproductive property of the Gaussian, J(xin)xin+ a(xin) follows a Gaussian N(0, xin 2 2V[J(xin)] + V[a(xin)]). The variance can be rearranged as follows:

xin 2 2V[J(xin)ij] + V[a(xin)i] =

k=1 ωk 1 = ωL + ασ2 b

k=1 ωk 1. (A259)

Note that the c.d.f of the folded normal distribution based on the Gaussian N(0, σ2) is given by erf(x/

2σ2). Thus, we can compute the probability such that |J(xin)xin + a(xin)| is less than

2ωLd/π as follows:

2 ωL + ασ2 b PL k=1 ωk 1

v u u t ωLd

π ωL + ασ2 b PL k=1 ωk 1

d 1. (A261)

L Other experimental results

L.1 Setting

We focused on Re LU networks, i.e., u = 1, v = 0, and α = 1/2. Adversarial examples were generated using auto projected gradient descent [20] and the loss function defined in Eq. (3). Networks were initialized to satisfy the (M, m)-trainability condition (Lemma 5.6), i.e., σ2 w = 2 for vanilla networks and σ2 w = 0.1 or σ2 w = 0.01 for residual networks. The initial bias variance was set to σ2 b = 0.01. We employed MNIST [26] and Fashion-MNIST [99] as the training dataset. All experiments are conducted on an NVIDIA A100.

L.2 Verification of Thm 4.1 (new mean field-based framework)

As shown in Fig. 1, the empirical distribution of J(xin) in the vanilla network aligned well with Thm 4.1. To further verify Thm 4.1, we provide additional results. We sampled 10, 000 vanilla networks with d = 1, 000 and K = 1. For vanilla networks, we set σ2 w = 2, and for residual networks, we set σ2 w = 0.01. The network width N and depth L were set to 5, 000 and 10, respectively.

The empirical distributions of J(xin) and a(xin) for different network depths, L = 10 and 100, and two randomly generated inputs, xin 1 and xin 2 , are shown in Figs. A5 and A6. The accuracy of Thm 4.1 in predicting these distributions remains consistent across varying network depths. Moreover, the distributions of J(xin) and a(xin) did not depend on the input xin, even though they are defined as functions of xin.

We should note that our theory operates under the assumption of infinite network width. To assess the influence of network width on Thm 4.1, please refer to Fig. A7. We found that for smaller values of N, for instance N = 10, the empirical distribution of J(xin) and a(xin) deviates from the theoretical prediction as provided by Thm 4.1. As network width increases, the alignment between empirical results and predictions from Thm 4.1 improves.

It is important to clarify that while Thm 4.1 asserts that the distributions of J(xin) and a(xin) do not depend on xin, it does not necessarily imply that random variables J(xin) and a(xin) are independent of J(yin) and a(yin), respectively, when xin = yin. For slightly different inputs (yin := xin 0.999, xin 0.99, and xin 0.5), we computed the correlation coefficients between J(xin), a(xin) and J(yin), a(yin), respectively.4 These findings can be found in Figs. A8 and A9, and clearly demonstrate that more similar inputs result in higher correlation coefficients. However, this does not contravene the claims made in Thm 4.1. Further, for two randomly generated inputs, xin 1 and xin 2 , the random variables J(xin), a(xin) and J(xin 2 ), a(xin 2 ) were found to be relatively uncorrelated, respectively. The theoretical prediction for this correlation is currently unclear, which is a topic for future work.

To validate the independence of different entries of J(xin) and a(xin), we computed the correlation coefficient between two distinct entries.4 As presented in Fig. A10, we found that the two unique entries of J(xin) and a(xin) were indeed uncorrelated, corroborating the claims in Thm 4.1. Furthermore, Fig. A10 illustrates that the entries of J(xin) and a(xin) were uncorrelated, which also aligned with Thm 4.1.

An examination of the distributions of J(xin) and a(xin) after training may yield interesting insights. However, as trained networks have different weight and bias variances, the observed J(xin) and

4While the correlation coefficient is not necessarily indicative of dependence, we regard it as a conveniently measurable value. For more information, please refer to Remark E.3.

0.1 0.0 0.1 0

J(xin)1,1, L=10

0.1 0.0 0.1 0

J(xin)1,1, L=100

0.5 0.0 0.5 0

2 a(xin)1, L=10

a(xin)1, L=100

Exp. Theory

0.1 0.0 0.1 0

0.1 0.0 0.1 0

0.5 0.0 0.5 0

Figure A5: Distributions of J(xin)1,1 and a(xin)1 in the vanilla Re LU network with d = 1, 000, K = 1, N = 5, 000, σ2 w = 2, and σ2 b = 0.01. The description is the same as Fig. 1.

0.1 0.0 0.1 0

J(xin)1,1, L=10

0.1 0.0 0.1 0

J(xin)1,1, L=100

0.5 0.0 0.5 0

2 a(xin)1, L=10

a(xin)1, L=100

Exp. Theory

0.1 0.0 0.1 0

0.1 0.0 0.1 0

0.5 0.0 0.5 0

Figure A6: Distributions of J(xin)1,1 and a(xin)1 in the residual Re LU network with d = 1, 000, K = 1, N = 5, 000, σ2 w = 0.01, and σ2 b = 0.01. The description is the same as Fig. 1.

a(xin) in each network do not follow the same distribution. Consequently, empirically evaluating the validity of Thm 4.1 for trained networks presents substantial challenges. Despite this, we consider, based on experimental findings such as those presented in Fig. A18, that Thm 4.1 offers an accurate representation of the early stages of the training process.

L.3 Verification of Thm 5.1 (upper bounds of adversarial loss)

As shown in Fig. 2, the upper bounds in Thm 5.1 indicates the significant tightness in vanilla networks. To further verify Thm 5.1, we provide additional results. We generated 100 adversarial examples and calculated the adversarial losses defined in Eq. (3). For vanilla networks, we set σ2 w = 2, and for residual networks, we set σ2 w = 0.01. The network width N was set to 40, 000 for vanilla networks and 35, 000 for residual networks, and the network depth L was set to 3. We set the perturbation constraint ϵ to 0.1 and iterations of projected gradient descent to 50.

In Fig. 2, we illustrate the tightness of the bounds with varying input dimensions in vanilla networks. We provide further results for residual networks, as shown in Fig. A11. Additionally, Figs. A12 and A13 demonstrate the adversarial loss as a function of varying output dimensions in vanilla and residual networks, respectively. Overall, our findings affirm the tightness of our bounds across both network types. For some (p, q) combinations, empirically observed adversarial loss samples exceeded the theoretical upper bounds. We consider that this discrepancy can be mitigated by expanding the network width, a topic elaborated upon in the following paragraph.

0.1 0.0 0.1 0

0.1 0.0 0.1 0

0.1 0.0 0.1 0

0.1 0.0 0.1 0

Exp. Theory

0.5 0.0 0.5 0

0.5 0.0 0.5 0.0

0.5 0.0 0.5 0.0

0.5 0.0 0.5 0.0

Figure A7: Distributions of J(xin)1,1 and a(xin)1 in the vanilla Re LU network with d = 1, 000, K = 1, L = 10, σ2 w = 0.01, and σ2 b = 0.01. The description is the same as Fig. 1.

The derivation of Thm 5.1 assumes a network of infinite width. However, in practical implementation, an infinite network width is unachievable, leading to instances where empirical results do not coincide with Thm 5.1. Specifically, certain samples were observed to exceed the theoretical upper bounds, as shown in Fig. 2. To evaluate the influence of network width, we varied it while sampling the adversarial losses, with the results displayed in Fig. A14. In the case of (p, q) = (2, 2), it was observed that a wider network width diminished the discrepancy between sampled losses and theoretical bounds. For other (p, q) pairings, even with a relatively small width (e.g., N = 100), empirical results aligned with Thm 5.1.

The impact of the perturbation constraint ϵ can be found in Fig. A15. It was confirmed that larger ϵ values corresponded to a greater divergence between empirically sampled adversarial losses and theoretical bounds. This is because for larger ϵ, it was harder for the projected gradient descent to tune adversarial examples well.

The effect of the network depth L can be confirmed in Fig. A16. For (p, q) = (2, 2), as the number of layers increased, the value exceeded the upper bounds. In contrast, for (p, q) = ( , ), the value fell below the bounds. These differences can be attributed to the varying complexities of optimizing the adversarial loss for each combination of p and q. That is, for (p, q) = (2, 2), the optimization of adversarial examples is relatively straightforward, enabling the generation of adversarial losses that exceed the upper bound, which is imposed by the constraints of finite width. However, for (p, q) = ( , ), the optimization is challenging, and it is difficult to generate highquality adversarial examples. The underlying reasons for these disparities in optimization complexity remain a topic for future work.

Furthermore, we assessed adversarial loss during training on the MNIST dataset, as shown in Fig. A17. Vanilla networks were trained using stochastic gradient descent with a learning rate of 0.001. While Thm 5.1 broadly holds for (p, q) = (1, 2), the disparity between theoretical prediction and empirical results expands as training progresses for (p, q) = (2, 2), (2, ), and (p, q) = ( , ). Determining a theoretical prediction that provides a tight bound on the adversarial loss for fully trained deep neural networks remains an open challenge for future research.

L.4 Verification of Thms 5.4 and G.9 (time evolution of weight variance)

As shown in Fig. 3, the empirical weight variance observed in the vanilla network aligns well with Thm 5.4. To verify Thm G.9, we provide Fig. A18. We trained 10-layers vanilla and residual networks. The vanilla networks were initialized with σ2 w(0) = 2 and σ2 b(0) = 0.01. Residual networks are initialized with σ2 w(0) = 0.1 and σ2 b(0) = 0.01. For standard training, N was set to 1, 000. For

0.15 0.00 0.15 J(xin 1 )1,1

J(xin 1 0.999)1,1

Correlation coeﬃcient: 0.99

0.15 0.00 0.15 J(xin 1 )1,1

J(xin 1 0.99)1,1

Correlation coeﬃcient: 0.91

Experiment Theory

0.15 0.00 0.15 J(xin 1 )1,1

J(xin 1 0.5)1,1

Correlation coeﬃcient: 0.5

0.15 0.00 0.15 J(xin 1 )1,1

J(xin 2 )1,1

Correlation coeﬃcient: 0.03

Figure A8: Correlation coefficient of J( )1,1 for two inputs in the vanilla Re LU network with d = 1, 000, K = 1, N = 5, 000, L = 10, σ2 w = 2, and σ2 b = 0.01. The description of the histogram is the same as Fig. 1.

adversarial training, we designated p = , q = , and ϵ = 0.3. In both training, we used stochastic gradient descent with a small learning rate, 0.001. Note that gradient flow assumes an infinitesimal learning rate (cf. Eq. (10)). A theoretically defined training step t under gradient flow is not equal to a training step in the experiment. We have linked them by t := training steps learning rate. In practice, setting the weight variance precisely is not feasible. For example, in a vanilla network, σ2 w(0) might be set to 1.999 even though we tried to initialize it to satisfy σ2 w(0) = 2. Thus, for visibility, the experimental results (curves) were shifted parallel to meet σ2 w(0) = 2 in vanilla networks and σ2 w(0) = 0.1 in residual networks.

From Fig. A18, it is evident that adversarial training facilitates weight regularization for both vanilla and residual networks. Compared to the ℓ2 weight regularization discussed in Appx. I, it offers stronger regularization. We can verify that Thm 5.1 can predict these empirical behaviors in the initial stage of training. The validity of Thms 5.4 and G.9, which underpin our theorems such as Thms 5.7 and 5.9, lends credence to our theoretical assertions.

1 0 1 a(xin 1 )1

a(xin 1 0.999)1

Correlation coeﬃcient: 0.99

1 0 1 a(xin 1 )1

a(xin 1 0.99)1

Correlation coeﬃcient: 0.9

Experiment Theory

1 0 1 a(xin 1 )1

a(xin 1 0.5)1

Correlation coeﬃcient: 0.53

1 0 1 a(xin 1 )1

Correlation coeﬃcient: 0.42

Figure A9: Correlation coefficient of a( )1 for two inputs in the vanilla Re LU network with d = 1, 000, K = 1, N = 5, 000, L = 10, σ2 w = 2, and σ2 b = 0.01. The description of the histogram is the same as Fig. 1.

L.5 Verification of Thms 5.7 and 5.8 (adversarial trainability)

As shown in Fig. 4, vanilla networks with small width and large depth were not adversarially trainable, which aligned well with Thm 5.7. To verify Thms 5.7 and 5.8, we provide Fig. A19. We trained vanilla and residual networks under various width and depth settings, and observed training accuracy. For fast training convergence, we used Adam [51]. The training was stopped if training accuracy was not improved in the last 200 steps. We set the learning rates to 0.001 or 0.0001, adopting the highest training accuracy at the final training step. The vanilla networks were initialized with σ2 w(0) = 2 and σ2 b(0) = 0.01. The residual networks were initialized with σ2 w(0) = 0.01 and σ2 b(0) = 0.01. This initialization satisfied the (M, m)-trainability conditions (Lemma 5.6). In adversarial training, we set p = , q = , ϵ = 0.1, and the iterations of projected gradient descent to 10.

From Fig. A19, we can confirm that vanilla networks with large depth and small width were not adversarially trainable and the training difficulty was unique to adversarial training of such vanilla networks. Note that Prop G.10 could not be verified as the condition in which training persists without machine errors and the trainability condition not being satisfied at t = 0 is challenging to implement in practical scenarios.

0.15 0.00 0.15 J(xin)1,1

Correlation coeﬃcient: -0.01

1 0 1 a(xin)1

Correlation coeﬃcient: 0.0

0.15 0.00 0.15 J(xin)1,1

Correlation coeﬃcient: 0.0

Experiment Theory

Figure A10: Correlation coefficient of two different entries of J(xin) in the vanilla Re LU network with d = 1, 000, K = 1, N = 5, 000, L = 10, σ2 w = 2, and σ2 b = 0.01. The description of the histogram is the same as Fig. 1.

250 500 750 1000 0.25

p = 1, q = 1

250 500 750 1000

p = 1, q = 2

250 500 750 1000

Theory Exp.

250 500 750 1000

p = 2, q = 2

250 500 750 1000 0.0

250 500 750 1000

Input dimension d

Adversarial loss Ladv

Figure A11: Adversarial loss (Eq. (3)) in residual networks with N = 35, 000, K = 100, L = 3, σ2 w = 0.01, σ2 b = 0.01, and ϵ = 0.1. The description is the same as Fig. 2.

L.6 Verification of Thms 5.9 and G.14 (capacity degradation)

To verify the degradation of network capacity during adversarial training, we trained vanilla and residual networks on MNIST with p = , q = , and ϵ = 0.1, and observed the Fisher Rao norm as a measure of capacity. The training steps were set to 500. As Thms 5.9 and G.14 are derived under sufficiently small learning rate (cf. Eq. (10)), we employed small learning rate, 0.0001. Vanilla networks were initialized with σ2 w(0) = 2 and σ2 b(0) = 0.01. Residual networks were initialized with σ2 w(0) = 0.01 and σ2 b(0) = 0.01. During the derivation of Thms 5.9 and G.14, using Asm B.1, we calculated E[ f

wi f wj wiwj] = E[ f

wi f wj ]E[wiwj] = 0, for i = j. Nevertheless, in practice, this calculation does not always hold. Therefore, for the computation of the Fisher Rao norm, we employed a diagonal matrix of F instead of F itself.

As shown in Figs. A20 and A21, large network depths (large L) produced high Fisher Rao norms but degraded them drastically. Moreover, it was observed that a wider network width could maintain the Fisher Rao norm at its initial state. The discrepancy between these experimental values and the values predicted by Thms 5.9 and G.14 is considered to originate from Asm B.1, which does not hold in the case where a diagonal matrix of F is used instead of F itself.

Additionally, we assessed the influence of network width on the robust test accuracy of fully-trained networks. The robust test accuracy was determined using ℓ -Auto Attack with ϵ = 0.1. The adversarial training was conducted under p = , q = , ϵ = 0.1, and 10 iterations. We employed Adam with a learning rate of 0.001 and epochs set to 200.

From Tabs. A5 and A6, it is evident that the robust test accuracy depends on network width rather than depth. The same superscripts in the tables indicate results from networks with an identical number of parameters. For example, the accuracy for (N, L) = (250, 5) significantly surpasses that

250 500 750 1000

2.5 5.0 7.5

p = 1, q = 1

250 500 750 1000 0.1

p = 1, q = 2

250 500 750 1000 0.02

0.04 p = 1, q =

Theory Exp.

250 500 750 1000 0.2

p = 2, q = 2

250 500 750 1000 0.0

250 500 750 1000 0.7

Output dimension K

Adversarial loss Ladv

Figure A12: Adversarial loss (Eq. (3)) in vanilla networks with d = 100, N = 40, 000, L = 3, σ2 w = 2, σ2 b = 0.01, and ϵ = 0.1. The description is the same as Fig. 2.

250 500 750 1000

2.5 5.0 7.5

p = 1, q = 1

250 500 750 1000 0.1

p = 1, q = 2

250 500 750 1000 0.02

0.04 p = 1, q =

Theory Exp.

250 500 750 1000 0.2

p = 2, q = 2

250 500 750 1000 0.0

250 500 750 1000 0.7

Output dimension K

Adversarial loss Ladv

Figure A13: Adversarial loss (Eq. (3)) in residual networks with d = 100, N = 35, 000, L = 3, σ2 w = 0.01, σ2 b = 0.01, and ϵ = 0.1. The description is the same as Fig. 2.

for (N, L) = (125, 20). Although Thms 5.9 and G.14 is primarily applicable to the early stages of training, these theorems, emphasizing the importance of network width, hold true for fully-trained networks.

0 10000 20000 30000 40000 0.6

1.0 p = 1, q = 1

0 10000 20000 30000 40000 0.05

0.15 p = 1, q = 2

0 10000 20000 30000 40000

Theory Exp.

0 10000 20000 30000 40000

p = 2, q = 2

0 10000 20000 30000 40000 0.05

0.15 p = 2, q =

0 10000 20000 30000 40000 0.7

0.9 p = , q =

Adversarial loss Ladv

Figure A14: Adversarial loss (Eq. (3)) in vanilla networks with d = 500, K = 100, L = 3, σ2 w = 2, σ2 b = 0.01, p = 2, q = 2, and ϵ = 0.1. The description is similar to Fig. 2.

0.02 0.10 0.20 0.30

p = 1, q = 1

0.02 0.10 0.20 0.30

0.3 p = 1, q = 2

0.02 0.10 0.20 0.30

0.025 0.050 0.075

Theory Exp.

0.02 0.10 0.20 0.30

p = 2, q = 2

0.02 0.10 0.20 0.30

0.1 0.2 0.3

0.02 0.10 0.20 0.30

Adversarial loss Ladv

Figure A15: Adversarial loss (Eq. (3)) in vanilla networks with d = 500, N = 40, 000, K = 100, L = 3, σ2 w = 2, σ2 b = 0.01, p = 2, and q = 2. The description is similar to Fig. 2.

3 10 20 40 60 80 100

p = 1, q = 1

3 10 20 40 60 80 100 0.05

0.15 p = 1, q = 2

3 10 20 40 60 80 100

Theory Exp.

3 10 20 40 60 80 100 0.20

p = 2, q = 2

3 10 20 40 60 80 100 0.05

0.15 p = 2, q =

3 10 20 40 60 80 100

Adversarial loss Ladv

Figure A16: Adversarial loss (Eq. (3)) in vanilla networks with d = 500, N = 40, 000, K = 100, σ2 w = 2, σ2 b = 0.01, p = 2, q = 2, and ϵ = 0.1. The description is similar to Fig. 2.

Table A5: Test accuracy (%) on MNIST [26] under ℓ -Auto Attack with the perturbation constraint 0.1. We used residual networks with σ2 w = 0.01 and σ2 b = 0.01. For adversarial attack in training, we used p = , q = , ϵ = 0.1, and 10 iterations. We set an optimizer to Adam, a learning rate to 0.001, and epochs to 200. Values marked with the same superscript denote results from networks with the same number of parameters LN 2.

L = 5 L = 10 L = 15 L = 20 N = 125 13.8 13.1 9.79 8.69

N = 250 45.8 49.8 47.0 48.1

N = 500 79.2 76.5 75.3 73.4

N = 1, 000 89.8 88.4 87.3 87.9

0 500 1000 0.02

p = 1, q = 1

p = 1, q = 2

Theory Experiment

p = 2, q = 2

Training step

Adversarial loss Ladv

Figure A17: Adversarial loss (Eq. (3)) calculated with ϵ = 0.1. We used vanilla networks with d = 28 28, N = 10, 000, K = 10, L = 3, σ2 w(0) = 2, and σ2 b(0) = 0.01. We trained the vanilla networks normally (not adversarially) with the learning rate 0.001. The description is similar to Fig. 2.

0 500 1000 1500 2000 1.9985

0 500 1000 1500 2000

Training step

Weight variance σ2 w

Standard ℓ2 weight

Adversarial (N = 2k)

Adversarial (N = 1k)

Theory (N = 2k)

Theory (N = 1k)

Figure A18: Time evolution of the weight variance in the vanilla and residual network with L = 10 during adversarial training with p = , q = , and ϵ = 0.3. In standard and adversarial training, we set the learning rate to 0.0001. In standard training with or without ℓ2 weight regularization, we set N = 1, 000. In adversarial training, we set p = , q = , and ϵ = 0.3. The vanilla network was initialized with σ2 w(0) = 2 and σ2 b(0) = 0.01. The residual network was initialized with σ2 w(0) = 0.1 and σ2 b(0) = 0.01. The orange dashed lines are predicted by Thms 5.4 and G.9. To derive the theoretical predictions, we calculated t := training steps learning rate. For visibility, the experimental results were shifted parallel to satisfy σ2 w(0) = 2 on the vanilla networks and σ2 w(0) = 0.1 on the residual networks.

100 200 300 400 500 600 700 800

100 200 300 400 500 600 700 800

40 Adversarial

100 200 300 400 500 600 700 800

100 200 300 400 500 600 700 800

Training accuracy

Figure A19: Heat map of the training accuracy on the vanilla and residual networks trained on MNIST. The description is the same as Fig. 4. The vanilla network was initialized with σ2 w(0) = 2 and σ2 b(0) = 0.01. The residual network was initialized with σ2 w(0) = 0.1 and σ2 b(0) = 0.01. The horizontal axis represents the width of the network and the vertical axis represents the depth.

0 100 200 300 400 500

0 100 200 300 400 500

Training step

Fisher-Rao norm w FR

L = 10, N = 1250 L = 20, N = 1250 L = 20, N = 2500 L = 20, N = 5000 L = 20, N = 10000

Figure A20: Fisher Rao norm of vanilla and residual networks adversarially trained on MNIST. We set p = , q = , ϵ = 0.1, and the learning rate to 0.0001. The vanilla network was initialized with σ2 w(0) = 2 and σ2 b(0) = 0.01. The residual network was initialized with σ2 w(0) = 0.1 and σ2 b(0) = 0.01.

0 100 200 300 400 500 40

0 100 200 300 400 500 0.15

Training step

Fisher-Rao norm w FR

L = 10, N = 1250 L = 20, N = 1250 L = 20, N = 2500 L = 20, N = 5000 L = 20, N = 10000

Figure A21: Fig. A20 with the origin shifted parallel to zero.

Table A6: Test accuracy (%) on Fashion-MNIST [99] under ℓ -Auto Attack. The description is the same as Tab. A5.

L = 5 L = 10 L = 15 L = 20 N = 125 7.78 9.22 10.4 11.0

N = 250 33.1 33.6 32.6 34.0

N = 500 51.7 51.7 51.6 50.8

N = 1, 000 67.9 67.3 66.7 66.8