# a_sublinear_adversarial_training_algorithm__b4d645c4.pdf

Published as a conference paper at ICLR 2024

A SUBLINEAR ADVERSARIAL TRAINING ALGORITHM

Yeqi Gao Tsinghua Univeristy Beijing, China gaoyq23@mails.tsinghua.edu.cn

Lianke Qin UC Santa Barbara Santa Barbara, CA, USA lianke@ucsb.edu

Zhao Song Adobe Research Seattle, WA, USA zsong@adobe.com

Yitan Wang Yale University New Haven, CT, USA yitan.wang@yale.edu

Adversarial training is a widely used strategy for making neural networks resistant to adversarial perturbations. For a neural network of width m, n input training data in d dimension, it takes Ω(mnd) time cost per training iteration for the forward and backward computation. In this paper we analyze the convergence guarantee of adversarial training procedure on a two-layer neural network with shifted Re LU activation, and shows that only o(m) neurons will be activated for each input data per iteration. Furthermore, we develop an algorithm for adversarial training with time cost o(mnd) per iteration by applying half-space reporting data structure.

1 INTRODUCTION

After modest adversarial input perturbations, gradient-trained deep neural networks have a tendency to change their prediction results in an incorrect manner (Szegedy et al., 2013). Numerous steps have been taken to build deep neural networks immune to such malicious input. Among such efforts, adversarial training with a min-max object is most effective (Madry et al., 2018) to obtain a robust neural network against perturbed input according to Carlini & Wagner (2017); Athalye et al. (2018). Adversarial training usually yields small robust training loss in practice. The min-max formulation can be viewed as a two-player game. One player is the learner of neural network and the other player is an adversary allowed to arbitrarily perturb the input up to some norm constraint. For every round, the adversary creates adversarial inputs against the existing neural network. Then, the learner adjusts the parameters of neural network by taking a gradient descent step to reduce its prediction loss evaluated by adversarial inputs.

In the past few years, convergence analysis for training neural network on original input has been established. The seminal work of Jacot et al. (2018) initiates the study of neural tangent kernel (NTK), which is a very useful analytical model in the deep learning theory area. By over-parameterizing the neural network so that the network width is relatively large (m Ω(poly(n))), one can show that the training dynamic on a neural network is almost the same as that on an NTK. Analyzing the convergence guarantee via over-parameterization has been broadly studied (Li & Liang, 2018; Jacot et al., 2018; Du et al., 2019b; Allen-Zhu et al., 2019b;c; Du et al., 2019a; Song & Yang, 2019; Zou et al., 2020; Oymak & Soltanolkotabi, 2020; Lee et al., 2020; Brand et al., 2021; Song et al., 2021a; Huang et al., 2021; Song et al., 2021b; Chen & Xu, 2021; Munteanu et al., 2022; Hu et al., 2022; Zhang, 2022). Inspired by the over-parameterized neural network convergence theory, (Zhang et al., 2020) applies a similar analysis to adversarial training, i.e., training with perturbed input.

Training such adversarial neural networks is usually done via gradient descent, whose time is determined by the product of a number of training iterations and time cost spent on every training iteration. Many previous papers focus on accelerating the time cost per training iteration via nearest neighbor search (Chen et al., 2020; 2021; Daghaghi et al., 2021). For example, SLIDE Chen et al. (2020) speeds up the forward computation by efficiently retrieving activated neurons with the maximum inner product via a locality-sensitive hashing data structure. Using data structure to speed

Published as a conference paper at ICLR 2024

up optimization algorithms has made much progress for problems like linear programming (Cohen et al., 2021; Lee et al., 2019; Ye, 2020; Song & Yu, 2021; Dong et al., 2021; Jiang et al., 2021), semi-definite programming (Jiang et al., 2020; Huang et al., 2022), sum-of-squares (Jiang et al., 2022), non-convex optimization (Song et al., 2021a; Brand et al., 2021; Song et al., 2021b), reinforcement learning (Shrivastava et al., 2021a), frank-wolfe method (Shrivastava et al., 2021b; Song et al., 2022a) and discrepancy (Song et al., 2022b; Zhang, 2022). Following this line, we will also strive to reduce the time required for each training iteration by designing an algorithm to identify the activated neurons during each training iteration efficiently via high dimensional search data structure.

In this paper, we study a two-layer fully connected neural networks with shifted Re LU στ : R R activation. We first define the two-layer Re LU activation neural network

r=1 ar στ( wr, x + br)

where m is the number of hidden neurons and στ(x) := 1[x > τ] x is the shifted Re LU activation function with threshold τ. {wr}m r=1 Rd is the weight vector, {ar}m r=1 R is the output weight vector, and x Rd is the input vector. The total number of inputs is n. Therefore, in each training iteration, we need to compute the forward pass for each input vector which requires m vector inner product of d dimension. This implies a Ω(mnd) cost per training iteration. One question arises:

Can we design an adversarial training algorithm only requires o(mnd) cost per iteration?

The answer is affirmative. We outline our contributions as follows:

We analyze the convergence of adversarial training for a two-layer neural network with shifted Re LU activation, and show that in each iteration only o(m) of neurons are activated for a single input.

We have designed an adversarial training algorithm for a two-layer fully-connected neural network with shifted Re LU activation. We leverage a half-space reporting data structure of weights to identify sparsely activated neurons, enabling a sublinear training time cost per iteration.

1.1 MAIN RESULT

We give a formal statement of our main result in Theorem 1.1. Theorem 1.1 (Time complexity). Given n data points in Rd and a neural network model defined in Eq.(2), there exists an adversarial training algorithm (Algorithm 1) whose expected time cost per-iteration is o(mnd).

Roadmap. We present a brief overview of our techniques in Section 2. We present notations and preliminary tools in Section 3. We prove the existence of pseudo-network to approximate the target f in Appendix B. We provide the convergence analysis in Appendix C. And we provide some key results in Section 4. We will conduct a time complexity analysis of our adversarial algorithm in Section 5. We conclude the contribution of this paper in Section 6.

2 TECHNIQUE OVERVIEW

In this section, we briefly present an overview of the techniques used in this paper.

Technique I: Approximation via pseudo-network. An important fact about over-parameterized neural network used in many recent papers is that if a highly over-parameterized neural network f(x; W) = Pm r=1 ar,0 στ( wr, x + br,0) has weight close to its random initialization, then by the a pseudo-network g(x; W), f(x; W) could be approximated, where g(x; W) is defined as

r=1 ar,0 wr wr,0, x 1[ wr,0, x + br,0 τ]

Published as a conference paper at ICLR 2024

We show that supx X |f(x; W) g(x; W)| is very small with a probability 1 θc.1 We decompose f(x) into three components f(x) = A(x) + B(x) + C(x). A(x), B(x) and C(x) are represented as follows:

r=1 ar,0 wr, x Φx,r

r=1 ar,0( wr,0, x + br,0)Φ(0) x,r

r=1 ar,0( wr,0, x + br,0)(Φx,r Φ(0) x,r)

It follows from triangle inequality that |f(x; W) g(x; W)| |A(x) g(x)| + |B(x)| + |C(x)| . Φx,r and Φ(0) x,r are defined in Definition 3.6. Then we derive upper bounds for the three terms on the right-hand side respectively. With the exponential failure probability, we could construct an ϵ-net of X where ϵ = 1 poly(m) and apply union bound on this ϵ-net. Then we prove the stability of f and

g, where we get the bound of |f(x; W) g(x; W)| as O(K2 m 1/10) with Prob. 1 θ1/5. And we also give the perturbation analysis for f(x; W) and g(x; W) by choosing perturbation µ with µ 2 1

m and τ = O( 1

m). The upper bound of |g(x+µ; W) g(x)| and |f(x+µ; W) f(x; W)| can be bounded by O(K m 1/2) and O(K m 1 + 1/m1/5) separately with Pro. 1 θ1/2. With the pseudo-network at hand, we can give a further analysis of the gradient descent of the f(x; W). In this paper, the convergence of our network f(x; W) is established with the analysis above whereby choosing proper learning rate (η = Θ(ϵm 1/5)) and training iterations (T = Θ(ϵ 2K2)), with W near the initialization, we can attain

LA (f W ) + ϵ 1

t=1 LA(f Wt)

And then, we can show that with Prob. 1 θ1/5 and τ = O( 1

m), the existence of W near the initialization where

W0 W 2, K m3/5

and a small enough loss where LA (f W ) ϵ, is established.

Technique II: Shifted Re LU activation. The second technique is based on the observation that, shifted Re LU activation function on two-layer neural network, the number of activated neurons for each training data point is sublinear in the network width m. The Shifted Re LU activation στ is defined as

στ(x) := 1[x > τ] x.

By choosing τ properly here, we can attain o(m) activated neurons at initialization. Therefore, we first prove that at initialization, the number of activated neurons is upper bounded by Q0 = 2m exp( ( K2

4m1/5 )) = O(mc Q) with probability 1 n exp( Ω(m exp( ( K2

4m1/5 ))). Then we analyze the size of neuron set which has different signs at each training iteration compared to the neuron weights at initialization. it can be upper bounded by O(nm7/8) with probability at most exp( Ω(m1/5)). We can use them to derive the upper bound of activated neurons per training iteration via triangle inequality.

For the correctness of our paper, we also give the analysis of the approximation, the convergence, and the perturbation of our network based on the shifted Re LU function we proposed. Base on the Theorem A.3, we make a bridge between 1[ wr, x +br > τ] with τ = O( 1

m) and 1[ wr, x +br > 0], which is used in the prior work. And with this bridge, we can finish the proof of the analysis above. By choosing τ = O( 1

m), we show that |g(x; W) f(x; W)| O(K2 m 1/10) with Prob. 1 θ1/5. According to such a statement, the prediction loss can also be compressed to any ϵ (0, 1) with the

1We define θc := exp( Ω(mc)).

Published as a conference paper at ICLR 2024

same requirement on τ = O( 1

m) as well as the probability 1 θ1/5. For the perturbation analysis, we also give the bound of |g(x + µ; W) g(x)| and |f(x + µ; W) f(x; W)| with Pro. 1 θ1/2 and the Shifted Re LU activation we choose.

In total, in this paper by using Shifted Re LU activation, we attain o(mnd) time complexity per training iteration and give the convergence analysis and the perturbation analysis to make sure the existence of such a network f(x; W ) which leads to a small enough prediction loss.

Technique III: Half-space reporting. We leverage the data structure designed by Agarwal et al. (1992) to implement the half-space range reporting functionality. The half-space range reporting problem requires us to maintain a data structure to store a finite point set P Rd. In addition to storing P, half-space range problem also has queries. Each query can be denoted by (a, b) Rd R. For the query (a, b), the data structure should report a subset Pa,b P where Pa,b is defined as

Pa,b := {x Rd : x P, sgn( a, x b) 0}

One brute force implementation for the half-space range reporting problem is to maintain P in an array and enumerate all points in P to determine which points are contained in Pa,b. As we hope to do adversarial training with lower time cost, we introduce a tree data structure to organize points in P. With the tree data structure, we could report Pa,b efficiently. And we assume that the ar and br are not updated in this paper. In this structure, we use function query(x, (τ br)) to find satisfied wr. And we use Px,(τ br) as follows

Px,(τ br) = {wr Rd : wr P, sgn( wr, x (τ br)) 0}

We preprocess the network weights to construct a half-space range reporting (HSR) data structure for wr s in order that we can efficiently identify the set of activated neurons Qt,i for each of the adversarial input exi. By Agarwal et al. (1992), we can also attain the time complexity of the functions in the data structure for half-space reporting including init, query, update (See details in Section 5), which is presented as follows

Tinit(n, d) = Od(n log n), Tquery (n, d, k) = Od(n1 1/ d/2 + k), amortized Tupdate = Od(log2(n)).

Td,ϵ(n d/2 +ϵ), Tquery (n, d, k) = Od(log(n) + k), amortized Tupdate = Od(n d/2 1).

To get the running time of training per iteration, we compute the time spent on querying the active neuron set for adversarial training data points, forward and backward computation with the activated neuron set, and updating the HSR search data structure respectively. Based on the number of activated neurons at the initialization (where the number of activated neurons is proved to be o(m) by us) and during training, by summing all parts we obtain an algorithm whose running time per iteration is e O(m1 Θ(1/d)nd).

3 PRELIMINARIES

3.1 NOTATIONS

In this paper, x p denotes the ℓp norm and mainly focuses on p = 1, 2 or for a vector x.

B Rn k denotes the transpose of a matrix B Rk n. And we use B 1 as the entry-wise ℓ1 norm, B as the spectral norm and B F as the Frobenius norm. Bj denotes the j-th column of B where j [n]. B 2,1 denotes Pn j=1 Bj 2. And B 2, denotes maxj [n] Bj 2. We use N(µ, σ2) to represent a Gaussian distribution where σ is a covariance and µ is a mean. We use θc := exp( Ω(mc)). We define f as follows

f = sup x X |f(x)|, (1)

where f : X R.

Published as a conference paper at ICLR 2024

3.2 NETWORK FUNCTION

We consider a neural network f which is parameterized by (a, b, W) Rm Rm Rd m:

r=1 ar στ( wr, x + br) (2)

Note that this neural network is two-layer, and also called one-hidden layer. The activation function we consider here is shifted Re LU. F represent the class of the function above. Because a Rm and b Rm remain constant throughout adversarial training and only W Rd m is updated, we will also use f(x; W) to denote the network. And with τ > 0, στ : R R 0 is defined as follows

στ(x) := 1[x > τ] x.

Let S Rd R be the training data with element (xi, yi) where i [n]. There are some standard assumptions in our paper regarding the training set. While maintaining generality, we set for every xi Rd wherei [n] that xi,d = 1/2 and xi 2 = 1. Therefore, a set X := {x Rd : xd = 1/2, x 2 = 1} is defined. We also define X1 as a maximal 1 kc -net of X where c R+ is a constant. We set |yi| 1 where i [n] for simplicity. Definition 3.1 (Initialization of a, W, b). We define the initialization of a0 Rm whose entries are uniformly sampled from { 1 m1/5 , + 1 m1/5 }, W0 Rd m whose entries are i.i.d. sampled from N(0, 1

m), b0 Rm whose entries are i.i.d. sampled from N(0, 1

3.3 ROBUST AND ADVERSARY LOSS

We take into account the following loss function for analysing the neural nets. Definition 3.2 (Lipschitz convex regression loss). If a loss function ℓ: R R R meets the below criteria, it is a regression loss (Convex and Lipschitz): ℓ(x, x) = 0 for every x R, positive, 1 Lipshcitz and convex in the second argument. Definition 3.3 (ρ-Bounded adversary). With B2(u, ρ) := {v Rd : v u 2 ρ} X. Suppose ρ > 0. For adversary A : F X R X, we claim A is ρ-bounded, if A(f, x, y) B2(x, ρ). And then against a loss function ℓ, A is defined as the worst-case ρ-bounded adversary as follows:

A (f, x, y) := argmax ex B2(x,ρ) ℓ(y, f(ex))

Given a neural net f, we represent an adversarial dataset generated by A with respect to it as

A(f, S) := {(A(f, xi, yi), yi)}n i=1.

Then we give formal definition of robust loss function. Definition 3.4 (Robust loss and training loss). For a raw data set S = {(xi, yi)}n i=1 with n training points and a prediction model f, the training loss is

L(S, f) := 1

i=1 ℓ(yi, f(xi)).

The robust training loss is defined for ρ-bounded adversary A as

LA(f) := L(A(f, S), f) = 1

i=1 ℓ(yi, f(A(f, xi, yi)))

In addition, the worst-case of the above loss is defined similarly as

LA (f) := L(A (f, S), f) = 1

i=1 max e xi B2(xi,ρ) ℓ(yi, f( exi))

Definition 3.5 (Chi-square distribution). If a1, , am are independent, standard normal random variables, then the sum of their squares, Q := Pm i=1 a2 i , is distributed according to the chi-squared distribution with m degrees of freedom.

Published as a conference paper at ICLR 2024

Algorithm 1 Sublinear adversarial training

1: procedure FASTADVERSARIALTRAINING(a, b) 2: Adversary A 3: Learning rate η. 4: Training set S = {(xi, yi)}n 1 5: Initialization a0, b0, W0. 6: Data Structure for Half-space reporting ds 7: ds.INIT(w1, , wm) 8: for t in [T] do 9: S(t) := 10: for i in [n] do 11: ex(t) i = A(f Wt, xi, yi)

12: Qt,i ds.QUERY(ex(t) i ) 13: Qt,i [m], it is a set of indices j such that j-th neuron is activated

14: S(t) = S(t) (ex(t) i , yi) 15: end for 16: Qt i [n]Qt,i 17: Forward pass for ex(t) i only on neurons in Qt,i for i [n]

18: Calculate gradient for ex(t) i only on neurons in Qt,i for i [n] 19: Gradient update W(t + 1) = W(t) η W L(f W (t), S(t)) for the neurons in Qt 20: ds.DELETE(wr(t)) for r Qt 21: ds.ADD(wr(t + 1)) for r Qt 22: end for 23: return {W(t)}T t=1 24: end procedure

Definition 3.6 (Boolean function for activated neurons). For r [m], we define wr, Φx,r and Φ(0) x,r as follows:

wr := wr wr,0, Φx,r := 1[ wr,0 + wr, x + br,0 τ],

Φ(0) x,r := 1[ wr,0, x + br,0 τ]

3.4 WELL-SEPARATED TRAINING SETS

A commonly made assumption in the literature on over-parameterized neural network is to assume the training set is well-separated. There are several variants of separability defined in related literature. We give the formal definition of the γ-separability adopted in this paper.

Definition 3.7 (γ-separability). If for all i = j [n], γ ϵ(ϵ 2ρ) and xi xj 2 ϵ, a training dataset X will be γ-separable for a ρ-bounded adversary.

3.5 SUBLINEAR ADVERSARIAL TRAINING ALGORITHM

Associated with an adversary A, the algorithm 1 can be used to describe training a network in an adversarial way. The adversary creates adversarial samples against the present neural network in the inner loop. To reduce its loss on the new adversarial samples, the neural network s parameter undergoes a gradient descent step in the outer loop.

Remark 3.8. We assume that S(t) and Wt are independent, when calculating the W L(S(t), f Wt) so that we don t need differentiating over A.

3.6 ACTIVATED NEURON SET

In the analysis of our algorithm, we pay attention to the activated neurons. We give the formal definition of activated neuron set in Definition 3.9.

Published as a conference paper at ICLR 2024

Definition 3.9 (Activated neuron set). For all t {0, 1, , T}, τ > 0 and i [n], we define the set which has activated neurons at time t as Qt,i := {r [m] : wr,t, xi + br > τ} The number of activated neurons at time t is also defined as the size of Qt,i ki,t := |Qt,i| (3)

4 ANALYSIS OF OUR ALGORITHM

In this section, we will provide an analysis from two perspectives. Firstly, we will delve into the convergence analysis when employing the Shifted Re LU activation function. Secondly, we will establish the upper bound of activated neurons during each iteration, a crucial step in our algorithm s time complexity analysis. Now, let s shift our focus to the convergence analysis.

4.1 CONVERGENCE ANALYSIS OF ADVERSARIAL TRAINING WITH SHIFTED RELU ACTIVATION

In this section, we provide essential components of the correctness proof. Due to space constraints, we have included the complete proof in the appendix. Here, we offer an overview of our approach to demonstrating the convergence of the training process, considering both adversarial data and the shifted Re LU activation function.

For analyzing the gradient descent of f(x; W), a fake-network g(x; W) is defined in this paper. And we show that by choosing W near the initialization and τ = O( 1

m), with Prob. 1 θ1/5 the neuron network f(x; W) can be approximated by g(x; W) properly. The proof is given in Appendix C.1. Theorem 4.1 (Approximation of f(x; W)). With K 1, τ = O( 1

m), for every m poly(d), and every W Rd m where W0 W 2, K m3/5 , with probability at most 1/ exp(Ω(m1/5)) over the initialization(See Definition 3.1), we have that

sup x X |g(x; W) f(x; W)| O(K2 m 1/10)

We do perturbation analysis for f(x; W) and g(x; W) with shifted Re LU activation in the following lemma when the perturbation µ satisfies that µ 2 1

m and x + µ X. We obtain the perturbation upper bound for |f(x + µ; W) f(x; W)| and |g(x + µ|; W) g(x; W)| with the shifted Re LU activation threshold set as τ = O(1/m), which is proved in Appendix C.9. Lemma 4.2 (Perturbation analysis of f(x; W) and g(x; W)). For every x X1, every µ Rd where µ 2 1

m and x + µ X, with τ = O( 1

m) and probability at least 1 1/ exp(Ω(m1/2)), we have

|g(x + µ; W) g(x; W)| O(K m 1/2). (4) and |f(x + µ; W) f(x; W)| O(K m 1 + 1/m1/5) (5)

The convergence of our neural network f(x; W) with shifted Re LU is stated as follows: Theorem 4.3 (Optimal weights versus the initialization). For all K > 0, ϵ (0, 1), and for every m larger than poly(n, K, 1/ϵ), we set η = Θ(ϵm 1/5) and T = Θ(ϵ 2K2). in Algorithm 1. For every W Rd m with W0 W 2, K m3/5 , Algorithm 1 outputs weights {Wt}T t=1 such that

LA (f W ) + ϵ 1

t=1 LA(f Wt)

with succeed probability at least 1 θ1/5. The randomness is from random initialization (See Definition 3.1).

In the following statement, we show that by choosing proper τ = O( 1

m) and m as a large enough constant, a W around the initialization and a small enough worst-case robust loss can be attained by us. The proof is given in Appendix B.8. Theorem 4.4 (Convergence of Algorithm 1). For all ϵ (0, 1), we can get K = poly((n/ϵ)1/γ) and let m be larger than some poly(d, (n/ϵ)1/γ) we can find, with probability 1 θ1/5 and τ = O( 1

m), there will be a W Rd m such that LA (f W ) ϵ and W0 W 2, K m3/5 . The randomness is because of the selection of W0, b0, a0.

Published as a conference paper at ICLR 2024

4.2 THE UPPER BOUND OF ACTIVE NEURONS DURING TRAINING

For the purpose of getting the time complexity of per training iteration, we propose the following two statements, Lemma 4.5 and Claim 4.6. And their proofs are given in Appendix C.13 and C.12. By applying shifted Re LU activation στ in neural network and choosing K/m3/5 as τ, we can attain o(m) activated neurons here. With o(m) activated neurons, we can attain the running time per training iteration. Lemma 4.5 (Number of activated neurons during initialization). Let c Q (0, 1) denote a fixed constant. Let Q0 = 2m exp( ( K2

4m1/5 )) = O(mc Q). For every η R+, x X and r [m], Ψr(x, η) is defined as Ψr(x, η) := 1[| wr,0, x + br,0| η]. With succeed probability at least 1 n/ exp(Ω(m exp( ( K2

4m1/5 ))), it holds that X

r [m] Ψr(x, K/m3/5) Q0.

Besides the activated neurons during initialization bounded, we also give the bound of activated neurons versus the initialization. The proof given in Appendix C.12 is based on the shifted Re LU activation στ with τ > 0. The sublinear activation of neurons, along with the limitation on the number of neurons versus the initialization which is the primary focus of our computations during the training, provides us with an opportunity to employ the Half-Space Data Structure introduced in the following section. This allows us to maintain a time complexity of o(nmd).

Claim 4.6 (Bound for activated neurons versus the initialization). For any wr 2 m 15/24 and any subset {xi}n 1 X. Then it holds with succeed probability at least 1 θ1/5 that

r=1 1[ i [n], sgn( xi, wr,0 + wr + br,0 τ) = sgn( xi, wr,0 + br,0 τ)] < O(nm7/8)

i [m], 1[ i [n], sgn( xi, wr,0 + wr + br,0 τ) = sgn( xi, wr,0 + br,0 τ)] < O(nm 1/8)

The randomness is from initialized states (See Definition 3.1).

5 TIME COMPLEXITY

For the identification of activated neurons at each iteration, we will introduce a half-space data structure. This section 5 analyzes the adversarial training algorithm s time complexity. In section 5.1, we provide the Half-space reporting data structure at first. In Section 5.2, we examine the time complexity of our sublinear time adversarial training algorithm.

5.1 DATA STRUCTURE FOR HALF-SPACE REPORTING

We formally give the definition of the problem of the half-space range reporting, which is of importance in the field of computational geometry. And we give the data-structure in Agarwal et al. (1992) whose functions are outlined in Algorithm 2 and complexity is given in Corollary 5.2. Definition 5.1 (Half-space range reporting). For a set of m points P Rd, three operations are supported:

QUERY(W): find each point in P W with W Rd as a half-space.

INSERT(p) : add a point p to P.

REMOVE(p) : remove a point p from P.

We address the problem (See Definition 5.1) with the data-structure in Agarwal et al. (1992) which has the functions outlined in Algorithm 2. Because the data-structure partitions the set P recursively as well as uses a tree data-structure to organizes the points. Given a query (b, c), all the points z Rd from P which satisfy b, z c 0 can be quickly found from P. And in our paper, the problem is finding all r [m] such that wr, z + br τ > 0. The Half-Space (See Definition 5.1.) in our algorithm is defined by (z, (τ br)) for r [m] and the QUERY(b, c) holds for QUERY(wr, (τ br)).

Published as a conference paper at ICLR 2024

Algorithm 2 Data Structure For Half Space Reporting

1: data structure 2: INIT(P, n, d) Construct our data structure via P Rd, |P| = n 3: QUERY(b, c) b, c Rd. Find all the points z P which satisfies sgn( b, z c) 0 4: INSERT(z) Insert a point z Rd to P 5: REMOVE(z) Remove a point z Rd from P 6: end data structure

Corollary 5.2 (Agarwal et al. (1992)). Considering n points in Rd as a set, we can solve the halfspace reporting problem with the complexity as follows, where TINIT indicates the time to construct the data structure, TQUERY represent the time spent on each query and Tupdate represent the time on each update:

Tinit(n, d) = Od(n log n), Tquery (n, d, k) = Od(n1 1/ d/2 + k), amortized Tupdate = Od(log2(n)).

Td,ϵ(n d/2 +ϵ), Tquery (n, d, k) = Od(log(n) + k), amortized Tupdate = Od(n d/2 1).

5.2 WEIGHTS PREPROCESSING

We offer an adversarial training algorithm with sublinear time in this part. The foundation of our algorithm (Algorithm 1) is the construction of a search data structure (in high dimensional space) for neural network weights. To avoid needless calculation, the activated neurons can be identified quickly by using the Half-Space reporting technique. We provide an algorithm with preprocessing procedure for wr where r [m], which is widely used (see Chen et al. (2020); Kitaev et al. (2020); Chen et al. (2021)). The two-layer Re LU network is given in section 3.2 as f(x) = Pm r=1 ar στ( wr, x + br). We can rapidly identify a collection of active neurons Qt,i for every i (which is the adversarial training sample xi) by building a HSR data structure. Pseudo-code is shown in Algorithm 1. The time complexity analysis of the algorithm is the major topic of the remaining portion of this section.

5.3 TIME COMPLEXITY PER TRAINING ITERATION

Since we leverage the Half-Space Data Structure, the training time complexity is inherently linked to the activated neurons, as elaborated in Section 4.2. Given that the number of activated neurons is sublinear during each iteration, the following conclusion naturally ensues.

Lemma 5.3 (Time complexity). Given a two-layer Re LU network defined in Eq.(2). Assume there are n data points. Say those points are from d-dimensional space. Then, the expected time cost for each iteration of our adversarial algorithm (Algorithm 1) runs in time e O(m1 Θ(1/d)nd).

The detailed proof is deferred to Appendix C.16.

6 CONCLUSION

Making neural networks more resilient to the impact of adversarial perturbations is a common application of adversarial training. The usual time complexity of training a neural network of width m, and n input training data in d dimension is Ω(mnd) per iteration for the forward and backward computation. We show that only a sublinear number of neurons are activated for every input data during every iteration when we apply the shifted Re LU activation layer. To exploit such characteristics, we leverage a high-dimensional search data structure for half-space reporting to design an adversarial training algorithm that only requires e O(m1 Θ(1/d)nd) time for every training iteration. To the best of our knowledge, our work does not result in adverse effects on society. We also emphasize that the training procedure proposed in this paper is equivalent to applying adversarial training to neural network. Although there is no extra negative societal impact introduced by our method, to make sure the neural network model is properly used, the user needs extra attention, which goes beyond the scope of this paper.

Published as a conference paper at ICLR 2024

ACKNOWLEDGMENTS

Yitan Wang gratefully acknowledges support from ONR Award N00014-20-1-2335 and a Simons Investigator award from the Simons Foundation to Daniel Spielman.

Pankaj K Agarwal, David Eppstein, and Jiri Matousek. Dynamic half-space reporting, geometric optimization, and minimum spanning trees. In Annual Symposium on Foundations of Computer Science, volume 33, pp. 80 80. IEEE COMPUTER SOCIETY PRESS, 1992.

Zeyuan Allen-Zhu, Yuanzhi Li, and Yingyu Liang. Learning and generalization in overparameterized neural networks, going beyond two layers. In Neur IPS. ar Xiv preprint ar Xiv:1811.04918, 2019a.

Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via overparameterization. In ICML, 2019b.

Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. On the convergence rate of training recurrent neural networks. In Neur IPS, 2019c.

Anish Athalye, Nicholas Carlini, and David Wagner. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. ar Xiv preprint ar Xiv:1802.00420, 2018.

Sergei Bernstein. On a modification of chebyshev s inequality and of the error formula of laplace. Ann. Sci. Inst. Sav. Ukraine, Sect. Math, 1(4):38 49, 1924.

Jan van den Brand, Binghui Peng, Zhao Song, and Omri Weinstein. Training (overparametrized) neural networks in near-linear time. In ITCS, 2021.

Nicholas Carlini and David Wagner. Towards evaluating the robustness of neural networks. In 2017 IEEE Symposium on Security and Privacy (SP), pp. 39 57. IEEE, 2017.

Beidi Chen, Tharun Medini, James Farwell, Charlie Tai, Anshumali Shrivastava, et al. Slide: In defense of smart algorithms over hardware acceleration for large-scale deep learning systems. Proceedings of Machine Learning and Systems, 2:291 306, 2020.

Beidi Chen, Zichang Liu, Binghui Peng, Zhaozhuo Xu, Jonathan Lingjie Li, Tri Dao, Zhao Song, Anshumali Shrivastava, and Christopher Re. Mongoose: A learnable lsh framework for efficient neural network training. In International Conference on Learning Representations, 2021.

Lin Chen and Sheng Xu. Deep neural tangent kernel and laplace kernel have the same rkhs. In ICLR, 2021.

Herman Chernoff. A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations. The Annals of Mathematical Statistics, pp. 493 507, 1952.

Michael B Cohen, Yin Tat Lee, and Zhao Song. Solving linear programs in the current matrix multiplication time. Journal of the ACM (JACM), 68(1):1 39, 2021.

Shabnam Daghaghi, Nicholas Meisburger, Mengnan Zhao, and Anshumali Shrivastava. Accelerating slide deep learning on modern cpus: Vectorization, quantizations, memory optimizations, and more. Proceedings of Machine Learning and Systems, 3:156 166, 2021.

Sally Dong, Yin Tat Lee, and Guanghao Ye. A nearly-linear time algorithm for linear programs with small treewidth: A multiscale representation of robust central path. In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, pp. 1784 1797, 2021.

Simon S Du, Jason D Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. Gradient descent finds global minima of deep neural networks. In ICML, 2019a.

Simon S Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient descent provably optimizes over-parameterized neural networks. In ICLR, 2019b.

Published as a conference paper at ICLR 2024

Roy Frostig, Cameron Musco, Christopher Musco, and Aaron Sidford. Principal component projection without principal component analysis. In International Conference on Machine Learning, pp. 2349 2357, 2016.

Wassily Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301):13 30, 1963. ISSN 01621459.

Hang Hu, Zhao Song, Omri Weinstein, and Danyang Zhuo. Training overparametrized neural networks in sublinear time. In ar Xiv preprint ar Xiv: 2208.04508, 2022.

Baihe Huang, Xiaoxiao Li, Zhao Song, and Xin Yang. Fl-ntk: A neural tangent kernel-based framework for federated learning analysis. In International Conference on Machine Learning, pp. 4423 4434. PMLR, 2021.

Baihe Huang, Shunhua Jiang, Zhao Song, Runzhou Tao, and Ruizhe Zhang. Solving sdp faster: A robust ipm framework and efficient implementation. In FOCS, 2022.

Arthur Jacot, Franck Gabriel, and Cl ement Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in neural information processing systems, pp. 8571 8580, 2018.

Haotian Jiang, Tarun Kathuria, Yin Tat Lee, Swati Padmanabhan, and Zhao Song. A faster interior point method for semidefinite programming. In 2020 IEEE 61st annual symposium on foundations of computer science (FOCS), pp. 910 918. IEEE, 2020.

Shunhua Jiang, Zhao Song, Omri Weinstein, and Hengjie Zhang. A faster algorithm for solving general lps. In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, STOC 2021, pp. 823 832, New York, NY, USA, 2021. Association for Computing Machinery.

Shunhua Jiang, Bento Natura, and Omri Weinstein. A faster interior-point method for sum-of-squares optimization. ar Xiv preprint ar Xiv:2202.08489, 2022.

Nikita Kitaev, Łukasz Kaiser, and Anselm Levskaya. Reformer: The efficient transformer. ar Xiv preprint ar Xiv:2001.04451, 2020.

B. Laurent and P. Massart. Adaptive estimation of a quadratic functional by model selection. The Annals of Statistics, 28(5):1302 1338, 2000.

Jason D Lee, Ruoqi Shen, Zhao Song, Mengdi Wang, and Zheng Yu. Generalized leverage score sampling for neural networks. In Neur IPS, 2020.

Yin Tat Lee, Zhao Song, and Qiuyi Zhang. Solving empirical risk minimization in the current matrix multiplication time. In Conference on Learning Theory, pp. 2140 2157. PMLR, 2019.

Wenbo V Li and Q-M Shao. Gaussian processes: inequalities, small ball probabilities and applications. Handbook of Statistics, 19:533 597, 2001.

Yuanzhi Li and Yingyu Liang. Learning overparameterized neural networks via stochastic gradient descent on structured data. In Neur IPS, 2018.

Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. In ICLR. ar Xiv preprint ar Xiv:1706.06083, 2018.

Alexander Munteanu, Simon Omlor, Zhao Song, and David Woodruff. Bounding the width of neural networks via coupled initialization a worst case analysis. In International Conference on Machine Learning, pp. 16083 16122. PMLR, 2022.

Samet Oymak and Mahdi Soltanolkotabi. Toward moderate overparameterization: Global convergence guarantees for training shallow neural networks. IEEE Journal on Selected Areas in Information Theory, 1(1):84 105, 2020.

Sushant Sachdeva, Nisheeth K Vishnoi, et al. Faster algorithms via approximation theory. Foundations and Trends in Theoretical Computer Science, 9(2):125 210, 2014.

Published as a conference paper at ICLR 2024

Anshumali Shrivastava, Zhao Song, and Zhaozhuo Xu. Sublinear least-squares value iteration via locality sensitive hashing. ar Xiv preprint ar Xiv:2105.08285, 2021a.

Anshumali Shrivastava, Zhao Song, and Zhaozhuo Xu. Breaking the linear iteration cost barrier for some well-known conditional gradient methods using maxip data-structures. Advances in Neural Information Processing Systems (Neur IPS), 34, 2021b.

Zhao Song and Xin Yang. Quadratic suffices for over-parametrization via matrix chernoff bound. ar Xiv preprint ar Xiv:1906.03593, 2019.

Zhao Song and Zheng Yu. Oblivious sketching-based central path method for linear programming. In International Conference on Machine Learning, pp. 9835 9847. PMLR, 2021.

Zhao Song, Shuo Yang, and Ruizhe Zhang. Does preprocessing help training over-parameterized neural networks? Advances in Neural Information Processing Systems, 34, 2021a.

Zhao Song, Lichen Zhang, and Ruizhe Zhang. Training multi-layer over-parametrized neural network in subquadratic time. ar Xiv preprint ar Xiv:2112.07628, 2021b.

Zhao Song, Zhaozhuo Xu, Yuanyuan Yang, and Lichen Zhang. Accelerating frank-wolfe algorithm using low-dimensional and adaptive data structures. ar Xiv preprint ar Xiv:2207.09002, 2022a.

Zhao Song, Zhaozhuo Xu, and Lichen Zhang. Speeding up sparsification using inner product search data structures. ar Xiv preprint ar Xiv:2204.03209, 2022b.

Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. ar Xiv preprint ar Xiv:1312.6199, 2013.

Guanghao Ye. Fast algorithm for solving structured convex programs. The University of Washington, Undergraduate Thesis, 2020.

Lichen Zhang. Speeding up optimizations via data structures: Faster search, sample and maintenance. Master s thesis, Carnegie Mellon University, 2022.

Yi Zhang, Orestis Plevrakis, Simon S Du, Xingguo Li, Zhao Song, and Sanjeev Arora. Overparameterized adversarial training: An analysis overcoming the curse of dimensionality. Advances in Neural Information Processing Systems, 33:679 688, 2020.

Difan Zou, Yuan Cao, Dongruo Zhou, and Quanquan Gu. Gradient descent optimizes overparameterized deep relu networks. In Mach. Learn., 2020.

Published as a conference paper at ICLR 2024

Roadmap. We first present some useful tools from previous work in Section A, then demonstrate the existence of pseudo-network which can approximate f in Section B. We then present our convergence analysis in Section C.

A MORE TOOLS FROM PREVIOUS WORK

Lemma A.1 (Corollary 5.4 in Frostig et al. (2016)). With k = 1 η2 ln (2/ϵ1) and x [ 1, η] [η, 1],

we have | sgn(x) pk(x)| ϵ1/2, where pk(z) := z Pk i=0(1 z2)i Qi j=1 2j 1

Theorem A.2 (Theorem 3.3 from Sachdeva et al. (2014)). With X1, . . . , Xm as independent 1 random variables, Xm := Pm i=1 Xi and t 0, for all m, t which are positive integers and all z [ 1, 1], we have

|EX1,...,Xm[TZm(z) 1[|Xm| t]] zm| 2e t2/(2m)

Theorem A.3 (Theorem 3.1 from Li & Shao (2001)). Suppose that η > 0 and τ > 0. We obtain,

exp( η2/2) Pr x N(0,1)[|x| τ] Pr x N(0,1)[|x η| τ] Pr x N(0,1)[|x| τ].

The following results will be used in our proof. Definition A.4 (Lipschitz continuity). A function f : X Y is called C-Lipschitz continuous if, for each x1, x2 X it holds that d(f(x1), f(x2)) C d(x1, x2), where d stands for the distance. Lemma A.5. With X is sampled from N(0, σ2), we have Pr[|X| η] ( 2

5 η σ) . Lemma A.6 (Hoeffding bound Hoeffding (1963)). For {Xi}n 1 where Xi is bounded by [ci, di] independently, σi = (di ci) and U = Pn i [n] Xi, it holds that Pr[|U E[U]| η] 2 exp( 2η2 P

i [n] σ2 i ).

Lemma A.7 (A Sharper Bound for a Chi-square Variable). Let X be chi-square with n degrees of freedom. For any positive t, the tail of X can be bounded by Pr[X n 2

nt + 2t] e t and Pr[X n 2

nt] e t. Lemma A.8 (Chernoff bound Chernoff (1952)). With Y = Pn i=1 Yi, where Pr[Yi = 1] = pi and Pr[Yi = 0] = 1 pi for all i [n], and Yi are independent, and µ = E[Y ] = Pn i=1 pi, it holds that

0 < ϵ < 1, Pr[Y µ(1 ϵ)] exp( ϵ2µ/2),

ϵ > 0, Pr[Y µ(1 + ϵ)] exp( ϵ2µ/3). Lemma A.9 (Bernstein inequality Bernstein (1924)). We use Y1, , Yn to represent independent random variables whose mean are 0. Assume that |Yi| M almost certainly. We have for every η > 0,

i=1 Yi > η] exp

1 2η2 Pn j=1 E[Y 2 j ] + 1

Claim A.10. Let w, b be sampled from N(0, Id) and N(0, 1) respectively. Suppose η 0 and for every x X, it holds that Pr[| w, x + b| η] = O(η).

B EXISTENCE OF PSEUDO-NETWORK TO APPROXIMATE f

In this section, a complexity measure s definition for polynomials is given first in Section B.1. In Section B.2, we state a tool from previous work. We demonstrate the existence of a function f that reliably fits the training dataset and has low complexity in Section B.3. We prove that there exists a univariate polynomial qϵ1(z) to estimate the step function in Section B.4. We provide a stronger version of the polynomial approximation guarantee in Section B.5. An upper bound of the coefficients of Ck(x) is given in Section B.6.

Then we demonstrate that by using a pseudo-network we can approximate f in Section B.7. We put them together to prove Theorem B.10 in Section B.8.

Published as a conference paper at ICLR 2024

B.1 DEFINITIONS

In the beginning, we define a complexity measure of polynomials according to Allen-Zhu et al. (2019a) with K = 1 in this work. Definition B.1. With parameter ϵ1 > 0, c as a large enough constant and univariate polynomial ϕ(z) = Pk j=0 αjzj with any degree-k, the following two complexity measures are what we define

C(ϕ, ϵ1) :=

j=0 cj (1 + ( p

ln(1/ϵ1)/j)j) |αj|

j=0 (j + 1)1.75|αj|.

B.2 TOOLS FROM PREVIOUS WORK

We state a tool from previous work (Laurent & Massart, 2000). Lemma B.2 (Laurent-Massart (Laurent & Massart, 2000)). Let a1, ..., an be nonegative, and set

|a| = sup i [n] |ai|, |a|2 2 =

For i.i.d Zi N(0, 1), let X = Pn i=1 ai(Z2 i 1). Then the following inequalities hold for any positive t:

Pr[X 2|a|2 2

t + 2|a| ] e t

Lemma B.3 (Lemma 6.2 from Allen-Zhu et al. (2019a)). Let C be defined as Definition B.1. For every ϵ2 (0, 1/C(ϕ)). Let ϕ : R R be a univariate polynomial. There will be a function h : R2 [ C(ϕ, ϵ2), C(ϕ, ϵ2)] so that for all y, w Rd where y 2 = w 2 = 1: the following E[h( w , u , β) 1{ u, y + β 0}] ϕ( w , y ) ϵ2.

holds. In the above equation, β N(0, 1), u N(0, Id).

B.3 ROBUST FITTING WITH POLYNOMIALS

The purpose of this section is to give the proof of Lemma B.4. Lemma B.4 (Lemma 6.2 in Zhang et al. (2020)). With M = 24γ 1 ln(48n/ϵ), we have a polynomial q : R R where the upper bound of its coefficients is O(γ 126M) and the upper bound of its degree is M, such that for every exj B2(xj, ρ) and j [n],

i=1 yi q( xi, exj ) yi ϵ/3.

B.4 EXISTENCE OF STEP FUNCTION ESTIMATION

In the following claim, we prove the existence of a univariate polynomial qϵ1(z) to roughly estimate step function.

Claim B.5 (Lemma 6.4 in Zhang et al. (2020)). With M = 24 ln(16/ϵ1)

γ and ϵ1 (0, 1), there will be

a univariate polynomial qϵ1(z) with the bound of coefficients is O( 26M

γ ) and degree at most M, such that

1. |qϵ1(z)| ϵ1, for all z such that 1 z < 1 (δ ρ)2/2.

2. |qϵ1(z) 1| ϵ1, for all z such that 1 ρ2/2 z 1

Published as a conference paper at ICLR 2024

B.5 STRONGER POLYNOMIAL APPROXIMATION

According to Frostig et al. (2016), we need a more robust version of the sgn function s polynomial approximation result in the following lemma. Lemma B.6 (Lemma 6.4 in Zhang et al. (2020)). Let M = 3

ηϵ1 ) and η, ϵ1 (0, 1). There will be a univariate polynomial

j {0,1, ,k} αjxj

with |αj| 24M and degree k M, which is the sgn function ϵ1-approximation in [ 1, 1] \ ( η, η) such that

1. x [ 1, η], |pϵ1(x) + 1| ϵ1.

2. x [η, 1], |pϵ1(x) 1| ϵ1.

B.6 UPPER BOUND OF SIZE OF COEFFICIENTS OF Ck(x)

In the following we get the upper bound of coefficients of Ck(x) s size. Proposition B.7 (Proposition A.13 in Zhang et al. (2020)). We define Ck(x) as follows:

x2j( 1)i jxk 2i

The size of the coefficients of Ck(x) is at most 22k.

The following outcome is a direct result of the Proposition B.7: Corollary B.8 (Corollary A.14 in Zhang et al. (2020)). With s > 0, Ms := Ps i=1 Xi, M 0 and X1, . . . , Xs iid 1 random variables, ps,M(x) is defined as:

ps,M(x) := EX1,...,Xs[CMs(x) 1[|Ms| M]].

The upper bound of ps,M(x) s degree is M and the upper bound of its coefficients is 22M.

B.7 FAKE-NETWORK APPROXIMATES f

Suppose m > poly(d, (n/ϵ)1/γ). For a matrix W Rd m, with a fake-network g(x; W ), f can be approximated uniformly across X. Compared to Zhang et al. (2020), our neuron network here is based on the shifted Re LU activation whose threshold is τ. Lemma B.9 (A variation of Lemma 6.5 in Zhang et al. (2020)). For all ϵ (0, 1), there will be K = poly((n/ϵ)1/γ) so that let τ = O( 1

m), m is larger than some poly(d, (n/ϵ)1/γ we can find and then with probability 1 1/ exp(Ω(( m

n )1/2)), there will be a W Rd m such that

sup x X |g(x; W ) f (x)| ϵ/3

W W0 2, K m3/5

Recall that the randomness is due to initialization (See Definition 3.1).

Proof. With q(z) as the polynomial constructed from Lemma B.4 for every i [n], its complexities C(q) and C(q, ϵ3) (See Definition B.1) can be bounded now, where we will set ϵ3 after C(q) bounded (ϵ3 < 1/C(q) according to Lemma C.18).

Next, we explain how to upper bound the C(q), one can get

j=0 (j + 1)1.75 1

Published as a conference paper at ICLR 2024

< c c2 (M + 1)2.75

where the 1st inequality is due to Definition B.1 and M is the upper bound of the degree of q and c2 1

γ 26M is the upper bound of the size of its coefficients, where c2 R+ is a constant and

γ ln(48n/ϵ), the second step is due to (M + 1)2.75 PM j=0(j + 1)1.75.

By ϵ3 = (c c2 (M+1)2.75

γ 26M) 1, we can get:

ln(1/ϵ3) = ln(c c2 (M + 1)2.75

Then we discuss about how to upper bound C(q, ϵ3),

C(q, ϵ3) c2

1 γ 26Mcj(1 + p

ln(1/ϵ3)/j)j

γ 26M(M + 1)c Me

γ 26M(M + 1)c Me

where the first step is due to Definition B.1 and M is the upper bound of the degree of q and c2 1

is the upper bound of the size of its coefficients, the second step comes from the geometric sum formula, the third step comes from Eq. (6), and the fourth step follows that c Me

M O(M) 2O(M).

With regard to the selection of k and the random variables, we now describe how to carry out the n use of the lemma.

e B := c1 d ϵ2 3 C2(q, ϵ3) .

We have that n e B d( n

ϵ )c/γ m with a large enough c chosen. We use the Lemma C.18 with k = m

n for i [n 1] and with k = m (n 1) m

n for i = n.

For the i-th datapoint, we use the (m1/2W(i 1) m

n +r,0, m1/2b(i 1) m

n +r,0, a(i 1) m

n +r,0) as (wr,0, br,0, a0) in the Lemma C.18.

By using a union bound over n different terms, we can recompute the probability, which is

1 n/ exp(Ω( p

m/n)) = 1 1/ exp(Ω( p

Recall the movement of weight,

W = [ W (1), , W (m)] Rd m.

Then we can obtain the following upper bound for W 2, :

W 2, O(C(q, ϵ3)m1/5

O(m 4/5 C(q, ϵ3) n)

O(m 3/5 C(q, ϵ3) n)

O(m 3/5 (n/ϵ)O(γ 1))

Published as a conference paper at ICLR 2024

where the reason behind the first inequality is Lemma C.18, the second inequality is due to simplifying the terms, the third step is due to 1 m4/5 1 m3/5 and the final step comes from e Bn (n/ϵ)c/γd m.

r=1 1[ wr,0, x + br,0 τ]ar,0 wr, x

i=1 yiq( xi, x )| nϵ3 ϵ/3,

where the final inequality serves as a rough but adequate bound.

B.8 PUTTING IT ALL TOGETHER

When we construct function the gx;W and f(x; W), we bring the new shifted RELU activated function. And existence problem is base on the new function constructed compared to Zhang et al. (2020). Put everything in this section together we can prove the following theorem:

Theorem B.10 (A variation of Theorem 5.3 in Zhang et al. (2020)). For all ϵ (0, 1), we can get K = poly((n/ϵ)1/γ) and let m be larger than some poly(d, (n/ϵ)1/γ) we can find and τ = O( 1

m), with probability 1 θ1/5 , there will be a W Rd m so that the following inequality holds:

LA (f W ) ϵ

W0 W 2, K m3/5

The randomness is because of selection of W0, b0, a0.

Proof. To prove Theorem B.10, we shall make use of Lemmas B.4, B.9, and Theorem C.1. f is the result of Lemma B.4.

As a result of integrating Theorem C.1 with Lemma B.9 and m max{M, poly(d)}, we obtain with prob.

1 1/ exp(Ω(m1/5)) 1/ exp(Ω( p

there will be a W Rd m such that

x X, |g(x; W ) f(x; W )| O( K2

m1/10 ) and |g(x; W ) f (x)| ϵ/3. (7)

W0 W 2, K m3/5

Therefore, for every exi B(xi, ρ), i [n],

ℓ(yi, f(exi; W )) |f(exi; W ) yi| |f (exi) yi| + |g(exi; W ) f (exi)| + |f(exi; W ) g(exi; W )|

where the first step follows from the 1-Lipschitz defined in Definition 3.2 , the second step is because of triangle inequality and the third step is due to Eq. (7) and the final step comes from ϵ/3 O( K2

m1/10 ). poly(d, (n/ϵ)1/γ) m, given a sufficiently big polynomial. Therefore, we can get LA (f ) ϵ.

We get 1 1/ exp(Ω(m1/5)) as the bound of probability by choosing m n2.

Published as a conference paper at ICLR 2024

C OUR ANALYSIS

We demonstrate the stability of f and g in Section C.1 and the upper bound of the sum of Fr in Section C.2. In Section C.3, we upper bound Ψr. In Section C.12, we analyze the upper bound of number of activated neurons during initialization. In Section C.4, we define several helper functions A(x), B(x) and C(x). In Section C.5, we first prove the upper bound of |A(x) g(x)|. Then we upper bound |B(x)| in Section C.6. We further upper bound |C(x)| in Section C.7. We can obtain the overall upper bound for |f(x) g(x)| in Section C.8. We provide some perturbation analysis in Section C.9. We prove the convergence of our algorithm in Section C.10. We analyze the gradient coupling of our algorithm in Section C.11. We upper bound the number of activated neurons for every iteration in Section C.13. We prove the pseudo-network approximation guarantee in Section C.14. We analyze the stability of S1 and S2 in Section C.15.

C.1 PROOF OF STABILITY OF f AND g

Note that when we prove the stability of f and g, our definition of f and g is different from Zhang et al. (2020) because of the introduced shifted Re LU operation. With a selection of τ = O( 1

m), we prove a different upper bound for supx X |f(x; W) g(x; W)|, where f(x; W) = Pm r=1 ar,0( wr,0 + wr, x + br,0)Φx,r and g(x; W) = Pm r=1 Φ(0) x,rar,0 wr, x compared with Zhang et al. (2020).

Theorem C.1 (A variation of Theorem 5.1 in Zhang et al. (2020)). With K 1, τ = O( 1

m), for every m poly(d), and every W Rd m where W0 W 2, K m3/5 , with probability at most 1/ exp(Ω(m1/5)) over the initialization(See Definition 3.1), we have that

sup x X |g(x; W) f(x; W)| O(K2 m 1/10)

Proof. We have W W0 2, K m3/5 , which coincides randomly with the W0, a0, b0. The above limitation is enough to get the upper bound of f W g W (See Definition 1).

We will take W into consideration now. For simplicity, f(x; W), g(x; W) are written as f, g separately. Now, we can get

r=1 ar,0( wr,0 + wr, x + br,0)Φx,r

r=1 ar,0 wr, x Φ(0) x,r

We represent Ψr(x, η) as follows for any x X, η R+ and r [m].

Ψr(x, η) := 1[| wr,0, x + br,0 τ| η]. (8)

With Claim A.10, we can get

Pr[Ψr(x, η) = 1] < Pr[| wr,0, x + br,0| η] O(η m).

Fr := 1[Φ(0) x,r = Φx,r].

Now, we can get the bound of the size of Pm r=1 Fr in Claim C.2. With probability at least 1 1/ exp(Ω(m9/10)),

r=1 Fr O(K m9/10), x X.

Published as a conference paper at ICLR 2024

What needs to be done now is union bounding above inequality across each x X. Obviously, the issue that X is not countable still remains. Therefore, across a net of X with an extremely fine grain, we will get a union bound first and then discuss how f and g have changed as x X as input have changed slightly.

With X1 as a maximal 1

m-net of X, we can have that |X1| ( 1

m)O(d). By Lemma C.8 and a union bound across x X1, we can know that for m Ω(d3),

1 (1/ exp(Ω(m1/5))) exp(O(d log m)) = 1 θ1/5,

one can obtain the following guarantee:

x X1, |g(x; W) f(x; W)| O(K2 m 1/10) (9)

The perturbation analysis comes last. We leverage Lemma C.9 which works for fixed inputs to apply a union bound across x X1. By m Ω(d3), for all x X1 and µ Rd where µ 2 1 m and x + µ X, we can have that with probability at most (1/ exp(Ω(m1/2))) exp(O(d log m)) = 1/ exp(Ω(m1/2)), the Eq. (15) and Eq. (14) fail.

Now, with prob.

1 1/ exp(Ω(m1/5)) 1/ exp(Ω(m1/2)) = 1 θ1/5.

One can further upper bound g(x; W) f(x; W) by:

g(x; W) f(x; W) O(K2 m 1/10 + m 1/5 + K m 1/2 + K m 1)

= O(K2 m 1/10)

where the first step comes from combining Eq. (9), Eq. (15) and Eq. (14), and applying a union bound, and the second step is due to combining the terms.

The proof is complete.

C.2 UPPER BOUND OF SUM OF Fr

The sum of Fr is to be upper bound in this section.

Recall that Fr = 1[Φx,r = Φ(0) x,r] and our definitions of Φx,r and Φ(0) x,r are based on the shifted Re LU activation. Note that Pm r=1 Fr represents the number of neurons whose outputs have different signs between the initialized weights wr,0 and the updated weights wr,0 + wr. And we obtain a different upper bound of sum of Fr compared with Zhang et al. (2020). Claim C.2 (A variation of Claim A.3 in Zhang et al. (2020)). For each x X ,

r=1 Fr O(K m9/10)] 1 θ9/10.

Proof. With an x X fixed, W 2, K/m3/5 and x 2 = 1, we can get

Fr 1[ wr 2 | wr,0, x + br,0 τ|]

Ψr(x, K m 3/5).

where the first step comes from the definition of Φ(0) x,r and Φx,r(See Definition 1.), and the second step comes from Eq. (8) and W 2, K/m3/5.

With Gaussian anti-concentration from Lemma A.5, we can have that for all r [m]:

Pr[Ψr(x, K/m3/5) = 1] < Pr[ wr,0, x + br,0| K m 3/5]

Published as a conference paper at ICLR 2024

< O(K m 1/10),

By fixing x, there will be m Bernoulli random variables that are independent. With Bernstein inequality from Lemma A.9, with probability at most 1/ exp(Ω(m9/10)), we attain that

r=1 Ψr(x, K m 3/5) O(K m9/10).

We can also have

r=1 Ψr(x, K m 3/5),

which implies

r=1 Fr O(K m9/10).

The proof is complete now.

C.3 UPPER BOUND OF Ψr

In this section, we want to prove the upper bound of Ψr. Lemma C.3 (Upper bound of Ψr). For x X, r [m], η R+ and τ > 0, we define Ψr(x, η) as follows:

Ψr(x, η) := 1[| wr,0, x + br,0 τ| η].

with probability at least 1 θ9/10

r=1 Ψr(x, K/m3/5) O(K m9/10).

Proof. We decompose f into three functions A(x), B(x), C(x) in Definition C.4. Then the proof comes from Lemma C.5, Lemma C.6 and Lemma C.7.

C.4 DEFINITIONS OF HELP FUNCTIONS

We can split the f(x) into the following three sub-functions for future analysis purpose. Definition C.4. The following is how A(x), B(x), C(x) are defined:

r=1 ar,0 wr, x Φx,r

r=1 ar,0( wr,0, x + br,0)Φ(0) x,r

r=1 ar,0( wr,0, x + br,0)(Φx,r Φ(0) x,r)

We will acquire the upper bound of |A(x) g(x)|, |B(x)| and |C(x)| to upper bound |g(x) f(x)|, because f(x) is the summation of A(x), B(x) and C(x).

Recall that our definitions of Φ(0) x,r and Φx,r are based on shifted Re LU activation with a threshold set as τ, which is different from Zhang et al. (2020).

Φ(0) x,r = 1[ wr,0, x + br,0 τ]

Φx,r = 1[ wr,0 + wr, x + br,0 τ].

Published as a conference paper at ICLR 2024

C.5 UPPER BOUND OF |A(x) g(x)|

First, we want to obtain the upper bound of |A(x) g(x)|. We obtain a different upper bound based on the shifted Re LU activation compared with Zhang et al. (2020) because our definitions of Φx,r and Φ(0) x,r are different due to the shifted Re LU activation.

Claim C.5 (A variation of Claim A.5 in Zhang et al. (2020)). We have that

Pr[|g(x) A(x)| O(K2 m 1/10)] θ9/10

Proof. We can easily have that

|A(x) g(x)| = |

r=1 ar(Φx,r Φ(0) x,r) wr, x |

r=1 |ar| Fr | wr, x |

where the 1st equality is due to the Def. of A(x) and g(x), the 2nd inequality is because |Φx,r Φ(0) x,r| Fr and the definition of Fr, and the last step is because W 2, K m3/5 , ar { 1 m1/5 }.

By Claim C.2, we can have that:

Pr[|g(x) A(x)| O(K2 m 1/10)] 1/ exp(Ω(m9/10)) = θ9/10.

The proof is complete now.

C.6 UPPER BOUND OF |B(x)|

Then we want to upper bound B(x). We obtain a different upper bound O(1/m3/10) and different failure probability θ1/5 = 1/ exp(Ω(m1/5)) based on the shifted Re LU activation whose threshold is τ compared to Zhang et al. (2020).

Claim C.6 (A variation of Claim A.6 in Zhang et al. (2020)). With probability at most θ1/5, it gives

|B(x)| O(m 1/10)

Proof. Note that

r=1 ar,0στ( wr,0, x + br,0).

With στ(x) = 1[x > τ]x, we can have

r=1 σ2 τ( wr,0, x + br,0)

r=1 ( wr,0, x + br,0)2 (10)

We can obtain that with probability θ1,

r=1 σ2 τ( wr,0, x + br,0)

r=1 ( wr,0, x + br,0)2

Published as a conference paper at ICLR 2024

where the first step is because of Eq. (10), and the second step is because the random variable wr,0, x + br,0 is sampled from N(0, 2/m) independently where r [m].

Now, because ar,0στ( wr,0, x + br,0) for r [m] are independence and are Chi-Square random variables(See Definition 3.5.), by using Hoeffding s concentration inequality from Lemma A.6, for some large constant c > 0,

r=1 ar,0στ( wr,0, x + br,0)| c m1/10 | b0, W0]

exp( Ω( m 1/5

1 m2/5 Pm r=1 σ2τ( wr,0, x + br,0))).

By utilizing the bound above, with probability at most 1/ exp(Ω(m1/5)), we have

|B(x)| O(m 1/10)

The proof is complete.

C.7 UPPER BOUND OF |C(x)|

Finally, we want to obtain a new upper bound |C(x)| based on shifted Re LU in the following Claim C.7. Because our definition of C(x) is different from Zhang et al. (2020), we obtain a different upper bound O(K2/m3/10) and failure probability exp( Ω(m5/6)) compared with Zhang et al. (2020).

Claim C.7 (A variation of Claim A.7 in Zhang et al. (2020)). With στ and probability at most exp( Ω(m9/10)) , we have

|C(x)| O(K2 m 1/10).

r=1 ar,0(Φx,r Φ(0) x,r)( wr,0, x + br,0)

r=1 |ar,0||Φx,r Φ(0) x,r|| wr,0, x + br,0|

r=1 |Φx,r Φ(0) x,r|| wr,0, x + br,0|

where the first step is because of the definition of C(x), the second step is due to triangle inequality, and the third step is due to |ar,0| = 1 m1/5 .

We have that

|Φ(0) x,r Φx,r| Fr

Ψr(x, K m 3/5). (11)

And then recall that

Ψr(x, K m 3/5) = 0 | wr,0, x + br,0| K m 3/5, (12)

which implies that

| wr,0, x + br,0| |Φ(0) x,r Φx,r|

| wr,0, x + br,0| Ψr(x, K m 3/5)

Published as a conference paper at ICLR 2024

K m3/5 Ψr(x, K m 3/5)

where the first step is becasue of Eq. (11), the final step is due to Eq. (12).

Now we can have,

r [m] Ψr(x, K m 3/5).

As we ve already demonstrated, with Pr[] at most θ9/10,

r=1 Ψr(x, K m 3/5) O(K m9/10).

Therefore, with Pr[] at most θ9/10,

|C(x)| O(K2 m 1/10).

The proof is complete now.

C.8 UPPER BOUND OF |f(x) g(x)|

With Claim C.5, Claim C.6 and Claim C.7 in hand, We are prepared to demonstrate that, |f(x; W) g(x; W)| is tiny for every fixed x X with a high degree of probability. Recall that our definition of A(x), B(x) and C(x) are different from Zhang et al. (2020) and we obtain different upper bounds in Claim C.5, Claim C.6 and Claim C.7. Consequently, we obtain a different upper bound for |f(x) g(x)| compared with Zhang et al. (2020). Lemma C.8 (A variation of Lemma 4 in Zhang et al. (2020)). For each x X, with probability θ1/5,

|f(x; W) g(x; W)| O(K2 m 1/10)

Proof. With Lemma C.5, Lemma C.6 and Lemma C.7 aggregated together and a union bound, for x X, with succeed probability

1 θ9/10 θ1/5 = 1 θ1/5,

it provides

|f(x; W) g(x; W)| |A(x) g(x)| + |B(x)| + |C(x)|

O(K2/m1/10) (13)

where the first step is due to triangle inequality and f(x) = A(x) + B(x) + C(x), and the second step is the result of Claim C.5, Claim C.6 and Claim C.7.

This completes the proof.

C.9 PERTURBATION ANALYSIS

We do perturbation analysis for f(x) and g(x) with shifted Re LU activation in the following lemma when the perturbation µ satisfies that µ 2 1

m and x + µ X. We obtain a different perturbation upper bound for |f(x + µ) f(x)| with the shifted Re LU activation threshold set as τ = O(1/m) compared with Zhang et al. (2020). Lemma C.9 (A variation of Lemma A.8 in Zhang et al. (2020)). For every x X1, every µ Rd

where µ 2 1 m and x + µ X, with τ = O( 1

m) and probability at least 1 1/ exp(Ω(m1/2)), we have

|g(x + µ; W) g(x; W)| O(K m 1/2). (14)

|f(x + µ; W) f(x; W)| O(K m 1 + 1/m1/5) (15)

Published as a conference paper at ICLR 2024

Proof. We define µ as a small perturbation of x which depends on a(0) Rd, W0 Rm d, b0 Rd arbitrarily and has properties listed in the statement of Lemma.

|f(x + µ; W) f(x; W)| =

r=1 ar,0 στ( wr,0 + wr, x + µ + br,0) στ( wr,0 + wr, x + br,0)

r=1 |ar,0|(| wr,0 + wr, µ | + τ)

r=1 |ar,0| wr,0 + wr 2 +

r=1 |ar,0|τ

r=1 wr,0 + wr 2 + τm4/5

r=1 wr,0 2 + 1 m6/5

r=1 wr 2 + O( 1 m1/5 )

r=1 wr,0 2 + K

m + O( 1 m1/5 )

O( 1 m1/5 + K

where the first step is due to the definition of f(x; W), the second step follows is due to triangle inequality, the third step follows µ 2 1 m, the fourth step comes from |ar,0| = 1 m1/5 , the fifth step is due to triangle inequality, and the sixth inequality is due to P

r [m] wr 2 K m1/5, and the final step follows that for any r [m], with probability 1/ exp(Ω(m2/d)), we have wr,0 2 2 O(1).

By concentration (See Lemma A.6.) of the sum of a set of χ2 random variables which are independent and with union bound over r [m] and m d,

Pr[O(1) W0 2, ] θ1. (16)

And then we take g into account.

|g(x + µ; W) g(x; W)|

r=1 1[ wr,0, x + µ + br,0 τ]ar,0 wr, x + µ

r=1 1[ wr,0, x + br,0 τ]ar,0 wr, x

r=1 |ar,0| wr 2

1[ wr,0, x + µ + br,0 τ] 1[ wr,0, x + br,0 τ] |ar,0| | wr, x |

1[ wr,0, x + µ + br,0 τ] 1[ wr,0, x + br,0 τ] .

where the first step is due to the definition of g(x; W), the second step follows from triangle inequality, and the third step is because that Pm r=1 wr 2 K m1/5 and |ar,0| = 1 m1/5 .

About the last sum, from Eq. (16), we have W0 2, O(1). And the we can attain that

1[ wr,0, x + µ + br,0 τ] 1[ wr,0, x + br,0 τ]

r=1 Ψr(x, O(1/m))

By Claim A.10, with probability at most O( 1 m1/2 ), we can get Ψr(x, O(m 1)) = 1. With x fixed, Ψr(x, O(m 1)) where r [m] can be seen as m independent Bernoulli random variables. Therefore,

Published as a conference paper at ICLR 2024

by Chernoff bound (See Lemma A.8), with probability at least 1 1/ exp(Ω(m1/2)), we can have

r=1 Ψr(x, O(m 1)) O(m1/2).

The proof of Eq. (14) is finished now.

C.10 CONVERGENCE ANALYSIS

We first define the 2,1 and 2, norms in the following definition.

Definition C.10. Let W Rd m and wr Rd denote the r-th column of the matrix W. We define

W 2,1 := Pm r=1 wr 2

W 2, := maxr [m] wr 2

We define the gradient of real net gradient and pseudo-net gradient as follows: Definition C.11. The two notions of gradients are represented as convenient notations in the following:

Gradient of pseudo-net b (t)(g) := W L(S(t), g(Wt)) Rd m

Gradient of real net (t)(f) := W L(S(t), f(Wt)) Rd m

In the following theorem we analyze the convergence of our neural network f W with shifted Re LU and choose a different parameter:

m Ω(max{n300/31, (Kn/ϵ)24/15, (K2/ϵ)300/29}),

T = Θ(m 1/5ϵ)

compared with Zhang et al. (2020). Theorem C.12 (A variation of Theorem 4.1 in Zhang et al. (2020)). For all K > 0, and ϵ (0, 1), for every m which is larger than some poly(n, K, 1/ϵ). We set

η = Θ(ϵm 1/5) and T = Θ(ϵ 2K2),

Running Algorithm 1, for every W Rd m with W0 W 2, K m3/5 , there will be weights {Wt}T t=1 Rd m such that

LA (f W ) + ϵ 1

t=1 LA(f Wt)

The succeed probability is 1 θ1/5. The randomness is coming from initialized states (See Definition 3.1).

Proof. As for T and η, we will give them value in the proof later. We define various distances as follows

Dmax := max t [T ] Wt W0 2, (17)

DW := W W0 2, (18)

where DW = O( K m3/5 ).

By Lemma C.14, with high probability Dmax and DW will be coupled, if Wt remains near initiation (Dmax m 15/24).

b (t)(g) (t)(f) 2,1 O(nm27/50)

Published as a conference paper at ICLR 2024

Remark C.13. We assume Dmax m 15/24 first. Finally, we will set η, T and m to the appropriate values to ensure that it occurs.

We can bound the size of gradient for all r [m]:

(t)(f)r 2 ( 1

i=1 στ( wr,t, xi + br,0) exi 2) |ar|

1 m1/5 (19)

where the first step is because the loss is 1-Lipschitz, and the second step is due to |ar| = 1/m1/5 and στ( wr,t, xi + br,0) exi 2 O(1).

The L(S, (g(W)) is a convex function with respect to W Rd m because g is linear with respect to W Rd m.

We express the inner product of two two identically sized matrices B and C as

B, C := tr[B C]

And then we have:

L(S(t), g(Wt)) L(S(t), g(W ))

b (t)(g) (t)(f), Wt W + (t)(f), Wt W

b (t)(g) (t)(f) 2,1 Wt W 2, + (t)(f), Wt W (20)

where the first step is due to triangle inequality, and the second step is because the inner product and the definition of 2,1 and 2, .

For simplicity, we define α(t) R and β(t) R as follows:

α(t) := (t)(f), Wt W

β(t) := b (t)(g) (t)(f) 2,1 Wt W 2,

Therefore, we have:

L(S(t), g(Wt)) L(S(t), g(W )) α(t) + β(t)

α(t) and β(t) terms will be dealt with separately.

As for α(t), we can have:

Wt+1 W 2 F = Wt η (t)(f) W 2 F = Wt W 2 F + η2 (t)(f) 2 F 2ηα(t)

where the first step us due to Wt+1 = Wt η (t)(f), and the second step is because of rearranging the terms.

And then by reorganizing the terms, we can have

2 (t)(f) 2 F + 1

2η ( Wt W 2 F WT +1 W 2 F )

By summing over t, we have

t=1 (t)(f) 2 F + 1

2η ( W0 W 2 F WT +1 W 2 F )

2 T + m D2 W 2η (21)

Published as a conference paper at ICLR 2024

where the first inequality is due to the upper bound of α(t), and the second inequality is because W W0 2 F m W W0 2, = m D2 W and (t)(f) 2 F Pm r=1 (t) r 2 2 m1/5.

For the β(t) s,

β(t) Wt W 2, O(nm27/50)

(DW + Dmax) O(nm27/50) (22)

where the 1st inequality is because Lemma C.14, and the 2nd inequality is due to triangle inequality and the definition of Dmax and DW in Eq. (17) and Eq. (18).

Additionally, we may get the bound of Dmax s size using the bound of gradients.

Dmax = max t [T ] W0 Wt 2,

t=1 η max r [m] (t)(f)r 2

where the first equality is due to the value of Dmax, the second inequality replies on accumulating T iterations, and the final inequality is due to Eq. (19).

Combining it with the preexisting condition DW = O( K m4/5 ), we arrive to the following:

t=1 L(S(t), g(Wt))

t=1 L(S(t), g(W ))

O(1)(m1/5ηT + K2

m1/5η + ηT 2nm31/150 + ηKTn

where the first step is due to Eq. (20), and the second step is due to the upper bound in Eq. (21) and Eq. (22).

t [T ] L(S(t), g(Wt)) 1

t [T ] L(S(t), g(W )) O(ϵ).

The following values were chosen for the hyper-parameters η, m, and T:

T = Θ(m 1/5ϵ),

m Ω(max{n300/31, (Rn/ϵ)24/15, (K2/ϵ)300/29}),

T = Θ(ϵ 2K2)

where m is required to satisfy ηT 2nm31/150 + ηKT n m29/150 O(ϵ), Dmax m 15/24 and to fulfill the prerequisite for using the Theorem C.1, we have for all t [T] that

sup x X |g(x; Wt) f(x; Wt)| O(ϵ)

Published as a conference paper at ICLR 2024

Therefore, we have

t=1 L(S(t), f Wt) 1

t=1 L(S(t), f W ) c ϵ

where c R+ is a large constant. With L(S(t), f Wt) = LA(f Wt) and L(S(t), f W ) LA (f W ), if we replace ϵ with ϵ

c, the proof will hold for all ϵ > 0. We finish the proof now.

C.11 GRADIENT COUPLING

In this section, we show that b (t)(g) (t)(f) 2,1 ( 2,1 is from Definition C.10) can be upper bounded with high probability for all iterations t. Due to the new shifted Re LU activation with threshold τ, the new upper bound is O(nm27/50) which is different from Zhang et al. (2020). Lemma C.14 (A variation of Lemma A.10 in Zhang et al. (2020)). For every t [T] let Wt W0 2, O(m 15/24). Then

Pr[ b (t)(g) (t)(f) 2,1 O(nm27/50)] 1 θ1/5.

Proof. For every t [T], by Claim C.16 and Dmax = Wt W0 2, , we have,

r=1 1[ (t) r = b (t) r ] O(nm7/8)] 1 θ1/5. (23)

For r [m] where (t) r = b (t) r ,

b (t) r (t) r 2 |ar| 1

i=1 | exi 2 1[ wr,t, exi + br,0 τ] 1[ wr,0, xi + br,0 τ]|

i=1 | 1[ wr,t, exi + br,0 τ] 1[ wr,0, xi + br,0 τ]|

1 m1/5 (24)

where the first step is because that the loss is 1-Lipschitz, and the second step is because that ar { 1 m1/5 , 1 m1/5 } , and the third step is because that | 1[ wr,t, exi + br,0 τ] 1[ wr,0, xi + br,0 τ]| 1.

Thus, we conclude

b (t)(g) (t)(f) 2,1 =

r=1 b (t) r (t) r 2

1 m1/5 O(nm7/8)

= O(nm27/50)

where the first step is due to the definition of 2,1, the second step is due to Eq. (23) and Eq. (24), and the final step is the result of merging the terms.

C.12 UPPER BOUND OF NUMBER OF ACTIVATED NEURONS AT INITIALIZATION

We prove the upper bound of number of activated neurons with shifted Re LU activation whose threshold is τ at initialization. We will use the below lemma to compute the time cost per training iteration.

Published as a conference paper at ICLR 2024

Lemma C.15 (Restatement of Lemma 4.5, Number of activated neurons during initialization). Let c Q (0, 1) denote a fixed constant. Let Q0 = 2m exp( ( K2

4m1/5 )) = O(mc Q). For every η R+, x X and r [m], Ψr(x, η) is defined as follows:

Ψr(x, η) := 1[| wr,0, x + br,0| η].

with succeed probability 1 n/ exp(Ω(m exp( ( K2

r [m] Ψr(x, K/m3/5) Q0.

Proof. Fix x, wr,0, x + br,0 N(0, 2

m), due to the reason that b0 and W0 are random variables from N(0, 1

m) independently. By Gaussian tail bounds and replacing wr,0, x + br,0 with z, the probability that each initial neuron is activated is

Pr[ wr,0, x + br,0 K m 3/5] = Pr z N(0, 2

m )[z K m 3/5]

By using 1r Q0,i, we have

E[1r Q0,i] exp( ( K2

By Bernstein inequality from Lemma A.9, with η > 0

Pr[Qi,0 > η + t] exp( t2/2

η + t/3), t > 0

where η := m exp( ( K2

By choosing t = η, we can have:

Pr[Qi,0 > 2η] exp( 3η/8)

By applying union bound across i [n], we can attain that for each xi X with at least probability

1 n 1/ exp(Ω(m exp( ( K2

the upper bound of Qi,0 (See definition 3.9.) is 2m exp( ( K2

C.13 UPPER BOUND OF THE SIZE OF ACTIVATED NEURON SET

We have obtained the upper bound of the size of activated neuron set at initialization. In this section, we want to upper bound the size of neuron set which has different signs compared to the neuron weights at initialization under shifted Re LU activation per training iteration. When computing the sign, we need to take the activation threshold τ into account compared with Zhang et al. (2020). Claim C.16 (A variation of Claim A.11 in Zhang et al. (2020)). For any wr 2 m 15/24 and any subset {xi}n 1 X. Then we have

r=1 1[ i [n], sgn( xi, wr,0 + wr + br,0 τ) = sgn( xi, wr,0 + br,0 τ)] < O(nm7/8)

i [m], 1[ i [n], sgn( xi, wr,0 + wr + br,0 τ) = sgn( xi, wr,0 + br,0 τ)] < O(nm 1/8)

The succeed probability is 1 θ1/5.

The randomness is from initialized states (See Definition 3.1).

Published as a conference paper at ICLR 2024

Proof. We first give a proof of Claim C.16 with n points in the set fixed. At last we get the final result by using a union bound across all the sets.

With {x1, . . . , xn} X where xi s are fixed, we have the following definition:

Cr := 1[ i [n], sgn( wr,t, xi + br,0 τ) = sgn( wr,0, xi + br,0 τ)]

Now we will focus on bounding the size of Pm r=1 Cr. By Claim A.10, for xi X we have that

Pr[| wr,0, xi + br,0 τ| m 15/24] < Pr[| wr,0, xi + br,0| 1 m15/24 ]

O( 1 m1/8 )

By a union bound across i [n], we have

Pr[ i [n] s.t. | wr,0, xi + br,0 τ| m 15/24] O( n m1/8 )

Pr[Cr = 1] Pr[ i [n] s.t. | wr,0, xi + br,0 τ| 1 m15/24 ]

< O( n m1/8 )

With xi X s fixed, Cr s will be m Bernoulli random variables which are independent where r [m]. Therefore, by Chernoff bound from Lemma A.8,

r [m] Cr O(m7/8n)] 1/ exp(Ω(m7/8n)).

We will only amplify by just exp(O(nd log m)) on the failure probability with a big enough m when over product space n X, we assume a union bound over a 1

Remark C.17. In each training iteration we only use Q0 + O(nm7/8) activated neurons to do the computation to save training cost. We will use the second upper bound to compute the time eventually. Q0 is the activated neuron set size during initialization.

C.14 PROOF OF PSEUDO-NETWORK APPROXIMATION

We will first demonstrate how individual components of f may be approximated by pseudo-networks and combine them to create a sizable pseudo-network by which f is approximated.

Lemma C.18 (A variation of Lemma A.16 in Zhang et al. (2020)). Let q : R R univariate polynomial, i [n], c1 be a large constant, ϵ3 (0, 1 C(q)) and k c1 d

ϵ2 3 C2(q, ϵ3). For every r [k], with independent random variables wr,0, br,0 and ar,0, where wr,0 is a d-dimentional standard Gaussian random variable, br,0 is a standard normal variable , and ar,0 { 1 m1/5 , + 1 m1/5 } uniformly at random. With τ = O(k 1/2) and probability at least 1 1/ exp(Ω(k1/2)), there will be W (i) Rd k such that

r=1 αr,0 wr(i), x 1[ wr,0, x + br,0 τ] yiq( xi, x )| 3ϵ3

W (i) 2, O(C(q, ϵ3)

Published as a conference paper at ICLR 2024

Proof. Notice that wr, x + br N(0, 2), and then by claim A.10, we can have Pr[1[ w, x + b < τ]] = O(τ). And then there will a function defined as h : R2 [ C(q, ϵ3), C(q, ϵ3)] which satisfies that: E b N(0,1),w N(0,Id)[1[ w, x + β < τ] h( xi, w , b)] O(τ)C(q, ϵ3) (25)

We use Lemma B.3 by setting ϕ(z) = yiq(z) and ϵ1 = ϵ3. Note that |yi| O(1), ϕ will have an equal complexity to q, with some constant. Therefore, we have a function h : R2 [ C(q, ϵ3), C(q, ϵ3)] s. t. for every x in set X

E b N(0,1),w N(0,Id)[h( xi, w , b) 1[ w, x + b τ] ] yiq( xi, x )

E b N(0,1),w N(0,Id)[h( xi, w , b) 1[ w, x + b 0]] yiq( xi, x )

+ E b N(0,1),w N(0,Id)[h( xi, w , b) 1[ w, x + b < τ] ]

ϵ3 + O(τ)C(q, ϵ3)

ϵ3 + O(C(q, ϵ3)

k1/2 ) (26)

where the first step is due to triangle inequality, and the second step is because of Lemma B.3 and Eq.(25), the last step comes from τ = O(k 1/2).

With an x X fixed and by Hoeffding s inequality (Lemma A.6), with probability at least 1 1/ exp(Ω( ϵ2 3k C2(q,ϵ3))), we have

r=1 1[ wr,0, x + br,0 τ]h( wr,0, xi , br,0)

E b N(0,1),w N(0,Id)[1[ w, x + b τ]] h( w, xi , b)]

wr(i) = 1 αr,0

1 k 2h( wr,0, xi , br,0) be.

where be is a vector whose only the last element is 1 and otherwise is 0, we can obtain the following upper bound:

W (i) 2, O(C(q, ϵ3)

For every x X, because xd = 1/2, with probability at most exp( Ω( kϵ2 3 C2(q,ϵ3))), we have

r=1 1[ wr,0, x + br,0 τ]αr,0 wr(i), x

E b N(0,1),w N(0,Id)[h( xi, w , b) 1[ w, x + b τ]]

With X1 as a maximal 1 kc -net of X and c R+ as a large enough constant , we all understand that |X1| ( 1

k)O(d). With a union bound over X1 for Eq. (28) and c1 as a significant constant, we have that for k c1 d

ϵ2 3 C2(q, ϵ3)),

r=1 1[ wr,0, x + br,0 τ]ar,0 wr,i, x

Published as a conference paper at ICLR 2024

E b N(0,1),w N(0,Id)[h( w, xi , b) 1[ w, x + b τ]]

(1/ exp(Ω( kϵ2 3 C2(q, ϵ3)))) exp(O(d log k))

= 1/ exp(Ω( kϵ2 3 C2(q, ϵ3))) (29)

where the first step is due to Eq.(27), the second step is because of adding terms, and the last step demonstrate that for each x X1, with a high probability, if x is adjusted by no more than 1 kc in ℓ2, the LHS of Eq. (28) only slightly changes.

After showing that a given x X is stable, a union bound will be performed.

Indeed, with a large enough c and combining Eq. (26), (29) and Claim C.19, with probability at least 1 1/ exp(Ω(k1/2)) 1/ exp(Ω( kϵ2 3 C2(q,ϵ3))), we attain that

r=1 ar,0 wr,i, x 1[ wr,0, x + br,0 τ] yiq( xi, x )| O(k 1/2 C(q, ϵ3)) + 2ϵ3

since k c1 d

ϵ2 3 C2(q, ϵ3) for a large constant c1, the proof is complete.

C.15 STABILITY OF S1 AND S2

In this section, we ll talk about the stability of S1 and S2 based on the shifted Re LU activation with a threshold set as τ. When we bound S1 and S2, we need to bound the sum of k independent Bernoulli random variables 1[| wr,0, x + br,0 τ| 1

k] which is different from Zhang et al. (2020).

Claim C.19 (A variation of Claim A.17 in Zhang et al. (2020)). For τ > 0 and every x X1, let c be large enough and µ Rd such that µ 2 1 kc and x + µ X . With independent random variables wr,0, br,0 and ar,0, where wr,0 is a d-dimentional standard Gaussian random variable, br,0 is a standard normal variable, and probability 1 1/ exp(Ω(k1/2)), we have

r=1 αr,0 wr(i), x + µ 1[ wr,0, x + µ + br,0 τ]

r=1 αr,0 wr(i), x 1[ wr,0, x + br,0 τ]

O(k 1/2 C(q, ϵ3))

S2 := E b N(0,1),w N(0,Id)[ h( w, xi , b) 1[ w, x + µ + b τ]]]

E b N(0,1),w N(0,Id)[ h( w, xi , b) 1[ w, x + b τ]]

O(k 1/2 C(q, ϵ3))

Proof. The upper bound of D1 will be given first. Note that W (i) rj = 0 for j d 1 based on how we generated W(i). And wr(i), µ = 0 when vd = 0.

With conditions as follows:

|αr,0| = 1 k1/5 and W (i) 2, O(C(q, ϵ3)m1/5

we will have

S1 O(C(q, ϵ3)

1[ wr,0, x + µ + br,0 τ] 1[ wr,0, x + br,0 τ]

Published as a conference paper at ICLR 2024

r=1 1[sgn( wr,0, x + µ + br,0 τ) = sgn( wr,0, x + br,0 τ)]

r=1 (1[| wr,0, x + br,0 τ| k 1/2] + 1[ wr,0 2 > c2 k1/2]).

where the first step is due to |h( )| C(q, ϵ3), the second step is the result of the value of 1[ wr,0, x+

µ +br,0 τ] 1[ wr,0, x +br,0 τ] is determined by the number of differences between the sign

of wr,0, x+µ +br,0 and wr,0, x +br,0, and the third step follows that only when wr,0, x+µ +br,0 and wr,0, x + br,0 have different sign, 1[ wr,0, x + µ + br,0) = sgn( wr,0, x + br,0)] is 1 which is decided by wr,0 and µ, where µ 2 1 kc , w N(0, Id) and 1[ wr,0 2 > c2k1/2] 0.

We can choose a large enough constant c2 if an sufficiently large constant c is chosen.

And then we demonstrate that with probability at least 1 1/ exp(Ω(k)), for every r [k], wr,0 2 O(k1/2).

By the concentration property of sum of a set of independent Chi-Square random variables from Lemma B.2 and Lemma A.7, for all r [k], with probability 1/ exp(Ω(k2/d)), we have wr,0 2 2 O(k). By applying union bound over all r [k] we can complete the proof with k d.

Therefore, by selecting c2 correctly, we can obtain the following bound:

S1 O(k 1 C(q, ϵ3))

r=1 1[| wr,0, x + br,0 τ| k 1/2] (30)

The succeed probability is 1 1/ exp(Ω(k)).

Now, 1[| wr,0, x + br,0 τ| k 1/2] can be seen as k independent Bernoulli random variables.

Pr[| wr,0, x + br,0 τ| k 1/2] Pr[| wr,0, x + br,0| k 1/2]

where the first step is due to Theorem A.3, the second step is due to Claim A.10.

The corresponding probability of 1[| wr,0, x + br,0 τ| k 1/2] = 1 is bounded by O(k 1/2). Therefore, by Chernoff bounds (Lemma A.8) with probability at least 1 1/ exp(Ω(k1/2)), we have

r=1 1[| wr,0, x + br,0 τ| k 1/2] O(k1/2). (31)

With a union bound over two bounds in Eq. (30) and Eq. (31).

One can obtain the following bound for S1:

S1 O(k 1/2 C(q, ϵ3)).

The succeed probability of above equation is

1 1/ exp(Ω(k1/2)) 1/ exp(Ω(k)) = 1 1/ exp(Ω(k1/2)).

And now we will focus on getting the bound of S2. We have

S2 C(q, ϵ3) E b N(0,1),w N(0,Id)[| 1[ w, x + µ + b τ] 1[ w, x + b τ]|]

Published as a conference paper at ICLR 2024

C(q, ϵ3) E b N(0,1),w N(0,Id)[1[| w, x + b| k 1/2] + 1[k1/2 c2 < w 2]]

where the first step is because |h( )| C(q, ϵ3), and the second step is because | 1[ w, x + µ + b τ] 1[ w, x +b τ]| is decided by the sign of 1[ w, x+µ +b τ] and 1[ w, x +b τ]. The above difference is decided by the value of µ 2 1 kc and w N(0, Id), and 1[k1/2 c2 < w 2] 0. The constant c2 is the same as previously.

Pr b N(0,1),w N(0,Id)[k 1/2 | w, x + b τ|] Pr b N(0,1),w N(0,Id)[k 1/2 | w, x + b|]

where the first step is due to Theorem A.3, the second step is due to Lemma A.10.

Then we have

Pr b N(0,1),w N(0,Id)[k1/2 c2 < w 2] 1/ exp(Ω(k)).

S2 O(k 1/2 C(q, ϵ3)).

This completes the proof.

C.16 PROOF OF TIME COMPLEXITY PER TRAINING ITERATION

We give the proof of Lemma 5.3 in this section.

Proof. The complexity for Algorithm 1 at each iteration can be decomposed as follows:

Querying the active neuron set for n adversarial training data points exi takes e O(m1 Θ(1/d)nd) time: Pn i=1 Tquery(2m, d, ki,t) = e O(m1 Θ(1/d)nd) according to Corollary 5.2.

Forward computation takes O(d (Q0 + nm7/8)) time: Pn i=1 O(d ki,t) = O(d (Q0 + nm7/8)) according to Lemma 4.5 and Claim 4.6.

Backward computation involving computing gradient W and updating Wt+1 takes O(d (Q0 + nm7/8)) time: O(d nnz(P)) = O(d (Q0 + nm7/8)) according to Lemma 4.5 and Claim 4.6.

Updating the weight vectors takes O((Q0 + nm7/8) log2(2m)) time.

Tupdate (|Qt,i| + |Qt+1,i|)

= O(log2(2m)) (

i=1 ki,t + ki,t+1)

= O(log2(2m)) (Q0 + O(nm7/8))

= O((Q0 + nm7/8) log2(2m))

where the first step is according to Corollary 5.2, the second step is the result of Lemma 4.5 and Claim 4.6, and the third step is the result of aggregating the terms.

Summing over all the quantities in four bullets gives us the per iteration cost e O(m1 Θ(1/d)nd).