# most_neural_networks_are_almost_learnable__bb72a7fc.pdf Most Neural Networks Are Almost Learnable Amit Daniely Hebrew University and Google amit.daniely@mail.huji.ac.il Nathan Srebro TTI-Chicago nati@ttic.edu Gal Vardi TTI-Chicago and Hebrew University galvardi@ttic.edu We present a PTAS for learning random constant-depth networks. We show that for any fixed ϵ > 0 and depth i, there is a poly-time algorithm that for any distribution on d Sd 1 learns random Xavier networks of depth i, up to an additive error of ϵ. The algorithm runs in time and sample complexity of ( d)poly(ϵ 1), where d is the size of the network. For some cases of sigmoid and Re LU-like activations the bound can be improved to ( d)polylog(ϵ 1), resulting in a quasi-poly-time algorithm for learning constant depth random networks. 1 Introduction One of the greatest mysteries surrounding deep learning is the discrepancy between its phenomenal capabilities in practice and the fact that despite a great deal of research, polynomial-time algorithms for learning deep models are known only for very restrictive cases. Indeed, state of the art results are only capable of dealing with two-layer networks under assumptions on the input distribution and the network s weights. Furthermore, theoretical study shows that even with very naive architectures, learning neural networks is worst-case computationally intractable. In this paper, we contrast the aforementioned theoretical state of affairs, and show that, perhaps surprisingly, even though constant-depth networks are completely out of reach from a worst-case perspective, most of them are not as hard as one would imagine. That is, they are distribution-free learnable in polynomial time up to any desired constant accuracy. This is the first polynomial-time approximation scheme (PTAS) for learning neural networks of depth greater than 2 (see the related work section for more details). Moreover, we show that the standard SGD algorithm on a Re LU network can be used as a PTAS for learning random networks. The question of whether learning random networks can be done efficiently was posed by Daniely et al. [15], and our work provides a positive result in that respect. In a bit more detail, we consider constant-depth random networks obtained using the standard Xavier initialization scheme [22, 26], and any input distribution supported on the sphere d Sd 1. For Lipschitz activation functions, our algorithm runs in time ( d)poly(ϵ 1), where d is the network s size including the d input components, and ϵ is the desired accuracy. While this complexity is polynomial for constant ϵ, we also consider the special cases of sigmoid and Re LU-like activations, where the bound can be improved to ( d)polylog(ϵ 1). The main technical idea in our work is that constant-depth random neural networks with Lipschitz activations can be approximated sufficiently well by low-degree polynomials. This result follows by analyzing the network obtained by replacing each activation function with its polynomial approximation using Hermite polynomials. It implies that efficient algorithms for learning polynomials can be 37th Conference on Neural Information Processing Systems (Neur IPS 2023). used for learning random neural networks, and specifically that we can use the SGD algorithm on Re LU networks for this task. 1.1 Results In this work, we show that random fully-connected feedforward neural networks can be wellapproximated by low-degree polynomials, which implies a PTAS for learning random networks. We start by defining the network architecture. We will denote by σ : R R the activation function, and will assume that it is L-Lipschitz. To simplify the presentation, we will also assume that it is normalized in the sense that EX N(0,1) σ2(X) = 1. Define ϵσ(n) = mindeg(p)=n EX N(0,1)(σ(X) p(X))2, namely, the error when approximating σ with a degree-n polynomial, and note that limn ϵσ(n) = 0. We will consider fully connected networks of depth i and will use d0 = d to denote the input dimension and d1, . . . , di to denote the number of neurons in each layer. Denote also d = Pi j=0 dj. Given weight matrices W = (W 1, . . . , W i) Rd1 d0 . . . Rdi di 1 and x Rd0 we define Ψ0 W (x) = x. Then for 1 j i we define recursively Φj W (x) = W jΨj 1 W (x), Ψj W (x) = σ Φj W (x) We will consider random networks in which the weight matrices are random Xavier matrices [22, 26]. That is, each entry in W j is a centered Gaussian of variance 1 dj 1 . This choice is motivated by the fact that it is a standard practice to initialize the network s weights with Xavier matrices, and furthermore, it ensures that the scale across the network is the same. That is, for any example x and a neuron n, the second moment of the output of n (w.r.t. the choice of W) is 1. Our main result shows that Ψi W can be approximated, up to any constant accuracy ϵ, via constant degree polynomials (the constant will depend only on ϵ, the depth i, and the activation σ). We will consider the input space Sd 1 = {x Rd : x = 1}. Here, and throughout the paper, x stands for the normalized Euclidean norm x = q 1 d Pd i=1 x2 i . Theorem 1.1. For every i and n such that ϵσ(n) 1 2 there is a constant D = D(n, i, σ) such that if d1, . . . , di 1 D the following holds. For any weights W, there is a degree ni 1 polynomial p W such that for any distribution D on Sd 1 Φi W (x) p W (x) 14 (L + 1)2 (ϵσ(n)) 1 2i 1 14 (L + 1)3 Furthermore, the coefficients of p W are bounded by (2 d)4ni 1. Since constant degree polynomials are learnable in polynomial time, Theorem 1.1 implies a PTAS for learning random networks of constant depth. In fact, as shown in [9], constant degree polynomials with polynomial coefficients are efficiently learnable via SGD on Re LU networks starting from standard Xavier initialization. Thus, this PTAS can be standard SGD on neural networks. To be more specific, for any constant ϵ > 0 there is an algorithm with ( d) O time and sample complexity that is guaranteed to return a hypothesis whose loss is at most ϵ in expectation. For some specific activations, such as the sigmoid σ(x) = erf(x) := 2 π R x 0 e t2 2 dt, or the Re LUlike activation σ(x) = R x 0 erf(t) + 1dt we have that ϵσ(n) approaches to 0 exponentially fast (see Lemma A.4 in the appendix). In this case, we get get a quasi-polynomial time and sample complexity log 14(L+1)3 Corollary 1.2. For every constants ϵ, i and σ there is a constant D, a univariate-polynomial p and a polynomial-time algorithm A such that if d1, . . . , di 1 D the following holds. For any distribution D on Sd 1, if h is the output of A upon seeing p(d0, . . . , di) examples from D, then1 E h E W E x D Φi W (x) h(x) ϵ . Furthermore, A can be taken to be SGD on a Re LU network starting from a Xavier initialization. 1.2 Related Work Learning neural networks efficiently. Efficiently learning classes of neural networks has attracted much interest in recent years. Several works established polynomial-time algorithms for learning onehidden-layer neural networks with certain input distributions (such as the Gaussian distribution) under the assumption that the weight matrix of the hidden layer is non-degenerate [27, 34, 19, 20, 5, 32, 4]. For example, Awasthi et al. [4] showed such a result for non-degenerate one-hidden-layer Re LU networks with bias terms under Gaussian inputs, and also concluded that one-hidden-layer networks can be learned efficiently under the smoothed-analysis framework. Efficient algorithms for learning one-hidden-layer Re LU networks with Gaussian inputs were also shown in Diakonikolas et al. [18], Diakonikolas and Kane [17]. These results do not require non-degenerate weight matrices, but they require that the output layer weights are all positive, as well as a sub-linear upper bound on the number of hidden neurons. Chen et al. [8] recently showed an efficient algorithm for learning one-hidden-layer Re LU networks with Gaussian inputs, under the assumption that the number of hidden neurons is a constant. Note that all of the aforementioned works consider only one-hiddenlayer networks. Chen et al. [7] gave an algorithm for learning deeper Re LU networks, whose complexity is polynomial in the input dimension but exponential in the other parameters (such as the number of hidden units, depth, spectral norm of the weight matrices, and Lipschitz constant of the overall network). Finally, several works established algorithms for learning neural networks, whose complexity is exponential unless we impose strong assumptions on the norms of both the inputs and the weights [23, 30, 33, 24]. Hardness of learning neural networks. As we discussed in the previous paragraph, efficient algorithms for learning Re LU networks are known only for depth-2 networks and under certain assumptions on both the network weights and the input distribution. The limited progress in learning Re LU networks can be partially understood by an abundance of hardness results. Learning neural networks without any assumptions on the input distribution or the weights is known to be hard (under cryptographic and average-case hardness assumptions) already for depth-2 Re LU networks [28, 3, 11]. For depth-3 networks, hardness results were obtained already when the input distribution is Gaussian [13, 6]. All of the aforementioned hardness results are for improper learning, namely, they do not impose any restrictions on the learning algorithm or on the hypothesis that it returns. For statistical query (SQ) algorithms, unconditional superpolynomial lower bounds were obtained for learning depth-3 networks with Gaussian inputs [6], and superpolynomial lower bounds for Correlational SQ (CSQ) algorithms were obtained already for learning depth-2 networks with Gaussian inputs [25, 18]. The above negative results suggest that assumptions on the input distribution may not suffice for obtaining efficient learning algorithms. Since in one-hidden-layer networks efficient algorithms exist when imposing assumptions on both the input distribution and the weights, a natural question is whether this approach might also work for deeper networks. Recently, Daniely et al. [15] gave a hardness result for improperly learning depth-3 Re LU networks under the Gaussian distribution even when the weight matrices are non-degenerate. This result suggests that learning networks of depth larger than 2 might require new approaches and new assumptions. Moreover, [15] showed hardness of learning depth-3 networks under the Gaussian distribution even when a small random perturbation is added to the network s parameters, namely, they proved hardness in the smoothed-analysis framework. While adding a small random perturbation to the parameters does not seem to make the problem computationally easier, they posed the question of whether learning random networks, which roughly correspond to adding a large random perturbation, can be done efficiently. The current work gives a positive result in that respect. Daniely and Vardi [12] studied whether there exist some natural properties of the network s weights that may suffice to allow efficient distribution-free learning, where a natural property is any property 1The leftmost expectation denoted Eh is over the examples provided to A, as well as the internal randomness of A. that holds w.h.p. in random networks. More precisely, they considered a setting where the target network is random, an adversary chooses some input distribution (that may depend on the target network), and the learning algorithm needs to learn the random target network under this input distribution. They gave a hardness result for improper learning (within constant accuracy) in this setting. Thus, they showed that learning random networks is hard when the input distribution may depend on the random network. Note that in the current work, we give a positive result in a setting where we first fix an input distribution and then draw a random network. Finally, learning deep random networks was studied in Das et al. [16], Agarwal et al. [1], where the authors showed hardness of learning networks of depth ω(log(d)) in the SQ model. 2 Proof of Theorem 1.1 2.1 Notation We recall that for vectors x Rd we use the normalized Euclidean norm x = q Pd i=1 x2 i d and take the unit sphere Sd 1 = {x Rd : x = 1} w.r.t. this norm as our instance space. Inner products will also be normalized: for x, y Rd we denote x, y = Pd i=1 xiyi d . For x Rd and a closed set A Rd we denote d(x, A) := minx A x x . Unless otherwise specified, a random scalar is assumed to be a standard normal, a random vector in Rd is assumed to be a centered Gaussian vector with covariance matrix 1 d I, and a random matrix is assumed to be a Xavier matrix. For f : R R, we denote f 2 = EX f 2(X). We denote the Kronecker delta by δij, i.e. δij = 1 if i = j and 0 otherwise. 2.2 Some Preliminaries We will use the Hermite Polynomials [29] which are defined via the following recursion formula. hn+1(x) = x n + 1hn(x) r n n + 1hn 1(x), h0(x) = 1, h1(x) = x (1) The Hermite polynomials are the sequence of normalized orthogonal polynomials w.r.t. the standard Gaussian measure. That is, it holds that E X hi(X)hj(X) = δij More generally, if (X, Y ) is a Gaussian vector with covariance matrix 1 ρ ρ 1 E X,Y hi(X)hj(Y ) = δijρi (2) We will use the fact that h n = nhn 1 (3) and that for even n E X Xn = (n 1)!! (4) Let σ = P i=0 aihi be the representation of the activation function σ in the basis of the Hermite polynomials. We will also use the dual activation ˆσ(ρ) = P i=0 a2 i ρi as defined in [14]. We note that ˆσ is defined in [ 1, 1] and satisfies ˆσ(1) = σ 2 = 1. 2.3 Defining a Shadow Network In order to approximate Ψi W via a polynomial, we will use a shadow network" that is obtained by replacing the activation σ with a polynomial approximation of it. We will show that for random networks we can approximate each activation sufficiently well with low-degree Hermite polynomials. Recall that σ = P i=0 aihi is the representation of σ in the basis of the Hermite polynomials. Define σn = 1 Pn i=0 a2 i Pn i=0 aihi. We have ϵσ(n) = P i=n+1 a2 i and hence σn = 1 1 ϵσ(n) Pn i=0 aihi. We next define a shadow network. For x Rd we let Ψ0,n W (x) = x. For 1 j i we define recursively Φj,n W (x) = W jΨj 1,n W (x), Ψj,n W (x) = σn Φj,n W (x) for 1 j i 1 and Ψi,n W (x) = W iΨi 1,n W (x). We will prove the following theorem, which implies Theorem 1.1. Theorem 2.1. Fix i and let n be large enough so that ϵσ(n) 1 2. There is a constant D = D(n, i, σ) such that if d1, . . . , di 2 D then for any x Sd 1, Φi W (x) Φi,n W (x) 13 (L + 1)2 (ϵσ(n)) Since ϵσ(n) is the error in the approximation of a single activation σ with a degree-n polynomial, it is natural to expect that the above bound will depend on ϵσ(n). To see why Theorem 2.1 (together with Lemma A.3 which bounds ϵσ(n)) implies Theorem 1.1, note that Φi,n W (x) is a polynomial of degree ni 1. This implies Theorem 1.1, except the requirement that the coefficients of the polynomial are polynomially bounded. To deal with this, define Φi,n W (x) = ( Φi,n W (x) if all entries in W are at most Pi j=0 dj 0 otherwise As we show next limmin(d1,...,di 1) E W Φi,n W (x) Φi,n W (x) = 0. Hence, in the theorem we can replace Φi,n by Φi,n which has polynomially bounded coefficients. See Appendix A.3 and A.4 for the proofs. Lemma 2.2. For every ϵ and n there is a constant D such that if d1, . . . , di 1 D then for any x Sd 1, E W Φi,n W (x) Φi,n W (x) < ϵ. Lemma 2.3. Φi,n W computes a polynomial whose sum of coefficients is at most (2 d)4ni 1. 2.4 Proof of Theorem 2.1 for depth-two networks We will first prove Theorem 2.1 for depth-2 networks (i.e. for i = 2). We will prove Lemma 2.5 below which implies that for every ϵ there is n such that for any x Sd 1, E W Ψ1,n W (x) Ψ1 W (x) ϵ. We will then prove Lemma 2.6, that together with Lemma 2.5 will show that E W Φ2,n W (x) Φ2 W (x) ϵ, thus proving Theorem 2.1 for i = 2. We will start however with the following lemma that will be useful throughout (see Appendix A.5 for the proof). Lemma 2.4. Fix f, g : R R, x, y Rd1 and a Xavier matrix W Rd2 d1. Let (X, Y ) be a centered Gaussian vector with covariance matrix x 2 x, y x, y y 2 E W f(Wx) g(Wy) r E W f(Wx) g(Wy) 2 = r E X,Y (f(X) g(Y ))2 Lemma 2.5. Fix x Sd1 1. Let W Rd2 d1 be a Xavier matrix. Then E W σ(Wx) σn(Wx) p Proof. By Lemma 2.4 we have E W σ(Wx) σn(Wx) r E W σ(Wx) σn(Wx) 2 = q E X(σ(X) σn(X))2 . Now, the above equals to v u u t i=n+1 a2 i = v u u t(1 ϵσ(n)) v u u t(1 ϵσ(n)) 1 ϵσ(n) 1 p 2 ϵσ(n) 2 p 1 ϵσ(n) + ϵσ(n) 1 ϵσ(n))(1 + p Lemma 2.5 implies that E W Ψ1,n W (x) Ψ1 W (x) p 2ϵσ(n). Thus, given ϵ > 0, for suf- ficiently large n, E W Ψ1,n W (x) Ψ1 W (x) ϵ. The following lemma therefore implies that E W Φ2,n W (x) Φ2 W (x) p 2ϵσ(n) and thus implies Theorem 2.1 for depth two networks. Lemma 2.6. For any x Sd 1 Φi,n W (x) Φi W (x) Ψi 1,n W (x) Ψi 1 W (x) Proof. By Lemma 2.4 we have Φi,n W (x) Φi W (x) = E W i W i Ψi 1,n W (x) Ψi 1 W (x) X N 0, Ψi 1,n W (x) Ψi 1 W (x) 2 X2 = Ψi 1,n W (x) Ψi 1 W (x) 2.5 Proof of Theorem 2.1 for General Networks For x Rdi 1 we denote ΨW i(x) = σ(W ix) and Ψn W i(x) = σn(W ix). Lemma 2.5 can be roughly phrased as (x = x ) and ( x = 1) ΨW i(x) Ψn W i(x ) In order to prove Theorem 2.1 for general networks we will extend it by replacing the strict equality conditions with softer ones. That is, we will show that (x x ) and ( x 1) and ( x 1) ΨW i(x) Ψn W i(x ) (5) This will be enough to prove Theorem 2.1 for general networks. Indeed, the conditions x 1 and x 1 are valid w.h.p. via a simple probabilistic argument. Thus, Eq. (5) implies that x x ΨW i(x) Ψn W i(x ) (6) Now, for x Sd 1 Eq. (6) implies that ΨW 1(x) Ψn W 1(x ). Using Eq. (6) again we get that ΨW 2 ΨW 1(x) Ψn W 2 Ψn W 1(x ). Using it i 3 more times we we get that ΨW i 1 ΨW 1(x) Ψn W i 1 Ψn W 1(x ), or in other words that Ψi 1 W (x) Ψi 1,n W (x). As we will show stands for a sufficiently strong approximation, which guarantees that E W Ψi 1 W (x) Ψi 1,n W (x) ϵ, and hence Lemma 2.6 implies Theorem 2.1. To prove Eq. (5) we first prove Lemma 2.7 which softens the requirement that x = x . That is, it shows that (x x ) and ( x = x = 1) ΨW i(x) Ψn W i(x ) The second condition which requires that x = x = 1 is softened via Lemmas 2.8 and 2.9. Lemma 2.10 then wraps the two softenings together, and shows that Eq. (5) is valid. Finally, in section 2.5.1 we use Lemma 2.10 to prove Theorem 2.1. Lemma 2.7. Fix x, x + v Sd1 1 with v ϵ. Let W Rd2 d1 be a Xavier matrix. Then E W σ(Wx) σn(W(x + v)) p Proof. We have σ(Wx) σn(W(x + v)) σ(Wx) σn(Wx) + σn(Wx) σn(W(x + v)) By Lemma 2.5 we have EW σ(Wx) σn(Wx) p 2ϵσ(n). It remains to bound EW σn(Wx) σn(W(x + v)) . By Lemma 2.4 We have E W σn(Wx) σn(W(x + v)) r E X,Y (σn(X) σn(Y ))2 where (X, Y ) is a centered Gaussian vector with correlation matrix 1 ρ ρ 1 for ρ = x, x + v 1 ϵ. Finally, we have E X,Y (σn(X) σn(Y ))2 = 1 1 ϵσ(n) E X,Y i=0 ai(hi(X) hi(Y )) = 1 1 ϵσ(n) j=0 aiaj E X,Y (hi(X) hi(Y ))(hj(X) hj(Y )) Eq. (2) = 1 1 ϵσ(n) i=0 a2 i (2 2ρi) 2 1 ϵσ(n) (ˆσ(1) ˆσ(ρ)) In Lemma A.1 we show that ˆσ is L2-Lipschitz. Hence the above is at most 2L2 1 ϵσ(n)ϵ. We next give a lemma that allows us to almost jointly project a pair of points x1, x2 Rd on a closed set A Rd, without expanding the distance too much. See Appendix A.6 for the proof. Lemma 2.8. Let A Rd a closed set and fix x1, x2 Rd. There are x1, x2 A such that x1 x1 2d(x1, A), x2 x2 2d(x2, A) and x1 x2 3 x1 x2 Lemma 2.9. Let x, x + v Rd1 be vectors such that x = 1 and v ϵ 1. Let W Rd2 d1 be a Xavier matrix. Then E W σ(Wx) σ(W(x + v)) Lϵ and E W σn(Wx) σn(W(x + v)) 22n+1 (9(4n 1)!!)1/4 ϵ =: λ(n)ϵ Proof. Fix a centered Gaussain vector (X, Y ) with covariance matrix 1 x + v, x x + v, x x + v 2 Let Z = Y X. Note that Var(Z) ϵ2. By Lemma 2.4 we have E W σ(Wx) σ(W(x + v)) p E(σ(X) σ(X + Z))2 We now prove the second part. In Lemma A.2, we show that |hi(x) hi(x + y)| 2i max(|x|, |x + y|, 1)i|y|. Therefore, |σn(x) σn(x + y)| 1 ϵσ(n) |hi(x) hi(x + y)| i=0 2i max(|x|i, |x + y|i, 1) |y|2n+1 max(|x|n, |x + y|n, 1) E(σn(X) σn(X + Z))2 22n+2 E Z2 max(|X|n, |X + Z|n, 1)2 E max(|X|4n, |X + Z|4n, 1) E [|X|4n + |X + Z|4n + 1] Eq. (4) = 22n+2p 1 + (4n 1)!!( x + v 4n + x 4n) 3(1 + ϵ)4n(4n 1)!! 3 24n (4n 1)!! Lemma 2.4 now implies that E W σn(Wx) σn(W(x + v)) 22n+1 (9(4n 1)!!)1/4 ϵ Lemma 2.10. Let x, x + v Rd1 be vectors such that v ϵ, | x 1| δ 1/2 and | x + v 1| δ. Let W Rd2 d1 be a Xavier matrix. Then E W σ(Wx) σn(W(x + v)) 2Lδ + p 1 ϵσ(n)ϵ + 2λ(n)δ Proof. By Lemma 2.8 there are vectors x , v such that x = x + v = 1 and x x 2δ, x + v x v 2δ, and v 3 v Now, we have, by Lemmas 2.7 and 2.9, E W σ(Wx) σn(W(x + v)) E W σ(Wx) σ(Wx ) + E W σ(Wx ) σn(W(x + v )) + E W σn(W(x + v )) σn(W(x + v)) 1 ϵσ(n)ϵ + 2λ(n)δ 2.5.1 Concluding the proof of Theorem 2.1 Ψi W (x, δ) = ( 0 |1 Ψj W (x) | > δ or |1 Ψj,n W (x) | > δ for some j < i Ψi W (x) otherwise Ψi,n W (x, δ) = ( 0 |1 Ψj W (x) | > δ or |1 Ψj,n W (x) | > δ for some j < i Ψi,n W (x) otherwise Ψi W (x) Ψi,n W (x) E W Ψi W (x) Ψi W (x, δ) + E W Ψi W (x, δ) Ψi,n W (x, δ) Ψi,n W (x, δ) Ψi,n W (x) Theorem 2.1 now follows from Lemmas 2.11 and 2.12 below, together with Lemma 2.6. Lemma 2.11. Let n be large enough so that ϵσ(n) 1 2 and let δ < ϵσ(n) 2L+2λ(n). Then, Ψi W (x, δ) Ψi,n W (x, δ) 12 (L + 1)2 (ϵσ(n))2 i Proof. We will prove the result by induction on i. The case i = 0 is clear as Ψ0 W (x, δ) = Ψ0,n W (x, δ). Fix i > 0. For every δ < 1 2 and n we have by Lemma 2.10 Ψi W (x, δ) Ψi,n W (x, δ) 2Lδ + p Ψi 1 W (x, δ) Ψi 1,n W (x, δ) + 2λ(n)δ Taking expectation over W 1, . . . , W i 1 we get Ψi W (x, δ) Ψi,n W (x, δ) 2ϵσ(n) + E W Ψi 1 W (x, δ) Ψi 1,n W (x, δ) + 2λ(n)δ Jensen inequality 2Lδ + p 1 ϵσ(n) E W Ψi 1 W (x, δ) Ψi 1,n W (x, δ) + 2λ(n)δ ϵσ(n) 2L+2λ(n) 4 p 1 ϵσ(n) E W Ψi 1 W (x, δ) Ψi 1,n W (x, δ) ϵσ(n) + L r Ψi 1 W (x, δ) Ψi 1,n W (x, δ) Induction hypothesis 4 p ϵσ(n) + L q 12 12 (L + 1)2 (ϵσ(n))2 i+1 12 12 (L + 1)2 (ϵσ(n))2 i+1 = 12 (L + 1)2 (ϵσ(n))2 i Lemma 2.12. Fix i, n, δ and ϵ > 0. There is a constant D such that if d1, . . . , di 1 D then Ψi W (x) Ψi W (x, δ) + E W Ψi,n W (x, δ) Ψi,n W (x) ϵ Proof sketch (see Appendix A.7 for the formal proof). Let Bi,δ be the event that for some j < i, |1 Ψj W (x) | > δ or |1 Ψj,n W (x) | > δ. We have Ψi W (x) Ψi W (x, δ) = E W h Ψi W (x) 1Bi,δ i Ψi W (x) 2 q Ψi,n W (x) Ψi,n W (x, δ) Ψi,n W (x) 2 q Now, the lemma follows by proving that E W Ψi W (x) 2 and E W Ψi,n W (x) 2 are bounded by a constant (independent of d0, . . . , di), and that for every δ, ϵ , i and n, there is a constant D such that if d1, . . . , di 1 D then Pr(Bi,δ) < ϵ . 3 Conclusion and Future work One of the prominent approaches for explaining the success of neural networks is trying to show that they are capable of learning complex and deep models. So far this approach has relatively limited success. Despite that significant progress has been made to show that neural networks can learn shallow models, so far, neural networks were shown to learn only toy deep models (e.g. [21, 2, 10, 31]). Not only that, but there are almost no known rich families of deep models that are efficiently learnable by some algorithm (not necessarily gradient methods on neural networks). Our paper suggests that random neural networks might be candidate models. To take this approach further, a natural next step, and a central open question that arises from our work, is to show the existence of an algorithm that learns random networks in time that is polynomial both in 1 ϵ and the network size. This question is already open for depth-two Re LU networks with two hidden neurons. We note that as implied by [31], such a result, even for a single neuron, will have to go beyond polynomial approximation of the network, and even more generally, beyond kernel methods. Our result requires a lower bound D for the network s width, where D is a constant. We conjecture that this requirement can be relaxed, and leave it to future work. Additional open directions are: (i) the analysis of random convolutional networks, (ii) achieving time and sample complexity of ( d)O(ϵ 2) for random networks of any constant depth (and not only for depth two), and (iii) finding a PTAS for random networks of depth ω(1). Acknowledgments and Disclosure of Funding The research described in this paper was funded by the European Research Council (ERC) under the European Union s Horizon 2022 research and innovation program (grant agreement No. 101041711), and the Israel Science Foundation (grant number 2258/19). This research was done as part of the NSF-Simons Sponsored Collaboration on the Theoretical Foundations of Deep Learning. [1] N. Agarwal, P. Awasthi, and S. Kale. A deep conditioning treatment of neural networks. In Algorithmic Learning Theory, pages 249 305. PMLR, 2021. [2] Z. Allen-Zhu and Y. Li. What can resnet learn efficiently, going beyond kernels? ar Xiv preprint ar Xiv:1905.10337, 2019. [3] B. Applebaum, B. Barak, and A. Wigderson. Public-key cryptography from different assumptions. In Proceedings of the forty-second ACM symposium on Theory of computing, pages 171 180, 2010. [4] P. Awasthi, A. Tang, and A. Vijayaraghavan. Efficient algorithms for learning depth-2 neural networks with general relu activations. Advances in Neural Information Processing Systems, 34:13485 13496, 2021. [5] A. Bakshi, R. Jayaram, and D. P. Woodruff. Learning two layer rectified neural networks in polynomial time. In Conference on Learning Theory, pages 195 268. PMLR, 2019. [6] S. Chen, A. Gollakota, A. R. Klivans, and R. Meka. Hardness of noise-free learning for two-hidden-layer neural networks. ar Xiv preprint ar Xiv:2202.05258, 2022. [7] S. Chen, A. R. Klivans, and R. Meka. Learning deep relu networks is fixed-parameter tractable. In 2021 IEEE 62nd Annual Symposium on Foundations of Computer Science (FOCS), pages 696 707. IEEE, 2022. [8] S. Chen, Z. Dou, S. Goel, A. R. Klivans, and R. Meka. Learning narrow one-hidden-layer relu networks. ar Xiv preprint ar Xiv:2304.10524, 2023. [9] A. Daniely. Sgd learns the conjugate kernel class of the network. In NIPS, 2017. [10] A. Daniely and E. Malach. Learning parities with neural networks. In NIPS, 2020. [11] A. Daniely and S. Shalev-Shwartz. Complexity theoretic limitations on learning dnf s. In Conference on Learning Theory, pages 815 830, 2016. [12] A. Daniely and G. Vardi. Hardness of learning neural networks with natural weightss. In NIPS, 2020. [13] A. Daniely and G. Vardi. From local pseudorandom generators to hardness of learning. In Conference on Learning Theory, pages 1358 1394. PMLR, 2021. [14] A. Daniely, R. Frostig, and Y. Singer. Toward deeper understanding of neural networks: The power of initialization and a dual view on expressivity. In NIPS, 2016. [15] A. Daniely, N. Srebro, and G. Vardi. Computational complexity of learning neural networks: Smoothness and degeneracy. ar Xiv preprint ar Xiv:2302.07426, 2023. [16] A. Das, S. Gollapudi, R. Kumar, and R. Panigrahy. On the learnability of deep random networks. ar Xiv preprint ar Xiv:1904.03866, 2019. [17] I. Diakonikolas and D. M. Kane. Small covers for near-zero sets of polynomials and learning latent variable models. In 2020 IEEE 61st Annual Symposium on Foundations of Computer Science (FOCS), pages 184 195. IEEE, 2020. [18] I. Diakonikolas, D. M. Kane, V. Kontonis, and N. Zarifis. Algorithms and sq lower bounds for pac learning one-hidden-layer relu networks. In Conference on Learning Theory, pages 1514 1539. PMLR, 2020. [19] R. Ge, J. D. Lee, and T. Ma. Learning one-hidden-layer neural networks with landscape design. ar Xiv preprint ar Xiv:1711.00501, 2017. [20] R. Ge, R. Kuditipudi, Z. Li, and X. Wang. Learning two-layer neural networks with symmetric inputs. ar Xiv preprint ar Xiv:1810.06793, 2018. [21] B. Ghorbani, S. Mei, T. Misiakiewicz, and A. Montanari. Limitations of lazy training of two-layers neural network. In Advances in Neural Information Processing Systems, pages 9108 9118, 2019. [22] X. Glorot and Y. Bengio. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 249 256. JMLR Workshop and Conference Proceedings, 2010. [23] S. Goel and A. R. Klivans. Learning neural networks with two nonlinear layers in polynomial time. In Conference on Learning Theory, pages 1470 1499. PMLR, 2019. [24] S. Goel, V. Kanade, A. Klivans, and J. Thaler. Reliably learning the relu in polynomial time. In Conference on Learning Theory, pages 1004 1042. PMLR, 2017. [25] S. Goel, A. Gollakota, Z. Jin, S. Karmalkar, and A. Klivans. Superpolynomial lower bounds for learning one-layer neural networks using gradient descent. ar Xiv preprint ar Xiv:2006.12011, 2020. [26] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pages 1026 1034, 2015. [27] M. Janzamin, H. Sedghi, and A. Anandkumar. Beating the perils of non-convexity: Guaranteed training of neural networks using tensor methods. ar Xiv preprint ar Xiv:1506.08473, 2015. [28] A. R. Klivans and A. A. Sherstov. Cryptographic hardness for learning intersections of halfspaces. In FOCS, 2006. [29] R. O Donnell. Analysis of boolean functions. Cambridge University Press, 2014. [30] S. Vempala and J. Wilmes. Gradient descent for one-hidden-layer neural networks: Polynomial convergence and sq lower bounds. In Conference on Learning Theory, pages 3115 3117. PMLR, 2019. [31] G. Yehudai and O. Shamir. On the power and limitations of random features for understanding neural networks. ar Xiv preprint ar Xiv:1904.00687, 2019. [32] X. Zhang, Y. Yu, L. Wang, and Q. Gu. Learning one-hidden-layer relu networks via gradient descent. In The 22nd international conference on artificial intelligence and statistics, pages 1524 1534. PMLR, 2019. [33] Y. Zhang, J. D. Lee, and M. I. Jordan. l1-regularized neural networks are improperly learnable in polynomial time. In International Conference on Machine Learning, pages 993 1001. PMLR, 2016. [34] K. Zhong, Z. Song, P. Jain, P. L. Bartlett, and I. S. Dhillon. Recovery guarantees for one-hiddenlayer neural networks. In International conference on machine learning, pages 4140 4149. PMLR, 2017. A Missing proofs A.1 Some Technical Lemmas Lemma A.1. If σ is L-Lipschitz then ˆσ is L2-Lipschitz in [ 1, 1] Proof. As shown in [14], (ˆσ) = bσ . Hence, for ρ [ 1, 1], |(ˆσ) (ρ)| = bσ (ρ) Lemma A.2. |hn(x) hn(x + y)| 2n max(|x|, |x + y|, 1)n|y| Proof. It is not hard to verify by induction on Eq. (1) that |hn(x)| 2n/2 max(1, |x|n) This implies that for ξ [x, x + y] |hn(x) hn(x + y)| = |h n(ξ)y| Eq. (3) = n|hn 1(ξ)y| n2n/2 max(|x|, |x + y|, 1)n|y| 2n max(|x|, |x + y|, 1)n|y| A.2 Bounds on ϵσ(n) By Eq. (3) if σ is differentiable k times then we have σ(k) = P i=k q i! (i k)!aihi k. Hence, for k n + 1, i=n+1 a2 i (n + 1 k)! i! (i k)!a2 i (n + 1 k)! Lemma A.3. For any L-Lipschitz σ we have ϵσ(n) L2 Proof. By Eq. (7) for k = 1 we get ϵσ(n) 1 n + 1 σ 2 L2 Lemma A.4. For the sigmoid activation σ(x) = R x 0 e t2 2 dt we have ϵσ(n) 2 n. Proof. We have σ(k)(x) = ( 1)k 1p (k 1)!hk 1(x)e x2 2 . Indeed, it is not hard to verify it for k = 1 and k = 2. For k > 2 we have via induction that σ(k+1)(x) = ( 1)k 1p (k 1)! h k 1(x) xhk 1(x) e x2 Eq. (3) = ( 1)k k 1hk 2(x) i e x2 Eq. (1) = ( 1)k k!hk(x)e x2 Hence, |σ(k)(x)| | p (k 1)!hk 1(x)|, and now Eq. (7) implies that for any k n + 1 ϵσ(n) (n + 1 k)! (n + 1)! (k 1)! = (n + 1 k)!k! (n + 1)!k = 1 k n+1 k Taking k = n+1 2 we conclude that ϵσ(n) 2 n. A.3 Proof of Lemma 2.2 Let A be the event that there is an entry in W that is greater than Pi j=0 dj. We have Φi,n W (x) Φi,n W (x) = E h Φi,n W (x) 1A i E Φi,n W (x) 2p Now, it is not hard to verify that E Φi,n W (x) 2 is polynomial in Pi j=0 dj while Pr(A) con- verges to 0 exponentially fast in Pi j=0 dj. Thus, if min(d1, . . . , di 1) is large enough then E W Φi,n W (x) Φi,n W (x) < ϵ. A.4 Proof of Lemma 2.3 We assume that Φi,n W = Φi,n W , as otherwise Φi,n W 0, in which case the lemma is clear. Write σn(x) = Pn k=0 bkxk and hj(x) = Pj k=0 cj,kxk. Via induction on Eq. (1), we have |cj,k| 2 j 2 . Hence, |bk| 1 q Pn j=0 a2 j j=0 |aj||cj,k| 1 q Pn j=0 a2 j j=0 |aj|2 j 2 1 q Pn j=0 a2 j Now, let Mj be the maximal sum of coefficients of any polynomial computed by an output neuron of Ψj,n W . We next show by induction that Mj (2 d)2 Pj k=1 nk. This will conclude the proof as it will imply that the sum of the coefficients of the polynomial computed by Φi,n W is at most (2 d)2Mi 1 (2 d)2 Pi 1 k=0 nk (2 d)4ni 1. For j = 0 we have M0 = 1. For j 1 we we have k=0 |bk| ( d)2Mj 1 k 2 n+1 2 2 ( d)2Mj 1 n (2 d)2Mj 1 n By the induction hypothesis we have Mj (2 d)2n+2n Pj 1 k=1 nk = (2 d)2 Pj k=1 nk A.5 Proof of Lemma 2.4 E W f(Wx) g(Wy) Jensen Inequality r E W f(Wx) g(Wy) 2 j=1 E W(f((Wx)j) g((Wy)j)2 Now, the lemma follows from the fact that {((Wx)j, (Wy)j)}d2 j=1 are independent centered Gaussian vectors with covariance matrix x 2 x, y x, y y 2 A.6 Proof of Lemma 2.8 Let PA : Rd A a function such that for any x Rd, PA(x) x = d(x, A). Assume w.l.o.g. that x1 PA(x1) x2 PA(x2) . Case I: x2 PA(x2) x1 x2 Simply define xi = PA(xi). We have x1 x1 = x1 PA(x1) , x2 x2 = x2 PA(x2) and x1 x2 PA(x1) x1 + x1 x2 + x2 PA(x2) 3 x1 x2 Case II: x1 x2 x2 PA(x2) Define x1 = x2 = PA(x1). We have x1 x1 = x1 PA(x1) , x1 x2 0 x1 x2 and x2 x2 x2 x1 + x1 PA(x1) 2 x2 PA(x2) A.7 Proof of Lemma 2.12 Let Bi,δ be the event that for some j < i, |1 Ψj W (x) | > δ or |1 Ψj,n W (x) | > δ. We have Ψi W (x) Ψi W (x, δ) = E W h Ψi W (x) 1Bi,δ i Ψi W (x) 2 q Ψi,n W (x) Ψi,n W (x, δ) Ψi,n W (x) 2 q the lemma now follows from the following two claims. Claim 1. E W Ψi W (x) 2 and E W Ψi,n W (x) 2 are bounded by a constant (independent of d0, . . . , di). Proof. We have Ψi W (x) 2 = E w σ2 w Ψi 1 W (x) 2σ2(0) + 2L2 E w w Ψi 1 W (x) 2 = 2σ2(0) + 2L2 Ψi 1 W (x) 2 By induction on i, this implies that E W Ψi W (x) 2 is bounded by a constant that depends only on i and L (but not on d1, . . . , di). For E W Ψi,n W (x) 2 we have Ψi,n W (x) 2 = E w σ2 n w Ψi 1,n W (x) Hence, EW i Ψi,n W (x) 2 is an even polynomial in Ψi 1,n W (x) of degree 2n. The polynomial depends only on σn. It therefore enough to show that for any i and k, E W Ψi,n W (x) 2k is bounded, by a bound that is independent of d0, . . . , di. We will show that via induction on i. For i = 0 this is trivial as Ψ0,n W (x) 2k 1. Fix i 1. We have Ψi,n W (x) 2k = E W i Pdi j=1 σ2 n W iΨi 1,n W (x) Jensen inequality 1 di E W i W iΨi 1,n W (x) = E w σ2k n w Ψi 1,n W (x) The last expression is an even polynomial in Ψi 1,n W (x) . The polynomial depends only on 2k and n. By the induction hypothesis we conclude that E W Ψi,n W (x) 2k is bounded by a bound that is independent from d0, . . . , di. Claim 2. For every δ, ϵ , i and n, there is a constant D such that if d1, . . . , di 1 D then Pr(Bi,δ) < ϵ . Proof. We will prove the lemma by induction on i. For i = 1 this is immediate as Pr(Bi,δ) = 0. Fix i 2. Let δ be small enough so that if | x 1| δ then E w σ2(w x) 1 < δ 4 and E w σ2 n(w x) 1 < δ 4 and E w σ4(w x) E X σ4(X) < 1 and E w σ4 n(w x) E X σ4 n(X) < 1 we have Pr(Bi,δ) Pr(Bi,δ|Bc i 1,δ ) + Pr(Bi 1,δ ) By Chebyshev inequality, Pr(Bi,δ|Bc i 1,δ ) < ϵ 2 for sufficiently large di 1. By the induction hypothesis, Pr(Bi 1,δ ) < ϵ 2 for sufficiently large d1, . . . , di 2